Custom fieldtype with sharding?

2011-03-10 Thread Peter Cline

Hi all,
I'm having an issue with using a custom fieldtype with distributed 
search.  It may be the case that what I'm looking for could be 
accomplished in a different way, but this is my first stab at it.


I'm looking to store XML in a field.  What I've done, which works fine, 
is to:

- on ingest, wrap the XML in a CDATA tag
- write a simple class that extends org.apache.solr.schema.TextField, 
which writes an XML node much in the way that a textfield would, but 
without escaping the contents


It looks like this:
public class XMLField extends TextField {
   @Override
   public void write(TextResponseWriter xmlWriter, String name, 
Fieldable f)

 throws java.io.IOException {
  Writer writer = xmlWriter.getWriter();
  writer.write(xml name= + '' + name + '' + '');
  writer.write(f.stringValue(), 0, f.stringValue() == null ? 0 : 
f.stringValue().length());

  writer.write(/xml);
 }
}

Like I said, simple.  Not especially pretty, but it does the job.  Works 
fine for normal searching, I get back a response like:

xml name=xmlFieldxml-contents-unescaped//xml

When I try to use this with distributed searching, though, it comes back 
written as a normal textfield, like:

str name=xmlFieldlt;xml-contents-have-been-escaped/gt;/str

It looks like it doesn't know anything about my custom fieldtype at all, 
and is defaulting to writing it as a StrField or TextField instead.


So, my question:
- is there a better way to do this?  I'd be fine if it came back with a 
'str' element name, as long as it's not escaped.
- is there perhaps a different class I should extend to do this with 
sharded searching?
- should I just bite the bullet and manually unescape the xml after 
receiving the response?  I'd really prefer not to do this if I can get 
around it.


Thanks in advance for any help.

Peter


facet.offset with facet.sort=lex and shards problem?

2011-02-24 Thread Peter Cline

Hi all,

I'm having a problem using distributed search in conjunction with the 
facet.offset parameter and lexical facet value sorting.  Is there an 
incompatibility between these?  I'm using Solr 1.41.


I have a facet with ~100k values in one index.  I'm wanting to page 
through them alphabetically.  When not using distributed search, 
everything works just fine, and very quick.  A query like this works, 
returning 10 facet values starting at the 50,001st:


http://server:port/solr/select/?q=*:*facet.field=subject_full_facetfacet=truef.subject_full_facet.facet.limit=10facet.sort=lexfacet.offset=5
# Butterflies - Indiana !

However, if I enable distributed search, using a single shard (which is 
the same index), I get no facet values returned.


http://server:port/solr/select/?q=*:*facet.field=subject_full_facetfacet=truef.subject_full_facet.facet.limit=10facet.sort=lexfacet.offset=5shards=server:port/solr
# empty list :(

Doing a little more testing, I'm finding that with sharding I often get 
an empty list any time the facet.offset = facet.limit.  Also, by 
example, if I do facet.limit=100 and facet.offset=90, I get 10 facet 
values.  Doing so without sharding, I get the expected (by me, at least) 
100 values (starting at what would normally be the 91st).


Can anybody shed any light on this for me?

Thanks,
Peter


Re: facet.offset with facet.sort=lex and shards problem?

2011-02-24 Thread Peter Cline

On 02/24/2011 12:37 PM, Yonik Seeley wrote:

On Thu, Feb 24, 2011 at 10:57 AM, Peter Clinepcl...@pobox.upenn.edu  wrote:

Hi all,

I'm having a problem using distributed search in conjunction with the
facet.offset parameter and lexical facet value sorting.  Is there an
incompatibility between these?  I'm using Solr 1.41.

I have a facet with ~100k values in one index.  I'm wanting to page through
them alphabetically.  When not using distributed search, everything works
just fine, and very quick.  A query like this works, returning 10 facet
values starting at the 50,001st:

http://server:port/solr/select/?q=*:*facet.field=subject_full_facetfacet=truef.subject_full_facet.facet.limit=10facet.sort=lexfacet.offset=5
# Butterflies - Indiana !

However, if I enable distributed search, using a single shard (which is the
same index), I get no facet values returned.

http://server:port/solr/select/?q=*:*facet.field=subject_full_facetfacet=truef.subject_full_facet.facet.limit=10facet.sort=lexfacet.offset=5shards=server:port/solr
# empty list :(

Doing a little more testing, I'm finding that with sharding I often get an
empty list any time the facet.offset= facet.limit.  Also, by example, if I
do facet.limit=100 and facet.offset=90, I get 10 facet values.  Doing so
without sharding, I get the expected (by me, at least) 100 values (starting
at what would normally be the 91st).

Can anybody shed any light on this for me?

Sounds like a bug.
Have you tried a 3x or trunk development build to see if it's fixed there?

-Yonik
http://lucidimagination.com


I haven't.  I'll try the current trunk and get back to you.

Thanks,
Peter


Re: facet.offset with facet.sort=lex and shards problem?

2011-02-24 Thread Peter Cline

On 02/24/2011 02:58 PM, Peter Cline wrote:

On 02/24/2011 12:37 PM, Yonik Seeley wrote:
On Thu, Feb 24, 2011 at 10:57 AM, Peter 
Clinepcl...@pobox.upenn.edu  wrote:

Hi all,

I'm having a problem using distributed search in conjunction with the
facet.offset parameter and lexical facet value sorting.  Is there an
incompatibility between these?  I'm using Solr 1.41.

I have a facet with ~100k values in one index.  I'm wanting to page 
through
them alphabetically.  When not using distributed search, everything 
works

just fine, and very quick.  A query like this works, returning 10 facet
values starting at the 50,001st:

http://server:port/solr/select/?q=*:*facet.field=subject_full_facetfacet=truef.subject_full_facet.facet.limit=10facet.sort=lexfacet.offset=5 


# Butterflies - Indiana !

However, if I enable distributed search, using a single shard (which 
is the

same index), I get no facet values returned.

http://server:port/solr/select/?q=*:*facet.field=subject_full_facetfacet=truef.subject_full_facet.facet.limit=10facet.sort=lexfacet.offset=5shards=server:port/solr 


# empty list :(

Doing a little more testing, I'm finding that with sharding I often 
get an
empty list any time the facet.offset= facet.limit.  Also, by 
example, if I
do facet.limit=100 and facet.offset=90, I get 10 facet values.  
Doing so
without sharding, I get the expected (by me, at least) 100 values 
(starting

at what would normally be the 91st).

Can anybody shed any light on this for me?

Sounds like a bug.
Have you tried a 3x or trunk development build to see if it's fixed 
there?


-Yonik
http://lucidimagination.com


I haven't.  I'll try the current trunk and get back to you.

Thanks,
Peter


I tried today's builds for the 3.x branch and the trunk.  The problem 
persists in both.


Peter


Re: Question about facet.prefix usage

2008-10-27 Thread Peter Cline

Hi Simon,
I came across your post to the solr users list about using facet 
prefixes, shown below.  I was wondering if you were still using your 
modified version of SimpleFacets.java, and if so -- if you could send me 
a copy.  I'll need to implement something similar, and it never hurts to 
start from existing material.


Thanks,
Peter

Simon Hu wrote:

I also need the exact same feature. I was not able to find an easy solution
and ended up modifying class SimpleFacets to make it accept an array of
facet prefixes per field. If you are interested, I can email you the
modified SimpleFacets.java. 


-Simon


steve berry-2 wrote:
  
Question: Is it possible to pass complex queries to facet.prefix? 
Example instead of facet.prefix:foo I want facet.prefix:foo OR 
facet.prefix:bar


My application is for browsing business records that fall into 
categories. The user is only allowed to see businesses falling into 
categories which they have access to.


I have a series of documents dumped into the following basic structure 
which I was hoping would help me deal with this:


doc
field name=id123/field
field name=nameBusiness Corp./field
field name=zip28255-0001/field
.
field name=market_categorycharlotte_2006 Banks/field
field name=market_categorycharlotte_2007 Banks/field
field name=market_categorysanfrancisco_2006 Banks/field
field name=market_categorysanfrancisco_2007 Banks/field
... (lots more market_category entries) ...
/doc
doc
field name=id124/field
field name=nameFactory Corp./field
field name=zip28205-0001/field
.
field name=market_categorycharlotte_2006 Banks/field
field name=market_categorycharlotte_2007 Banks/field
field name=market_categoryaustin_2006 Banks/field
field name=market_categoryaustin_2007 Banks/field
... (lots more market_category entries) ...
/doc
.

The multivalued market_category fields are flattened relational data 
attributed to that business and I want to use those values for facted 
navigation /but/ I want the facets to be restricted depending on what 
products the user has access to. For example a user may have access to 
sanfrancisco_2007 and sanfrancisco_2006 data but nothing else.


So I've created a request using facet.prefix that looks something like
this:
http://SOLRSERVER:8080/solr/select?q.op=ANDq=docType:genfacet.field=market_categoryfacet.prefix=charlotte_2007

This ends up producing perfectly suitable facet results that look like
this:
..
lst name=facet_queries/
lst name=facet_fields
lst name=market_category
int name=charlotte_2007 Banks/Financial Institutions1/int
int name=charlotte_2007 Employers1/int
int name=charlotte_2007 Highest-Paid Executives/Public 
Officials/Athletes1/int

int name=charlotte_2007 Mergers  Acquisitions1/int
int name=charlotte_2007 Miscellaneous1/int
int name=charlotte_2007 Public Companies1/int
int name=charlotte_20070/int
/lst
.


Bingo! facet.prefix does exactly what I want it to.

Now I want to go a step further and pass a compound statement to the 
facet.prefix along the lines of facet.prefix:charlotte_2007 OR 
sanfrancisco_2007 or facet.prefix:charlotte_2007 OR charlotte_2006 to 
return more complex facet sets. As far as I can tell looking at the docs 
this won't work.


Is this possible using the existing facet.prefix functionality? Anyone 
have a better idea of how I should accomplish this?


Thanks,
steve berry
American City Business Journals







  


uriEncoding for solr in glassfish

2008-03-13 Thread Peter Cline

Hi all,

This is a little off-topic, so I apologize.  I asked a question not too 
long ago about uri encoding problems, and got a quick and accurate 
response, so I thought I would try again.


I need to pass utf-8 encoded characters to solr instances, so I need the 
uri encoding to be done in UTF-8.  In tomcat, this was accomplished by 
setting an attribute of the Connector (thanks Nicholas and Yonik).  
We're considering moving from tomcat to Glassfish (for various reasons), 
so I'm trying to get this working there as well.  I found a very similar 
setting, setting the uriEncoding property in the http-listener, but it's 
not seeming to have any effect--solr is getting garbled strings.


So, in effect, my question is this: has anybody used solr in glassfish 
and had to address this problem?


Seems unlikely, but it's worth a shot. 


Thanks,
Peter


Re: Accented search

2008-03-11 Thread Peter Cline
I'm not sure about a way to boost scores in this case, but you can 
achieve the basic matching by applying a filter to the index and the 
queries.  The ISOLatin1Accent Filter seems like it may work for you, 
though I'm not entirely certain if that will cover all the accent 
characters you need.


My approach has been to write new filters, one to normalize the unicode 
into the decomposed version, then one to manually strip out all of the 
add-on characters (with decimal codepoint greater than 256).  I don't 
know if this will always work, but it's worked well for me so far.


I would test out adding a filter class=ISOLatin1AccentFilterFactory/ 
to your analyzer.  It might do the trick.  Once again, with this 
approach I'm not sure how to boost either score, so someone else may 
have better ideas.  I'm pretty new to all of this stuff.


Peter

climbingrose wrote:

Hi guys,

I'm running to some problems with accented (UTF-8) language. I'd love to
hear some ideas about how to use Solr with those languages. Basically, I
want to achieve what Google did with UTF-8 language.

My requirements including:
1) Accent insensitive search and proper highlighting:
  For example, we have 2 documents:

  Doc A (title:Lập Trình Viên)
  Doc B (title:Lap Trinh Vien)

  if the user enters Lập Trình Viên, then Doc B is also matched and Lập
Trình Viên is highlighted.
  On the other hand, if the query is Lap Trinh Vien, Doc A is also
matched.
2) Assign proper scores to accented or non-accented searches:
  if the user enters Lập Trình Viên, then Doc A should be given higher
score than DOC B.
  if the query is Lap Trinh Vien, Doc A should be given higher score.

Any ideas guys? Thanks in advance!

  


Re: Illegal xml/html character; unicode problems near solr

2008-03-07 Thread Peter Cline

Nicolas and Yonik,

Thank you both for your excellent responses--this fixed my problem.  Now 
it's time to go back and remove all the hacks I was using to pin this 
thing together without proper utf-8 support. 


Thanks again,
Peter

[EMAIL PROTECTED] wrote:

I think Tomcat defaults to the operating system default, e.g. cp1252 on a
classic windows.

You need to add an attribute URIEncoding=UTF-8 to the Connector you use in
the server.xml conf.

Nicolas

-Message d'origine-
De : [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] De la part de Yonik Seeley
Envoyé : vendredi 7 mars 2008 18:53
À : solr-user@lucene.apache.org
Objet : Re: Illegal xml/html character; unicode problems near solr

On Fri, Mar 7, 2008 at 12:30 PM, Peter Cline [EMAIL PROTECTED] wrote:
  

 The following is a snippet of a link to use a facet:
 search-faceted.html?q=[* TO
 *]amp;facet=trueamp;rows=25amp;fq=name_facet:#34;Brasseur de
 Bourbourg, abb%C3%A9, 1814-1874, former owner#34;

 These characters are correctly specified. When it returns, I get an
 illegal character error. Examining the XML, I get an fq value of:
 name_facet:Brasseur de Bourbourg, abbÃÂ(c), 1814-1874, former owner



Is this bad XML part of the responseHeader (parameters that are simply
being echoed back)?
If so, it's most likely the config on whatever servlet container you
are using... you need to configure it to accept UTF-8 URLs rather than
latin-1 (Tomcat defaults to the old-style latin-1 AFAIK)

-Yonik