date:20101111

Re: Link to download solr4.0 is not working?

2010-11-11 Thread Shawn Heisey


On 11/11/2010 7:44 PM, Deche Pangestu wrote:

Hello,
Does anyone know where to download solr4.0 source?
I tried downloading from this page:
http://wiki.apache.org/solr/FrontPage#solr_development
but the link is not working...


Your best bet is to use svn.
http://lucene.apache.org/solr/version_control.html

For Solr 4.0, you need to check out trunk:
http://svn.apache.org/repos/asf/lucene/dev/trunk

For Solr 3.1, you'd use branch_3x:
http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x

Shawn

Re: Best practices to rebuild index on live system

2010-11-11 Thread Shawn Heisey



On 11/11/2010 4:45 PM, Robert Gründler wrote:

So far, i can only think of 2 scenarios for rebuilding the index, if we need to 
update the schema after the rollout:

1. Create 3 more cores (A1,B1,C1) - Import the data from the database - After 
importing, switch the application to cores A1, B1, C1

This will most likely cause a corrupt index, as in the 1.5 hours of indexing, 
the database might get inserts/updates/deletes.

2. Put the Livesystem in a Read-Only mode and rebuild the index during that 
time. This will ensure data integrity in the index, with the drawback for users 
not being
able to write to the app.


I can tell you how we handle this.  The actual build system is more 
complicated than I have mentioned here, involving replication and error 
handling, but this is the basic idea.  This isn't the only possible 
approach, but it does work.


I have 6 main static shards and one incremental shard, each on their own 
machine (Xen VM, actually).  Data is distributed by taking the Did value 
(primary key in the database) and doing a "mod 6" on it, the resulting 
value is the static shard number.


The system tracks two values at all times - minDid and maxDid.  The 
static shards have Did values <= minDid.  The incremental is > minDid 
and <= maxDid.  Once an hour, I write the current Did value to an RRD.  
Once a day, I use that RRD to figure out the Did value corresponding to 
one week ago.  All documents > minDid and <= newMinDid are 
delta-imported into the static indexes and deleted from the incremental 
index, and minDid is updated.


When it comes time to rebuild, I first rebuild the static indexes in a 
core named "build" which takes 5-6 hours.  When that's done, I rebuild 
the incremental in its build core, which only takes about 10 minutes.  
Then on all the machines, I swap the build and live cores.  While all 
the static builds are happening, the incremental continues to get new 
content, until it too is rebuilt.


Shawn

Looking for help with Solr implementation

2010-11-11 Thread AC

Hi,

Not sure if this is the correct place to post but I'm looking for someone to 
help finish a Solr install on our LAMP based website.  This would be a paid 
project.  


The programmer that started the project got too busy with his full-time job to 
finish the project.  Solr has been installed and a basic search is working but 
we need to configure it to work across the site and also set-up faceted 
search.    I tried posting on some popular freelance sites but haven't been 
able 
to find anyone with real Solr expertise / experience.   


If you think you can help me with this project please let me know and I can 
supply more details.  


Regards,

Abe

Re: Rollback can't be done after committing?

2010-11-11 Thread gengshaoguang

Oh, Pardeep:
I don't think lucene is a advanced storage app to support rollback to a 
history check point (which would be support only in distributed system, such 
as tow phase commit or transactional web services)

yours

On Friday, November 12, 2010 11:25:45 am Pradeep Singh wrote:
> In some cases you can rollback to a named checkpoint. I am not too sure but
> I think I read in the lucene documentation that it supported named
> checkpointing.
> 
> On Thu, Nov 11, 2010 at 7:12 PM, gengshaoguang 
wrote:
> > Hi, Kouta:
> > Any data store does not support rollback AFTER commit, rollback works
> > only BEFORE.
> > 
> > On Friday, November 12, 2010 12:34:18 am Kouta Osabe wrote:
> > > Hi, all
> > > 
> > > I have a question about Solr and SolrJ's rollback.
> > > 
> > > I try to rollback like below
> > > 
> > > try{
> > > server.addBean(dto);
> > > server.commit;
> > > }catch(Exception e){
> > > 
> > >  if (server != null) { server.rollback();}
> > > 
> > > }
> > > 
> > > I wonder if any Exception thrown, "rollback" process is run. so all
> > > data would not be updated.
> > > 
> > > but once commited, rollback would not be well done.
> > > 
> > > rollback correctly will be done only when "commit" process will not?
> > > 
> > > Solr and SolrJ's rollback system is not the same as any RDB's rollback?

RE: importing from java

2010-11-11 Thread Eric Martin

http://wiki.apache.org/solr/DIHQuickStart
http://wiki.apache.org/solr/DataImportHandlerFaq
http://wiki.apache.org/solr/DataImportHandler

-Original Message-
From: Tri Nguyen [mailto:tringuye...@yahoo.com] 
Sent: Thursday, November 11, 2010 9:34 PM
To: solr-user@lucene.apache.org
Subject: Re: importing from java

another question is, can I write my own DataImportHandler class?

thanks,

Tri

From: Tri Nguyen 
To: solr user 
Sent: Thu, November 11, 2010 7:01:25 PM
Subject: importing from java

Hi,

I'm restricted to the following in regards to importing.

I have access to a list (Iterator) of Java objects I need to import into
solr.

Can I import the java objects as part of solr's data import interface
(whenever 
an http request to solr to do a dataimport, it'll call my java class to get 
objects)?  

Before I had direct read only access to the db and specified the column
mappings 

and things were fine with the data import.  

But now I am restricted to using a .jar file that has an api to get the
records 
in the database and I need to publish these records in the db.  I do see
solrj 
and but solrj is seaparate from the solr webapp.

Can I write my own dataimporthandler?

Thanks,

Tri

Re: importing from java

2010-11-11 Thread Tri Nguyen

another question is, can I write my own DataImportHandler class?

thanks,

Tri





From: Tri Nguyen 
To: solr user 
Sent: Thu, November 11, 2010 7:01:25 PM
Subject: importing from java

Hi,

I'm restricted to the following in regards to importing.

I have access to a list (Iterator) of Java objects I need to import into solr.

Can I import the java objects as part of solr's data import interface (whenever 
an http request to solr to do a dataimport, it'll call my java class to get 
objects)?  


Before I had direct read only access to the db and specified the column 
mappings 

and things were fine with the data import.  


But now I am restricted to using a .jar file that has an api to get the records 
in the database and I need to publish these records in the db.  I do see solrj 
and but solrj is seaparate from the solr webapp.

Can I write my own dataimporthandler?

Thanks,

Tri

A Newbie Question

2010-11-11 Thread K. Seshadri Iyer

Hi,

Pardon me if this sounds very elementary, but I have a very basic question
regarding Solr search. I have about 10 storage devices running Solaris with
hundreds of thousands of text files (there are other files, as well, but my
target is these text files). The directories on the Solaris boxes are
exported and are available as NFS mounts.

I have installed Solr 1.4 on a Linux box and have tested the installation,
using curl to post  documents. However, the manual says that curl is not the
recommended way of posting documents to Solr. Could someone please tell me
what is the preferred approach in such an environment? I am not a programmer
and would appreciate some hand-holding here :o)

Thanks in advance,

Sesh

Re: Rollback can't be done after committing?

2010-11-11 Thread Pradeep Singh

In some cases you can rollback to a named checkpoint. I am not too sure but
I think I read in the lucene documentation that it supported named
checkpointing.

On Thu, Nov 11, 2010 at 7:12 PM, gengshaoguang wrote:

> Hi, Kouta:
> Any data store does not support rollback AFTER commit, rollback works only
> BEFORE.
>
> On Friday, November 12, 2010 12:34:18 am Kouta Osabe wrote:
> > Hi, all
> >
> > I have a question about Solr and SolrJ's rollback.
> >
> > I try to rollback like below
> >
> > try{
> > server.addBean(dto);
> > server.commit;
> > }catch(Exception e){
> >  if (server != null) { server.rollback();}
> > }
> >
> > I wonder if any Exception thrown, "rollback" process is run. so all
> > data would not be updated.
> >
> > but once commited, rollback would not be well done.
> >
> > rollback correctly will be done only when "commit" process will not?
> >
> > Solr and SolrJ's rollback system is not the same as any RDB's rollback?
>
>

Re: Rollback can't be done after committing?

2010-11-11 Thread gengshaoguang

Hi, Kouta:
Any data store does not support rollback AFTER commit, rollback works only 
BEFORE.

On Friday, November 12, 2010 12:34:18 am Kouta Osabe wrote:
> Hi, all
> 
> I have a question about Solr and SolrJ's rollback.
> 
> I try to rollback like below
> 
> try{
> server.addBean(dto);
> server.commit;
> }catch(Exception e){
>  if (server != null) { server.rollback();}
> }
> 
> I wonder if any Exception thrown, "rollback" process is run. so all
> data would not be updated.
> 
> but once commited, rollback would not be well done.
> 
> rollback correctly will be done only when "commit" process will not?
> 
> Solr and SolrJ's rollback system is not the same as any RDB's rollback?

importing from java

2010-11-11 Thread Tri Nguyen

Hi,

I'm restricted to the following in regards to importing.

I have access to a list (Iterator) of Java objects I need to import into solr.

Can I import the java objects as part of solr's data import interface (whenever 
an http request to solr to do a dataimport, it'll call my java class to get 
objects)?  


Before I had direct read only access to the db and specified the column 
mappings 
and things were fine with the data import.  


But now I am restricted to using a .jar file that has an api to get the records 
in the database and I need to publish these records in the db.  I do see solrj 
and but solrj is seaparate from the solr webapp.

Can I write my own dataimporthandler?

Thanks,

Tri

Link to download solr4.0 is not working?

2010-11-11 Thread Deche Pangestu

Hello,
Does anyone know where to download solr4.0 source?
I tried downloading from this page:
http://wiki.apache.org/solr/FrontPage#solr_development
but the link is not working...


Best,
Deche

Looking for help with Solr implementation

2010-11-11 Thread AC

Hi,


Not sure if this is the correct place to post but I'm looking for someone to 
help finish a Solr install on our LAMP based website.  This would be a paid 
project.  


The programmer that started the project got too busy with his full-time job to 
finish the project.  Solr has been installed and a basic search is working but 
we need to configure it to work across the site and also set-up faceted 
search.    I tried posting on some popular freelance sites but haven't been 
able 
to find anyone with real Solr expertise / experience.   


If you think you can help me with this project please let me know and I can 
supply more details.  


Regards

Re: Boosting

2010-11-11 Thread Shalin Shekhar Mangar

On Thu, Nov 11, 2010 at 10:35 AM, Solr User  wrote:
> Hi,
>
> I have a question about boosting.
>
> I have the following fields in my schema.xml:
>
> 1. title
> 2. description
> 3. ISBN
>
> etc
>
> I want to boost the field title. I tried index time boosting but it did not
> work. I also tried Query time boosting but with no luck.
>
> Can someone help me on how to implement boosting on a specific field like
> title?
>

If you use index time boosting, you have to restart Solr and re-index
the documents after making the change to the schema.xml. For debugging
problems with query-time boosting, append debugQuery=on as a request
parameter to see the parsed query and scoring information.

-- 
Regards,
Shalin Shekhar Mangar.

Re: index just new articles from rss feeds - Data Import Request Handler

2010-11-11 Thread Shalin Shekhar Mangar

On Thu, Nov 11, 2010 at 8:21 AM, Matteo Moci  wrote:
> Hello,
> I'd like to use solr to index some documents coming from an rss feed,
> like the example at [1], but it seems that the configuration used
> there is just for a one-time indexing, trying to get all the articles
> exposed in the rss feed of the website.
>
> Is it possible to manage and index just the new articles coming from
> the rss source?
>

Each item in an RSS feed has a publishing date which you can use to
ingest only the new articles.

> I found that maybe the delta-import can be useful but, from what I understand,
> the delta-import is used to just update the index with contents of
> documents that have been modified since the last indexing:
> this is obviously useful, but I'd like to index just the new articles
> coming from an rss feed.
>
> Is it something managed automatically by solr or I have to deal with
> it in a separate way? Maybe a full import with &clean=false
> parameters?
> Are there any solutions that you would suggest?
> Maybe storing the article feeds in a table like [2] and have a module
> that periodically sends each row to solr for indexing it?
>

The RSS import example is more of a proof-of-concept that it can be
done, it may not be the best way to do it though. Storing the article
feeds in a table is essential if you have multiple ones. You can use a
parent entity for the table and a child entity to make the actual http
calls to the RSS. Be sure to use onError="continue" so that a bad RSS
feed does not stop the whole process. It will probably work fine for a
handful of feeds but if you are looking to develop a large feed
ingestion system, I'd suggest looking into alternate methods.

-- 
Regards,
Shalin Shekhar Mangar.

Re: Spatial search in Solr 1.5

2010-11-11 Thread Scott K

I just upgraded to a later version of the trunk and noticed my
geofilter queries stopped working, apparently because the sfilt
function was renamed to geofilt.

I realize trunk is not stable, but other than looking at every change,
is there an easy way to find changes that are not backward compatible
so developers know what they need to update when upgrading?

Thanks, Scott

On Tue, Oct 12, 2010 at 17:42, Yonik Seeley  wrote:
> On Tue, Oct 12, 2010 at 8:07 PM, PeterKerk  wrote:
>>
>> Ok, so does this actually say:
>> for now you have to do calculations based on bounding box instead of great
>> circle?
>
> I tried to make the documentation a little simpler... there's
>  - geofilt... filters within a radius of "d" km  (i.e. "great circle 
> distance")
>  - bbox... filters using a bounding box
>  - geodist... function query that yields the distance (again, "great
> circle distance")
>
> If you point out the part to the docs you found confusing, I can try
> and improve it.
> Did you try and step through the quick start?  Those links actually work!
>
>> And the fact that on top of the page it says "Solr4.0", does that imply I
>> cant use this right now? Or where could I find the latest trunk for this?
>
> The wiki says "If you haven't already, get a recent nightly build of 
> Solr4.0"...
> and links to the Solr4.0 page, which points to
> http://wiki.apache.org/solr/FrontPage#solr_development
> for nightly builds.
>
> -Yonik
>
> http://www.lucidimagination.com
>

Re: Best practices to rebuild index on live system

2010-11-11 Thread Erick Erickson

If by "corrupt index" you mean an index that's just not quite
up to date, could you do a delta import? In other words, how
do you make our Solr index reflect changes to the DB even
without a schema change? Could you extend that method
to handle your use case?

So the scenario is something like this:
Record the time
rebuild the index
import all changes since you recorded the original time.
switch cores or replicate.

Best
Erick

2010/11/11 Robert Gründler 

> Hi again,
>
> we're coming closer to the rollout of our newly created solr/lucene based
> search, and i'm wondering
> how people handle changes to their schema on live systems.
>
> In our case, we have 3 cores (ie. A,B,C), where the largest one takes about
> 1.5 hours for a full dataimport from the relational
> database. The Index is being updated in realtime, through post
> insert/update/delete events in our ORM.
>
> So far, i can only think of 2 scenarios for rebuilding the index, if we
> need to update the schema after the rollout:
>
> 1. Create 3 more cores (A1,B1,C1) - Import the data from the database -
> After importing, switch the application to cores A1, B1, C1
>
> This will most likely cause a corrupt index, as in the 1.5 hours of
> indexing, the database might get inserts/updates/deletes.
>
> 2. Put the Livesystem in a Read-Only mode and rebuild the index during that
> time. This will ensure data integrity in the index, with the drawback for
> users not being
> able to write to the app.
>
> Does Solr provide any built-in approaches to this problem?
>
>
> best
>
> -robert
>
>
>
>

Re: Best practices to rebuild index on live system

2010-11-11 Thread Jonathan Rochkind

You can do a similar thing to your case #1 with Solr replication, 
handling a lot of the details for you instead of you manually switching 
cores and such. Index to a new core, then tell your production solr to 
be a slave replicating from that master new core. It still may have some 
of the same downsides as your scenario #1, it's essentially the same 
thing, but with Solr replication taking care of the some of the nuts and 
bolts for you.


I haven't hard of any better solutions. In general, Solr seems not 
really so great at use cases where the index changes frequently in 
response to user actions, it doesn't seem to really have been designed 
that way.


You could store all your user-created data in an external store (rdbms 
or no-sql), as well as indexing it, and then when you rebuild the index 
you can get it all from there, so you won't lose any. It seems to often 
work best, getting along with Solr's assumptions,  to avoid considering 
a Solr index ever the canonical storage location of any data -- Solr 
isn't really designed to be storage, it's designed to be an index.  
Always have the canonical storage location of any data being some actual 
store, with Solr just being an index. That approach tends to make it 
easier to work out things like this, although there can still be some 
tricks. (Like, after you're done building your new index, but before you 
replicate it to production, you might have to check the actual canonical 
store for any data that changed in between the time you started your 
re-index and now -- and then re-index that. And then any data that 
changed between the time your second re-index began and... this could go 
on forever. )


Robert Gründler wrote:

Hi again,

we're coming closer to the rollout of our newly created solr/lucene based 
search, and i'm wondering
how people handle changes to their schema on live systems. 


In our case, we have 3 cores (ie. A,B,C), where the largest one takes about 1.5 
hours for a full dataimport from the relational
database. The Index is being updated in realtime, through post 
insert/update/delete events in our ORM.

So far, i can only think of 2 scenarios for rebuilding the index, if we need to 
update the schema after the rollout:

1. Create 3 more cores (A1,B1,C1) - Import the data from the database - After 
importing, switch the application to cores A1, B1, C1

This will most likely cause a corrupt index, as in the 1.5 hours of indexing, 
the database might get inserts/updates/deletes.

2. Put the Livesystem in a Read-Only mode and rebuild the index during that 
time. This will ensure data integrity in the index, with the drawback for users 
not being
able to write to the app.

Does Solr provide any built-in approaches to this problem?


best

-robert

Best practices to rebuild index on live system

2010-11-11 Thread Robert Gründler

Hi again,

we're coming closer to the rollout of our newly created solr/lucene based 
search, and i'm wondering
how people handle changes to their schema on live systems. 

In our case, we have 3 cores (ie. A,B,C), where the largest one takes about 1.5 
hours for a full dataimport from the relational
database. The Index is being updated in realtime, through post 
insert/update/delete events in our ORM.

So far, i can only think of 2 scenarios for rebuilding the index, if we need to 
update the schema after the rollout:

1. Create 3 more cores (A1,B1,C1) - Import the data from the database - After 
importing, switch the application to cores A1, B1, C1

This will most likely cause a corrupt index, as in the 1.5 hours of indexing, 
the database might get inserts/updates/deletes.

2. Put the Livesystem in a Read-Only mode and rebuild the index during that 
time. This will ensure data integrity in the index, with the drawback for users 
not being
able to write to the app.

Does Solr provide any built-in approaches to this problem?


best

-robert

Re: WELCOME to solr-user@lucene.apache.org

2010-11-11 Thread Ramavtar Meena

Hi,

If you are looking for query time boosting on title field you can do
the following:
/select?q=title:android^10

Also unless you have a very good reason to use string for date data
(in your case pubdate and reldate), you should be using
solr.DateField.

regards,
Ram
On Fri, Nov 12, 2010 at 3:41 AM, Ahmet Arslan  wrote:
> There are several mistakes in your approach:
>
> copyField just copies data. Index time boost is not copied.
>
> There is no such boosting syntax. /select?q=Each&title^9&fl=score
>
> You are searching on your default field.
>
> This is not your cause of your problem but omitNorms="true" disables index 
> time boosts.
>
> http://wiki.apache.org/solr/DisMaxQParserPlugin can satisfy your need.
>
>
> --- On Thu, 11/11/10, Solr User  wrote:
>
>> From: Solr User 
>> Subject: Re: WELCOME to solr-user@lucene.apache.org
>> To: solr-user@lucene.apache.org
>> Date: Thursday, November 11, 2010, 11:54 PM
>> Eric,
>>
>> Thank you so much for the reply and apologize for not
>> providing all the
>> details.
>>
>> The following are the field definitons in my schema.xml:
>>
>> > stored="true"
>> omitNorms="false" />
>>
>> > stored="true"
>> multiValued="true" omitNorms="true" />
>>
>> > stored="true"
>> multiValued="true" omitNorms="true" />
>>
>> > stored="true" />
>>
>> > stored="true" />
>>
>> > stored="true" />
>>
>> > stored="true" />
>>
>> > stored="true" />
>>
>> > stored="true" />
>>
>> > stored="true" />
>>
>> > stored="true" />
>>
>> > stored="true" />
>>
>> > stored="true" />
>>
>> > stored="true" />
>>
>> > stored="true" />
>>
>> > stored="true"
>> multiValued="true" omitNorms="true" />
>>
>> > stored="true" />
>>
>> > stored="true"
>> multiValued="true" omitNorms="true" />
>>
>> > stored="true"
>> multiValued="true" omitNorms="true" />
>>
>> > stored="true" />
>>
>> > stored="true" />
>>
>> > stored="true" />
>>
>> > stored="true" />
>>
>> > stored="true" />
>>
>> > stored="true"
>> omitNorms="true"/>
>>
>> > stored="true"/>
>>
>> > indexed="true" stored="true"
>> multiValued="true" omitNorms="true"/>
>>
>> Copy Fields:
>>
>> 
>>
>> 
>>
>> 
>>
>> 
>>
>> 
>>
>> 
>>
>> 
>>
>> 
>>
>> 
>>
>> 
>>
>> 
>>
>> 
>>
>> 
>>
>> 
>>
>>
>>
>> searchFields
>>
>>
>>
>> Before creating the indexes I feed XML file to the Solr job
>> to create index
>> files. I added Boost attribute to the title field before
>> creating indexes
>> and an example is below:
>>
>> > standalone="no"?>> name="material">1785440> boost="10.0" name="title">Each Little
>> Bird That Sings> name="price">16.0> name="isbn10">0152051139> name="isbn13">9780152051136> name="format">Hardcover> name="pubdate">2005-03-01> name="pubyear">2005> name="reldate">2005-02-22> name="pages">272> name="bisacstatus">Active> name="season">Spring
>> 2005> name="imprint">Children's> name="age">8.0-12.0> name="grade">3-6> name="author">Marla Frazee> name="authortype">Jacket
>> IllustratorDeborah
>> Wiles> name="authortype">Author> name="bisacsub">Social
>> Issues/Friendship> name="bisacsub">Social Issues/General (see
>> also headings under Family)> name="bisacsub">General> name="bisacsub">Girls &
>> Women> name="category">Fiction/Middle
>> Grade> name="category">Fiction/Award
>> WinnersComing
>> of AgeSocial
>> Situations/Death &
>> DyingSocial
>> Situations/Friendship> name="path">/assets/product/0152051139.gif> name="desc">Ten-year-old Comfort
>> Snowberger has attended 247
>> funerals. But that's not surprising, considering that her
>> family runs the
>> town funeral home. And even though Great-uncle Edisto
>> keeled over with a
>> heart attack and Great-great-aunt Florentine dropped
>> dead--just like
>> that--six months later, Comfort knows how to deal with
>> loss, or so she
>> thinks. She's more concerned with avoiding her crazy cousin
>> Peach and trying
>> to figure out why her best friend, Declaration, suddenly
>> won't talk to her.
>> Life is full of surprises. And the biggest one of all is
>> learning what it
>> takes to handle them.
>> Deborah Wiles has created a
>> unique, funny, and utterly real cast of characters in this
>> heartfelt, and
>> quintessentially Southern coming-of-age novel. Comfort will
>> charm young
>> readers with her wit, her warmth, and her struggles as she
>> learns about
>> life, loss, and ultimately,
>> triumph.> name="shortdesc">Ten-year-old Comfort Snowberger learns
>> about life's
>> surprises in this funny, poignant, and very Southern
>> coming-of-age
>> story.> name="material">1195443> boost="10.0" name="title">Baby Bear's
>> Chairs> name="price">16.0> name="isbn10">0152051147> name="isbn13">9780152051143> name="format">Hardcover> name="pubdate">2005-09-01> name="pubyear">2005> name="reldate">2005-08-01> name="pages">40> name="bisacstatus">Active> name="season">Fall
>> 2005> name="imprint">Children's> name="age">2.0-5.0> name="grade">P-K> name="author">Jane Yolen> name="authortype">Author> name="author">Melissa
>> Sweet> name="authortype">Illustrator> name="bisacsub">Bedtime &
>> Dreams> name="bisacsub">Animals/Bears>

Re: EdgeNGram relevancy

2010-11-11 Thread Jonathan Rochkind

Without the parens, the "edgytext:" only applied to "Mr", the default 
field still applied to "Scorcese".


The double quotes are neccesary in the second case (rather than parens), 
because on a non-tokenized field because the standard query parser will 
"pre-tokenize" on whitespace before sending individual white-space 
seperated words to match the index. If the index includes multi-word 
tokens with internal whitespace, they will never match. But the standard 
query parser doesn't "pre-tokenize" like this, it passes the whole 
phrase to the index intact.


Robert Gründler wrote:

Did you run your query without using () and "" operators? If yes can you try 
this?
&q=edgytext:(Mr Scorsese) OR edgytext2:"Mr Scorsese"^2.0



I didn't use () and "" in my query before. Using the query with those operators
works now, stopwords are thrown out as the should, thanks.

However, i don't understand how the () and "" operators affect the 
StopWordFilter.

Could you give a brief explanation for the above example?

thanks!


-robert

Re: EdgeNGram relevancy

2010-11-11 Thread Robert Gründler

> 
> Did you run your query without using () and "" operators? If yes can you try 
> this?
> &q=edgytext:(Mr Scorsese) OR edgytext2:"Mr Scorsese"^2.0

I didn't use () and "" in my query before. Using the query with those operators
works now, stopwords are thrown out as the should, thanks.

However, i don't understand how the () and "" operators affect the 
StopWordFilter.

Could you give a brief explanation for the above example?

thanks!


-robert

Re: facet+shingle in autosuggest

2010-11-11 Thread Lukas Kahwe Smith

On 11.11.2010, at 17:42, Erick Erickson wrote:

> I don't know all the implications here, but can't you just
> insert the StopwordFilterFactory before the ShingleFilterFactory
> and turn it loose?

havent tried this, but i would suspect that i would then get in trouble with 
stuff like "united states of america". it would then generate a shingle with 
"united states america" which in turn wouldnt generate a proper phrase search 
string.

one option of course would be to restrict the shingles to 2 words and then 
using the stop word filter would work as expected.

regards,
Lukas Kahwe Smith
m...@pooteeweet.org

Re: using CJKTokenizerFactory for Japanese language

2010-11-11 Thread Koji Sekiguchi


(10/11/12 1:49), Kumar Pandey wrote:

I am exploring support for Japanese language in solr.
Solr seems to provide CJKTokenizerFactory.
How useful is this module? Has anyone been using this in production for
Japanese language?


CJKTokenizer is used in a lot of places in Japan.


One shortfall it seems to have from what I have been able to read up on is
that it can generate lot of false matches. For example mathcing kyoto when
searching for tokyo etc.


Yep, it is a well-known problem.


I did not see many questions related to this module so I wonder if people
are actively using it.
If not are there any other solution in the market that are recommended by
solr users?


You may want to look at morphological analyzers. There are some of them in 
Japan.
Search MeCab, Sen, GoSen by Google. Or in Lucene, there is a patch for
a morphological-taste analyzer:

https://issues.apache.org/jira/browse/LUCENE-2522

Koji

--
http://www.rondhuit.com/en/

Re: facet+shingle in autosuggest

2010-11-11 Thread Erick Erickson

I don't know all the implications here, but can't you just
insert the StopwordFilterFactory before the ShingleFilterFactory
and turn it loose?

Best
Erick

On Thu, Nov 11, 2010 at 4:02 PM, Lukas Kahwe Smith wrote:

> Hi,
>
> I am using a facet.prefix search with shingle's in my autosuggest:
> positionIncrementGap="100" stored="false" multiValued="true">
>  
>
>
>
>  maxShingleSize="3" outputUnigrams="true"
> outputUnigramIfNoNgram="false" />
>  
>
>
> Now I would like to prevent stop words to appear in the suggestions:
>
> 
> 52
> 6
> 6
> 5
> 25
> 7
> 
>
> Here I would like to filter out the last 4 suggestions really. Is there a
> way I can sensibly bring in a stop word filter here? Actually in theory the
> stop words could appear as the first or second word as well.
>
> So I guess when producing shingle's I want to skip any stop word from being
> part of any shingle.
>
> regards,
> Lukas Kahwe Smith
> m...@pooteeweet.org
>
>
>
>

Re: WELCOME to solr-user@lucene.apache.org

2010-11-11 Thread Ahmet Arslan

There are several mistakes in your approach:

copyField just copies data. Index time boost is not copied.

There is no such boosting syntax. /select?q=Each&title^9&fl=score

You are searching on your default field. 

This is not your cause of your problem but omitNorms="true" disables index time 
boosts.

http://wiki.apache.org/solr/DisMaxQParserPlugin can satisfy your need.


--- On Thu, 11/11/10, Solr User  wrote:

> From: Solr User 
> Subject: Re: WELCOME to solr-user@lucene.apache.org
> To: solr-user@lucene.apache.org
> Date: Thursday, November 11, 2010, 11:54 PM
> Eric,
> 
> Thank you so much for the reply and apologize for not
> providing all the
> details.
> 
> The following are the field definitons in my schema.xml:
> 
>  stored="true"
> omitNorms="false" />
> 
>  stored="true"
> multiValued="true" omitNorms="true" />
> 
>  stored="true"
> multiValued="true" omitNorms="true" />
> 
>  stored="true" />
> 
>  stored="true" />
> 
>  stored="true" />
> 
>  stored="true" />
> 
>  stored="true" />
> 
>  stored="true" />
> 
>  stored="true" />
> 
>  stored="true" />
> 
>  stored="true" />
> 
>  stored="true" />
> 
>  stored="true" />
> 
>  stored="true" />
> 
>  stored="true"
> multiValued="true" omitNorms="true" />
> 
>  stored="true" />
> 
>  stored="true"
> multiValued="true" omitNorms="true" />
> 
>  stored="true"
> multiValued="true" omitNorms="true" />
> 
>  stored="true" />
> 
>  stored="true" />
> 
>  stored="true" />
> 
>  stored="true" />
> 
>  stored="true" />
> 
>  stored="true"
> omitNorms="true"/>
> 
>  stored="true"/>
> 
>  indexed="true" stored="true"
> multiValued="true" omitNorms="true"/>
> 
> Copy Fields:
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> searchFields
> 
> 
> 
> Before creating the indexes I feed XML file to the Solr job
> to create index
> files. I added Boost attribute to the title field before
> creating indexes
> and an example is below:
> 
>  standalone="no"?> name="material">1785440 boost="10.0" name="title">Each Little
> Bird That Sings name="price">16.0 name="isbn10">0152051139 name="isbn13">9780152051136 name="format">Hardcover name="pubdate">2005-03-01 name="pubyear">2005 name="reldate">2005-02-22 name="pages">272 name="bisacstatus">Active name="season">Spring
> 2005 name="imprint">Children's name="age">8.0-12.0 name="grade">3-6 name="author">Marla Frazee name="authortype">Jacket
> IllustratorDeborah
> Wiles name="authortype">Author name="bisacsub">Social
> Issues/Friendship name="bisacsub">Social Issues/General (see
> also headings under Family) name="bisacsub">General name="bisacsub">Girls &
> Women name="category">Fiction/Middle
> Grade name="category">Fiction/Award
> WinnersComing
> of AgeSocial
> Situations/Death &
> DyingSocial
> Situations/Friendship name="path">/assets/product/0152051139.gif name="desc">Ten-year-old Comfort
> Snowberger has attended 247
> funerals. But that's not surprising, considering that her
> family runs the
> town funeral home. And even though Great-uncle Edisto
> keeled over with a
> heart attack and Great-great-aunt Florentine dropped
> dead--just like
> that--six months later, Comfort knows how to deal with
> loss, or so she
> thinks. She's more concerned with avoiding her crazy cousin
> Peach and trying
> to figure out why her best friend, Declaration, suddenly
> won't talk to her.
> Life is full of surprises. And the biggest one of all is
> learning what it
> takes to handle them.

> 
Deborah Wiles has created a
> unique, funny, and utterly real cast of characters in this
> heartfelt, and
> quintessentially Southern coming-of-age novel. Comfort will
> charm young
> readers with her wit, her warmth, and her struggles as she
> learns about
> life, loss, and ultimately,
> triumph.
 name="shortdesc">Ten-year-old Comfort Snowberger learns
> about life's
> surprises in this funny, poignant, and very Southern
> coming-of-age
> story. name="material">1195443 boost="10.0" name="title">Baby Bear's
> Chairs name="price">16.0 name="isbn10">0152051147 name="isbn13">9780152051143 name="format">Hardcover name="pubdate">2005-09-01 name="pubyear">2005 name="reldate">2005-08-01 name="pages">40 name="bisacstatus">Active name="season">Fall
> 2005 name="imprint">Children's name="age">2.0-5.0 name="grade">P-K name="author">Jane Yolen name="authortype">Author name="author">Melissa
> Sweet name="authortype">Illustrator name="bisacsub">Bedtime &
> Dreams name="bisacsub">Animals/Bears name="bisacsub">Family/General
> (see also headings under Social
> Issues)Social
> Issues/Emotions & Feelings name="bisacsub">Family/Parents name="category">Animals/Bears name="category">Bedtime
> BooksFamily
> Relationships/Parent-Child name="path">/assets/product/0152051147.gif name="desc">Baby Bear is the littlest
> bear in his family, and
> sometimes that's not so easy. Mama and Papa Bear get to
> stay up late in
> their great big chairs. Big brother gets to play fun games
> in his
> middle-sized ch

Re: WELCOME to solr-user@lucene.apache.org

2010-11-11 Thread Solr User

Eric,

Thank you so much for the reply and apologize for not providing all the
details.

The following are the field definitons in my schema.xml:

Copy Fields:

searchFields

Before creating the indexes I feed XML file to the Solr job to create index
files. I added Boost attribute to the title field before creating indexes
and an example is below:

1785440Each Little
Bird That Sings16.001520511399780152051136Hardcover2005-03-0120052005-02-22272ActiveSpring
2005Children's8.0-12.03-6Marla FrazeeJacket
IllustratorDeborah WilesAuthorSocial
Issues/FriendshipSocial Issues/General (see
also headings under Family)GeneralGirls &
WomenFiction/Middle GradeFiction/Award WinnersComing
of AgeSocial Situations/Death &
DyingSocial
Situations/Friendship/assets/product/0152051139.gifTen-year-old Comfort Snowberger has attended 247
funerals. But that's not surprising, considering that her family runs the
town funeral home. And even though Great-uncle Edisto keeled over with a
heart attack and Great-great-aunt Florentine dropped dead--just like
that--six months later, Comfort knows how to deal with loss, or so she
thinks. She's more concerned with avoiding her crazy cousin Peach and trying
to figure out why her best friend, Declaration, suddenly won't talk to her.
Life is full of surprises. And the biggest one of all is learning what it
takes to handle them.

Deborah Wiles has created a
unique, funny, and utterly real cast of characters in this heartfelt, and
quintessentially Southern coming-of-age novel. Comfort will charm young
readers with her wit, her warmth, and her struggles as she learns about
life, loss, and ultimately, triumph.
Ten-year-old Comfort Snowberger learns about life's
surprises in this funny, poignant, and very Southern coming-of-age
story.1195443Baby Bear's Chairs16.001520511479780152051143Hardcover2005-09-0120052005-08-0140ActiveFall
2005Children's2.0-5.0P-KJane YolenAuthorMelissa
SweetIllustratorBedtime & DreamsAnimals/BearsFamily/General
(see also headings under Social Issues)Social
Issues/Emotions & FeelingsFamily/ParentsAnimals/BearsBedtime
BooksFamily
Relationships/Parent-Child/assets/product/0152051147.gifBaby Bear is the littlest bear in his family, and
sometimes that's not so easy. Mama and Papa Bear get to stay up late in
their great big chairs. Big brother gets to play fun games in his
middle-sized chair. And Baby Bear only seems to cause trouble in his own
tiny chair. But at the end of the day, he finds the one
perfect chair that's comfier and cozier than all the
rest.

Bestselling author Jane Yolen and popular
illustrator Melissa Sweet have come together to create a lyrical bedtime
tale about a baby bear trying to find his place in a family. With a playful
rhyming text and adorable, fun illustrations, here is a book for parents and
their own baby bears to treasure.
In this sweet, bedtime story, Baby Bear discovers that
Papa's lap is the best chair of all!

I am trying to boost the title field so that the search results brings the
actual match with title as the first item in the results.

Adding boost attribute to the title field and Index time boosting did not
change the search results. I tried Query time boosting also as mentioned
below but no luck

/select?q=Each+Little+Bird+That+Sings&title^9&fl=score

Any help to fix this issue would be really helpful.

Thanks,

Solr User
On Thu, Nov 11, 2010 at 10:32 AM, Solr User  wrote:

> Hi,
>
> I have a question about boosting.
>
> I have the following fields in my schema.xml:
>
> 1. title
> 2. description
> 3. ISBN
>
> etc
>
> I want to boost the field title. I tried index time boosting but it did not
> work. I also tried Query time boosting but with no luck.
>
> Can someone help me on how to implement boosting on a specific field like
> title?
>
> Thanks,
> Solr User
>
>
>

Re: WELCOME to solr-user@lucene.apache.org

2010-11-11 Thread Erick Erickson

There's not much to go on here. Boosting works,
and index time as opposed to query time boosting
addresses two different needs. Could you add some
detail? All you've really said is "it didn't work", which
doesn't allow a very constructive response.

Perhaps you could review:
http://wiki.apache.org/solr/HowToContribute

Best
Erick

On Thu, Nov 11, 2010 at 10:32 AM, Solr User  wrote:

> Hi,
>
> I have a question about boosting.
>
> I have the following fields in my schema.xml:
>
> 1. title
> 2. description
> 3. ISBN
>
> etc
>
> I want to boost the field title. I tried index time boosting but it did not
> work. I also tried Query time boosting but with no luck.
>
> Can someone help me on how to implement boosting on a specific field like
> title?
>
> Thanks,
> Solr User
>
>
>

Re: EdgeNGram relevancy

2010-11-11 Thread Andy

Ah I see. Thanks for the explanation.

Could you set the defaultOperator to "AND"? That way both "Bill" and "Cl" must 
be a match and that would exclude "Clyde Phillips".


--- On Thu, 11/11/10, Robert Gründler  wrote:

> From: Robert Gründler 
> Subject: Re: EdgeNGram relevancy
> To: solr-user@lucene.apache.org
> Date: Thursday, November 11, 2010, 3:51 PM
> according to the fieldtype i posted
> previously, i think it's because of:
> 
> 1. WhiteSpaceTokenizer splits the String "Clyde Phillips"
> into 2 tokens: "Clyde" and "Phillips"
> 2. EdgeNGramFilter gets the 2 tokens, and creates an
> EdgeNGram for each token: "C" "Cl" "Cly"
> ...   AND  "P" "Ph" "Phi" ...
> 
> The Query String "Bill Cl" gets split up in 2 Tokens "Bill"
> and "Cl" by the WhitespaceTokenizer.
> 
> This creates a match for the 2nd token "Ci" of the query,
> and one of the "sub"tokens the EdgeNGramFilter created:
> "Cl".
> 
> 
> -robert
> 
> 
> 
> 
> On Nov 11, 2010, at 21:34 , Andy wrote:
> 
> > Could anyone help me understand what does "Clyde
> Phillips" appear in the results for "Bill Cl"??
> > 
> > "Clyde Phillips" doesn't produce any EdgeNGram that
> would match "Bill Cl", so why is it even in the results?
> > 
> > Thanks.
> > 
> > --- On Thu, 11/11/10, Ahmet Arslan 
> wrote:
> > 
> >> You can add an additional field, with
> >> using KeywordTokenizerFactory instead of
> >> WhitespaceTokenizerFactory. And query both these
> fields with
> >> an OR operator. 
> >> 
> >> edgytext:(Bill Cl) OR edgytext2:"Bill Cl"
> >> 
> >> You can even apply boost so that begins with
> matches comes
> >> first.
> >> 
> >> --- On Thu, 11/11/10, Robert Gründler 
> >> wrote:
> >> 
> >>> From: Robert Gründler 
> >>> Subject: EdgeNGram relevancy
> >>> To: solr-user@lucene.apache.org
> >>> Date: Thursday, November 11, 2010, 5:51 PM
> >>> Hi,
> >>> 
> >>> consider the following fieldtype (used for
> >>> autocompletion):
> >>> 
> >>>    name="edgytext"
> >> class="solr.TextField"
> >>> positionIncrementGap="100">
> >>>    
> >>>       >>> class="solr.WhitespaceTokenizerFactory"/>
> >>>       >>> class="solr.LowerCaseFilterFactory"/>
> >>>       >>> class="solr.StopFilterFactory"
> ignoreCase="true"
> >>> words="stopwords.txt"
> enablePositionIncrements="true"
> >>> />     
> >>>           >>> class="solr.PatternReplaceFilterFactory"
> >> pattern="([^a-z])"
> >>> replacement="" replace="all" />
> >>>       >>> class="solr.EdgeNGramFilterFactory"
> minGramSize="1"
> >>> maxGramSize="25" />
> >>>    
> >>>    
> >>>       >>> class="solr.WhitespaceTokenizerFactory"/>
> >>>       >>> class="solr.LowerCaseFilterFactory"/>
> >>>       >>> class="solr.StopFilterFactory"
> ignoreCase="true"
> >>> words="stopwords.txt"
> enablePositionIncrements="true"
> >> />
> >>>           >>> class="solr.PatternReplaceFilterFactory"
> >> pattern="([^a-z])"
> >>> replacement="" replace="all" />
> >>>    
> >>>   
> >>> 
> >>> 
> >>> This works fine as long as the query string is
> a
> >> single
> >>> word. For multiple words, the ranking is
> weird
> >> though.
> >>> 
> >>> Example:
> >>> 
> >>> Query String: "Bill Cl"
> >>> 
> >>> Result (in that order):
> >>> 
> >>> - Clyde Phillips
> >>> - Clay Rogers
> >>> - Roger Cloud
> >>> - Bill Clinton
> >>> 
> >>> "Bill Clinton" should have the highest rank in
> that
> >>> case.  
> >>> 
> >>> Has anyone an idea how to to configure this
> fieldtype
> >> to
> >>> make matches in both tokens rank higher than
> those who
> >> match
> >>> in either token?
> >>> 
> >>> 
> >>> thanks!
> >>> 
> >>> 
> >>> -robert
> >>> 
> >>> 
> >>> 
> >>> 
> >> 
> >> 
> >> 
> >> 
> > 
> > 
> > 
> 
>

Re: problem with wildcard

2010-11-11 Thread Ahmet Arslan

> select?q=*:*&fq=title:(+lowe')&debugQuery=on&rows=0
> > 
> > "wildcard queries are not analyzed" http://search-lucene.com/m/pnmlH14o6eM1/
> > 
> 
> Yeah I found out about this a couple of minutes after I
> posted my problem. If there is no analyzer then
> why is Solr not finding any documents when a single quote
> precedes the wildcard?


Probably your index analyzer (WordDelimiterFilterFactory) eating that single 
quote. You can verify this at admin/analysis.jsp page. In other words there is 
no such term begins with (lowe') in your index. You can try searching just lowe*

facet+shingle in autosuggest

2010-11-11 Thread Lukas Kahwe Smith

Hi,

I am using a facet.prefix search with shingle's in my autosuggest:

  




  


Now I would like to prevent stop words to appear in the suggestions:


52
6
6
5
25
7


Here I would like to filter out the last 4 suggestions really. Is there a way I 
can sensibly bring in a stop word filter here? Actually in theory the stop 
words could appear as the first or second word as well.

So I guess when producing shingle's I want to skip any stop word from being 
part of any shingle.

regards,
Lukas Kahwe Smith
m...@pooteeweet.org

Re: problem with wildcard

2010-11-11 Thread Jean-Sebastien Vachon


On 2010-11-11, at 3:45 PM, Ahmet Arslan wrote:

>> I'm having some trouble with a query using some wildcard
>> and I was wondering if anyone could tell me why these two
>> similar queries do not return the same number of results.
>> Basically, the query I'm making should return all docs whose
>> title starts
>> (or contain) the string "lowe'". I suspect some analyzer is
>> causing this behaviour and I'd like to know if there is a
>> way to fix this problem.
>> 
>> 1)
>> select?q=*:*&fq=title:(+lowe')&debugQuery=on&rows=0
> 
> "wildcard queries are not analyzed" http://search-lucene.com/m/pnmlH14o6eM1/
> 

Yeah I found out about this a couple of minutes after I posted my problem. If 
there is no analyzer then
why is Solr not finding any documents when a single quote precedes the wildcard?

FAST ESP -> Solr migration webinar

2010-11-11 Thread Yonik Seeley

We're holding a free webinar on migration from FAST to Solr.  Details below.

-Yonik
http://www.lucidimagination.com

=
Solr To The Rescue: Successful Migration From FAST ESP to Open Source
Search Based on Apache Solr

Thursday, Nov 18, 2010, 14:00 EST (19:00 GMT)
Hosted by SearchDataManagement.com

For anyone concerned about the future of their FAST ESP applications
since the purchase of Fast Search and Transfer by Microsoft in 2008,
this webinar will provide valuable insights on making the switch to
Solr.  A three-person rountable will discuss factors driving the need
for FAST ESP alternatives, differences between FAST and Solr, a
typical migration project lifecycle & methodology, complementary open
source tools, best practices, customer examples, and recommended next
steps.

The speakers for this webinar are:

Helge Legernes, Founding Partner & CTO of Findwise
Michael McIntosh, VP Search Solutions for TNR Global
Eric Gaumer, Chief Architect for ESR Technology.

For more information and to register, please go to:

http://SearchDataManagement.bitpipe.com/detail/RES/1288718603_527.html?asrc=CL_PRM_Lucid2
=

Re: EdgeNGram relevancy

2010-11-11 Thread Robert Gründler

according to the fieldtype i posted previously, i think it's because of:

1. WhiteSpaceTokenizer splits the String "Clyde Phillips" into 2 tokens: 
"Clyde" and "Phillips"
2. EdgeNGramFilter gets the 2 tokens, and creates an EdgeNGram for each token: 
"C" "Cl" "Cly" ...   AND  "P" "Ph" "Phi" ...

The Query String "Bill Cl" gets split up in 2 Tokens "Bill" and "Cl" by the 
WhitespaceTokenizer.

This creates a match for the 2nd token "Ci" of the query, and one of the 
"sub"tokens the EdgeNGramFilter created: "Cl".


-robert




On Nov 11, 2010, at 21:34 , Andy wrote:

> Could anyone help me understand what does "Clyde Phillips" appear in the 
> results for "Bill Cl"??
> 
> "Clyde Phillips" doesn't produce any EdgeNGram that would match "Bill Cl", so 
> why is it even in the results?
> 
> Thanks.
> 
> --- On Thu, 11/11/10, Ahmet Arslan  wrote:
> 
>> You can add an additional field, with
>> using KeywordTokenizerFactory instead of
>> WhitespaceTokenizerFactory. And query both these fields with
>> an OR operator. 
>> 
>> edgytext:(Bill Cl) OR edgytext2:"Bill Cl"
>> 
>> You can even apply boost so that begins with matches comes
>> first.
>> 
>> --- On Thu, 11/11/10, Robert Gründler 
>> wrote:
>> 
>>> From: Robert Gründler 
>>> Subject: EdgeNGram relevancy
>>> To: solr-user@lucene.apache.org
>>> Date: Thursday, November 11, 2010, 5:51 PM
>>> Hi,
>>> 
>>> consider the following fieldtype (used for
>>> autocompletion):
>>> 
>>>   > class="solr.TextField"
>>> positionIncrementGap="100">
>>>
>>>  >> class="solr.WhitespaceTokenizerFactory"/>
>>>  >> class="solr.LowerCaseFilterFactory"/>
>>>  >> class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords.txt" enablePositionIncrements="true"
>>> /> 
>>>  >> class="solr.PatternReplaceFilterFactory"
>> pattern="([^a-z])"
>>> replacement="" replace="all" />
>>>  >> class="solr.EdgeNGramFilterFactory" minGramSize="1"
>>> maxGramSize="25" />
>>>
>>>
>>>  >> class="solr.WhitespaceTokenizerFactory"/>
>>>  >> class="solr.LowerCaseFilterFactory"/>
>>>  >> class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords.txt" enablePositionIncrements="true"
>> />
>>>  >> class="solr.PatternReplaceFilterFactory"
>> pattern="([^a-z])"
>>> replacement="" replace="all" />
>>>
>>>   
>>> 
>>> 
>>> This works fine as long as the query string is a
>> single
>>> word. For multiple words, the ranking is weird
>> though.
>>> 
>>> Example:
>>> 
>>> Query String: "Bill Cl"
>>> 
>>> Result (in that order):
>>> 
>>> - Clyde Phillips
>>> - Clay Rogers
>>> - Roger Cloud
>>> - Bill Clinton
>>> 
>>> "Bill Clinton" should have the highest rank in that
>>> case.  
>>> 
>>> Has anyone an idea how to to configure this fieldtype
>> to
>>> make matches in both tokens rank higher than those who
>> match
>>> in either token?
>>> 
>>> 
>>> thanks!
>>> 
>>> 
>>> -robert
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> 
>> 
> 
> 
>

Re: problem with wildcard

2010-11-11 Thread Ahmet Arslan

> I'm having some trouble with a query using some wildcard
> and I was wondering if anyone could tell me why these two
> similar queries do not return the same number of results.
> Basically, the query I'm making should return all docs whose
> title starts
> (or contain) the string "lowe'". I suspect some analyzer is
> causing this behaviour and I'd like to know if there is a
> way to fix this problem.
> 
> 1)
> select?q=*:*&fq=title:(+lowe')&debugQuery=on&rows=0

"wildcard queries are not analyzed" http://search-lucene.com/m/pnmlH14o6eM1/

Re: EdgeNGram relevancy

2010-11-11 Thread Andy

Could anyone help me understand what does "Clyde Phillips" appear in the 
results for "Bill Cl"??

"Clyde Phillips" doesn't produce any EdgeNGram that would match "Bill Cl", so 
why is it even in the results?

Thanks.

--- On Thu, 11/11/10, Ahmet Arslan  wrote:

> You can add an additional field, with
> using KeywordTokenizerFactory instead of
> WhitespaceTokenizerFactory. And query both these fields with
> an OR operator. 
> 
> edgytext:(Bill Cl) OR edgytext2:"Bill Cl"
> 
> You can even apply boost so that begins with matches comes
> first.
> 
> --- On Thu, 11/11/10, Robert Gründler 
> wrote:
> 
> > From: Robert Gründler 
> > Subject: EdgeNGram relevancy
> > To: solr-user@lucene.apache.org
> > Date: Thursday, November 11, 2010, 5:51 PM
> > Hi,
> > 
> > consider the following fieldtype (used for
> > autocompletion):
> > 
> >    class="solr.TextField"
> > positionIncrementGap="100">
> >    
> >       > class="solr.WhitespaceTokenizerFactory"/>
> >       > class="solr.LowerCaseFilterFactory"/>
> >       > class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt" enablePositionIncrements="true"
> > />     
> >       > class="solr.PatternReplaceFilterFactory"
> pattern="([^a-z])"
> > replacement="" replace="all" />
> >       > class="solr.EdgeNGramFilterFactory" minGramSize="1"
> > maxGramSize="25" />
> >    
> >    
> >       > class="solr.WhitespaceTokenizerFactory"/>
> >       > class="solr.LowerCaseFilterFactory"/>
> >       > class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt" enablePositionIncrements="true"
> />
> >       > class="solr.PatternReplaceFilterFactory"
> pattern="([^a-z])"
> > replacement="" replace="all" />
> >    
> >   
> > 
> > 
> > This works fine as long as the query string is a
> single
> > word. For multiple words, the ranking is weird
> though.
> > 
> > Example:
> > 
> > Query String: "Bill Cl"
> > 
> > Result (in that order):
> > 
> > - Clyde Phillips
> > - Clay Rogers
> > - Roger Cloud
> > - Bill Clinton
> > 
> > "Bill Clinton" should have the highest rank in that
> > case.  
> > 
> > Has anyone an idea how to to configure this fieldtype
> to
> > make matches in both tokens rank higher than those who
> match
> > in either token?
> > 
> > 
> > thanks!
> > 
> > 
> > -robert
> > 
> > 
> > 
> > 
> 
> 
> 
>

Re: Retrieving indexed content containing multiple languages

2010-11-11 Thread Dennis Gearon

I look forward to the eanswers to this one.

 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



- Original Message 
From: Tod 
To: solr-user@lucene.apache.org
Sent: Thu, November 11, 2010 11:35:23 AM
Subject: Retrieving indexed content containing multiple languages

My Solr corpus is currently created by indexing metadata from a relational 
database as well as content pointed to by URLs from the database.  I'm using a 
pretty generic out of the box Solr schema.  The search results are presented 
via 
an AJAX enabled HTML page.

When I perform a search the document title (for example) has a mix of english 
and chinese characters.  Everything there is fine - I can see the english and 
chinese returned from a facet query on title.  I can search against the title 
using english words it contains and I get back an expected result.  I asked a 
chinese friend to perform the same search using chinese and nothing is returned.

How should I go about getting this search to work?  Chinese is just one 
language, I'll probably need to support more in the future.

My thought is that the chinese characters are indexed as their unicode 
equivalent so all I'll need to do is make sure the query is encoded 
appropriately and just perform a regular search as I would if the terms were in 
english.  For some reason that sounds too easy.

I see there is a CJK tokenizer that would help here.  Do I need that for my 
situation?  Is there a fairly detailed tutorial on how to handle these types of 
language challenges?


Thanks in advance - Tod

Re: EdgeNGram relevancy

2010-11-11 Thread Nick Martin


On 12 Nov 2010, at 01:46, Ahmet Arslan  wrote:

>> This setup now makes troubles regarding StopWords, here's
>> an example:
>> 
>> Let's say the index contains 2 Strings: "Mr Martin
>> Scorsese" and "Martin Scorsese". "Mr" is in the stopword
>> list.
>> 
>> Query: edgytext:Mr Scorsese OR edgytext2:Mr Scorsese^2.0
>> 
>> This way, the only result i get is "Mr Martin Scorsese",
>> because the strict field edgytext2 is boosted by 2.0. 
>> 
>> Any idea why in this case "Martin Scorsese" is not in the
>> result at all?
> 
> Did you run your query without using () and "" operators? If yes can you try 
> this?
> &q=edgytext:(Mr Scorsese) OR edgytext2:"Mr Scorsese"^2.0
> 
> If no can you paste output of &debugQuery=on
> 
> 
> 

This would still not deal with the problem of removing stop words from the 
indexing and query analysis stages.

I really need something that will allow that and give a single token as in the 
example below.

Best

Nick

Re: Concatenate multiple tokens into one

2010-11-11 Thread Robert Gründler

this is the full source code, but be warned, i'm not a java developer, and i 
have no background in lucine/solr development:

// ConcatFilter

import java.io.IOException;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;
import org.apache.lucene.analysis.tokenattributes.TypeAttribute;

public class ConcatFilter extends TokenFilter {

  protected ConcatFilter(TokenStream input) 
  {
super(input);   
  }

  @Override
  public Token next() throws IOException 
  {
Token token = new Token();
StringBuilder builder = new StringBuilder();

TermAttribute termAttribute = (TermAttribute) 
input.getAttribute(TermAttribute.class);
TypeAttribute typeAttribute = (TypeAttribute) 
input.getAttribute(TypeAttribute.class);

boolean hasToken = false;

while (input.incrementToken()) 
{
  if (typeAttribute.type().equals("word")) {
builder.append(termAttribute.term());
hasToken = true;
  } 
}

if (hasToken == true) {
  token.setTermBuffer(builder.toString());
  return token;
}
  
return null;
  }
}

//ConcatFilterFactory:

import org.apache.lucene.analysis.TokenStream;
import org.apache.solr.analysis.BaseTokenFilterFactory;

public class ConcatFilterFactory extends BaseTokenFilterFactory {

@Override
public TokenStream create(TokenStream stream) {
return new ConcatFilter(stream);
}
}

and in your schema.xml, you can simply add the filterfactory using this element:



Jar files i have included in the buildpath (can be found in the solr download 
package):

apache-solr-core-1.4.1.jar
lucene-analyzers-2.9.3.jar
lucene-core.2.9.3-jar


good luck ;)


-robert




On Nov 11, 2010, at 8:45 PM, Nick Martin wrote:

> Thanks Robert, I had been trying to get your ConcatFilter to work, but I'm 
> not sure what i need in the classpath and where Token comes from.
> Will check the thread you mention.
> 
> Best
> 
> Nick
> 
> On 11 Nov 2010, at 18:13, Robert Gründler wrote:
> 
>> I've posted a ConcaFilter in my previous mail which does concatenate tokens. 
>> This works fine, but i
>> realized that what i wanted to achieve is implemented easier in another way 
>> (by using 2 separate field types).
>> 
>> Have a look at a previous mail i wrote to the list and the reply from Ahmet 
>> Arslan (topic: "EdgeNGram relevancy).
>> 
>> 
>> best
>> 
>> 
>> -robert
>> 
>> 
>> 
>> 
>> See 
>> On Nov 11, 2010, at 5:27 PM, Nick Martin wrote:
>> 
>>> Hi Robert, All,
>>> 
>>> I have a similar problem, here is my fieldType, 
>>> http://paste.pocoo.org/show/289910/
>>> I want to include stopword removal and lowercase the incoming terms. The 
>>> idea being to take, "Foo Bar Baz Ltd" and turn it into "foobarbaz" for the 
>>> EdgeNgram filter factory.
>>> If anyone can tell me a simple way to concatenate tokens into one token 
>>> again, similar too the KeyWordTokenizer that would be super helpful.
>>> 
>>> Many thanks
>>> 
>>> Nick
>>> 
>>> On 11 Nov 2010, at 00:23, Robert Gründler wrote:
>>> 
 
 On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote:
 
> Are you sure you really want to throw out stopwords for your use case?  I 
> don't think autocompletion will work how you want if you do. 
 
 in our case i think it makes sense. the content is targetting the 
 electronic music / dj scene, so we have a lot of words like "DJ" or 
 "featuring" which
 make sense to throw out of the query. Also searches for "the beastie boys" 
 and "beastie boys" should return a match in the autocompletion.
 
> 
> And if you don't... then why use the WhitespaceTokenizer and then try to 
> jam the tokens back together? Why not just NOT tokenize in the first 
> place. Use the KeywordTokenizer, which really should be called the 
> NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just 
> creates one token from the entire input string. 
 
 I started out with the KeywordTokenizer, which worked well, except the 
 StopWord problem.
 
 For now, i've come up with a quick-and-dirty custom "ConcatFilter", which 
 does what i'm after:
 
 public class ConcatFilter extends TokenFilter {
 
private TokenStream tstream;
 
protected ConcatFilter(TokenStream input) {
super(input);
this.tstream = input;
}
 
@Override
public Token next() throws IOException {

Token token = new Token();
StringBuilder builder = new StringBuilder();

TermAttribute termAttribute = (TermAttribute) 
 tstream.getAttribute(TermAttribute.class);
TypeAttribute typeAttribute = (TypeAttribute) 
 tstream.getAttribute(TypeAtt

Re: Concatenate multiple tokens into one

2010-11-11 Thread Nick Martin

Thanks Robert, I had been trying to get your ConcatFilter to work, but I'm not 
sure what i need in the classpath and where Token comes from.
Will check the thread you mention.

Best

Nick

On 11 Nov 2010, at 18:13, Robert Gründler wrote:

> I've posted a ConcaFilter in my previous mail which does concatenate tokens. 
> This works fine, but i
> realized that what i wanted to achieve is implemented easier in another way 
> (by using 2 separate field types).
> 
> Have a look at a previous mail i wrote to the list and the reply from Ahmet 
> Arslan (topic: "EdgeNGram relevancy).
> 
> 
> best
> 
> 
> -robert
> 
> 
> 
> 
> See 
> On Nov 11, 2010, at 5:27 PM, Nick Martin wrote:
> 
>> Hi Robert, All,
>> 
>> I have a similar problem, here is my fieldType, 
>> http://paste.pocoo.org/show/289910/
>> I want to include stopword removal and lowercase the incoming terms. The 
>> idea being to take, "Foo Bar Baz Ltd" and turn it into "foobarbaz" for the 
>> EdgeNgram filter factory.
>> If anyone can tell me a simple way to concatenate tokens into one token 
>> again, similar too the KeyWordTokenizer that would be super helpful.
>> 
>> Many thanks
>> 
>> Nick
>> 
>> On 11 Nov 2010, at 00:23, Robert Gründler wrote:
>> 
>>> 
>>> On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote:
>>> 
 Are you sure you really want to throw out stopwords for your use case?  I 
 don't think autocompletion will work how you want if you do. 
>>> 
>>> in our case i think it makes sense. the content is targetting the 
>>> electronic music / dj scene, so we have a lot of words like "DJ" or 
>>> "featuring" which
>>> make sense to throw out of the query. Also searches for "the beastie boys" 
>>> and "beastie boys" should return a match in the autocompletion.
>>> 
 
 And if you don't... then why use the WhitespaceTokenizer and then try to 
 jam the tokens back together? Why not just NOT tokenize in the first 
 place. Use the KeywordTokenizer, which really should be called the 
 NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just 
 creates one token from the entire input string. 
>>> 
>>> I started out with the KeywordTokenizer, which worked well, except the 
>>> StopWord problem.
>>> 
>>> For now, i've come up with a quick-and-dirty custom "ConcatFilter", which 
>>> does what i'm after:
>>> 
>>> public class ConcatFilter extends TokenFilter {
>>> 
>>> private TokenStream tstream;
>>> 
>>> protected ConcatFilter(TokenStream input) {
>>> super(input);
>>> this.tstream = input;
>>> }
>>> 
>>> @Override
>>> public Token next() throws IOException {
>>> 
>>> Token token = new Token();
>>> StringBuilder builder = new StringBuilder();
>>> 
>>> TermAttribute termAttribute = (TermAttribute) 
>>> tstream.getAttribute(TermAttribute.class);
>>> TypeAttribute typeAttribute = (TypeAttribute) 
>>> tstream.getAttribute(TypeAttribute.class);
>>> 
>>> boolean incremented = false;
>>> 
>>> while (tstream.incrementToken()) {
>>> 
>>> if (typeAttribute.type().equals("word")) {
>>> builder.append(termAttribute.term());   
>>> 
>>> }
>>> incremented = true;
>>> }
>>> 
>>> token.setTermBuffer(builder.toString());
>>> 
>>> if (incremented == true)
>>> return token;
>>> 
>>> return null;
>>> }
>>> }
>>> 
>>> I'm not sure if this is a safe way to do this, as i'm not familar with the 
>>> whole solr/lucene implementation after all.
>>> 
>>> 
>>> best
>>> 
>>> 
>>> -robert
>>> 
>>> 
>>> 
>>> 
 
 Then lowercase, remove whitespace (or not), do whatever else you want to 
 do to your single token to normalize it, and then edgengram it. 
 
 If you include whitespace in the token, then when making your queries for 
 auto-complete, be sure to use a query parser that doesn't do 
 "pre-tokenization", the 'field' query parser should work well for this. 
 
 Jonathan
 
 
 
 
 From: Robert Gründler [rob...@dubture.com]
 Sent: Wednesday, November 10, 2010 6:39 PM
 To: solr-user@lucene.apache.org
 Subject: Concatenate multiple tokens into one
 
 Hi,
 
 i've created the following filterchain in a field type, the idea is to use 
 it for autocompletion purposes:
 
  
  
 >>> words="stopwords.txt" enablePositionIncrements="true" />  
 >>> replacement="" replace="all" />  
 
 
 
 >>> maxGramSize="25" /> 
 
 With that kind of filterchain, the EdgeNGramFilterFactory will receive 
 multiple tokens on input strings with whitespaces in it. This leads to the 
 following results:
>>>

Retrieving indexed content containing multiple languages

2010-11-11 Thread Tod

My Solr corpus is currently created by indexing metadata from a 
relational database as well as content pointed to by URLs from the 
database.  I'm using a pretty generic out of the box Solr schema.  The 
search results are presented via an AJAX enabled HTML page.


When I perform a search the document title (for example) has a mix of 
english and chinese characters.  Everything there is fine - I can see 
the english and chinese returned from a facet query on title.  I can 
search against the title using english words it contains and I get back 
an expected result.  I asked a chinese friend to perform the same search 
using chinese and nothing is returned.


How should I go about getting this search to work?  Chinese is just one 
language, I'll probably need to support more in the future.


My thought is that the chinese characters are indexed as their unicode 
equivalent so all I'll need to do is make sure the query is encoded 
appropriately and just perform a regular search as I would if the terms 
were in english.  For some reason that sounds too easy.


I see there is a CJK tokenizer that would help here.  Do I need that for 
my situation?  Is there a fairly detailed tutorial on how to handle 
these types of language challenges?



Thanks in advance - Tod

Search Result Differences a Puzzle

2010-11-11 Thread Eric Martin

Hi,

 

I cannot find out how this is occurring: 

 

 

Nolosearch/com/search/apachesolr_search/law

 

 

You can see that the John Paul Stevens result yields more description in the
search result because of the keyword relevancy, whereas, the other results
just give you a snippet of the title based on keywords found. 

 

I am trying to figure out how to get a standard size search result no matter
what the relevancy is. While application of this type of result would be
irrelevant to many search engines it is completely practical in a legal
setting as a keyword is only as good as how it is being referenced in the
sentence or paragraph. What a dilemma I have!

 

 

I have been trying to figure out if it is the actual schema.xml file or
solrconfig.xml file and for the life of me, I can't find it referenced
anywhere. I tried changing the fragsize to 200 instead of a default of like
70. Didn't do any damage at re-index.

 

 

This problem is super critical to my search results. Like I said, as an
attorney, the word is superfluous until it attached to a long sentence or
two in order to describe if the keyword we searched for is relevant, let
alone worthy of  a click. That is why my titles are set to open in a new
window, faster access and if the result is crud, then just close the window
out and back to research.

 

 

Eric

Re: EdgeNGram relevancy

2010-11-11 Thread Ahmet Arslan

> This setup now makes troubles regarding StopWords, here's
> an example:
> 
> Let's say the index contains 2 Strings: "Mr Martin
> Scorsese" and "Martin Scorsese". "Mr" is in the stopword
> list.
> 
> Query: edgytext:Mr Scorsese OR edgytext2:Mr Scorsese^2.0
> 
> This way, the only result i get is "Mr Martin Scorsese",
> because the strict field edgytext2 is boosted by 2.0. 
> 
> Any idea why in this case "Martin Scorsese" is not in the
> result at all?

Did you run your query without using () and "" operators? If yes can you try 
this?
&q=edgytext:(Mr Scorsese) OR edgytext2:"Mr Scorsese"^2.0

If no can you paste output of &debugQuery=on

Memory used by facet queries

2010-11-11 Thread Charlie Gildawie

Hello All.

My first time post so be kind. Developing a document store with lots and lots 
of very small documents. (200 million at the moment. Final size will probably 
be double this at 400 million documents). This is Proof of concept development 
so we are seeing what a single code can do for us before we consider sharding. 
We'd rather not shard if we don't have to.

I'm using SOLR 4.0 (for the simple facet pivots and groups which work well).

We're into week 4 of our development and have the production servers etc set 
up. Everything working very well until we start to test queries with production 
volumes of data.

I'm running into Java Heap Space exceptions during simple faceting on inverted 
fields. The fields we are currently faceting on are names - Country / Continent 
/ City names all stored as a Solr.StringField (there are other fields using 
tokenization to provide initial search but we want to use the simple 
StringFields to provide faceted navigation). In total we have 10 fields we'd 
ever want to facet on (8 names fields that are strings and 2 Datepart fields 
(year and yearMonth) that are also strings)).

This is our first time using SOLR and I didn't realise that we'd need so much 
heap for facets!

Solr is running in tomcat container and I've currently set tomcat to use a max 
of

JAVA_OPTS="$JAVA_OPTS -server -Xms512m -Xmx3m"

I've been reading all I can find online and have seen advice to populate the 
facets caches first as soon as we've started the solr service. However I'd 
really like to know if there are ways to reduce the memory footprint. We 
currently have 32g of physical ram. Adding more ram is an option but I'm being 
asked the (completely reasonable) question -- "Why do you need so much?"

Please help!

Charlie.

-Original Message-
From: Robert Gründler [mailto:rob...@dubture.com]
Sent: 11 November 2010 18:14
To: solr-user@lucene.apache.org
Subject: Re: Concatenate multiple tokens into one

I've posted a ConcaFilter in my previous mail which does concatenate tokens. 
This works fine, but i realized that what i wanted to achieve is implemented 
easier in another way (by using 2 separate field types).

Have a look at a previous mail i wrote to the list and the reply from Ahmet 
Arslan (topic: "EdgeNGram relevancy).

best

-robert

See
On Nov 11, 2010, at 5:27 PM, Nick Martin wrote:

> Hi Robert, All,
>
> I have a similar problem, here is my fieldType,
> http://paste.pocoo.org/show/289910/
> I want to include stopword removal and lowercase the incoming terms. The idea 
> being to take, "Foo Bar Baz Ltd" and turn it into "foobarbaz" for the 
> EdgeNgram filter factory.
> If anyone can tell me a simple way to concatenate tokens into one token 
> again, similar too the KeyWordTokenizer that would be super helpful.
>
> Many thanks
>
> Nick
>
> On 11 Nov 2010, at 00:23, Robert Gründler wrote:
>
>>
>> On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote:
>>
>>> Are you sure you really want to throw out stopwords for your use case?  I 
>>> don't think autocompletion will work how you want if you do.
>>
>> in our case i think it makes sense. the content is targetting the
>> electronic music / dj scene, so we have a lot of words like "DJ" or 
>> "featuring" which make sense to throw out of the query. Also searches for 
>> "the beastie boys" and "beastie boys" should return a match in the 
>> autocompletion.
>>
>>>
>>> And if you don't... then why use the WhitespaceTokenizer and then try to 
>>> jam the tokens back together? Why not just NOT tokenize in the first place. 
>>> Use the KeywordTokenizer, which really should be called the 
>>> NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just creates 
>>> one token from the entire input string.
>>
>> I started out with the KeywordTokenizer, which worked well, except the 
>> StopWord problem.
>>
>> For now, i've come up with a quick-and-dirty custom "ConcatFilter", which 
>> does what i'm after:
>>
>> public class ConcatFilter extends TokenFilter {
>>
>>  private TokenStream tstream;
>>
>>  protected ConcatFilter(TokenStream input) {
>>  super(input);
>>  this.tstream = input;
>>  }
>>
>>  @Override
>>  public Token next() throws IOException {
>>
>>  Token token = new Token();
>>  StringBuilder builder = new StringBuilder();
>>
>>  TermAttribute termAttribute = (TermAttribute) 
>> tstream.getAttribute(TermAttribute.class);
>>  TypeAttribute typeAttribute = (TypeAttribute)
>> tstream.getAttribute(TypeAttribute.class);
>>
>>  boolean incremented = false;
>>
>>  while (tstream.incrementToken()) {
>>
>>  if (typeAttribute.type().equals("word")) {
>>  builder.append(termAttribute.term());
>>  }
>>  incremented = true;
>>  }
>>
>>  token.setTermBuffer(builder.toString());
>>
>>

Re: Concatenate multiple tokens into one

2010-11-11 Thread Robert Gründler

I've posted a ConcaFilter in my previous mail which does concatenate tokens. 
This works fine, but i
realized that what i wanted to achieve is implemented easier in another way (by 
using 2 separate field types).

Have a look at a previous mail i wrote to the list and the reply from Ahmet 
Arslan (topic: "EdgeNGram relevancy).


best


-robert




See 
On Nov 11, 2010, at 5:27 PM, Nick Martin wrote:

> Hi Robert, All,
> 
> I have a similar problem, here is my fieldType, 
> http://paste.pocoo.org/show/289910/
> I want to include stopword removal and lowercase the incoming terms. The idea 
> being to take, "Foo Bar Baz Ltd" and turn it into "foobarbaz" for the 
> EdgeNgram filter factory.
> If anyone can tell me a simple way to concatenate tokens into one token 
> again, similar too the KeyWordTokenizer that would be super helpful.
> 
> Many thanks
> 
> Nick
> 
> On 11 Nov 2010, at 00:23, Robert Gründler wrote:
> 
>> 
>> On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote:
>> 
>>> Are you sure you really want to throw out stopwords for your use case?  I 
>>> don't think autocompletion will work how you want if you do. 
>> 
>> in our case i think it makes sense. the content is targetting the electronic 
>> music / dj scene, so we have a lot of words like "DJ" or "featuring" which
>> make sense to throw out of the query. Also searches for "the beastie boys" 
>> and "beastie boys" should return a match in the autocompletion.
>> 
>>> 
>>> And if you don't... then why use the WhitespaceTokenizer and then try to 
>>> jam the tokens back together? Why not just NOT tokenize in the first place. 
>>> Use the KeywordTokenizer, which really should be called the 
>>> NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just creates 
>>> one token from the entire input string. 
>> 
>> I started out with the KeywordTokenizer, which worked well, except the 
>> StopWord problem.
>> 
>> For now, i've come up with a quick-and-dirty custom "ConcatFilter", which 
>> does what i'm after:
>> 
>> public class ConcatFilter extends TokenFilter {
>> 
>>  private TokenStream tstream;
>> 
>>  protected ConcatFilter(TokenStream input) {
>>  super(input);
>>  this.tstream = input;
>>  }
>> 
>>  @Override
>>  public Token next() throws IOException {
>>  
>>  Token token = new Token();
>>  StringBuilder builder = new StringBuilder();
>>  
>>  TermAttribute termAttribute = (TermAttribute) 
>> tstream.getAttribute(TermAttribute.class);
>>  TypeAttribute typeAttribute = (TypeAttribute) 
>> tstream.getAttribute(TypeAttribute.class);
>>  
>>  boolean incremented = false;
>>  
>>  while (tstream.incrementToken()) {
>>  
>>  if (typeAttribute.type().equals("word")) {
>>  builder.append(termAttribute.term());   
>> 
>>  }
>>  incremented = true;
>>  }
>>  
>>  token.setTermBuffer(builder.toString());
>>  
>>  if (incremented == true)
>>  return token;
>>  
>>  return null;
>>  }
>> }
>> 
>> I'm not sure if this is a safe way to do this, as i'm not familar with the 
>> whole solr/lucene implementation after all.
>> 
>> 
>> best
>> 
>> 
>> -robert
>> 
>> 
>> 
>> 
>>> 
>>> Then lowercase, remove whitespace (or not), do whatever else you want to do 
>>> to your single token to normalize it, and then edgengram it. 
>>> 
>>> If you include whitespace in the token, then when making your queries for 
>>> auto-complete, be sure to use a query parser that doesn't do 
>>> "pre-tokenization", the 'field' query parser should work well for this. 
>>> 
>>> Jonathan
>>> 
>>> 
>>> 
>>> 
>>> From: Robert Gründler [rob...@dubture.com]
>>> Sent: Wednesday, November 10, 2010 6:39 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Concatenate multiple tokens into one
>>> 
>>> Hi,
>>> 
>>> i've created the following filterchain in a field type, the idea is to use 
>>> it for autocompletion purposes:
>>> 
>>>  
>>>  
>>> >> words="stopwords.txt" enablePositionIncrements="true" />  
>>> >> replacement="" replace="all" />  
>>> 
>>> 
>>> 
>>> >> maxGramSize="25" /> 
>>> 
>>> With that kind of filterchain, the EdgeNGramFilterFactory will receive 
>>> multiple tokens on input strings with whitespaces in it. This leads to the 
>>> following results:
>>> Input Query: "George Cloo"
>>> Matches:
>>> - "George Harrison"
>>> - "John Clooridge"
>>> - "George Smith"
>>> -"George Clooney"
>>> - etc
>>> 
>>> However, only "George Clooney" should match in the autocompletion use case.
>>> Therefore, i'd like to add a filter before the EdgeNGramFilterFactory, 
>>> which concatenates all the tokens generated by the 
>>> WhitespaceTokenizerFac

Re: EdgeNGram relevancy

2010-11-11 Thread Robert Gründler

thanks a lot, that setup works pretty well now.

the only problem now is that the StopWords do not work that good anymore. I'll 
provide an example, but first the 2 fieldtypes:

  
  
  
   
 
 
  
 
 
   
   
 
 
 
 
   
  
  


  
   
 
 
 
 
   
   
 
 
 
   
  


This setup now makes troubles regarding StopWords, here's an example:

Let's say the index contains 2 Strings: "Mr Martin Scorsese" and "Martin 
Scorsese". "Mr" is in the stopword list.

Query: edgytext:Mr Scorsese OR edgytext2:Mr Scorsese^2.0

This way, the only result i get is "Mr Martin Scorsese", because the strict 
field edgytext2 is boosted by 2.0. 

Any idea why in this case "Martin Scorsese" is not in the result at all?


thanks again!


-robert






On Nov 11, 2010, at 5:57 PM, Ahmet Arslan wrote:

> You can add an additional field, with using KeywordTokenizerFactory instead 
> of WhitespaceTokenizerFactory. And query both these fields with an OR 
> operator. 
> 
> edgytext:(Bill Cl) OR edgytext2:"Bill Cl"
> 
> You can even apply boost so that begins with matches comes first.
> 
> --- On Thu, 11/11/10, Robert Gründler  wrote:
> 
>> From: Robert Gründler 
>> Subject: EdgeNGram relevancy
>> To: solr-user@lucene.apache.org
>> Date: Thursday, November 11, 2010, 5:51 PM
>> Hi,
>> 
>> consider the following fieldtype (used for
>> autocompletion):
>> 
>>   > positionIncrementGap="100">
>>
>>  > class="solr.WhitespaceTokenizerFactory"/>
>>  > class="solr.LowerCaseFilterFactory"/>
>>  > class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt" enablePositionIncrements="true"
>> /> 
>>  > class="solr.PatternReplaceFilterFactory" pattern="([^a-z])"
>> replacement="" replace="all" />
>>  > class="solr.EdgeNGramFilterFactory" minGramSize="1"
>> maxGramSize="25" />
>>
>>
>>  > class="solr.WhitespaceTokenizerFactory"/>
>>  > class="solr.LowerCaseFilterFactory"/>
>>  > class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt" enablePositionIncrements="true" />
>>  > class="solr.PatternReplaceFilterFactory" pattern="([^a-z])"
>> replacement="" replace="all" />
>>
>>   
>> 
>> 
>> This works fine as long as the query string is a single
>> word. For multiple words, the ranking is weird though.
>> 
>> Example:
>> 
>> Query String: "Bill Cl"
>> 
>> Result (in that order):
>> 
>> - Clyde Phillips
>> - Clay Rogers
>> - Roger Cloud
>> - Bill Clinton
>> 
>> "Bill Clinton" should have the highest rank in that
>> case.  
>> 
>> Has anyone an idea how to to configure this fieldtype to
>> make matches in both tokens rank higher than those who match
>> in either token?
>> 
>> 
>> thanks!
>> 
>> 
>> -robert
>> 
>> 
>> 
>> 
> 
> 
>

Re: Issue with facet fields

2010-11-11 Thread Paige Cook

Are you storing the upload_by and business fields? You will not be able to
retrieve a field from your index if it is not stored. Check that you have
stored="true" for both of those fields.

- Paige

On Thu, Nov 11, 2010 at 10:23 AM, gauravshetti wrote:

>
> I am facing this weird issue in facet fields
>
> Within config xml
> under
> 
> 
> −
> 
>
> I have defined the fl as
> 
>
>file_id folder_id display_name file_name priority_text content_type
> last_upload upload_by business indexed
>
> 
>
> But my out xml doesnt contain the element upload_by and business
> But i am able to do seach by upload_by: and business:
>
> Even when i add in the url &fl=* i do not get this facet field in the
> response
>
> Any idea what i am doing wrong.
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Issue-with-facet-fields-tp1883106p1883106.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: EdgeNGram relevancy

2010-11-11 Thread Ahmet Arslan

You can add an additional field, with using KeywordTokenizerFactory instead of 
WhitespaceTokenizerFactory. And query both these fields with an OR operator. 

edgytext:(Bill Cl) OR edgytext2:"Bill Cl"

You can even apply boost so that begins with matches comes first.

--- On Thu, 11/11/10, Robert Gründler  wrote:

> From: Robert Gründler 
> Subject: EdgeNGram relevancy
> To: solr-user@lucene.apache.org
> Date: Thursday, November 11, 2010, 5:51 PM
> Hi,
> 
> consider the following fieldtype (used for
> autocompletion):
> 
>    positionIncrementGap="100">
>    
>       class="solr.WhitespaceTokenizerFactory"/>
>       class="solr.LowerCaseFilterFactory"/>
>       class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true"
> />     
>       class="solr.PatternReplaceFilterFactory" pattern="([^a-z])"
> replacement="" replace="all" />
>       class="solr.EdgeNGramFilterFactory" minGramSize="1"
> maxGramSize="25" />
>    
>    
>       class="solr.WhitespaceTokenizerFactory"/>
>       class="solr.LowerCaseFilterFactory"/>
>       class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>       class="solr.PatternReplaceFilterFactory" pattern="([^a-z])"
> replacement="" replace="all" />
>    
>   
> 
> 
> This works fine as long as the query string is a single
> word. For multiple words, the ranking is weird though.
> 
> Example:
> 
> Query String: "Bill Cl"
> 
> Result (in that order):
> 
> - Clyde Phillips
> - Clay Rogers
> - Roger Cloud
> - Bill Clinton
> 
> "Bill Clinton" should have the highest rank in that
> case.  
> 
> Has anyone an idea how to to configure this fieldtype to
> make matches in both tokens rank higher than those who match
> in either token?
> 
> 
> thanks!
> 
> 
> -robert
> 
> 
> 
>

using CJKTokenizerFactory for Japanese language

2010-11-11 Thread Kumar Pandey

I am exploring support for Japanese language in solr.
Solr seems to provide CJKTokenizerFactory.
How useful is this module? Has anyone been using this in production for
Japanese language?

One shortfall it seems to have from what I have been able to read up on is
that it can generate lot of false matches. For example mathcing kyoto when
searching for tokyo etc.

I did not see many questions related to this module so I wonder if people
are actively using it.
If not are there any other solution in the market that are recommended by
solr users?

Thanks
Kumar

Re: Rollback can't be done after committing?

2010-11-11 Thread Jonathan Rochkind


What you say is true. Solr is not an rdbms.

Kouta Osabe wrote:

Hi, all

I have a question about Solr and SolrJ's rollback.

I try to rollback like below

try{
server.addBean(dto);
server.commit;
}catch(Exception e){
 if (server != null) { server.rollback();}
}

I wonder if any Exception thrown, "rollback" process is run. so all
data would not be updated.

but once commited, rollback would not be well done.

rollback correctly will be done only when "commit" process will not?

Solr and SolrJ's rollback system is not the same as any RDB's rollback?

Rollback can't be done after committing?

2010-11-11 Thread Kouta Osabe

Hi, all

I have a question about Solr and SolrJ's rollback.

I try to rollback like below

try{
server.addBean(dto);
server.commit;
}catch(Exception e){
 if (server != null) { server.rollback();}
}

I wonder if any Exception thrown, "rollback" process is run. so all
data would not be updated.

but once commited, rollback would not be well done.

rollback correctly will be done only when "commit" process will not?

Solr and SolrJ's rollback system is not the same as any RDB's rollback?

Re: Concatenate multiple tokens into one

2010-11-11 Thread Nick Martin

Hi Robert, All,

I have a similar problem, here is my fieldType, 
http://paste.pocoo.org/show/289910/
I want to include stopword removal and lowercase the incoming terms. The idea 
being to take, "Foo Bar Baz Ltd" and turn it into "foobarbaz" for the EdgeNgram 
filter factory.
If anyone can tell me a simple way to concatenate tokens into one token again, 
similar too the KeyWordTokenizer that would be super helpful.

Many thanks

Nick

On 11 Nov 2010, at 00:23, Robert Gründler wrote:

> 
> On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote:
> 
>> Are you sure you really want to throw out stopwords for your use case?  I 
>> don't think autocompletion will work how you want if you do. 
> 
> in our case i think it makes sense. the content is targetting the electronic 
> music / dj scene, so we have a lot of words like "DJ" or "featuring" which
> make sense to throw out of the query. Also searches for "the beastie boys" 
> and "beastie boys" should return a match in the autocompletion.
> 
>> 
>> And if you don't... then why use the WhitespaceTokenizer and then try to jam 
>> the tokens back together? Why not just NOT tokenize in the first place. Use 
>> the KeywordTokenizer, which really should be called the 
>> NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just creates 
>> one token from the entire input string. 
> 
> I started out with the KeywordTokenizer, which worked well, except the 
> StopWord problem.
> 
> For now, i've come up with a quick-and-dirty custom "ConcatFilter", which 
> does what i'm after:
> 
> public class ConcatFilter extends TokenFilter {
> 
>   private TokenStream tstream;
> 
>   protected ConcatFilter(TokenStream input) {
>   super(input);
>   this.tstream = input;
>   }
> 
>   @Override
>   public Token next() throws IOException {
>   
>   Token token = new Token();
>   StringBuilder builder = new StringBuilder();
>   
>   TermAttribute termAttribute = (TermAttribute) 
> tstream.getAttribute(TermAttribute.class);
>   TypeAttribute typeAttribute = (TypeAttribute) 
> tstream.getAttribute(TypeAttribute.class);
>   
>   boolean incremented = false;
>   
>   while (tstream.incrementToken()) {
>   
>   if (typeAttribute.type().equals("word")) {
>   builder.append(termAttribute.term());   
> 
>   }
>   incremented = true;
>   }
>   
>   token.setTermBuffer(builder.toString());
>   
>   if (incremented == true)
>   return token;
>   
>   return null;
>   }
> }
> 
> I'm not sure if this is a safe way to do this, as i'm not familar with the 
> whole solr/lucene implementation after all.
> 
> 
> best
> 
> 
> -robert
> 
> 
> 
> 
>> 
>> Then lowercase, remove whitespace (or not), do whatever else you want to do 
>> to your single token to normalize it, and then edgengram it. 
>> 
>> If you include whitespace in the token, then when making your queries for 
>> auto-complete, be sure to use a query parser that doesn't do 
>> "pre-tokenization", the 'field' query parser should work well for this. 
>> 
>> Jonathan
>> 
>> 
>> 
>> 
>> From: Robert Gründler [rob...@dubture.com]
>> Sent: Wednesday, November 10, 2010 6:39 PM
>> To: solr-user@lucene.apache.org
>> Subject: Concatenate multiple tokens into one
>> 
>> Hi,
>> 
>> i've created the following filterchain in a field type, the idea is to use 
>> it for autocompletion purposes:
>> 
>>  
>>  
>> > words="stopwords.txt" enablePositionIncrements="true" />  
>> > replacement="" replace="all" />  
>> 
>> 
>> 
>> > /> 
>> 
>> With that kind of filterchain, the EdgeNGramFilterFactory will receive 
>> multiple tokens on input strings with whitespaces in it. This leads to the 
>> following results:
>> Input Query: "George Cloo"
>> Matches:
>> - "George Harrison"
>> - "John Clooridge"
>> - "George Smith"
>> -"George Clooney"
>> - etc
>> 
>> However, only "George Clooney" should match in the autocompletion use case.
>> Therefore, i'd like to add a filter before the EdgeNGramFilterFactory, which 
>> concatenates all the tokens generated by the WhitespaceTokenizerFactory.
>> Are there filters which can do such a thing?
>> 
>> If not, are there examples how to implement a custom TokenFilter?
>> 
>> thanks!
>> 
>> -robert
>> 
>> 
>> 
>> 
>

Re: Any Copy Field Caveats?

2010-11-11 Thread Tod


I've noticed that using camelCase in field names causes problems.


On 11/5/2010 11:02 AM, Will Milspec wrote:

Hi all,

we're moving from an old lucene version to solr  and plan to use the "Copy
Field" functionality. Previously we had "rolled our own" implementation,
sticking title, description, etc. in a field called 'content'.

We lose some flexibility (i.e. java layer can no longer control what gets in
the new copied field), at the expense of simplicity. A fair tradeoff IMO.

My question: has anyone found any subtle issues or "gotchas" with copy
fields?

(from the subject line "caveat"--pronounced 'kah-VEY-AT'  is Latin as in
"Caveat Emptor"..."let the buyer beware").

thanks,

will

will

Re: solr dynamic core creation

2010-11-11 Thread Robert Sandiford


No - in reading what you just wrote, and what you originally wrote, I think
the misunderstanding was mine, based on the architecture of my code.  In my
code, it is our 'server' level that does the SolrJ indexing calls, but you
meant 'server' to be the Solr instance, and what you mean by 'client' is
what I was thinking of (without thinking) as the 'server'...

Sorry about that.  Hopefully someone else can chime in on your specific
issue...
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-dynamic-core-creation-tp1867705p1883354.html
Sent from the Solr - User mailing list archive at Nabble.com.

EdgeNGram relevancy

2010-11-11 Thread Robert Gründler

Hi,

consider the following fieldtype (used for autocompletion):

  
   
 
 
  
 
 
   
   
 
 
 
 
   
  


This works fine as long as the query string is a single word. For multiple 
words, the ranking is weird though.

Example:

Query String: "Bill Cl"

Result (in that order):

- Clyde Phillips
- Clay Rogers
- Roger Cloud
- Bill Clinton

"Bill Clinton" should have the highest rank in that case.  

Has anyone an idea how to to configure this fieldtype to make matches in both 
tokens rank higher than those who match in either token?


thanks!


-robert

Re: Crawling with nutch and mapping fields to solr

2010-11-11 Thread Jean-Luc


I'm going down the route of patching nutch so I can use this ParseMetaTags
plugin:
https://issues.apache.org/jira/browse/NUTCH-809

Also wondering whether I will be able to use the XMLParser to allow me to
parse well formed XHTML, using xpath would be bonus:
https://issues.apache.org/jira/browse/NUTCH-185

Any thoughts appreciated...
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Crawling-with-nutch-and-mapping-fields-to-solr-tp1879060p1883295.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr dynamic core creation

2010-11-11 Thread nizan


Hi,

Maybe just don't understand all the concept there and I mix up server and
client...

Client - The place where I make the http calls (for index, search etc.) -
where I use the CommonsHttpSolrServer as the solr server. This machine isn't
defined as master or slave, it just use solr as search engine

Server - The http calls I made in the client, goes to another server, the
master solr server (or one of the slaves), where I have embeddedSolrServer,
aren't they?

thanks, nizan
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-dynamic-core-creation-tp1867705p1883269.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr dynamic core creation

2010-11-11 Thread Robert Sandiford


Hmmm.  Maybe you need to define what you mean by 'server' and what you mean
by 'client'.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-dynamic-core-creation-tp1867705p1883238.html
Sent from the Solr - User mailing list archive at Nabble.com.

problem with wildcard

2010-11-11 Thread Jean-Sebastien Vachon


Hi All,

I'm having some trouble with a query using some wildcard and I was wondering if 
anyone could tell me why these two
similar queries do not return the same number of results. Basically, the query 
I'm making should return all docs whose title starts
(or contain) the string "lowe'". I suspect some analyzer is causing this 
behaviour and I'd like to know if there is a way to fix this problem.

1) select?q=*:*&fq=title:(+lowe')&debugQuery=on&rows=0



*:*
*:*
MatchAllDocsQuery(*:*)
*:*

LuceneQParser

title:(  lowe')


title:low


2) select?q=*:*&fq=title:(+lowe'*)&debugQuery=on&rows=0 



*:*
*:*
MatchAllDocsQuery(*:*)
*:*

LuceneQParser

title:(  lowe'*)


title:lowe'*

...



The  field is defined as:



where the text type is:

Re: solr dynamic core creation

2010-11-11 Thread nizan


Hi,

Thanks for the offers, I'll take deeper look into them.

In the offers you showed me, if I understand correctly, the call for
creation is done in the client side. I need the mechanism we'll work in the
server side.

I know it sounds stupid, but I need the client side wouldn't know about
which cores are there or not, and on the server side I (maybe with a
handler?), will understand if the core is not created, and create it if
needed.

Thanks, nizan
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-dynamic-core-creation-tp1867705p1883213.html
Sent from the Solr - User mailing list archive at Nabble.com.

Boosting

2010-11-11 Thread Solr User

Hi,

I have a question about boosting.

I have the following fields in my schema.xml:

1. title
2. description
3. ISBN

etc

I want to boost the field title. I tried index time boosting but it did not
work. I also tried Query time boosting but with no luck.

Can someone help me on how to implement boosting on a specific field like
title?

Thanks,
Solr User

Re: WELCOME to solr-user@lucene.apache.org

2010-11-11 Thread Solr User

Hi,

I have a question about boosting.

I have the following fields in my schema.xml:

1. title
2. description
3. ISBN

etc

I want to boost the field title. I tried index time boosting but it did not
work. I also tried Query time boosting but with no luck.

Can someone help me on how to implement boosting on a specific field like
title?

Thanks,
Solr User

On Thu, Nov 11, 2010 at 10:26 AM,  wrote:

> Hi! This is the ezmlm program. I'm managing the
> solr-user@lucene.apache.org mailing list.
>
> I'm working for my owner, who can be reached
> at solr-user-ow...@lucene.apache.org.
>
> Acknowledgment: I have added the address
>
>   solr...@gmail.com
>
> to the solr-user mailing list.
>
> Welcome to solr-u...@lucene.apache.org!
>
> Please save this message so that you know the address you are
> subscribed under, in case you later want to unsubscribe or change your
> subscription address.
>
>
> --- Administrative commands for the solr-user list ---
>
> I can handle administrative requests automatically. Please
> do not send them to the list address! Instead, send
> your message to the correct command address:
>
> To subscribe to the list, send a message to:
>   
>
> To remove your address from the list, send a message to:
>   
>
> Send mail to the following for info and FAQ for this list:
>   
>   
>
> Similar addresses exist for the digest list:
>   
>   
>
> To get messages 123 through 145 (a maximum of 100 per request), mail:
>   
>
> To get an index with subject and author for messages 123-456 , mail:
>   
>
> They are always returned as sets of 100, max 2000 per request,
> so you'll actually get 100-499.
>
> To receive all messages with the same subject as message 12345,
> send a short message to:
>   
>
> The messages should contain one line or word of text to avoid being
> treated as s...@m, but I will ignore their content.
> Only the ADDRESS you send to is important.
>
> You can start a subscription for an alternate address,
> for example "j...@host.domain", just add a hyphen and your
> address (with '=' instead of '@') after the command word:
> 
>
> To stop subscription for this address, mail:
> 
>
> In both cases, I'll send a confirmation message to that address. When
> you receive it, simply reply to it to complete your subscription.
>
> If despite following these instructions, you do not get the
> desired results, please contact my owner at
> solr-user-ow...@lucene.apache.org. Please be patient, my owner is a
> lot slower than I am ;-)
>
> --- Enclosed is a copy of the request I received.
>
> Return-Path: 
> Received: (qmail 48883 invoked by uid 99); 11 Nov 2010 15:26:44 -
> Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230)
>by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Nov 2010 15:26:44
> +
> X-ASF-Spam-Status: No, hits=2.2 required=10.0
>
>  
> tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL
> X-Spam-Check-By: apache.org
> Received-SPF: pass (nike.apache.org: domain of solr...@gmail.comdesignates 
> 209.85.213.48 as permitted sender)
> Received: from [209.85.213.48] (HELO mail-yw0-f48.google.com)
> (209.85.213.48)
>by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Nov 2010 15:26:35
> +
> Received: by ywp4 with SMTP id 4so1394872ywp.35
>for  @lucene.apache.org>; Thu, 11 Nov 2010 07:26:14 -0800 (PST)
> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
>d=gmail.com; s=gamma;
>h=domainkey-signature:mime-version:received:received:in-reply-to
> :references:date:message-id:subject:from:to:content-type;
>bh=4KuKRrRVLjzTO4oB9/DNxMdQPfNQH2GnYznzPE6YqOo=;
>b=l5lBfUYcyvipJn9SE+5j+t1XUmBjTtbyPYlRVj7jDb6G+W3NzQ21EHOowiD9rNH2L9
>
> gc2+6mGEZmRJOZQwpKD7SUQ2bXL9fVm7mVfS21TMAgC+ZsWQ3vvFOHXalWZa8dbtcOY7
> C23KauLY7YH1UfducfXL77J7u0/snEZl5jQ7A=
> DomainKey-Signature: a=rsa-sha1; c=nofws;
>d=gmail.com; s=gamma;
>
>  h=mime-version:in-reply-to:references:date:message-id:subject:from:to
> :content-type;
>b=nb9+3a9bOHnjGO5T5BhMlW15adcafr+MPzvpgc5X5NXEUGCI05ViLho0SSoQP2Wp2i
>
> xp1Mfjrjw05umeKmHX23oeD5Idc2G6xgz8I3ZcJ1bUM+cD7c52cMKG2suE2VvhUHlfah
> z52rEtlqd0Q9fk/ZDWwR2DS7GoiVMRmgaWgD0=
> MIME-Version: 1.0
> Received: by 10.229.216.201 with SMTP id hj9mr877669qcb.58.1289489174123;
> Thu,
>  11 Nov 2010 07:26:14 -0800 (PST)
> Received: by 10.229.66.165 with HTTP; Thu, 11 Nov 2010 07:26:14 -0800 (PST)
> In-Reply-To: <1289489103.46214.ez...@lucene.apache.org>
> References: <1289489103.46214.ez...@lucene.apache.org>
> Date: Thu, 11 Nov 2010 10:26:14 -0500
> Message-ID: 
> 
> >
> Subject: Re: confirm subscribe to solr-user@lucene.apache.org
> From: Solr User 
> To: solr-user-sc.1289489103.apfngfdapdhadiahjfln-solrnew=gmail.com@
> lucene.apache.org
> Content-Type: multipart/alternative; boundary=0016361e83f82a56590494c898ec
> X-Virus-Checked: Checked by ClamAV on apache.org
>
> --0016361e83f82a56590494c898ec
> Content-Type: text/plain; charset=ISO-8859-1
>
> Pl

Issue with facet fields

2010-11-11 Thread gauravshetti


I am facing this weird issue in facet fields

Within config xml
under


−


I have defined the fl as 


file_id folder_id display_name file_name priority_text content_type
last_upload upload_by business indexed
 


But my out xml doesnt contain the element upload_by and business
But i am able to do seach by upload_by: and business:

Even when i add in the url &fl=* i do not get this facet field in the
response

Any idea what i am doing wrong.


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Issue-with-facet-fields-tp1883106p1883106.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr dynamic core creation

2010-11-11 Thread Robert Sandiford


Hi, nizan.  I didn't realize that just replying to a thread from my email
client wouldn't get back to you.  Here's some info on this thread since your
original post:


On Nov 10, 2010, at 12:30pm, Bob Sandiford wrote:

> Why not use replication?  Call it inexperience...
>
> We're really early into working with and fully understanding Solr and 
> the best way to approach various issues.  I did mention that this was 
> a prototype and non-production code, so I'm covered, though :)
>
> We'll take a look at the replication feature...

Replication doesn't replicate the top-level solr.xml file that defines
available cores, so if dynamic cores is a requirement then your custom code
isn't wasted :)

-- Ken


>> -Original Message-
>> From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
>> Sent: Wednesday, November 10, 2010 3:26 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Dynamic creating of cores in solr
>>
>> You could use the actual built-in Solr replication feature to 
>> accomplish that same function -- complete re-index to a 'master', and 
>> then when finished, trigger replication to the 'slave', with the 
>> 'slave' being the live index that actually serves your applications.
>>
>> I am curious if there was any reason you chose to roll your own 
>> solution using JSolr and dynamic creation of cores, instead of simply 
>> using the replication feature. Were there any downsides of using the 
>> replication feature for this purpose that you amerliorated through 
>> your solution?
>>
>> Jonathan
>>
>> Bob Sandiford wrote:
>>> We also use SolrJ, and have a dynamically created Core capability -
>> where we don't know in advance what the Cores will be that we 
>> require.
>>>
>>> We almost always do a complete index build, and if there's a 
>>> previous
>> instance of that index, it needs to be available during a complete 
>> index build, so we have two cores per index, and switch them as 
>> required at the end of an indexing run.
>>>
>>> Here's a summary of how we do it (we're in an early prototype /
>> implementation right now - this isn't  production quality code - as 
>> you can tell from our voluminous javadocs on the methods...)
>>>
>>> 1) Identify if the core exists, and if not, create it:
>>>
>>>   /**
>>> * This method instantiates two SolrServer objects, solr and
>> indexCore.  It requires that
>>> * indexName be set before calling.
>>> */
>>>private void initSolrServer() throws IOException
>>>{
>>>String baseUrl = "http://localhost:8983/solr/";;
>>>solr = new CommonsHttpSolrServer(baseUrl);
>>>
>>>String indexCoreName = indexName +
>> SolrConstants.SUFFIX_INDEX; // SUFIX_INDEX = "_INDEX"
>>>String indexCoreUrl = baseUrl + indexCoreName;
>>>
>>>// Here we create two cores for the indexName, if they don't
>> already exist - the live core used
>>>// for searching and a second core used for indexing. After
>> indexing, the two will be switched so the
>>>// just-indexed core will become the live core. The way that
>> core swapping works, the live core will always
>>>// be named [indexName] and the indexing core will always be
>> named [indexname]_INDEX, but the
>>>// dataDir of each core will alternate between [indexName]_1
>> and [indexName]_2.
>>>createCoreIfNeeded(indexName, indexName + "_1", solr);
>>>createCoreIfNeeded(indexCoreName, indexName + "_2", solr);
>>>indexCore = new CommonsHttpSolrServer(indexCoreUrl);
>>>}
>>>
>>>
>>>   /**
>>> * Create a core if it does not already exists. Returns true if a
>> new core was created, false otherwise.
>>> */
>>>private boolean createCoreIfNeeded(String coreName, String
>> dataDir, SolrServer server) throws IOException
>>>{
>>>boolean coreExists = true;
>>>try
>>>{
>>>// SolrJ provides no direct method to check if a core
>> exists, but getStatus will
>>>// return an empty list for any core that doesn't.
>>>CoreAdminResponse statusResponse =
>> CoreAdminRequest.getStatus(coreName, server);
>>>coreExists =
>> statusResponse.getCoreStatus(coreName).size() > 0;
>>>if(!coreExists)
>>>{
>>>// Create the core
>>>LOG.info("Creating Solr core: " + coreName);
>>>CoreAdminRequest.Create create = new
>> CoreAdminRequest.Create();
>>>create.setCoreName(coreName);
>>>create.setInstanceDir(".");
>>>create.setDataDir(dataDir);
>>>create.process(server);
>>>}
>>>}
>>>catch (SolrServerException e)
>>>{
>>>e.printStackTrace();
>>>}
>>>return !coreExists;
>>>}
>>>
>>>
>>> 2) Do the index, clearing it first if it's a complete rebuild:
>>>
>>> [snip]
>>>if (fullIndex)
>>>{
>>>try
>>>{
>>>indexCore.delete

Re: Adding new field after data is already indexed

2010-11-11 Thread Erick Erickson

@Jerry Li

What version of Solr were you using? And was there any
data in the new field?  I have no problems here with a quick
test I ran on trunk...

Best
Erick

On Thu, Nov 11, 2010 at 1:37 AM, Jerry Li | 李宗杰 wrote:

> but if I use this field to do sorting, there will be an error occured
> and throw an indexOfBoundArray exception.
>
> On Thursday, November 11, 2010, Robert Petersen  wrote:
> > 1)  Just put the new field in the schema and stop/start solr.  Documents
> > in the index will not have the field until you reindex them but it won't
> > hurt anything.
> >
> > 2)  Just turn off their handlers in solrconfig is all I think that
> > takes.
> >
> > -Original Message-
> > From: gauravshetti [mailto:gaurav.she...@tcs.com]
> > Sent: Monday, November 08, 2010 5:21 AM
> > To: solr-user@lucene.apache.org
> > Subject: Adding new field after data is already indexed
> >
> >
> > Hi,
> >
> >  I had a few questions regarding Solr.
> > Say my schema file looks like
> > 
> > 
> >
> > and i index data on the basis of these fields. Now, incase i need to add
> > a
> > new field, is there a way i can add the field without corrupting the
> > previous data. Is there any feature which adds a new field with a
> > default
> > value to the existing records.
> >
> >
> > 2) Is there any security mechanism/authorization check to prevent url
> > like
> > /admin and /update to only a few users.
> >
> > --
> > View this message in context:
> > http://lucene.472066.n3.nabble.com/Adding-new-field-after-data-is-alread
> > y-indexed-tp1862575p1862575.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>
> --
>
> Best Regards.
> Jerry. Li | 李宗杰
> 
>

solr 1.3 how to parse "rich" documents

2010-11-11 Thread Nikola Garafolic


Hi,

I use solr 1.3 with patch for parsing rich documents, and when uploading 
for example pdf file, only thing I see in solr.log is following:


INFO: [] webapp=/solr path=/update/rich 
params={id=250&stream.type=pdf&fieldnames=id,name&commit=true&stream.fieldname=body&name=iphone+user+guide+pdf+iphone_user_guide.pdf} 
status=0 QTime=12656


solrconfig.xml contains the line:

 class="solr.RichDocumentRequestHandler" startup="lazy" />


What else am I missing?

Since I am running solr as standalone, I do not need to build it with 
ant, or?


Regards,
Nikola

--
Nikola Garafolic
SRCE, Sveucilisni racunski centar
tel: +385 1 6165 804
email: nikola.garafo...@srce.hr

IndexTank technology...

2010-11-11 Thread Glen Newton

Does anyone know what technology they are using: http://www.indextank.com/
Is it Lucene under the hood?

Thanks, and apologies for cross-posting.
-Glen

http://zzzoot.blogspot.com

-- 

-

index just new articles from rss feeds - Data Import Request Handler

2010-11-11 Thread Matteo Moci

Hello,
I'd like to use solr to index some documents coming from an rss feed,
like the example at [1], but it seems that the configuration used
there is just for a one-time indexing, trying to get all the articles
exposed in the rss feed of the website.

Is it possible to manage and index just the new articles coming from
the rss source?

I found that maybe the delta-import can be useful but, from what I understand,
the delta-import is used to just update the index with contents of
documents that have been modified since the last indexing:
this is obviously useful, but I'd like to index just the new articles
coming from an rss feed.

Is it something managed automatically by solr or I have to deal with
it in a separate way? Maybe a full import with &clean=false
parameters?
Are there any solutions that you would suggest?
Maybe storing the article feeds in a table like [2] and have a module
that periodically sends each row to solr for indexing it?

Thanks,
Matteo

[1] http://wiki.apache.org/solr/DataImportHandler#HttpDataSource_Example
[2] http://wiki.apache.org/solr/DataImportHandler#Usage_with_RDBMS

Error while indexing files with Solr

2010-11-11 Thread Kaustuv Royburman


Hi,
I am trying to index documents (PDF, Doc, XLS, RTF) using the 
ExtractingRequestHandler.


I am following the tutorial at 
http://wiki.apache.org/solr/ExtractingRequestHandler

But when i run the following command

*curl 
"http://localhost:8983/solr/update/extract?literal.id=mydoc.doc&uprefix=attr_&fmap.content=attr_content"; 
-F "myfile=@/home/system/Documents/mydoc.doc"*


i am getting the following error :




Error 500 

HTTP ERROR: 500lazy loading error

org.apache.solr.common.SolrException: lazy loading error
   at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:249)
   at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:231)

   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
   at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
   at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
   at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
   at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
   at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
   at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
   at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)

   at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
   at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
   at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
   at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)

   at org.mortbay.jetty.Server.handle(Server.java:285)
   at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
   at 
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)

   at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
   at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
   at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
   at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
   at 
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
Caused by: org.apache.solr.common.SolrException: Error loading class 
'org.apache.solr.handler.extraction.ExtractingRequestHandler'
   at 
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:375)

   at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:413)
   at org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:449)
   at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:240)

   ...21 more
Caused by: java.lang.ClassNotFoundException: 
org.apache.solr.handler.extraction.ExtractingRequestHandler not found in 
java.net.URLClassLoader{urls=[], parent=contextloa...@null}

   at java.net.URLClassLoader.findClass(libgcj.so.90)
   at java.lang.ClassLoader.loadClass(libgcj.so.90)
   at java.lang.ClassLoader.loadClass(libgcj.so.90)
   at java.lang.Class.forName(libgcj.so.90)
   at 
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:359)

   ...24 more

RequestURI=/solr/update/extracthref="http://jetty.mortbay.org/";>Powered by 
Jetty://

























I am running Debian Lenny and java version "1.6.0_22".
I am running apache-solr-1.4.1 and running it from the examples directory.

Please point me in the right direction and help me solve the problem.



--
---
Regards,
Kaustuv Royburman

Senior Software Developer
infoservices.in
DLF IT Park,
Rajarhat, 1st Floor, Tower - 3
Major Arterial Road,
Kolkata - 700156,
India

Re: How to use polish stemmer - Stempel - in schema.xml?

2010-11-11 Thread Jakub Godawa

Hi! Sorry for such a break, but I was moving house... anyway:

1. I took the 
~/apache-solr/src/java/org/apache/solr/analysis/StandardFilterFactory.java
file and modified it (named as StempelFilterFactory.java) in Vim that
way:

package org.getopt.solr.analysis;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardFilter;

public class StempelTokenFilterFactory extends BaseTokenFilterFactory {
  public StempelFilter create(TokenStream input) {
return new StempelFilter(input);
  }
}

2. Then I put the file to the extracted stempel-1.0.jar in
./org/getopt/solr/analysis/
3. Then I created a class from it: jar -cf
StempelTokenFilterFactory.class StempelFilterFactory.java
4. Then I created new stempel-1.0.jar archive: jar -cf stempel-1.0.jar
-C ./stempel-1.0/ .
5. Then in schema.xml I've put:


  



  


6. I started the solr server and I recieved the following error:

2010-11-11 11:50:56 org.apache.solr.common.SolrException log
SEVERE: java.lang.ClassFormatError: Incompatible magic value
1347093252 in class file
org/getopt/solr/analysis/StempelTokenFilterFactory
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
...

Question: What is wrong? :) I use "jar (fastjar) 0.98" to create jars,
I googled on that error but with no answer gave me idea what is wrong
in my .java file.

Please help, as I believe I am close to the end of that subject.

Cheers,
Jakub Godawa.

2010/11/3 Lance Norskog :
> Here's the problem: Solr is a little dumb about these Filter classes,
> and so you have to make a Factory object for the Stempel Filter.
>
> There are a lot of other FilterFactory classes. You would have to just
> copy one and change the names to Stempel and it might actually work.
>
> This will take some Solr programming- perhaps the author can help you?
>
> On Tue, Nov 2, 2010 at 7:08 AM, Jakub Godawa  wrote:
>> Sorry, I am not Java programmer at all. I would appreciate more
>> verbose (or step by step) help.
>>
>> 2010/11/2 Bernd Fehling :
>>>
>>> So you call org.getopt.solr.analysis.StempelTokenFilterFactory.
>>> In this case I would assume a file StempelTokenFilterFactory.class
>>> in your directory org/getopt/solr/analysis/.
>>>
>>> And a class which extends the BaseTokenFilterFactory rigth?
>>> ...
>>> public class StempelTokenFilterFactory extends BaseTokenFilterFactory 
>>> implements ResourceLoaderAware {
>>> ...
>>>
>>>
>>>
>>> Am 02.11.2010 14:20, schrieb Jakub Godawa:
 This is what stempel-1.0.jar consist of after jar -xf:

 jgod...@ubuntu:~/apache-solr-1.4.1/ifaq/lib$ ls -R org/
 org/:
 egothor  getopt

 org/egothor:
 stemmer

 org/egothor/stemmer:
 Cell.class     Diff.class    Gener.class  MultiTrie2.class
 Optimizer2.class  Reduce.class        Row.class    TestAll.class
 TestLoad.class  Trie$StrEnum.class
 Compile.class  DiffIt.class  Lift.class   MultiTrie.class
 Optimizer.class   Reduce$Remap.class  Stock.class  Test.class
 Trie.class

 org/getopt:
 stempel

 org/getopt/stempel:
 Benchmark.class  lucene  Stemmer.class

 org/getopt/stempel/lucene:
 StempelAnalyzer.class  StempelFilter.class
 jgod...@ubuntu:~/apache-solr-1.4.1/ifaq/lib$ ls -R META-INF/
 META-INF/:
 MANIFEST.MF
 jgod...@ubuntu:~/apache-solr-1.4.1/ifaq/lib$ ls -R res
 res:
 tables

 res/tables:
 readme.txt  stemmer_1000.out  stemmer_100.out  stemmer_2000.out
 stemmer_200.out  stemmer_500.out  stemmer_700.out

 2010/11/2 Bernd Fehling :
> Hi Jakub,
>
> if you unzip your stempel-1.0.jar do you have the
> required directory structure and file in there?
> org/getopt/stempel/lucene/StempelFilter.class
>
> Regards,
> Bernd
>
> Am 02.11.2010 13:54, schrieb Jakub Godawa:
>> Erick I've put the jar files like that before. I also added the
>> directive and put the file in instanceDir/lib
>>
>> What is still a problem is that even the files are loaded:
>> 2010-11-02 13:20:48 org.apache.solr.core.SolrResourceLoader 
>> replaceClassLoader
>> INFO: Adding 
>> 'file:/home/jgodawa/apache-solr-1.4.1/ifaq/lib/stempel-1.0.jar'
>> to classloader
>>
>> I am not able to use the FilterFactory... maybe I am attempting it in
>> a wrong way?
>>
>> Cheers,
>> Jakub Godawa.
>>
>> 2010/11/2 Erick Erickson :
>>> The polish stemmer jar file needs to be findable by Solr, if you copy
>>> it to /lib and restart solr you should be set.
>>>
>>> Alternatively, you can add another  directive to the solrconfig.xml
>>> file
>>> (there are several examples in that file already).
>>>
>>> I'm a little confused about not being able to find TokenFilter, is that

Re: To cache or to not cache

2010-11-11 Thread Em


Jonathan,

thanks for your statement. In fact, you are quite right: A lot of people
developed great caching mechanisms.
However, the solution I got in mind was something like an HTTP-Cache - in
most cases on the same box.

I talked to some experts who told me that Squid would be a relatively large
monster, since we only want him for http-caching.

Do you know any benchmarks about responses per second, if most of the
queried data is in the cache?

Regards
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/To-cache-or-to-not-cache-tp1875289p1881714.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr dynamic core creation

2010-11-11 Thread nizan


Does anyone has any idea on how to do this?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-dynamic-core-creation-tp1867705p1881374.html
Sent from the Solr - User mailing list archive at Nabble.com.

data import scheduling

2010-11-11 Thread Tri Nguyen

Hi,

Has anyone gotten solr to schedule data imports at a certain time interval 
through configuring solr?

I tried setting interval=1, which is import every minute but I don't see it 
happening.

I'm trying to avoid cron jobs.

Thanks,

Tri

72 matches

Mail list logo