There are several mistakes in your approach:
copyField just copies data. Index time boost is not copied.
There is no such boosting syntax. /select?q=Each&title^9&fl=score
You are searching on your default field.
This is not your cause of your problem but omitNorms="true" disables index time
boosts.
http://wiki.apache.org/solr/DisMaxQParserPlugin can satisfy your need.
--- On Thu, 11/11/10, Solr User wrote:
> From: Solr User
> Subject: Re: WELCOME to solr-user@lucene.apache.org
> To: solr-user@lucene.apache.org
> Date: Thursday, November 11, 2010, 11:54 PM
> Eric,
>
> Thank you so much for the reply and apologize for not
> providing all the
> details.
>
> The following are the field definitons in my schema.xml:
>
> stored="true"
> omitNorms="false" />
>
> stored="true"
> multiValued="true" omitNorms="true" />
>
> stored="true"
> multiValued="true" omitNorms="true" />
>
> stored="true" />
>
> stored="true" />
>
> stored="true" />
>
> stored="true" />
>
> stored="true" />
>
> stored="true" />
>
> stored="true" />
>
> stored="true" />
>
> stored="true" />
>
> stored="true" />
>
> stored="true" />
>
> stored="true" />
>
> stored="true"
> multiValued="true" omitNorms="true" />
>
> stored="true" />
>
> stored="true"
> multiValued="true" omitNorms="true" />
>
> stored="true"
> multiValued="true" omitNorms="true" />
>
> stored="true" />
>
> stored="true" />
>
> stored="true" />
>
> stored="true" />
>
> stored="true" />
>
> stored="true"
> omitNorms="true"/>
>
> stored="true"/>
>
> indexed="true" stored="true"
> multiValued="true" omitNorms="true"/>
>
> Copy Fields:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> searchFields
>
>
>
> Before creating the indexes I feed XML file to the Solr job
> to create index
> files. I added Boost attribute to the title field before
> creating indexes
> and an example is below:
>
> standalone="no"?> name="material">1785440 boost="10.0" name="title">Each Little
> Bird That Sings name="price">16.0 name="isbn10">0152051139 name="isbn13">9780152051136 name="format">Hardcover name="pubdate">2005-03-01 name="pubyear">2005 name="reldate">2005-02-22 name="pages">272 name="bisacstatus">Active name="season">Spring
> 2005 name="imprint">Children's name="age">8.0-12.0 name="grade">3-6 name="author">Marla Frazee name="authortype">Jacket
> IllustratorDeborah
> Wiles name="authortype">Author name="bisacsub">Social
> Issues/Friendship name="bisacsub">Social Issues/General (see
> also headings under Family) name="bisacsub">General name="bisacsub">Girls &
> Women name="category">Fiction/Middle
> Grade name="category">Fiction/Award
> WinnersComing
> of AgeSocial
> Situations/Death &
> DyingSocial
> Situations/Friendship name="path">/assets/product/0152051139.gif name="desc">Ten-year-old Comfort
> Snowberger has attended 247
> funerals. But that's not surprising, considering that her
> family runs the
> town funeral home. And even though Great-uncle Edisto
> keeled over with a
> heart attack and Great-great-aunt Florentine dropped
> dead--just like
> that--six months later, Comfort knows how to deal with
> loss, or so she
> thinks. She's more concerned with avoiding her crazy cousin
> Peach and trying
> to figure out why her best friend, Declaration, suddenly
> won't talk to her.
> Life is full of surprises. And the biggest one of all is
> learning what it
> takes to handle them.
>
Deborah Wiles has created a
> unique, funny, and utterly real cast of characters in this
> heartfelt, and
> quintessentially Southern coming-of-age novel. Comfort will
> charm young
> readers with her wit, her warmth, and her struggles as she
> learns about
> life, loss, and ultimately,
> triumph.
name="shortdesc">Ten-year-old Comfort Snowberger learns
> about life's
> surprises in this funny, poignant, and very Southern
> coming-of-age
> story. name="material">1195443 boost="10.0" name="title">Baby Bear's
> Chairs name="price">16.0 name="isbn10">0152051147 name="isbn13">9780152051143 name="format">Hardcover name="pubdate">2005-09-01 name="pubyear">2005 name="reldate">2005-08-01 name="pages">40 name="bisacstatus">Active name="season">Fall
> 2005 name="imprint">Children's name="age">2.0-5.0 name="grade">P-K name="author">Jane Yolen name="authortype">Author name="author">Melissa
> Sweet name="authortype">Illustrator name="bisacsub">Bedtime &
> Dreams name="bisacsub">Animals/Bears name="bisacsub">Family/General
> (see also headings under Social
> Issues)Social
> Issues/Emotions & Feelings name="bisacsub">Family/Parents name="category">Animals/Bears name="category">Bedtime
> BooksFamily
> Relationships/Parent-Child name="path">/assets/product/0152051147.gif name="desc">Baby Bear is the littlest
> bear in his family, and
> sometimes that's not so easy. Mama and Papa Bear get to
> stay up late in
> their great big chairs. Big brother gets to play fun games
> in his
> middle-sized ch
Eric,
Thank you so much for the reply and apologize for not providing all the
details.
The following are the field definitons in my schema.xml:
Copy Fields:
searchFields
Before creating the indexes I feed XML file to the Solr job to create index
files. I added Boost attribute to the title field before creating indexes
and an example is below:
1785440Each Little
Bird That Sings16.001520511399780152051136Hardcover2005-03-0120052005-02-22272ActiveSpring
2005Children's8.0-12.03-6Marla FrazeeJacket
IllustratorDeborah WilesAuthorSocial
Issues/FriendshipSocial Issues/General (see
also headings under Family)GeneralGirls &
WomenFiction/Middle GradeFiction/Award WinnersComing
of AgeSocial Situations/Death &
DyingSocial
Situations/Friendship/assets/product/0152051139.gifTen-year-old Comfort Snowberger has attended 247
funerals. But that's not surprising, considering that her family runs the
town funeral home. And even though Great-uncle Edisto keeled over with a
heart attack and Great-great-aunt Florentine dropped dead--just like
that--six months later, Comfort knows how to deal with loss, or so she
thinks. She's more concerned with avoiding her crazy cousin Peach and trying
to figure out why her best friend, Declaration, suddenly won't talk to her.
Life is full of surprises. And the biggest one of all is learning what it
takes to handle them.
Deborah Wiles has created a
unique, funny, and utterly real cast of characters in this heartfelt, and
quintessentially Southern coming-of-age novel. Comfort will charm young
readers with her wit, her warmth, and her struggles as she learns about
life, loss, and ultimately, triumph.
Ten-year-old Comfort Snowberger learns about life's
surprises in this funny, poignant, and very Southern coming-of-age
story.1195443Baby Bear's Chairs16.001520511479780152051143Hardcover2005-09-0120052005-08-0140ActiveFall
2005Children's2.0-5.0P-KJane YolenAuthorMelissa
SweetIllustratorBedtime & DreamsAnimals/BearsFamily/General
(see also headings under Social Issues)Social
Issues/Emotions & FeelingsFamily/ParentsAnimals/BearsBedtime
BooksFamily
Relationships/Parent-Child/assets/product/0152051147.gifBaby Bear is the littlest bear in his family, and
sometimes that's not so easy. Mama and Papa Bear get to stay up late in
their great big chairs. Big brother gets to play fun games in his
middle-sized chair. And Baby Bear only seems to cause trouble in his own
tiny chair. But at the end of the day, he finds the one
perfect chair that's comfier and cozier than all the
rest.
Bestselling author Jane Yolen and popular
illustrator Melissa Sweet have come together to create a lyrical bedtime
tale about a baby bear trying to find his place in a family. With a playful
rhyming text and adorable, fun illustrations, here is a book for parents and
their own baby bears to treasure.
In this sweet, bedtime story, Baby Bear discovers that
Papa's lap is the best chair of all!
I am trying to boost the title field so that the search results brings the
actual match with title as the first item in the results.
Adding boost attribute to the title field and Index time boosting did not
change the search results. I tried Query time boosting also as mentioned
below but no luck
/select?q=Each+Little+Bird+That+Sings&title^9&fl=score
Any help to fix this issue would be really helpful.
Thanks,
Solr User
On Thu, Nov 11, 2010 at 10:32 AM, Solr User wrote:
> Hi,
>
> I have a question about boosting.
>
> I have the following fields in my schema.xml:
>
> 1. title
> 2. description
> 3. ISBN
>
> etc
>
> I want to boost the field title. I tried index time boosting but it did not
> work. I also tried Query time boosting but with no luck.
>
> Can someone help me on how to implement boosting on a specific field like
> title?
>
> Thanks,
> Solr User
>
>
>
There's not much to go on here. Boosting works,
and index time as opposed to query time boosting
addresses two different needs. Could you add some
detail? All you've really said is "it didn't work", which
doesn't allow a very constructive response.
Perhaps you could review:
http://wiki.apache.org/solr/HowToContribute
Best
Erick
On Thu, Nov 11, 2010 at 10:32 AM, Solr User wrote:
> Hi,
>
> I have a question about boosting.
>
> I have the following fields in my schema.xml:
>
> 1. title
> 2. description
> 3. ISBN
>
> etc
>
> I want to boost the field title. I tried index time boosting but it did not
> work. I also tried Query time boosting but with no luck.
>
> Can someone help me on how to implement boosting on a specific field like
> title?
>
> Thanks,
> Solr User
>
>
>
Ah I see. Thanks for the explanation.
Could you set the defaultOperator to "AND"? That way both "Bill" and "Cl" must
be a match and that would exclude "Clyde Phillips".
--- On Thu, 11/11/10, Robert Gründler wrote:
> From: Robert Gründler
> Subject: Re: EdgeNGram relevancy
> To: solr-user@lucene.apache.org
> Date: Thursday, November 11, 2010, 3:51 PM
> according to the fieldtype i posted
> previously, i think it's because of:
>
> 1. WhiteSpaceTokenizer splits the String "Clyde Phillips"
> into 2 tokens: "Clyde" and "Phillips"
> 2. EdgeNGramFilter gets the 2 tokens, and creates an
> EdgeNGram for each token: "C" "Cl" "Cly"
> ... AND "P" "Ph" "Phi" ...
>
> The Query String "Bill Cl" gets split up in 2 Tokens "Bill"
> and "Cl" by the WhitespaceTokenizer.
>
> This creates a match for the 2nd token "Ci" of the query,
> and one of the "sub"tokens the EdgeNGramFilter created:
> "Cl".
>
>
> -robert
>
>
>
>
> On Nov 11, 2010, at 21:34 , Andy wrote:
>
> > Could anyone help me understand what does "Clyde
> Phillips" appear in the results for "Bill Cl"??
> >
> > "Clyde Phillips" doesn't produce any EdgeNGram that
> would match "Bill Cl", so why is it even in the results?
> >
> > Thanks.
> >
> > --- On Thu, 11/11/10, Ahmet Arslan
> wrote:
> >
> >> You can add an additional field, with
> >> using KeywordTokenizerFactory instead of
> >> WhitespaceTokenizerFactory. And query both these
> fields with
> >> an OR operator.
> >>
> >> edgytext:(Bill Cl) OR edgytext2:"Bill Cl"
> >>
> >> You can even apply boost so that begins with
> matches comes
> >> first.
> >>
> >> --- On Thu, 11/11/10, Robert Gründler
> >> wrote:
> >>
> >>> From: Robert Gründler
> >>> Subject: EdgeNGram relevancy
> >>> To: solr-user@lucene.apache.org
> >>> Date: Thursday, November 11, 2010, 5:51 PM
> >>> Hi,
> >>>
> >>> consider the following fieldtype (used for
> >>> autocompletion):
> >>>
> >>> name="edgytext"
> >> class="solr.TextField"
> >>> positionIncrementGap="100">
> >>>
> >>> >>> class="solr.WhitespaceTokenizerFactory"/>
> >>> >>> class="solr.LowerCaseFilterFactory"/>
> >>> >>> class="solr.StopFilterFactory"
> ignoreCase="true"
> >>> words="stopwords.txt"
> enablePositionIncrements="true"
> >>> />
> >>> >>> class="solr.PatternReplaceFilterFactory"
> >> pattern="([^a-z])"
> >>> replacement="" replace="all" />
> >>> >>> class="solr.EdgeNGramFilterFactory"
> minGramSize="1"
> >>> maxGramSize="25" />
> >>>
> >>>
> >>> >>> class="solr.WhitespaceTokenizerFactory"/>
> >>> >>> class="solr.LowerCaseFilterFactory"/>
> >>> >>> class="solr.StopFilterFactory"
> ignoreCase="true"
> >>> words="stopwords.txt"
> enablePositionIncrements="true"
> >> />
> >>> >>> class="solr.PatternReplaceFilterFactory"
> >> pattern="([^a-z])"
> >>> replacement="" replace="all" />
> >>>
> >>>
> >>>
> >>>
> >>> This works fine as long as the query string is
> a
> >> single
> >>> word. For multiple words, the ranking is
> weird
> >> though.
> >>>
> >>> Example:
> >>>
> >>> Query String: "Bill Cl"
> >>>
> >>> Result (in that order):
> >>>
> >>> - Clyde Phillips
> >>> - Clay Rogers
> >>> - Roger Cloud
> >>> - Bill Clinton
> >>>
> >>> "Bill Clinton" should have the highest rank in
> that
> >>> case.
> >>>
> >>> Has anyone an idea how to to configure this
> fieldtype
> >> to
> >>> make matches in both tokens rank higher than
> those who
> >> match
> >>> in either token?
> >>>
> >>>
> >>> thanks!
> >>>
> >>>
> >>> -robert
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >>
> >>
> >
> >
> >
>
>
> select?q=*:*&fq=title:(+lowe')&debugQuery=on&rows=0
> >
> > "wildcard queries are not analyzed" http://search-lucene.com/m/pnmlH14o6eM1/
> >
>
> Yeah I found out about this a couple of minutes after I
> posted my problem. If there is no analyzer then
> why is Solr not finding any documents when a single quote
> precedes the wildcard?
Probably your index analyzer (WordDelimiterFilterFactory) eating that single
quote. You can verify this at admin/analysis.jsp page. In other words there is
no such term begins with (lowe') in your index. You can try searching just lowe*
Hi,
I am using a facet.prefix search with shingle's in my autosuggest:
Now I would like to prevent stop words to appear in the suggestions:
52
6
6
5
25
7
Here I would like to filter out the last 4 suggestions really. Is there a way I
can sensibly bring in a stop word filter here? Actually in theory the stop
words could appear as the first or second word as well.
So I guess when producing shingle's I want to skip any stop word from being
part of any shingle.
regards,
Lukas Kahwe Smith
m...@pooteeweet.org
On 2010-11-11, at 3:45 PM, Ahmet Arslan wrote:
>> I'm having some trouble with a query using some wildcard
>> and I was wondering if anyone could tell me why these two
>> similar queries do not return the same number of results.
>> Basically, the query I'm making should return all docs whose
>> title starts
>> (or contain) the string "lowe'". I suspect some analyzer is
>> causing this behaviour and I'd like to know if there is a
>> way to fix this problem.
>>
>> 1)
>> select?q=*:*&fq=title:(+lowe')&debugQuery=on&rows=0
>
> "wildcard queries are not analyzed" http://search-lucene.com/m/pnmlH14o6eM1/
>
Yeah I found out about this a couple of minutes after I posted my problem. If
there is no analyzer then
why is Solr not finding any documents when a single quote precedes the wildcard?
We're holding a free webinar on migration from FAST to Solr. Details below.
-Yonik
http://www.lucidimagination.com
=
Solr To The Rescue: Successful Migration From FAST ESP to Open Source
Search Based on Apache Solr
Thursday, Nov 18, 2010, 14:00 EST (19:00 GMT)
Hosted by SearchDataManagement.com
For anyone concerned about the future of their FAST ESP applications
since the purchase of Fast Search and Transfer by Microsoft in 2008,
this webinar will provide valuable insights on making the switch to
Solr. A three-person rountable will discuss factors driving the need
for FAST ESP alternatives, differences between FAST and Solr, a
typical migration project lifecycle & methodology, complementary open
source tools, best practices, customer examples, and recommended next
steps.
The speakers for this webinar are:
Helge Legernes, Founding Partner & CTO of Findwise
Michael McIntosh, VP Search Solutions for TNR Global
Eric Gaumer, Chief Architect for ESR Technology.
For more information and to register, please go to:
http://SearchDataManagement.bitpipe.com/detail/RES/1288718603_527.html?asrc=CL_PRM_Lucid2
=
according to the fieldtype i posted previously, i think it's because of:
1. WhiteSpaceTokenizer splits the String "Clyde Phillips" into 2 tokens:
"Clyde" and "Phillips"
2. EdgeNGramFilter gets the 2 tokens, and creates an EdgeNGram for each token:
"C" "Cl" "Cly" ... AND "P" "Ph" "Phi" ...
The Query String "Bill Cl" gets split up in 2 Tokens "Bill" and "Cl" by the
WhitespaceTokenizer.
This creates a match for the 2nd token "Ci" of the query, and one of the
"sub"tokens the EdgeNGramFilter created: "Cl".
-robert
On Nov 11, 2010, at 21:34 , Andy wrote:
> Could anyone help me understand what does "Clyde Phillips" appear in the
> results for "Bill Cl"??
>
> "Clyde Phillips" doesn't produce any EdgeNGram that would match "Bill Cl", so
> why is it even in the results?
>
> Thanks.
>
> --- On Thu, 11/11/10, Ahmet Arslan wrote:
>
>> You can add an additional field, with
>> using KeywordTokenizerFactory instead of
>> WhitespaceTokenizerFactory. And query both these fields with
>> an OR operator.
>>
>> edgytext:(Bill Cl) OR edgytext2:"Bill Cl"
>>
>> You can even apply boost so that begins with matches comes
>> first.
>>
>> --- On Thu, 11/11/10, Robert Gründler
>> wrote:
>>
>>> From: Robert Gründler
>>> Subject: EdgeNGram relevancy
>>> To: solr-user@lucene.apache.org
>>> Date: Thursday, November 11, 2010, 5:51 PM
>>> Hi,
>>>
>>> consider the following fieldtype (used for
>>> autocompletion):
>>>
>>> > class="solr.TextField"
>>> positionIncrementGap="100">
>>>
>>> >> class="solr.WhitespaceTokenizerFactory"/>
>>> >> class="solr.LowerCaseFilterFactory"/>
>>> >> class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords.txt" enablePositionIncrements="true"
>>> />
>>> >> class="solr.PatternReplaceFilterFactory"
>> pattern="([^a-z])"
>>> replacement="" replace="all" />
>>> >> class="solr.EdgeNGramFilterFactory" minGramSize="1"
>>> maxGramSize="25" />
>>>
>>>
>>> >> class="solr.WhitespaceTokenizerFactory"/>
>>> >> class="solr.LowerCaseFilterFactory"/>
>>> >> class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords.txt" enablePositionIncrements="true"
>> />
>>> >> class="solr.PatternReplaceFilterFactory"
>> pattern="([^a-z])"
>>> replacement="" replace="all" />
>>>
>>>
>>>
>>>
>>> This works fine as long as the query string is a
>> single
>>> word. For multiple words, the ranking is weird
>> though.
>>>
>>> Example:
>>>
>>> Query String: "Bill Cl"
>>>
>>> Result (in that order):
>>>
>>> - Clyde Phillips
>>> - Clay Rogers
>>> - Roger Cloud
>>> - Bill Clinton
>>>
>>> "Bill Clinton" should have the highest rank in that
>>> case.
>>>
>>> Has anyone an idea how to to configure this fieldtype
>> to
>>> make matches in both tokens rank higher than those who
>> match
>>> in either token?
>>>
>>>
>>> thanks!
>>>
>>>
>>> -robert
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>
>
>
> I'm having some trouble with a query using some wildcard
> and I was wondering if anyone could tell me why these two
> similar queries do not return the same number of results.
> Basically, the query I'm making should return all docs whose
> title starts
> (or contain) the string "lowe'". I suspect some analyzer is
> causing this behaviour and I'd like to know if there is a
> way to fix this problem.
>
> 1)
> select?q=*:*&fq=title:(+lowe')&debugQuery=on&rows=0
"wildcard queries are not analyzed" http://search-lucene.com/m/pnmlH14o6eM1/
Could anyone help me understand what does "Clyde Phillips" appear in the
results for "Bill Cl"??
"Clyde Phillips" doesn't produce any EdgeNGram that would match "Bill Cl", so
why is it even in the results?
Thanks.
--- On Thu, 11/11/10, Ahmet Arslan wrote:
> You can add an additional field, with
> using KeywordTokenizerFactory instead of
> WhitespaceTokenizerFactory. And query both these fields with
> an OR operator.
>
> edgytext:(Bill Cl) OR edgytext2:"Bill Cl"
>
> You can even apply boost so that begins with matches comes
> first.
>
> --- On Thu, 11/11/10, Robert Gründler
> wrote:
>
> > From: Robert Gründler
> > Subject: EdgeNGram relevancy
> > To: solr-user@lucene.apache.org
> > Date: Thursday, November 11, 2010, 5:51 PM
> > Hi,
> >
> > consider the following fieldtype (used for
> > autocompletion):
> >
> > class="solr.TextField"
> > positionIncrementGap="100">
> >
> > > class="solr.WhitespaceTokenizerFactory"/>
> > > class="solr.LowerCaseFilterFactory"/>
> > > class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt" enablePositionIncrements="true"
> > />
> > > class="solr.PatternReplaceFilterFactory"
> pattern="([^a-z])"
> > replacement="" replace="all" />
> > > class="solr.EdgeNGramFilterFactory" minGramSize="1"
> > maxGramSize="25" />
> >
> >
> > > class="solr.WhitespaceTokenizerFactory"/>
> > > class="solr.LowerCaseFilterFactory"/>
> > > class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt" enablePositionIncrements="true"
> />
> > > class="solr.PatternReplaceFilterFactory"
> pattern="([^a-z])"
> > replacement="" replace="all" />
> >
> >
> >
> >
> > This works fine as long as the query string is a
> single
> > word. For multiple words, the ranking is weird
> though.
> >
> > Example:
> >
> > Query String: "Bill Cl"
> >
> > Result (in that order):
> >
> > - Clyde Phillips
> > - Clay Rogers
> > - Roger Cloud
> > - Bill Clinton
> >
> > "Bill Clinton" should have the highest rank in that
> > case.
> >
> > Has anyone an idea how to to configure this fieldtype
> to
> > make matches in both tokens rank higher than those who
> match
> > in either token?
> >
> >
> > thanks!
> >
> >
> > -robert
> >
> >
> >
> >
>
>
>
>
I look forward to the eanswers to this one.
Dennis Gearon
Signature Warning
It is always a good idea to learn from your own mistakes. It is usually a
better
idea to learn from others’ mistakes, so you do not have to make them yourself.
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
EARTH has a Right To Life,
otherwise we all die.
- Original Message
From: Tod
To: solr-user@lucene.apache.org
Sent: Thu, November 11, 2010 11:35:23 AM
Subject: Retrieving indexed content containing multiple languages
My Solr corpus is currently created by indexing metadata from a relational
database as well as content pointed to by URLs from the database. I'm using a
pretty generic out of the box Solr schema. The search results are presented
via
an AJAX enabled HTML page.
When I perform a search the document title (for example) has a mix of english
and chinese characters. Everything there is fine - I can see the english and
chinese returned from a facet query on title. I can search against the title
using english words it contains and I get back an expected result. I asked a
chinese friend to perform the same search using chinese and nothing is returned.
How should I go about getting this search to work? Chinese is just one
language, I'll probably need to support more in the future.
My thought is that the chinese characters are indexed as their unicode
equivalent so all I'll need to do is make sure the query is encoded
appropriately and just perform a regular search as I would if the terms were in
english. For some reason that sounds too easy.
I see there is a CJK tokenizer that would help here. Do I need that for my
situation? Is there a fairly detailed tutorial on how to handle these types of
language challenges?
Thanks in advance - Tod
On 12 Nov 2010, at 01:46, Ahmet Arslan wrote:
>> This setup now makes troubles regarding StopWords, here's
>> an example:
>>
>> Let's say the index contains 2 Strings: "Mr Martin
>> Scorsese" and "Martin Scorsese". "Mr" is in the stopword
>> list.
>>
>> Query: edgytext:Mr Scorsese OR edgytext2:Mr Scorsese^2.0
>>
>> This way, the only result i get is "Mr Martin Scorsese",
>> because the strict field edgytext2 is boosted by 2.0.
>>
>> Any idea why in this case "Martin Scorsese" is not in the
>> result at all?
>
> Did you run your query without using () and "" operators? If yes can you try
> this?
> &q=edgytext:(Mr Scorsese) OR edgytext2:"Mr Scorsese"^2.0
>
> If no can you paste output of &debugQuery=on
>
>
>
This would still not deal with the problem of removing stop words from the
indexing and query analysis stages.
I really need something that will allow that and give a single token as in the
example below.
Best
Nick
this is the full source code, but be warned, i'm not a java developer, and i
have no background in lucine/solr development:
// ConcatFilter
import java.io.IOException;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;
import org.apache.lucene.analysis.tokenattributes.TypeAttribute;
public class ConcatFilter extends TokenFilter {
protected ConcatFilter(TokenStream input)
{
super(input);
}
@Override
public Token next() throws IOException
{
Token token = new Token();
StringBuilder builder = new StringBuilder();
TermAttribute termAttribute = (TermAttribute)
input.getAttribute(TermAttribute.class);
TypeAttribute typeAttribute = (TypeAttribute)
input.getAttribute(TypeAttribute.class);
boolean hasToken = false;
while (input.incrementToken())
{
if (typeAttribute.type().equals("word")) {
builder.append(termAttribute.term());
hasToken = true;
}
}
if (hasToken == true) {
token.setTermBuffer(builder.toString());
return token;
}
return null;
}
}
//ConcatFilterFactory:
import org.apache.lucene.analysis.TokenStream;
import org.apache.solr.analysis.BaseTokenFilterFactory;
public class ConcatFilterFactory extends BaseTokenFilterFactory {
@Override
public TokenStream create(TokenStream stream) {
return new ConcatFilter(stream);
}
}
and in your schema.xml, you can simply add the filterfactory using this element:
Jar files i have included in the buildpath (can be found in the solr download
package):
apache-solr-core-1.4.1.jar
lucene-analyzers-2.9.3.jar
lucene-core.2.9.3-jar
good luck ;)
-robert
On Nov 11, 2010, at 8:45 PM, Nick Martin wrote:
> Thanks Robert, I had been trying to get your ConcatFilter to work, but I'm
> not sure what i need in the classpath and where Token comes from.
> Will check the thread you mention.
>
> Best
>
> Nick
>
> On 11 Nov 2010, at 18:13, Robert Gründler wrote:
>
>> I've posted a ConcaFilter in my previous mail which does concatenate tokens.
>> This works fine, but i
>> realized that what i wanted to achieve is implemented easier in another way
>> (by using 2 separate field types).
>>
>> Have a look at a previous mail i wrote to the list and the reply from Ahmet
>> Arslan (topic: "EdgeNGram relevancy).
>>
>>
>> best
>>
>>
>> -robert
>>
>>
>>
>>
>> See
>> On Nov 11, 2010, at 5:27 PM, Nick Martin wrote:
>>
>>> Hi Robert, All,
>>>
>>> I have a similar problem, here is my fieldType,
>>> http://paste.pocoo.org/show/289910/
>>> I want to include stopword removal and lowercase the incoming terms. The
>>> idea being to take, "Foo Bar Baz Ltd" and turn it into "foobarbaz" for the
>>> EdgeNgram filter factory.
>>> If anyone can tell me a simple way to concatenate tokens into one token
>>> again, similar too the KeyWordTokenizer that would be super helpful.
>>>
>>> Many thanks
>>>
>>> Nick
>>>
>>> On 11 Nov 2010, at 00:23, Robert Gründler wrote:
>>>
On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote:
> Are you sure you really want to throw out stopwords for your use case? I
> don't think autocompletion will work how you want if you do.
in our case i think it makes sense. the content is targetting the
electronic music / dj scene, so we have a lot of words like "DJ" or
"featuring" which
make sense to throw out of the query. Also searches for "the beastie boys"
and "beastie boys" should return a match in the autocompletion.
>
> And if you don't... then why use the WhitespaceTokenizer and then try to
> jam the tokens back together? Why not just NOT tokenize in the first
> place. Use the KeywordTokenizer, which really should be called the
> NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just
> creates one token from the entire input string.
I started out with the KeywordTokenizer, which worked well, except the
StopWord problem.
For now, i've come up with a quick-and-dirty custom "ConcatFilter", which
does what i'm after:
public class ConcatFilter extends TokenFilter {
private TokenStream tstream;
protected ConcatFilter(TokenStream input) {
super(input);
this.tstream = input;
}
@Override
public Token next() throws IOException {
Token token = new Token();
StringBuilder builder = new StringBuilder();
TermAttribute termAttribute = (TermAttribute)
tstream.getAttribute(TermAttribute.class);
TypeAttribute typeAttribute = (TypeAttribute)
tstream.getAttribute(TypeAtt
Thanks Robert, I had been trying to get your ConcatFilter to work, but I'm not
sure what i need in the classpath and where Token comes from.
Will check the thread you mention.
Best
Nick
On 11 Nov 2010, at 18:13, Robert Gründler wrote:
> I've posted a ConcaFilter in my previous mail which does concatenate tokens.
> This works fine, but i
> realized that what i wanted to achieve is implemented easier in another way
> (by using 2 separate field types).
>
> Have a look at a previous mail i wrote to the list and the reply from Ahmet
> Arslan (topic: "EdgeNGram relevancy).
>
>
> best
>
>
> -robert
>
>
>
>
> See
> On Nov 11, 2010, at 5:27 PM, Nick Martin wrote:
>
>> Hi Robert, All,
>>
>> I have a similar problem, here is my fieldType,
>> http://paste.pocoo.org/show/289910/
>> I want to include stopword removal and lowercase the incoming terms. The
>> idea being to take, "Foo Bar Baz Ltd" and turn it into "foobarbaz" for the
>> EdgeNgram filter factory.
>> If anyone can tell me a simple way to concatenate tokens into one token
>> again, similar too the KeyWordTokenizer that would be super helpful.
>>
>> Many thanks
>>
>> Nick
>>
>> On 11 Nov 2010, at 00:23, Robert Gründler wrote:
>>
>>>
>>> On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote:
>>>
Are you sure you really want to throw out stopwords for your use case? I
don't think autocompletion will work how you want if you do.
>>>
>>> in our case i think it makes sense. the content is targetting the
>>> electronic music / dj scene, so we have a lot of words like "DJ" or
>>> "featuring" which
>>> make sense to throw out of the query. Also searches for "the beastie boys"
>>> and "beastie boys" should return a match in the autocompletion.
>>>
And if you don't... then why use the WhitespaceTokenizer and then try to
jam the tokens back together? Why not just NOT tokenize in the first
place. Use the KeywordTokenizer, which really should be called the
NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just
creates one token from the entire input string.
>>>
>>> I started out with the KeywordTokenizer, which worked well, except the
>>> StopWord problem.
>>>
>>> For now, i've come up with a quick-and-dirty custom "ConcatFilter", which
>>> does what i'm after:
>>>
>>> public class ConcatFilter extends TokenFilter {
>>>
>>> private TokenStream tstream;
>>>
>>> protected ConcatFilter(TokenStream input) {
>>> super(input);
>>> this.tstream = input;
>>> }
>>>
>>> @Override
>>> public Token next() throws IOException {
>>>
>>> Token token = new Token();
>>> StringBuilder builder = new StringBuilder();
>>>
>>> TermAttribute termAttribute = (TermAttribute)
>>> tstream.getAttribute(TermAttribute.class);
>>> TypeAttribute typeAttribute = (TypeAttribute)
>>> tstream.getAttribute(TypeAttribute.class);
>>>
>>> boolean incremented = false;
>>>
>>> while (tstream.incrementToken()) {
>>>
>>> if (typeAttribute.type().equals("word")) {
>>> builder.append(termAttribute.term());
>>>
>>> }
>>> incremented = true;
>>> }
>>>
>>> token.setTermBuffer(builder.toString());
>>>
>>> if (incremented == true)
>>> return token;
>>>
>>> return null;
>>> }
>>> }
>>>
>>> I'm not sure if this is a safe way to do this, as i'm not familar with the
>>> whole solr/lucene implementation after all.
>>>
>>>
>>> best
>>>
>>>
>>> -robert
>>>
>>>
>>>
>>>
Then lowercase, remove whitespace (or not), do whatever else you want to
do to your single token to normalize it, and then edgengram it.
If you include whitespace in the token, then when making your queries for
auto-complete, be sure to use a query parser that doesn't do
"pre-tokenization", the 'field' query parser should work well for this.
Jonathan
From: Robert Gründler [rob...@dubture.com]
Sent: Wednesday, November 10, 2010 6:39 PM
To: solr-user@lucene.apache.org
Subject: Concatenate multiple tokens into one
Hi,
i've created the following filterchain in a field type, the idea is to use
it for autocompletion purposes:
>>> words="stopwords.txt" enablePositionIncrements="true" />
>>> replacement="" replace="all" />
>>> maxGramSize="25" />
With that kind of filterchain, the EdgeNGramFilterFactory will receive
multiple tokens on input strings with whitespaces in it. This leads to the
following results:
>>>
My Solr corpus is currently created by indexing metadata from a
relational database as well as content pointed to by URLs from the
database. I'm using a pretty generic out of the box Solr schema. The
search results are presented via an AJAX enabled HTML page.
When I perform a search the document title (for example) has a mix of
english and chinese characters. Everything there is fine - I can see
the english and chinese returned from a facet query on title. I can
search against the title using english words it contains and I get back
an expected result. I asked a chinese friend to perform the same search
using chinese and nothing is returned.
How should I go about getting this search to work? Chinese is just one
language, I'll probably need to support more in the future.
My thought is that the chinese characters are indexed as their unicode
equivalent so all I'll need to do is make sure the query is encoded
appropriately and just perform a regular search as I would if the terms
were in english. For some reason that sounds too easy.
I see there is a CJK tokenizer that would help here. Do I need that for
my situation? Is there a fairly detailed tutorial on how to handle
these types of language challenges?
Thanks in advance - Tod
Hi,
I cannot find out how this is occurring:
Nolosearch/com/search/apachesolr_search/law
You can see that the John Paul Stevens result yields more description in the
search result because of the keyword relevancy, whereas, the other results
just give you a snippet of the title based on keywords found.
I am trying to figure out how to get a standard size search result no matter
what the relevancy is. While application of this type of result would be
irrelevant to many search engines it is completely practical in a legal
setting as a keyword is only as good as how it is being referenced in the
sentence or paragraph. What a dilemma I have!
I have been trying to figure out if it is the actual schema.xml file or
solrconfig.xml file and for the life of me, I can't find it referenced
anywhere. I tried changing the fragsize to 200 instead of a default of like
70. Didn't do any damage at re-index.
This problem is super critical to my search results. Like I said, as an
attorney, the word is superfluous until it attached to a long sentence or
two in order to describe if the keyword we searched for is relevant, let
alone worthy of a click. That is why my titles are set to open in a new
window, faster access and if the result is crud, then just close the window
out and back to research.
Eric
> This setup now makes troubles regarding StopWords, here's
> an example:
>
> Let's say the index contains 2 Strings: "Mr Martin
> Scorsese" and "Martin Scorsese". "Mr" is in the stopword
> list.
>
> Query: edgytext:Mr Scorsese OR edgytext2:Mr Scorsese^2.0
>
> This way, the only result i get is "Mr Martin Scorsese",
> because the strict field edgytext2 is boosted by 2.0.
>
> Any idea why in this case "Martin Scorsese" is not in the
> result at all?
Did you run your query without using () and "" operators? If yes can you try
this?
&q=edgytext:(Mr Scorsese) OR edgytext2:"Mr Scorsese"^2.0
If no can you paste output of &debugQuery=on
Hello All.
My first time post so be kind. Developing a document store with lots and lots
of very small documents. (200 million at the moment. Final size will probably
be double this at 400 million documents). This is Proof of concept development
so we are seeing what a single code can do for us before we consider sharding.
We'd rather not shard if we don't have to.
I'm using SOLR 4.0 (for the simple facet pivots and groups which work well).
We're into week 4 of our development and have the production servers etc set
up. Everything working very well until we start to test queries with production
volumes of data.
I'm running into Java Heap Space exceptions during simple faceting on inverted
fields. The fields we are currently faceting on are names - Country / Continent
/ City names all stored as a Solr.StringField (there are other fields using
tokenization to provide initial search but we want to use the simple
StringFields to provide faceted navigation). In total we have 10 fields we'd
ever want to facet on (8 names fields that are strings and 2 Datepart fields
(year and yearMonth) that are also strings)).
This is our first time using SOLR and I didn't realise that we'd need so much
heap for facets!
Solr is running in tomcat container and I've currently set tomcat to use a max
of
JAVA_OPTS="$JAVA_OPTS -server -Xms512m -Xmx3m"
I've been reading all I can find online and have seen advice to populate the
facets caches first as soon as we've started the solr service. However I'd
really like to know if there are ways to reduce the memory footprint. We
currently have 32g of physical ram. Adding more ram is an option but I'm being
asked the (completely reasonable) question -- "Why do you need so much?"
Please help!
Charlie.
-Original Message-
From: Robert Gründler [mailto:rob...@dubture.com]
Sent: 11 November 2010 18:14
To: solr-user@lucene.apache.org
Subject: Re: Concatenate multiple tokens into one
I've posted a ConcaFilter in my previous mail which does concatenate tokens.
This works fine, but i realized that what i wanted to achieve is implemented
easier in another way (by using 2 separate field types).
Have a look at a previous mail i wrote to the list and the reply from Ahmet
Arslan (topic: "EdgeNGram relevancy).
best
-robert
See
On Nov 11, 2010, at 5:27 PM, Nick Martin wrote:
> Hi Robert, All,
>
> I have a similar problem, here is my fieldType,
> http://paste.pocoo.org/show/289910/
> I want to include stopword removal and lowercase the incoming terms. The idea
> being to take, "Foo Bar Baz Ltd" and turn it into "foobarbaz" for the
> EdgeNgram filter factory.
> If anyone can tell me a simple way to concatenate tokens into one token
> again, similar too the KeyWordTokenizer that would be super helpful.
>
> Many thanks
>
> Nick
>
> On 11 Nov 2010, at 00:23, Robert Gründler wrote:
>
>>
>> On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote:
>>
>>> Are you sure you really want to throw out stopwords for your use case? I
>>> don't think autocompletion will work how you want if you do.
>>
>> in our case i think it makes sense. the content is targetting the
>> electronic music / dj scene, so we have a lot of words like "DJ" or
>> "featuring" which make sense to throw out of the query. Also searches for
>> "the beastie boys" and "beastie boys" should return a match in the
>> autocompletion.
>>
>>>
>>> And if you don't... then why use the WhitespaceTokenizer and then try to
>>> jam the tokens back together? Why not just NOT tokenize in the first place.
>>> Use the KeywordTokenizer, which really should be called the
>>> NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just creates
>>> one token from the entire input string.
>>
>> I started out with the KeywordTokenizer, which worked well, except the
>> StopWord problem.
>>
>> For now, i've come up with a quick-and-dirty custom "ConcatFilter", which
>> does what i'm after:
>>
>> public class ConcatFilter extends TokenFilter {
>>
>> private TokenStream tstream;
>>
>> protected ConcatFilter(TokenStream input) {
>> super(input);
>> this.tstream = input;
>> }
>>
>> @Override
>> public Token next() throws IOException {
>>
>> Token token = new Token();
>> StringBuilder builder = new StringBuilder();
>>
>> TermAttribute termAttribute = (TermAttribute)
>> tstream.getAttribute(TermAttribute.class);
>> TypeAttribute typeAttribute = (TypeAttribute)
>> tstream.getAttribute(TypeAttribute.class);
>>
>> boolean incremented = false;
>>
>> while (tstream.incrementToken()) {
>>
>> if (typeAttribute.type().equals("word")) {
>> builder.append(termAttribute.term());
>> }
>> incremented = true;
>> }
>>
>> token.setTermBuffer(builder.toString());
>>
>>
I've posted a ConcaFilter in my previous mail which does concatenate tokens.
This works fine, but i
realized that what i wanted to achieve is implemented easier in another way (by
using 2 separate field types).
Have a look at a previous mail i wrote to the list and the reply from Ahmet
Arslan (topic: "EdgeNGram relevancy).
best
-robert
See
On Nov 11, 2010, at 5:27 PM, Nick Martin wrote:
> Hi Robert, All,
>
> I have a similar problem, here is my fieldType,
> http://paste.pocoo.org/show/289910/
> I want to include stopword removal and lowercase the incoming terms. The idea
> being to take, "Foo Bar Baz Ltd" and turn it into "foobarbaz" for the
> EdgeNgram filter factory.
> If anyone can tell me a simple way to concatenate tokens into one token
> again, similar too the KeyWordTokenizer that would be super helpful.
>
> Many thanks
>
> Nick
>
> On 11 Nov 2010, at 00:23, Robert Gründler wrote:
>
>>
>> On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote:
>>
>>> Are you sure you really want to throw out stopwords for your use case? I
>>> don't think autocompletion will work how you want if you do.
>>
>> in our case i think it makes sense. the content is targetting the electronic
>> music / dj scene, so we have a lot of words like "DJ" or "featuring" which
>> make sense to throw out of the query. Also searches for "the beastie boys"
>> and "beastie boys" should return a match in the autocompletion.
>>
>>>
>>> And if you don't... then why use the WhitespaceTokenizer and then try to
>>> jam the tokens back together? Why not just NOT tokenize in the first place.
>>> Use the KeywordTokenizer, which really should be called the
>>> NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just creates
>>> one token from the entire input string.
>>
>> I started out with the KeywordTokenizer, which worked well, except the
>> StopWord problem.
>>
>> For now, i've come up with a quick-and-dirty custom "ConcatFilter", which
>> does what i'm after:
>>
>> public class ConcatFilter extends TokenFilter {
>>
>> private TokenStream tstream;
>>
>> protected ConcatFilter(TokenStream input) {
>> super(input);
>> this.tstream = input;
>> }
>>
>> @Override
>> public Token next() throws IOException {
>>
>> Token token = new Token();
>> StringBuilder builder = new StringBuilder();
>>
>> TermAttribute termAttribute = (TermAttribute)
>> tstream.getAttribute(TermAttribute.class);
>> TypeAttribute typeAttribute = (TypeAttribute)
>> tstream.getAttribute(TypeAttribute.class);
>>
>> boolean incremented = false;
>>
>> while (tstream.incrementToken()) {
>>
>> if (typeAttribute.type().equals("word")) {
>> builder.append(termAttribute.term());
>>
>> }
>> incremented = true;
>> }
>>
>> token.setTermBuffer(builder.toString());
>>
>> if (incremented == true)
>> return token;
>>
>> return null;
>> }
>> }
>>
>> I'm not sure if this is a safe way to do this, as i'm not familar with the
>> whole solr/lucene implementation after all.
>>
>>
>> best
>>
>>
>> -robert
>>
>>
>>
>>
>>>
>>> Then lowercase, remove whitespace (or not), do whatever else you want to do
>>> to your single token to normalize it, and then edgengram it.
>>>
>>> If you include whitespace in the token, then when making your queries for
>>> auto-complete, be sure to use a query parser that doesn't do
>>> "pre-tokenization", the 'field' query parser should work well for this.
>>>
>>> Jonathan
>>>
>>>
>>>
>>>
>>> From: Robert Gründler [rob...@dubture.com]
>>> Sent: Wednesday, November 10, 2010 6:39 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Concatenate multiple tokens into one
>>>
>>> Hi,
>>>
>>> i've created the following filterchain in a field type, the idea is to use
>>> it for autocompletion purposes:
>>>
>>>
>>>
>>> >> words="stopwords.txt" enablePositionIncrements="true" />
>>> >> replacement="" replace="all" />
>>>
>>>
>>>
>>> >> maxGramSize="25" />
>>>
>>> With that kind of filterchain, the EdgeNGramFilterFactory will receive
>>> multiple tokens on input strings with whitespaces in it. This leads to the
>>> following results:
>>> Input Query: "George Cloo"
>>> Matches:
>>> - "George Harrison"
>>> - "John Clooridge"
>>> - "George Smith"
>>> -"George Clooney"
>>> - etc
>>>
>>> However, only "George Clooney" should match in the autocompletion use case.
>>> Therefore, i'd like to add a filter before the EdgeNGramFilterFactory,
>>> which concatenates all the tokens generated by the
>>> WhitespaceTokenizerFac
thanks a lot, that setup works pretty well now.
the only problem now is that the StopWords do not work that good anymore. I'll
provide an example, but first the 2 fieldtypes:
This setup now makes troubles regarding StopWords, here's an example:
Let's say the index contains 2 Strings: "Mr Martin Scorsese" and "Martin
Scorsese". "Mr" is in the stopword list.
Query: edgytext:Mr Scorsese OR edgytext2:Mr Scorsese^2.0
This way, the only result i get is "Mr Martin Scorsese", because the strict
field edgytext2 is boosted by 2.0.
Any idea why in this case "Martin Scorsese" is not in the result at all?
thanks again!
-robert
On Nov 11, 2010, at 5:57 PM, Ahmet Arslan wrote:
> You can add an additional field, with using KeywordTokenizerFactory instead
> of WhitespaceTokenizerFactory. And query both these fields with an OR
> operator.
>
> edgytext:(Bill Cl) OR edgytext2:"Bill Cl"
>
> You can even apply boost so that begins with matches comes first.
>
> --- On Thu, 11/11/10, Robert Gründler wrote:
>
>> From: Robert Gründler
>> Subject: EdgeNGram relevancy
>> To: solr-user@lucene.apache.org
>> Date: Thursday, November 11, 2010, 5:51 PM
>> Hi,
>>
>> consider the following fieldtype (used for
>> autocompletion):
>>
>> > positionIncrementGap="100">
>>
>> > class="solr.WhitespaceTokenizerFactory"/>
>> > class="solr.LowerCaseFilterFactory"/>
>> > class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt" enablePositionIncrements="true"
>> />
>> > class="solr.PatternReplaceFilterFactory" pattern="([^a-z])"
>> replacement="" replace="all" />
>> > class="solr.EdgeNGramFilterFactory" minGramSize="1"
>> maxGramSize="25" />
>>
>>
>> > class="solr.WhitespaceTokenizerFactory"/>
>> > class="solr.LowerCaseFilterFactory"/>
>> > class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt" enablePositionIncrements="true" />
>> > class="solr.PatternReplaceFilterFactory" pattern="([^a-z])"
>> replacement="" replace="all" />
>>
>>
>>
>>
>> This works fine as long as the query string is a single
>> word. For multiple words, the ranking is weird though.
>>
>> Example:
>>
>> Query String: "Bill Cl"
>>
>> Result (in that order):
>>
>> - Clyde Phillips
>> - Clay Rogers
>> - Roger Cloud
>> - Bill Clinton
>>
>> "Bill Clinton" should have the highest rank in that
>> case.
>>
>> Has anyone an idea how to to configure this fieldtype to
>> make matches in both tokens rank higher than those who match
>> in either token?
>>
>>
>> thanks!
>>
>>
>> -robert
>>
>>
>>
>>
>
>
>
Are you storing the upload_by and business fields? You will not be able to
retrieve a field from your index if it is not stored. Check that you have
stored="true" for both of those fields.
- Paige
On Thu, Nov 11, 2010 at 10:23 AM, gauravshetti wrote:
>
> I am facing this weird issue in facet fields
>
> Within config xml
> under
>
>
> −
>
>
> I have defined the fl as
>
>
>file_id folder_id display_name file_name priority_text content_type
> last_upload upload_by business indexed
>
>
>
> But my out xml doesnt contain the element upload_by and business
> But i am able to do seach by upload_by: and business:
>
> Even when i add in the url &fl=* i do not get this facet field in the
> response
>
> Any idea what i am doing wrong.
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Issue-with-facet-fields-tp1883106p1883106.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
You can add an additional field, with using KeywordTokenizerFactory instead of
WhitespaceTokenizerFactory. And query both these fields with an OR operator.
edgytext:(Bill Cl) OR edgytext2:"Bill Cl"
You can even apply boost so that begins with matches comes first.
--- On Thu, 11/11/10, Robert Gründler wrote:
> From: Robert Gründler
> Subject: EdgeNGram relevancy
> To: solr-user@lucene.apache.org
> Date: Thursday, November 11, 2010, 5:51 PM
> Hi,
>
> consider the following fieldtype (used for
> autocompletion):
>
> positionIncrementGap="100">
>
> class="solr.WhitespaceTokenizerFactory"/>
> class="solr.LowerCaseFilterFactory"/>
> class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true"
> />
> class="solr.PatternReplaceFilterFactory" pattern="([^a-z])"
> replacement="" replace="all" />
> class="solr.EdgeNGramFilterFactory" minGramSize="1"
> maxGramSize="25" />
>
>
> class="solr.WhitespaceTokenizerFactory"/>
> class="solr.LowerCaseFilterFactory"/>
> class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
> class="solr.PatternReplaceFilterFactory" pattern="([^a-z])"
> replacement="" replace="all" />
>
>
>
>
> This works fine as long as the query string is a single
> word. For multiple words, the ranking is weird though.
>
> Example:
>
> Query String: "Bill Cl"
>
> Result (in that order):
>
> - Clyde Phillips
> - Clay Rogers
> - Roger Cloud
> - Bill Clinton
>
> "Bill Clinton" should have the highest rank in that
> case.
>
> Has anyone an idea how to to configure this fieldtype to
> make matches in both tokens rank higher than those who match
> in either token?
>
>
> thanks!
>
>
> -robert
>
>
>
>
I am exploring support for Japanese language in solr.
Solr seems to provide CJKTokenizerFactory.
How useful is this module? Has anyone been using this in production for
Japanese language?
One shortfall it seems to have from what I have been able to read up on is
that it can generate lot of false matches. For example mathcing kyoto when
searching for tokyo etc.
I did not see many questions related to this module so I wonder if people
are actively using it.
If not are there any other solution in the market that are recommended by
solr users?
Thanks
Kumar
What you say is true. Solr is not an rdbms.
Kouta Osabe wrote:
Hi, all
I have a question about Solr and SolrJ's rollback.
I try to rollback like below
try{
server.addBean(dto);
server.commit;
}catch(Exception e){
if (server != null) { server.rollback();}
}
I wonder if any Exception thrown, "rollback" process is run. so all
data would not be updated.
but once commited, rollback would not be well done.
rollback correctly will be done only when "commit" process will not?
Solr and SolrJ's rollback system is not the same as any RDB's rollback?
Hi, all
I have a question about Solr and SolrJ's rollback.
I try to rollback like below
try{
server.addBean(dto);
server.commit;
}catch(Exception e){
if (server != null) { server.rollback();}
}
I wonder if any Exception thrown, "rollback" process is run. so all
data would not be updated.
but once commited, rollback would not be well done.
rollback correctly will be done only when "commit" process will not?
Solr and SolrJ's rollback system is not the same as any RDB's rollback?
Hi Robert, All,
I have a similar problem, here is my fieldType,
http://paste.pocoo.org/show/289910/
I want to include stopword removal and lowercase the incoming terms. The idea
being to take, "Foo Bar Baz Ltd" and turn it into "foobarbaz" for the EdgeNgram
filter factory.
If anyone can tell me a simple way to concatenate tokens into one token again,
similar too the KeyWordTokenizer that would be super helpful.
Many thanks
Nick
On 11 Nov 2010, at 00:23, Robert Gründler wrote:
>
> On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote:
>
>> Are you sure you really want to throw out stopwords for your use case? I
>> don't think autocompletion will work how you want if you do.
>
> in our case i think it makes sense. the content is targetting the electronic
> music / dj scene, so we have a lot of words like "DJ" or "featuring" which
> make sense to throw out of the query. Also searches for "the beastie boys"
> and "beastie boys" should return a match in the autocompletion.
>
>>
>> And if you don't... then why use the WhitespaceTokenizer and then try to jam
>> the tokens back together? Why not just NOT tokenize in the first place. Use
>> the KeywordTokenizer, which really should be called the
>> NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just creates
>> one token from the entire input string.
>
> I started out with the KeywordTokenizer, which worked well, except the
> StopWord problem.
>
> For now, i've come up with a quick-and-dirty custom "ConcatFilter", which
> does what i'm after:
>
> public class ConcatFilter extends TokenFilter {
>
> private TokenStream tstream;
>
> protected ConcatFilter(TokenStream input) {
> super(input);
> this.tstream = input;
> }
>
> @Override
> public Token next() throws IOException {
>
> Token token = new Token();
> StringBuilder builder = new StringBuilder();
>
> TermAttribute termAttribute = (TermAttribute)
> tstream.getAttribute(TermAttribute.class);
> TypeAttribute typeAttribute = (TypeAttribute)
> tstream.getAttribute(TypeAttribute.class);
>
> boolean incremented = false;
>
> while (tstream.incrementToken()) {
>
> if (typeAttribute.type().equals("word")) {
> builder.append(termAttribute.term());
>
> }
> incremented = true;
> }
>
> token.setTermBuffer(builder.toString());
>
> if (incremented == true)
> return token;
>
> return null;
> }
> }
>
> I'm not sure if this is a safe way to do this, as i'm not familar with the
> whole solr/lucene implementation after all.
>
>
> best
>
>
> -robert
>
>
>
>
>>
>> Then lowercase, remove whitespace (or not), do whatever else you want to do
>> to your single token to normalize it, and then edgengram it.
>>
>> If you include whitespace in the token, then when making your queries for
>> auto-complete, be sure to use a query parser that doesn't do
>> "pre-tokenization", the 'field' query parser should work well for this.
>>
>> Jonathan
>>
>>
>>
>>
>> From: Robert Gründler [rob...@dubture.com]
>> Sent: Wednesday, November 10, 2010 6:39 PM
>> To: solr-user@lucene.apache.org
>> Subject: Concatenate multiple tokens into one
>>
>> Hi,
>>
>> i've created the following filterchain in a field type, the idea is to use
>> it for autocompletion purposes:
>>
>>
>>
>> > words="stopwords.txt" enablePositionIncrements="true" />
>> > replacement="" replace="all" />
>>
>>
>>
>> > />
>>
>> With that kind of filterchain, the EdgeNGramFilterFactory will receive
>> multiple tokens on input strings with whitespaces in it. This leads to the
>> following results:
>> Input Query: "George Cloo"
>> Matches:
>> - "George Harrison"
>> - "John Clooridge"
>> - "George Smith"
>> -"George Clooney"
>> - etc
>>
>> However, only "George Clooney" should match in the autocompletion use case.
>> Therefore, i'd like to add a filter before the EdgeNGramFilterFactory, which
>> concatenates all the tokens generated by the WhitespaceTokenizerFactory.
>> Are there filters which can do such a thing?
>>
>> If not, are there examples how to implement a custom TokenFilter?
>>
>> thanks!
>>
>> -robert
>>
>>
>>
>>
>
I've noticed that using camelCase in field names causes problems.
On 11/5/2010 11:02 AM, Will Milspec wrote:
Hi all,
we're moving from an old lucene version to solr and plan to use the "Copy
Field" functionality. Previously we had "rolled our own" implementation,
sticking title, description, etc. in a field called 'content'.
We lose some flexibility (i.e. java layer can no longer control what gets in
the new copied field), at the expense of simplicity. A fair tradeoff IMO.
My question: has anyone found any subtle issues or "gotchas" with copy
fields?
(from the subject line "caveat"--pronounced 'kah-VEY-AT' is Latin as in
"Caveat Emptor"..."let the buyer beware").
thanks,
will
will
No - in reading what you just wrote, and what you originally wrote, I think
the misunderstanding was mine, based on the architecture of my code. In my
code, it is our 'server' level that does the SolrJ indexing calls, but you
meant 'server' to be the Solr instance, and what you mean by 'client' is
what I was thinking of (without thinking) as the 'server'...
Sorry about that. Hopefully someone else can chime in on your specific
issue...
--
View this message in context:
http://lucene.472066.n3.nabble.com/solr-dynamic-core-creation-tp1867705p1883354.html
Sent from the Solr - User mailing list archive at Nabble.com.
Hi,
consider the following fieldtype (used for autocompletion):
This works fine as long as the query string is a single word. For multiple
words, the ranking is weird though.
Example:
Query String: "Bill Cl"
Result (in that order):
- Clyde Phillips
- Clay Rogers
- Roger Cloud
- Bill Clinton
"Bill Clinton" should have the highest rank in that case.
Has anyone an idea how to to configure this fieldtype to make matches in both
tokens rank higher than those who match in either token?
thanks!
-robert
I'm going down the route of patching nutch so I can use this ParseMetaTags
plugin:
https://issues.apache.org/jira/browse/NUTCH-809
Also wondering whether I will be able to use the XMLParser to allow me to
parse well formed XHTML, using xpath would be bonus:
https://issues.apache.org/jira/browse/NUTCH-185
Any thoughts appreciated...
--
View this message in context:
http://lucene.472066.n3.nabble.com/Crawling-with-nutch-and-mapping-fields-to-solr-tp1879060p1883295.html
Sent from the Solr - User mailing list archive at Nabble.com.
Hi,
Maybe just don't understand all the concept there and I mix up server and
client...
Client - The place where I make the http calls (for index, search etc.) -
where I use the CommonsHttpSolrServer as the solr server. This machine isn't
defined as master or slave, it just use solr as search engine
Server - The http calls I made in the client, goes to another server, the
master solr server (or one of the slaves), where I have embeddedSolrServer,
aren't they?
thanks, nizan
--
View this message in context:
http://lucene.472066.n3.nabble.com/solr-dynamic-core-creation-tp1867705p1883269.html
Sent from the Solr - User mailing list archive at Nabble.com.
Hmmm. Maybe you need to define what you mean by 'server' and what you mean
by 'client'.
--
View this message in context:
http://lucene.472066.n3.nabble.com/solr-dynamic-core-creation-tp1867705p1883238.html
Sent from the Solr - User mailing list archive at Nabble.com.
Hi All,
I'm having some trouble with a query using some wildcard and I was wondering if
anyone could tell me why these two
similar queries do not return the same number of results. Basically, the query
I'm making should return all docs whose title starts
(or contain) the string "lowe'". I suspect some analyzer is causing this
behaviour and I'd like to know if there is a way to fix this problem.
1) select?q=*:*&fq=title:(+lowe')&debugQuery=on&rows=0
*:*
*:*
MatchAllDocsQuery(*:*)
*:*
LuceneQParser
title:( lowe')
title:low
2) select?q=*:*&fq=title:(+lowe'*)&debugQuery=on&rows=0
*:*
*:*
MatchAllDocsQuery(*:*)
*:*
LuceneQParser
title:( lowe'*)
title:lowe'*
...
The field is defined as:
where the text type is:
Hi,
Thanks for the offers, I'll take deeper look into them.
In the offers you showed me, if I understand correctly, the call for
creation is done in the client side. I need the mechanism we'll work in the
server side.
I know it sounds stupid, but I need the client side wouldn't know about
which cores are there or not, and on the server side I (maybe with a
handler?), will understand if the core is not created, and create it if
needed.
Thanks, nizan
--
View this message in context:
http://lucene.472066.n3.nabble.com/solr-dynamic-core-creation-tp1867705p1883213.html
Sent from the Solr - User mailing list archive at Nabble.com.
Hi,
I have a question about boosting.
I have the following fields in my schema.xml:
1. title
2. description
3. ISBN
etc
I want to boost the field title. I tried index time boosting but it did not
work. I also tried Query time boosting but with no luck.
Can someone help me on how to implement boosting on a specific field like
title?
Thanks,
Solr User
Hi,
I have a question about boosting.
I have the following fields in my schema.xml:
1. title
2. description
3. ISBN
etc
I want to boost the field title. I tried index time boosting but it did not
work. I also tried Query time boosting but with no luck.
Can someone help me on how to implement boosting on a specific field like
title?
Thanks,
Solr User
On Thu, Nov 11, 2010 at 10:26 AM, wrote:
> Hi! This is the ezmlm program. I'm managing the
> solr-user@lucene.apache.org mailing list.
>
> I'm working for my owner, who can be reached
> at solr-user-ow...@lucene.apache.org.
>
> Acknowledgment: I have added the address
>
> solr...@gmail.com
>
> to the solr-user mailing list.
>
> Welcome to solr-u...@lucene.apache.org!
>
> Please save this message so that you know the address you are
> subscribed under, in case you later want to unsubscribe or change your
> subscription address.
>
>
> --- Administrative commands for the solr-user list ---
>
> I can handle administrative requests automatically. Please
> do not send them to the list address! Instead, send
> your message to the correct command address:
>
> To subscribe to the list, send a message to:
>
>
> To remove your address from the list, send a message to:
>
>
> Send mail to the following for info and FAQ for this list:
>
>
>
> Similar addresses exist for the digest list:
>
>
>
> To get messages 123 through 145 (a maximum of 100 per request), mail:
>
>
> To get an index with subject and author for messages 123-456 , mail:
>
>
> They are always returned as sets of 100, max 2000 per request,
> so you'll actually get 100-499.
>
> To receive all messages with the same subject as message 12345,
> send a short message to:
>
>
> The messages should contain one line or word of text to avoid being
> treated as s...@m, but I will ignore their content.
> Only the ADDRESS you send to is important.
>
> You can start a subscription for an alternate address,
> for example "j...@host.domain", just add a hyphen and your
> address (with '=' instead of '@') after the command word:
>
>
> To stop subscription for this address, mail:
>
>
> In both cases, I'll send a confirmation message to that address. When
> you receive it, simply reply to it to complete your subscription.
>
> If despite following these instructions, you do not get the
> desired results, please contact my owner at
> solr-user-ow...@lucene.apache.org. Please be patient, my owner is a
> lot slower than I am ;-)
>
> --- Enclosed is a copy of the request I received.
>
> Return-Path:
> Received: (qmail 48883 invoked by uid 99); 11 Nov 2010 15:26:44 -
> Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230)
>by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Nov 2010 15:26:44
> +
> X-ASF-Spam-Status: No, hits=2.2 required=10.0
>
>
> tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL
> X-Spam-Check-By: apache.org
> Received-SPF: pass (nike.apache.org: domain of solr...@gmail.comdesignates
> 209.85.213.48 as permitted sender)
> Received: from [209.85.213.48] (HELO mail-yw0-f48.google.com)
> (209.85.213.48)
>by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Nov 2010 15:26:35
> +
> Received: by ywp4 with SMTP id 4so1394872ywp.35
>for @lucene.apache.org>; Thu, 11 Nov 2010 07:26:14 -0800 (PST)
> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
>d=gmail.com; s=gamma;
>h=domainkey-signature:mime-version:received:received:in-reply-to
> :references:date:message-id:subject:from:to:content-type;
>bh=4KuKRrRVLjzTO4oB9/DNxMdQPfNQH2GnYznzPE6YqOo=;
>b=l5lBfUYcyvipJn9SE+5j+t1XUmBjTtbyPYlRVj7jDb6G+W3NzQ21EHOowiD9rNH2L9
>
> gc2+6mGEZmRJOZQwpKD7SUQ2bXL9fVm7mVfS21TMAgC+ZsWQ3vvFOHXalWZa8dbtcOY7
> C23KauLY7YH1UfducfXL77J7u0/snEZl5jQ7A=
> DomainKey-Signature: a=rsa-sha1; c=nofws;
>d=gmail.com; s=gamma;
>
> h=mime-version:in-reply-to:references:date:message-id:subject:from:to
> :content-type;
>b=nb9+3a9bOHnjGO5T5BhMlW15adcafr+MPzvpgc5X5NXEUGCI05ViLho0SSoQP2Wp2i
>
> xp1Mfjrjw05umeKmHX23oeD5Idc2G6xgz8I3ZcJ1bUM+cD7c52cMKG2suE2VvhUHlfah
> z52rEtlqd0Q9fk/ZDWwR2DS7GoiVMRmgaWgD0=
> MIME-Version: 1.0
> Received: by 10.229.216.201 with SMTP id hj9mr877669qcb.58.1289489174123;
> Thu,
> 11 Nov 2010 07:26:14 -0800 (PST)
> Received: by 10.229.66.165 with HTTP; Thu, 11 Nov 2010 07:26:14 -0800 (PST)
> In-Reply-To: <1289489103.46214.ez...@lucene.apache.org>
> References: <1289489103.46214.ez...@lucene.apache.org>
> Date: Thu, 11 Nov 2010 10:26:14 -0500
> Message-ID:
>
> >
> Subject: Re: confirm subscribe to solr-user@lucene.apache.org
> From: Solr User
> To: solr-user-sc.1289489103.apfngfdapdhadiahjfln-solrnew=gmail.com@
> lucene.apache.org
> Content-Type: multipart/alternative; boundary=0016361e83f82a56590494c898ec
> X-Virus-Checked: Checked by ClamAV on apache.org
>
> --0016361e83f82a56590494c898ec
> Content-Type: text/plain; charset=ISO-8859-1
>
> Pl
I am facing this weird issue in facet fields
Within config xml
under
−
I have defined the fl as
file_id folder_id display_name file_name priority_text content_type
last_upload upload_by business indexed
But my out xml doesnt contain the element upload_by and business
But i am able to do seach by upload_by: and business:
Even when i add in the url &fl=* i do not get this facet field in the
response
Any idea what i am doing wrong.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Issue-with-facet-fields-tp1883106p1883106.html
Sent from the Solr - User mailing list archive at Nabble.com.
Hi, nizan. I didn't realize that just replying to a thread from my email
client wouldn't get back to you. Here's some info on this thread since your
original post:
On Nov 10, 2010, at 12:30pm, Bob Sandiford wrote:
> Why not use replication? Call it inexperience...
>
> We're really early into working with and fully understanding Solr and
> the best way to approach various issues. I did mention that this was
> a prototype and non-production code, so I'm covered, though :)
>
> We'll take a look at the replication feature...
Replication doesn't replicate the top-level solr.xml file that defines
available cores, so if dynamic cores is a requirement then your custom code
isn't wasted :)
-- Ken
>> -Original Message-
>> From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
>> Sent: Wednesday, November 10, 2010 3:26 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Dynamic creating of cores in solr
>>
>> You could use the actual built-in Solr replication feature to
>> accomplish that same function -- complete re-index to a 'master', and
>> then when finished, trigger replication to the 'slave', with the
>> 'slave' being the live index that actually serves your applications.
>>
>> I am curious if there was any reason you chose to roll your own
>> solution using JSolr and dynamic creation of cores, instead of simply
>> using the replication feature. Were there any downsides of using the
>> replication feature for this purpose that you amerliorated through
>> your solution?
>>
>> Jonathan
>>
>> Bob Sandiford wrote:
>>> We also use SolrJ, and have a dynamically created Core capability -
>> where we don't know in advance what the Cores will be that we
>> require.
>>>
>>> We almost always do a complete index build, and if there's a
>>> previous
>> instance of that index, it needs to be available during a complete
>> index build, so we have two cores per index, and switch them as
>> required at the end of an indexing run.
>>>
>>> Here's a summary of how we do it (we're in an early prototype /
>> implementation right now - this isn't production quality code - as
>> you can tell from our voluminous javadocs on the methods...)
>>>
>>> 1) Identify if the core exists, and if not, create it:
>>>
>>> /**
>>> * This method instantiates two SolrServer objects, solr and
>> indexCore. It requires that
>>> * indexName be set before calling.
>>> */
>>>private void initSolrServer() throws IOException
>>>{
>>>String baseUrl = "http://localhost:8983/solr/";;
>>>solr = new CommonsHttpSolrServer(baseUrl);
>>>
>>>String indexCoreName = indexName +
>> SolrConstants.SUFFIX_INDEX; // SUFIX_INDEX = "_INDEX"
>>>String indexCoreUrl = baseUrl + indexCoreName;
>>>
>>>// Here we create two cores for the indexName, if they don't
>> already exist - the live core used
>>>// for searching and a second core used for indexing. After
>> indexing, the two will be switched so the
>>>// just-indexed core will become the live core. The way that
>> core swapping works, the live core will always
>>>// be named [indexName] and the indexing core will always be
>> named [indexname]_INDEX, but the
>>>// dataDir of each core will alternate between [indexName]_1
>> and [indexName]_2.
>>>createCoreIfNeeded(indexName, indexName + "_1", solr);
>>>createCoreIfNeeded(indexCoreName, indexName + "_2", solr);
>>>indexCore = new CommonsHttpSolrServer(indexCoreUrl);
>>>}
>>>
>>>
>>> /**
>>> * Create a core if it does not already exists. Returns true if a
>> new core was created, false otherwise.
>>> */
>>>private boolean createCoreIfNeeded(String coreName, String
>> dataDir, SolrServer server) throws IOException
>>>{
>>>boolean coreExists = true;
>>>try
>>>{
>>>// SolrJ provides no direct method to check if a core
>> exists, but getStatus will
>>>// return an empty list for any core that doesn't.
>>>CoreAdminResponse statusResponse =
>> CoreAdminRequest.getStatus(coreName, server);
>>>coreExists =
>> statusResponse.getCoreStatus(coreName).size() > 0;
>>>if(!coreExists)
>>>{
>>>// Create the core
>>>LOG.info("Creating Solr core: " + coreName);
>>>CoreAdminRequest.Create create = new
>> CoreAdminRequest.Create();
>>>create.setCoreName(coreName);
>>>create.setInstanceDir(".");
>>>create.setDataDir(dataDir);
>>>create.process(server);
>>>}
>>>}
>>>catch (SolrServerException e)
>>>{
>>>e.printStackTrace();
>>>}
>>>return !coreExists;
>>>}
>>>
>>>
>>> 2) Do the index, clearing it first if it's a complete rebuild:
>>>
>>> [snip]
>>>if (fullIndex)
>>>{
>>>try
>>>{
>>>indexCore.delete
@Jerry Li
What version of Solr were you using? And was there any
data in the new field? I have no problems here with a quick
test I ran on trunk...
Best
Erick
On Thu, Nov 11, 2010 at 1:37 AM, Jerry Li | 李宗杰 wrote:
> but if I use this field to do sorting, there will be an error occured
> and throw an indexOfBoundArray exception.
>
> On Thursday, November 11, 2010, Robert Petersen wrote:
> > 1) Just put the new field in the schema and stop/start solr. Documents
> > in the index will not have the field until you reindex them but it won't
> > hurt anything.
> >
> > 2) Just turn off their handlers in solrconfig is all I think that
> > takes.
> >
> > -Original Message-
> > From: gauravshetti [mailto:gaurav.she...@tcs.com]
> > Sent: Monday, November 08, 2010 5:21 AM
> > To: solr-user@lucene.apache.org
> > Subject: Adding new field after data is already indexed
> >
> >
> > Hi,
> >
> > I had a few questions regarding Solr.
> > Say my schema file looks like
> >
> >
> >
> > and i index data on the basis of these fields. Now, incase i need to add
> > a
> > new field, is there a way i can add the field without corrupting the
> > previous data. Is there any feature which adds a new field with a
> > default
> > value to the existing records.
> >
> >
> > 2) Is there any security mechanism/authorization check to prevent url
> > like
> > /admin and /update to only a few users.
> >
> > --
> > View this message in context:
> > http://lucene.472066.n3.nabble.com/Adding-new-field-after-data-is-alread
> > y-indexed-tp1862575p1862575.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>
> --
>
> Best Regards.
> Jerry. Li | 李宗杰
>
>
Hi,
I use solr 1.3 with patch for parsing rich documents, and when uploading
for example pdf file, only thing I see in solr.log is following:
INFO: [] webapp=/solr path=/update/rich
params={id=250&stream.type=pdf&fieldnames=id,name&commit=true&stream.fieldname=body&name=iphone+user+guide+pdf+iphone_user_guide.pdf}
status=0 QTime=12656
solrconfig.xml contains the line:
class="solr.RichDocumentRequestHandler" startup="lazy" />
What else am I missing?
Since I am running solr as standalone, I do not need to build it with
ant, or?
Regards,
Nikola
--
Nikola Garafolic
SRCE, Sveucilisni racunski centar
tel: +385 1 6165 804
email: nikola.garafo...@srce.hr
Does anyone know what technology they are using: http://www.indextank.com/
Is it Lucene under the hood?
Thanks, and apologies for cross-posting.
-Glen
http://zzzoot.blogspot.com
--
-
Hello,
I'd like to use solr to index some documents coming from an rss feed,
like the example at [1], but it seems that the configuration used
there is just for a one-time indexing, trying to get all the articles
exposed in the rss feed of the website.
Is it possible to manage and index just the new articles coming from
the rss source?
I found that maybe the delta-import can be useful but, from what I understand,
the delta-import is used to just update the index with contents of
documents that have been modified since the last indexing:
this is obviously useful, but I'd like to index just the new articles
coming from an rss feed.
Is it something managed automatically by solr or I have to deal with
it in a separate way? Maybe a full import with &clean=false
parameters?
Are there any solutions that you would suggest?
Maybe storing the article feeds in a table like [2] and have a module
that periodically sends each row to solr for indexing it?
Thanks,
Matteo
[1] http://wiki.apache.org/solr/DataImportHandler#HttpDataSource_Example
[2] http://wiki.apache.org/solr/DataImportHandler#Usage_with_RDBMS
Hi,
I am trying to index documents (PDF, Doc, XLS, RTF) using the
ExtractingRequestHandler.
I am following the tutorial at
http://wiki.apache.org/solr/ExtractingRequestHandler
But when i run the following command
*curl
"http://localhost:8983/solr/update/extract?literal.id=mydoc.doc&uprefix=attr_&fmap.content=attr_content";
-F "myfile=@/home/system/Documents/mydoc.doc"*
i am getting the following error :
Error 500
HTTP ERROR: 500lazy loading error
org.apache.solr.common.SolrException: lazy loading error
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:249)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:231)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
Caused by: org.apache.solr.common.SolrException: Error loading class
'org.apache.solr.handler.extraction.ExtractingRequestHandler'
at
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:375)
at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:413)
at org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:449)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:240)
...21 more
Caused by: java.lang.ClassNotFoundException:
org.apache.solr.handler.extraction.ExtractingRequestHandler not found in
java.net.URLClassLoader{urls=[], parent=contextloa...@null}
at java.net.URLClassLoader.findClass(libgcj.so.90)
at java.lang.ClassLoader.loadClass(libgcj.so.90)
at java.lang.ClassLoader.loadClass(libgcj.so.90)
at java.lang.Class.forName(libgcj.so.90)
at
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:359)
...24 more
RequestURI=/solr/update/extracthref="http://jetty.mortbay.org/";>Powered by
Jetty://
I am running Debian Lenny and java version "1.6.0_22".
I am running apache-solr-1.4.1 and running it from the examples directory.
Please point me in the right direction and help me solve the problem.
--
---
Regards,
Kaustuv Royburman
Senior Software Developer
infoservices.in
DLF IT Park,
Rajarhat, 1st Floor, Tower - 3
Major Arterial Road,
Kolkata - 700156,
India
Hi! Sorry for such a break, but I was moving house... anyway:
1. I took the
~/apache-solr/src/java/org/apache/solr/analysis/StandardFilterFactory.java
file and modified it (named as StempelFilterFactory.java) in Vim that
way:
package org.getopt.solr.analysis;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardFilter;
public class StempelTokenFilterFactory extends BaseTokenFilterFactory {
public StempelFilter create(TokenStream input) {
return new StempelFilter(input);
}
}
2. Then I put the file to the extracted stempel-1.0.jar in
./org/getopt/solr/analysis/
3. Then I created a class from it: jar -cf
StempelTokenFilterFactory.class StempelFilterFactory.java
4. Then I created new stempel-1.0.jar archive: jar -cf stempel-1.0.jar
-C ./stempel-1.0/ .
5. Then in schema.xml I've put:
6. I started the solr server and I recieved the following error:
2010-11-11 11:50:56 org.apache.solr.common.SolrException log
SEVERE: java.lang.ClassFormatError: Incompatible magic value
1347093252 in class file
org/getopt/solr/analysis/StempelTokenFilterFactory
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
at
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
...
Question: What is wrong? :) I use "jar (fastjar) 0.98" to create jars,
I googled on that error but with no answer gave me idea what is wrong
in my .java file.
Please help, as I believe I am close to the end of that subject.
Cheers,
Jakub Godawa.
2010/11/3 Lance Norskog :
> Here's the problem: Solr is a little dumb about these Filter classes,
> and so you have to make a Factory object for the Stempel Filter.
>
> There are a lot of other FilterFactory classes. You would have to just
> copy one and change the names to Stempel and it might actually work.
>
> This will take some Solr programming- perhaps the author can help you?
>
> On Tue, Nov 2, 2010 at 7:08 AM, Jakub Godawa wrote:
>> Sorry, I am not Java programmer at all. I would appreciate more
>> verbose (or step by step) help.
>>
>> 2010/11/2 Bernd Fehling :
>>>
>>> So you call org.getopt.solr.analysis.StempelTokenFilterFactory.
>>> In this case I would assume a file StempelTokenFilterFactory.class
>>> in your directory org/getopt/solr/analysis/.
>>>
>>> And a class which extends the BaseTokenFilterFactory rigth?
>>> ...
>>> public class StempelTokenFilterFactory extends BaseTokenFilterFactory
>>> implements ResourceLoaderAware {
>>> ...
>>>
>>>
>>>
>>> Am 02.11.2010 14:20, schrieb Jakub Godawa:
This is what stempel-1.0.jar consist of after jar -xf:
jgod...@ubuntu:~/apache-solr-1.4.1/ifaq/lib$ ls -R org/
org/:
egothor getopt
org/egothor:
stemmer
org/egothor/stemmer:
Cell.class Diff.class Gener.class MultiTrie2.class
Optimizer2.class Reduce.class Row.class TestAll.class
TestLoad.class Trie$StrEnum.class
Compile.class DiffIt.class Lift.class MultiTrie.class
Optimizer.class Reduce$Remap.class Stock.class Test.class
Trie.class
org/getopt:
stempel
org/getopt/stempel:
Benchmark.class lucene Stemmer.class
org/getopt/stempel/lucene:
StempelAnalyzer.class StempelFilter.class
jgod...@ubuntu:~/apache-solr-1.4.1/ifaq/lib$ ls -R META-INF/
META-INF/:
MANIFEST.MF
jgod...@ubuntu:~/apache-solr-1.4.1/ifaq/lib$ ls -R res
res:
tables
res/tables:
readme.txt stemmer_1000.out stemmer_100.out stemmer_2000.out
stemmer_200.out stemmer_500.out stemmer_700.out
2010/11/2 Bernd Fehling :
> Hi Jakub,
>
> if you unzip your stempel-1.0.jar do you have the
> required directory structure and file in there?
> org/getopt/stempel/lucene/StempelFilter.class
>
> Regards,
> Bernd
>
> Am 02.11.2010 13:54, schrieb Jakub Godawa:
>> Erick I've put the jar files like that before. I also added the
>> directive and put the file in instanceDir/lib
>>
>> What is still a problem is that even the files are loaded:
>> 2010-11-02 13:20:48 org.apache.solr.core.SolrResourceLoader
>> replaceClassLoader
>> INFO: Adding
>> 'file:/home/jgodawa/apache-solr-1.4.1/ifaq/lib/stempel-1.0.jar'
>> to classloader
>>
>> I am not able to use the FilterFactory... maybe I am attempting it in
>> a wrong way?
>>
>> Cheers,
>> Jakub Godawa.
>>
>> 2010/11/2 Erick Erickson :
>>> The polish stemmer jar file needs to be findable by Solr, if you copy
>>> it to /lib and restart solr you should be set.
>>>
>>> Alternatively, you can add another directive to the solrconfig.xml
>>> file
>>> (there are several examples in that file already).
>>>
>>> I'm a little confused about not being able to find TokenFilter, is that
Jonathan,
thanks for your statement. In fact, you are quite right: A lot of people
developed great caching mechanisms.
However, the solution I got in mind was something like an HTTP-Cache - in
most cases on the same box.
I talked to some experts who told me that Squid would be a relatively large
monster, since we only want him for http-caching.
Do you know any benchmarks about responses per second, if most of the
queried data is in the cache?
Regards
--
View this message in context:
http://lucene.472066.n3.nabble.com/To-cache-or-to-not-cache-tp1875289p1881714.html
Sent from the Solr - User mailing list archive at Nabble.com.
Does anyone has any idea on how to do this?
--
View this message in context:
http://lucene.472066.n3.nabble.com/solr-dynamic-core-creation-tp1867705p1881374.html
Sent from the Solr - User mailing list archive at Nabble.com.
Hi,
Has anyone gotten solr to schedule data imports at a certain time interval
through configuring solr?
I tried setting interval=1, which is import every minute but I don't see it
happening.
I'm trying to avoid cron jobs.
Thanks,
Tri
72 matches
Mail list logo