Re: Displaying highlights in formatted HTML document

2011-06-08 Thread Ahmet Arslan


--- On Thu, 6/9/11, Bryan Loofbourrow  wrote:

> From: Bryan Loofbourrow 
> Subject: Displaying highlights in formatted HTML document
> To: solr-user@lucene.apache.org
> Date: Thursday, June 9, 2011, 2:14 AM
> Here is my use case:
> 
> 
> 
> I have a large number of HTML documents, sizes in the
> 0.5K-50M range, most
> around, say, 10M.
> 
> 
> 
> I want to be able to present the user with the formatted
> HTML document, with
> the hits tagged, so that he may iterate through them, and
> see them in the
> context of the document, with the document looking as it
> would be presented
> by a browser; that is, fully formatted, with its tables and
> italics and font
> sizes and all.
> 
> 
> 
> This is something that the user would explicitly request
> from within a set
> of search results, not something I’d expect to have
> returned from an initial
> search – the initial search merely returns the snippets
> around the hits. But
> if the user wants to dive into one of the returned results
> and see them in
> context, I need to be able to go get that.
> 
> 
> 
> We are currently solving this problem by using an entirely
> separate search
> engine (dtSearch), which performs the tagging of the hits
> in the HTML just
> fine. But the solution is unsatisfactory because there are
> Solr searches
> that dtSearch’s capabilities cannot reasonably match.
> 
> 
> 
> Can anyone suggest a good way to use Solr/Lucene for this
> instead? I’m
> thinking a separate core for this purpose might make sense,
> so as not to
> burden the primary search core with the full contents of
> the document. But
> after that, I’m stuck. How can I get Solr to express the
> highlighting in the
> context of the formatted HTML document?
> 
> 
> 
> If Solr does not do this currently, and anyone can suggest
> ways to add the
> feature, any tips on how this might best be incorporated
> into the
> implementation would be welcome.

I am doing the same thing (solr trunk) using the following field type:















In your separate core - which will is queried when the user wants to dive into 
one of the returned results - feed your html files in to this field. 

You may want to increase max analyzed chars too.
147483647


Code for getting distinct facet counts across shards(Distributed Process).

2011-06-08 Thread rajini maski
 In solr 1.4.1, for getting "distinct facet terms count" across shards,



The piece of code added for getting count of distinct facet terms across
distributed process is as followed:





Class: facetcomponent.java

Function: -- finishStage(ResponseBuilder rb)



  for (DistribFieldFacet dff : fi.facets.values()) {

//just after this line of code

 else { // TODO: log error or throw exception?

 counts = dff.getLexSorted();



int namedistint = 0;


namedistint=rb.req.getParams().getFieldInt(dff.getKey().toString(),FacetParams.FACET_NAMEDISTINCT,0);

if (namedistint  == 0)

facet_fields.add(dff.getKey(), fieldCounts);



if (namedistint  == 1)

facet_fields.add("numfacetTerms", counts.length);




 if (namedistint  == 2) {

 NamedList resCount = new NamedList();


 resCount.add("numfacetTerms", counts.length);


 resCount.add("counts", fieldCounts);

facet_fields.add(dff.getKey(), resCount);

 }




Is this flow correct ?  I have worked with few test cases and it has worked
fine.  but i want to know if there are any bugs that can creep in here?  (My
concern is this piece of code should not effect the rest of logic)




*Code flow with comments for reference:*


 Function : --   finishStage(ResponseBuilder rb)



  //in this for loop ,

 for (DistribFieldFacet dff : fi.facets.values()) {



//just after this line of code

 else { // TODO: log error or throw exception?

 counts = dff.getLexSorted();



 int namedistint = 0;  //default



//get the value of facet.numterms from the input query


namedistint=rb.req.getParams().getFieldInt(dff.getKey().toString(),FacetParams.FACET_NAMEDISTINCT,0);



// based on the value for  facet.numterms==0 or 1 or 2  , if conditions



//Get only facet field counts

if (namedistint  == 0)

{

facet_fields.add(dff.getKey(), fieldCounts);


}



//get only distinct facet term count

if (namedistint  == 1)

{

facet_fields.add("numfacetTerms", counts.length);


}



//get facet field count and distinct term count.

 if (namedistint  == 2) {

 NamedList resCount = new NamedList();


 resCount.add("numfacetTerms", counts.length);


 resCount.add("counts", fieldCounts);

facet_fields.add(dff.getKey(), resCount);

 }





Regards,

Rajani





On Fri, May 27, 2011 at 1:14 PM, rajini maski  wrote:

>  No such issues . Successfully integrated with 1.4.1 and it works across
> single index.
>
> for f.2.facet.numFacetTerms=1  parameter it will give the distinct count
> result
>
> for f.2.facet.numFacetTerms=2 parameter  it will give counts as well as
> results for facets.
>
> But this is working only across single index not distributed process. The
> conditions you have added in simple facet.java- "if namedistinct count ==int
> " ( 0, 1 and 2 condtions).. Should it be added in distributed process
> function to enable it work across shards?
>
> Rajani
>
>
>
> On Fri, May 27, 2011 at 12:33 PM, Bill Bell  wrote:
>
>> I am pretty sure it does not yet support distributed shards..
>>
>> But the patch was written for 4.0... So there might be issues with running
>> it on 1.4.1.
>>
>> On 5/26/11 11:08 PM, "rajini maski"  wrote:
>>
>> > The patch solr 2242 for getting count of distinct facet terms
>> doesn't
>> >work for distributedProcess
>> >
>> >(https://issues.apache.org/jira/browse/SOLR-2242)
>> >
>> >The error log says
>> >
>> > HTTP ERROR 500
>> >Problem accessing /solr/select. Reason:
>> >
>> >For input string: "numFacetTerms"
>> >
>> >java.lang.NumberFormatException: For input string: "numFacetTerms"
>> >at
>>
>> >java.lang.NumberFormatException.forInputString(NumberFormatException.java:
>> >48)
>> >at java.lang.Long.parseLong(Long.java:403)
>> >at java.lang.Long.parseLong(Long.java:461)
>> >at org.apache.solr.schema.TrieField.readableToIndexed(TrieField.java:331)
>> >at org.apache.solr.schema.TrieField.toInternal(TrieField.java:344)
>> >at
>>
>> >org.apache.solr.handler.component.FacetComponent$DistribFieldFacet.add(Fac
>> >etComponent.java:619)
>> >at
>>
>> >org.apache.solr.handler.component.FacetComponent.countFacets(FacetComponen
>> >t.java:265)
>> >at
>>
>> >org.apache.solr.handler.component.FacetComponent.handleResponses(FacetComp
>> >onent.java:235)
>> >at
>>
>> >org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHa
>> >ndler.java:290)
>> >at
>>
>> >org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBas
>> >e.java:131)
>> >at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>> >at
>>
>> >org.apache.solr.servlet.Solr

Re: Query regarding Solr-2242 patch for getting distinct facet counts.

2011-06-08 Thread rajini maski
 In solr 1.4.1, for getting "distinct facet terms count" across shards,



The piece of code added for getting count of distinct facet terms across
distributed process is as followed:





Class: facetcomponent.java

Function: -- finishStage(ResponseBuilder rb)



  for (DistribFieldFacet dff : fi.facets.values()) {

//just after this line of code

 else { // TODO: log error or throw exception?

 counts = dff.getLexSorted();



int namedistint = 0;


namedistint=rb.req.getParams().getFieldInt(dff.getKey().toString(),FacetParams.FACET_NAMEDISTINCT,0);

if (namedistint  == 0)

facet_fields.add(dff.getKey(), fieldCounts);



if (namedistint  == 1)

facet_fields.add("numfacetTerms", counts.length);




 if (namedistint  == 2) {

 NamedList resCount = new NamedList();


 resCount.add("numfacetTerms", counts.length);


 resCount.add("counts", fieldCounts);

facet_fields.add(dff.getKey(), resCount);

 }




Is this flow correct ?  I have worked with few test cases and it has worked
fine.  but i want to know if there are any bugs that can creep in here?  (My
concern is this piece of code should not effect the rest of logic)




*Code flow with comments for reference:*


 Function : --   finishStage(ResponseBuilder rb)



  //in this for loop ,

 for (DistribFieldFacet dff : fi.facets.values()) {



//just after this line of code

 else { // TODO: log error or throw exception?

 counts = dff.getLexSorted();



 int namedistint = 0;  //default



//get the value of facet.numterms from the input query


namedistint=rb.req.getParams().getFieldInt(dff.getKey().toString(),FacetParams.FACET_NAMEDISTINCT,0);



// based on the value for  facet.numterms==0 or 1 or 2  , if conditions



//Get only facet field counts

if (namedistint  == 0)

{

facet_fields.add(dff.getKey(), fieldCounts);


}



//get only distinct facet term count

if (namedistint  == 1)

{

facet_fields.add("numfacetTerms", counts.length);


}



//get facet field count and distinct term count.

 if (namedistint  == 2) {

 NamedList resCount = new NamedList();


 resCount.add("numfacetTerms", counts.length);


 resCount.add("counts", fieldCounts);

facet_fields.add(dff.getKey(), resCount);

 }





Regards,

Rajani





On Fri, May 27, 2011 at 1:14 PM, rajini maski  wrote:

>  No such issues . Successfully integrated with 1.4.1 and it works across
> single index.
>
> for f.2.facet.numFacetTerms=1  parameter it will give the distinct count
> result
>
> for f.2.facet.numFacetTerms=2 parameter  it will give counts as well as
> results for facets.
>
> But this is working only across single index not distributed process. The
> conditions you have added in simple facet.java- "if namedistinct count ==int
> " ( 0, 1 and 2 condtions).. Should it be added in distributed process
> function to enable it work across shards?
>
> Rajani
>
>
>
> On Fri, May 27, 2011 at 12:33 PM, Bill Bell  wrote:
>
>> I am pretty sure it does not yet support distributed shards..
>>
>> But the patch was written for 4.0... So there might be issues with running
>> it on 1.4.1.
>>
>> On 5/26/11 11:08 PM, "rajini maski"  wrote:
>>
>> > The patch solr 2242 for getting count of distinct facet terms
>> doesn't
>> >work for distributedProcess
>> >
>> >(https://issues.apache.org/jira/browse/SOLR-2242)
>> >
>> >The error log says
>> >
>> > HTTP ERROR 500
>> >Problem accessing /solr/select. Reason:
>> >
>> >For input string: "numFacetTerms"
>> >
>> >java.lang.NumberFormatException: For input string: "numFacetTerms"
>> >at
>>
>> >java.lang.NumberFormatException.forInputString(NumberFormatException.java:
>> >48)
>> >at java.lang.Long.parseLong(Long.java:403)
>> >at java.lang.Long.parseLong(Long.java:461)
>> >at org.apache.solr.schema.TrieField.readableToIndexed(TrieField.java:331)
>> >at org.apache.solr.schema.TrieField.toInternal(TrieField.java:344)
>> >at
>>
>> >org.apache.solr.handler.component.FacetComponent$DistribFieldFacet.add(Fac
>> >etComponent.java:619)
>> >at
>>
>> >org.apache.solr.handler.component.FacetComponent.countFacets(FacetComponen
>> >t.java:265)
>> >at
>>
>> >org.apache.solr.handler.component.FacetComponent.handleResponses(FacetComp
>> >onent.java:235)
>> >at
>>
>> >org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHa
>> >ndler.java:290)
>> >at
>>
>> >org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBas
>> >e.java:131)
>> >at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>> >at
>>
>> >org.apache.solr.servlet.Solr

Re: Tokenising based on known words?

2011-06-08 Thread Gora Mohanty
On Thu, Jun 9, 2011 at 4:37 AM, Mark Mandel  wrote:
> Not sure if this possible, but figured I would ask the question.
>
> Basically, we have some users who do some pretty rediculous things ;o)
>
> Rather than writing "red jacket", they write "redjacket", which obviously
> returns no results.
[...]

Have you tried using synonyms,
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
It seems like they should fit your use case.

Regards,
Gora


Multiple Values not getting Indexed

2011-06-08 Thread Pawan Darira
Hi

I am trying to index 2 fields with multiple values. BUT, it is only putting
1 value for each & ignoring rest of the values after comma(,). I am fetching
query through DIH. It works fine if i have only 1 value each of the 2 fields

E.g. Field1 - 150,178,461,151,310,306,305,179,137,162
& Field2 - Chandigarh,Gurgaon,New
Delhi,Ahmedabad,Rajkot,Surat,Mumbai,Nagpur,Pune,India - Others

*Schema.xml*





p.s. i tried multivalued=true but of no help.

-- 
Thanks,
Pawan Darira


Re: tika integration exception and other related queries

2011-06-08 Thread Naveen Gupta
Hi Gary

It started working .. though i did not test for Zip files, but for rar
files, it is working fine ..

only thing what i wanted to do is to index the metadata (text mapped to
content) not store the data  Also in search result, i want to filter the
stuffs ... and it started working fine .. i don't want to show the content
stuffs to the end user, since the way it extracts the information is not
very helpful to the user .. although we can apply few of the analyzers and
filters to remove the unnecessary tags ..still the information would not be
of much help .. looking for your opinion ... what you did in order to filter
out the content or are you showing the content extracted to the end user?

Even in case, we are showing the text part to the end user, how can i limit
the number of characters while querying the search results ... is there any
feature where we can achieve this ... the concept of snippet kind of thing
...

Thanks
Naveen

On Wed, Jun 8, 2011 at 1:45 PM, Gary Taylor  wrote:

> Naveen,
>
> For indexing Zip files with Tika, take a look at the following thread :
>
>
> http://lucene.472066.n3.nabble.com/Extracting-contents-of-zipped-files-with-Tika-and-Solr-1-4-1-td2327933.html
>
> I got it to work with the 3.1 source and a couple of patches.
>
> Hope this helps.
>
> Regards,
> Gary.
>
>
>
> On 08/06/2011 04:12, Naveen Gupta wrote:
>
>> Hi Can somebody answer this ...
>>
>> 3. can somebody tell me an idea how to do indexing for a zip file ?
>>
>> 1. while sending docx, we are getting following error.
>>
>
>


Displaying highlights in formatted HTML document

2011-06-08 Thread Bryan Loofbourrow
Here is my use case:



I have a large number of HTML documents, sizes in the 0.5K-50M range, most
around, say, 10M.



I want to be able to present the user with the formatted HTML document, with
the hits tagged, so that he may iterate through them, and see them in the
context of the document, with the document looking as it would be presented
by a browser; that is, fully formatted, with its tables and italics and font
sizes and all.



This is something that the user would explicitly request from within a set
of search results, not something I’d expect to have returned from an initial
search – the initial search merely returns the snippets around the hits. But
if the user wants to dive into one of the returned results and see them in
context, I need to be able to go get that.



We are currently solving this problem by using an entirely separate search
engine (dtSearch), which performs the tagging of the hits in the HTML just
fine. But the solution is unsatisfactory because there are Solr searches
that dtSearch’s capabilities cannot reasonably match.



Can anyone suggest a good way to use Solr/Lucene for this instead? I’m
thinking a separate core for this purpose might make sense, so as not to
burden the primary search core with the full contents of the document. But
after that, I’m stuck. How can I get Solr to express the highlighting in the
context of the formatted HTML document?



If Solr does not do this currently, and anyone can suggest ways to add the
feature, any tips on how this might best be incorporated into the
implementation would be welcome.



Thanks,



-- Bryan


Re: FilterQuery and Ors

2011-06-08 Thread Erick Erickson
try fq=age:[1 TO 10] OR age:[10 TO 20]

I'm pretty sure

fq=age:([1 TO 10] OR [10 TO 20])

will work too.

But you're right, multiple fq clauses are intersections, so specifying more
than one fq clause on the SAME field results in what you're seeing.

Best
Erick

On Wed, Jun 8, 2011 at 5:34 PM, Jamie Johnson  wrote:
> I'm looking for a way to do a filter query and Ors.  I've done a bit of
> googling and found an open jira but nothing indicating this is possible.
> I'm looking to do something like the search at
> http://www.lucidimagination.com/search/?q=test
> where you can do multi selects for the facets.  I've read about it at
> http://wiki.apache.org/solr/SimpleFacetParameters#Multi-Select_Faceting_and_LocalParamsso
> I have the tag/exclusion working but if I select two items from a
> facet
> group (say age from 1 to 10 and age from 10 to 20) I get nothing because
> nothing meets both of those criteria.  I can obviously write something
> custom to build an OR out of this but that seems less elegant.  Any guidance
> would be appreciated
>


Tokenising based on known words?

2011-06-08 Thread Mark Mandel
Not sure if this possible, but figured I would ask the question.

Basically, we have some users who do some pretty rediculous things ;o)

Rather than writing "red jacket", they write "redjacket", which obviously
returns no results.

Is there any way, with Solr, to go hunting for known words (maybe if there
is no results) within the word set? Or even tokenise based on known words in
the index?

Last time I played with spell check suggestions, it didn't seem to handle
this very well,  but I've yet to try it again on 3.2.0 (just upgraded from
1.4.1).

Any help/thoughts appreciated, as they do this al the time.

Mark

-- 
E: mark.man...@gmail.com
T: http://www.twitter.com/neurotic
W: www.compoundtheory.com

cf.Objective(ANZ) - Nov 17, 18 - Melbourne Australia
http://www.cfobjective.com.au

Hands-on ColdFusion ORM Training
www.ColdFusionOrmTraining.com


FilterQuery and Ors

2011-06-08 Thread Jamie Johnson
I'm looking for a way to do a filter query and Ors.  I've done a bit of
googling and found an open jira but nothing indicating this is possible.
I'm looking to do something like the search at
http://www.lucidimagination.com/search/?q=test
where you can do multi selects for the facets.  I've read about it at
http://wiki.apache.org/solr/SimpleFacetParameters#Multi-Select_Faceting_and_LocalParamsso
I have the tag/exclusion working but if I select two items from a
facet
group (say age from 1 to 10 and age from 10 to 20) I get nothing because
nothing meets both of those criteria.  I can obviously write something
custom to build an OR out of this but that seems less elegant.  Any guidance
would be appreciated


RE: Does MultiTerm highlighting work with the fastVectorHighlighter?

2011-06-08 Thread Burton-West, Tom
Hi Erick,

Thanks for asking, yes we have termVectors=true set:



I guess I should also mention that highlighting works fine using the 
fastVectorHighLighter as long as we don't do a MultiTerm query.   For example 
see the query and results appended below (using the same hl parameters listed 
in the previous email)


Tom

ocr:tinkham


−

−

−

 John {lt:}b style="background:#00"{gt:}Tinkham{lt:}/b{gt:}, who
married Miss Mallie Kingsbury; Mr. William Ash-
ley, and Mr. Leavitt, who, I believe, built the big
stone house, now left high and dry by itself, on
the top of Lyon street hill. As 





-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Wednesday, June 08, 2011 4:56 PM
To: solr-user@lucene.apache.org
Subject: Re: Does MultiTerm highlighting work with the fastVectorHighlighter?

Just to check, does the field have termVectors="true" set?
I think it's required for FVH to work.
Best
Erick



Re: Does MultiTerm highlighting work with the fastVectorHighlighter?

2011-06-08 Thread Erick Erickson
Just to check, does the field have termVectors="true" set?
I think it's required for FVH to work.
Best
Erick

On Wed, Jun 8, 2011 at 3:24 PM, Burton-West, Tom  wrote:
> We are trying to implement highlighting for wildcard (MultiTerm) queries.  
> This seems to work find with the regular highlighter but when we try to use 
> the fastVectorHighlighter we don't see any results in the  highlighting 
> section of the response.  Appended below are the parameters we are using.
>
> Tom Burton-West
>
> query
> ocr:tink*
> highlighting params:
>
> true
> 200
> true
> 200
> colored
> simple
> ocr
> true
> true
>
>


Re: wildcard search

2011-06-08 Thread Ahmet Arslan

> > I don't use it myself  (but I will soon), so I
> may be wrong, but did you try
> > to use the ComplexPhraseQueryParser :
> > 
> > ComplexPhraseQueryParser
> >          QueryParser which
> permits complex phrase query syntax eg "(john
> > jon jonathan~) peters*".
> > 
> > It seems that you could do such type of queries :
> > 
> > GOK:"IA 38*"
> 
> yes that sounds interesting.
> But I don't know how to get and install it into solr. Cam
> you give me a hint?

https://issues.apache.org/jira/browse/SOLR-1604

But it seems that you can achieve what you want with vanilla solr.

I don't follow the multivalued part in your example but you can tokenize 
"IA 300; IC 330; IA 317; IA 318" into these 4 tokens 

IA 300
IC 330
IA 314
IA 318

Using Pattern Tokenizer Factory. And you can use PrefixQParserPlugin for 
searching.

http://lucene.apache.org/solr/api/org/apache/solr/search/PrefixQParserPlugin.html





Does MultiTerm highlighting work with the fastVectorHighlighter?

2011-06-08 Thread Burton-West, Tom
We are trying to implement highlighting for wildcard (MultiTerm) queries.  This 
seems to work find with the regular highlighter but when we try to use the 
fastVectorHighlighter we don't see any results in the  highlighting section of 
the response.  Appended below are the parameters we are using.

Tom Burton-West

query
ocr:tink*
highlighting params:

true
200
true
200
colored
simple
ocr
true
true



RE: huge shards (300GB each) and load balancing

2011-06-08 Thread Burton-West, Tom
Hi Dmitry,

I am assuming you are splitting one very large index over multiple shards 
rather than replicating and index multiple times.

Just for a point of comparison, I thought I would describe our experience with 
large shards. At HathiTrust, we run a 6 terabyte index over 12 shards.  This is 
split over 4 machines with 3 shards per machine and our shards are about 
400-500GB.  We get average response times of around 200 ms with the 99th 
percentile queries up around 1-2 seconds. We have a very low qps rate, i.e. 
less than 1 qps.  We also index offline on a separate machine and update the 
indexes nightly.

Some of the issues we have found with very large shards are:
1) Becaue of the very large shard size, I/O tends to be the bottleneck, with 
phrase queries containing common words being the slowest.
2) Because of the I/O issues running cache-warming queries to get postings into 
the OS disk cache is important as is leaving significant free memory for the OS 
to use for disk caching
3) Because of the I/O issues using stop words or CommonGrams produces a 
significant performance increase.
2) We have a huge number of unique terms in our indexes.  In order to reduce 
the amount of memory needed by the in-memory terms index we set the 
termInfosIndexDivisor to 8, which causes Solr to only load every 8th term from 
the tii file into memory. This reduced memory use from over 18GB to below 3G 
and got rid of 30 second stop the world java Garbage Collections. (See 
http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again for 
details)  We later ran into memory problems when indexing so instead changed 
the index time parameter termIndexInterval from 128 to 1024.

(More details here: http://www.hathitrust.org/blogs/large-scale-search)

Tom Burton-West



Re: wildcard search

2011-06-08 Thread Thomas Fischer
Hi Ludovic,


> I don't use it myself  (but I will soon), so I may be wrong, but did you try
> to use the ComplexPhraseQueryParser :
> 
> ComplexPhraseQueryParser
>  QueryParser which permits complex phrase query syntax eg "(john
> jon jonathan~) peters*".
> 
> It seems that you could do such type of queries :
> 
> GOK:"IA 38*"

yes that sounds interesting.
But I don't know how to get and install it into solr. Cam you give me a hint?

Thanks
Thomas

Re: solr 3.1 java.lang.NoClassDEfFoundError org/carrot2/core/ControllerFactory

2011-06-08 Thread Stanislaw Osinski
Hi Bryan,

You'll also need to make sure the your
${solr.dir}/contrib/clustering/lib directory is in the classpath; that
directory contains the Carrot2 JARs that
provide the classes you're missing. I think the example solrconfig.xml
has the relevant  declarations.

Cheers,

S.

On Tue, Jun 7, 2011 at 13:48, bryan rasmussen wrote:

> As per the subject I am getting java.lang.NoClassDEfFoundError
> org/carrot2/core/ControllerFactory
> when I try to run clustering.
>
> I am using Solr 3.1:
>
> I get the following error:
>
> java.lang.NoClassDefFoundError: org/carrot2/core/ControllerFactory
>at
> org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.(CarrotClusteringEngine.java:74)
>at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown
> Source)
>at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
> Source)
>at java.lang.reflect.Constructor.newInstance(Unknown Source)
>at java.lang.Class.newInstance0(Unknown Source)
>at java.lang.Class.newInstance(Unknown Source)
>at
> org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:412)
>at
> org.apache.solr.handler.clustering.ClusteringComponent.inform(ClusteringComponent.java:203)
>at
> org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:522)
>at org.apache.solr.core.SolrCore.(SolrCore.java:594)
>at org.apache.solr.core.CoreContainer.create(CoreContainer.java:458)
>at org.apache.solr.core.CoreContainer.load(CoreContainer.java:316)
>at org.apache.solr.core.CoreContainer.load(CoreContainer.java:207)
>at
> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:130)
>at
> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:94)
>at
> org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>at
> org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713)
>at org.mortbay.jetty.servlet.Context.startContext(Context.java:140)
>at
> org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1282)
>at
> org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518)
>at
> org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>at
> org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
>at
> org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>at
> org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>at
> org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
>at org.mortbay.jetty.Server.doStart(Server.java:224)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:985)
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>at java.lang.reflect.Method.invoke(Unknown Source)
>at org.mortbay.start.Main.invokeMain(Main.java:194)
>at org.mortbay.start.Main.start(Main.java:534)
>at org.mortbay.start.Main.start(Main.java:441)
>at org.mortbay.start.Main.main(Main.java:119)
> Caused by: java.lang.ClassNotFoundException:
> org.carrot2.core.ControllerFactory
>at java.net.URLClassLoader$1.run(Unknown Source)
>at java.security.AccessController.doPrivileged(Native Method)
>at java.net.URLClassLoader.findClass(Unknown Source)
>at java.lang.ClassLoader.loadClass(Unknown Source)
>at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
>
> using the following configuration
>
>
>   class="org.apache.solr.handler.clustering.ClusteringComponent"
> name="clustering">
>  
>default
> name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm
>
>
>20
>  
> 
>   class="org.apache.solr.handler.component.SearchHandler">
>
>  explicit
>
>
>title
>all_text
>all_text title
> 
> 150
>  
>  
>clustering
>  
> 
>
>
>
> with the following command  to start solr
> java -Dsolr.clustering.enabled=true
> -Dsolr.solr.home="C:\projects\solrexample\solr" -jar start.jar
>
> Any idea as to why crusty is not working?
>
> Thanks,
> Bryan Rasmussen
>


Re: huge shards (300GB each) and load balancing

2011-06-08 Thread Dmitry Kan
Hi, Bill. Thanks, always nice to have options!

Dmitry

On Wed, Jun 8, 2011 at 4:47 PM, Bill Bell  wrote:

> Re Amazon elb.
>
> This is not exactly true. The ELB does load balancer internal IPs. But the
> ELB   IP address must be external. Still a major issue unless you use
> authentication. Nginx and others can also do load balancing.
>
> Bill Bell
> Sent from mobile
>
>
> On Jun 8, 2011, at 3:32 AM, "Upayavira"  wrote:
>
> >
> >
> > On Wed, 08 Jun 2011 10:42 +0300, "Dmitry Kan" 
> > wrote:
> >> Hello list,
> >>
> >> Thanks for attending to my previous questions so far, have learnt a lot.
> >> Here is another one, I hope it will be interesting to answer.
> >>
> >>
> >>
> >> We run our SOLR shards and front end SOLR on the Amazon high-end
> >> machines.
> >> Currently we have 6 shards with around 200GB in each. Currently we have
> >> only
> >> one front end SOLR which, given a client query, redirects it to all the
> >> shards. Our shards are constantly growing, data is at times reindexed
> (in
> >> batches, which is done by removing a decent chunk before replacing it
> >> with
> >> updated data), constant stream of new data is coming every hour (usually
> >> hits the latest shard in time, but can also hit other shards, which have
> >> older data). Since the front end SOLR has started to be a SPOF, we are
> >> thinking about setting up some sort of load balancer.
> >>
> >> 1) do you think ELB from Amazon is a good solution for starters? We
> don't
> >> need to maintain sessions between SOLR and client.
> >> 2) What other load balancers have been used specifically with SOLR?
> >>
> >>
> >> Overall: does SOLR scale to such size (200GB in an index) and what can
> be
> >> recommended as next step -- resharding (cutting existing shards to
> >> smaller
> >> chunks), replication?
> >
> > Really, it is going to be up to you to work out what works in your
> > situation. You may be reaching the limit of what a Lucene index can
> > handle, don't know. If your query traffic is low, you might find that
> > two 100Gb cores in a single instance performs better. But then, maybe
> > not! Or two 100Gb shards on smaller Amazon hosts. But then, maybe not!
> > :-)
> >
> > The principal issue with Amazon's load balancers (at least when I was
> > using them last year) is that the ports that they balance need to be
> > public. You can't use an Amazon load balancer as an internal service
> > within a security group. For a service such as Solr, that can be a bit
> > of a killer.
> >
> > If they've fixed that issue, then they'd work fine (I used them quite
> > happily in another scenario).
> >
> > When looking at resolving single points of failure, handling search is
> > pretty easy (as you say, stateless load balancer). You will need to give
> > more attention though to how you handle it regarding indexing.
> >
> > Hope that helps a bit!
> >
> > Upayavira
> >
> >
> >
> >
> >
> > ---
> > Enterprise Search Consultant at Sourcesense UK,
> > Making Sense of Open Source
> >
>



-- 
Regards,

Dmitry Kan


Re: Sorting on solr.TextField

2011-06-08 Thread Yonik Seeley
On Wed, Jun 8, 2011 at 1:21 PM, Jamie Johnson  wrote:
> Thanks exactly what I was looking for.
>
> With this new field used just for sorting is there a way to have it be case
> insensitive?

>From the example schema:



  


  



-Yonik
http://www.lucidimagination.com


Re: Sorting on solr.TextField

2011-06-08 Thread Jamie Johnson
Thanks exactly what I was looking for.

With this new field used just for sorting is there a way to have it be case
insensitive?

On Wed, Jun 8, 2011 at 12:50 PM, Ahmet Arslan  wrote:

> > Is there any documentation which
> > details sorting behaviors on the different
> > types of solr fields?  My question is specifically
> > about solr.TextField but
> > I'd just like to know in general at this point.
>
>
> http://wiki.apache.org/solr/FAQ#Why_Isn.27t_Sorting_Working_on_my_Text_Fields.3F
>


Re: Sorting on solr.TextField

2011-06-08 Thread Ahmet Arslan
> Is there any documentation which
> details sorting behaviors on the different
> types of solr fields?  My question is specifically
> about solr.TextField but
> I'd just like to know in general at this point. 

http://wiki.apache.org/solr/FAQ#Why_Isn.27t_Sorting_Working_on_my_Text_Fields.3F


Sorting on solr.TextField

2011-06-08 Thread Jamie Johnson
Is there any documentation which details sorting behaviors on the different
types of solr fields?  My question is specifically about solr.TextField but
I'd just like to know in general at this point.  Currently when executing a
query and I say to sort on a text field I am getting results as follows:


Beth Cross
Beth Cross
Caroline Cross
Arlene Cross
Calvin Cross
Brett Cross
Brandon Cross
Beth Cross
Beth Cross
Caroline Cross

where I would have expected

Arlene Cross
Beth Cross
Beth Cross
Brett Cross
Beth Cross
Beth Cross
Brandon Cross
Calvin Cross
Caroline Cross
Caroline Cross


Re: KeywordTokenizerFactory and stopwords

2011-06-08 Thread Matt Mitchell
Hi Erik. Yes something like what you describe would do the trick. I
did find this:

http://lucene.472066.n3.nabble.com/Concatenate-multiple-tokens-into-one-td1879611.html

I might try the pattern replace filter with stopwords, even though
that feels kinda clunky.

Matt

On Wed, Jun 8, 2011 at 11:04 AM, Erik Hatcher  wrote:
> This seems like it deserves some kind of "collecting" TokenFilter(Factory) 
> that will slurp up all incoming tokens and glue them together with a space 
> (and allow separator to be configurable).   Hmmm surprised one of those 
> doesn't already exist.  With something like that you could have a standard 
> tokenization chain, and put it all back together at the end.
>
>        Erik
>
> On Jun 8, 2011, at 10:59 , Matt Mitchell wrote:
>
>> Hi,
>>
>> I have an "autocomplete" fieldType that works really well, but because
>> the KeywordTokenizerFactory (if I understand correctly) is emitting a
>> single token, the stopword filter will not detect any stopwords.
>> Anyone know of a way to strip out stopwords when using
>> KeywordTokenizerFactory? I did try the reg-exp replace filter, but I'm
>> not sure I want to add a bunch of reg-exps for replacing every
>> stopword.
>>
>> Thanks,
>> Matt
>>
>> Here's the fieldType definition:
>>
>> > positionIncrementGap="100">
>>  
>>    
>>    
>>    
>>    
>>
>>    > maxGramSize="50"/>
>>  
>>  
>>    
>>    
>>    
>>    
>>  
>> 
>
>


Re: solr index losing entries

2011-06-08 Thread Marius Hanganu
We have an API built in 2007 which at the lowest level submits requests with
. We haven't changed anything to the API, and it worked well until the
beginning of this year.

Unique key is solr_id with this definition: 

The number of documents is determined using this HTTP request:
http://server/app_name/select/?q=*%3A*&version=2.2&start=0&rows=10&indent=on

Thanks,
Marius

2011/6/8 Tomás Fernández Löbbe 

> That's rare. How do you add documents to Solr? what do you have as primary
> key?
> How do you determine the number of documents in the index?
>
> The value of "maxDoc" of the stats page considers deleted documents too,
> which are eliminated at merging.
>
> On Wed, Jun 8, 2011 at 12:18 PM, Marius Hanganu 
> wrote:
>
> > Hello,
> >
> > We've been using for 1.5 years now solr 1.4 for one of the indexes in our
> > application with a special configuration with maxDocs=1 and maxTime=1.
> The
> > number of documents is 10.000, with index size around 10MB.
> >
> > For a few monhts now, SOLR has this strange behavior. Our code did not
> > change, however, documents started disappearing from the index. And it's
> > decreasing constantly, at various speeds.
> >
> > Our first modification was to raise maxTime to 30sec, which seemed to fix
> > the problem for some time. After a few weeks, the problem started showing
> > again, so we've upgraded to SOLR 3.1.
> >
> > This upgrade did not fix the problem either. Our last try was with
> > maxDocs=500 and maxTime=60sec. After a complete reindex, SOLR shows all
> > 10081 documents, but after a few minutes, it suddenly goes down to 10078,
> > after a few hours to 10076, and it's stabilizing around this number.
> >
> > We have another SOLR index with ~400.000 objects, maxDocs=1000 and
> > maxTime=3
> > minutes and it never showed this problem.
> >
> > Do you have any idea why this is happening? Or how we can identify the
> > problem?
> >
> > Thanks,
> > Marius
> >
>


Re: solr index losing entries

2011-06-08 Thread Tomás Fernández Löbbe
That's rare. How do you add documents to Solr? what do you have as primary
key?
How do you determine the number of documents in the index?

The value of "maxDoc" of the stats page considers deleted documents too,
which are eliminated at merging.

On Wed, Jun 8, 2011 at 12:18 PM, Marius Hanganu  wrote:

> Hello,
>
> We've been using for 1.5 years now solr 1.4 for one of the indexes in our
> application with a special configuration with maxDocs=1 and maxTime=1. The
> number of documents is 10.000, with index size around 10MB.
>
> For a few monhts now, SOLR has this strange behavior. Our code did not
> change, however, documents started disappearing from the index. And it's
> decreasing constantly, at various speeds.
>
> Our first modification was to raise maxTime to 30sec, which seemed to fix
> the problem for some time. After a few weeks, the problem started showing
> again, so we've upgraded to SOLR 3.1.
>
> This upgrade did not fix the problem either. Our last try was with
> maxDocs=500 and maxTime=60sec. After a complete reindex, SOLR shows all
> 10081 documents, but after a few minutes, it suddenly goes down to 10078,
> after a few hours to 10076, and it's stabilizing around this number.
>
> We have another SOLR index with ~400.000 objects, maxDocs=1000 and
> maxTime=3
> minutes and it never showed this problem.
>
> Do you have any idea why this is happening? Or how we can identify the
> problem?
>
> Thanks,
> Marius
>


Re: wildcard search

2011-06-08 Thread lboutros
Hi Thomas,

I don't use it myself  (but I will soon), so I may be wrong, but did you try
to use the ComplexPhraseQueryParser :

ComplexPhraseQueryParser
  QueryParser which permits complex phrase query syntax eg "(john
jon jonathan~) peters*".

It seems that you could do such type of queries :

GOK:"IA 38*"

Ludovic.


-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/memory-leak-during-undeploying-tp2620093p3039561.html
Sent from the Solr - User mailing list archive at Nabble.com.


solr index losing entries

2011-06-08 Thread Marius Hanganu
Hello,

We've been using for 1.5 years now solr 1.4 for one of the indexes in our
application with a special configuration with maxDocs=1 and maxTime=1. The
number of documents is 10.000, with index size around 10MB.

For a few monhts now, SOLR has this strange behavior. Our code did not
change, however, documents started disappearing from the index. And it's
decreasing constantly, at various speeds.

Our first modification was to raise maxTime to 30sec, which seemed to fix
the problem for some time. After a few weeks, the problem started showing
again, so we've upgraded to SOLR 3.1.

This upgrade did not fix the problem either. Our last try was with
maxDocs=500 and maxTime=60sec. After a complete reindex, SOLR shows all
10081 documents, but after a few minutes, it suddenly goes down to 10078,
after a few hours to 10076, and it's stabilizing around this number.

We have another SOLR index with ~400.000 objects, maxDocs=1000 and maxTime=3
minutes and it never showed this problem.

Do you have any idea why this is happening? Or how we can identify the
problem?

Thanks,
Marius


Re: KeywordTokenizerFactory and stopwords

2011-06-08 Thread Erik Hatcher
This seems like it deserves some kind of "collecting" TokenFilter(Factory) that 
will slurp up all incoming tokens and glue them together with a space (and 
allow separator to be configurable).   Hmmm surprised one of those doesn't 
already exist.  With something like that you could have a standard tokenization 
chain, and put it all back together at the end.

Erik

On Jun 8, 2011, at 10:59 , Matt Mitchell wrote:

> Hi,
> 
> I have an "autocomplete" fieldType that works really well, but because
> the KeywordTokenizerFactory (if I understand correctly) is emitting a
> single token, the stopword filter will not detect any stopwords.
> Anyone know of a way to strip out stopwords when using
> KeywordTokenizerFactory? I did try the reg-exp replace filter, but I'm
> not sure I want to add a bunch of reg-exps for replacing every
> stopword.
> 
> Thanks,
> Matt
> 
> Here's the fieldType definition:
> 
>  positionIncrementGap="100">
>  
>
>
>
>
> 
> maxGramSize="50"/>
>  
>  
>
>
>
>
>  
> 



KeywordTokenizerFactory and stopwords

2011-06-08 Thread Matt Mitchell
Hi,

I have an "autocomplete" fieldType that works really well, but because
the KeywordTokenizerFactory (if I understand correctly) is emitting a
single token, the stopword filter will not detect any stopwords.
Anyone know of a way to strip out stopwords when using
KeywordTokenizerFactory? I did try the reg-exp replace filter, but I'm
not sure I want to add a bunch of reg-exps for replacing every
stopword.

Thanks,
Matt

Here's the fieldType definition:


  






  
  




  



Re: wildcard search

2011-06-08 Thread Erick Erickson
Hmmm, have you tried EdgeNGrams? This works for me (at the expense
of a somewhat larger index, of course)...


  



  
  


  


and a field of type "edge" named "thomasfield"


Now searches like
thomasfield:"GOK IA 3"
(include quotes!) should work. The various parameters (min/max gram size)
I chose arbitrarily, you'll want to tweak them.

I include a lowercasefilter for safety's sake if people are actually
going to type
things in...

It's probably instructive to look at the admin/analysis page to see how
this all plays out

Best
Erick


On Wed, Jun 8, 2011 at 9:29 AM, Thomas Fischer  wrote:
> Hi Erick,
>
> I have a multivalued field "GOK" (local classification scheme) with separate 
> entries of the sort
>  IA 300; IC 330; IA 317; IA 318, i.e. 1 to 3 capital characters, space, 3 
> digits.
> I want to be able to perform a truncated search on that field:
> either just the string before the space, or a combination of that string with 
> 1 or 2 digits, something like:
> GOK:IA
> or
> GOK:IA 3*
> or
> GOK:IA 31?
> My problem is the clash between the phrase (GOK:"IA 317" works) and the 
> wildcards.
>
> As a start I tried as type
>  autoGeneratePhraseQueries="true">
> from the solr 3.2 distribution schema
> (apache-solr-3.2.0/example/solr/conf/schema.xml),
> the field is just
> 
>
> BTW, I have another field "DDC" with entries of the form "t1:086643" with 
> analogous requirements which yields similar problems due to the colon, also 
> indexed as text.
> Here also
> DDC:T1\:086643
> works, but not
> DDC:T1\:08664?
>
> Thanks in advance
> Thomas
>
>> Yes there is, but you haven't provided enough information to
>> make a suggestion. What isthe fieldType definition? What is
>> the field definition?
>>
>> Two resources that'll help you greatly are:
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>>
>> and the admin/analysis page...
>>
>> Best
>> Erick
>>
>> On Tue, Jun 7, 2011 at 6:23 PM, Thomas Fischer  wrote:
>>> Hello,
>>>
>>> I am testing solr 3.2 and have problems with wildcards.
>>> I am indexing values like "IA 300; IC 330; IA 317; IA 318" in a field 
>>> "GOK", and can't find a way to search with wildcards.
>>> I want to use a wild card search to match something like "IA 31?" but 
>>> cannot find a way to do so.
>>> GOK:IA\ 38* doesn't work with the contents of GOK indexed as text.
>>> Is there a way to index and search that would meet my requirements?
>>>
>>> Thomas
>>>
>>>
>>>
>
> Mit freundlichen Grüßen
> Thomas Fischer
>
>
>


Re: Problem with boosting function

2011-06-08 Thread Denis Kuzmenok
try:

q=title:Unicamp&defType=dismax&bf=question_count^5.0
"title:Unicamp" in any search handler will search only in requested field

> The queries I am trying to do are
> q=title:Unicamp

> and

> q=title:Unicamp&bf=question_count^5.0

> The boosting factor (5.0) is just to verify if it was really used.

> Thanks

> Alex





Re: Problem with boosting function

2011-06-08 Thread Alex Grilo
The queries I am trying to do are
q=title:Unicamp

and

q=title:Unicamp&bf=question_count^5.0

The boosting factor (5.0) is just to verify if it was really used.

Thanks

Alex

On Wed, Jun 8, 2011 at 10:25 AM, Denis Kuzmenok  wrote:

> Show your full request to solr (all params)
>
> > Hi,
> > I'm trying to use bf parameter in solr queries but I'm having some
> problems.
>
> > The context is: I have some topics and a integer weight of popularity
> > (number of users that follow the topic). I'd like to boost the documents
> > according to this weight field, and it changes (users may start following
> or
> > unfollowing that topic). I through the best way to do that is adding a bf
> > parameter to the query.
>
> > First of all I was trying to include it in a query processed by a default
> > SearchHandler. I debugged the results and the scores didn't change. So I
> > tried to change the defType of the SearchHandler to dismax (I didn't add
> any
> > other field in solrconfig), and queries didn't work anymore.
>
> > What is the best way to achieve what I want? Do I really need to use a
> > dismax SearchHander (I read about it, and I don't want to search in
> multple
> > fields - I want to search in one field and boost in another one)?
>
> > Thanks in advance
>
> > Alex Grilo
>
>
>


Re: huge shards (300GB each) and load balancing

2011-06-08 Thread Bill Bell
Re Amazon elb.

This is not exactly true. The ELB does load balancer internal IPs. But the ELB  
 IP address must be external. Still a major issue unless you use 
authentication. Nginx and others can also do load balancing.

Bill Bell
Sent from mobile


On Jun 8, 2011, at 3:32 AM, "Upayavira"  wrote:

> 
> 
> On Wed, 08 Jun 2011 10:42 +0300, "Dmitry Kan" 
> wrote:
>> Hello list,
>> 
>> Thanks for attending to my previous questions so far, have learnt a lot.
>> Here is another one, I hope it will be interesting to answer.
>> 
>> 
>> 
>> We run our SOLR shards and front end SOLR on the Amazon high-end
>> machines.
>> Currently we have 6 shards with around 200GB in each. Currently we have
>> only
>> one front end SOLR which, given a client query, redirects it to all the
>> shards. Our shards are constantly growing, data is at times reindexed (in
>> batches, which is done by removing a decent chunk before replacing it
>> with
>> updated data), constant stream of new data is coming every hour (usually
>> hits the latest shard in time, but can also hit other shards, which have
>> older data). Since the front end SOLR has started to be a SPOF, we are
>> thinking about setting up some sort of load balancer.
>> 
>> 1) do you think ELB from Amazon is a good solution for starters? We don't
>> need to maintain sessions between SOLR and client.
>> 2) What other load balancers have been used specifically with SOLR?
>> 
>> 
>> Overall: does SOLR scale to such size (200GB in an index) and what can be
>> recommended as next step -- resharding (cutting existing shards to
>> smaller
>> chunks), replication?
> 
> Really, it is going to be up to you to work out what works in your
> situation. You may be reaching the limit of what a Lucene index can
> handle, don't know. If your query traffic is low, you might find that
> two 100Gb cores in a single instance performs better. But then, maybe
> not! Or two 100Gb shards on smaller Amazon hosts. But then, maybe not!
> :-)
> 
> The principal issue with Amazon's load balancers (at least when I was
> using them last year) is that the ports that they balance need to be
> public. You can't use an Amazon load balancer as an internal service
> within a security group. For a service such as Solr, that can be a bit
> of a killer.
> 
> If they've fixed that issue, then they'd work fine (I used them quite
> happily in another scenario).
> 
> When looking at resolving single points of failure, handling search is
> pretty easy (as you say, stateless load balancer). You will need to give
> more attention though to how you handle it regarding indexing.
> 
> Hope that helps a bit!
> 
> Upayavira
> 
> 
> 
> 
> 
> --- 
> Enterprise Search Consultant at Sourcesense UK, 
> Making Sense of Open Source
> 


Re: Problem with boosting function

2011-06-08 Thread Yonik Seeley
The boost qparser should do the trick if you want a multiplicative boost.
http://lucene.apache.org/solr/api/org/apache/solr/search/BoostQParserPlugin.html

-Yonik
http://www.lucidimagination.com



On Wed, Jun 8, 2011 at 9:22 AM, Alex Grilo  wrote:
> Hi,
> I'm trying to use bf parameter in solr queries but I'm having some problems.
>
> The context is: I have some topics and a integer weight of popularity
> (number of users that follow the topic). I'd like to boost the documents
> according to this weight field, and it changes (users may start following or
> unfollowing that topic). I through the best way to do that is adding a bf
> parameter to the query.
>
> First of all I was trying to include it in a query processed by a default
> SearchHandler. I debugged the results and the scores didn't change. So I
> tried to change the defType of the SearchHandler to dismax (I didn't add any
> other field in solrconfig), and queries didn't work anymore.
>
> What is the best way to achieve what I want? Do I really need to use a
> dismax SearchHander (I read about it, and I don't want to search in multple
> fields - I want to search in one field and boost in another one)?
>
> Thanks in advance
>
> Alex Grilo
>


Re: Solr Cloud and Range Facets

2011-06-08 Thread Jamie Johnson
One last piece of informationregular range queries seem to work fine,
it's only date ranges which seem to be intermittent.

On Wed, Jun 8, 2011 at 9:03 AM, Jamie Johnson  wrote:

> Some more information
>
> I am currently doing the following:
>
> SolrQuery query = new SolrQuery();
>
> query.setQuery(test");
>
> query.setParam("distrib", true);
>
> query.setFacet(true);
>
> query.setParam(FacetParams.FACET_RANGE, "dateTime");
> query.setParam("f.dateTime." + FacetParams.FACET_RANGE_GAP,
> "+1MONTH");
> query.setParam("f.dateTime." + FacetParams.FACET_RANGE_START,
> "2011-06-01T00:00:00Z-1YEAR");
> query.setParam("f.dateTime." + FacetParams.FACET_RANGE_END,
> "2011-07-01T00:00:00Z");
> query.setParam("f.dateTime." + FacetParams.FACET_MINCOUNT, "1");
>
> System.out.println(query);
> int failure = 0;
> for(int x = 0; x < 1000; x ++){
>
> QueryResponse response = mainServer.query(query);
>
> List ranges = response.getFacetRanges();
> for(RangeFacet range : ranges){
> if("dateTime".equals(range.getName())){
> if(range.getCounts().size() == 0){
> failure ++;
> }
> }
> }
> }
> System.out.println("Failed: " + failure);
>
>
> After this has run I get anywhere 30 - 40% failures (300 - 400).  If I set
> distrib to false or take off the query it works fine.  Any insight would be
> greatly appreciated.
>
>
> On Tue, Jun 7, 2011 at 2:27 PM, Jamie Johnson  wrote:
>
>> I have a solr cloud setup wtih 2 servers, when executing a query against
>> them of the form:
>>
>>
>> http://localhost:8983/solr/select/?distrib=true&q=*:*&facet=true&facet.mincount=1&facet.range=dateTime&f.dateTime.facet.range.gap=%2B1MONTH&f.dateTime.facet.range.start=2011-06-01T00%3A00%3A00Z-1YEAR&f.dateTime.facet.range.end=2011-07-01T00%3A00%3A00Z&f.dateTime.facet.mincount=1&start=0&rows=0
>>
>> I am seeing that sometimes the date facet has a count, and other times it
>> does not.  Specifically I am seeing sometimes:
>>
>> 
>>   
>> 
>> +1MONTH
>> 2010-06-01T00:00:00Z
>> 2011-07-01T00:00:00Z
>>   
>> 
>>
>> and others
>> 
>>   
>> 
>>   250
>> 
>> +1MONTH
>> 2010-06-01T00:00:00Z
>> 2011-07-01T00:00:00Z
>>   
>> 
>>
>> What could be causing this inconsistency?
>>
>
>


Re: wildcard search

2011-06-08 Thread Thomas Fischer
Hi Erick,

I have a multivalued field "GOK" (local classification scheme) with separate 
entries of the sort
 IA 300; IC 330; IA 317; IA 318, i.e. 1 to 3 capital characters, space, 3 
digits.
I want to be able to perform a truncated search on that field:
either just the string before the space, or a combination of that string with 1 
or 2 digits, something like:
GOK:IA
or
GOK:IA 3*
or
GOK:IA 31?
My problem is the clash between the phrase (GOK:"IA 317" works) and the 
wildcards.

As a start I tried as type

from the solr 3.2 distribution schema
(apache-solr-3.2.0/example/solr/conf/schema.xml),
the field is just


BTW, I have another field "DDC" with entries of the form "t1:086643" with 
analogous requirements which yields similar problems due to the colon, also 
indexed as text.
Here also 
DDC:T1\:086643
works, but not 
DDC:T1\:08664?

Thanks in advance
Thomas

> Yes there is, but you haven't provided enough information to
> make a suggestion. What isthe fieldType definition? What is
> the field definition?
> 
> Two resources that'll help you greatly are:
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
> 
> and the admin/analysis page...
> 
> Best
> Erick
> 
> On Tue, Jun 7, 2011 at 6:23 PM, Thomas Fischer  wrote:
>> Hello,
>> 
>> I am testing solr 3.2 and have problems with wildcards.
>> I am indexing values like "IA 300; IC 330; IA 317; IA 318" in a field "GOK", 
>> and can't find a way to search with wildcards.
>> I want to use a wild card search to match something like "IA 31?" but cannot 
>> find a way to do so.
>> GOK:IA\ 38* doesn't work with the contents of GOK indexed as text.
>> Is there a way to index and search that would meet my requirements?
>> 
>> Thomas
>> 
>> 
>> 

Mit freundlichen Grüßen
Thomas Fischer




Re: huge shards (300GB each) and load balancing

2011-06-08 Thread Dmitry Kan
Hi Upayavira,

Thanks for sharing insights and experience on this.

As we have 6 shards at the moment, it is pretty hard (=almost impossible) to
keep them on a single box, so that's why we decided to shard. On the other
hand, we have never tried multicore architecture, so that's a good point,
thanks.

On the indexing side, we do it rather straightforward, that is, by updating
the online shards. This should hopefully be improved with [offline update /
http swap] system, as already now, updating online 200GB shards at times
produces OOM, freezing and other issues.



Does someone have other experience / pointers to load balancer software that
was tried with SOLR?

Dmitry

On Wed, Jun 8, 2011 at 12:32 PM, Upayavira  wrote:

>
>
> On Wed, 08 Jun 2011 10:42 +0300, "Dmitry Kan" 
> wrote:
> > Hello list,
> >
> > Thanks for attending to my previous questions so far, have learnt a lot.
> > Here is another one, I hope it will be interesting to answer.
> >
> >
> >
> > We run our SOLR shards and front end SOLR on the Amazon high-end
> > machines.
> > Currently we have 6 shards with around 200GB in each. Currently we have
> > only
> > one front end SOLR which, given a client query, redirects it to all the
> > shards. Our shards are constantly growing, data is at times reindexed (in
> > batches, which is done by removing a decent chunk before replacing it
> > with
> > updated data), constant stream of new data is coming every hour (usually
> > hits the latest shard in time, but can also hit other shards, which have
> > older data). Since the front end SOLR has started to be a SPOF, we are
> > thinking about setting up some sort of load balancer.
> >
> > 1) do you think ELB from Amazon is a good solution for starters? We don't
> > need to maintain sessions between SOLR and client.
> > 2) What other load balancers have been used specifically with SOLR?
> >
> >
> > Overall: does SOLR scale to such size (200GB in an index) and what can be
> > recommended as next step -- resharding (cutting existing shards to
> > smaller
> > chunks), replication?
>
> Really, it is going to be up to you to work out what works in your
> situation. You may be reaching the limit of what a Lucene index can
> handle, don't know. If your query traffic is low, you might find that
> two 100Gb cores in a single instance performs better. But then, maybe
> not! Or two 100Gb shards on smaller Amazon hosts. But then, maybe not!
> :-)
>
> The principal issue with Amazon's load balancers (at least when I was
> using them last year) is that the ports that they balance need to be
> public. You can't use an Amazon load balancer as an internal service
> within a security group. For a service such as Solr, that can be a bit
> of a killer.
>
> If they've fixed that issue, then they'd work fine (I used them quite
> happily in another scenario).
>
> When looking at resolving single points of failure, handling search is
> pretty easy (as you say, stateless load balancer). You will need to give
> more attention though to how you handle it regarding indexing.
>
> Hope that helps a bit!
>
> Upayavira
>
>
>
>
>
> ---
> Enterprise Search Consultant at Sourcesense UK,
> Making Sense of Open Source
>
>


Re: Problem with boosting function

2011-06-08 Thread Denis Kuzmenok
Show your full request to solr (all params)

> Hi,
> I'm trying to use bf parameter in solr queries but I'm having some problems.

> The context is: I have some topics and a integer weight of popularity
> (number of users that follow the topic). I'd like to boost the documents
> according to this weight field, and it changes (users may start following or
> unfollowing that topic). I through the best way to do that is adding a bf
> parameter to the query.

> First of all I was trying to include it in a query processed by a default
> SearchHandler. I debugged the results and the scores didn't change. So I
> tried to change the defType of the SearchHandler to dismax (I didn't add any
> other field in solrconfig), and queries didn't work anymore.

> What is the best way to achieve what I want? Do I really need to use a
> dismax SearchHander (I read about it, and I don't want to search in multple
> fields - I want to search in one field and boost in another one)?

> Thanks in advance

> Alex Grilo




Problem with boosting function

2011-06-08 Thread Alex Grilo
Hi,
I'm trying to use bf parameter in solr queries but I'm having some problems.

The context is: I have some topics and a integer weight of popularity
(number of users that follow the topic). I'd like to boost the documents
according to this weight field, and it changes (users may start following or
unfollowing that topic). I through the best way to do that is adding a bf
parameter to the query.

First of all I was trying to include it in a query processed by a default
SearchHandler. I debugged the results and the scores didn't change. So I
tried to change the defType of the SearchHandler to dismax (I didn't add any
other field in solrconfig), and queries didn't work anymore.

What is the best way to achieve what I want? Do I really need to use a
dismax SearchHander (I read about it, and I don't want to search in multple
fields - I want to search in one field and boost in another one)?

Thanks in advance

Alex Grilo


Re: how to Index and Search non-Eglish Text in solr

2011-06-08 Thread Erick Erickson
This page is a handy reference for individual languages...
http://wiki.apache.org/solr/LanguageAnalysis

But the usual approach, especially for Chinese/Japanese/Korean
(CJK) is to index the content in different fields with language-specific
analyzers then spread your search across the language-specific
fields (e.g. title_en, title_fr, title_ar). Stemming and stopwords
particularly give "surprising" results if you put words from different
languages in the same field.

Best
Erick

On Wed, Jun 8, 2011 at 8:34 AM, Mohammad Shariq  wrote:
> Hi,
> I had setup solr( solr-1.4 on Ubuntu 10.10) for indexing news articles in
> English, but my requirement extend to index the news of other languages too.
>
> This is how my schema looks :
>  required="false"/>
>
>
> And the "text" Field in schema.xml looks like :
>
> 
>    
>       
>        words="stopwords.txt" enablePositionIncrements="true"/>
>        generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
>       
>        protected="protwords.txt"/>
>    
>    
>       
>        ignoreCase="true" expand="true"/>
>        words="stopwords.txt" enablePositionIncrements="true"/>
>        generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> catenateAll="0" splitOnCaseChange="1"/>
>       
>        protected="protwords.txt"/>
>    
> 
>
>
> My Problem is :
> Now I want to index the news articles in other languages to e.g.
> Chinese,Japnese.
> How I can I modify my text field so that I can Index the news in other lang
> too and make it searchable ??
>
> Thanks
> Shariq
>
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/how-to-Index-and-Search-non-Eglish-Text-in-solr-tp3038851p3038851.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Solr Cloud and Range Facets

2011-06-08 Thread Jamie Johnson
Some more information

I am currently doing the following:

SolrQuery query = new SolrQuery();

query.setQuery(test");

query.setParam("distrib", true);

query.setFacet(true);

query.setParam(FacetParams.FACET_RANGE, "dateTime");
query.setParam("f.dateTime." + FacetParams.FACET_RANGE_GAP,
"+1MONTH");
query.setParam("f.dateTime." + FacetParams.FACET_RANGE_START,
"2011-06-01T00:00:00Z-1YEAR");
query.setParam("f.dateTime." + FacetParams.FACET_RANGE_END,
"2011-07-01T00:00:00Z");
query.setParam("f.dateTime." + FacetParams.FACET_MINCOUNT, "1");

System.out.println(query);
int failure = 0;
for(int x = 0; x < 1000; x ++){

QueryResponse response = mainServer.query(query);

List ranges = response.getFacetRanges();
for(RangeFacet range : ranges){
if("dateTime".equals(range.getName())){
if(range.getCounts().size() == 0){
failure ++;
}
}
}
}
System.out.println("Failed: " + failure);


After this has run I get anywhere 30 - 40% failures (300 - 400).  If I set
distrib to false or take off the query it works fine.  Any insight would be
greatly appreciated.

On Tue, Jun 7, 2011 at 2:27 PM, Jamie Johnson  wrote:

> I have a solr cloud setup wtih 2 servers, when executing a query against
> them of the form:
>
>
> http://localhost:8983/solr/select/?distrib=true&q=*:*&facet=true&facet.mincount=1&facet.range=dateTime&f.dateTime.facet.range.gap=%2B1MONTH&f.dateTime.facet.range.start=2011-06-01T00%3A00%3A00Z-1YEAR&f.dateTime.facet.range.end=2011-07-01T00%3A00%3A00Z&f.dateTime.facet.mincount=1&start=0&rows=0
>
> I am seeing that sometimes the date facet has a count, and other times it
> does not.  Specifically I am seeing sometimes:
>
> 
>   
> 
> +1MONTH
> 2010-06-01T00:00:00Z
> 2011-07-01T00:00:00Z
>   
> 
>
> and others
> 
>   
> 
>   250
> 
> +1MONTH
> 2010-06-01T00:00:00Z
> 2011-07-01T00:00:00Z
>   
> 
>
> What could be causing this inconsistency?
>


Re: Getting a query on an "fl" parameter value ?

2011-06-08 Thread Erick Erickson
Hmmm, fl is the list of fields to return, it has nothing to do with
what's searched. Are you looking for something like
q=keyword AND id:(12 OR 45 OR 32)&version.

Best
Erick

On Tue, Jun 7, 2011 at 11:14 AM, duddy67  wrote:
> Hi all,
>
> I'd like to know if it's possible to get a query on an "fl" value.
> For now my url query looks like that:
>
> /solr/select/?q=keyword&version=2.2&start=0&rows=10&indent=on&fl=id+name+title
>
> it works but I need request also on a "fl" parameter value.
> I'd like to add to my initial query a kind of:  WHERE the "fl" id value is
> equal to 12 OR 45 OR 32.
>
> How can I do that ?
>
>
> Thanks for advance.
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Getting-a-query-on-an-fl-parameter-value-tp3034887p3034887.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


how to Index and Search non-Eglish Text in solr

2011-06-08 Thread Mohammad Shariq
Hi,
I had setup solr( solr-1.4 on Ubuntu 10.10) for indexing news articles in
English, but my requirement extend to index the news of other languages too.

This is how my schema looks :



And the "text" Field in schema.xml looks like :



   
   
   
   
   


   
   
   
   
   
   




My Problem is :
Now I want to index the news articles in other languages to e.g.
Chinese,Japnese.
How I can I modify my text field so that I can Index the news in other lang
too and make it searchable ??

Thanks
Shariq





--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-Index-and-Search-non-Eglish-Text-in-solr-tp3038851p3038851.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr speed issues..

2011-06-08 Thread Mohammad Shariq
How frequently you Optimize your solrIndex ??
Optimization also helps in reducing search latency. 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-speed-issues-tp2254823p3038794.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Boosting result on query.

2011-06-08 Thread Koji Sekiguchi

(11/06/08 16:20), Denis Kuzmenok wrote:

If you could move to 3.x and your "linked item" boosts could be
calculated offline in batch periodically you could use an external
file field to store the doc boost.



a few If's though


I  have  3.2 and external file field doesn't work without solr restart
(on multicore instance).


Can you try ReloadCacheRequestHandler which has been introduced 3.2?



When you change external file, hit /reloadCache above instead of restarting 
solr.

koji
--
http://www.rondhuit.com/en/


Re: Re: Can I update a specific field in solr?

2011-06-08 Thread Mohammad Shariq
Solr dont support partial updates.

On 8 June 2011 16:04, ZiLi  wrote:

>
> Thanks very much , I'll re-index a whole document : )
>
>
>
>
> 发件人: Chandan Tamrakar
> 发送时间: 2011-06-08  18:25:37
> 收件人: solr-user
> 抄送:
> 主题: Re: Can I update a specific field in solr?
>
> I think You can do that but you need to re-index a whole document again.
> note that there is nothing like "update"  , its usually delete and then
> add.
> thanks
> On Wed, Jun 8, 2011 at 4:00 PM, ZiLi  wrote:
> > Hi, I try to update a specific field in solr , but I didn't find anyway
> to
> > implement this .
> > Anyone who knows how to ?
> > Any suggestions will be appriciate : )
> >
> >
> > 2011-06-08
> >
> >
> >
> > ZiLi
> >
> --
> Chandan Tamrakar
> *
> *
>



-- 
Thanks and Regards
Mohammad Shariq


Re: Re: Can I update a specific field in solr?

2011-06-08 Thread ZiLi

Thanks very much , I'll re-index a whole document : ) 




发件人: Chandan Tamrakar 
发送时间: 2011-06-08  18:25:37 
收件人: solr-user 
抄送: 
主题: Re: Can I update a specific field in solr? 
 
I think You can do that but you need to re-index a whole document again.
note that there is nothing like "update"  , its usually delete and then add.
thanks
On Wed, Jun 8, 2011 at 4:00 PM, ZiLi  wrote:
> Hi, I try to update a specific field in solr , but I didn't find anyway to
> implement this .
> Anyone who knows how to ?
> Any suggestions will be appriciate : )
>
>
> 2011-06-08
>
>
>
> ZiLi
>
-- 
Chandan Tamrakar
*
*


Re: Can I update a specific field in solr?

2011-06-08 Thread Chandan Tamrakar
I think You can do that but you need to re-index a whole document again.

note that there is nothing like "update"  , its usually delete and then add.

thanks

On Wed, Jun 8, 2011 at 4:00 PM, ZiLi  wrote:

> Hi, I try to update a specific field in solr , but I didn't find anyway to
> implement this .
> Anyone who knows how to ?
> Any suggestions will be appriciate : )
>
>
> 2011-06-08
>
>
>
> ZiLi
>



-- 
Chandan Tamrakar
*
*


AW: How to deal with many files using solr external file field

2011-06-08 Thread Bohnsack, Sven
Hi,

I could not provide a stack trace and IMHO it won't provide some useful 
information. But we've made a good progress in the analysis.

We took a deeper look at what happened, when an "external-file-field"-Request 
is sent to SOLR:

* SOLR looks if there is a file for the requested query, e.g. "trousers"
* If so, then SOLR loads the "trousers"-file and generates a HashMap-Entry 
consisting of a FileFloatSource-Object and a FloatArray with the size of the 
number of documents in the SOLR-index. Every document matched by the query 
gains the score-value, which is provided in the external-score-file. For 
every(!) other document SOLR writes a zero in that FloatArray
* if SOLR does not find a file for the query-Request, then SOLR still generates 
a HashMapEntry with score zero for every document

In our case we have about 8.5 Mio. documents in our index and one of those 
Arrays occupies about 34MB Heap Space. Having e.g. 100 different queries and 
using external file field for sorting the result, SOLR occupies about 3.4GB of 
Heap Space.

The problem might be the use of WeakHashMap [1], which prevents the Garbage 
Collector from cleaning up unused Keys.


What do you think could be a possible solution for this whole problem? (except 
from "don't use external file fields" ;)


Regards
Sven


[1]: "A hashtable-based Map implementation with weak keys. An entry in a 
WeakHashMap will automatically be removed when its key is no longer in ordinary 
use. More precisely, the presence of a mapping for a given key will not prevent 
the key from being discarded by the garbage collector, that is, made 
finalizable, finalized, and then reclaimed. When a key has been discarded its 
entry is effectively removed from the map, so this class behaves somewhat 
differently than other Map implementations."

-Ursprüngliche Nachricht-
Von: mtnes...@gmail.com [mailto:mtnes...@gmail.com] Im Auftrag von Simon 
Rosenthal
Gesendet: Mittwoch, 8. Juni 2011 03:56
An: solr-user@lucene.apache.org
Betreff: Re: How to deal with many files using solr external file field

Can you provide a stack trace for the OOM eexception ?

On Tue, Jun 7, 2011 at 4:25 PM, Bohnsack, Sven
wrote:

> Hi all,
>
> we're using solr 1.4 and external file field ([1]) for sorting our
> searchresults. We have about 40.000 Terms, for which we use this sorting
> option.
> Currently we're running into massive OutOfMemory-Problems and were not
> pretty sure, what's the matter. It seems that the garbage collector stops
> working or some processes are going wild. However, solr starts to allocate
> more and more RAM until we experience this OutOfMemory-Exception.
>
>
> We noticed the following:
>
> For some terms one could see in the solr log that there appear some
> java.io.FileNotFoundExceptions, when solr tries to load an external file for
> a term for which there is not such a file, e.g. solr tries to load the
> external score file for "trousers" but there ist none in the
> /solr/data-Folder.
>
> Question: is it possible, that those exceptions are responsible for the
> OutOfMemory-Problem or could it be due to the large(?) number of 40k terms
> for which we want to sort the result via external file field?
>
> I'm looking forward for your answers, suggestions and ideas :)
>
>
> Regards
> Sven
>
>
> [1]:
> http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html
>


Can I update a specific field in solr?

2011-06-08 Thread ZiLi
Hi, I try to update a specific field in solr , but I didn't find anyway to 
implement this . 
Anyone who knows how to ?
Any suggestions will be appriciate : )


2011-06-08 



ZiLi 


Re: Question about tokenizing, searching and retrieving results.

2011-06-08 Thread Luis Cappa Banda
Hello again!

Thank you very much for answering. The problem was the defaultOperator,
which was setted as AND. Damn, I was blind :-/


Thank you again.


Re: huge shards (300GB each) and load balancing

2011-06-08 Thread Upayavira


On Wed, 08 Jun 2011 10:42 +0300, "Dmitry Kan" 
wrote:
> Hello list,
> 
> Thanks for attending to my previous questions so far, have learnt a lot.
> Here is another one, I hope it will be interesting to answer.
> 
> 
> 
> We run our SOLR shards and front end SOLR on the Amazon high-end
> machines.
> Currently we have 6 shards with around 200GB in each. Currently we have
> only
> one front end SOLR which, given a client query, redirects it to all the
> shards. Our shards are constantly growing, data is at times reindexed (in
> batches, which is done by removing a decent chunk before replacing it
> with
> updated data), constant stream of new data is coming every hour (usually
> hits the latest shard in time, but can also hit other shards, which have
> older data). Since the front end SOLR has started to be a SPOF, we are
> thinking about setting up some sort of load balancer.
> 
> 1) do you think ELB from Amazon is a good solution for starters? We don't
> need to maintain sessions between SOLR and client.
> 2) What other load balancers have been used specifically with SOLR?
> 
> 
> Overall: does SOLR scale to such size (200GB in an index) and what can be
> recommended as next step -- resharding (cutting existing shards to
> smaller
> chunks), replication?

Really, it is going to be up to you to work out what works in your
situation. You may be reaching the limit of what a Lucene index can
handle, don't know. If your query traffic is low, you might find that
two 100Gb cores in a single instance performs better. But then, maybe
not! Or two 100Gb shards on smaller Amazon hosts. But then, maybe not!
:-)

The principal issue with Amazon's load balancers (at least when I was
using them last year) is that the ports that they balance need to be
public. You can't use an Amazon load balancer as an internal service
within a security group. For a service such as Solr, that can be a bit
of a killer.

If they've fixed that issue, then they'd work fine (I used them quite
happily in another scenario).

When looking at resolving single points of failure, handling search is
pretty easy (as you say, stateless load balancer). You will need to give
more attention though to how you handle it regarding indexing.

Hope that helps a bit!

Upayavira





--- 
Enterprise Search Consultant at Sourcesense UK, 
Making Sense of Open Source



Re: 400 MB Fields

2011-06-08 Thread Alexander Kanarsky
Otis,

Not sure about the Solr, but with Lucene It was certainly doable. I
saw fields way bigger than 400Mb indexed, sometimes having a large set
of unique terms as well (think something like log file with lots of
alphanumeric tokens, couple of gigs in size). While indexing and
querying of such things the I/O, naturally, could easily become a
bottleneck.

-Alexander


Re: tika integration exception and other related queries

2011-06-08 Thread Gary Taylor

Naveen,

For indexing Zip files with Tika, take a look at the following thread :

http://lucene.472066.n3.nabble.com/Extracting-contents-of-zipped-files-with-Tika-and-Solr-1-4-1-td2327933.html

I got it to work with the 3.1 source and a couple of patches.

Hope this helps.

Regards,
Gary.


On 08/06/2011 04:12, Naveen Gupta wrote:

Hi Can somebody answer this ...

3. can somebody tell me an idea how to do indexing for a zip file ?

1. while sending docx, we are getting following error.




huge shards (300GB each) and load balancing

2011-06-08 Thread Dmitry Kan
Hello list,

Thanks for attending to my previous questions so far, have learnt a lot.
Here is another one, I hope it will be interesting to answer.



We run our SOLR shards and front end SOLR on the Amazon high-end machines.
Currently we have 6 shards with around 200GB in each. Currently we have only
one front end SOLR which, given a client query, redirects it to all the
shards. Our shards are constantly growing, data is at times reindexed (in
batches, which is done by removing a decent chunk before replacing it with
updated data), constant stream of new data is coming every hour (usually
hits the latest shard in time, but can also hit other shards, which have
older data). Since the front end SOLR has started to be a SPOF, we are
thinking about setting up some sort of load balancer.

1) do you think ELB from Amazon is a good solution for starters? We don't
need to maintain sessions between SOLR and client.
2) What other load balancers have been used specifically with SOLR?


Overall: does SOLR scale to such size (200GB in an index) and what can be
recommended as next step -- resharding (cutting existing shards to smaller
chunks), replication?

Thanks for reading to this point.

-- 
Regards,

Dmitry Kan


Re: Getting query fields in a custom SearchHandler

2011-06-08 Thread Marc SCHNEIDER
Hi,

I reply to myself :-)
The solution is to use this utility class :
org.apache.solr.search.QueryParsing. Then you can do:

Query luceneQuery = QueryParsing.parseQuery(req.getParams().get("q"),
req.getSchema());

Then with luceneQuery you can use the extractTerms method.

Marc.

On Fri, Jun 3, 2011 at 9:15 AM, Marc SCHNEIDER
wrote:

> Hi all,
>
> I wrote my own SearchHandler and therefore overrided the handleRequestBody
> method.
> This method takes two input parameters : SolrQueryRequest and
> SolrQueryResponse objects.
> The thing I'd like to do is to get the query fields that are used in my
> request.
> Of course I can use req.getParams().get("q") but it returns the complete
> query (which can be very complicated). I'd like to have a simple map with
> field:value.
> Is there a way to get it? Or do I have to write my own parser for the "q"
> parameter?
>
> Thanks in advance,
> Marc.
>


Re: Boosting result on query.

2011-06-08 Thread Denis Kuzmenok
> If you could move to 3.x and your "linked item" boosts could be
> calculated offline in batch periodically you could use an external
> file field to store the doc boost.

> a few If's though

I  have  3.2 and external file field doesn't work without solr restart
(on multicore instance). 



Re: Getting a query on an "fl" parameter value ?

2011-06-08 Thread lee carroll
try
http://wiki.apache.org/solr/CommonQueryParameters#fq

On 7 June 2011 16:14, duddy67  wrote:
> Hi all,
>
> I'd like to know if it's possible to get a query on an "fl" value.
> For now my url query looks like that:
>
> /solr/select/?q=keyword&version=2.2&start=0&rows=10&indent=on&fl=id+name+title
>
> it works but I need request also on a "fl" parameter value.
> I'd like to add to my initial query a kind of:  WHERE the "fl" id value is
> equal to 12 OR 45 OR 32.
>
> How can I do that ?
>
>
> Thanks for advance.
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Getting-a-query-on-an-fl-parameter-value-tp3034887p3034887.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Boosting result on query.

2011-06-08 Thread lee carroll
If you could move to 3.x and your "linked item" boosts could be
calculated offline in batch periodically you could use an external
file field to store the doc boost.

a few If's though



On 8 June 2011 03:23, Jeff Boul  wrote:
> Hi,
>
> I am trying to figure out options for the following problem. I am on
> Solr 1.4.1 (Lucene 2.9.1).
>
> I need to perform a boost on a query related to the value of a multiple
> value field.
>
> Lets say the result return the following documents:
>
> id   name    linked_items
> 3     doc3    (item1, item33, item55)
> 8     doc8    (item2, item55, item8)
> 0     doc0    (item7)
> 1     doc1    (item1)
> 
>
> I want the result to be boosted regarding the foollowing ordered list of
> linked_items values:
>
> item2 > item55 > item1 > ...
>
> So doc8 will received the higher boost because his 'linked_items' contains
> 'item2'
> then doc3 will received a lower boost because his 'linked_items' contains
> 'item55'
> then doc1 will received a much lower boost because his 'linked_items'
> contains 'item1'
> and maybe doc0 will received some boost if 'item7' is somewhere in the list.
>
> The tricky part is that the ordered list is obtained by querying on an other
> index. So the result of the query on the other index will give me a result
> and I will use the values of one field of those documents to construct the
> ordered list.
>
> It would be even better if the boost not use only the order but also the
> score of the result of the query on the other index.
>
> I'm not very used to Solr and Lucene but from what I read, I think that the
> solution turns around a customization of the Query object.
>
> So the questions are:
>
> 1) Am I right with the Query's cutomization assumption? (if so... can
> someone could give me advices or point me an example of something related)
> 2) Is something already exist that i could use to do that?
> 3) Is that a good approach to use separate index?
>
> Thanks for the help
>
> Jeff
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Boosting-result-on-query-tp3037649p3037649.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: solr 3.1 java.lang.NoClassDEfFoundError org/carrot2/core/ControllerFactory

2011-06-08 Thread Stanislaw Osinski
Hi Bryan,

You'll also need to make sure the your ${solr.home}/contrib/clustering/lib
directory is in the classpath; that directory contains the Carrot2 JARs that
provide the classes you're missing. I think the example solrconfig.xml has
the relevant  declarations.

Cheers,

S.

On Tue, Jun 7, 2011 at 13:48, bryan rasmussen wrote:

> As per the subject I am getting java.lang.NoClassDEfFoundError
> org/carrot2/core/ControllerFactory
> when I try to run clustering.
>
> I am using Solr 3.1:
>
> I get the following error:
>
> java.lang.NoClassDefFoundError: org/carrot2/core/ControllerFactory
>at
> org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.(CarrotClusteringEngine.java:74)
>at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown
> Source)
>at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
> Source)
>at java.lang.reflect.Constructor.newInstance(Unknown Source)
>at java.lang.Class.newInstance0(Unknown Source)
>at java.lang.Class.newInstance(Unknown Source)
>at
> org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:412)
>at
> org.apache.solr.handler.clustering.ClusteringComponent.inform(ClusteringComponent.java:203)
>at
> org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:522)
>at org.apache.solr.core.SolrCore.(SolrCore.java:594)
>at org.apache.solr.core.CoreContainer.create(CoreContainer.java:458)
>at org.apache.solr.core.CoreContainer.load(CoreContainer.java:316)
>at org.apache.solr.core.CoreContainer.load(CoreContainer.java:207)
>at
> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:130)
>at
> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:94)
>at
> org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>at
> org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713)
>at org.mortbay.jetty.servlet.Context.startContext(Context.java:140)
>at
> org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1282)
>at
> org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518)
>at
> org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>at
> org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
>at
> org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>at
> org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>at
> org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
>at org.mortbay.jetty.Server.doStart(Server.java:224)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:985)
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>at java.lang.reflect.Method.invoke(Unknown Source)
>at org.mortbay.start.Main.invokeMain(Main.java:194)
>at org.mortbay.start.Main.start(Main.java:534)
>at org.mortbay.start.Main.start(Main.java:441)
>at org.mortbay.start.Main.main(Main.java:119)
> Caused by: java.lang.ClassNotFoundException:
> org.carrot2.core.ControllerFactory
>at java.net.URLClassLoader$1.run(Unknown Source)
>at java.security.AccessController.doPrivileged(Native Method)
>at java.net.URLClassLoader.findClass(Unknown Source)
>at java.lang.ClassLoader.loadClass(Unknown Source)
>at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
>
> using the following configuration
>
>
>   class="org.apache.solr.handler.clustering.ClusteringComponent"
> name="clustering">
>  
>default
> name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm
>
>
>20
>  
> 
>   class="org.apache.solr.handler.component.SearchHandler">
>
>  explicit
>
>
>title
>all_text
>all_text title
> 
> 150
>  
>  
>clustering
>  
> 
>
>
>
> with the following command  to start solr
> java -Dsolr.clustering.enabled=true
> -Dsolr.solr.home="C:\projects\solrexample\solr" -jar start.jar
>
> Any idea as to why crusty is not working?
>
> Thanks,
> Bryan Rasmussen
>


Getting a query on an "fl" parameter value ?

2011-06-08 Thread duddy67
Hi all,

I'd like to know if it's possible to get a query on an "fl" value.
For now my url query looks like that:

/solr/select/?q=keyword&version=2.2&start=0&rows=10&indent=on&fl=id+name+title

it works but I need request also on a "fl" parameter value.
I'd like to add to my initial query a kind of:  WHERE the "fl" id value is
equal to 12 OR 45 OR 32.

How can I do that ?


Thanks for advance.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Getting-a-query-on-an-fl-parameter-value-tp3034887p3034887.html
Sent from the Solr - User mailing list archive at Nabble.com.