from:"Sujit Pal"

Re: Detect term occurrences

2015-09-11 Thread Sujit Pal

Hi Francisco,

>> I have many drug products leaflets, each corresponding to 1 product. In
the
other hand we have a medical dictionary with about 10^5 terms.
I want to detect all the occurrences of those terms for any leaflet
document.
Take a look at SolrTextTagger for this use case.
https://github.com/OpenSextant/SolrTextTagger

10^5 entries are not that large, I am using it for much larger dictionaries
at the moment with very good results.

Its a project built (at least originally) by David Smiley, who is also
quite active in this group.

-sujit


On Fri, Sep 11, 2015 at 7:29 AM, Alexandre Rafalovitch 
wrote:

> Assuming the medical dictionary is constant, I would do a copyField of
> text into a separate field and have that separate field use:
>
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/miscellaneous/KeepWordFilterFactory.html
> with words coming from the dictionary (normalized).
>
> That way that new field will ONLY have your dictionary terms from the
> text. Then you can do facet against that field or anything else. Or
> even search and just be a lot more efficient.
>
> The main issue would be a gigantic filter, which may mean speed and/or
> memory issues. Solr has some ways to deal with such large set matches
> by compiling them into a state machine (used for auto-complete), but I
> don't know if that's exposed for your purpose.
>
> But could make a fun custom filter to build.
>
> Regards,
>Alex.
> 
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 10 September 2015 at 22:21, Francisco Andrés Fernández
>  wrote:
> > Yes.
> > I have many drug products leaflets, each corresponding to 1 product. In
> the
> > other hand we have a medical dictionary with about 10^5 terms.
> > I want to detect all the occurrences of those terms for any leaflet
> > document.
> > Could you give me a clue about how is the best way to perform it?
> > Perhaps, the best way is (as Walter suggests) to do all the queries every
> > time, as needed.
> > Regards,
> >
> > Francisco
> >
> > El jue., 10 de sept. de 2015 a la(s) 11:14 a. m., Alexandre Rafalovitch <
> > arafa...@gmail.com> escribió:
> >
> >> Can you tell us a bit more about the business case? Not the current
> >> technical one. Because it is entirely possible Solr can solve the
> >> higher level problem out of the box without you doing manual term
> >> comparisons.In which case, your problem scope is not quite right.
> >>
> >> Regards,
> >>Alex.
> >> 
> >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> >> http://www.solr-start.com/
> >>
> >>
> >> On 10 September 2015 at 09:58, Francisco Andrés Fernández
> >>  wrote:
> >> > Hi all, I'm new to Solr.
> >> > I want to detect all ocurrences of terms existing in a thesaurus into
> 1
> >> or
> >> > more documents.
> >> > What´s the best strategy to make it?
> >> > Doing a query for each term doesn't seem to be the best way.
> >> > Many thanks,
> >> >
> >> > Francisco
> >>
>

Re: Solr query which return only those docs whose all tokens are from given list

2015-05-11 Thread Sujit Pal

Hi Naresh,

Couldn't you could just model this as an OR query since your requirement is
at least one (but can be more than one), ie:

tags:T1 tags:T2 tags:T3

-sujit


On Mon, May 11, 2015 at 4:14 AM, Naresh Yadav nyadav@gmail.com wrote:

 Hi all,

 Also asked this here : http://stackoverflow.com/questions/30166116

 For example i have SOLR docs in which tags field is indexed :

 Doc1 - tags:T1 T2

 Doc2 - tags:T1 T3

 Doc3 - tags:T1 T4

 Doc4 - tags:T1 T2 T3

 Query1 : get all docs with tags:T1 AND tags:T3 then it works and will
 give Doc2 and Doc4

 Query2 : get all docs whose tags must be one of these [T1, T2, T3] Expected
 is : Doc1, Doc2, Doc4

 How to model Query2 in Solr ?? Please help me on this ?

Re: Proximity Search

2015-04-30 Thread Sujit Pal

Hi Vijay,

I haven't tried this myself, but perhaps you could build the two phrases as
PhraseQueries and connect them up with a SpanQuery? Something like this
(using your original example).

PhraseQuery p1 = new PhraseQuery();
for (String word : this is phrase 1.split()) {
p1.add(new Term(my_field, word));
}
PhraseQuery p2 = new PhraseQuery();
for (String word : this is the second phrase.split()) {
p2.add(new Term(my_field, word));
}
SpanQuery q = new SpanNearQuery(new SpanQuery[] {p1, p2}, 4, true);

-sujit


On Thu, Apr 30, 2015 at 10:04 AM, Vijaya Narayana Reddy Bhoomi Reddy 
vijaya.bhoomire...@whishworks.com wrote:

 Thanks Rajani.

 I could get proximity search work for individual words. However, still
 could not make it work for two phrases, each containing more than a word.
 Also, results seem to be unexpected for proximity queries with wildcards.



 Thanks  Regards
 Vijay


 On 30 April 2015 at 15:19, Rajani Maski rajani.ma...@lucidworks.com
 wrote:

  Hi Vijaya,
 
  I just quickly tried proximity search with the example set shipped with
  solr 5 and it looked like working for me.
  Perhaps, what you could is debug the query by enabling debugQuery=true.
 
 
  Here are the steps that I tried.(Assuming you are on Solr 5. Though this
  term proximity functionality should work for 4.x versions too)
 
  1. Go to solr5.0 downloaded folder and navigate to bin.
 
  Rajanis-MacBook-Pro:solr-5.0.0 rajanishivarajmaski$ bin/solr -e
  techproducts
 
  2. Execute the below query. The field name has value Test with some
  GB18030 encoded characters and you search for  name: Test  GB18030~10
 
  http://localhost:8983/solr/techproducts/select?q=name: Test
   GB18030~10wt=jsonindent=true
 
  Image : http://postimg.org/image/bjkbufsph/
 
 
  On Thu, Apr 30, 2015 at 7:14 PM, Vijaya Narayana Reddy Bhoomi Reddy 
  vijaya.bhoomire...@whishworks.com wrote:
 
   I just tried with simple proximity search like word1 word2 ~3 and it
 is
   not working. Just wondering whether I have to make any configuration
   changes to solrconfig.xml to make proximity search work?
  
   Thanks
   Vijay
  
  
   On 30 April 2015 at 14:32, Vijaya Narayana Reddy Bhoomi Reddy 
   vijaya.bhoomire...@whishworks.com wrote:
  
Hi,
   
I have created my index with the default configurations. Now I am
  trying
to use proximity search. However, I am bit not sure on the results
 and
where its going wrong.
   
For example, I want to find two phrases this is phrase one and
  another
phrase this is the second phrase with not more than a proximity
   distance
of 4 words in between them. The query syntax I am using is (\this
 is
phrase one\) (\this is the second phrase\)~4
   
However, the results I am getting are similar to OR operation. Can
  anyone
please let me know whether the syntax is correct?
   
Also, please let me know how to implement proximity search using
 SolrJ
Query API?
   
Thanks  Regards
Vijay
   
  
   --
   The contents of this e-mail are confidential and for the exclusive use
 of
   the intended recipient. If you receive this e-mail in error please
 delete
   it from your system immediately and notify us either by e-mail or
   telephone. You should not copy, forward or otherwise disclose the
 content
   of the e-mail. The views expressed in this communication may not
   necessarily be the view held by WHISHWORKS.
  
 

 --
 The contents of this e-mail are confidential and for the exclusive use of
 the intended recipient. If you receive this e-mail in error please delete
 it from your system immediately and notify us either by e-mail or
 telephone. You should not copy, forward or otherwise disclose the content
 of the e-mail. The views expressed in this communication may not
 necessarily be the view held by WHISHWORKS.

Re: Enrich search results with external data

2015-04-17 Thread Sujit Pal

Hi Ha,

Yes, I think if you want to facet on the external field, the custom
component seems to be the best option IMO.

-sujit

On Fri, Apr 17, 2015 at 3:02 PM, ha.p...@arvatosystems.com wrote:

 Hi  Sujit,



 Many thanks for your blog post, responding to my question, and suggesting
 the alternative option ☺



 I think I prefer your approach because we can supply our own Comparator.
 The reason is that we need to meet some strict requirements: we can only
 call the external system once to retrieve extra fields (price, inventory,
 etc.) for probably a subset of the search result.  Therefore we need to be
 able to sort and facet on the list of items that some of them may not have
 external fields. I think using the Comparator would help with the sorting
 but let me know if you have different ideas.



 Do you have suggestion how we should deal with the facet requirement? I am
 thinking about adding another Facet Component that will be executed after
 the standard FacetComponent. Let me know if you think we should consider
 other options.



 Thanks,



 -Ha



 -Original Message-

 From: sujitatgt...@gmail.com [mailto:sujitatgt...@gmail.com] On Behalf Of
 Sujit Pal

 Sent: Saturday, April 11, 2015 10:23 AM

 To: solr-user@lucene.apache.org; Ahmet Arslan

 Subject: Re: Enrich search results with external data



 Hi Ha,



 I am the author of the blog post you mention. To your question, I don't
 know if the code will work without change (since the Lucene/Solr API has
 evolved so much over the last few years), but a more preferred way using
 Function Queries way may be found in slides for Timothy Potter's talk here:


 http://www.slideshare.net/thelabdude/boosting-documents-in-solr-lucene-revolution-2011



 Here he speaks of external fields stored in a database and accessed using
 a custom component (rather than from a flat file as in ExternalFieldField),
 and using function queries to influence the ranking based on the external
 field. However, per this document on function queries, you can use the
 output of a function query to sort as well by passing the function to the
 sort parameter.

 https://wiki.apache.org/solr/FunctionQuery#Sort_By_Function



 Hope this helps,

 Sujit





 On Fri, Apr 10, 2015 at 10:38 PM, Ahmet Arslan iori...@yahoo.com.invalid

 wrote:



  Hi,

 

  Who don't you include/add/index those additional fields, at least the

  one used in sorting?

 

  Also, you may find

  https://stanbol.apache.org/docs/trunk/components/enhancer/ relevant.

 

  Ahmet

 

 

 

  On Saturday, April 11, 2015 1:04 AM, ha.p...@arvatosystems.com 

  ha.p...@arvatosystems.com wrote:

  This ticket seems to address the problem I have

 

  https://issues.apache.org/jira/browse/SOLR-1566

 

 

 

  and as the result of that ticket, DocTransformer is added since Solr 4.0.

  I wrote a simple DocTransformer and found that the transformer is

  executed AFTER pagination. In our application, we need the external

  fields added before sorting/pagination. I've looked around for the

  option to change the execution order but haven't had any luck. Does
 anyone know the solution?

 

 

 

  The ticket also states it is not possible for components to add

  fields to outgoing documents which are not in the stored fields of the
 document.

  Does anyone know if this is still true?

 

 

 

  Thanks,

 

 

 

  -Ha

 

 

 

 

 

  -Original Message-

 

  From: Pham, Ha

 

  Sent: Thursday, April 09, 2015 11:41 PM

 

  To: solr-user@lucene.apache.org

 

 

  Subject: Enrich search results with external data

 

 

 

  Hi everyone,

 

 

 

  We have a requirement to append external data (e.g. price/inventory of

  product, retrieved from an ERP via web services) to query result and

  support sorting and pagination based on those external fields. For

  example if Solr returns 100 records and the page size user selects is

  20, the sorting on the external  fields is still on 100 records. This

  limits us from enriching search results outside of Solr. I guess this

  is a common problem so hopefully someone could share their experience.

 

 

 

  I am considering using a PostFilter and enrich documents in collect()

  method as below

 

 

 

  @Override

 

  public void collect(int docId) throws IOException { DoubleField price

  = new DoubleField (PRICE, 1.23, Field.Store.YES); Document

  currentDoc = context.reader().document(docId); currentDoc.add(price);

  }

 

 

 

  but the result documents don't have PRICE fields. Did I miss anything
 here?

 

 

 

  I also did some research and it seems the approach mentioned here

  http://sujitpal.blogspot.com/2011/05/custom-sorting-in-solr-using-exte

  rnal.html is close to what we need to achieve but since the document

  is 4 years old, I don't know if there's a better approach for our

  problem (we are using solr 5.0)?

 

 

 

  Thanks,

 

 

 

  -Ha

Re: Enrich search results with external data

2015-04-11 Thread Sujit Pal

Hi Ha,

I am the author of the blog post you mention. To your question, I don't
know if the code will work without change (since the Lucene/Solr API has
evolved so much over the last few years), but a more preferred way using
Function Queries way may be found in slides for Timothy Potter's talk here:
http://www.slideshare.net/thelabdude/boosting-documents-in-solr-lucene-revolution-2011

Here he speaks of external fields stored in a database and accessed using a
custom component (rather than from a flat file as in ExternalFieldField),
and using function queries to influence the ranking based on the external
field. However, per this document on function queries, you can use the
output of a function query to sort as well by passing the function to the
sort parameter.
https://wiki.apache.org/solr/FunctionQuery#Sort_By_Function

Hope this helps,
Sujit


On Fri, Apr 10, 2015 at 10:38 PM, Ahmet Arslan iori...@yahoo.com.invalid
wrote:

 Hi,

 Who don't you include/add/index those additional fields, at least the one
 used in sorting?

 Also, you may find
 https://stanbol.apache.org/docs/trunk/components/enhancer/ relevant.

 Ahmet



 On Saturday, April 11, 2015 1:04 AM, ha.p...@arvatosystems.com 
 ha.p...@arvatosystems.com wrote:
 This ticket seems to address the problem I have

 https://issues.apache.org/jira/browse/SOLR-1566



 and as the result of that ticket, DocTransformer is added since Solr 4.0.
 I wrote a simple DocTransformer and found that the transformer is executed
 AFTER pagination. In our application, we need the external fields added
 before sorting/pagination. I've looked around for the option to change the
 execution order but haven't had any luck. Does anyone know the solution?



 The ticket also states it is not possible for components to add fields to
 outgoing documents which are not in the stored fields of the document.
 Does anyone know if this is still true?



 Thanks,



 -Ha





 -Original Message-

 From: Pham, Ha

 Sent: Thursday, April 09, 2015 11:41 PM

 To: solr-user@lucene.apache.org


 Subject: Enrich search results with external data



 Hi everyone,



 We have a requirement to append external data (e.g. price/inventory of
 product, retrieved from an ERP via web services) to query result and
 support sorting and pagination based on those external fields. For example
 if Solr returns 100 records and the page size user selects is 20, the
 sorting on the external  fields is still on 100 records. This limits us
 from enriching search results outside of Solr. I guess this is a common
 problem so hopefully someone could share their experience.



 I am considering using a PostFilter and enrich documents in collect()
 method as below



 @Override

 public void collect(int docId) throws IOException { DoubleField price =
 new DoubleField (PRICE, 1.23, Field.Store.YES); Document currentDoc =
 context.reader().document(docId); currentDoc.add(price); }



 but the result documents don't have PRICE fields. Did I miss anything here?



 I also did some research and it seems the approach mentioned here
 http://sujitpal.blogspot.com/2011/05/custom-sorting-in-solr-using-external.html
 is close to what we need to achieve but since the document is 4 years old,
 I don't know if there's a better approach for our problem (we are using
 solr 5.0)?



 Thanks,



 -Ha

Re: Get the new terms of fields since last update

2014-12-05 Thread Sujit Pal

Hi Ludovic,

A bit late to the party, sorry, but here is a bit of a riff off Eric's
idea. Why not store the previous terms in a Bloom filter and once you get
the terms from this week, check to see if they are not in the set. Once you
find the set, add them to the Bloom filter. Bloom filters are space
efficient, by increasing the false positive rate you can make it consume
less space (more keys hash to the same element), since you are only
concerned with finding if something is not in the set.

-sujit

On Fri, Dec 5, 2014 at 7:21 AM, lboutros boutr...@gmail.com wrote:

The Apache Solr community is sooo great !

Interesting problem with 3 interesting answers in less than 2 hours !

Thank you all, really.

Erik,

I'm already saving the billion of terms each week. It's hard to diff 1
billion of terms.
I'm already rebuilding the whole dictionaries each week in a custom
distributed terms query handler.

I'm saving the result in Mongo DB in order to scroll thru it quickly with
term position in the dictionary.

It takes 3-4 hours each week. Now I would like to update the result in
order
to do it faster.

Alex, I will check, this seems to be a good idea.
Is it possible to filter terms with payloads in index readers ? I did not
see anything like that in my first investigation.
I suppose it would take some additional disk space.

Michael,

this is the easiest way to do it. You are right. But I'm not sure that
indexing twice and update the dictionaries would be faster than the current
process. But it worth it to do some math ;)

Ludovic.

-
Jouve
France.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Get-the-new-terms-of-fields-since-last-update-tp4172755p4172785.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: What's the most efficient way to sort by number of terms matched?

2014-11-06 Thread Sujit Pal

Hi Trey,

In an application I built few years ago, I had a component that rewrote the
input query into a Lucene BooleanQuery and we would set the
minimumNumberShouldMatch value for the query. Worked well, but lately we
are trying to move away from writing our own custom components since
maintaining them across releases becomes a bit of a chore.

So lately we simulate this behavior in the client by constructing
progressively smaller n-grams and OR'ing them then sending to Solr. For
your example, it becomes something like this:

(python AND solr AND hadoop) OR (python AND solr) OR (solr AND hadoop) OR
(python AND hadoop) OR (python) OR (solr) OR (hadoop).

-sujit


On Thu, Nov 6, 2014 at 7:25 AM, Ahmet Arslan iori...@yahoo.com.invalid
wrote:

 Hi Trey,

 Not exactly the same but we did something similar with (e)dismax's mm
 parameter. By autoRelax'ing it.

 In your example,
 try with mm=3 if numFound  20 then try with mm=2 etc.

 Ahmet

 On Thursday, November 6, 2014 8:41 AM, Trey Grainger solrt...@gmail.com
 wrote:



 Just curious if there are some suggestions here. The use case is fairly
 simple:

 Given a query like  python OR solr OR hadoop, I want to sort results by
 number of keywords matched first, and by relevancy separately.

 I can think of ways to do this, but not efficiently. For example, I could
 do:
 q=python OR solr OR hadoop
   p1=python
   p2=solr
   p3=hadoop
   sort=sum(if(query($p1,0),1,0),if(query($p2,0),1,0),if(query($p3,0),1,0))
 desc, score desc

 Other than the obvious downside that this requires me to pre-parse the
 user's query, it's also somewhat inefficient to run the query function once
 for each term in the original query since it is re-executing multiple
 queries and looping through every document in the index during scoring.

 Ideally, I would be able to do something like the below that could just
 pull the count of unique matched terms from the main query (q parameter)
 execution::
 q=python OR solr OR hadoopsort=uniquematchedterms() desc,score desc.

 I don't think anything like this exists, but would love some suggestions if
 anyone else has solved this before.

 Thanks,

 -Trey

Re: Query on Facet

2014-07-30 Thread Sujit Pal

Hi Smitha,

Have you looked at Facet queries? It allows you to attach Solr queries to
facets. The problem with this is that you will need to know all possible
combinations of language and binding (or make an initial query to find this
information).

https://wiki.apache.org/solr/SimpleFacetParameters#facet.query_:_Arbitrary_Query_Faceting

Another alternative could be to bake in language+binding pairs into a field
in your index and facet on that.

-sujit

On Wed, Jul 30, 2014 at 7:01 AM, vamshi kiran mothevamshiki...@gmail.com
wrote:

Hi Alex,

As you said If we exclude language facet field ,it will get all the
language facets with count right ?
It Will not filter by binding facet field of type 'paperback' , how can we
do this ?

Thanks Regards,
Vamshi.
On Jul 30, 2014 4:11 PM, Alexandre Rafalovitch arafa...@gmail.com
wrote:

I am not sure I fully understood your question, but I would start by
looking at Tagging and Excluding first:

https://wiki.apache.org/solr/SimpleFacetParameters#Tagging_and_excluding_Filters

Regards,
Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853

On Wed, Jul 30, 2014 at 5:07 PM, Smitha Rajiv smitharaji...@gmail.com
wrote:
Hi,

I need some help on Solr Faceting.

How do I facet on two fields at the same time to get combination facets
and
its count?

I'm using below query to get facets with combination of language and
its
binding. But now I'm getting only selected facet in facetList of each
field and its count. For e.g. in language facets the query is
returning
English and its count. Instead I need to get other language facets
which
satisfies binding type of paperback

http://localhost:8080/solr/collection1/select?q=software%20testingfq=language%3A(%22English%22)fq=Binding%3A(%22paperback%22)facet=truefacet.mincount=1

facet.field=Languagefacet.field=latestArrivalsfacet.field=Bindingwt=jsonindent=truedefType=edismax
json.nl=map

Please provide me your inputs.

Thanks Regards,

Smitha

Re: Implementing custom analyzer for multi-language stemming

2014-07-30 Thread Sujit Pal

Hi Eugene,

In a system we built couple of years ago, we had a corpus of English and
French mixed (and Spanish on the way but that was implemented by client
after we handed off). We had different fields for each language. So (title,
body) for English docs was (title_en, body_en), for French (title_fr,
body_fr) and for Spanish (title_es, body_es) - each of these were
associated with a different Analyzer (that was associated with the field
types in schema.xml, in case of Lucene you can use
PerFieldAnalyzerWrapper). Our pipeline used Google translate to detect the
language and write the contents into the appropriate field set for the
language. Our analyzers were custom - but Lucene/Solr provides analyzer
chains for many major languages. You can find a list here:

https://wiki.apache.org/solr/LanguageAnalysis

-sujit



On Wed, Jul 30, 2014 at 10:52 AM, Chris Morley ch...@depahelix.com wrote:

 I know BasisTech.com has a plugin for elasticsearch that extends
 stemming/lemmatization to work across 40 natural languages.
 I'm not sure what they have for Solr, but I think something like that may
 exist as well.

 Cheers,
 -Chris.

 
  From: Eugene beyondcomp...@gmail.com
 Sent: Wednesday, July 30, 2014 1:48 PM
 To: solr-user@lucene.apache.org
 Subject: Implementing custom analyzer for multi-language stemming

 Hello, fellow Solr and Lucene users and developers!

 In our project we receive text from users in different languages. We
 detect language automatically and use Google Translate APIs a lot (so
 having arbitrary number of languages in our system doesn't concern us).
 However we need to be able to search using stemming. Having nearly hundred
 of fields (several fields for each language with language-specific
 stemmers) listed in our search query is not an option. So we need a way to
 have a single index which has stemmed tokens for different languages. I
 have two questions:

 1. Are there already (third-party) custom multi-language stemming
 analyzers? (I doubt that no one else ran into this issue)

 2. If I'm going to implement such analyzer myself, could you please
 suggest a better way to 'pass' detected language value into such analyzer?
 Detecting language in analyzer itself is not an option, because: a) we
 already detect it in other place b) we do it based on combined values of
 many fields ('name', 'topic', 'description', etc.), while current field
 can
 be to short for reliable detection c) sometimes we just want to specify
 language explicitly. The obvious hack would be to prepend ISO 639-1 code
 to
 field value. But I'd like to believe that Solr allows for cleaner
 solution.
 I could think about either: a) custom query parameter (but I guess, it
 will
 require modifying request handlers, etc. which is highly undesirable) b)
 getting value from other field (we obviously have 'language' field and we
 do not have mixed-language records). If it is possible, could you please
 describe the mechanism for doing this or point to relevant code examples?
 Thank you very much and have a good day!

Re: Any Solrj API to obtain field list?

2014-05-27 Thread Sujit Pal

Have you looked at IndexSchema? That would offer you methods to query index
metadata using SolrJ.

http://lucene.apache.org/solr/4_7_2/solr-core/org/apache/solr/schema/IndexSchema.html

-sujit



On Tue, May 27, 2014 at 1:56 PM, T. Kuro Kurosaka k...@healthline.comwrote:

 I'd like to write Solr client code that writes text to language specific
 field, say, myfield_es, for Spanish,
 if the field myfield_es is defined in schema.xml, and otherwise to a
 fall-back field myfield. To do this,
 I need to obtain a list of defined fields (and dynamic fields) from the
 server. But I cannot find
 a suitable Solrj API. Is there any? I'm using Solr 4.6.1. I could write
 code to use Schema REST API
 (https://wiki.apache.org/solr/SchemaRESTAPI) but I would much prefer to
 use
 the existing code if one exists.

 --
 T. Kuro Kurosaka • Senior Software Engineer

Re: How to apply Semantic Search in Solr

2014-03-11 Thread Sujit Pal

Hi Sohan,

Given you have 15 days and this looks like a class project, I would suggest
going with John Berryman's approach - he also provides code which you can
just apply to your data. Even if you don't get the exact expansions you
desire, I think you will get results that will pleasantly surprise you :-).

-sujit

On Mon, Mar 10, 2014 at 11:07 PM, Sohan Kalsariya
sohankalsar...@gmail.comwrote:

Hey Sujit thanks a lot.
But what do you think about Berryman blog post ?
Is it feasible to apply or should i apply the synonym stuff ?
which one is good?
And the 3rd approach you told me about, seems like difficult and
time consuming for students like me as i will have to submit this in next
15 Days.
Please suggest me something.

On Tue, Mar 11, 2014 at 5:12 AM, Sujit Pal sujit@comcast.net wrote:

Hi Sohan,

You would be the best person to answer your question of how to proceed
:-).
From your original query term musical events in New York rewriting to
musical nights at ABC place OR concerts events OR classical music
event you would have to build into your knowledge base that ABC place
is
a synonym for New York, and that musical event at New York is a
synonym
for concerts events and classical music event. You can do this using
approach #1 (from the Berryman blog post) and the approach #2 (my first
suggestion) but these results are not guaranteed - because your corpus
may
not contain this relationship. Approach #3 (my second suggestion)
involves
lots of work and possibly domain knowledge but much cleaner
relationships.
OTOH, you could get away for this one query by adding the three queries
into your synonyms.txt and enabling synonym support in Solr.

http://stackoverflow.com/questions/18790256/solr-synonym-not-working

So how much effort you put into supporting this feature would be dictated
by how important it is to your environment - that is a question only you
can answer.

-sujit

On Sun, Mar 9, 2014 at 11:26 PM, Sohan Kalsariya
sohankalsar...@gmail.comwrote:

Thanks Sujit and all for your views about semantic search in solr.
But How do i proceed towards, i mean how do i start off the things to
get
on track ?

On Sat, Mar 8, 2014 at 10:50 PM, Sujit Pal sujit@comcast.net
wrote:

Thanks for sharing this link Sohan, its an interesting approach.
Since
you
have effectively defined what you mean by Semantic Search, there are
couple
other approaches I know of to do something like this:
1) preprocess your documents looking for terms that co-occur in the
same
document. The more such cooccurrences you find the more strongly
these
terms are related (can help with ordering related terms from most
related
to least related). At query time expand the query to include /most/
related
concepts and search.
2) use an external knowledgebase such as a taxonomy that indicates
relationships between concepts (this is the approach we use). At
query
time
expand the query to include related concepts and search.

-sujit

On Sat, Mar 8, 2014 at 8:21 AM, Sohan Kalsariya
sohankalsar...@gmail.com
wrote:

Basically, when i searched it on Google I got this result :

http://www.opensourceconnections.com/2013/08/25/semantic-search-with-solr-and-python-numpy/

And I am working on this.

So is this useful ?

On Sat, Mar 8, 2014 at 3:11 PM, Alexandre Rafalovitch
arafa...@gmail.com
wrote:

And how would it know to give you those results? Obviously, you
have
some sort of magic/algorithm in your mind. Are you doing
geographic
location match, category match, synonyms match?

We can't really help with generic questions. You still need to
figure
out what semantic means for you specifically.

Regards,
Alex.
Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening
all
at once. Lately, it doesn't seem to be working. (Anonymous -
via
GTD
book)

On Sat, Mar 8, 2014 at 4:27 PM, Sohan Kalsariya
sohankalsar...@gmail.com wrote:
Hello,

I am working on an event listing and promotions website(
http://allevents.in) and I want to apply semantic search on
solr.
For example, if someone search :

Musical Events in New York
So it would give me results such as :

* Musical Night at ABC place
* Concerts Events
* Classical Music Event
I mean all results should be Semantic to the Search_Query it
should
not
give the results only based on tf-idf. So can you please make
me
understand how do i proceed to apply Semantic Search in Solr. (
allevents.in

Re: How to apply Semantic Search in Solr

2014-03-10 Thread Sujit Pal

Hi Sohan,

You would be the best person to answer your question of how to proceed :-).
From your original query term musical events in New York rewriting to
musical nights at ABC place OR concerts events OR classical music
event you would have to build into your knowledge base that ABC place is
a synonym for New York, and that musical event at New York is a synonym
for concerts events and classical music event. You can do this using
approach #1 (from the Berryman blog post) and the approach #2 (my first
suggestion) but these results are not guaranteed - because your corpus may
not contain this relationship. Approach #3 (my second suggestion) involves
lots of work and possibly domain knowledge but much cleaner relationships.
OTOH, you could get away for this one query by adding the three queries
into your synonyms.txt and enabling synonym support in Solr.

http://stackoverflow.com/questions/18790256/solr-synonym-not-working

So how much effort you put into supporting this feature would be dictated
by how important it is to your environment - that is a question only you
can answer.

-sujit

On Sun, Mar 9, 2014 at 11:26 PM, Sohan Kalsariya
sohankalsar...@gmail.comwrote:

Thanks Sujit and all for your views about semantic search in solr.
But How do i proceed towards, i mean how do i start off the things to get
on track ?

On Sat, Mar 8, 2014 at 10:50 PM, Sujit Pal sujit@comcast.net wrote:

Thanks for sharing this link Sohan, its an interesting approach. Since
you
have effectively defined what you mean by Semantic Search, there are
couple
other approaches I know of to do something like this:
1) preprocess your documents looking for terms that co-occur in the same
document. The more such cooccurrences you find the more strongly these
terms are related (can help with ordering related terms from most related
to least related). At query time expand the query to include /most/
related
concepts and search.
2) use an external knowledgebase such as a taxonomy that indicates
relationships between concepts (this is the approach we use). At query
time
expand the query to include related concepts and search.

-sujit

On Sat, Mar 8, 2014 at 8:21 AM, Sohan Kalsariya
sohankalsar...@gmail.com
wrote:

Basically, when i searched it on Google I got this result :

http://www.opensourceconnections.com/2013/08/25/semantic-search-with-solr-and-python-numpy/

And I am working on this.

So is this useful ?

On Sat, Mar 8, 2014 at 3:11 PM, Alexandre Rafalovitch
arafa...@gmail.com
wrote:

And how would it know to give you those results? Obviously, you have
some sort of magic/algorithm in your mind. Are you doing geographic
location match, category match, synonyms match?

We can't really help with generic questions. You still need to figure
out what semantic means for you specifically.

Regards,
Alex.
Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working. (Anonymous - via
GTD
book)

On Sat, Mar 8, 2014 at 4:27 PM, Sohan Kalsariya
sohankalsar...@gmail.com wrote:
Hello,

I am working on an event listing and promotions website(
http://allevents.in) and I want to apply semantic search on solr.
For example, if someone search :

Musical Events in New York
So it would give me results such as :

* Musical Night at ABC place
* Concerts Events
* Classical Music Event
I mean all results should be Semantic to the Search_Query it should
not
give the results only based on tf-idf. So can you please make me
understand how do i proceed to apply Semantic Search in Solr. (
allevents.in)

--
Regards,
*Sohan Kalsariya*

Re: How to apply Semantic Search in Solr

2014-03-08 Thread Sujit Pal

Thanks for sharing this link Sohan, its an interesting approach. Since you
have effectively defined what you mean by Semantic Search, there are couple
other approaches I know of to do something like this:
1) preprocess your documents looking for terms that co-occur in the same
document. The more such cooccurrences you find the more strongly these
terms are related (can help with ordering related terms from most related
to least related). At query time expand the query to include /most/ related
concepts and search.
2) use an external knowledgebase such as a taxonomy that indicates
relationships between concepts (this is the approach we use). At query time
expand the query to include related concepts and search.

-sujit

On Sat, Mar 8, 2014 at 8:21 AM, Sohan Kalsariya sohankalsar...@gmail.comwrote:

Basically, when i searched it on Google I got this result :

http://www.opensourceconnections.com/2013/08/25/semantic-search-with-solr-and-python-numpy/

And I am working on this.

So is this useful ?

On Sat, Mar 8, 2014 at 3:11 PM, Alexandre Rafalovitch arafa...@gmail.com
wrote:

And how would it know to give you those results? Obviously, you have
some sort of magic/algorithm in your mind. Are you doing geographic
location match, category match, synonyms match?

We can't really help with generic questions. You still need to figure
out what semantic means for you specifically.

Regards,
Alex.
Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working. (Anonymous - via GTD
book)

On Sat, Mar 8, 2014 at 4:27 PM, Sohan Kalsariya
sohankalsar...@gmail.com wrote:
Hello,

I am working on an event listing and promotions website(
http://allevents.in) and I want to apply semantic search on solr.
For example, if someone search :

Musical Events in New York
So it would give me results such as :

* Musical Night at ABC place
* Concerts Events
* Classical Music Event
I mean all results should be Semantic to the Search_Query it should not
give the results only based on tf-idf. So can you please make me
understand how do i proceed to apply Semantic Search in Solr. (
allevents.in)

--
Regards,
*Sohan Kalsariya*

Re: Multivalued true Error?

2013-11-26 Thread Sujit Pal

Hi Furkan,

In the stock definition of the payload field:
http://svn.apache.org/viewvc/lucene/dev/trunk/solr/example/solr/collection1/conf/schema.xml?view=markup

the analyzer for payloads field type is a WhitespaceTokenizerFactory
followed by a DelimitedPayloadTokenFilterFactory. So if you send it a
string foo$score1 bar$score2 ... where foo and bar are string tokens and
score[12] are payload scores and $ is your delimiter, the analyzer will
tokenize it into multiple payloads and you should be able to run the tests
in the blog post. So you shouldn't make it multiValued AFAIK.

-sujit



On Tue, Nov 26, 2013 at 8:44 AM, Furkan KAMACI furkankam...@gmail.comwrote:

 Hi;

 I've ported this example from Scala into Java:
 http://sujitpal.blogspot.com/2013/07/porting-payloads-to-solr4.html#!

 However does field should be multivalued true at that example?

 PS: I use Solr 4.5.1

 Thanks;
 Furkan KAMACI

Re: Why do people want to deploy to Tomcat?

2013-11-12 Thread Sujit Pal

In our case, it is because all our other applications are deployed on
Tomcat and ops is familiar with the deployment process. We also had
customizations that needed to go in, so we inserted our custom JAR into the
solr.war's WEB-INF/lib directory, so to ops the process of deploying Solr
was (almost, except for schema.xml or solrconfig.xml changes) identical to
any of the other apps. But I think if Solr becomes a server with clearly
defined extension points (such as dropping your custom JARs into lib/ and
custom configuration in conf/solrconfig.xml or similar like it already is)
then it will be treated as something other than a webapp and the
expectation that it runs on Tomcat will not apply.

Just my $0.02...

Sujit



On Tue, Nov 12, 2013 at 9:13 AM, Siegfried Goeschl sgoes...@gmx.at wrote:

 Hi ALex,

 in my case

 * ignorance that Tomcat is not fully supported
 * Tomcat configuration and operations know-how inhouse
 * could migrate to Jetty but need approved change request to do so

 Cheers,

 Siegfried Goeschl

 On 12.11.13 04:54, Alexandre Rafalovitch wrote:

 Hello,

 I keep seeing here and on Stack Overflow people trying to deploy Solr to
 Tomcat. We don't usually ask why, just help when where we can.

 But the question happens often enough that I am curious. What is the
 actual
 business case. Is that because Tomcat is well known? Is it because other
 apps are running under Tomcat and it is ops' requirement? Is it because
 Tomcat gives something - to Solr - that Jetty does not?

 It might be useful to know. Especially, since Solr team is considering
 making the server part into a black box component. What use cases will
 that
 break?

 So, if somebody runs Solr under Tomcat (or needed to and gave up), let's
 use this thread to collect this knowledge.

 Regards,
 Alex.
 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)

Re: Solr language-dependent sort

2013-04-08 Thread SUJIT PAL

Hi Lisheng,

We did something similar in Solr using a custom handler (but I think you could 
just build a custom QeryParser to do this), but you could do this in your 
application as well, ie, get the language and then rewrite your query to use 
the language specific fields. Come to think of it, the QueryParser would 
probably be sufficiently general to qualify as a patch for custom functionality.

-sujit

On Apr 8, 2013, at 12:28 PM, Zhang, Lisheng wrote:

 
 Hi,
 
 I found that in solr we need to define a special fieldType for each
 language (http://wiki.apache.org/solr/UnicodeCollation), then point
 a field to this type.
 
 But in our application one field (like 'title') can be used by various
 users for their languages (user1 used for English, user2 used it for
 Japanese ..), so it is even difficult for us to use dynamical field,
 we would prefer to pass in a parameter like 
 
 language = 'en'
 
 at run time, then solr API may use this parameter to call lucene API
 to sort a field. This approach would be much more flexible (we programmed
 this way when using lucene directly)?
 
 Thanks very much for helps, Lisheng

Re: Solr Sorting is not working properly on long Fields

2013-03-24 Thread SUJIT PAL

Hi ballusethuraman, 

I am sure you have done this already, but just to be sure, did you reindex your 
existing kilometer data after you changed the data type from string to long? If 
not, then you should.

-sujit

On Mar 23, 2013, at 11:21 PM, ballusethuraman wrote:

 Hi,  I am having a column named 'Kilometers' and when I try to sort with
 that it is not working properly.The values in 'Kilometers' column
 are,Kilometers171119792365611Values in 'Kilometers' after soting are
 Kilometers979236561117111The Problem here is, when 97 is compared with 923
 it is taking 97 as bigger number since 97 is greater than 923. Initially
 Kilometers column was having string as datatype and I thought the problem
 could be because of that and i changed the datatype of that column to
 'long'. Even then i couldn't see any change in the results.But when I insert
 values which are having same number of digits say, 1, 2,
 3,4,5Kilometers21452  when i try to sort now
 it is working perfectlyKilometers12345Datatypes that I
 have tries are, Can anyone helpme to get rid out of this problem...
 Thanks in Advance
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-Sorting-is-not-working-properly-on-long-Fields-tp4050833.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Matching an exact word

2013-02-21 Thread SUJIT PAL

You could also do this outside Solr, in your client. If your query is 
surrounded by quotes, then strip away the quotes and make 
q=text_exact_field:your_unquoted_query. Probably better to do outside Solr in 
general keeping in mind the upgrade path.

-sujit

On Feb 21, 2013, at 12:20 PM, Van Tassell, Kristian wrote:

 Thank you.
 
 So essentially I need to write a custom query parser (extending upon 
 something like the QParser)?
 
 -Original Message-
 From: Upayavira [mailto:u...@odoko.co.uk] 
 Sent: Thursday, February 21, 2013 12:22 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Matching an exact word
 
 Solr will only match on the terms as they are in the index. If it is stemmed 
 in the index, it will match that. If it isn't, it'll match that.
 
 All term matches are (by default at least) exact matches. Only with stemming 
 you are doing an exact match against the stemmed term.
 Therefore, there really is no way to do what you are looking for within Solr. 
 I'd suggest you'll need to do some parsing at your side and, if you find 
 quotes, do the query against a different field.
 
 Upayavira
 
 On Thu, Feb 21, 2013, at 06:17 PM, Van Tassell, Kristian wrote:
 I'm trying to match the word created. Given that it is surrounded by 
 quotes, I would expect an exact match to occur, but instead the entire 
 stemming results show for words such as create, creates, created, etc.
 
 q=createdwt=xmlrows=1000qf=textdefType=edismax
 
 If I copy the text field to a new one that does not stem words, 
 text_exact for example, I get the expected results:
 
 q=createdwt=xmlrows=1000qf=text_exactdefType=edismax
 
 I would like the decision whether to match exact or not to be 
 determined by the quotes rather than the qf parameter (eg, not have to 
 use it at all). What topic do I need to look into more to understand 
 this? Thanks in advance!

Re: Can Solr analyze content and find dates and places

2013-02-11 Thread SUJIT PAL

Hi Bart,

Like I said, I didn't actually hook my UIMA stuff into Solr, content and
queries are annotated before they reach Solr. What you describe sounds like a
classpath problem (but of course you already knew that :-)). Since I haven't
actually done what you are trying to do, here are some suggestions, they may or
may not work...

1) package up the XML files into your custom JAR at the top level, that way you
don't need to specify it as /RoomNumberAnnotator.xml.
2) if you are using solr4, then you should drop your custom JAR into
$SOLR_HOME/collection1/lib, not $SOLR_HOME/lib.

-sujit

On Feb 11, 2013, at 9:40 AM, jazz wrote:

Hi Sujit and others who answered my question,

I have been working on the UIMA path which seems great with the available
Eclipse tooling and this:

http://sujitpal.blogspot.nl/2011/03/smart-query-parsing-with-uima.html

Now I worked through the UIMA tutorial of the RoomNumberAnnotator:
http://uima.apache.org/doc-uima-annotator.html
And I am able to test it using the UIMA CAS Virtuall Debugger. So far so good.

But, now I want to use the new RoomNumberAnnotator with Solr, but it cannot
find the xml file and the Java class (they are in the correct lib
directories, because the WhitespaceTokenizer works fine).

updateRequestProcessorChain name=uima
processor
class=org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory
lst name=uimaConfig
lst name=runtimeParameters
/lst
str name=analysisEngine/RoomNumberAnnotator.xml/str
bool name=ignoreErrorsfalse/bool
lst name=analyzeFields
bool name=mergefalse/bool
arr name=fields
strcontent/str
/arr
/lst
lst name=fieldMappings
lst name=type
str name=nameorg.apache.uima.tutorial.RoomNumber/str
lst name=mapping
str name=featurebuilding/str
str name=fieldUIMAname/str
/lst
/lst
/lst
/lst
/processor
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /

On the Wiki (http://wiki.apache.org/solr/SolrUIMA) this is mentioned but it
fails:
Deploy new jars inside one of the lib directories

Run 'ant clean dist' (or 'mvn clean package') from the solr/contrib/uima path.

Is it needed to deploy the new jar (RoomAnnotator.jar)? If yes, which branch
can I checkout? This is the Stable release I am running:

Solr 4.1.0 1434440 - sarowe - 2013-01-16 17:21:36

Regards, Bart

On 8 Feb 2013, at 22:11, SUJIT PAL wrote:

Hi Bart,

I did some work with UIMA but this was to annotate the data before it goes
to Lucene/Solr, ie not built as a UpdateRequestProcessor. I just looked
through the SolrUima wiki page [http://wiki.apache.org/solr/SolrUIMA] and I
believe you will have to set up your own aggregate analysis chain in place
of the one currently configured.

Writing UIMA annotators is very simple (there is a tutorial here:
[http://uima.apache.org/downloads/releaseDocs/2.1.0-incubating/docs/html/tutorials_and_users_guides/tutorials_and_users_guides.html]).
You provide the XML description for the annotation and let UIMA generate
the annotation bean. You write Java code for the annotator and also the
annotator XML descriptor. UIMA uses the annotator XML descriptor to
instantiate and run your annotator. Overall, sounds really complicated but
its actually quite simple.

The tutorial has quite a few examples that you will find useful, but in case
you need more, I have some on this github repository:
[https://github.com/sujitpal/tgni/tree/master/src/main/java/com/mycompany/tgni/analysis/uima]

The dictionary and pattern annotators may be similar to what you are looking
for (date and city annotators).

Best regards,
Sujit

On Feb 8, 2013, at 8:50 AM, Bart Rijpers wrote:

Hi Alex,

Indeed that is exactly what I am trying to achieve using wordcities. Date
will be simple: 16-Jan becomes 16-Jan-2013 in a new dynamic field. But how
do I integrate the Java library as UIMA? The documentation about changing
schema.xml and solr.xml is not very detailed.

Regards, Bart

On 8 Feb 2013, at 16:57, Alexandre Rafalovitch arafa...@gmail.com wrote:

Hi Bart,

I haven't done any UIMA work (I used other stuff for my NLP phase), so not
sure I can help much further. But in general, you are venturing into pure
research territory here.

Even for dates, what do you actually mean? Just fixed expression? Relative
dates (e.g. last tuesday?). What about times (7pm?).

Same with cities. If you want it offline, you need the gazetteer and
disambiguation modules. Gazetteer for cities (worldwide) is huge and has a
lot of duplicate names (Paris, Ontario is apparently a short drive from
London, Ontario eh?). Something like
http://www.maxmind.com/en/worldcities? And disambiguation

Re: Can Solr analyze content and find dates and places

2013-02-11 Thread SUJIT PAL

Cool! Thanks for the update, this will help if I ever go all the way with UIMA
and Solr.

-sujit

On Feb 11, 2013, at 12:13 PM, jazz wrote:

Hi Sujit,

Thanks for your help! I moved the RoomNumberAnnotator.xml to the top level of
the jar and used the same solrconfig.xml (with the /). Now it works perfect.

Best regards, Bart

On 11 Feb 2013, at 20:13, SUJIT PAL wrote:

Hi Bart,

Like I said, I didn't actually hook my UIMA stuff into Solr, content and
queries are annotated before they reach Solr. What you describe sounds like
a classpath problem (but of course you already knew that :-)). Since I
haven't actually done what you are trying to do, here are some suggestions,
they may or may not work...

1) package up the XML files into your custom JAR at the top level, that way
you don't need to specify it as /RoomNumberAnnotator.xml.
2) if you are using solr4, then you should drop your custom JAR into
$SOLR_HOME/collection1/lib, not $SOLR_HOME/lib.

-sujit

On Feb 11, 2013, at 9:40 AM, jazz wrote:

Hi Sujit and others who answered my question,

I have been working on the UIMA path which seems great with the available
Eclipse tooling and this:

http://sujitpal.blogspot.nl/2011/03/smart-query-parsing-with-uima.html

Now I worked through the UIMA tutorial of the RoomNumberAnnotator:
http://uima.apache.org/doc-uima-annotator.html
And I am able to test it using the UIMA CAS Virtuall Debugger. So far so
good.

But, now I want to use the new RoomNumberAnnotator with Solr, but it cannot
find the xml file and the Java class (they are in the correct lib
directories, because the WhitespaceTokenizer works fine).

On the Wiki (http://wiki.apache.org/solr/SolrUIMA) this is mentioned but it
fails:
Deploy new jars inside one of the lib directories

Run 'ant clean dist' (or 'mvn clean package') from the solr/contrib/uima
path.

Is it needed to deploy the new jar (RoomAnnotator.jar)? If yes, which
branch can I checkout? This is the Stable release I am running:

Solr 4.1.0 1434440 - sarowe - 2013-01-16 17:21:36

Regards, Bart

On 8 Feb 2013, at 22:11, SUJIT PAL wrote:

Hi Bart,

I did some work with UIMA but this was to annotate the data before it goes
to Lucene/Solr, ie not built as a UpdateRequestProcessor. I just looked
through the SolrUima wiki page [http://wiki.apache.org/solr/SolrUIMA] and
I believe you will have to set up your own aggregate analysis chain in
place of the one currently configured.

The tutorial has quite a few examples that you will find useful, but in
case you need more, I have some on this github repository:
[https://github.com/sujitpal/tgni/tree/master/src/main/java/com/mycompany/tgni/analysis/uima]

The dictionary and pattern annotators may be similar to what you are
looking for (date and city annotators).

Best regards,
Sujit

On Feb 8, 2013, at 8:50 AM, Bart Rijpers wrote:

Hi Alex,

Indeed that is exactly what I am trying to achieve using wordcities. Date
will be simple: 16-Jan becomes 16-Jan-2013 in a new dynamic field. But
how do I integrate the Java library as UIMA? The documentation about
changing schema.xml and solr.xml is not very detailed.

Regards, Bart

On 8 Feb 2013, at 16:57, Alexandre Rafalovitch arafa...@gmail.com wrote:

Hi Bart,

I haven't done any UIMA work (I used other stuff for my NLP phase), so
not
sure I can help much further. But in general, you are venturing into pure
research territory here.

Even for dates, what do you actually mean? Just fixed expression?
Relative
dates (e.g

Re: Crawl Anywhere -

2013-02-10 Thread SUJIT PAL

Hi Siva,

You will probably get a better reply if you head over to the nutch mailing list 
[http://nutch.apache.org/mailing_lists.html] and ask there. 

Nutch 2.1 may be what you are looking for (stores pages in NoSQL database).

Regards,
Sujit

On Feb 10, 2013, at 9:16 PM, SivaKarthik wrote:

 Dear Erick,
   Thanks for ur relpy..
   ya..nutch can meet my requirement... 
  but the problem is, i want to store the crawled document in html or xml
 format instead of mapreduce format..
  not sure nutch plugins available to convert into xml files.
  please share me if you any idea .
 
 ThankYou
 
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607831p4039619.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can Solr analyze content and find dates and places

2013-02-08 Thread SUJIT PAL

Hi Bart,

I did some work with UIMA but this was to annotate the data before it goes to
Lucene/Solr, ie not built as a UpdateRequestProcessor. I just looked through
the SolrUima wiki page [http://wiki.apache.org/solr/SolrUIMA] and I believe you
will have to set up your own aggregate analysis chain in place of the one
currently configured.

Writing UIMA annotators is very simple (there is a tutorial here:
[http://uima.apache.org/downloads/releaseDocs/2.1.0-incubating/docs/html/tutorials_and_users_guides/tutorials_and_users_guides.html]).
You provide the XML description for the annotation and let UIMA generate the
annotation bean. You write Java code for the annotator and also the annotator
XML descriptor. UIMA uses the annotator XML descriptor to instantiate and run
your annotator. Overall, sounds really complicated but its actually quite
simple.

The dictionary and pattern annotators may be similar to what you are looking
for (date and city annotators).

Best regards,
Sujit

On Feb 8, 2013, at 8:50 AM, Bart Rijpers wrote:

Hi Alex,

Indeed that is exactly what I am trying to achieve using wordcities. Date
will be simple: 16-Jan becomes 16-Jan-2013 in a new dynamic field. But how do
I integrate the Java library as UIMA? The documentation about changing
schema.xml and solr.xml is not very detailed.

Regards, Bart

On 8 Feb 2013, at 16:57, Alexandre Rafalovitch arafa...@gmail.com wrote:

Hi Bart,

I haven't done any UIMA work (I used other stuff for my NLP phase), so not
sure I can help much further. But in general, you are venturing into pure
research territory here.

Even for dates, what do you actually mean? Just fixed expression? Relative
dates (e.g. last tuesday?). What about times (7pm?).

Online services like OpenCalais are backed by gigantic databases and some
serious corpus-training Machine Language disambiguation algorithms.

So, no plug-and-play solution here. If you really need to get this done, I
would recommend narrowing down the specification of exactly what you will
settle for and looking for software that can do it. Once you have that,
integration with Solr is your next - and smaller - concern.

Regards,
Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)

On Fri, Feb 8, 2013 at 10:41 AM, jazz jazzsa...@me.com wrote:

Thanks Alex,

I checked the documentation but it seems there is only a webservice
(OpenCalais) available to extract dates and places.

http://uima.apache.org/sandbox.html

Do you know is there is a Solr Compatible UIMA add-on which detects dates
and places (cities) without a webservice? If not, how do you write one?

Regards, Bart

On 8 Feb 2013, at 15:29, Alexandre Rafalovitch wrote:

Yes, it is possible. You are looking at UIMA or OpenNLP integration, most
probably in Update Request Processor pipeline.

Have a look here as a start: https://wiki.apache.org/solr/SolrUIMA

You will have to put some serious work into this, it is not all tied
together and packaged. Mostly because the Natural Language Processing
(the
field you are getting into) is kind of messy all of its own.

Good luck,
Alex.

On Fri, Feb 8, 2013 at 9:24 AM, jazz jazzsa...@me.com wrote:

Hi,

I want to know if Solr can analyze text and recoginze dates and places.
If
yes, is it then possible to create new dynamic fields with these dates
and
places (e.g. city).

Thanks, Bart

Re: Per user document exclusions

2012-11-19 Thread SUJIT PAL

Hi Christian,

Since customization is not a problem in your case, how about writing out the 
userId and excluded document ids to the database when it is excluded, and then 
for each query from the user (possibly identified by a userid parameter), 
lookup the database by userid, construct a NOT filter out of the excluded 
docIds, then send to Solr as the fq?

We are using a variant of this approach to allow database style wildcard search 
on document titles.

-sujit
 
On Nov 18, 2012, at 9:05 PM, Christian Jensen wrote:

 Hi,
 
 We have a need to allow each user to 'exclude' individual documents in the
 results. We can easily do this now within the RDBMS using a FTS index and a
 query with 'OUTER LEFT JOIN WHERE NULL' type of thing.
 
 Can Solr do this somehow? Heavy customization is not a problem - I would
 bet this has already been done. I would like to avoid multiple trips back
 and forth from either the DB or SOLR if possible.
 
 Thanks!
 Christian
 
 -- 
 
 *Christian Jensen*
 724 Ioco Rd
 Port Moody, BC V3H 2W8
 +1 (778) 996-4283
 christ...@jensenbox.com

Re: Query foreign language synonyms / words of equivalent meaning?

2012-10-10 Thread SUJIT PAL

Hi,

We are using google translate to do something like what you (onlinespending) 
want to do, so maybe it will help.

During indexing, we store the searchable fields from documents into a fields 
named _en, _fr, _es, etc. So assuming we capture title and body from each 
document, the fields are (title_en, body_en), (title_fr, body_fr), etc, with 
their own analyzer chains. These documents come from a controlled source (ie 
not the web), so we know the language they are authored in.

During searching, a custom component intercepts the client language and the 
query. The query is sent to google translate for language detection. The 
largest amount of docs in the corpus is english, so if the detected language is 
either english or the client language, then we call google translate again to 
find the translated query in the other (english or client) language. Another 
custom component constructs an OR query between the two languages one component 
of which is aimed at the _en field set and the other aimed at the _xx (client 
language) field set.

-sujit

On Oct 9, 2012, at 11:24 PM, Bernd Fehling wrote:

 
 As far as I know, there is no built-in functionality for language translation.
 I would propose to write one, but there are many many pitfalls.
 If you want to translate from one language to another you might have to
 know the starting language. Otherwise you get problems with translation.
 
 Not (german) - distress (english), affliction (english)
 
 - you might have words in one language which are stopwords in other language 
 not
 - you don't have a one to one mapping, it's more like 1 to n+x
  toilette (french) - bathroom, rest room / restroom, powder room
 
 This are just two points which jump into my mind but there are tons of 
 pitfalls.
 
 We use the solution of a multilingual thesaurus as synonym dictionary.
 http://en.wikipedia.org/wiki/Eurovoc
 It holds translations of 22 official languages of the European Union.
 
 So a search for europäischer währungsfonds gives also results with
 european monetary fund, fonds monétaire européen, ...
 
 Regards
 Bernd
 
 
 
 Am 10.10.2012 04:54, schrieb onlinespend...@gmail.com:
 Hi,
 
 English is going to be the predominant language used in my documents, but
 there may be a spattering of words in other languages (such as Spanish or
 French). What I'd like is to initiate a query for something like bathroom
 for example and for Solr to return documents that not only contain
 bathroom but also baño (Spanish). And the same goes when searching for 
 baño. I'd like Solr to return documents that contain either bathroom or 
 baño.
 
 One possibility is to pre-translate all indexed documents to a common
 language, in this case English. And if someone were to search using a
 foreign word, I'd need to translate that to English before issuing a query
 to Solr. This appears to be problematic, since I'd have to know whether the
 indexed words and the query are even in a foreign language, which is not
 trivial.
 
 Another possibility is to pre-build a list of foreign word synonyms. So baño
 would be listed as a synonym for bathroom. But I'd need to include other
 languages (such as toilette in French) and other words. This requires that
 I know in advance all possible words I'd need to include foreign language
 versions of (not to mention needing to know which languages to include).
 This isn't trivial either.
 
 I'm assuming there's no built-in functionality that supports the foreign
 language translation on the fly, so what do people propose?
 
 Thanks!
 
 
 -- 
 *
 Bernd FehlingUniversitätsbibliothek Bielefeld
 Dipl.-Inform. (FH)LibTec - Bibliothekstechnologie
 Universitätsstr. 25 und Wissensmanagement
 33615 Bielefeld
 Tel. +49 521 106-4060   bernd.fehling(at)uni-bielefeld.de
 
 BASE - Bielefeld Academic Search Engine - www.base-search.net
 *

Re: How to make SOLR manipulate the results?

2012-10-04 Thread SUJIT PAL

Hi Srilatha,

One way to do this would be by making two calls, one to your sponsored list 
where you pick two at random and a solr call where you pick all the search 
results and then stick them together in your client.

Sujit

On Oct 4, 2012, at 12:39 AM, srilatha wrote:

 For an E-commerce website, we have stored the products as SOLR documents with
 the following fields and weights:
 Title:5
 Description:4
 
 For some products, we need to ensure that they appear in the top ten results
 even if their relevance in the above two fields does not qualify them for
 being in top 10. For example:
 P1, P2,  P10 are the legitimate products for a given search keyword
 iPhone. I have S1 ... S100 as sponsored products that want to appear in
 the top 10. My policy is that only 2 of these 100 sponsored products will be
 randomly chosen and shown in the top 10 so that the results will be: S5,
 S31, P1, P2, ... P8. In the next request, the sponsored products that gets
 slipped in may be S4, S99.
 
 The QueryElevationComponent lets us specify the docIDs for keywords but does
 not let us randomize the results such that only 2 of the complete set of
 sponsored docIDs is sent in the results.
 
 Any suggestions for implementing this would be appreciated.
 
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/How-to-make-SOLR-manipulate-the-results-tp4011739.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Synonym file for American-British words

2012-08-07 Thread SUJIT PAL

Hi Alex,

I implemented something similar using the rules described in this page:

http://en.wikipedia.org/wiki/American_and_British_English_spelling_differences 

The idea is to normalize the British spelling form to the American form during 
indexing and query using a tokenizer that takes in a word and if matched to one 
of the rules, returns the converted form.

My rules were modeled as a chain of transformations. Each transformation had a 
set of (pattern, action) pairs. The transformations were:
a. word_replacement (such as artefact = artifact) - in this case the source 
word would directly be normalized into the specified target word.
b) prefix rules (eg anae = ane for anemic) - in this case the prefix 
characters of the word, if matched, would be transformed into the target.
c) suffix rules (eg tre = ter for center) - similar to prefix rules except it 
works on suffix.
d) infix rules (eg moeb = meb for ameba) - replaces characters in the middle 
of the word. 

I cannot share the actual rules, but they should be relatively simple to figure 
out from the wiki page, if you want to go that route.

HTM
Sujit

On Aug 7, 2012, at 7:08 AM, Alexander Cougarman wrote:

 Dear friends,
 
 Is there a downloadable synonym file for American-British words? This page 
 has some, for example the VarCon file, but it's not in the Solr synonym.txt 
 file. 
 
 We need something that can normalize words like center to centre. The 
 VarCon file has it, but it's in the wrong format.
 
 Thank you in advance :)
 
 Sincerely,
 Alex

Re: First query to find meta data, second to search. How to group into one?

2012-05-15 Thread SUJIT PAL

Hi Samarendra,

This does look like a candidate for a custom query component if you want to do 
this inside Solr. You can of course continue to do this at the client.

-sujit

On May 15, 2012, at 12:26 PM, Samarendra Pratap wrote:

 Hi,
 I need a suggestion for improving relevance of search results. Any
 help/pointers are appreciated.
 
 We have following fields (plus a lot more) in our schema
 
 title
 description
 category_id (multivalued)
 
 We are using mm=70% in solrconfig.xml
 We are using qf=title description
 We are not doing phrase query in q
 
 In case of a multi-word search text, mostly the end results are the junk
 ones. Because the words, mentioned in search text, are written in different
 fields and in different contexts.
 For example searching for water proof (without double quotes) brings a
 record where title = rose water and description = ... no proof of
 contamination ...
 
 Our priority is to remove irrelevant results, as much as possible.
 Increasing mm will not solve this completely because user input may not
 be always correct to be benefited by high mm.
 
 To remove irrelevant records we worked on following solution (or
 work-around)
 
   - We are firing first query to get top n results. We assume that first
   n results are mostly good results. n is dynamic within a predefined
   minimum and maximum value.
   - We are calculating frequency of category ids in these top results. We
   are not using facets because that gives count for all, relevant or
   irrelevant, results.
   - Based on category frequencies within top matching results we are
   trying to find a few most frequent categories by simple calculation. Now we
   are very confident that these categories are the ones which best suit to
   our query.
   - Finally we are firing a second query with top categories, calculated
   above, in filter query (fq).
 
 
 The quality of results really increased very much so I thought to try it
 the standard way.
 Does it require writing a plugin if I want to move above logic into Solr?
 Which component do I need to modify - QueryComponent?
 
 Or is there any better or even equivalent method in Solr of doing this or
 similar thing?
 
 
 
 Thanks
 
 -- 
 Regards,
 Samar

Re: Faceting on a date field multiple times

2012-05-04 Thread SUJIT PAL

Hi Ian,

I believe you may be able to use a bunch of facet.query parameters, something 
like this:

facet.query=yourfield:[NOW-1DAY TO NOW]
facet.query=yourfield:[NOW-2DAY to NOW-1DAY]
...
and so on.

-sujit

On May 3, 2012, at 10:41 PM, Ian Holsman wrote:

 Hi.
 
 I would like to be able to do a facet on a date field, but with different 
 ranges (in a single query).
 
 for example. I would like to show
 
 #documents by day for the last week - 
 #documents by week for the last couple of months
 #documents by year for the last several years.
 
 is there a way to do this without hitting solr 3 times?
 
 
 thanks
 Ian

Re: Any way to get reference to original request object from within Solr component?

2012-03-20 Thread SUJIT PAL

Hi Hoss,

Thanks for the pointers, and sorry, it was a bug in my code (was some dead code 
which was alphabetizing the facet link text and also the parameters themselves 
indirectly by reference).

I actually ended up building a servlet and a component to print out the 
multi-valued parameters using HttpServletRequest.getParameterValues(myparam) 
and ResponseBuilder.req.getParams().getParams(myparam) respectively to 
isolate the problem. Both of them returned the parameters in the correct order.

So I went trolling through the code with a debugger, to observe exactly at what 
point the order got messed up, and found the bug.

FWIW, I am using Tomcat 5.5.

Thanks to everybody for their help, and sorry for the noise, guess I should 
have done the debugger thing before I threw up my hands :-).

-sujit

On Mar 19, 2012, at 6:55 PM, Chris Hostetter wrote:

 
 : I have a custom component which depends on the ordering of a 
 : multi-valued parameter. Unfortunately it looks like the values do not 
 : come back in the same order as they were put in the URL. Here is some 
 : code to explain the behavior:
   ...
 : and I notice that the values are ordered differently than [foo, bar, 
 : baz] that I would have expected. I am guessing its because the 
 : SolrParams is a MultiMap structure, so order is destroyed on its way in.
 
 a) MultiMapSolrParams does not destroy order on the way in
 b) when dealing with HTTP requests, the request params actaully use an 
 instance of ServletSolrParams which is backed directly by the 
 ServletRequest.getParameterMap() -- you should get the values returned in 
 the exact order as ServletRequest.getParameterMap().get(myparam)
 
 : 1) is there a setting in Solr can use to enforce ordering of 
 : multi-valued parameters? I suppose I could use a single parameter with 
 : comma-separated values, but its a bit late to do that now...
 
 Should already be enforced in MultiMapSolrParams and ServletSolrParams
 
 : 2) is it possible to use a specific SolrParams object that preserves order? 
 If so how?
 
 see above.
 
 : 3) is it possible to get a reference to the HTTP request object from within 
 a component? If so how?
 
 not out of the box, because there is no garuntee that solr is even running 
 in a servlet container. you can subclass SolrDispatchFilter to do this if 
 you wish (note the comment in the execute() method).
 
 My questions to you...
 
 1) what servlet container are you using? 
 2) have you tested your servlet 
 container with a simple servlet (ie: eliminate solr from the equation) to 
 verify that the ServletRequest.getParameterMap() contains your request 
 values in order?
 
 
 if you debug this and find evidence that something in solr is re-ordering 
 the values in a MultiMapSolrParams or ServletSolrParams *PLEASE* open a 
 jira with a reproducable example .. that would definitley be an anoying 
 bug we should get to the bottom of.
 
 
 -Hoss

Re: Any way to get reference to original request object from within Solr component?

2012-03-18 Thread SUJIT PAL

Thanks Russel, thats a good idea, I think this would work too... I will try
this and update the thread with details once.

-sujit

On Mar 18, 2012, at 7:11 AM, Russell Black wrote:

One way to do this is to register a servlet filter that places the current
request in a global static ThreadLocal variable, thereby making it available
to your Solr component. It's kind of a hack but would work.

Sent from my phone

On Mar 17, 2012, at 6:53 PM, SUJIT PAL sujit@comcast.net wrote:

Thanks Pravesh,

Yes, converting the myparam to a single (comma-separated) field is probably
the best approach, but as I mentioned, this is probably a bit too late for
this to be practical in my case...

The myparam parameters are facet filter queries, and so far order did not
matter, since the filters were just AND-ed together and applied to the
result set and facets were being returned in count order. But now the
requirement is to bubble up the selected facets so the one is most
currently selected is on the top. This was uncovered during user-acceptance
testing (since the client shows only the top N facets, and the currently
selected facet to disappear since its no longer within the top N facets).

Asking the client to switch to a single comma-separated field is an option,
but its the last option at this point, so I was wondering if it was possible
to switch to some other data structure, or at least get a handle to the
original HTTP servlet request from within the component so I could grab the
parameters from there.

I noticed that the /select call does preserve the order of the parameters,
but that is because its probably being executed by SolrServlet, which gets
its parameters from the HttpServletRequest.

I guess I will have to just run the request through a debugger and see where
exactly the parameter order gets messed up...I'll update this thread if I
find out.

Meanwhile, if any of you have simpler alternatives, would really appreciate
knowing...

Thanks,
-sujit

On Mar 17, 2012, at 12:01 AM, pravesh wrote:

Hi Sujit,

The Http parameters ordering is above the SOLR level. Don't think this could
be controlled at SOLR level.
You can append all required values in a single Http param at then break at
your component level.

Regds
Pravesh

--
View this message in context:
http://lucene.472066.n3.nabble.com/Any-way-to-get-reference-to-original-request-object-from-within-Solr-component-tp3833703p3834082.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Any way to get reference to original request object from within Solr component?

2012-03-17 Thread SUJIT PAL

Thanks Pravesh,

Yes, converting the myparam to a single (comma-separated) field is probably the
best approach, but as I mentioned, this is probably a bit too late for this to
be practical in my case...

The myparam parameters are facet filter queries, and so far order did not
matter, since the filters were just AND-ed together and applied to the result
set and facets were being returned in count order. But now the requirement is
to bubble up the selected facets so the one is most currently selected is on
the top. This was uncovered during user-acceptance testing (since the client
shows only the top N facets, and the currently selected facet to disappear
since its no longer within the top N facets).

Asking the client to switch to a single comma-separated field is an option, but
its the last option at this point, so I was wondering if it was possible to
switch to some other data structure, or at least get a handle to the original
HTTP servlet request from within the component so I could grab the parameters
from there.

I noticed that the /select call does preserve the order of the parameters, but
that is because its probably being executed by SolrServlet, which gets its
parameters from the HttpServletRequest.

I guess I will have to just run the request through a debugger and see where
exactly the parameter order gets messed up...I'll update this thread if I find
out.

Meanwhile, if any of you have simpler alternatives, would really appreciate
knowing...

Thanks,
-sujit

On Mar 17, 2012, at 12:01 AM, pravesh wrote:

Hi Sujit,

Regds
Pravesh

Any way to get reference to original request object from within Solr component?

2012-03-16 Thread SUJIT PAL

Hello,

I have a custom component which depends on the ordering of a multi-valued 
parameter. Unfortunately it looks like the values do not come back in the same 
order as they were put in the URL. Here is some code to explain the behavior:

URL: /solr/my_custom_handler?q=somethingmyparam=foomyparam=barmyparam=baz

Inside my component's process(ResponseBuilder) method, I do the following:

public void process(ResponseBuilder rb) throws IOException {
  String[] myparams = rb.req.getParams().getParams(myparam);
  System.out.println(myparams= + ArrayUtils.toString(myparams);
  ...
}

and I notice that the values are ordered differently than [foo, bar, baz] 
that I would have expected. I am guessing its because the SolrParams is a 
MultiMap structure, so order is destroyed on its way in.

My question is:
1) is there a setting in Solr can use to enforce ordering of multi-valued 
parameters? I suppose I could use a single parameter with comma-separated 
values, but its a bit late to do that now...
2) is it possible to use a specific SolrParams object that preserves order? If 
so how?
3) is it possible to get a reference to the HTTP request object from within a 
component? If so how?

I am on Solr version 3.2.0.

Thanks in advance for any help you can provide,

Sujit

Re: How to check if a field is a multivalue field with java

2012-02-22 Thread SUJIT PAL

Hi Thomas,

With Java (from within a custom handler in Solr) you can get a handle to the 
IndexSchema from the request, like so:

IndexSchema schema = req.getSchema();
SchemaField sf = schema.getField(fielaname);
boolean isMultiValued = sf.multiValued();

From within SolrJ code, you can use SolrDocument.getFieldValue() which returns 
an Object, so you could do an instanceof check - if its a Collection its 
multivalued, else not.

Object value = sdoc.getFieldValue(fieldname);
boolean isMultiValued = value instanceof Collection;

At least this is what I do, I don't think there is a way to get a handle to the 
IndexSchema object over solrj...

-sujit

On Feb 22, 2012, at 9:41 AM, tschiela wrote:

 Hello,
 
 is there any way to check, if a field of a SolrDocument ist a multivalue
 field with java (solrj)?
 
 Greets
 Thomas
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/How-to-check-if-a-field-is-a-multivalue-field-with-java-tp3767200p3767200.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to make search with special characters in keywords

2012-02-01 Thread SUJIT PAL

Hi Tejinder,

I had this problem yesterday (believe it or not :-)), and the fix for us was to 
make Tomcat UTF-8 compliant. In server.xml, there is a Controller tag, we 
added the attribute URIEncoding=UTF-8 and restarted Tomcat. Not sure what 
container you are using, if its Tomcat this will solve it, else you could 
probably find a similar setting for your container. Here is a link that 
provides more specific info:
http://struts.apache.org/2.0.6/docs/how-to-support-utf-8-uriencoding-with-tomcat.html

-sujit

On Feb 1, 2012, at 11:52 AM, Tejinder Rawat wrote:

 Hi all,
 
 In my implementation many fields in documents are having words with
 special characters like Company® ,Time™.
 
 Index is created using these fields. However if I make search using
 these keywords in solr console, it does not work.
 
 i.e. entering Company® or Time™ in search field box does not
 return any document. Where as entering Company or Time returns
 documents.
 
 Requirement is to be able to make search with special characters in keywords.
 
 Any pointers about how to index and search in case of special
 characters will be greatly appreciated.  Thank you.
 
 
 Thanks,
 Tejinder

Re: How to make search with special characters in keywords

2012-02-01 Thread SUJIT PAL

Well, sometimes people just copy-paste stuff into the search box probably 
because some words (at least in my world) are very hard to spell correctly. We 
noticed the problem because the query was getting mangled on its way in and not 
returning any search results even though it should have.

Our analysis chain (both query and index) uses ASCIIFoldingFilter to downcast 
these special characters to equivalent ASCII, so a string such as Ångström 
for example will actually result in a search for angstrom. The indexing also 
does the same conversion.

The mangling looked very similar to what happens when UTF-8 is passed through 
ISO-8859-1 encoding (and vice versa) which led us to the solution.

-sujit

On Feb 1, 2012, at 5:04 PM, Erick Erickson wrote:

 Sujit's comments are well taken, part of your problem will certainly be
 getting the special characters through your container...
 
 But another part of your problem will be having the characters in
 your index in the first place. The fact that you can find Time in
 the first place suggests that your index does NOT have the special
 characters, you need to look to your analysis chain to see
 what transformations occur, see the admin/analysis page...
 
 But I question why you need to search on special characters. Do
 you really expect the user to be happy with being required to
 enter Company®? A common approach is to remove such
 special characters during both index and query analyzing so a
 Company® and Company are equivalent.
 
 But your problem space may differ.
 
 Best
 Erick
 
 On Wed, Feb 1, 2012 at 6:55 PM, SUJIT PAL sujit@comcast.net wrote:
 Hi Tejinder,
 
 I had this problem yesterday (believe it or not :-)), and the fix for us was 
 to make Tomcat UTF-8 compliant. In server.xml, there is a Controller tag, 
 we added the attribute URIEncoding=UTF-8 and restarted Tomcat. Not sure 
 what container you are using, if its Tomcat this will solve it, else you 
 could probably find a similar setting for your container. Here is a link 
 that provides more specific info:
 http://struts.apache.org/2.0.6/docs/how-to-support-utf-8-uriencoding-with-tomcat.html
 
 -sujit
 
 On Feb 1, 2012, at 11:52 AM, Tejinder Rawat wrote:
 
 Hi all,
 
 In my implementation many fields in documents are having words with
 special characters like Company® ,Time™.
 
 Index is created using these fields. However if I make search using
 these keywords in solr console, it does not work.
 
 i.e. entering Company® or Time™ in search field box does not
 return any document. Where as entering Company or Time returns
 documents.
 
 Requirement is to be able to make search with special characters in 
 keywords.
 
 Any pointers about how to index and search in case of special
 characters will be greatly appreciated.  Thank you.
 
 
 Thanks,
 Tejinder

Re: Solr, SQL Server's LIKE

2011-12-29 Thread Sujit Pal

Hi Devon,

Have you considered using a permuterm index? Its workable, but depending
on your requirements (size of fields that you want to create the index
on), it may bloat your index. I've written about it here:
http://sujitpal.blogspot.com/2011/10/lucene-wildcard-query-and-permuterm.html 

Another alternative which I've implemented is a custom mechanism that
retrieves a list of matching unique ids from a database table using a
SQL LIKE, then passes this list as a filter to the main query. Its
hacky, but I was building a custom handler anyway, so it was quite
simple to add in.

-sujit

On Thu, 2011-12-29 at 11:38 -0600, Devon Baumgarten wrote:
 I have been tinkering with Solr for a few weeks, and I am convinced that it 
 could be very helpful in many of my upcoming projects. I am trying to decide 
 whether Solr is appropriate for this one, and I haven't had luck looking for 
 answers on Google.
 
 I need to search a list of names of companies and individuals pretty exactly. 
 T-SQL's LIKE operator does this with decent performance, but I have a feeling 
 there is a way to configure Solr to do this better. I've tried using an edge 
 N-gram tokenizer, but it feels like it might be more complicated than 
 necessary. What would you suggest?
 
 I know this sounds kind of 'Golden Hammer,' but there has been talk of other, 
 more complicated (magic) searches that I don't think SQL Server can handle, 
 since its tokens (as far as I know) can't be smaller than one word.
 
 Thanks,
 
 Devon Baumgarten

Re: Dynamic rating based on Like feature

2011-11-05 Thread Sujit Pal

Hi Eugene,

I proposed a solution for something similar, maybe it will help you.

http://sujitpal.blogspot.com/2011/05/custom-sorting-in-solr-using-external.html 

-sujit

On Sat, 2011-11-05 at 16:43 -0400, Eugene Strokin wrote:
 Hello,
 I have a task which seems trivial, but I couldn't find any related
 information from Solr documentation.
 So I'm asking the community for an advice.
 I have relatively big amount (about 25 Millions) of documents which are
 describing products.
 Those products could be rated by humans and/or machines.
 The rating is nothing more but just Like kind of points. So if someone or
 something likes a product it adds +1 to the total points of the product.
 I was thinking I could just have an integer field in the document, and
 increment it each time when Like event is fired, and just sort  this field.
 But, because Like event could come from external systems, I could
 get literally thousands of such events in first few hours. And I'm not
 sure that updating the document that often would be good.
 This is the first question - May be there is another way to do such dynamic
 rating? So more Liked products will be first in a search result.
 
 The second problem, that the client is asking to have time based search
 results. For example those Likes should not boost the document if they
 are a week old, a month old, etc. Ideally, they want to set the expiration
 time dynamically, but if this is a problem, it is acceptable to have some
 predefined time of expiration of those Likes, but still we are going to
 need at least a week and a month thresholds.
 Second question, if this is possible at all to do using Solr, if so, how?
 If not, what could you suggest?
 
 Thanks in advance,
 any advice, information, anything are greatly appreciated.
 
 Eugene S.

Re: Find Documents with field = maxValue

2011-10-18 Thread Sujit Pal

Hi Alireza,

Would this work? Sort the results by age desc, then loop through the
results as long as age == age[0].

-sujit

On Tue, 2011-10-18 at 15:23 -0700, Otis Gospodnetic wrote:
 Hi,
 
 Are you just looking for:
 
 age:target age
 
 This will return all documents/records where age field is equal to target age.
 
 But maybe you want
 
 age:[0 TO target age here]
 
 This will include people aged from 0 to target age.
 
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/
 
 
 
 From: Alireza Salimi alireza.sal...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Tuesday, October 18, 2011 10:15 AM
 Subject: Re: Find Documents with field = maxValue
 
 Hi Ahmet,
 
 Thanks for your reply, but I want ALL documents with age = max_age.
 
 
 On Tue, Oct 18, 2011 at 9:59 AM, Ahmet Arslan iori...@yahoo.com wrote:
 
 
 
  --- On Tue, 10/18/11, Alireza Salimi alireza.sal...@gmail.com wrote:
 
   From: Alireza Salimi alireza.sal...@gmail.com
   Subject: Find Documents with field = maxValue
   To: solr-user@lucene.apache.org
   Date: Tuesday, October 18, 2011, 4:10 PM
   Hi,
  
   It might be a naive question.
   Assume we have a list of Document, each Document contains
   the information of
   a person,
   there is a numeric field named 'age', how can we find those
   Documents whose
   *age* field
   is *max(age) *in one query.
 
  May be http://wiki.apache.org/solr/StatsComponent?
 
  Or sort by age?  q=*:*start=0rows=1sort=age desc
 
 
 
 
 -- 
 Alireza Salimi
 Java EE Developer

Re: SolrJ + Post

2011-10-14 Thread Sujit Pal

If you use the CommonsHttpSolrServer from your client (not sure about
the other types, this is the one I use), you can pass the method as an
argument to its query() method, something like this:

QueryResponse rsp = server.query(params, METHOD.POST);

HTH
Sujit

On Fri, 2011-10-14 at 13:29 +, Rohit wrote:
 I want to user POST instead of GET while using solrj, but I am unable to
 find a clear example for it. If anyone has implemented the same it would be
 nice to get some insight.
 
  
 
 Regards,
 
 Rohit
 
 Mobile: +91-9901768202
 
 About Me:  http://about.me/rohitg http://about.me/rohitg

Re: SolrJ + Post

2011-10-14 Thread Sujit Pal

Not the OP, but I put it in on /one/ of my solr custom handlers that
acts as a proxy to itself (ie the server its part of). It basically
rewrites the incoming query (usually short 50-250 chars at most) to a
set of very long queries and passes them in parallel to the server,
gathers up the results and returns a combo response. 

The logging is not an issue for me since the handler logs the expanded
query before sending it off, but the caching is. Thank you for pointing
it out.

I was doing it because I was running afoul of the limit on the URL size
(and the max boolean clauses as well, but I reset the max for that). But
I just realized that we can probably reset that limit as well as this
page shows:
http://serverfault.com/questions/56691/whats-the-maximum-url-length-in-tomcat 

So perhaps if the URL length is the reason for the OP's question,
increasing it may be a better option than using POST?

-sujit

On Fri, 2011-10-14 at 09:30 -0700, Walter Underwood wrote:
 Why do you want to use POST? It is the wrong HTTP request type for search 
 results.
 
 GET is for retrieving information from the server, POST is for changing 
 information on the server.
 
 POST responses cannot be cached (see HTTP spec).
 
 POST requests do not include the arguments in the log, which makes your HTTP 
 logs nearly useless for diagnosing problems.
 
 wunder
 Walter Underwood
 
 On Oct 14, 2011, at 9:20 AM, Sujit Pal wrote:
 
  If you use the CommonsHttpSolrServer from your client (not sure about
  the other types, this is the one I use), you can pass the method as an
  argument to its query() method, something like this:
  
  QueryResponse rsp = server.query(params, METHOD.POST);
  
  HTH
  Sujit
  
  On Fri, 2011-10-14 at 13:29 +, Rohit wrote:
  I want to user POST instead of GET while using solrj, but I am unable to
  find a clear example for it. If anyone has implemented the same it would be
  nice to get some insight.
  
  
  
  Regards,
  
  Rohit
  
  Mobile: +91-9901768202
  
  About Me:  http://about.me/rohitg http://about.me/rohitg

Re: Sort five random Top Offers to the top

2011-10-03 Thread Sujit Pal

Hi Mouli,

I was looking at the code here, not sure why you even need to do the
sort...

After you get the DocList, couldn't you do something like this?

ListInteger topofferDocIds = new ArrayListInteger();
for (DocIterator it = ergebnis.iterator(); it.hasNext();) {
  topofferDocIds.add(it.next());
}
Collections.shuffle(topofferDocIds);
rb.req.getContext().set(TOPOFFERS, topofferDocIds);

so in first-component, you have identified the top 5 offers for the
query and client, and stuffed them into the context.

Then you define a last component which will take the topofferDocIds and
place them at the top of the search results, and remove them if they
exist from the main result.

Would that not work?

Alternatively (kind of a hybrid way) would be to define your own
(single) component that takes the query, sends back two queries to the
underlying solr, one with the topoffers and one without and merges the
results before sending back. This would replace the component that does
the search.

-sujit

On Wed, 2011-09-28 at 07:15 -0700, MOuli wrote:
 Hey Community.
 
 I write my first component and now i got a problem hear is my code: 
 
 @Override
 public void prepare(ResponseBuilder rb) throws IOException {
 try {
 rb.req.getParams().getBool(topoffers.show, true);
 String client = rb.req.getParams().get(client, 1);
 BooleanQuery[] queries = new BooleanQuery[2];
 queries[0] = (BooleanQuery) DisMaxQParser.getParser(
 rb.req.getParams().get(q),
 DisMaxQParserPlugin.NAME,
 rb.req)
 .getQuery();
 queries[1] = new BooleanQuery();
 Occur occur = BooleanClause.Occur.MUST;
 queries[1].add(QueryParsing.parseQuery(ups_topoffer_ + client
 + :true, rb.req.getSearcher().getSchema()), occur);
 
 Query q = Query.mergeBooleanQueries(queries[0], queries[1]);
 
 DocList ergebnis = rb.req.getSearcher().getDocList(q, null,
 null, 0, 5, 0);
 
 String[] machineIds = new String[5];
 int position = 0;
 DocIterator iter = ergebnis.iterator();
 while (iter.hasNext()) {
 int docID = iter.nextDoc();
 Document doc =
 rb.req.getSearcher().getReader().document(docID);
 for (String value : doc.getValues(machine_id)) {
 machineIds[position++] = value;
 }
 }
 
 Sort sort = rb.getSortSpec().getSort();
 if (sort == null) {
 rb.getSortSpec().setSort(new Sort());
 sort = rb.getSortSpec().getSort();
 }
 
 SortField[] newSortings = new SortField[sort.getSort().length +
 5];
 int count = 0;
 for (String machineId : machineIds) {
 SortField sortMachineId = new SortField(map(machine_id, +
 machineId + , + machineId + ,1,0) desc, SortField.DOUBLE);
 newSortings[count++] = sortMachineId;
 }
 
 SortField[] sortings = sort.getSort();
 for (SortField sorting : sortings) {
 newSortings[count++] = sorting;
 }
 
 sort.setSort(newSortings);
 
 rb.getSortSpec().setSort(sort);
 
 } catch (ParseException e) {
 LoggerFactory.getLogger(Topoffers.class).error( Fehler bei den
 Topoffers!, this);
 LoggerFactory.getLogger(Topoffers.class).error(e.toString(),
 this);
 }
 
 }
 
 Why can't i manipulate the sort? Is there something i miss understand?
 
 This search component is added as a first-component in the solrconfig.xml.
 
 Please can anyone help me??
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Sort-five-random-Top-Offers-to-the-top-tp3355469p3376166.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Sort five random Top Offers to the top

2011-09-22 Thread Sujit Pal

Not the OP, but this is /much/ simpler, although at the expense of
making 2 calls to solr. But the upside is that no customization is
required.

On Thu, 2011-09-22 at 09:43 +0100, Doug McKenzie wrote:
Could you not just do your normal search with and add a filter query on?
fq=topoffer:true

That would then return only results with top offer : true and then use
whatever shuffling / randomising you like in your application.
Alternately you could even add sorting on relevance to show the top 5
closest matches to the query rows=5sort=score desc

On 21/09/2011 21:26, Sujit Pal wrote:
Hi MOuli,

AFAIK (and I don't know that much about Solr), this feature does not
exist out of the box in Solr. One way to achieve this could be to
construct a DocSet with topoffer:true and intersect it with your result
DocSet, then select the first 5 off the intersection, randomly shuffle
them, sublist [0:5], and move the sublist to the top of the results like
QueryElevationComponent does. Actually you may want to take a look at
QueryElevationComponent code for inspiration (this is where I would have
looked if I had to implement something similar).

-sujit

On Wed, 2011-09-21 at 06:54 -0700, MOuli wrote:
Hey Community.

I got a Lucene/Solr Index with many offers. Some of them are marked by a
flag field topoffer that they are top offers. Now I want so sort randomly
5 of this offers on the top.

For Example
HTC Sensation
- topoffer = true
HTC Desire
- topoffer = false
Samsung Galaxy S2
- topoffer = ture
IPhone 4
- topoffer = true
...

When i search for a Handy then i want that first 3 offers are HTC
Sensation,
Samsung Galaxy S2 and the iPhone 4.

Does anyone have an idea?

PS.: I hope my english is not to bad

--
View this message in context:
http://lucene.472066.n3.nabble.com/Sort-five-random-Top-Offers-to-the-top-tp3355469p3355469.html
Sent from the Solr - User mailing list archive at Nabble.com.
--
Become a Firebox Fan on Facebook: http://facebook.com/firebox
And Follow us on Twitter: http://twitter.com/firebox

Firebox has been nominated for Retailer of the Year in the 2011 Stuff Awards.
Who will win? It's up to you! Visit http://www.stuff.tv/awards and place your
vote. We'll do a special dance if it's us.

Firebox HQ is MOVING HOUSE! We're migrating from Streatham Hill to shiny new
digs in Shoreditch. As of 3rd October please update your records to:
Firebox.com, 6.10 The Tea Building, 56 Shoreditch High Street, London, E1 6JJ

Global Head Office: Firebox House, Ardwell Road, London SW2 4RT
Firebox.com Ltd is registered in England and Wales, company number 3874477
Registered Company Address: 41 Welbeck Street London W1G 8EA Firebox.com

Any views expressed in this email are those of the individual sender, except
where the sender expressly, and with authority, states them to be the views
of Firebox.com Ltd.

Re: Sort five random Top Offers to the top

2011-09-22 Thread Sujit Pal

I have a few blog posts on this...
http://sujitpal.blogspot.com/2011/04/custom-solr-search-components-2-dev.html 
http://sujitpal.blogspot.com/2011/04/more-fun-with-solr-component.html 
http://sujitpal.blogspot.com/2011/02/solr-custom-search-requesthandler.html 

but its quite simple, just look at some of the ones already in there.

If you need books, check out the Apache Solr 3.1 Cookbook - it has a
chapter on how to do this.

-sujit

On Thu, 2011-09-22 at 02:13 -0700, MOuli wrote:
 Hmm is it possible for me to write my own search component?
 
 I just downloaded the solr sources and need some informations how the search
 components work. Is there anything out there which can help me?
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Sort-five-random-Top-Offers-to-the-top-tp3355469p3358152.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Sort five random Top Offers to the top

2011-09-22 Thread Sujit Pal

Sorry hit send too soon. Personally, given the use case, I think I would
still prefer the two query approach. It seems way too much work to do a
handler (unless you want to learn how to do it) to support this.

On Thu, 2011-09-22 at 12:31 -0700, Sujit Pal wrote:
 I have a few blog posts on this...
 http://sujitpal.blogspot.com/2011/04/custom-solr-search-components-2-dev.html 
 http://sujitpal.blogspot.com/2011/04/more-fun-with-solr-component.html 
 http://sujitpal.blogspot.com/2011/02/solr-custom-search-requesthandler.html 
 
 but its quite simple, just look at some of the ones already in there.
 
 If you need books, check out the Apache Solr 3.1 Cookbook - it has a
 chapter on how to do this.
 
 -sujit
 
 On Thu, 2011-09-22 at 02:13 -0700, MOuli wrote:
  Hmm is it possible for me to write my own search component?
  
  I just downloaded the solr sources and need some informations how the search
  components work. Is there anything out there which can help me?
  
  --
  View this message in context: 
  http://lucene.472066.n3.nabble.com/Sort-five-random-Top-Offers-to-the-top-tp3355469p3358152.html
  Sent from the Solr - User mailing list archive at Nabble.com.

Re: Sort five random Top Offers to the top

2011-09-21 Thread Sujit Pal

Hi MOuli,

AFAIK (and I don't know that much about Solr), this feature does not
exist out of the box in Solr. One way to achieve this could be to
construct a DocSet with topoffer:true and intersect it with your result
DocSet, then select the first 5 off the intersection, randomly shuffle
them, sublist [0:5], and move the sublist to the top of the results like
QueryElevationComponent does. Actually you may want to take a look at
QueryElevationComponent code for inspiration (this is where I would have
looked if I had to implement something similar).

-sujit
 
On Wed, 2011-09-21 at 06:54 -0700, MOuli wrote:
 Hey Community.
 
 I got a Lucene/Solr Index with many offers. Some of them are marked by a
 flag field topoffer that they are top offers. Now I want so sort randomly
 5 of this offers on the top.
 
 For Example
 HTC Sensation
  - topoffer = true
 HTC Desire
  - topoffer = false
 Samsung Galaxy S2
  - topoffer = ture
 IPhone 4
  - topoffer = true 
 ...
 
 When i search for a Handy then i want that first 3 offers are HTC Sensation,
 Samsung Galaxy S2 and the iPhone 4.
 
 
 Does anyone have an idea?
 
 PS.: I hope my english is not to bad 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Sort-five-random-Top-Offers-to-the-top-tp3355469p3355469.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Too many results in dismax queries with one word

2011-08-21 Thread Sujit Pal

Would it make sense to have a Did you mean? type of functionality for
which you use the EdgeNGram and Metaphone filters /if/ you don't get
appropriate results for the user query?

So when user types cannon and the application notices that there are
no cannons for sale in the index (0 results with standard analysis), it
then makes another query with the EdgeNGram and/or Metaphone filters and
come back with:

Did you mean Canon, Canine?

Clicking on Canon or Canine would fire off a query for these terms.

That way your application doesn't guess what is right, it goes back and
asks the user what he wants.

-sujit

On Sun, 2011-08-21 at 17:19 +0200, Rafał Piekarski (RaVbaker) wrote:
 Thanks for reply. I know that sometimes meeting all clients needs would be
 impossible but then client recalls that competitive (commercial) product
 already do that (but has other problems, like performance). And then I'm
 obligated to try more tricks. :/
 
 I'm currently using Solr 3.1 but thinking about migrating to latest stable
 version - 3.3.
 
 You correct, to meet client needs I have also used some hacks with boosting
 queries (`bq` and `bf` parameters) but I omit that to make XMLs clearer.
 
 You mentioned faceting. This is also one of my(my client?) problems. In the
 user interface they want to have 5 categories for products. Those 5 should
 be most relevance ones. When I get those with highest counts for one word
 queries they are most of the time not that which should be there. For
 example with phrase ipad which actually has only 12 most relevant products
 in category tablets but phonetic APT matches also part of model name for
 hundreds of UPS power supplies and bath tubes . And these are on the list,
 not tablets. :/
 
 But you mentioned autocomplete which is something what I haven't watched
 yet. I'll try with that and show it to my client.
 
 -- 
 Rafał RaVbaker Piekarski.
 
 web: http://ja.ravbaker.net
 mail: ravba...@gmail.com
 jid/xmpp/aim: ravba...@gmail.com
 mobile: +48-663-808-481
 
 
 On Sun, Aug 21, 2011 at 4:20 PM, Erick Erickson 
 erickerick...@gmail.comwrote:
 
  The root problem here is This is unacceptable for my client. The first
  thing I'd suggest is that you work with your client and get them to define
  what is acceptable. You'll be forever changing things (to no good purpose)
  if all they can say is that's not right.
 
  For instance, you apparently have two competing requirements:
  1 try to correct users input, which inevitably increases the results
  returned
  2 narrow the search to the right results.
 
  You can't have both every time!
 
  So you could try something like going with a more-restrictive search
  (no metaphone
  comparison) first and, if the results returned weren't sufficient
  firing the broader query
  back, without showing the too-small results first.
 
  You could work with your client and see if what they really want is
  just the most relevant
  results at the top of the list, in which case you can play with the
  dismax field boosts
  (by the way, what version of Solr are you using?)
 
  You could work with the client to understand the user experience if
  you use autocomplete
  and/or faceting etc. to guide their explorations.
 
  You could...
 
  But none of that will help unless and until you and your client can
  agree what is the
  correct behavior ahead of time
 
  Best
  Erick
 
  On Sat, Aug 20, 2011 at 11:04 AM, Rafał Piekarski (RaVbaker)
  ravba...@gmail.com wrote:
   Hi all,
  
   I have a database of e-commerce products (5M) and trying to build a
  search
   solution for it.
  
   I have used steemer, edgengram and doublemetaphone phonetic fields for
   omiting common typos in queries.  It works quite good with dismax QParser
   for queries longer than one word: tv lc20, sny psp 3001, cannon 5d
   etc. For not having too many results I manipulated with `mm` parameter.
  But
   when user type a single word like ipad, cannon. I always having a lot
  of
   results (~6). This is unacceptable for my client. He would like to
  have
   then only the `good` results. That particulary match specific query. It's
   hard to acomplish for me cause of use doublemetaphone field which
  converts
   words like apt, opt and ipad and even ipod to the same phonetic
  word
   - APT. And then all of these  words are matched fairly the same gives me
   huge amount of results. Similar problems I have with other words like
   canon, canine and cannon which are KNN in phonetic way. But
  lexically
   have different meanings: canon - camera, canine - cat food , cannon
  -
   may be a misspell for canon or part of book title about cannon weapons.
  
   My first idea was to make a second requestHandler without searching in
   *_phonetic fields. And use it for queries with only one word. But it
  didn't
   worked cause sometimes I want to correct user even if there is only one
  word
   and suggest him something better. Query cannon is a good example. I'm
   fairly sure that most

Re: Exact matching on names?

2011-08-16 Thread Sujit Pal

Hi Ron,

There was a discussion about this some time back, which I implemented
(with great success btw) in my own code...basically you store both the
analyzed and non-analyzed versions (use string type) in the index, then
send in a query like this:

+name:clarke name_s:clarke^100

The name field is text so it will analyze down clarke to clark but
it will match both clark and clarke and the second clause would
boost the entry with clarke up to the top, which you then select with
rows=1.

-sujit

On Tue, 2011-08-16 at 10:20 -0500, Olson, Ron wrote:
 Hi all-
 
 I'm missing something fundamental yet I've been unable to find the definitive 
 answer for exact name matching. I'm indexing names using the standard text 
 field type and my search is for the name clarke. My results include 
 clark, which is incorrect, it needs to match clarke exactly (case 
 insensitive).
 
 I tried textType but that doesn't work because I believe it needs to be 
 *really* exact, whereas I'm looking for things like clark oil, bob, frank, 
 and clark, etc.
 
 Thanks for any help,
 
 Ron
 
 DISCLAIMER: This electronic message, including any attachments, files or 
 documents, is intended only for the addressee and may contain CONFIDENTIAL, 
 PROPRIETARY or LEGALLY PRIVILEGED information.  If you are not the intended 
 recipient, you are hereby notified that any use, disclosure, copying or 
 distribution of this message or any of the information included in or with it 
 is  unauthorized and strictly prohibited.  If you have received this message 
 in error, please notify the sender immediately by reply e-mail and 
 permanently delete and destroy this message and its attachments, along with 
 any copies thereof. This message does not create any contractual obligation 
 on behalf of the sender or Law Bulletin Publishing Company.
 Thank you.

Re: Problems generating war distribution using ant

2011-08-16 Thread Sujit Pal

FWIW, we have some custom classes on top of solr as well. The way we do
it is using the following ant target:

  target name=war depends=jar description=Rebuild Solr WAR with
custom code
mkdir dir=${maven.webapps.output}/
!-- we unwar a copy of the 3.2.0 war file in source repo --
unwar src=${prod.common.lib.external.solr}/apache-solr-3.2.0.war
dest=${maven.webapps.output}/
!-- add in some extra jar files our custom stuff needs --
copy todir=${maven.webapps.output}/WEB-INF/lib
  fileset refid=.../
  fileset refid=.../
  ...
/copy
!-- the jar target builds just our custom classes into a
hl-solr.jar, which is copied over to the WEB-INF/lib of the 
exploded solr war --
copy file=${maven.build.directory}/hl-solr.jar
todir=${maven.webapps.output}/WEB-INF/lib/
/war

Seems to work fine...basically automates what you have described in your
second paragraph, but allows us to keep our own code separately from
solr code under source control.

-sujit

On Tue, 2011-08-16 at 16:09 -0700, arian487 wrote:
 So the way I generate war files now is by running an 'ant dist' in the solr
 folder.  It generates the war fine and I get a build success, and then I
 deploy it to tomcat and once again the logs show it was successful (from the
 looks of it).  However, when I go to 'myip:8080/solr/admin' I get an HTTP
 status 404.
 
 However, it works when I take a war from the nightly build, expand it, drop
 some new class files in there that I need, and close it up again.  The solr
 I have checked out seems fine though and I can't find any differences
 between the war I'm generating and the one that has been generated.
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Problems-generating-war-distribution-using-ant-tp3260070p3260070.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Strip special chars like -

2011-08-09 Thread Sujit Pal

I have done this using a custom tokenfilter that (among other things)
detects hyphenated words and converts it to the 3 variations, using a
regex match on the incoming token:
(\w+)-(\w+)

that runs the following regex transform:

s/(\w+)-(\w+)/$1$2__$1 $2/

and then splits by __ and passes the original token, the one word and
two word versions through a SynonymFilter further down the chain (see
Lucene in Action, 2nd Edition for code).

-sujit

On Tue, 2011-08-09 at 06:27 -0700, roySolr wrote:
 Hello,
 
 I have some terms in my index with specials characters. An example is
 manchester-united. I want that a user can search for
 manchester-united,manchester united and  manchesterunited. What's the
 best way to fix this? i have used the patternReplaceFilter and some
 tokenizers but it couldn't fix the last situation(manchesterunited). Can
 someone helps me?
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Strip-special-chars-like-tp3238942p3238942.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: (Solr-UIMA) Doubt regarding integrating UIMA in to solr - Configuration.

2011-07-08 Thread Sujit Pal

Hi Sowmya,

I basically wrote an annotator and built a buffering tokenizer around it
so I could include it in a Lucene analyzer pipeline. I've blogged about
it, not sure if its good form to include links to blog posts in public
forums, but here they are, apologies in advance if this is wrong (let me
know and I won't do it again).

http://sujitpal.blogspot.com/2011/06/uima-analysis-engine-for-keyword.html
http://sujitpal.blogspot.com/2011/06/running-uima-analysis-engine-in-lucene.html

Of course, this is in Lucene land. I haven't worked with the SOLR-UIMA
stuff so this may not answer your question directly. But I think if you
build an Tokenizer or TokenFilter then you can declare it as an analyzer
chain in SOLR.

HTH
Sujit

On Fri, 2011-07-08 at 09:19 +0200, Sowmya V.B. wrote:
Hi Koji

Thanks for the mail.

Thanks for all the clarifications. I am now using the version 3.3.. But,
another query that I have about this is:
How can I add an annotator that I wrote myself, in to Solr-UIMA?

Here is what I did before I moved to Solr:
I wrote an annotator (which worked when I used plain vanilla lucene based
indexer), which enriched the document with more fields (Some statistics
about the document...all fields added were numeric fields). Those fields
were added to the index by extending *JCasAnnotator_ImplBase* class.

But, in Solr-UIMA, I am not exactly clear on where the above setup fits in.
I thought I would get an idea looking at the annotators that came with the
UIMA integration of Solr, but their source was not available. So, I do not
understand how to actually integrate my own annotator in to UIMA.

Can you please explain on how to go about this?

Sowmya.

On Fri, Jul 8, 2011 at 2:03 AM, Koji Sekiguchi k...@r.email.ne.jp wrote:

(11/07/07 18:38), Sowmya V.B. wrote:

I am trying to add UIMA module in to Solr..and began with the readme file
given here.
https://svn.apache.org/repos/**asf/lucene/dev/tags/lucene_**
solr_3_1/solr/contrib/uima/**README.txthttps://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_3_1/solr/contrib/uima/README.txt

I would recommend you to use Solr 3.3 rather than 3.1, as we have changed
some configuration
in solrconfig.xml for UIMA.

2. modify your schema.xml adding the fields you want to be hold
metadata specifying proper values for type, indexed, stored and
multiValued options:

-I understood this line as: adding to my schema.xml, the new fields that
will come as a result of a UIMA pipeline. For example, in my UIMA
pipeline,
post-processing, I get fields A,B,C in addition to fields X,Y,Z that I
already added to the SolrInputDocument. So, does this mean I should add
A,B,C to the schema.xml?

I think you got it. Have you tried it but you got some errors?

3. In SolrConfig.xml,

inside,

uimaConfig
runtimeParameters

The uimaConfig tag has been moved into update processor setting @ Solr 3.2.
Please see the latest README.txt.

if iam not using any of those alchemy api key... etc, I think I can
remove
those lines. However, I plan to use the openNLP tagger tokenizer, and an
annotator I wrote for my task. Can I give my model file locations here as
runtimeParameters?

I don't have an idea of openNLP.

4. I did not understand what fieldMapping tag does. The description
said:
field mapping describes which features of which types should go in a
field--
- For example, in this snippet from the link:

type name=org.apache.uima.alchemy.**ts.concept.ConceptFS
map feature=text field=concept/
/type

-what does feature mean and what does field mean?

This defines a map uima feature:

http://uima.apache.org/d/**uimaj-2.3.1/references.html#**
ugr.ref.xml.component_**descriptor.type_system.**featureshttp://uima.apache.org/d/uimaj-2.3.1/references.html#ugr.ref.xml.component_descriptor.type_system.features

to Solr field.

koji
--
http://www.rondhuit.com/en/

Re: Results with and without whitspace(soccer club and soccerclub)

2011-05-20 Thread Sujit Pal

This may or may not help you, we solved something similar based on
hyphenated words - essentially when we encountered a hyphenated word
(say word1-word2) we send in a OR query with the word (word1-word2)
itself, a phrase word1 word2~3 and the word formed by removing the
hyphen (word1word2).

But in this case, soccerclub is not hyphenated, but if you have some
kind of mapping of common conjunctions based on your search logs, you
could write a custom QParser plugin to break it up like that.

-sujit

On Fri, 2011-05-20 at 05:52 -0700, roySolr wrote:
 Thanks for the help so far,
 
 I don't think this solves the problem. What if my data look like this:
 
 soccer club Manchester united
 
 if i search for soccerclub manchester and for soccer club manchester  i
 want this result back.
 A copyfield that removes whitespaces is not an option.
 
 With the charfilter i get something like this:
 
 1. Index time: soccer club Manchester united-- soccerclubManchesterunited
 indexed.
 2. Search time: soccer club OR soccerclub -- soccerclub searched. 
 
 In this situation i still get no result if i search soccerclub. The index is
 soccerclubManchesterunited.
 
 How can i fix it? 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Results-with-and-without-whitespace-soccer-club-and-soccerclub-tp2934742p2965389.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Custom sorting based on external (database) data

2011-05-05 Thread Sujit Pal

Hi,

Sorry for the possible double post, I wrote this up but had the
incorrect sender address, so I am guessing that my previous one is going
to be rejected by the list moderation daemon.

I am trying to figure out options for the following problem. I am on
Solr 1.4.1 (Lucene 2.9.1).

I have search results which are going to be ranked by the user (using a
thumbs up/down) and would translate to a score between -1 and +1. 

This data is stored in a database table (
unique_id
thumbs_up
thumbs_down
num_calls

as the thumbs up/down component is clicked.

We want to be able to sort the results by the following score =
(thumbs_up - thumbs_down) / (num_calls). The unique_id field refers to
the one referenced as uniqueId in the schema.xml.

Based on the following conversation:
http://www.mail-archive.com/solr-user@lucene.apache.org/msg06322.html 

...my understanding is that I need to:

1) subclass FieldType to create my own RankFieldType. 
2) In this class I override the getSortField() method to return my
custom FieldSortComparatorSource object.
3) Build the custom FieldSortComparatorSource object which returns a
custom FieldSortComparator object in newComparator().
4) Configure the field type of class RankFieldType (rank_t), and a field
(called rank) of field type rank_t in schema.xml of type RankFieldType.
5) use sort=rank+desc to do the sort.

My question is: is there a simpler/more performant way? The number of
database lookups seems like its going to be pretty high with this
approach. And its hard to believe that my problem is new, so I am
guessing this is either part of some Solr configuration I am missing, or
there is some other (possibly simpler) approach I am overlooking.

Pointers to documentation or code (or even keywords I could google)
would be much appreciated.

TIA for all your help,

Sujit

Re: Custom sorting based on external (database) data

2011-05-05 Thread Sujit Pal

Thank you Ahmet, looks like we could use this. Basically we would do
periodic dumps of the (unique_id|computed_score) sorted by score and
write it out to this file followed by a commit.

Found some more info here, for the benefit of others looking for
something similar:
http://dev.tailsweep.com/solr-external-scoring/ 

On Thu, 2011-05-05 at 13:12 -0700, Ahmet Arslan wrote:
 
 --- On Thu, 5/5/11, Sujit Pal sujit@comcast.net wrote:
 
  From: Sujit Pal sujit@comcast.net
  Subject: Custom sorting based on external (database) data
  To: solr-user solr-user@lucene.apache.org
  Date: Thursday, May 5, 2011, 11:03 PM
  Hi,
  
  Sorry for the possible double post, I wrote this up but had
  the
  incorrect sender address, so I am guessing that my previous
  one is going
  to be rejected by the list moderation daemon.
  
  I am trying to figure out options for the following
  problem. I am on
  Solr 1.4.1 (Lucene 2.9.1).
  
  I have search results which are going to be ranked by the
  user (using a
  thumbs up/down) and would translate to a score between -1
  and +1. 
  
  This data is stored in a database table (
  unique_id
  thumbs_up
  thumbs_down
  num_calls
  
  as the thumbs up/down component is clicked.
  
  We want to be able to sort the results by the following
  score =
  (thumbs_up - thumbs_down) / (num_calls). The unique_id
  field refers to
  the one referenced as uniqueId in the schema.xml.
  
  Based on the following conversation:
  http://www.mail-archive.com/solr-user@lucene.apache.org/msg06322.html
  
  
  ...my understanding is that I need to:
  
  1) subclass FieldType to create my own RankFieldType. 
  2) In this class I override the getSortField() method to
  return my
  custom FieldSortComparatorSource object.
  3) Build the custom FieldSortComparatorSource object which
  returns a
  custom FieldSortComparator object in newComparator().
  4) Configure the field type of class RankFieldType
  (rank_t), and a field
  (called rank) of field type rank_t in schema.xml of type
  RankFieldType.
  5) use sort=rank+desc to do the sort.
  
  My question is: is there a simpler/more performant way? The
  number of
  database lookups seems like its going to be pretty high
  with this
  approach. And its hard to believe that my problem is new,
  so I am
  guessing this is either part of some Solr configuration I
  am missing, or
  there is some other (possibly simpler) approach I am
  overlooking.
  
  Pointers to documentation or code (or even keywords I could
  google)
  would be much appreciated.
 
 Looks like it can be done with 
 http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html
  
 and 
 http://wiki.apache.org/solr/FunctionQuery
 
 You can dump your table into three text files. Issue a commit to load these 
 changes.
 
 Sort by function query is available in Solr3.1 though.

Hook to do stuff when searcher is reopened?

2011-04-07 Thread Sujit Pal

Hi,

I am developing a SearchComponent that needs to build some initial
DocSets and then intersect with the result DocSet during each query (in
process()).

When the searcher is reopened, I need to regenerate the initial DocSets.

I am on Solr 1.4.1.

My question is, which method in SearchComponent should I override to
ensure that this regeneration happens whenever the searcher is reopened
(for example in response to an update followed by a commit)?

If no such hook method exists, how would this need to be done?

Thanks
Sujit

Re: Hook to do stuff when searcher is reopened?

2011-04-07 Thread Sujit Pal

I think I found the answer by looking through the code...specifically
SpellCheckComponent.

So my component would have to implement SolrCoreAware and in the
inform() method, register a custom SolrEventListener which will execute
the regeneration code in the postCommit and newSearcher methods.

Would still appreciate knowing if there is a simpler way, or if I am
wildly off the mark.

Thanks
Sujit

On Thu, 2011-04-07 at 16:39 -0700, Sujit Pal wrote:
 Hi,
 
 I am developing a SearchComponent that needs to build some initial
 DocSets and then intersect with the result DocSet during each query (in
 process()).
 
 When the searcher is reopened, I need to regenerate the initial DocSets.
 
 I am on Solr 1.4.1.
 
 My question is, which method in SearchComponent should I override to
 ensure that this regeneration happens whenever the searcher is reopened
 (for example in response to an update followed by a commit)?
 
 If no such hook method exists, how would this need to be done?
 
 Thanks
 Sujit

Re: Hook to do stuff when searcher is reopened?

2011-04-07 Thread Sujit Pal

Thanks Erick. This looks like it would work... I sent out an update to
my original query, there is another approach that would probably also
work for my case that is being used by SpellCheckerComponent.

I will check out both approaches.

Thanks very much for your help.

-sujit

On Thu, 2011-04-07 at 20:58 -0400, Erick Erickson wrote:
 I haven't built one myself, but have you considered the Solr
 UserCache?
 See: http://wiki.apache.org/solr/SolrCaching#User.2BAC8-Generic_Caches
 
 
 It even receives warmup signals I believe...
 
 
 Best
 Erick
 
 On Thu, Apr 7, 2011 at 7:39 PM, Sujit Pal sujit@comcast.net
 wrote:
 Hi,
 
 I am developing a SearchComponent that needs to build some
 initial
 DocSets and then intersect with the result DocSet during each
 query (in
 process()).
 
 When the searcher is reopened, I need to regenerate the
 initial DocSets.
 
 I am on Solr 1.4.1.
 
 My question is, which method in SearchComponent should I
 override to
 ensure that this regeneration happens whenever the searcher is
 reopened
 (for example in response to an update followed by a commit)?
 
 If no such hook method exists, how would this need to be done?
 
 Thanks
 Sujit

Re: Solr and Permissions

2011-03-11 Thread Sujit Pal

Yes there can be cases where user is allowed a subset of a content type,
or a combination of content type groups and individual documents, where
this would break down.

And yes, afaik, if you want to update the permissions in the document
(seems slightly strange, since you would potentially many more users
than documents, so you may want to think this requirement through some
more), you would need to update the document.

-sujit

On Thu, 2011-03-10 at 21:24 -0800, go canal wrote:
 I have similar requirements.
 
 Content type is one solution; but there are also other use cases where this 
 not 
 enough.
 
 Another requirement is, when the access permission is changed, we need to 
 update 
 the field - my understanding is we can not unless re-index the whole document 
 again. Am I correct?
  thanks,
 canal
 
 
 
 
 
 From: Sujit Pal sujit@comcast.net
 To: solr-user@lucene.apache.org
 Sent: Fri, March 11, 2011 10:39:27 AM
 Subject: Re: Solr and Permissions
 
 How about assigning content types to documents in the index, and map
 users to a set of content types they are allowed to access? That way you
 will pass in fewer parameters in the fq.
 
 -sujit
 
 On Fri, 2011-03-11 at 11:53 +1100, Liam O'Boyle wrote:
  Morning,
  
  We use solr to index a range of content to which, within our application,
  access is restricted by a system of user groups and permissions.  In order
  to ensure that search results don't reveal information about items which the
  user doesn't have access to, we need to somehow filter the results; this
  needs to be done within Solr itself, rather than after retrieval, so that
  the facet and result counts are correct.
  
  Currently we do this by creating a filter query which specifies all of the
  items which may be allowed to match (e.g. id: (foo OR bar OR blarg OR ...)),
  but this has definite scalability issues - we're starting to run into
  issues, as this can be a set of ORs of potentially unlimited size (and
  practically, we're hitting the low thousands sometimes).  While we can
  adjust maxBooleanClauses upwards, I understand that this has performance
  implications...
  
  So, has anyone had to implement something similar in the past?  Any
  suggestions for a more scalable approach?  Any advice on safe and sensible
  limits on how far I can push maxBooleanClauses?
  
  Thanks for your advice,
  
  Liam

Any way to do payload queries in Luke?

2011-03-11 Thread Sujit Pal

Hello,

I am denormalizing a map of string,float into a single lucene document
by storing it as key1|score1 key2|score2  In Solr, I pull this in
using the following analyzer definition.

fieldtype name=payloads stored=false indexed=true
class=solr.TextField 
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.DelimitedPayloadTokenFilterFactory
  delimiter=| encoder=float/
  /analyzer
/fieldtype

I have my own PayloadSimilarity which overrides scorePayload.

The index is created by POSTing Solr XML to Solr.

In Solr, I have a custom QParser that converts any query containing a
field of type payloads into a PayloadTermQuery instead of a TermQuery
(multiple sub-queries are combined using a BooleanQuery).

However, in Luke, when I put my custom PayloadSimilarity and a custom
PayloadAnalyzer (equivalent to the chain above) in the classpath and
enter the same field:value query and the results don't come back ordered
by the payload score. I do set the analyzer to my payload analyzer and
the similarity to my payload similarity.

I guess this is expected as there is no way (that I know of anyway) for
me to tell Luke that this is a PayloadTermQuery rather than a TermQuery.

So the question is - can I use some special syntax to indicate to Luke
that the query should be converted to a PayloadTermQuery? 

Don't think Luke can figure out based on the definition (in Luke I see
the field defined as ITS).

Thanks
Sujit

Re: Solr and Permissions

2011-03-10 Thread Sujit Pal

How about assigning content types to documents in the index, and map
users to a set of content types they are allowed to access? That way you
will pass in fewer parameters in the fq.

-sujit

On Fri, 2011-03-11 at 11:53 +1100, Liam O'Boyle wrote:
 Morning,
 
 We use solr to index a range of content to which, within our application,
 access is restricted by a system of user groups and permissions.  In order
 to ensure that search results don't reveal information about items which the
 user doesn't have access to, we need to somehow filter the results; this
 needs to be done within Solr itself, rather than after retrieval, so that
 the facet and result counts are correct.
 
 Currently we do this by creating a filter query which specifies all of the
 items which may be allowed to match (e.g. id: (foo OR bar OR blarg OR ...)),
 but this has definite scalability issues - we're starting to run into
 issues, as this can be a set of ORs of potentially unlimited size (and
 practically, we're hitting the low thousands sometimes).  While we can
 adjust maxBooleanClauses upwards, I understand that this has performance
 implications...
 
 So, has anyone had to implement something similar in the past?  Any
 suggestions for a more scalable approach?  Any advice on safe and sensible
 limits on how far I can push maxBooleanClauses?
 
 Thanks for your advice,
 
 Liam

Re: Understanding multi-field queries with q and fq

2011-03-02 Thread Sujit Pal

This could probably be done using a custom QParser plugin?

Define the pattern like this:

String queryTemplate = title:%Q%^2.0 body:%Q%;

then replace the %Q% with the value of the Q param, send it through
QueryParser.parse() and return the query.

-sujit

On Wed, 2011-03-02 at 11:28 -0800, mrw wrote:
 Anyone understand how to do boolean logic across multiple fields?  
 
 Dismax is nice for searching multiple fields, but doesn't necessarily
 support our syntax requirements. eDismax appears to be not available until
 Solr 3.1.   
 
 In the meantime, it looks like we need to support applying the user's query
 to multiple fields, so if the user enters led zeppelin merle we need to be
 able to do the logical equivalent of 
 
 fq=field1:led zeppelin merle OR field2:led zeppelin merle
 
 
 Any ideas?  :)
 
 
 
 mrw wrote:
  
  After searching this list, Google, and looking through the Pugh book, I am
  a little confused about the right way to structure a query.
  
  The Packt book uses the example of the MusicBrainz DB full of song
  metadata.  What if they also had the song lyrics in English and German as
  files on disk, and wanted to index them along with the metadata, so that
  each document would basically have song title, artist, publisher, date,
  ..., All_Metadata (copy field of all metadata fields), Text_English, and
  Text_German fields?  
  
  There can only be one default field, correct?  So if we want to search for
  all songs containing (zeppelin AND (dog OR merle)) do we 
  
  repeat the entire query text for all three major fields in the 'q' clause
  (assuming we don't want to use the cache):
  
  q=(+All_Metadata:zeppelin AND (dog OR merle)+Text_English:zeppelin AND
  (dog OR merle)+Text_German:(zeppelin AND (dog OR merle))
  
  or repeat the entire query text for all three major fields in the 'fq'
  clause (assuming we want to use the cache):
  
  q=*:*fq=(+All_Metadata:zeppelin AND (dog OR merle)+Text_English:zeppelin
  AND (dog OR merle)+Text_German:zeppelin AND (dog OR merle))
  
  ?
  
  Thanks!
  
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Understanding-multi-field-queries-with-q-and-fq-tp2528866p2619700.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Payloads retrieval

2011-02-28 Thread Sujit Pal

Yes, check out the field type payloads in the schema.xml file. If you
set up one or more of your fields as type payloads (you would use the
DelimitedPayloadTokenFilterFactory during indexing in your analyzer
chain), you can then use the PayloadTermQuery to query it with, scoring
can be done with a custom PayloadSimilarity implementation. Check out
this (slightly dated) article for more information.
http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/ 

-sujit

On Mon, 2011-02-28 at 14:49 -0300, Fabiano Nunes wrote:
 Hi!
 
 I'm studying a migration from pure Lucene to Solr, but I need a crucial
 feature:
 Is it posible to retrieve payloads from Solr?
 
 I'm storing the coordinates from each term in its payload to highlight
 images in client-side.
 
 Thank you,

Re: loading XML docbook files into solr

2011-02-26 Thread Sujit Pal

Hi Derek,

The XML files you post to Solr needs to be in the correct Solr specific
XML format. 

One way to preserve the original structure would be to flatten the
document into field names indicating the position of the text, for
example:
book_titleabbrev: Advancing Return on Investment Analysis for Government IT:\
 A Public Value Framework
... etc.

But you will still have to parse your docbook XML into the appropriate schema 
that you want to use for Solr. I believe DIH also allows XSLT based 
preprocessors
so you don't have to write parsing code, but I haven't used them.

-sujit

On Sat, 2011-02-26 at 10:40 -0500, Derek Werthmuller wrote:
 I've been working on this for a while an seem to hit a wall.  The error
 messages aren't complete enought to give guidance why importing a sample
 docbook document
 into solr is not working.
 I'm using the curl tool to post the xml file and receive a non error message
 but the document count doesn't increase and the *:* returns no results
 still.
 The docbook document has a attribute id and this is mapped to the uniquekey
 in the schema.xml file.  But it seems this may be the issue still.  Its not
 clear
 how the field names map to the XML.  Do they only map to attributes?  or do
 they map to elements?   How to you differentiate?
 Can field names in the schema.xml file have xpath statements?
 
 Are there other important sections of the solrconfig that could be keeping
 this from working?
 
 We want to maintain much of the document structure so we have more control
 over the searching.
 
 Here is what the docbook XML looks like:  (tried setting the uniquekey to id
 and docid but no go either way)
 
 book label=issuebriefs id=proi
   docid245/docid
 titleabbrevAdvancing Return on Investment Analysis for Government IT:
 A Pu
 blic Value Framework /titleabbrev
 chapter
 titleAdvancing Return on Investment Analysis for Government IT: A
 Publ
 ic Value Framework/title
 para
 mediaobject
 imageobject
 imagedata
 fileref=/publications/annualreports/ar2006/image
 s/public-value.jpg format=jpg contentdepth=157 contentwidth=216
 align=le
 ft/
 /imageobject
 textobject
 phrasePublic Value Illustration/phrase
 /textobject
 /mediaobject
 
 ..
 
 Here is the section of the schema.xml  
 field name=id type=string indexed=true stored=true
 multiValued=false required=true /
   field name=titleabbrev type=text indexed=true stored=true
 /
   field name=title type=text indexed=true stored=true /
   
   field name=para type=text indexed=true stored=true /
   field name=ulink type=string indexed=true stored=true /
   field name=listitem type=text indexed=true stored=true /
   
   field name=all_text type=text indexed=true stored=false
 multiValued=true /
 
copyField source=title dest=all_text /
   copyField source=para dest=all_text /
   copyField source=listitem dest=all_text /
   copyField source=titleabbrev dest=all_text /
 
 
  /fields
 
  !-- Field to use to determine and enforce document uniqueness. 
   Unless this field is marked with required=false, it will be a
 required field
--
  uniqueKeyid/uniqueKey
 
  !-- field for the QueryParser to use when an explicit fieldname is absent
 --
  defaultSearchFieldall_text/defaultSearchField
 
  !-- SolrQueryParser configuration: defaultOperator=AND|OR --
  solrQueryParser defaultOperator=OR/
 
 /schema
 
 Load command results.
 
 $ ./postfile.sh 
 ?xml version=1.0 encoding=UTF-8?
 response
 lst name=responseHeaderint name=status0/intint
 name=QTime56/int/lst
 /response
 ?xml version=1.0 encoding=UTF-8?
 response
 lst name=responseHeaderint name=status0/intint
 name=QTime15/int/lst
 /response
 
 
 Thanks
   Derek

Re: manually editing spellcheck dictionary

2011-02-25 Thread Sujit Pal

If the dictionary is a Lucene index, wouldn't it be as simple as delete
using a term query? Something like this:

IndexReader sdreader = new IndexReader();
sdreader.delete(new Term(word, sherri));
...
sdreader.optimize();
sdreader.close();

I am guessing your dictionary is built dynamically using content words.
If so, you may want to run the words through an aspell like filter
(jazzy.sf.net is a Java implementation of aspell that works quite well
with single words) to determine if more of these should be removed, and
whether they should be added in the first place.

-sujit

On Fri, 2011-02-25 at 10:41 -0700, Tanner Postert wrote:
 I'm using an index based spellcheck dictionary and I was wondering if there
 were a way for me to manually remove certain words from the dictionary.
 
 Some of my content has some mis-spellings, and for example when I search for
 the word sherrif (which should be spelled sheriff), it get recommendations
 like sherriff or sherri instead. If I could remove those words, it would
 seem like the system would work a little better.

Re: boosting results by a query?

2011-02-11 Thread Sujit Pal

We are currently a Lucene shop, the way we do it (currently) is to have
these results come from a database table (where it is available in rank
order). We want to move to Solr, so what I plan on doing to replicate
this functionality is to write a custom request handler that will do the
database query and put the results on the top of the search results
before the SolrIndexSearcher is invoked.

-sujit

On Fri, 2011-02-11 at 16:31 -0500, Ryan McKinley wrote:
 I have an odd need, and want to make sure I am not reinventing a wheel...
 
 Similar to the QueryElevationComponent, I need to be able to move
 documents to the top of a list that match a given query.
 
 If there were no sort, then this could be implemented easily with
 BooleanQuery (i think) but with sort it gets more complicated.  Seems
 like I need:
 
   sortSpec.setSort( new Sort( new SortField[] {
 new SortField( something that only sorts results in the boost query ),
 new SortField( the regular sort )
   }));
 
 Is there an existing FieldComparator I should look at?  Any other
 pointers/ideas?
 
 Thanks
 ryan

Re: Architecture decisions with Solr

2011-02-09 Thread Sujit Pal

Another option (assuming the case where a user can be granted access to
a certain class of documents, and more than one user would be able to
access certain documents) would be to store the access filter (as an OR
query of content types) in an external cache (perhaps a database or an
eternal cache that the database changes are published to periodically),
then using this access filter as a facet on the base query.

-sujit

On Wed, 2011-02-09 at 14:38 -0500, Glen Newton wrote:
  This application will be built to serve many users
 
 If this means that you have thousands of users, 1000s of VMs and/or
 1000s of cores is not going to scale.
 
 Have an ID in the index for each user, and filter using it.
 Then they can see only their own documents.
 
 Assuming that you are building an app that through which they
 authenticate  talks to solr .
 (i.e. all requests are filtered using their ID)
 
 -Glen
 
 On Wed, Feb 9, 2011 at 2:31 PM, Greg Georges greg.geor...@biztree.com wrote:
  From what I understand about multicore, each of the indexes are independant 
  from each other right? Or would one index have access to the info of the 
  other? My requirement is like you mention, a client has access only to his 
  or her search data based in their documents. Other clients have no access 
  to the index of other clients.
 
  Greg
 
  -Original Message-
  From: Darren Govoni [mailto:dar...@ontrenet.com]
  Sent: 9 février 2011 14:28
  To: solr-user@lucene.apache.org
  Subject: Re: Architecture decisions with Solr
 
  What about standing up a VM (search appliance that you would make) for
  each client?
  If there's no data sharing across clients, then using the same solr
  server/index doesn't seem necessary.
 
  Solr will easily meet your needs though, its the best there is.
 
  On Wed, 2011-02-09 at 14:23 -0500, Greg Georges wrote:
 
  Hello all,
 
  I am looking into an enterprise search solution for our architecture and I 
  am very pleased to see all the features Solr provides. In our case, we 
  will have a need for a highly scalable application for multiple clients. 
  This application will be built to serve many users who each will have a 
  client account. Each client will have a multitude of documents to index 
  (0-1000s of documents). After discussion we were talking about going 
  multicore and to have one index file per client account. The reason for 
  this is that security is achieved by having a separate index for each 
  client etc.. Is this the best approach? How feasible is it (dynamically 
  create indexes on client account creation. Is it better to go the faceted 
  search capabilities route? Thanks for your help
 
  Greg

65 matches

Mail list logo