Re: Language support

2016-08-23 Thread Walter Underwood
Synonyms are also domain specific. A synonym set for one area may be completely 
wrong in another.

In cooking, arugula and rocket are the same thing. In military or aerospace, 
missile and rocket are very similar.

I would start with librarians. They maintain controlled vocabularies (called 
“thesauri”). Usually, a thesaurus has the official classification terms but 
also has “entry terms”. The entry terms are alternate terms that are used to 
get to the primary term.

For example, the category might be “electric vehicle”, but an entry term could 
be “zero emission vehicle”.

Good luck. I had a hard time finding thesauri on line a few years ago.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Aug 23, 2016, at 7:38 AM, Bradley Belyeu  
> wrote:
> 
> Hi, I’m trying to find a synonym list for any of the following languages:
> Catalan, Farsi, Hindi, Korean, Latvian, Dutch, Romanian, Thai, and Turkish
> Does anyone know of resources where I can get a synonym list for these 
> languages?



Language support

2016-08-23 Thread Bradley Belyeu
Hi, I’m trying to find a synonym list for any of the following languages:
Catalan, Farsi, Hindi, Korean, Latvian, Dutch, Romanian, Thai, and Turkish
Does anyone know of resources where I can get a synonym list for these 
languages?


Re: What are the best practices on Multiple Language support in Solr Cloud ?

2014-05-05 Thread shamik
Thanks Nicole. Leveraging dynamic field definitions is a great idea. Probably
work for me as I've a bunch of fields which are indexed as String. Just
curious about the sharding, are you using Solr Cloud. I thought of taking
the dedicated shard / core route , but then, as using a composite key (for
dedup), managing dedicated core can cause issues at times.

As far as single field representation, thanks for validating my concern.
Probably its best to use when you've to address a multi-lingual search.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/What-are-the-best-practices-on-Multiple-Language-support-in-Solr-Cloud-tp4134006p4134743.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: What are the best practices on Multiple Language support in Solr Cloud ?

2014-05-02 Thread Nicole Lacoste
Hi Shamik,

I don't have an answer for you, just a couple of comments.

Why not use dynamic field definitions in the schema? As you say most of
your fields are not analysed you just add a language tag _en, _fr, _de,
...) to the field when you index or query.  Then you can add languages as
you need without having to touch the schema.  For fields that you do
analyse (stop words or synonyms) then you'll have to explicitly define a
field type for them.  My experience with docs that are in two or three main
languages is that single core or multi-core has not been that critical,
sharding and replication made a bigger difference to us.  You could put
english in one core and everything else in another.

What we tried to do was just index stuff to the same field, that is french
and english getting indexed to contents or title field (we have our own
tokenizer and filter chain so did actually analyse them differently) but we
got into lots of problems with tf-idf, so I'd advise to not do that. The
motivation was that we wanted multi-ligual results. Terry's approach here
is much better, and as you thought is addressing the multi-lingual
requirement, but I still don't think it totally addresses the tf-idf
problem. So if you don't need multilingual don't go that route.

I am curious to see what other people think.

Niki


What are the best practices on Multiple Language support in Solr Cloud ?

2014-04-30 Thread Shamik Bandopadhyay
Hi,

  I'm  trying to implement multiple language support in Solr Cloud (4.7).
Although we've different languages in index, we were only supporting
english in terms of index and query. To provide some context, our current
index size is 35 GB with close to 15 million documents. We've two shards
with two replicas per shard. I'm using composite id to support
de-duplication, which puts the documents having the same field (dedup)
value to a specific shard.
Language is known prior to for every document being indexed. That saves the
need for runtime language detection. Similarly, during query, the language
will be known as well. To extend it, there's no need for multi-lingual
support.

Based on my understanding so far, there are three approaches which are
widely adopted. Multi-field indexing, Multi-Core indexing and Multiple
language in one field (based from Solr in Action).

First option seems easy to implement. But then, I've around 40 fields which
are getting indexed currently, though a majority of them are type=string
and not being analyzed. I'm planning to support around 10 languages, which
translates to 400 field definitions in the same schema. And this is poised
to grow with addition of languages and fields. My apprehension is whether
this approach becomes a maintenance nightmare ? Does it affect overall
scalability ? Does is affect any existing features like Suggester,
Spellcheck, etc. ? I was thinking of including language as part of the id
key. It'll look like Language!Dedup_id!url so that documents are spread
across the two shards.

Second option of a dedicated core sounds easy in terms of maintaining
config files. Also,routing requests will be fairly easy as the language
will be always known up-front,both during indexing and query time. But, as
I looked into the documents, 60% of our total index will be in English,
while rest 40% will constitute remaining 10-14 languages. Some language
content are in few thousands which perhaps doesn't merit a dedicate core.
On top of that, this approach has the potential of getting into a complex
infrastructure, which might be hard to maintain.

I read about the use of multiple language in a single field in Trey
Grainger's book. It looks like a great approach but not sure if it is meant
to address my scenario. My first impression is that it's more geared
towards supporting multi-lingual, but I maybe completely wrong. Also, this
is not supported by Solr / Lucene out of the box.

I know there's a lot of people in this group who have excelled as far as
supporting multiple language in Solr is concerned. I'm trying to gather
their inputs / experience on the best practice to help me decide the right
approach. Any pointer on this will be highly appreciated.

Thanks,
Shamik


Re: eDisMax, multiple language support and stopwords

2013-11-11 Thread Liu Bo
Happy to see some one have similar solutions as ours.

we have similar multi-language search feature and we index different
language content to _fr, _en field like you've done

but in search, we need a language code as a parameter to specify the
language client wants to search on which is normally decided by the website
visited, such as: qf=name descriptionlanguage=en

and in our search components we find the right field: name_en and
description_en to be searched on

we used to support on all language search and removed that later, as the
site tells the customer which language is supported, we also don't think we
have many language experts on our web sites that knows more than two
language and need to search them at the same time.


On 7 November 2013 23:01, Tom Mortimer tom.m.f...@gmail.com wrote:

 Ah, thanks Markus. I think I'll just add the Boolean operators to the
 stopwords list in that case.

 Tom



 On 7 November 2013 12:01, Markus Jelsma markus.jel...@openindex.io
 wrote:

  This is an ancient problem. The issue here is your mm-parameter, it gets
  confused because for separate fields different amount of tokens are
  filtered/emitted so it is never going to work just like this. The easiest
  option is not to use the stopfilter.
 
 
 
 http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug-td493483.html
  https://issues.apache.org/jira/browse/SOLR-3085
 
  -Original message-
   From:Tom Mortimer tom.m.f...@gmail.com
   Sent: Thursday 7th November 2013 12:50
   To: solr-user@lucene.apache.org
   Subject: eDisMax, multiple language support and stopwords
  
   Hi all,
  
   Thanks for the help and advice I've got here so far!
  
   Another question - I want to support stopwords at search time, so that
  e.g.
   the query oscar and wilde is equivalent to oscar wilde (this is
 with
   lowercaseOperators=false). Fair enough, I have stopword and in the
  query
   analyser chain.
  
   However, I also need to support French as well as English, so I've got
  _en
   and _fr versions of the text fields, with appropriate stemming and
   stopwords. I index French content into the _fr fields and English into
  the
   _en fields. I'm searching with eDisMax over both versions, e.g.:
  
   str name=qfheadline_en headline_fr/str
  
   However, this means I get no results for oscar and wilde. The parsed
   query is:
  
   (+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar))
   DisjunctionMaxQuery((headline_fr:and))
   DisjunctionMaxQuery((headline_fr:wild |
 headline_en:wild)))~3))/no_coord
  
   If I add and to the French stopwords list, I *do* get results, and
 the
   parsed query is:
  
   (+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar))
   DisjunctionMaxQuery((headline_fr:wild |
 headline_en:wild)))~2))/no_coord
  
   This implies that the only solution is to have a minimal, shared
  stopwords
   list for all languages I want to support. Is this correct, or is there
 a
   way of supporting this kind of searching with per-language stopword
  lists?
  
   Thanks for any ideas!
  
   Tom
  
 




-- 
All the best

Liu Bo


eDisMax, multiple language support and stopwords

2013-11-07 Thread Tom Mortimer
Hi all,

Thanks for the help and advice I've got here so far!

Another question - I want to support stopwords at search time, so that e.g.
the query oscar and wilde is equivalent to oscar wilde (this is with
lowercaseOperators=false). Fair enough, I have stopword and in the query
analyser chain.

However, I also need to support French as well as English, so I've got _en
and _fr versions of the text fields, with appropriate stemming and
stopwords. I index French content into the _fr fields and English into the
_en fields. I'm searching with eDisMax over both versions, e.g.:

str name=qfheadline_en headline_fr/str

However, this means I get no results for oscar and wilde. The parsed
query is:

(+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar))
DisjunctionMaxQuery((headline_fr:and))
DisjunctionMaxQuery((headline_fr:wild | headline_en:wild)))~3))/no_coord

If I add and to the French stopwords list, I *do* get results, and the
parsed query is:

(+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar))
DisjunctionMaxQuery((headline_fr:wild | headline_en:wild)))~2))/no_coord

This implies that the only solution is to have a minimal, shared stopwords
list for all languages I want to support. Is this correct, or is there a
way of supporting this kind of searching with per-language stopword lists?

Thanks for any ideas!

Tom


RE: eDisMax, multiple language support and stopwords

2013-11-07 Thread Markus Jelsma
This is an ancient problem. The issue here is your mm-parameter, it gets 
confused because for separate fields different amount of tokens are 
filtered/emitted so it is never going to work just like this. The easiest 
option is not to use the stopfilter.

http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug-td493483.html
https://issues.apache.org/jira/browse/SOLR-3085
 
-Original message-
 From:Tom Mortimer tom.m.f...@gmail.com
 Sent: Thursday 7th November 2013 12:50
 To: solr-user@lucene.apache.org
 Subject: eDisMax, multiple language support and stopwords
 
 Hi all,
 
 Thanks for the help and advice I've got here so far!
 
 Another question - I want to support stopwords at search time, so that e.g.
 the query oscar and wilde is equivalent to oscar wilde (this is with
 lowercaseOperators=false). Fair enough, I have stopword and in the query
 analyser chain.
 
 However, I also need to support French as well as English, so I've got _en
 and _fr versions of the text fields, with appropriate stemming and
 stopwords. I index French content into the _fr fields and English into the
 _en fields. I'm searching with eDisMax over both versions, e.g.:
 
 str name=qfheadline_en headline_fr/str
 
 However, this means I get no results for oscar and wilde. The parsed
 query is:
 
 (+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar))
 DisjunctionMaxQuery((headline_fr:and))
 DisjunctionMaxQuery((headline_fr:wild | headline_en:wild)))~3))/no_coord
 
 If I add and to the French stopwords list, I *do* get results, and the
 parsed query is:
 
 (+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar))
 DisjunctionMaxQuery((headline_fr:wild | headline_en:wild)))~2))/no_coord
 
 This implies that the only solution is to have a minimal, shared stopwords
 list for all languages I want to support. Is this correct, or is there a
 way of supporting this kind of searching with per-language stopword lists?
 
 Thanks for any ideas!
 
 Tom
 


Re: eDisMax, multiple language support and stopwords

2013-11-07 Thread Tom Mortimer
Ah, thanks Markus. I think I'll just add the Boolean operators to the
stopwords list in that case.

Tom



On 7 November 2013 12:01, Markus Jelsma markus.jel...@openindex.io wrote:

 This is an ancient problem. The issue here is your mm-parameter, it gets
 confused because for separate fields different amount of tokens are
 filtered/emitted so it is never going to work just like this. The easiest
 option is not to use the stopfilter.


 http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug-td493483.html
 https://issues.apache.org/jira/browse/SOLR-3085

 -Original message-
  From:Tom Mortimer tom.m.f...@gmail.com
  Sent: Thursday 7th November 2013 12:50
  To: solr-user@lucene.apache.org
  Subject: eDisMax, multiple language support and stopwords
 
  Hi all,
 
  Thanks for the help and advice I've got here so far!
 
  Another question - I want to support stopwords at search time, so that
 e.g.
  the query oscar and wilde is equivalent to oscar wilde (this is with
  lowercaseOperators=false). Fair enough, I have stopword and in the
 query
  analyser chain.
 
  However, I also need to support French as well as English, so I've got
 _en
  and _fr versions of the text fields, with appropriate stemming and
  stopwords. I index French content into the _fr fields and English into
 the
  _en fields. I'm searching with eDisMax over both versions, e.g.:
 
  str name=qfheadline_en headline_fr/str
 
  However, this means I get no results for oscar and wilde. The parsed
  query is:
 
  (+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar))
  DisjunctionMaxQuery((headline_fr:and))
  DisjunctionMaxQuery((headline_fr:wild | headline_en:wild)))~3))/no_coord
 
  If I add and to the French stopwords list, I *do* get results, and the
  parsed query is:
 
  (+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar))
  DisjunctionMaxQuery((headline_fr:wild | headline_en:wild)))~2))/no_coord
 
  This implies that the only solution is to have a minimal, shared
 stopwords
  list for all languages I want to support. Is this correct, or is there a
  way of supporting this kind of searching with per-language stopword
 lists?
 
  Thanks for any ideas!
 
  Tom
 



Re: copyField at search time / multi-language support

2011-03-29 Thread lboutros
Tom,

to solve this kind of problem, if I understand it well, you could extend the
query parser to support something like meta-fields. I'm currently developing
a QueryParser Plugin to support a specific syntax. The support of
meta-fields to search on different fields (multiple languages) is one of the
functionalities that this parser will contain.

Ludovic.

2011/3/29 Markus Jelsma-2 [via Lucene] 
ml-node+2747011-315348515-383...@n3.nabble.com

 I haven't tried this as an UpdateProcessor but it relies on Tika and that
 LanguageIdentifier works well, except for short texts.

  Thanks Markus.
 
  Do you know if this patch is good enough for production use? Thanks.
 
  Andy
 
  --- On Tue, 3/29/11, Markus Jelsma [hidden 
  email]http://user/SendEmail.jtp?type=nodenode=2747011i=0by-user=t
 wrote:
   From: Markus Jelsma [hidden 
   email]http://user/SendEmail.jtp?type=nodenode=2747011i=1by-user=t

   Subject: Re: copyField at search time / multi-language support
   To: [hidden 
   email]http://user/SendEmail.jtp?type=nodenode=2747011i=2by-user=t
   Cc: Andy [hidden 
   email]http://user/SendEmail.jtp?type=nodenode=2747011i=3by-user=t

   Date: Tuesday, March 29, 2011, 1:29 AM
   https://issues.apache.org/jira/browse/SOLR-1979
  
Tom,
   
Could you share the method you use to perform language
  
   detection? Any open
  
source tools that do that?
   
Thanks.
   
--- On Mon, 3/28/11, Tom Mortimer [hidden 
email]http://user/SendEmail.jtp?type=nodenode=2747011i=4by-user=t

  
   wrote:
 From: Tom Mortimer [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=2747011i=5by-user=t

 Subject: copyField at search time /
  
   multi-language support
  
 To: [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=2747011i=6by-user=t
 Date: Monday, March 28, 2011, 4:45 AM
 Hi,

 Here's my problem: I'm indexing a corpus with
  
   text in a
  
 variety of
 languages. I'm planning to detect these at index
  
   time and
  
 send the
 text to one of a suitably-configured field (e.g.
 mytext_de for
 German, mytext_cjk for Chinese/Japanese/Korean
  
   etc.)
  
 At search time I want to search all of these
  
   fields.
  
 However, there
 will be at least 12 of them, which could lead to
  
   a very
  
 long query
 string. (Also I need to use the standard query
  
   parser
  
 rather than
 dismax, for full query syntax.)

 Therefore I was wondering if there was a way to
  
   copy fields
  
 at search
 time, so I can have my mytext query in a single
  
   field and
  
 have it
 copied to mytext_de, mytext_cjk etc. Something
  
   like:
copyQueryField source=mytext

 dest=mytext_de /

copyQueryField source=mytext

 dest=mytext_cjk /

   ...

 If this is not currently possible, could someone
  
   give me
  
 some pointers
 for hacking Solr to support it? Should I
  
   subclass
  
 solr.SearchHandler?
 I know nothing about Solr internals at the
  
   moment...
  
 thanks,
 Tom


 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://lucene.472066.n3.nabble.com/copyField-at-search-time-multi-language-support-tp2746017p2747011.html
  To start a new topic under Solr - User, email
 ml-node+472068-1765922688-383...@n3.nabble.com
 To unsubscribe from Solr - User, click 
 herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472068code=Ym91dHJvc2xAZ21haWwuY29tfDQ3MjA2OHw0Mzk2MDUxNjE=.




-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/copyField-at-search-time-multi-language-support-tp2746017p2747386.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: copyField at search time / multi-language support

2011-03-29 Thread Erick Erickson
This may not be all that helpful, but have you looked at edismax?
https://issues.apache.org/jira/browse/SOLR-1553

It allows the full Solr query syntax while preserving the goodness of
dismax.

This is standard equipment on 3.1, which is being released even as we
speak, and I also know it's being used in production situations.

If going to 3.1 is not an option, I know people have applied that patch
to 1.4.1, but haven't done it myself.

Best
Erick

On Mon, Mar 28, 2011 at 4:45 AM, Tom Mortimer t...@flax.co.uk wrote:
 Hi,

 Here's my problem: I'm indexing a corpus with text in a variety of
 languages. I'm planning to detect these at index time and send the
 text to one of a suitably-configured field (e.g. mytext_de for
 German, mytext_cjk for Chinese/Japanese/Korean etc.)

 At search time I want to search all of these fields. However, there
 will be at least 12 of them, which could lead to a very long query
 string. (Also I need to use the standard query parser rather than
 dismax, for full query syntax.)

 Therefore I was wondering if there was a way to copy fields at search
 time, so I can have my mytext query in a single field and have it
 copied to mytext_de, mytext_cjk etc. Something like:

   copyQueryField source=mytext dest=mytext_de /
   copyQueryField source=mytext dest=mytext_cjk /
  ...

 If this is not currently possible, could someone give me some pointers
 for hacking Solr to support it? Should I subclass solr.SearchHandler?
 I know nothing about Solr internals at the moment...

 thanks,
 Tom



copyField at search time / multi-language support

2011-03-28 Thread Tom Mortimer
Hi,

Here's my problem: I'm indexing a corpus with text in a variety of
languages. I'm planning to detect these at index time and send the
text to one of a suitably-configured field (e.g. mytext_de for
German, mytext_cjk for Chinese/Japanese/Korean etc.)

At search time I want to search all of these fields. However, there
will be at least 12 of them, which could lead to a very long query
string. (Also I need to use the standard query parser rather than
dismax, for full query syntax.)

Therefore I was wondering if there was a way to copy fields at search
time, so I can have my mytext query in a single field and have it
copied to mytext_de, mytext_cjk etc. Something like:

   copyQueryField source=mytext dest=mytext_de /
   copyQueryField source=mytext dest=mytext_cjk /
  ...

If this is not currently possible, could someone give me some pointers
for hacking Solr to support it? Should I subclass solr.SearchHandler?
I know nothing about Solr internals at the moment...

thanks,
Tom


Re: copyField at search time / multi-language support

2011-03-28 Thread Gora Mohanty
On Mon, Mar 28, 2011 at 2:15 PM, Tom Mortimer t...@flax.co.uk wrote:
 Hi,

 Here's my problem: I'm indexing a corpus with text in a variety of
 languages. I'm planning to detect these at index time and send the
 text to one of a suitably-configured field (e.g. mytext_de for
 German, mytext_cjk for Chinese/Japanese/Korean etc.)


 At search time I want to search all of these fields. However, there
 will be at least 12 of them, which could lead to a very long query
 string. (Also I need to use the standard query parser rather than
 dismax, for full query syntax.)

Sorry, unable to understand this. Are you detecting the language,
and based on that, indexing to one of mytext_de, mytext_cjk, etc.,
or does each field have mixed languages? If the former, why could
you not also detect the language at query time (or, have separate
query sources for users of different languages), and query the
appropriate field based on the known language to be searched?

 Therefore I was wondering if there was a way to copy fields at search
 time, so I can have my mytext query in a single field and have it
 copied to mytext_de, mytext_cjk etc. Something like:

   copyQueryField source=mytext dest=mytext_de /
   copyQueryField source=mytext dest=mytext_cjk /
  ...

 If this is not currently possible, could someone give me some pointers
 for hacking Solr to support it? Should I subclass solr.SearchHandler?
 I know nothing about Solr internals at the moment...
[...]

This is not possible as far as I know, and would be quite inefficient.

Regards,
Gora


Re: copyField at search time / multi-language support

2011-03-28 Thread Andy
Tom,

Could you share the method you use to perform language detection? Any open 
source tools that do that?

Thanks.

--- On Mon, 3/28/11, Tom Mortimer t...@flax.co.uk wrote:

 From: Tom Mortimer t...@flax.co.uk
 Subject: copyField at search time / multi-language support
 To: solr-user@lucene.apache.org
 Date: Monday, March 28, 2011, 4:45 AM
 Hi,
 
 Here's my problem: I'm indexing a corpus with text in a
 variety of
 languages. I'm planning to detect these at index time and
 send the
 text to one of a suitably-configured field (e.g.
 mytext_de for
 German, mytext_cjk for Chinese/Japanese/Korean etc.)
 
 At search time I want to search all of these fields.
 However, there
 will be at least 12 of them, which could lead to a very
 long query
 string. (Also I need to use the standard query parser
 rather than
 dismax, for full query syntax.)
 
 Therefore I was wondering if there was a way to copy fields
 at search
 time, so I can have my mytext query in a single field and
 have it
 copied to mytext_de, mytext_cjk etc. Something like:
 
    copyQueryField source=mytext
 dest=mytext_de /
    copyQueryField source=mytext
 dest=mytext_cjk /
   ...
 
 If this is not currently possible, could someone give me
 some pointers
 for hacking Solr to support it? Should I subclass
 solr.SearchHandler?
 I know nothing about Solr internals at the moment...
 
 thanks,
 Tom
 


   


Re: copyField at search time / multi-language support

2011-03-28 Thread Markus Jelsma
https://issues.apache.org/jira/browse/SOLR-1979

 Tom,
 
 Could you share the method you use to perform language detection? Any open
 source tools that do that?
 
 Thanks.
 
 --- On Mon, 3/28/11, Tom Mortimer t...@flax.co.uk wrote:
  From: Tom Mortimer t...@flax.co.uk
  Subject: copyField at search time / multi-language support
  To: solr-user@lucene.apache.org
  Date: Monday, March 28, 2011, 4:45 AM
  Hi,
  
  Here's my problem: I'm indexing a corpus with text in a
  variety of
  languages. I'm planning to detect these at index time and
  send the
  text to one of a suitably-configured field (e.g.
  mytext_de for
  German, mytext_cjk for Chinese/Japanese/Korean etc.)
  
  At search time I want to search all of these fields.
  However, there
  will be at least 12 of them, which could lead to a very
  long query
  string. (Also I need to use the standard query parser
  rather than
  dismax, for full query syntax.)
  
  Therefore I was wondering if there was a way to copy fields
  at search
  time, so I can have my mytext query in a single field and
  have it
  copied to mytext_de, mytext_cjk etc. Something like:
  
 copyQueryField source=mytext
  dest=mytext_de /
 copyQueryField source=mytext
  dest=mytext_cjk /
...
  
  If this is not currently possible, could someone give me
  some pointers
  for hacking Solr to support it? Should I subclass
  solr.SearchHandler?
  I know nothing about Solr internals at the moment...
  
  thanks,
  Tom


Re: copyField at search time / multi-language support

2011-03-28 Thread Andy
Thanks Markus.

Do you know if this patch is good enough for production use? Thanks.

Andy

--- On Tue, 3/29/11, Markus Jelsma markus.jel...@openindex.io wrote:

 From: Markus Jelsma markus.jel...@openindex.io
 Subject: Re: copyField at search time / multi-language support
 To: solr-user@lucene.apache.org
 Cc: Andy angelf...@yahoo.com
 Date: Tuesday, March 29, 2011, 1:29 AM
 https://issues.apache.org/jira/browse/SOLR-1979
 
  Tom,
  
  Could you share the method you use to perform language
 detection? Any open
  source tools that do that?
  
  Thanks.
  
  --- On Mon, 3/28/11, Tom Mortimer t...@flax.co.uk
 wrote:
   From: Tom Mortimer t...@flax.co.uk
   Subject: copyField at search time /
 multi-language support
   To: solr-user@lucene.apache.org
   Date: Monday, March 28, 2011, 4:45 AM
   Hi,
   
   Here's my problem: I'm indexing a corpus with
 text in a
   variety of
   languages. I'm planning to detect these at index
 time and
   send the
   text to one of a suitably-configured field (e.g.
   mytext_de for
   German, mytext_cjk for Chinese/Japanese/Korean
 etc.)
   
   At search time I want to search all of these
 fields.
   However, there
   will be at least 12 of them, which could lead to
 a very
   long query
   string. (Also I need to use the standard query
 parser
   rather than
   dismax, for full query syntax.)
   
   Therefore I was wondering if there was a way to
 copy fields
   at search
   time, so I can have my mytext query in a single
 field and
   have it
   copied to mytext_de, mytext_cjk etc. Something
 like:
   
      copyQueryField source=mytext
   dest=mytext_de /
      copyQueryField source=mytext
   dest=mytext_cjk /
     ...
   
   If this is not currently possible, could someone
 give me
   some pointers
   for hacking Solr to support it? Should I
 subclass
   solr.SearchHandler?
   I know nothing about Solr internals at the
 moment...
   
   thanks,
   Tom
 





Re: copyField at search time / multi-language support

2011-03-28 Thread Markus Jelsma
I haven't tried this as an UpdateProcessor but it relies on Tika and that 
LanguageIdentifier works well, except for short texts.

 Thanks Markus.
 
 Do you know if this patch is good enough for production use? Thanks.
 
 Andy
 
 --- On Tue, 3/29/11, Markus Jelsma markus.jel...@openindex.io wrote:
  From: Markus Jelsma markus.jel...@openindex.io
  Subject: Re: copyField at search time / multi-language support
  To: solr-user@lucene.apache.org
  Cc: Andy angelf...@yahoo.com
  Date: Tuesday, March 29, 2011, 1:29 AM
  https://issues.apache.org/jira/browse/SOLR-1979
  
   Tom,
   
   Could you share the method you use to perform language
  
  detection? Any open
  
   source tools that do that?
   
   Thanks.
   
   --- On Mon, 3/28/11, Tom Mortimer t...@flax.co.uk
  
  wrote:
From: Tom Mortimer t...@flax.co.uk
Subject: copyField at search time /
  
  multi-language support
  
To: solr-user@lucene.apache.org
Date: Monday, March 28, 2011, 4:45 AM
Hi,

Here's my problem: I'm indexing a corpus with
  
  text in a
  
variety of
languages. I'm planning to detect these at index
  
  time and
  
send the
text to one of a suitably-configured field (e.g.
mytext_de for
German, mytext_cjk for Chinese/Japanese/Korean
  
  etc.)
  
At search time I want to search all of these
  
  fields.
  
However, there
will be at least 12 of them, which could lead to
  
  a very
  
long query
string. (Also I need to use the standard query
  
  parser
  
rather than
dismax, for full query syntax.)

Therefore I was wondering if there was a way to
  
  copy fields
  
at search
time, so I can have my mytext query in a single
  
  field and
  
have it
copied to mytext_de, mytext_cjk etc. Something
  
  like:
   copyQueryField source=mytext
   
dest=mytext_de /
   
   copyQueryField source=mytext
   
dest=mytext_cjk /
   
  ...
   
If this is not currently possible, could someone
  
  give me
  
some pointers
for hacking Solr to support it? Should I
  
  subclass
  
solr.SearchHandler?
I know nothing about Solr internals at the
  
  moment...
  
thanks,
Tom


Re: Help on Multi-language support

2011-03-06 Thread Jan Høydahl
Go with the one doc per title and different fields for each language 
(title_name_ch, title_name_es) approach. Your application needs to handle what 
fields to query and return.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 5. mars 2011, at 00.44, cyang2010 wrote:

 Hi,
 
 I wonder how solr can satisfy our multi-language requirement.   
 
 For example, for movie/tv series titles, We require that based on user
 preferred language, user is able to get back titles name (and actor,
 directors) by selected language.  For example, getTitlesByGenreId.   On the
 other hand, we also support search by title name, actor or director name.   
 
 Therefore, how shall i design the solr schema to accomate the requirement?
 
 Here is my current schema without consideration of the language:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 Someone recommended to store as a new doc for the same title in a different
 language.  Then solr doc has a language field to denote what language the
 doc is for.  I don't think this can work, right?  Now all those title_name
 value in different language will share the same index/query analyzer since
 they are put int the same field.  For different language (asci based vs
 asian language), you will need different parser, right?
 
 The way i see this can work is to add extra field for those language related
 field, such as title_name_ch (chinese), title_name_es (spanish) and etc. 
 Then my application logic need to know what language specific field to query
 on based on user language preference.   What do you think?
 
 In summary, shall i go with duplicate doc for title in another language, or
 shall i just change my schema to accomodate those additional language
 related fields?
 
 
 Thanks.  Your help is appreciated.
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Help-on-Multi-language-support-tp2636054p2636054.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Help on Multi-language support

2011-03-04 Thread cyang2010
Hi,

I wonder how solr can satisfy our multi-language requirement.   

For example, for movie/tv series titles, We require that based on user
preferred language, user is able to get back titles name (and actor,
directors) by selected language.  For example, getTitlesByGenreId.   On the
other hand, we also support search by title name, actor or director name.   

Therefore, how shall i design the solr schema to accomate the requirement?

Here is my current schema without consideration of the language:














Someone recommended to store as a new doc for the same title in a different
language.  Then solr doc has a language field to denote what language the
doc is for.  I don't think this can work, right?  Now all those title_name
value in different language will share the same index/query analyzer since
they are put int the same field.  For different language (asci based vs
asian language), you will need different parser, right?

The way i see this can work is to add extra field for those language related
field, such as title_name_ch (chinese), title_name_es (spanish) and etc. 
Then my application logic need to know what language specific field to query
on based on user language preference.   What do you think?

In summary, shall i go with duplicate doc for title in another language, or
shall i just change my schema to accomodate those additional language
related fields?


Thanks.  Your help is appreciated.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Help-on-Multi-language-support-tp2636054p2636054.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Help on Multi-language support

2011-03-04 Thread cyang2010
This is the solr schema:














--
View this message in context: 
http://lucene.472066.n3.nabble.com/Help-on-Multi-language-support-tp2636054p2636065.html
Sent from the Solr - User mailing list archive at Nabble.com.


slovene language support

2010-07-19 Thread Markus Goldbach
Hi,

I want to setup an solr with support for several languages.
The language list includes slovene, unfortunately I found nothing about it in 
the wiki.
Has some one experiences with solr 1.4 and slovene?

thanks for help
Markus

Re: slovene language support

2010-07-19 Thread Robert Muir
Hello,

There is some information here (prototype stemmer) about support in
snowball.
But Martin Porter had some unanswered questions/reservations so nothing ever
got added to snowball:

http://snowball.tartarus.org/archives/snowball-discuss/0725.html
http://snowball.tartarus.org/archives/snowball-discuss/0725.html
Of course you could take that stemmer and generate java code with the
snowball code generator and use it, but it seems like it would be best for
those issues to get resolved and get it fixed/included in snowball itself...

On Mon, Jul 19, 2010 at 10:42 AM, Markus Goldbach markus.goldb...@gmail.com
 wrote:

 Hi,

 I want to setup an solr with support for several languages.
 The language list includes slovene, unfortunately I found nothing about it
 in the wiki.
 Has some one experiences with solr 1.4 and slovene?

 thanks for help
 Markus




-- 
Robert Muir
rcm...@gmail.com


Polish language support?

2010-07-09 Thread Peter Wolanin
In IRC trying to help someone find Polish-language support for Solr.

Seems lucene has nothing to offer?  Found one stemmer that looks to be
compatibly licensed in case someone wants to take a shot at
incorporating it:  http://www.getopt.org/stempel/

-Peter

-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com


Re: Polish language support?

2010-07-09 Thread Robert Muir
Hi Peter,

this stemmer is integrated into trunk and 3x.

http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/stempel/
http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/lucene/contrib/analyzers/stempel/


On Fri, Jul 9, 2010 at 2:38 PM, Peter Wolanin peter.wola...@acquia.comwrote:

 In IRC trying to help someone find Polish-language support for Solr.

 Seems lucene has nothing to offer?  Found one stemmer that looks to be
 compatibly licensed in case someone wants to take a shot at
 incorporating it:  http://www.getopt.org/stempel/

 -Peter

 --
 Peter M. Wolanin, Ph.D.
 Momentum Specialist,  Acquia. Inc.
 peter.wola...@acquia.com




-- 
Robert Muir
rcm...@gmail.com


Re: Hindi language support in solr

2010-01-22 Thread Ranveer kumar
Hi Robert,

Thanks for reply.
As you write, I used textgen but still not able to search hindi text.
Might be missing some important configuration.
following is my schema.xml configuration

 fieldType name=textgen class=solr.TextField
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=0/
   filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
 filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/
 filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType

fields
  field name=id type=string indexed=true stored=true
required=true /

   field name=cat_name type=textgen indexed=true stored=true /
   field name=title type=textgen indexed=true stored=true /
   field name=summary type=textgen indexed=true stored=true /

   field name=textgen type=textgen indexed=true stored=false
multiValued=true/
/fields
uniqueKeyid/uniqueKey
 defaultSearchFieldtextgen/defaultSearchField
 solrQueryParser defaultOperator=OR/
   copyField source=title dest=textgen/
   copyField source=cat_name dest=textgen/
   copyField source=summary dest=textgen/


In the summary field there are hindi keywords.
Please help..

thanks
with regards
Ranveer K Kumar









On Thu, Jan 21, 2010 at 11:25 PM, Robert Muir rcm...@gmail.com wrote:

 hello, take a look at field type textgen (a general unstemmed text field)

 the whitespacetokenizer + worddelimiterfilter used by this type will
 work correctly for hindi tokenization and punctuation.

 On Thu, Jan 21, 2010 at 10:55 AM, Ranveer kumar
 ranveer.k.ku...@gmail.com wrote:
  Hi all,
 
  I am very new in solr.
  I download latest release 1.4 and install. For Indexing and Searching I
 am
  using SolrJ api.
  My Question is How to enable solr to search hindi language text ?.
  Please Help me..
 
  thanks
  with regards
  Ranveer K Kumar
 



 --
 Robert Muir
 rcm...@gmail.com



Hindi language support in solr

2010-01-21 Thread Ranveer kumar
Hi all,

I am very new in solr.
I download latest release 1.4 and install. For Indexing and Searching I am
using SolrJ api.
My Question is How to enable solr to search hindi language text ?.
Please Help me..

thanks
with regards
Ranveer K Kumar


Re: Hindi language support in solr

2010-01-21 Thread Robert Muir
hello, take a look at field type textgen (a general unstemmed text field)

the whitespacetokenizer + worddelimiterfilter used by this type will
work correctly for hindi tokenization and punctuation.

On Thu, Jan 21, 2010 at 10:55 AM, Ranveer kumar
ranveer.k.ku...@gmail.com wrote:
 Hi all,

 I am very new in solr.
 I download latest release 1.4 and install. For Indexing and Searching I am
 using SolrJ api.
 My Question is How to enable solr to search hindi language text ?.
 Please Help me..

 thanks
 with regards
 Ranveer K Kumar




-- 
Robert Muir
rcm...@gmail.com


Re: Multi language support

2010-01-13 Thread Robert Muir
right, but we should not encourage users to significantly degrade
overall relevance for all movies due to a few movies and a band (very
special cases, as I said).

In english, by not using stopwords, it doesn't really degrade
relevance that much, so its a reasonable decision to make. This is not
true in other languages!

Instead, systems that worry about all-stopword queries should use
CommonGrams. it will work better for these cases, without taking away
from overall relevance.

On Wed, Jan 13, 2010 at 1:08 AM, Walter Underwood wun...@wunderwood.org wrote:
 There is a band named The The. And a producer named Don Was. For a list 
 of all-stopword movie titles at Netflix, see this post:

 http://wunderwood.org/most_casual_observer/2007/05/invisible_titles.html

 My favorite is To Be and To Have (Être et Avoir), which is all stopwords in 
 two languages. And a very good movie.

 wunder

 On Jan 12, 2010, at 6:55 PM, Robert Muir wrote:

 sorry, i forgot to include this 2009 paper comparing what stopwords do
 across 3 languages:

 http://doc.rero.ch/lm.php?url=1000,43,4,20091218142456-GY/Dolamic_Ljiljana_-_When_Stopword_Lists_Make_the_Difference_20091218.pdf

 in my opinion, if stopwords annoy your users for very special cases
 like 'the the' then, instead consider using commongrams +
 defaultsimilarity.discountOverlaps = true so that you still get the
 benefits.

 as you can see from the above paper, they can be extremely important
 depending on the language, they just don't matter so much for English.

 On Tue, Jan 12, 2010 at 9:20 PM, Lance Norskog goks...@gmail.com wrote:
 There are a lot of projects that don't use stopwords any more. You
 might consider dropping them altogether.

 On Mon, Jan 11, 2010 at 2:25 PM, Don Werve d...@madwombat.com wrote:
 This is the way I've implemented multilingual search as well.

 2010/1/11 Markus Jelsma mar...@buyways.nl

 Hello,


 We have implemented language specific search in Solr using language
 specific fields and field types. For instance, an en_text field type can
 use an English stemmer, and list of stopwords and synonyms. We, however
 did not use specific stopwords, instead we used one list shared by both
 languages.

 So you would have a field type like:
 fieldType name=en_text class=solr.TextField ...
  analyzer type=
  filter class=solr.StopFilterFactory words=stopwords.en.txt
  filter class=solr.SynonymFilterFactory synonyms=synoyms.en.txt

 etc etc.



 Cheers,

 -
 Markus Jelsma          Buyways B.V.
 Technisch Architect    Friesestraatweg 215c
 http://www.buyways.nl  9743 AD Groningen


 Alg. 050-853 6600      KvK  01074105
 Tel. 050-853 6620      Fax. 050-3118124
 Mob. 06-5025 8350      In: http://www.linkedin.com/in/markus17


 On Mon, 2010-01-11 at 13:45 +0100, Daniel Persson wrote:

 Hi Solr users.

 I'm trying to set up a site with Solr search integrated. And I use the
 SolJava API to feed the index with search documents. At the moment I
 have only activated search on the English portion of the site. I'm
 interested in using as many features of solr as possible. Synonyms,
 Stopwords and stems all sounds quite interesting and useful but how do
 I set up this in a good way for a multilingual site?

 The site don't have a huge text mass so performance issues don't
 really bother me but still I'd like to hear your suggestions before I
 try to implement an solution.

 Best regards

 Daniel





 --
 Lance Norskog
 goks...@gmail.com




 --
 Robert Muir
 rcm...@gmail.com






-- 
Robert Muir
rcm...@gmail.com


Re: Multi language support

2010-01-13 Thread Paul Libbrecht
Isn't the conclusion here that some stopword and stemming free  
matching should be the best match if ever and to then gently degrade  
to  weaker forms of matching?


paul


Le 13-janv.-10 à 07:08, Walter Underwood a écrit :

There is a band named The The. And a producer named Don Was. For  
a list of all-stopword movie titles at Netflix, see this post:


http://wunderwood.org/most_casual_observer/2007/05/invisible_titles.html

My favorite is To Be and To Have (Être et Avoir), which is all  
stopwords in two languages. And a very good movie.


wunder

On Jan 12, 2010, at 6:55 PM, Robert Muir wrote:

sorry, i forgot to include this 2009 paper comparing what stopwords  
do

across 3 languages:

http://doc.rero.ch/lm.php?url=1000,43,4,20091218142456-GY/Dolamic_Ljiljana_-_When_Stopword_Lists_Make_the_Difference_20091218.pdf

in my opinion, if stopwords annoy your users for very special cases
like 'the the' then, instead consider using commongrams +
defaultsimilarity.discountOverlaps = true so that you still get the
benefits.

as you can see from the above paper, they can be extremely important
depending on the language, they just don't matter so much for  
English.


On Tue, Jan 12, 2010 at 9:20 PM, Lance Norskog goks...@gmail.com  
wrote:

There are a lot of projects that don't use stopwords any more. You
might consider dropping them altogether.

On Mon, Jan 11, 2010 at 2:25 PM, Don Werve d...@madwombat.com  
wrote:

This is the way I've implemented multilingual search as well.

2010/1/11 Markus Jelsma mar...@buyways.nl


Hello,


We have implemented language specific search in Solr using  
language
specific fields and field types. For instance, an en_text field  
type can
use an English stemmer, and list of stopwords and synonyms. We,  
however
did not use specific stopwords, instead we used one list shared  
by both

languages.

So you would have a field type like:
fieldType name=en_text class=solr.TextField ...
analyzer type=
filter class=solr.StopFilterFactory words=stopwords.en.txt
filter class=solr.SynonymFilterFactory  
synonyms=synoyms.en.txt


etc etc.



Cheers,

-
Markus Jelsma  Buyways B.V.
Technisch ArchitectFriesestraatweg 215c
http://www.buyways.nl  9743 AD Groningen


Alg. 050-853 6600  KvK  01074105
Tel. 050-853 6620  Fax. 050-3118124
Mob. 06-5025 8350  In: http://www.linkedin.com/in/markus17


On Mon, 2010-01-11 at 13:45 +0100, Daniel Persson wrote:


Hi Solr users.

I'm trying to set up a site with Solr search integrated. And I  
use the
SolJava API to feed the index with search documents. At the  
moment I
have only activated search on the English portion of the site.  
I'm
interested in using as many features of solr as possible.  
Synonyms,
Stopwords and stems all sounds quite interesting and useful but  
how do

I set up this in a good way for a multilingual site?

The site don't have a huge text mass so performance issues don't
really bother me but still I'd like to hear your suggestions  
before I

try to implement an solution.

Best regards

Daniel








--
Lance Norskog
goks...@gmail.com





--
Robert Muir
rcm...@gmail.com







Re: Multi language support

2010-01-13 Thread Lance Norskog
Robert Muir: Thank you for the pointer to that paper!

On Wed, Jan 13, 2010 at 6:29 AM, Paul Libbrecht p...@activemath.org wrote:
 Isn't the conclusion here that some stopword and stemming free matching
 should be the best match if ever and to then gently degrade to  weaker forms
 of matching?

 paul


 Le 13-janv.-10 à 07:08, Walter Underwood a écrit :

 There is a band named The The. And a producer named Don Was. For a
 list of all-stopword movie titles at Netflix, see this post:

 http://wunderwood.org/most_casual_observer/2007/05/invisible_titles.html

 My favorite is To Be and To Have (Être et Avoir), which is all stopwords
 in two languages. And a very good movie.

 wunder

 On Jan 12, 2010, at 6:55 PM, Robert Muir wrote:

 sorry, i forgot to include this 2009 paper comparing what stopwords do
 across 3 languages:


 http://doc.rero.ch/lm.php?url=1000,43,4,20091218142456-GY/Dolamic_Ljiljana_-_When_Stopword_Lists_Make_the_Difference_20091218.pdf

 in my opinion, if stopwords annoy your users for very special cases
 like 'the the' then, instead consider using commongrams +
 defaultsimilarity.discountOverlaps = true so that you still get the
 benefits.

 as you can see from the above paper, they can be extremely important
 depending on the language, they just don't matter so much for English.

 On Tue, Jan 12, 2010 at 9:20 PM, Lance Norskog goks...@gmail.com wrote:

 There are a lot of projects that don't use stopwords any more. You
 might consider dropping them altogether.

 On Mon, Jan 11, 2010 at 2:25 PM, Don Werve d...@madwombat.com wrote:

 This is the way I've implemented multilingual search as well.

 2010/1/11 Markus Jelsma mar...@buyways.nl

 Hello,


 We have implemented language specific search in Solr using language
 specific fields and field types. For instance, an en_text field type
 can
 use an English stemmer, and list of stopwords and synonyms. We,
 however
 did not use specific stopwords, instead we used one list shared by
 both
 languages.

 So you would have a field type like:
 fieldType name=en_text class=solr.TextField ...
 analyzer type=
 filter class=solr.StopFilterFactory words=stopwords.en.txt
 filter class=solr.SynonymFilterFactory synonyms=synoyms.en.txt

 etc etc.



 Cheers,

 -
 Markus Jelsma          Buyways B.V.
 Technisch Architect    Friesestraatweg 215c
 http://www.buyways.nl  9743 AD Groningen


 Alg. 050-853 6600      KvK  01074105
 Tel. 050-853 6620      Fax. 050-3118124
 Mob. 06-5025 8350      In: http://www.linkedin.com/in/markus17


 On Mon, 2010-01-11 at 13:45 +0100, Daniel Persson wrote:

 Hi Solr users.

 I'm trying to set up a site with Solr search integrated. And I use
 the
 SolJava API to feed the index with search documents. At the moment I
 have only activated search on the English portion of the site. I'm
 interested in using as many features of solr as possible. Synonyms,
 Stopwords and stems all sounds quite interesting and useful but how
 do
 I set up this in a good way for a multilingual site?

 The site don't have a huge text mass so performance issues don't
 really bother me but still I'd like to hear your suggestions before I
 try to implement an solution.

 Best regards

 Daniel





 --
 Lance Norskog
 goks...@gmail.com




 --
 Robert Muir
 rcm...@gmail.com







-- 
Lance Norskog
goks...@gmail.com


Re: Multi language support

2010-01-12 Thread Lance Norskog
There are a lot of projects that don't use stopwords any more. You
might consider dropping them altogether.

On Mon, Jan 11, 2010 at 2:25 PM, Don Werve d...@madwombat.com wrote:
 This is the way I've implemented multilingual search as well.

 2010/1/11 Markus Jelsma mar...@buyways.nl

 Hello,


 We have implemented language specific search in Solr using language
 specific fields and field types. For instance, an en_text field type can
 use an English stemmer, and list of stopwords and synonyms. We, however
 did not use specific stopwords, instead we used one list shared by both
 languages.

 So you would have a field type like:
 fieldType name=en_text class=solr.TextField ...
  analyzer type=
  filter class=solr.StopFilterFactory words=stopwords.en.txt
  filter class=solr.SynonymFilterFactory synonyms=synoyms.en.txt

 etc etc.



 Cheers,

 -
 Markus Jelsma          Buyways B.V.
 Technisch Architect    Friesestraatweg 215c
 http://www.buyways.nl  9743 AD Groningen


 Alg. 050-853 6600      KvK  01074105
 Tel. 050-853 6620      Fax. 050-3118124
 Mob. 06-5025 8350      In: http://www.linkedin.com/in/markus17


 On Mon, 2010-01-11 at 13:45 +0100, Daniel Persson wrote:

  Hi Solr users.
 
  I'm trying to set up a site with Solr search integrated. And I use the
  SolJava API to feed the index with search documents. At the moment I
  have only activated search on the English portion of the site. I'm
  interested in using as many features of solr as possible. Synonyms,
  Stopwords and stems all sounds quite interesting and useful but how do
  I set up this in a good way for a multilingual site?
 
  The site don't have a huge text mass so performance issues don't
  really bother me but still I'd like to hear your suggestions before I
  try to implement an solution.
 
  Best regards
 
  Daniel





-- 
Lance Norskog
goks...@gmail.com


Re: Multi language support

2010-01-12 Thread Robert Muir
I don't think this is something to consider across the board for all
languages. The same grammatical units that are part of a word in one
language (and removed by stemmers) are independent morphemes in others
(and should be stopwords)

so please take this advice on a case-by-case basis for each language.

On Tue, Jan 12, 2010 at 9:20 PM, Lance Norskog goks...@gmail.com wrote:
 There are a lot of projects that don't use stopwords any more. You
 might consider dropping them altogether.

 On Mon, Jan 11, 2010 at 2:25 PM, Don Werve d...@madwombat.com wrote:
 This is the way I've implemented multilingual search as well.

 2010/1/11 Markus Jelsma mar...@buyways.nl

 Hello,


 We have implemented language specific search in Solr using language
 specific fields and field types. For instance, an en_text field type can
 use an English stemmer, and list of stopwords and synonyms. We, however
 did not use specific stopwords, instead we used one list shared by both
 languages.

 So you would have a field type like:
 fieldType name=en_text class=solr.TextField ...
  analyzer type=
  filter class=solr.StopFilterFactory words=stopwords.en.txt
  filter class=solr.SynonymFilterFactory synonyms=synoyms.en.txt

 etc etc.



 Cheers,

 -
 Markus Jelsma          Buyways B.V.
 Technisch Architect    Friesestraatweg 215c
 http://www.buyways.nl  9743 AD Groningen


 Alg. 050-853 6600      KvK  01074105
 Tel. 050-853 6620      Fax. 050-3118124
 Mob. 06-5025 8350      In: http://www.linkedin.com/in/markus17


 On Mon, 2010-01-11 at 13:45 +0100, Daniel Persson wrote:

  Hi Solr users.
 
  I'm trying to set up a site with Solr search integrated. And I use the
  SolJava API to feed the index with search documents. At the moment I
  have only activated search on the English portion of the site. I'm
  interested in using as many features of solr as possible. Synonyms,
  Stopwords and stems all sounds quite interesting and useful but how do
  I set up this in a good way for a multilingual site?
 
  The site don't have a huge text mass so performance issues don't
  really bother me but still I'd like to hear your suggestions before I
  try to implement an solution.
 
  Best regards
 
  Daniel





 --
 Lance Norskog
 goks...@gmail.com




-- 
Robert Muir
rcm...@gmail.com


Re: Multi language support

2010-01-12 Thread Robert Muir
sorry, i forgot to include this 2009 paper comparing what stopwords do
across 3 languages:

http://doc.rero.ch/lm.php?url=1000,43,4,20091218142456-GY/Dolamic_Ljiljana_-_When_Stopword_Lists_Make_the_Difference_20091218.pdf

in my opinion, if stopwords annoy your users for very special cases
like 'the the' then, instead consider using commongrams +
defaultsimilarity.discountOverlaps = true so that you still get the
benefits.

as you can see from the above paper, they can be extremely important
depending on the language, they just don't matter so much for English.

On Tue, Jan 12, 2010 at 9:20 PM, Lance Norskog goks...@gmail.com wrote:
 There are a lot of projects that don't use stopwords any more. You
 might consider dropping them altogether.

 On Mon, Jan 11, 2010 at 2:25 PM, Don Werve d...@madwombat.com wrote:
 This is the way I've implemented multilingual search as well.

 2010/1/11 Markus Jelsma mar...@buyways.nl

 Hello,


 We have implemented language specific search in Solr using language
 specific fields and field types. For instance, an en_text field type can
 use an English stemmer, and list of stopwords and synonyms. We, however
 did not use specific stopwords, instead we used one list shared by both
 languages.

 So you would have a field type like:
 fieldType name=en_text class=solr.TextField ...
  analyzer type=
  filter class=solr.StopFilterFactory words=stopwords.en.txt
  filter class=solr.SynonymFilterFactory synonyms=synoyms.en.txt

 etc etc.



 Cheers,

 -
 Markus Jelsma          Buyways B.V.
 Technisch Architect    Friesestraatweg 215c
 http://www.buyways.nl  9743 AD Groningen


 Alg. 050-853 6600      KvK  01074105
 Tel. 050-853 6620      Fax. 050-3118124
 Mob. 06-5025 8350      In: http://www.linkedin.com/in/markus17


 On Mon, 2010-01-11 at 13:45 +0100, Daniel Persson wrote:

  Hi Solr users.
 
  I'm trying to set up a site with Solr search integrated. And I use the
  SolJava API to feed the index with search documents. At the moment I
  have only activated search on the English portion of the site. I'm
  interested in using as many features of solr as possible. Synonyms,
  Stopwords and stems all sounds quite interesting and useful but how do
  I set up this in a good way for a multilingual site?
 
  The site don't have a huge text mass so performance issues don't
  really bother me but still I'd like to hear your suggestions before I
  try to implement an solution.
 
  Best regards
 
  Daniel





 --
 Lance Norskog
 goks...@gmail.com




-- 
Robert Muir
rcm...@gmail.com


Re: Multi language support

2010-01-12 Thread Walter Underwood
There is a band named The The. And a producer named Don Was. For a list of 
all-stopword movie titles at Netflix, see this post:

http://wunderwood.org/most_casual_observer/2007/05/invisible_titles.html

My favorite is To Be and To Have (Être et Avoir), which is all stopwords in 
two languages. And a very good movie.

wunder

On Jan 12, 2010, at 6:55 PM, Robert Muir wrote:

 sorry, i forgot to include this 2009 paper comparing what stopwords do
 across 3 languages:
 
 http://doc.rero.ch/lm.php?url=1000,43,4,20091218142456-GY/Dolamic_Ljiljana_-_When_Stopword_Lists_Make_the_Difference_20091218.pdf
 
 in my opinion, if stopwords annoy your users for very special cases
 like 'the the' then, instead consider using commongrams +
 defaultsimilarity.discountOverlaps = true so that you still get the
 benefits.
 
 as you can see from the above paper, they can be extremely important
 depending on the language, they just don't matter so much for English.
 
 On Tue, Jan 12, 2010 at 9:20 PM, Lance Norskog goks...@gmail.com wrote:
 There are a lot of projects that don't use stopwords any more. You
 might consider dropping them altogether.
 
 On Mon, Jan 11, 2010 at 2:25 PM, Don Werve d...@madwombat.com wrote:
 This is the way I've implemented multilingual search as well.
 
 2010/1/11 Markus Jelsma mar...@buyways.nl
 
 Hello,
 
 
 We have implemented language specific search in Solr using language
 specific fields and field types. For instance, an en_text field type can
 use an English stemmer, and list of stopwords and synonyms. We, however
 did not use specific stopwords, instead we used one list shared by both
 languages.
 
 So you would have a field type like:
 fieldType name=en_text class=solr.TextField ...
  analyzer type=
  filter class=solr.StopFilterFactory words=stopwords.en.txt
  filter class=solr.SynonymFilterFactory synonyms=synoyms.en.txt
 
 etc etc.
 
 
 
 Cheers,
 
 -
 Markus Jelsma  Buyways B.V.
 Technisch ArchitectFriesestraatweg 215c
 http://www.buyways.nl  9743 AD Groningen
 
 
 Alg. 050-853 6600  KvK  01074105
 Tel. 050-853 6620  Fax. 050-3118124
 Mob. 06-5025 8350  In: http://www.linkedin.com/in/markus17
 
 
 On Mon, 2010-01-11 at 13:45 +0100, Daniel Persson wrote:
 
 Hi Solr users.
 
 I'm trying to set up a site with Solr search integrated. And I use the
 SolJava API to feed the index with search documents. At the moment I
 have only activated search on the English portion of the site. I'm
 interested in using as many features of solr as possible. Synonyms,
 Stopwords and stems all sounds quite interesting and useful but how do
 I set up this in a good way for a multilingual site?
 
 The site don't have a huge text mass so performance issues don't
 really bother me but still I'd like to hear your suggestions before I
 try to implement an solution.
 
 Best regards
 
 Daniel
 
 
 
 
 
 --
 Lance Norskog
 goks...@gmail.com
 
 
 
 
 -- 
 Robert Muir
 rcm...@gmail.com
 



Multi language support

2010-01-11 Thread Daniel Persson
Hi Solr users.

I'm trying to set up a site with Solr search integrated. And I use the
SolJava API to feed the index with search documents. At the moment I
have only activated search on the English portion of the site. I'm
interested in using as many features of solr as possible. Synonyms,
Stopwords and stems all sounds quite interesting and useful but how do
I set up this in a good way for a multilingual site?

The site don't have a huge text mass so performance issues don't
really bother me but still I'd like to hear your suggestions before I
try to implement an solution.

Best regards

Daniel


Re: Multi language support

2010-01-11 Thread Markus Jelsma
Hello,


We have implemented language specific search in Solr using language
specific fields and field types. For instance, an en_text field type can
use an English stemmer, and list of stopwords and synonyms. We, however
did not use specific stopwords, instead we used one list shared by both
languages.

So you would have a field type like:
fieldType name=en_text class=solr.TextField ...
 analyzer type=
  filter class=solr.StopFilterFactory words=stopwords.en.txt
  filter class=solr.SynonymFilterFactory synonyms=synoyms.en.txt

etc etc.



Cheers,

-  
Markus Jelsma  Buyways B.V.
Technisch ArchitectFriesestraatweg 215c
http://www.buyways.nl  9743 AD Groningen   


Alg. 050-853 6600  KvK  01074105
Tel. 050-853 6620  Fax. 050-3118124
Mob. 06-5025 8350  In: http://www.linkedin.com/in/markus17


On Mon, 2010-01-11 at 13:45 +0100, Daniel Persson wrote:

 Hi Solr users.
 
 I'm trying to set up a site with Solr search integrated. And I use the
 SolJava API to feed the index with search documents. At the moment I
 have only activated search on the English portion of the site. I'm
 interested in using as many features of solr as possible. Synonyms,
 Stopwords and stems all sounds quite interesting and useful but how do
 I set up this in a good way for a multilingual site?
 
 The site don't have a huge text mass so performance issues don't
 really bother me but still I'd like to hear your suggestions before I
 try to implement an solution.
 
 Best regards
 
 Daniel


Re: Multi language support

2010-01-11 Thread Don Werve
This is the way I've implemented multilingual search as well.

2010/1/11 Markus Jelsma mar...@buyways.nl

 Hello,


 We have implemented language specific search in Solr using language
 specific fields and field types. For instance, an en_text field type can
 use an English stemmer, and list of stopwords and synonyms. We, however
 did not use specific stopwords, instead we used one list shared by both
 languages.

 So you would have a field type like:
 fieldType name=en_text class=solr.TextField ...
  analyzer type=
  filter class=solr.StopFilterFactory words=stopwords.en.txt
  filter class=solr.SynonymFilterFactory synonyms=synoyms.en.txt

 etc etc.



 Cheers,

 -
 Markus Jelsma  Buyways B.V.
 Technisch ArchitectFriesestraatweg 215c
 http://www.buyways.nl  9743 AD Groningen


 Alg. 050-853 6600  KvK  01074105
 Tel. 050-853 6620  Fax. 050-3118124
 Mob. 06-5025 8350  In: http://www.linkedin.com/in/markus17


 On Mon, 2010-01-11 at 13:45 +0100, Daniel Persson wrote:

  Hi Solr users.
 
  I'm trying to set up a site with Solr search integrated. And I use the
  SolJava API to feed the index with search documents. At the moment I
  have only activated search on the English portion of the site. I'm
  interested in using as many features of solr as possible. Synonyms,
  Stopwords and stems all sounds quite interesting and useful but how do
  I set up this in a good way for a multilingual site?
 
  The site don't have a huge text mass so performance issues don't
  really bother me but still I'd like to hear your suggestions before I
  try to implement an solution.
 
  Best regards
 
  Daniel



Re: Multi-language support

2009-04-14 Thread Grant Ingersoll


On Apr 9, 2009, at 7:09 AM, revas wrote:


Hi,

To reframe my earlier question

Some languages have just analyzers only but nostemmer from snowball
porter,then does the analyzer take care of stemming as well?

Some languages only have the stemmer from snowball but no analyzer?

Some have both.

Can we say then that solr supports all the above languages .Will  
search be

same across all the above cases?


I just responded to the earlier question, but it didn't contain this  
question.  No, I wouldn't say that search would be the same.  Stemmed  
vs. non-stemmed may result in different results, just as one stemmer  
implementation results will differ from a different stemming approach.



--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search



Multi-language support

2009-04-09 Thread revas
Hi,

To reframe my earlier question

Some languages have just analyzers only but nostemmer from snowball
porter,then does the analyzer take care of stemming as well?

Some languages only have the stemmer from snowball but no analyzer?

Some have both.

Can we say then that solr supports all the above languages .Will search be
same across all the above cases?

thanks
revas


Re: Multiple language support

2008-12-29 Thread Otis Gospodnetic
Hi,

The problem is that a single document (and even a field in your case) is 
multilingual.  Ideally you'd detect different languages within a document and 
apply a different tokenizer/filter to different parts of the field.  So the 
first part would be handled as EN, and the second part as Chinese.  At search 
time you would have to find the language of the query one way or the other, and 
again apply the appropriate analyzer.  If the right analyzer is applied, you 
could match even this multilingual field.  None of the existing 
Analyzers/tokenizers/filters are capable of handling a single piece of text in 
multiple languages, so you will have to create a custom analyzer that is smart 
enough to do that.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Deshpande, Mukta mudes...@ptc.com
 To: solr-user@lucene.apache.org
 Sent: Monday, December 29, 2008 4:52:19 AM
 Subject: Multiple language support 
 
 Hi All,
 
 I have a multiple language supporting schema in which there is a separate 
 field 
 for every language.
 
 I have a field product_name to store product name and its description that 
 can 
 be in any user preferred language. 
 This can be stored in fields product_name_EN if user prefers English 
 language, 
 product_name_SCH if user prefers Simplified Chinese language.
 The WhitespaceTokenizerFactory and filter EnglishPorterFilterFactory are 
 applied 
 on product_name_EN.
 The CJKAnalyzer and CJKTokenizer are applied on product_name_SCH.
 
 e.g. Value can be : ElectrolyticCapacitor - 被对立的电容器以价值220µF
 
 Now my problem is: Which field do I store the above value?
 product_name_EN OR product_name_SCH OR should it be something else?
 
 How do I find out which analyzers should get applied for this field.
 
 Did any one face a similar situation before. 
 Please help ASAP.
 
 Thanks,
 ~Mukta



RE: Language support

2008-03-20 Thread nicolas . dessaigne
You may be interested in a recent discussion that took place on a similar
subject:
http://www.mail-archive.com/solr-user@lucene.apache.org/msg09332.html

Nicolas

-Message d'origine-
De : David King [mailto:[EMAIL PROTECTED] 
Envoyé : mercredi 19 mars 2008 20:07
À : solr-user@lucene.apache.org
Objet : Language support

This has probably been asked before, but I'm having trouble finding  
it. Basically, we want to be able to search for content across several  
languages, given that we know what language a datum and a query are  
in. Is there an obvious way to do this?

Here's the longer version: I am trying to index content that occurs in  
multiple languages, including Asian languages. I'm in the process of  
moving from PyLucene to Solr. In PyLucene, I would have a list of  
analysers:

 analyzers = dict(en = pyluc.SnowballAnalyzer(English),
  cs = pyluc.CzechAnalyzer(),
  pt = pyluc.SnowballAnalyzer(Portuguese),
  ...

Then when I want to index something, I do

writer = pyluc.IndexWriter(store, analyzer, create)
writer.addDocument(d.doc)

That is, I tell Lucene the language of every datum, and the analyser  
to use when writing out the field. Then when I want to search against  
it, I do

 analyzer = LanguageAnalyzer.getanal(lang)
 q = pyluc.QueryParser(field, analyzer).parse(value)

And use that QueryParser to parse the query in the given language  
before sending it off to PyLucene. (off-topic: getanal() is perhaps my  
favourite function-name ever). So the language of a given datum is  
attached to the datum itself. In Solr, however, this appears to be  
attached to the field, not to the individual data in it:

 fieldType name=text_greek class=solr.TextField
   analyzer class=org.apache.lucene.analysis.el.GreekAnalyzer/
 /fieldType

Does this mean there there's no way to have a single contents field  
that has content in multiple languages, and still have the queries be  
parsed and stemmed correctly? How are other people handling this? Does  
it makes sense to write a tokeniser factory and a query factory that  
look at, say, the 'lang' field and return the correct tokenisers? Does  
this already exist?

The other alternative is to have a text_zh field, a text_en field,  
etc, and to modify the query to search on that field depending on the  
language of the query, but that seems kind of hacky to me, especially  
if a query may be against more than one language. Is this the accepted  
way to go about it? Is there a benefit to this method over writing a  
detecting tokeniser factory?


Re: Language support

2008-03-20 Thread David King
You may be interested in a recent discussion that took place on a  
similar

subject:
http://www.mail-archive.com/solr-user@lucene.apache.org/msg09332.html


Interesting, yes. But since it doesn't actually exist, it's not much  
help.


I guess what I'm asking is, if my approach seems convoluted, I'm  
probably doing it wrong, so how *a*re people solving the problem of  
searching over multiple languages? What is the canonical way to do this?






Nicolas

-Message d'origine-
De : David King [mailto:[EMAIL PROTECTED]
Envoyé : mercredi 19 mars 2008 20:07
À : solr-user@lucene.apache.org
Objet : Language support

This has probably been asked before, but I'm having trouble finding
it. Basically, we want to be able to search for content across several
languages, given that we know what language a datum and a query are
in. Is there an obvious way to do this?

Here's the longer version: I am trying to index content that occurs in
multiple languages, including Asian languages. I'm in the process of
moving from PyLucene to Solr. In PyLucene, I would have a list of
analysers:

analyzers = dict(en = pyluc.SnowballAnalyzer(English),
 cs = pyluc.CzechAnalyzer(),
 pt = pyluc.SnowballAnalyzer(Portuguese),
 ...

Then when I want to index something, I do

   writer = pyluc.IndexWriter(store, analyzer, create)
   writer.addDocument(d.doc)

That is, I tell Lucene the language of every datum, and the analyser
to use when writing out the field. Then when I want to search against
it, I do

analyzer = LanguageAnalyzer.getanal(lang)
q = pyluc.QueryParser(field, analyzer).parse(value)

And use that QueryParser to parse the query in the given language
before sending it off to PyLucene. (off-topic: getanal() is perhaps my
favourite function-name ever). So the language of a given datum is
attached to the datum itself. In Solr, however, this appears to be
attached to the field, not to the individual data in it:

fieldType name=text_greek class=solr.TextField
  analyzer class=org.apache.lucene.analysis.el.GreekAnalyzer/
/fieldType

Does this mean there there's no way to have a single contents field
that has content in multiple languages, and still have the queries be
parsed and stemmed correctly? How are other people handling this? Does
it makes sense to write a tokeniser factory and a query factory that
look at, say, the 'lang' field and return the correct tokenisers? Does
this already exist?

The other alternative is to have a text_zh field, a text_en field,
etc, and to modify the query to search on that field depending on the
language of the query, but that seems kind of hacky to me, especially
if a query may be against more than one language. Is this the accepted
way to go about it? Is there a benefit to this method over writing a
detecting tokeniser factory?




Re: Language support

2008-03-20 Thread Benson Margulies
Unless you can come up with language-neutral tokenization and stemming, you
need to:

a) know the language of each document.
b) run a different analyzer depending on the language.
c) force the user to tell you the language of the query.
d) run the query through the same analyzer.



On Thu, Mar 20, 2008 at 12:17 PM, David King [EMAIL PROTECTED] wrote:

  You may be interested in a recent discussion that took place on a
  similar
  subject:
  http://www.mail-archive.com/solr-user@lucene.apache.org/msg09332.html

 Interesting, yes. But since it doesn't actually exist, it's not much
 help.

 I guess what I'm asking is, if my approach seems convoluted, I'm
 probably doing it wrong, so how *a*re people solving the problem of
 searching over multiple languages? What is the canonical way to do this?


 
 
  Nicolas
 
  -Message d'origine-
  De : David King [mailto:[EMAIL PROTECTED]
  Envoyé : mercredi 19 mars 2008 20:07
  À : solr-user@lucene.apache.org
  Objet : Language support
 
  This has probably been asked before, but I'm having trouble finding
  it. Basically, we want to be able to search for content across several
  languages, given that we know what language a datum and a query are
  in. Is there an obvious way to do this?
 
  Here's the longer version: I am trying to index content that occurs in
  multiple languages, including Asian languages. I'm in the process of
  moving from PyLucene to Solr. In PyLucene, I would have a list of
  analysers:
 
  analyzers = dict(en = pyluc.SnowballAnalyzer(English),
   cs = pyluc.CzechAnalyzer(),
   pt = pyluc.SnowballAnalyzer(Portuguese),
   ...
 
  Then when I want to index something, I do
 
 writer = pyluc.IndexWriter(store, analyzer, create)
 writer.addDocument(d.doc)
 
  That is, I tell Lucene the language of every datum, and the analyser
  to use when writing out the field. Then when I want to search against
  it, I do
 
  analyzer = LanguageAnalyzer.getanal(lang)
  q = pyluc.QueryParser(field, analyzer).parse(value)
 
  And use that QueryParser to parse the query in the given language
  before sending it off to PyLucene. (off-topic: getanal() is perhaps my
  favourite function-name ever). So the language of a given datum is
  attached to the datum itself. In Solr, however, this appears to be
  attached to the field, not to the individual data in it:
 
  fieldType name=text_greek class=solr.TextField
analyzer class=org.apache.lucene.analysis.el.GreekAnalyzer/
  /fieldType
 
  Does this mean there there's no way to have a single contents field
  that has content in multiple languages, and still have the queries be
  parsed and stemmed correctly? How are other people handling this? Does
  it makes sense to write a tokeniser factory and a query factory that
  look at, say, the 'lang' field and return the correct tokenisers? Does
  this already exist?
 
  The other alternative is to have a text_zh field, a text_en field,
  etc, and to modify the query to search on that field depending on the
  language of the query, but that seems kind of hacky to me, especially
  if a query may be against more than one language. Is this the accepted
  way to go about it? Is there a benefit to this method over writing a
  detecting tokeniser factory?




Re: Language support

2008-03-20 Thread David King
Unless you can come up with language-neutral tokenization and  
stemming, you

need to:
a) know the language of each document.
b) run a different analyzer depending on the language.
c) force the user to tell you the language of the query.
d) run the query through the same analyzer.


I can do all of those. This implies storing all of the different  
languages in different fields, right? Then changing the default search- 
field to the language of the query for every query?








On Thu, Mar 20, 2008 at 12:17 PM, David King [EMAIL PROTECTED]  
wrote:



You may be interested in a recent discussion that took place on a
similar
subject:
http://www.mail-archive.com/solr-user@lucene.apache.org/ 
msg09332.html


Interesting, yes. But since it doesn't actually exist, it's not much
help.

I guess what I'm asking is, if my approach seems convoluted, I'm
probably doing it wrong, so how *a*re people solving the problem of
searching over multiple languages? What is the canonical way to do  
this?






Nicolas

-Message d'origine-
De : David King [mailto:[EMAIL PROTECTED]
Envoyé : mercredi 19 mars 2008 20:07
À : solr-user@lucene.apache.org
Objet : Language support

This has probably been asked before, but I'm having trouble finding
it. Basically, we want to be able to search for content across  
several

languages, given that we know what language a datum and a query are
in. Is there an obvious way to do this?

Here's the longer version: I am trying to index content that  
occurs in

multiple languages, including Asian languages. I'm in the process of
moving from PyLucene to Solr. In PyLucene, I would have a list of
analysers:

   analyzers = dict(en = pyluc.SnowballAnalyzer(English),
cs = pyluc.CzechAnalyzer(),
pt = pyluc.SnowballAnalyzer(Portuguese),
...

Then when I want to index something, I do

  writer = pyluc.IndexWriter(store, analyzer, create)
  writer.addDocument(d.doc)

That is, I tell Lucene the language of every datum, and the analyser
to use when writing out the field. Then when I want to search  
against

it, I do

   analyzer = LanguageAnalyzer.getanal(lang)
   q = pyluc.QueryParser(field, analyzer).parse(value)

And use that QueryParser to parse the query in the given language
before sending it off to PyLucene. (off-topic: getanal() is  
perhaps my

favourite function-name ever). So the language of a given datum is
attached to the datum itself. In Solr, however, this appears to be
attached to the field, not to the individual data in it:

   fieldType name=text_greek class=solr.TextField
 analyzer class=org.apache.lucene.analysis.el.GreekAnalyzer/
   /fieldType

Does this mean there there's no way to have a single contents  
field
that has content in multiple languages, and still have the queries  
be
parsed and stemmed correctly? How are other people handling this?  
Does

it makes sense to write a tokeniser factory and a query factory that
look at, say, the 'lang' field and return the correct tokenisers?  
Does

this already exist?

The other alternative is to have a text_zh field, a text_en field,
etc, and to modify the query to search on that field depending on  
the
language of the query, but that seems kind of hacky to me,  
especially
if a query may be against more than one language. Is this the  
accepted

way to go about it? Is there a benefit to this method over writing a
detecting tokeniser factory?







Re: Language support

2008-03-20 Thread Benson Margulies
You can store in one field if you manage to hide a language code with the
text. XML is overkill but effective for this. At one point, we'd
investigated how to allow a Lucene analyzer to see more than one field (the
language code as well as the text) but I don't think we came up with
anything.


On Thu, Mar 20, 2008 at 12:39 PM, David King [EMAIL PROTECTED] wrote:

  Unless you can come up with language-neutral tokenization and
  stemming, you
  need to:
  a) know the language of each document.
  b) run a different analyzer depending on the language.
  c) force the user to tell you the language of the query.
  d) run the query through the same analyzer.

 I can do all of those. This implies storing all of the different
 languages in different fields, right? Then changing the default search-
 field to the language of the query for every query?


 
 
 
 
  On Thu, Mar 20, 2008 at 12:17 PM, David King [EMAIL PROTECTED]
  wrote:
 
  You may be interested in a recent discussion that took place on a
  similar
  subject:
  http://www.mail-archive.com/solr-user@lucene.apache.org/
  msg09332.html
 
  Interesting, yes. But since it doesn't actually exist, it's not much
  help.
 
  I guess what I'm asking is, if my approach seems convoluted, I'm
  probably doing it wrong, so how *a*re people solving the problem of
  searching over multiple languages? What is the canonical way to do
  this?
 
 
 
 
  Nicolas
 
  -Message d'origine-
  De : David King [mailto:[EMAIL PROTECTED]
  Envoyé : mercredi 19 mars 2008 20:07
  À : solr-user@lucene.apache.org
  Objet : Language support
 
  This has probably been asked before, but I'm having trouble finding
  it. Basically, we want to be able to search for content across
  several
  languages, given that we know what language a datum and a query are
  in. Is there an obvious way to do this?
 
  Here's the longer version: I am trying to index content that
  occurs in
  multiple languages, including Asian languages. I'm in the process of
  moving from PyLucene to Solr. In PyLucene, I would have a list of
  analysers:
 
 analyzers = dict(en = pyluc.SnowballAnalyzer(English),
  cs = pyluc.CzechAnalyzer(),
  pt = pyluc.SnowballAnalyzer(Portuguese),
  ...
 
  Then when I want to index something, I do
 
writer = pyluc.IndexWriter(store, analyzer, create)
writer.addDocument(d.doc)
 
  That is, I tell Lucene the language of every datum, and the analyser
  to use when writing out the field. Then when I want to search
  against
  it, I do
 
 analyzer = LanguageAnalyzer.getanal(lang)
 q = pyluc.QueryParser(field, analyzer).parse(value)
 
  And use that QueryParser to parse the query in the given language
  before sending it off to PyLucene. (off-topic: getanal() is
  perhaps my
  favourite function-name ever). So the language of a given datum is
  attached to the datum itself. In Solr, however, this appears to be
  attached to the field, not to the individual data in it:
 
 fieldType name=text_greek class=solr.TextField
   analyzer class=org.apache.lucene.analysis.el.GreekAnalyzer/
 /fieldType
 
  Does this mean there there's no way to have a single contents
  field
  that has content in multiple languages, and still have the queries
  be
  parsed and stemmed correctly? How are other people handling this?
  Does
  it makes sense to write a tokeniser factory and a query factory that
  look at, say, the 'lang' field and return the correct tokenisers?
  Does
  this already exist?
 
  The other alternative is to have a text_zh field, a text_en field,
  etc, and to modify the query to search on that field depending on
  the
  language of the query, but that seems kind of hacky to me,
  especially
  if a query may be against more than one language. Is this the
  accepted
  way to go about it? Is there a benefit to this method over writing a
  detecting tokeniser factory?
 
 




Re: Language support

2008-03-20 Thread Benson Margulies
Token/by/token seems a bit extreme. Are you concerned with macaronic
documents?

On Thu, Mar 20, 2008 at 12:42 PM, Walter Underwood [EMAIL PROTECTED]
wrote:

 Nice list.

 You may still need to mark the language of each document. There are
 plenty of cross-language collisions: die and boot have different
 meanings in German and English. Proper nouns (Laserjet) may be the
 same in all languages, a different problem if you are trying to get
 answers in one language.

 At one point, I considered using Unicode language tagging on each
 token to keep it all straight. Effectively, index de/Boot or
 en/Laserjet.

 wunder

 On 3/20/08 9:20 AM, Benson Margulies [EMAIL PROTECTED] wrote:

  Unless you can come up with language-neutral tokenization and stemming,
  you
 need to:
 
  a) know the language of each document.
  b) run a different
  analyzer depending on the language.
  c) force the user to tell you the language of the query.
  d) run the query through the same analyzer.





Re: Language support

2008-03-20 Thread Walter Underwood
Extreme, but guaranteed to work and it avoids bad IDF when there are
inter-language collisions. In Ultraseek, we only stored the hash, so
the size of the source token didn't matter.

Trademarks are a bad source of collisions and anomalous IDF. If you have
LaserJet support docs in 20 languages, the term LaserJet will have
a document frequency 20X higher than the terms in a single language
and will score too low.

Ultraseek handles macaronic documents when the script makes it possible,
for example, roman is sent to the English stemmer in a Japanese document,
Hangul always goes to the Korean segmenter/stemmer.

A simpler approach is to tag each document with a language, like lang:de,
then use a filter query to restrict the documents to the query language.

Per-token tagging still strikes me as the right approach. It makes
all sorts of things work, like keeping fuzzy matches within the same
language. We didn't do it in Ultraseek because it would have been an
incompatible index change and the benefit didn't justify that.

wunder
==
Walter Underwood
Former Ultraseek Architect
Current Entire Netflix Search Department

On 3/20/08 9:45 AM, Benson Margulies [EMAIL PROTECTED] wrote:

 Token/by/token seems a bit extreme. Are you concerned with macaronic
 documents?
 
 On Thu, Mar 20, 2008 at 12:42 PM, Walter Underwood [EMAIL PROTECTED]
 wrote:
 
 Nice list.
 
 You may still need to mark the language of each document. There are
 plenty of cross-language collisions: die and boot have different
 meanings in German and English. Proper nouns (Laserjet) may be the
 same in all languages, a different problem if you are trying to get
 answers in one language.
 
 At one point, I considered using Unicode language tagging on each
 token to keep it all straight. Effectively, index de/Boot or
 en/Laserjet.
 
 wunder
 
 On 3/20/08 9:20 AM, Benson Margulies [EMAIL PROTECTED] wrote:
 
 Unless you can come up with language-neutral tokenization and stemming,
 you
 need to:
 
 a) know the language of each document.
 b) run a different
 analyzer depending on the language.
 c) force the user to tell you the language of the query.
 d) run the query through the same analyzer.
 
 
 



Re: Language support

2008-03-20 Thread Benson Margulies
Oh, Walter! Hello! I thought that name was familiar. Greetings from Basis.
All that makes sense.

On Thu, Mar 20, 2008 at 1:00 PM, Walter Underwood [EMAIL PROTECTED]
wrote:

 Extreme, but guaranteed to work and it avoids bad IDF when there are
 inter-language collisions. In Ultraseek, we only stored the hash, so
 the size of the source token didn't matter.

 Trademarks are a bad source of collisions and anomalous IDF. If you have
 LaserJet support docs in 20 languages, the term LaserJet will have
 a document frequency 20X higher than the terms in a single language
 and will score too low.

 Ultraseek handles macaronic documents when the script makes it possible,
 for example, roman is sent to the English stemmer in a Japanese document,
 Hangul always goes to the Korean segmenter/stemmer.

 A simpler approach is to tag each document with a language, like
 lang:de,
 then use a filter query to restrict the documents to the query language.

 Per-token tagging still strikes me as the right approach. It makes
 all sorts of things work, like keeping fuzzy matches within the same
 language. We didn't do it in Ultraseek because it would have been an
 incompatible index change and the benefit didn't justify that.

 wunder
 ==
 Walter Underwood
 Former Ultraseek Architect
 Current Entire Netflix Search Department

 On 3/20/08 9:45 AM, Benson Margulies [EMAIL PROTECTED] wrote:

  Token/by/token seems a bit extreme. Are you concerned with macaronic
  documents?
 
  On Thu, Mar 20, 2008 at 12:42 PM, Walter Underwood 
 [EMAIL PROTECTED]
  wrote:
 
  Nice list.
 
  You may still need to mark the language of each document. There are
  plenty of cross-language collisions: die and boot have different
  meanings in German and English. Proper nouns (Laserjet) may be the
  same in all languages, a different problem if you are trying to get
  answers in one language.
 
  At one point, I considered using Unicode language tagging on each
  token to keep it all straight. Effectively, index de/Boot or
  en/Laserjet.
 
  wunder
 
  On 3/20/08 9:20 AM, Benson Margulies [EMAIL PROTECTED] wrote:
 
  Unless you can come up with language-neutral tokenization and
 stemming,
  you
  need to:
 
  a) know the language of each document.
  b) run a different
  analyzer depending on the language.
  c) force the user to tell you the language of the query.
  d) run the query through the same analyzer.
 
 
 




Language support

2008-03-19 Thread David King
This has probably been asked before, but I'm having trouble finding  
it. Basically, we want to be able to search for content across several  
languages, given that we know what language a datum and a query are  
in. Is there an obvious way to do this?


Here's the longer version: I am trying to index content that occurs in  
multiple languages, including Asian languages. I'm in the process of  
moving from PyLucene to Solr. In PyLucene, I would have a list of  
analysers:


analyzers = dict(en = pyluc.SnowballAnalyzer(English),
 cs = pyluc.CzechAnalyzer(),
 pt = pyluc.SnowballAnalyzer(Portuguese),
 ...

Then when I want to index something, I do

   writer = pyluc.IndexWriter(store, analyzer, create)
   writer.addDocument(d.doc)

That is, I tell Lucene the language of every datum, and the analyser  
to use when writing out the field. Then when I want to search against  
it, I do


analyzer = LanguageAnalyzer.getanal(lang)
q = pyluc.QueryParser(field, analyzer).parse(value)

And use that QueryParser to parse the query in the given language  
before sending it off to PyLucene. (off-topic: getanal() is perhaps my  
favourite function-name ever). So the language of a given datum is  
attached to the datum itself. In Solr, however, this appears to be  
attached to the field, not to the individual data in it:


fieldType name=text_greek class=solr.TextField
  analyzer class=org.apache.lucene.analysis.el.GreekAnalyzer/
/fieldType

Does this mean there there's no way to have a single contents field  
that has content in multiple languages, and still have the queries be  
parsed and stemmed correctly? How are other people handling this? Does  
it makes sense to write a tokeniser factory and a query factory that  
look at, say, the 'lang' field and return the correct tokenisers? Does  
this already exist?


The other alternative is to have a text_zh field, a text_en field,  
etc, and to modify the query to search on that field depending on the  
language of the query, but that seems kind of hacky to me, especially  
if a query may be against more than one language. Is this the accepted  
way to go about it? Is there a benefit to this method over writing a  
detecting tokeniser factory?