Re: How to search special characters in LUcene

Erick Erickson Fri, 24 Apr 2009 06:48:19 -0700

I'm puzzled why you say

"By the above out put we can say that StandardAnalyzer is
enough to get rid  of danish elements."


It does NOT get rid of the accents, according to your own output.
If your goal is to go ahead and index multiple language documents
in a single index then search it, I'd recommend using the
ISOLatin1AccentFilter. Compare the output of an analyzer like:

public class MyAnalyzer extends StandardAnalyzer {
    public TokenStream tokenStream(String s, Reader reader) {
        return new ISOLatin1AccentFilter(super.tokenStream(s, reader));
    }
}
 and you'll see that all the accents disappear, which will play much
more nicely with queries that don't have accents.

Best
Erick

On Fri, Apr 24, 2009 at 3:53 AM, uday kumar maddigatla <[email protected]>wrote:

>
> Hi Thanks for your reply.
>
> After gone threw with the site which you given... i understood that
> StandardAnalyzer is enough to handle these special characters.
>
> i'm attaching one class called AnalysisDemo.java. By executing that class
> i'm able to say the above sentance(i.e StandardAnalyzer is enough).
>
> Here is the out put when i ran the above java file.
> Analzying "Vedr.           :   Amtsgården Århus, Lyseng Allé 1, 8270
> Højbjerg"
>        org.apache.lucene.analysis.WhitespaceAnalyzer:
>                [Vedr.] [:] [Amtsgården] [Århus,] [Lyseng] [Allé] [1,]
> [8270] [Højbjerg]
>
>        org.apache.lucene.analysis.SimpleAnalyzer:
>                [vedr] [amtsgården] [århus] [lyseng] [allé] [højbjerg]
>
>        org.apache.lucene.analysis.StopAnalyzer:
>                [vedr] [amtsgården] [århus] [lyseng] [allé] [højbjerg]
>
>        org.apache.lucene.analysis.standard.StandardAnalyzer:
>                [vedr] [amtsgården] [århus] [lyseng] [allé] [1] [8270]
> [højbjerg]
>
>        org.apache.lucene.analysis.snowball.SnowballAnalyzer:
>                [vedr] [amtsgården] [århus] [lyseng] [allé] [1] [8270]
> [højbjerg]
>
> By the above out put we can say that StandardAnalyzer is enough to get rid
> of danish elements.
>
> But only problem is when i'm searching the any term which includes the
> danish elements(like højbjerg...)
>
> it is unable to find out.
>
> Even i checked with LUKE. In that  i given my sample text which contains
> the
> danish elements and selected the StandardAnalyzer as analyser. when i click
> analyze in that it cleary making index of danish words.
>
> and also i givne one try on luke by loading my index directory in to luke.
> after loading my index i searched for a word which contains the danish
> element, But this time it was failed. It was shown nothing(i.e o resluts).
>
> As in my sense the problem might be making the indexes or in searching the
> item.
>
>
> I gone threw the site which you given. From that i'm able to do this kind
> of
> reaserch work.
>
> Please help me in this.
>
>
> Erick Erickson wrote:
> >
> > OK, this is a much different problem than you were originally
> > asking about, effectively "how to index/search mixed language
> > documents".
> >
> > This topic has been discussed multiple times on the user list, I
> > think your first step should be to search the archive. I *was*
> > going to find the old searchable mail archive, but those clever folks
> > at Lucid Imagination have something new, see:
> >
> > http://www.lucidimagination.com/search/p:lucene?q=multiple+languages
> >
> > Once you've had a chance to look that over I think you'll be off and
> > running.
> >
> > Best
> > Erick
> >
> > On Thu, Apr 23, 2009 at 1:43 AM, uday kumar maddigatla
> > <[email protected]>wrote:
> >
> >>
> >> HI
> >>
> >> Here are the details about my goals.
> >> 1. I want to use this lucene for mixed languages.
> >> 2. I want to make indexes of the documents which are either english or
> >> danish etc.
> >> I'm attaching my IndexFiles.java file.
> >>
> >> When i'm searching i'm giving the index path location  as well as
> >> doucmets
> >> folder.
> >>
> >> If i use StandardAnalyzer as an argument to IndexWriter's method it is
> >> able
> >> to search the english characters.
> >>
> >> How can i use DutchAnalyzer in order to make this IndexFiles.java to
> >> index
> >> the danish elements.
> >>
> >> In my Code which i attached, you can see 'C:\test3'. This is my location
> >> where i want to store my indexes.
> >>
> >> I'm giving documents folder location as comand line argument.
> >>
> >> In my document the content will be like this
> >>
> >> <com:Note><![CDATA[Kreditnota til udligning af faktura nr. 13927 pga
> skal
> >> opsplittes
> >> hhv. byggeplads og skat
> >> Vedr.           :   Amtsgården Århus, Lyseng Allé 1, 8270 Højbjerg
> >> Bygning B
> >> SES Journal nr. :   42895-0001
> >> SES Navision nr.:   Navision 9800124
> >> SES Ansvarlig   :   Martin Krøldrup Nielsen
> >> SES rådgiver    :   Friis & Moltke A/S
> >> Hermed fremsendes faktura på ekstra tømrerarbejde.
> >> Byggeplads Amtsgården B-4
> >> jvf. vedlagte specifikation - aftaleseddel nr. 12.]]></com:Note>
> >>
> >> i"m searching the word like rådgiver . When i see the result it is
> >> clearly
> >> searching for r dgiver. It is omitting the danish element.
> >>
> >> Please help me in this.
> >>
> >>
> >>
> >> Erick Erickson wrote:
> >> >
> >> > Are you *also* using the DutchAnalyzer for your *query*?
> >> >
> >> > Please show us the index and search code (simplified as much
> >> > as possible), then we'll be able to provide better suggestions.
> >> >
> >> > Also, tell us a bit more about your goals here. Is this an
> >> > index entirely of Dutch documents? Or is it a mixed-language
> >> > index?
> >> >
> >> > Think about getting a copy of Luke and
> >> > 1> examining your index to see what's *really* there
> >> > 2> examining the effects of using different parsers on
> >> >      your *query*.
> >> >
> >> > Best
> >> > Erick
> >> >
> >> > On Wed, Apr 22, 2009 at 2:57 AM, uday kumar maddigatla
> >> > <[email protected]>wrote:
> >> >
> >> >>
> >> >> Hi
> >> >>
> >> >> Thanks for your reply.
> >> >>
> >> >> I'm able to see the DutchAnalyzer.
> >> >>
> >> >> When i'm indexing my documents i given instace of DutchAnalyzer as an
> >> >> argument to IndexWriter Class.
> >> >>
> >> >> After this when i search for the
> >> >> http://www.nabble.com/file/p23170710/IndexFiles.java IndexFiles.java
> >> >> contains the danish elements .. Still it is not able to identify.
> >> >>
> >> >> Please tell me how to use DutchAnalzer in my application. Sample
> >> example
> >> >> or
> >> >> series of steps helps me.
> >> >>
> >> >> I also attached my index file(.java file).
> >> >>
> >> >> Please help me in this. please..
> >> >>
> >> >> Erick Erickson wrote:
> >> >> >
> >> >> > Take a look at DutchAnalyzer. The problem you'll have is if you're
> >> >> > indexing
> >> >> > this document along with a bunch of documents from other languages.
> >> >> > You could search the mail archive for extensive discussions of
> >> >> indexing/
> >> >> > searching documents from several languages.
> >> >> >
> >> >> > Best
> >> >> > Erick
> >> >> >
> >> >> > On Tue, Apr 21, 2009 at 2:40 AM, Uday Kumar Maddigatla
> >> >> > <[email protected]>wrote:
> >> >> >
> >> >> >> HI,
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> I'm new to the lucene. I downloaded lucene 2.4.1.
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> I have one xml file which contains few special characters like
> 'å',
> >> >> 'ø,'
> >> >> >> °'
> >> >> >> etc.(these are Danish language elements).
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> How can I search these things.
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> Uday Kumar  Reddy Maddigatla
> >> >> >>
> >> >> >> Software Engineer(Progrator|gatetrade)
> >> >> >>
> >> >> >> MACH India(Operations)
> >> >> >>
> >> >> >> Mobile: + 91-9963000377
> >> >> >>
> >> >> >> [email protected] <mailto:[email protected]>
> >> >> >>
> >> >> >> [email protected] <mailto:[email protected]>
> >> >> >>
> >> >> >> www.ness.com
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >
> >> >> >
> >> >>
> >> >> --
> >> >> View this message in context:
> >> >>
> >>
> http://www.nabble.com/How-to-search-special-characters-in-LUcene-tp23150039p23170710.html
> >> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >> >>
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: [email protected]
> >> >> For additional commands, e-mail: [email protected]
> >> >>
> >> >>
> >> >
> >> >
> >> http://www.nabble.com/file/p23190583/IndexFiles.java IndexFiles.java
> >> http://www.nabble.com/file/p23190583/SearchFiles.java SearchFiles.java
> >> http://www.nabble.com/file/p23190583/IndexFiles.java IndexFiles.java
> >> http://www.nabble.com/file/p23190583/IndexFiles.java IndexFiles.java
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/How-to-search-special-characters-in-LUcene-tp23150039p23190583.html
> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >>
> >
> >
> http://www.nabble.com/file/p23211629/AnalysisDemo.java AnalysisDemo.java
> --
> View this message in context:
> http://www.nabble.com/How-to-search-special-characters-in-LUcene-tp23150039p23211629.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: How to search special characters in LUcene

Reply via email to