Re: Best practices for multiple languages?
But for this, you need a skillfully designed: - set of fields - multiplexing analyzer - query expansion In one of my projects, we do not split language by fields and it's a pain... I'm having recurring issues in one sense or the other. - the "die" example that Oti s mentioned is a good one: stop-word in German, essential verb in English - I had recently issues with the contribution of the word Fourier (for the name of series): in English it stays fourier, in French in becomes fouri. So: if the resource is contributed in French, the indexed value is fouri, English seekers won't find it; if the resource is contributed in English, French seekers won't find it. So my last lesson: always have a whitespace-lowercase unstemmed field also at hand and prefer it over the others in your query expansion. A wiki page should probably be made. paul Le 19 janv. 2011 à 07:53, Vinaya Kumar Thimmappa a écrit : > I think we should be using lucene with snowball jar's which means one index > for all languages (ofcourse size of index is always a matter of concerns). > > Hope this helps. > -vinaya > > On Tuesday 18 January 2011 11:23 PM, Clemens Wyss wrote: >> What is the "best practice" to support multiple languages, i.e. >> Lucene-Documents that have multiple language content/fields? >> Should >> a) each language be indexed in a seperate index/directory or should >> b) the Documents (in a single directory) hold the diverse localized fields? >> >> We most often will be searching "language dependent" which (at least >> performance wise) mandates one-directory-per-language... >> >> Any (lucene specific) white papers on this topic? >> >> Thx in advance >> Clemens >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
AW: Best practices for multiple languages?
> 1) Docs in different languages -- every document is one language > 2) Each document has fields in different languages We mainly have 1)-models Clemens > -Ursprüngliche Nachricht- > Von: Shai Erera [mailto:ser...@gmail.com] > Gesendet: Dienstag, 18. Januar 2011 20:28 > An: java-user@lucene.apache.org > Betreff: Re: Best practices for multiple languages? > > Hi > > There are two types of multi-language docs: > 1) Docs in different languages -- every document is one language > 2) Each document has fields in different languages > > I've dealt with both, and there are different solutions to each. Which of them > is yours? > > Shai > > On Tue, Jan 18, 2011 at 7:53 PM, Clemens Wyss > wrote: > > > What is the "best practice" to support multiple languages, i.e. > > Lucene-Documents that have multiple language content/fields? > > Should > > a) each language be indexed in a seperate index/directory or should > > b) the Documents (in a single directory) hold the diverse localized fields? > > > > We most often will be searching "language dependent" which (at least > > performance wise) mandates one-directory-per-language... > > > > Any (lucene specific) white papers on this topic? > > > > Thx in advance > > Clemens > > > > - > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Best practices for multiple languages?
I think we should be using lucene with snowball jar's which means one index for all languages (ofcourse size of index is always a matter of concerns). Hope this helps. -vinaya On Tuesday 18 January 2011 11:23 PM, Clemens Wyss wrote: What is the "best practice" to support multiple languages, i.e. Lucene-Documents that have multiple language content/fields? Should a) each language be indexed in a seperate index/directory or should b) the Documents (in a single directory) hold the diverse localized fields? We most often will be searching "language dependent" which (at least performance wise) mandates one-directory-per-language... Any (lucene specific) white papers on this topic? Thx in advance Clemens - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?
Where do you get your Lucene/Solr downloads from? [X] ASF Mirrors (linked in our release announcements or via the Lucene website) [] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) [X] I/we build them from source via an SVN/Git checkout. [] Other (someone in your company mirrors them internally or via a downstream project) -- Anshum Gupta http://ai-cafe.blogspot.com
Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?
> [X] ASF Mirrors (linked in our release announcements or via the Lucene >website) > > [X] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) > > [X] I/we build them from source via an SVN/Git checkout. > > [] Other (someone in your company mirrors them internally or via a > downstream >project) - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?
[x] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.)
Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?
On Jan 18, 2011, at 2:24 PM, Glen Newton wrote: > Where do you get your Lucene/Solr downloads from? > > [] ASF Mirrors (linked in our release announcements or via the Lucene website) > > [X] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) > > [] I/we build them from source via an SVN/Git checkout. Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Lucene Revolution 2011 is Coming - May 25 & 26 - Save The Date and Call For Papers
Mark your calendars today! The largest worldwide conference dedicated to Lucene and Solr will take place in the San Francisco/Bay Area May 25-26. The 2011 conference will build on the success of last year's Lucene Revolution in Boston. Sponsored by Lucid Imagination with additional support from community and other commercial co-sponsors, we'll be adding new sessions, new speakers, and new training sessions to the agenda. Lucid Imagination is the commercial entity exclusively dedicated to Apache Lucene/Solr open source search technology. Registration will begin shortly - so make sure to save-the-date. In the meantime, the Call For Participation (CFP) is now open for Lucene Revolution 2011. If you have a great Solr or Lucene talk, this is a fantastic opportunity to share it with the community. To submit a proposal for a 45-minute presentation, please complete the form at: http://www.lucidimagination.com/revolution/2011/cfp Topics of interest include: - Lucene and Solr in the Enterprise (case studies, implementation, return on investment, etc.) - Use of LucidWorks Enterprise - “How We Did It” development case studies - Lucene/Solr technology deep dives: features, how to use, etc. - Spatial/Geo/local search - Lucene and Solr in the Cloud - Scalability and performance tuning - Large Scale Search - Real Time Search (or NRT search) - Data Integration/Data Management - Lucene & Solr for Mobile Applications - Associated technologies: Mahout, Nutch, NLP, etc. All accepted speakers will get complimentary conference passes. Financial assistance is available for speakers that qualify. Submissions must be received by Wednesday , March 2 , 2011 , 12 Midnight PST Key Dates: January 18 , 2011 : Call For Participation open; form available for completion at: http://www.lucidimagination.com/revolution/2011/cfp March 2, 2011 : Call For Participation Closes March 9, 2011 : Speaker Acceptance Notification May 23-24, 2011 : Lucene and Solr Training Sessions May 25-26, 2011 : Lucene Revolution Conference Sessions If you have more than one topic that you would like to propose, please complete an additional online form. To be considered, proposals must be received by 12 Midnight PDT, March 2 , 2011 . Interested in registration or other conference news? Want to be added to the conference mailing list? Is your organization interested in sponsorship opportunities? Please send an email to: i...@lucenerevolution.org We look forward to seeing you in the San Francisco/Bay Area! Regards, Mike Michael Bohlig | Lucid Imagination Enterprise Marketing p +1 650 353 4057 x132 m+1 650 703 8383 www.lucidimagination.com
[POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?
Sincerely, Sithu D Sudarsan Grant Ingersoll wrote: > Where do you get your Lucene/Solr downloads from? > > [x] ASF Mirrors (linked in our release announcements or via the Lucene > website) > > [] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) > > [x] I/we build them from source via an SVN/Git checkout. > > [] Other (someone in your company mirrors them internally or via a downstream > project) - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?
[] ASF Mirrors (linked in our release announcements or via the Lucene website) [X] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) [] I/we build them from source via an SVN/Git checkout. [] Other (someone in your company mirrors them internally or via a downstream project) - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?
Grant Ingersoll wrote: > Where do you get your Lucene/Solr downloads from? > > [x] ASF Mirrors (linked in our release announcements or via the Lucene > website) > > [] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) > > [x] I/we build them from source via an SVN/Git checkout. > > [] Other (someone in your company mirrors them internally or via a downstream > project) - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?
> > > [X] ASF Mirrors (linked in our release announcements or via the Lucene > website) > > [X] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) > > [] I/we build them from source via an SVN/Git checkout. > > [] Other (someone in your company mirrors them internally or via a > downstream project) >
Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?
[X] ASF Mirrors (linked in our release announcements or via the Lucene website) [X] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) [] I/we build them from source via an SVN/Git checkout. [] Other (someone in your company mirrors them internally or via a downstream project) 2011/1/18 Grant Ingersoll > As devs of Lucene/Solr, due to the way ASF mirrors, etc. works, we really > don't have a good sense of how people get Lucene and Solr for use in their > application. Because of this, there has been some talk of dropping Maven > support for Lucene artifacts (or at least make them external). Before we do > that, I'd like to conduct an informal poll of actual users out there and see > how you get Lucene or Solr. > > Where do you get your Lucene/Solr downloads from? > > [] ASF Mirrors (linked in our release announcements or via the Lucene > website) > > [] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) > > [] I/we build them from source via an SVN/Git checkout. > > [] Other (someone in your company mirrors them internally or via a > downstream project) > > Please put an X in the box that applies to you. Multiple selections are OK > (for instance, if one project uses a mirror and another uses Maven) > > Please do not turn this thread into a discussion on Maven and it's > (de)merits, I simply want to know, informally, where people get their JARs > from. In other words, no discussion is necessary (we already have that > going on d...@lucene.apache.org which you are welcome to join.) > > Thanks, > Grant > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?
> [X] ASF Mirrors (linked in our release announcements or via the Lucene > website) > > [] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) > > [] I/we build them from source via an SVN/Git checkout. > > [] Other (someone in your company mirrors them internally or via a downstream > project) > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?
On Tue, Jan 18, 2011 at 3:04 PM, Grant Ingersoll wrote: > > Where do you get your Lucene/Solr downloads from? > > [] ASF Mirrors (linked in our release announcements or via the Lucene website) > > [x] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) > > [] I/we build them from source via an SVN/Git checkout. > > [] Other (someone in your company mirrors them internally or via a downstream > project) > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?
> [] ASF Mirrors (linked in our release announcements or via > the Lucene website) > > [X] Maven repository (whether you use Maven, Ant+Ivy, > Buildr, etc.) > > [] I/we build them from source via an SVN/Git checkout. > > [] Other (someone in your company mirrors them internally > or via a downstream project) - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?
On 18.01.2011, at 22:04, Grant Ingersoll wrote: > As devs of Lucene/Solr, due to the way ASF mirrors, etc. works, we really > don't have a good sense of how people get Lucene and Solr for use in their > application. Because of this, there has been some talk of dropping Maven > support for Lucene artifacts (or at least make them external). Before we do > that, I'd like to conduct an informal poll of actual users out there and see > how you get Lucene or Solr. > > Where do you get your Lucene/Solr downloads from? > > [X] ASF Mirrors (linked in our release announcements or via the Lucene > website) > > [] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) > > [X] I/we build them from source via an SVN/Git checkout. > > [] Other (someone in your company mirrors them internally or via a downstream > project) regards, Lukas Kahwe Smith m...@pooteeweet.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?
> > Where do you get your Lucene/Solr downloads from? > > [] ASF Mirrors (linked in our release announcements or via the Lucene website) > > [X] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) > > [X] I/we build them from source via an SVN/Git checkout. > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?
> Where do you get your Lucene/Solr downloads from? > > [X] ASF Mirrors (linked in our release announcements or via the Lucene > website) > > [] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) > > [] I/we build them from source via an SVN/Git checkout. > > [] Other (someone in your company mirrors them internally or via a downstream > project) > > Please put an X in the box that applies to you. Multiple selections are OK > (for instance, if one project uses a mirror and another uses Maven) - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?
Where do you get your Lucene/Solr downloads from? [] ASF Mirrors (linked in our release announcements or via the Lucene website) [X] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) [] I/we build them from source via an SVN/Git checkout. [] Other (someone in your company mirrors them internally or via a downstream project) > > -- Beatriz Nombela Escobar bea...@gmail.com
Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?
[x] ASF Mirrors (linked in our release announcements or via the Lucene website) [] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) [] I/we build them from source via an SVN/Git checkout. On Tue, Jan 18, 2011 at 1:24 PM, Glen Newton wrote: > Where do you get your Lucene/Solr downloads from? > > [x] ASF Mirrors (linked in our release announcements or via the Lucene > website) > > [] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) > > [] I/we build them from source via an SVN/Git checkout. > > > -Glen Newton > > > -- > > - > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?
On Tue, 18 Jan 2011 22:04:01 +0100, Grant Ingersoll wrote: [] ASF Mirrors (linked in our release announcements or via the Lucene website) [x] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) [] I/we build them from source via an SVN/Git checkout. [] Other (someone in your company mirrors them internally or via a - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?
> Where do you get your Lucene/Solr downloads from? > > [X] ASF Mirrors (linked in our release announcements or via the Lucene > website) --ewh - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?
> [x] ASF Mirrors (linked in our release announcements or via the Lucene > website) > > [x] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) > > [x] I/we build them from source via an SVN/Git checkout.
Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?
> Where do you get your Lucene/Solr downloads from? > > [] ASF Mirrors (linked in our release announcements or via the Lucene website) > > [X] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) > > [] I/we build them from source via an SVN/Git checkout. > > [] Other (someone in your company mirrors them internally or via a downstream > project) > > Please put an X in the box that applies to you. Multiple selections are OK > (for instance, if one project uses a mirror and another uses Maven) - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?
> [X] ASF Mirrors (linked in our release announcements or via the Lucene > website) > > [] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) > > [X] I/we build them from source via an SVN/Git checkout. > > [] Other (someone in your company mirrors them internally or via a > downstream project) - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?
Where do you get your Lucene/Solr downloads from? [x] ASF Mirrors (linked in our release announcements or via the Lucene website) [] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) [] I/we build them from source via an SVN/Git checkout. -Glen Newton -- - - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?
[] ASF Mirrors (linked in our release announcements or via the Lucene website) [X] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) [] I/we build them from source via an SVN/Git checkout. [] Other (someone in your company mirrors them internally or via a downstream project) - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?
[X] ASF Mirrors (linked in our release announcements or via the Lucene website) [] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) [] I/we build them from source via an SVN/Git checkout. [] Other (someone in your company mirrors them internally or via a downstream project)
Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?
And here's mine: On Jan 18, 2011, at 4:04 PM, Grant Ingersoll wrote: > > Where do you get your Lucene/Solr downloads from? > > [] ASF Mirrors (linked in our release announcements or via the Lucene website) > > [x] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) > > [x] I/we build them from source via an SVN/Git checkout. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?
Where do you get your Lucene/Solr downloads from? [] ASF Mirrors (linked in our release announcements or via the Lucene website) [X] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) [] I/we build them from source via an SVN/Git checkout. [] Other (someone in your company mirrors them internally or via a downstream project) -- Luka Stojanovic lu...@vast.com Platform Engineering - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
[POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?
As devs of Lucene/Solr, due to the way ASF mirrors, etc. works, we really don't have a good sense of how people get Lucene and Solr for use in their application. Because of this, there has been some talk of dropping Maven support for Lucene artifacts (or at least make them external). Before we do that, I'd like to conduct an informal poll of actual users out there and see how you get Lucene or Solr. Where do you get your Lucene/Solr downloads from? [] ASF Mirrors (linked in our release announcements or via the Lucene website) [] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) [] I/we build them from source via an SVN/Git checkout. [] Other (someone in your company mirrors them internally or via a downstream project) Please put an X in the box that applies to you. Multiple selections are OK (for instance, if one project uses a mirror and another uses Maven) Please do not turn this thread into a discussion on Maven and it's (de)merits, I simply want to know, informally, where people get their JARs from. In other words, no discussion is necessary (we already have that going on d...@lucene.apache.org which you are welcome to join.) Thanks, Grant - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Best practices for multiple languages?
Hi There are two types of multi-language docs: 1) Docs in different languages -- every document is one language 2) Each document has fields in different languages I've dealt with both, and there are different solutions to each. Which of them is yours? Shai On Tue, Jan 18, 2011 at 7:53 PM, Clemens Wyss wrote: > What is the "best practice" to support multiple languages, i.e. > Lucene-Documents that have multiple language content/fields? > Should > a) each language be indexed in a seperate index/directory or should > b) the Documents (in a single directory) hold the diverse localized fields? > > We most often will be searching "language dependent" which (at least > performance wise) mandates one-directory-per-language... > > Any (lucene specific) white papers on this topic? > > Thx in advance > Clemens > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: Best practices for multiple languages?
Hi Clemens, If you will be searching individual languages, go with language-specific indices. Wunder likes to give an example of "die" in German vs. English. :) Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: Clemens Wyss > To: "java-user@lucene.apache.org" > Sent: Tue, January 18, 2011 12:53:57 PM > Subject: Best practices for multiple languages? > > What is the "best practice" to support multiple languages, i.e. >Lucene-Documents that have multiple language content/fields? > > Should > a) each language be indexed in a seperate index/directory or should > b) the Documents (in a single directory) hold the diverse localized fields? > > We most often will be searching "language dependent" which (at least >performance wise) mandates one-directory-per-language... > > Any (lucene specific) white papers on this topic? > > Thx in advance > Clemens > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Best practices for multiple languages?
What is the "best practice" to support multiple languages, i.e. Lucene-Documents that have multiple language content/fields? Should a) each language be indexed in a seperate index/directory or should b) the Documents (in a single directory) hold the diverse localized fields? We most often will be searching "language dependent" which (at least performance wise) mandates one-directory-per-language... Any (lucene specific) white papers on this topic? Thx in advance Clemens - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Large .frq file
Hi Shai, What I really wanted to do was reduce the frq file size Oddly (when tokenizing 3 seperate fields) with the WhitespaceTokenizer, more terms are produced than with the CJK analyzer and the CJK frq filesize is much larger ... examples below: with WhitespaceTokenizer: 89M _0.tis 1.4M_0.tii 71 _0.fnm 5.8M_0.fdx 741K_0.fdt 20 segments.gen 293 segments_2 119M_0.frq with CJKTokenizer: 31M _0.tis 633K_0.tii 71 _0.fnm 5.8M_0.fdx 741K_0.fdt 20 segments.gen 293 segments_2 166M_0.frq Also I believe solr calls addDocument with payLoads turned off. I'm not sure why the size is much larger. Cheers, Dan On Tue, Jan 18, 2011 at 12:41 PM, Shai Erera wrote: > If I understand correctly, you compare the size of the .frq when > WhitespaceTokenizer is used, vs the CJK ones? > > I'd bet this is because WhitespaceTokenizer creates far less terms than the > CJK one. Whitespace tokenizes the text by separating on whitespace, while > CJK does sort of N-Gram tokenization, which usually leads to much more terms > created. This affects the .frq file in that there are much more posting > lists created, which are stored in the .frq file. > > See if the .tii and .tis files differ and if their difference is the same > order of the .frq differences (e.g. if they are 2x larger w/ CJK, so .frq > should be of the same order of difference), then I believe this is the > reason. > > Shai > > On Tue, Jan 18, 2011 at 2:13 PM, dan sutton wrote: > >> Hi, >> >> We're trying to create a large index via solr for trends and notice >> that we have a large '.frq' file after doing the following: >> >> >> make all text fields index="true", stored="false", >> omitTermFreqAndPositions="true" omitNorms="true" termPositions="false" >> termOffsets="false" termVectors="false" >> >> We are using a variation on org.apache.lucene.analysis.cjk and notice >> that the .frq is about 4 time larger than, for example, the >> WhiteSpaceTokenizer. >> >> >> Considering that with omitTermFreqAndPositions="true" for the text >> fields I'd have thought this should be : "If omitTf were true it would >> be this sequence of VInts instead:" >> (http://lucene.apache.org/java/2_9_1/fileformats.html#Frequencies) >> >> >> Can anyone suggest how I can reduce the size of this file? >> >> >> Many thanks, >> Dan >> >> Lucene Specification Version: 2.9.1 >> Solr Specification Version: 1.4.0.2010.09.10.17.10.36 >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene Ranking Problem
HI Ian & Umesh. This is what I was looking for. Thank a lot. Regards, Lahiru
Re: Large .frq file
If I understand correctly, you compare the size of the .frq when WhitespaceTokenizer is used, vs the CJK ones? I'd bet this is because WhitespaceTokenizer creates far less terms than the CJK one. Whitespace tokenizes the text by separating on whitespace, while CJK does sort of N-Gram tokenization, which usually leads to much more terms created. This affects the .frq file in that there are much more posting lists created, which are stored in the .frq file. See if the .tii and .tis files differ and if their difference is the same order of the .frq differences (e.g. if they are 2x larger w/ CJK, so .frq should be of the same order of difference), then I believe this is the reason. Shai On Tue, Jan 18, 2011 at 2:13 PM, dan sutton wrote: > Hi, > > We're trying to create a large index via solr for trends and notice > that we have a large '.frq' file after doing the following: > > > make all text fields index="true", stored="false", > omitTermFreqAndPositions="true" omitNorms="true" termPositions="false" > termOffsets="false" termVectors="false" > > We are using a variation on org.apache.lucene.analysis.cjk and notice > that the .frq is about 4 time larger than, for example, the > WhiteSpaceTokenizer. > > > Considering that with omitTermFreqAndPositions="true" for the text > fields I'd have thought this should be : "If omitTf were true it would > be this sequence of VInts instead:" > (http://lucene.apache.org/java/2_9_1/fileformats.html#Frequencies) > > > Can anyone suggest how I can reduce the size of this file? > > > Many thanks, > Dan > > Lucene Specification Version: 2.9.1 > Solr Specification Version: 1.4.0.2010.09.10.17.10.36 > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: Lucene Ranking Problem
Hi Lahiru, Comments are inline: On Tue, Jan 18, 2011 at 5:42 PM, Lahiru Samarakoon wrote: > Dear All, > > I have two documents. The analyzed and the tokenized contents are > mentioned > below. > > *Document 1 :* > > *when*, null_1, *my*, null_1, money, > > fund, amount, payment, creditcard, credit, > > card, *bank, account*, debit, deduct, > > *charge*, null_1, my, mobile, usage, > > *service*, connection > > > *Document 2:* > > *when*, what, time, what, day, > > null_1, money, fund, cash, payment, > > null_1, i, do, you, i, > > null_1, deduct, *charge*, reduce, debit, > > from, *my*, *bank, account*, credit, > > card, null_1, *adsl*, adsl1, adsl-2, > > adsl-1, adsl2, adsl, 1, adsl, > > 2, usage, connection, *service* > > > Then, I searched for the following text. > > *Query:* when my bank account charge adsl service > > *Scores > * > > Document 1 = 0.74406385 > > Document 2 = Score = 0.66067594 > > Please read the documentation of lucene scoring. http://lucene.apache.org/java/2_9_1/scoring.html. That will help you understand the bigger picture. > I was expecting to have Document 2 as the top ranked document. But I get > Document 1 as the top ranked even it does not contains the term “adsl”. > > The word order of the Document 1 matches with the query very well. Can it > be the reason ? > > Word order doesn't matter. However tf/idf , norms and other factors do matter as described in above link. You can get see how , documents got assigned score by using IndexSearcher.explain(query,docId); as described in http://lucene.apache.org/java/2_9_1/api/core/org/apache/lucene/search/Searcher.html#explain%28org.apache.lucene.search.Query,%20int%29 If it is, how can I neglect the word order when searching. (I am not using > phase queries). > > My searching code look like below and it is very simple. > > > *QueryParser parser = new QueryParser(Version.LUCENE_30, * > > *"pattern", * > > *new StandardAnalyzer(Version.LUCENE_30)); * > > *org.apache.lucene.search.Query query1 = > parser.parse(this.query.getQuestion()); * > > *TopDocs hits = is.search(query1, 10); * > > Please advice > > > Thanks, > > Lahiru > -- --- Thanks & Regards Umesh Prasad
Re: Lucene Ranking Problem
See what Searcher.explain() says for each hit. I don't think that word order will matter with the query you give. There are several factors in scoring - see oal.search.Similarity or google lucene scoring. Or have a play with Luke: invaluable for investigating things with lucene and will tell you everything about your index. -- Ian. On Tue, Jan 18, 2011 at 12:12 PM, Lahiru Samarakoon wrote: > Dear All, > > I have two documents. The analyzed and the tokenized contents are mentioned > below. > > *Document 1 :* > > *when*, null_1, *my*, null_1, money, > > fund, amount, payment, creditcard, credit, > > card, *bank, account*, debit, deduct, > > *charge*, null_1, my, mobile, usage, > > *service*, connection > > > *Document 2:* > > *when*, what, time, what, day, > > null_1, money, fund, cash, payment, > > null_1, i, do, you, i, > > null_1, deduct, *charge*, reduce, debit, > > from, *my*, *bank, account*, credit, > > card, null_1, *adsl*, adsl1, adsl-2, > > adsl-1, adsl2, adsl, 1, adsl, > > 2, usage, connection, *service* > > > Then, I searched for the following text. > > *Query:* when my bank account charge adsl service > > *Scores > * > > Document 1 = 0.74406385 > > Document 2 = Score = 0.66067594 > > I was expecting to have Document 2 as the top ranked document. But I get > Document 1 as the top ranked even it does not contains the term “adsl”. > > The word order of the Document 1 matches with the query very well. Can it > be the reason ? > > If it is, how can I neglect the word order when searching. (I am not using > phase queries). > > My searching code look like below and it is very simple. > > > *QueryParser parser = new QueryParser(Version.LUCENE_30, * > > *"pattern", * > > *new StandardAnalyzer(Version.LUCENE_30)); * > > *org.apache.lucene.search.Query query1 = > parser.parse(this.query.getQuestion()); * > > *TopDocs hits = is.search(query1, 10); * > > Please advice > > > Thanks, > > Lahiru > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Large .frq file
Hi, We're trying to create a large index via solr for trends and notice that we have a large '.frq' file after doing the following: make all text fields index="true", stored="false", omitTermFreqAndPositions="true" omitNorms="true" termPositions="false" termOffsets="false" termVectors="false" We are using a variation on org.apache.lucene.analysis.cjk and notice that the .frq is about 4 time larger than, for example, the WhiteSpaceTokenizer. Considering that with omitTermFreqAndPositions="true" for the text fields I'd have thought this should be : "If omitTf were true it would be this sequence of VInts instead:" (http://lucene.apache.org/java/2_9_1/fileformats.html#Frequencies) Can anyone suggest how I can reduce the size of this file? Many thanks, Dan Lucene Specification Version: 2.9.1 Solr Specification Version: 1.4.0.2010.09.10.17.10.36 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Lucene Ranking Problem
Dear All, I have two documents. The analyzed and the tokenized contents are mentioned below. *Document 1 :* *when*, null_1, *my*, null_1, money, fund, amount, payment, creditcard, credit, card, *bank, account*, debit, deduct, *charge*, null_1, my, mobile, usage, *service*, connection *Document 2:* *when*, what, time, what, day, null_1, money, fund, cash, payment, null_1, i, do, you, i, null_1, deduct, *charge*, reduce, debit, from, *my*, *bank, account*, credit, card, null_1, *adsl*, adsl1, adsl-2, adsl-1, adsl2, adsl, 1, adsl, 2, usage, connection, *service* Then, I searched for the following text. *Query:* when my bank account charge adsl service *Scores * Document 1 = 0.74406385 Document 2 = Score = 0.66067594 I was expecting to have Document 2 as the top ranked document. But I get Document 1 as the top ranked even it does not contains the term “adsl”. The word order of the Document 1 matches with the query very well. Can it be the reason ? If it is, how can I neglect the word order when searching. (I am not using phase queries). My searching code look like below and it is very simple. *QueryParser parser = new QueryParser(Version.LUCENE_30, * *"pattern", * *new StandardAnalyzer(Version.LUCENE_30)); * *org.apache.lucene.search.Query query1 = parser.parse(this.query.getQuestion()); * *TopDocs hits = is.search(query1, 10); * Please advice Thanks, Lahiru