Re: Designing a multilingual index

2010-04-01 Thread David Vergnaud
Hi, thanks Paul for your input. I'm gonna try the "localized field" variant and see how it works for me. I think your idea of automatically boosting the user language is neat, but it should definitely be possible to disable this boosting... Most users have no idea about the language settings

Re: Lucene Challenge - sum, count, avg, etc.

2010-04-01 Thread Chris Lu
Hi, Michel, This has already been implemented in DBSight. Check it out! http://www.dbsight.net You can get sum, avg for Facet searches. And count is included in Facet search directly. -- Chris Lu - Instant Scalable Full-Text Search On Any Database/Application site: htt

Re: query: order of search

2010-04-01 Thread Ian Lea
> Query I > Does the order of query play role in searching > example:doc has fields > rollno(pk), name, marks > > Query : marks=90&rollno=2&name=abc > > Query :rollno=2&name=abc&marks=90 > > which query processing will be more efficient. > is it work like search doc field by field , it will look fo

Re: query: order of search

2010-04-01 Thread suman.holani
Query I its written "to do a "search within search", so that the second search is constrained by the results of the first query" we can use boolean query. So doesn't it mean the order of query ll be preserved give me an simple example of how the docs get searched in lucene. 10 docs with 3 fields

indexing an object

2010-04-01 Thread Bujji
hi all, i want to index an object once instead of indexing it by strings seperately can i have that facility exisitng or any anlyzer is there ? please help me Thanks Bujji

Re: indexing an object

2010-04-01 Thread Ian Lea
There is no way to just index an object as such, but there are ways of creating Fields out of byte arrays or with Readers so you could use them with serialization or something. -- Ian. On Thu, Apr 1, 2010 at 10:46 AM, Bujji wrote: > hi all, > > i want to index an object once instead of indexing

Re: Lucene Challenge - sum, count, avg, etc.

2010-04-01 Thread Michel Nadeau
@Ken: yeah we thought about it - but we have a HUGE amount of data (sales, affiliates, etc.) - so pre-calculating everything isn't really an option. Plus I don't know how we would sort.. let's say I get the totals for affiliate X, loop totals from day 1 to X (range), sum up, great: I can do this fo

Re: Lucene Challenge - sum, count, avg, etc.

2010-04-01 Thread Michel Nadeau
Are you planning to be able to sort by these SUMs? A SpanQuery would work great to get the integers... then you would loop and sum up... but what about "joining" with your other data and sorting? - Mike aka...@gmail.com On Wed, Mar 31, 2010 at 9:23 PM, prasenjit mukherjee wrote: > I too am tryi

Re: Lucene Challenge - sum, count, avg, etc.

2010-04-01 Thread prasenjit mukherjee
Not sure what you mean by "joining" in lucene , since conceptually there is only 1 table ( with many field aka columns ) in lucene. A representative query would be good to know the use case. Again didn't get the "sorting" part. SUM() will return only 1 aggregated value, so what do you want to sor

RE: Lucene Challenge - sum, count, avg, etc.

2010-04-01 Thread Darren Hartford
If you are going to end up either copying or moving all the data to lucene (which, when you hook up lucene even to the existing mysql data, it will still create it's own copy of the data), you might really want to look at other options: *column oriented databases (analytical databases). If ope

Re: Designing a multilingual index

2010-04-01 Thread henrib
Hi, I worked some time ago on a similar system (using Solr) and used the multiple indices route (the multicore feature in Solr). In our case, the "same" document could exist in different languages; different localized versions of the same information (same Solr unique id for each l10n version).

Re: Designing a multilingual index

2010-04-01 Thread Paul Libbrecht
How? paul Le 01-avr.-10 à 14:19, henrib a écrit : Finally, query expansion can also be used in the multiple indices case and might even use automated/guided translation. - To unsubscribe, e-mail: java-user-unsubscr...@lu

Re: Lucene Challenge - sum, count, avg, etc.

2010-04-01 Thread prasenjit mukherjee
> Lucene is great at searching for data, but just because it is awesome in one > area doesn't mean it would excel in something it wasn't designed for ;-) I think lucene is probably one of the better data structures for computing "conditional aggregated stats". Even for straight search lucene has

Re: Lucene Challenge - sum, count, avg, etc.

2010-04-01 Thread Will Johnson
Hi Michel, You can do all of this with Lucene however not with a standard index/query operators. At Attivio we have a custom Lucene index structure + custom query operators that support relational joins across records in an index. You can write the queries in our standard query language or run

Re: Lucene Challenge - sum, count, avg, etc.

2010-04-01 Thread Michel Nadeau
Hi, Here's an example of raw data that would be in my Sales index: *Affiliate / SaleDate / SaleAmount* * mike / 2010-03-01 / 10.00 * john / 2010-03-01 / 10.00 * mike / 2010-03-02 / 15.00 * john / 2010-03-02 / 5.00 * mike / 2010-03-03 / 20.00 * john / 2010-03-03 / 1.00 * mike / 2010-03-04 / 10.0

Re: Designing a multilingual index

2010-04-01 Thread David Vergnaud
Hi, thx for sharing your experience with us. I'm happy to see that both methods I've thought of are apparently sensible ;-) However, it might be due to my lack of experience in that domain, but some of your arguments in favor of a multi-index solution seem to me to be also compatible with a si

Re: query: order of search

2010-04-01 Thread Erick Erickson
Why do you care? By that I mean "what problem are you trying to solve" (See "The XY problem at http://people.apache.org/~hossman/). The reason I'm asking here is that very often, when people ask this kind of question without providing background, they're trying the wrong approach to solve a problem

Re: Lucene Challenge - sum, count, avg, etc.

2010-04-01 Thread Grant Ingersoll
Have you looked at Solr's StatsComponent? On Mar 31, 2010, at 9:17 PM, Michel Nadeau wrote: > Hi, > > We're currently in the process of switching many of our screens from MySQL > to Lucene because MySQL simply dies because we have too much data and it's > becoming too long to generate the stats

Re: query: order of search

2010-04-01 Thread Chris Hostetter
: Subject: query: order of search : In-Reply-To: <8d42dcc0-4e03-4f8b-a6cc-c53890910...@transpac.com> : References: : <8d42dcc0-4e03-4f8b-a6cc-c53890910...@transpac.com> http://people.apache.org/~hossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion on a mail

Re: query: order of search

2010-04-01 Thread Karl Wettin
1 apr 2010 kl. 11.21 skrev >: its written "to do a "search within search", so that the second search is constrained by the results of the first query" If I understand your needs you could while collecting search results populate a new filter with all matching documents and use that filt

Re: Lucene Challenge - sum, count, avg, etc.

2010-04-01 Thread Chris Lu
Hi, Michel, You can use DBSight free version to test it out. However, it's a whole solution since you will need to configure it first. Like specifying which column you want to do the counting before the actual search. BTW: DBSight also support MIN and MAX, in addition to SUM,AVG. -- Chris Lu

Re: Lucene Challenge - sum, count, avg, etc.

2010-04-01 Thread prasenjit mukherjee
If the number of documents ( in this case "Affiliates" ) aren't huge, sorting can probably be done as a post-process. Still dont see any need of joins here. On Thu, Apr 1, 2010 at 7:16 PM, Michel Nadeau wrote: > Hi, > > Here's an example of raw data that would be in my Sales index: > > *Affili

Re: Lucene Challenge - sum, count, avg, etc.

2010-04-01 Thread Michel Nadeau
Well that's my problem: we have a lot of records of all types (afiiliates, sales) so looping tons of records each time isn't possible. - Mike aka...@gmail.com On Thu, Apr 1, 2010 at 2:11 PM, prasenjit mukherjee wrote: > If the number of documents ( in this case "Affiliates" ) aren't huge, > so

Re: Lucene Challenge - sum, count, avg, etc.

2010-04-01 Thread prasenjit mukherjee
This looks like a use case more suited for Pig ( over Hadoop ). It could be difficult for lucene to do sort and sum simultaneously as sorting itself depends upon summed value. On Thu, Apr 1, 2010 at 11:47 PM, Michel Nadeau wrote: > Well that's my problem: we have a lot of records of all types (

Articles not found

2010-04-01 Thread rohit dholakia
Hi, I am trying to access the articles in the resources part of the lucene wiki but all of them say "Page not found" . Why is that? Are all the articles hosted at another page now ? Rohit

Re: Articles not found

2010-04-01 Thread rohit dholakia
Hey, I found them by googling and searching within the website but it will be better to update the links in that wiki . On Fri, Apr 2, 2010 at 12:43 AM, rohit dholakia wrote: > Hi, > >I am trying to access the articles in the resources part of the lucene > wiki but all of them say

Re: Lucene Challenge - sum, count, avg, etc.

2010-04-01 Thread Chris Lu
No need for Hadoop. It's even more slower. Lucene can do it easily. This has been implemented in DBSight. The implementation is very similar to Facet search. Just need a way to load the field quickly, like put it in memory or some data structure, and count the sum/min/max during searching. --

[ANN] Luke - The Lucene Index Toolbox - 1.0.1 release

2010-04-01 Thread Andrzej Bialecki
Hi all, I'm happy to announce the release of Luke - the Lucene Index Toolbox. You can get an executable self-contained jar here: http://luke.googlecode.com/files/lukeall-1.0.1.jar The Downloads section contains also the source code and a minimal Luke-only jar. This release upgrades to Lucene 3

Re: Lucene Challenge - sum, count, avg, etc.

2010-04-01 Thread prasenjit mukherjee
On Fri, Apr 2, 2010 at 12:54 AM, Chris Lu wrote: > No need for Hadoop. It's even more slower. Lucene can do it easily. > > This has been implemented in DBSight. > The implementation is very similar to Facet search. Just need a way to load > the field quickly, like put it in memory or some data str

Re: Lucene Challenge - sum, count, avg, etc.

2010-04-01 Thread Chris Lu
For DBSight, the aggregated values are computed during run time. And the sorting on the computed aggregated values are done when displaying the results. The idea is, after the aggregation, the number of aggregated values are much much smaller. -- Chris Lu - Instant Sc

Re: Lucene Challenge - sum, count, avg, etc.

2010-04-01 Thread Michel Nadeau
I'm sure the DBSight feature is great, but we already have a system in place and we're not throwing it away -- it's closely integrated with our whole platform. We're way past the point to switch our solution to DBSight. We'd be more than happy to use the DBSight feature if it would be opensource b

Re: Lucene Challenge - sum, count, avg, etc.

2010-04-01 Thread Chris Lu
Thanks. Not really trying to sell DBSight here since most people here are Lucene experts. Just to confirm that this "challenge" has been done via Lucene for quite a while. The technique for it is very similar to how facet search is done, which has several ways also. Million's of rows are not r

Re: Designing a multilingual index

2010-04-01 Thread henrib
By issuing multiple queries, one against each localized index, results being clustered by locale. You can further refine by translating the end-user input query terms for each locale and issue "translated" queries against the respective indices. I've seen satisfying results with "key" terms dicti

Re: Designing a multilingual index

2010-04-01 Thread henrib
Hi David, pagod wrote: > > ... apply only in a particular situation: > Very true, as often in the IR field :-) ; in our case, the "same" document existed in different locales; these were localized technical docs which also meant the dictionary (of important) terms was limited and used to influ

IndexWriter and memory usage

2010-04-01 Thread Woolf, Ross
We are seeing a situation where the IndexWriter is using up the Java Heap space and only releases memory for garbage collection upon a commit. We are using the default RAMBufferSize of 16 mb. We are using Lucene 2.9.1. We are set at heap size of 512 mb. We have a large number of documents th

Re: IndexWriter and memory usage

2010-04-01 Thread Michael McCandless
Hmm, not good. Can you post a heap dump? Also, can you turn on infoStream, index up to the OOM @ 512 MB, and post the output? IndexWriter should not hang onto much beyond the RAM buffer. But, it does allocate and then recycle this RAM buffer, so even in an idle state (having indexed enough docs

Re: Designing a multilingual index

2010-04-01 Thread henrib
By issuing multiple queries, one against each localized index, results being clustered by locale. You can further refine by translating the end-user input query terms for each locale and issue "translated" queries against the respective indices. I've seen satisfying results with "key" terms dictio

Re: Lucene Challenge - sum, count, avg, etc.

2010-04-01 Thread Michel Nadeau
My big question is how do you loop 1M records, sum up field(s), and then sort on that field... all in memory (could use too much ram) ? In a temporary index (could take a while to re-write a lot of documents in a new index) ? - Mike aka...@gmail.com On Thu, Apr 1, 2010 at 5:31 PM, Chris Lu wro

Re: Designing a multilingual index

2010-04-01 Thread henrib
pagod wrote: > > ... apply only in a particular situation: > Very true, as often in the IR field :-) ; in our case, the "same" document existed in different locales; these were localized technical docs which also meant the dictionary (of important) terms was limited and used to influence scorin

Re: Designing a multilingual index

2010-04-01 Thread henrib
pagod wrote: > > ... apply only in a particular situation: > Very true, as often in the IR field :-) ; in our case, the "same" document existed in different locales; these were localized technical docs which also meant the dictionary (of important) terms was limited and used to influence scorin

Re: Designing a multilingual index

2010-04-01 Thread henrib
By issuing multiple queries, one against each localized index, results being clustered by locale. You can further refine by translating the end-user input query terms for each locale and issue "translated" queries against the respective indices. I've seen satisfying results with "key" terms dicti

Memory use and Lucene

2010-04-01 Thread John Viviano
All - I have a question is about memory use and Lucene. I'm not sure if I'm dealing with a leak, or if I'm seeing expected behavior. I'll preface this by acknowledging that the "error" could be in my understanding of things. I've included a lot of information below. There is a demo progr

Re: Lucene Challenge - sum, count, avg, etc.

2010-04-01 Thread Jason Eacott
Thanks for the ref - didn't know about Pig before. the language and approach looks useful, so now I'm wondering if it couldn't be used across lucene over hadoop too. If data was indexed in lucene and Pig knew that, then it could make for an interesting alternate lucene query language. could this w

Re: query: order of search

2010-04-01 Thread suman.holani
Hello Erick, I was trying to optimise the searching. Basically my data is like field1 has less no of docs matching compared to field2, which has larget sets. So if search goes by order to order, then i can make field1 to be search first, (by making order of boolean query such )and from thr the