access payload from HitCollector.collect()

2010-03-22 Thread prasenjit mukherjee
I am trying to implement oracle's aggregation like SQL's ( e.g. SUM(col3) where col1='foo' and col2='bar' ) using lucene's payload feature. I can add the integer_value ( of col3 ) as a payload to my searchable fields ( col1 and col2 ). I can probably extend the DefaultSImilarity's scorePayload()

Re: Lucene Challenge - sum, count, avg, etc.

2010-03-31 Thread prasenjit mukherjee
I too am trying to achieve something. I am thinking of storing the integer values in payloads and then using spanquery classes to compute the respective SUMs -Prasen On Thu, Apr 1, 2010 at 6:47 AM, Michel Nadeau wrote: > Hi, > > We're currently in the process of switching many of our screens f

Re: Lucene Challenge - sum, count, avg, etc.

2010-04-01 Thread prasenjit mukherjee
our other data and sorting? > > - Mike > aka...@gmail.com > > > On Wed, Mar 31, 2010 at 9:23 PM, prasenjit mukherjee > wrote: > >> I too am trying to achieve something. >> >> I am thinking of storing the integer values in  payloads and then >> using spanque

Re: Lucene Challenge - sum, count, avg, etc.

2010-04-01 Thread prasenjit mukherjee
d excel in something it wasn't designed for ;-) > > -D > > > -Original Message- > From: prasenjit mukherjee [mailto:prasen@gmail.com] > Sent: Thursday, April 01, 2010 8:11 AM > To: java-user@lucene.apache.org > Subject: Re: Lucene Challenge - sum, count,

Re: Lucene Challenge - sum, count, avg, etc.

2010-04-01 Thread prasenjit mukherjee
t;= '2010-03-06' >  GROUP BY Affiliate >  ORDER BY TotalSales DESC; > > - Mike > aka...@gmail.com > > > On Thu, Apr 1, 2010 at 8:11 AM, prasenjit mukherjee > wrote: > >> Not sure what you mean by "joining" in lucene , since conceptually >

Re: Lucene Challenge - sum, count, avg, etc.

2010-04-01 Thread prasenjit mukherjee
ll types (afiiliates, > sales) so looping tons of records each time isn't possible. > > - Mike > aka...@gmail.com > > > On Thu, Apr 1, 2010 at 2:11 PM, prasenjit mukherjee > wrote: > - To

Re: Lucene Challenge - sum, count, avg, etc.

2010-04-01 Thread prasenjit mukherjee
eate_Lucene_Database_Search_in_3_minutes > DBSight customer, a shopping comparison site, (anonymous per request) got > 2.6 Million Euro funding! > > > prasenjit mukherjee wrote: >> >> This looks like a use case more suited  for Pig ( over Hadoop ). >> >> It could be

Re: Lucene Challenge - sum, count, avg, etc.

2010-04-02 Thread prasenjit mukherjee
interesting alternate lucene query language. > > could this work? > > > prasenjit mukherjee wrote: >> This looks like a use case more suited  for Pig ( over Hadoop ). >> >> It could be difficult for lucene to do sort and sum simultaneously as >> sorting itself dep

Re: Document on Indexing in Lucene

2006-10-12 Thread Prasenjit Mukherjee
did someone delete the shared doc ? [EMAIL PROTECTED] wrote: Hello, I have got lot of personal emails for sharing the "Lucene Investigation" document. It is not possible to reply each of the Emails. So I am putting this document inside my briefcase. Anyone interested please go to following sit

using lucene to find neighbouring points in an n-dimensional space

2011-10-22 Thread prasenjit mukherjee
My use case is the following : Given an n-dimensional vector ( only +ve quadrants/points ) find its closest neighbours. I would like to try out with lucene's default ranking. Here is how a typical document will look like : ( or same thing ) doc1 = 1245:15 3490:20 8856:20 etc. As reflected in th

reusing the term-frequency count while indexing

2011-10-23 Thread prasenjit mukherjee
I already have the term-frequency-count for all the terms in a document. Is there a way I can re-use that info while indexing. I would like to use solr for this. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org F

Re: reusing the term-frequency count while indexing

2011-10-23 Thread prasenjit mukherjee
my point of view, it's meaningless, since the analysis process has > to be performed to collect such as prox, offset, or syno, payload and so on. > > On Sun, Oct 23, 2011 at 11:22 PM, prasenjit mukherjee > wrote: > >> I already have the term-frequency-count for all the term

Re: using lucene to find neighbouring points in an n-dimensional space

2011-10-23 Thread prasenjit mukherjee
Any pointers/suggestions on my approach ? On 10/22/11, prasenjit mukherjee wrote: > My use case is the following : > Given an n-dimensional vector ( only +ve quadrants/points ) find its > closest neighbours. I would like to try out with lucene's default > ranking. Here is how a

Re: reusing the term-frequency count while indexing

2011-10-24 Thread prasenjit mukherjee
e this directly? I think the easiest way is to write a > simple tokenFilter that emit the term X times where X is the term > frequency. There is no easy way to pass these tuples to lucene > directly. > > simon > > On Mon, Oct 24, 2011 at 3:28 AM, prasenjit mukherjee > wrote: >

Re: reusing the term-frequency count while indexing

2011-10-25 Thread prasenjit mukherjee
On Tue, Oct 25, 2011 at 1:17 PM, Simon Willnauer wrote: > On Tue, Oct 25, 2011 at 5:08 AM, prasenjit mukherjee > wrote: >> Thats exactly I was trying to avoid :( >> >> I can afford to do that during indexing time, but it will be >> time-consuming to do that a

Re: reusing the term-frequency count while indexing

2011-10-25 Thread prasenjit mukherjee
ryparsersyntax.html#Boosting%20a%20Term > > Am 25.10.2011 11:19, schrieb prasenjit mukherjee: >> >> During search time I get the following input ( only for 1 field ) = >> "solr:3 rocks:2 apache:1" . For this I have to create the lucene query >> in the

Re: using lucene to find neighbouring points in an n-dimensional space

2011-10-27 Thread prasenjit mukherjee
njit > > > Felipe Hummel > > > On Sun, Oct 23, 2011 at 9:33 PM, prasenjit mukherjee > wrote: > >> Any pointers/suggestions on my approach ? >> >> >> On 10/22/11, prasenjit mukherjee wrote: >> > My use case is the following : >> > Give

frequent keyword computation within a search ( and timeinterval )

2012-01-03 Thread prasenjit mukherjee
I have a requirement where reads and writes are quite high ( @ 100-500 per-sec ). A document has the following fields : timestamp, unique-docid, content-text, keyword. Average content-text length is ~ 20 bytes, there is only 1 keyword for a given docid. At runtime, given a query-term ( which coul

Re: frequent keyword computation within a search ( and timeinterval )

2012-01-05 Thread prasenjit mukherjee
s are pretty high, so you'll need > some experimentation to size your site > correctly. > > Best > Erick > > On Wed, Jan 4, 2012 at 12:17 AM, prasenjit mukherjee > wrote: >> I have a requirement where reads and writes are quite high ( @ 100-500 >> per-sec ). A

Re: frequent keyword computation within a search ( and timeinterval )

2012-01-05 Thread prasenjit mukherjee
eplacement for >>>> an RDBMS. It is a *text search engine*. >>>> Whenever you start asking "how do I implement >>>> a SQL statement in Solr", you have to stop >>>> and reconsider *why* you are trying to do that. >>>> Then recast the questio

Problem using custom-separator in UpdateCSV ( in solr )

2012-01-08 Thread prasenjit mukherjee
I am trying to add document to a slor index via : $> curl "http://localhost:8983/solr/update/csv?commit=true&fieldnames=id,title_s&separator=%09"; --data "Doc1\tTitle1" -H 'Content-type:text/plain; charset=utf-8' Solr doesn't seem to recognize the \t in the content, and is failing with followin

Distributed Lucene..

2006-03-05 Thread Prasenjit Mukherjee
I already have an implementation of a distributed crawler farm, where crawler instances are runnign on different boxes. I want to come up with a distributed indexing scheme using lucene and take advantage of the distributed nature of my crawlers' distributed nature. Here is what I am thinking.

Re: Distributed Lucene..

2006-03-06 Thread Prasenjit Mukherjee
I think nutch has a distributed lucene implementation. I could have used nutch straightaway, but I have a different crawler, and also dont want to use NDFS(which is used by nutch) . What I have proposed earlier is basically based on mapReduce paradigm, which is used by nutch as well. It would

Data structure of a Lucene Index

2006-03-28 Thread Prasenjit Mukherjee
It seems to me that lucene doesn't use B-tree for its indexing storage. Any paper/article which explains the theory behind data-structure of single index(segment). I am not referring to the merge algorithm, I am curious to know the storage structure of a single optimized lucene index. Any po

Re: Data structure of a Lucene Index

2006-03-29 Thread Prasenjit Mukherjee
I have already gone through the fileformat. What I was looking for, is the underlying theory behind the chosen fileformats. I am sure those fileformats were decided based on some theoritical axioms. --prasen [EMAIL PROTECTED] wrote: On Mar 28, 2006, at 11:57 PM, Prasenjit Mukherjee wrote

Re: Data structure of a Lucene Index

2006-04-10 Thread Prasenjit Mukherjee
Dmitry Goldenberg wrote: Ideally, I'd love to see an article explaining both in detail: the index structure as well as the merge algorithm... ________ From: Prasenjit Mukherjee [mailto:[EMAIL PROTECTED] Sent: Tue 3/28/2006 11:57 PM To: java-user@lucene.apache.org S

Re: Distributed Lucene.. - clustering as a requirement

2006-04-10 Thread Prasenjit Mukherjee
Agreed, an inverted index cannot be efficiently maintained in a B-tree(hence RDBMS). But I think we can(or should) have the option of a B-tree based storage for unindexed fields, whereas for indexed fields we can use the existing lucene's architecture. prasen [EMAIL PROTECTED] wrote: Dmi

Any storage initatives for optimized indexing/searching

2006-04-18 Thread Prasenjit Mukherjee
It seems that the performance aspects of any indexing/searching algorithm is very much dependent upon the disk-access-technologies. Just curious, anybody know of any company working(mostly storage companies) in improving their storage/disk access technology to make indexing/searching effici

Search algo for the postings ( or TermFreqs)

2006-04-25 Thread Prasenjit Mukherjee
Given a term "myterm", what kind of search algorithm lucene uses to get to the postings list(i.e. the term-frequency location in .frq file) ? From what I understood by looking into the lucene fileformat, is that it keeps the whole of .tii file in memory and and does a skipped linear search o

Enforcing Primary key uniqueness in lucene index

2006-05-30 Thread Prasenjit Mukherjee
I want to enforce the concept of a unique primary key in lucene index by having a field whose values has to be unique for all lucene documents. One way is to do a search just before indexing, but that seems to consume lot of time as you have to create a new IndexSearcher every time you want to

Document clustering using lucene

2006-06-15 Thread Prasenjit Mukherjee
I want to do some document clustering on a corpus of ~ 100,000 documents, with average doc size being ~ 7k. I have looked into carrot2 but it seems to work only for relatively short documents and has soem scalign issues for large corpus. Certainly for these kind of corpus size, one cannot us

WIll storing docs affect lucene's search performance ?

2006-08-11 Thread Prasenjit Mukherjee
I have a requirement ( use highlighter) to store the doc content somewhere., and I am not allowed to use a RDBMS. I am thinking of using Lucene's Field with (Field.Store.YES and Field.Index.NO) to store the doc content. Will it have any negative affect on my search performance ? I think I hav