Re: Efficient Query Evaluation using a Two-Level Retrieval Process

2009-11-16 Thread J. Delgado
On Mon, Nov 16, 2009 at 9:44 AM, Earwin Burrfoot  wrote:
> This algo is strictly tied to sort-by-score, if I understand it correctly.
> Lucene has queries and sorting decoupled (except for allowOutOfOrder
> mess), so implementing it would require some really fat hacks.
>

According to the paper on Indexing Boolean Expression (using the WAND
algo), sorting can be done based on scores that are determined based
weight assignment to key-value pairs:

http://ilpubs.stanford.edu:8090/927/2/wand_vldb.pdf

So I believe this can be generalized to sorting by any doc attributes
given the proper weight assignment model

Of course, the devil-is-in-the-details :-(

-- Joaquin


> On Mon, Nov 16, 2009 at 20:26, J. Delgado  wrote:
>> As I understood it setMinimumNumberShouldMatch(int min) Is used to
>> specify a minimum number of the optional BooleanClauses which must be
>> satisfied.
>>
>> I haven't seen the implementation of setMinimumNumberShouldMatch but
>> it seems a bit different than what is intended with the WAND operator,
>> which can take any real number as threshold θ
>>
>> As stated in the paper:
>>
>> WAND(X1,w1, . . . Xk,wk, θ) is true iff X 1≤i≤k and SUM(xiwi)≥ θ
>>
>> where xi is the indicator variable for Xi, that is xi =  1, if Xi is
>> true 0, otherwise.
>>
>> Observe that WAND can be used to implement AND
>> and OR via
>> AND(X1,X2, . . .Xk) ≡ WAND(X1, 1,X2, 1, . . . Xk, 1, k),
>> and
>> OR(X1,X2, . ..Xk) ≡ WAND(X1, 1,X2, 1, . ..Xk, 1, 1).
>>
>> What I find interesting is the idea of using a first pass using the
>> upper bound (maximal) contribution of a term on any document score and
>> the dynamic setting of the threshold θ to skip or to fully evaluate a
>> document..
>>
>> As stated in the paper:
>>
>> "Given this setup our preliminary scoring consists of evaluating
>> for each document d
>> WAND(X1,UB1,X2,UB2, . . .,Xk,UBk, θ),
>> where Xi is an indicator variable for the presence of query term i in
>> document d and the threshold θ is varied during
>> the algorithm as explained below. If WAND evaluates to true, then the
>> document d undergoes a full evaluation.
>> The threshold θ is set dynamically by the algorithm based on the
>> minimum score m among the top n results found so
>> far, where n is the number of requested documents. The larger the
>> threshold, the more documents will be skipped
>> and thus we will need to compute full scores for fewer documents."
>>
>> I think its worth a try...
>>
>> -- Joaquin
>>
>> On Mon, Nov 16, 2009 at 2:54 AM, Andrzej Bialecki  wrote:
>>>
>>> J. Delgado wrote:
>>>>
>>>> Here is the link to the paper.
>>>> http://cis.poly.edu/westlab/papers/cntdstrb/p426-broder.pdf
>>>>
>>>> A more recent application of the use and extension of the WAND operator for
>>>> indexing of Boolean expressions:
>>>> http://ilpubs.stanford.edu:8090/927/2/wand_vldb.pdf
>>>>
>>>> -- Joaquin
>>>>
>>>>
>>>> On Sun, Nov 15, 2009 at 11:12 PM, Shalin Shekhar Mangar <
>>>> shalinman...@gmail.com> wrote:
>>>>
>>>>> Hey Joaquin,
>>>>>
>>>>> The mailing list strips off attachments. Can you please upload it 
>>>>> somewhere
>>>>> and give us the link?
>>>>>
>>>>> On Mon, Nov 16, 2009 at 12:35 PM, J. Delgado >>>>>
>>>>>> wrote:
>>>>>> Please find attached the paper on "Efficient Query Evaluation using a
>>>>>> Two-Level Retrieval Process". I believe that such approach may improve
>>>>>
>>>>> the
>>>>>>
>>>>>> way Lucene/Solr evaluates queries today.
>>>
>>> The functionality of WAND (weak AND) is already implemented in Lucene, if I 
>>> understand it correctly - this is the BooleanQuery.setMinShouldMatch(int). 
>>> Lucene implements this probably differently from the algorithm described in 
>>> the paper, so there may be still some benefits from comparing the 
>>> algorithms in Lucene's BooleanScorer[2] with this one ...
>>>
>>>
>>> --
>>> Best regards,
>>> Andrzej Bialecki     <><
>>>  ___. ___ ___ ___ _ _   __
>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>> http://www.sigram.com  Contact: info at sigram dot com
>>>
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>
>
>
>
> --
> Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
> Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
> ICQ: 104465785
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Efficient Query Evaluation using a Two-Level Retrieval Process

2009-11-16 Thread J. Delgado
As I understood it setMinimumNumberShouldMatch(int min) Is used to
specify a minimum number of the optional BooleanClauses which must be
satisfied.

I haven't seen the implementation of setMinimumNumberShouldMatch but
it seems a bit different than what is intended with the WAND operator,
which can take any real number as threshold θ

As stated in the paper:

WAND(X1,w1, . . . Xk,wk, θ) is true iff X 1≤i≤k and SUM(xiwi)≥ θ

where xi is the indicator variable for Xi, that is xi =  1, if Xi is
true 0, otherwise.

Observe that WAND can be used to implement AND
and OR via
AND(X1,X2, . . .Xk) ≡ WAND(X1, 1,X2, 1, . . . Xk, 1, k),
and
OR(X1,X2, . ..Xk) ≡ WAND(X1, 1,X2, 1, . ..Xk, 1, 1).

What I find interesting is the idea of using a first pass using the
upper bound (maximal) contribution of a term on any document score and
the dynamic setting of the threshold θ to skip or to fully evaluate a
document..

As stated in the paper:

"Given this setup our preliminary scoring consists of evaluating
for each document d
WAND(X1,UB1,X2,UB2, . . .,Xk,UBk, θ),
where Xi is an indicator variable for the presence of query term i in
document d and the threshold θ is varied during
the algorithm as explained below. If WAND evaluates to true, then the
document d undergoes a full evaluation.
The threshold θ is set dynamically by the algorithm based on the
minimum score m among the top n results found so
far, where n is the number of requested documents. The larger the
threshold, the more documents will be skipped
and thus we will need to compute full scores for fewer documents."

I think its worth a try...

-- Joaquin

On Mon, Nov 16, 2009 at 2:54 AM, Andrzej Bialecki  wrote:
>
> J. Delgado wrote:
>>
>> Here is the link to the paper.
>> http://cis.poly.edu/westlab/papers/cntdstrb/p426-broder.pdf
>>
>> A more recent application of the use and extension of the WAND operator for
>> indexing of Boolean expressions:
>> http://ilpubs.stanford.edu:8090/927/2/wand_vldb.pdf
>>
>> -- Joaquin
>>
>>
>> On Sun, Nov 15, 2009 at 11:12 PM, Shalin Shekhar Mangar <
>> shalinman...@gmail.com> wrote:
>>
>>> Hey Joaquin,
>>>
>>> The mailing list strips off attachments. Can you please upload it somewhere
>>> and give us the link?
>>>
>>> On Mon, Nov 16, 2009 at 12:35 PM, J. Delgado >>>
>>>> wrote:
>>>> Please find attached the paper on "Efficient Query Evaluation using a
>>>> Two-Level Retrieval Process". I believe that such approach may improve
>>>
>>> the
>>>>
>>>> way Lucene/Solr evaluates queries today.
>
> The functionality of WAND (weak AND) is already implemented in Lucene, if I 
> understand it correctly - this is the BooleanQuery.setMinShouldMatch(int). 
> Lucene implements this probably differently from the algorithm described in 
> the paper, so there may be still some benefits from comparing the algorithms 
> in Lucene's BooleanScorer[2] with this one ...
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Efficient Query Evaluation using a Two-Level Retrieval Process

2009-11-16 Thread J. Delgado
Here is the link to the paper.
http://cis.poly.edu/westlab/papers/cntdstrb/p426-broder.pdf

A more recent application of the use and extension of the WAND operator for
indexing of Boolean expressions:
http://ilpubs.stanford.edu:8090/927/2/wand_vldb.pdf

-- Joaquin

On Sun, Nov 15, 2009 at 11:15 PM, Uwe Schindler  wrote:

>  I see the attachment... (in java-dev)
>
>
>
> Uwe
>
>
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>   --
>
> *From:* Shalin Shekhar Mangar [mailto:shalinman...@gmail.com]
> *Sent:* Monday, November 16, 2009 8:13 AM
> *To:* solr-...@lucene.apache.org
> *Cc:* java-dev@lucene.apache.org
> *Subject:* Re: Efficient Query Evaluation using a Two-Level Retrieval
> Process
>
>
>
> Hey Joaquin,
>
>
>
> The mailing list strips off attachments. Can you please upload it somewhere
> and give us the link?
>
> On Mon, Nov 16, 2009 at 12:35 PM, J. Delgado 
> wrote:
>
> Please find attached the paper on "Efficient Query Evaluation using a
> Two-Level Retrieval Process". I believe that such approach may improve the
> way Lucene/Solr evaluates queries today.
>
> Cheers,
>
> -- Joaquin
>
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


Re: Grouping Lucene search results and calculating frequency by category

2009-04-10 Thread J. Delgado
Have you looked at SOLR?
http://lucene.apache.org/solr/

It pretty much has what you are looking for.

-- Joaquin

On Fri, Apr 10, 2009 at 9:39 PM, mitu2009  wrote:

>
> Am working on a store search API using Lucene.
>
> I need to show store search results for each City,State combination with
> its
> frequency in bracketsfor example:
>
> Los Angles,CA (450) Atlant,GA (212) Boston, MA (78) . . .
>
> As of now, my search results return around 7000 lucene documents on an
> average if the user says "Show me all the stores". In this use case, I end
> up showing around 800 unique City,State records as shown above.
>
> Am overriding HitCollector class's Collect method and retrieving vectors as
> follows: var vectors = _reader.GetTermFreqVectors(doc); Then I iterate
> through this collection and calculate the frequency for each unique
> City,State combination.
>
> But this is turning out to be very very slow in performance...is there any
> better way of grouping search results and calculating frequency in Lucene?
> Code snippet would be very helpful
>
> Also,please suggest me if i can optimize my Lucene search code using any
> other techniques/tips
>
> Thanks for reading!
>
> --
> View this message in context:
> http://www.nabble.com/Grouping-Lucene-search-results-and-calculating-frequency-by-category-tp22997958p22997958.html
> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


Re: Realtime Search

2008-12-26 Thread J. Delgado
One thing that I forgot to mention is that in our implementation the
real-time indexing took place with many "folder-based" listeners writing  to
many  tiny in-memory indexes partitioned by "sub-sources" with fewer
long-term and archive indexes per box. Overall distributed search across
various lucene-based search services was done using a federator component,
very much like shard based searches is done today (I believe).

-- Joaquin.
l


On Fri, Dec 26, 2008 at 10:48 AM, J. Delgado wrote:

> The addition of docs into tiny segments using the current data structures
> seems the right way to go. Sometime back one of my engineers implemented
> pseudo real-time using MultiSearcher by having an in-memory (RAM based)
> "short-term" index that auto-merged into a disk-based "long term" index that
> eventually get merged into "archive" indexes. Index optimization would take
> place during these merges. The search we required was very time-sensitive
> (searching last-minute breaking news wires). The advantage of having an
> archive index is that very old documents in our applications were not
> usually searched on unless archives were explicitely selected.
>
> -- Joaquin
>
>
> On Fri, Dec 26, 2008 at 10:20 AM, Doug Cutting  wrote:
>
>> Michael McCandless wrote:
>>
>>> So then I think we should start with approach #2 (build real-time on
>>> top of the Lucene core) and iterate from there.  Newly added docs go
>>> into a tiny segments, which IndexReader.reopen pulls in.  Replaced or
>>> deleted docs record the delete against the right SegmentReader (and
>>> LUCENE-1314 lets reopen carry those pending deletes forward, in RAM).
>>>
>>> I would take the simple approach first: use ordinary SegmentReader on
>>> a RAMDirectory for the tiny segments.  If that proves too slow, swap
>>> in Memory/InstantiatedIndex for the tiny segments.  If that proves too
>>> slow, build a reader impl that reads from DocumentsWriter RAM buffer.
>>>
>>
>> +1 This sounds like a good approach to me.  I don't see any fundamental
>> reasons why we need different representations, and fewer implementations of
>> IndexWriter and IndexReader is generally better, unless they get way too
>> hairy.  Mostly it seems that real-time can be done with our existing toolbox
>> of datastructures, but with some slightly different control structures.
>>  Once we have the control structure in place then we should look at
>> optimizing data structures as needed.
>>
>> Doug
>>
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>
>


Re: Realtime Search

2008-12-26 Thread J. Delgado
The addition of docs into tiny segments using the current data structures
seems the right way to go. Sometime back one of my engineers implemented
pseudo real-time using MultiSearcher by having an in-memory (RAM based)
"short-term" index that auto-merged into a disk-based "long term" index that
eventually get merged into "archive" indexes. Index optimization would take
place during these merges. The search we required was very time-sensitive
(searching last-minute breaking news wires). The advantage of having an
archive index is that very old documents in our applications were not
usually searched on unless archives were explicitely selected.

-- Joaquin

On Fri, Dec 26, 2008 at 10:20 AM, Doug Cutting  wrote:

> Michael McCandless wrote:
>
>> So then I think we should start with approach #2 (build real-time on
>> top of the Lucene core) and iterate from there.  Newly added docs go
>> into a tiny segments, which IndexReader.reopen pulls in.  Replaced or
>> deleted docs record the delete against the right SegmentReader (and
>> LUCENE-1314 lets reopen carry those pending deletes forward, in RAM).
>>
>> I would take the simple approach first: use ordinary SegmentReader on
>> a RAMDirectory for the tiny segments.  If that proves too slow, swap
>> in Memory/InstantiatedIndex for the tiny segments.  If that proves too
>> slow, build a reader impl that reads from DocumentsWriter RAM buffer.
>>
>
> +1 This sounds like a good approach to me.  I don't see any fundamental
> reasons why we need different representations, and fewer implementations of
> IndexWriter and IndexReader is generally better, unless they get way too
> hairy.  Mostly it seems that real-time can be done with our existing toolbox
> of datastructures, but with some slightly different control structures.
>  Once we have the control structure in place then we should look at
> optimizing data structures as needed.
>
> Doug
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


Re: Ocean and GData

2008-09-27 Thread J. Delgado
On Sat, Sep 27, 2008 at 5:03 AM, Jason Rutherglen <
[EMAIL PROTECTED]> wrote:

> Unlike MapReduce, there are no infrastructure whitepapers on
> how GData/Base works so I had to make a broad comparison rather than a
> specific one.

My understanding is that GBase is based on the infrastructure that Google is
building for large scale distributed computing (Google File System,
MapReduce, BigTable, GData, etc.) More specifically, BigTable, the column
storage "database" which requires extremely high performance and
reliability, but provides only weak guarantees on data consistency. There is
plenty of documentation on these technologies.

I agree with Otis that it is clear to mention the characteristics of RDBMS
that real-time search displays such as atomicity and transactionality.

-- Joaquin


Re: Realtime Search for Social Networks Collaboration

2008-09-21 Thread J. Delgado
Please ignore the correction... "lose" is fine:-)

On Sun, Sep 21, 2008 at 8:38 PM, J. Delgado <[EMAIL PROTECTED]>wrote:

> Sorry, I meant "loose" (replacing "lose")
>
>
> On Sun, Sep 21, 2008 at 8:38 PM, J. Delgado <[EMAIL PROTECTED]>wrote:
>
>> On Sat, Sep 20, 2008 at 1:04 PM, Noble Paul നോബിള്‍ नोब्ळ् <
>> [EMAIL PROTECTED]> wrote:
>>
>>> Moving back to RDBMS model will be a big step backwards where we miss
>>> mulivalued fields and arbitrary fields .
>>
>>
>>  No one is suggesting to "lose" any of the virtues of the field based
>> indexing that Lucene provides. All but the contrary: by extending the RDBMS
>> model with Lucene-based indexes one can map relational rows to documents and
>> columns to fields. Note that one relational field can be mapped to one or
>> more text based fields and multi-valued fields will still be allowed.
>>
>> Please check the Lucence OJVM implementation for details on implementation
>> and philosophy on the RDBMS-Lucene converged model:
>>
>> http://docs.google.com/Doc?id=ddgw7sjp_54fgj9kg
>>
>> More discussions at Marcelo's blog who will be presenting in Oracle World
>> 2008 this week.
>> http://marceloochoa.blogspot.com/
>>
>> BTW, it just happen that this was implemented using Oracle but similar
>> implementation in H2 seems not only feasible but desirable.
>>
>> -- Joaquin
>>
>>
>>
>>>
>>> On Tue, Sep 9, 2008 at 4:17 AM, Jason Rutherglen
>>> <[EMAIL PROTECTED]> wrote:
>>> > Cool.  I mention H2 because it does have some Lucene code in it yes.
>>> > Also according to some benchmarks it's the fastest of the open source
>>> > databases.  I think it's possible to integrate realtime search for H2.
>>> >  I suppose there is no need to store the data in Lucene in this case?
>>> > One loses the multiple values per field Lucene offers, and the schema
>>> > become static.  Perhaps it's a trade off?
>>> >
>>> > On Mon, Sep 8, 2008 at 6:17 PM, J. Delgado <[EMAIL PROTECTED]>
>>> wrote:
>>> >> Yes, both Marcelo and I would be interested.
>>> >>
>>> >> We looked into H2 and it looks like something similar to Oracle's ODCI
>>> can
>>> >> be implemented. Plus the primitive full-text implementación is based
>>> on
>>> >> Lucene.
>>> >> I say primitive because looking at the code I saw that one cannot
>>> define an
>>> >> Analyzer and for each scan corresponding to a where clause a searcher
>>> is
>>> >> open and closed, instead of having a pool, plus it does not have any
>>> way to
>>> >> queue changes to reduce the use of the IndexWriter, etc.
>>> >>
>>> >> But its open source and that is a great starting point!
>>> >>
>>> >> -- Joaquin
>>> >>
>>> >> On Mon, Sep 8, 2008 at 2:05 PM, Jason Rutherglen
>>> >> <[EMAIL PROTECTED]> wrote:
>>> >>>
>>> >>> Perhaps an interesting project would be to integrate Ocean with H2
>>> >>> www.h2database.com to take advantage of both models.  I'm not sure
>>> how
>>> >>> exactly that would work, but it seems like it would not be too
>>> >>> difficult.  Perhaps this would solve being able to perform faster
>>> >>> hierarchical queries and perhaps other types of queries that Lucene
>>> is
>>> >>> not capable of.
>>> >>>
>>> >>> Is this something Joaquin you are interested in collaborating on?  I
>>> >>> am definitely interested in it.
>>> >>>
>>> >>> On Sun, Sep 7, 2008 at 4:04 AM, J. Delgado <
>>> [EMAIL PROTECTED]>
>>> >>> wrote:
>>> >>> > On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic
>>> >>> > <[EMAIL PROTECTED]> wrote:
>>> >>> >>
>>> >>> >> Regarding real-time search and Solr, my feeling is the focus
>>> should be
>>> >>> >> on
>>> >>> >> first adding real-time search to Lucene, and then we'll figure out
>>> how
>>> >>> >> to
>>> >>> >> incorporate that into Solr later.
>>> >>> >
>>> >>> >
>>> >>> > Otis, wha

Re: Realtime Search for Social Networks Collaboration

2008-09-21 Thread J. Delgado
Sorry, I meant "loose" (replacing "lose")

On Sun, Sep 21, 2008 at 8:38 PM, J. Delgado <[EMAIL PROTECTED]>wrote:

> On Sat, Sep 20, 2008 at 1:04 PM, Noble Paul നോബിള്‍ नोब्ळ् <
> [EMAIL PROTECTED]> wrote:
>
>> Moving back to RDBMS model will be a big step backwards where we miss
>> mulivalued fields and arbitrary fields .
>
>
>  No one is suggesting to "lose" any of the virtues of the field based
> indexing that Lucene provides. All but the contrary: by extending the RDBMS
> model with Lucene-based indexes one can map relational rows to documents and
> columns to fields. Note that one relational field can be mapped to one or
> more text based fields and multi-valued fields will still be allowed.
>
> Please check the Lucence OJVM implementation for details on implementation
> and philosophy on the RDBMS-Lucene converged model:
>
> http://docs.google.com/Doc?id=ddgw7sjp_54fgj9kg
>
> More discussions at Marcelo's blog who will be presenting in Oracle World
> 2008 this week.
> http://marceloochoa.blogspot.com/
>
> BTW, it just happen that this was implemented using Oracle but similar
> implementation in H2 seems not only feasible but desirable.
>
> -- Joaquin
>
>
>
>>
>> On Tue, Sep 9, 2008 at 4:17 AM, Jason Rutherglen
>> <[EMAIL PROTECTED]> wrote:
>> > Cool.  I mention H2 because it does have some Lucene code in it yes.
>> > Also according to some benchmarks it's the fastest of the open source
>> > databases.  I think it's possible to integrate realtime search for H2.
>> >  I suppose there is no need to store the data in Lucene in this case?
>> > One loses the multiple values per field Lucene offers, and the schema
>> > become static.  Perhaps it's a trade off?
>> >
>> > On Mon, Sep 8, 2008 at 6:17 PM, J. Delgado <[EMAIL PROTECTED]>
>> wrote:
>> >> Yes, both Marcelo and I would be interested.
>> >>
>> >> We looked into H2 and it looks like something similar to Oracle's ODCI
>> can
>> >> be implemented. Plus the primitive full-text implementación is based on
>> >> Lucene.
>> >> I say primitive because looking at the code I saw that one cannot
>> define an
>> >> Analyzer and for each scan corresponding to a where clause a searcher
>> is
>> >> open and closed, instead of having a pool, plus it does not have any
>> way to
>> >> queue changes to reduce the use of the IndexWriter, etc.
>> >>
>> >> But its open source and that is a great starting point!
>> >>
>> >> -- Joaquin
>> >>
>> >> On Mon, Sep 8, 2008 at 2:05 PM, Jason Rutherglen
>> >> <[EMAIL PROTECTED]> wrote:
>> >>>
>> >>> Perhaps an interesting project would be to integrate Ocean with H2
>> >>> www.h2database.com to take advantage of both models.  I'm not sure
>> how
>> >>> exactly that would work, but it seems like it would not be too
>> >>> difficult.  Perhaps this would solve being able to perform faster
>> >>> hierarchical queries and perhaps other types of queries that Lucene is
>> >>> not capable of.
>> >>>
>> >>> Is this something Joaquin you are interested in collaborating on?  I
>> >>> am definitely interested in it.
>> >>>
>> >>> On Sun, Sep 7, 2008 at 4:04 AM, J. Delgado <[EMAIL PROTECTED]
>> >
>> >>> wrote:
>> >>> > On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic
>> >>> > <[EMAIL PROTECTED]> wrote:
>> >>> >>
>> >>> >> Regarding real-time search and Solr, my feeling is the focus should
>> be
>> >>> >> on
>> >>> >> first adding real-time search to Lucene, and then we'll figure out
>> how
>> >>> >> to
>> >>> >> incorporate that into Solr later.
>> >>> >
>> >>> >
>> >>> > Otis, what do you mean exactly by "adding real-time search to
>> Lucene"?
>> >>> >  Note
>> >>> > that Lucene, being a indexing/search library (and not a full blown
>> >>> > search
>> >>> > engine), is by definition "real-time": once you add/write a document
>> to
>> >>> > the
>> >>> > index it becomes immediately searchable and if a document is
>> logically
>> >>> > deleted and no lon

Re: Realtime Search for Social Networks Collaboration

2008-09-21 Thread J. Delgado
On Sat, Sep 20, 2008 at 1:04 PM, Noble Paul നോബിള്‍ नोब्ळ् <
[EMAIL PROTECTED]> wrote:

> Moving back to RDBMS model will be a big step backwards where we miss
> mulivalued fields and arbitrary fields .


 No one is suggesting to "lose" any of the virtues of the field based
indexing that Lucene provides. All but the contrary: by extending the RDBMS
model with Lucene-based indexes one can map relational rows to documents and
columns to fields. Note that one relational field can be mapped to one or
more text based fields and multi-valued fields will still be allowed.

Please check the Lucence OJVM implementation for details on implementation
and philosophy on the RDBMS-Lucene converged model:

http://docs.google.com/Doc?id=ddgw7sjp_54fgj9kg

More discussions at Marcelo's blog who will be presenting in Oracle World
2008 this week.
http://marceloochoa.blogspot.com/

BTW, it just happen that this was implemented using Oracle but similar
implementation in H2 seems not only feasible but desirable.

-- Joaquin



>
> On Tue, Sep 9, 2008 at 4:17 AM, Jason Rutherglen
> <[EMAIL PROTECTED]> wrote:
> > Cool.  I mention H2 because it does have some Lucene code in it yes.
> > Also according to some benchmarks it's the fastest of the open source
> > databases.  I think it's possible to integrate realtime search for H2.
> >  I suppose there is no need to store the data in Lucene in this case?
> > One loses the multiple values per field Lucene offers, and the schema
> > become static.  Perhaps it's a trade off?
> >
> > On Mon, Sep 8, 2008 at 6:17 PM, J. Delgado <[EMAIL PROTECTED]>
> wrote:
> >> Yes, both Marcelo and I would be interested.
> >>
> >> We looked into H2 and it looks like something similar to Oracle's ODCI
> can
> >> be implemented. Plus the primitive full-text implementación is based on
> >> Lucene.
> >> I say primitive because looking at the code I saw that one cannot define
> an
> >> Analyzer and for each scan corresponding to a where clause a searcher is
> >> open and closed, instead of having a pool, plus it does not have any way
> to
> >> queue changes to reduce the use of the IndexWriter, etc.
> >>
> >> But its open source and that is a great starting point!
> >>
> >> -- Joaquin
> >>
> >> On Mon, Sep 8, 2008 at 2:05 PM, Jason Rutherglen
> >> <[EMAIL PROTECTED]> wrote:
> >>>
> >>> Perhaps an interesting project would be to integrate Ocean with H2
> >>> www.h2database.com to take advantage of both models.  I'm not sure how
> >>> exactly that would work, but it seems like it would not be too
> >>> difficult.  Perhaps this would solve being able to perform faster
> >>> hierarchical queries and perhaps other types of queries that Lucene is
> >>> not capable of.
> >>>
> >>> Is this something Joaquin you are interested in collaborating on?  I
> >>> am definitely interested in it.
> >>>
> >>> On Sun, Sep 7, 2008 at 4:04 AM, J. Delgado <[EMAIL PROTECTED]>
> >>> wrote:
> >>> > On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic
> >>> > <[EMAIL PROTECTED]> wrote:
> >>> >>
> >>> >> Regarding real-time search and Solr, my feeling is the focus should
> be
> >>> >> on
> >>> >> first adding real-time search to Lucene, and then we'll figure out
> how
> >>> >> to
> >>> >> incorporate that into Solr later.
> >>> >
> >>> >
> >>> > Otis, what do you mean exactly by "adding real-time search to
> Lucene"?
> >>> >  Note
> >>> > that Lucene, being a indexing/search library (and not a full blown
> >>> > search
> >>> > engine), is by definition "real-time": once you add/write a document
> to
> >>> > the
> >>> > index it becomes immediately searchable and if a document is
> logically
> >>> > deleted and no longer returned in a search, though physical deletion
> >>> > happens
> >>> > during an index optimization.
> >>> >
> >>> > Now, the problem of adding/deleting documents in bulk, as part of a
> >>> > transaction and making these documents available for search
> immediately
> >>> > after the transaction is commited sounds more like a search engine
> >>> > problem
> >>> > (i.e. SOLR, Nutch, Ocean), specially if these transactions are known
>

Re: Realtime Search for Social Networks Collaboration

2008-09-08 Thread J. Delgado
Yes, both Marcelo and I would be interested.

We looked into H2 and it looks like something similar to Oracle's ODCI can
be implemented. Plus the primitive full-text implementación is based on
Lucene.
I say primitive because looking at the code I saw that one cannot define an
Analyzer and for each scan corresponding to a where clause a searcher is
open and closed, instead of having a pool, plus it does not have any way to
queue changes to reduce the use of the IndexWriter, etc.

But its open source and that is a great starting point!

-- Joaquin

On Mon, Sep 8, 2008 at 2:05 PM, Jason Rutherglen <[EMAIL PROTECTED]
> wrote:

> Perhaps an interesting project would be to integrate Ocean with H2
> www.h2database.com to take advantage of both models.  I'm not sure how
> exactly that would work, but it seems like it would not be too
> difficult.  Perhaps this would solve being able to perform faster
> hierarchical queries and perhaps other types of queries that Lucene is
> not capable of.
>
> Is this something Joaquin you are interested in collaborating on?  I
> am definitely interested in it.
>
> On Sun, Sep 7, 2008 at 4:04 AM, J. Delgado <[EMAIL PROTECTED]>
> wrote:
> > On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic
> > <[EMAIL PROTECTED]> wrote:
> >>
> >> Regarding real-time search and Solr, my feeling is the focus should be
> on
> >> first adding real-time search to Lucene, and then we'll figure out how
> to
> >> incorporate that into Solr later.
> >
> >
> > Otis, what do you mean exactly by "adding real-time search to Lucene"?
>  Note
> > that Lucene, being a indexing/search library (and not a full blown search
> > engine), is by definition "real-time": once you add/write a document to
> the
> > index it becomes immediately searchable and if a document is logically
> > deleted and no longer returned in a search, though physical deletion
> happens
> > during an index optimization.
> >
> > Now, the problem of adding/deleting documents in bulk, as part of a
> > transaction and making these documents available for search immediately
> > after the transaction is commited sounds more like a search engine
> problem
> > (i.e. SOLR, Nutch, Ocean), specially if these transactions are known to
> be
> > I/O expensive and thus are usually implemented bached proceeses with some
> > kind of sync mechanism, which makes them non real-time.
> >
> > For example, in my previous life, I designed and help implement a
> > quasi-realtime enterprise search engine using Lucene, having a set of
> > multi-threaded indexers hitting a set of multiple indexes alocatted
> accross
> > different search services which powered a broker based distributed search
> > interface. The most recent documents provided to the indexers were always
> > added to the smaller in-memory (RAM) indexes which usually could absorbe
> the
> > load of a bulk "add" transaction and later would be merged into larger
> disk
> > based indexes and then flushed to make them ready to absorbe new fresh
> docs.
> > We even had further partitioning of the indexes that reflected time
> periods
> > with caps on size for them to be merged into older more archive based
> > indexes which were used less (yes the search engine default search was on
> > data no more than 1 month old, though user could open the time window by
> > including archives).
> >
> > As for SOLR and OCEAN,  I would argue that these semi-structured search
> > engines are becomming more and more like relational databases with
> full-text
> > search capablities (without the benefit of full reletional algebra -- for
> > example joins are not possible using SOLR). Notice that "real-time" CRUD
> > operations and transactionality are core DB concepts adn have been
> studied
> > and developed by database communities for aquite long time. There has
> been
> > recent efforts on how to effeciently integrate Lucene into releational
> > databases (see Lucene JVM ORACLE integration, see
> >
> http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html
> )
> >
> > I think we should seriously look at joining efforts with open-source
> > Database engine projects, written in Java (see
> > http://java-source.net/open-source/database-engines) in order to blend
> IR
> > and ORM for once and for all.
> >
> > -- Joaquin
> >
> >
> >>
> >> I've read Jason's Wiki as well.  Actually, I had to read it a number of
> >> times to understand bits and pieces of it.  I have to admit ther

Re: Realtime Search for Social Networks Collaboration

2008-09-07 Thread J. Delgado
BTW, quoting Marcelo Ochoa (the developer behind the Oracle/Lucene
implementation) the three minimal features a transactional DB should support
for Lucene integration are:

  1) The ability to define new functions (e.g. lcontains() lscore) which
would allow to bind queries to lucene and obtain document/row scores
  2) An API that would allow DML intercepts, like  Oracle's ODCI.
  3) The ability to extend and/or implement new types of "domain" indexes
that the engine's query evaluation and execution/optimization planner can
use efficiently.

Thanks Marcelo.

-- Joaquin

On Sun, Sep 7, 2008 at 8:16 AM, J. Delgado <[EMAIL PROTECTED]>wrote:

> On Sun, Sep 7, 2008 at 2:41 AM, mark harwood <[EMAIL PROTECTED]>wrote:
>
>  >>for example joins are not possible using SOLR).
>>
>> It's largely *because* Lucene doesn't do joins that it can be made to
>> scale out. I've replaced two large-scale database systems this year with
>> distributed Lucene solutions because this scale-out architecture provided
>> significantly better performance. These were "semi-structured" systems too.
>> Lucene's comparitively simplistic data model/query model is both a weakness
>> and a strength in this regard.
>>
>
>  Hey, maybe the right way to go for a truly scalable and high performance
> semi-structured database is to marry HBase (Big-table like data storage)
> with SOLR/Lucene.I concur with you in the sense that simplistic data models
> coupled with high performance are the killer.
>
> Let me quote this from the original Bigtable paper from Google:
>
> " Bigtable does not support a full relational data model; instead, it
> provides clients with a simple data model that supports dynamic control over
> data layout and format, and allows clients to reason about the locality
> properties of the data represented in the underlying storage. Data is
> indexed using row and column names that can be arbitrary strings. Bigtable
> also treats data as uninterpreted strings, although clients often serialize
> various forms of structured and semi-structured data into these strings.
> Clients can control the locality of their data through careful choices in
> their schemas. Finally, Bigtable schema parameters let clients dynamically
> control whether to serve data out of memory or from disk."
>
>


Re: Realtime Search for Social Networks Collaboration

2008-09-07 Thread J. Delgado
On Sun, Sep 7, 2008 at 2:41 AM, mark harwood <[EMAIL PROTECTED]>wrote:

>>for example joins are not possible using SOLR).
>
> It's largely *because* Lucene doesn't do joins that it can be made to scale
> out. I've replaced two large-scale database systems this year with
> distributed Lucene solutions because this scale-out architecture provided
> significantly better performance. These were "semi-structured" systems too.
> Lucene's comparitively simplistic data model/query model is both a weakness
> and a strength in this regard.
>

 Hey, maybe the right way to go for a truly scalable and high performance
semi-structured database is to marry HBase (Big-table like data storage)
with SOLR/Lucene.I concur with you in the sense that simplistic data models
coupled with high performance are the killer.

Let me quote this from the original Bigtable paper from Google:

" Bigtable does not support a full relational data model; instead, it
provides clients with a simple data model that supports dynamic control over
data layout and format, and allows clients to reason about the locality
properties of the data represented in the underlying storage. Data is
indexed using row and column names that can be arbitrary strings. Bigtable
also treats data as uninterpreted strings, although clients often serialize
various forms of structured and semi-structured data into these strings.
Clients can control the locality of their data through careful choices in
their schemas. Finally, Bigtable schema parameters let clients dynamically
control whether to serve data out of memory or from disk."


Re: Realtime Search for Social Networks Collaboration

2008-09-07 Thread J. Delgado
On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic <[EMAIL PROTECTED]>
wrote:

> Regarding real-time search and Solr, my feeling is the focus should be on
> first adding real-time search to Lucene, and then we'll figure out how to
> incorporate that into Solr later.


Otis, what do you mean exactly by "adding real-time search to Lucene"?  Note
that Lucene, being a indexing/search library (and not a full blown search
engine), is by definition "real-time": once you add/write a document to the
index it becomes immediately searchable and if a document is logically
deleted and no longer returned in a search, though physical deletion happens
during an index optimization.

Now, the problem of adding/deleting documents in bulk, as part of a
transaction and making these documents available for search immediately
after the transaction is commited sounds more like a search engine problem
(i.e. SOLR, Nutch, Ocean), specially if these transactions are known to be
I/O expensive and thus are usually implemented bached proceeses with some
kind of sync mechanism, which makes them non real-time.

For example, in my previous life, I designed and help implement a
quasi-realtime enterprise search engine using Lucene, having a set of
multi-threaded indexers hitting a set of multiple indexes alocatted accross
different search services which powered a broker based distributed search
interface. The most recent documents provided to the indexers were always
added to the smaller in-memory (RAM) indexes which usually could absorbe the
load of a bulk "add" transaction and later would be merged into larger disk
based indexes and then flushed to make them ready to absorbe new fresh docs.
We even had further partitioning of the indexes that reflected time periods
with caps on size for them to be merged into older more archive based
indexes which were used less (yes the search engine default search was on
data no more than 1 month old, though user could open the time window by
including archives).

As for SOLR and OCEAN,  I would argue that these semi-structured search
engines are becomming more and more like relational databases with full-text
search capablities (without the benefit of full reletional algebra -- for
example joins are not possible using SOLR). Notice that "real-time" CRUD
operations and transactionality are core DB concepts adn have been studied
and developed by database communities for aquite long time. There has been
recent efforts on how to effeciently integrate Lucene into releational
databases (see Lucene JVM ORACLE integration, see
http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html
)

I think we should seriously look at joining efforts with open-source
Database engine projects, written in Java (see
http://java-source.net/open-source/database-engines) in order to blend IR
and ORM for once and for all.

-- Joaquin



>
>
> I've read Jason's Wiki as well.  Actually, I had to read it a number of
> times to understand bits and pieces of it.  I have to admit there is still
> some fuzziness about the whole things in my head - is "Ocean" something that
> already works, a separate project on googlecode.com?  I think so.  If so,
> and if you are working on getting it integrated into Lucene, would it make
> it less confusing to just refer to it as "real-time search", so there is no
> confusion?
>
> If this is to be initially integrated into Lucene, why are things like
> replication, crowding/field collapsing, locallucene, name service, tag
> index, etc. all mentioned there on the Wiki and bundled with description of
> how real-time search works and is to be implemented?  I suppose mentioning
> replication kind-of makes sense because the replication approach is closely
> tied to real-time search - all query nodes need to see index changes fast.
>  But Lucene itself offers no replication mechanism, so maybe the replication
> is something to figure out separately, say on the Solr level, later on "once
> we get there".  I think even just the essential real-time search requires
> substantial changes to Lucene (I remember seeing large patches in JIRA),
> which makes it hard to digest, understand, comment on, and ultimately commit
> (hence the luke warm response, I think).  Bringing other non-essential
> elements into discussion at the same time makes it more difficult to
>  process all this new stuff, at least for me.  Am I the only one who finds
> this hard?
>
> That said, it sounds like we have some discussion going (Karl...), so I
> look forward to understanding more! :)
>
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> - Original Message 
> > From: Yonik Seeley <[EMAIL PROTECTED]>
> > To: java-dev@lucene.apache.org
> > Sent: Thursday, September 4, 2008 10:13:32 AM
> > Subject: Re: Realtime Search for Social Networks Collaboration
> >
> > On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
> > wrote:
> > > I also think it's got a
> > > lot of things now which m

Re: Moving SweetSpotSimilarity out of contrib

2008-09-06 Thread J. Delgado
I cannot agree more with Otis. Its all about exposure! Without references
from main JavaDocs, some cool things in contrib just remain in obscurity.

-- Joaquin

On Sat, Sep 6, 2008 at 1:08 AM, Otis Gospodnetic <[EMAIL PROTECTED]
> wrote:

> Regarding SSS (and any other contrib visibility).
> Perhaps we should get into habit of referencing contrib goodies from highly
> visible (to developers) spots (no pun intended), like Javadocs.  Concretely,
> if SSS is so good or if it is simply one possible alternative Similarity
> that's available and that we (Lucene developers) know about, why are we not
> mentioning it in Javadocs for (Default)Similarity?
>
>
>
> http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/search/Similarity.html
>
> http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/search/DefaultSimilarity.html
>
> Javadocs have a lot of visibility, esp. in modern IDEs.  We can also have
> this mentioned on the Wiki, but Wiki is documentation that I think most
> people don't really like to read.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> - Original Message 
> > From: Michael McCandless <[EMAIL PROTECTED]>
> > To: java-dev@lucene.apache.org
> > Sent: Friday, September 5, 2008 6:41:48 AM
> > Subject: Re: Moving SweetSpotSimilarity out of contrib
> >
> >
> > Chris Hostetter wrote:
> >
> > > : Another important driver is the "out-of-the-box experience".
> > >
> > > I honestly have no idea what an OOTB experience for Lucene-Java
> > > means ...
> > > For Solr i understand, For Nutch i understand ... for a java
> > > library
> >
> > Well... even though it's a "java library", Lucene still has many
> > defaults.
> >
> > Sure, Solr has even more, so this is important for Solr too.
> >
> > Most non-Solr apps built on Lucene will simply use Lucene's defaults,
> > for lack of knowing any better.
> >
> > How well such apps then work is what I'm calling the OOTB experience
> > for Lucene, and I think it's well-defined and important.
> >
> > Especially spooky is when a publication does an eval of search
> > libraries because typically they will eval only the OOTB experience and
> > won't go looking on our wiki to discover all the tricks.
> >
> > With IndexWriter we default to flushing by RAM usage (16 MB) not by
> > buffered doc count, to ConcurrentMergeScheduler, to
> > LogByteSizeMergePolicy, to compound file format, mergeFactor is 10,
> > etc.
> >
> > IndexSearcher (and also IndexWriter, for lengthNorm) uses
> > Similarity.getDefault().
> >
> > QueryParser uses a number of defaults when translating the end user's
> > search text into all sorts of Query instances.
> >
> > In 2.3 we made great improvements to OOTB indexing speed, and that's
> > important.
> >
> > I think making improvements to OOTB relevance is also important, but I
> > agree this is much harder to do "in general" since there are so many
> > differences between the content in apps.
> >
> > That all being said... I also agree (on closer inspection) it's not
> > cut and dry that SSS is a good choice for default (what would be the
> > right default for its "curve"?).
> >
> > If other OOTB relevance improvements surface with time (eg a good way
> > to do passage scoring/retrieval or proximity scoring or lexical
> > affinity) then we should strongly consider them.  Such things always
> > come with a performance cost, though, so it'll be an interesting
> > discussion...
> >
> > > Butthen we get into that back-compat concern issue.
> >
> > Well...is Lucene's precise scoring formula guaranteed not to change
> > between releases?  I assume and hope not.
> >
> > Just like with indexing, where the precise choice of when committing
> > and merging and flushing happens was never "promised", that lack of
> > API promise gave us the freedom to drastically improve the OOTB
> > indexing speed without breaking any promises.  We need to keep that
> > same freedom on the search side.
> >
> > From our last discussion on back compat, our most powerful weapon is
> > to NOT make promises when they aren't necessary or could limit future
> > back compat.
> >
> > And, if we have a back compat situation that's holding back Lucene's
> > OOTB adoption by new users, we should think hard about switching the
> > default to favor new users and making an option to quickly get back to
> > the old behavior to accomodate existing users.  The recent bug fixes
> > to StandardTokenizer are such examples.
> >
> > Mike
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: Re[4]: lucene scoring

2008-08-08 Thread J. Delgado
The only score that I can think of that can measure "quality" across
different queries are invariant scores such as pagerank. That is to score
the document on its general information value and then use that as a filter
regardless of the query. This is very different than the problem of trying
to nomalize the score on the same query over different shards (indexes) in a
federated query setting, which has been researched extensively.

The reason why two queries have different "scale" for scores is because of
the probabilistic nature of the algorithms which view word occurences as
independent random variables. Thus the occurence of each word in a document
is treated as an independent event. Joint and conditional probabilities can
estimated looking at word co-occurence, which could be used to compare two
specific results (i.e. how relevant is document X to both "baby kittens" and
"death metal" or if "baby kitten" is present in a doc how likely is that
"death metal" is present too), but to use the TF-IDF based score as as
absolute measure is like trying to compare Pears with Apples. Trying to
nomalize it is an ill-defined task.

-- J.D.



2008/8/8 Александр Аристов <[EMAIL PROTECTED]>

> Relevance ranking is an option but we still won't be able compare results.
> Lets say we have distributed searching - in this case top 10 from one server
> is not the same as those which are from another. Even worse we may get that
> in the resulting set a document with most top score is worse than others.
>
> what if we disable normalization or make it constant will results be
> absolutely dummy?
>
> And anther approach, can we calculate most possible top value? Or just
> maybe approximation of it? we then would be able to compare results with it.
>
> Alex
>
>
> -Original Message-
> From: Grant Ingersoll <[EMAIL PROTECTED]>
> To: java-dev@lucene.apache.org
> Date: Thu, 7 Aug 2008 15:54:41 -0400
> Subject: Re: Re[2]: lucene scoring
>
>
> On Aug 7, 2008, at 3:05 PM, Александр Аристов wrote:
>
> > I want implement searching with ability to set so-called a
> > confidence level below which I would treat documents as garbage. I
> > cannot defile the level per query as the level should be relevant
> > for all documents.
> >
> > With current scoring implementation the level would mean nothing. I
> > don't believe that since that time (the thread is of 2005year)
> > nothing has been made towards the resolving the issue.
>
> That's because there is no resolution to be had, as far as I know, but
> I'm open to suggestions (patches are even better.)  What would it mean
> to say that a score of 0.5 for "baby kittens" is comparable to a score
> of 0.5 for "death metal"?  Like I said, I don't think that 0.5 for
> "baby kittens" is even comparable later if you added other documents
> that contain any of the query terms.
>
> >
> >
> > Do you think any workarounds like implementing more sophisticated
> > queries so that we have approximately the same normalization values?
>
> I just don't think you will be successful with this, and I don't
> believe it is a Lucene issue alone, but one that applies to all search
> engines, but I could be wrong.
>
> I get what you are trying to do, though, I've wanted to do it from
> time to time.   Another approach may be to look for significant
> differences between scores w/in a result set.   For example, if doc 1
> is 0.8, doc 2 is 0.79 and then doc 3 is 0.2, then maybe one could
> argue that doc 3 is garbage, but even that is somewhat of a stretch.
> Garbage truly is in the eye of the beholder.
>
> Another option is to do more relevance tuning to make sure your top 10
> are as good as possible so that your garbage is minimized.
>
> -Grant
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: My understanding about lucene internals.

2008-06-30 Thread J. Delgado
Prasen,

Great summary!

On Mon, Jun 30, 2008 at 4:27 AM, Mukherjee, Prasenjit
<[EMAIL PROTECTED]> wrote:
> Hi,
>  I have tried to consolidate my understanding of lucene with the
> following ppt slides. I would really aprpeciate your comments (
> specially where I am incorrect ) specifically on slide16 which talsk
> about the segment-layout( aka file-format )
>
> http://docs.google.com/Presentation?docid=dmsxgtg_98dbh529dn
>
>
> Thanks,
> Prasen
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to do a query using less than or greater than

2008-06-24 Thread J. Delgado
I do not believe that the operators "<" and ">" are supported by
Lucene, but you can use RANGE SEARCH to do achieve what you want. Just
put an unreachable upper boundary for "greater than" or lower boundary
for "less than".

J.D.
On Tue, Jun 24, 2008 at 3:31 PM, Kyle Miller <[EMAIL PROTECTED]> wrote:
> Hi all,
>   I've been looking at the lucene documentation and the source code
> and I can't seem to find a greater than or less than operator in the
> default query syntax for lucene.  Does anyone one know if they exists
> and how to use them?  For a concrete example I'm looking to do a query
> on a date field to find documents earlier than a specified date or
> later than a specified date.  Ex: date:( >20070101)  or date:
> (<20070101).  I looked at the range query feature but it didn't appear
> to cover this case. Anyone have any suggestions?
> Thanks,
> Kyle

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Fwd: New binary distribution of Oracle-Lucene integration

2008-04-13 Thread J. Delgado
Here is the latest on the Oracle-Lucene Integration.

J.D.

-- Forwarded message --
From: Marcelo Ochoa <[EMAIL PROTECTED]>
Date: Mon, Apr 7, 2008 at 10:01 AM
Subject: New binary distribution of Oracle-Lucene integration
To: [EMAIL PROTECTED]


Hi all:
 I just released a new version of Oracle-Lucene integration
 implemented as a Domain Index.
 Binary distribution have a very straightforward installation and
 testing step, downloads are at SF.net web site:

http://sourceforge.net/project/showfiles.php?group_id=56183&package_id=255524&release_id=589900
 Updated documentation is available as Google Document at:
 http://docs.google.com/Doc?id=ddgw7sjp_54fgj9kg
 Source is available with public CVS access at:
 http://dbprism.cvs.sourceforge.net/dbprism/ojvm/
 As consequence of reading many mails with feedback and development
 tips from this list this new version have a lot performance
 improvement by using a rowid<->lucene doc id cache, usage of
 LoadFirstFieldSelector class to prevent Lucene from loading a complete
 doc if only we need the rowid.
 Many thanks to all for sharing the experience.
 A complete list of changes is at:

http://dbprism.cvs.sourceforge.net/dbprism/ojvm/ChangeLog.txt?revision=1.3&view=markup
 Best regards, Marcelo.

 PD: I have a plan to a make a new version of Oracle-Lucene integration
 synchronized with Lucene 2.3.1 ASAP.
 --
 Marcelo F. Ochoa
 http://marceloochoa.blogspot.com/
 http://marcelo.ochoa.googlepages.com/home
 __
 Do you Know DBPrism? Look @ DB Prism's Web Site
 http://www.dbprism.com.ar/index.html
 More info?
 Chapter 17 of the book "Programming the Oracle Database using Java &
 Web Services"
 http://www.amazon.com/gp/product/183296/
 Chapter 21 of the book "Professional XML Databases" - Wrox Press
 http://www.amazon.com/gp/product/1861003587/
 Chapter 8 of the book "Oracle & Open Source" - O'Reilly
 http://www.oreilly.com/catalog/oracleopen/



--
Marcelo F. Ochoa
http://marceloochoa.blogspot.com/
http://marcelo.ochoa.googlepages.com/home
__
Do you Know DBPrism? Look @ DB Prism's Web Site
http://www.dbprism.com.ar/index.html
More info?
Chapter 17 of the book "Programming the Oracle Database using Java &
Web Services"
http://www.amazon.com/gp/product/183296/
Chapter 21 of the book "Professional XML Databases" - Wrox Press
http://www.amazon.com/gp/product/1861003587/
Chapter 8 of the book "Oracle & Open Source" - O'Reilly
http://www.oreilly.com/catalog/oracleopen/


Re: an API for synonym in Lucene-core

2008-03-13 Thread J. Delgado
Mathieu,

Have you thought about incorporating a standard format for thesaurus and
thus for query/index expansion. Here is the recommendation from NISO:
http://www.niso.org/committees/MT-info.html

Beyond synonyms, having the capabilities to specify the use of BT (broader
terms or Hypernyms) or NT (narrower terms or Hyponyms) is very useful to
provide more general or specific context to the query.

There are other tricks such as weighing terms from a thesaurus based on the
number of occurences in the index, as well as extracting potencial
"used-as-for" terms by looking at patters such as  a word followed by a
parethesis with small number of tokens (i.e.  "term ()").

J.D.


On Thu, Mar 13, 2008 at 2:52 AM, Mathieu Lecarme <[EMAIL PROTECTED]>
wrote:

> I'll slice my contrib in small parts
>
> Synonyms
> 1) Synonym (Token + a weight)
> 2) Synonym provider from OO.o thesaurus
> 3) SynonymTokenFilter
> 4) Query expander wich apply a filter (and a boost) on each of its
> TermQuery
> 5) a Synonym filter for the query expander
> 6) to be efficient, Synonym can be exclude if doesn't exist in the Index.
> 7) Stemming can be used as a dynamic Synonym
>
> Spell checking or the "do you mean?" pattern
> 1) The main concept is in the SpellCheck contrib, but in a not
> expandable implementation
> 2) In some language, like French, homophony is very important in
> mispelling, "there is more than one way to write it"
> 3) Homophony rules is provided by Aspell in a neutral language (just
> like SnowBall for stemming), I implemented a translator to build Java
> class from aspell file (it's the same format in aspell evolution :
> myspell and hunspell, wich are used in OO.o and firefox family)
> https://issues.apache.org/jira/browse/LUCENE-956
>
> Storing information about word found in an index
> 1) It's the Dictionary used in SpellCheck contrib, in a more open way :
> a lexicon. It's a plain old lucene index, word become a Document, and
> Field store computed informations like size, Ngram token and homophony.
> All use filter took from TokenFilter, code duplication is avoided.
> 2) this information can be not synchronized with the index, in order to
> not slow indexation process, so some informations need to be lately
> check (is this synonym already exist in the index?), and lexicon
> correction can be done on the fly (if the synonym doesn't exist, write
> it in the lexicon for the next time). There is some work here to find
> the best and fastest way to keep information synchronized between index
> and lexicon (hard link, log for nightly replay, complete iteration over
> the index to find deleted and new stuff ...)
> 3) Similar (more than only Synonym) and Near (mispelled) words use
> Lexicon.
> https://issues.apache.org/jira/browse/LUCENE-1190
>
> Extending it
> 1) Lexicon can be used to store Noun, ie words that better work
> together, like "New York", "Apple II" or "Alexander the great".
> Extracting nouns from a thesaurus is very hard, but Wikipedia peoples
> done a part of the work, article titles can be a good start to build a
> noun list. And it works in many languages.
> Noun can be used as an intuitive PhraseQuery, or as a suggestion for
> refining a results.
>
> Implementig it well in Lucene
> SpellCheck and WordNet contrib do a part of it, but in a specific and
> not extensible way, I think it's better when fundation is checked by
> Lucene maintener, and after, contrib is built on top of this fundation.
>
> M.
>
>
> Otis Gospodnetic a écrit :
> > Grant, I think Mathieu is hinting at his JIRA contribution (I looked at
> it briefly the other day, but haven't had the chance to really understand
> it).
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> > - Original Message 
> > From: Mathieu Lecarme <[EMAIL PROTECTED]>
> > To: java-dev@lucene.apache.org
> > Sent: Wednesday, March 12, 2008 5:47:40 AM
> > Subject: an API for synonym in Lucene-core
> >
> > Why Lucen doesn't have a clean synonym API?
> > WordNet contrib is not an answer, it provides an Interface for its own
> > needs, and most of the world don't speak english.
> > Compass provides a tool, just like Solr. Lucene is the framework for
> > applications like Solr, Nutch or Compass, why don't backport low level
> > features of this project?
> > A synonym API should provide a TokenFilter, an abstract storage should
> > map token -> similar tokens with weight, and a tools for expanding
> query.
> > Openoffice dictionnary project can provides data in differents
> > languages, with compatible licences, I  presume.
> >
> > M.
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
>
>
> --

Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread J. Delgado
I'm pretty sure that what you describe is the case, specially taking into
consideration that PageRank (what drives their search results) is a per
document value that is probably recomputed after some long time interval. I
did see a MapReduce algorithm to compute PageRank as well. However I do
think they must be distributing the query load across many many machines.

I also think that limiting flat results of the top 10 and then do paging is
optimized for performance. Yet another reason why Google has not implemented
facets browsing or real-time clustering around their result set.

J.D.

On Feb 6, 2008 4:22 PM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

> (trimming excessive cc-s)
>
> Ning Li wrote:
> > No. I'm curious too. :)
> >
> > On Feb 6, 2008 11:44 AM, J. Delgado <[EMAIL PROTECTED]> wrote:
> >
> >> I assume that Google also has distributed index over their
> >> GFS/MapReduce implementation. Any idea how they achieve this?
>
> I'm pretty sure that MapReduce/GFS/BigTable is used only for creating
> the index (as well as crawling, data mining, web graph analysis, static
> scoring etc). The overhead of MR jobs is just too high.
>
> Their impressive search response times are most likely the result of
> extensive caching of pre-computed partial hit lists for frequent terms
> and phrases - at least that's what I suspect after reading this paper
> (not by Google folks, but very enlightening):
> http://citeseer.ist.psu.edu/724464.html
>
> --
> Best regards,
> Andrzej Bialecki <><
>  ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread J. Delgado
I assume that Google also has distributed index over their
GFS/MapReduce implementation. Any idea how they achieve this?

J.D.



On Feb 6, 2008 11:33 AM, Clay Webster <[EMAIL PROTECTED]> wrote:
>
> There seem to be a few other players in this space too.
>
> Are you from Rackspace?
> (http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-
> query-terabytes-data)
>
> AOL also has a Hadoop/Solr project going on.
>
> CNET does not have much brewing there.  Although Yonik and I had
> talked about it a bunch -- but that was long ago.
>
> --cw
>
> Clay Webster   tel:1.908.541.3724
> Associate VP, Platform Infrastructure http://www.cnet.com
> CNET, Inc. (Nasdaq:CNET) mailto:[EMAIL PROTECTED]
>
>
> > -Original Message-
> > From: Ning Li [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, February 06, 2008 1:57 PM
> > To: [EMAIL PROTECTED]; java-dev@lucene.apache.org; solr-
> > [EMAIL PROTECTED]
> > Subject: Lucene-based Distributed Index Leveraging Hadoop
> >
> > There have been several proposals for a Lucene-based distributed index
> > architecture.
> >  1) Doug Cutting's "Index Server Project Proposal" at
> >
> http://www.mail-archive.com/[EMAIL PROTECTED]/msg00338.html
> >  2) Solr's "Distributed Search" at
> > http://wiki.apache.org/solr/DistributedSearch
> >  3) Mark Butler's "Distributed Lucene" at
> > http://wiki.apache.org/hadoop/DistributedLucene
> >
> > We have also been working on a Lucene-based distributed index
> > architecture.
> > Our design differs from the above proposals in the way it leverages
> > Hadoop
> > as much as possible. In particular, HDFS is used to reliably store
> > Lucene
> > instances, Map/Reduce is used to analyze documents and update Lucene
> > instances
> > in parallel, and Hadoop's IPC framework is used. Our design is geared
> > for
> > applications that require a highly scalable index and where batch
> > updates
> > to each Lucene instance are acceptable (verses finer-grained document
> > at
> > a time updates).
> >
> > We have a working implementation of our design and are in the process
> > of evaluating its performance. An overview of our design is provided
> > below.
> > We welcome feedback and would like to know if you are interested in
> > working
> > on it. If so, we would be happy to make the code publicly available.
> At
> > the
> > same time, we would like to collaborate with people working on
> existing
> > proposals and see if we can consolidate our efforts.
> >
> > TERMINOLOGY
> > A distributed "index" is partitioned into "shards". Each shard
> > corresponds
> > to
> > a Lucene instance and contains a disjoint subset of the documents in
> > the
> > index.
> > Each shard is stored in HDFS and served by one or more "shard
> servers".
> > Here
> > we only talk about a single distributed index, but in practice
> multiple
> > indexes
> > can be supported.
> >
> > A "master" keeps track of the shard servers and the shards being
> served
> > by
> > them. An "application" updates and queries the global index through an
> > "index client". An index client communicates with the shard servers to
> > execute a query.
> >
> > KEY RPC METHODS
> > This section lists the key RPC methods in our design. To simplify the
> > discussion, some of their parameters have been omitted.
> >
> >   On the Shard Servers
> > // Execute a query on this shard server's Lucene instance.
> > // This method is called by an index client.
> > SearchResults search(Query query);
> >
> >   On the Master
> > // Tell the master to update the shards, i.e., Lucene instances.
> > // This method is called by an index client.
> > boolean updateShards(Configuration conf);
> >
> > // Ask the master where the shards are located.
> > // This method is called by an index client.
> > LocatedShards getShardLocations();
> >
> > // Send a heartbeat to the master. This method is called by a
> > // shard server. In the response, the master informs the
> > // shard server when to switch to a newer version of the index.
> > ShardServerCommand sendHeartbeat();
> >
> > QUERYING THE INDEX
> > To query the index, an application sends a search request to an index
> > client.
> > The index client then calls the shard server search() method for each
> > shard
> > of the index, merges the results and returns them to the application.
> > The
> > index client caches the mapping between shards and shard servers by
> > periodically calling the master's getShardLocations() method.
> >
> > UPDATING THE INDEX USING MAP/REDUCE
> > To update the index, an application sends an update request to an
> index
> > client.
> > The index client then calls the master's updateShards() method, which
> > schedules
> > a Map/Reduce job to update the index. The Map/Reduce job updates the
> > shards
> > in
> > parallel and copies the new index files of each shard (i.e., Lucene
> > instance)
> > to HDFS.
> >
> > The upd

Oracle-Lucene Domain Index (New Release)

2007-12-13 Thread J. Delgado
Once again, LendingClub.com, a social lending network that today
announced nation-wide expansion (see Tech Crunch), is please to
contribute to the open source community a new release (2.2.0.2.0) of
the Oracle-Lucene Domain Index, a fast implementation of text indexing
and search using Lucene within the Oracle relational database. Many
thanks to Marcelo Ochoa, the developer that made it all happen!

Among the goodies you will find in the new release are:

* LuceneDomainIndex.countHits() function to replace select count from
.. where lcontains(..)>0 syntax.
* support inline pagination at lcontains(col,'rownum:[n TO m] AND ...") function
* rounding and padding support for columns date, timestamp, mumber,
float, varchar2 and char
* ODCI API array DML support
* BLOB parameter support
* sort by column passed at
lcontains(col,query_parser_str,sort_str,corr_id) syntax
* Logging support using Java Util Logging package
* JUnit test suites emulating middle tier environment
* Support for rebuild and optimize online for SyncMode:OnLine index
* XMLDB Export which allows inspecting the Lucene index using Luke or
other tools
* AutoTuneMemory parameter for replacing MaxBufferedDocs parameter
* Functional column support

Here are the pointers

Full Documentation:
http://docs.google.com/Doc?docid=ddgw7sjp_54fgj9kg&hl=en

New Binaries
http://sourceforge.net/project/showfiles.php?group_id=56183&package_id=255524

Release Notes:
http://sourceforge.net/project/shownotes.php?release_id=561159&group_id=56183

Cheers!

Joaquin Delgado, PhD
CTO, Lending Club

About Lending Club (TM)
LendingClub.com is an online social lending network where people can
borrow and lend money among themselves based upon their affinities
and/or social connections. Across
all 50 states, members can borrow money at a better interest rate than
they would get from a bank or credit card and invest in a diversified
portfolio of loans with higher rates of
return than those served by savings accounts, CDs or other online
lending services.
LendingMatch (TM) technology helps match lenders and borrowers by
using connections established through social networks, associations
and online communities,
and build diversified portfolios based on lender preferences. Lending
Club is headquartered in Sunnyvale, CA. More information is available
at www.lendingclub.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Analyzers

2007-10-28 Thread J. Delgado
If you don't want to start from scratch you may look at what is available in
the GATE framework, also written in Java:
http://gate.ac.uk/gate/doc/plugins.html#hindi

2007/10/28, Grant Ingersoll <[EMAIL PROTECTED]>:
>
> A Google search reveals:
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200408.mbox/[EMAIL 
> PROTECTED]
>
> Which leads to
> http://ltrc.iiit.net/showfile.php?filename=onlineServices/morph/index.htm
>
> However, I don't see one contributed to contrib/analyzers, so feel
> free to take it on.  Sounds like a welcome addition to me.
>
> You might also try asking others on the Lucene User mailing list
> concerning their experience.
>
> Cheers,
> Grant
>
> On Oct 28, 2007, at 8:49 PM, Sandeep Mahendru wrote:
>
> > Hi All,
> >
> >   My name is Sandeep Mehandru.
> >
> > I have been working at Wachovia Bank, charlotte North Carolina.
> >
> > I have been involved in a project,where I am designing a Report/Log
> > tracker,
> > which support English like queries.
> > I have been using Lucene indexing/searching a lot.
> >
> > I have gone through the concepts of Analyzers, Filters and Tokens. I
> > have
> > laos done some lexical analysis in past on some of the projects.
> >
> > I am very interested in writing a Lucene anlayzer for the HINDI
> > language.
> > Has this work been done, If not i would like to work on it and add
> > it to the
> > Lucene API.
> >
> > I knwo that first I would have to work on defining the grammer for
> > the Hindi
> > language.
> >
> > Please let me know your comments on the same.
> >
> > Regards,
>
> --
> Grant Ingersoll
> http://lucene.grantingersoll.com
>
> Lucene Boot Camp Training:
> ApacheCon Atlanta, Nov. 12, 2007.  Sign up now!  http://www.apachecon.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: Geographical indexing in Lucene

2007-10-01 Thread J. Delgado
Quadtrees and R-trees have been used as special "domain" indexes in Oracle
RDBMS for Spatial:
http://www.oracle.com/technology/products/spatial/htdocs/data_sheet_9i/9iR2_spatial_ds.html


Some lectures and papers:

http://csiweb.ucd.ie/staff/mbertolotto/home/lecture-notes4025-07-08.htm
http://ieeexplore.ieee.org/Xplore/login.jsp?url=/iel5/9217/29235/01320042.pdf
http://www.espatial.com/pdf/TWP_Spatial_Oracle_Spatial_and_Oracle_Locator_10gR2_0513.pdf

This is yet another evidence of the RDBMS/IR integration trend.

-- Joaquin


2007/10/1, markharw00d <[EMAIL PROTECTED]>:
>
> Great work, Evgeny!
>
> I'm certainly interested in this area and will be dissecting this in
> some detail.
>
> I've done similar work before but making use of JTS (Java Topology
> Suite), using the OpenGIS standards for spatial features/queries and
> 2-pass spatial queries (first rough pass is MBB only, 2nd pass does full
> geometry tests but only for results that satisfied any Lucene-text
> queries). What I haven't addressed (which you have here) is disk-based
> spatial indexes which is obviously the key to scalability.
>
> Should have more to discuss once I've dug deeper in...
> I for one would be interested in making this part of Lucene.
>
> Thanks again,
> Mark
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Oracle-Lucene integration (OJVMDirectory and Lucene Domain Index) - LONG

2007-09-13 Thread J. Delgado
I'm very happy to announce the partial rework and extension to LUCENE-724
(Oracle-Lucene Integration), primarily based on new requirements from
LendingClub.com, who commissioned the work to Marcelo Ochoa, the contributer
of the original patch (great job Marcelo!). As contribution of
LendingClub.com to the Lucene community we have posted the code on a public
CVS (sourceforge) as explained below.

Here at Lending Club (www.lendingclub.com) we have very specific needs
regarding the indexing of both structured and unstructured data, most of it
transactional in nature and siting in our Oracle !0gR2 DB, with a highly
complex schema. Our "ranking" of loans in the inventory includes components
of exact, textual and hardcore mathematical calculations including time,
amount and spatial constraints. This integration of Lucene into Oracle as a
Domain Index will now allow us to query this inventory in real-time. Going
against the Lucene index, created on "synthetic documents" comprised of
fields being populated from diverse tables (user data store), eliminates the
need to create very complex joins to link data from different tables at
query time. This, along with the support of the full Lucene query language,
makes this a great alternative to:

   1. Using Lucene outside the database which requires "crawling" the
   data and storing the index outside the database, loosing all the benefits of
   a fully transactional system and a secure environment.
   2. Using Oracle Text, which is very powerful but lacks the
   extensibility and flexibility that Lucene offers (for example, being able to
   query directly the index from the Java layer or implementing our our ranking
   algorithm), though to be completely fair some of it is addressed in the new
   Oracle DB 11g version.

If anyone is interested in learning more how we are going to use this within
Lending Club, please drop me a line. BTW, please make sure you check us out:
"Lending Club (http://www.lendingclub.com/), the rapidly growing
people-to-people (P2P) lending service that launched as a Facebook
application in May 2007, today announced the public availability of its
services with the launch of LendingClub.com. Lending Club connects lenders
and borrowers based upon shared affinities, enabling them to bypass banks to
secure better interest rates on loans"... more about the announcement here
http://www.sys-con.com/read/428678.htm. We have seen man entrepreneurs
applying for loans and being helped by regular people to build their
business with the money obtained at very low interest.

OK, without further marketing stuff (sorry for that), here is the original
note sent to me by Marcelo that summarizes all the new cool functionalities:

OJVMDirectory, a Lucene Integration running inside the Oracle JVM is going
one step further.

This new release includes:

   - Synchronized with latest Lucene 2.2.0 production
   - Replaced in memory storage using Vector based implementation by
   direct BLOB IO, reducing memory usage for large index.
   - Support for user data stores, it means you can not only index one
   column at time (limited by Data Cartridge API on 10g), now you can index
   multiples columns at base table and columns on related tabled joined
   together.
   - User Data Stores can be customized by the user, it means writing a
   simple Java Class users can control which column are indexed, padding
   - used or any other functionality previous to document adding step.
   - There is a DefaultUserDataStore which gets all columns of the query
   and built a Lucene Document with Fields representing each database
   - columns these fields are automatically padded if they have NUMBER or
   rounded if they have DATE data, for example.
   - lcontains() SQL operator support full Lucene's QueryParser syntax to
   provide access to all columns indexed, see examples below.
   - Support for DOMAIN_INDEX_SORT and FIRST_ROWS hint, it means that if
   you want to get rows order by lscore() operator (ascending,descending) the
   optimizer hint will assume that Lucene Domain Index will returns rowids in
   proper order avoided an inline-view to sort it.
   - Automatic index synchronization by using AQ's Call Back.
   - Lucene Domain Index creates extra tables named IndexName$T and an
   Oracle AQ named IndexName$Q with his storage table IndexName$QT at user's
   schema, so you can alter storage's preference if you want.
   - ojvm project is at SourceForge.net CVS, so anybody can get it and
   collaborate ;)
   - Tested against 10gR2 and 11g database.


Some sample usages:

create table t2 (
 f4 number primary key,
 f5 VARCHAR2(200));
create table t1 (
 f1 number,
 f2 CLOB,
 f3 number,
 CONSTRAINT t1_t2_fk FOREIGN KEY (f3)
 REFERENCES t2(f4) ON DELETE cascade);
create index it1 on t1(f3) indextype is lucene.LuceneIndex
 parameters('Analyzer:org.apache.lucene.analysis
.SimpleAnalyzer;ExtraCols:f2');

alter index it1
parameters('ExtraCols:f2,t2.f5;ExtraTabs:t2;WhereCondition:t1.f3=

Re: [jira] Commented: (LUCENE-724) Oracle JVM implementation for Lucene DataStore also a preliminary implementation for an Oracle Domain index using Lucene

2007-08-08 Thread J. Delgado
Michael, are you still working on this replacement of the BLOB I/O?

I'm looking into parameterizing the option of lazy syncs of DML
operations (via calls to LuceneDomainIndex.sync potentially queued
using dbms_aq) which is convenient for bulk inserts vs. real-time
syncs for non-bulked operations for transactional data retrieval.

-- Joaquin

2007/7/12, Michael Goddard (JIRA) <[EMAIL PROTECTED]>:
>
> [ 
> https://issues.apache.org/jira/browse/LUCENE-724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512169
>  ]
>
> Michael Goddard commented on LUCENE-724:
> 
>
> Marcelo,
>
> Are you still working on this?  I have been experimenting with it recently -- 
> thank you for creating it.  Do you think that the I/O might be faster if the 
> Vector was replaced with BLOB I/O via InputStream, OutputStream directly?  
> That is what I am working with right now, and I did observe my indexing time 
> for a sample data set go from 22 seconds to 13 seconds.  I do currently have 
> the problem that the resulting index is not behaving correctly and am working 
> on that.
>
>
> > Oracle JVM implementation for Lucene DataStore also a preliminary 
> > implementation for an Oracle Domain index using Lucene
> > 
> >
> > Key: LUCENE-724
> > URL: https://issues.apache.org/jira/browse/LUCENE-724
> > Project: Lucene - Java
> >  Issue Type: New Feature
> >  Components: Store
> >Affects Versions: 2.0.0
> > Environment: Oracle 10g R2 with latest patchset, there is a txt 
> > file into the lib directory with the required libraries to compile this 
> > extension, which for legal issues I can't redistribute. All these libraries 
> > are include into the Oracle home directory,
> >Reporter: Marcelo F. Ochoa
> >Priority: Minor
> > Attachments: ojvm-01-09-07.tar.gz, ojvm-11-28-06.tar.gz, 
> > ojvm-12-20-06.tar.gz, ojvm.tar.gz
> >
> >
> > Here a preliminary implementation of the Oracle JVM Directory data store 
> > which replace a file system by BLOB data storage.
> > The reason to do this is:
> >   - Using traditional File System for storing the inverted index is not a 
> > good option for some users.
> >   - Using BLOB for storing the inverted index running Lucene outside the 
> > Oracle database has a bad performance because there are a lot of network 
> > round trips and data marshalling.
> >   - Indexing relational data stores such as tables with VARCHAR2, CLOB or 
> > XMLType with Lucene running outside the database has the same problem as 
> > the previous point.
> >   - The JVM included inside the Oracle database can scale up to 10.000+ 
> > concurrent threads without memory leaks or deadlock and all the operation 
> > on tables are in the same memory space!!
> >   With these points in mind, I uploaded the complete Lucene framework 
> > inside the Oracle JVM and I runned the complete JUnit test case successful, 
> > except for some test such as the RMI test which requires special grants to 
> > open ports inside the database.
> >   The Lucene's test cases run faster inside the Oracle database (11g) than 
> > the Sun JDK 1.5, because the classes are automatically JITed after some 
> > executions.
> >   I had implemented and OJVMDirectory Lucene Store which replaces the file 
> > system storage with a BLOB based storage, compared with a RAMDirectory 
> > implementation is a bit slower but we gets all the benefits of the BLOB 
> > storage (backup, concurrence control, and so on).
> >  The OJVMDirectory is cloned from the source at
> > http://issues.apache.org/jira/browse/LUCENE-150 (DBDirectory) but with some 
> > changes to run faster inside the Oracle JVM.
> >  At this moment, I am working in a full integration with the SQL Engine 
> > using the Data Cartridge API, it means using Lucene as a new Oracle Domain 
> > Index.
> >  With this extension we can create a Lucene Inverted index in a table using:
> > create index it1 on t1(f2) indextype is LuceneIndex parameters('test');
> >  assuming that the table t1 has a column f2 of type VARCHAR2, CLOB or 
> > XMLType, after this, the query against the Lucene inverted index can be 
> > made using a new Oracle operator:
> > select * from t1 where contains(f2, 'Marcelo') = 1;
> >  the important point here is that this query is integrated with the 
> > execution plan of the Oracle database, so in this simple example the Oracle 
> > optimizer see that the column "f2" is indexed with the Lucene Domain index, 
> > then using the Data Cartridge API a Java code running inside the Oracle JVM 
> > is executed to open the search, a fetch all the ROWID that match with 
> > "Marcelo" and get the rows using the pointer,
> > here the output:
> > SELECT STATEMENT  ALL_ROWS  3

Re: for a better spellchecker

2007-07-06 Thread J. Delgado

Instead of "overriding" the trigram approach you may want to do a
combination. That is create trigrams out of the list of words from the
dictionary and weigh the matches much higher than those coming from the
index or even have a first dictionary exact lookup and then a trigram/index
based lookup if it fails.

J.D.

2007/7/6, Mathieu Lecarme <[EMAIL PROTECTED]>:


Now, SpellChecker use the trigram algorithm to find similar words. It
works well for keyboard fumbles, but not well enough for short words
and for languages like french where a same sound can be wrote
differently.
Spellchecking is a classical computer task, and aspell provides some
nice and free (it's GNU) sound dictionary. Lots of dictionary are
available.
I did a python parser which write translation code in different
languages : python, php and java. A bit like snowball stuff.
Few works will be done to generate lucene compliant code. But is the
python generator is well enough to Lucene, or a translation must be
done in Java to put it in Lucene source?

I'll start soon a PhonemeSpellChecker wich overide the trigram
SpellChecker.

Next step is to implement word cutter, just like Google suggest.

Any suggestions?

M.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Various Ideas from ApacheCon

2007-05-10 Thread J. Delgado

The ever growing presence of mingled structured and unstructured data is a
fact of life and modern systems we have to deal with. Clearly, the tendency
is that full-text indexing is moving towards DB functionality, i.e.
 fields for projection/filtering, sorting, faceted queries,
transactional CRUD operations etc. Though set manipulation is not Lucene's
or Solr's forte, the document-object model maps very well to rows of
relational sets or tables, evermore when CLOBs and TEXT fields where
introduced.

On the other hand, relational databases with XML and OO extensions and
native XML repositories still have to deal with the problem of RANKING
unstructured text and combination of text fragments and structured
conditions, thus  dealing no longer just with a set/relational model  that
yields binary answers but extending their query languages to handled the
concept of fuzziness, relevance, etc. (e.g. SQL/MM, XQuery-FullText).

I would like once again to open this can of worms, and perhaps think out of
the box, without classifying DB and Full-Text as simply different, as we
analyze concepts to further understand the real path for evolution of
Lucene/Sorl

Here is a very interesting attempt to create a special type of "index"
called Domain Index to query unstructured data within Oracle by Marcelo
Ochoa:
https://issues.apache.org/jira/browse/LUCENE-724

Other interesting articles:

XQuery 1.0 - Full-Text:
http://www.w3.org/TR/xquery-full-text/
SQL/MM Full-Text
http://www.wiscorp.com/2CD1R1-02-fulltext-2001-12.pdf

Discussions on *XML data model vs. relational model*
http://www.xml.com/cs/user/view/cs_msg/2645

http://www.w3.org/TR/xpath-datamodel/
http://en.wikipedia.org/wiki/Relational_model

2007/5/9, James liu <[EMAIL PROTECTED]>:


I think the topest thing lucene/solr should do:
1: more easy use and less code
2: distributed index and search
3: manage these index and search server
4: test method or tool

i don't agree

2007/5/8, Grant Ingersoll <[EMAIL PROTECTED]>:Yep, my advice always is
use
a db for what a db is designed for (set
manipulation) and use Lucene for what it is good for

maybe fs+lucene/solr is better


--
regards
jl



Re: Progressive Query Relaxation

2007-04-09 Thread J. Delgado

The idea is to efficiently get the desired result set (top N) at once
without having to re-run different queries inside the application
logic. Query relaxation avoids having several round trips and possibly
could be offered with and without deduplication. Maybe this is a
feature required for Solr rather than for Lucene.

Question: Even if lucene's score is not absolute does it somewhat
determine an partial order among results of different queries?

J.D.

2007/4/9, Otis Gospodnetic <[EMAIL PROTECTED]>:

Not that I know of.  One typically puts that in application logic and re-runs or offers 
to run alternative queries.  No de-duping there, unless you do it in your app.  I think 
one problem with the described approach and Lucene would be that Lucene's scores are not 
"absolute".

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: J. Delgado <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org; solr-dev@lucene.apache.org
Sent: Monday, April 9, 2007 3:46:40 AM
Subject: Progressive Query Relaxation

Has anyone within the Lucene or Solr community attempted to code a
progressive query relaxation technique similar to the one described
here for Oracle Text?
http://www.oracle.com/technology/products/text/htdocs/prog_relax.html

Thanks,

-- J.D.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Progressive Query Relaxation

2007-04-09 Thread J. Delgado

Has anyone within the Lucene or Solr community attempted to code a
progressive query relaxation technique similar to the one described
here for Oracle Text?
http://www.oracle.com/technology/products/text/htdocs/prog_relax.html

Thanks,

-- J.D.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: LSI, Latent Semantic Indexing

2007-01-29 Thread J. Delgado

It all depends for what you need it for. BTW, Latent Semantic Analysis
(LSA) is a super set of LSI. LSI concentrates on just how to index and
search documents in a reduced dimensional (latent) space whether LSA
includes a range of possible analysis that can be done on
representations in this space. There are other equivalent techniques
(e.g. probabilistic LSI) that can be much more efficient.

Perhaps the original requester could give us more information about
how he intends to use LSI. For example is this for plain "concept"
search or for document classification, clustering, automatic query
expansion/suggestion, link/topology analysis or for something else?

J.D.



2007/1/29, Mario Alejandro M. <[EMAIL PROTECTED]>:

I also research the use of LSA.

My interest is simply cluster the information. I found that LSA is a way,
but I'm not convinved is the better (also, is very high in CPU and RAM
consumption).

--
Mario Alejandro Montoya
MCP
www.paradondevamos.com
!El mejor sitio de restaurantes y entretenimiento de Colombia!




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Scalability Question

2007-01-10 Thread J. Delgado

This sounds very interesting... I'll defenitely have a look into it.
However I have the feeling that, like the use of Oracle Text, this is
keeping seperate the underlying data structures used for evaluating
full-text and conditions over other data types, which brings up other
issues when trying to do full-blown mixed queries. Things get worse
when doing joins and other relational algebra operations.

I'm still wondering if the basic data structures should be revised to
achieve better performance...

-- Joaquin

2007/1/10, robert engels <[EMAIL PROTECTED]>:

There is a module in Lucene contrib that changes that! It loads
Lucene into the Oracle database (it has a JVM), and allows Lucene
syntax to perform full-text searching.

On Jan 10, 2007, at 2:37 PM, J. Delgado wrote:

> No, Oracle Text does not use Lucene. It has its own proprietary
> full-text engine. It represents documents, the inverted index and
> relationships in a DB schema and it depends heavily on the SQL layer.
> This has some severe limitations though...
>
> Of course, you can push structured data into full-text based indexes.
> We have seen how in Lucene we can represent some structured data types
> (e.g. dates, numbers) as fields and perform some type of mixed queries
> but the Lucene index, as some of you have pointed out, is not meant
> for this and does not scale like a DB would.
>
> I'm looking to hear new ideas people may have to solve this very
> hard problem.
>
> -- Joaquin
>
> 2007/1/10, robert engels <[EMAIL PROTECTED]>:
>> I think the contrib 'Oracle Full Text' does this (although in the
>> reverse).
>>
>> It uses Lucene for full text queries (embedded into the db), the
>> query analyzer works.
>>
>> It is really a great piece of software. Do bad it can't be done in a
>> standard way so that it would work with all dbs.
>>
>> I think it may be possible to embedded the Apache Derby to do
>> something like this, although this might be overkill. A simple b-tree
>> db might work best.
>>
>> It would be interesting if the documents could be stored in a btree,
>> and a GUID used to access them (since the lucene docid is constantly
>> changing). The only stored field in a lucene Document would be the
>> GUID.
>>
>> On Jan 10, 2007, at 2:21 PM, J. Delgado wrote:
>>
>> > This is a more general question:
>> >
>> > Given the fact that most applications require querying a
>> combination
>> > of full-text and structured data has anyone looked into building
>> data
>> > structures at the most fundamental level  (e.g. combination of b-
>> tree
>> > and inverted lists) that would enable scalable and performant
>> > structured (e.g.SQL or XQuery) + Full-Text queries?
>> >
>> > Can Lucene be taken as basis for this or do you recommend exploring
>> > other routes?
>> >
>> > -- Joaquin
>> >
>> > 2007/1/10, Chris Hostetter <[EMAIL PROTECTED]>:
>> >>
>> >> : So you mean lucene can't do better than this ?
>> >>
>> >> robert's point is that based on what you've told us, there is no
>> >> reason to
>> >> think Lucene makes sense for you -- if *all* you are doing is
>> finding
>> >> documents based on numeric rnages, then a relational database is
>> >> petter
>> >> suited to your task.  if you accutally care about the tetual IR
>> >> features
>> >> of Lucene, then there are probably ways to make your searches
>> >> faster, but
>> >> you aren't giving us enough information.
>> >>
>> >> you said the example code you gave was in a loop ... but a loop
>> >> over what?
>> >> .. what cahnges with each iteration of the loop? ... if there are
>> >> RangeFilter's that ge reused more then once, CachingWrapperFilter
>> >> can come
>> >> in handy to ensure that work isn't done more often then it needs
>> >> to me.
>> >>
>> >> it's also not clear wether your query on "type:0" is just a
>> >> placeholder,
>> >> or indicative of what you acctually want to do in the long run ...
>> >> if all
>> >> of your queries are this simple, and all you care about is getting
>> >> a count
>> >> of things that have type:0 and are in your numeric ranges, then
>> >> don'g use
>> >> the "search" method at all, just put "type:0" in your
>> >> ChainedFilter and
>> >> c

Re: Lucene Scalability Question

2007-01-10 Thread J. Delgado

No, Oracle Text does not use Lucene. It has its own proprietary
full-text engine. It represents documents, the inverted index and
relationships in a DB schema and it depends heavily on the SQL layer.
This has some severe limitations though...

Of course, you can push structured data into full-text based indexes.
We have seen how in Lucene we can represent some structured data types
(e.g. dates, numbers) as fields and perform some type of mixed queries
but the Lucene index, as some of you have pointed out, is not meant
for this and does not scale like a DB would.

I'm looking to hear new ideas people may have to solve this very hard problem.

-- Joaquin

2007/1/10, robert engels <[EMAIL PROTECTED]>:

I think the contrib 'Oracle Full Text' does this (although in the
reverse).

It uses Lucene for full text queries (embedded into the db), the
query analyzer works.

It is really a great piece of software. Do bad it can't be done in a
standard way so that it would work with all dbs.

I think it may be possible to embedded the Apache Derby to do
something like this, although this might be overkill. A simple b-tree
db might work best.

It would be interesting if the documents could be stored in a btree,
and a GUID used to access them (since the lucene docid is constantly
changing). The only stored field in a lucene Document would be the GUID.

On Jan 10, 2007, at 2:21 PM, J. Delgado wrote:

> This is a more general question:
>
> Given the fact that most applications require querying a combination
> of full-text and structured data has anyone looked into building data
> structures at the most fundamental level  (e.g. combination of b-tree
> and inverted lists) that would enable scalable and performant
> structured (e.g.SQL or XQuery) + Full-Text queries?
>
> Can Lucene be taken as basis for this or do you recommend exploring
> other routes?
>
> -- Joaquin
>
> 2007/1/10, Chris Hostetter <[EMAIL PROTECTED]>:
>>
>> : So you mean lucene can't do better than this ?
>>
>> robert's point is that based on what you've told us, there is no
>> reason to
>> think Lucene makes sense for you -- if *all* you are doing is finding
>> documents based on numeric rnages, then a relational database is
>> petter
>> suited to your task.  if you accutally care about the tetual IR
>> features
>> of Lucene, then there are probably ways to make your searches
>> faster, but
>> you aren't giving us enough information.
>>
>> you said the example code you gave was in a loop ... but a loop
>> over what?
>> .. what cahnges with each iteration of the loop? ... if there are
>> RangeFilter's that ge reused more then once, CachingWrapperFilter
>> can come
>> in handy to ensure that work isn't done more often then it needs
>> to me.
>>
>> it's also not clear wether your query on "type:0" is just a
>> placeholder,
>> or indicative of what you acctually want to do in the long run ...
>> if all
>> of your queries are this simple, and all you care about is getting
>> a count
>> of things that have type:0 and are in your numeric ranges, then
>> don'g use
>> the "search" method at all, just put "type:0" in your
>> ChainedFilter and
>> call the "bits" method directly.
>>
>> you also haven't given us any information about wether or not you are
>> opening a new IndexSearcher/IndexReader every time you execute a
>> query, or
>> resuing the same instance -- reuse makes the perofrance much better
>> because it can reuse underlying resources.
>>
>> In short: if you state some performance numbers from timing some
>> code, and
>> want to know how to make that code faster, you have to actualy
>> show people
>> *all* of the code for them to be able to help you.
>>
>>
>> : >>  I still have the search problem I had before, now search
>> takes around
>> : >> 750
>> : >> msecs for a small set of documents.
>> : >>
>> : >> [java] Total Query Processing time (msec) : 38745
>> : >> [java] Total No. of Documents : 7,500,000
>> : >> [java] Total No. of Executed queries : 50.0
>> : >> [java] Execution time per query : 774.9 msec
>> : >>
>> : >>  The index is optimized and its size is 830 MB.
>> : >>  Each document has the following terms :
>> : >> VSID(integer), data(float), type(short int) , precision
>> (byte).
>> : >>   The queries are generate in a loop similar to one below :
>> : >> loop ...
>> : >

Re: Lucene Scalability Question

2007-01-10 Thread J. Delgado

This is a more general question:

Given the fact that most applications require querying a combination
of full-text and structured data has anyone looked into building data
structures at the most fundamental level  (e.g. combination of b-tree
and inverted lists) that would enable scalable and performant
structured (e.g.SQL or XQuery) + Full-Text queries?

Can Lucene be taken as basis for this or do you recommend exploring
other routes?

-- Joaquin

2007/1/10, Chris Hostetter <[EMAIL PROTECTED]>:


: So you mean lucene can't do better than this ?

robert's point is that based on what you've told us, there is no reason to
think Lucene makes sense for you -- if *all* you are doing is finding
documents based on numeric rnages, then a relational database is petter
suited to your task.  if you accutally care about the tetual IR features
of Lucene, then there are probably ways to make your searches faster, but
you aren't giving us enough information.

you said the example code you gave was in a loop ... but a loop over what?
.. what cahnges with each iteration of the loop? ... if there are
RangeFilter's that ge reused more then once, CachingWrapperFilter can come
in handy to ensure that work isn't done more often then it needs to me.

it's also not clear wether your query on "type:0" is just a placeholder,
or indicative of what you acctually want to do in the long run ... if all
of your queries are this simple, and all you care about is getting a count
of things that have type:0 and are in your numeric ranges, then don'g use
the "search" method at all, just put "type:0" in your ChainedFilter and
call the "bits" method directly.

you also haven't given us any information about wether or not you are
opening a new IndexSearcher/IndexReader every time you execute a query, or
resuing the same instance -- reuse makes the perofrance much better
because it can reuse underlying resources.

In short: if you state some performance numbers from timing some code, and
want to know how to make that code faster, you have to actualy show people
*all* of the code for them to be able to help you.


: >>  I still have the search problem I had before, now search takes around
: >> 750
: >> msecs for a small set of documents.
: >>
: >> [java] Total Query Processing time (msec) : 38745
: >> [java] Total No. of Documents : 7,500,000
: >> [java] Total No. of Executed queries : 50.0
: >> [java] Execution time per query : 774.9 msec
: >>
: >>  The index is optimized and its size is 830 MB.
: >>  Each document has the following terms :
: >> VSID(integer), data(float), type(short int) , precision (byte).
: >>   The queries are generate in a loop similar to one below :
: >> loop ...
: >> RangeFilter rq1 = new
: >> RangeFilter("data","+5.4324324344","+5.4324324344"true,true);
: >> RangeFilter rq2 = new RangeFilter
: >> ("precision","+0001","+0002",true,true);
: >> ChainedFilter cf = new ChainedFilter(new
: >> Filter[]{rq2,rq1},ChainedFilter.AND);
: >> Query query = qp.parse("type:0");
: >> Hits hits = searcher.search(query,cf);
: >> end loop
: >>
: >>  I would like to know if there exist any solution to improve the search
: >> time ?  (I need to insert more than 500 million of these data pages into
: >> lucene)




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Job Opportunity (Sunnyvale, CA)

2007-01-09 Thread J. Delgado

(Sorry for the cross-posting)

This is a full-time position with an exciting New Venture (now in
stealth mode) and will be based out of Sunnyvale, CA.

We are looking for Java Developer with search, social networks and/or
payment processing related experience.

Required Skills:

2+ yrs of industrial experience on Search technologies/Engines like
Lucene/Nutch/Solr, Oracle,Fast, Endeca, etc. as well
as XML and relational database technologies and/or on development of
transactional payment systems (e.g. PayPal).

- Experience with classification, attribute matching and/or
collaborative filtering
- Some exposure to P2P technologies (transactions, communication and
social networks) is highly desirable.
- Understanding of ontologies/taxonomies, keyword libraries, and other
databases to assist search query interpretation and formulation.
- Prefer MS or Computer Science graduate with specialization in
Information Retrieval or Data Mining.
- Willing to train a junior candidate.
- Must be *hands-on*.
- Ability to work quickly and accurately in a high-volume work environment.
- Excellent analytical skills.
- Creativity, intelligence, and integrity.
- Strong work ethic and a high level of professionalism.
- Hands-on design and development skills in Java and J2EE technologies
- Experience in development of large scale Web Portals is a plus.

If interested, please send the resume with contact info and salary
expectations at the earliest.

Less experienced AJAX/Web 2.0 Java Developers are also welcomed to
submit their resume.

Joaquin Delgado, PhD.
[EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]