date:20080921

Re: Lucene OSGi Bundle

2008-09-21 Thread Nicolas Lalevée



Le 19 sept. 08 à 23:08, Gunnar Wagenknecht a écrit :


Hi Lucene Developers,

Issue 1344 requests to make the Lucene JAR an OSGi bundle. The  
approach

proposed is to add the OSGi specific meta data into the MANIFEST.MF of
the existing Maven artifacts.

I prepared a patch which proposes a different approach. Instead of
adding the headers into the Maven JARs it will create a new set of  
OSGi

JARs.

This is basically driven by the following advantages.

1. In OSGi the naming convention for a bundle JAR file is symbolic
name_version.jar. The Maven JARs are not fully qualified.  
Therefor,

it's not possible to use them out of the box, i.e. one has to download
and manually rename them before they can be used. A separate set of  
OSGi

bundles could be consumed directly without any modifications.


Even if it is hardly recommended just for sanity, there is no direct  
relation between the name of the jar and its OSGi metadata. This hard  
binding only exist in an Eclipse environment.



2. Maven repositories cannot be consumed directly by OSGi frameworks.
It's better to have the OSGi bundle jars in one folder which can be
downloaded from mirrors and consumed by frameworks directly.


actually it is possible. A maven plugin exist to manage a OSGi bundle  
with maven:

http://felix.apache.org/site/apache-felix-maven-osgi-plugin.html
And there is a maven repository which contains OSGi bundles:
http://www.springsource.com/repository
And in fact, that repository is mainly a repository of jar. And there  
is different kind of metadata around it so that different dependency  
management tools (maven, ivy, obr) can use it.



3. In addition to the OSGi bundle JAR I was able to generate a source
jar for Eclipse PDE. Thus, whenever you are developing with Eclipse  
for
*any* OSGi framework one would simply throw the Lucene OSGi bundle  
JARs

together with the source bundles into the target platform. Eclipse PDE
then configures the classpath automatically to attach the source  
code to

the class files. This is very developer friendly.


This could also be done in the source jars in the maven artifacts.

So I think there is no hard requirement to have a complete different  
build for having OSGi metadata into the Lucene jars. The question here  
is just about the naming convention of the jars.
I don't have any objection of having a third distribution layout, but  
it will somehow increase the work of the Lucene developers when  
releasing because it will add some extra sign/deploy/check work.


So if Lucene developers are up to have an OSGi jar naming convention,  
I think that a good build would be to include the OSGi headers into  
the manifest to the actual jar (as does the patch I provided), and add  
an extra task to copy the maven jars into an OSGi layout.


Then there can be a debate on to use or not to use the bnd tool (this  
tool used as an ant task in Gunnar's patch tries to leverage the  
maintenance of manifest file). I didn't used it in my patch because  
the classpath of Lucene is so simple (no dependency at all), that I  
thought it would be simpler for Lucene developers to maintain a  
MANIFEST.MF than a lucene.bnd. Here again, this can be question of  
taste for the Lucene developers.


Nicolas


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Resolved: (LUCENE-1396) Improve PhraseQuery.toString()

2008-09-21 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1396.


   Resolution: Fixed
Fix Version/s: 2.9
 Assignee: Michael McCandless

Committed revision 697469 (trunk) and 697470 (2.4).  Thanks Andrzej!

 Improve PhraseQuery.toString()
 --

 Key: LUCENE-1396
 URL: https://issues.apache.org/jira/browse/LUCENE-1396
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4, 2.9
Reporter: Andrzej Bialecki 
Assignee: Michael McCandless
 Fix For: 2.4, 2.9

 Attachments: phraseQuery.patch


 PhraseQuery.toString() is overly simplistic, in that it doesn't correctly 
 show phrases with gaps or overlapping terms. This may be misleading when 
 presenting phrase queries built using complex analyzers and filters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene OSGi Bundle

2008-09-21 Thread Gunnar Wagenknecht

Nicolas Lalevée schrieb:
 the classpath of Lucene is so simple (no dependency at all),

Unfortunately, it's not. Some exported packages are split across
bundles. This makes it tough to manage. I wish it would be as easy as
using BND to simply generate the manifests for the existing jars. But it
doesn't work without the BND descriptors to get the split-packages and
the version dependencies on exported/imported packages right.

BTW, the Maven OSGi plug-in is deprecated and has been replaced by this
one which is also based on BND.
http://felix.apache.org/site/apache-felix-maven-bundle-plugin-bnd.html

-Gunnar


-- 
Gunnar Wagenknecht
[EMAIL PROTECTED]
http://wagenknecht.org/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Could positions/payloads in SegmentMerger be copied directly?

2008-09-21 Thread Michael McCandless



This part is indeed quite tricky... I'll try to take a stab at it.

Paul Elschot wrote:


Op Friday 19 September 2008 17:05:29 schreef Michael McCandless:

Not quite, because how positions are encoded depends on whether any
payload appeared in that segment.

However, if 1) the input is a SegmentReader (since in general we can
merge any IndexReader), and 2) its format is congruent with the
format we are writing (ie both don't or do use the payloads format),
which ought to be true the vast majority of the time, then I think we
could simply copy bytes.  Since the next TermInfo tells us the
proxPointer where it begins, we know exactly how many bytes to copy.
I think this'd be a nice optimization!


I tried to find a way to do this, but I'm stuck at the point where
the proxPointer is needed from a TermInfo.
I got this far (uncompiled code, smi is the SegmentMergeInfo
that is currently merged):

   if (smi.reader instanceof SegmentReader) {
 SegmentReader inputReader = smi.reader;
 boolean readerStorePayloads =
inputReader.fieldInfos.fieldInfo(smi.term.field).storePayloads;
 if (storePayloads == readerStorePayloads) {
   // take the difference of the two prox pointers:
   int positionsLength = inputReader.tis. ... -  ...;
   // do a direct byte copy from inputReader to proxOutput:
   ... ;
 }
   }

but I could not find out how to get from the TermInfosReader
at inputReader.tis to the next prox pointer.

SegmentMerger never needs to index the positions by using a
proxPointer itself, as it accesses all positions serially. This leaves
me without an example on how to use proxPointer from a TermInfo.

Any tips on how to continue?

Regards,
Paul Elschot



Mike

Paul Elschot wrote:

I'm looking at the for loop in SegmentMerger.java at line 666,
which completely interprets the input positions/payloads for
an input term at a document.

The positions/payloads don't change when they merged, is that
correct? I'm wondering whether this loop could be replaced by a
direct copy from
the input postings to proxOutput.

Regards,
Paul Elschot

---
-- To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

2008-09-21 Thread Jason Rutherglen

Agreed, it's a system that is of value to a subset of cases.

On Sat, Sep 20, 2008 at 4:04 PM, Noble Paul നോബിള്‍ नोब्ळ्
[EMAIL PROTECTED] wrote:
 Moving back to RDBMS model will be a big step backwards where we miss
 mulivalued fields and arbitrary fields .

 On Tue, Sep 9, 2008 at 4:17 AM, Jason Rutherglen
 [EMAIL PROTECTED] wrote:
 Cool.  I mention H2 because it does have some Lucene code in it yes.
 Also according to some benchmarks it's the fastest of the open source
 databases.  I think it's possible to integrate realtime search for H2.
  I suppose there is no need to store the data in Lucene in this case?
 One loses the multiple values per field Lucene offers, and the schema
 become static.  Perhaps it's a trade off?

 On Mon, Sep 8, 2008 at 6:17 PM, J. Delgado [EMAIL PROTECTED] wrote:
 Yes, both Marcelo and I would be interested.

 We looked into H2 and it looks like something similar to Oracle's ODCI can
 be implemented. Plus the primitive full-text implementación is based on
 Lucene.
 I say primitive because looking at the code I saw that one cannot define an
 Analyzer and for each scan corresponding to a where clause a searcher is
 open and closed, instead of having a pool, plus it does not have any way to
 queue changes to reduce the use of the IndexWriter, etc.

 But its open source and that is a great starting point!

 -- Joaquin

 On Mon, Sep 8, 2008 at 2:05 PM, Jason Rutherglen
 [EMAIL PROTECTED] wrote:

 Perhaps an interesting project would be to integrate Ocean with H2
 www.h2database.com to take advantage of both models.  I'm not sure how
 exactly that would work, but it seems like it would not be too
 difficult.  Perhaps this would solve being able to perform faster
 hierarchical queries and perhaps other types of queries that Lucene is
 not capable of.

 Is this something Joaquin you are interested in collaborating on?  I
 am definitely interested in it.

 On Sun, Sep 7, 2008 at 4:04 AM, J. Delgado [EMAIL PROTECTED]
 wrote:
  On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic
  [EMAIL PROTECTED] wrote:
 
  Regarding real-time search and Solr, my feeling is the focus should be
  on
  first adding real-time search to Lucene, and then we'll figure out how
  to
  incorporate that into Solr later.
 
 
  Otis, what do you mean exactly by adding real-time search to Lucene?
   Note
  that Lucene, being a indexing/search library (and not a full blown
  search
  engine), is by definition real-time: once you add/write a document to
  the
  index it becomes immediately searchable and if a document is logically
  deleted and no longer returned in a search, though physical deletion
  happens
  during an index optimization.
 
  Now, the problem of adding/deleting documents in bulk, as part of a
  transaction and making these documents available for search immediately
  after the transaction is commited sounds more like a search engine
  problem
  (i.e. SOLR, Nutch, Ocean), specially if these transactions are known to
  be
  I/O expensive and thus are usually implemented bached proceeses with
  some
  kind of sync mechanism, which makes them non real-time.
 
  For example, in my previous life, I designed and help implement a
  quasi-realtime enterprise search engine using Lucene, having a set of
  multi-threaded indexers hitting a set of multiple indexes alocatted
  accross
  different search services which powered a broker based distributed
  search
  interface. The most recent documents provided to the indexers were
  always
  added to the smaller in-memory (RAM) indexes which usually could absorbe
  the
  load of a bulk add transaction and later would be merged into larger
  disk
  based indexes and then flushed to make them ready to absorbe new fresh
  docs.
  We even had further partitioning of the indexes that reflected time
  periods
  with caps on size for them to be merged into older more archive based
  indexes which were used less (yes the search engine default search was
  on
  data no more than 1 month old, though user could open the time window by
  including archives).
 
  As for SOLR and OCEAN,  I would argue that these semi-structured search
  engines are becomming more and more like relational databases with
  full-text
  search capablities (without the benefit of full reletional algebra --
  for
  example joins are not possible using SOLR). Notice that real-time CRUD
  operations and transactionality are core DB concepts adn have been
  studied
  and developed by database communities for aquite long time. There has
  been
  recent efforts on how to effeciently integrate Lucene into releational
  databases (see Lucene JVM ORACLE integration, see
 
  http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html)
 
  I think we should seriously look at joining efforts with open-source
  Database engine projects, written in Java (see
  http://java-source.net/open-source/database-engines) in order to blend
  IR
  and ORM for once and for all.
 
  -- Joaquin

[jira] Commented: (LUCENE-1385) IndexReader.isIndexCurrent()==false - IndexReader.reopen() - still index not current

2008-09-21 Thread Uwe Schindler (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12633083#action_12633083
]

Uwe Schindler commented on LUCENE-1385:
---

By the way: The index was optimized after the change by the other process
modifying the index, maybe this is the problem.

IndexReader.isIndexCurrent()==false - IndexReader.reopen() - still index
not current
--

Key: LUCENE-1385
URL: https://issues.apache.org/jira/browse/LUCENE-1385
Project: Lucene - Java
Issue Type: Bug
Components: Index
Affects Versions: 2.3.2
Environment: Linux, Solaris, Windows XP
Reporter: Uwe Schindler
Attachments: LUCENE-1385.patch

I found a strange error occurring with IndexReader.reopen. It is not always
reproduceable, it only happens sometimes, but strangely on all my computers
with different platforms at the same time. Maybe has something to to with the
timestamp used in index versions.
I have a search server using an IndexReader, that is openend in webapp
startup and should stay open. Every half an hour this web application checks,
if the index is still current using IndexReader.isCurrent(). When a parallel
job that indexes documents (in another virtual machine) and modifies the
indexes, isCurrent() return TRUE. The half-hourly cron-job then uses
IndexReader.reopen() to reopen the index. But sometimes, directly after
reopen() the Index is still not current (and no updates occur). Again calling
reopen does not change it, too. Searching on the index shows all new/updated
documents, but isCurrent() still return false. The problem with this is, that
now the index is reopened all the time, because the detection of a current
index does not work any more.
I have now a workaround in my code to handle this: After calling
IndexReader.reopen(), I test for IndexReader.isCurrent(), and if not, I close
it hard and open a new instance.
Most times IndexReader.reopen works correct, but sometimes this error occurs.
Looking into the code of reopen(), I realized, that there is some extra
check, if the Index has modifications, and if yes the reopen call returns the
original reader (this maybe the problem I have). But the IndexReader is only
used for searching, no updates occur.
My questions: Why is there this check for modifications in reopen()? Why does
this happen only at certain times on all my servers with different platforms?
I want to use reopen, because in future, when the new FieldCache will be
reopen-aware and does not everytime rebuild the full cache, it will be very
important, to have this fixed. At the moment, I have no problem with the
case, that reopen may fail and I have to do a rough reopen.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Multi Field search without Multifieldqueryparser

2008-09-21 Thread Ryan McKinley



Now I can have two types of queries:
Structured query:
name: George Bush AND Occupation: President



please don't remind us!

try asking this question on the [EMAIL PROTECTED]  that  
list is for usage related questions


ryan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Multi Field search without Multifieldqueryparser

2008-09-21 Thread Anshul jain

Hi Ryan,

Apparently it's not because as far as I know Lucene doesn't support
this function and I am planning to develop it.

Anshul



On Sun, Sep 21, 2008 at 8:04 PM, Ryan McKinley [EMAIL PROTECTED] wrote:

 Now I can have two types of queries:
 Structured query:
 name: George Bush AND Occupation: President


 please don't remind us!

 try asking this question on the [EMAIL PROTECTED]  that list
 is for usage related questions

 ryan

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-- 
Anshul Jain

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1385) IndexReader.isIndexCurrent()==false - IndexReader.reopen() - still index not current

2008-09-21 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12633089#action_12633089
]

Michael McCandless commented on LUCENE-1385:

OK I think I found the bug.

From those prints above I can see your current IndexReader was opened
when the index had a single segment (so, it's a SegmentReader). And,
the changed index also has a single segment by the same name... so we
call SegmentReader.reopenSegment to do the reopening, which has logic
to return itself if it detects no changes (to norms or deleetions).
You are somehow hitting that logic.

The bug seems to boil down to, somehow, IndexWriter is writing a new
segments_N file for a single-segment index yet no actual changes were
made to the segment.

The bug is rather harmless: the reopen call does no real work (just
returns your current IndexReader instance), and, it's doing that
because there were in fact no actual changes to the index, just
somehow a new segments_N file was written.

I found one case where IndexWriter can do this, which is if you open
the writer, call deleteDocuments but no docs actually match the Term,
then close the writer.

Is it possible that your indexing job that wakes up and only makes calls
to deleteDocuments yet no documents matched the deleted terms? If
not... can you capture the details of exactly what your indexing job
did just before you hit the reopen failure? It could be another
no-op action in IndexWriter that then writes a segments_N file.

IndexReader.isIndexCurrent()==false - IndexReader.reopen() - still index
not current
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1385) IndexReader.isIndexCurrent()==false - IndexReader.reopen() - still index not current

2008-09-21 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12633091#action_12633091
]

Michael McCandless commented on LUCENE-1385:

I have a test case that shows the above failure.

But, on 2.4, it does not fail -- the bug was already fixed as a byproduct of
LUCENE-1194 (adding delete by query to IndexWriter).

IndexReader.isIndexCurrent()==false - IndexReader.reopen() - still index
not current
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: 2.4 release candidate 1

2008-09-21 Thread Michael McCandless



OK so I wrote yet another way to do the signing, in Python (which I'll  
happily find any excuse to use ;) -- it prompts for your passphrase  
and then recurses through the dist directory looking for artifacts to  
sign:


import sys
import os
import subprocess
import getpass

def signFile(pwd, fileName):

  print '\nSIGN %s' % fileName

  command = 'gpg --passphrase-fd 0 --batch --armor --detach-sig %s' %  
fileName

  print '  command %s' % command

  ascFileName = fileName + '.asc'
  if os.path.exists(ascFileName):
os.remove(ascFileName)

  p = subprocess.Popen(command, shell=True, stdin=subprocess.PIPE)
  p.stdin.write(pwd)
  p.stdin.close()
  result = p.wait()
  if result != 0:
raise RuntimeError('command failed: exit code %s' % result)

def isArtifact(fileName):
  for suffix in ('.tar.gz', '.jar', '.zip', '.pom'):
if fileName.endswith(suffix):
  return True
  else:
return False

def main(argv):

  if len(argv) != 2:
print '\nUsage: python %s distRootDirName\n' % argv[0]
return

  pwd = getpass.unix_getpass(prompt='\nPlease enter your GPG private  
key passphrase:')


  for dirPath, dirNames, fileNames in os.walk(argv[1]):
for fileName in fileNames:
  if isArtifact(fileName):
signFile(pwd, os.path.join(dirPath, fileName))

if __name__ == '__main__':
  main(sys.argv)

Mike

Nicolas Lalevée wrote:



Le 19 sept. 08 à 15:21, Grant Ingersoll a écrit :


FWIW, here's a simple bash function to do it too:

function sign-artifacts()
{
   gpg --armor --output $1-$2.pom.asc --detach-sig $1-$2.pom
   if [ -f $1-$2-javadoc.jar ]; then
   gpg --armor --output $1-$2-javadoc.jar.asc --detach-sig $1- 
$2-javadoc.jar

   fi
   if [ -f $1-$2-sources.jar ]; then
   gpg --armor --output $1-$2-sources.jar.asc --detach-sig $1- 
$2-sources.jar

   fi
   if [ -f $1-$2.jar ]; then
   gpg --armor --output $1-$2.jar.asc --detach-sig $1-$2.jar
   fi
}

I call it as sign-artifacts artifact id version number

i.e. sign-artifacts solr-common 1.3.0

I suppose it could be put into a loop that recurses through sub-dirs.


You might also interested into the read function which avoid enter  
the pass phrase for every artifact:

https://svn.apache.org/repos/asf/ant/ivy/ivyde/trunk/signArtifacts.sh

Nicolas




-Grant

On Sep 18, 2008, at 7:16 PM, Michael McCandless wrote:



Yeah I was afraid of this :)

I'll look at SOLR-776.  Thanks for the pointer!

Mike

Grant Ingersoll wrote:

FYI, MIke, you might be interested in https://issues.apache.org/jira/browse/SOLR-776 
 for signing the Maven artifacts (what a PITA).  I know Michael  
B. has a batch script, but this does it in a Ant friendly way and  
is available for all RMs.


Cheers,
Grant
On Sep 18, 2008, at 2:29 PM, Michael McCandless wrote:



Hi,

I just created the first release candidate for 2.4, here:

http://people.apache.org/~mikemccand/staging-area/lucene2.4rc1

Please download the release candidate, kick the tires and report  
back

on any issues you encounter.

The plan is to make only serious bug fixes or build/doc fixes, to
2.4 for ~10 days, after which if there are no blockers I'll call a
vote for the actual release.

Happy testing, and thanks!

Mike




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1387) Add LocalLucene

2008-09-21 Thread Karl Wettin (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12633102#action_12633102
]

Karl Wettin commented on LUCENE-1387:
-

bq. I'm struggling to get two of the existing tests to pass... I don't think it
is from my modifications since they don't pass on the original either.

On my box the test fails with different results due to the writer not beeing
comitted in setUp, giving me 0 results. After adding a commit it fails with the
results you are reporting here.

Is it possible that you are getting one sort of result in the original due to
non committed writer and another error in this version due to your changes to
the distance measurement? All points in the list are rather close to each other
so very small changes to the algorithm might be the problem.

I have a hard time tracing the code and I'm sort of hoping this might be the
problem.

Add LocalLucene
---

Key: LUCENE-1387
URL: https://issues.apache.org/jira/browse/LUCENE-1387
Project: Lucene - Java
Issue Type: New Feature
Components: contrib/*
Reporter: Grant Ingersoll
Priority: Minor
Attachments: spatial.zip

Local Lucene (Geo-search) has been donated to the Lucene project, per
https://issues.apache.org/jira/browse/INCUBATOR-77. This issue is to handle
the Lucene portion of integration.
See http://lucene.markmail.org/message/orzro22sqdj3wows?q=LocalLucene

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

2008-09-21 Thread J. Delgado

On Sat, Sep 20, 2008 at 1:04 PM, Noble Paul നോബിള്‍ नोब्ळ् 
[EMAIL PROTECTED] wrote:

 Moving back to RDBMS model will be a big step backwards where we miss
 mulivalued fields and arbitrary fields .


 No one is suggesting to lose any of the virtues of the field based
indexing that Lucene provides. All but the contrary: by extending the RDBMS
model with Lucene-based indexes one can map relational rows to documents and
columns to fields. Note that one relational field can be mapped to one or
more text based fields and multi-valued fields will still be allowed.

Please check the Lucence OJVM implementation for details on implementation
and philosophy on the RDBMS-Lucene converged model:

http://docs.google.com/Doc?id=ddgw7sjp_54fgj9kg

More discussions at Marcelo's blog who will be presenting in Oracle World
2008 this week.
http://marceloochoa.blogspot.com/

BTW, it just happen that this was implemented using Oracle but similar
implementation in H2 seems not only feasible but desirable.

-- Joaquin




 On Tue, Sep 9, 2008 at 4:17 AM, Jason Rutherglen
 [EMAIL PROTECTED] wrote:
  Cool.  I mention H2 because it does have some Lucene code in it yes.
  Also according to some benchmarks it's the fastest of the open source
  databases.  I think it's possible to integrate realtime search for H2.
   I suppose there is no need to store the data in Lucene in this case?
  One loses the multiple values per field Lucene offers, and the schema
  become static.  Perhaps it's a trade off?
 
  On Mon, Sep 8, 2008 at 6:17 PM, J. Delgado [EMAIL PROTECTED]
 wrote:
  Yes, both Marcelo and I would be interested.
 
  We looked into H2 and it looks like something similar to Oracle's ODCI
 can
  be implemented. Plus the primitive full-text implementación is based on
  Lucene.
  I say primitive because looking at the code I saw that one cannot define
 an
  Analyzer and for each scan corresponding to a where clause a searcher is
  open and closed, instead of having a pool, plus it does not have any way
 to
  queue changes to reduce the use of the IndexWriter, etc.
 
  But its open source and that is a great starting point!
 
  -- Joaquin
 
  On Mon, Sep 8, 2008 at 2:05 PM, Jason Rutherglen
  [EMAIL PROTECTED] wrote:
 
  Perhaps an interesting project would be to integrate Ocean with H2
  www.h2database.com to take advantage of both models.  I'm not sure how
  exactly that would work, but it seems like it would not be too
  difficult.  Perhaps this would solve being able to perform faster
  hierarchical queries and perhaps other types of queries that Lucene is
  not capable of.
 
  Is this something Joaquin you are interested in collaborating on?  I
  am definitely interested in it.
 
  On Sun, Sep 7, 2008 at 4:04 AM, J. Delgado [EMAIL PROTECTED]
  wrote:
   On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic
   [EMAIL PROTECTED] wrote:
  
   Regarding real-time search and Solr, my feeling is the focus should
 be
   on
   first adding real-time search to Lucene, and then we'll figure out
 how
   to
   incorporate that into Solr later.
  
  
   Otis, what do you mean exactly by adding real-time search to
 Lucene?
Note
   that Lucene, being a indexing/search library (and not a full blown
   search
   engine), is by definition real-time: once you add/write a document
 to
   the
   index it becomes immediately searchable and if a document is
 logically
   deleted and no longer returned in a search, though physical deletion
   happens
   during an index optimization.
  
   Now, the problem of adding/deleting documents in bulk, as part of a
   transaction and making these documents available for search
 immediately
   after the transaction is commited sounds more like a search engine
   problem
   (i.e. SOLR, Nutch, Ocean), specially if these transactions are known
 to
   be
   I/O expensive and thus are usually implemented bached proceeses with
   some
   kind of sync mechanism, which makes them non real-time.
  
   For example, in my previous life, I designed and help implement a
   quasi-realtime enterprise search engine using Lucene, having a set of
   multi-threaded indexers hitting a set of multiple indexes alocatted
   accross
   different search services which powered a broker based distributed
   search
   interface. The most recent documents provided to the indexers were
   always
   added to the smaller in-memory (RAM) indexes which usually could
 absorbe
   the
   load of a bulk add transaction and later would be merged into
 larger
   disk
   based indexes and then flushed to make them ready to absorbe new
 fresh
   docs.
   We even had further partitioning of the indexes that reflected time
   periods
   with caps on size for them to be merged into older more archive based
   indexes which were used less (yes the search engine default search
 was
   on
   data no more than 1 month old, though user could open the time window
 by
   including archives).
  
   As for SOLR and OCEAN,  I would argue that these semi-structured

Re: Realtime Search for Social Networks Collaboration

2008-09-21 Thread J. Delgado

Sorry, I meant loose (replacing lose)

On Sun, Sep 21, 2008 at 8:38 PM, J. Delgado [EMAIL PROTECTED]wrote:

 On Sat, Sep 20, 2008 at 1:04 PM, Noble Paul നോബിള്‍ नोब्ळ् 
 [EMAIL PROTECTED] wrote:

 Moving back to RDBMS model will be a big step backwards where we miss
 mulivalued fields and arbitrary fields .


  No one is suggesting to lose any of the virtues of the field based
 indexing that Lucene provides. All but the contrary: by extending the RDBMS
 model with Lucene-based indexes one can map relational rows to documents and
 columns to fields. Note that one relational field can be mapped to one or
 more text based fields and multi-valued fields will still be allowed.

 Please check the Lucence OJVM implementation for details on implementation
 and philosophy on the RDBMS-Lucene converged model:

 http://docs.google.com/Doc?id=ddgw7sjp_54fgj9kg

 More discussions at Marcelo's blog who will be presenting in Oracle World
 2008 this week.
 http://marceloochoa.blogspot.com/

 BTW, it just happen that this was implemented using Oracle but similar
 implementation in H2 seems not only feasible but desirable.

 -- Joaquin




 On Tue, Sep 9, 2008 at 4:17 AM, Jason Rutherglen
 [EMAIL PROTECTED] wrote:
  Cool.  I mention H2 because it does have some Lucene code in it yes.
  Also according to some benchmarks it's the fastest of the open source
  databases.  I think it's possible to integrate realtime search for H2.
   I suppose there is no need to store the data in Lucene in this case?
  One loses the multiple values per field Lucene offers, and the schema
  become static.  Perhaps it's a trade off?
 
  On Mon, Sep 8, 2008 at 6:17 PM, J. Delgado [EMAIL PROTECTED]
 wrote:
  Yes, both Marcelo and I would be interested.
 
  We looked into H2 and it looks like something similar to Oracle's ODCI
 can
  be implemented. Plus the primitive full-text implementación is based on
  Lucene.
  I say primitive because looking at the code I saw that one cannot
 define an
  Analyzer and for each scan corresponding to a where clause a searcher
 is
  open and closed, instead of having a pool, plus it does not have any
 way to
  queue changes to reduce the use of the IndexWriter, etc.
 
  But its open source and that is a great starting point!
 
  -- Joaquin
 
  On Mon, Sep 8, 2008 at 2:05 PM, Jason Rutherglen
  [EMAIL PROTECTED] wrote:
 
  Perhaps an interesting project would be to integrate Ocean with H2
  www.h2database.com to take advantage of both models.  I'm not sure
 how
  exactly that would work, but it seems like it would not be too
  difficult.  Perhaps this would solve being able to perform faster
  hierarchical queries and perhaps other types of queries that Lucene is
  not capable of.
 
  Is this something Joaquin you are interested in collaborating on?  I
  am definitely interested in it.
 
  On Sun, Sep 7, 2008 at 4:04 AM, J. Delgado [EMAIL PROTECTED]
 
  wrote:
   On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic
   [EMAIL PROTECTED] wrote:
  
   Regarding real-time search and Solr, my feeling is the focus should
 be
   on
   first adding real-time search to Lucene, and then we'll figure out
 how
   to
   incorporate that into Solr later.
  
  
   Otis, what do you mean exactly by adding real-time search to
 Lucene?
Note
   that Lucene, being a indexing/search library (and not a full blown
   search
   engine), is by definition real-time: once you add/write a document
 to
   the
   index it becomes immediately searchable and if a document is
 logically
   deleted and no longer returned in a search, though physical deletion
   happens
   during an index optimization.
  
   Now, the problem of adding/deleting documents in bulk, as part of a
   transaction and making these documents available for search
 immediately
   after the transaction is commited sounds more like a search engine
   problem
   (i.e. SOLR, Nutch, Ocean), specially if these transactions are known
 to
   be
   I/O expensive and thus are usually implemented bached proceeses with
   some
   kind of sync mechanism, which makes them non real-time.
  
   For example, in my previous life, I designed and help implement a
   quasi-realtime enterprise search engine using Lucene, having a set
 of
   multi-threaded indexers hitting a set of multiple indexes alocatted
   accross
   different search services which powered a broker based distributed
   search
   interface. The most recent documents provided to the indexers were
   always
   added to the smaller in-memory (RAM) indexes which usually could
 absorbe
   the
   load of a bulk add transaction and later would be merged into
 larger
   disk
   based indexes and then flushed to make them ready to absorbe new
 fresh
   docs.
   We even had further partitioning of the indexes that reflected time
   periods
   with caps on size for them to be merged into older more archive
 based
   indexes which were used less (yes the search engine default search
 was
   on
   data no more than 1 month old, though user

Re: Lucene OSGi Bundle

[jira] Resolved: (LUCENE-1396) Improve PhraseQuery.toString()

Re: Lucene OSGi Bundle

Re: Could positions/payloads in SegmentMerger be copied directly?

Re: Realtime Search for Social Networks Collaboration

[jira] Commented: (LUCENE-1385) IndexReader.isIndexCurrent()==false - IndexReader.reopen() - still index not current

Re: Multi Field search without Multifieldqueryparser

Re: Multi Field search without Multifieldqueryparser

[jira] Commented: (LUCENE-1385) IndexReader.isIndexCurrent()==false - IndexReader.reopen() - still index not current

[jira] Commented: (LUCENE-1385) IndexReader.isIndexCurrent()==false - IndexReader.reopen() - still index not current

Re: 2.4 release candidate 1

[jira] Commented: (LUCENE-1387) Add LocalLucene

Re: Realtime Search for Social Networks Collaboration

Re: Realtime Search for Social Networks Collaboration

14 matches

Site Navigation

Mail list logo

Footer information