RE: Does lucene support distributed indexing?

2008-04-28 Thread Fang_Li
Solr does not do distributed indexing, but index replication. All copies are 
identical.
Lucene has some build in support for distributed search, please take a look at 
RemoteSearchable. For indexing, you can add a front load balancer in a naïve 
way.

Regards,

-Original Message-
From: Samuel Guo [mailto:[EMAIL PROTECTED] 
Sent: Sunday, April 27, 2008 4:22 PM
To: java-user@lucene.apache.org
Subject: Re: Does lucene support distributed indexing?

Thanks a lot :)

2008/4/26 Grant Ingersoll <[EMAIL PROTECTED]>:

>
> On Apr 26, 2008, at 2:33 AM, Samuel Guo wrote:
>
>  Hi all,
> >
> > I am a lucene newbie:)
> >
> > It seems that lucene doesn't support distributed indexing:(
> > As some IR research papers mentioned, when the documents collection
> > become
> > large, the index will be large also. When one single machine can't hold
> > all
> > the index, some strategies are used to solve it. such as that we can
> > part
> > the whole collection into several small sub-collections. According to
> > different partitions, we can got different strategies :
> > document-partittion
> > and term-partition. but I don't know why not lucene support these ways:(
> > can't anyone explain it ?
> >
>
> Because no one has donated the code to do it.  You can do distributed
> indexing via Nutch and some (albeit non fault tolerant) distributed Search
> in Lucene.  Solr also now has distributed search.
>
> -Grant
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: TrecDocMaker

2008-04-28 Thread Grant Ingersoll
Yeah, these classes are a bit weird in that they are configured via  
properties, and not setters.  They really are designed to run inside  
the benchmaker and not much attention was paid to using them elsewhere.


However, one can co-opt them for the purposes you are doing:

Something like:
TrecDocMaker docMaker = new TrecDocMaker();
Properties properties = new Properties();

properties.setProperty("doc.maker.forever", "false");
...
docMaker.setConfig(new Config(properties));

(not I was using the EnWikiDocMaker in the above example, but it  
should work for Trec, too.


I often also do something like:

while ((doc = docMaker.makeDocument()) != null && i < numDocs) {
...

where numDocs is the max. docs I want.


HTH,
Grant

On Apr 27, 2008, at 2:31 PM, DanaWhite wrote:



Greetings,

I am trying to use TrecDocMaker so I can successfully index and  
evaluate

lucene on a TReC collection.

It seems like I would just repeatedly call makeDocument() until all  
the
Documents have been created, but makeDocument appears to just read  
forever.
In general TrecDocMaker seems like an odd class and I just cant  
figure out
how to use it right.  I have been changing the class so it works  
with an

uncompressed collection and trying to modify it so makeDocument doesnt
endlessly read, but no matter what I have done it is just causing a
different error.  Clearly I am trying too hard.

In short what I want know is how am I supposed to use TrecDocMaker  
to parse
my collection...cause the current Lucene implementation doesnt seem  
to work

right, or I am using it wrong.

Thanks
Dana
--
View this message in context: 
http://www.nabble.com/TrecDocMaker-tp16926877p16926877.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



search performance & caching

2008-04-28 Thread Beard, Brian

I'm using lucene 2.2.0 & have two questions:

1) Should search times be linear wrt number of queries hitting a  single
searcher? I've run multiple search threads against a single searcher,
and the search times are very linear - 10x slower for 10 threads vs 1
thread, etc. I'm using a paralle multi-searcher with a custom hit
collector.

2) I'm performing some field caching during search warmup. For an index
of 3.4 million doc's and 7GB, it's taking up to 30 minutes to execute
the code snippet below. Most of this time is involved with the
multireader.document call (where it says "THIS TAKES THE MOST TIME").

I want to know if anyone has any ideas for speeding this up. There are
multiple documents containing the same recordId. I want to figure out
which two documents with the same recordId also have a documentName of
CORE or WL.
Then for each document in the index I store three pieces of information:
- it's associated recordId
- the CORE doc number for this recordId.
- the WL doc number for this recordId

Ideally, since the multiReader.document call is taking the most time,
I'd like to not have to perform this. Although I can't figure out how to
get around needing to read in the recordId.

What I really need is something like a two dimensional termEnum I could
iterate over - for the recordId and documentName fields.

Any ideas are appreciated.

// Now loop through all documents in the indexes and set the cache
values.
TermDocs termDocs = multiReader.termDocs();
TermEnum termEnum = multiReader.terms (new Term ("RECORD_ID", ""));
try {
FieldSelector fieldSelector = getFieldSelector();
List docList = new ArrayList();
int regularCoreDocId = -1;
int wlCoreDocId = -1;
int docId = -1;
Document document = null;
String documentName = null;

// Loop through each RECORD_ID with termEnums
do {
docList.clear();
  regularCoreDocId = -1;
  wlCoreDocId = -1;

 Term term = termEnum.term();
 if (term == null || term.field() != field) {
   break;
  }
  String recordId = term.text();

  // Now loop through all documents with the same recordId
  // using the termDocs.
  termDocs.seek(termEnum);
  while (termDocs.next()) {
docId = termDocs.doc();
docList.add(Integer.valueOf(docId));
// THIS TAKES THE MOST TIME
document = multiReader.document(docId, fieldSelector);
documentName = document.get("DOCUMENT_NAME");
if ("CORE".equals(documentName)) {
regularCoreDocId = docId;
} else if ("WL".equals(documentName)) {
wlCoreDocId = docId;
}
}

  // Map all docId's associated with this recordId
  for (Integer i : docList) {
  doc2RecordId [i] = recordId;
  }

  // Map from the docId to the coreData docId for  
  // regular core and wl core documents.
for (Integer i : docList) {
   doc2RegularCoreDoc[i] = regularCoreDocId;
 wlCoreDocId [i] = wlCoreDocId;
}
   } while (termEnum.next());
} finally {
termDocs.close();
termEnum.close();
}


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Does lucene support distributed indexing?

2008-04-28 Thread Stu Hood
Solr does not do distributed indexing, but the development version _does_ do 
distributed search, in addition to replication. Currently, you can manually 
shard up your data to a set of Solr instances, and then query them by adding a 
'shard=localhost:8080/solr_1,localhost:8080/solr_2' parameter.

See https://issues.apache.org/jira/browse/SOLR-303

Thanks,
Stu


-Original Message-
From: [EMAIL PROTECTED]
Sent: Monday, April 28, 2008 5:04am
To: java-user@lucene.apache.org
Subject: RE: Does lucene support distributed indexing?

Solr does not do distributed indexing, but index replication. All copies are 
identical.
Lucene has some build in support for distributed search, please take a look at 
RemoteSearchable. For indexing, you can add a front load balancer in a naïve 
way.

Regards,

-Original Message-
From: Samuel Guo [mailto:[EMAIL PROTECTED] 
Sent: Sunday, April 27, 2008 4:22 PM
To: java-user@lucene.apache.org
Subject: Re: Does lucene support distributed indexing?

Thanks a lot :)

2008/4/26 Grant Ingersoll <[EMAIL PROTECTED]>:

>
> On Apr 26, 2008, at 2:33 AM, Samuel Guo wrote:
>
>  Hi all,
> >
> > I am a lucene newbie:)
> >
> > It seems that lucene doesn't support distributed indexing:(
> > As some IR research papers mentioned, when the documents collection
> > become
> > large, the index will be large also. When one single machine can't hold
> > all
> > the index, some strategies are used to solve it. such as that we can
> > part
> > the whole collection into several small sub-collections. According to
> > different partitions, we can got different strategies :
> > document-partittion
> > and term-partition. but I don't know why not lucene support these ways:(
> > can't anyone explain it ?
> >
>
> Because no one has donated the code to do it.  You can do distributed
> indexing via Nutch and some (albeit non fault tolerant) distributed Search
> in Lucene.  Solr also now has distributed search.
>
> -Grant
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Does lucene support distributed indexing?

2008-04-28 Thread Chris Hostetter

: There are actually several distributed indexing or searching projects in 
: Lucene (the top-level ASF Lucene project, not Lucene Java), and it's 
: time to start thinking about the possibility of bringing them together, 
: finding commonalities, etc.

I would actually argue that almost all of the examples you listed describe 
"distributed searching" to query multiple shards.

As far as i know, none of them address the "distributed indexing" aspect: 
throw some raw data at the system and trust that it it will be indexed by 
one (or more) shard(s) in a way that "evenly" distributes the indexing 
"load"



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Does lucene support distributed indexing?

2008-04-28 Thread Otis Gospodnetic
That's right - most of them are about distributed searching (hence my notes 
about sharding being up to the app).  Hadoop's contrib/index is about dist 
indexing:

"This contrib package provides a utility to build or update an index
using Map/Reduce.

A distributed "index" is partitioned into "shards". Each shard corresponds
to a Lucene instance."

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
> From: Chris Hostetter <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Monday, April 28, 2008 7:53:43 PM
> Subject: Re: Does lucene support distributed indexing?
> 
> 
> : There are actually several distributed indexing or searching projects in 
> : Lucene (the top-level ASF Lucene project, not Lucene Java), and it's 
> : time to start thinking about the possibility of bringing them together, 
> : finding commonalities, etc.
> 
> I would actually argue that almost all of the examples you listed describe 
> "distributed searching" to query multiple shards.
> 
> As far as i know, none of them address the "distributed indexing" aspect: 
> throw some raw data at the system and trust that it it will be indexed by 
> one (or more) shard(s) in a way that "evenly" distributes the indexing 
> "load"
> 
> 
> 
> -Hoss
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Does lucene support distributed indexing?

2008-04-28 Thread Vaijanath N. Rao

Hi all,

How about adding hadoop support for distributed indexing. If required I 
can start working on this. If Hadoop is the fesiable option.


Also what other technique one can think for doing distributed Indexing. 
Currently I am planning on extending the SolrJ to keep a map of where 
the document has gone and trying to get a distibuted Indexing.


--Thanks and Regards
Vaijanath


Otis Gospodnetic wrote:

That's right - most of them are about distributed searching (hence my notes 
about sharding being up to the app).  Hadoop's contrib/index is about dist 
indexing:

"This contrib package provides a utility to build or update an index
using Map/Reduce.

A distributed "index" is partitioned into "shards". Each shard corresponds
to a Lucene instance."

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
  

From: Chris Hostetter <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Monday, April 28, 2008 7:53:43 PM
Subject: Re: Does lucene support distributed indexing?


: There are actually several distributed indexing or searching projects in 
: Lucene (the top-level ASF Lucene project, not Lucene Java), and it's 
: time to start thinking about the possibility of bringing them together, 
: finding commonalities, etc.


I would actually argue that almost all of the examples you listed describe 
"distributed searching" to query multiple shards.


As far as i know, none of them address the "distributed indexing" aspect: 
throw some raw data at the system and trust that it it will be indexed by 
one (or more) shard(s) in a way that "evenly" distributes the indexing 
"load"




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]