Re: Wikipedia Index

Michael McCandless Tue, 19 Jun 2012 16:49:13 -0700

I have the index locally ... but it's really impractical to send it
especially if you already have the source text locally.


Maybe index directly from the source text instead of via a database?
Lucene's benchmark contrib/module has code to decode the XML into
documents...

Mike McCandless

http://blog.mikemccandless.com

On Tue, Jun 19, 2012 at 6:27 PM, Elshaimaa Ali
<[email protected]> wrote:
>
> Thanks Mike for the prompt replyDo you have a fully indexed version of the 
> wikipedia,  I mainly need two fields for each document the indexed content of 
> the wikipedia articles  and the title.if there is any place where I can get 
> the index, that will save me great time
> regardsshaimaa
>
>> From: [email protected]
>> Date: Tue, 19 Jun 2012 16:29:39 -0400
>> Subject: Re: Wikipedia Index
>> To: [email protected]
>>
>> Likely the bottleneck is pulling content from the database?  Maybe
>> test just that and see how long it takes?
>>
>> 24 hours is way too long to index all of Wikipedia.  For example, we
>> index Wikipedia every night for our trunk/4.0 performance tests, here:
>>
>>     http://people.apache.org/~mikemccand/lucenebench/indexing.html
>>
>> The export is a bit old now (01/15/2011) but it takes just under 6
>> minutes to fully index it.  This is on a fairly beefy machine (24
>> cores)... and trunk/4.0 has substantial concurrency improvements over
>> 3.x.
>>
>> You can also try the ideas here:
>>
>>     http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Tue, Jun 19, 2012 at 12:27 PM, Elshaimaa Ali
>> <[email protected]> wrote:
>> >
>> > Hi everybody
>> > I'm using Lucene3.6 to index Wikipedia documents which is over 3 million 
>> > article, the data is on a mysql database and it is taking more than 24 
>> > hours so far.Do you know any tips that can speed up the indexing process
>> > here is mycode:
>> > public static void main(String[] args) { á á á á á á String indexPath = 
>> > INDEXPATH; á á á á á IndexWriter writer = null; á á á 
>> > DatabaseConfiguration dbConfig = new DatabaseConfiguration(); á á á á á 
>> > dbConfig.setHost(host); á á á á dbConfig.setDatabase(data); á á á á á á 
>> > dbConfig.setUser(user); á á á á dbConfig.setPassword(password); á á á á 
>> > dbConfig.setLanguage(Language.english);
>> > á á á á á á á á átry { á á á á á Directory dir = FSDirectory.open(new 
>> > File(indexPath)); á á á á á á á á áAnalyzer analyzer = new 
>> > StandardAnalyzer(Version.LUCENE_31); á á á áIndexWriterConfig iwc = new 
>> > IndexWriterConfig(Version.LUCENE_31, analyzer); á á á á á á 
>> > iwc.setOpenMode(OpenMode.CREATE); á á á writer = new IndexWriter(dir, 
>> > iwc); á á á á á á á á á á á á } á á á á á á á catch (IOException e) { á á 
>> > á á á á á á á á System.out.println(" caught a " + e.getClass() + á á á á á 
>> > á á á "\n with message: " + e.getMessage()); á á á á á á á } á á á á á á á 
>> > á á á á á á á try { á á á á á á á á á á á á Wikipedia wiki = new 
>> > Wikipedia(dbConfig); á á á á á á á á á á á á á á á Iterable<Page> 
>> > wikipages = wiki.getPages(); //get wikipedia articles from the database á 
>> > á á á á á á á á á á á áIterator iter = wikipages.iterator(); á á á á á á á 
>> > á á á á á á while(iter.hasNext()){ á á á á á á á á á á á á áPage p = 
>> > (Page)iter.next(); á á á á á á á á á á á á á á 
>> > System.out.println(p.getTitle().getPlainTitle()); á á á á á á á á á á á á 
>> > á á á á á Document doc = new Document(); á á á á á á á á á á á á á á á á 
>> > áField contentField = new Field("contents", p.getPlainText(), 
>> > Field.Store.NO, Field.Index.ANALYZED); á á á á á á á á á á á á á á Field 
>> > titleField = new Field("title", 
>> > p.getTitle().getPlainTitle(),Field.Store.YES, Field.Index.NOT_ANALYZED ); 
>> > á á á á á á á á á á á á á á á á doc.add(contentField); // wiki page text á 
>> > á á á á á á á á á á á á á á ádoc.add(titleField); // wiki page title á á á 
>> > á á á á á á á á á á á á á writer.addDocument(doc); á á á á á á á á á á á á 
>> > á á} á á á á á á á á á á á } catch (Exception e) { á á á á á á á á á á á á 
>> > e.printStackTrace(); á á á á á á á á á á} á á á á á á á á á á á á á á á á }
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Wikipedia Index

Reply via email to