Re: Wikipedia Index

Michael McCandless Tue, 19 Jun 2012 16:46:36 -0700

3 GB RAM is plenty for indexing Wikipedia (eg, that nightly benchmark
uses a 2 GB heap).


2 cores just means it'll take longer than more cores... just use 2
indexing threads.

Mike McCandless

http://blog.mikemccandless.com

On Tue, Jun 19, 2012 at 5:26 PM, Reyna Melara <reynamel...@gmail.com> wrote:
> Could it be possible to index Wikipedia in a 2 core machine with 3 GB in
> RAM? I have had the same problem trying to index it.
>
> I've tried with a dump from april 2011.
>
> Thanks
> Reyna
> CIC-IPN
> Mexico
>
> 2012/6/19 Michael McCandless <luc...@mikemccandless.com>
>
>> Likely the bottleneck is pulling content from the database?  Maybe
>> test just that and see how long it takes?
>>
>> 24 hours is way too long to index all of Wikipedia.  For example, we
>> index Wikipedia every night for our trunk/4.0 performance tests, here:
>>
>>    http://people.apache.org/~mikemccand/lucenebench/indexing.html
>>
>> The export is a bit old now (01/15/2011) but it takes just under 6
>> minutes to fully index it.  This is on a fairly beefy machine (24
>> cores)... and trunk/4.0 has substantial concurrency improvements over
>> 3.x.
>>
>> You can also try the ideas here:
>>
>>    http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Tue, Jun 19, 2012 at 12:27 PM, Elshaimaa Ali
>> <elshaimaa....@hotmail.com> wrote:
>> >
>> > Hi everybody
>> > I'm using Lucene3.6 to index Wikipedia documents which is over 3 million
>> article, the data is on a mysql database and it is taking more than 24
>> hours so far.Do you know any tips that can speed up the indexing process
>> > here is mycode:
>> > public static void main(String[] args) {             String indexPath =
>> INDEXPATH;           IndexWriter writer = null;       DatabaseConfiguration
>> dbConfig = new DatabaseConfiguration();           dbConfig.setHost(host);
>>       dbConfig.setDatabase(data);             dbConfig.setUser(user);
>>   dbConfig.setPassword(password);
>> dbConfig.setLanguage(Language.english);
>> >                  try {           Directory dir = FSDirectory.open(new
>> File(indexPath));                  Analyzer analyzer = new
>> StandardAnalyzer(Version.LUCENE_31);        IndexWriterConfig iwc = new
>> IndexWriterConfig(Version.LUCENE_31, analyzer);
>> iwc.setOpenMode(OpenMode.CREATE);       writer = new IndexWriter(dir, iwc);
>>                         }               catch (IOException e) {
>>         System.out.println(" caught a " + e.getClass() +
>> "\n with message: " + e.getMessage());               }
>>         try {                         Wikipedia wiki = new
>> Wikipedia(dbConfig);                               Iterable<Page> wikipages
>> = wiki.getPages(); //get wikipedia articles from the database
>>            Iterator iter = wikipages.iterator();
>> while(iter.hasNext()){                          Page p = (Page)iter.next();
>>
>> System.out.println(p.getTitle().getPlainTitle());
>>         Document doc = new Document();
>>  Field contentField = new Field("contents", p.getPlainText(),
>> Field.Store.NO, Field.Index.ANALYZED);                             Field
>> titleField = new Field("title",
>> p.getTitle().getPlainTitle(),Field.Store.YES, Field.Index.NOT_ANALYZED );
>>                               doc.add(contentField); // wiki page text
>>                            doc.add(titleField); // wiki page title
>>                         writer.addDocument(doc);
>>  }                       } catch (Exception e) {
>> e.printStackTrace();                    }                                 }
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
>
> --
> Reyna

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Wikipedia Index

Reply via email to