3 GB RAM is plenty for indexing Wikipedia (eg, that nightly benchmark uses a 2 GB heap).
2 cores just means it'll take longer than more cores... just use 2 indexing threads. Mike McCandless http://blog.mikemccandless.com On Tue, Jun 19, 2012 at 5:26 PM, Reyna Melara <reynamel...@gmail.com> wrote: > Could it be possible to index Wikipedia in a 2 core machine with 3 GB in > RAM? I have had the same problem trying to index it. > > I've tried with a dump from april 2011. > > Thanks > Reyna > CIC-IPN > Mexico > > 2012/6/19 Michael McCandless <luc...@mikemccandless.com> > >> Likely the bottleneck is pulling content from the database? Maybe >> test just that and see how long it takes? >> >> 24 hours is way too long to index all of Wikipedia. For example, we >> index Wikipedia every night for our trunk/4.0 performance tests, here: >> >> http://people.apache.org/~mikemccand/lucenebench/indexing.html >> >> The export is a bit old now (01/15/2011) but it takes just under 6 >> minutes to fully index it. This is on a fairly beefy machine (24 >> cores)... and trunk/4.0 has substantial concurrency improvements over >> 3.x. >> >> You can also try the ideas here: >> >> http://wiki.apache.org/lucene-java/ImproveIndexingSpeed >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> On Tue, Jun 19, 2012 at 12:27 PM, Elshaimaa Ali >> <elshaimaa....@hotmail.com> wrote: >> > >> > Hi everybody >> > I'm using Lucene3.6 to index Wikipedia documents which is over 3 million >> article, the data is on a mysql database and it is taking more than 24 >> hours so far.Do you know any tips that can speed up the indexing process >> > here is mycode: >> > public static void main(String[] args) { String indexPath = >> INDEXPATH; IndexWriter writer = null; DatabaseConfiguration >> dbConfig = new DatabaseConfiguration(); dbConfig.setHost(host); >> dbConfig.setDatabase(data); dbConfig.setUser(user); >> dbConfig.setPassword(password); >> dbConfig.setLanguage(Language.english); >> > try { Directory dir = FSDirectory.open(new >> File(indexPath)); Analyzer analyzer = new >> StandardAnalyzer(Version.LUCENE_31); IndexWriterConfig iwc = new >> IndexWriterConfig(Version.LUCENE_31, analyzer); >> iwc.setOpenMode(OpenMode.CREATE); writer = new IndexWriter(dir, iwc); >> } catch (IOException e) { >> System.out.println(" caught a " + e.getClass() + >> "\n with message: " + e.getMessage()); } >> try { Wikipedia wiki = new >> Wikipedia(dbConfig); Iterable<Page> wikipages >> = wiki.getPages(); //get wikipedia articles from the database >> Iterator iter = wikipages.iterator(); >> while(iter.hasNext()){ Page p = (Page)iter.next(); >> >> System.out.println(p.getTitle().getPlainTitle()); >> Document doc = new Document(); >> Field contentField = new Field("contents", p.getPlainText(), >> Field.Store.NO, Field.Index.ANALYZED); Field >> titleField = new Field("title", >> p.getTitle().getPlainTitle(),Field.Store.YES, Field.Index.NOT_ANALYZED ); >> doc.add(contentField); // wiki page text >> doc.add(titleField); // wiki page title >> writer.addDocument(doc); >> } } catch (Exception e) { >> e.printStackTrace(); } } >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > > -- > Reyna --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org