RE: Wikipedia Index

Elshaimaa Ali Tue, 19 Jun 2012 17:04:11 -0700
I only have the source text on a mysql database
Do you know where I can download it in xml and is it possible to split the 
documents into content and title
thanksshaimaa
> From: luc...@mikemccandless.com
> Date: Tue, 19 Jun 2012 19:48:24 -0400
> Subject: Re: Wikipedia Index
> To: java-user@lucene.apache.org
> 
> I have the index locally ... but it's really impractical to send it
> especially if you already have the source text locally.
> 
> Maybe index directly from the source text instead of via a database?
> Lucene's benchmark contrib/module has code to decode the XML into
> documents...
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> On Tue, Jun 19, 2012 at 6:27 PM, Elshaimaa Ali
> <elshaimaa....@hotmail.com> wrote:
> >
> > Thanks Mike for the prompt replyDo you have a fully indexed version of the 
> > wikipedia,  I mainly need two fields for each document the indexed content 
> > of the wikipedia articles  and the title.if there is any place where I can 
> > get the index, that will save me great time
> > regardsshaimaa
> >
> >> From: luc...@mikemccandless.com
> >> Date: Tue, 19 Jun 2012 16:29:39 -0400
> >> Subject: Re: Wikipedia Index
> >> To: java-user@lucene.apache.org
> >>
> >> Likely the bottleneck is pulling content from the database?  Maybe
> >> test just that and see how long it takes?
> >>
> >> 24 hours is way too long to index all of Wikipedia.  For example, we
> >> index Wikipedia every night for our trunk/4.0 performance tests, here:
> >>
> >>     http://people.apache.org/~mikemccand/lucenebench/indexing.html
> >>
> >> The export is a bit old now (01/15/2011) but it takes just under 6
> >> minutes to fully index it.  This is on a fairly beefy machine (24
> >> cores)... and trunk/4.0 has substantial concurrency improvements over
> >> 3.x.
> >>
> >> You can also try the ideas here:
> >>
> >>     http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >> On Tue, Jun 19, 2012 at 12:27 PM, Elshaimaa Ali
> >> <elshaimaa....@hotmail.com> wrote:
> >> >
> >> > Hi everybody
> >> > I'm using Lucene3.6 to index Wikipedia documents which is over 3 million 
> >> > article, the data is on a mysql database and it is taking more than 24 
> >> > hours so far.Do you know any tips that can speed up the indexing process
> >> > here is mycode:
> >> > public static void main(String[] args) { á á á á á á String indexPath = 
> >> > INDEXPATH; á á á á á IndexWriter writer = null; á á á 
> >> > DatabaseConfiguration dbConfig = new DatabaseConfiguration(); á á á á á 
> >> > dbConfig.setHost(host); á á á á dbConfig.setDatabase(data); á á á á á á 
> >> > dbConfig.setUser(user); á á á á dbConfig.setPassword(password); á á á á 
> >> > dbConfig.setLanguage(Language.english);
> >> > á á á á á á á á átry { á á á á á Directory dir = FSDirectory.open(new 
> >> > File(indexPath)); á á á á á á á á áAnalyzer analyzer = new 
> >> > StandardAnalyzer(Version.LUCENE_31); á á á áIndexWriterConfig iwc = new 
> >> > IndexWriterConfig(Version.LUCENE_31, analyzer); á á á á á á 
> >> > iwc.setOpenMode(OpenMode.CREATE); á á á writer = new IndexWriter(dir, 
> >> > iwc); á á á á á á á á á á á á } á á á á á á á catch (IOException e) { á 
> >> > á á á á á á á á á System.out.println(" caught a " + e.getClass() + á á á 
> >> > á á á á á "\n with message: " + e.getMessage()); á á á á á á á } á á á á 
> >> > á á á á á á á á á á try { á á á á á á á á á á á á Wikipedia wiki = new 
> >> > Wikipedia(dbConfig); á á á á á á á á á á á á á á á Iterable<Page> 
> >> > wikipages = wiki.getPages(); //get wikipedia articles from the database 
> >> > á á á á á á á á á á á á áIterator iter = wikipages.iterator(); á á á á á 
> >> > á á á á á á á á while(iter.hasNext()){ á á á á á á á á á á á á áPage p = 
> >> > (Page)iter.next(); á á á á á á á á á á á á á á 
> >> > System.out.println(p.getTitle().getPlainTitle()); á á á á á á á á á á á 
> >> > á á á á á á Document doc = new Document(); á á á á á á á á á á á á á á á 
> >> > á áField contentField = new Field("contents", p.getPlainText(), 
> >> > Field.Store.NO, Field.Index.ANALYZED); á á á á á á á á á á á á á á Field 
> >> > titleField = new Field("title", 
> >> > p.getTitle().getPlainTitle(),Field.Store.YES, Field.Index.NOT_ANALYZED 
> >> > ); á á á á á á á á á á á á á á á á doc.add(contentField); // wiki page 
> >> > text á á á á á á á á á á á á á á á ádoc.add(titleField); // wiki page 
> >> > title á á á á á á á á á á á á á á á á writer.addDocument(doc); á á á á á 
> >> > á á á á á á á á á} á á á á á á á á á á á } catch (Exception e) { á á á á 
> >> > á á á á á á á á e.printStackTrace(); á á á á á á á á á á} á á á á á á á 
> >> > á á á á á á á á á }
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
RE: Wikipedia Index

Reply via email to