Re: Wikia search goes live today

Dennis Kubes Tue, 08 Jan 2008 14:54:06 -0800

Sorry about not responding to this before now, been a little busy :).

For those of you who don't know me, I am a committer on the Nutchproject. I have been working with Wikia since early July and moreactively since the beginning of November. Before Wikia I helped startanother search engine based on Nutch called Visvo.com.

For the record, yes Search Wikia is using and will be supportingNutch/Hadoop/Lucene/Solr/HBase. It is the intention of Search Wikia tohelp develop these projects and their communities. We have no intentionof keeping the changes we make "proprietary". Everything that SearchWikia develops (barring an user or personal data) will be consideredopen source and freely available. Any improvements made to the apacheprojects will be immediately donated back to the community through therespective project.

Making search open and transparent is not just limited to source code.It is our intention to make the Search Wikia data freely open andavailable as well. This means that people will be able to download thecrawl data, link data, content shards, and completed indexes. Also thesocial networking functionality, named foowi, will become its own opensource project (probably with an apache license), and will be availableto download, use, and improve.

And Search Wikia is not alone in this. Visvo.com in coordination withWikia will be releasing all of its data and source code improvements tothe community under an OSI approved license, including a pythonframework for managing hadoop configurations on distributed machines,automating the fetching and indexing process, and for managing searchshards.

In terms of the Nutch logo. There are two standard nutch installationsand index farms at the following urls. One in an index hosted at theISC and the other is Visvo's open index. The ISC index hasapproximately 35M pages while Visvo's index has a little over 50M pages.


http://search.isc.swlabs.org
http://open-index.visvo.com

The main Search Wikia site is hosted in a secure underground hostingfacility in a bunker in Iowa (http://usshc.com/) and calls to theseindexes. So when showing cached pages and explain plans those requestsgo to their respective indexes.

Both indexes are available for search by either browser based or web 2.0based clients. We are currently using NUTCH-594 to serve results fromthese indexes in both xml and JSON formats. An example requestsearching for java would be:


http://search.isc.swlabs.org/nutchsearch?query=java&hitsPerSite=1&lang=en&hitsPerPage=10&type=json
http://open-index.visvo.com/nutchsearch?query=java&hitsPerSite=1&lang=en&hitsPerPage=10&type=json

So we are busy working on getting the data avaiable for download.Hopefully we should have a site setup within the next day or so. Ifanybody has any questions or would like to get some specific data feelfree to send me an email.


Dennis Kubes

Lukas Vlcek wrote:

I should note that this technique is probably not easily applicable to
current Lucene scoring mechanism without additional development.

On 1/8/08, Lukas Vlcek <[EMAIL PROTECTED]> wrote:

After checking the Lucene API of ParallelReader it seems that the star
score could be stored in different index which shares the same identifier
for the documents. Such index could be small (partitioned to many small
indices?) so the updates can be fast. Is that what you meant Andrzej? ;-)

Anyway, I remember different technique which I once mentioned in Lucene
mail list taking inspiration from book called Programming Collective
Intelligence <http://www.oreilly.com/catalog/9780596529321/> . The idea is
not to store score (may be I should call it user preference) into index but
into neural net. One useful side effect is that this technique could score
reasonably even document without any stars (meaning "similar" document to
highly started documents could score better even if they haven't been stared
by any user yet).

Regards,
Lukas

On 1/8/08, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

Lukas Vlcek wrote:

So staring will be accommodated only during indexing phase. Does it

mean it

will be pretty static value not a dynamically changing variable...

correct?

In other words if I add my starts to some document it won't affect the
scoring immediately but after indexing cycle. Correct?

(I'm not involved in Wikia development). There are some ways to go about
it even in the pure Lucene-land, so that the updates are fast without
reindexing the main content. Hint: ParallelReader.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--
http://blog.lukas-vlcek.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Wikia search goes live today

Reply via email to