Hey Lukas,

You can get a basic demo of this working in Lucene
first then make a more advanced and efficient version.

First, give each document in your index a score field
using NumberTools so it's sortable. When users perform
a search, log the unique document_id, IP address and
result position for the next step.

Use Hadoop to simplifiy your logs by mapping the id
and emitting IP's as intermediate values. Have
reduce collect unique document_id[IP addresses].

Read thru final output file, increment score value for
each IP who clicked on the document_id, re-index
Lucene and sort results (reverse order) by score field.

A more advanced version could store previous result
positions as Payloads but I don't understand this
new Lucene concept.

Regards,

Peter W.

On Aug 10, 2007, at 5:56 AM, Lukas Vlcek wrote:

Enis,

Thanks for your time.
I gave a quick glance at Pig and it seems good (seems it is directly based
on Hadoop which I am starting to play with :-). It obvious that a huge
amount of data (like user queries or access logs) should be stored in flat files which makes it convenient for further analysis by Pig (or directly by Hadoop based tasks) or other tools. And I agree with you that size of the index can be tracked in journal based style in separated log rather then with every since user query. That is for the easier part of my original
question :-)

The true art starts with the mining tasks itself. How to efficiently use
such data for bettering user experience with the search engine...

On 8/10/07, Enis Soztutar <[EMAIL PROTECTED]> wrote:
...

Web server log analysis is a very popular topic nowadays, and you can check for the literature, especially clickthrough data anaysis. All the
major search engines has to interpret the data to improve their
algorithms, and to learn from the latent "collective knowlege" hidden
in web server logs.
...

...
You do not have to implement this from scratch. You just have to specify
your data mining tasks, then write scripts(in pig latin) or write
map-reduce programs (in hadoop). Either of these are not that hard. I do not think that there is any tool which may satisfy all you information needs. So at the risk of repeating myself i suggest you to look at pig
at write some scripts to mine the data...

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to