Hey Lukas,
You can get a basic demo of this working in Lucene
first then make a more advanced and efficient version.
First, give each document in your index a score field
using NumberTools so it's sortable. When users perform
a search, log the unique document_id, IP address and
result position for the next step.
Use Hadoop to simplifiy your logs by mapping the id
and emitting IP's as intermediate values. Have
reduce collect unique document_id[IP addresses].
Read thru final output file, increment score value for
each IP who clicked on the document_id, re-index
Lucene and sort results (reverse order) by score field.
A more advanced version could store previous result
positions as Payloads but I don't understand this
new Lucene concept.
Regards,
Peter W.
On Aug 10, 2007, at 5:56 AM, Lukas Vlcek wrote:
Enis,
Thanks for your time.
I gave a quick glance at Pig and it seems good (seems it is
directly based
on Hadoop which I am starting to play with :-). It obvious that a huge
amount of data (like user queries or access logs) should be stored
in flat
files which makes it convenient for further analysis by Pig (or
directly by
Hadoop based tasks) or other tools. And I agree with you that size
of the
index can be tracked in journal based style in separated log rather
then
with every since user query. That is for the easier part of my
original
question :-)
The true art starts with the mining tasks itself. How to
efficiently use
such data for bettering user experience with the search engine...
On 8/10/07, Enis Soztutar <[EMAIL PROTECTED]> wrote:
...
Web server log analysis is a very popular topic nowadays, and
you can
check for the literature, especially clickthrough data anaysis.
All the
major search engines has to interpret the data to improve their
algorithms, and to learn from the latent "collective knowlege"
hidden
in web server logs.
...
...
You do not have to implement this from scratch. You just have to
specify
your data mining tasks, then write scripts(in pig latin) or write
map-reduce programs (in hadoop). Either of these are not that
hard. I do
not think that there is any tool which may satisfy all you
information
needs. So at the risk of repeating myself i suggest you to look at
pig
at write some scripts to mine the data...
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]