Enis,

Thanks for your time.
I gave a quick glance at Pig and it seems good (seems it is directly based
on Hadoop which I am starting to play with :-). It obvious that a huge
amount of data (like user queries or access logs) should be stored in flat
files which makes it convenient for further analysis by Pig (or directly by
Hadoop based tasks) or other tools. And I agree with you that size of the
index can be tracked in journal based style in separated log rather then
with every since user query. That is for the easier part of my original
question :-)

The true art starts with the mining tasks itself. How to efficiently use
such data for bettering user experience with the search engine... one potion
is to use such data for search engine tuning, which is more technical
oriented application (speeding up slow queries, reshaping the index, ...).
But I am looking for more IR oriented application of this information. I
remember that once I read on Lucene mail list that somebody suggested
utilization of previously issued user queries for suggestions of
similar/other/related queries or for typo checking. While there are well
discussed methods (moreLikeThis, did you mean, similarity, ... etc) in
Lucene community I am still wondering if once can use user search history
data for such purpose and if the answer is yes then how (practical examples
are welcomed).

Lukas

On 8/10/07, Enis Soztutar <[EMAIL PROTECTED]> wrote:
>
>
>
> Lukas Vlcek wrote:
> > Hi Enis,
> >
> Hi again,
> > On 8/10/07, Enis Soztutar <[EMAIL PROTECTED]> wrote:
> >
> >> Hi,
> >>
> >> Lukas Vlcek wrote:
> >>
> >>> Hi,
> >>>
> >>> I would like to keep user search history data and I am looking for
> some
> >>> ideas/advices/recommendations. In general I would like to talk about
> >>>
> >> methods
> >>
> >>> of storing such data, its structure and how to turn it into valuable
> >>> information.
> >>>
> >>> As for the structure:
> >>> ==============
> >>> For now I don't have exact idea about what kind of information I
> should
> >>> keep. I know that this is application specific but I believe there can
> >>>
> >> be
> >>
> >>> some common general patterns. as of now I think can be useful to keep
> is
> >>>
> >> the
> >>
> >>> following:
> >>>
> >>> 1) system time (time of issuing the query) and userid
> >>> 2) original user query in raw form (untokenized)
> >>> 3) expanded user query (both tokenized and untokenized can be useful)
> >>> 4) query execution time
> >>> 5) # of objects retrieved from index
> >>> 6) # of total object count in index (this can change during time)
> >>> 7) and possibly if user clicked some result and if so then which one
> >>>
> >> (the
> >>
> >>> hit number) and system time
> >>>
> >>>
> >>>
> >> Remember that you may not want to store all the information available
> at
> >> runtime of the query, since it may result in great performance burden.
> >> For example you  may want to store the raw form of the query, but not
> >> parsed form since you can later parse the query anayway (unless you
> have
> >> some architecture change). Similarly 6 seemed not a good choice for
> >> me(again you can store the info externally). You can look at the common
> >> and extended log formats which are stored by the web server.
> >>
> >
> >
> > The problem is that all the information do chance in time. The index is
> > updated continuously which means that expanded queries and total number
> of
> > documents in index do change as well. But you are right that getting
> some of
> > this info can cause extra performance expenses (then it would be
> question of
> > later optimization of architecture design).
> >
> Well i think you can at least store the size of the index in another
> file, and log to the changes in the index size from there. The
> motivation for this comes from storage efficiency. You may not want to
> store the same index size over and over again in n queries before the
> index size changes, but store it once, with the time, per change.
> >
> >
> >>> As for the information I can get from this:
> >>> =============================
> >>> Such minimal data collection could show if the search engine serves
> >>>
> >> users
> >>
> >>> well or not (generally said). I should note that for the users in this
> >>>
> >> case
> >>
> >>> the only other option is to not use the search engine at all (so the
> >>>
> >> data
> >>
> >>> should not be biased by the fact that users are using alternative
> search
> >>> method). I should be able to learn if:
> >>>
> >>> 1) there are bottleneck queries (Prefix,Fuzzy,Proximity queries...)
> >>> 2) users are finding what they want (they can find it fast and results
> >>>
> >> are
> >>
> >>> ordered by properly defined relevance [my model is well tuned in terms
> >>>
> >> of
> >>
> >>> term weights] so the result they click is among first hits)
> >>> 3) user can formulate queries well (do they issue queries which return
> >>>
> >> all
> >>
> >>> index documents or they can issue queries which return just a couple
> of
> >>> documents)
> >>> 4) ...?... etc...
> >>>
> >>>
> >>>
> >> Web server log analysis is a very popular topic nowadays, and you can
> >> check for the literature, especially clickthrough data anaysis. All the
> >> major search engines has to interpret the data to improve their
> >> algorithms, and to learn from the latent "collective knowlege" hidden
> in
> >> web server logs.
> >>
> >
> >
> > It seems I have to do my homework and check CiteSeer for some papers :-)
> > Is there any paper you can recommend me? Some good one to start with?
> > What I want to achieve is far beyond the scope of the project I am
> working
> > on right now thus I cannot spend all my time on research (in spite of
> the
> > fact I would love to) so I can either a) use some tool which is already
> > available (open sourced) and directly fits my needs (I don't think there
> is
> > any tool which I could use out-of-box) or b) implement something new
> from
> > scratch but with just very limited functionality.
> >
> You do not have to implement this from scratch. You just have to specify
> your data mining tasks, then write scripts(in pig latin) or write
> map-reduce programs (in hadoop). Either of these are not that hard. I do
> not think that there is any tool which may satisfy all you information
> needs. So at the risk of repeating myself i suggest you to look at pig
> at write some scripts to mine the data.
>
> Coming to literature, i can hardly suggest any specific paper, since i
> am not very into the subject either. But i suggest you to skip this
> step, first build you data structures (log format), then start
> extracting some naive statistical information from the data. For
> example, initially you may want to know
> 1. avarage query execution time
> 2. avarage query execution time per query type(boolean, fuzzy, etc.)
> 3. histogram of query types (how many boolean queries, etc.)
> 4. avarage #of queries per user session.
> 5. etc.
>
> The list can go on and on depending on the data you have and information
> you want. These kind of simple statistical analysis can be very easy to
> extract and relatively easy to interpret.
>
> >
> >> As for the storage method:
> >>
> >>> ===================
> >>> I was planning to keep such data in database but now it seems to me
> that
> >>>
> >> it
> >>
> >>> will be better to keep it directly in index (Lucene index). It seems
> to
> >>>
> >> me
> >>
> >>> that this approach would allow me for better fuzzy searches across
> >>>
> >> history
> >>
> >>> and extracting relevant objects and their count more efficiently (with
> >>> benefit of the relevance based search on top of history search
> corpus).
> >>>
> >>> I think that more scalable solution would be to keep such data in pure
> >>>
> >> flat
> >>
> >>> file and then periodically recreate search history index (or more
> >>>
> >> indices)
> >>
> >>> from it (for example by Map-Reduce like task). Event better the flat
> >>>
> >> file
> >>
> >>> could be stored in distributer file system. However, for now I would
> >>>
> >> like to
> >>
> >>> start with something simple.
> >>>
> >>>
> >> I would rather suggest you to keep the logs in rolling flat files. An
> >> access to the database for each search will take lots of time. Then you
> >> may want to flush those logs to the db once a day if you indeed want to
> >> store the data in a relational way.
> >>
> >> I infer that you want to mine the data, but you do not know what to
> >> mine, right? I suggest you to look at hadoop and pig. Pig is a is
> >> designed especially for this purpose.
> >>
> >
> >
> > You've hit the nail on the head! I am very curious about how one can use
> > such data to improve user experience with search engine (given my
> project
> > schedule time constraints).
> >
> >
> >> I know this is a complex topic...
> >>
> >>> Regards,
> >>> Lukas
> >>>
> >>>
> >>>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
> >>
> > Anyway, thanks for your reply!
> >
> > BR
> > Lukas
> >
> >
>

Reply via email to