Hi Tonci

Public Data Sets - Check out infochimps.org/ or 
aws.amazon.com/publicdatasets/  

I find a lot of the Hadoopified algorithms out there originate from 
Linguistics departments, TF-IDF is one example, but, have you considered 
looking into Information Theory ? i.e. Entropy analytics using algorithms 
like Pointwise Mutual Information. I'd imagine most government security 
agencies would be interested in using Hadoop for signal processing/code 
breaking. Especially the cost savings of using commodity machines. The 
trick will be to find a dataset that suits your algorithm.

Kind regards
Steve Watt




From:
Tonci Buljan <tonci.bul...@gmail.com>
To:
common-user@hadoop.apache.org
Date:
03/01/2010 08:27 AM
Subject:
Re: Hadoop as master's thesis



Thank you for your reply.


 I didn't mention that I already installed Hadoop on 2 machines back at 
home
(for a essay on Hadoop which I did), one as a namenode and datanode and 
one
as a datanode only. Everything worked perfect. I would really try to 
install
it on more machines to see how cluster works in more detail. So I was
thinking:” Now I have a cluster, where do I find a large dataset to work
with?”.


 I like your idea about publicly available datasets, do you have any links
on that?

The other idea, about student grades is also great (thank you for that) 
and
I might just start with that.


 Thank you very much, you both really helped me.


On 1 March 2010 15:15, Mark Kerzner <markkerz...@gmail.com> wrote:

> Tonci,
>
> to start with, you can run Hadoop on one computer in pseudo-cluster 
mode.
> Installing and configuring will be enough headache on its own. Then you 
can
> think of a problem, such as process student records and grades and find
> some
> statistics, or grade and their future achievements. Or, you can look at
> some
> publicly available datasets and so something with them.
>
> Cheers,
> Mark
>
> On Mon, Mar 1, 2010 at 8:01 AM, Tonci Buljan <tonci.bul...@gmail.com>
> wrote:
>
> > Hello everyone,
> >
> >  I'm thinking of using Hadoop as a subject in my master's thesis in
> > Computer
> > Science. I'm supposed to solve some kind of a problem with Hadoop, but
> > can't
> > think of any :)).
> >
> >  We have a lab with 10-15 computers and I tough of installing Hadoop 
on
> > those computers, and now I should write some kind of a program to run 
on
> my
> > cluster.
> >
> >  I really hope you understood my problem :). I really need any kind of
> > suggestion.
> >
> >
> >  P.S. Sorry for my bad English, I'm from Croatia.
> >
>



Reply via email to