Hadoop as master's thesis

2010-03-01 Thread Tonci Buljan
Hello everyone,

 I'm thinking of using Hadoop as a subject in my master's thesis in Computer
Science. I'm supposed to solve some kind of a problem with Hadoop, but can't
think of any :)).

 We have a lab with 10-15 computers and I tough of installing Hadoop on
those computers, and now I should write some kind of a program to run on my
cluster.

 I really hope you understood my problem :). I really need any kind of
suggestion.


 P.S. Sorry for my bad English, I'm from Croatia.


Re: Hadoop as master's thesis

2010-03-01 Thread Mark Kerzner
Tonci,

to start with, you can run Hadoop on one computer in pseudo-cluster mode.
Installing and configuring will be enough headache on its own. Then you can
think of a problem, such as process student records and grades and find some
statistics, or grade and their future achievements. Or, you can look at some
publicly available datasets and so something with them.

Cheers,
Mark

On Mon, Mar 1, 2010 at 8:01 AM, Tonci Buljan tonci.bul...@gmail.com wrote:

 Hello everyone,

  I'm thinking of using Hadoop as a subject in my master's thesis in
 Computer
 Science. I'm supposed to solve some kind of a problem with Hadoop, but
 can't
 think of any :)).

  We have a lab with 10-15 computers and I tough of installing Hadoop on
 those computers, and now I should write some kind of a program to run on my
 cluster.

  I really hope you understood my problem :). I really need any kind of
 suggestion.


  P.S. Sorry for my bad English, I'm from Croatia.



Re: Hadoop as master's thesis

2010-03-01 Thread Tonci Buljan
Thank you for your reply.


 I didn't mention that I already installed Hadoop on 2 machines back at home
(for a essay on Hadoop which I did), one as a namenode and datanode and one
as a datanode only. Everything worked perfect. I would really try to install
it on more machines to see how cluster works in more detail. So I was
thinking:” Now I have a cluster, where do I find a large dataset to work
with?”.


 I like your idea about publicly available datasets, do you have any links
on that?

The other idea, about student grades is also great (thank you for that) and
I might just start with that.


 Thank you very much, you both really helped me.


On 1 March 2010 15:15, Mark Kerzner markkerz...@gmail.com wrote:

 Tonci,

 to start with, you can run Hadoop on one computer in pseudo-cluster mode.
 Installing and configuring will be enough headache on its own. Then you can
 think of a problem, such as process student records and grades and find
 some
 statistics, or grade and their future achievements. Or, you can look at
 some
 publicly available datasets and so something with them.

 Cheers,
 Mark

 On Mon, Mar 1, 2010 at 8:01 AM, Tonci Buljan tonci.bul...@gmail.com
 wrote:

  Hello everyone,
 
   I'm thinking of using Hadoop as a subject in my master's thesis in
  Computer
  Science. I'm supposed to solve some kind of a problem with Hadoop, but
  can't
  think of any :)).
 
   We have a lab with 10-15 computers and I tough of installing Hadoop on
  those computers, and now I should write some kind of a program to run on
 my
  cluster.
 
   I really hope you understood my problem :). I really need any kind of
  suggestion.
 
 
   P.S. Sorry for my bad English, I'm from Croatia.
 



Re: Hadoop as master's thesis

2010-03-01 Thread Mark Kerzner
Tonci,

here are Enron email files used in the litigation that they had:
http://edrm.net/resources/data-sets/enron-data-set-files

Here is much more stuff: http://infochimps.org/

Sincerely,
Mark

http://edrm.net/resources/data-sets/enron-data-set-files

On Mon, Mar 1, 2010 at 8:24 AM, Tonci Buljan tonci.bul...@gmail.com wrote:

 Thank you for your reply.


  I didn't mention that I already installed Hadoop on 2 machines back at
 home
 (for a essay on Hadoop which I did), one as a namenode and datanode and one
 as a datanode only. Everything worked perfect. I would really try to
 install
 it on more machines to see how cluster works in more detail. So I was
 thinking:” Now I have a cluster, where do I find a large dataset to work
 with?”.


  I like your idea about publicly available datasets, do you have any links
 on that?

 The other idea, about student grades is also great (thank you for that) and
 I might just start with that.


  Thank you very much, you both really helped me.


 On 1 March 2010 15:15, Mark Kerzner markkerz...@gmail.com wrote:

  Tonci,
 
  to start with, you can run Hadoop on one computer in pseudo-cluster mode.
  Installing and configuring will be enough headache on its own. Then you
 can
  think of a problem, such as process student records and grades and find
  some
  statistics, or grade and their future achievements. Or, you can look at
  some
  publicly available datasets and so something with them.
 
  Cheers,
  Mark
 
  On Mon, Mar 1, 2010 at 8:01 AM, Tonci Buljan tonci.bul...@gmail.com
  wrote:
 
   Hello everyone,
  
I'm thinking of using Hadoop as a subject in my master's thesis in
   Computer
   Science. I'm supposed to solve some kind of a problem with Hadoop, but
   can't
   think of any :)).
  
We have a lab with 10-15 computers and I tough of installing Hadoop on
   those computers, and now I should write some kind of a program to run
 on
  my
   cluster.
  
I really hope you understood my problem :). I really need any kind of
   suggestion.
  
  
P.S. Sorry for my bad English, I'm from Croatia.
  
 



Re: Sun JVM 1.6.0u18

2010-03-01 Thread Edward Capriolo
On Mon, Mar 1, 2010 at 6:37 AM, Steve Loughran ste...@apache.org wrote:
 Todd Lipcon wrote:

 On Thu, Feb 25, 2010 at 11:09 AM, Scott Carey
 sc...@richrelevance.comwrote:


 I have found some notes that suggest that -XX:-ReduceInitialCardMarks
 will work around some known crash problems with 6u18, but that may be
 unrelated.


 Yep, I think that is probably a likely workaround as well. For now I'm
 recommending downgrade to our clients, rather than introducing cryptic XX
 flags :)


 lots of bugreps come in once you search for ReduceInitialCardMarks

 Looks like a feature has been turned on :
 http://bugs.sun.com/view_bug.do?bug_id=6889757

 and now it is in wide-beta-test

 http://bugs.sun.com/view_bug.do?bug_id=698
 http://permalink.gmane.org/gmane.comp.lang.scala/19228

 Looks like the root cause is a new Garbage Collector, one that is still
 settling down. The ReduceInitialCardMarks flag is tuning the GC, but it is
 the GC itself that is possibly playing up, or it is a old GC + some new
 features. Either way: trouble.

 -steve


FYI. We are still running:
[r...@nyhadoopdata10 ~]# java -version
java version 1.6.0_15
Java(TM) SE Runtime Environment (build 1.6.0_15-b03)
Java HotSpot(TM) 64-Bit Server VM (build 14.1-b02, mixed mode)

u14 added support for the 64bit compressed memory pointers which
seemed important due to the fact that hadoop can be memory hungry. u15
has been stable in our deployments. Not saying you should not go
newer, but I would not go older then u14.


Re: Hadoop as master's thesis

2010-03-01 Thread Otis Gospodnetic
Bok Tonci,

You'll find good dataset pointers here:

  http://www.simpy.com/user/otis/search/dataset


You may find inspiration for Hadoop usage here, assuming you have ML background:

  http://cwiki.apache.org/MAHOUT/algorithms.html

Oh, and you may also want to look out for GSOC (Google Summer of Code).

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



- Original Message 
 From: Tonci Buljan tonci.bul...@gmail.com
 To: common-user@hadoop.apache.org
 Sent: Mon, March 1, 2010 9:24:53 AM
 Subject: Re: Hadoop as master's thesis
 
 Thank you for your reply.
 
 
 I didn't mention that I already installed Hadoop on 2 machines back at home
 (for a essay on Hadoop which I did), one as a namenode and datanode and one
 as a datanode only. Everything worked perfect. I would really try to install
 it on more machines to see how cluster works in more detail. So I was
 thinking:” Now I have a cluster, where do I find a large dataset to work
 with?”.
 
 
 I like your idea about publicly available datasets, do you have any links
 on that?
 
 The other idea, about student grades is also great (thank you for that) and
 I might just start with that.
 
 
 Thank you very much, you both really helped me.
 
 
 On 1 March 2010 15:15, Mark Kerzner wrote:
 
  Tonci,
 
  to start with, you can run Hadoop on one computer in pseudo-cluster mode.
  Installing and configuring will be enough headache on its own. Then you can
  think of a problem, such as process student records and grades and find
  some
  statistics, or grade and their future achievements. Or, you can look at
  some
  publicly available datasets and so something with them.
 
  Cheers,
  Mark
 
  On Mon, Mar 1, 2010 at 8:01 AM, Tonci Buljan 
  wrote:
 
   Hello everyone,
  
I'm thinking of using Hadoop as a subject in my master's thesis in
   Computer
   Science. I'm supposed to solve some kind of a problem with Hadoop, but
   can't
   think of any :)).
  
We have a lab with 10-15 computers and I tough of installing Hadoop on
   those computers, and now I should write some kind of a program to run on
  my
   cluster.
  
I really hope you understood my problem :). I really need any kind of
   suggestion.
  
  
P.S. Sorry for my bad English, I'm from Croatia.
  
 



Re: Hadoop as master's thesis

2010-03-01 Thread Steve Loughran

Tonci Buljan wrote:

Hello everyone,

 I'm thinking of using Hadoop as a subject in my master's thesis in Computer
Science. I'm supposed to solve some kind of a problem with Hadoop, but can't
think of any :)).


well, you need some interesting data, then mine it. So ask around. 
Physicists often have stuff.


LocalDirAllocator error

2010-03-01 Thread Ted Yu
Hi,

We use hadoop 0.20.1


I saw the following in our log:

2010-02-27 10:05:09,808 WARN
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext: Failed to create
/disk2/opt/kindsight/hadoop/data/mapred/local


[r...@snv-qa-lin-cg ~]# df
Filesystem   1K-blocks  Used Available Use% Mounted on
/dev/sda3  7936288   2671376   4855256  36% /
/dev/mapper/VolGroup00-LogVol01
 1864008392 396885836 1370909156  23% /opt
/dev/mapper/VolGroup00-LogVol00
   3935944194560   3538224   6% /var
/dev/sda1   497829 16649455478   4% /boot
tmpfs  8219276 0   8219276   0% /dev/shm


What should I do to get past the above warning ?


Thanks


Re: cluster involvement trigger

2010-03-01 Thread Amogh Vasekar
Hi,
You mentioned you pass the files packed together using -archives option. This 
will uncompress the archive on the compute node itself, so the namenode won't 
be hampered in this case. However, cleaning up the distributed cache is a 
tricky scenario ( user doesn't have explicit control over this ), you may 
search this list for many discussions pertaining to this. And while on the 
topic of archives, while it may not be practical for you as of now, but Hadoop 
Archives (har) provide similar functionality.
Hope this helps.

Amogh


On 2/27/10 12:53 AM, Michael Kintzer michael.kint...@zerk.com wrote:

Amogh,

Thank you for the detailed information.   Our initial prototyping seems to 
agree with your statements below, i.e. a single large input file is performing 
better than an index file + an archive of small files.   I will take a look at 
the CombineFileInputFormat as you suggested.

One question.   Since the many small input files are all in a single jar 
archive managed by the name node, does that still hamper name node performance? 
  I was under the impression these archives are are only unpacked into the 
temporary map reduce file space (and I'm assuming cleaned up after map-reduce 
completes).   Does the name node need to store the metadata of each individual 
file during the unpacking for this case?

-Michael

On Feb 25, 2010, at 10:31 PM, Amogh Vasekar wrote:

 Hi,
 The number of mappers initialized depends largely on your input format ( the 
 getSplits of your input format) , (almost all) input formats available in 
 hadoop derive from fileinputformat, hence the 1 mapper per file block notion 
 ( this actually is 1 mapper per split ).
 You say that you have too many small files. In general each of these small 
 files  (  64 mb ) will be executed by a single mapper. However, I would 
 suggest looking at CombineFileInputFormat which does the job of packaging 
 many small files together depending on data locality for better performance ( 
 initialization time is a significant factor in hadoop's performance ).
 On the other side, many small files will hamper your namenode performance 
 since file metadata is stored in memory and limit its overall capacity wrt 
 number of files.

 Amogh


 On 2/25/10 11:15 PM, Michael Kintzer michael.kint...@zerk.com wrote:

 Hi,

 We are using the streaming API.We are trying to understand what hadoop 
 uses as a threshold or trigger to involve more TaskTracker nodes in a given 
 Map-Reduce execution.

 With default settings (64MB chunk size in HDFS), if the input file is less 
 than 64MB, will the data processing only occur on a single TaskTracker Node, 
 even if our cluster size is greater than 1?

 For example, we are trying to figure out if hadoop is more efficient at 
 processing:
 a) a single input file which is just an index file that refers to a jar 
 archive of 100K or 1M individual small files, where the jar file is passed as 
 the -archives argument, or
 b) a single input file containing all the raw data represented by the 100K or 
 1M small files.

 With (a), our input file is 64MB.   With (b) our input file is very large.

 Thanks for any insight,

 -Michael





Re: Hadoop as master's thesis

2010-03-01 Thread Tonci Buljan
Thank you all for your reply.

Matteo, I' m definitely interested in what you did, and I would be very
happy to check it out in detail. Mark Kerzner's link
http://infochimps.org/was very usefull. Thank you Mark for that. I'll
probably download and work
with some data from there.



For Marko (in Croatian)

Nisam ima pojma da postoji još ljudi u Hrvatskoj koji se bave Hadoopom.
Studiram na FESB-u u Splitu I cijela katedra koja se bavi distribuiranim
računanjem je tanka. Profesor nije ni znao što je Hadoop kada sam ga pitao
za ideju. Java je još veći bauk, isti taj profesor ju drži tako da će bit
prava borba napisat nekakav diplomski na tu temu.

U svakom slučaju, hvala za odgovor.






On 1 March 2010 22:35, Song Liu lamfeeli...@gmail.com wrote:

 Hi, Tonci, Actually, I am taking a Master's thesis by developing algorithms
 on hadoop.

 My project is to extend algorithms into mapreduce fasion and to discover
 whether there is a optimal choice.  Most of them belong to the Machine
 Learning area. Personally, I think this is a fresh area, and if you search
 the main academic database, you may find few literature about this.

 I recently made an proposal about my study on Hadoop, and I would like to
 discuss this with you in depth if you wish.

 Another interesting topic is to discover the limit of hadoop. We have a
 very
 large cluster at a very high rank among TOP500, so I'm wondering whether
 hadoop can perform as we expected.

 Hope this helpful.

 Regards
 Song Liu


 On Mon, Mar 1, 2010 at 9:16 PM, Stephen Watt sw...@us.ibm.com wrote:

  Hi Tonci
 
  Public Data Sets - Check out infochimps.org/ or
  aws.amazon.com/publicdatasets/
 
  I find a lot of the Hadoopified algorithms out there originate from
  Linguistics departments, TF-IDF is one example, but, have you considered
  looking into Information Theory ? i.e. Entropy analytics using algorithms
  like Pointwise Mutual Information. I'd imagine most government security
  agencies would be interested in using Hadoop for signal processing/code
  breaking. Especially the cost savings of using commodity machines. The
  trick will be to find a dataset that suits your algorithm.
 
  Kind regards
  Steve Watt
 
 
 
 
  From:
  Tonci Buljan tonci.bul...@gmail.com
  To:
  common-user@hadoop.apache.org
  Date:
  03/01/2010 08:27 AM
  Subject:
  Re: Hadoop as master's thesis
 
 
 
  Thank you for your reply.
 
 
   I didn't mention that I already installed Hadoop on 2 machines back at
  home
  (for a essay on Hadoop which I did), one as a namenode and datanode and
  one
  as a datanode only. Everything worked perfect. I would really try to
  install
  it on more machines to see how cluster works in more detail. So I was
  thinking: Now I have a cluster, where do I find a large dataset to work
  with?.
 
 
   I like your idea about publicly available datasets, do you have any
 links
  on that?
 
  The other idea, about student grades is also great (thank you for that)
  and
  I might just start with that.
 
 
   Thank you very much, you both really helped me.
 
 
  On 1 March 2010 15:15, Mark Kerzner markkerz...@gmail.com wrote:
 
   Tonci,
  
   to start with, you can run Hadoop on one computer in pseudo-cluster
  mode.
   Installing and configuring will be enough headache on its own. Then you
  can
   think of a problem, such as process student records and grades and find
   some
   statistics, or grade and their future achievements. Or, you can look at
   some
   publicly available datasets and so something with them.
  
   Cheers,
   Mark
  
   On Mon, Mar 1, 2010 at 8:01 AM, Tonci Buljan tonci.bul...@gmail.com
   wrote:
  
Hello everyone,
   
 I'm thinking of using Hadoop as a subject in my master's thesis in
Computer
Science. I'm supposed to solve some kind of a problem with Hadoop,
 but
can't
think of any :)).
   
 We have a lab with 10-15 computers and I tough of installing Hadoop
  on
those computers, and now I should write some kind of a program to run
  on
   my
cluster.
   
 I really hope you understood my problem :). I really need any kind
 of
suggestion.
   
   
 P.S. Sorry for my bad English, I'm from Croatia.
   
  
 
 
 
 



Re: Big-O Notation for Hadoop

2010-03-01 Thread Edward Capriolo
On Mon, Mar 1, 2010 at 4:13 PM, Darren Govoni dar...@ontrenet.com wrote:
 Theoretically. O(n)

 All other variables being equal across all nodes
 should...m.reduce to n.

 That part that really can't be measured is the cost of Hadoop's
 bookkeeping chores as the data set grows since some things in Hadoop
 involve synchronous/serial behavior.

 On Mon, 2010-03-01 at 12:27 -0500, Edward Capriolo wrote:

 A previous post to core-user mentioned some formula to determine job
 time. I was wondering if anyone out there is trying to tackle
 designing a formula that can calculate the job run time of a
 map/reduce program. Obviously there are many variables here including
 but not limited to Disk Speed ,Network Speed, Processor Speed, input
 data, many constants , data-skew, map complexity, reduce complexity, #
 of nodes..

 As an intellectual challenge has anyone starting trying to write a
 formula that can take into account all these factors and try to
 actually predict a job time in minutes/hours?




Understood, BIG-0 notation is really not what I am looking for.

Given all variables are the same, a hadoop job on a finite set of data
should run for a finite time. There are parts of the process that run
linear and parts that run in parallel, but there must be a way to
express how long a job actually takes (although admittedly it is very
involved to figure out)


Re: Big-O Notation for Hadoop

2010-03-01 Thread Darren Govoni
Its a Turing-class problem and thus non-deterministic by nature - a
priori.

But given the uniform aspect of map/reduce an estimate could continually
be approximated - as the data is processed - noting that,  the farther
from completion it is, the less accurate that calculation would be. And
of course, once completed, the estimate is 100% accurate.

In order to do it before, you would need an algorithm that can examine
your map/reduce code and predict the running cost. 
Without data on prior runs, its not -mathematically- possible. 

As a function of cycle complexity over time (which is what big O is),
map/reduce will scale somewhat linearly (maybe even logn) with regards
to data - is my hunch. There's probably a quotient in there for the
bookkeeping no one has data on yet though. But its a good inquiry.

On Mon, 2010-03-01 at 18:25 -0500, Edward Capriolo wrote:

 On Mon, Mar 1, 2010 at 4:13 PM, Darren Govoni dar...@ontrenet.com wrote:
  Theoretically. O(n)
 
  All other variables being equal across all nodes
  should...m.reduce to n.
 
  That part that really can't be measured is the cost of Hadoop's
  bookkeeping chores as the data set grows since some things in Hadoop
  involve synchronous/serial behavior.
 
  On Mon, 2010-03-01 at 12:27 -0500, Edward Capriolo wrote:
 
  A previous post to core-user mentioned some formula to determine job
  time. I was wondering if anyone out there is trying to tackle
  designing a formula that can calculate the job run time of a
  map/reduce program. Obviously there are many variables here including
  but not limited to Disk Speed ,Network Speed, Processor Speed, input
  data, many constants , data-skew, map complexity, reduce complexity, #
  of nodes..
 
  As an intellectual challenge has anyone starting trying to write a
  formula that can take into account all these factors and try to
  actually predict a job time in minutes/hours?
 
 
 
 
 Understood, BIG-0 notation is really not what I am looking for.
 
 Given all variables are the same, a hadoop job on a finite set of data
 should run for a finite time. There are parts of the process that run
 linear and parts that run in parallel, but there must be a way to
 express how long a job actually takes (although admittedly it is very
 involved to figure out)




Re: Sun JVM 1.6.0u18

2010-03-01 Thread Scott Carey

On Mar 1, 2010, at 10:46 AM, Allen Wittenauer wrote:

 
 
 
 On 3/1/10 7:24 AM, Edward Capriolo edlinuxg...@gmail.com wrote:
 u14 added support for the 64bit compressed memory pointers which
 seemed important due to the fact that hadoop can be memory hungry. u15
 has been stable in our deployments. Not saying you should not go
 newer, but I would not go older then u14.
 
 How are the compressed memory pointers working for you?  I've been debating
 turning them on here, so real world experience would be useful from those
 that have taken plunge.
 

Been using it since they came out, both for Hadoop where needed and in many 
other applications.  Performance gains and memory reduction in most places -- 
sometimes rather significant (25%).  GC times significantly lower for any heap 
that is reference heavy.  Heaps are still a little larger than a 32 bit one, 
but the benefits of native 64 bit code on x86 include improved computational 
performance as well.  6u18 introduces some performance enhancements to the 
feature that we might be able to use if 6u19 fixes the other bugs.  The next 
Hotspot version will make it the default setting, whenever that gets integrated 
and tested into the JDK6 line.  6u14 and 6u18 are the last two JDK releases 
with updated Hotspot versions.

RE: Sun JVM 1.6.0u18

2010-03-01 Thread Zlatin.Balevsky
1.6.0_u18 also claims to fix bug_id=5103988 which may or may not improve the 
performance of the transferTo code used in 
org.apache.hadoop.net.SocketOutputStream.

-Original Message-
From: Scott Carey [mailto:sc...@richrelevance.com] 
Sent: Monday, March 01, 2010 6:41 PM
To: common-user@hadoop.apache.org
Subject: Re: Sun JVM 1.6.0u18


On Mar 1, 2010, at 10:46 AM, Allen Wittenauer wrote:

 
 
 
 On 3/1/10 7:24 AM, Edward Capriolo edlinuxg...@gmail.com wrote:
 u14 added support for the 64bit compressed memory pointers which 
 seemed important due to the fact that hadoop can be memory hungry. 
 u15 has been stable in our deployments. Not saying you should not go 
 newer, but I would not go older then u14.
 
 How are the compressed memory pointers working for you?  I've been 
 debating turning them on here, so real world experience would be 
 useful from those that have taken plunge.
 

Been using it since they came out, both for Hadoop where needed and in many 
other applications.  Performance gains and memory reduction in most places -- 
sometimes rather significant (25%).  GC times significantly lower for any heap 
that is reference heavy.  Heaps are still a little larger than a 32 bit one, 
but the benefits of native 64 bit code on x86 include improved computational 
performance as well.  6u18 introduces some performance enhancements to the 
feature that we might be able to use if 6u19 fixes the other bugs.  The next 
Hotspot version will make it the default setting, whenever that gets integrated 
and tested into the JDK6 line.  6u14 and 6u18 are the last two JDK releases 
with updated Hotspot versions.
___

This e-mail may contain information that is confidential, privileged or 
otherwise protected from disclosure. If you are not an intended recipient of 
this e-mail, do not duplicate or redistribute it by any means. Please delete it 
and any attachments and notify the sender that you have received it in error. 
Unless specifically indicated, this e-mail is not an offer to buy or sell or a 
solicitation to buy or sell any securities, investment products or other 
financial product or service, an official confirmation of any transaction, or 
an official statement of Barclays. Any views or opinions presented are solely 
those of the author and do not necessarily represent those of Barclays. This 
e-mail is subject to terms available at the following link: 
www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the 
foregoing.  Barclays Capital is the investment banking division of Barclays 
Bank PLC, a company registered in England (number 1026167) with its registered 
office at 1 Churchill Place, London, E14 5HP.  This email may relate to or be 
sent from other members of the Barclays Group.
___


Re: Big-O Notation for Hadoop

2010-03-01 Thread Edward Capriolo
I am looking at this many different ways.

For example: shuffle sort might run faster if we have 12 disks not 8 per node.


So shuffle sort involves data size/ disk speed network speed/ and
processor speed/ number of nodes.


Can we find formula to take these (and more factors ) into account?
Once we find it we should be able to plug in 12 or 8 and get a result
close to the shuffle sort time.


I think it would be rather cool to have a long drawn out formula.that
even made reference to some constants, like time to copy data to
distributed cache,



I am looking at source data size, map complety, map output size,
shuffle sort time, reduce complexity, number of nodes and try to
arrive at a formula that will say how long a job will take.

From there we can factor in something like all nodes have 10 g
ethernet and watch the entire thing fall apart :)




On 3/1/10, brien colwell xcolw...@gmail.com wrote:
 Map reduce should be a constant factor improvement for the algorithm
 complexity. I think you're asking for the overhead as a function of
 input/cluster size? If your algorithm has some complexity O(f(n)), and
 you spread it over M nodes (constant), with some merge complexity less
 than f(n), the total time will still be O(f(n)).

 I run a small job, measure the time, and then extrapolate based on the bigO.






 On 3/1/2010 6:25 PM, Edward Capriolo wrote:
 On Mon, Mar 1, 2010 at 4:13 PM, Darren Govonidar...@ontrenet.com  wrote:

 Theoretically. O(n)

 All other variables being equal across all nodes
 should...m.reduce to n.

 That part that really can't be measured is the cost of Hadoop's
 bookkeeping chores as the data set grows since some things in Hadoop
 involve synchronous/serial behavior.

 On Mon, 2010-03-01 at 12:27 -0500, Edward Capriolo wrote:


 A previous post to core-user mentioned some formula to determine job
 time. I was wondering if anyone out there is trying to tackle
 designing a formula that can calculate the job run time of a
 map/reduce program. Obviously there are many variables here including
 but not limited to Disk Speed ,Network Speed, Processor Speed, input
 data, many constants , data-skew, map complexity, reduce complexity, #
 of nodes..

 As an intellectual challenge has anyone starting trying to write a
 formula that can take into account all these factors and try to
 actually predict a job time in minutes/hours?




 Understood, BIG-0 notation is really not what I am looking for.

 Given all variables are the same, a hadoop job on a finite set of data
 should run for a finite time. There are parts of the process that run
 linear and parts that run in parallel, but there must be a way to
 express how long a job actually takes (although admittedly it is very
 involved to figure out)





bulk data transfer to HDFS remotely (e.g. via wan)

2010-03-01 Thread jiang licht
I am considering a basic task of loading data to hadoop cluster in this 
scenario: hadoop cluster and bulk data reside on different boxes, e.g. 
connected via LAN or wan.
 
An example to do this is to move data from amazon s3 to ec2, which is supported 
in latest hadoop by specifying s3(n)://authority/path in distcp.
 
But generally speaking, what is the best way to load data to hadoop cluster 
from a remote box? Clearly, in this scenario, it is unreasonable to copy data 
to local name node and then issue some command like hadoop fs -copyFromLocal 
to put data in the cluster (besides this, a desired data transfer tool is 
also a factor, scp or sftp, gridftp, ..., compression and encryption, ...).
 
I am not awaring of a generic support for fetching data from a remote box (like 
that from s3 or s3n), I am thinking about the following solution (run on remote 
boxes to push data to hadoop):
 
cat datafile | ssh hadoopbox 'hadoop fs -put - dst'
 
There are pros (simple and will do the job without storing a local copy of each 
data file and then do a command like 'hadoop fs -copyFromLocal') and cons 
(obviously will need many such pipelines running in parallel to speed up the 
job, but at the cost of creating processes on remote machines to read data and 
maintain ssh connections, so if data file is small, better archive small files 
into a tar file before calling 'cat'). Alternative to using a 'cat', a program 
can be written to keep reading data files and dump to stdin in parallel.
 
Any comments about this or thoughts about a better solution?
 
Thanks,
--
Michael