Hadoop as master's thesis

2010-03-01 Thread Tonci Buljan
Hello everyone, I'm thinking of using Hadoop as a subject in my master's thesis in Computer Science. I'm supposed to solve some kind of a problem with Hadoop, but can't think of any :)). We have a lab with 10-15 computers and I tough of installing Hadoop on those computers, and now I should

Re: Hadoop as master's thesis

2010-03-01 Thread Mark Kerzner
Tonci, to start with, you can run Hadoop on one computer in pseudo-cluster mode. Installing and configuring will be enough headache on its own. Then you can think of a problem, such as process student records and grades and find some statistics, or grade and their future achievements. Or, you can

Re: Hadoop as master's thesis

2010-03-01 Thread Tonci Buljan
Thank you for your reply. I didn't mention that I already installed Hadoop on 2 machines back at home (for a essay on Hadoop which I did), one as a namenode and datanode and one as a datanode only. Everything worked perfect. I would really try to install it on more machines to see how cluster

Re: Hadoop as master's thesis

2010-03-01 Thread Mark Kerzner
Tonci, here are Enron email files used in the litigation that they had: http://edrm.net/resources/data-sets/enron-data-set-files Here is much more stuff: http://infochimps.org/ Sincerely, Mark http://edrm.net/resources/data-sets/enron-data-set-files On Mon, Mar 1, 2010 at 8:24 AM, Tonci

Re: Sun JVM 1.6.0u18

2010-03-01 Thread Edward Capriolo
On Mon, Mar 1, 2010 at 6:37 AM, Steve Loughran ste...@apache.org wrote: Todd Lipcon wrote: On Thu, Feb 25, 2010 at 11:09 AM, Scott Carey sc...@richrelevance.comwrote: I have found some notes that suggest that -XX:-ReduceInitialCardMarks will work around some known crash problems with 6u18,

Re: Hadoop as master's thesis

2010-03-01 Thread Otis Gospodnetic
Bok Tonci, You'll find good dataset pointers here: http://www.simpy.com/user/otis/search/dataset You may find inspiration for Hadoop usage here, assuming you have ML background: http://cwiki.apache.org/MAHOUT/algorithms.html Oh, and you may also want to look out for GSOC (Google Summer

Re: Hadoop as master's thesis

2010-03-01 Thread Steve Loughran
Tonci Buljan wrote: Hello everyone, I'm thinking of using Hadoop as a subject in my master's thesis in Computer Science. I'm supposed to solve some kind of a problem with Hadoop, but can't think of any :)). well, you need some interesting data, then mine it. So ask around. Physicists often

LocalDirAllocator error

2010-03-01 Thread Ted Yu
Hi, We use hadoop 0.20.1 I saw the following in our log: 2010-02-27 10:05:09,808 WARN org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext: Failed to create /disk2/opt/kindsight/hadoop/data/mapred/local [r...@snv-qa-lin-cg ~]# df Filesystem 1K-blocks Used Available Use%

Re: cluster involvement trigger

2010-03-01 Thread Amogh Vasekar
Hi, You mentioned you pass the files packed together using -archives option. This will uncompress the archive on the compute node itself, so the namenode won't be hampered in this case. However, cleaning up the distributed cache is a tricky scenario ( user doesn't have explicit control over

Re: Hadoop as master's thesis

2010-03-01 Thread Tonci Buljan
Thank you all for your reply. Matteo, I' m definitely interested in what you did, and I would be very happy to check it out in detail. Mark Kerzner's link http://infochimps.org/was very usefull. Thank you Mark for that. I'll probably download and work with some data from there. For Marko (in

Re: Big-O Notation for Hadoop

2010-03-01 Thread Edward Capriolo
On Mon, Mar 1, 2010 at 4:13 PM, Darren Govoni dar...@ontrenet.com wrote: Theoretically. O(n) All other variables being equal across all nodes should...m.reduce to n. That part that really can't be measured is the cost of Hadoop's bookkeeping chores as the data set grows since some

Re: Big-O Notation for Hadoop

2010-03-01 Thread Darren Govoni
Its a Turing-class problem and thus non-deterministic by nature - a priori. But given the uniform aspect of map/reduce an estimate could continually be approximated - as the data is processed - noting that, the farther from completion it is, the less accurate that calculation would be. And of

Re: Sun JVM 1.6.0u18

2010-03-01 Thread Scott Carey
On Mar 1, 2010, at 10:46 AM, Allen Wittenauer wrote: On 3/1/10 7:24 AM, Edward Capriolo edlinuxg...@gmail.com wrote: u14 added support for the 64bit compressed memory pointers which seemed important due to the fact that hadoop can be memory hungry. u15 has been stable in our

RE: Sun JVM 1.6.0u18

2010-03-01 Thread Zlatin.Balevsky
1.6.0_u18 also claims to fix bug_id=5103988 which may or may not improve the performance of the transferTo code used in org.apache.hadoop.net.SocketOutputStream. -Original Message- From: Scott Carey [mailto:sc...@richrelevance.com] Sent: Monday, March 01, 2010 6:41 PM To:

Re: Big-O Notation for Hadoop

2010-03-01 Thread Edward Capriolo
I am looking at this many different ways. For example: shuffle sort might run faster if we have 12 disks not 8 per node. So shuffle sort involves data size/ disk speed network speed/ and processor speed/ number of nodes. Can we find formula to take these (and more factors ) into account? Once

bulk data transfer to HDFS remotely (e.g. via wan)

2010-03-01 Thread jiang licht
I am considering a basic task of loading data to hadoop cluster in this scenario: hadoop cluster and bulk data reside on different boxes, e.g. connected via LAN or wan.   An example to do this is to move data from amazon s3 to ec2, which is supported in latest hadoop by specifying