Hello everyone,
I'm thinking of using Hadoop as a subject in my master's thesis in Computer
Science. I'm supposed to solve some kind of a problem with Hadoop, but can't
think of any :)).
We have a lab with 10-15 computers and I tough of installing Hadoop on
those computers, and now I should
Tonci,
to start with, you can run Hadoop on one computer in pseudo-cluster mode.
Installing and configuring will be enough headache on its own. Then you can
think of a problem, such as process student records and grades and find some
statistics, or grade and their future achievements. Or, you can
Thank you for your reply.
I didn't mention that I already installed Hadoop on 2 machines back at home
(for a essay on Hadoop which I did), one as a namenode and datanode and one
as a datanode only. Everything worked perfect. I would really try to install
it on more machines to see how cluster
Tonci,
here are Enron email files used in the litigation that they had:
http://edrm.net/resources/data-sets/enron-data-set-files
Here is much more stuff: http://infochimps.org/
Sincerely,
Mark
http://edrm.net/resources/data-sets/enron-data-set-files
On Mon, Mar 1, 2010 at 8:24 AM, Tonci
On Mon, Mar 1, 2010 at 6:37 AM, Steve Loughran ste...@apache.org wrote:
Todd Lipcon wrote:
On Thu, Feb 25, 2010 at 11:09 AM, Scott Carey
sc...@richrelevance.comwrote:
I have found some notes that suggest that -XX:-ReduceInitialCardMarks
will work around some known crash problems with 6u18,
Bok Tonci,
You'll find good dataset pointers here:
http://www.simpy.com/user/otis/search/dataset
You may find inspiration for Hadoop usage here, assuming you have ML background:
http://cwiki.apache.org/MAHOUT/algorithms.html
Oh, and you may also want to look out for GSOC (Google Summer
Tonci Buljan wrote:
Hello everyone,
I'm thinking of using Hadoop as a subject in my master's thesis in Computer
Science. I'm supposed to solve some kind of a problem with Hadoop, but can't
think of any :)).
well, you need some interesting data, then mine it. So ask around.
Physicists often
Hi,
We use hadoop 0.20.1
I saw the following in our log:
2010-02-27 10:05:09,808 WARN
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext: Failed to create
/disk2/opt/kindsight/hadoop/data/mapred/local
[r...@snv-qa-lin-cg ~]# df
Filesystem 1K-blocks Used Available Use%
Hi,
You mentioned you pass the files packed together using -archives option. This
will uncompress the archive on the compute node itself, so the namenode won't
be hampered in this case. However, cleaning up the distributed cache is a
tricky scenario ( user doesn't have explicit control over
Thank you all for your reply.
Matteo, I' m definitely interested in what you did, and I would be very
happy to check it out in detail. Mark Kerzner's link
http://infochimps.org/was very usefull. Thank you Mark for that. I'll
probably download and work
with some data from there.
For Marko (in
On Mon, Mar 1, 2010 at 4:13 PM, Darren Govoni dar...@ontrenet.com wrote:
Theoretically. O(n)
All other variables being equal across all nodes
should...m.reduce to n.
That part that really can't be measured is the cost of Hadoop's
bookkeeping chores as the data set grows since some
Its a Turing-class problem and thus non-deterministic by nature - a
priori.
But given the uniform aspect of map/reduce an estimate could continually
be approximated - as the data is processed - noting that, the farther
from completion it is, the less accurate that calculation would be. And
of
On Mar 1, 2010, at 10:46 AM, Allen Wittenauer wrote:
On 3/1/10 7:24 AM, Edward Capriolo edlinuxg...@gmail.com wrote:
u14 added support for the 64bit compressed memory pointers which
seemed important due to the fact that hadoop can be memory hungry. u15
has been stable in our
1.6.0_u18 also claims to fix bug_id=5103988 which may or may not improve the
performance of the transferTo code used in
org.apache.hadoop.net.SocketOutputStream.
-Original Message-
From: Scott Carey [mailto:sc...@richrelevance.com]
Sent: Monday, March 01, 2010 6:41 PM
To:
I am looking at this many different ways.
For example: shuffle sort might run faster if we have 12 disks not 8 per node.
So shuffle sort involves data size/ disk speed network speed/ and
processor speed/ number of nodes.
Can we find formula to take these (and more factors ) into account?
Once
I am considering a basic task of loading data to hadoop cluster in this
scenario: hadoop cluster and bulk data reside on different boxes, e.g.
connected via LAN or wan.
An example to do this is to move data from amazon s3 to ec2, which is supported
in latest hadoop by specifying
16 matches
Mail list logo