Re: Fastlz coming?

2009-06-04 Thread Johan Oskarsson
We're using Lzo still, works great for those big log files:
http://code.google.com/p/hadoop-gpl-compression/

/Johan

Kris Jirapinyo wrote:
 Hi all,
In the remove lzo JIRA ticket
 https://issues.apache.org/jira/browse/HADOOP-4874 Tatu mentioned he was
 going to port fastlz from C to Java and provide a patch.  Has there been any
 updates on that?  Or is anyone working on any additional custom compression
 codecs?
 
 Thanks,
 Kris J.
 



Re: Splittable lzo files

2009-03-03 Thread Johan Oskarsson
We use it with python (dumbo) and streaming, so it should certainly be 
possible. I haven't tried it myself though, so can't give any pointers.


/Johan

Miles Osborne wrote:

that's very interesting.  for us poor souls using streaming, would we
be able to use it?

(right now i'm looking at a 100+ GB gzipped file ...)

Miles

2009/3/3 Johan Oskarsson jo...@oskarsson.nu:

Hi,

thought I'd pass on this blog post I just wrote about how we compress our
raw log data in Hadoop using Lzo at Last.fm.

The essence of the post is that we're able to make them splittable by
indexing where each compressed chunk starts in the file, similar to the gzip
input format being worked on.
This actually gives us a performance boost in certain jobs that read a lot
of data while saving us disk space at the same time.

http://blog.oskarsson.nu/2009/03/hadoop-feat-lzo-save-disk-space-and.html

/Johan









Re: Hadoop User Group UK Meetup - April 14th

2009-02-18 Thread Johan Oskarsson
Registrations to the next Hadoop User Group UK meetup have now opened:
http://huguk.eventwax.com/hadoop-user-group-uk-2

The preliminary schedule:
10.00 – 10.15: Arriving and chatting
10.15 – 11.15: Practical MapReduce (Tom White, Cloudera)
11.15 – 12.15: Introducing Apache Mahout (Isabel Drost, ASF)
12.15 – 13.15: Lunch
13.15 – 14.15: Terrier (Iadh Ounis and Craig Macdonald, University of
Glasgow)
14.15 – 15.15: Having Fun with PageRank and MapReduce (Paolo Castagna, HP)
15.15 – 16.15: Apache HBase (Michael Stack, Powerset)
16.15 – 17.00: General chat, perhaps lightning talks (powered by Sun beer)
17.00 – 00.00: Discussions continues at a nearby pub

The event is hosted by Sun in London, near Monument station, for more
details see the event page or the blog: http://huguk.org/

/Johan

Johan Oskarsson wrote:
 I've started organizing the next Hadoop meetup in London, UK. The date
 is April 14th and the presentations so far include:
 
 Michael Stack (Powerset): Apache HBase
 Isabel Drost (Neofonie): Introducing Apache Mahout
 Iadh Ounis and Craig Macdonalt (University of Glasgow): Terrier
 Paolo Castagna (HP): Having Fun with PageRank and MapReduce
 
 Keep an eye on the blog for updates: http://huguk.org/
 
 Help in the form of sponsoring (venue, beer etc) would be much
 appreciated. Also let me know if you want to present. Personally I'd
 love to see presentations from other Hadoop related projects (pig, hive,
 hama etc).
 
 /Johan



Hadoop User Group UK Meetup - April 14th

2009-02-02 Thread Johan Oskarsson
I've started organizing the next Hadoop meetup in London, UK. The date
is April 14th and the presentations so far include:

Michael Stack (Powerset): Apache HBase
Isabel Drost (Neofonie): Introducing Apache Mahout
Iadh Ounis and Craig Macdonalt (University of Glasgow): Terrier
Paolo Castagna (HP): Having Fun with PageRank and MapReduce

Keep an eye on the blog for updates: http://huguk.org/

Help in the form of sponsoring (venue, beer etc) would be much
appreciated. Also let me know if you want to present. Personally I'd
love to see presentations from other Hadoop related projects (pig, hive,
hama etc).

/Johan


Re: Practical limits on number of blocks per datanode.

2008-11-21 Thread Johan Oskarsson
Hi Rick,

unfortunately 4,800,000 blocks per node is going to be too much. Ideally
you'd want to merge your files into as few as possible, even 1MB per
file is quite small for Hadoop. Would it be possible to merge them into
hundreds of mbs or preferably gigabyte files?

In newer Hadoop versions there is an archive feature that can put many
files into an archive for you. This can then be processed transparently
by Hadoop. I haven't used that though so can't tell if it's worth the
effort.

I ran into issues with too many blocks per datanode before and it's not
fun, they start losing contact with the namenode with all kinds of
interesting side effects.

/Johan

Rick Hangartner wrote:
 
 Hi,
 
 We are in the midst of considering Hadoop as a prototype solution for a
 system we are building.  In the abstract Hadoop and MapReduce are very
 well-suited to our computational problem.  However, this email exchange
 has caused us some concern that we are hoping the user community might
 allay.  We've searched JIRA for relevant issues but didn't turn up
 anything. (We probably aren't as adept as we might be at surfacing
 appropriate items though.)
 
 Here are the relevant numbers for the data we are using to prototype a
 system using Hadoop 0.18.1:
 
 We have 16,000,000 files that are 10K each, or about 160GB total.  We
 have 10 datanodes with the default replication factor of 3.   Each file
 will probably be stored as a single block, right?  This means we would
 be storing 48,000,000 blocks on 10 datanodes or 4,800,000 blocks per node.
 
 At 160GB, the total data is not particularly large.  Unfortunately, the
 attached email exchange suggests  we could have a problem with the large
 number of blocks per node.  We have considered combining a number of
 small files into larger files (say concatenating  sets of 100 files into
 single larger files so we have 48,000 blocks that are 1MB in size per
 node.)  This would not significantly effect our MapReduce algorithm, but
 it could undesirably complicate other components of the system that use
 this data.
 
 Thanks in advance for any insights on the match between Hadoop (0.18.x
 and later) and our particular system requirements.
 
 RDH
 
 Begin forwarded message:
 
 From: Konstantin Shvachko [EMAIL PROTECTED]
 Date: November 17, 2008 6:27:42 PM PST
 To: core-user@hadoop.apache.org
 Subject: Re: The Case of a Long Running Hadoop System
 Reply-To: core-user@hadoop.apache.org

 Bagri,

 According to the numbers you posted your cluster has 6,000,000 block
 replicas
 and only 12 data-nodes. The blocks are small on average about 78KB
 according
 to fsck. So each node contains about 40GB worth of block data.
 But the number of blocks is really huge 500,000 per node. Is my math
 correct?
 I haven't seen data-nodes that big yet.
 The problem here is that a data-node keeps a map of all its blocks in
 memory.
 The map is a HashMap. With 500,000 entries you can get long lookup
 times I guess.
 And also block reports can take long time.

 So I believe restarting name-node will not help you.
 You should somehow pack your small files into larger ones.
 Alternatively, you can increase your cluster size, probably 5 to 10
 times larger.
 I don't remember whether we had any optimization patches related to
 data-nodes
 block map since 0.15. Please advise if anybody remembers.

 Thanks,
 --Konstantin


 Abhijit Bagri wrote:
 We do not have a secondary namenode because 0.15.3 has serious bug
 which truncates the namenode image if there is a failure while
 namenode fetches image from secondary namenode. See HADOOP-3069
 I have a patched version of 0.15.3 for this issue. From the patch of
 HADOOP-3069, the changes are on namenode _and_ secondary namenode,
 which means I just cant fire up a seconday namenode.
 - Bagri
 On Nov 15, 2008, at 11:36 PM, Billy Pearson wrote:
 If I understand the secondary namenode merges the edits log in to
 the fsimage and reduces the edit log size.
 Which is likely the root of your problems 8.5G seams large and
 likely putting a strain on your master servers memory and io bandwidth
 Why do you not have a secondary namenode?

 If you do not have the memory on the master I would look in to
 stopping a datanode/tasktracker on a server and loading the
 secondary namenode on it

 Let it run for a while and watch your log for the secondary namenode
 you should see your edit log get smaller

 I am not an expert but that would be my first action.

 Billy


 



Re: Hadoop User Group (Bay Area) Oct 15th

2008-10-15 Thread Johan Oskarsson
Since I'm not based in the San Francisco I would love to see the slides
from this meetup uploaded somewhere. Especially the database join
techniques talk sounds very interesting to me.

/Johan

Ajay Anand wrote:
 The next Bay Area User Group meeting is scheduled for October 15th at
 Yahoo! 2821 Mission College Blvd, Santa Clara, Building 1, Training
 Rooms 3  4 from 6:00-7:30 pm.
 
 Agenda:
 - Exploiting database join techniques for analytics with Hadoop: Jun
 Rao, IBM
 - Jaql Update: Kevin Beyer, IBM
 - Experiences moving a Petabyte Data Center: Sriram Rao, Quantcast
 
 Look forward to seeing you there!
 Ajay



Hadoop User Group UK

2008-07-14 Thread Johan Oskarsson

Update on the Hadoop user group in the UK:

It will be hosted at Skills Matter in Clerkenwell, London on August 19. 
We'll have presentations from both developers and users of Apache Hadoop.


The event is free and anyone is welcome, but we only have room for 60 
people so make sure you're on the attending list @ 
http://upcoming.yahoo.com/event/506444 if you're coming.
We're sponsored by Yahoo! Developer Network (lunch+beer), Skills matter 
(beer) and Last fm (room hire), thanks guys!


If you're interested in speaking please let us know at 
[EMAIL PROTECTED], we can still squeeze in some interesting 
presentations or lightning talks.


Preliminary times:
10.00 - 10.45: Doug Cutting (Project founder, Yahoo!) - Hadoop overview
10.45 - 11.30: Tom White (Lexemetech) - Hadoop on Amazon S3/EC2
11.30 - 12.15: Steve Loughran and Julio Guijarro (HP) - Smartfrog and 
Hadoop
12.15 - 13.15: Free lunch! (Sandwich, fruit, drink and crisps. Meat and 
veggie options available)
13.15 - 14.00: Martin Dittus and Johan Oskarsson (Last.fm) - Hadoop 
usage at Last fm

14.00 - 15.00: Lightning talks (5-10 minutes each)
15.00 - 16.00: Panel discussion
16.00 - 17.00: Free beer!
17.00 - xx.xx: Wandering to a nearby pub

Lightning talks include:
Miles Osborne (University of Edinburgh) - Using Nutch and Hadoop for 
Natural Language Processing

Tim Sell (Last fm intern) - PostgreSQL to HBase replication

For those of you who cannot attend we'll try to put presentations up on 
the wiki and perhaps even record the event in some fashion.


/Johan