Re: hadoop jobs take long time to setup

2009-06-28 Thread tim robertson
How long does it take to start the code locally in a single thread? Can you reuse the JVM so it only starts once per node per job? conf.setNumTasksToExecutePerJvm(-1) Cheers, Tim On Sun, Jun 28, 2009 at 9:43 PM, Marcus Heroumarcus.he...@tailsweep.com wrote: Hi. Wonder how one should

Re: Help with EC2 using HDFS client (port open issue?)

2009-06-20 Thread tim robertson
/hadoop-core-user/200905.mbox/%3cdfd95197f3ae8c45b0a96c2f4ba3a2556bf123e...@sc-mbxc1.thefacebook.com%3e - Harish On Sat, Jun 20, 2009 at 6:12 PM, tim robertson timrobertson...@gmail.comwrote: Hi all, I am using Hadoop to build a read only store for voldemort on EC2 and for some reason can't

Re: how to transfer data from one reduce to another map

2009-06-15 Thread tim robertson
Hi I am not sure I understand the question correctly. If you mean you want to use the output of Job1 as the input of Job2, then you can set the input path to the second job as the output path (e.g. output directory) from the first job. Cheers Tim On Mon, Jun 15, 2009 at 3:30 PM,

Re: Few Queries..!!!

2009-06-05 Thread tim robertson
Answers inline - Once I place my data in HDFS, it gets replicated and chunked automatically over the datanodes. Right? Hadoop takes care of all those things. Yes it does - Now, if there is some third party who is not participating in the Hadoop program. Means, he is not one of the nodes of

Re: Can I have Reducer with No Output?

2009-05-28 Thread tim robertson
Yes you can do this. It is complaining because you are not declaring the output types in the method signature, but you will not use them anyway. So please try private static class Reducer extends MapReduceBase implements ReducerText, Writable, Text, Text { ... The output format will be a

Re: Mysql Load Data Infile with Hadoop?

2009-05-19 Thread tim robertson
So you are using a java program to execute a load data infile command on mysql through JDBC? If so I *think* you would have to copy it onto the mysql machine from HDFS first, or the machine running the command and then try a 'load data local infile'. Or pehaps use the

Re: hadoop MapReduce and stop words

2009-05-16 Thread tim robertson
Perhaps some kind of in memory index would be better than iterating an array? Binary tree or so. I did similar with polygon indexes and point data. It requires careful memory planning on the nodes if the indexes are large (mine were several GB). Just a thought, Tim On Sat, May 16, 2009 at

Re: hadoop MapReduce and stop words

2009-05-16 Thread tim robertson
Try and google binary tree java and you will get loads of hits... This is a simple implementation but I am sure there are better ones that handle balancing better. Cheers Tim public class BinaryTree { public static void main(String[] args) { BinaryTree bt = new

Re: ClassNotFoundException

2009-05-08 Thread tim robertson
Can you post the entire error trace please? On Fri, May 8, 2009 at 9:40 AM, George Pang p09...@gmail.com wrote: Dear  users, I got ClassNotFoundException when run the WordCount example on hadoop using Eclipse.  Does anyone know where is the problem? Thank you! George

Re: .gz input files having less output than uncompressed version

2009-05-07 Thread tim robertson
Hi, What input format are you using for the GZipped file? I don't believe there is a GZip input format although some people have discussed whether it is feasible... Cheers Tim On Thu, May 7, 2009 at 9:05 PM, Malcolm Matalka mmata...@millennialmedia.com wrote: Problem: I am comparing two

Re: TextInputFormat unique key across files

2009-05-04 Thread tim robertson
I don't think that you can using those classes. If you look at TextInputFormat and LineRecordReader, they should not be hard to use as a basis for copying into your own version which uniques the IDs but I presume you would need to make them Text and not LongWritable keys. Just a thought...

Re: Hadoop / MySQL

2009-04-28 Thread tim robertson
Hi, [Ankur]: How can make sure this happens? -- show processlist is how we spot it... literally it takes hours on our set up so easy to find. So we ended up with 2 DBs - DB1 we insert to, prepare and do batch processing - DB2 serving the read only web app Periodically we dump the DB1, point the

Re: Generating many small PNGs to Amazon S3 with MapReduce

2009-04-23 Thread tim robertson
If anyone is interested I did finally get round to processing it all, and due to the sparsity of data we have, for all 23 zoom levels and all species we have information on, the result was 807 million PNGs, which is $8,000 to PUT to S3 - too much for me to pay. So like most things I will probably

Re: Generating many small PNGs to Amazon S3 with MapReduce

2009-04-16 Thread tim robertson
on the same section of the sequence file. Maybe you can elaborate further and I'll see if I can offer any thoughts. On Tue, Apr 14, 2009 at 7:10 AM, tim robertson timrobertson...@gmail.com wrote: Sorry Brian, can I just ask please... I have the PNGs in the Sequence file for my sample set

Re: Generating many small PNGs to Amazon S3 with MapReduce

2009-04-16 Thread tim robertson
However, do the math on the costs for S3. We were doing something similar, and found that we were spending a fortune on our put requests at $0.01 per 1000, and next to nothing on storage. I've since moved to a more complicated model where I pack many small items in each object and store an

Generating many small PNGs to Amazon S3 with MapReduce

2009-04-14 Thread tim robertson
Hi all, I am currently processing a lot of raw CSV data and producing a summary text file which I load into mysql. On top of this I have a PHP application to generate tiles for google mapping (sample tile: http://eol-map.gbif.org/php/map/getEolTile.php?tile=0_0_0_13839800). Here is a (dev

Re: Generating many small PNGs to Amazon S3 with MapReduce

2009-04-14 Thread tim robertson
and places it onto S3.  (If my numbers are correct, you're looking at around 3TB of data; is this right?  With that much, you might want another separate Map task to unpack all the files in parallel ... really depends on the throughput you get to Amazon) Brian On Apr 14, 2009, at 4:35 AM, tim

Re: Generating many small PNGs to Amazon S3 with MapReduce

2009-04-14 Thread tim robertson
missing something obvious (e.g. can I disable this behavior)? Cheers Tim On Tue, Apr 14, 2009 at 2:44 PM, tim robertson timrobertson...@gmail.com wrote: Thanks Brian, This is pretty much what I was looking for. Your calculations are correct but based on the assumption that at all zoom

Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

2009-04-14 Thread tim robertson
Thanks for sharing this - I find these comparisons really interesting. I have a small comment after skimming this very quickly. [Please accept my apologies for commenting on such a trivial thing, but personal experience has shown this really influences performance] One thing not touched on in

Hardware - please sanity check?

2009-04-02 Thread tim robertson
Hi all, I am not a hardware guy but about to set up a 10 node cluster for some processing of (mostly) tab files, generating various indexes and researching HBase, Mahout, pig, hive etc. Could someone please sanity check that these specs look sensible? [I know 4 drives would be better but price

Re: Hardware - please sanity check?

2009-04-02 Thread tim robertson
with 50 -- 33 GB RAM and 8 x 1 TB disks on each one;  one box however just has 16 GB of RAM and it routinely falls over when we run jobs on it) Miles 2009/4/2 tim robertson timrobertson...@gmail.com: Hi all, I am not a hardware guy but about to set up a 10 node cluster for some processing

Re: Sorting data numerically

2009-03-23 Thread tim robertson
If Akira was to write his/her own Mappers, using types like IntWritable would result in it being numerically sorted right? Cheers, Tim On Mon, Mar 23, 2009 at 5:04 PM, Aaron Kimball aa...@cloudera.com wrote: Simplest possible solution: zero-pad your keys to ten places? - Aaron On Sat,

Re: Hadoop AMI for EC2

2009-03-05 Thread tim robertson
Yeps, A good starting read: http://wiki.apache.org/hadoop/AmazonEC2 These are the AMIs: $ ec2-describe-images -a | grep hadoop IMAGE ami-245db94dcloudbase-1.1-hadoop-fc64/image.manifest.xml 247610401714available public x86_64 machine IMAGE ami-791ffb10

Re: Announcing CloudBase-1.2.1 release

2009-03-03 Thread tim robertson
Hi Praveen, I think it is more equivalent to Hive than HBase - both offer joins and structured querying where HBase is more a column oriented data store with many to ones embedded in a single row and (currently) only indexes on the primary key, but secondary keys are coming. I anticipate using

Re: Announcing CloudBase-1.2.1 release

2009-03-03 Thread tim robertson
warehosue layer on top of Hadoop and by means of its SQL interface makes it easier to mine logs. So instead of writing Map-Reduce jobs for analyzing data, one can use SQL to do the same and SQL to Map Reduce job translation is handled by CloudBase. -Taran 2009/3/3 tim robertson timrobertson

Re: Announcing CloudBase-1.2.1 release

2009-03-03 Thread tim robertson
first.   *   * @see http://en.wikipedia.org/wiki/Relational_algebra#Semijoin   * @see http://en.wikipedia.org/wiki/Bloom_filter   */ Thanks, Taran 2009/3/3 tim robertson timrobertson...@gmail.com Hi Taran, Have you a blog or something

Re: How-to in MapReduce

2009-01-23 Thread tim robertson
Hi, Sounds like you might want to look at the Nutch project architecture and then see the Nutch on Hadoop tutorial - http://wiki.apache.org/nutch/NutchHadoopTutorial It does web crawling, and indexing using Lucene. It would be a good place to start anyway for ideas, even if it doesn't end up

Re: streaming split sizes

2009-01-21 Thread tim robertson
Hi Dmitry, What version of hadoop are you using? Assuming your 3G DB is a read only lookup... can you load it into memory in the Map.configure and then use (0.19+ only...): property namemapred.job.reuse.jvm.num.tasks/name value-1/value /property So that the Maps are reused for all time

Re: Maven repo for Hadoop

2009-01-18 Thread tim robertson
Nope Here is a super simple little pom to install it locally, and change version easily (put it in project root along with hadoop jar and then run as per comment at top). If you do put it in a repository yourself, are you able to you share the URL? Ours is unfortunately on an intranet so I

Re: Architecture question.

2008-12-24 Thread tim robertson
I would also consider a DB for this... 10M and 2 columns is not a lot of data so I would look to have it in memory with some DB index or memory hash for querying. (We are keeping the indexes of tables with 150M records, 30M and 10M and joining between them with around 25 indexes on the 150M table

Re: does hadoop support submit a new different job in map function?

2008-12-06 Thread tim robertson
I don't agree that this would be considered unconventional, as I have scenarios where this makes sense too - one file with a summary view, and others that are very detailed and a pass over the first one determines which ones to analyse properly in a second job. I am a novice, but it looks like

Re: does hadoop support submit a new different job in map function?

2008-12-06 Thread tim robertson
there is probably no benefit over just waiting for the first pass to finish. On Sat, Dec 6, 2008 at 6:41 PM, Devaraj Das [EMAIL PROTECTED] wrote: On 12/6/08 10:43 PM, tim robertson [EMAIL PROTECTED] wrote: I don't agree that this would be considered unconventional, as I have scenarios where

Re: Best practice for using third party libraries in MapReduce Jobs?

2008-12-03 Thread tim robertson
Can't answer your question exactly, but can let you know what I do. I build all dependencies into 1 jar, and by using Maven for my build environment, when I assemble my jar, I am 100% sure all my dependencies are collected together. This is working very nicely for me and I have used the same

Re: Best practice for using third party libraries in MapReduce Jobs?

2008-12-03 Thread tim robertson
are using Maven to take the dependencies, and package them in one large jar? Basically unjar the contents of the jar and use those with your code I'm assuming? On Dec 3, 2008, at 9:25 AM, tim robertson wrote: Can't answer your question exactly, but can let you know what I do. I build all

Controlling maps per Node on 0.19.0 working?

2008-11-30 Thread tim robertson
Hi, I am a newbie so please excuse if I am doing something wrong: in hadoop-site.xml I have the following since I have a very memory intensive map: property namemapred.tasktracker.map.tasks.maximum/name value1/value /property property namemapred.tasktracker.reduce.tasks.maximum/name

Re: Newbie: Problem splitting a tab file into many (20,000) files

2008-11-30 Thread tim robertson
number of multi-hundred-MB files? You can probably make your setup work eventually, but it'll be a bit like fighting the tide. Alternately, if you must have random-record access, try putting your results into HBase. Hope this helps! Brian On Nov 28, 2008, at 2:14 AM, tim robertson wrote: I

Re: Controlling maps per Node on 0.19.0 working?

2008-11-30 Thread tim robertson
Ok - apologies, it seems changes to the hadoop-site.xml are not automatically picked up after the cluster is running. Cheers Tim On Sun, Nov 30, 2008 at 12:48 PM, tim robertson [EMAIL PROTECTED] wrote: Hi, I am a newbie so please excuse if I am doing something wrong: in hadoop-site.xml I

Re: Lookup HashMap available within the Map

2008-11-30 Thread tim robertson
explain some of the differences between using: - setNumTasksToExecutePerJvm() and then having statically declared data initialised in Mapper.configure(); and - a MultithreadedMapRunner? Regards, Shane On Wed, Nov 26, 2008 at 6:41 AM, Doug Cutting [EMAIL PROTECTED] wrote: tim robertson wrote

Re: Newbie: Problem splitting a tab file into many (20,000) files

2008-11-28 Thread tim robertson
) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) On Thu, Nov 27, 2008 at 10:55 AM, tim robertson

Newbie: Problem splitting a tab file into many (20,000) files

2008-11-27 Thread tim robertson
Hi all, I am really struggling with splitting a single file into many files using hadoop and would appreciate any help offered. The input file is 150,000,000 rows long today, but will grow to 1Billion+. My mapper simply emits a key that it determines from the data (key will be used for the

Re: Hadoop Internal Architecture writeup

2008-11-27 Thread tim robertson
Hi Ricky, As a newcomer to MR and Hadoop I think what you are doing is a great addition to the docs. One thing I would like to see in this overview is how JVM's are spawned in the process - e.g. is it 1 JVM per node per job, or per node per task etc. The reason being it has implications about

Memory allocation - please confirm

2008-11-26 Thread tim robertson
Hi, Could you please sanity check this: In Hadoop-site.xml I add: property namemapred.child.java.opts/name value-Xmx1G/value descriptionIncreasing the size of the heap to allow for large in memory index of polygons/description /property Is this all required to increase the -Xmx for

Re: Memory allocation - please confirm

2008-11-26 Thread tim robertson
, -Xmx1024M version -Xmx1G. Other than that I think it looks good Dennis tim robertson wrote: Hi, Could you please sanity check this: In Hadoop-site.xml I add: property namemapred.child.java.opts/name value-Xmx1G/value descriptionIncreasing the size of the heap to allow for large

Re: output in memory

2008-11-26 Thread tim robertson
I would still store the result in file, and then write a user interface that renders the output file as required... How would you know the user is still on the other end waiting to view the result? If you are sure, then perhaps the thing that launches the job could block until it is finished,

java.lang.OutOfMemoryError: Direct buffer memory

2008-11-25 Thread tim robertson
Hi all, I am doing a very simple Map that determines an integer value to assign to an input (1-64000). The reduction does nothing, but I then use this output formatter to put the data in a file per Key. public class CellBasedOutputFormat extends MultipleTextOutputFormatWritableComparable,

Lookup HashMap available within the Map

2008-11-25 Thread tim robertson
Hi all, If I want to have an in memory lookup Hashmap that is available in my Map class, where is the best place to initialise this please? I have a shapefile with polygons, and I wish to create the polygon objects in memory on each node's JVM and have the map able to pull back the objects by id

Re: Lookup HashMap available within the Map

2008-11-25 Thread tim robertson
! Alex On Tue, Nov 25, 2008 at 11:09 AM, tim robertson [EMAIL PROTECTED]wrote: Hi all, If I want to have an in memory lookup Hashmap that is available in my Map class, where is the best place to initialise this please? I have a shapefile with polygons, and I wish to create the polygon

Re: Lookup HashMap available within the Map

2008-11-25 Thread tim robertson
, or is a Mapper.configure() the best place for this? Can it be called multiple times per Job meaning I need to keep some static synchronised indicator flag? Thanks again, Tim On Tue, Nov 25, 2008 at 8:41 PM, Doug Cutting [EMAIL PROTECTED] wrote: tim robertson wrote: Thanks Alex

Re: Lookup HashMap available within the Map

2008-11-25 Thread tim robertson
On Nov 25, 2008, at 11:46 AM, tim robertson wrote: Hi Doug, Thanks - it is not so much I want to run in a single JVM - I do want a bunch of machines doing the work, it is just I want them all to have this in-memory lookup index, that is configured once per job. Is there some hook somewhere

Re: Newbie: multiple output files

2008-11-23 Thread tim robertson
, tim robertson [EMAIL PROTECTED]wrote: Hi, Can someone please point me at the best way to create multiple output files based on the Key outputted from the Map? So I end up with no reduction, but a file per Key outputted in the Mapping phase, ideally with the Key as the file name. Many thanks

Newbie: error=24, Too many open files

2008-11-23 Thread tim robertson
Hi all, I am running MR which is scanning 130M records and then trying to group them into around 64,000 files. The Map does the grouping of the record by determining the key, and then I use a MultipleTextOutputFormat to write the file based on the key: @Override protected String

Re: Newbie: error=24, Too many open files

2008-11-23 Thread tim robertson
Thank you Jeremy I am on Mac (10.5.5) and it is 256 by default. I will change this and rerun before running on the cluster. Thanks again Tim On Mon, Nov 24, 2008 at 8:38 AM, Jeremy Chow [EMAIL PROTECTED] wrote: There are a file number limitation each process can open in unix/linux. The

Re: Hadoop EC2

2008-09-03 Thread tim robertson
PROTECTED] wrote: There's a case study with some numbers in it from a presentation I gave on Hadoop and AWS in London last month, which you may find interesting: http://skillsmatter.com/custom/presentations/ec2-talk.pdf. tim robertson [EMAIL PROTECTED] wrote: For these small datasets, you might find

Re: Hadoop EC2

2008-09-02 Thread tim robertson
I have been processing only 100s GBs on EC2, not 1000's and using 20 nodes and really only in exploration and testing phase right now. On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock [EMAIL PROTECTED] wrote: Hi Ryan, Just a heads up, if you require more than the 20 node limit, Amazon

Re: Hadoop EC2

2008-09-02 Thread tim robertson
, but maybe I'm incorrect in my assumptions. I am also noticing that it takes about 15 minutes to parse through the 15GB of data with a 15 node cluster. Thanks, Ryan On Tue, Sep 2, 2008 at 3:29 AM, tim robertson [EMAIL PROTECTED] wrote: I have been processing only 100s GBs on EC2, not 1000's

Re: Hadoop 101

2008-09-01 Thread tim robertson
I suggest reading up around map reduce first: http://labs.google.com/papers/mapreduce-osdi04.pdf Cheers On Mon, Sep 1, 2008 at 11:27 AM, HHB [EMAIL PROTECTED] wrote: Hey, I'm reading about Hadoop lately but I'm unable to understand it. Would you please explain it to me in easy words? How

Re: distinct count

2008-08-26 Thread tim robertson
Hi Shirley, If you mean the distinct words along with counts of their usage for example... In the Map, output the word as the key and 1 as the value In the Reduce, count up the values for the key This is then 1 job. Cheers Tim On Tue, Aug 26, 2008 at 3:02 PM, Shirley Cohen [EMAIL PROTECTED]

Re: Hadoop also applicable in a web app environment?

2008-08-05 Thread tim robertson
I am a newbie also, so my answer is not an expert user's by any means. That said: This is not what the MR is designed for... If you have a reporting tool for example, which takes a database a very long time to answer - such a long time that you can't expect a user to hang around waiting for the

Re: Maven

2008-07-11 Thread tim robertson
No there isn't unfortunately... I use this, so I can quickly change versions: !-- mvn -f hadoop-installer.xml install -Dhadoop.version=0.16.4 -Dmaven.test.skip=true -- project xmlns=http://maven.apache.org/POM/4.0.0; xmlns:xsi= http://www.w3.org/2001/XMLSchema-instance; xsi:schemaLocation=

Re: FW: [jira] Updated: (HADOOP-3601) Hive as a contrib project

2008-07-09 Thread tim robertson
Hi Ashish I am very excited to try this, having been evaluating Hadoop, HBase, Cascading etc recently to process 100 millions of Biodiversity records (expecting billions soon), with a view for data mining purposes (species that are critically endangered and observed outside of protected areas

Re: Hadoop - is it good for me and performance question

2008-07-01 Thread tim robertson
MapReduce on Hadoop is for processing very large amounts of data, or else the overhead of framework (job scheduling, failover etc) do not justify it. If you are processing 10-100M / min = 14-140G a day. This probably justifies it's use I would say You can't get a performance estimate on a pseudo

Re: are there any large data set to test the map reduce program on hadoop?

2008-06-28 Thread tim robertson
Perhaps something like a RandomTextWriter to generate a file for input? http://hadoop.apache.org/core/docs/r0.17.0/api/org/apache/hadoop/examples/RandomTextWriter.html Cheers Tim On Sat, Jun 28, 2008 at 4:42 AM, Richard Zhang [EMAIL PROTECTED] wrote: Hello Folks: I wrote a map reduce

Hadoop on EC2 + S3 - best practice?

2008-06-28 Thread tim robertson
Hi all, I have data in a file (150million lines at 100Gb or so) and have several MapReduce classes for my processing (custom index generation). Can someone please confirm the following is the best way to run on EC2 and S3 (both of which I am new to..) 1) load my 100Gb file into S3 2) create a

EC2 public AMI

2008-06-27 Thread tim robertson
Hi all, I have been battling EC2 all day and getting nowhere (see other message) Does anyone use the hadoop-ec2-images/hadoop-0.17.0 AMI for small instances successfully? Following http://wiki.apache.org/hadoop/AmazonEC2 unfortunately doesn't work as the slaves don't come up (details in my

Newbie reducer question

2008-03-09 Thread tim robertson
Hi all, I am a day one newbie investigating distributed work for the first time... I have run through the tutorials with ease (thanks for the nice documentation) and now have written my first map reduce. Is it accurate to say that the reduce is repetitively called by the Hadoop framework until