DBInputFormat - alternative select strategy?

2009-10-12 Thread tim robertson
Hi all, I've been dumping tables from mysql and loading them manually into HDFS, and but decided to look at the DBInputFormat to better automate the process. I see it issuing the "select... from ... order by id limit..." which takes ages (several minutes) on my large tables since I use myisam and

Re: DBInputFormat - alternative select strategy?

2009-10-12 Thread tim robertson
able to generate > line number keys the way dbinputformat does. > > -Omer > > -Original Message- > From: tim robertson [mailto:timrobertson...@gmail.com] > Sent: Monday, October 12, 2009 10:44 AM > To: mapreduce-user@hadoop.apache.org > Subject: DBInputFormat

Re: DBInputFormat - alternative select strategy?

2009-10-15 Thread tim robertson
at automates the whole process for you, called "Sqoop"; see > www.cloudera.com/hadoop-sqoop > > - Aaron > > On Mon, Oct 12, 2009 at 8:11 AM, tim robertson > wrote: >> >> Thanks Omer! >> >> >> On Mon, Oct 12, 2009 at 5:01 PM, Omer Trajman wro

Reducer output records = 0? (0.20.1)

2009-10-20 Thread tim robertson
Hi all, I have a Reducer with the following (using new API): public static class Transpose extends Reducer { @Override protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int c

Re: Reducer output records = 0? (0.20.1)

2009-10-21 Thread tim robertson
On Wed, Oct 21, 2009 at 10:10 PM, Amareshwari Sri Ramadasu wrote: > That was bug in 0.20. Got fixed in 0.20.2 through MAPREDUCE-112 > > Thanks > Amareshwari > > tim robertson wrote: >> >> Hi all, >> >> I have a Reducer with the following (using new AP

How to use MultipleTextOutputFormat ?

2009-10-27 Thread tim robertson
Hi all, Using 0.20.1 I have a MultipleTextOutputFormat with the following: protected String generateFileNameForKeyValue(Object key, Object value, String name) { return BASE_FILE + "/resource-" + key.toString(); } But when I run this on a 9 node cluster with 9 reducers I get issues with n

MultipleTextOutputFormat giving "Bad connect ack with firstBadLink"

2009-10-27 Thread tim robertson
Hi all, I am running a simple job working on an input tab file, running the following: - a simple Mapper which reading a field from the tab file row and emitting this as the key and the line as the value. - an Identity reducer - a MultipleTextOutputFormat emitting a filename based on the key like

Re: MultipleTextOutputFormat giving "Bad connect ack with firstBadLink"

2009-10-27 Thread tim robertson
10x. > > On Tue, Oct 27, 2009 at 8:24 AM, tim robertson > wrote: >> >> Hi all, >> >> I am running a simple job working on an input tab file, running the >> following: >> >> - a simple Mapper which reading a field from the tab file row and >> em

Newbie: Inner join - reduce side

2009-11-12 Thread Tim Robertson
Hi all, I have 2 KVP files of 200million+ rows, and plan to do a reduce side join (my first...). Input 1 -- KEY TC_ID Input 2 -- KEY OCC_ID I aim to produce an output of: Output -- OCC_ID TC_ID (if there are any many2many I would flag an error) My plan was to

Re: Newbie: Inner join - reduce side

2009-11-12 Thread Tim Robertson
Ok, I missed the org.apache.hadoop.contrib.utils.join which obviously does this exact thing... Sorry, answering my own question Tim On Thu, Nov 12, 2009 at 4:14 PM, Tim Robertson wrote: > Hi all, > > I have 2 KVP files of 200million+ rows, and plan to do a reduce side > jo

Simple normalizing of data

2009-12-01 Thread Tim Robertson
Hi all, I am processing a large tab file to format it suitable for loading into a database with a predefined schema. I have a tab file with a column that I need to normalize out to another table and reference it with a foreign key from the original file. I would like to hear if my proposed proces

Help tuning a cluster - COPY slow

2010-11-17 Thread Tim Robertson
Hi all, We have setup a small cluster (13 nodes) using CDH3 We have been tuning it using TeraSort and Hive queries on our data, and the copy phase is very slow, so I'd like to ask if anyone can look over our config. We have an unbalanced set of machines (all on a single switch): - 10 of Intel @

Re: Help tuning a cluster - COPY slow

2010-11-17 Thread Tim Robertson
, but overall throughput is low so > there's a lot of seeks going on). > > > > On 17 nov 2010, at 09:43, Tim Robertson wrote: > > Hi all, > > We have setup a small cluster (13 nodes) using CDH3 > > We have been tuning it using TeraSort and Hive queries

Re: Help tuning a cluster - COPY slow

2010-11-17 Thread Tim Robertson
ns between hosts from opening efficiently? > - Aaron > > On Wed, Nov 17, 2010 at 12:50 PM, Tim Robertson > wrote: >> >> Thanks Friso, >> >> We've been trying to diagnose all day and still did not find a solution. >> We're running cacti and IO w

Re: Help tuning a cluster - COPY slow

2010-11-18 Thread Tim Robertson
gt; wrong). > > You could try running something like strace (with the -T option, which shows > time spent in system calls) to see whether network related system calls take > a long time. > > > > Friso > > > > > On 17 nov 2010, at 22:20, Tim Robertson wrote:

Re: Help tuning a cluster - COPY slow

2010-11-18 Thread Tim Robertson
Just to close this thread. Turns out it all came down to a mapred.reduce.parallel.copies being overwritten to 5 on the Hive submission. Cranking that back up and everything is happy again. Thanks for the ideas, Tim On Thu, Nov 18, 2010 at 11:04 AM, Tim Robertson wrote: > Thanks ag

Re: Locks in M/R framework

2012-08-13 Thread Tim Robertson
How about introducing a distributed coordination and locking mechanism? ZooKeeper would be a good candidate for that kind of thing. On Mon, Aug 13, 2012 at 12:52 PM, David Ginzburg wrote: > Hi, > > I have an HDFS folder and M/R job that periodically updates it by > replacing the data with newly

Re: Sending data to all reducers

2012-08-23 Thread Tim Robertson
So you are trying to run a single reducer on each machine, and all input data regardless of its location gets streamed to each reducer? On Thu, Aug 23, 2012 at 10:41 AM, Hamid Oliaei wrote: > Hi, > > I want to broadcast some data to all nodes under Hadoop 0.20.2. I tested > DistributedCache modu

Re: Sending data to all reducers

2012-08-23 Thread Tim Robertson
Sorry to ask too many questions, but it will help the user list best offer you advice, as this is not a typical MR use case. - Do you foresee the reducer store the data on a local files system to the machine? - Do you need to use specific input formats for the job, or is it really just text files?

Re: Sending data to all reducers

2012-08-23 Thread Tim Robertson
Then I think you might be best exploring running a getmerge on each client. How you trigger that is up to you, but something like Fabric [1] might help. Others might propose different solutions, but it doesn't sound like MR is a natural choice to me. I would expect this is the very fastest way o

Re: splits and maps

2012-09-19 Thread Tim Robertson
I think the splitting recognises the end of line, so you might get 11 but otherwise that looks correct. On Wed, Sep 19, 2012 at 5:42 PM, Pedro Sá da Costa wrote: > > > If I've an input file of 640MB in size, and a split size of 64Mb, this > file will be partitioned in 10 splits, and each split

Re: splits and maps

2012-09-19 Thread Tim Robertson
fault map numbers, I think a perfect > file of 10 blocks will spawn only 10 mappers. The mapper's record > reader is the one that reads until a newline (even after the end of > its block length bytes). > > On Wed, Sep 19, 2012 at 9:16 PM, Tim Robertson > wrote: > > I think

Re: replace separator in output.collect?

2013-06-11 Thread Tim Robertson
Assuming you are using a textfileoutputformat: http://stackoverflow.com/questions/11031785/hadoop-key-and-value-are-tab-separated-in-the-output-file-how-to-do-it-semicol So something like: conf.set("mapred.textoutputformat.separator", ":"); conf.set("mapreduce.textoutputformat.separator", ":

Re: Joining N DataSets????

2013-07-26 Thread Tim Robertson
Sounds like you might be interested in Hive. On Fri, Jul 26, 2013 at 9:11 PM, شجاع الرحمن بیگ wrote: > Hi > I am working on a problem where I need to join multiple datasets. The > problem is explained below. > > Given N number of datasets, having M relations in between them, i want > to merge

Re: Gathering Statistics for use by Mappers?

2013-09-01 Thread Tim Robertson
Hey Steve, If I recall correctly the total number of counters you have is limited. It's been a while since I looked at that code, but I seem to recall the counters get pushed to JT in heartbeat messaging and are held in JT memory. Anyway, 1) sounds like you'll hit limits, so I'd suggest starting

Re: Increasing Java Heap Space in Slave Nodes

2013-09-07 Thread Tim Robertson
That's right. You can verify it when you run your job by looking at the "job file" link at the top. That shows you all the params used to start the job. Just be aware to make sure you don't put your cluster into an unstable state when you do that. E.g. look at how many mappers / reducers that ca