Hi all,
I've been dumping tables from mysql and loading them manually into
HDFS, and but decided to look at the DBInputFormat to better automate
the process.
I see it issuing the "select... from ... order by id limit..." which
takes ages (several minutes) on my large tables since I use myisam and
able to generate
> line number keys the way dbinputformat does.
>
> -Omer
>
> -Original Message-
> From: tim robertson [mailto:timrobertson...@gmail.com]
> Sent: Monday, October 12, 2009 10:44 AM
> To: mapreduce-user@hadoop.apache.org
> Subject: DBInputFormat
at automates the whole process for you, called "Sqoop"; see
> www.cloudera.com/hadoop-sqoop
>
> - Aaron
>
> On Mon, Oct 12, 2009 at 8:11 AM, tim robertson
> wrote:
>>
>> Thanks Omer!
>>
>>
>> On Mon, Oct 12, 2009 at 5:01 PM, Omer Trajman wro
Hi all,
I have a Reducer with the following (using new API):
public static class Transpose extends Reducer {
@Override
protected void reduce(Text key, Iterable values,
Context context)
throws IOException, InterruptedException {
int c
On Wed, Oct 21, 2009 at 10:10 PM, Amareshwari Sri Ramadasu
wrote:
> That was bug in 0.20. Got fixed in 0.20.2 through MAPREDUCE-112
>
> Thanks
> Amareshwari
>
> tim robertson wrote:
>>
>> Hi all,
>>
>> I have a Reducer with the following (using new AP
Hi all,
Using 0.20.1 I have a MultipleTextOutputFormat with the following:
protected String generateFileNameForKeyValue(Object key, Object
value, String name) {
return BASE_FILE + "/resource-" + key.toString();
}
But when I run this on a 9 node cluster with 9 reducers I get issues
with n
Hi all,
I am running a simple job working on an input tab file, running the following:
- a simple Mapper which reading a field from the tab file row and
emitting this as the key and the line as the value.
- an Identity reducer
- a MultipleTextOutputFormat emitting a filename based on the key like
10x.
>
> On Tue, Oct 27, 2009 at 8:24 AM, tim robertson
> wrote:
>>
>> Hi all,
>>
>> I am running a simple job working on an input tab file, running the
>> following:
>>
>> - a simple Mapper which reading a field from the tab file row and
>> em
Hi all,
I have 2 KVP files of 200million+ rows, and plan to do a reduce side
join (my first...).
Input 1
--
KEY TC_ID
Input 2
--
KEY OCC_ID
I aim to produce an output of:
Output
--
OCC_ID TC_ID (if there are any many2many I would flag an error)
My plan was to
Ok, I missed the org.apache.hadoop.contrib.utils.join which obviously
does this exact thing...
Sorry, answering my own question
Tim
On Thu, Nov 12, 2009 at 4:14 PM, Tim Robertson
wrote:
> Hi all,
>
> I have 2 KVP files of 200million+ rows, and plan to do a reduce side
> jo
Hi all,
I am processing a large tab file to format it suitable for loading
into a database with a predefined schema.
I have a tab file with a column that I need to normalize out to
another table and reference it with a foreign key from the original
file. I would like to hear if my proposed proces
Hi all,
We have setup a small cluster (13 nodes) using CDH3
We have been tuning it using TeraSort and Hive queries on our data,
and the copy phase is very slow, so I'd like to ask if anyone can look
over our config.
We have an unbalanced set of machines (all on a single switch):
- 10 of Intel @
, but overall throughput is low so
> there's a lot of seeks going on).
>
>
>
> On 17 nov 2010, at 09:43, Tim Robertson wrote:
>
> Hi all,
>
> We have setup a small cluster (13 nodes) using CDH3
>
> We have been tuning it using TeraSort and Hive queries
ns between hosts from opening efficiently?
> - Aaron
>
> On Wed, Nov 17, 2010 at 12:50 PM, Tim Robertson
> wrote:
>>
>> Thanks Friso,
>>
>> We've been trying to diagnose all day and still did not find a solution.
>> We're running cacti and IO w
gt; wrong).
>
> You could try running something like strace (with the -T option, which shows
> time spent in system calls) to see whether network related system calls take
> a long time.
>
>
>
> Friso
>
>
>
>
> On 17 nov 2010, at 22:20, Tim Robertson wrote:
Just to close this thread.
Turns out it all came down to a mapred.reduce.parallel.copies being
overwritten to 5 on the Hive submission. Cranking that back up and
everything is happy again.
Thanks for the ideas,
Tim
On Thu, Nov 18, 2010 at 11:04 AM, Tim Robertson
wrote:
> Thanks ag
How about introducing a distributed coordination and locking mechanism?
ZooKeeper would be a good candidate for that kind of thing.
On Mon, Aug 13, 2012 at 12:52 PM, David Ginzburg wrote:
> Hi,
>
> I have an HDFS folder and M/R job that periodically updates it by
> replacing the data with newly
So you are trying to run a single reducer on each machine, and all input
data regardless of its location gets streamed to each reducer?
On Thu, Aug 23, 2012 at 10:41 AM, Hamid Oliaei wrote:
> Hi,
>
> I want to broadcast some data to all nodes under Hadoop 0.20.2. I tested
> DistributedCache modu
Sorry to ask too many questions, but it will help the user list best offer
you advice, as this is not a typical MR use case.
- Do you foresee the reducer store the data on a local files system to the
machine?
- Do you need to use specific input formats for the job, or is it really
just text files?
Then I think you might be best exploring running a getmerge on each
client. How you trigger that is up to you, but something like Fabric [1]
might help. Others might propose different solutions, but it doesn't sound
like MR is a natural choice to me.
I would expect this is the very fastest way o
I think the splitting recognises the end of line, so you might get 11 but
otherwise that looks correct.
On Wed, Sep 19, 2012 at 5:42 PM, Pedro Sá da Costa wrote:
>
>
> If I've an input file of 640MB in size, and a split size of 64Mb, this
> file will be partitioned in 10 splits, and each split
fault map numbers, I think a perfect
> file of 10 blocks will spawn only 10 mappers. The mapper's record
> reader is the one that reads until a newline (even after the end of
> its block length bytes).
>
> On Wed, Sep 19, 2012 at 9:16 PM, Tim Robertson
> wrote:
> > I think
Assuming you are using a textfileoutputformat:
http://stackoverflow.com/questions/11031785/hadoop-key-and-value-are-tab-separated-in-the-output-file-how-to-do-it-semicol
So something like:
conf.set("mapred.textoutputformat.separator", ":");
conf.set("mapreduce.textoutputformat.separator", ":
Sounds like you might be interested in Hive.
On Fri, Jul 26, 2013 at 9:11 PM, شجاع الرحمن بیگ wrote:
> Hi
> I am working on a problem where I need to join multiple datasets. The
> problem is explained below.
>
> Given N number of datasets, having M relations in between them, i want
> to merge
Hey Steve,
If I recall correctly the total number of counters you have is limited.
It's been a while since I looked at that code, but I seem to recall the
counters get pushed to JT in heartbeat messaging and are held in JT memory.
Anyway, 1) sounds like you'll hit limits, so I'd suggest starting
That's right.
You can verify it when you run your job by looking at the "job file" link
at the top. That shows you all the params used to start the job.
Just be aware to make sure you don't put your cluster into an unstable
state when you do that. E.g. look at how many mappers / reducers that ca
26 matches
Mail list logo