On Oct 19, 2016, at 11:00 AM, Michael Segel wrote:
>
> Hi,
> Since I am not on the ORC mailing list… and since the ORC java code is in the
> hive APIs… this seems like a good place to start. ;-)
>
>
> So…
>
> Ran in to a little problem…
>
> One of my develo
Hi,
Since I am not on the ORC mailing list… and since the ORC java code is in the
hive APIs… this seems like a good place to start. ;-)
So…
Ran in to a little problem…
One of my developers was writing a map/reduce job to read records from a source
and after some filter, write the result se
thought, that is purely accidental.
Use at your own risk.
Michael Segel
michael_segel (AT) hotmail.com
Short answer yes.
> On Mar 24, 2015, at 11:53 AM, Xuzhan Sun wrote:
>
> Hello,
>
> I want to do some test on my single node cluster for Speed. I know it is easy
> to set up the Pseudo-Distributed Mode, and Hadoop will start one Java process
> for each single map/reduce.
>
> My question is: i
WRT capacity scheduler, its not so much changing the priority of a job, but
allowing for pre-emption. Note that I guess you could raise the one job's
priority, and then the other job's priority so that when a task finishes the
other job gets the next slot. However, you're still stuck waiting an
Just a quick question...
Suppose you have a M/R job running.
How does the Mapper or Reducer task know or find out if its running as a M/R 1
or M/R 2 job?
I would suspect the job context would hold that information... but on first
glance I didn't see it.
So what am I missing?
Thx
-Mike
How do you know where the data exists when you begin?
Sent from a remote device. Please excuse any typos...
Mike Segel
> On Oct 28, 2013, at 8:57 PM, "ricky lee" wrote:
>
> Hi,
>
> I have a question about maintaining data locality in a MapReduce job launched
> through Yarn. Based on the Yarn
ion, distribution, disclosure or forwarding of
> this communication is strictly prohibited. If you have received this
> communication in error, please contact the sender immediately and delete it
> from your system. Thank You.
The opinions expressed here are mine, while they may reflect a cognitive
thought, that is purely accidental.
Use at your own risk.
Michael Segel
michael_segel (AT) hotmail.com
Actually,
I am interested.
Lots of different Apache top level projects seem to overlap and it can be
confusing.
Its very easy for a good technology to get starved because no one asks how to
combine these features in to the framework.
On Jul 29, 2013, at 9:58 AM, Tsuyoshi OZAWA wrote:
> I
Actually,
I am interested.
Lots of different Apache top level projects seem to overlap and it can be
confusing.
Its very easy for a good technology to get starved because no one asks how to
combine these features in to the framework.
On Jul 29, 2013, at 10:06 AM, Michael Segel wrote
Uhm...
You want to save the counters as in counts per job run or something? (Remember
HDFS == WORM)
Then you could do a sequence file and then use something like HBase to manage
the index.
(Every time you add a set of counters, you have a new file and a new index.)
Heck you could use HBase f
Dave,
How did you lose power to the entire cluster?
I realize that this question goes beyond HBase, but is an Ops question. Do you
have redundant power sources and redundant power supplies to the racks and
machines in the cluster?
On Jul 2, 2013, at 7:42 AM, Dave Latham wrote:
> Hi Uma,
> I'll keep looking at Pig with ElephantBird.
> Thanks,
>
> -Jorge
>
>
>
>
>
> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel
> wrote:
> Hi..
>
> Have you thought about HBase?
>
> I would suggest that if you're using Hive o
I could have sworn there was a thread on this already. (Maybe the HBase list?)
Andrew P. kinda nailed it when he talked about the fact that you had to write
the replication(s).
If you wanted improved performance, why not look at the hybrid drives that have
a small SSD buffer and a spinning di
Where was the pig script? On HDFS?
How often does your cluster clean up the trash?
(Deleted stuff doesn't get cleaned up when the file is deleted... ) Its a
configurable setting so YMMV
On Jun 12, 2013, at 8:58 PM, feng jiang wrote:
> Hi everyone,
>
> We have a pig script scheduled running
Hi..
Have you thought about HBase?
I would suggest that if you're using Hive or Pig, to look at taking these files
and putting the JSON records in to a sequence file.
Or set of sequence files (Then look at HBase to help index them...) 200KB
is small.
That would be the same for either pi
Silly question... then what's meant by the native libraries when you talk about
compression?
On Jun 3, 2013, at 5:27 AM, Harsh J wrote:
> Hi Xu,
>
> HDFS is data agnostic. It does not currently care about what form the
> data of the files are in - whether they are compressed, encrypted,
> ser
like (in mapper)
> JSONObject jsn = new JSONObject(value.toString());
>
> String text = (String) jsn.get("text");
> StringTokenizer itr = new StringTokenizer(text);
>
> But its not working :(
> It would be better to get this thing properly but I wouldnt mind using a hac
Yeah,
I have to agree w Russell. Pig is definitely the way to go on this.
If you want to do it as a Java program you will have to do some work on the
input string but it too should be trivial.
How formal do you want to go?
Do you want to strip it down or just find the quote after the text par
I think what he's missing is to change the configurations to point to the new
name node.
It sounds like the new NN has a different IP address from the old NN so the DNs
don't know who to report to...
On May 21, 2013, at 11:23 AM, Todd Lipcon wrote:
> Hi David,
>
> You shouldn't need to do
Drink heavily?
Sorry.
Let me rephrase.
Part of the exercise is for you, the student to come up with the idea. Not
solicit someone else for a suggestion. This is how you learn.
The exercise is to get you to think about the following:
1) What is Hadoop
2) How does it work
3) Why would you wa
Uhm... sort of...
Niels is essentially correct and for the most of us, just starting an NNTPd on
a server that sync's with a government clock and then your local servers sync
to that... will be enough. However... in more detail...
Time is relative. ;-)
Ok... being a bit more serious...
The
namespace.
I'm trying to understand an argument made against HDFS-3370.
Thx
-Mike
On May 16, 2013, at 12:14 AM, Harsh J wrote:
> Do you see viewfs mounts coming useful there (i.e. in place of
> hardlinks across NSes)?
>
> On Thu, May 16, 2013 at 3:49 AM, Michael Segel
>
oesn't sound logical
> - in such a case a person has to build a self failover of URIs for
> said file, which they can simply avoid by using HDFS HA for the
> hosting NN.
>
> On Wed, May 15, 2013 at 7:47 PM, Michael Segel
> wrote:
>> Quick question...
>> So whe
> file.
> To achieve file name redundancy, it is better to have NameNode HA, instead of
> copying it to another namespace. Since Datanodes serve blocks to multiple
> namespace, locality is not an issue and copying file to another namespace
> would not buy you much.
>
>
On May 15, 2013, at 9:24 AM, Lohit wrote:
>
>
> On May 15, 2013, at 7:17 AM, Michael Segel wrote:
>
>> Quick question...
>> So when we have a cluster which has multiple namespaces (multiple name
>> nodes) , why would you have a file in two different namespace
Quick question...
So when we have a cluster which has multiple namespaces (multiple name nodes) ,
why would you have a file in two different namespaces?
Not sure what you mean...
If you want to put up a small file to be used by each Task in your job (mapper
or reducer)... you could put it up on HDFS.
Or if you're launching your job from an edge node, you could read in the small
file and put it in to the distributed cache.
It really depends o
I wouldn't.
You end up with a 'Frankencluster' which could become problematic down the
road.
Ever try to debug a port failure on a switch? (It does happen and its a bitch.)
Note that you say 'reliable'... older hardware may or may not be reliable
or under warranty.
(How many here build th
I wouldn't go the route of multiple nics unless you are using MapR.
MapR allows you to do port bonding or rather use both ports simultaneously.
When you port bond. 1+1 != 2 and then you have some other configuration issues.
(Unless they've fixed them)
If this is your first cluster... keep it s
Hi,
Your cross join is supported in both pig and hive. (Cross, and Theta joins)
So there must be code to do this.
Essentially in the reducer you would have your key and then the set of rows
that match the key. You would then perform the cross product on the key's
result set and output them t
"A potential problem could be, that a reduce is going to write files >600MB and
our mapred.child.java.opts is set to ~380MB."
Isn't the minimum heap normally 512MB?
Why not just increase your child heap size, assuming you have enough memory on
the box...
On Mar 8, 2013, at 4:57 AM, Harsh J
I'm partial to using Java and JNI and then use the distributed cache to push
the native libraries out to each node if not already there.
But that's just me... ;-)
HTH
-Mike
On Mar 3, 2013, at 6:02 PM, Julian Bui wrote:
> Hi hadoop users,
>
> Trying to figure out which interface would be b
Your job.xml file is kept for a set period of time.
I believe the others are automatically removed.
You can easily access the job.xml file from the JT webpage.
On Mar 1, 2013, at 4:14 AM, Ling Kun wrote:
> Dear all,
> In order to know more about the files creation and size when the job is
You can encrypt the splits separately.
The issue of key management is actually a layer above this.
Looks like the research is on the encryption process w a known key.
The layer above would handle key management which can be done a couple of
different ways...
On Feb 26, 2013, at 1:52 PM, jav
Nope HBase wasn't mentioned.
The OP could be talking about using external tables and Hive.
The OP could still be stuck in the RDBMs world and hasn't flattened his data
yet.
2 million records? Kinda small dontcha think?
Not Enough Information ...
On Feb 18, 2013, at 8:58 AM, Hemanth Yamijal
cases.
Also how much memory do you have on each machine?
Tuning is going to be hardware specific and without really understanding what
each parameter does, you can hurt performance.
Michael Segel | (m) 312.755.9623
Segel and Associates
The opinions expressed here are mine, while they
se hadoop.
Michael Segel | (m) 312.755.9623
Segel and Associates
Why not?
Who said you had to parallelize anything?
On Feb 15, 2013, at 12:09 PM, Jay Vyas wrote:
> i don't think you can't do an embarassingly parallel sort of a randomly
> ordered file without merging results.
>
> However, if you know that the file is psudeoordered:
>
> 1123
> 1
the conversion.
>
> Thank you for your time and guidance.
>
> Regards,
>
> Jagat Singh
>
> 1)
> http://docs.oracle.com/javase/6/docs/technotes/guides/intl/encoding.doc.html
> 2) http://sourceforge.net/projects/jrecord/
> 3) http://sourceforge.net/projects/cb2java/
>
>
Michael Segel | (m) 312.755.9623
Segel and Associates
>> limit connects and connection frequency).
>>
>>
>>
>> If this job runs from multiple reducers on the same node, those per-host
>> limits will be violated. Also, this is a shared environment and I don’t
>> want long running network bound jobs uselessly taking up all reduce slots.
>
>
>
> --
> Harsh J
>
Michael Segel | (m) 312.755.9623
Segel and Associates
y" for a mapreduce job ?
>
> I have a job running multiple tasks and I want them to be able to use both
> Text and IntWritable as output key classes.
>
> Any suggestions ?
>
> Thanks,
>
> Amit.
>
Michael Segel | (m) 312.755.9623
Segel and Associates
I'm thinking 'Downfall'
But I could be wrong.
On Jan 17, 2013, at 6:56 PM, Yongzhi Wang wrote:
> Who can tell me what is the name of the original film? Thanks!
>
> Yongzhi
>
>
> On Thu, Jan 17, 2013 at 3:05 PM, Mohammad Tariq wrote:
> I am sure you will suffer from severe stomach ache after
t your job required 28 reducers and it was using the full
resources of the machines.
On Jan 11, 2013, at 5:53 PM, Roy Smith wrote:
> On Jan 11, 2013, at 6:20 PM, Michael Segel wrote:
>
>> Hi,
>>
>> First, not enough information.
>>
>> 1) EC2 got it.
>
Hi,
First, not enough information.
1) EC2 got it.
2) Which flavor of Hadoop? Is this EMR as well?
3) How many slots did you configure in your mapred-site.xml?
AWS EC2 cores aren't going to be hyperthreaded cores so 8 cores means that you
will probably have 6 cores for slots.
With 16 reduc
He's got two different queues.
1) queue in capacity scheduler so he can have a set or M/R tasks running in the
background to pull data off of...
2) a durable queue that receives the inbound json files to be processed.
You can have a customer written listener that pulls data from the queue and
Uhm...
Well, you can talk to Microsoft and Hortonworks about Microsoft as a platform.
Depending on the power of your laptop, you could create a VM and run hadoop in
a pseudo distributed mode there.
You could also get an Amazon Web Services account and build a small cluster via
EMR...
In ter
You can't really say that.
Too many variables in terms of networking. (Like what other traffic is
occurring at the same time? Or who else is attached to the NAS?
On Jan 3, 2013, at 5:09 PM, John Lilley wrote:
> Unless the Hadoop processing and the OneFS storage are co-located, MapReduce
> ca
Ed,
There are some who are of the opinion that these certifications are worthless.
I tend to disagree, however, I don't think that they are the best way to
demonstrate one's abilities.
IMHO they should provide a baseline.
We have seen these types of questions on the list and in the forums.
ners, battery backup... etc ... I was
only running 8 nodes. So YMMV.
On Dec 21, 2012, at 1:37 PM, Lance Norskog wrote:
> You will also be raided by the DEA- too much power for a residence.
>
> On 12/20/2012 07:56 AM, Ted Dunning wrote:
>>
>>
>>
>> On T
While Ted ignores that the world is going to end before X-Mas, he does hit the
crux of the matter head on.
If you don't have a place to put it, the cost of setting it up would kill you,
not to mention that you can get newer hardware which is better suited for less.
Having said that... if you
Hi,
Just a reminder... just because you can do something or rather in this case,
not do something, doesn't mean that its a good idea.
The SN is there for a reason. Maybe if you're on an EMR cluster that will be
taken down at the end of the job or end of the day not having the SN running is
O
500 TB?
How many nodes in the cluster? Is this attached storage or is it in an array?
I mean if you have 4 nodes for a total of 2PB, what happens when you lose 1
node?
On Dec 12, 2012, at 9:02 AM, Mohammad Tariq wrote:
> Hello list,
>
> I don't know if this question makes any s
n IBM where you
need specific IBM security stuff.
Now I could be wrong but that's my first take on it.
On Dec 11, 2012, at 8:50 AM, "Emile Kao" wrote:
> No, this is the general available version...
>
> Original-Nachricht
>> Datum: Tue, 11 Dec 201
Well, on the surface
It looks like its either a missing class, or you don't have your class path set
up right.
I'm assuming you got this version of Hadoop from IBM, so I would suggest
contacting their support and opening up a ticket.
On Dec 11, 2012, at 8:23 AM, Emile Kao wrote:
> He
locations by myself.
>>>>> But There needs to be one mapper running in each node in some cases,
>>>>> so I need a strict way to do it.
>>>>>
>>>>> So, locations is taken care of by JobTracker(scheduler), but it is not
>>>>>
So how many people here are old enough to remember the song 'Hotel California'
? :-P
On Nov 28, 2012, at 11:18 AM, Ted Dunning wrote:
> Also, the moderators don't seem to read anything that goes by.
>
>
> On Wed, Nov 28, 2012 at 4:12 AM, sathyavageeswaran
> wrote:
> In this group once anyo
Here's the simple thing to consider...
If you are running M/R jobs against the data... HBase hands down is the winner.
If you are looking at a stand alone cluster ... Cassandra wins. HBase is still
a fickle beast.
Of course I just bottom lined it. :-)
On Nov 29, 2012, at 10:51 PM, Lance N
On Nov 29, 2012, at 4:59 AM, a...@hsk.hk wrote:
> Hi
>
> Since NN and SNN are used in the same server:
>
> 1) If i use the default "dfs.secondary.http.address", i.e. 0.0.0.0:50090
> (commented out dfs.secondary.http.address property)
>
> I got : Exception in thread "main" java.lang.Ill
Not really the best tool. ?Fuse? (Forget the name)
You do have other options. I saw one group took an open source FTP server and
then extended it to write to HDFS. YMMV, however the code to open a file on
HDFS and to write to it is pretty trivial and straight forward. Not sure why
Cloudera o
me, and I did not know what to answer. I will ask them your
> questions.
>
> Thank you.
> Mark
>
> On Wed, Nov 28, 2012 at 7:41 AM, Michael Segel
> wrote:
> Silly question, why are you worrying about this?
>
> In a production the odds of getting a replacement
Mappers? Uhm... yes you can do it.
Yes it is non-trivial.
Yes, it is not recommended.
I think we talk a bit about this in an InfoQ article written by Boris
Lublinsky.
Its kind of wild when your entire cluster map goes red in ganglia... :-)
On Nov 28, 2012, at 2:41 AM, Harsh J wrote:
> Hi,
Silly question, why are you worrying about this?
In a production the odds of getting a replacement disk in service within 10
minutes after a fault is detected is highly improbable.
Why do you care that the blocks are replicated to another node?
After you replace the disk, bounce the node (rest
To go back to the OP's initial position.
2 new nodes where data hasn't yet been 'balanced'.
First, that's a small window of time.
But to answer your question...
The JT will attempt to schedule work to where the data is. If you're using 3X
replication, there are 3 nodes where the block resid
> printout and few from mails and few from googling and few from sites and few
> from some of my friends...
>
> regards,
> Rams
>
> On Wed, Nov 7, 2012 at 10:57 PM, Michael Segel
> wrote:
> Ok...
> Where are you pulling these questions from?
>
> Seriously.
>
for this
> question)... If you know more detail on that please share..
>
> Note : I forgot from where this question I taken :)
>
> regards,
> Rams.
>
> On Wed, Nov 7, 2012 at 10:01 PM, Michael Segel
> wrote:
> 0 Custer didn't run. He got surrounded and then
Ok...
Where are you pulling these questions from?
Seriously.
On Nov 7, 2012, at 11:21 AM, Ramasubramanian Narayanan
wrote:
> Hi,
>
>I came across the following question in some sites and the answer that
> they provided seems to be wrong according to me... I might be wrong... Can
> s
0 Custer didn't run. He got surrounded and then massacred. :-P (See Custer's
last stand at Little Big Horn)
Ok... plain text files 100 files 2 blocks each would by default attempt to
schedule 200 mappers.
Is this one of those online Cert questions?
On Nov 7, 2012, at 10:20 AM, Ramasubraman
ning just Hadoop, you could have a little swap. Running HBase,
> fuggit about it." -- could you give a bit more information about what do you
> mean swap and why forget for HBase?
>
> regards,
> Lin
>
>
> On Tue, Nov 6, 2012 at 12:46 PM, Michael Segel
> wrote:
>
Mappers and Reducers are separate JVM processes.
And yes you need to take in to account the amount of memory the machine(s) when
you configure the number of slots.
If you are running just Hadoop, you could have a little swap. Running HBase,
fuggit about it.
On Nov 5, 2012, at 7:12 PM, Lin Ma
You have other options.
You could create a secondary cluster.
You could also look in to Cleversafe and what they are doing with Hadoop.
Here's the sad thing about backing up to tape... you can dump a couple of 10's
of TB to tape.
You lose your system. How long will it take to recover?
And th
"However in production clustes the jvm size is marked final to prevent abuses
that may lead to OOMs."
Not necessarily.
On Nov 1, 2012, at 6:43 AM, Bejoy Ks wrote:
> However in production clustes the jvm size is marked final to prevent abuses
> that may lead to OOMs.
cleverscale
> Sent with Sparrow
>
> On Monday 29 October 2012 at 20:04, Michael Segel wrote:
>
>> how many times did you test it?
>>
>> need to rule out aberrations.
>>
>> On Oct 29, 2012, at 11:30 AM, Harsh J wrote:
>>
>>> On your s
how many times did you test it?
need to rule out aberrations.
On Oct 29, 2012, at 11:30 AM, Harsh J wrote:
> On your second low-memory NM instance, did you ensure to lower the
> yarn.nodemanager.resource.memory-mb property specifically to avoid
> swapping due to excessive resource grants? The d
So...
Data locality only works when you actually have data on the cluster itself.
Otherwise how can the data be local.
Assuming 3X replication, and you're not doing a custom split and your input
file is splittable...
You will split along the block delineation. So if your input file has 5
b
They were official, back around 2009, hence the first API was deprecated.
The reason that they removed the deprecation was that the 'new' API didn't have
all of the features/methods of the old APIs.
I learned using the new APIs and ToolRunner is your friend.
So I would suggest using the new A
unter in the middle of the process of a job is undefined and internal
> behavior, it is more reliable to read counter after the whole job completes?
> Agree?
>
> regards,
> Lin
>
> On Sun, Oct 21, 2012 at 8:15 PM, Michael Segel
> wrote:
>
> On Oct 21, 2012, at 1:
Try upping the child to 1.5GB or more.
On Oct 21, 2012, at 8:18 AM, Subash D'Souza wrote:
> I'm running CDH 4 on a 4 node cluster each with 96 G of RAM. Up until last
> week the cluster was running until there was an error in the name node log
> file and I had to reformat it put the data back
a monitoring process and watch the
counters. If lets say an error counter hits a predetermined threshold, you
could then issue a 'hadoop job -kill ' command.
>
> regards,
> Lin
>
> On Sat, Oct 20, 2012 at 3:12 PM, Michael Segel
> wrote:
>
> On Oct 19, 2012,
here the car has
an RFID chip but doesn't trip the sensor.) Pushing the data in a map/reduce job
would require the use of counters.
Does that help?
-Mike
> regards,
> Lin
>
> On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel
> wrote:
> Yeah, sorry...
>
> I
e if you could help to clarify a bit.
>
> regards,
> Lin
>
> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel
> wrote:
>
> On Oct 19, 2012, at 11:27 AM, Lin Ma wrote:
>
>> Hi Mike,
>>
>> Thanks for the detailed reply. Two quick questions/comm
n example, if I want to count the number of quality errors and
then fail after X number of errors, I can't use Global counters to do this.
> regards,
> Lin
>
> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel
> wrote:
> As I understand it... each Task has its own counters and
As I understand it... each Task has its own counters and are independently
updated. As they report back to the JT, they update the counter(s)' status.
The JT then will aggregate them.
In terms of performance, Counters take up some memory in the JT so while its OK
to use them, if you abuse them,
I haven't played with a NetApp box, but the way it has been explained to me is
that your SAN appears as if its direct attached storage.
Its possible, based on drives and other hardware, plus it looks like they are
focusing on read times only.
I'd contact a NetApp rep for a better answer.
Act
When you create your jar using netbeans, do you include the Hadoop libraries in
the jar you create?
This would increase the size of the jar and in this case, size does matter.
On Oct 18, 2012, at 5:06 AM, sudha sadhasivam wrote:
>
>
> Sir
>
> We are trying to combine Hadoop and CUDA. When
Please don't hijack a thread. Start your own discussion.
On Oct 16, 2012, at 1:34 AM, sudha sadhasivam wrote:
>
> The code executes, but time taken for execution is high
> Does not show any advantages in two levels of parallelism
> G Sudha
>
> --- On Tue, 10/16/12, Manoj Babu wrote:
>
> From
ck response.
>
> The idea is that we are selling the encryption product for customers who use
> HDFS. Hence, encryption is a requirement.
>
> Any other suggestions.
>
> Sam
>
> From: Michael Segel [michael_se...@hotmail.com]
>
You don't need an UDF...
You encrypt the string 'Ann' first then use that encrypted value in the Select
statement.
That should make things a bit simpler.
On Oct 17, 2012, at 8:04 PM, Sam Mohamed wrote:
> I have some encrypted data in an HDFS csv, that I've created a Hive table
> for, an
Meh.
If you are worried about the memory constraints of a Linux system, I'd say go
with MapR and their CLDB.
I just did a quick look at Supermico servers and found that on a 2u server
768GB was the max.
So how many blocks can you store in that much memory? I only have 10 fingers
and toes so
ache.org and my web browser
> say it's just you
>
>
> On Wed, Oct 17, 2012 at 9:02 AM, Michael Segel
> wrote:
> I'm having issues connecting to the API pages off the Apache site.
>
> Is it just me?
>
> Thx
>
> -Mike
>
>
>
I'm having issues connecting to the API pages off the Apache site.
Is it just me?
Thx
-Mike
Build and store the tree in some sort of globally accessible space?
Like HBase, or HDFS?
On Oct 13, 2012, at 9:46 AM, Kyle Moses wrote:
> Chris,
> Thanks for the suggestion on serializing the radix tree and your thoughts on
> the memory issue. I'm planning to test a few different solutions a
. the ratio could change over time as the CPUs become more
efficient and faster.
On Oct 12, 2012, at 9:52 PM, ranjith raghunath
wrote:
> Does hypertheading affect this ratio?
>
> On Oct 12, 2012 9:36 PM, "Michael Segel" wrote:
> First, the obvious caveat... YMMV
>
First, the obvious caveat... YMMV
Having said that.
The key here is to take a look across the various jobs that you will run. Some
may be more CPU intensive, others more I/O intensive.
If you monitor these jobs via Ganglia, when you have too few spindles you
should see the wait cpu rise on t
e whereas I'm looking
> for an optimization between reduce and map.
>
> Jim
>
> On Mon, Oct 8, 2012 at 2:19 PM, Michael Segel
> wrote:
>> Well I was thinking ...
>>
>> Map -> Combiner -> Reducer -> Identity Mapper -> combiner -> reducer ->
't have the required functionality.
>
> If I'm missing anything and.or if there are folks who used Giraph or
> Hama and think that they might serve the purpose, I'd be glad to hear
> more.
>
> Jim
>
> On Mon, Oct 8, 2012 at 6:52 AM, Michael Segel
> wrote:
&
I don't believe that Hama would suffice.
In terms of M/R where you want to chain reducers...
Can you chain combiners? (I don't think so, but you never know)
If not, you end up with a series of M/R jobs and the Mappers are just identity
mappers.
Or you could use HBase, with a small caveat...
se
>> and you also need to consider sla for the users so the whole is not trivial.
>>
>> Regards
>>
>> Bertrand
>>
>>
>> On Sun, Oct 7, 2012 at 5:28 PM, centerqi hu wrote:
>>>
>>> Very good explanation,
>>> If there is
Rack local means that while the data isn't local to the node running the task,
it is still on the same rack.
(Its meaningless unless you've set up rack awareness because all of the
machines are on the default rack. )
Data local means that the task is running local to the machine that contains
Yup, I hate that when it happens.
You tend to see this more with Avro than anything else.
The issue is that in Java, the first class loaded wins. So when Hadoop loads
1.4 first, you can't unload it and replace it with 1.7.
The only solution that we found to be workable is to replace the jars
1 - 100 of 141 matches
Mail list logo