Re: Bug in ORC file code? (OrcSerde)?

2016-10-19 Thread Michael Segel
On Oct 19, 2016, at 11:00 AM, Michael Segel wrote: > > Hi, > Since I am not on the ORC mailing list… and since the ORC java code is in the > hive APIs… this seems like a good place to start. ;-) > > > So… > > Ran in to a little problem… > > One of my develo

Bug in ORC file code? (OrcSerde)?

2016-10-19 Thread Michael Segel
Hi, Since I am not on the ORC mailing list… and since the ORC java code is in the hive APIs… this seems like a good place to start. ;-) So… Ran in to a little problem… One of my developers was writing a map/reduce job to read records from a source and after some filter, write the result se

Re: Trusted-realm vs default-realm kerberos issue

2015-03-24 Thread Michael Segel
thought, that is purely accidental. Use at your own risk. Michael Segel michael_segel (AT) hotmail.com

Re: Can Pseudo-Distributed Mode take advantage of multi-core structure?

2015-03-24 Thread Michael Segel
Short answer yes. > On Mar 24, 2015, at 11:53 AM, Xuzhan Sun wrote: > > Hello, > > I want to do some test on my single node cluster for Speed. I know it is easy > to set up the Pseudo-Distributed Mode, and Hadoop will start one Java process > for each single map/reduce. > > My question is: i

Re: change yarn application priority

2014-06-03 Thread Michael Segel
WRT capacity scheduler, its not so much changing the priority of a job, but allowing for pre-emption. Note that I guess you could raise the one job's priority, and then the other job's priority so that when a task finishes the other job gets the next slot. However, you're still stuck waiting an

How can a task know if its running as a MR1 or MR2 job?

2014-06-03 Thread Michael Segel
Just a quick question... Suppose you have a M/R job running. How does the Mapper or Reducer task know or find out if its running as a M/R 1 or M/R 2 job? I would suspect the job context would hold that information... but on first glance I didn't see it. So what am I missing? Thx -Mike

Re: question about preserving data locality in MapReduce with Yarn

2013-10-28 Thread Michael Segel
How do you know where the data exists when you begin? Sent from a remote device. Please excuse any typos... Mike Segel > On Oct 28, 2013, at 8:57 PM, "ricky lee" wrote: > > Hi, > > I have a question about maintaining data locality in a MapReduce job launched > through Yarn. Based on the Yarn

Re: Multidata center support

2013-09-04 Thread Michael Segel
ion, distribution, disclosure or forwarding of > this communication is strictly prohibited. If you have received this > communication in error, please contact the sender immediately and delete it > from your system. Thank You. The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. Use at your own risk. Michael Segel michael_segel (AT) hotmail.com

Re: Abstraction layer to support both YARN and Mesos

2013-07-29 Thread Michael Segel
Actually, I am interested. Lots of different Apache top level projects seem to overlap and it can be confusing. Its very easy for a good technology to get starved because no one asks how to combine these features in to the framework. On Jul 29, 2013, at 9:58 AM, Tsuyoshi OZAWA wrote: > I

Re: Abstraction layer to support both YARN and Mesos

2013-07-29 Thread Michael Segel
Actually, I am interested. Lots of different Apache top level projects seem to overlap and it can be confusing. Its very easy for a good technology to get starved because no one asks how to combine these features in to the framework. On Jul 29, 2013, at 10:06 AM, Michael Segel wrote

Re: Saving counters in Mapfile

2013-07-23 Thread Michael Segel
Uhm... You want to save the counters as in counts per job run or something? (Remember HDFS == WORM) Then you could do a sequence file and then use something like HBase to manage the index. (Every time you add a set of counters, you have a new file and a new index.) Heck you could use HBase f

Re: data loss after cluster wide power loss

2013-07-03 Thread Michael Segel
Dave, How did you lose power to the entire cluster? I realize that this question goes beyond HBase, but is an Ops question. Do you have redundant power sources and redundant power supplies to the racks and machines in the cluster? On Jul 2, 2013, at 7:42 AM, Dave Latham wrote: > Hi Uma,

Re: Aggregating data nested into JSON documents

2013-06-13 Thread Michael Segel
> I'll keep looking at Pig with ElephantBird. > Thanks, > > -Jorge > > > > > > On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel > wrote: > Hi.. > > Have you thought about HBase? > > I would suggest that if you're using Hive o

Re: SSD support in HDFS

2013-06-12 Thread Michael Segel
I could have sworn there was a thread on this already. (Maybe the HBase list?) Andrew P. kinda nailed it when he talked about the fact that you had to write the replication(s). If you wanted improved performance, why not look at the hybrid drives that have a small SSD buffer and a spinning di

Re: recovery accidently deleted pig script

2013-06-12 Thread Michael Segel
Where was the pig script? On HDFS? How often does your cluster clean up the trash? (Deleted stuff doesn't get cleaned up when the file is deleted... ) Its a configurable setting so YMMV On Jun 12, 2013, at 8:58 PM, feng jiang wrote: > Hi everyone, > > We have a pig script scheduled running

Re: Aggregating data nested into JSON documents

2013-06-12 Thread Michael Segel
Hi.. Have you thought about HBase? I would suggest that if you're using Hive or Pig, to look at taking these files and putting the JSON records in to a sequence file. Or set of sequence files (Then look at HBase to help index them...) 200KB is small. That would be the same for either pi

Re: Does libhdfs c/c++ api support read/write compressed file

2013-06-03 Thread Michael Segel
Silly question... then what's meant by the native libraries when you talk about compression? On Jun 3, 2013, at 5:27 AM, Harsh J wrote: > Hi Xu, > > HDFS is data agnostic. It does not currently care about what form the > data of the files are in - whether they are compressed, encrypted, > ser

Re: Reading json format input

2013-05-29 Thread Michael Segel
like (in mapper) > JSONObject jsn = new JSONObject(value.toString()); > > String text = (String) jsn.get("text"); > StringTokenizer itr = new StringTokenizer(text); > > But its not working :( > It would be better to get this thing properly but I wouldnt mind using a hac

Re: Reading json format input

2013-05-29 Thread Michael Segel
Yeah, I have to agree w Russell. Pig is definitely the way to go on this. If you want to do it as a Java program you will have to do some work on the input string but it too should be trivial. How formal do you want to go? Do you want to strip it down or just find the quote after the text par

Re: Recovering the namenode from failure

2013-05-21 Thread Michael Segel
I think what he's missing is to change the configurations to point to the new name node. It sounds like the new NN has a different IP address from the old NN so the DNs don't know who to report to... On May 21, 2013, at 11:23 AM, Todd Lipcon wrote: > Hi David, > > You shouldn't need to do

Re: Project ideas

2013-05-21 Thread Michael Segel
Drink heavily? Sorry. Let me rephrase. Part of the exercise is for you, the student to come up with the idea. Not solicit someone else for a suggestion. This is how you learn. The exercise is to get you to think about the following: 1) What is Hadoop 2) How does it work 3) Why would you wa

Re: how to get the time of a hadoop cluster, v0.20.2

2013-05-16 Thread Michael Segel
Uhm... sort of... Niels is essentially correct and for the most of us, just starting an NNTPd on a server that sync's with a government clock and then your local servers sync to that... will be enough. However... in more detail... Time is relative. ;-) Ok... being a bit more serious... The

Re: Question about Name Spaces…

2013-05-16 Thread Michael Segel
namespace. I'm trying to understand an argument made against HDFS-3370. Thx -Mike On May 16, 2013, at 12:14 AM, Harsh J wrote: > Do you see viewfs mounts coming useful there (i.e. in place of > hardlinks across NSes)? > > On Thu, May 16, 2013 at 3:49 AM, Michael Segel >

Re: Question about Name Spaces…

2013-05-15 Thread Michael Segel
oesn't sound logical > - in such a case a person has to build a self failover of URIs for > said file, which they can simply avoid by using HDFS HA for the > hosting NN. > > On Wed, May 15, 2013 at 7:47 PM, Michael Segel > wrote: >> Quick question... >> So whe

Hardlinkes (See HDFS-3370) wuz Re: Question about Name Spaces…

2013-05-15 Thread Michael Segel
> file. > To achieve file name redundancy, it is better to have NameNode HA, instead of > copying it to another namespace. Since Datanodes serve blocks to multiple > namespace, locality is not an issue and copying file to another namespace > would not buy you much. > >

Re: Question about Name Spaces…

2013-05-15 Thread Michael Segel
On May 15, 2013, at 9:24 AM, Lohit wrote: > > > On May 15, 2013, at 7:17 AM, Michael Segel wrote: > >> Quick question... >> So when we have a cluster which has multiple namespaces (multiple name >> nodes) , why would you have a file in two different namespace

Question about Name Spaces…

2013-05-15 Thread Michael Segel
Quick question... So when we have a cluster which has multiple namespaces (multiple name nodes) , why would you have a file in two different namespaces?

Re: Benefits of Hadoop Distributed Cache

2013-05-07 Thread Michael Segel
Not sure what you mean... If you want to put up a small file to be used by each Task in your job (mapper or reducer)... you could put it up on HDFS. Or if you're launching your job from an edge node, you could read in the small file and put it in to the distributed cache. It really depends o

Re: Hardware Selection for Hadoop

2013-05-07 Thread Michael Segel
I wouldn't. You end up with a 'Frankencluster' which could become problematic down the road. Ever try to debug a port failure on a switch? (It does happen and its a bitch.) Note that you say 'reliable'... older hardware may or may not be reliable or under warranty. (How many here build th

Re: Hardware Selection for Hadoop

2013-05-07 Thread Michael Segel
I wouldn't go the route of multiple nics unless you are using MapR. MapR allows you to do port bonding or rather use both ports simultaneously. When you port bond. 1+1 != 2 and then you have some other configuration issues. (Unless they've fixed them) If this is your first cluster... keep it s

Re: How can I record some position of context in Reduce()?

2013-04-09 Thread Michael Segel
Hi, Your cross join is supported in both pig and hive. (Cross, and Theta joins) So there must be code to do this. Essentially in the reducer you would have your key and then the set of rows that match the key. You would then perform the cross product on the key's result set and output them t

Re: OutOfMemory during Plain Java MapReduce

2013-03-08 Thread Michael Segel
"A potential problem could be, that a reduce is going to write files >600MB and our mapred.child.java.opts is set to ~380MB." Isn't the minimum heap normally 512MB? Why not just increase your child heap size, assuming you have enough memory on the box... On Mar 8, 2013, at 4:57 AM, Harsh J

Re: hadoop pipes or java with jni?

2013-03-03 Thread Michael Segel
I'm partial to using Java and JNI and then use the distributed cache to push the native libraries out to each node if not already there. But that's just me... ;-) HTH -Mike On Mar 3, 2013, at 6:02 PM, Julian Bui wrote: > Hi hadoop users, > > Trying to figure out which interface would be b

Re: Is there a way to keep all intermediate files there after the MapReduce Job run?

2013-03-01 Thread Michael Segel
Your job.xml file is kept for a set period of time. I believe the others are automatically removed. You can easily access the job.xml file from the JT webpage. On Mar 1, 2013, at 4:14 AM, Ling Kun wrote: > Dear all, > In order to know more about the files creation and size when the job is

Re: Encryption in HDFS

2013-02-27 Thread Michael Segel
You can encrypt the splits separately. The issue of key management is actually a layer above this. Looks like the research is on the encryption process w a known key. The layer above would handle key management which can be done a couple of different ways... On Feb 26, 2013, at 1:52 PM, jav

Re: Database insertion by HAdoop

2013-02-18 Thread Michael Segel
Nope HBase wasn't mentioned. The OP could be talking about using external tables and Hive. The OP could still be stuck in the RDBMs world and hasn't flattened his data yet. 2 million records? Kinda small dontcha think? Not Enough Information ... On Feb 18, 2013, at 8:58 AM, Hemanth Yamijal

Re: why my test result on dfs short circuit read is slower?

2013-02-18 Thread Michael Segel
cases. Also how much memory do you have on each machine? Tuning is going to be hardware specific and without really understanding what each parameter does, you can hurt performance. Michael Segel | (m) 312.755.9623 Segel and Associates The opinions expressed here are mine, while they

Re: Sorting huge text files in Hadoop

2013-02-15 Thread Michael Segel
se hadoop. Michael Segel | (m) 312.755.9623 Segel and Associates

Re: Sorting huge text files in Hadoop

2013-02-15 Thread Michael Segel
Why not? Who said you had to parallelize anything? On Feb 15, 2013, at 12:09 PM, Jay Vyas wrote: > i don't think you can't do an embarassingly parallel sort of a randomly > ordered file without merging results. > > However, if you know that the file is psudeoordered: > > 1123 > 1

Re: Mainframe to ASCII conversion

2013-02-11 Thread Michael Segel
the conversion. > > Thank you for your time and guidance. > > Regards, > > Jagat Singh > > 1) > http://docs.oracle.com/javase/6/docs/technotes/guides/intl/encoding.doc.html > 2) http://sourceforge.net/projects/jrecord/ > 3) http://sourceforge.net/projects/cb2java/ > > Michael Segel | (m) 312.755.9623 Segel and Associates

Re: How can I limit reducers to one-per-node?

2013-02-10 Thread Michael Segel
>> limit connects and connection frequency). >> >> >> >> If this job runs from multiple reducers on the same node, those per-host >> limits will be violated. Also, this is a shared environment and I don’t >> want long running network bound jobs uselessly taking up all reduce slots. > > > > -- > Harsh J > Michael Segel | (m) 312.755.9623 Segel and Associates

Re: Generic output key class

2013-02-10 Thread Michael Segel
y" for a mapreduce job ? > > I have a job running multiple tasks and I want them to be able to use both > Text and IntWritable as output key classes. > > Any suggestions ? > > Thanks, > > Amit. > Michael Segel | (m) 312.755.9623 Segel and Associates

Re: On a lighter note

2013-01-17 Thread Michael Segel
I'm thinking 'Downfall' But I could be wrong. On Jan 17, 2013, at 6:56 PM, Yongzhi Wang wrote: > Who can tell me what is the name of the original film? Thanks! > > Yongzhi > > > On Thu, Jan 17, 2013 at 3:05 PM, Mohammad Tariq wrote: > I am sure you will suffer from severe stomach ache after

Re: How does hadoop decide how many reducers to run?

2013-01-12 Thread Michael Segel
t your job required 28 reducers and it was using the full resources of the machines. On Jan 11, 2013, at 5:53 PM, Roy Smith wrote: > On Jan 11, 2013, at 6:20 PM, Michael Segel wrote: > >> Hi, >> >> First, not enough information. >> >> 1) EC2 got it. >

Re: How does hadoop decide how many reducers to run?

2013-01-11 Thread Michael Segel
Hi, First, not enough information. 1) EC2 got it. 2) Which flavor of Hadoop? Is this EMR as well? 3) How many slots did you configure in your mapred-site.xml? AWS EC2 cores aren't going to be hyperthreaded cores so 8 cores means that you will probably have 6 cores for slots. With 16 reduc

Re: queues in haddop

2013-01-11 Thread Michael Segel
He's got two different queues. 1) queue in capacity scheduler so he can have a set or M/R tasks running in the background to pull data off of... 2) a durable queue that receives the inbound json files to be processed. You can have a customer written listener that pulls data from the queue and

Re: Hello and request some advice.

2013-01-04 Thread Michael Segel
Uhm... Well, you can talk to Microsoft and Hortonworks about Microsoft as a platform. Depending on the power of your laptop, you could create a VM and run hadoop in a pseudo distributed mode there. You could also get an Amazon Web Services account and build a small cluster via EMR... In ter

Re: Hadoop throughput question

2013-01-03 Thread Michael Segel
You can't really say that. Too many variables in terms of networking. (Like what other traffic is occurring at the same time? Or who else is attached to the NAS? On Jan 3, 2013, at 5:09 PM, John Lilley wrote: > Unless the Hadoop processing and the OneFS storage are co-located, MapReduce > ca

Re: What is the preferred way to pass a small number of configuration parameters to a mapper or reducer

2012-12-30 Thread Michael Segel
Ed, There are some who are of the opinion that these certifications are worthless. I tend to disagree, however, I don't think that they are the best way to demonstrate one's abilities. IMHO they should provide a baseline. We have seen these types of questions on the list and in the forums.

Re: What should I do with a 48-node cluster

2012-12-22 Thread Michael Segel
ners, battery backup... etc ... I was only running 8 nodes. So YMMV. On Dec 21, 2012, at 1:37 PM, Lance Norskog wrote: > You will also be raided by the DEA- too much power for a residence. > > On 12/20/2012 07:56 AM, Ted Dunning wrote: >> >> >> >> On T

Re: What should I do with a 48-node cluster

2012-12-20 Thread Michael Segel
While Ted ignores that the world is going to end before X-Mas, he does hit the crux of the matter head on. If you don't have a place to put it, the cost of setting it up would kill you, not to mention that you can get newer hardware which is better suited for less. Having said that... if you

Re: Is it necessary to run secondary namenode when starting HDFS?

2012-12-17 Thread Michael Segel
Hi, Just a reminder... just because you can do something or rather in this case, not do something, doesn't mean that its a good idea. The SN is there for a reason. Maybe if you're on an EMR cluster that will be taken down at the end of the job or end of the day not having the SN running is O

Re: Sane max storage size for DN

2012-12-12 Thread Michael Segel
500 TB? How many nodes in the cluster? Is this attached storage or is it in an array? I mean if you have 4 nodes for a total of 2PB, what happens when you lose 1 node? On Dec 12, 2012, at 9:02 AM, Mohammad Tariq wrote: > Hello list, > > I don't know if this question makes any s

Re: using hadoop on zLinux (Linux on S390)

2012-12-11 Thread Michael Segel
n IBM where you need specific IBM security stuff. Now I could be wrong but that's my first take on it. On Dec 11, 2012, at 8:50 AM, "Emile Kao" wrote: > No, this is the general available version... > > Original-Nachricht >> Datum: Tue, 11 Dec 201

Re: using hadoop on zLinux (Linux on S390)

2012-12-11 Thread Michael Segel
Well, on the surface It looks like its either a missing class, or you don't have your class path set up right. I'm assuming you got this version of Hadoop from IBM, so I would suggest contacting their support and opening up a ticket. On Dec 11, 2012, at 8:23 AM, Emile Kao wrote: > He

Re: Assigning reduce tasks to specific nodes

2012-12-01 Thread Michael Segel
locations by myself. >>>>> But There needs to be one mapper running in each node in some cases, >>>>> so I need a strict way to do it. >>>>> >>>>> So, locations is taken care of by JobTracker(scheduler), but it is not >>>>>

Re: bounce message

2012-11-30 Thread Michael Segel
So how many people here are old enough to remember the song 'Hotel California' ? :-P On Nov 28, 2012, at 11:18 AM, Ted Dunning wrote: > Also, the moderators don't seem to read anything that goes by. > > > On Wed, Nov 28, 2012 at 4:12 AM, sathyavageeswaran > wrote: > In this group once anyo

Re: Best practice for storage of data that changes

2012-11-30 Thread Michael Segel
Here's the simple thing to consider... If you are running M/R jobs against the data... HBase hands down is the winner. If you are looking at a stand alone cluster ... Cassandra wins. HBase is still a fickle beast. Of course I just bottom lined it. :-) On Nov 29, 2012, at 10:51 PM, Lance N

Re: Failed To Start SecondaryNameNode in Secure Mode

2012-11-29 Thread Michael Segel
On Nov 29, 2012, at 4:59 AM, a...@hsk.hk wrote: > Hi > > Since NN and SNN are used in the same server: > > 1) If i use the default "dfs.secondary.http.address", i.e. 0.0.0.0:50090 > (commented out dfs.secondary.http.address property) > > I got : Exception in thread "main" java.lang.Ill

Re: Downloading data directly into HDFS

2012-11-29 Thread Michael Segel
Not really the best tool. ?Fuse? (Forget the name) You do have other options. I saw one group took an open source FTP server and then extended it to write to HDFS. YMMV, however the code to open a file on HDFS and to write to it is pretty trivial and straight forward. Not sure why Cloudera o

Re: Replacing a hard drive on a slave

2012-11-28 Thread Michael Segel
me, and I did not know what to answer. I will ask them your > questions. > > Thank you. > Mark > > On Wed, Nov 28, 2012 at 7:41 AM, Michael Segel > wrote: > Silly question, why are you worrying about this? > > In a production the odds of getting a replacement

Re: Assigning reduce tasks to specific nodes

2012-11-28 Thread Michael Segel
Mappers? Uhm... yes you can do it. Yes it is non-trivial. Yes, it is not recommended. I think we talk a bit about this in an InfoQ article written by Boris Lublinsky. Its kind of wild when your entire cluster map goes red in ganglia... :-) On Nov 28, 2012, at 2:41 AM, Harsh J wrote: > Hi,

Re: Replacing a hard drive on a slave

2012-11-28 Thread Michael Segel
Silly question, why are you worrying about this? In a production the odds of getting a replacement disk in service within 10 minutes after a fault is detected is highly improbable. Why do you care that the blocks are replicated to another node? After you replace the disk, bounce the node (rest

Re: Hadoop processing

2012-11-08 Thread Michael Segel
To go back to the OP's initial position. 2 new nodes where data hasn't yet been 'balanced'. First, that's a small window of time. But to answer your question... The JT will attempt to schedule work to where the data is. If you're using 3X replication, there are 3 nodes where the block resid

Re: Please help on providing correct answers

2012-11-07 Thread Michael Segel
> printout and few from mails and few from googling and few from sites and few > from some of my friends... > > regards, > Rams > > On Wed, Nov 7, 2012 at 10:57 PM, Michael Segel > wrote: > Ok... > Where are you pulling these questions from? > > Seriously. >

Re: Question related to Number of Mapper

2012-11-07 Thread Michael Segel
for this > question)... If you know more detail on that please share.. > > Note : I forgot from where this question I taken :) > > regards, > Rams. > > On Wed, Nov 7, 2012 at 10:01 PM, Michael Segel > wrote: > 0 Custer didn't run. He got surrounded and then

Re: Please help on providing correct answers

2012-11-07 Thread Michael Segel
Ok... Where are you pulling these questions from? Seriously. On Nov 7, 2012, at 11:21 AM, Ramasubramanian Narayanan wrote: > Hi, > >I came across the following question in some sites and the answer that > they provided seems to be wrong according to me... I might be wrong... Can > s

Re: Question related to Number of Mapper

2012-11-07 Thread Michael Segel
0 Custer didn't run. He got surrounded and then massacred. :-P (See Custer's last stand at Little Big Horn) Ok... plain text files 100 files 2 blocks each would by default attempt to schedule 200 mappers. Is this one of those online Cert questions? On Nov 7, 2012, at 10:20 AM, Ramasubraman

Re: One mapper/reducer runs on a single JVM

2012-11-06 Thread Michael Segel
ning just Hadoop, you could have a little swap. Running HBase, > fuggit about it." -- could you give a bit more information about what do you > mean swap and why forget for HBase? > > regards, > Lin > > > On Tue, Nov 6, 2012 at 12:46 PM, Michael Segel > wrote: >

Re: One mapper/reducer runs on a single JVM

2012-11-05 Thread Michael Segel
Mappers and Reducers are separate JVM processes. And yes you need to take in to account the amount of memory the machine(s) when you configure the number of slots. If you are running just Hadoop, you could have a little swap. Running HBase, fuggit about it. On Nov 5, 2012, at 7:12 PM, Lin Ma

Re: backup of hdfs data

2012-11-05 Thread Michael Segel
You have other options. You could create a secondary cluster. You could also look in to Cleversafe and what they are doing with Hadoop. Here's the sad thing about backing up to tape... you can dump a couple of 10's of TB to tape. You lose your system. How long will it take to recover? And th

Re: Map Reduce slot

2012-11-01 Thread Michael Segel
"However in production clustes the jvm size is marked final to prevent abuses that may lead to OOMs." Not necessarily. On Nov 1, 2012, at 6:43 AM, Bejoy Ks wrote: > However in production clustes the jvm size is marked final to prevent abuses > that may lead to OOMs.

Re: Insight on why distcp becomes slower when adding nodemanager

2012-10-31 Thread Michael Segel
cleverscale > Sent with Sparrow > > On Monday 29 October 2012 at 20:04, Michael Segel wrote: > >> how many times did you test it? >> >> need to rule out aberrations. >> >> On Oct 29, 2012, at 11:30 AM, Harsh J wrote: >> >>> On your s

Re: Insight on why distcp becomes slower when adding nodemanager

2012-10-29 Thread Michael Segel
how many times did you test it? need to rule out aberrations. On Oct 29, 2012, at 11:30 AM, Harsh J wrote: > On your second low-memory NM instance, did you ensure to lower the > yarn.nodemanager.resource.memory-mb property specifically to avoid > swapping due to excessive resource grants? The d

Re: How do map tasks get assigned efficiently?

2012-10-24 Thread Michael Segel
So... Data locality only works when you actually have data on the cluster itself. Otherwise how can the data be local. Assuming 3X replication, and you're not doing a custom split and your input file is splittable... You will split along the block delineation. So if your input file has 5 b

Re: Old vs New API

2012-10-24 Thread Michael Segel
They were official, back around 2009, hence the first API was deprecated. The reason that they removed the deprecation was that the 'new' API didn't have all of the features/methods of the old APIs. I learned using the new APIs and ToolRunner is your friend. So I would suggest using the new A

Re: Hadoop counter

2012-10-22 Thread Michael Segel
unter in the middle of the process of a job is undefined and internal > behavior, it is more reliable to read counter after the whole job completes? > Agree? > > regards, > Lin > > On Sun, Oct 21, 2012 at 8:15 PM, Michael Segel > wrote: > > On Oct 21, 2012, at 1:

Re: Java heap space error

2012-10-21 Thread Michael Segel
Try upping the child to 1.5GB or more. On Oct 21, 2012, at 8:18 AM, Subash D'Souza wrote: > I'm running CDH 4 on a 4 node cluster each with 96 G of RAM. Up until last > week the cluster was running until there was an error in the name node log > file and I had to reformat it put the data back

Re: Hadoop counter

2012-10-21 Thread Michael Segel
a monitoring process and watch the counters. If lets say an error counter hits a predetermined threshold, you could then issue a 'hadoop job -kill ' command. > > regards, > Lin > > On Sat, Oct 20, 2012 at 3:12 PM, Michael Segel > wrote: > > On Oct 19, 2012,

Re: Hadoop counter

2012-10-20 Thread Michael Segel
here the car has an RFID chip but doesn't trip the sensor.) Pushing the data in a map/reduce job would require the use of counters. Does that help? -Mike > regards, > Lin > > On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel > wrote: > Yeah, sorry... > > I

Re: Hadoop counter

2012-10-19 Thread Michael Segel
e if you could help to clarify a bit. > > regards, > Lin > > On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel > wrote: > > On Oct 19, 2012, at 11:27 AM, Lin Ma wrote: > >> Hi Mike, >> >> Thanks for the detailed reply. Two quick questions/comm

Re: Hadoop counter

2012-10-19 Thread Michael Segel
n example, if I want to count the number of quality errors and then fail after X number of errors, I can't use Global counters to do this. > regards, > Lin > > On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel > wrote: > As I understand it... each Task has its own counters and

Re: Hadoop counter

2012-10-19 Thread Michael Segel
As I understand it... each Task has its own counters and are independently updated. As they report back to the JT, they update the counter(s)' status. The JT then will aggregate them. In terms of performance, Counters take up some memory in the JT so while its OK to use them, if you abuse them,

Re: HDFS using SAN

2012-10-18 Thread Michael Segel
I haven't played with a NetApp box, but the way it has been explained to me is that your SAN appears as if its direct attached storage. Its possible, based on drives and other hardware, plus it looks like they are focusing on read times only. I'd contact a NetApp rep for a better answer. Act

Re: Hadoop and CUDA

2012-10-18 Thread Michael Segel
When you create your jar using netbeans, do you include the Hadoop libraries in the jar you create? This would increase the size of the jar and in this case, size does matter. On Oct 18, 2012, at 5:06 AM, sudha sadhasivam wrote: > > > Sir > > We are trying to combine Hadoop and CUDA. When

Re: Hadoop and CUDA

2012-10-18 Thread Michael Segel
Please don't hijack a thread. Start your own discussion. On Oct 16, 2012, at 1:34 AM, sudha sadhasivam wrote: > > The code executes, but time taken for execution is high > Does not show any advantages in two levels of parallelism > G Sudha > > --- On Tue, 10/16/12, Manoj Babu wrote: > > From

Re: Hive Query with UDF

2012-10-17 Thread Michael Segel
ck response. > > The idea is that we are selling the encryption product for customers who use > HDFS. Hence, encryption is a requirement. > > Any other suggestions. > > Sam > > From: Michael Segel [michael_se...@hotmail.com] >

Re: Hive Query with UDF

2012-10-17 Thread Michael Segel
You don't need an UDF... You encrypt the string 'Ann' first then use that encrypted value in the Select statement. That should make things a bit simpler. On Oct 17, 2012, at 8:04 PM, Sam Mohamed wrote: > I have some encrypted data in an HDFS csv, that I've created a Hive table > for, an

Re: Using a hard drive instead of

2012-10-17 Thread Michael Segel
Meh. If you are worried about the memory constraints of a Linux system, I'd say go with MapR and their CLDB. I just did a quick look at Supermico servers and found that on a 2u server 768GB was the max. So how many blocks can you store in that much memory? I only have 10 fingers and toes so

Re: Anyone else having problems hitting Apache's site?

2012-10-17 Thread Michael Segel
ache.org and my web browser > say it's just you > > > On Wed, Oct 17, 2012 at 9:02 AM, Michael Segel > wrote: > I'm having issues connecting to the API pages off the Apache site. > > Is it just me? > > Thx > > -Mike > > >

Anyone else having problems hitting Apache's site?

2012-10-17 Thread Michael Segel
I'm having issues connecting to the API pages off the Apache site. Is it just me? Thx -Mike

Re: Distributed Cache For 100MB+ Data Structure

2012-10-13 Thread Michael Segel
Build and store the tree in some sort of globally accessible space? Like HBase, or HDFS? On Oct 13, 2012, at 9:46 AM, Kyle Moses wrote: > Chris, > Thanks for the suggestion on serializing the radix tree and your thoughts on > the memory issue. I'm planning to test a few different solutions a

Re: Spindle per Cores

2012-10-12 Thread Michael Segel
. the ratio could change over time as the CPUs become more efficient and faster. On Oct 12, 2012, at 9:52 PM, ranjith raghunath wrote: > Does hypertheading affect this ratio? > > On Oct 12, 2012 9:36 PM, "Michael Segel" wrote: > First, the obvious caveat... YMMV >

Re: Spindle per Cores

2012-10-12 Thread Michael Segel
First, the obvious caveat... YMMV Having said that. The key here is to take a look across the various jobs that you will run. Some may be more CPU intensive, others more I/O intensive. If you monitor these jobs via Ganglia, when you have too few spindles you should see the wait cpu rise on t

Re: Chaning Multiple Reducers: Reduce -> Reduce -> Reduce

2012-10-08 Thread Michael Segel
e whereas I'm looking > for an optimization between reduce and map. > > Jim > > On Mon, Oct 8, 2012 at 2:19 PM, Michael Segel > wrote: >> Well I was thinking ... >> >> Map -> Combiner -> Reducer -> Identity Mapper -> combiner -> reducer ->

Re: Chaning Multiple Reducers: Reduce -> Reduce -> Reduce

2012-10-08 Thread Michael Segel
't have the required functionality. > > If I'm missing anything and.or if there are folks who used Giraph or > Hama and think that they might serve the purpose, I'd be glad to hear > more. > > Jim > > On Mon, Oct 8, 2012 at 6:52 AM, Michael Segel > wrote: &

Re: Chaning Multiple Reducers: Reduce -> Reduce -> Reduce

2012-10-08 Thread Michael Segel
I don't believe that Hama would suffice. In terms of M/R where you want to chain reducers... Can you chain combiners? (I don't think so, but you never know) If not, you end up with a series of M/R jobs and the Mappers are just identity mappers. Or you could use HBase, with a small caveat...

Re: What is the difference between Rack-local map tasks and Data-local map tasks?

2012-10-07 Thread Michael Segel
se >> and you also need to consider sla for the users so the whole is not trivial. >> >> Regards >> >> Bertrand >> >> >> On Sun, Oct 7, 2012 at 5:28 PM, centerqi hu wrote: >>> >>> Very good explanation, >>> If there is

Re: What is the difference between Rack-local map tasks and Data-local map tasks?

2012-10-07 Thread Michael Segel
Rack local means that while the data isn't local to the node running the task, it is still on the same rack. (Its meaningless unless you've set up rack awareness because all of the machines are on the default rack. ) Data local means that the task is running local to the machine that contains

Re: Lib conflicts

2012-10-03 Thread Michael Segel
Yup, I hate that when it happens. You tend to see this more with Avro than anything else. The issue is that in Java, the first class loaded wins. So when Hadoop loads 1.4 first, you can't unload it and replace it with 1.7. The only solution that we found to be workable is to replace the jars

  1   2   >