Re: Assigning reduce tasks to specific nodes

2012-11-28 Thread Mohit Anchlia
Look at locality delay parameter Sent from my iPhone On Nov 28, 2012, at 8:44 PM, Harsh J wrote: > None of the current schedulers are "strict" in the sense of "do not > schedule the task if such a tasktracker is not available". That has > never been a requirement for Map/Reduce programs and no

Re: Assigning reduce tasks to specific nodes

2012-11-28 Thread Harsh J
None of the current schedulers are "strict" in the sense of "do not schedule the task if such a tasktracker is not available". That has never been a requirement for Map/Reduce programs and nor should be. I feel if you want some code to run individually on all nodes for whatever reason, you may as

Re: Downloading data directly into HDFS

2012-11-28 Thread Manoj Babu
You can take look on this http://wiki.apache.org/hadoop/MountableHDFS Cheers! Manoj. On Thu, Nov 29, 2012 at 1:33 AM, Uri Laserson wrote: > What is the best way to download data directly into HDFS from some remote > source? > > I used this command, which works: > curl | funzip | hadoop fs -

Re: Assigning reduce tasks to specific nodes

2012-11-28 Thread Hiroyuki Yamada
Thank you all for the comments and advices. I know it is not recommended to assigning mapper locations by myself. But There needs to be one mapper running in each node in some cases, so I need a strict way to do it. So, locations is taken care of by JobTracker(scheduler), but it is not strict. An

Re: Guidelines for production cluster

2012-11-28 Thread Gaurav Sharma
So, before getting any suggestions, will have to explain a few core things: 1. do you know if there exist patterns in this data? 2. will the data be read and how? 3. does there exist a hot subset of the data - both read/write? 4. what makes you think hdfs is a good option? 5. how much do you inten

Re: notorious impersonation ERROR

2012-11-28 Thread Robert Molina
Just to add Alejandro's information regarding the wildcard support, here is the reference to the fix: https://issues.apache.org/jira/browse/HADOOP-6995 On Tue, Nov 13, 2012 at 11:26 AM, Alejandro Abdelnur wrote: > Andy, Oleg, > > What versions of Oozie and Hadoop are you using? > > Some versio

Re: Hadoop 1.0.4 Performance Problem

2012-11-28 Thread Jon Allen
Jie, Simple answer - I got lucky (though obviously there are thing you need to have in place to allow you to be lucky). Before running the upgrade I ran a set of tests to baseline the cluster performance, e.g. terasort, gridmix and some operational jobs. Terasort by itself isn't very realistic a

Guidelines for production cluster

2012-11-28 Thread Mohammad Tariq
Hello list, Although a lot of similar discussions have been done here, I still seek some of your able guidance. Till now I have worked only on small or mid-sized clusters. But this time situation is a bit different. I have to cpollect a lot of legacy data, stored over last few decades. This d

Re: Downloading data directly into HDFS

2012-11-28 Thread Harsh J
That'd be a proper shell way to go about it, for one-time writes. On Thu, Nov 29, 2012 at 1:33 AM, Uri Laserson wrote: > What is the best way to download data directly into HDFS from some remote > source? > > I used this command, which works: > curl | funzip | hadoop fs -put - /path/filename > >

Re: discrepancy du in dfs are fs

2012-11-28 Thread Christoph Böhm
You're right. "du -b" returns the expected value. Thanks. Chris Original-Nachricht > Datum: Wed, 28 Nov 2012 20:17:18 +0530 > Von: Mahesh Balija > An: user@hadoop.apache.org > Betreff: Re: discrepancy du in dfs are fs > Hi Chris, > > Can you try the following in yo

Downloading data directly into HDFS

2012-11-28 Thread Uri Laserson
What is the best way to download data directly into HDFS from some remote source? I used this command, which works: curl | funzip | hadoop fs -put - /path/filename Is this the recommended way to go? Uri -- Uri Laserson, PhD Data Scientist, Cloudera Twitter/GitHub: @laserson +1 617 910 0447 la

Re: Best practice for storage of data that changes

2012-11-28 Thread anil gupta
Hi Jeff, At my workplace "Intuit", we did some detailed study to evaluate HBase and Cassandra for our use case. I will see if i can post the comparative study on my public blog or on this mailing list. BTW, What is your use case? What bottleneck are you hitting at current solutions? If you can sh

Re: Best practice for storage of data that changes

2012-11-28 Thread jeff l
Hi, I have quite a bit of experience with RDBMSs ( Oracle, Postgres, Mysql ) and MongoDB but don't feel any are quite right for this problem. The amount of data being stored and access requirements just don't match up well. I was hoping to keep the stack as simple as possible and just use hdfs b

Re: Get JobInProgress given jobId

2012-11-28 Thread Mahesh Balija
Hi Pedro, You can get the JobInProgress instance from JobTracker. JobInProgress getJob(JobID jobid); Best, Mahesh Balija, Calsoft Labs. On Wed, Nov 28, 2012 at 10:41 PM, Pedro Sá da Costa wrote: > I'm building a Java class and given a JobID, how can I get the > JobInP

Re: submitting a mapreduce job to remote cluster

2012-11-28 Thread Harsh J
Hi, This appears to be more of an environment or JRE config issue. Your Windows machine needs the kerberos configuration files on it for Java security APIs to be able to locate which KDC to talk to, for logging in. You can also manually specify the path to such a configuration - read http://docs.

Re: bounce message

2012-11-28 Thread Ted Dunning
Also, the moderators don't seem to read anything that goes by. On Wed, Nov 28, 2012 at 4:12 AM, sathyavageeswaran wrote: > In this group once anyone subscribes there is no exit route. > > -Original Message- > From: Tony Burton [mailto:tbur...@sportingindex.com] > Sent: 28 November 2012 1

RE: submitting a mapreduce job to remote cluster

2012-11-28 Thread Erravelli, Venkat
Tried the below : conf.set("hadoop.security.authentication", "kerberos"); Added this line. UserGroupInformation.setConfiguration(conf); < Now, it fails on this line with the below exception Exception in thread "Main Thread" java.lang.ExceptionInInitial

Re: Replacing a hard drive on a slave

2012-11-28 Thread Michael Segel
I think we kind of talked about this on one of the linkedIn discussion groups. Either Hard Core Hadoop, or Big Data Low Latency. What I heard from guys who manage very, very large clusters is that they don't replace the disks right away. In my own experience, you lose the drive late at night

Re: Replacing a hard drive on a slave

2012-11-28 Thread Harsh J
When added back, with blocks retained, the NN would detect that the affected files have over-replicated conditions, and will suitably delete any excess replicas while still adhering to the block placement policy (for rack-aware clusters), but not necessarily everything from the re-added DN will be

Re: submitting a mapreduce job to remote cluster

2012-11-28 Thread Harsh J
Are you positive that your cluster/client configuration files' directory is on the classpath when you run this job? Only then its values would get automatically read when you instantiate the Configuration class. Alternatively, you may try to set: "hadoop.security.authentication" to "kerberos" manu

submitting a mapreduce job to remote cluster

2012-11-28 Thread Erravelli, Venkat
Hello : I see the below exception when I submit a MapReduce Job from standalone java application to a remote Hadoop cluster. Cluster authentication mechanism is Kerberos. Below is the code. I am using user impersonation since I need to submit the job as a hadoop cluster user (userx) from my ma

Re: Assigning reduce tasks to specific nodes

2012-11-28 Thread Michael Segel
Mappers? Uhm... yes you can do it. Yes it is non-trivial. Yes, it is not recommended. I think we talk a bit about this in an InfoQ article written by Boris Lublinsky. Its kind of wild when your entire cluster map goes red in ganglia... :-) On Nov 28, 2012, at 2:41 AM, Harsh J wrote: > Hi,

Re: Replacing a hard drive on a slave

2012-11-28 Thread Mark Kerzner
Somebody asked me, and I did not know what to answer. I will ask them your questions. Thank you. Mark On Wed, Nov 28, 2012 at 7:41 AM, Michael Segel wrote: > Silly question, why are you worrying about this? > > In a production the odds of getting a replacement disk in service within > 10 minutes

Re: Replacing a hard drive on a slave

2012-11-28 Thread Michael Segel
Silly question, why are you worrying about this? In a production the odds of getting a replacement disk in service within 10 minutes after a fault is detected is highly improbable. Why do you care that the blocks are replicated to another node? After you replace the disk, bounce the node (rest

Re: Assigning reduce tasks to specific nodes

2012-11-28 Thread JAY
Seems like hadoop is non optimal for this since it's designed to scale machines anonymously. On Nov 27, 2012, at 11:08 PM, Harsh J wrote: > This is not supported/available currently even in MR2, but take a look at > https://issues.apache.org/jira/browse/MAPREDUCE-199. > > > On Wed, Nov 28,

Re: Replacing a hard drive on a slave

2012-11-28 Thread Mark Kerzner
What happens if I stop the datanode, miss the 10 min 30 seconds deadline, and restart the datanode say 30 minutes later? Will Hadoop re-use the data on this datanode, balancing it with HDFS? What happens to those blocks that correspond to file that have been updated meanwhile? Mark On Wed, Nov 28

Re: Replacing a hard drive on a slave

2012-11-28 Thread Mark Kerzner
Thank you. Mark On Wed, Nov 28, 2012 at 6:51 AM, Stephen Fritz wrote: > HDFS will not start re-replicating blocks from a dead DN for 10 minutes 30 > seconds by default. > > Right now there isn't a good way to replace a disk out from under a > running datanode, so the best way is: > - Stop the DN

Re: Failed To Start SecondaryNameNode in Secure Mode

2012-11-28 Thread a...@hsk.hk
Hi, I have 'dfs.secondary.namenode.kerberos.internal.spnego.principal' in hdfs-site.xml I used the following commands to add this principal: 1) kadmin: addprinc -randkey HTTP/m146 2) kadmin: ktadd -k /etc/hadoop/hadoop.keytab -norandkey HTTP/m146 kadmin: Principal -norandkey do

Re: Replacing a hard drive on a slave

2012-11-28 Thread Stephen Fritz
HDFS will not start re-replicating blocks from a dead DN for 10 minutes 30 seconds by default. Right now there isn't a good way to replace a disk out from under a running datanode, so the best way is: - Stop the DN - Replace the disk - Restart the DN On Wed, Nov 28, 2012 at 9:14 AM, Mark Kerzne

Re: discrepancy du in dfs are fs

2012-11-28 Thread Mahesh Balija
Hi Chris, Can you try the following in your local machine, du -b myfile.txt and compare this with the hadoop fs -du myfile.txt. Best, Mahesh Balija, Calsoft Labs. On Wed, Nov 28, 2012 at 7:43 PM, wrote: > > Hi all, > > I wonder wy there is a difference betw

Replacing a hard drive on a slave

2012-11-28 Thread Mark Kerzner
Hi, can I remove one hard drive from a slave but tell Hadoop not to replicate missing blocks for a few minutes, because I will return it back? Or will this not work at all, and will Hadoop continue replicating, since I removed blocks, even for a short time? Thank you. Sincerely, Mark

discrepancy du in dfs are fs

2012-11-28 Thread listenbruder
Hi all, I wonder wy there is a difference between "du" on HDFS and "get" + "du" on my local machnine. Here is an example: hadoop fs -du myfile.txt > 81355258 hadoop fs -get myfile.txt . du myfile.txt > 34919 --- nevertheless --- hadoop fs -cat myfile.txt | wc -l > 4789943 cat myfile.txt

Re: Hadoop storage file format

2012-11-28 Thread Mohammad Tariq
Good pointer by Dyuti. For an explanation on Sequence and HAR files you can visit another great post on Cloudera's blog section here : http://blog.cloudera.com/blog/2009/02/the-small-files-problem/ Regards, Mohammad Tariq On Wed, Nov 28, 2012 at 6:52 PM, dyuti a wrote: > Hi Lin, > check t

Re: Hadoop storage file format

2012-11-28 Thread dyuti a
Hi Lin, check this link too http://blog.cloudera.com/blog/2011/01/hadoop-io-sequence-map-set-array-bloommap-files/ Hope it helps! dti On Wed, Nov 28, 2012 at 6:42 PM, Lin Ma wrote: > Thanks Mohammad, > > I searched Hadoop file format, but only find sequence file format, so it > is why I have th

Re: Hadoop storage file format

2012-11-28 Thread Lin Ma
Thanks Mohammad, I searched Hadoop file format, but only find sequence file format, so it is why I have the confusion. 1. Are these file formats built on top of sequence file format? 2. Appreciate if you could kindly point me to the official documentation for the file formats. regards,

Re: Hadoop storage file format

2012-11-28 Thread Mohammad Tariq
Hello Lin, Along with that, Hadoop MapFiles, SetFiles, IFiles , HAR files. But each has its own significance and used under different scenarios. Regards, Mohammad Tariq On Wed, Nov 28, 2012 at 6:29 PM, Lin Ma wrote: > Sorry I miss a question mark. I should say "are there any other b

Re: Hadoop storage file format

2012-11-28 Thread Lin Ma
Sorry I miss a question mark. I should say "are there any other built-in file format supported by Hadoop?" :-) regards, Lin On Wed, Nov 28, 2012 at 8:58 PM, Lin Ma wrote: > Hello everyone, > > I have a very basic question. Besides sequence file format ( > http://hadoop.apache.org/docs/current/a

Hadoop storage file format

2012-11-28 Thread Lin Ma
Hello everyone, I have a very basic question. Besides sequence file format ( http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/SequenceFile.html), are there any other built-in file format supported by Hadoop thanks in advance, Lin

Re: ClassNotFoundException: org.jdom2.JDOMException

2012-11-28 Thread dyuti a
Hi All, Thank you for your suggestions, i resolved the same. but after that got into below errors( It works fine when checked in IDE). //Errors: java.lang.NullPointerException at com.ge.hadoop.test.xmlfileformat.MyParserMapper1.map(MyParserMapper1.java:52) at com.ge.hadoop.test.xml

Re: advice

2012-11-28 Thread Simone Leo
On 11/28/2012 06:17 AM, jamal sasha wrote: Lately, I have been writing alot of algorithms in map reduce abstraction in python (hadoop streaming). By not using java libraries, what power of hadoop am I missing? In the Pydoop docs we have a section where several approaches to Hadoop programmi

RE: bounce message

2012-11-28 Thread sathyavageeswaran
In this group once anyone subscribes there is no exit route. -Original Message- From: Tony Burton [mailto:tbur...@sportingindex.com] Sent: 28 November 2012 17:33 To: Subject: bounce message I'm getting the following every time I post to user@hadoop - can we unsubscribe Tianku? From: Po

bounce message

2012-11-28 Thread Tony Burton
I'm getting the following every time I post to user@hadoop - can we unsubscribe Tianku? From: Postmaster Thank you for your inquiry. Tiankai Tu is no longer with the firm. For immediate assistance, please contact Reception at +1-212-478-. Sincerely, The D. E. Shaw Group **

RE: Map output compression in Hadoop 1.0.3

2012-11-28 Thread Tony Burton
Got it - thanks Harsh. -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: 28 November 2012 11:41 To: Subject: Re: Map output compression in Hadoop 1.0.3 No, I see your point of confusion and I can think of others who may be confused that way, but the API changes did n

Re: Map output compression in Hadoop 1.0.3

2012-11-28 Thread Harsh J
No, I see your point of confusion and I can think of others who may be confused that way, but the API changes did not trigger the config naming change. The config naming changes could instead be viewed by you as a MR1 vs. MR2 thing, for simplification. So unless you move onto YARN-based MR2, keep

RE: Map output compression in Hadoop 1.0.3

2012-11-28 Thread Tony Burton
Also, another point that prompted my initial question: I'd come across "mapred.compress.map.output" in the documentation, but I wasn't 100% sure if there has been or will be any equivalence or correspondence between config setting like this one and the naming of the stable and new API. For exam

RE: Map output compression in Hadoop 1.0.3

2012-11-28 Thread Tony Burton
Sorry, my fault about "mapred.output.compress" - I meant "mapred.compress.map.output". Thanks Harsh for the speedy and comprehensive answer! Very useful. Tony -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: 28 November 2012 11:25 To: Subject: Re: Map output comp

Re: Map output compression in Hadoop 1.0.3

2012-11-28 Thread Harsh J
Hi, The property mapred.output.compress, as its name reads, controls job-output compression, not intermediate/transient data compression, which is what you mean by "Map output compression". Also note that this property is a per job one and can be toggled, if a user wanted, on/off for each job spe

Map output compression in Hadoop 1.0.3

2012-11-28 Thread Tony Burton
Hi, Quick question: What's the best way to turn on Map Output Compression in Hadoop 1.0.3? The tutorial at http://hadoop.apache.org/docs/r1.0.3/mapred_tutorial.html says to use JobConf.setCompressMapOutput(boolean), but I'm using o.a.h.mapreduce.Job rather than o.a.h.mapred.JobConf. Is it sim

Re: Could I authenticate hadoop manually using kerberos

2012-11-28 Thread Harsh J
If the cluster is secured, it will demand kerberos credentials. There is no way to bypass this requirement (and it wouldn't make sense to allow such a thing either). If you do have a keytab file, and are wishing to automate the login by knowing the keytab path, you can use the SecurityUtil.login(…

Could I authenticate hadoop manually using kerberos

2012-11-28 Thread Oh Seok Keun
HI. I set my hadoop cluster to security enable using kerberos. Can I login to the hadoop cluster without execute kinit command? I can't find hadoop api that I use kerberos principal (manually set username and password) instead cached ticket. How can I use hadoop api for that. thanks. -- --

Re: Assigning reduce tasks to specific nodes

2012-11-28 Thread Harsh J
Hi, Mapper scheduling is indeed influenced by the getLocations() returned results of the InputSplit. The map task itself does not care about deserializing the location information, as it is of no use to it. The location information is vital to the scheduler (or in 0.20.2, the JobTracker), where i