Look at locality delay parameter
Sent from my iPhone
On Nov 28, 2012, at 8:44 PM, Harsh J wrote:
> None of the current schedulers are "strict" in the sense of "do not
> schedule the task if such a tasktracker is not available". That has
> never been a requirement for Map/Reduce programs and no
None of the current schedulers are "strict" in the sense of "do not
schedule the task if such a tasktracker is not available". That has
never been a requirement for Map/Reduce programs and nor should be.
I feel if you want some code to run individually on all nodes for
whatever reason, you may as
You can take look on this http://wiki.apache.org/hadoop/MountableHDFS
Cheers!
Manoj.
On Thu, Nov 29, 2012 at 1:33 AM, Uri Laserson wrote:
> What is the best way to download data directly into HDFS from some remote
> source?
>
> I used this command, which works:
> curl | funzip | hadoop fs -
Thank you all for the comments and advices.
I know it is not recommended to assigning mapper locations by myself.
But There needs to be one mapper running in each node in some cases,
so I need a strict way to do it.
So, locations is taken care of by JobTracker(scheduler), but it is not strict.
An
So, before getting any suggestions, will have to explain a few core things:
1. do you know if there exist patterns in this data?
2. will the data be read and how?
3. does there exist a hot subset of the data - both read/write?
4. what makes you think hdfs is a good option?
5. how much do you inten
Just to add Alejandro's information regarding the wildcard support, here is
the reference to the fix:
https://issues.apache.org/jira/browse/HADOOP-6995
On Tue, Nov 13, 2012 at 11:26 AM, Alejandro Abdelnur wrote:
> Andy, Oleg,
>
> What versions of Oozie and Hadoop are you using?
>
> Some versio
Jie,
Simple answer - I got lucky (though obviously there are thing you need to
have in place to allow you to be lucky).
Before running the upgrade I ran a set of tests to baseline the cluster
performance, e.g. terasort, gridmix and some operational jobs. Terasort by
itself isn't very realistic a
Hello list,
Although a lot of similar discussions have been done here, I still
seek some of your able guidance. Till now I have worked only on small or
mid-sized clusters. But this time situation is a bit different. I have to
cpollect a lot of legacy data, stored over last few decades. This d
That'd be a proper shell way to go about it, for one-time writes.
On Thu, Nov 29, 2012 at 1:33 AM, Uri Laserson wrote:
> What is the best way to download data directly into HDFS from some remote
> source?
>
> I used this command, which works:
> curl | funzip | hadoop fs -put - /path/filename
>
>
You're right.
"du -b" returns the expected value.
Thanks.
Chris
Original-Nachricht
> Datum: Wed, 28 Nov 2012 20:17:18 +0530
> Von: Mahesh Balija
> An: user@hadoop.apache.org
> Betreff: Re: discrepancy du in dfs are fs
> Hi Chris,
>
> Can you try the following in yo
What is the best way to download data directly into HDFS from some remote
source?
I used this command, which works:
curl | funzip | hadoop fs -put - /path/filename
Is this the recommended way to go?
Uri
--
Uri Laserson, PhD
Data Scientist, Cloudera
Twitter/GitHub: @laserson
+1 617 910 0447
la
Hi Jeff,
At my workplace "Intuit", we did some detailed study to evaluate HBase and
Cassandra for our use case. I will see if i can post the comparative study
on my public blog or on this mailing list.
BTW, What is your use case? What bottleneck are you hitting at current
solutions? If you can sh
Hi,
I have quite a bit of experience with RDBMSs ( Oracle, Postgres, Mysql )
and MongoDB but don't feel any are quite right for this problem. The
amount of data being stored and access requirements just don't match up
well.
I was hoping to keep the stack as simple as possible and just use hdfs b
Hi Pedro,
You can get the JobInProgress instance from JobTracker.
JobInProgress getJob(JobID jobid);
Best,
Mahesh Balija,
Calsoft Labs.
On Wed, Nov 28, 2012 at 10:41 PM, Pedro Sá da Costa wrote:
> I'm building a Java class and given a JobID, how can I get the
> JobInP
Hi,
This appears to be more of an environment or JRE config issue. Your
Windows machine needs the kerberos configuration files on it for Java
security APIs to be able to locate which KDC to talk to, for logging
in. You can also manually specify the path to such a configuration -
read
http://docs.
Also, the moderators don't seem to read anything that goes by.
On Wed, Nov 28, 2012 at 4:12 AM, sathyavageeswaran
wrote:
> In this group once anyone subscribes there is no exit route.
>
> -Original Message-
> From: Tony Burton [mailto:tbur...@sportingindex.com]
> Sent: 28 November 2012 1
Tried the below :
conf.set("hadoop.security.authentication", "kerberos"); Added this
line.
UserGroupInformation.setConfiguration(conf); < Now, it fails on
this line with the below exception
Exception in thread "Main Thread" java.lang.ExceptionInInitial
I think we kind of talked about this on one of the linkedIn discussion groups.
Either Hard Core Hadoop, or Big Data Low Latency.
What I heard from guys who manage very, very large clusters is that they don't
replace the disks right away. In my own experience, you lose the drive late at
night
When added back, with blocks retained, the NN would detect that the
affected files have over-replicated conditions, and will suitably
delete any excess replicas while still adhering to the block placement
policy (for rack-aware clusters), but not necessarily everything from
the re-added DN will be
Are you positive that your cluster/client configuration files'
directory is on the classpath when you run this job? Only then its
values would get automatically read when you instantiate the
Configuration class.
Alternatively, you may try to set: "hadoop.security.authentication" to
"kerberos" manu
Hello :
I see the below exception when I submit a MapReduce Job from standalone java
application to a remote Hadoop cluster. Cluster authentication mechanism is
Kerberos.
Below is the code. I am using user impersonation since I need to submit the job
as a hadoop cluster user (userx) from my ma
Mappers? Uhm... yes you can do it.
Yes it is non-trivial.
Yes, it is not recommended.
I think we talk a bit about this in an InfoQ article written by Boris
Lublinsky.
Its kind of wild when your entire cluster map goes red in ganglia... :-)
On Nov 28, 2012, at 2:41 AM, Harsh J wrote:
> Hi,
Somebody asked me, and I did not know what to answer. I will ask them your
questions.
Thank you.
Mark
On Wed, Nov 28, 2012 at 7:41 AM, Michael Segel wrote:
> Silly question, why are you worrying about this?
>
> In a production the odds of getting a replacement disk in service within
> 10 minutes
Silly question, why are you worrying about this?
In a production the odds of getting a replacement disk in service within 10
minutes after a fault is detected is highly improbable.
Why do you care that the blocks are replicated to another node?
After you replace the disk, bounce the node (rest
Seems like hadoop is non optimal for this since it's designed to scale machines
anonymously.
On Nov 27, 2012, at 11:08 PM, Harsh J wrote:
> This is not supported/available currently even in MR2, but take a look at
> https://issues.apache.org/jira/browse/MAPREDUCE-199.
>
>
> On Wed, Nov 28,
What happens if I stop the datanode, miss the 10 min 30 seconds deadline,
and restart the datanode say 30 minutes later? Will Hadoop re-use the data
on this datanode, balancing it with HDFS? What happens to those blocks that
correspond to file that have been updated meanwhile?
Mark
On Wed, Nov 28
Thank you.
Mark
On Wed, Nov 28, 2012 at 6:51 AM, Stephen Fritz wrote:
> HDFS will not start re-replicating blocks from a dead DN for 10 minutes 30
> seconds by default.
>
> Right now there isn't a good way to replace a disk out from under a
> running datanode, so the best way is:
> - Stop the DN
Hi,
I have 'dfs.secondary.namenode.kerberos.internal.spnego.principal' in
hdfs-site.xml
I used the following commands to add this principal:
1) kadmin: addprinc -randkey HTTP/m146
2) kadmin: ktadd -k /etc/hadoop/hadoop.keytab -norandkey HTTP/m146
kadmin: Principal -norandkey do
HDFS will not start re-replicating blocks from a dead DN for 10 minutes 30
seconds by default.
Right now there isn't a good way to replace a disk out from under a running
datanode, so the best way is:
- Stop the DN
- Replace the disk
- Restart the DN
On Wed, Nov 28, 2012 at 9:14 AM, Mark Kerzne
Hi Chris,
Can you try the following in your local machine,
du -b myfile.txt
and compare this with the hadoop fs -du myfile.txt.
Best,
Mahesh Balija,
Calsoft Labs.
On Wed, Nov 28, 2012 at 7:43 PM, wrote:
>
> Hi all,
>
> I wonder wy there is a difference betw
Hi,
can I remove one hard drive from a slave but tell Hadoop not to replicate
missing blocks for a few minutes, because I will return it back? Or will
this not work at all, and will Hadoop continue replicating, since I removed
blocks, even for a short time?
Thank you. Sincerely,
Mark
Hi all,
I wonder wy there is a difference between "du" on HDFS and "get" + "du" on my
local machnine.
Here is an example:
hadoop fs -du myfile.txt
> 81355258
hadoop fs -get myfile.txt .
du myfile.txt
> 34919
--- nevertheless ---
hadoop fs -cat myfile.txt | wc -l
> 4789943
cat myfile.txt
Good pointer by Dyuti. For an explanation on Sequence and HAR files you can
visit another great post on Cloudera's blog section here :
http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
Regards,
Mohammad Tariq
On Wed, Nov 28, 2012 at 6:52 PM, dyuti a wrote:
> Hi Lin,
> check t
Hi Lin,
check this link too
http://blog.cloudera.com/blog/2011/01/hadoop-io-sequence-map-set-array-bloommap-files/
Hope it helps!
dti
On Wed, Nov 28, 2012 at 6:42 PM, Lin Ma wrote:
> Thanks Mohammad,
>
> I searched Hadoop file format, but only find sequence file format, so it
> is why I have th
Thanks Mohammad,
I searched Hadoop file format, but only find sequence file format, so it is
why I have the confusion.
1. Are these file formats built on top of sequence file format?
2. Appreciate if you could kindly point me to the official documentation
for the file formats.
regards,
Hello Lin,
Along with that, Hadoop MapFiles, SetFiles, IFiles , HAR files. But
each has its own significance and used under different scenarios.
Regards,
Mohammad Tariq
On Wed, Nov 28, 2012 at 6:29 PM, Lin Ma wrote:
> Sorry I miss a question mark. I should say "are there any other b
Sorry I miss a question mark. I should say "are there any other built-in
file format supported by Hadoop?" :-)
regards,
Lin
On Wed, Nov 28, 2012 at 8:58 PM, Lin Ma wrote:
> Hello everyone,
>
> I have a very basic question. Besides sequence file format (
> http://hadoop.apache.org/docs/current/a
Hello everyone,
I have a very basic question. Besides sequence file format (
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/SequenceFile.html),
are there any other built-in file format supported by Hadoop
thanks in advance,
Lin
Hi All,
Thank you for your suggestions, i resolved the same. but after that got
into below errors( It works fine when checked in IDE).
//Errors:
java.lang.NullPointerException
at
com.ge.hadoop.test.xmlfileformat.MyParserMapper1.map(MyParserMapper1.java:52)
at
com.ge.hadoop.test.xml
On 11/28/2012 06:17 AM, jamal sasha wrote:
Lately, I have been writing alot of algorithms in map reduce
abstraction in python (hadoop streaming).
By not using java libraries, what power of hadoop am I missing?
In the Pydoop docs we have a section where several approaches to Hadoop
programmi
In this group once anyone subscribes there is no exit route.
-Original Message-
From: Tony Burton [mailto:tbur...@sportingindex.com]
Sent: 28 November 2012 17:33
To:
Subject: bounce message
I'm getting the following every time I post to user@hadoop - can we
unsubscribe Tianku?
From: Po
I'm getting the following every time I post to user@hadoop - can we unsubscribe
Tianku?
From: Postmaster
Thank you for your inquiry. Tiankai Tu is no longer with the firm. For
immediate assistance, please contact Reception at +1-212-478-.
Sincerely,
The D. E. Shaw Group
**
Got it - thanks Harsh.
-Original Message-
From: Harsh J [mailto:ha...@cloudera.com]
Sent: 28 November 2012 11:41
To:
Subject: Re: Map output compression in Hadoop 1.0.3
No, I see your point of confusion and I can think of others who may be confused
that way, but the API changes did n
No, I see your point of confusion and I can think of others who may be
confused that way, but the API changes did not trigger the config
naming change.
The config naming changes could instead be viewed by you as a MR1 vs.
MR2 thing, for simplification. So unless you move onto YARN-based MR2,
keep
Also, another point that prompted my initial question: I'd come across
"mapred.compress.map.output" in the documentation, but I wasn't 100% sure if
there has been or will be any equivalence or correspondence between config
setting like this one and the naming of the stable and new API.
For exam
Sorry, my fault about "mapred.output.compress" - I meant
"mapred.compress.map.output".
Thanks Harsh for the speedy and comprehensive answer! Very useful.
Tony
-Original Message-
From: Harsh J [mailto:ha...@cloudera.com]
Sent: 28 November 2012 11:25
To:
Subject: Re: Map output comp
Hi,
The property mapred.output.compress, as its name reads, controls
job-output compression, not intermediate/transient data compression,
which is what you mean by "Map output compression".
Also note that this property is a per job one and can be toggled, if a
user wanted, on/off for each job spe
Hi,
Quick question: What's the best way to turn on Map Output Compression in Hadoop
1.0.3? The tutorial at
http://hadoop.apache.org/docs/r1.0.3/mapred_tutorial.html says to use
JobConf.setCompressMapOutput(boolean), but I'm using o.a.h.mapreduce.Job rather
than o.a.h.mapred.JobConf.
Is it sim
If the cluster is secured, it will demand kerberos credentials. There
is no way to bypass this requirement (and it wouldn't make sense to
allow such a thing either).
If you do have a keytab file, and are wishing to automate the login by
knowing the keytab path, you can use the SecurityUtil.login(…
HI.
I set my hadoop cluster to security enable using kerberos.
Can I login to the hadoop cluster without execute kinit command?
I can't find hadoop api that I use kerberos principal (manually set
username and password) instead cached ticket.
How can I use hadoop api for that.
thanks.
--
--
Hi,
Mapper scheduling is indeed influenced by the getLocations() returned
results of the InputSplit.
The map task itself does not care about deserializing the location
information, as it is of no use to it. The location information is vital to
the scheduler (or in 0.20.2, the JobTracker), where i
51 matches
Mail list logo