i was looking at
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter's setupTask
method, but it does nothing and states:
// FileOutputCommitter's setupTask doesn't do anything. Because the
// temporary task directory is created on demand when the
// task is writing.
ok i got
how do long lived services such as the namenode or yarn resourcemanager
deal with kerberos ticket expiration for the user that runs the service?
do they periodically renew/refresh their tickets by calling
SecurityUtil.login(conf, keytab, user, host)?
where can i find an example the code that
, String path)
Geoff
On Jun 29, 2014, at 5:50 PM, Koert Kuipers ko...@tresata.com wrote:
how do long lived services such as the namenode or yarn resourcemanager
deal with kerberos ticket expiration for the user that runs the service?
do they periodically renew/refresh their tickets
why does FileSystem.closeAll(UserGroupInformation ugi) not remove the
FileSystem from the cache?
for example, closeAll(boolean onlyAutomatic) does remove filesystems from
the cache...
thanks koert
, Feb 2, 2014 at 3:14 PM, Koert Kuipers ko...@tresata.com wrote:
i
s it necessary to create a kerberos principal for hdfs on every node, as
in hdfs/some-host@SOME-REALM?
why not use one principal hdfs@SOME-REALM? that way i could distribute
the same keytab file to all nodes which makes things
i
s it necessary to create a kerberos principal for hdfs on every node, as
in hdfs/some-host@SOME-REALM?
why not use one principal hdfs@SOME-REALM? that way i could distribute the
same keytab file to all nodes which makes things a lot easier.
thanks! koert
i understand kerberos is used on hadoop to provide security in a multi-user
environment, and i can totally see its usage for a shared cluster within a
company to make sure sensitive data for one department is safe from prying
eyes of another department.
but for a hadoop cluster that sits behind a
i have a cluster with replication factor 2. wit the following events in
this order, do i have data loss?
1) shut down a datanode for maintenance unrelated to hdfs. so now some
blocks only have replication factor 1
2) a disk dies in another datanode. let's assume some blocks now have
replication
to Hadoop.
Bertrand
On Sun, Oct 27, 2013 at 7:42 PM, Koert Kuipers ko...@tresata.com wrote:
i have a cluster with replication factor 2. wit the following events in
this order, do i have data loss?
1) shut down a datanode for maintenance unrelated to hdfs. so now some
blocks only have
handle
usual R/W operations and just throws StandbyException, hbase region
server may kill itself in some cases I guess.
I think you can remove sshfence from the configuration if you are
using QJM-based HA.
On Fri, Oct 11, 2013 at 4:51 PM, Koert Kuipers ko...@tresata.com wrote:
i have been
i have been playing with high availability using journalnodes and 2 masters
both running namenode and hbase master.
when i kill the namenode and hbase-master processes on the active master,
the failover is perfect. hbase never stops and a running map-reduce jobs
keeps going. this is impressive!
is it safe to upload UTF16 encoded (unicode) text files to hadoop for
processing by map-reduce, hive, pig, etc?
thanks! koert
.jar
slf4j-api-1.6.1.jar
slf4j-log4j12-1.6.1.jar
xmlenc-0.52.jar
On Sat, Nov 10, 2012 at 5:24 AM, Steve Loughran ste...@hortonworks.comwrote:
On 10 November 2012 07:43, Koert Kuipers ko...@tresata.com wrote:
i am porting a map-reduce library from CDH3 to apache hadoop 1.0.4. the
unit tests
i am looking at the code for RunJar.java which is behind hadoop jar for
hadoop 0.20.2 (from cdh3u5).
i see
1) jar is unpacked to a temporary directory
2) the file URLs of all the jars found in the lib subdir of the unpacked
jar are gathered into a list called classPath
3) a new ClassLoader
currently i run a map-reduce job that reads from a single path with a glob:
/data/*
i am considering replacing this one glob path with an explicit list of all
the paths (so that i can check for _SUCCESS files in the subdirs and
exclude the subdirs that don't have this file, to avoid reading from
. This is consistent with what Oozie does as well.
Since the listing of files happens post-submit() call, doing this will
just work :)
On Fri, Oct 12, 2012 at 8:00 PM, Koert Kuipers ko...@tresata.com
wrote:
We have a dataset that is heavily partitioned, like this
/data
partition1
We have a dataset that is heavily partitioned, like this
/data
partition1/
_SUCESS
part-0
part-1
...
partition1/
_SUCCESS
part-0
part-1
...
We have loaders that use map-red jobs to add new partitions to this data
set at a regular interval
With secure hadoop the user name is authenticated by the kerberos server.
But what about the groups that the user is a member of? Are these simple
the groups that the user is a member of on the namenode machine?
Is it viable to manage access to files on HDFS using groups on a secure
hadoop
Why does GenericOptionParser also remove -Dprop=value options (without
setting the system properties)? Those are not hadoop options but java
options. And why does hadoop jar not accept -Dprop=value before the class
name, as java does? like hadoop jar -Dprop=value class
How do i set java system
, 2012 at 11:02 PM, Koert Kuipers ko...@tresata.com wrote:
We have smaller nodes (4 to 6 disks), and we used to write logs to the
same
disk as where the OS is. So if that disks goes then i don't really care
about tasktrackers failing. Also, the fact that logs were written to a
single
, Harsh J ha...@cloudera.com wrote:
Hi Koert,
On Sun, Aug 26, 2012 at 11:20 PM, Koert Kuipers ko...@tresata.com wrote:
Hey Harsh,
Thanks for responding!
Would limiting the logging for each task via mapred.userlog.limit.kb be
strictly enforced (while the job is running)? That would solve my
).
The other alternative may be to switch down the log level on the task,
via mapred.map.child.log.level and/or mapred.reduce.child.log.level
set to WARN or ERROR.
On Sun, Aug 26, 2012 at 11:37 PM, Koert Kuipers ko...@tresata.com wrote:
Looks like mapred.userlog.limit.kb is implemented
Increase throughput
On Aug 26, 2012 3:21 PM, Raj Vishwanathan rajv...@yahoo.com wrote:
What do you want to use channel bonding for? Increasing throughput or
increasing availlability?
Raj
--
*From:* Koert Kuipers ko...@tresata.com
*To:* user
nothing has confused me as much in hadoop as FileSystem.close().
any decent java programmer that sees that an object implements Closable
writes code like this:
Final FileSystem fs = FileSystem.get(conf);
try {
// do something with fs
} finally {
fs.close();
}
so i started out using hadoop
Since FileSystem is a Closeable i would expect code using it to be like
this:
FileSystem fs = path.getFileSystem(conf);
try {
// do something with fs, such as read from the path
} finally {
fs.close()
}
However i have repeatedly gotten into trouble with this approach. In one
situation it
at 10:34 AM, Koert Kuipers ko...@tresata.com wrote:
Since FileSystem is a Closeable i would expect code using it to be like
this:
FileSystem fs = path.getFileSystem(conf);
try {
// do something with fs, such as read from the path
} finally {
fs.close()
}
However i have
silly question, but i have our hadoop slave boxes configured with 7 mappers
each, yet i see java 14 process for user mapred on each box. and each
process takes up about 2GB, which is equals to my memory allocation
(mapred.child.java.opts=-Xmx2048m). so it is using twice as much memory as
i
with kerberos enabled a mapreduce job runs as the user that submitted it.
does this mean the user that submitted the job needs to have linux accounts
on all machines on the cluster?
how does mapreduce do this (run jobs as the user)? do the tasktrackers use
secure impersonation to run-as the user?
:
Yes, User submitting a job needs to have an account on all the nodes.
Sent from my iPhone
On Jun 7, 2012, at 6:20 AM, Koert Kuipers ko...@tresata.com wrote:
with kerberos enabled a mapreduce job runs as the user that submitted
it.
does this mean the user that submitted the job needs
do i understand it correctly that with kerberos enabled the mappers and
reducers will be run as the actual user that started them? as opposed to
the user that runs the tasktracker, which is mapred or hadoop or something
like that?
.
In secure mode, I guess your best bet is to use JobHistory file or
JobSummary log for the job (available _after_ job completion).
Arun
On Apr 20, 2012, at 12:35 AM, Koert Kuipers wrote:
is there a way to find out which user is running a mapred job from the
JobConf? and is this usable
in the input data- it often
makes sense to disable splitting of the individual files by setting
the min split size to Integer.MAX_VALUE.
The description probably shouldn't use partitioned, since that
implies that the partitioner is sufficient. -C
On Tue, Apr 10, 2012 at 8:11 AM, Koert Kuipers ko
I read about CompositeInputFormat and how it allows one to join two
datasets together as long as those datasets were sorted and partitioned the
same way.
Ok i think i get it, but something bothers me. It is suggested that two
datasets are sorted and partitioned the same way if they were both
Does anyone know of any work/ideas to encrypt data stored on hdfs?
Ideally both temporary files and final files would be encrypted. Or there
would have to be a mechanism in hdfs to securely wipe temporary files, like
shred in linux.
So far this is what i found:
to protect from, you can start thinking
about how to protect yourself.
- Tim.
--
*From:* Koert Kuipers [ko...@tresata.com]
*Sent:* Friday, January 20, 2012 1:09 PM
*To:* hdfs-user@hadoop.apache.org
*Subject:* encryption
Does anyone know of any work/ideas
what are the thoughts on running a hadoop cluster in a datacenter with
respect to power? should all the boxes have redundant power supplies and be
on dual power? or just dual power for the namenode, secondary namenode, and
hbase master, and then perhaps switch the power source per rack for the
can someone point me to the exact rules? thanks!
for example, i want to know if i can take down a slave for the afternoon to
fix something on the machine, without causing the cluster to start creating
extra copies of blocks that reside on that slave because the replication
count is down.
For a hadoop cluster that starts medium size (50 nodes) but could grow to
hundred of nodes, what is the recommended network in the rack? 1gig or 10gig
We have machines with 8 cores, 4 X 1tb drive (could grow to 8 X 1b drive),
48 Gb ram per node.
We expect balanced usage of the cluster (both
are synchronous and asynchronous in c++, in java only
synchronous clients are supported)
On 21.09.2011 22:59, Koert Kuipers wrote:
i would love an IDL, plus that modern serialization frameworks such as
protobuf/thrift support versioning (although i still have issues with
different versions of thrift
How do i take a machine down that is a datanode without causing the namenode
to start fixing the under-replication? Say i need to fix something on the
machine that will take 30 mins, and after that i plan to turn it back on.
Does anyone have any experience using HP Proliant SL hardware for hadoop? We
are currently using DL 160 and DL 180 servers, and the SL hardware seems to
fit the bill for our new servers in many ways. However the shared power on
chassis is not something i am fully comfortable with yet.
Koert
Hello all,
We are considering using low end HP proliant machines (DL160s and DL180s)
for cluster nodes. However with these machines if you want to do more than 4
hard drives then HP puts in a P410 raid controller. We would configure the
RAID controller to function as JBOD, by simply creating
create RAID0 for each disk to
make it as JBOD.
-Bharath
From: Koert Kuipers ko...@tresata.com
To: common-user@hadoop.apache.org
Sent: Thursday, August 11, 2011 2:50 PM
Subject: Question about RAID controllers and hadoop
Hello all,
We
43 matches
Mail list logo