where does the temporary task directory get created in mapred?

2014-09-27 Thread Koert Kuipers
i was looking at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter's setupTask method, but it does nothing and states: // FileOutputCommitter's setupTask doesn't do anything. Because the // temporary task directory is created on demand when the // task is writing. ok i got

kerberos ticket renewal for hadoop services

2014-06-29 Thread Koert Kuipers
how do long lived services such as the namenode or yarn resourcemanager deal with kerberos ticket expiration for the user that runs the service? do they periodically renew/refresh their tickets by calling SecurityUtil.login(conf, keytab, user, host)? where can i find an example the code that

Re: kerberos ticket renewal for hadoop services

2014-06-29 Thread Koert Kuipers
, String path) Geoff On Jun 29, 2014, at 5:50 PM, Koert Kuipers ko...@tresata.com wrote: how do long lived services such as the namenode or yarn resourcemanager deal with kerberos ticket expiration for the user that runs the service? do they periodically renew/refresh their tickets

FileSystem.closeAll(UserGroupInformation ugi) does not remove filesystem from cache

2014-05-27 Thread Koert Kuipers
why does FileSystem.closeAll(UserGroupInformation ugi) not remove the FileSystem from the cache? for example, closeAll(boolean onlyAutomatic) does remove filesystems from the cache... thanks koert

Re: kerberos principals per node necessary?

2014-02-03 Thread Koert Kuipers
, Feb 2, 2014 at 3:14 PM, Koert Kuipers ko...@tresata.com wrote: i s it necessary to create a kerberos principal for hdfs on every node, as in hdfs/some-host@SOME-REALM? why not use one principal hdfs@SOME-REALM? that way i could distribute the same keytab file to all nodes which makes things

kerberos principals per node necessary?

2014-02-02 Thread Koert Kuipers
i s it necessary to create a kerberos principal for hdfs on every node, as in hdfs/some-host@SOME-REALM? why not use one principal hdfs@SOME-REALM? that way i could distribute the same keytab file to all nodes which makes things a lot easier. thanks! koert

kerberos for outside threads

2014-01-21 Thread Koert Kuipers
i understand kerberos is used on hadoop to provide security in a multi-user environment, and i can totally see its usage for a shared cluster within a company to make sure sensitive data for one department is safe from prying eyes of another department. but for a hadoop cluster that sits behind a

question about hdfs data loss risk

2013-10-27 Thread Koert Kuipers
i have a cluster with replication factor 2. wit the following events in this order, do i have data loss? 1) shut down a datanode for maintenance unrelated to hdfs. so now some blocks only have replication factor 1 2) a disk dies in another datanode. let's assume some blocks now have replication

Re: question about hdfs data loss risk

2013-10-27 Thread Koert Kuipers
to Hadoop. Bertrand On Sun, Oct 27, 2013 at 7:42 PM, Koert Kuipers ko...@tresata.com wrote: i have a cluster with replication factor 2. wit the following events in this order, do i have data loss? 1) shut down a datanode for maintenance unrelated to hdfs. so now some blocks only have

Re: high availability

2013-10-15 Thread Koert Kuipers
handle usual R/W operations and just throws StandbyException, hbase region server may kill itself in some cases I guess. I think you can remove sshfence from the configuration if you are using QJM-based HA. On Fri, Oct 11, 2013 at 4:51 PM, Koert Kuipers ko...@tresata.com wrote: i have been

high availability

2013-10-11 Thread Koert Kuipers
i have been playing with high availability using journalnodes and 2 masters both running namenode and hbase master. when i kill the namenode and hbase-master processes on the active master, the failover is perfect. hbase never stops and a running map-reduce jobs keeps going. this is impressive!

UTF16

2013-01-24 Thread Koert Kuipers
is it safe to upload UTF16 encoded (unicode) text files to hadoop for processing by map-reduce, hive, pig, etc? thanks! koert

Re: MiniMRCluster not behaving in hadoop 1.0.4

2012-11-10 Thread Koert Kuipers
.jar slf4j-api-1.6.1.jar slf4j-log4j12-1.6.1.jar xmlenc-0.52.jar On Sat, Nov 10, 2012 at 5:24 AM, Steve Loughran ste...@hortonworks.comwrote: On 10 November 2012 07:43, Koert Kuipers ko...@tresata.com wrote: i am porting a map-reduce library from CDH3 to apache hadoop 1.0.4. the unit tests

hadoop jar question

2012-11-03 Thread Koert Kuipers
i am looking at the code for RunJar.java which is behind hadoop jar for hadoop 0.20.2 (from cdh3u5). i see 1) jar is unpacked to a temporary directory 2) the file URLs of all the jars found in the lib subdir of the unpacked jar are gathered into a list called classPath 3) a new ClassLoader

map-red with many input paths

2012-10-16 Thread Koert Kuipers
currently i run a map-reduce job that reads from a single path with a glob: /data/* i am considering replacing this one glob path with an explicit list of all the paths (so that i can check for _SUCCESS files in the subdirs and exclude the subdirs that don't have this file, to avoid reading from

Re: concurrency

2012-10-12 Thread Koert Kuipers
. This is consistent with what Oozie does as well. Since the listing of files happens post-submit() call, doing this will just work :) On Fri, Oct 12, 2012 at 8:00 PM, Koert Kuipers ko...@tresata.com wrote: We have a dataset that is heavily partitioned, like this /data partition1

concurrency

2012-10-12 Thread Koert Kuipers
We have a dataset that is heavily partitioned, like this /data partition1/ _SUCESS part-0 part-1 ... partition1/ _SUCCESS part-0 part-1 ... We have loaders that use map-red jobs to add new partitions to this data set at a regular interval

Secure hadoop and group permission on HDFS

2012-10-08 Thread Koert Kuipers
With secure hadoop the user name is authenticated by the kerberos server. But what about the groups that the user is a member of? Are these simple the groups that the user is a member of on the namenode machine? Is it viable to manage access to files on HDFS using groups on a secure hadoop

GenericOptionsParser

2012-10-03 Thread Koert Kuipers
Why does GenericOptionParser also remove -Dprop=value options (without setting the system properties)? Those are not hadoop options but java options. And why does hadoop jar not accept -Dprop=value before the class name, as java does? like hadoop jar -Dprop=value class How do i set java system

Re: Is there a way to turn off MAPREDUCE-2415?

2012-08-26 Thread Koert Kuipers
, 2012 at 11:02 PM, Koert Kuipers ko...@tresata.com wrote: We have smaller nodes (4 to 6 disks), and we used to write logs to the same disk as where the OS is. So if that disks goes then i don't really care about tasktrackers failing. Also, the fact that logs were written to a single

Re: Is there a way to turn off MAPREDUCE-2415?

2012-08-26 Thread Koert Kuipers
, Harsh J ha...@cloudera.com wrote: Hi Koert, On Sun, Aug 26, 2012 at 11:20 PM, Koert Kuipers ko...@tresata.com wrote: Hey Harsh, Thanks for responding! Would limiting the logging for each task via mapred.userlog.limit.kb be strictly enforced (while the job is running)? That would solve my

Re: Is there a way to turn off MAPREDUCE-2415?

2012-08-26 Thread Koert Kuipers
). The other alternative may be to switch down the log level on the task, via mapred.map.child.log.level and/or mapred.reduce.child.log.level set to WARN or ERROR. On Sun, Aug 26, 2012 at 11:37 PM, Koert Kuipers ko...@tresata.com wrote: Looks like mapred.userlog.limit.kb is implemented

Re: ethernet bonding / 802.3ad / link aggregation

2012-08-26 Thread Koert Kuipers
Increase throughput On Aug 26, 2012 3:21 PM, Raj Vishwanathan rajv...@yahoo.com wrote: What do you want to use channel bonding for? Increasing throughput or increasing availlability? Raj -- *From:* Koert Kuipers ko...@tresata.com *To:* user

fs cache giving me headaches

2012-08-04 Thread Koert Kuipers
nothing has confused me as much in hadoop as FileSystem.close(). any decent java programmer that sees that an object implements Closable writes code like this: Final FileSystem fs = FileSystem.get(conf); try { // do something with fs } finally { fs.close(); } so i started out using hadoop

hadoop FileSystem.close()

2012-07-24 Thread Koert Kuipers
Since FileSystem is a Closeable i would expect code using it to be like this: FileSystem fs = path.getFileSystem(conf); try { // do something with fs, such as read from the path } finally { fs.close() } However i have repeatedly gotten into trouble with this approach. In one situation it

Re: hadoop FileSystem.close()

2012-07-24 Thread Koert Kuipers
at 10:34 AM, Koert Kuipers ko...@tresata.com wrote: Since FileSystem is a Closeable i would expect code using it to be like this: FileSystem fs = path.getFileSystem(conf); try { // do something with fs, such as read from the path } finally { fs.close() } However i have

memory usage tasks

2012-06-08 Thread Koert Kuipers
silly question, but i have our hadoop slave boxes configured with 7 mappers each, yet i see java 14 process for user mapred on each box. and each process takes up about 2GB, which is equals to my memory allocation (mapred.child.java.opts=-Xmx2048m). so it is using twice as much memory as i

kerberos mapreduce question

2012-06-07 Thread Koert Kuipers
with kerberos enabled a mapreduce job runs as the user that submitted it. does this mean the user that submitted the job needs to have linux accounts on all machines on the cluster? how does mapreduce do this (run jobs as the user)? do the tasktrackers use secure impersonation to run-as the user?

Re: kerberos mapreduce question

2012-06-07 Thread Koert Kuipers
: Yes, User submitting a job needs to have an account on all the nodes. Sent from my iPhone On Jun 7, 2012, at 6:20 AM, Koert Kuipers ko...@tresata.com wrote: with kerberos enabled a mapreduce job runs as the user that submitted it. does this mean the user that submitted the job needs

kerberos security enabled and hadoop/hdfs/mapred users

2012-05-03 Thread Koert Kuipers
do i understand it correctly that with kerberos enabled the mappers and reducers will be run as the actual user that started them? as opposed to the user that runs the tasktracker, which is mapred or hadoop or something like that?

Re: find out which user that is running the mapred job

2012-04-19 Thread Koert Kuipers
. In secure mode, I guess your best bet is to use JobHistory file or JobSummary log for the job (available _after_ job completion). Arun On Apr 20, 2012, at 12:35 AM, Koert Kuipers wrote: is there a way to find out which user is running a mapred job from the JobConf? and is this usable

Re: question about org.apache.hadoop.mapred.join

2012-04-11 Thread Koert Kuipers
in the input data- it often makes sense to disable splitting of the individual files by setting the min split size to Integer.MAX_VALUE. The description probably shouldn't use partitioned, since that implies that the partitioner is sufficient. -C On Tue, Apr 10, 2012 at 8:11 AM, Koert Kuipers ko

question about org.apache.hadoop.mapred.join

2012-04-10 Thread Koert Kuipers
I read about CompositeInputFormat and how it allows one to join two datasets together as long as those datasets were sorted and partitioned the same way. Ok i think i get it, but something bothers me. It is suggested that two datasets are sorted and partitioned the same way if they were both

encryption

2012-01-20 Thread Koert Kuipers
Does anyone know of any work/ideas to encrypt data stored on hdfs? Ideally both temporary files and final files would be encrypted. Or there would have to be a mechanism in hdfs to securely wipe temporary files, like shred in linux. So far this is what i found:

Re: encryption

2012-01-20 Thread Koert Kuipers
to protect from, you can start thinking about how to protect yourself. - Tim. -- *From:* Koert Kuipers [ko...@tresata.com] *Sent:* Friday, January 20, 2012 1:09 PM *To:* hdfs-user@hadoop.apache.org *Subject:* encryption Does anyone know of any work/ideas

dual power for hadoop in datacenter?

2012-01-07 Thread Koert Kuipers
what are the thoughts on running a hadoop cluster in a datacenter with respect to power? should all the boxes have redundant power supplies and be on dual power? or just dual power for the namenode, secondary namenode, and hbase master, and then perhaps switch the power source per rack for the

when does hdfs start to fix under-replication of a block?

2012-01-07 Thread Koert Kuipers
can someone point me to the exact rules? thanks! for example, i want to know if i can take down a slave for the afternoon to fix something on the machine, without causing the cluster to start creating extra copies of blocks that reside on that slave because the replication count is down.

1gig or 10gig network for cluster?

2011-12-23 Thread Koert Kuipers
For a hadoop cluster that starts medium size (50 nodes) but could grow to hundred of nodes, what is the recommended network in the rack? 1gig or 10gig We have machines with 8 cores, 4 X 1tb drive (could grow to 8 X 1b drive), 48 Gb ram per node. We expect balanced usage of the cluster (both

Re: A question about RPC

2011-09-23 Thread Koert Kuipers
are synchronous and asynchronous in c++, in java only synchronous clients are supported) On 21.09.2011 22:59, Koert Kuipers wrote: i would love an IDL, plus that modern serialization frameworks such as protobuf/thrift support versioning (although i still have issues with different versions of thrift

How to take a datanode down temporarily

2011-08-27 Thread Koert Kuipers
How do i take a machine down that is a datanode without causing the namenode to start fixing the under-replication? Say i need to fix something on the machine that will take 30 mins, and after that i plan to turn it back on.

hp sl servers with hadoop?

2011-08-24 Thread Koert Kuipers
Does anyone have any experience using HP Proliant SL hardware for hadoop? We are currently using DL 160 and DL 180 servers, and the SL hardware seems to fit the bill for our new servers in many ways. However the shared power on chassis is not something i am fully comfortable with yet. Koert

Question about RAID controllers and hadoop

2011-08-11 Thread Koert Kuipers
Hello all, We are considering using low end HP proliant machines (DL160s and DL180s) for cluster nodes. However with these machines if you want to do more than 4 hard drives then HP puts in a P410 raid controller. We would configure the RAID controller to function as JBOD, by simply creating

Re: Question about RAID controllers and hadoop

2011-08-11 Thread Koert Kuipers
create RAID0 for each disk to make it as JBOD. -Bharath From: Koert Kuipers ko...@tresata.com To: common-user@hadoop.apache.org Sent: Thursday, August 11, 2011 2:50 PM Subject: Question about RAID controllers and hadoop Hello all, We