Long running Yarn Applications on a secured HA cluster?

2016-01-28 Thread Niels Basjes
ards / Met vriendelijke groeten, Niels Basjes

Flink job on secure Yarn fails after many hours

2015-12-02 Thread Niels Basjes
(in either Hadoop or Flink) or am I doing something wrong? Would upgrading Yarn to 2.7.1 (i.e. HDP 2.3) fix this? Niels Basjes 21:30:27,821 WARN org.apache.hadoop.security.UserGroupInformation - PriviledgedActionException as:nbasjes (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException

Re: Not able to run more than one map task

2015-04-10 Thread Niels Basjes
Just curious: what is the input for your job ? If it is a single gzipped file then that is the cause of getting exactly 1 mapper. Niels On Fri, Apr 10, 2015, 09:21 Amit Kumar amiti...@msn.com wrote: Thanks a lot Harsha for replying This problem has waster at least last one week. We tried

Re: way to add custom udf jar in hadoop 2.x version

2015-01-04 Thread Niels Basjes
'] ]; Is this something for which there is already a JIRA (couldn't find it)? If not; Should I create one? (I.e. do you think this would make sense for others?) Niels Basjes On Fri, Jan 2, 2015 at 9:00 PM, Yakubovich, Alexey alexey.yakubov...@searshc.com wrote: Try to look hr: http://stackoverflow.com

Re: way to add custom udf jar in hadoop 2.x version

2015-01-04 Thread Niels Basjes
I created https://issues.apache.org/jira/browse/HIVE-9252 for this improvement. On Sun, Jan 4, 2015 at 5:16 PM, Niels Basjes ni...@basjes.nl wrote: Hi, These options: - HIVE_HOME/auxlib - http://stackoverflow.com/questions/14032924/how-to-add-serde-jar - ADD JAR commands in your $HOME

Re: way to add custom udf jar in hadoop 2.x version

2014-12-31 Thread Niels Basjes
Thanks for the pointer. This seems to work for functions. Is there something similar for CREATE EXTERNAL TABLE ?? Niels On Dec 31, 2014 8:13 AM, Ted Yu yuzhih...@gmail.com wrote: Have you seen this thread ?

Re: to all this unsubscribe sender

2014-12-05 Thread Niels Basjes
or anyone which have used your email account. cheers Aleks -- Best regards / Met vriendelijke groeten, Niels Basjes

Re: to all this unsubscribe sender

2014-12-05 Thread Niels Basjes
Dec 2014 18:05, Niels Basjes ni...@basjes.nl wrote: Yes, I agree. We should accept people as they are. So perhaps we should increase the hurdle to subscribe in the first place? Something like adding a question like What do you do if you want to unsubscribe from a mailing list? That way

Re: Are these configuration parameters deprecated?

2014-11-14 Thread Niels Basjes
. Perhaps an issue indicating that the use of the deprecated parameters should be removed from the main code base is in order here. Niels Basjes On Fri, Nov 14, 2014 at 9:22 PM, Tianyin Xu t...@cs.ucsd.edu wrote: Hi, I'm very confused by some of the MapReduce configuration parameters which appear

Re: Spark vs Tez

2014-10-19 Thread Niels Basjes
Very interesting! What makes Tez more scalable than Spark? What architectural thing makes the difference? Niels Basjes On Oct 19, 2014 3:07 AM, Jeff Zhang zjf...@gmail.com wrote: Tez has a feature called pre-warm which will launch JVM before you use it and you can reuse the container

Re: Spark vs Tez

2014-10-18 Thread Niels Basjes
suitable. Did I understand correctly? Niels Basjes On Oct 17, 2014 8:30 PM, Gavin Yue yue.yuany...@gmail.com wrote: Spark and tez both make MR faster, this has no doubt. They also provide new features like DAG, which is quite important for interactive query processing. From this perspective, you

Re: Bzip2 files as an input to MR job

2014-09-22 Thread Niels Basjes
on the Internet. Georgi -- Best regards / Met vriendelijke groeten, Niels Basjes

Generating mysql or sqlite datafiles from Hadoop (Java)?

2013-09-17 Thread Niels Basjes
googling. Does anyone know where I can find such a thing? -- Best regards / Met vriendelijke groeten, Niels Basjes

Re: Why LineRecordWriter.write(..) is synchronized

2013-08-11 Thread Niels Basjes
I expect the impact on the IO speed to be almost 0 because waiting for a single disk seek is longer than many thousands of calls to a synchronized method. Niels On Aug 11, 2013 3:00 PM, Harsh J ha...@cloudera.com wrote: Yes, I feel we could discuss this over a JIRA to remove it if it hurts

Re: Why LineRecordWriter.write(..) is synchronized

2013-08-08 Thread Niels Basjes
-- Best regards / Met vriendelijke groeten, Niels Basjes

Re: Why LineRecordWriter.write(..) is synchronized

2013-08-08 Thread Niels Basjes
-- Best regards / Met vriendelijke groeten, Niels Basjes

Re: Is there any way to use a hdfs file as a Circular buffer?

2013-07-24 Thread Niels Basjes
A circular file on hdfs is not possible. Some of the ways around this limitation: - Create a series of files and delete the oldest file when you have too much. - Put the data into an hbase table and do something similar. - Use completely different technology like mongodb which has built in

Use a URL for the HADOOP_CONF_DIR?

2013-07-15 Thread Niels Basjes
) you need them all to update their config files. My question is: Can you set the HADOOP_CONF_DIR to be a URL on a webserver? A while ago I tried this and (back then) it didn't work. Would this be a useful enhancement? -- Best regards, Niels Basjes

Running a single cluster in multiple datacenters

2013-07-15 Thread Niels Basjes
fast. What things should we consider also? Has anyone any experience with such a setup? Is it a good idea to do this? What are better options for us to consider? Thanks for any input. -- Best regards, Niels Basjes

Re: Inputformat

2013-06-21 Thread Niels Basjes
If you try to hammer in a nail (json file) with a screwdriver ( XMLInputReader) then perhaps the reason it won't work may be that you are using the wrong tool? On Jun 21, 2013 11:38 PM, jamal sasha jamalsha...@gmail.com wrote: Hi, I am using one of the libraries which rely on InputFormat.

Re: gz containing null chars?

2013-06-10 Thread Niels Basjes
My best guess is that at a low level a string is often terminated by having a null byte at the end. Perhaps that's where the difference lies. Perhaps the gz decompressor simply stops at the null byte and the basic record reader that follows simply continues. In this situation your input file

Re: Reducer to output only json

2013-06-04 Thread Niels Basjes
Have you tried something like this (i do not have a pc here to check this code) context.write(NullWritable, new Text(jsn.toString())); On Jun 4, 2013 8:10 PM, Chengi Liu chengi.liu...@gmail.com wrote: Hi, I have the following redcuer class public static class TokenCounterReducer

Re: Experimental Hadoop Cluster - Linux Windows machines

2013-06-01 Thread Niels Basjes
on something identical to what you are describing here. Niels Basjes On Sat, Jun 1, 2013 at 9:47 PM, Rody BigData rodybigd...@gmail.com wrote: I have some old ( not very old - each of 4GB RAM with a decent processor etc., and working fine till now ) Dell Windows XP machines and want to convert

Re: Experimental Hadoop Cluster - Linux Windows machines

2013-06-01 Thread Niels Basjes
I've installed CentOS on several different types of old (originally Windows XP) Dell desktops for the last 4 years (i.e. desktops as old as 7 years ago) and so far installing CentOS was as easy as booting from the installation CD/DVD and doing next, next, finish. The only thing that you may run

Re: Configuring SSH - is it required? for a psedo distriburted mode?

2013-05-19 Thread Niels Basjes
I never configure the ssh feature. Not for running on a single node and not for a full size cluster. I simply start all the required deamons (name/data/job/task) and configure them on which ports each can be reached. Niels Basjes On May 16, 2013 4:55 PM, Raj Hadoop hadoop...@yahoo.com wrote

Re: how to get the time of a hadoop cluster, v0.20.2

2013-05-16 Thread Niels Basjes
cluster. On Tue, May 14, 2013 at 5:09 PM, Niels Basjes ni...@basjes.nl wrote: I made a typo. I meant API (instead of SPI). Have a look at this for more information: http://stackoverflow.com/questions/833768/java-code-for-getting-current-time If you have a client that is not under

Re: how to get the time of a hadoop cluster, v0.20.2

2013-05-14 Thread Niels Basjes
the time from the namenode or jobtracker would suffice. i looked at JobClient but didn't see anything helpful. -- Best regards / Met vriendelijke groeten, Niels Basjes

Re: how to get the time of a hadoop cluster, v0.20.2

2013-05-14 Thread Niels Basjes
time is easy. Niels Basjes On Tue, May 14, 2013 at 5:46 PM, Jane Wayne jane.wayne2...@gmail.comwrote: niels, i'm not familiar with the native java spi. spi = service provider interface? could you let me know if this spi is part of the hadoop api? if so, which package/class? but yes, all

Re: How to process only input files containing 100% valid rows

2013-04-19 Thread Niels Basjes
How about a different approach: If you use the multiple output option you can process the valid lines in a normal way and put the invalid lines in a special separate output file. On Apr 18, 2013 9:36 PM, Matthias Scherer matthias.sche...@1und1.de wrote: Hi all, ** ** In my mapreduce job,

Re: Can I perfrom a MR on my local filesystem

2013-02-16 Thread Niels Basjes
Have a look at this http://stackoverflow.com/questions/3546025/is-it-possible-to-run-hadoop-in-pseudo-distributed-operation-without-hdfs -- Met vriendelijke groet, Niels Basjes (Verstuurd vanaf mobiel ) Op 17 feb. 2013 07:51 schreef Agarwal, Nikhil nikhil.agar...@netapp.com het volgende: Hi

Re: how to find top N values using map-reduce ?

2013-02-02 Thread Niels Basjes
My suggestion is to use secondary sort with a single reducer. That easy you can easily extract the top N. If you want to get the top N% you'll need an additional phase to determine how many records this N% really is. -- Met vriendelijke groet, Niels Basjes (Verstuurd vanaf mobiel ) Op 2 feb

Re: how to find top N values using map-reduce ?

2013-02-02 Thread Niels Basjes
My suggestion is to use secondary sort with a single reducer. That easy you can easily extract the top N. If you want to get the top N% you'll need an additional phase to determine how many records this N% really is. -- Met vriendelijke groet, Niels Basjes (Verstuurd vanaf mobiel ) Op 2 feb

Re: What is the preferred way to pass a small number of configuration parameters to a mapper or reducer

2012-12-30 Thread Niels Basjes
F. put a mongodb replica set on all hadoop workernodes and let the tasks query the mongodb at localhost. (this is what I did recently with a multi GiB dataset) -- Met vriendelijke groet, Niels Basjes (Verstuurd vanaf mobiel ) Op 30 dec. 2012 20:01 schreef Jonathan Bishop jbishop@gmail.com

Re: Doubts on compressed file

2012-11-07 Thread Niels Basjes
into blocks and stored in HDFS? Yes, and then the mapper will read the other parts of the file over the network. So what I do is I upload such files with a bigger HDFS blocksize so the mapper has the entire file locally. -- Best regards / Met vriendelijke groeten, Niels Basjes

Re: Hadoop Real time help

2012-08-22 Thread Niels Basjes
. Then you also have other solutions which will allow you to scale such as Storm. A few people have already considered using Storm for scalability and Esper to do the real computation. Regards Bertrand On Sun, Aug 19, 2012 at 9:44 PM, Niels Basjes ni...@basj.es wrote: Is there a complete

Re: output/input ratio 1 for map tasks?

2012-07-30 Thread Niels Basjes
: Have a look at the WordCount example. Input of a single map call is 1 record: This is a line Output are 4 records: This1 is 1 a1 line 1 -- Best regards / Met vriendelijke groeten, Niels Basjes

Making gzip splittable for Hadoop

2012-03-30 Thread Niels Basjes
and up (I tested it with Cloudera CDH4b1). So for now Hadoop 1.x is not yet supported (waiting for HADOOP-7823). Running mvn package automatically generates an RPM on my CentOS system. Have fun with it an let me know what you think. -- Best regards / Met vriendelijke groeten, Niels Basjes

Making gzip splittable for Hadoop

2012-03-30 Thread Niels Basjes
and up (I tested it with Cloudera CDH4b1). So for now Hadoop 1.x is not yet supported (waiting for HADOOP-7823). Running mvn package automatically generates an RPM on my CentOS system. Have fun with it an let me know what you think. -- Best regards / Met vriendelijke groeten, Niels Basjes

Re: Merge sorting reduce output files

2012-03-01 Thread Niels Basjes
that takes the output of run 1 and create a aggregate that can be used to partition the dataset 2) Use the partitioning dataset from '1)' to distribute the processing for the next run. Thanks for your suggestions. -- Best regards / Met vriendelijke groeten, Niels Basjes

Re: Merge sorting reduce output files

2012-02-29 Thread Niels Basjes
and from there simply manual define the partitions based on the pattern we find. -- Best regards / Met vriendelijke groeten, Niels Basjes

Re: Should splittable Gzip be a core hadoop feature?

2012-02-29 Thread Niels Basjes
each line in the very first mapper. Then we store the result in (snappy compressed) avro files. I don't disagree, I just want to have a solid argument in favor of it... :) -- Best regards / Met vriendelijke groeten, Niels Basjes

Re: Should splittable Gzip be a core hadoop feature?

2012-02-29 Thread Niels Basjes
of concatenated gzipped files. (HADOOP-7909) -- Best regards / Met vriendelijke groeten, Niels Basjes

Should splittable Gzip be a core hadoop feature?

2012-02-28 Thread Niels Basjes
it by setting the right configuration? - a separate library? - a nice idea I had fun building but that no one needs? - ... ? -- Best regards / Met vriendelijke groeten, Niels Basjes

Merge sorting reduce output files

2012-02-28 Thread Niels Basjes
regards / Met vriendelijke groeten, Niels Basjes

Re: Merge sorting reduce output files

2012-02-28 Thread Niels Basjes
and base the partitioning on that (like the one used in terrasort) wouldn't help. The data has a special distribution... Niels Basjes --Bobby Evans On 2/28/12 2:10 PM, Niels Basjes ni...@basjes.nl wrote: Hi, We have a job that outputs a set of files that are several hundred MB

Should splittable Gzip be a core hadoop feature?

2012-02-28 Thread Niels Basjes
it by setting the right configuration? - a separate library? - a nice idea I had fun building but that no one needs? - ... ? -- Best regards / Met vriendelijke groeten, Niels Basjes

Re: How to Create an effective chained MapReduce program.

2011-09-05 Thread Niels Basjes
groet, Niels Basjes Op 6 sep. 2011 01:54 schreef ilyal levin nipponil...@gmail.com het volgende: o.k , so now i'm using SequenceFileInputFormat and SequenceFileOutputFormat and it works fine but the output of the reducer is now a binary file (not txt) so i can't understand the data. how can i

Re: Excuting a shell script inside the HDFS

2011-08-16 Thread Niels Basjes
Yes, that way it could work. I'm just wondering ... Why would you want to have a script like this in HDFS? Met vriendelijk groet, Niels Basjes Op 16 aug. 2011 06:49 schreef Friso van Vollenhoven fvanvollenho...@xebia.com het volgende: hadoop fs -cat /path/on/hdfs/script.sh | bash Should

Re: How to select random n records using mapreduce ?

2011-06-27 Thread Niels Basjes
/ mapper_number records. Does anyone has such experience ? -- Best Regards Jeff Zhang -- Best regards / Met vriendelijke groeten, Niels Basjes

Re: AW: How to split a big file in HDFS by size

2011-06-21 Thread Niels Basjes
blocks in parallel.) Note that you’ll need enough storage capacity. I don’t have example code, but I’m guessing Google can help. From: Mapred Learn [mailto:mapred.le...@gmail.com] Sent: maandag 20 juni 2011 18:09 To: Niels Basjes; Evert Lammerts Subject: Re: AW: How to split a big file

Re: AW: How to split a big file in HDFS by size

2011-06-20 Thread Niels Basjes
mappers running in parallel. Isn't it so ? Yes, that is very true. -- Best regards / Met vriendelijke groeten, Niels Basjes

Re: Using df instead of du to calculate datanode space

2011-05-21 Thread Niels Basjes
*/ -- Met vriendelijke groeten, Niels Basjes

Including external libraries in my job.

2011-05-03 Thread Niels Basjes
) at org.apache.hadoop.mapred.Task.initialize(Task.java:486) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) ... So what is the correct way of doing this? -- Met vriendelijke groeten, Niels Basjes

Unsplittable files on HDFS

2011-04-27 Thread Niels Basjes
that a datanode that has blocks of this file must always have ALL blocks of this file? -- Best regards, Niels Basjes

Re: Unsplittable files on HDFS

2011-04-27 Thread Niels Basjes
at 1:25 PM, Niels Basjes ni...@basjes.nl wrote: Hi, In some scenarios you have gzipped files as input for your map reduce job (apache logfiles is a common example). Now some of those files are several hundred megabytes and as such will be split by HDFS in several blocks. When looking

Re: hadoop mr cluster mode on my laptop?

2011-04-18 Thread Niels Basjes
Set   64-bit I was thinking in running a Tasktracker in each Core, although I don't know how to do it. Any help to install Hadoop MR in cluster mode on my laptop? Thanks, -- Pedro -- Met vriendelijke groeten, Niels Basjes

Re: Small linux distros to run hadoop ?

2011-04-15 Thread Niels Basjes
scripting for anaconda (= Redhat installer). -- Met vriendelijke groeten, Niels Basjes

Re: Re-generate datanode storageID?

2011-03-24 Thread Niels Basjes
storageID (as defined in .../cache/hdfs/dfs/data/current/VERSION. So, my question is How do I resolve the collision of the storageIDs? Thanks! -mgl -- Met vriendelijke groeten, Niels Basjes

Re: TextInputFormat and Gzip encoding - wordcount displaying binary data

2011-03-21 Thread Niels Basjes
hadoop code actually chooses the decompressor on the extention of the filename. -- Niels Basjes

Re: File formats in Hadoop

2011-03-20 Thread Niels Basjes
solution of their own a while ago. Howl? -- Harsh J http://harshj.com -- Met vriendelijke groeten, Niels Basjes

Re: Efficiently partition broadly distributed keys

2011-03-10 Thread Niels Basjes
. Disadvantage: Each run will show different results. Only works if the set of keys that needs to be chopped is small enough so you can have it in memory in the call to the second map. HTH Niels Basjes 2011/3/10 Luca Aiello alu...@yahoo-inc.com: Dear users, hope this is the right list to submit

Re: Efficiently partition broadly distributed keys

2011-03-10 Thread Niels Basjes
the distribution. This can introduce some errors but it should produce a output which is quite uniformly distributed. Thanks again! You're welcome. Niels On Mar 10, 2011, at 12:23 PM, Niels Basjes wrote: If I understand your problem correctly you actually need some way of knowing if you need

Re: Comparison between Gzip and LZO

2011-03-02 Thread Niels Basjes
. -- Met vriendelijke groeten, Niels Basjes

Re: How to make a CGI with HBase?

2011-03-01 Thread Niels Basjes
groeten, Niels Basjes

Re: When use hadoop mapreduce?

2011-02-18 Thread Niels Basjes
. HTH -- Met vriendelijke groeten, Niels Basjes

Re: Hadoop in Real time applications

2011-02-17 Thread Niels Basjes
). If you really need realtime (as in: I want a guarantee that I have an answer within 0.x seconds) the answer is: No, HDFS/HBase cannot guarantee that. Other components like MapReduce (and Hive which run on top of MapReduce) are purely batch oriented. -- Met vriendelijke groeten, Niels Basjes

Re: Is a Block compressed (GZIP) SequenceFile splittable in MR operation?

2011-01-31 Thread Niels Basjes
of MR parallelism? AFAIK it should be splittable in the same blocks as the compression was done. How to control the size of block to be compressed in SequenceFile? Can't help you with that one. -- Met vriendelijke groeten, Niels Basjes

Re: Restricting number of records from map output

2011-01-14 Thread Niels Basjes
is stop reading the input iterator after N records and limit the output in that way. Doing it in the reducer also allows you to easily add a concept of Top N by using the Secondary Sort trick to sort the input before it arrives at the reducer. HTH Niels Basjes

Re: TeraSort question.

2011-01-11 Thread Niels Basjes
) I am using CDH3B3, even though I think this is not specific to CDH3B3. Sorry for the cross post. Raj -- Met vriendelijke groeten, Niels Basjes

Re: Help: How to increase amont maptasks per job ?

2011-01-07 Thread Niels Basjes
are being used, as we are not getting the full advantage of our cluster. -- Met vriendelijke groeten, Niels Basjes

Re: FILE_BYTES_WRITTEN and HDFS_BYTES_WRITTEN

2010-11-30 Thread Niels Basjes
? Thanks, Pedro -- Met vriendelijke groeten, Niels Basjes

Re: Control the number of Mappers

2010-11-25 Thread Niels Basjes
-site.xml using these two settings: mapred.tasktracker.{map|reduce}.tasks.maximum Have a look at this page for more information http://hadoop.apache.org/common/docs/current/cluster_setup.html -- Met vriendelijke groeten, Niels Basjes

Re: Control the number of Mappers

2010-11-25 Thread Niels Basjes
a (certain) job to invoke exactly N Mappers, where N is the number of cores in the cluster. Irregardless of the size of the data. This is not critical if it can't be done, but it can improve the performance of my job if it can be done. Thanks Shai On Thu, Nov 25, 2010 at 9:55 PM, Niels Basjes

Re: Predicting how many values will I see in a call to reduce?

2010-11-08 Thread Niels Basjes
that you can't. The main limit is that the Iterator does not have a size or length method. -- Met vriendelijke groeten, Niels Basjes

Re: Duplicated entries with map job reading from HBase

2010-11-05 Thread Niels Basjes
times there is a design fault in the processing and the combiner disrupts the processing. HTH Niels Basjes 2010/11/5 Adam Phelps a...@opendns.com I've noticed an odd behavior with a map-reduce job I've written which is reading data out of an HBase table. After a couple days of poking

Understanding FileInputFormat and isSplittable.

2010-09-07 Thread Niels Basjes
would like to understand the logic behind the current implementation choice in relation to what I expected (mainly from the documentation). Thanks for explaining. -- Best regards, Niels Basjes