Re: map side Vs. Reduce side join

2009-07-17 Thread jason hadoop
I seem to be one of the mapside join champions. For jobs that fit onto that pattern there is usually a 100x speed improvment, compared to doing reduce side joins, for real (large) datasets. On Wed, Jul 15, 2009 at 12:05 PM, bonito perdo bonito.pe...@googlemail.comwrote: Thank you for your

Re: help with two column sort

2009-07-17 Thread jason hadoop
that let you log what is going on in the field comparator or field partitioner. On Thu, Jul 16, 2009 at 11:05 PM, jason hadoop jason.had...@gmail.comwrote: In the example code for Pro Hadoop there are some shims for the fieldcomparator classes, that let you log what is going

Re: Compression issues!!

2009-07-15 Thread jason hadoop
Particularly for highly compressible data such as web log files, the loss in potential data locality is more than made up for by the increase in network transfer speed. The other somewhat unexpected side benefit is that there are fewer map tasks with less task startup overhead. If your data is not

Re: Relation between number of map reduce tasks per node and capacity of namenode

2009-07-14 Thread jason hadoop
The namenode is pretty much driven by the number of blocks and the number of files in your HDFS, and to a lessor extent, the rate of create/open/write/close of files. If you have any instability in your datanodes, there is a great increase in namenode loading. On Tue, Jul 14, 2009 at 4:16 AM,

Re: more than one reducer in standalone mode

2009-07-14 Thread jason hadoop
Thanks Tom. The single reducer is greatly limiting in local mode. On Tue, Jul 14, 2009 at 3:15 AM, Tom White t...@cloudera.com wrote: There's a Jira to fix this here: https://issues.apache.org/jira/browse/MAPREDUCE-434 Tom On Mon, Jul 13, 2009 at 12:34 AM, jason

Re: more than one reducer in standalone mode

2009-07-12 Thread jason hadoop
If the jobtracker is set to local, there is no way to have more than 1 reducer. On Sun, Jul 12, 2009 at 12:21 PM, Rares Vernica rvern...@gmail.com wrote: Hello, Is it possible to have more than one reducer in standalone mode? I am currently using 0.17.2.1 and I do:

Re: .tar.gz as input files

2009-07-11 Thread jason hadoop
There is already support for tar.gz, but it is buried. FileUtil provides a static unTar method. This is only used currently by the DistributedCache for unpacking archives. On Fri, Jul 10, 2009 at 2:58 AM, Andraz Tori and...@zemanta.com wrote: Has anyone written a TarGzipCodec decompressor for

Re: how to compress..!

2009-07-11 Thread jason hadoop
Here are the set of configuration parameters for compression from 0.19 You can enable mapred.compress.map.output, and mapred.output.compress as well as set mapred.output.compression.type to BLOCK for a good set of defaults. The compression codec's very by release substantially, so I won't go

Re: Sort by value

2009-07-09 Thread jason hadoop
The simplest way is to swap the key and value in your mapper's output, then swap them back afterward. On Thu, Jul 9, 2009 at 7:52 AM, Marcus Herou marcus.he...@tailsweep.comwrote: Hi many times I want to sort by value instead of key. For instance when counting the top used tags in blog posts

Re: Accessing static variables in map function

2009-07-09 Thread jason hadoop
To clarify all of the writers. Store the values you wish to share with your map tasks, in the JobConf object. In the configure method of your mapper class, unpack the variables and store them in class fields of the mapper class. Then use them as needed in the map method of your mapper class. On

Re: permission denied on additional binaries

2009-07-08 Thread jason hadoop
Just out of curiosity, what happens when you run your script by hand? On Wed, Jul 8, 2009 at 8:09 AM, Rares Vernica rvern...@gmail.com wrote: On Tue, Jul 7, 2009 at 10:26 PM, jason hadoop jason.had...@gmail.com wrote: The mapper has no control at the point where your mymapper.sh script

Re: local directory

2009-07-01 Thread jason hadoop
you. On Wed, Jul 1, 2009 at 5:13 PM, jason hadoop jason.had...@gmail.com wrote: The parameter mapred.local.dir controls the directory used by the task tracker for map/reduce jobs local files. the dfs.data.dir paramter is for the datanode. On Wed, Jul 1, 2009 at 8:56 AM, bonito

Re: Hadoop auto-installation scripts

2009-07-01 Thread jason hadoop
try the cloudera distributions, they have one based on 18.3,and soon (perhaps already) on 20.0 www.cloudera.com On Wed, Jul 1, 2009 at 9:45 PM, akhil1988 akhilan...@gmail.com wrote: Hi All, Has anyone written Hadoop auto-installation script for a cluster? If yes, please let me know.

Re: combine two map tasks

2009-06-28 Thread jason hadoop
The ChainMapper class introduced in Hadoop 19 will provide you with the ability to have an arbitrary number of map tasks to run one after the other, in the context of a single job. The one issue to be aware of is that the chain of mappers only see the output the previous map in the chain. There

Re: Scaling out/up or a mix

2009-06-27 Thread jason hadoop
How about multi-threaded mappers? Multi-Threaded mappers are ideal for map tasks that are non locally io bound with many distinct endpoints. You can also control the thread count on a per job basis. On Sat, Jun 27, 2009 at 8:26 AM, Marcus Herou marcus.he...@tailsweep.comwrote: The argument

Re: Are .bz2 extensions supported in Hadoop 18.3

2009-06-24 Thread jason hadoop
I believe the cloudera 18.3 supports bzip2 On Wed, Jun 24, 2009 at 3:45 AM, Usman Waheed usm...@opera.com wrote: Hi All, Can I map/reduce logs that have the .bz2 extension in Hadoop 18.3? I tried but interestingly the output was not what i expected versus what i got when my data was in

Re: CompositeInputFormat scalbility

2009-06-24 Thread jason hadoop
The join package does a streaming merge sort between each part-X in your input directories, part- will be handled a single task, part-0001 will be handled in a single task and so on These jobs are essentially io bound, and hard to beat for performance. On Wed, Jun 24, 2009 at 2:09 PM, pmg

Re: CompositeInputFormat scalbility

2009-06-24 Thread jason hadoop
with 64m block size get 16 blocks mapped to different map tasks? jason hadoop wrote: The join package does a streaming merge sort between each part-X in your input directories, part- will be handled a single task, part-0001 will be handled in a single task and so on These jobs

Re: Does balancer ensure a file's replication is satisfied?

2009-06-23 Thread jason hadoop
The namenode is constantly receiving reports about what datanode has what blocks, and performing replication when a block becomes under replicated. On Tue, Jun 23, 2009 at 6:18 PM, Stuart White stuart.whi...@gmail.comwrote: In my Hadoop cluster, I've had several drives fail lately (and they've

Re: Determining input record directory using Streaming...

2009-06-23 Thread jason hadoop
I happened to have a copy of 18.1 lying about, and the JobConf is added to the per process runtime environment in 18.1. The entire configuration from the JobConf object is added to the environment, with the jobconf key names being transformed slightly. Any character in the key name, that is not

Re: Strange Exeception

2009-06-22 Thread jason hadoop
The directory specified by the configuration parameter mapred.system.dir, defaulting to /tmp/hadoop/mapred/system, doesn't exist. Most likely your tmp cleaner task has removed it, and I am guessing it is only created at cluster start time. On Mon, Jun 22, 2009 at 6:19 PM, akhil1988

Re: When is configure and close run

2009-06-22 Thread jason hadoop
configure and close are run for each task, mapper and reducer. The configure and close are NOT run on the combiner class. On Mon, Jun 22, 2009 at 9:23 AM, Saptarshi Guha saptarshi.g...@gmail.comwrote: Hello, In a mapreduce job, a given map JVM will run N map tasks. Are the configure and close

Re: Determining input record directory using Streaming...

2009-06-22 Thread jason hadoop
Check the process environment for your streaming tasks, generally the configuration variables are exported into the process environment. The Mapper input file is normally stored as some variant of mapred.input.file. The reducer's input is the mapper output for that reduce, so the input file is

Re: Too many open files error, which gets resolved after some time

2009-06-21 Thread jason hadoop
HDFS/DFS client uses quite a few file descriptors for each open file. Many application developers (but not the hadoop core) rely on the JVM finalizer methods to close open files. This combination, expecially when many HDFS files are open can result in very large demands for file descriptors for

Re: Too many open files error, which gets resolved after some time

2009-06-21 Thread jason hadoop
will get called, if ever. -brian -Original Message- From: ext jason hadoop [mailto:jason.had...@gmail.com] Sent: Sunday, June 21, 2009 11:19 AM To: core-user@hadoop.apache.org Subject: Re: Too many open files error, which gets resolved after some time HDFS/DFS client uses quite a few

Re: Too many open files error, which gets resolved after some time

2009-06-21 Thread jason hadoop
and every file handle that I receive from HDFS? Regards. 2009/6/21 jason hadoop jason.had...@gmail.com Just to be clear, I second Brian's opinion. Relying on finalizes is a very good way to run out of file descriptors. On Sun, Jun 21, 2009 at 9:32 AM, brian.lev...@nokia.com wrote: IMHO

Re: Restrict output of mappers to reducers running on same node?

2009-06-19 Thread jason hadoop
the output of mapper only job so that we don't get a lot number of smaller files. Sometimes you just don't want to run reducers and unnecessarily transfer a whole lot of data across the network. Thanks, Tarandeep On Wed, Jun 17, 2009 at 7:57 PM, jason hadoop jason.had...@gmail.com wrote

Re: Nor OOM Java Heap Space neither GC OverHead Limit Exeeceded

2009-06-19 Thread jason hadoop
binaryRead. Please let me know if I am going wrong anywhere. Thanks, Akhil jason hadoop wrote: I have only ever used the distributed cache to add files, including binary files such as shared libraries. It looks like you are adding a directory. The DistributedCache

Re: JobControl for Pipes?

2009-06-18 Thread jason hadoop
:59 PM, jason hadoop jason.had...@gmail.com wrote: Job control is coming with the Hadoop WorkFlow manager, in the mean time there is cascade by chris wensel. I do not have any personal experience with either. I do not know how pipes interacts with either. On Wed, Jun 17, 2009 at 12:43

Re: Getting Task ID inside a Mapper

2009-06-18 Thread jason hadoop
The task id is readily available, if you override the configure method. The MapReduceBase class in the Pro Hadoop Book examples does this and makes the taskId available as a class field. On Thu, Jun 18, 2009 at 7:33 AM, Mark Desnoyer mdesno...@gmail.com wrote: Thanks! I'll try that. -Mark

Re: Practical limit on emitted map/reduce values

2009-06-18 Thread jason hadoop
In general if the values become very large, it becomes simpler to store them outline in hdfs, and just pass the hdfs path for the item as the value in the map reduce task. This greatly reduces the amount of IO done, and doesn't blow up the sort space on the reducer. You loose the magic of data

Re: Nor OOM Java Heap Space neither GC OverHead Limit Exeeceded

2009-06-17 Thread jason hadoop
to this? Thanks, Akhil jason hadoop wrote: Something is happening inside of your (Parameters. readConfigAndLoadExternalData(Config/allLayer1.config);) code, and the framework is killing the job for not heartbeating for 600 seconds On Tue, Jun 16, 2009 at 8:32 PM, akhil1988

Re: [ADV] Blatant marketing of the book Pro Hadoop. In honor of the 09 summit here is a 50% off coupon corrected code is LUCKYOU

2009-06-17 Thread jason hadoop
www.prohadoopbook.com ? 2009/6/17 zjffdu zjf...@gmail.com HI Jason, Where can I download your books' Alpha Chapters, I am very interested in your book about hadoop. And I cannot visit the link www.prohadoopbook.com -Original Message- From: jason hadoop [mailto:jason.had...@gmail.com

Re: Restrict output of mappers to reducers running on same node?

2009-06-17 Thread jason hadoop
You can open your sequence file in the mapper configure method, write to it in your map, and close it in the mapper close method. Then you end up with 1 sequence file per map. I am making an assumption that each key,value to your map some how represents a single xml file/item. On Wed, Jun 17,

Re: JobControl for Pipes?

2009-06-17 Thread jason hadoop
Job control is coming with the Hadoop WorkFlow manager, in the mean time there is cascade by chris wensel. I do not have any personal experience with either. I do not know how pipes interacts with either. On Wed, Jun 17, 2009 at 12:43 PM, Roshan James roshan.james.subscript...@gmail.com wrote:

Re: [ANN] HBase 0.20.0-alpha available for download

2009-06-17 Thread jason hadoop
Is there a requirement for hadoop 0.20 for HBase 0.20? On Wed, Jun 17, 2009 at 1:44 AM, Andrew Purtell apurt...@apache.org wrote: Minor correction/addition: Stargate is undergoing shared development in two github trees: http://github.com/macdiesel/stargate/tree/master

Re: Can I share datas for several map tasks?

2009-06-16 Thread jason hadoop
In the examples for my book is a jvm reuse with static data shared between jvm's example On Tue, Jun 16, 2009 at 1:08 AM, Hello World snowlo...@gmail.com wrote: Thanks for your reply. Can you do me a favor to make a check? I modified mapred-default.xml as follows: 540 property 541

Re: Datanodes fail to start

2009-06-16 Thread jason hadoop
Pankil On Fri, May 15, 2009 at 2:25 AM, jason hadoop jason.had...@gmail.com wrote: There should be a few more lines at the end. We only want the part from last the STARTUP_MSG to the end On one of mine a successfull start looks like this: STARTUP_MSG: Starting DataNode STARTUP_MSG

Re: Debugging Map-Reduce programs

2009-06-16 Thread jason hadoop
When you are running in local mode you have 2 basic choices if you want to interact with a debugger. You can launch from within eclipse or other IDE, or you can setup a java debugger transport as part of the mapred.child.java.opts variable, and attach to the running jvm. By far the simplest is

Re: Nor OOM Java Heap Space neither GC OverHead Limit Exeeceded

2009-06-16 Thread jason hadoop
Is it possible that your map class is an inner class and not static? On Tue, Jun 16, 2009 at 10:51 AM, akhil1988 akhilan...@gmail.com wrote: Hi All, I am running my mapred program in local mode by setting mapred.jobtracker.local to local mode so that I can debug my code. The mapred program

Re: Nor OOM Java Heap Space neither GC OverHead Limit Exeeceded

2009-06-16 Thread jason hadoop
, etc.) but it gets stuck there(while loading some classifier) and never reaches HI3. This program runs fine when executed normally(without mapreduce). Thanks, Akhil jason hadoop wrote: Is it possible that your map class is an inner class and not static? On Tue, Jun 16

Re: java.lang.ClassNotFoundException

2009-06-15 Thread jason hadoop
Your class is not in your jar, or your jar is not avialable in the hadoop class path. On Mon, Jun 15, 2009 at 2:39 AM, bharath vissapragada bhara...@students.iiit.ac.in wrote: Hi all , When i try to run my own progam (jar file) i get the following error. java.lang.ClassNotFoundException :

Re: get number of values for a key

2009-06-14 Thread jason hadoop
It would be nice if there was an interface compliant way. Perhaps it becomes available in the 0.20 and beyond api's. On Sat, Jun 13, 2009 at 3:40 PM, Rares Vernica rvern...@gmail.com wrote: Hello, In Reduce, can I get the number of values for the current key without iterating over them? Does

Re: 2009 Hadoop Summit West - was wonderful

2009-06-14 Thread jason hadoop
. Thanks. Schubert On Thu, Jun 11, 2009 at 11:26 AM, jason hadoop jason.had...@gmail.com wrote: I had a great time, smoozing with people, and enjoyed a couple of the talks I would love to see more from Pria Narasimhan, hope their toolset for automated fault detection in hadoop clusters

Re: The behavior of HashPartitioner

2009-06-12 Thread jason hadoop
You can always write something simple to hand call the HashPartitioner. Jython works for quick tests. But the code in hash partitioner is essentially ((int) key.hashcode()) % num reduces. Since nothing else is in play, I suspect there is an incorrect assumption somewhere. On Fri, Jun 12, 2009

Re: HDFS data transfer!

2009-06-12 Thread jason hadoop
Also check the IO wait time on your datanodes, if the io wait time is high, you can't win. On Fri, Jun 12, 2009 at 11:24 AM, Brian Bockelman bbock...@cse.unl.eduwrote: What's your replication factor? What aggregate I/O rates do you see in Ganglia? Is the I/O spikey, or has it plateaued? We

Re: Hadoop streaming - No room for reduce task error

2009-06-11 Thread jason hadoop
The reduce output may spill to disk during the sort, and if it expected to be larger than the partition free space, unless the machine/jvm has a hugh allowed memory space, the data will spill to disk during the sort. If I did my math correctly, you are trying to push ~2TB through the single

Re: Windows installation

2009-06-11 Thread jason hadoop
My book has a small section on setting up under windows. The key piece is that you must have a cygwin installation on the machine, and include the cygwin installation's bin directory in your windows system PATH environment variable. (Control Panel|System|Advanced|Environment Variables|System

Re: Windows installation

2009-06-11 Thread jason hadoop
The hadoop scripts must be run from the cygin bash shell also. It is MUCH simpler to just switch to linux :) On Thu, Jun 11, 2009 at 6:54 AM, jason hadoop jason.had...@gmail.comwrote: My book has a small section on setting up under windows. The key piece is that you must have a cygwin

Re: Windows installation

2009-06-11 Thread jason hadoop
more hbase that hadoop does hbase is well suited for every large application like auction website or very community forum thx 2009/6/11 Alexandre Jaquet alexjaq...@gmail.com Thanks I run yet to buy your ebook ! 2009/6/11 jason hadoop jason.had...@gmail.com My book has a small

Re: Windows installation

2009-06-11 Thread jason hadoop
the email I provided. One more question, does hbase provide a ConnectionFactory or SessionFactory that can be integrated within Spring ? Thanks 2009/6/11 jason hadoop jason.had...@gmail.com I don't know the password for that, you will need to contact apress support. On Thu, Jun 11, 2009

Re: Large size Text file split

2009-06-10 Thread jason hadoop
There is always NLineInputFormat. You specify the number of lines per split. The key is the position of the line start in the file, value is the line itself. The parameter mapred.line.input.format.linespermap controls the number of lines per split On Wed, Jun 10, 2009 at 5:27 AM, Harish

2009 Hadoop Summit West - was wonderful

2009-06-10 Thread jason hadoop
I had a great time, smoozing with people, and enjoyed a couple of the talks I would love to see more from Pria Narasimhan, hope their toolset for automated fault detection in hadoop clusters becomes generally available. Zookeeper rocks on! Hbase is starting to look really good, in 0.20 the

Re: [ADV] Blatant marketing of the book Pro Hadoop. In honor of the 09 summit here is a 50% off coupon,

2009-06-09 Thread jason hadoop
but it didn't work: ERROR: The promotional code 'LUCKYYOU' does not exist. Burt On Tuesday 09 June 2009 10:15:24 pm jason hadoop wrote: In honor of the Hadoop Summit on June 10th(tomorrow), Apress has agreed to provide some conference swag, in the form of a 50% off coupon Purchase

Re: [ADV] Blatant marketing of the book Pro Hadoop. In honor of the 09 summit here is a 50% off coupon,

2009-06-09 Thread jason hadoop
CORRECTED CODE, LUCKYOU I miss read the flyer. On Tue, Jun 9, 2009 at 8:45 PM, jason hadoop jason.had...@gmail.com wrote: I just sent a note to the publisher, hopefully they will fix it, especially since I just printed up 100 flyers to give out at the hadoop summit! On Tue, Jun 9, 2009

Re: What should I do to implements writable?

2009-06-09 Thread jason hadoop
A writeable basically needs to implement two methods: /** * Serialize the fields of this object to codeout/code. * * @param out codeDataOuput/code to serialize this object into. * @throws IOException */ void write(DataOutput out) throws IOException; /** * Deserialize the

Re: Map-Reduce!

2009-06-08 Thread jason hadoop
A very common one is processing large quantities of log files and producing summary date. Another use is simply as a way of distributing large jobs across multiple computers. In a previous job, we used Map/Reduce for distributed bulk web crawling, and for distributed media file processing. On

Re: Is there any way to debug the hadoop job in eclipse

2009-06-06 Thread jason hadoop
The chapters are available for download now. On Sat, Jun 6, 2009 at 3:33 AM, zhang jianfeng zjf...@gmail.com wrote: Is there any resource on internet that I can get as soon as possible ? On Fri, Jun 5, 2009 at 6:43 PM, jason hadoop jason.had...@gmail.com wrote: chapter 7 of my book goes

Re: Is there any way to debug the hadoop job in eclipse

2009-06-05 Thread jason hadoop
chapter 7 of my book goes into details of hour to debug with eclipse On Fri, Jun 5, 2009 at 3:40 AM, zhang jianfeng zjf...@gmail.com wrote: Hi all, Some jobs I submit to hadoop failed, but I can not see what's the problem. So is there any way to debug the hadoop job in eclipse, such as the

Re: Task files in _temporary not getting promoted out

2009-06-04 Thread jason hadoop
Are your tasks failing or completing successfully. Failed tasks have the output directory wiped, only successfully completed tasks have the files moved up. I don't recall if the FileOutputCommitter class appeared in 0.18 On Wed, Jun 3, 2009 at 6:43 PM, Ian Soboroff ian.sobor...@nist.gov wrote:

Re: problem getting map input filename

2009-06-02 Thread jason hadoop
you can always dump the entire property space and work it out that way. I haven't used the 0.20 api's yet so I can't speak to them On Tue, Jun 2, 2009 at 10:52 AM, Rares Vernica rvern...@gmail.com wrote: On 6/2/09, randy...@comcast.net randy...@comcast.net wrote: Your Map class needs to

Re: Reduce() time takes ~4x Map()

2009-05-28 Thread jason hadoop
At the minimal level, enable map output compression, it may make some difference, mapred.compress.map.output. Sorting is very expensive when there are many keys and the values are large. Are you quite certain your keys are unique. Also, do you need them sorted by document id? On Thu, May 28,

Re: Efficient algorithm for many-to-many reduce-side join?

2009-05-28 Thread jason hadoop
Use the mapside join stuff, if I understand your problem it provides a good solution but requires getting over the learning hurdle. Well described in chapter 8 of my book :) On Thu, May 28, 2009 at 8:29 AM, Chris K Wensel ch...@wensel.net wrote: I believe PIG, and I know Cascading use a kind

Re: avoid custom crawler getting blocked

2009-05-27 Thread jason hadoop
Random ordering helps with per thread delays based on domain recency also helps. On Wed, May 27, 2009 at 6:47 AM, Ken Krugler kkrugler_li...@transpac.comwrote: My current project is to gather stats from a lot of different documents. We're are not indexing just getting quite specific stats for

Re: Specifying NameNode externally to hadoop-site.xml

2009-05-25 Thread jason hadoop
if you launch your jobs via bin/hadoop jar jar_file [main class] [options] you can simply specify -fs hdfs://host:port before the jar_file On Sun, May 24, 2009 at 3:02 PM, Stas Oskin stas.os...@gmail.com wrote: Hi. I'm looking to move the Hadoop NameNode URL outside the hadoop-site.xml

Re: Specifying NameNode externally to hadoop-site.xml

2009-05-25 Thread jason hadoop
jason hadoop jason.had...@gmail.com if you launch your jobs via bin/hadoop jar jar_file [main class] [options] you can simply specify -fs hdfs://host:port before the jar_file On Sun, May 24, 2009 at 3:02 PM, Stas Oskin stas.os...@gmail.com wrote: Hi. I'm looking to move

Re: Circumventing Hadoop's data placement policy

2009-05-23 Thread jason hadoop
Can you give your machines multiple IP addresses, and bind the grid server to a different IP than the datanode With solaris you could put it in a different zone, On Sat, May 23, 2009 at 10:13 AM, Brian Bockelman bbock...@math.unl.eduwrote: Hey all, Had a problem I wanted to ask advice on.

Re: Could only be replicated to 0 nodes, instead of 1

2009-05-21 Thread jason hadoop
It does not appear that any datanodes have connected to your namenode. on the datanode machines look in the hadoop logs directory at the datanode log files. There should be some information there that helps you diagnose the problem. chapter 4 of my book provides some detail on work with this

Re: Multipleoutput file

2009-05-21 Thread jason hadoop
setInputPaths will take an array, or variable arguments. or you can simply provide the directory that the individual files reside in, and the individual files will be added. If there are other files in the directory, you may need to specify a custom input path filter via

Re: Randomize input file?

2009-05-21 Thread jason hadoop
The last time I had to do something like this, in the map phase, I made the key a random value, md5 of the key, and built a new value that had the real key embedded. Then in the reduce phase I received the records in random order and could do what I wanted. By using a stable but differently

Re: Optimal Filesystem (and Settings) for HDFS

2009-05-19 Thread jason hadoop
I always disable atime and it's ilk The deadline scheduler helps with the (non xfs hanging) du datanode timeout issues, but not much. Ultimately that is a caching failure in the kernel, due to the hadoop io patterns. Anshu, any luck getting off the PAE kernels? Is this the xfs lockup, or just

Re: FSDataOutputStream flush() not working?

2009-05-17 Thread jason hadoop
When you open a file you have the option, blockSize /** * Opens an FSDataOutputStream at the indicated Path with write-progress * reporting. * @param f the file name to open * @param permission * @param overwrite if a file with this name already exists, then if true, * the

Re: Datanodes fail to start

2009-05-15 Thread jason hadoop
= Slave1/127.0.1.1 On Thu, May 14, 2009 at 11:43 PM, jason hadoop jason.had...@gmail.com wrote: The data node logs are on the datanode machines in the log directory. You may wish to buy my book and read chapter 4 on hdfs management. On Thu, May 14, 2009 at 9:39 PM, Pankil Doshi forpan

Re: hadoop streaming binary input / image processing

2009-05-15 Thread jason hadoop
not impact any locality properties. Piotr 2009/5/15 jason hadoop jason.had...@gmail.com A downside of this approach is that you will not likely have data locality for the data on shared file systems, compared with data coming from an input split. That being said, from your

Re: Setting up another machine as secondary node

2009-05-15 Thread jason hadoop
master in the master file we have master and secondary node, *both *processes getting started on the two servers listed. Cant we have master and secondary node started seperately on two machines?? On Fri, May 15, 2009 at 9:39 AM, jason hadoop jason.had...@gmail.com wrote: I agree with billy

Re: Is Mapper's map method thread safe?

2009-05-14 Thread jason hadoop
Ultimately it depends on how you write the Mapper.map method. The framework supports a MultithreadedMapRunner which lets you set the number of threads running your map method simultaneously. Chapter 5 of my book covers this. On Wed, May 13, 2009 at 11:10 PM, Shengkai Zhu geniusj...@gmail.com

Re: Selective output based on keys

2009-05-14 Thread jason hadoop
The customary practice is to have your Reducer.reduce method handle the filtering if you are reducing your output. or the Mapper.map method if you are not. On Wed, May 13, 2009 at 1:57 PM, Asim linka...@gmail.com wrote: Hi, I wish to output only selective records to the output files based on

Re: Setting up another machine as secondary node

2009-05-14 Thread jason hadoop
any machine put in the conf/masters file becomes a secondary namenode. At some point there was confusion on the safety of more than one machine, which I believe was settled, as many are safe. The secondary namenode takes a snapshot at 5 minute (configurable) intervals, rebuilds the fsimage and

Re: Map-side join: Sort order preserved?

2009-05-14 Thread jason hadoop
Sort order is preserved if your Mapper doesn't change the key ordering in output. Partition name is not preserved. What I have done is to manually work out what the partition number of the output file should be for each map task, by calling the partitioner on an input key, and then renaming the

Re: How to replace the storage on a datanode without formatting the namenode?

2009-05-14 Thread jason hadoop
You can decommission the datanode, and then un-decommission it. On Thu, May 14, 2009 at 7:44 AM, Alexandra Alecu alexandra.al...@gmail.comwrote: Hi, I want to test how Hadoop and HBase are performing. I have a cluster with 1 namenode and 4 datanodes. I use Hadoop 0.19.1 and HBase 0.19.2.

Re: Map-side join: Sort order preserved?

2009-05-14 Thread jason hadoop
In the mapside join, the input file name is not visible. as the input is actually a composite a large number of files. I have started answering in www.prohadoopbook.com On Thu, May 14, 2009 at 1:19 PM, Stuart White stuart.whi...@gmail.comwrote: On Thu, May 14, 2009 at 10:25 AM, jason hadoop

Re: How to replace the storage on a datanode without formatting the namenode?

2009-05-14 Thread jason hadoop
You can have separate configuration files for the different datanodes. If you are willing to deal with the complexity you can manually start them with altered properties from the command line. rsync or other means of sharing identical configs is simple and common. Raghu, your technique will

Re: hadoop streaming binary input / image processing

2009-05-14 Thread jason hadoop
A downside of this approach is that you will not likely have data locality for the data on shared file systems, compared with data coming from an input split. That being said, from your script, *hadoop dfs -get FILE -* will write the file to standard out. On Thu, May 14, 2009 at 10:01 AM, Piotr

Re: Datanodes fail to start

2009-05-14 Thread jason hadoop
You have to examine the datanode log files the namenode does not start the datanodes, the start script does. The name node passively waits for the datanodes to connect to it. On Thu, May 14, 2009 at 6:43 PM, Pankil Doshi forpan...@gmail.com wrote: Hello Everyone, Actually I had a cluster

Re: Datanodes fail to start

2009-05-14 Thread jason hadoop
-hadoopmaster.out hadoop-hadoop-secondarynamenode-hadoopmaster.out.1 history Thanks Pankil On Thu, May 14, 2009 at 11:27 PM, jason hadoop jason.had...@gmail.com wrote: You have to examine the datanode log files the namenode does not start the datanodes, the start script does. The name node

Re: Setting up another machine as secondary node

2009-05-14 Thread jason hadoop
I agree with billy. conf/masters is misleading as the place for secondary namenodes. On Thu, May 14, 2009 at 8:38 PM, Billy Pearson sa...@pearsonwholesale.comwrote: I thank the secondary namenode is set in the masters file in the conf folder misleading Billy Rakhi Khatwani

Re: hadoop streaming reducer values

2009-05-13 Thread jason hadoop
You may wish to set the separator to the string comma space ', ' for your example. chapter 7 of my book goes into this in some detail, and I posted a graphic that visually depicts the process and the values about a month ago. The original post was titled 'Changing key/value separator in hadoop

Re: How can I get the actual time for one write operation in HDFS?

2009-05-13 Thread jason hadoop
Close the file after you write one block, the close is synchronous. On Tue, May 12, 2009 at 11:50 PM, Xie, Tao xietao1...@gmail.com wrote: DFSOutputStream.writeChunk() enqueues packets into data queue and after that it returns. So write is asynchronous. I want to know the total actual time

Re: hadoop streaming reducer values

2009-05-13 Thread jason hadoop
Thanks chuck, I didn't read the post and focused on the commas On Wed, May 13, 2009 at 2:38 PM, Chuck Lam chuck@gmail.com wrote: The behavior you saw in Streaming (list of key,value instead of key, list of values) is indeed intentional, and it's part of the design differences between

Re: sub 60 second performance

2009-05-11 Thread jason hadoop
/triedtested solution? thanks again On Mon, May 11, 2009 at 12:41 AM, jason hadoop jason.had...@gmail.com wrote: You can cache the block in your task, in a pinned static variable, when you are reusing the jvms. On Sun, May 10, 2009 at 2:30 PM, Matt Bowyer mattbowy...@googlemail.com wrote

Re: Re-Addressing a cluster

2009-05-11 Thread jason hadoop
Now that I think about it, the reverse lookups in my clusters work. On Mon, May 11, 2009 at 3:07 AM, Steve Loughran ste...@apache.org wrote: jason hadoop wrote: You should be able to relocate the cluster's IP space by stopping the cluster, modifying the configuration files, resetting the dns

Re: sub 60 second performance

2009-05-10 Thread jason hadoop
You can cache the block in your task, in a pinned static variable, when you are reusing the jvms. On Sun, May 10, 2009 at 2:30 PM, Matt Bowyer mattbowy...@googlemail.comwrote: Hi, I am trying to do 'on demand map reduce' - something which will return in reasonable time (a few seconds). My

Re: large files vs many files

2009-05-09 Thread jason hadoop
AM, Sasha Dolgy sdo...@gmail.com wrote: yes, that is the problem. two or hundreds...data streams in very quickly. On Fri, May 8, 2009 at 8:42 PM, jason hadoop jason.had...@gmail.com wrote: Is it possible that two tasks are trying to write to the same file path? On Fri, May 8, 2009

Re: ClassNotFoundException

2009-05-09 Thread jason hadoop
for the reply, but do I need to include every supporting jar file to the application path? What is the -rel-? George jason hadoop wrote: 1) when running under windows, include the cygwin bin directory in your windows path environment variable 2) eclipse is not so good at submitting

Re: Error when start hadoop cluster.

2009-05-09 Thread jason hadoop
looks like you have different versions of the jars, or perhaps a someone has run ant in one of your installation directories. On Fri, May 8, 2009 at 7:54 PM, nguyenhuynh.mr nguyenhuynh...@gmail.comwrote: Hi all! I cannot start hdfs successful. I checked log file and found following message:

Re: Most efficient way to support shared content among all mappers

2009-05-09 Thread jason hadoop
://www.umiacs.umd.edu/~jimmylin/publications/Lin_etal_TR2009.pdfhttp://www.umiacs.umd.edu/%7Ejimmylin/publications/Lin_etal_TR2009.pdf . Regards, Jeff On Fri, May 8, 2009 at 2:49 PM, jason hadoop jason.had...@gmail.com wrote: Most of the people with this need are using some variant of memcached

Re: Re-Addressing a cluster

2009-05-09 Thread jason hadoop
You should be able to relocate the cluster's IP space by stopping the cluster, modifying the configuration files, resetting the dns and starting the cluster. Be best to verify connectivity with the new IP addresses before starting the cluster. to the best of my knowledge the namenode doesn't care

Re: large files vs many files

2009-05-08 Thread jason hadoop
Is it possible that two tasks are trying to write to the same file path? On Fri, May 8, 2009 at 11:46 AM, Sasha Dolgy sdo...@gmail.com wrote: Hi Tom (or anyone else), Will SequenceFile allow me to avoid problems with concurrent writes to the file? I stll continue to get the following

Re: Setting thread stack size for child JVM

2009-05-08 Thread jason hadoop
You an set the mapred.child.java.opts on a per job basis either via -D mapred.child.java.ops=java options or via conf.set(mapred.child.java.opts, java options). Note: the conf.set must be done before the job is submitted. On Fri, May 8, 2009 at 11:57 AM, Philip Zeyliger phi...@cloudera.comwrote:

Re: Most efficient way to support shared content among all mappers

2009-05-08 Thread jason hadoop
Most of the people with this need are using some variant of memcached, or other distributed hash table. On Fri, May 8, 2009 at 10:07 AM, Joe joe_...@yahoo.com wrote: Hi, As a newcomer to Hadoop, I wonder any efficient way to support shared content among all mappers. For example, to implement

  1   2   3   >