Re: Backup Tasks in Hadoop MapReduce.

2008-04-16 Thread Ted Dunning
I believe that this is turned on by default (at least in 15). On 4/16/08 10:57 AM, Milind Bhandarkar [EMAIL PROTECTED] wrote: Yes. In hadoop, you can enable backup tasks by setting mapred.speculative.execution to true. - milind On 4/16/08 8:07 AM, Chaman Singh Verma [EMAIL PROTECTED]

Re: aborting reducer

2008-04-16 Thread Ted Dunning
Would it be better to have lots of records arrive at the same reducer? That has a simpler mechanism for ignoring data. You can just add a (trivial) partition function in addition to your sort. On 4/16/08 12:07 PM, Karl Wettin [EMAIL PROTECTED] wrote: I have a job that out of a list with

Re: Map reduce classes

2008-04-16 Thread Ted Dunning
That design is fine. You should read your map in the configure method of the reducer. There is a MapFile format supported by Hadoop, but they tend to be pretty slow. I usually find it better to just load my hash table by hand. If you do this, you should use whatever format you like. On

Re: Map reduce classes

2008-04-16 Thread Ted Dunning
it is called before reduce job. I need to eliminate rows from the HashMap when all the keys are read. Also my concern is if dataset is large will this HashMap thing work?? On Wed, Apr 16, 2008 at 10:07 PM, Ted Dunning [EMAIL PROTECTED] wrote: That design is fine. You should read your

Re: Large Weblink Graph

2008-04-15 Thread Ted Dunning
Please include the Mahout sub-project when you report what you find. This kind of dataset would be very helpful for that project as well. And you might find something helpful there as well. The goal is to support machine learning on hadoop. On 4/15/08 8:29 AM, Chaman Singh Verma [EMAIL

Re: Reduce Output

2008-04-15 Thread Ted Dunning
script which will sequentially parse each line and Iterate. Thanks, Senthil -Original Message- From: Ted Dunning [mailto:[EMAIL PROTECTED] Sent: Monday, April 14, 2008 2:20 PM To: core-user@hadoop.apache.org Subject: Re: Reduce Output Try using Text, Text as the output type

Re: Page Ranking, Hadoop And MPI.

2008-04-15 Thread Ted Dunning
Power law algorithms are ideal for this kind of parallelized problem. The basic idea is that hub and authority style algorithms are intimately related to eigenvector or singular value decompositions (depending on whether the links are symmetrical). This also means that there is a close

Re: Page Ranking, Hadoop And MPI.

2008-04-15 Thread Ted Dunning
On 4/15/08 11:59 AM, Chaman Singh Verma [EMAIL PROTECTED] wrote: How Google handle such a large matrix and solve it ? Do they use MapReduce framework for these process or adopt standard and reliable Message Passing Interface/RPC etc for this task ? They use map-reduce. What about the

Re: multiple datanodes in the same machine

2008-04-15 Thread Ted Dunning
Why do you want to do this perverse thing? How does it help to have more than one datanode per machine? And what in the world is better when you have 10? On 4/15/08 12:53 PM, Cagdas Gerede [EMAIL PROTECTED] wrote: I have a follow-up question, Is there a way to programatically configure

Re: multiple datanodes in the same machine

2008-04-15 Thread Ted Dunning
working on Distributed File System part. I do not use MR part, and I need to run multiple processes to test some scenarios on the file system. On Tue, Apr 15, 2008 at 1:37 PM, Ted Dunning [EMAIL PROTECTED] wrote: I have had no issues in scaling the number of datanodes. The location

Re: Reduce Output

2008-04-14 Thread Ted Dunning
Write an additional map-reduce step to join the data items together by treating different input files differently. OR Write an additional map-reduce step that reads in your string values in the map configuration method and keeps them in memory for looking up as you pass over the output of your

Re: Reduce Output

2008-04-14 Thread Ted Dunning
So do you know any class or method that I can use to have the values separated by space or any other separator. Thanks, Senthil -Original Message- From: Ted Dunning [mailto:[EMAIL PROTECTED] Sent: Monday, April 14, 2008 12:47 PM To: core-user@hadoop.apache.org Subject: Re

Re: Reduce Output

2008-04-14 Thread Ted Dunning
, new IntWritable(sum)); } } -Original Message- From: Ted Dunning [mailto:[EMAIL PROTECTED] Sent: Monday, April 14, 2008 1:49 PM To: core-user@hadoop.apache.org Subject: Re: Reduce Output The format of the reduce output is the responsibility of the reducer. You

Re: Hadoop performance on EC2?

2008-04-10 Thread Ted Dunning
Are you trying to read from mySQL? If so, it isn't very surprising that you could get lower performance with more readers. On 4/9/08 7:07 PM, Nate Carlson [EMAIL PROTECTED] wrote: Hey all, We've got a job that we're running in both a development environment, and out on EC2. I've been

Re: hdfs 100T?

2008-04-10 Thread Ted Dunning
Hadoop also does much better with spindles spread across many machines. Putting 16 TB on each of two nodes is distinctly sub-optimal on many fronts. Much better to put 0.5-2TB on 16-64 machines. With 2x1TB SATA drives, your cost and performance are likely to both be better than two machines with

Re: RAID-0 vs. JBOD?

2008-04-10 Thread Ted Dunning
I haven't done a detailed comparison, but I have seen some effects: A) raid doesn't usually work really well on low-end machines compared to independent drives. This would make me distrust raid. B) hadoop doesn't do very well, historically speaking with more than one partition if the

Re: Reduce Sort

2008-04-08 Thread Ted Dunning
On 4/8/08 10:43 AM, Natarajan, Senthil [EMAIL PROTECTED] wrote: I would like to try using Hadoop. That is good for education, probably bad for run time. It could take SECONDS longer to run (oh my). Do you mean to write another MapReduce program which takes the output of the first

Re: New user, several questions/comments (MaxMapTaskFailuresPercent in particular)

2008-04-08 Thread Ted Dunning
Looks like it is up to me. On 4/8/08 12:36 PM, Ian Tegebo [EMAIL PROTECTED] wrote: The wiki has been down for more than a day, any ETA? I was going to search the archives for the status, but I'm getting 403's for each of the Archive links on the mailing list page:

Re: on number of input files and split size

2008-04-06 Thread Ted Dunning
input files into single file, so that at the end of the copy process , we will have as many files as there are machines in the cluster. Any thoughts if how I should proceeed on this ? or if this is a good idea at all ? Ted Dunning [EMAIL PROTECTED] wrote: The split will depend

Re: Hadoop: Multiple map reduce or some better way

2008-04-04 Thread Ted Dunning
Are you implementing this for instruction or production? If production, why not use Lucene? On 4/3/08 6:45 PM, Aayush Garg [EMAIL PROTECTED] wrote: HI Amar , Theodore, Arun, Thanks for your reply. Actaully I am new to hadoop so cant figure out much. I have written following code for

Re: Streaming + custom input format

2008-04-04 Thread Ted Dunning
Take a looks at the way that the text input format moves to the next line after a split point. There are a couple of possible problems with your input format not found problem. First, is your input in a package? If so, you need to provide a complete name for the class. Secondly, you have to

Re: Streaming + custom input format

2008-04-04 Thread Ted Dunning
On 4/4/08 10:18 AM, Francesco Tamberi [EMAIL PROTECTED] wrote: Thank for your fast reply! Ted Dunning ha scritto: Take a looks at the way that the text input format moves to the next line after a split point. I'm not sure to understand.. is my way correct or are you suggesting

Re: Hadoop: Multiple map reduce or some better way

2008-04-04 Thread Ted Dunning
, Ted Dunning [EMAIL PROTECTED] wrote: Are you implementing this for instruction or production? If production, why not use Lucene? On 4/3/08 6:45 PM, Aayush Garg [EMAIL PROTECTED] wrote: HI Amar , Theodore, Arun, Thanks for your reply. Actaully I am new to hadoop so cant figure out

Re: Is it possible in Hadoop to overwrite or update a file?

2008-04-03 Thread Ted Dunning
You can overwrite it, but you can't update it. Soon you will be able to append to it, but you won't be able to do any other updates. On 4/2/08 11:39 PM, Garri Santos [EMAIL PROTECTED] wrote: Hi! I'm starting to take alook at hadoop and the whole HDFS idea. I'm wondering if it's just fine

Re: If I wanna read a config file before map task, which class I should choose?

2008-04-03 Thread Ted Dunning
That depends on where the file is. If you are reading a file on a normal file system, you use normal Java functions. If you are reading a file from HDFS, you use hadoop functions. On 4/3/08 1:22 AM, Jeremy Chow [EMAIL PROTECTED] wrote: Hi list, If I define a method named configure in a

Re: Quick jar deployment question...

2008-04-03 Thread Ted Dunning
The easiest way is to package all of your code (classes and jars) into a single jar file which you then execute. When you instantiate a JobClient and run a job, your jar gets copied to all necessary nodes. The machine you use to launch the job need not even be in the cluster, just able to see

Re: Is it possible in Hadoop to overwrite or update a file?

2008-04-03 Thread Ted Dunning
Interesting you should say this. I have been using this exact example (slightly modified) as an interview question lately. I have to admit I stole it from Doug's Hadoop slides. If you have a 1TB database with 100 B records and you want to update 1% of them, how long will it take? Assume for

Re: Performance impact of underlying file system?

2008-04-01 Thread Ted Dunning
I would expect that most file systems can saturate the disk bandwidth for the large sequential reads that hadoop does. We use ext3 with good results. On 4/1/08 8:08 AM, Colin Freas [EMAIL PROTECTED] wrote: Is the performance of Hadoop impacted by the underlying file system on the nodes at

Re: Hadoop input path - can it have subdirectories

2008-04-01 Thread Ted Dunning
But wildcards that match directories that contain files work well. On 4/1/08 10:41 AM, Peeyush Bishnoi [EMAIL PROTECTED] wrote: Hello , No Hadoop can't traverse recursively inside subdirectory with Java Map-Reduce program. It have to be just directory containing files (and no

Re: distcp fails :Input source not found

2008-04-01 Thread Ted Dunning
Are you missing a colon on the first command? Probably just a transcription error when you composed your email (but I have made similar mistakes often enough and been unable to see them). On 4/1/08 1:18 PM, Prasan Ary [EMAIL PROTECTED] wrote: Just to make sure that I am specifying the

Re: one key per output part file

2008-04-01 Thread Ted Dunning
Try opening the desired output file in the reduce method. Make sure that the output files are relative to the correct task specific directory (look for side-effect files on the wiki). On 4/1/08 5:57 PM, Ashish Venugopal [EMAIL PROTECTED] wrote: Hi, I am using Hadoop streaming and I am

Re: one key per output part file

2008-04-01 Thread Ted Dunning
that indicates that you can... Ashish On Tue, Apr 1, 2008 at 9:13 PM, Ted Dunning [EMAIL PROTECTED] wrote: Try opening the desired output file in the reduce method. Make sure that the output files are relative to the correct task specific directory (look for side-effect files on the wiki

Re: Hadoop streaming performance problem

2008-03-31 Thread Ted Dunning
Hadoop can't split a gzipped file so you will only get as many maps as you have files. Why the obsession with hadoop streaming? It is at best a jury rigged solution. On 3/31/08 3:12 PM, lin [EMAIL PROTECTED] wrote: Does Hadoop automatically decompress the gzipped file? I only have a single

Re: Hadoop streaming performance problem

2008-03-31 Thread Ted Dunning
This seems a bit surprising. In my experience well-written Java is generally just about as fast as C++, especially for I/O bound work. The exceptions are: - java startup is still slow. This shouldn't matter much here because you are using streaming anyway so you have java startup + C

Re: Hadoop streaming performance problem

2008-03-31 Thread Ted Dunning
++ and it is easier to migrate to hadoop streaming. Also we have very strict performance requirements. Java seems to be too slow. I rewrote the first program in Java and it runs 4 to 5 times slower than the C++ one. On Mon, Mar 31, 2008 at 3:15 PM, Ted Dunning [EMAIL PROTECTED] wrote

Re: Hadoop streaming performance problem

2008-03-31 Thread Ted Dunning
My experiences with Groovy are similar. Noticeable slowdown, but quite bearable (almost always better than 50% of best attainable speed). The highest virtue is that simple programs become simple again. Word count is 5 lines of code. On 3/31/08 6:10 PM, Colin Evans [EMAIL PROTECTED]

Re: Append data in hdfs_write

2008-03-27 Thread Ted Dunning
Yes. The present work-arounds for this are pretty complicated. option1) you can write small files relatively frequently and every time you write some number of them, you can concatenate them and delete them. These concatenations can receive the same treatment. If managed carefully in

Re: Using HDFS as native storage

2008-03-27 Thread Ted Dunning
We evaluated several options for just this problem and eventually settled on MogileFS. That said, Mogile needed several weeks of work to get it ready for prime time. It will work pretty well for modest sized collections, but for our stuff (many hundreds of millions of files, approaching PB of

Re: Using HDFS as native storage

2008-03-27 Thread Ted Dunning
PROTECTED] wrote: might be off-topic but how would you compare GlusterFS to HDFS and MogileFS for such an application? Did you look at that at all and decided against it? Ted Dunning wrote: We evaluated several options for just this problem and eventually settled on MogileFS. That said, Mogile

Re: HDFS: What happens when a harddrive fails

2008-03-26 Thread Ted Dunning
It depends on the failure. For some failure modes, the disk just becomes very slow. On 3/26/08 4:39 PM, Cagdas Gerede [EMAIL PROTECTED] wrote: I was wondering 1) what happens if a data node is alive but its harddrive fails? Does it throw an exception and dies? 2) If It continues to run

Re: [core] problems while coping files from local file system to dfs

2008-03-24 Thread Ted Dunning
Copy from a machine that is *not* running as a data node in order to get better balancing. Using distcp may also help because the nodes actually doing the copying will be spread across the cluster. You should probably be running a rebalancing script as well if your nodes have differing sizes.

Re: MapReduce with related data from disparate files

2008-03-24 Thread Ted Dunning
Map-reduce excels at gluing together files like this. The map phase selects the key and makes sure that you have some way of telling what the source of the record is. The reduce phase takes all of the records with the same key and glues them together. It can do your processing, but it is also

Re: [core] problems while coping files from local file system to dfs

2008-03-24 Thread Ted Dunning
I hate to point this out, but losing *any* data node will decrease the replication of some blocks. On 3/24/08 4:53 PM, lohit [EMAIL PROTECTED] wrote: Improves performance on the basis that files are copied locally in that node, so there is no need network transmission. But isn't that policy

Re: Fastest way to do grep via hadoop streaming

2008-03-19 Thread Ted Dunning
Also, streaming is not likely to be the fastest way to solve your problem because it introduces quite a bit more copying and, even worse, context switches into the process (java moves the data, passes it to the mapper, reads the results). I have seen a comment that there were flushes being done

Re: Partitioning reduce output by date

2008-03-18 Thread Ted Dunning
I think that a custom partitioner is half of the answer. The other half is that the reducer can open and close output files as needed. With the partitioner, only one file need be kept open at a time. It is good practice to open the files relative to the task directory so that process failure

Re: Limiting Total # of TaskTracker threads

2008-03-18 Thread Ted Dunning
I think the original request was to limit the sum of maps and reduces rather than limiting the two parameters independently. Clearly, with a single job running at a time, this is a non-issue since reducers don't do much until the maps are done. With multiple jobs it is a bit more of an issue.

Re: Partitioning reduce output by date

2008-03-18 Thread Ted Dunning
Also see my comment about side effect files. Basically, if you partition on date, then each set of values in the reduce will have the same date. Thus the reducer can open a file, write the values, close the file (repeat). This gives precisely the effect you were seeking. On 3/18/08 6:17 PM,

Re: [core-user] Move application to Map/Reduce architecture with Hadoop

2008-03-17 Thread Ted Dunning
Replication is vital in large or even medium-sized clusters for reliability. Replication also helps distribution. On 3/17/08 2:48 AM, Alfonso Olias Sanz [EMAIL PROTECTED] wrote: But what I wanted to say was that we need to set up a cluster in a way that the data is distributed among all the

Re: [core-user] Processing binary files Howto??

2008-03-17 Thread Ted Dunning
This sounds very different from your earlier questions. If you have a moderate (10's to 1000's) number of binary files, then it is very easy to write a special purpose InputFormat that tells hadoop that the file is not splittable. This allows you to add all of the files as inputs to the map

Re: Separate data-nodes from worker-nodes

2008-03-13 Thread Ted Dunning
it is not good one to separate them out. Just was wondering is it possible at all. Thanks! Ted Dunning wrote: It is quite possible to do this. It is also a bad idea. One of the great things about map-reduce architectures is that data is near the computation so that you don't have

Re: performance

2008-03-12 Thread Ted Dunning
Identity reduce is nice because the result values can be sorted. On 3/12/08 8:21 AM, Jason Rennie [EMAIL PROTECTED] wrote: Map could perform all the dot-products, which is the heavy lifting in what we're trying to do. Might want to do a reduce after that, not sure...

Re: reading input file only once for multiple map functions

2008-03-12 Thread Ted Dunning
Ahhh... There is an old saying for this. I think you are pulling fly specks out of pepper. Unless your input format is very, very strange, doing the split again for two jobs does, indeed, lead to some small inefficiency, but this cost should be so low compared to other inefficiencies that you

Re: scaling experiments on a static cluster?

2008-03-12 Thread Ted Dunning
factor using hadoop dfs? Chris On Wed, Mar 12, 2008 at 6:36 PM, Ted Dunning [EMAIL PROTECTED] wrote: What about just taking down half of the nodes and then loading your data into the remainder? Should take about 20 minutes each time you remove nodes but only a few seconds each time you

Re: performance

2008-03-11 Thread Ted Dunning
Yes. Each task is launching a JVM. Map reduce is not generally useful for real-time applications. It is VERY useful for large scale data reductions done in advance of real-time operations. The basic issue is that the major performance contribution of map-reduce architectures is large scale

Re: performance

2008-03-11 Thread Ted Dunning
Would you be interested in the grool extension to Groovy described in the attached README? I am looking for early collaborators/guinea pigs. On 3/11/08 1:43 PM, Jason Rennie [EMAIL PROTECTED] wrote: Have been working my way through the Map-Reduce tutorial. Just got the WordCount example

Re: File size and number of files considerations

2008-03-10 Thread Ted Dunning
Amar's comments are a little strange. Replication occurs at the block level, not the file level. Storing data in a small number of large files or a large number of small files will have less than a factor of two effect on number of replicated blocks if the small files are 64MB. Files smaller

Re: File Per Column in Hadoop

2008-03-10 Thread Ted Dunning
Have you looked at hbase. It looks like you are trying to reimplement a bunch of it. On 3/10/08 11:01 AM, Richard K. Turner [EMAIL PROTECTED] wrote: ... [storing data in columns is nice] ... I would also do the same for dir csv_file2. Does anyone know how to do this in Hadoop?

Re: Nutch Extensions to MapReduce

2008-03-08 Thread Ted Dunning
for map ? Thanks, Naama On Thu, Mar 6, 2008 at 6:02 PM, Ted Dunning [EMAIL PROTECTED] wrote: This is not difficult to do. Simply open an extra file in the reducers configure method and close it in the close method. Make sure you make it relative to the map reduce output directory so

Re: Equivalent of cmdline head or tail?

2008-03-07 Thread Ted Dunning
I thought so as well until I reflected for a moment. But if you include the top N from every combiner, then you are guaranteed to have the global top N in the output of all of the combiners. On 3/6/08 11:50 PM, Owen O'Malley [EMAIL PROTECTED] wrote: On Mar 6, 2008, at 5:02 PM, Ted Dunning

Re: Nutch Extensions to MapReduce

2008-03-06 Thread Ted Dunning
This is not difficult to do. Simply open an extra file in the reducers configure method and close it in the close method. Make sure you make it relative to the map reduce output directory so that you can take advantage of all of the machinery that handles lost jobs and such. Search the

Re: displaying intermediate results of map/reduce

2008-03-06 Thread Ted Dunning
You can use System.out if you like and then look at the results as each map or reduce completes via the web administration tool. Also, you can use counters via the reporter passed to your map and reduce classes to get immediate feedback. On 3/6/08 8:12 AM, Prasan Ary [EMAIL PROTECTED] wrote:

Re: Using Sorted Files For Filtering Input (File Index)

2008-03-05 Thread Ted Dunning
You can definitely use the approach that you suggest and you should have good results if you are looking for only a small fraction of the file. Basically, you should have the record reader check to see if any interesting records exist in the current split and if so, read them and if not, just

Re: Hardware Details for a Small Cluster

2008-03-05 Thread Ted Dunning
The right answer really depends on your workload and what your needs and goals are. You say that this is a research lab. If you are researching parallel algorithms, then I would recommend much higher parallelism. If you are working on problems where you want throughput, then the answer may be

Re: Reblance datablocks among multiple HDD's in a datanode

2008-03-05 Thread Ted Dunning
Just use a standard rebalancing script and the empty node will fill in quickly enough. The most common approach to rebalancing is to iterate through the files in your system and increase the replication substantially for about a minute and then drop it back down. It helps to overlap the time

Re: Processing multiple files - need to identify in map

2008-03-04 Thread Ted Dunning
Yes. Use the configure method which is called each time a new file is used in the map. Save the file name in a field of the mapper. The other alternative is to derive a new InputFormat that remembers the input file name. On 3/4/08 5:38 PM, Tarandeep Singh [EMAIL PROTECTED] wrote: Hi, I

Re: Counting the input bytes of a reduce task

2008-03-02 Thread Ted Dunning
Just call reporter.incrCounter(specificEnumValueOfSomeKind, n) where the first argument is some enum value. The framework will work out that it is your enum and put it in a box of its own along with any other values. On 3/2/08 7:46 PM, dennis81 [EMAIL PROTECTED] wrote: Hi, I was

Re: long write operations and data recovery

2008-02-29 Thread Ted Dunning
In our case, we looked at the problem and decided that Hadoop wasn't feasible for our real-time needs in any case. There were several issues, - first, of all, map-reduce itself didn't seem very plausible for real-time applications. That left hbase and hdfs as the capabilities offered by hadoop

Re: long write operations and data recovery

2008-02-29 Thread Ted Dunning
are only writing 1MB/s. If you need a day of buffering (=100,000 seconds), then you need 100GB of buffer storage. These are very, very moderate requirements for your ingestion point. On 2/29/08 11:18 AM, Steve Sapovits [EMAIL PROTECTED] wrote: Ted Dunning wrote: In our case, we looked

Re: long write operations and data recovery

2008-02-28 Thread Ted Dunning
This is exactly what we do as well. We also have auto-detection for modifications and downstream processing so that back-filling in the presence error correction is possible (the errors can be old processing code or file munging). On 2/28/08 6:06 PM, Joydeep Sen Sarma [EMAIL PROTECTED]

Re: Solving the hang problem in dfs -copyToLocal/-cat...

2008-02-27 Thread Ted Dunning
Have you tried using http to fetch the file instead? http://name-node-and-port/data/file-path This will get redirected to one of the datanodes to handle and should be pretty fast. It would be interesting to find out if this alternative path is subject to the same hangs that you are seeing.

Re: Solving the hang problem in dfs -copyToLocal/-cat...

2008-02-27 Thread Ted Dunning
Ooops. Should have read the rest of your posting. Sorry about the noise. On 2/27/08 12:05 PM, C G [EMAIL PROTECTED] wrote: Hi All: The following write-up is offered to help out anybody else who has seen performance problems and hangs while using dfs -copyToLocal/-cat. One

Re: Calculations involve large datasets

2008-02-22 Thread Ted Dunning
Joins are easy. Just reduce on a key composed of the stuff you want to join on. If the data you are joining is disparate, leave some kind of hint about what kind of record you have. The reducer will be iterating through sets of records that have the same key. This is similar to the results

Re: Sorting output data on value

2008-02-21 Thread Ted Dunning
But this only guarantees that the results will be sorted within each reducers input. Thus, this won't result in getting the results sorted by the reducers output value. On 2/21/08 8:40 PM, Owen O'Malley [EMAIL PROTECTED] wrote: On Feb 21, 2008, at 5:47 PM, Ted Dunning wrote: It may

Re: Yahoo's production webmap is now on Hadoop

2008-02-19 Thread Ted Dunning
Sorry to be picky about the math, but 1 Trillion = 10^12 = million million. At 10 links per page, this gives 100 x 10^9 pages, not 1 x 10^9. At 100 links per page, this gives 10B pages. On 2/19/08 2:25 PM, Peter W. [EMAIL PROTECTED] wrote: Amazing milestone, Looks like Y! had

Re: the best way to kill a bad job?

2008-02-13 Thread Ted Dunning
There is a kill job link at the bottom of the map-reduce admin panel for the job. Did that not work? On 2/12/08 10:33 PM, Jim the Standing Bear [EMAIL PROTECTED] wrote: What is the best way to kill a bad job (e.g. an infinite loop)? The job I was running went into an infinite loop and I

Re: key/value after reduce

2008-02-12 Thread Ted Dunning
But that map will have to read the file again (and is likely to want a different key than the reduce produces). On 2/12/08 12:33 PM, Miles Osborne [EMAIL PROTECTED] wrote: You may well have another Map operation operate over the Reducer output, in which case you'd want key-value pairs.

Re: key/value after reduce

2008-02-12 Thread Ted Dunning
Welcome to the club. The good news is that if you give the output collector a null key, it will just output the data in the value argument and ignore the key entirely. Occasionally, the distinction is useful to avoid constructing yet another temporary data structure to hold a tuple. Word

Re: Question on DFS block placement and 'what is a rack' wrt DFS block placement

2008-02-12 Thread Ted Dunning
It isn't popular much anymore, but once upon a time, network topology for clustering was a big topic. Since then, switches have gotten pretty fast and worrying about these things has gone out of fashion a bit other than something on the level of the current rack-aware locality in Hadoop. With 4

Re: Question on DFS block placement and 'what is a rack' wrt DFS block placement

2008-02-12 Thread Ted Dunning
Doesn't the incremental CPU cost you as much as an entire extra box? On 2/12/08 12:19 PM, Colin Evans [EMAIL PROTECTED] wrote: The big question for me is how well a dual-CPU 4-core (8 cores per box) configuration will do. Has anyone tried out this configuration with Intel or AMD CPUs? Is

Re: Question on DFS block placement and 'what is a rack' wrt DFS block placement

2008-02-12 Thread Ted Dunning
I would concur that it is much better to have sufficient storage in the compute farm for DFS files to be local for the compute tasks. Also, a 16 disk machine typically costs a good bit more than a 6 disk machine + 10 disks because you usually require a second chassis. Sun's Thumper would be an

Re: Question on DFS block placement and 'what is a rack' wrt DFS block placement

2008-02-12 Thread Ted Dunning
Why not down-grade the CPU power and increase the number of chassis to get more disks (and controllers and network interfaces)? On 2/12/08 12:53 PM, Jason Venner [EMAIL PROTECTED] wrote: We have 3 types of machines we can get, 2 disk, 6 disk and 16 disk machines. *They all have 4 dual core

Re: Question on DFS block placement and 'what is a rack' wrt DFS block placement

2008-02-12 Thread Ted Dunning
I have had issues with machines that are highly disparate in terms of disk space. I expect that some of those issues have been mitigated in recent releases. On 2/12/08 11:51 AM, Jason Venner [EMAIL PROTECTED] wrote: We are starting to build larger clusters, and want to better understand

Re: Best Practice?

2008-02-11 Thread Ted Dunning
Jeff, Doesn't the reducer see all of the data points for each cluster (canopy) in a single list? If so, why the need to output during close? If not, why not? On 2/11/08 12:24 PM, Jeff Eastman [EMAIL PROTECTED] wrote: Hi Owen, Thanks for the information. I took Ted's advice and

Re: Caching frequently map input files

2008-02-11 Thread Ted Dunning
You should be looking at HDFS (part of hadoop) plus hbase or code that you write yourself. Hadoop is built in two parts. One part is the distributed file system that provides replication and similar functions. You can access this file system pretty easily from Java. Your requirements are

Re: Caching frequently map input files

2008-02-11 Thread Ted Dunning
in parallel on many files as possible. This way I would be able to return a result faster then if I would have used one machine. Is there a way to tell which files are in memory? On Feb 10, 2008 10:33 PM, Ted Dunning [EMAIL PROTECTED] wrote: But if your files DO fit into memory

Re: Best Practice?

2008-02-10 Thread Ted Dunning
You got it exactly. On 2/10/08 5:08 PM, Jeff Eastman [EMAIL PROTECTED] wrote: mapper assigns points to clusters, combiner computes partial centroids, reducer computers final centroids... Using a combiner in this manner would avoid [outputting data in the mapper's close method]. Did I get

Re: Best Practice?

2008-02-09 Thread Ted Dunning
Hmmm I think that computing centroids in the mapper may not be the best idea. A different structure that would work well is to use the mapper to assign data records to centroids and use the centroid number as the key for the reduce key. Then the reduce itself can compute the centroids.

Re: Namenode fails to replicate file

2008-02-08 Thread Ted Dunning
@hadoop.apache.org Subject: Re: Namenode fails to replicate file Doesn't the -setrep command force the replication to be increased immediately? ./hadoop dfs -setrep [replication] path (I may have misunderstood) On Thu, 2008-02-07 at 17:05 -0800, Ted Dunning wrote: Chris Kline reported

Re: Namenode fails to replicate file

2008-02-08 Thread Ted Dunning
-0800, Ted Dunning wrote: Chris Kline reported a problem in early January where a file which had too few replicated blocks did not get replicated until a DFS restart. I just saw a similar issue. I had a file that had a block with 1 replica (2 required) that did not get replicated. I

Re: Namenode fails to replicate file

2008-02-08 Thread Ted Dunning
I will see if I can replicate the problem and do as you suggest. On 2/8/08 4:29 PM, Raghu Angadi [EMAIL PROTECTED] wrote: Ted Dunning wrote: That makes it wait, but I don't think it increases the urgency on the part of the namenode. As an interesting experiment, I had a cluster

Re: Skip Reduce Phase

2008-02-07 Thread Ted Dunning
map task, or does you suggestion already do this and this is a moot point?. David On Thu, 2008-02-07 at 09:39 -0800, Ted Dunning wrote: Set numReducers to 0. On 2/7/08 9:35 AM, David Alves [EMAIL PROTECTED] wrote: Hi All First of all since this is my first post I must say congrats

Re: Skip Reduce Phase

2008-02-07 Thread Ted Dunning
Set numReducers to 0. On 2/7/08 9:35 AM, David Alves [EMAIL PROTECTED] wrote: Hi All First of all since this is my first post I must say congrats for the great piece of software (both Hadoop and HBase). I've been using HadoopHBase for a while and I have a question, let me just explain a

Namenode fails to replicate file

2008-02-07 Thread Ted Dunning
Chris Kline reported a problem in early January where a file which had too few replicated blocks did not get replicated until a DFS restart. I just saw a similar issue. I had a file that had a block with 1 replica (2 required) that did not get replicated. I changed the number of required

Re: Mahout Machine Learning Project Launches

2008-02-06 Thread Ted Dunning
I don't think anybody has figured out how to patent the Lanczos algorithm itself! On 2/6/08 10:03 AM, Peter W. [EMAIL PROTECTED] wrote: Hello, This is Mahout project seems very interesting. Any problem that has reducibility components using mapreduce and can then be described as a

Re: Low complexity way to write a file to hdfs?

2008-02-06 Thread Ted Dunning
, C G Ted Dunning [EMAIL PROTECTED] wrote: I am looking for a way for scripts to write data to HDFS without having to install anything. The /data and /listPaths URL's on the nameserver are ideal for reading files, but I can't find anything comparable to write files. Am I missing

Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread Ted Dunning
Very nice summary. One of the issues that we have had with multiple search servers is that on linux, there can be substantial contention for disk I/O. This means that as a new index is being written, access to the current index can be stalled for very long periods of time (sometimes 10s). This

Re: sort by value

2008-02-06 Thread Ted Dunning
The method to describe is the standard approach. The benefit is that the data that arrives at the reducer might be larger than you want to store in memory (for sorting by the reduce). Also, reading the entire set of reduce values would increase the amount of data allocated and would mean that

Re: sort by value

2008-02-06 Thread Ted Dunning
On 2/6/08 11:58 AM, Joydeep Sen Sarma [EMAIL PROTECTED] wrote: But it actually adds duplicate data (i.e., the value column which needs sorting) to the key. Why? U can always take it out of the value to remove the redundancy. Actually, you can't in most cases. Suppose you have

Re: pig user meeting, Friday, February 8, 2008

2008-02-06 Thread Ted Dunning
If there is video recorded, please consider posting it on Veoh. :-) On 2/6/08 1:44 PM, Stefan Groschupf [EMAIL PROTECTED] wrote: Hi Otis, can you suggest a technology how we could do that? Skype? Ichat? Something that is free? I'm happy setup a video conf, however there are no big

Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread Ted Dunning
We have quite a few serving the load, but if we are trying to update relatively often (say every 30 minutes), then having a server out of action for several minutes really hurts. The outage is that long because you have to A) turn off traffic B) wait for traffic to actually stop C) move the

<    1   2   3   4   5   6   7   >