Re: Poor IO performance on a 10 node cluster.
It is also worth using dd to verify your raw disk speeds. Also, expressing disk transfer rates in bytes per second makes it a bit easier for most of the disk people I know to figure out what is large or small. Each of these disks disk should do about 100MB/s when driven well. Hadoop does OK, but not nearly full capacity so I would expected more like 40-50MB/s/disk. Also, if one of your disks is doing double duty you may have some extra cost. 40 x 2 x 10 = 800MB per second. You are doing 100GB / 220 seconds = 0.5 GB / s which isn't so terribly bad. It is definitely less than the theoretically possible 100 x 2 x 10 = 2GB/second and I would expect you could tune this up a little, but not a massive amount. In general, blade servers do not make good Hadoop nodes exactly because the I/O performance tends to be low when you only have a few spindle. One other reason that this might be a bit below expectations is that your files may not be well distributed on your cluster. Can you say what you used to upload the files? On Wed, Jun 1, 2011 at 5:56 PM, hadoopman hadoop...@gmail.com wrote: Some things which helped us include setting your vm.swappiness to 0 and mounting your disks with noatime,nodiratime options. Also make sure your disks aren't setup with RAID (JBOD is recommended) You might want to run terasort as you tweak your environment. It's very helpful when checking if a change helped (or hurt) your cluster. Hope that helps a bit. On 05/30/2011 06:27 AM, Gyuribácsi wrote: Hi, I have a 10 node cluster (IBM blade servers, 48GB RAM, 2x500GB Disk, 16 HT cores). I've uploaded 10 files to HDFS. Each file is 10GB. I used the streaming jar with 'wc -l' as mapper and 'cat' as reducer. I use 64MB block size and the default replication (3). The wc on the 100 GB took about 220 seconds which translates to about 3.5 Gbit/sec processing speed. One disk can do sequential read with 1Gbit/sec so i would expect someting around 20 GBit/sec (minus some overhead), and I'm getting only 3.5. Is my expectaion valid? I checked the jobtracked and it seems all nodes are working, each reading the right blocks. I have not played with the number of mapper and reducers yet. It seems the number of mappers is the same as the number of blocks and the number of reducers is 20 (there are 20 disks). This looks ok for me. We also did an experiment with TestDFSIO with similar results. Aggregated read io speed is around 3.5Gbit/sec. It is just too far from my expectation:( Please help! Thank you, Gyorgy
Re: trying to select technology
To pile on, thousands or millions of documents are well within the range that is well addressed by Lucene. Solr may be an even better option than bare Lucene since it handles lots of the boilerplate problems like document parsing and index update scheduling. On Tue, May 31, 2011 at 11:56 AM, Matthew Foley ma...@yahoo-inc.com wrote: Sounds like you're looking for a full-text inverted index. Lucene is a good opensource implementation of that. I believe it has an option for storing the original full text as well as the indexes. --Matt On May 31, 2011, at 10:50 AM, cs230 wrote: Hello All, I am planning to start project where I have to do extensive storage of xml and text files. On top of that I have to implement efficient algorithm for searching over thousands or millions of files, and also do some indexes to make search faster next time. I looked into Oracle database but it delivers very poor result. Can I use Hadoop for this? Which Hadoop project would be best fit for this? Is there anything from Google I can use? Thanks a lot in advance. -- View this message in context: http://old.nabble.com/trying-to-select-technology-tp31743063p31743063.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Simple change to WordCount either times out or runs 18+ hrs with little progress
itr.nextToken() is inside the if. On Tue, May 24, 2011 at 7:29 AM, maryanne.dellasa...@gdc4s.com wrote: while (itr.hasMoreTokens()) { if(count == 5) { word.set(itr.nextToken()); output.collect(word, one); } count++; }
Re: Hadoop and WikiLeaks
ZK started as sub-project of Hadoop. On Thu, May 19, 2011 at 7:27 AM, M. C. Srivas mcsri...@gmail.com wrote: Interesting to note that Cassandra and ZK are now considered Hadoop projects. There were independent of Hadoop before the recent update. On Thu, May 19, 2011 at 4:18 AM, Steve Loughran ste...@apache.org wrote: On 18/05/11 18:05, javam...@cox.net wrote: Yes! -Pete Edward Caprioloedlinuxg...@gmail.com wrote: = http://hadoop.apache.org/#What+Is+Apache%E2%84%A2+Hadoop%E2%84%A2%3F March 2011 - Apache Hadoop takes top prize at Media Guardian Innovation Awards The Hadoop project won the innovator of the yearaward from the UK's Guardian newspaper, where it was described as had the potential as a greater catalyst for innovation than other nominees including WikiLeaks and the iPad. Does this copy text bother anyone else? Sure winning any award is great but does hadoop want to be associated with innovation like WikiLeaks? Ian updated the page yesterday with changes I'd put in for trademarks, and I added this news quote directly from the paper. We could strip out the quote easily enough.
Re: matrix-vector multiply in hadoop
Try using the Apache Mahout code that solves exactly this problem. Mahout has a distributed row-wise matrix that is read one row at a time. Dot products with the vector are computed and the results are collected. This capability is used extensively in the large scale SVD's in Mahout. On Tue, May 17, 2011 at 1:13 PM, Alexandra Anghelescu axanghele...@gmail.com wrote: Hi all, I was wondering how to go about doing a matrix-vector multiplication using hadoop. I have my matrix in one file and my vector in another. All the map tasks will need the vector file... basically they need to share it. Basically I want my map function to output key-value pairs (i,m[i,j]*v(j)), where i is the row number, and j the column number; v(j) is the jth element in v. And the reduce function will sum up all the values with the same key - i, and that will be the ith element of my result vector. I don't know how to format the input to do this.. even if I do it in 2 MR iterations, first formatting the input and second the actual matrix-vector-multiply, I don't have a clear idea. If you have any ideas/suggestions I would appreciate it! Thanks in advance, Alexandra
Re: Suggestions for swapping issue
How is it that 36 processes are not expected if you have configured 48 + 12 = 50 slots available on the machine? On Wed, May 11, 2011 at 11:11 AM, Adi adi.pan...@gmail.com wrote: By our calculations hadoop should not exceed 70% of memory. Allocated per node - 48 map slots (24 GB) , 12 reduce slots (6 GB), 1 GB each for DataNode/TaskTracker and one JobTracker Totalling 33/34 GB allocation. The queues are capped at using only 90% of capacity allocated so generally 10% of slots are always kept free. The cluster was running total 33 mappers and 1 reducer so around 8-9 mappers per node with 3 GB max limit and they were utilizing around 2GB each. Top was showing 100% memory utilized. Which our sys admin says is ok as the memory is used for file caching by linux if the processes are not using it. No swapping on 3 nodes. Then node4 just started swapping after the number of processes shot up unexpectedly. The main mystery are these excess number of processes on the node which went down. 36 as opposed to expected 11. The other 3 nodes were successfully executing the mappers without any memory/swap issues. -Adi On Wed, May 11, 2011 at 1:40 PM, Michel Segel michael_se...@hotmail.com wrote: You have to do the math... If you have 2gb per mapper, and run 10 mappers per node... That means 20gb of memory. Then you have TT and DN running which also take memory... What did you set as the number of mappers/reducers per node? What do you see in ganglia or when you run top? Sent from a remote device. Please excuse any typos... Mike Segel On May 11, 2011, at 12:31 PM, Adi adi.pan...@gmail.com wrote: Hello Hadoop Gurus, We are running a 4-node cluster. We just upgraded the RAM to 48 GB. We have allocated around 33-34 GB per node for hadoop processes. Leaving the rest of the 14-15 GB memory for OS and as buffer. There are no other processes running on these nodes. Most of the lighter jobs run successfully but one big job is de-stabilizing the cluster. One node starts swapping and runs out of swap space and goes offline. We tracked the processes on that node and noticed that it ends up with more than expected hadoop-java processes. The other 3 nodes were running 10 or 11 processes and this node ends up with 36. After killing the job we find these processes still show up and we have to kill them manually. We have tried reducing the swappiness to 6 but saw the same results. It also looks like hadoop stays well within the memory limits allocated and still starts swapping. Some other suggestions we have seen are: 1) Increase swap size. Current size is 6 GB. The most quoted size is 'tons of swap' but note sure how much it translates to in numbers. Should it be 16 or 24 GB 2) Increase overcommit ratio. Not sure if this helps as a few blog comments mentioned it didn't help Any other hadoop or linux config suggestions are welcome. Thanks. -Adi
Re: questions about hadoop map reduce and compute intensive related applications
On Sat, Apr 30, 2011 at 12:18 AM, elton sky eltonsky9...@gmail.com wrote: I got 2 questions: 1. I am wondering how hadoop MR performs when it runs compute intensive applications, e.g. Monte carlo method compute PI. There's a example in 0.21, QuasiMonteCarlo, but that example doesn't use random number and it generates psudo input upfront. If we use distributed random number generation, then I guess the performance of hadoop should be similar with some message passing framework, like MPI. So my guess is by using proper method hadoop would be good in compute intensive applications compared with MPI. Not quite sure what algorithms you mean here, but for trivial parallelism, map-reduce is a fine way to go. MPI supports node-to-node communications in ways that map-reduce does not, however, which requires that you iterate map-reduce steps for many algorithms. With Hadoop's current implementation, this is horrendously slow (minimum 20-30 seconds per iteration). Sometimes you can avoid this by clever tricks. For instance, random projection can compute the key step in an SVD decomposition with one map-reduce while the comparable Lanczos algorithm requires more than one step per eigenvector (and we often want 100 of them!). Sometimes, however, there are no known algorithms that avoid the need for repeated communication. For these problems, Hadoop as it stands may be a poor fit. Help is on the way, however, with the MapReduce 2.0 work because that will allow much more flexible models of computation. 2. I am looking for some applications, which has large data sets and requires intensive computation. An application can be divided into a workflow, including either map reduce operations, and message passing like operations. For example, in step 1 I use hadoop MR processes 10TB of data and generates small output, say, 10GB. This 10GB can be fit into memory and they are better be processed with some interprocess communication, which will boost the performance. So in step 2 I will use MPI, etc. Some machine learning algorithms require features that are much smaller than the original input. This leads to exactly the pattern you describe. Integrating MPI with map-reduce is currently difficult and/or very ugly, however. Not impossible and there are hackish ways to do the job, but they are hacks.
Re: Serving Media Streaming
Check out S4 http://s4.io/ On Fri, Apr 29, 2011 at 10:13 PM, Luiz Fernando Figueiredo luiz.figueir...@auctorita.com.br wrote: Hi guys. Hadoop is well known to process large amounts of data but we think that we can do much more than it. Our goal is try to serve pseudo-streaming near of Akamai do (with their platform). Do someone know if is implemented any solution about it upon Hadoop already? Or, are any real problem about using it in that way? Until right now I didn't see any barrier of aplying Hadoop this way, but some people talk about only use GlusterFS instead, but more than just store the multimedia files, we want to be able to handle some data too. Cheers. Luiz Fernando
Re: Applications creates bigger output than input?
Cooccurrence analysis is commonly used in recommendations. These produce large intermediates. Come on over to the Mahout project if you would like to talk to a bunch of people who work on these problems. On Fri, Apr 29, 2011 at 9:31 PM, elton sky eltonsky9...@gmail.com wrote: Thank you for suggestions: Weblog analysis, market basket analysis and generating search index. I guess for these applications we need more reduces than maps, for handling large intermediate output, isn't it. Besides, the input split for map should be smaller than usual, because the workload for spill and merge on map's local disk is heavy. -Elton On Sat, Apr 30, 2011 at 11:22 AM, Owen O'Malley omal...@apache.org wrote: On Fri, Apr 29, 2011 at 5:02 AM, elton sky eltonsky9...@gmail.com wrote: For my benchmark purpose, I am looking for some non-trivial, real life applications which creates *bigger* output than its input. Trivial example I can think about is cross join... As you say, almost all cross join jobs have that property. The other case that almost always fits into that category is generating an index. For example, if your input is a corpus of documents and you want to generate the list of documents that contain each word, the output (and especially the shuffle data) is much larger than the input. -- Owen
Re: providing the same input to more than one Map task
I would recommend taking this question to the Mahout mailing list. The short answer is that matrix multiplication by a column vector is pretty easy. Each mapper reads the vector in the configure method and then does a dot product for each row of the input matrix. Results are reassembled into a vector in the reducer. Mahout has special matrix structures to help with this. On Fri, Apr 22, 2011 at 2:59 PM, Mehmet Tepedelenlioglu mehmets...@gmail.com wrote: There is a way: http://hadoop.apache.org/common/docs/r0.18.3/mapred_tutorial.html#DistributedCache Are you working with a sparse matrix, or a full one? On Apr 22, 2011, at 2:33 PM, aanghelescu wrote: Hi all, I am trying to perform matrix-vector multiplication using Hadoop. So I have matrix M in a file, and vector v in another file. Obviously, files are of different sizes. Is it possible to make it so that each Map task will get the whole vector v and a chunk of matrix M? I know how my map and reduce functions should look like, but I don't know how to format the input. Basically I want my map function to output key-value pairs (i,m[i,j]*v(j)), where i is the row number, and j the column number; v(j) is the jth element in v. And the reduce function will sum up all the values with the same key - i, and that will be the ith element of my result vector. Or can you suggest another way to do it? Thanks, Alexandra -- View this message in context: http://old.nabble.com/providing-the-same-input-to-more-than-one-Map-task-tp31459012p31459012.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Estimating Time required to compute M/Rjob
Turing completion isn't the central question here, really. The truth is, map-reduce programs have considerably pressure to be written in a scalable fashion which limits them to fairly simple behaviors that result in pretty linear dependence of run-time on input size for a given program. The cool thing about the paper that I linked to the other day is that there are enough cues about the expected runtime of the program available to make good predictions *without* looking at the details. No doubt the estimation facility could make good use of something as simple as the hash of the jar in question, but even without that it is possible to produce good estimates. I suppose that this means that all of us Hadoop programmers are really just kind of boring folk. On average, anyway. On Sun, Apr 17, 2011 at 12:07 PM, Matthew Foley ma...@yahoo-inc.com wrote: Since general M/R jobs vary over a huge (Turing problem equivalent!) range of behaviors, a more tractable problem might be to characterize the descriptive parameters needed to answer the question: If the following problem P runs in T0 amount of time on a certain benchmark platform B0, how long T1 will it take to run on a differently configured real-world platform B1 ?
Re: Estimating Time required to compute M/Rjob
Sounds like this paper might help you: Predicting Multiple Performance Metrics for Queries: Better Decisions Enabled by Machine Learning by Ganapathi, Archana, Harumi Kuno, Umeshwar Daval, Janet Wiener, Armando Fox, Michael Jordan, David Patterson http://radlab.cs.berkeley.edu/publication/187 On Sat, Apr 16, 2011 at 1:19 PM, Stephen Boesch java...@gmail.com wrote: some additional thoughts about the the 'variables' involved in characterizing the M/R application itself. - the configuration of the cluster for numbers of mappers vs reducers compared to the characteristics (amount of work/procesing) required in each of the map/shuffle/reduce stages - is the application using multiple chained M/R stages? Multi stage M/R's are more difficult to tune properly in terms of keeping all workers busy . That may be challenging to model. 2011/4/16 Stephen Boesch java...@gmail.com You could consider two scenarios / set of requirements for your estimator: 1. Allow it to 'learn' from certain input data and then project running times of similar (or moderately dissimilar) workloads. So the first steps could be to define a couple of relatively small control M/R jobs on a small-ish dataset and throw it at the unknown (cluster-under-test) hdfs/ M/R cluster. Try to design the control M/R job in a way that it will be able to completely load down all of the available DataNodes in the cluster-under-test for at least a brief period of time. Then you wlil have obtained a decent signal on the capabilities of the cluster under test and may allow a relatively high degree of predictive accuracy for even much larger jobs 2. If instead it were your goal to drive the predictions off of a purely mathematical model - in your terms the application and base file system - and without any empirical data - then here is an alternative approach. - Follow step (1) above against a variety of applications and base file systems - especially in configurations for which you wish your model to provide high quality predictions. - Save the results in structured data - Derive formulas for characterizing the curves of performance via those variables that you defined (application / base file system) Now you have a trained model. When it is applied to a new set of applications / base file systems it can use the curves you have already determined to provide the result without any runtime requirements. Obviously the value of this second approach is limited by the degree of similarity of the training data to the applications you attempt to model. If all of your training data is on a 50 node cluster against machines with IDE drives don't expect good results when asked to model a 1000 node cluster using SAN's / RAID's / SCSI's. 2011/4/16 Sonal Goyal sonalgoy...@gmail.com What is your MR job doing? What is the amount of data it is processing? What kind of a cluster do you have? Would you be able to share some details about what you are trying to do? If you are looking for metrics, you could look at the Terasort run .. Thanks and Regards, Sonal https://github.com/sonalgoyal/hihoHadoop ETL and Data Integrationhttps://github.com/sonalgoyal/hiho Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Sat, Apr 16, 2011 at 3:31 PM, real great.. greatness.hardn...@gmail.comwrote: Hi, As a part of my final year BE final project I want to estimate the time required by a M/R job given an application and a base file system. Can you folks please help me by posting some thoughts on this issue or posting some links here. -- Regards, R.V.
Re: Dynamic Data Sets
Hbase is very good at this kind of thing. Depending on your aggregation needs OpenTSDB might be interesting since they store and query against large amounts of time ordered data similar to what you want to do. It isn't clear to whether your data is primarily about current state or about time-embedded state transitions. You can easily store both in hbase, but the arrangements will be a bit different. On Wed, Apr 13, 2011 at 6:12 PM, Sam Seigal selek...@yahoo.com wrote: I have a requirement where I have large sets of incoming data into a system I own. A single unit of data in this set has a set of immutable attributes + state attached to it. The state is dynamic and can change at any time. What is the best way to run analytical queries on data of such nature ? One way is to maintain this data in a separate store, take a snapshot in point of time, and then import into the HDFS filesystem for analysis using Hadoop Map-Reduce. I do not see this approach scaling, since moving data is obviously expensive. If i was to directly maintain this data as Sequence Files in HDFS, how would updates work ? I am new to Hadoop/HDFS , so any suggestions/critique is welcome. I know that HBase works around this problem through multi version concurrency control techniques. Is that the only option ? Are there any alternatives ? Also note that all aggregation and analysis I want to do is time based i.e. sum of x on pivot y over a day, 2 days, week, month etc. For such use cases, is it advisable to use HDFS directly or use systems built on top of hadoop like Hive or Hbase ?
Re: Memory mapped resources
Kevin, You present a good discussion of architectural alternatives here. But my comment really had more to do with whether a particular HDFS patch would provide what the original poster seemed to be asking about. This is especially pertinent since the patch was intended to scratch a different itch. On Tue, Apr 12, 2011 at 5:51 AM, kevin.le...@thomsonreuters.com wrote: This is the age old argument of what to share in a partitioned environment. IBM and Teradata have always used shared nothing which is what only getting one chunk of the file in each hadoop node is doing. Oracle has always used shared disk which is not an easy thing to do, especially in scale, and seems to have varying results depending on application, transaction or dss. Here are a couple of web references. http://www.informatik.uni-trier.de/~ley/db/conf/vldb/Bhide88.html http://jhingran.typepad.com/anant_jhingrans_musings/2010/02/shared-nothi ng-vs-shared-disks-the-cloud-sequel.html Rather than say shared nothing isn't useful, hadoop should look to how others make this work. The two key problems to avoid are data skew where one node sees to much data and becomes the slow node and large intra-partition joins where large data is needed from more than one partition and potentially gets copied around. Rather than hybriding into shared disk, I think hadoop should hybrid into the shared data solutions others use, replication of select data, for solving intra-partition joins in a shared nothing architecture. This may be more database terminology that could be addressed by hbase, but I think it is good background for the questions of memory mapping files in hadoop. Kevin -Original Message- From: Ted Dunning [mailto:tdunn...@maprtech.com] Sent: Tuesday, April 12, 2011 12:09 AM To: Jason Rutherglen Cc: common-user@hadoop.apache.org; Edward Capriolo Subject: Re: Memory mapped resources Yes. But only one such block. That is what I meant by chunk. That is fine if you want that chunk but if you want to mmap the entire file, it isn't real useful. On Mon, Apr 11, 2011 at 6:48 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: What do you mean by local chunk? I think it's providing access to the underlying file block? On Mon, Apr 11, 2011 at 6:30 PM, Ted Dunning tdunn...@maprtech.com wrote: Also, it only provides access to a local chunk of a file which isn't very useful. On Mon, Apr 11, 2011 at 5:32 PM, Edward Capriolo edlinuxg...@gmail.com wrote: On Mon, Apr 11, 2011 at 7:05 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Yes you can however it will require customization of HDFS. Take a look at HDFS-347 specifically the HDFS-347-branch-20-append.txt patch. I have been altering it for use with HBASE-3529. Note that the patch noted is for the -append branch which is mainly for HBase. On Mon, Apr 11, 2011 at 3:57 PM, Benson Margulies bimargul...@gmail.com wrote: We have some very large files that we access via memory mapping in Java. Someone's asked us about how to make this conveniently deployable in Hadoop. If we tell them to put the files into hdfs, can we obtain a File for the underlying file on any given node? This features it not yet part of hadoop so doing this is not convenient.
Re: Memory mapped resources
Well, no. You could mmap all the blocks that are local to the node your program is on. The others you will have to read more conventionally. If all blocks are guaranteed local, this would work. I don't think that guarantee is possible on a non-trivial cluster. On Tue, Apr 12, 2011 at 6:32 AM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Then one could MMap the blocks pertaining to the HDFS file and piece them together. Lucene's MMapDirectory implementation does just this to avoid an obscure JVM bug. On Mon, Apr 11, 2011 at 9:09 PM, Ted Dunning tdunn...@maprtech.com wrote: Yes. But only one such block. That is what I meant by chunk. That is fine if you want that chunk but if you want to mmap the entire file, it isn't real useful. On Mon, Apr 11, 2011 at 6:48 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: What do you mean by local chunk? I think it's providing access to the underlying file block? On Mon, Apr 11, 2011 at 6:30 PM, Ted Dunning tdunn...@maprtech.com wrote: Also, it only provides access to a local chunk of a file which isn't very useful. On Mon, Apr 11, 2011 at 5:32 PM, Edward Capriolo edlinuxg...@gmail.com wrote: On Mon, Apr 11, 2011 at 7:05 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Yes you can however it will require customization of HDFS. Take a look at HDFS-347 specifically the HDFS-347-branch-20-append.txt patch. I have been altering it for use with HBASE-3529. Note that the patch noted is for the -append branch which is mainly for HBase. On Mon, Apr 11, 2011 at 3:57 PM, Benson Margulies bimargul...@gmail.com wrote: We have some very large files that we access via memory mapping in Java. Someone's asked us about how to make this conveniently deployable in Hadoop. If we tell them to put the files into hdfs, can we obtain a File for the underlying file on any given node? This features it not yet part of hadoop so doing this is not convenient.
Re: Memory mapped resources
Blocks live where they land when first created. They can be moved due to node failure or rebalancing, but it is typically pretty expensive to do this. It certainly is slower than just reading the file. If you really, really want mmap to work, then you need to set up some native code that builds an mmap'ed region, but sets all pages to no access if the corresponding block is non-local and sets the block to access the local block if the block is local. Then you can intercept the segmentation violations that occur on page access to non-local data, read that block to local storage and mmap it into those pages. This is LOTS of work to get exactly right and must be done in C since Java can't really handle seg faults correctly. This pattern is fairly commonly used in garbage collected languages to allow magical remapping of memory without explicit tests. On Tue, Apr 12, 2011 at 8:24 AM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Interesting. I'm not familiar with how blocks go local, however I'm interested in how to make this occur via a manual oriented call. Eg, is there an option available that guarantees locality, and if not, perhaps there's work being done towards that path?
Re: Memory mapped resources
Actually, it doesn't become trivial. It just becomes total fail or total win instead of almost always being partial win. It doesn't meet Benson's need. On Tue, Apr 12, 2011 at 11:09 AM, Jason Rutherglen jason.rutherg...@gmail.com wrote: To get around the chunks or blocks problem, I've been implementing a system that simply sets a max block size that is too large for a file to reach. In this way there will only be one block for HDFS file, and so MMap'ing or other single file ops become trivial. On Tue, Apr 12, 2011 at 10:40 AM, Benson Margulies bimargul...@gmail.com wrote: Here's the OP again. I want to make it clear that my question here has to do with the problem of distributing 'the program' around the cluster, not 'the data'. In the case at hand, the issue a system that has a large data resource that it needs to do its work. Every instance of the code needs the entire model. Not just some blocks or pieces. Memory mapping is a very attractive tactic for this kind of data resource. The data is read-only. Memory-mapping it allows the operating system to ensure that only one copy of the thing ends up in physical memory. If we force the model into a conventional file (storable in HDFS) and read it into the JVM in a conventional way, then we get as many copies in memory as we have JVMs. On a big machine with a lot of cores, this begins to add up. For people who are running a cluster of relatively conventional systems, just putting copies on all the nodes in a conventional place is adequate.
Re: Memory mapped resources
Benson is actually a pretty sophisticated guy who knows a lot about mmap. I engaged with him yesterday on this since I know him from Apache. On Tue, Apr 12, 2011 at 7:16 PM, M. C. Srivas mcsri...@gmail.com wrote: I am not sure if you realize, but HDFS is not VM integrated.
Re: Using global reverse lookup tables
Depending on the function that you want to use, it sounds like you want to use a self join to compute transposed cooccurrence. That is, it sounds like you want to find all the sets that share elements with X. If you have a binary matrix A that represents your set membership with one row per set and one column per element, then you want to compute A A'. It is common for A to be available in row-major form or in the form of pairs. With row major form, the easiest way to compute your desired result is to transpose A in a first map-reduce. With either the transposed matrix a second map-reduce can be used in which the mapper reads all of the sets with a particular element and emits pairs of sets that have a common element. The combiner and reducer are basically a pair counter. This implementation suffers in that it takes time quadratic in the density of the most dense row. It is common to downsample such rows to a reasonable level. Most uses of the cooccurrence matrix don't care about this downsampling and the makes the algorithm much faster. As an alternative, you can compute a matrix decomposition and use that to compute A A'. This can be arranged so as to avoid the downsampling, but the program required is much more complex. The Apache Mahout project has several implementations of such decompositions tuned for different situations. Some implementations use map-reduce, some are sequential. On Mon, Apr 11, 2011 at 9:30 AM, W.P. McNeill bill...@gmail.com wrote: Are there general approaches to solving this problem?
Re: Memory mapped resources
Also, it only provides access to a local chunk of a file which isn't very useful. On Mon, Apr 11, 2011 at 5:32 PM, Edward Capriolo edlinuxg...@gmail.comwrote: On Mon, Apr 11, 2011 at 7:05 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Yes you can however it will require customization of HDFS. Take a look at HDFS-347 specifically the HDFS-347-branch-20-append.txt patch. I have been altering it for use with HBASE-3529. Note that the patch noted is for the -append branch which is mainly for HBase. On Mon, Apr 11, 2011 at 3:57 PM, Benson Margulies bimargul...@gmail.com wrote: We have some very large files that we access via memory mapping in Java. Someone's asked us about how to make this conveniently deployable in Hadoop. If we tell them to put the files into hdfs, can we obtain a File for the underlying file on any given node? This features it not yet part of hadoop so doing this is not convenient.
Re: Memory mapped resources
Yes. But only one such block. That is what I meant by chunk. That is fine if you want that chunk but if you want to mmap the entire file, it isn't real useful. On Mon, Apr 11, 2011 at 6:48 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: What do you mean by local chunk? I think it's providing access to the underlying file block? On Mon, Apr 11, 2011 at 6:30 PM, Ted Dunning tdunn...@maprtech.com wrote: Also, it only provides access to a local chunk of a file which isn't very useful. On Mon, Apr 11, 2011 at 5:32 PM, Edward Capriolo edlinuxg...@gmail.com wrote: On Mon, Apr 11, 2011 at 7:05 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Yes you can however it will require customization of HDFS. Take a look at HDFS-347 specifically the HDFS-347-branch-20-append.txt patch. I have been altering it for use with HBASE-3529. Note that the patch noted is for the -append branch which is mainly for HBase. On Mon, Apr 11, 2011 at 3:57 PM, Benson Margulies bimargul...@gmail.com wrote: We have some very large files that we access via memory mapping in Java. Someone's asked us about how to make this conveniently deployable in Hadoop. If we tell them to put the files into hdfs, can we obtain a File for the underlying file on any given node? This features it not yet part of hadoop so doing this is not convenient.
Re: Architectural question
There are no subtle ways to deal with quadratic problems like this. They just don't scale. Your suggestions are roughly on course. When matching 10GB against 50GB, the choice of which input to use as input to the mapper depends a lot on how much you can buffer in memory and how long such a buffer takes to build. If you can't store the entire 10GB of data in memory at once, then consider a program like this: a) split the 50GB of data across as many mappers as you have using standard methods b) in the mapper, emit each record several times with keys of the form (i, j) where i cycles through [0,n) and is incremented once for each record read and j cycles through [0, m) and is incremented each time you emit a record. Choose m so that 1/m of your 10GB data will fit in your reducers memory. Choose n so that n x m is as large as your desired number of reducers. c) in the reducer, you will get some key (i,j) and an iterator for a number of records. Read the i-th segment of your 10GB data and compare each of the records that the iterator gives you to that data. If you made n = 1 in step (b), then you will have at most m-way parallelism in this step. If n is large, however, your reducer may need to read the same segment of your 10GB data more than once. In such conditions you may want to sort the records and remember which segment you have already read. In general, though, as I mentioned this is not a scalable process and as your data grows it is likely to become untenable. If you can split your data into pieces and estimate which piece each record should be matched to then you might be able to make the process more scalable. Consider indexing techniques to do this rough targeting. For instance, if you are trying to find the closes few strings based on edit distance, you might be able to use n-grams to get approximate matches via a text retrieval index. This can substantially reduce the cost of your algorithm. On Sun, Apr 10, 2011 at 2:10 PM, oleksiy gayduk.a.s...@mail.ru wrote: ... Persistent data and input data don't have commons keys. In my cluster I have 5 data nodes. The app does simple match every line of input data with every line of persistent data. ... And may be there is more subtle way in hadoop to do this work?
Re: Architectural question
The original poster said that there was no common key. Your suggestion presupposes that such a key exists. On Sun, Apr 10, 2011 at 4:29 PM, Mehmet Tepedelenlioglu mehmets...@gmail.com wrote: My understanding is you have two sets of strings S1, and S2 and you want to mark all strings that belong to both sets. If this is correct, then: Mapper: for all strings K in Si (i is 1 or 2) emit: key K and value i. Reducer: For key K, if the list of values includes both 1 and 2, you have a match, emit: K MATCH, else emit: K NO_MATCH (or nothing). I assume that the load is not terribly unbalanced. The logic goes for intersection of any number of sets. Mark the members with their sets, reduce over them to see if they belong to every set. Good luck. On Apr 10, 2011, at 2:10 PM, oleksiy wrote: Hi all, I have some architectural question. For my app I have persistent 50 GB data, which stored in HDFS, data is simple CSV format file. Also for my app which should be run over this (50 GB) data I have 10 GB input data also CSV format. Persistent data and input data don't have commons keys. In my cluster I have 5 data nodes. The app does simple match every line of input data with every line of persistent data. For solving this task I see two different approaches: 1. Destribute input file to every node using attribute -files, and run job. But in this case every map will go through 10 GB input data. 2. Devide input file (10 GB) to 5 parts (for instance), run 5 independent jobs (one per data node for instance), and for every job we will put 2 GB data. In this case every map should go through 2 GB data. In other words I'll give every map node it's own input data. But drawback of this approache is work which I should do before start job and after job finished. And may be there is more subtle way in hadoop to do this work? -- View this message in context: http://old.nabble.com/Architectural-question-tp31365863p31365863.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: We are looking to the root of the problem that caused us IOException
yes. At least periodically. You now have a situation where the age distribution of blocks in each datanode is quite different. This will lead to different evolution of which files are retained and that is likely to cause imbalances again. It will also cause the performance of your system to be degraded since any given program probably will have a non-uniform distribution of content in your cluster. Over time, this effect will probably decrease unless you keep your old files forever. On Tue, Apr 5, 2011 at 11:34 PM, Guy Doulberg guy.doulb...@conduit.comwrote: Do you think we should always run the balance, with low bandwidth, and not only after adding new nodes?
Re: We are looking to the root of the problem that caused us IOException
YOu can configure the balancer to use higher bandwidth. That can speed it up by 10x On Tue, Apr 5, 2011 at 2:54 AM, Guy Doulberg guy.doulb...@conduit.comwrote: We are running the blancer, but it takes a lot of time... in this time the cluster not working
Re: HBase schema design
The hbase list would be more appropriate. See http://hbase.apache.org/mail-lists.html http://hbase.apache.org/mail-lists.htmlThere is an active IRC channel, but your question fits the mailing list better so pop on over and I will give you some comments. In the meantime, take a look at OpenTSDB who are doing something very much like what you want to do. On Mon, Apr 4, 2011 at 8:43 AM, Miguel Costa miguel-co...@telecom.ptwrote: Hi, I need some help to a schema design on HBase. I have 5 dimensions (Time,Site,Referrer Keyword,Country). My row key is Site+Time. Now I want to answer some questions like what is the top Referrer by Keyword for a site on a Period of Time. Basically I want to cross all the dimensions that I have. And if I have 30 dimensions? What is the best schema design. Please let me know if this isn’t the right mailing list. Thank you for your time. Miguel
Re: Reverse Indexing Programming Help
It would help to get a good book. There are several. For your program, there are several things that will trip you up: a) lots of little files is going to be slow. You want input that is 100MB per file if you want speed. b) That file format is a bit cheesy since it is hard to tell URL's from text if you concatenate lots of files. Better to use a format like protobufs or Avro or even sequence files to separate the key and the data unambiguously. c) I suspect that what you are asking for is to run a mapper so that each invocation of map gets the URL as key and the text as data. That map invocation can then tokenize the data and emit records with the URL as key and each word as data. That isn't much use since the reducer will get the URL and all the words that were emitted for that URL. If each URL appears exactly once, then the input already had that. Perhaps you mean to emit the word as key and URL as data. Then the reducer will see the word as key and an iterator over all the URLs that mentioned the word. On Thu, Mar 31, 2011 at 9:48 PM, DoomUs ddu...@nmt.edu wrote: I'm just starting out using Hadoop. I've looked through the java examples, and have an idea about what's going on, but don't really get it. I'd like to write a program that takes a directory of files. Contained in those files are a URL to a website on the first line, and the second line is the TEXT from that website. The mapper should create a map for each word in the text to that URL, so every word found on the website would map to the URL. The reducer then, would collect all of the URLs that are mapped to via a given word. Each Word-URL is then written to a file. So, it's simple as a program designed to run on a single system, but I want to be able to distribute the computation and whatnot using Hadoop. I'm extremely new to Hadoop, I'm not even sure how to ask all of the questions I'd like answers for, I have zero experience in MapReduce, and limited experience in functional programming at all. Any programming tips, or if I have my Mapper or Reducer defined incorrectly, corrections, etc would be greatly appreciated. Questions: How do I read (and write) files from hdfs? Once I've read them, How do I distribute the files to be mapped? I know I need a class to implement the mapper, and one to implement the reducer, but how does the class have a return type to output the map? Thanks a lot for your help. -- View this message in context: http://old.nabble.com/Reverse-Indexing-Programming-Help-tp31292449p31292449.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Chukwa - Lightweight agents
Take a look at openTsDb at http://opentsdb.net/ It provides lots of the capability in a MUCH simpler package. On Sun, Mar 20, 2011 at 8:43 AM, Mark static.void@gmail.com wrote: Sorry but it doesn't look like Chukwa mailing list exists anymore? Is there an easy way to set up lightweight agents on cluster of machines instead of downloading the full Chukwa source (+50mb)? Has anyone build separate RPM's for the agents/collectors? Thanks
Re: Chukwa - Lightweight agents
OpenTSDB is purely a monitoring solution which is the primary mission of chukwa. If you are looking for data import, what about Flume? On Sun, Mar 20, 2011 at 9:59 AM, Mark static.void@gmail.com wrote: Thanks but we need Chukwa to aggregate and store files from across our app servers into Hadoop. Doesn't really look like opentsdb is meant for that. I could be wrong though? On 3/20/11 9:49 AM, Ted Dunning wrote: Take a look at openTsDb at http://opentsdb.net/ It provides lots of the capability in a MUCH simpler package. On Sun, Mar 20, 2011 at 8:43 AM, Mark static.void@gmail.com wrote: Sorry but it doesn't look like Chukwa mailing list exists anymore? Is there an easy way to set up lightweight agents on cluster of machines instead of downloading the full Chukwa source (+50mb)? Has anyone build separate RPM's for the agents/collectors? Thanks
Re: Inserting many small files into HBase
Take a look at this: http://wiki.apache.org/hadoop/Hbase/DesignOverview then read the bigtable paper. On Sun, Mar 20, 2011 at 6:39 PM, edward choi mp2...@gmail.com wrote: Hi, I'm planning to crawl thousands of news rss feeds via MapReduce, and save each news article into HBase directly. My concern is that Hadoop does not work well with a large number of small-size files, and if I insert every single news article (which is small-size apparently) into HBase, (without separately storing it into HDFS) I might end up with millions of files that are only several kilobytes in size. Or does HBase somehow automatically append each news article into a single file, so that it would have only a few files of large-size? Ed
Re: decommissioning node woes
Unfortunately this doesn't help much because it is hard to get the ports to balance the load. On Fri, Mar 18, 2011 at 8:30 PM, Michael Segel michael_se...@hotmail.comwrote: With a 1GBe port, you could go 100Mbs for the bandwidth limit. If you bond your ports, you could go higher.
Re: how to build kmeans
These java files are full of HTML. Are you sure that they are supposed to compile? How did you get these files? On Fri, Mar 18, 2011 at 3:12 AM, MANISH SINGLA coolmanishh...@gmail.comwrote: Hii everyone... I am trying to run kmeans on a single node... I have the attached files with me...I put them in a folder named kmeans...and place it in the src/examples directory... Then i go to hadoop home ant type ant examples on the command line... but I am getting some errors...which I am not able to understand...any help wpuld be appreciated... ON COMMAND LINE: ant examples //tHIS i TYPE ON THE COMMAND LINE OUTPUT: //AS A result I get all this Buildfile: build.xml clover.setup: clover.info: [echo] [echo] Clover not found. Code coverage reports disabled. [echo] clover: ivy-download: [get] Getting: http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.0.0-rc2/ivy-2.0.0-rc2.jar [get] To: /home/master/Desktop/hadoop-0.20.2/ivy/ivy-2.0.0-rc2.jar [get] Not modified - so not downloaded ivy-init-dirs: ivy-probe-antlib: ivy-init-antlib: ivy-init: [ivy:configure] :: Ivy 2.0.0-rc2 - 20081028224207 :: http://ant.apache.org/ivy/ :: :: loading settings :: file = /home/master/Desktop/hadoop-0.20.2/ivy/ivysettings.xml ivy-resolve-common: [ivy:resolve] :: resolving dependencies :: org.apache.hadoop#Hadoop;working@master [ivy:resolve] confs: [common] [ivy:resolve] found commons-logging#commons-logging;1.0.4 in maven2 [ivy:resolve] found log4j#log4j;1.2.15 in maven2 [ivy:resolve] found commons-httpclient#commons-httpclient;3.0.1 in maven2 [ivy:resolve] found commons-codec#commons-codec;1.3 in maven2 [ivy:resolve] found commons-cli#commons-cli;1.2 in maven2 [ivy:resolve] found xmlenc#xmlenc;0.52 in maven2 [ivy:resolve] found net.java.dev.jets3t#jets3t;0.6.1 in maven2 [ivy:resolve] found commons-net#commons-net;1.4.1 in maven2 [ivy:resolve] found org.mortbay.jetty#servlet-api-2.5;6.1.14 in maven2 [ivy:resolve] found oro#oro;2.0.8 in maven2 [ivy:resolve] found org.mortbay.jetty#jetty;6.1.14 in maven2 [ivy:resolve] found org.mortbay.jetty#jetty-util;6.1.14 in maven2 [ivy:resolve] found tomcat#jasper-runtime;5.5.12 in maven2 [ivy:resolve] found tomcat#jasper-compiler;5.5.12 in maven2 [ivy:resolve] found commons-el#commons-el;1.0 in maven2 [ivy:resolve] found junit#junit;3.8.1 in maven2 [ivy:resolve] found commons-logging#commons-logging-api;1.0.4 in maven2 [ivy:resolve] found org.slf4j#slf4j-api;1.4.3 in maven2 [ivy:resolve] found org.eclipse.jdt#core;3.1.1 in maven2 [ivy:resolve] found org.slf4j#slf4j-log4j12;1.4.3 in maven2 [ivy:resolve] found org.mockito#mockito-all;1.8.0 in maven2 [ivy:resolve] :: resolution report :: resolve 612ms :: artifacts dl 24ms - | |modules|| artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| - | common | 21 | 0 | 0 | 0 || 21 | 0 | - ivy-retrieve-common: [ivy:retrieve] :: retrieving :: org.apache.hadoop#Hadoop [ivy:retrieve] confs: [common] [ivy:retrieve] 0 artifacts copied, 21 already retrieved (0kB/16ms) No ivy:settings found for the default reference 'ivy.instance'. A default instance will be used DEPRECATED: 'ivy.conf.file' is deprecated, use 'ivy.settings.file' instead :: loading settings :: file = /home/master/Desktop/hadoop-0.20.2/ivy/ivysettings.xml init: [touch] Creating /tmp/null1787905831 [delete] Deleting: /tmp/null1787905831 [exec] src/saveVersion.sh: 34: svn: not found [exec] src/saveVersion.sh: 34: svn: not found record-parser: compile-rcc-compiler: compile-core-classes: compile-mapred-classes: [javac] Compiling 1 source file to /home/master/Desktop/hadoop-0.20.2/build/classes compile-hdfs-classes: [javac] Compiling 4 source files to /home/master/Desktop/hadoop-0.20.2/build/classes compile-core-native: check-c++-makefiles: create-c++-pipes-makefile: create-c++-utils-makefile: compile-c++-utils: compile-c++-pipes: compile-c++: compile-core: jar: [tar] Nothing to do: /home/master/Desktop/hadoop-0.20.2/build/classes/bin.tgz is up to date. [jar] Building jar: /home/master/Desktop/hadoop-0.20.2/build/hadoop-0.20.3-dev-core.jar compile-tools: create-c++-examples-pipes-makefile: compile-c++-examples-pipes: compile-c++-examples: compile-examples: [javac] Compiling 9 source files to /home/master/Desktop/hadoop-0.20.2/build/examples [javac] /home/master/Desktop/hadoop-0.20.2/src/examples/org/apache/hadoop/examples/kmeans/KmeansUtil.java:5: class, interface, or enum expected [javac] !DOCTYPE html [javac] ^
Re: how to build kmeans
This looks like you took the code from http://code.google.com/p/kmeans-hadoop/ http://code.google.com/p/kmeans-hadoop/And it looks like you didn't actually download the code, but you cut and pasted the HTML rendition of the code. First, this code is not from a serious project. It is more of a status dump of somebody just trying some stuff out. If you really want k-means, go get Mahout. http://mahout.apache.org/ On Fri, Mar 18, 2011 at 3:12 AM, MANISH SINGLA coolmanishh...@gmail.comwrote: Hii everyone... I am trying to run kmeans on a single node... I have the attached files with me...I put them in a folder named kmeans...and place it in the src/examples directory... Then i go to hadoop home ant type ant examples on the command line... but I am getting some errors...which I am not able to understand...any help wpuld be appreciated... ON COMMAND LINE: ant examples //tHIS i TYPE ON THE COMMAND LINE OUTPUT: //AS A result I get all this Buildfile: build.xml clover.setup: clover.info: [echo] [echo] Clover not found. Code coverage reports disabled. [echo] clover: ivy-download: [get] Getting: http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.0.0-rc2/ivy-2.0.0-rc2.jar [get] To: /home/master/Desktop/hadoop-0.20.2/ivy/ivy-2.0.0-rc2.jar [get] Not modified - so not downloaded ivy-init-dirs: ivy-probe-antlib: ivy-init-antlib: ivy-init: [ivy:configure] :: Ivy 2.0.0-rc2 - 20081028224207 :: http://ant.apache.org/ivy/ :: :: loading settings :: file = /home/master/Desktop/hadoop-0.20.2/ivy/ivysettings.xml ivy-resolve-common: [ivy:resolve] :: resolving dependencies :: org.apache.hadoop#Hadoop;working@master [ivy:resolve] confs: [common] [ivy:resolve] found commons-logging#commons-logging;1.0.4 in maven2 [ivy:resolve] found log4j#log4j;1.2.15 in maven2 [ivy:resolve] found commons-httpclient#commons-httpclient;3.0.1 in maven2 [ivy:resolve] found commons-codec#commons-codec;1.3 in maven2 [ivy:resolve] found commons-cli#commons-cli;1.2 in maven2 [ivy:resolve] found xmlenc#xmlenc;0.52 in maven2 [ivy:resolve] found net.java.dev.jets3t#jets3t;0.6.1 in maven2 [ivy:resolve] found commons-net#commons-net;1.4.1 in maven2 [ivy:resolve] found org.mortbay.jetty#servlet-api-2.5;6.1.14 in maven2 [ivy:resolve] found oro#oro;2.0.8 in maven2 [ivy:resolve] found org.mortbay.jetty#jetty;6.1.14 in maven2 [ivy:resolve] found org.mortbay.jetty#jetty-util;6.1.14 in maven2 [ivy:resolve] found tomcat#jasper-runtime;5.5.12 in maven2 [ivy:resolve] found tomcat#jasper-compiler;5.5.12 in maven2 [ivy:resolve] found commons-el#commons-el;1.0 in maven2 [ivy:resolve] found junit#junit;3.8.1 in maven2 [ivy:resolve] found commons-logging#commons-logging-api;1.0.4 in maven2 [ivy:resolve] found org.slf4j#slf4j-api;1.4.3 in maven2 [ivy:resolve] found org.eclipse.jdt#core;3.1.1 in maven2 [ivy:resolve] found org.slf4j#slf4j-log4j12;1.4.3 in maven2 [ivy:resolve] found org.mockito#mockito-all;1.8.0 in maven2 [ivy:resolve] :: resolution report :: resolve 612ms :: artifacts dl 24ms - | |modules|| artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| - | common | 21 | 0 | 0 | 0 || 21 | 0 | - ivy-retrieve-common: [ivy:retrieve] :: retrieving :: org.apache.hadoop#Hadoop [ivy:retrieve] confs: [common] [ivy:retrieve] 0 artifacts copied, 21 already retrieved (0kB/16ms) No ivy:settings found for the default reference 'ivy.instance'. A default instance will be used DEPRECATED: 'ivy.conf.file' is deprecated, use 'ivy.settings.file' instead :: loading settings :: file = /home/master/Desktop/hadoop-0.20.2/ivy/ivysettings.xml init: [touch] Creating /tmp/null1787905831 [delete] Deleting: /tmp/null1787905831 [exec] src/saveVersion.sh: 34: svn: not found [exec] src/saveVersion.sh: 34: svn: not found record-parser: compile-rcc-compiler: compile-core-classes: compile-mapred-classes: [javac] Compiling 1 source file to /home/master/Desktop/hadoop-0.20.2/build/classes compile-hdfs-classes: [javac] Compiling 4 source files to /home/master/Desktop/hadoop-0.20.2/build/classes compile-core-native: check-c++-makefiles: create-c++-pipes-makefile: create-c++-utils-makefile: compile-c++-utils: compile-c++-pipes: compile-c++: compile-core: jar: [tar] Nothing to do: /home/master/Desktop/hadoop-0.20.2/build/classes/bin.tgz is up to date. [jar] Building jar: /home/master/Desktop/hadoop-0.20.2/build/hadoop-0.20.3-dev-core.jar compile-tools: create-c++-examples-pipes-makefile: compile-c++-examples-pipes: compile-c++-examples:
Re: decommissioning node woes
If nobody else more qualified is willing to jump in, I can at least provide some pointers. What you describe is a bit surprising. I have zero experience with any 0.21 version, but decommissioning was working well in much older versions, so this would be a surprising regression. The observations you have aren't all inconsistent with how decommissioning should work. The fact that your nodes look up after starting the decommissioning isn't so strange. The idea is that no new data will be put on the node, nor should it be counted as a replica, but it will help in reading data. So that isn't such a big worry. The fact that it takes forever and a day, however, is a big worry. I cannot provide any help there just off hand. What happens when a datanode goes down? Do you see under-replicated files? Does the number of such files decrease over time? On Fri, Mar 18, 2011 at 4:23 AM, Rita rmorgan...@gmail.com wrote: Any help? On Wed, Mar 16, 2011 at 9:36 PM, Rita rmorgan...@gmail.com wrote: Hello, I have been struggling with decommissioning data nodes. I have a 50+ data node cluster (no MR) with each server holding about 2TB of storage. I split the nodes into 2 racks. I edit the 'exclude' file and then do a -refreshNodes. I see the node immediate in 'Decommiosied node' and I also see it as a 'live' node! Eventhough I wait 24+ hours its still like this. I am suspecting its a bug in my version. The data node process is still running on the node I am trying to decommission. So, sometimes I kill -9 the process and I see the 'under replicated' blocks...this can't be the normal procedure. There were even times that I had corrupt blocks because I was impatient -- waited 24-34 hours I am using 23 August, 2010: release 0.21.0 http://hadoop.apache.org/hdfs/releases.html#23+August%2C+2010%3A+release+0.21.0+available version. Is this a known bug? Is there anything else I need to do to decommission a node? -- --- Get your facts first, then you can distort them as you please.-- -- --- Get your facts first, then you can distort them as you please.--
Re: decommissioning node woes
Unless the last copy is on that node. Decommissioning is the only safe way to shut off 10 nodes at once. Doing them one at a time and waiting for replication to (asymptotically) recover is painful and error prone. On Fri, Mar 18, 2011 at 9:08 AM, James Seigel ja...@tynt.com wrote: Just a note. If you just shut the node off, the blocks will replicate faster. James. On 2011-03-18, at 10:03 AM, Ted Dunning wrote: If nobody else more qualified is willing to jump in, I can at least provide some pointers. What you describe is a bit surprising. I have zero experience with any 0.21 version, but decommissioning was working well in much older versions, so this would be a surprising regression. The observations you have aren't all inconsistent with how decommissioning should work. The fact that your nodes look up after starting the decommissioning isn't so strange. The idea is that no new data will be put on the node, nor should it be counted as a replica, but it will help in reading data. So that isn't such a big worry. The fact that it takes forever and a day, however, is a big worry. I cannot provide any help there just off hand. What happens when a datanode goes down? Do you see under-replicated files? Does the number of such files decrease over time? On Fri, Mar 18, 2011 at 4:23 AM, Rita rmorgan...@gmail.com wrote: Any help? On Wed, Mar 16, 2011 at 9:36 PM, Rita rmorgan...@gmail.com wrote: Hello, I have been struggling with decommissioning data nodes. I have a 50+ data node cluster (no MR) with each server holding about 2TB of storage. I split the nodes into 2 racks. I edit the 'exclude' file and then do a -refreshNodes. I see the node immediate in 'Decommiosied node' and I also see it as a 'live' node! Eventhough I wait 24+ hours its still like this. I am suspecting its a bug in my version. The data node process is still running on the node I am trying to decommission. So, sometimes I kill -9 the process and I see the 'under replicated' blocks...this can't be the normal procedure. There were even times that I had corrupt blocks because I was impatient -- waited 24-34 hours I am using 23 August, 2010: release 0.21.0 http://hadoop.apache.org/hdfs/releases.html#23+August%2C+2010%3A+release+0.21.0+available version. Is this a known bug? Is there anything else I need to do to decommission a node? -- --- Get your facts first, then you can distort them as you please.-- -- --- Get your facts first, then you can distort them as you please.--
Re: hadoop fs -rmr /*?
W.P is correct, however, that standard techniques like snapshots and mirrors and point in time backups do not exist in standard hadoop. This requires a variety of creative work-arounds if you use stock hadoop. It is not uncommon for people to have memories of either removing everything or somebody close to them doing the same thing. Few people have memories of doing it twice. On Wed, Mar 16, 2011 at 11:20 AM, David Rosenstrauch dar...@darose.netwrote: On 03/16/2011 01:35 PM, W.P. McNeill wrote: On HDFS, anyone can run hadoop fs -rmr /* and delete everything. Not sure how you have your installation set but on ours (we installed Cloudera CDH), only user hadoop has full read/write access to HDFS. Since we rarely either login as user hadoop, or run jobs as that user, this forces us to explicitly set and chown directory trees in HDFS that only specific users can access, thus enforcing file read/write restrictions. HTH, DR
Re: Why hadoop is written in java?
Note that that comment is now 7 years old. See Mahout for a more modern take on numerics using Hadoop (and other tools) for scalable machine learning and data mining. On Wed, Mar 16, 2011 at 10:43 AM, baloodevil dukek...@hotmail.com wrote: See this for comment on java handling numeric calculations like sparse matrices... http://acs.lbl.gov/software/colt/ -- View this message in context: http://lucene.472066.n3.nabble.com/Why-hadoop-is-written-in-java-tp1673148p2688781.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: k-means
Since you asked so nicely: http://www.manning.com/owen/ On Fri, Mar 4, 2011 at 6:52 AM, Mike Nute mike.n...@gmail.com wrote: James, Do you know how to get a copy of this book in early access form? Amazon doesn't release it until may. Thanks! Mike Nute --Original Message-- From: James Seigel To: common-user@hadoop.apache.org ReplyTo: common-user@hadoop.apache.org Subject: Re: k-means Sent: Mar 4, 2011 9:46 AM I am not near a computer so I won't be able to give you specifics. So instead, I'd suggest Manning's mahout in action book which is in their early access form for some basic direction. Disclosure: I have no relation to the publisher or authors. Cheers James Sent from my mobile. Please excuse the typos. On 2011-03-04, at 7:37 AM, MANISH SINGLA coolmanishh...@gmail.com wrote: are u suggesting me that??? if yes can u plzzz tell me the steps to use that...because I havent used it yet...a quick reply will really be appreciated... Thanx Manish On Fri, Mar 4, 2011 at 7:39 PM, James Seigel ja...@tynt.com wrote: Mahout project? Sent from my mobile. Please excuse the typos. On 2011-03-04, at 6:41 AM, MANISH SINGLA coolmanishh...@gmail.com wrote: Hey ppl... I need some serious help...I m not able to run kmeans code in hadoop...does anyone have a running code...that they would have tried... Regards MANISH
Re: Digital Signal Processing Library + Hadoop
Come on over to the Apache Mahout mailing list for a warm welcome at least. We don't have a lot of time series stuff but would be very interested in hearing more about what you need and would like to see if there are some common issues that we might work on together. On Fri, Mar 4, 2011 at 9:05 PM, Roger Smith rogersmith1...@gmail.comwrote: All - I wonder if any of you have integrated a DSP library with Hadoop. We are considering using Hadoop to processing time series data, but don't want to write standard DSP functions. Roger.
Re: Performance Test
It will be very difficult to do. If you have n machines running 4 different things, you will probably get better results segregating tasks as much as possible. Interactions can be very subtle and can have major impact on performance in a few cases. Hadoop, in general, will use a lot of the resources if they appear to be available. The intent, after all, is to run batch jobs absolutely as fast as your hardware can handle. On Wed, Mar 2, 2011 at 7:31 PM, liupei liu...@xingcloud.com wrote: Hi, I'd like to tune params in hadoop config for my job. But my current cluster runs lot of other processes such as mongod, php gateways and some other routine hadoop jobs. It is impossible to stop all to get a clear environment for testing. Is there any way to get reliable results for my tuning in such a mixture environment? Thanks
Re: Advice for a new open-source project and a license
Bixo may have some useful components. The thrust is different, but some of the pieces are similar. http://bixo.101tec.com/ On Mon, Feb 28, 2011 at 7:57 PM, Mark Kerzner markkerz...@gmail.com wrote: Well, it's more complex than that. I packed all files (or selected directories) into zip files, and those zip files go into HDFS, and they are processed from there. Mark On Mon, Feb 28, 2011 at 9:53 PM, Greg Roelofs roel...@yahoo-inc.com wrote: Mark Kerzner markkerz...@gmail.com wrote: I am working on an open-source project that would be using Hadoop/HDFS/HBase/Tika/Lucene and would make all files on a hard drive searchable. _A_ hard drive? Hadoop? Seems like a bad match. Greg
Re: Advice for a new open-source project and a license
Check out http://www.elasticsearch.org/ http://www.elasticsearch.org/Not what you are doing, but possibly a helpful bit of the pie. Also, Solr integrates Tika and Lucene pretty nicely any more. No Hbase yet, but it isn't hard to add that. On Mon, Feb 28, 2011 at 1:01 PM, Mark Kerzner markkerz...@gmail.com wrote: Hi, I am working on an open-source project that would be using Hadoop/HDFS/HBase/Tika/Lucene and would make all files on a hard drive searchable. Like Nutch, only applied to hard drives, and like Google Desktop Search, only I want to output information about every file found. Not a big difference though. I am looking for an advice on the following 1. Have you heard of a similar project? 2. What license should I use? I am thinking of Apache V2.0, because it relies on other Apache V2.0 projects; 3. Any other advice? Thank you. Sincerely, Mark
Re: Hadoop Case Studies?
Ted, Greetings back at you. It has been a while. Check out Jimmy Lin and Chris Dyer's book about text processing with hadoop: http://www.umiacs.umd.edu/~jimmylin/book.html On Sun, Feb 27, 2011 at 4:34 PM, Ted Pedersen tpede...@d.umn.edu wrote: Greetings all, I'm teaching an undergraduate Computer Science class that is using Hadoop quite heavily, and would like to include some case studies at various points during this semester. We are using Tom White's Hadoop The Definitive Guide as a text, and that includes a very nice chapter of case studies which might even provide enough material for my purposes. But, I wanted to check and see if there were other case studies out there that might provide motivating and interesting examples of how Hadoop is currently being used. The idea is to find material that goes beyond simply saying X uses Hadoop to explaining in more detail how and why X are using Hadoop. Any hints would be very gratefully received. Cordially, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse
Re: Hadoop Case Studies?
At any large company that makes heavy use of Hadoop, you aren't going to find any concise description of all the ways that hadoop is used. That said, here is a concise description of some of the ways that hadoop is (was) used at Yahoo: http://www.slideshare.net/ydn/hadoop-yahoo-internet-scale-data-processing On Sun, Feb 27, 2011 at 7:31 PM, Ted Pedersen tpede...@d.umn.edu wrote: Thanks for all these great ideas. These are really very helpful. What I'm also hoping to find are articles or papers that describe what particular companies or organizations have done with Hadoop. How does Facebook use Hadoop for example (that's one of the case studies in the White book), or how does last.fm use Hadoop (another of the case studies in the White book). One interesting resource is the list of powered by Hadoop projects available here: http://wiki.apache.org/hadoop/PoweredBy Some of these entries provide links to more detailed discussions of what an organization is doing, as in the following from Twitter http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009 So any additional descriptions of what specific organizations are doing with Hadoop (to the extent they are willing to share) would be really helpful (these sorts of real world cases tend to be particularly motivating). Cordially, Ted On Sun, Feb 27, 2011 at 9:23 PM, Simon gsmst...@gmail.com wrote: I think you can also simulate PageRank Algorithm with hadoop. Simon - On Sun, Feb 27, 2011 at 9:20 PM, Lance Norskog goks...@gmail.com wrote: This is an exercise that will appeal to undergrads: pull the Craiglist personals ads from several cities, and do text classification. Given a training set of all the cities, attempt to classify test ads by city. (If Peter Harrington is out there, I stole this from you.) Lance On Sun, Feb 27, 2011 at 4:55 PM, Ted Dunning tdunn...@maprtech.com wrote: Ted, Greetings back at you. It has been a while. Check out Jimmy Lin and Chris Dyer's book about text processing with hadoop: http://www.umiacs.umd.edu/~jimmylin/book.html On Sun, Feb 27, 2011 at 4:34 PM, Ted Pedersen tpede...@d.umn.edu wrote: Greetings all, I'm teaching an undergraduate Computer Science class that is using Hadoop quite heavily, and would like to include some case studies at various points during this semester. We are using Tom White's Hadoop The Definitive Guide as a text, and that includes a very nice chapter of case studies which might even provide enough material for my purposes. But, I wanted to check and see if there were other case studies out there that might provide motivating and interesting examples of how Hadoop is currently being used. The idea is to find material that goes beyond simply saying X uses Hadoop to explaining in more detail how and why X are using Hadoop. Any hints would be very gratefully received. Cordially, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse -- Lance Norskog goks...@gmail.com -- Regards, Simon -- Ted Pedersen http://www.d.umn.edu/~tpederse
Re: Quick question
This is the most important thing that you have said. The map function is called once per unit of input but the mapper object persists for many input units of input. You have a little bit of control over how many mapper objects there are and how many machines they are created on and how many pieces your input is broken into. That control is limited, however, unless you build your own input format. The standard input formats are optimized for very large inputs and may not give you the flexibility that you want for your experiments. That is unfortunate for the purpose of learning about hadoop but hadoop is designed mostly for dealing with very large data and isn't usually designed to be easy to understand. Where easy coincides with powerful then easy is good but powerful isn't always easy. On Sunday, February 20, 2011, maha m...@umail.ucsb.edu wrote: So first question: is there a difference between Mappers and maps ?
Re: Quick question
The input is effectively split by lines, but under the covers, the actual splits are by byte. Each mapper will cleverly scan from the specified start to the next line after the start point. At then end, it will over-read to the end of line that is at or after the end of its specified region. This can make the last split be a bit smaller than the others and the first be a bit larger. Practically speaking, however, your 2000 line file is extremely unlikely to be split at all because it is sooo small. On Fri, Feb 18, 2011 at 11:14 AM, maha m...@umail.ucsb.edu wrote: Hi all, I want to check if the following statement is right: If I use TextInputFormat to process a text file with 2000 lines (each ending with \n) with 20 mappers. Then each map will have a sequence of COMPLETE LINES . In other words, the input is not split byte-wise but by lines. Is that right? Thank you, Maha
Re: benchmark choices
MalStone looks like a very narrow benchmark. Terasort is also a very narrow and somewhat idiosyncratic benchmark, but it has the characteristic that lots of people use it. You should add PigMix to your list. There java versions of the problems in PigMix that make a pretty good set of benchmarks independent of Pig itself. On Fri, Feb 18, 2011 at 1:32 PM, Shrinivas Joshi jshrini...@gmail.comwrote: Which workloads are used for serious benchmarking of Hadoop clusters? Do you care about any of the following workloads : TeraSort, GridMix v1, v2, or v3, MalStone, CloudBurst, MRBench, NNBench, sample apps shipped with Hadoop distro like PiEstimator, dbcount etc. Thanks, -Shrinivas
Re: benchmark choices
I just read the malstone report. They report times for a Java version that is many (5x) times slower than for a streaming implementation. That single fact indicates that the Java code is so appallingly bad that this is a very bad benchmark. On Fri, Feb 18, 2011 at 2:27 PM, Jim Falgout jim.falg...@pervasive.comwrote: We use MalStone and TeraSort. For Hive, you can use TPC-H, at least the data and the queries, if not the query generator. There is a Jira issue in Hive that discusses the TPC-H benchmark if you're interested. Sorry, I don't remember the issue number offhand. -Original Message- From: Shrinivas Joshi [mailto:jshrini...@gmail.com] Sent: Friday, February 18, 2011 3:32 PM To: common-user@hadoop.apache.org Subject: benchmark choices Which workloads are used for serious benchmarking of Hadoop clusters? Do you care about any of the following workloads : TeraSort, GridMix v1, v2, or v3, MalStone, CloudBurst, MRBench, NNBench, sample apps shipped with Hadoop distro like PiEstimator, dbcount etc. Thanks, -Shrinivas
Re: Hadoop in Real time applications
Remove the for at the end that got sucked in by the email editor. On Thu, Feb 17, 2011 at 5:56 AM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Ted, thanks for the links, the yahoo.com one doesn't seem to exist? On Wed, Feb 16, 2011 at 11:48 PM, Ted Dunning tdunn...@maprtech.com wrote: Unless you go beyond the current standard semantics, this is true. See here: http://code.google.com/p/hop/ and http://labs.yahoo.com/node/476for alternatives. On Wed, Feb 16, 2011 at 10:30 PM, madhu phatak phatak@gmail.com wrote: Hadoop is not suited for real time applications On Thu, Feb 17, 2011 at 9:47 AM, Karthik Kumar karthik84ku...@gmail.com wrote: Can Hadoop be used for Real time Applications such as banking solutions... -- With Regards, Karthik
Re: DataCreator
Sounds like Pig. Or Cascading. Or Hive. Seriously, isn't this already available? On Wed, Feb 16, 2011 at 7:06 AM, Guy Doulberg guy.doulb...@conduit.comwrote: Hey all, I want to consult with you hadoppers about a Map/Reduce application I want to build. I want to build a map/reduce job, that read files from HDFS, perform some sort of transformation on the file lines, and store them to several partition depending on the source of the file or its data. I want this application to be as configurable as possible, so I designed interfaces to Parse, Decorate and Partition(On HDFS) the Data. I want to be able to configure different data flows, with different parsers, decorators and partitioners, using a config file. Do you think, you would use such an application? Does it fit an open-source project? Now, I have some technical questions: I was thinking of using reflection, to load all the classes I would need according to the configuration during the setup process of the Mapper. Do you think it is a good idea? Is there a way to send the Mapper objects or interfaces from the Job declaration? Thanks,
Re: Trying out AvatarNode
http://wiki.apache.org/hadoop/HowToContribute On Wed, Feb 16, 2011 at 8:23 AM, Mark Kerzner markkerz...@gmail.com wrote: if there is a place to learn about patch practices, please point me to it.
Re: Hadoop in Real time applications
Unless you go beyond the current standard semantics, this is true. See here: http://code.google.com/p/hop/ and http://labs.yahoo.com/node/476for alternatives. On Wed, Feb 16, 2011 at 10:30 PM, madhu phatak phatak@gmail.com wrote: Hadoop is not suited for real time applications On Thu, Feb 17, 2011 at 9:47 AM, Karthik Kumar karthik84ku...@gmail.com wrote: Can Hadoop be used for Real time Applications such as banking solutions... -- With Regards, Karthik
Re: recommendation on HDDs
Good idea! Would you like to create the nucleus of such a page? (there might already be something like that) On Tue, Feb 15, 2011 at 8:49 AM, Shrinivas Joshi jshrini...@gmail.comwrote: It would be nice to have a wiki page collecting all this good information.
Re: hadoop 0.20 append - some clarifications
HDFS definitely doesn't follow anything like POSIX file semantics. They may be a vague inspiration for what HDFS does, but generally the behavior of HDFS is not tightly specified. Even the unit tests have some real surprising behavior. On Mon, Feb 14, 2011 at 7:21 AM, Gokulakannan M gok...@huawei.com wrote: I think that in general, the behavior of any program reading data from an HDFS file before hsync or close is called is pretty much undefined. In Unix, users can parallelly read a file when another user is writing a file. And I suppose the sync feature design is based on that. So at any point of time during the file write, parallel users should be able to read the file. https://issues.apache.org/jira/browse/HDFS-142?focusedCommentId=12663958page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12663958 -- *From:* Ted Dunning [mailto:tdunn...@maprtech.com] *Sent:* Friday, February 11, 2011 2:14 PM *To:* common-user@hadoop.apache.org; gok...@huawei.com *Cc:* hdfs-u...@hadoop.apache.org; dhr...@gmail.com *Subject:* Re: hadoop 0.20 append - some clarifications I think that in general, the behavior of any program reading data from an HDFS file before hsync or close is called is pretty much undefined. If you don't wait until some point were part of the file is defined, you can't expect any particular behavior. On Fri, Feb 11, 2011 at 12:31 AM, Gokulakannan M gok...@huawei.com wrote: I am not concerned about the sync behavior. The thing is the reader reading non-flushed(non-synced) data from HDFS as you have explained in previous post.(in hadoop 0.20 append branch) I identified one specific scenario where the above statement is not holding true. Following is how you can reproduce the problem. 1. add debug point at createBlockOutputStream() method in DFSClient and run your HDFS write client in debug mode 2. allow client to write 1 block to HDFS 3. for the 2nd block, the flow will come to the debug point mentioned in 1(do not execute the createBlockOutputStream() method). hold here. 4. parallely, try to read the file from another client Now you will get an error saying that file cannot be read. _ From: Ted Dunning [mailto:tdunn...@maprtech.com] Sent: Friday, February 11, 2011 11:04 AM To: gok...@huawei.com Cc: common-user@hadoop.apache.org; hdfs-u...@hadoop.apache.org; c...@boudnik.org Subject: Re: hadoop 0.20 append - some clarifications It is a bit confusing. SequenceFile.Writer#sync isn't really sync. There is SequenceFile.Writer#syncFs which is more what you might expect to be sync. Then there is HADOOP-6313 which specifies hflush and hsync. Generally, if you want portable code, you have to reflect a bit to figure out what can be done. On Thu, Feb 10, 2011 at 8:38 PM, Gokulakannan M gok...@huawei.com wrote: Thanks Ted for clarifying. So the sync is to just flush the current buffers to datanode and persist the block info in namenode once per block, isn't it? Regarding reader able to see the unflushed data, I faced an issue in the following scneario: 1. a writer is writing a 10MB file(block size 2 MB) 2. wrote the file upto 4MB (2 finalized blocks in current and nothing in blocksBeingWritten directory in DN) . So 2 blocks are written 3. client calls addBlock for the 3rd block on namenode and not yet created outputstream to DN(or written anything to DN). At this point of time, the namenode knows about the 3rd block but the datanode doesn't. 4. at point 3, a reader is trying to read the file and he is getting exception and not able to read the file as the datanode's getBlockInfo returns null to the client(of course DN doesn't know about the 3rd block yet) In this situation the reader cannot see the file. But when the block writing is in progress , the read is successful. Is this a bug that needs to be handled in append branch? -Original Message- From: Konstantin Boudnik [mailto:c...@boudnik.org] Sent: Friday, February 11, 2011 4:09 AM To: common-user@hadoop.apache.org Subject: Re: hadoop 0.20 append - some clarifications You might also want to check append design doc published at HDFS-265 I was asking about the hadoop 0.20 append branch. I suppose HDFS-265's design doc won't apply to it. _ From: Ted Dunning [mailto:tdunn...@maprtech.com] Sent: Thursday, February 10, 2011 9:29 PM To: common-user@hadoop.apache.org; gok...@huawei.com Cc: hdfs-u...@hadoop.apache.org Subject: Re: hadoop 0.20 append - some clarifications Correct is a strong word here. There is actually an HDFS unit test that checks to see if partially written and unflushed data is visible. The basic rule of thumb is that you need to synchronize readers and writers outside of HDFS. There is no guarantee that data is visible or invisible after writing, but there is a guarantee that it will become visible after
Re: Is this a fair summary of HDFS failover?
Note that document purports to be from 2008 and, at best, was uploaded just about a year ago. That it is still pretty accurate is kind of a tribute to either the stability of hbase or the stagnation depending on how you read it. On Mon, Feb 14, 2011 at 12:31 PM, Mark Kerzner markkerz...@gmail.comwrote: As a conclusion, when building an HA HDFS cluster, one needs to follow the best practices outlined by Tom White http://www.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf, and may still need to resort to specialized NSF filers for running the NameNode.
Re: recommendation on HDDs
The original poster also seemed somewhat interested in disk bandwidth. That is facilitated by having more than on disk in the box. On Sat, Feb 12, 2011 at 8:26 AM, Michael Segel michael_se...@hotmail.comwrote: Since the OP believes that their requirement is 1TB per node... a single 2TB would be the best choice. It allows for additional space and you really shouldn't be too worried about disk i/o being your bottleneck.
Re: Which strategy is proper to run an this enviroment?
This sounds like it will be very inefficient. There is considerable overhead in starting Hadoop jobs. As you describe it, you will be starting thousands of jobs and paying this penalty many times. Is there a way that you could process all of the directories in one map-reduce job? Can you combine these directories into a single directory with a few large files? On Fri, Feb 11, 2011 at 8:07 PM, Jun Young Kim juneng...@gmail.com wrote: Hi. I have small clusters (9 nodes) to run a hadoop here. Under this cluster, a hadoop will take thousands of directories sequencely. In a each dir, there is two input files to m/r. Size of input files are from 1m to 5g bytes. In a summary, each hadoop job will take an one of these dirs. To get best performance, which strategy is proper for us? Could u suggest me about it? Which configuration is best? Ps) physical memory size is 12g of each node.
Re: hadoop 0.20 append - some clarifications
I think that in general, the behavior of any program reading data from an HDFS file before hsync or close is called is pretty much undefined. If you don't wait until some point were part of the file is defined, you can't expect any particular behavior. On Fri, Feb 11, 2011 at 12:31 AM, Gokulakannan M gok...@huawei.com wrote: I am not concerned about the sync behavior. The thing is the reader reading non-flushed(non-synced) data from HDFS as you have explained in previous post.(in hadoop 0.20 append branch) I identified one specific scenario where the above statement is not holding true. Following is how you can reproduce the problem. 1. add debug point at createBlockOutputStream() method in DFSClient and run your HDFS write client in debug mode 2. allow client to write 1 block to HDFS 3. for the 2nd block, the flow will come to the debug point mentioned in 1(do not execute the createBlockOutputStream() method). hold here. 4. parallely, try to read the file from another client Now you will get an error saying that file cannot be read. _ From: Ted Dunning [mailto:tdunn...@maprtech.com] Sent: Friday, February 11, 2011 11:04 AM To: gok...@huawei.com Cc: common-user@hadoop.apache.org; hdfs-u...@hadoop.apache.org; c...@boudnik.org Subject: Re: hadoop 0.20 append - some clarifications It is a bit confusing. SequenceFile.Writer#sync isn't really sync. There is SequenceFile.Writer#syncFs which is more what you might expect to be sync. Then there is HADOOP-6313 which specifies hflush and hsync. Generally, if you want portable code, you have to reflect a bit to figure out what can be done. On Thu, Feb 10, 2011 at 8:38 PM, Gokulakannan M gok...@huawei.com wrote: Thanks Ted for clarifying. So the sync is to just flush the current buffers to datanode and persist the block info in namenode once per block, isn't it? Regarding reader able to see the unflushed data, I faced an issue in the following scneario: 1. a writer is writing a 10MB file(block size 2 MB) 2. wrote the file upto 4MB (2 finalized blocks in current and nothing in blocksBeingWritten directory in DN) . So 2 blocks are written 3. client calls addBlock for the 3rd block on namenode and not yet created outputstream to DN(or written anything to DN). At this point of time, the namenode knows about the 3rd block but the datanode doesn't. 4. at point 3, a reader is trying to read the file and he is getting exception and not able to read the file as the datanode's getBlockInfo returns null to the client(of course DN doesn't know about the 3rd block yet) In this situation the reader cannot see the file. But when the block writing is in progress , the read is successful. Is this a bug that needs to be handled in append branch? -Original Message- From: Konstantin Boudnik [mailto:c...@boudnik.org] Sent: Friday, February 11, 2011 4:09 AM To: common-user@hadoop.apache.org Subject: Re: hadoop 0.20 append - some clarifications You might also want to check append design doc published at HDFS-265 I was asking about the hadoop 0.20 append branch. I suppose HDFS-265's design doc won't apply to it. _ From: Ted Dunning [mailto:tdunn...@maprtech.com] Sent: Thursday, February 10, 2011 9:29 PM To: common-user@hadoop.apache.org; gok...@huawei.com Cc: hdfs-u...@hadoop.apache.org Subject: Re: hadoop 0.20 append - some clarifications Correct is a strong word here. There is actually an HDFS unit test that checks to see if partially written and unflushed data is visible. The basic rule of thumb is that you need to synchronize readers and writers outside of HDFS. There is no guarantee that data is visible or invisible after writing, but there is a guarantee that it will become visible after sync or close. On Thu, Feb 10, 2011 at 7:11 AM, Gokulakannan M gok...@huawei.com wrote: Is this the correct behavior or my understanding is wrong?
Re: hadoop 0.20 append - some clarifications
Correct is a strong word here. There is actually an HDFS unit test that checks to see if partially written and unflushed data is visible. The basic rule of thumb is that you need to synchronize readers and writers outside of HDFS. There is no guarantee that data is visible or invisible after writing, but there is a guarantee that it will become visible after sync or close. On Thu, Feb 10, 2011 at 7:11 AM, Gokulakannan M gok...@huawei.com wrote: Is this the correct behavior or my understanding is wrong?
Re: recommendation on HDDs
Get bigger disks. Data only grows and having extra is always good. You can get 2TB drives for $100 and 1TB for $75. As far as transfer rates are concerned, any 3GB/s SATA drive is going to be about the same (ish). Seek times will vary a bit with rotation speed, but with Hadoop, you will be doing long reads and writes. Your controller and backplane will have a MUCH bigger vote in getting acceptable performance. With only 4 or 5 drives, you don't have to worry about super-duper backplane, but you can still kill performance with a lousy controller. On Thu, Feb 10, 2011 at 12:26 PM, Shrinivas Joshi jshrini...@gmail.comwrote: What would be a good hard drive for a 7 node cluster which is targeted to run a mix of IO and CPU intensive Hadoop workloads? We are looking for around 1 TB of storage on each node distributed amongst 4 or 5 disks. So either 250GB * 4 disks or 160GB * 5 disks. Also it should be less than 100$ each ;) I looked at HDD benchmark comparisons on tomshardware, storagereview etc. Got overwhelmed with the # of benchmarks and different aspects of HDD performance. Appreciate your help on this. -Shrinivas
Re: hadoop 0.20 append - some clarifications
It is a bit confusing. SequenceFile.Writer#sync isn't really sync. There is SequenceFile.Writer#syncFs which is more what you might expect to be sync. Then there is HADOOP-6313 which specifies hflush and hsync. Generally, if you want portable code, you have to reflect a bit to figure out what can be done. On Thu, Feb 10, 2011 at 8:38 PM, Gokulakannan M gok...@huawei.com wrote: Thanks Ted for clarifying. So the *sync* is to just flush the current buffers to datanode and persist the block info in namenode once per block, isn't it? Regarding reader able to see the unflushed data, I faced an issue in the following scneario: 1. a writer is writing a *10MB* file(block size 2 MB) 2. wrote the file upto 4MB (2 finalized blocks in *current* and nothing in *blocksBeingWritten* directory in DN) . So 2 blocks are written 3. client calls addBlock for the 3rd block on namenode and not yet created outputstream to DN(or written anything to DN). At this point of time, the namenode knows about the 3rd block but the datanode doesn't. 4. at point 3, a reader is trying to read the file and he is getting exception and not able to read the file as the datanode's getBlockInfo returns null to the client(of course DN doesn't know about the 3rd block yet) In this situation the reader cannot see the file. But when the block writing is in progress , the read is successful. *Is this a bug that needs to be handled in append branch?* -Original Message- From: Konstantin Boudnik [mailto:c...@boudnik.org] Sent: Friday, February 11, 2011 4:09 AM To: common-user@hadoop.apache.org Subject: Re: hadoop 0.20 append - some clarifications You might also want to check append design doc published at HDFS-265 I was asking about the hadoop 0.20 append branch. I suppose HDFS-265's design doc won't apply to it. -- *From:* Ted Dunning [mailto:tdunn...@maprtech.com] *Sent:* Thursday, February 10, 2011 9:29 PM *To:* common-user@hadoop.apache.org; gok...@huawei.com *Cc:* hdfs-u...@hadoop.apache.org *Subject:* Re: hadoop 0.20 append - some clarifications Correct is a strong word here. There is actually an HDFS unit test that checks to see if partially written and unflushed data is visible. The basic rule of thumb is that you need to synchronize readers and writers outside of HDFS. There is no guarantee that data is visible or invisible after writing, but there is a guarantee that it will become visible after sync or close. On Thu, Feb 10, 2011 at 7:11 AM, Gokulakannan M gok...@huawei.com wrote: Is this the correct behavior or my understanding is wrong?
Re: Can single map-reduce solve this problem
mapper should produce (k,1), (1, v) for lines k,v in file1 and should produce (k,2), (2,v) for lines k,v in file2. Your partition function should look at only the first member of the key tuple, but should order on both members. Your reducer will get data like this: (k,1), [(1,v)] or like this (k, 1), [(1,v1),(2,v2)] In the first case, it should emit k, v. In the second, k,v2. More simply, it should simply emit the last value in the reduce group. In actual practice, you should probably use something fancier than an integer to tag the data. You will also have to find some kind of appropriate tuple structure. Pig, Cascading, Plume and Hive would make this easier than straight Java, but all techniques would work. On Tue, Feb 8, 2011 at 4:26 PM, Gururaj S Mayya gsma...@yahoo.com wrote: Any pointers as to how this could be done?
Re: Hadoop XML Error
This is due to the security API not being available. You are crossing from a cluster with security to one without and that is causing confusion. Presumably your client assumes that it is available and your hadoop library doesn't provide it. Check your class path very carefully looking for version assumptions and confusions. On Mon, Feb 7, 2011 at 11:43 AM, Korb, Michael [USA] korb_mich...@bah.comwrote: We're migrating from CDH3b3 to a recent build of 0.20-append published by Ryan Rawson. This isn't something covered by normal upgrade scripts. I've tried several commands with different protocols and port numbers, but now keep getting the same error: 11/02/07 14:35:06 INFO tools.DistCp: srcPaths=[hftp://mc1:50070/] 11/02/07 14:35:06 INFO tools.DistCp: destPath=hdfs://mc0:55310/ Exception in thread main java.lang.NoSuchMethodError: org.apache.hadoop.mapred.JobConf.getCredentials()Lorg/apache/hadoop/security/Credentials; at org.apache.hadoop.tools.DistCp.checkSrcPath(DistCp.java:632) at org.apache.hadoop.tools.DistCp.copy(DistCp.java:656) at org.apache.hadoop.tools.DistCp.run(DistCp.java:881) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.tools.DistCp.main(DistCp.java:908) Has anyone seen this before? What might be causing it? Thanks, Mike From: Xavier Stevens [xstev...@mozilla.com] Sent: Monday, February 07, 2011 1:47 PM To: common-user@hadoop.apache.org Subject: Re: Hadoop XML Error You don't need to distcp to upgrade a cluster. You just need to go through the upgrade process. Bumping from 0.20.2 to 0.20.3 you might not even need to do anything other than stop the cluster processes, and then restart them using the 0.20.3 install. Here's a link to the upgrade and rollback docs: http://hadoop.apache.org/common/docs/r0.20.0/hdfs_user_guide.html#Upgrade+and+Rollback -Xavier On 2/7/11 10:22 AM, Korb, Michael [USA] wrote: Xavier, Yes, I'm trying to upgrade from 0.20.2 to 0.20.3. Both are running on the same cluster. I'm trying to distcp everything from the 0.20.2 instance over to the 0.20.3 instance, without any luck yet. Mike From: Xavier Stevens [xstev...@mozilla.com] Sent: Monday, February 07, 2011 1:20 PM To: common-user@hadoop.apache.org Subject: Re: Hadoop XML Error Mike, Are you just trying to upgrade then? I've never heard of anyone trying to run two versions of hadoop on the same cluster. I'm don't think that's even possible, but maybe someone else knows. -Xavier On 2/7/11 10:03 AM, Korb, Michael [USA] wrote: Xavier, Both instances of Hadoop are running on the same cluster. I tried the command sudo -u hdfs ./hadoop distcp -update hftp://mc1:50070/ hdfs://mc0:55310 from the hadoop2 bin directory (the 0.20.3 install) on mc0 (the port 55310 is specified in core-site.xml). Now I'm getting this: 11/02/07 13:03:14 INFO tools.DistCp: srcPaths=[hftp://mc1:50070/] 11/02/07 13:03:14 INFO tools.DistCp: destPath=hdfs://mc0:55310 Exception in thread main java.lang.NoSuchMethodError: org.apache.hadoop.mapred.JobConf.getCredentials()Lorg/apache/hadoop/security/Credentials; at org.apache.hadoop.tools.DistCp.checkSrcPath(DistCp.java:632) at org.apache.hadoop.tools.DistCp.copy(DistCp.java:656) at org.apache.hadoop.tools.DistCp.run(DistCp.java:881) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.tools.DistCp.main(DistCp.java:908) Thanks, Mike From: Xavier Stevens [xstev...@mozilla.com] Sent: Monday, February 07, 2011 12:56 PM To: common-user@hadoop.apache.org Subject: Re: Hadoop XML Error Mike, I've seen this when a directory has been removed or is missing from the time distcp starting stating the source files. You'll probably want to make sure that no code or person is messing with the filesystem during your copy. I would make sure you only have one version of hadoop installed on your destination cluster. Also you should use hdfs as the destination protocol and run the command as the hdfs user if you're using hadoop security. Example (Running on destination cluster): sudo -u hdfs /usr/lib/hadoop-0.20.3/bin/hadoop distcp -update hftp://mc1:50070/ hdfs://mc0:8020/ Cheers, -Xavier On 2/7/11 9:39 AM, Korb, Michael [USA] wrote: I'm trying to copy from 0.20.2 to 0.20.3. I was trying to follow the DistCp Guide but I think I know the problem. I'm trying to run the command on the destination cluster, but when I call hadoop, I think the path is set to run the hadoop1 executable. So I tried going to the hadoop2 install
Re: Quick Question: LineSplit or BlockSplit
Option (1) isn't the way that things normally work. Besides, mappers are called many times for each construction of a mapper. On Mon, Feb 7, 2011 at 3:38 PM, maha m...@umail.ucsb.edu wrote: Hi, I would appreciate it if you could give me your thoughts if there is affect on efficiency if: 1) Mappers were per line in a document or 2) Mappers were per block of lines in a document. I know the obvious difference I can see is that (1) has more mappers. Does that mean (1) will be slower because of scheduling time ? Thank you, Maha
Re: Quick Question: LineSplit or BlockSplit
That is quite doable. One way to do it is to make the max split size quite small. On Mon, Feb 7, 2011 at 6:14 PM, Mark Kerzner markkerz...@gmail.com wrote: Ted, I am also interested in this answer. I put the name of a zip file on a line in an input file, and I want one mapper to read this line, and start working on it (since it now knows the path in HDFS). Are you saying it's not doable? Thank you, Mark On Mon, Feb 7, 2011 at 8:10 PM, Ted Dunning tdunn...@maprtech.com wrote: Option (1) isn't the way that things normally work. Besides, mappers are called many times for each construction of a mapper. On Mon, Feb 7, 2011 at 3:38 PM, maha m...@umail.ucsb.edu wrote: Hi, I would appreciate it if you could give me your thoughts if there is affect on efficiency if: 1) Mappers were per line in a document or 2) Mappers were per block of lines in a document. I know the obvious difference I can see is that (1) has more mappers. Does that mean (1) will be slower because of scheduling time ? Thank you, Maha
Re: retain state between mappers
Remember that mappers are not executed in a well defined order. They can be executed in different order or even at the same time. One mapper can be run more than once. There are two ways to get something like what you want, but the question you asked is ill-posed. First, you can adapt the input format so that it gives a different integer to each split as a key. This doesn't something like what you ask for since each mapper will get a different integer and multiply executed mappers will get the same key each time they are run. Secondly, you could use central coordination server to act as a global counter. **THIS IS A REALLY BAD IDEA** It is bad because it turns a parallel computation into a partially sequential one and because it doesn't account for the fact that mappers can be run multiple times. On Sat, Feb 5, 2011 at 4:03 AM, ANKITBHATNAGAR abhatna...@vantage.comwrote: is there a way I can retain the count between mappers and increment.?
Re: Hadoop is for whom? Data architect or Java Architect or All
Yes. There is still a mismatch. The degree of mismatch is decreasing as tools like Hive or Pig become more advanced. Better packaging of hadoop is also helping. But your data architect will definitely need support from a java aware person. They won't be able to do all of the tasks. On Wed, Jan 26, 2011 at 7:43 AM, manoranjand manoranj...@rediffmail.comwrote: Hi- I have a basic question. Appologies for my ignorance, but is hadoop a mis-fit for a data architect with zero java knowledge? -- View this message in context: http://old.nabble.com/Hadoop-is-for-whom--Data-architect-or-Java-Architect-or-All-tp30765860p30765860.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: the performance of HDFS
This is a bit lower than it should be, but it is not so far out of line with what is reasonable. Did you make sure that have multiple separate disks for HDFS to use? With many disks, you should be able to get local disk write speeds up to a few hundred MB/s. Once you involve replication then you need to have data go out the network interface, back in to another machine, back out and back in to a third machine. There are lots of copies going on and if you are writing lots of files, you will typically be limited to 1/2 of your network bandwidth at most, but doing a bit less than that is to be expected. What you are seeing is lower than it should be but a moderate factor. On Tue, Jan 25, 2011 at 12:33 PM, Da Zheng zhengda1...@gmail.com wrote: Hello, I try to measure the performance of HDFS, but the writing rate is quite low. When the replication factor is 1, the rate of writing to HDFS is about 60MB/s. When the replication factor is 3, the rate drops significantly to about 15MB/s. Even though the actual rate of writing data to the disk is about 45MB/s, it's still much lower than when replication factor is 1. The link between two nodes in the cluster is 1Gbps. CPU is Dual-Core AMD Opteron(tm) Processor 2212, so CPU isn't bottleneck either. I thought I should be able to saturate the disk very easily. I wonder where the bottleneck is. What is the throughput for writing on a Hadoop cluster when the replication factor is 3? Thanks, Da
Re: the performance of HDFS
This is a really slow drive or controller. Consumer grade 3.5 inch 2TB drives typically can handle 100MB/s. I would suspect in the absence of real information that your controller is more likely to be deficient than your drive. If this is on a laptop or something, then I withdraw my thought. On Tue, Jan 25, 2011 at 4:50 PM, Da Zheng zhengda1...@gmail.com wrote: The aggregate write-rate can get much higher if you use more drives, but a single stream throughput is limited to the speed of one disk spindle. You are right. I measure the performance of the hard drive. It seems the bottleneck is the hard drive, but the hard drive is a little too slow. The average writing rate is 50MB/s.
Re: the performance of HDFS
Perhaps lshw would help you. ubuntu:~$ sudo lshw ... *-storage description: RAID bus controller product: SB700/SB800 SATA Controller [Non-RAID5 mode] vendor: ATI Technologies Inc physical id: 11 bus info: pci@:00:11.0 logical name: scsi0 logical name: scsi1 version: 00 width: 32 bits clock: 66MHz capabilities: storage pm bus_master cap_list emulated configuration: driver=ahci latency=64 resources: irq:22 ioport:b000(size=8) ioport:a000(size=4) ioport:9000(size=8) ioport:8000(size=4) ioport:7000(size=16) memory:fe7ffc00-fe7f *-disk description: ATA Disk product: ST3750528AS vendor: Seagate physical id: 0 bus info: scsi@0:0.0.0 ... On Tue, Jan 25, 2011 at 6:29 PM, Da Zheng zhengda1...@gmail.com wrote: No, each node in the cluster is powerful server. I was told the nodes are Dell Poweredge SC1435, but I cannot figure out the configuration of hard drives. Dell provides several possible hard drives for this model. On 1/25/11 7:59 PM, Ted Dunning wrote: This is a really slow drive or controller. Consumer grade 3.5 inch 2TB drives typically can handle 100MB/s. I would suspect in the absence of real information that your controller is more likely to be deficient than your drive. If this is on a laptop or something, then I withdraw my thought. On Tue, Jan 25, 2011 at 4:50 PM, Da Zheng zhengda1...@gmail.com wrote: The aggregate write-rate can get much higher if you use more drives, but a single stream throughput is limited to the speed of one disk spindle. You are right. I measure the performance of the hard drive. It seems the bottleneck is the hard drive, but the hard drive is a little too slow. The average writing rate is 50MB/s.
Re: Hadoop user event in Europe (interested ? )
If it occurs in early June, it might be possible for US attendees of Buzzwords to link in a visit to the hadoop meeting. I certainly would like to do that. On Thu, Jan 20, 2011 at 12:22 AM, Asif Jan asif@unige.ch wrote: Hi wondering if there is interest to organize a hadoop meet-up in Europe ( Geneva, Switzerland) ? It could be a 2 day event discussing use of Hadoop in industry/science project. If this interest you please let me know. cheers asif
Re: When applying a patch, which attachment should I use?
You may also be interested in the append branch: http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append/ On Tue, Jan 11, 2011 at 3:12 AM, edward choi mp2...@gmail.com wrote: Thanks for the info. I am currently using Hadoop 0.20.2, so I guess I only need apply hdfs-630-0.20-append.patch https://issues.apache.org/jira/secure/attachment/12446812/hdfs-630-0.20-append.patch . I wasn't familiar with the term trunk. I guess it means the latest development. Thanks again. Best Regards, Ed 2011/1/11 Konstantin Boudnik c...@apache.org Yeah, that's pretty crazy all right. In your case looks like that 3 patches on the top are the latest for 0.20-append branch, 0.21 branch and trunk (which perhaps 0.22 branch at the moment). It doesn't look like you need to apply all of them - just try the latest for your particular branch. The mess is caused by the fact the ppl are using different names for consequent patches (as in file.1.patch, file.2.patch etc) This is _very_ confusing indeed, especially when different contributors work on the same fix/feature. -- Take care, Konstantin (Cos) Boudnik On Mon, Jan 10, 2011 at 01:10, edward choi mp2...@gmail.com wrote: Hi, For the first time I am about to apply a patch to HDFS. https://issues.apache.org/jira/browse/HDFS-630 Above is the one that I am trying to do. But there are like 15 patches and I don't know which one to use. Could anyone tell me if I need to apply them all or just the one at the top? The whole patching process is just so confusing :-( Ed
Re: Import data from mysql
Yes. Hadoop can definitely help with this. On Mon, Jan 10, 2011 at 12:00 PM, Brian brian.mcswee...@gmail.com wrote: Thus, I would greatly appreciate your opinion on whether or not using hadoop for this would make sense in order to parallelize the task if it gets too slow.
Re: Import data from mysql
It is, of course, only quadratic, even if you compare all rows to all other rows. You can reduce this cost to O(n log n) by ordinary sorting and you can reduce further reduce the cost to O(n) using radix sort on hashes. Practically speaking, in either the parallel or non parallel setting try sorting each batch of inputs and then doing a merge pass to find duplicated rows. Hashing your rows and doing the sort will make things faster if your rows are very long or if you use radix sort. Unless your data is vast, this would probably work on a single machine with no need for parallelism since sorting billions of items would require 10 passes through your data with a 2^16 way radix sort. To do this with hadoop, simply do the hashing as before and run a typical word count. Then the rows that duplicate are simply the ones with count 1 and these can be preferentially output by the reducer. On Sat, Jan 8, 2011 at 3:33 PM, Brian McSweeney brian.mcswee...@gmail.comwrote: I'm a TOTAL newbie on hadoop. I have an existing webapp that has a growing number of rows in a mysql database that I have to compare against one another once a day from a batch job. This is an exponential problem as every row must be compared against every other row. I was thinking of parallelizing this computation via hadoop.
Re: Import data from mysql
You still have to knock down the quadratic cost. Any equality checks you have in your problem can be used to limit the problem to growing quadratically in the number of records equal by that comparison. That may be enough to fix things (for now). Unfortunately heavily skewed data are very common so this smaller quadratic will be orders of magnitude smaller than the original, but still unscalable. Hadoop makes this grouping by equality much easier of course and the internal scan can be done by conventional techniques. Beyond that, you need to look at more interesting techniques to really make this a viable option. I would recommend: - if the multiplication is part of a cosine similarity measurement, then look at expressing it as a difference instead and bound the largest component of the different. - take a look at locality sensitive hashing. This gives you an approximate nearest neighbor solution that will allow good probabilistic bounds on the number of cases that you miss in return of a scalable solution. The error bounds can be made fairly tight. See http://www.mit.edu/~andoni/LSH/ - if you decide that LSH is the way to go, check out Mahout which has a minhash clustering implementation. - if you can't restate the problem as non-quadratic, then start over. Quadratic algorithms are not scalable as Michael Black has stated eloquently enough in another thread. - consider tell the group more about your problem. You get more if you give more. On Sun, Jan 9, 2011 at 5:26 AM, Brian McSweeney brian.mcswee...@gmail.comwrote: Thanks Ted, You're right but I suppose I was too brief in my initial statement. I should have said that I have to run an operation on all rows with respect to each other. It's not a case of just comparing them and thus sorting them so unfortunately I don't think this will help much. Some of the values in the rows have to be multiplied together, some have to be compared, some have to have a function run against them etc. cheers, Brian On Sun, Jan 9, 2011 at 8:55 AM, Ted Dunning tdunn...@maprtech.com wrote: It is, of course, only quadratic, even if you compare all rows to all other rows. You can reduce this cost to O(n log n) by ordinary sorting and you can reduce further reduce the cost to O(n) using radix sort on hashes. Practically speaking, in either the parallel or non parallel setting try sorting each batch of inputs and then doing a merge pass to find duplicated rows. Hashing your rows and doing the sort will make things faster if your rows are very long or if you use radix sort. Unless your data is vast, this would probably work on a single machine with no need for parallelism since sorting billions of items would require 10 passes through your data with a 2^16 way radix sort. To do this with hadoop, simply do the hashing as before and run a typical word count. Then the rows that duplicate are simply the ones with count 1 and these can be preferentially output by the reducer. On Sat, Jan 8, 2011 at 3:33 PM, Brian McSweeney brian.mcswee...@gmail.comwrote: I'm a TOTAL newbie on hadoop. I have an existing webapp that has a growing number of rows in a mysql database that I have to compare against one another once a day from a batch job. This is an exponential problem as every row must be compared against every other row. I was thinking of parallelizing this computation via hadoop. -- - Brian McSweeney Technology Director Smarter Technology web: http://www.smarter.ie phone: +353868578212 -
Re: Rngd
As it normally stands, rngd will only help (it appears) if you have a hardware RNG. You need to cheat and use entropy you don't really have. If you don't mind hacking your system, you could even do this: # mv /dev/random /dev/random.orig # ln /dev/urandom /dev/random This makes /dev/random behave as if it were /dev/urandom (which it, strictly speaking, is after you do this). Don't let your sysadmin see you do this, of course. On Tue, Jan 4, 2011 at 12:00 PM, Jon Lederman jon2...@mac.com wrote: Hi, I am trying to locate the source for rngd to build on my embedded processor in order to test whether my hadoop setup is stalled due to low entropy? Do u know where I can find this. I thought it was part of rng-tools but it's not. Thanks Jon Sent from my iPhone Sent from my iPhone
Re: What is the runtime efficiency of secondary sorting?
As a point of order, you would normally use a combiner with this problem and you wouldn't sort in either the combiner or the reducer. Instead, combiner and reducer would simply scan and keep the smallest item to emit at the end of the scan. As a point of information, most of the rank-based statistics like min, max, median and quantiles can be approximated in an on-line fashion with O(n) time and O(1) storage. Back to the question, though. Let's assume that you would still like to sort the elements. The Hadoop sorts are typically merge-sorts which can be helped if you provide data in order. Thus, you should consider (and probably test) providing a combiner that will sort partial results in order to make the framework sorts run faster. Another important consideration is whether the Hadoop shuffle and sort steps will have to deserialize your data in order to sort it. If so, you will almost certainly be better off doing the sort yourself. I don't know if your combiner would be subject to the round-trip serialization cost, but I wouldn't be surprised. On Mon, Jan 3, 2011 at 2:59 PM, W.P. McNeill bill...@gmail.com wrote: Say I have a set of unordered sets of integers: A: {2,5,7} B: {6,1,9} C: {3,8,2,1,6} I want to use map/reduce to emit the smallest integer in each set. If my input data looks like this: A2 A5 A7 B6 B1 ...etc... I could use an identity mapper and a reducer like the following Reduce(setID, [e0, e1, ... ]): a = sort [e0, e1, ... ] Emit(setID, a[0]) Using standard sort algorithms, this has runtime efficiency of O(N log N), where N is the length of [e0, e1, ... ]. I can write a custom partitioner and grouper to do a secondary sort http://sonerbalkir.blogspot.com/2010/01/simulating-secondary-sort-on-values.html , so that Hadoop sees to it that [e0, e1, ... ] comes into my reducer already sorted. When I do this my reducer becomes simply: Reduce(setID, [e0, e1, ... ]): Emit(setID, e0) I understand that this makes things faster because I'm parallelizing the sort work, but how much faster? Specifically is my runtime efficiency now O(1), amortized O(1), some function of the cluster size, or still O(N log N) but with smaller constant factors? I think (but am not 100% sure) that this is equivalent to the question, What is the runtime efficiency of map-reduce sort? Also, is there an academic paper with this information that I could cite? Usual Google searches and manual perusals were fruitless. Thanks for your help.
Re: Entropy Pool and HDFS FS Commands Hanging System
Yes. It is stuck as suggested. See the bolded lines. You can help avoid this by dumping additional entropy into the machine via network traffic. According to the man page for /dev/random you can cheat by writing goo into /dev/urandom, but I have been unable to verify that by experiment. Is it really necessary to use /dev/random here? Again from the man page, there is a strong feeling in the community that only very long lived, high value keys really need to read from /dev/random. Session keys from /dev/urandom are fine. I wrote an adaptation of the secure seed generator that doesn't block for Mahout. It is trivial, but might be useful to copy: http://svn.apache.org/repos/asf/mahout/trunk/math/src/main/java/org/apache/mahout/common/DevURandomSeedGenerator.java On Mon, Jan 3, 2011 at 3:13 PM, Jon Lederman jon2...@gmail.com wrote: I have attached the jstack pid of namenode output. Does it appear to be stuck in SecureRandom as you noted as a possibility? I am not sure whether this is indicated in the following output: ... main prio=10 tid=0x000583c8 nid=0xf3f runnable [0xb729d000] java.lang.Thread.State: RUNNABLE *at java.io.FileInputStream.readBytes(Native Method) *at java.io.FileInputStream.read(FileInputStream.java:236) at java.io.BufferedInputStream.read1(BufferedInputStream.java:273) at java.io.BufferedInputStream.read(BufferedInputStream.java:334) - locked 0x70e59ae8 (a java.io.BufferedInputStream) at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) at java.io.BufferedInputStream.read1(BufferedInputStream.java:275) at java.io.BufferedInputStream.read(BufferedInputStream.java:334) - locked 0x70e59970 (a java.io.BufferedInputStream) at sun.security.provider.SeedGenerator$URLSeedGenerator.getSeedByte(SeedGenerator.java:469) at sun.security.provider.SeedGenerator.getSeedBytes(SeedGenerator.java:140) at sun.security.provider.SeedGenerator.generateSeed(SeedGenerator.java:135) *at sun.security.provider.SecureRandom.engineGenerateSeed(SecureRandom.java:131) *at sun.security.provider.SecureRandom.engineNextBytes(SecureRandom.java:188)
Re: What is the runtime efficiency of secondary sorting?
On Mon, Jan 3, 2011 at 4:00 PM, W.P. McNeill bill...@gmail.com wrote: ... If I write a combiner like this, is there any advantage to also doing a secondary sort? The definitive answer is that it depends. As for deserialization, the value in my actual application is a Java object with a floating point rank field, and I will be sorting these objects by this rank. Does this make deserialization relatively costly? (I'm guessing it does, because it's not as simple as a single number.) If you can define a binary comparator for your value field that extracts this single number and compare it, then the framework can sort your data items without (fully) deserializing them. This can be a big, big performance win if only because the serialized form is often smaller so the merge sort needs to recurse less because more data fits into a certain amount of memory. Without a binary comparator, the framework must deserialize your object completely and then compare it. That can't help but be slower than avoiding the deserialization. Some serialization frameworks like Avro even allow some values to be sorted without any deserialization. This is even better, of course, than partial deserialization.
Re: Entropy Pool and HDFS FS Commands Hanging System
try dd if=/dev/random bs=1 count=100 of=/dev/null This will likely hang for a long time. There is no way that I know of to change the behavior of /dev/random except by changing the file itself to point to a different minor device. That would be very bad form. One think you may be able do is to pour lots of entropy into the system via /dev/urandom. I was not able to demonstrate this, though, when I just tried that. It would be nice if there were a config variable to set that would change this behavior, but right now, a code change is required (AFAIK). Another thing to do is replace the use of SecureRandom with a version that uses /dev/urandom. That is the point of the code that I linked to. It provides a plugin replacement that will not block. On Mon, Jan 3, 2011 at 4:31 PM, Jon Lederman jon2...@gmail.com wrote: Could you give me a bit more information on how I can overcome this issue. I am running Hadoop on an embedded processor and networking is turned off to the embedded processor. Is there a quick way to check whether this is in fact blocking on my system? And, are there some variables or configuration options I can set to avoid any potential blocking behavior?
Re: Entropy Pool and HDFS FS Commands Hanging System
On Mon, Jan 3, 2011 at 4:48 PM, Jon Lederman jon2...@gmail.com wrote: Thanks. Will try that. One final question, based on the jstack output I sent, is it obvious that the system is blocked due to the behavior of /dev/random? I tried to send you a highlighted markup of your jstack output. The key thing to look for is some thread reading bytes that nests from SecureRandom. If I just let the FS command run (i.e., hadoop fs -ls), is there any guarantee it will eventually return in some relatively finite period of time such as hours, or could it potentially take days, weeks, years or eternity? It depends on how quiet your machine is. If it has stuff happening, then it will unwedge eventually.
Re: Help for the problem of running lucene on Hadoop
With even a dozen or two servers, it is very easy to flatten a mysql server with a hadoop cluster. Also, mysql is typically a very poor storage system for an inverted index because it doesn't allow for compression of the posting vectors. Better to copy Katta in this required and create many independent indexes. On Fri, Dec 31, 2010 at 9:56 PM, Jander g jande...@gmail.com wrote: Thanks for all the above reply. Now my idea is: running word segmentation on Hadoop and creating the inverted index in mysql. As we know, Hadoop MR supports writing and reading to mysql. Does this have any problem? On Sat, Jan 1, 2011 at 7:49 AM, James Seigel ja...@tynt.com wrote: Check out katta for an example
Re: Hadoop RPC call response post processing
Knowing the tenuring distribution will tell a lot about that exact issue. Ephemeral collections take on average less than one instruction per allocation and the allocation itself is generally only a single instruction. For ephemeral garbage, it is extremely unlikely that you can beat that. So the real question is whether you are actually creating so much garbage that you are over-whelming the collector or whether the data is much longer lived than it should be. *That* can cause lots of collection costs. To tell how long data lives, you need to get the tenuring distribution: -XX:+PrintTenuringDistribution Prints details about the tenuring distribution to standard out. It can be used to show this threshold and the ages of objects in the new generation. It is also useful for observing the lifetime distribution of an application. On Tue, Dec 28, 2010 at 11:59 AM, Stefan Groschupf s...@101tec.com wrote: I don't think the problem is allocation but garbage collection.
Re: help for using mapreduce to run different code?
if you mean running different code in different mappers, I recommend using an if statement. On Tue, Dec 28, 2010 at 2:53 PM, Jander g jande...@gmail.com wrote: Whether Hadoop supports the map function running different code? If yes, how to realize this?
Re: how to run jobs every 30 minutes?
Good quote. On Tue, Dec 28, 2010 at 3:46 PM, Chris K Wensel ch...@wensel.net wrote: deprecated is the new stable. https://issues.apache.org/jira/browse/MAPREDUCE-1734 ckw On Dec 28, 2010, at 2:56 PM, Jimmy Wan wrote: I've been using Cascading to act as make for my Hadoop processes for quite some time. Unfortunately, even the most recent distribution of Cascading was written against the deprecated Hadoop APIs (JobConf) that I'm looking to replace. Does anyone have an alternative? On Tue, Dec 14, 2010 at 18:02, Chris K Wensel ch...@wensel.net wrote: Cascading also has the ability to only run 'stale' processes. Think 'make' file. When re-running a job where only one file of many has changed, this is a big win. -- Chris K Wensel ch...@concurrentinc.com http://www.concurrentinc.com -- Concurrent, Inc. offers mentoring, support, and licensing for Cascading
Re: Hadoop RPC call response post processing
I would be very surprised if allocation itself is the problem as opposed to good old fashioned excess copying. It is very hard to write an allocator faster than the java generational gc, especially if you are talking about objects that are ephemeral. Have you looked at the tenuring distribution? On Mon, Dec 27, 2010 at 8:07 PM, Stefan Groschupf s...@101tec.com wrote: Hi All, I'm browsing the RPC code since quite a while now trying to find any entry point / interceptor slot that allows me to handle a RPC call response writable after it was send over the wire. Does anybody has an idea how break into the RPC code from outside. All the interesting methods are private. :( Background: Heavy use of the RPC allocates hugh amount of Writable objects. We saw in multiple systems that the garbage collect can get so busy that the jvm almost freezes for seconds. Things like zookeeper sessions time out in that cases. My idea is to create an object pool for writables. Borrowing an object from the pool is simple since this happen in our custom code, though we do know when the writable return was send over the wire and can be returned into the pool. A dirty hack would be to overwrite the write(out) method in the writable, assuming that is the last thing done with the writable, though turns out that this method is called in other cases too, e.g. to measure throughput. Any ideas? Thanks, Stefan
Re: How to simulate network delay on 1 node
See also https://github.com/toddlipcon/gremlins On Sun, Dec 26, 2010 at 11:26 AM, Konstantin Boudnik c...@apache.org wrote: Hi there. What are looking at is fault injection. I am not sure what version of Hadoop you're looking at, but here's at what you take a look in 0.21 and forward: - Herriot system testing framework (which does code instrumentation to add special APIs) on a real clusters. Here's some starting pointers: - source code is in src/test/system - http://wiki.apache.org/hadoop/HowToUseSystemTestFramework - fault injection framework (should've been ported to 0.20 as well) - Source code is under src/test/aop - http://hadoop.apache.org/hdfs/docs/r0.21.0/faultinject_framework.html If you are running on simulated infrastructure you don't need to look further than fault injection framework. There's a test in HDFS which does pretty much what you're looking for but for pipe-lines (look under src/test/aop/org/apache/hadoop/hdfs/*). If you are on a physical cluster then you need to use a combination of 1st and 2nd. The implementation of faults in system tests are coming into Hadoop at some point of not very distant future, so you might want to wait a little bit. -- Take care, Konstantin (Cos) Boudnik On Sun, Dec 26, 2010 at 04:25, yipeng yip...@gmail.com wrote: Hi everyone, I would like to simulate network delay on 1 node in my cluster, perhaps by putting the thread to sleep every time it transfers data non-locally. I'm looking at the source but am not sure where to place the code. Is there a better way to do it... a tool perhaps? Or could someone point me in the right direction? Cheers, Yipeng
Re: Hadoop/Elastic MR on AWS
EMR instances are started near each other. This increases the bandwidth between nodes. There may also be some enhancements in terms of access to the SAN that supports EBS. On Fri, Dec 24, 2010 at 4:41 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: - Original Message From: Amandeep Khurana ama...@gmail.com To: common-user@hadoop.apache.org Sent: Fri, December 10, 2010 1:14:45 AM Subject: Re: Hadoop/Elastic MR on AWS Mark, Using EMR makes it very easy to start a cluster and add/reduce capacity as and when required. There are certain optimizations that make EMR an attractive choice as compared to building your own cluster out. Using EMR Could you please point out what optimizations you are referring to?
Re: How frequently can I set status?
It is reasonable to update counters often, but I think you are right to limit the number status updates. On Thu, Dec 23, 2010 at 11:15 AM, W.P. McNeill bill...@gmail.com wrote: I have a loop that runs over a large number of iterations (order of 100,000) very quickly. It is nice to do context.setStatus() with an indication of where I am in the loop. Currently I'm only calling setStatus() every 10,000 iterations because I don't want to overwhelm the task trackers with lots of status messages. Is this something I should be worried, about or is Hadoop designed to handle a high volume of status messages? If so, I'll just call setStatus() every iteration.
Re: breadth-first search
The Mahout math package has a number of basic algorithms that use algorithmic efficiencies when given sparse graphs. A number of other algorithms use only the product of a sparse matrix on another matrix or a vector. Since these algorithms never change the original sparse matrix, they are safe against fill-in problems. The random projection technique avoids O(v^3) algorithms for computing SVD or related matrix decompositions. See http://arxiv.org/abs/0909.4061 and https://issues.apache.org/jira/browse/MAHOUT-376 None of these these algorithms are specific to graph theory, but all deal with methods that are useful with sparse graphs. On Wed, Dec 22, 2010 at 10:46 AM, Ricky Ho rickyphyl...@yahoo.com wrote: Can you point me to Matrix algorithms that is tuned for sparse graph ? What I mean is from O(v^3) to O(v*e) where v = number of vertex and e = number of edges.
Re: breadth-first search
Ahh... I see what you mean. This algorithm can be implemented with all of the iterations for all points proceeding in parallel. You should only need 4 map-reduce steps, not 400. This will still take several minutes on Hadoop, but as your problem increases in size, this overhead becomes less important. 2010/12/21 Peng, Wei wei.p...@xerox.com The graph that my BFS algorithm is running on only needs 4 levels to reach all nodes. The reason I say many iterations is that there are 100 sources nodes, so totally 400 iterations. The algorithm should be right, and I cannot think of anything to reduce the number of iterations. Ted, I will check out the links that you sent to me. I really appreciate your help. Wei -Original Message- From: Edward J. Yoon [mailto:edw...@udanax.org] Sent: Tuesday, December 21, 2010 1:27 AM To: common-user@hadoop.apache.org Subject: Re: breadth-first search There's no release yet. But, I had tested the BFS using hama and, hbase. Sent from my iPhone On 2010. 12. 21., at 오전 11:30, Peng, Wei wei.p...@xerox.com wrote: Yoon, Can I use HAMA now, or it is still in development? Thanks Wei -Original Message- From: Edward J. Yoon [mailto:edwardy...@apache.org] Sent: Monday, December 20, 2010 6:23 PM To: common-user@hadoop.apache.org Subject: Re: breadth-first search Check this slide out - http://people.apache.org/~edwardyoon/papers/Apache_HAMA_BSP.pdf On Tue, Dec 21, 2010 at 10:49 AM, Peng, Wei wei.p...@xerox.com wrote: I implemented an algorithm to run hadoop on a 25GB graph data to calculate its average separation length. The input format is V1(tab)V2 (where V2 is the friend of V1). My purpose is to first randomly select some seed nodes, and then for each node, calculate the shortest paths from this node to all other nodes on the graph. To do this, I first run a simple python code in a single machine to get some random seed nodes. Then I run a hadoop job to generate adjacent list for each node as the input for the second job. The second job takes the adjacent list input and output the first level breadth-first search result. The nodes which are the friends of the seed node have distance 1. Then this output is the input for the next hadoop job so on so forth, until all the nodes are reached. I generated a simulated graph for testing. This data has only 100 nodes. Normal python code can find the separation length within 1 second (100 seed nodes). However, the hadoop took almost 3 hours to do that (pseudo-distributed mode on one machine)!! I wonder if there is a more efficient way to do breadth-first search in hadoop? It is very inefficient to output so many intermediate results. Totally there would be seedNodeNumber*levelNumber+1 jobs, seedNodeNumber*levelNumber intermediate files. Why is hadoop so slow? Please help. Thanks! Wei -- Best Regards, Edward J. Yoon edwardy...@apache.org http://blog.udanax.org
Re: breadth-first search
Absolutely true. Nobody should pretend otherwise. On Tue, Dec 21, 2010 at 10:04 AM, Peng, Wei wei.p...@xerox.com wrote: Hadoop is useful when the data is huge and cannot fit into memory, but it does not seem to be a real-time solution.
Re: Friends of friends with MapReduce
On Mon, Dec 20, 2010 at 9:39 AM, Antonio Piccolboni anto...@piccolboni.info wrote: For an easy solution, use hive. Let's say your record contains userid and friendid and the table is called friends Then you would do select A.userid , B.friendid from friends A join friends B on (A.friendid = B user.id) This is on top of my mind, sorry if some details are off, but I've done it in the past on large datasets (~100M rows).That's it. Do that in java and tell me if it isn't at least 50 lines of code. In raw java it will be a lot of code. In Plume, it should be just a few lines, most of which will have to do with reading the data. Pig and Hive will definitely be the most concise though.
Re: breadth-first search
On Mon, Dec 20, 2010 at 8:16 PM, Peng, Wei wei.p...@xerox.com wrote: ... My question is really about what is the efficient way for graph computation, matrix computation, algorithms that need many iterations to converge (with intermediate results). Large graph computations usually assume a sparse graph for historical reasons. A key property of scalable algorithms is that the time and space are linear in the input size. Most all path algorithms are not linear because the result is n x n and is dense. Some graph path computations can be done indirectly by spectral methods. With good random projection algorithms for sparse matrix decomposition, approximate versions of some of these algorithms can be phrased in a scalable fashion. It isn't an easy task, however. HAMA looks like a very good solution, but can we use it now and how to use it? I don't think that Hama has produced any usable software yet.
Re: breadth-first search
On Mon, Dec 20, 2010 at 9:43 PM, Peng, Wei wei.p...@xerox.com wrote: ... Currently, most of the matrix data (graph matrix, document-word matrix) that we are dealing with are sparse. Good. The matrix decomposition often needs many iterations to converge, then intermediate results have to be saved to serve as the input for the next iteration. I think you are thinking of the wrong algorithms. The ones that I am talking about converge in a fixed and small number of steps. See https://issues.apache.org/jira/browse/MAHOUT-376 for the work in progress on this. This is super inefficient. As I mentioned, the BFS algorithm written in python took 1 second to run, however, hadoop took almost 3 hours. I would expect hadoop to be slower, but not this slow. I think you have a combination of factors here. But, even accounting for having too many iterations in your BFS algorithm, iterations with stock Hadoop take 10s of seconds even if they do nothing. If you arrange your computation to need many iterations, it will be slow. All the hadoop applications that I saw are all very simple calculations, I wonder how it can be applied to machine learning/data mining algorithms. Check out Mahout. There is a lot of machine learning going on there both on Hadoop and using other scalable methods. Is HAMA the only way to solve it? If it is not ready to use yet, then can I assume hadoop is not a good solution for multiple iteration algorithms now? I don't see much evidence that HAMA will ever solve anything, so I wouldn't recommend pinning your hopes on that. For fast, iterative map-reduce, you really need to keep your mappers and reducers live between iterations. Check out twister for that: http://www.iterativemapreduce.org/
Re: InputFormat for a big file
a) this is a small file by hadoop standards. You should be able to process it by conventional methods on a single machine in about the same time it takes to start a hadoop job that does nothing at all. b) reading a single line at a time is not as inefficient as you might think. If you write a mapper that reads each line, converts to an integer and outputs a key consisting of a constant integer and the data you read, the mapper will process the data reasonably quickly. If you add a combiner and a reducer that add up numbers in a list, then the amount of data spilled will be nearly zero. On Fri, Dec 17, 2010 at 7:58 AM, madhu phatak phatak@gmail.com wrote: Hi I have a very large file of size 1.4 GB. Each line of the file is a number . I want to find the sum all those numbers. I wanted to use NLineInputFormat as a InputFormat but it sends only one line to the Mapper which is very in efficient. So can you guide me to write a InputFormat which splits the file into multiple Splits and each mapper can read multiple line from each split Regards Madhukar
Re: Please help with hadoop configuration parameter set and get
Statics won't work the way you might think because different mappers and different reducers are all running in different JVM's. It might work in local mode, but don't kid yourself about it working in a distributed mode. It won't. On Fri, Dec 17, 2010 at 8:31 AM, Peng, Wei wei.p...@xerox.com wrote: Arindam, how to set this global static Boolean variable? I have tried to do something similarly yesterday in the following: Public class BFSearch { Private static boolean expansion; Public static class MapperClass {if no nodes expansion = false;} Public static class ReducerClass Public static void main {expansion = true; run job; print(expansion)} } In this case, expansion is still true. I will look at hadoop counter and report back here later. Thank you for all your help Wei -Original Message- From: Arindam Khaled [mailto:akha...@utdallas.edu] Sent: Friday, December 17, 2010 10:35 AM To: common-user@hadoop.apache.org Subject: Re: Please help with hadoop configuration parameter set and get I did something like this using a global static boolean variable (flag) while I was implementing breadth first IDA*. In my case, I set the flag to something else if a solution was found, which was examined in the reducer. I guess in your case, since you know that if the mappers don't produce anything the reducers won't have anything as input, if I am not wrong. And I had chaining map-reduce jobs ( http://developer.yahoo.com/hadoop/tutorial/module4.html ) running until a solution was found. Kind regards, Arindam Khaled On Dec 17, 2010, at 12:58 AM, Peng, Wei wrote: Hi, I am a newbie of hadoop. Today I was struggling with a hadoop problem for several hours. I initialize a parameter by setting job configuration in main. E.g. Configuration con = new Configuration(); con.set(test, 1); Job job = new Job(con); Then in the mapper class, I want to set test to 2. I did it by context.getConfiguration().set(test,2); Finally in the main method, after the job is finished, I check the test again by job.getConfiguration().get(test); However, the value of test is still 1. The reason why I want to change the parameter inside Mapper class is that I want to determine when to stop an iteration in the main method. For example, for doing breadth-first search, when there is no new nodes are added for further expansion, the searching iteration should stop. Your help will be deeply appreciated. Thank you Wei
Re: Needs a simple answer
Maha, Remember that the mapper is not running on the same machine as the main class. Thus local files aren't where you think. On Thu, Dec 16, 2010 at 1:06 PM, maha m...@umail.ucsb.edu wrote: Hi all, Why the following lines would work in the main class (WordCount) and not in Mapper ? even though myconf is set in WordCount to point to the getConf() returned object. try{ FileSystem hdfs = FileSystem.get(wc.WordCount.myconf); hdfs.copyFromLocalFile(new Path(/Users/file), new Path(/tmp/file)); }catch(Exception e) { System.err.print(\nError);} Also, the print statement will never print on console unless it's in my run function.. Appreciate it :) Maha
Re: how to run jobs every 30 minutes?
Or even simpler, try Azkaban: http://sna-projects.com/azkaban/ On Mon, Dec 13, 2010 at 9:26 PM, edward choi mp2...@gmail.com wrote: Thanks for the tip. I took a look at it. Looks similar to Cascading I guess...? Anyway thanks for the info!! Ed 2010/12/8 Alejandro Abdelnur t...@cloudera.com Or, if you want to do it in a reliable way you could use an Oozie coordinator job. On Wed, Dec 8, 2010 at 1:53 PM, edward choi mp2...@gmail.com wrote: My mistake. Come to think about it, you are right, I can just make an infinite loop inside the Hadoop application. Thanks for the reply. 2010/12/7 Harsh J qwertyman...@gmail.com Hi, On Tue, Dec 7, 2010 at 2:25 PM, edward choi mp2...@gmail.com wrote: Hi, I'm planning to crawl a certain web site every 30 minutes. How would I get it done in Hadoop? In pure Java, I used Thread.sleep() method, but I guess this won't work in Hadoop. Why wouldn't it? You need to manage your post-job logic mostly, but sleep and resubmission should work just fine. Or if it could work, could anyone show me an example? Ed. -- Harsh J www.harshj.com
Re: Is it possible to write file output in Map phase once and write another file output in Reduce phase?
Of course. It is just a set of Hadoop programs. 2010/12/11 edward choi mp2...@gmail.com Can I operate Bixo on a cluster other than Amazon EC2?