how to preserve original line order?
The task should be simple, I want to put in uppercase all the words of a (large) file. I tried the following: - streaming mode - the mapper is a perl script that put each line in uppercase (number of mappers > 1) - no reducer (number of reducers set to zero) It works fine except for line order which is not preserved. How to preserve the original line order? I would appreciate any suggestion. Roldano
Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?
Hey TCK, We operate a large cluster in which we run both HDFS/KFS in the same cluster and on the same nodes. We run two instances of KFS and one instance of HDFS in the cluster: - Our logs are in KFS and we have KFS setup in WORM mode (a mode in which deletions/renames on files/dirs are permitted only on files with .tmp extension). - Map/reduce jobs read from WORM and can write to HDFS or the KFS setup in r/w mode. - For archival purposes, we back data between the two different DFS implementations. The thruput you get depends on your cluster setup: in our case, we have 4 1-TB disks on each node, that we can push at 100MB/s a piece. In JBOD mode, in theory, we can get 400MB/s. With a 1 Gbps NIC, the theoratical limit is 125MB/s. Sriram >>> >>> Thanks, Brian. This sounds encouraging for us. >>> >>> What are the advantages/disadvantages of keeping a persistent storage >> >> (HD/K)FS cluster separate from a processing Hadoop+(HD/K)FS cluster ? >>> >>> The advantage I can think of is that a permanent storage cluster has >> >> different requirements from a map-reduce processing cluster -- the >> permanent >> storage cluster would need faster, bigger hard disks, and would need to >> grow as >> the total volume of all collected logs grows, whereas the processing >> cluster >> would need fast CPUs and would only need to grow with the rate of incoming >> data. >> So it seems to make sense to me to copy a piece of data from the permanent >> storage cluster to the processing cluster only when it needs to be >> processed. Is >> my line of thinking reasonable? How would this compare to running the >> map-reduce >> processing on same cluster as the data is stored in? Which approach is >> used by >> most people? >>> >>> Best Regards, >>> TCK >>> >>> >>> >>> --- On Wed, 2/4/09, Brian Bockelman wrote: >>> From: Brian Bockelman >>> Subject: Re: Batch processing with Hadoop -- does HDFS scale for parallel >> >> reads? >>> >>> To: core-user@hadoop.apache.org >>> Date: Wednesday, February 4, 2009, 1:06 PM >>> >>> Hey TCK, >>> >>> We use HDFS+FUSE solely as a storage solution for a application which >>> doesn't understand MapReduce. We've scaled this solution to >> >> around >>> >>> 80Gbps. For 300 processes reading from the same file, we get about >> >> 20Gbps. >>> >>> Do consider your data retention policies -- I would say that Hadoop as a >>> storage system is thus far about 99% reliable for storage and is not a >> >> backup >>> >>> solution. If you're scared of getting more than 1% of your logs lost, >> >> have >>> >>> a good backup solution. I would also add that when you are learning your >>> operational staff's abilities, expect even more data loss. As you >> >> gain >>> >>> experience, data loss goes down. >>> >>> I don't believe we've lost a single block in the last month, but >> >> it >>> >>> took us 2-3 months of 1%-level losses to get here. >>> >>> Brian >>> >>> On Feb 4, 2009, at 11:51 AM, TCK wrote: >>> Hey guys, We have been using Hadoop to do batch processing of logs. The logs get >>> >>> written and stored on a NAS. Our Hadoop cluster periodically copies a >> >> batch of >>> >>> new logs from the NAS, via NFS into Hadoop's HDFS, processes them, and >>> copies the output back to the NAS. The HDFS is cleaned up at the end of >> >> each >>> >>> batch (ie, everything in it is deleted). The problem is that reads off the NAS via NFS don't scale even if >> >> we >>> >>> try to scale the copying process by adding more threads to read in >> >> parallel. If we instead stored the log files on an HDFS cluster (instead of >> >> NAS), it >>> >>> seems like the reads would scale since the data can be read from multiple >> >> data >>> >>> nodes at the same time without any contention (except network IO, which >>> shouldn't be a problem). I would appreciate if anyone could share any similar experience they >> >> have >>> >>> had with doing parallel reads from a storage HDFS. Also is it a good idea to have a separate HDFS for storage vs for >> >> doing >>> >>> the batch processing ? Best Regards, TCK >>> >>> >>> >> >> >> >> >> > >
Re: Creating Lucene index in Hadoop
you can see the nutch code. 2009/3/13 Mark Kerzner > Hi, > > How do I allow multiple nodes to write to the same index file in HDFS? > > Thank you, > Mark >
Re: tuning performance
On 3/12/09 7:13 PM, "Vadim Zaliva" wrote: > The machines have 4 disk each, stripped. > However I do not see disks being a bottleneck. When you stripe you automatically make every disk in the system have the same speed as the slowest disk. In our experiences, systems are more likely to have a 'slow' disk than a dead one and detecting that is really really hard. In a distributed system, that multiplier effect can have significant consequences on the whole grids performance. Also: >>> 16Gb RAM each. They both run mapreduce and dfs nodes. Currently >>> I've set up each of them to run 32 map and 8 reduce tasks. >>> Also, HADOOP_HEAPSIZE=2048. IIRC, HEAPSIZE is in MB, sooo... 32*2048=65536 8*2048=16384 81920MB of VM *just in heap* ... >>> I see CPU is under utilized. If there is a guideline how I can find optimal >>> number of tasks and memory setting for this kind of hardware. If you actually have all 40 task slots going, the system is likely spending quite a bit of time paging... We generally use cores/2-1 for map and reduce slots. This leaves some cores and memory for the OS to use for monitoring, data node, etc. So 8 cores=3 maps, 3 reduces per node. Going above that has been done for extremely lightweight processes, but you're more likely headed for heartache if you aren't careful.
Creating Lucene index in Hadoop
Hi, How do I allow multiple nodes to write to the same index file in HDFS? Thank you, Mark
Child Nodes processing jobs?
Hi, I am running a cluster of map/reduce jobs. How do I confirm that slaves are actually executing the map/reduce job spawned by the JobTracker at the master. All the slaves are running the datanodes and tasktracker fine. Thanks, Richa Khandelwal University Of California, Santa Cruz. Ph:425-241-7763
Re: Reducers spawned when mapred.reduce.tasks=0
Are you seeing reducers getting spawned from web ui? then, it is a bug. If not, there won't be reducers spawned, it could be job-setup/ job-cleanup task that is running on a reduce slot. See HADOOP-3150 and HADOOP-4261. -Amareshwari Chris K Wensel wrote: May have found the answer, waiting on confirmation from users. Turns out 0.19.0 and .1 instantiate the reducer class when the task is actually intended for job/task cleanup. branch-0.19 looks like it resolves this issue by not instantiating the reducer class in this case. I've got a workaround in the next maint release: http://github.com/cwensel/cascading/tree/wip-1.0.5 ckw On Mar 12, 2009, at 10:12 AM, Chris K Wensel wrote: Hey all Have some users reporting intermittent spawning of Reducers when the job.xml shows mapred.reduce.tasks=0 in 0.19.0 and .1. This is also confirmed when jobConf is queried in the (supposedly ignored) Reducer implementation. In general this issue would likely go unnoticed since the default reducer is IdentityReducer. but since it should be ignored in the Mapper only case, we don't bother not setting the value, and subsequently comes to ones attention rather abruptly. am happy to open a JIRA, but wanted to see if anyone else is experiencing this issue. note the issue seems to manifest with or without spec exec. ckw -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/ -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/
Re: tuning performance
For a simple test, set the replication on your entire cluster to 6 hadoop dfs -setRep -R -w 6 / This will triple your disk usage and probably take a while, but then you are guaranteed that all data is local. You can also get a rough idea from the Job Counters, 'Data-local map tasks' total field, of the data local map tasks. -setrep [-R] [-w] : Set the replication level of a file. The -R flag requests a recursive change of replication level for an entire tree. On Thu, Mar 12, 2009 at 7:13 PM, Vadim Zaliva wrote: > The machines have 4 disk each, stripped. > However I do not see disks being a bottleneck. Monitoring system activity > shows that CPU is utilized 2-70%, disk usage is moderate, while network > activity seems to be quite high. In this particular cluster we have 6 > machines > and replication factor is 2. I was wondering if increasing replication > factor would > help, so there is a better chance that data block is available locally. > > Sincerely, > Vadim > > > On Thu, Mar 12, 2009 at 13:27, Aaron Kimball wrote: > > Xeon vs. Opteron is likely not going to be a major factor. More important > > than this is the number of disks you have per machine. Task performance > is > > proportional to both the number of CPUs and the number of disks. > > > > You are probably using way too many tasks. Adding more tasks/node isn't > > necessarily going to increase utilization if they're waiting on data from > > the disks; you'll just increase the IO pressure and probably make things > > seek more. You've configured up to 40 simultaneous tasks/node, to run on > 8 > > cores. I'd lower that to maybe 8 maps and 6 reduces. You might get better > > usage since they don't compete for IO as much. The correct value for you > > might lie somewhere in between, too, depending on your workload. > > > > How many hdds does your machine have? You should stripe dfs.data.dir and > > mapred.local.dir across all of them. Having a 2:1 core:disk (And thus a > 2:1 > > map task:disk) ratio is probably a good starting point, so for eight > cores, > > you should have at least 4 disks. > > > > - Aaron > > > > On Wed, Mar 11, 2009 at 10:15 AM, Vadim Zaliva > wrote: > > > >> Hi! > >> > >> I have a question about fine-tunining hadoop performance on 8-core > >> machines. > >> I have 2 machines I am testing. One is 8-core Xeon and another is 8-core > >> Opteron. 16Gb RAM each. They both run mapreduce and dfs nodes. Currently > >> I've set up each of them to run 32 map and 8 reduce tasks. > >> Also, HADOOP_HEAPSIZE=2048. > >> > >> I see CPU is under utilized. If there is a guideline how I can find > optimal > >> number of tasks and memory setting for this kind of hardware. > >> > >> Also, since we going to my more machines like this, I need to decided > >> whenever buy Xeons or Opterons. Any advise on that? > >> > >> Sincerely, > >> Vadim > >> > >> P.S. I am using Hadoop 19 and java version "1.6.0_12": > >> Java(TM) SE Runtime Environment (build 1.6.0_12-b04) > >> Java HotSpot(TM) 64-Bit Server VM (build 11.2-b01, mixed mode) > >> > > > -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422
Re: tuning performance
The machines have 4 disk each, stripped. However I do not see disks being a bottleneck. Monitoring system activity shows that CPU is utilized 2-70%, disk usage is moderate, while network activity seems to be quite high. In this particular cluster we have 6 machines and replication factor is 2. I was wondering if increasing replication factor would help, so there is a better chance that data block is available locally. Sincerely, Vadim On Thu, Mar 12, 2009 at 13:27, Aaron Kimball wrote: > Xeon vs. Opteron is likely not going to be a major factor. More important > than this is the number of disks you have per machine. Task performance is > proportional to both the number of CPUs and the number of disks. > > You are probably using way too many tasks. Adding more tasks/node isn't > necessarily going to increase utilization if they're waiting on data from > the disks; you'll just increase the IO pressure and probably make things > seek more. You've configured up to 40 simultaneous tasks/node, to run on 8 > cores. I'd lower that to maybe 8 maps and 6 reduces. You might get better > usage since they don't compete for IO as much. The correct value for you > might lie somewhere in between, too, depending on your workload. > > How many hdds does your machine have? You should stripe dfs.data.dir and > mapred.local.dir across all of them. Having a 2:1 core:disk (And thus a 2:1 > map task:disk) ratio is probably a good starting point, so for eight cores, > you should have at least 4 disks. > > - Aaron > > On Wed, Mar 11, 2009 at 10:15 AM, Vadim Zaliva wrote: > >> Hi! >> >> I have a question about fine-tunining hadoop performance on 8-core >> machines. >> I have 2 machines I am testing. One is 8-core Xeon and another is 8-core >> Opteron. 16Gb RAM each. They both run mapreduce and dfs nodes. Currently >> I've set up each of them to run 32 map and 8 reduce tasks. >> Also, HADOOP_HEAPSIZE=2048. >> >> I see CPU is under utilized. If there is a guideline how I can find optimal >> number of tasks and memory setting for this kind of hardware. >> >> Also, since we going to my more machines like this, I need to decided >> whenever buy Xeons or Opterons. Any advise on that? >> >> Sincerely, >> Vadim >> >> P.S. I am using Hadoop 19 and java version "1.6.0_12": >> Java(TM) SE Runtime Environment (build 1.6.0_12-b04) >> Java HotSpot(TM) 64-Bit Server VM (build 11.2-b01, mixed mode) >> >
Hadoop Streaming throw an exception with wget as the mapper
Hi All, I am trying to use the hadoop straeming with "wget" to simulate a distributed downloader. The command line i use is ./bin/hadoop jar -D mapred.reduce.tasks=0 contrib/streaming/hadoop-0.19.0-streaming.jar -input urli -output urlo -mapper /usr/bin/wget -outputformat org.apache.hadoop.mapred.lib.MultipleTextOutputFormat But it thrown an exception java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:295) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:519) at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332) at org.apache.hadoop.mapred.Child.main(Child.java:155) can somebody point me a way of why this happend. thanks. -- http://daily.appspot.com/food/
Re: How to let key sorted in the final outputfile
For your information - http://wiki.apache.org/hama/MatMult On Wed, Nov 12, 2008 at 2:05 AM, He Chen wrote: > hi everyone > > I use hadoop to do matrix multiplication, I let the key to store the row > information, and let the value be the total row like this: > > 0 (this is the key) (0,537040) (1,217656) (this string > after the key is value in text) > 8 (0,641356) (1,287908) > > Now I want to keep the key(LongWritable) in turn from the low to high. I > used the > > "config.setOutputKeyComparatorClass(LongWritable.Comparator.class);" > > to realize my intent. However, when I multiply two 10x10 matrixes. the key > can not keep in turn. What is the reason?Should I rewrite the comparator? > > Chen > -- Best Regards, Edward J. Yoon edwardy...@apache.org http://blog.udanax.org
Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?
TCK wrote: How well does the read throughput from HDFS scale with the number of data nodes ? For example, if I had a large file (say 10GB) on a 10 data node cluster, would the time taken to read this whole file in parallel (ie, with multiple reader client processes requesting different parts of the file in parallel) be halved if I had the same file on a 20 data node cluster ? depends: yes, if whatever was bottleneck with 10 still continues to be bottleneck (i.e. you are able to saturate in both cases) and that resource is scaled (disk or network) Is this not possible because HDFS doesn't support random seeks? HDFS does support random seeks for reading... your case should work. Raghu. What about if the file was split up into multiple smaller files before placing in the HDFS ? Thanks for your input. -TCK --- On Wed, 2/4/09, Brian Bockelman wrote: From: Brian Bockelman Subject: Re: Batch processing with Hadoop -- does HDFS scale for parallel reads? To: core-user@hadoop.apache.org Date: Wednesday, February 4, 2009, 1:50 PM Sounds overly complicated. Complicated usually leads to mistakes :) What about just having a single cluster and only running the tasktrackers on the fast CPUs? No messy cross-cluster transferring. Brian On Feb 4, 2009, at 12:46 PM, TCK wrote: Thanks, Brian. This sounds encouraging for us. What are the advantages/disadvantages of keeping a persistent storage (HD/K)FS cluster separate from a processing Hadoop+(HD/K)FS cluster ? The advantage I can think of is that a permanent storage cluster has different requirements from a map-reduce processing cluster -- the permanent storage cluster would need faster, bigger hard disks, and would need to grow as the total volume of all collected logs grows, whereas the processing cluster would need fast CPUs and would only need to grow with the rate of incoming data. So it seems to make sense to me to copy a piece of data from the permanent storage cluster to the processing cluster only when it needs to be processed. Is my line of thinking reasonable? How would this compare to running the map-reduce processing on same cluster as the data is stored in? Which approach is used by most people? Best Regards, TCK --- On Wed, 2/4/09, Brian Bockelman wrote: From: Brian Bockelman Subject: Re: Batch processing with Hadoop -- does HDFS scale for parallel reads? To: core-user@hadoop.apache.org Date: Wednesday, February 4, 2009, 1:06 PM Hey TCK, We use HDFS+FUSE solely as a storage solution for a application which doesn't understand MapReduce. We've scaled this solution to around 80Gbps. For 300 processes reading from the same file, we get about 20Gbps. Do consider your data retention policies -- I would say that Hadoop as a storage system is thus far about 99% reliable for storage and is not a backup solution. If you're scared of getting more than 1% of your logs lost, have a good backup solution. I would also add that when you are learning your operational staff's abilities, expect even more data loss. As you gain experience, data loss goes down. I don't believe we've lost a single block in the last month, but it took us 2-3 months of 1%-level losses to get here. Brian On Feb 4, 2009, at 11:51 AM, TCK wrote: Hey guys, We have been using Hadoop to do batch processing of logs. The logs get written and stored on a NAS. Our Hadoop cluster periodically copies a batch of new logs from the NAS, via NFS into Hadoop's HDFS, processes them, and copies the output back to the NAS. The HDFS is cleaned up at the end of each batch (ie, everything in it is deleted). The problem is that reads off the NAS via NFS don't scale even if we try to scale the copying process by adding more threads to read in parallel. If we instead stored the log files on an HDFS cluster (instead of NAS), it seems like the reads would scale since the data can be read from multiple data nodes at the same time without any contention (except network IO, which shouldn't be a problem). I would appreciate if anyone could share any similar experience they have had with doing parallel reads from a storage HDFS. Also is it a good idea to have a separate HDFS for storage vs for doing the batch processing ? Best Regards, TCK
Re: Reducers spawned when mapred.reduce.tasks=0
May have found the answer, waiting on confirmation from users. Turns out 0.19.0 and .1 instantiate the reducer class when the task is actually intended for job/task cleanup. branch-0.19 looks like it resolves this issue by not instantiating the reducer class in this case. I've got a workaround in the next maint release: http://github.com/cwensel/cascading/tree/wip-1.0.5 ckw On Mar 12, 2009, at 10:12 AM, Chris K Wensel wrote: Hey all Have some users reporting intermittent spawning of Reducers when the job.xml shows mapred.reduce.tasks=0 in 0.19.0 and .1. This is also confirmed when jobConf is queried in the (supposedly ignored) Reducer implementation. In general this issue would likely go unnoticed since the default reducer is IdentityReducer. but since it should be ignored in the Mapper only case, we don't bother not setting the value, and subsequently comes to ones attention rather abruptly. am happy to open a JIRA, but wanted to see if anyone else is experiencing this issue. note the issue seems to manifest with or without spec exec. ckw -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/ -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/
Hadoop User Group Meeting (Bay Area) 3/18
The next Bay Area Hadoop User Group meeting is scheduled for Wednesday, March 18th at Yahoo! 2811 Mission College Blvd, Santa Clara, Building 2, Training Rooms 5 & 6 from 6:00-7:30 pm. Agenda: "Performance Enhancement Techniques with Hadoop - a Case Study" - Milind Bhandarkar "RPMs for Hadoop Deployment and Configuration Management" - Matt Massie, Christophe Bisciglia Registration: http://upcoming.yahoo.com/event/2112338/ Look forward to seeing you there. Ajay
Re: Building Release 0.19.1
Hi Aviad, You are right. The eclipse plugin cannot be compiled in in windows. See also HADOOP-4310, https://issues.apache.org/jira/browse/HADOOP-4310 Nicholas Sze - Original Message > From: Aviad sela > To: Hadoop Users Support > Sent: Thursday, March 12, 2009 1:00:12 PM > Subject: Building Release 0.19.1 > > Building the eclipse project in windows XP, using Eclipse 3.4 > results with the following error. > It seems that some of the jars to build the projects are missing > * > > compile*: > [*echo*] contrib: eclipse-plugin > [*javac*] Compiling 45 source files to > D:\Work\AviadWork\workspace\cur\W_ECLIPSE\E34_Hadoop_Core_19_1\Hadoop\build\contrib\eclipse-plugin\classes > [*javac*] * > D:\Work\Hadoop\src\contrib\eclipse-plugin\src\java\org\apache\hadoop\eclipse\launch\HadoopApplicationLaunchShortcut.java > *:35: cannot find symbol > [*javac*] symbol : class JavaApplicationLaunchShortcut > [*javac*] location: package org.eclipse.jdt.internal.debug.ui.launcher > [*javac*] import > org.eclipse.jdt.internal.debug.ui.launcher.JavaApplicationLaunchShortcut; > [*javac*] ^ > [*javac*] * > D:\Work\Hadoop\src\contrib\eclipse-plugin\src\java\org\apache\hadoop\eclipse\launch\HadoopApplicationLaunchShortcut.java > *:49: cannot find symbol > [*javac*] symbol: class JavaApplicationLaunchShortcut > [*javac*] JavaApplicationLaunchShortcut { > [*javac*] ^ > [*javac*] * > D:\Work\Hadoop\src\contrib\eclipse-plugin\src\java\org\apache\hadoop\eclipse\launch\HadoopApplicationLaunchShortcut.java > *:66: cannot find symbol > [*javac*] symbol : variable super > [*javac*] location: class > org.apache.hadoop.eclipse.launch.HadoopApplicationLaunchShortcut > [*javac*] super.findLaunchConfiguration(type, configType); > [*javac*] ^ > [*javac*] * > D:\Work\Hadoop\src\contrib\eclipse-plugin\src\java\org\apache\hadoop\eclipse\launch\HadoopApplicationLaunchShortcut.java > *:60: method does not override or implement a method from a supertype > [*javac*] @Override > [*javac*] ^ > [*javac*] Note: Some input files use or override a deprecated API. > [*javac*] Note: Recompile with -Xlint:deprecation for details. > [*javac*] Note: Some input files use unchecked or unsafe operations. > [*javac*] Note: Recompile with -Xlint:unchecked for details. > [*javac*] 4 errors
Re: tuning performance
Xeon vs. Opteron is likely not going to be a major factor. More important than this is the number of disks you have per machine. Task performance is proportional to both the number of CPUs and the number of disks. You are probably using way too many tasks. Adding more tasks/node isn't necessarily going to increase utilization if they're waiting on data from the disks; you'll just increase the IO pressure and probably make things seek more. You've configured up to 40 simultaneous tasks/node, to run on 8 cores. I'd lower that to maybe 8 maps and 6 reduces. You might get better usage since they don't compete for IO as much. The correct value for you might lie somewhere in between, too, depending on your workload. How many hdds does your machine have? You should stripe dfs.data.dir and mapred.local.dir across all of them. Having a 2:1 core:disk (And thus a 2:1 map task:disk) ratio is probably a good starting point, so for eight cores, you should have at least 4 disks. - Aaron On Wed, Mar 11, 2009 at 10:15 AM, Vadim Zaliva wrote: > Hi! > > I have a question about fine-tunining hadoop performance on 8-core > machines. > I have 2 machines I am testing. One is 8-core Xeon and another is 8-core > Opteron. 16Gb RAM each. They both run mapreduce and dfs nodes. Currently > I've set up each of them to run 32 map and 8 reduce tasks. > Also, HADOOP_HEAPSIZE=2048. > > I see CPU is under utilized. If there is a guideline how I can find optimal > number of tasks and memory setting for this kind of hardware. > > Also, since we going to my more machines like this, I need to decided > whenever buy Xeons or Opterons. Any advise on that? > > Sincerely, > Vadim > > P.S. I am using Hadoop 19 and java version "1.6.0_12": > Java(TM) SE Runtime Environment (build 1.6.0_12-b04) > Java HotSpot(TM) 64-Bit Server VM (build 11.2-b01, mixed mode) >
Building Release 0.19.1
Building the eclipse project in windows XP, using Eclipse 3.4 results with the following error. It seems that some of the jars to build the projects are missing * compile*: [*echo*] contrib: eclipse-plugin [*javac*] Compiling 45 source files to D:\Work\AviadWork\workspace\cur\W_ECLIPSE\E34_Hadoop_Core_19_1\Hadoop\build\contrib\eclipse-plugin\classes [*javac*] * D:\Work\Hadoop\src\contrib\eclipse-plugin\src\java\org\apache\hadoop\eclipse\launch\HadoopApplicationLaunchShortcut.java *:35: cannot find symbol [*javac*] symbol : class JavaApplicationLaunchShortcut [*javac*] location: package org.eclipse.jdt.internal.debug.ui.launcher [*javac*] import org.eclipse.jdt.internal.debug.ui.launcher.JavaApplicationLaunchShortcut; [*javac*] ^ [*javac*] * D:\Work\Hadoop\src\contrib\eclipse-plugin\src\java\org\apache\hadoop\eclipse\launch\HadoopApplicationLaunchShortcut.java *:49: cannot find symbol [*javac*] symbol: class JavaApplicationLaunchShortcut [*javac*] JavaApplicationLaunchShortcut { [*javac*] ^ [*javac*] * D:\Work\Hadoop\src\contrib\eclipse-plugin\src\java\org\apache\hadoop\eclipse\launch\HadoopApplicationLaunchShortcut.java *:66: cannot find symbol [*javac*] symbol : variable super [*javac*] location: class org.apache.hadoop.eclipse.launch.HadoopApplicationLaunchShortcut [*javac*] super.findLaunchConfiguration(type, configType); [*javac*] ^ [*javac*] * D:\Work\Hadoop\src\contrib\eclipse-plugin\src\java\org\apache\hadoop\eclipse\launch\HadoopApplicationLaunchShortcut.java *:60: method does not override or implement a method from a supertype [*javac*] @Override [*javac*] ^ [*javac*] Note: Some input files use or override a deprecated API. [*javac*] Note: Recompile with -Xlint:deprecation for details. [*javac*] Note: Some input files use unchecked or unsafe operations. [*javac*] Note: Recompile with -Xlint:unchecked for details. [*javac*] 4 errors
Re: about block size
One factor is that block size should minimize the impact of disk seeks. For example, if a disk seeks in 10ms and transfers at 100MB/s, then a good block size will be substantially larger than 1MB. With 100MB blocks, seeks would only slow things by 1%. Another factor is that, unless files are smaller than the block size, larger blocks means fewer blocks, and fewer blocks make for a more efficient namenode. The primary harm of too large blocks is that you will end up with fewer map tasks than nodes, and not use your cluster optimally. Doug ChihChun Chu wrote: Hi, I have a question about how to decide the block size. As I understanding, the block size is related to namenode's heap size(how many blocks can be handled), total storage capacity of clusters, the files size (depends on applications, e.g. 1T log file), #of replicas, and the performance of mapreduce. In Google's paper, they used 64MB as their block size. Yahoo and Facebook seems set block size to 128MB. Hadoop default value is 64MB. I don't know why 64MB or 128MB. Is that the result from the tradeoff as I mentioned above? How do I decide the block size if I want to build my application upon Hadoop? Is their any criterion or formula? Any opinions or comments will be appreciate. stchu
Reducers spawned when mapred.reduce.tasks=0
Hey all Have some users reporting intermittent spawning of Reducers when the job.xml shows mapred.reduce.tasks=0 in 0.19.0 and .1. This is also confirmed when jobConf is queried in the (supposedly ignored) Reducer implementation. In general this issue would likely go unnoticed since the default reducer is IdentityReducer. but since it should be ignored in the Mapper only case, we don't bother not setting the value, and subsequently comes to ones attention rather abruptly. am happy to open a JIRA, but wanted to see if anyone else is experiencing this issue. note the issue seems to manifest with or without spec exec. ckw -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/
How to limit concurrent task numbers of a job.
Here I have a job , it contains 2000 map tasks and each map need 1 hour or so (map cannot be splited because its input is a compressed archive.) How can I set this job's max concurrent task numbers (map and reduce) to leave resources for other urgent jobs? Thanks.
Re: Why is large number of [(heavy) keys , (light) value] faster than (light)key , (heavy) value
I am running the same test and job that completes in 10 mins for (hk,lv) case takes is still running after 30mins have passed for (sk,hv) case. Would be interesting to pinpoint the reason behind it. On Wed, Mar 11, 2009 at 1:27 PM, Gyanit wrote: > > Here are exact numbers: > # of (k,v) pairs = 1.2 Mil this is same. > # of unique k = 1000, k is integer. > # of unique v = 1Mil, v is a big big string. > For a given k, cumulative size of all v associated to it is about 30 Mb. > (That is each v is about 25~30Kb) > # of Mappers = 30 > # of Reducers = 10 > > (v,k) is atleast 4/5 times faster than (k,v). > > -Gyanit > > > Scott Carey wrote: > > > > Well if the smaller keys are producing fewer unique values, there should > > be some more significant differences. > > > > I had assumed that your test produced the same number of unique values. > > > > I'm still not sure why there would be that significant of a difference as > > long as the total number of unique values in the small key test is a good > > deal larger than the number of reducers and there is not too much skew in > > the bucket sizes. If there are a small subset of keys in the small key > > test that contain a large subset of the values, then the reducers will > > have very skewed work sizes and this could explain your observation. > > > > > > On 3/11/09 11:50 AM, "Gyanit" wrote: > > > > > > > > I notices one more thing. Lighter keys tend to make smaller number of > > unique > > keys. > > For example (key,value) pairs may be 10Mil, but if key is lighter unique > > keys might be just 1000. > > In other case if keys are heavier unique keys might be 5 mil. > > I think this might have something to do with it. > > Bottom line: If your reduce is simple dump and no combining, the put data > > in > > keys than values. > > > > I need to put data in values. Any suggestions on how to make it faster. > > > > -Gyanit. > > > > > > Scott Carey wrote: > >> > >> That is a fascinating question. I would also love to know the reason > >> behind this. > >> > >> If I were to guess I would have thought that smaller keys and heavier > >> values would slightly outperform, rather than significantly > underperform. > >> (assuming total pair count at each phase is the same). Perhaps there > is > >> room for optimization here? > >> > >> > >> > >> On 3/10/09 6:44 PM, "Gyanit" wrote: > >> > >> > >> > >> I have large number of key,value pairs. I don't actually care if data > >> goes > >> in > >> value or key. Let me be more exact. > >> (k,v) pair after combiner is about 1 mil. I have approx 1kb data for > each > >> pair. I can put it in keys or values. > >> I have experimented with both options (heavy key , light value) vs > >> (light > >> key, heavy value). It turns out that hk,lv option is much much better > >> than > >> (lk,hv). > >> Has someone else also noticed this? > >> Is there a way to make things faster in light key , heavy value option. > >> As > >> some application will need that also. > >> Remember in both cases we are talking about atleast dozen or so million > >> pairs. > >> There is a difference of time in shuffle phase. Which is weird as amount > >> of > >> data transferred is same. > >> > >> -gyanit > >> -- > >> View this message in context: > >> > http://www.nabble.com/Why-is-large-number-of---%28heavy%29-keys-%2C-%28light%29-value--faster-than-%28light%29key-%2C-%28heavy%29-value-tp22447877p22447877.html > >> Sent from the Hadoop core-user mailing list archive at Nabble.com. > >> > >> > >> > >> > > > > -- > > View this message in context: > > > http://www.nabble.com/Why-is-large-number-of---%28heavy%29-keys-%2C-%28light%29-value--faster-than-%28light%29key-%2C-%28heavy%29-value-tp22447877p22463049.html > > Sent from the Hadoop core-user mailing list archive at Nabble.com. > > > > > > > > > > -- > View this message in context: > http://www.nabble.com/Why-is-large-number-of---%28heavy%29-keys-%2C-%28light%29-value--faster-than-%28light%29key-%2C-%28heavy%29-value-tp22447877p22463784.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > > -- Richa Khandelwal University Of California, Santa Cruz. Ph:425-241-7763
Re: using virtual slave machines
Karthikeyan V wrote: There is no specific procedure for configuring virtual machine slaves. make sure the following thing are done. I've used these as the beginning of a page on this http://wiki.apache.org/hadoop/VirtualCluster
Re: Extending ClusterMapReduceTestCase
jason hadoop wrote: I am having trouble reproducing this one. It happened in a very specific environment that pulled in an alternate sax parser. The bottom line is that jetty expects a parser with particular capabilities and if it doesn't get one, odd things happen. In a day or so I will have hopefully worked out the details, but it has been have a year since I dealt with this last. Unless you are forking, to run your junit tests, ant won't let you change the class path for your unit tests - much chaos will ensue. Even if you fork, unless you set includeantruntime=false then you get Ant's classpath, as the junit test listeners are in the ant-optional-junit.jar and you'd better pull them in somehow. I can see why AElfred would cause problems for jetty; they need to handle web.xml and suchlike, and probably validate them against the schema to reduce support calls.
Re: Persistent HDFS On EC2
Kris Jirapinyo wrote: Why would you lose the locality of storage-per-machine if one EBS volume is mounted to each machine instance? When that machine goes down, you can just restart the instance and re-mount the exact same volume. I've tried this idea before successfully on a 10 node cluster on EC2, and didn't see any adverse performance effects-- I was thinking more of S3 FS, which is remote-ish and write times measurable and actually amazon claims that EBS I/O should be even better than the instance stores. Assuming the transient filesystems are virtual disks (and not physical disks that get scrubbed, formatted and mounted on every VM instantiation), and also assuming that EBS disks are on a SAN in the same datacentre, this is probably true. Disk IO performance in virtual disks is currently pretty slow as you are navigating through >1 filesystem, and potentially seeking at lot, even something that appears unfragmented at the VM level The only concerns I see are that you need to pay for EBS storage regardless of whether you use that storage or not. So, if you have 10 EBS volumes of 1 TB each, and you're just starting out with your cluster so you're using only 50GB on each EBS volume so far for the month, you'd still have to pay for 10TB worth of EBS volumes, and that could be a hefty price for each month. Also, currently EBS needs to be created in the same availability zone as your instances, so you need to make sure that they are created correctly, as there is no direct migration of EBS to different availability zones. View EBS as renting space in SAN and it starts to make sense. -- Steve Loughran http://www.1060.org/blogxter/publish/5 Author: Ant in Action http://antbook.org/
Re: HADOOP Oracle connection workaround
Would be better to externalize this through either a template - or at the least, message bundles. - Mridul evana wrote: Out of the box implementation hadoop has some issues in connecting to oracle. Loos like DBInputFomat is built keeping mysql/hsqldb in mind. You need to modify the out of the box implementation of getSelectQuery method in DBInputFomat. WORK AROUND here is the code snippet...(remember this works only on oracle. if you want to get it working on any db other than oracle you have to have if-else logic on db type) protected String getSelectQuery() { StringBuilder query = new StringBuilder(); if(dbConf.getInputQuery() == null) { query.append("SELECT "); for (int i = 0; i < fieldNames.length; i++) { query.append(fieldNames[i]); if(i != fieldNames.length -1) { query.append(", "); } } query.append(" FROM ").append(tableName); if (conditions != null && conditions.length() > 0) query.append(" WHERE ").append(conditions); String orderBy = dbConf.getInputOrderBy(); if(orderBy != null && orderBy.length() > 0) { query.append(" ORDER BY ").append(orderBy); } }else { //PREBUILT QUERY query.append(dbConf.getInputQuery()); } try { if(split.getLength() > 0 && split.getStart() > 0){ String querystring = query.toString(); query = new StringBuilder(); query.append("select * from (select a.*,rownum rno from ( "); query.append(querystring); query.append(" ) a where rownum <= ").append(split.getStart()).append(" + ").append(split.getLength()); query.append(" ) where rno >= ").append(split.getStart()); } }catch (IOException ex) { //ignore, will not throw } return query.toString(); }
How to skip bad records in .19.1
Dear all: I have set the value "SkipBadRecords.setMapperMaxSkipRecords(conf, 1)", and also the "SkipBadRecords.setAttemptsToStartSkipping(conf, 2)". However, after 3 failed attempts, it gave me this exception message: java.lang.NullPointerException at org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73) at org.apache.hadoop.io.SequenceFile$Writer.init(SequenceFile.java:910) at org.apache.hadoop.io.SequenceFile$BlockCompressWriter.(SequenceFile.java:1198) at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:401) at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:306) at org.apache.hadoop.mapred.MapTask$SkippingRecordReader.writeSkippedRec(MapTask.java:265) at org.apache.hadoop.mapred.MapTask$SkippingRecordReader.next(MapTask.java:237) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) at org.apache.hadoop.mapred.Child.main(Child.java:158) The last line of syslog shows: 2009-03-12 16:44:11,218 WARN org.apache.hadoop.mapred.SortedRanges: Skipping index 1-2 I have two questions: 1. Should it skip the bad record automatically after 2 attempts? why it starts after 3? 2. Why does the skip fail? Regards Song Liu from Suzhou University
HADOOP Oracle connection workaround
Out of the box implementation hadoop has some issues in connecting to oracle. Loos like DBInputFomat is built keeping mysql/hsqldb in mind. You need to modify the out of the box implementation of getSelectQuery method in DBInputFomat. WORK AROUND here is the code snippet...(remember this works only on oracle. if you want to get it working on any db other than oracle you have to have if-else logic on db type) protected String getSelectQuery() { StringBuilder query = new StringBuilder(); if(dbConf.getInputQuery() == null) { query.append("SELECT "); for (int i = 0; i < fieldNames.length; i++) { query.append(fieldNames[i]); if(i != fieldNames.length -1) { query.append(", "); } } query.append(" FROM ").append(tableName); if (conditions != null && conditions.length() > 0) query.append(" WHERE ").append(conditions); String orderBy = dbConf.getInputOrderBy(); if(orderBy != null && orderBy.length() > 0) { query.append(" ORDER BY ").append(orderBy); } }else { //PREBUILT QUERY query.append(dbConf.getInputQuery()); } try { if(split.getLength() > 0 && split.getStart() > 0){ String querystring = query.toString(); query = new StringBuilder(); query.append("select * from (select a.*,rownum rno from ( "); query.append(querystring); query.append(" ) a where rownum <= ").append(split.getStart()).append(" + ").append(split.getLength()); query.append(" ) where rno >= ").append(split.getStart()); } }catch (IOException ex) { //ignore, will not throw } return query.toString(); } -- View this message in context: http://www.nabble.com/HADOOP-2536-supports-Oracle-too--tp21823199p22471395.html Sent from the Hadoop core-user mailing list archive at Nabble.com.