du reserved can ran into problems with reserved disk capacity by tune2fs
Hi! I'm using hadoop-0.20.2 on Debian Squeeze and ran into the same confusion as many others with the parameter for dfs.datanode.du.reserved. One day some data nodes got out of disk errors although there was space left on the disks. The following values are rounded to make the problem more clear: - the disk for the DFS data has 1000GB and only one Partition (ext3) for DFS data - you plan to set the dfs.datanode.du.reserved to 20GB - the reserved reserved-blocks-percentage by tune2fs is 5% (the default) That gives all users, except root, 5% less capacity that they can use. Although the System reports the total of 1000GB as usable for all users via df. The hadoop-deamons are not running as root. If i read it right, than hadoop get's the free capacity via df. Starting in /src/hdfs/org/apache/hadoop/hdfs/server/datanode/FSDataset.java on line 350: return usage.getCapacity()-reserved; going to /src/core/org/apache/hadoop/fs/DF.java which says: Filesystem disk space usage statistics. Uses the unix 'df' program When you have 5% reserved by tune2fs (in our case 50GB) and you give dfs.datanode.du.reserved only 20GB, than you can possibly ran into out of disk errors that hadoop can't handle. In this case you must add the planned 20GB du reserved to the reserved capacity by tune2fs. This results in (at least) 70GB for dfs.datanode.du.reserved in my case. Two ideas: 1. The documentation must be clear at this point to avoid this problem. 2. Hadoop could check for reserved space by tune2fs (or other tools) and add this value to the dfs.datanode.du.reserved parameter. -- BR Alexander Fahlke Software Development www.fahlke.org
Fwd: Delivery Status Notification (Failure)
Hi All, I wanted to know how to connect Hive(hadoop-cdh4 distribution) with MircoStrategy Any help is very helpfull. Witing for you response Note: It is little bit urgent do any one have exprience in that Thanks, samir
Re: number input files to mapreduce job
Hi Vikas, You can get the FileSystem instance by calling FileSystem.get(Configuration); Once you get the FileSystem instance you can use FileSystem.listStatus(InputPath); to get the fileStatus instances. Best, Mahesh Balija, Calsoft Labs. On Tue, Feb 12, 2013 at 12:35 PM, Vikas Jadhav vikascjadha...@gmail.comwrote: Hi all, How to get number of Input files and thier to particular mapreduce job in java MapReduce program. -- * * * Thanx and Regards* * Vikas Jadhav*
NullPointerException in Spring Data Hadoop with CDH4
Hi, I try to use Spring Data Hadoop with CDH4 to write a Map Reduce Job. On startup, I get the following exception: Exception in thread SimpleAsyncTaskExecutor-1 java.lang.ExceptionInInitializerError at org.springframework.data.hadoop.mapreduce.JobExecutor$2.run(JobExecutor.java:183) at java.lang.Thread.run(Thread.java:722) Caused by: java.lang.NullPointerException at org.springframework.util.ReflectionUtils.makeAccessible(ReflectionUtils.java:405) at org.springframework.data.hadoop.mapreduce.JobUtils.clinit(JobUtils.java:123) ... 2 more I guess there is a problem with my Hadoop related dependencies. I couldn't find any reference showing how to configure Spring Data together with CDH4. But Costin showed, he is able to configure it: https://build.springsource.org/browse/SPRINGDATAHADOOP-CDH4-JOB1 **Maven Setup** properties spring.hadoop.version1.0.0.BUILD-SNAPSHOT/spring.hadoop.version hadoop.version2.0.0-cdh4.1.3/hadoop.version /properties dependencies ... dependency groupIdorg.springframework.data/groupId artifactIdspring-data-hadoop/artifactId version${spring.hadoop.version}/version /dependency dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-common/artifactId version${hadoop.version}/version /dependency dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-client/artifactId version${hadoop.version}/version /dependency dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-streaming/artifactId version${hadoop.version}/version /dependency dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-test/artifactId version2.0.0-mr1-cdh4.1.3/version /dependency dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-tools/artifactId version2.0.0-mr1-cdh4.1.3/version /dependency ... /dependencies ... repositories repository idcloudera/id urlhttps://repository.cloudera.com/artifactory/cloudera-repos//url snapshots enabledfalse/enabled /snapshots /repository repository idspring-snapshot/id nameSpring Maven SNAPSHOT Repository/name urlhttp://repo.springframework.org/snapshot/url /repository /repositories **Application Context** ?xml version=1.0 encoding=UTF-8? beans xmlns=http://www.springframework.org/schema/beans; xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance; xmlns:hdp=http://www.springframework.org/schema/hadoop; xmlns:context=http://www.springframework.org/schema/context; xmlns:hadoop=http://www.springframework.org/schema/hadoop; xsi:schemaLocation= http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd http://www.springframework.org/schema/hadoop http://www.springframework.org/schema/hadoop/spring-hadoop.xsd http://www.springframework.org/schema/context/spring-context.xsd http://www.springframework.org/schema/integration http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context-3.1.xsd; context:property-placeholder location=classpath:hadoop.properties / hdp:configuration id=hadoopConfiguration fs.defaultFS=${hd.fs} /hdp:configuration hdp:job id=wordCountJob input-path=${input.path} output-path=${output.path} mapper=com.example.WordMapper reducer=com.example.WordReducer / hdp:job-runner job-ref=wordCountJob run-at-startup=true wait-for-completion=true/ /beans **Cluster version** Hadoop 2.0.0-cdh4.1.3 **Note:** This small Unittest is running fine with the current configuration: @RunWith(SpringJUnit4ClassRunner.class) @ContextConfiguration(locations = { classpath:/applicationContext.xml }) public class Starter { @Autowired private Configuration configuration; @Test public void shellOps() { Assert.assertNotNull(this.configuration); FsShell fsShell = new FsShell(this.configuration); final CollectionFileStatus coll = fsShell.ls(/user); System.out.println(coll); } } It would be nice if someone can give me an example configuration. Best Regards, Christian.
Error for Pseudo-distributed Mode
Hi all, I installed a redhat_enterprise-linux-x86 in VMware Workstation, and set the virtual machine 1G memory. Then I followed steps guided by Installing CDH4 on a Single Linux Node in Pseudo-distributed Mode —— https://ccp.cloudera.com/display/CDH4DOC/Installing+CDH4+on+a+Single+Linux+Node+in+Pseudo-distributed+Mode. When at last, I ran an example Hadoop job with the command $ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output23 'dfs[a-z.]+' then the screen showed as follows, depending AttemptID:attempt_1360528029309_0001_r_00_0 Timed out after 600 secs and I wonder is that because my virtual machine's memory too little~~?? [hadoop@localhost hadoop-mapreduce]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output23 'dfs[a-z]+' 13/02/11 04:30:44 WARN mapreduce.JobSubmitter: No job jar file set. User classes may not be found. See Job or Job#setJar(String). 13/02/11 04:30:44 INFO input.FileInputFormat: Total input paths to process : 4 13/02/11 04:30:45 INFO mapreduce.JobSubmitter: number of splits:4 13/02/11 04:30:45 WARN conf.Configuration: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class 13/02/11 04:30:45 WARN conf.Configuration: mapreduce.combine.class is deprecated. Instead, use mapreduce.job.combine.class 13/02/11 04:30:45 WARN conf.Configuration: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class 13/02/11 04:30:45 WARN conf.Configuration: mapred.job.name is deprecated. Instead, use mapreduce.job.name 13/02/11 04:30:45 WARN conf.Configuration: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class 13/02/11 04:30:45 WARN conf.Configuration: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir 13/02/11 04:30:45 WARN conf.Configuration: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir 13/02/11 04:30:45 WARN conf.Configuration: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class 13/02/11 04:30:45 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 13/02/11 04:30:45 WARN conf.Configuration: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class 13/02/11 04:30:45 WARN conf.Configuration: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir 13/02/11 04:30:46 INFO mapred.YARNRunner: Job jar is not present. Not adding any jar to the list of resources. 13/02/11 04:30:46 INFO mapred.ResourceMgrDelegate: Submitted application application_1360528029309_0001 to ResourceManager at /0.0.0.0:8032 13/02/11 04:30:46 INFO mapreduce.Job: The url to track the job: http://localhost.localdomain:8088/proxy/application_1360528029309_0001/ 13/02/11 04:30:46 INFO mapreduce.Job: Running job: job_1360528029309_0001 13/02/11 04:31:01 INFO mapreduce.Job: Job job_1360528029309_0001 running in uber mode : false 13/02/11 04:31:01 INFO mapreduce.Job: map 0% reduce 0% 13/02/11 04:47:22 INFO mapreduce.Job: Task Id : attempt_1360528029309_0001_r_00_0, Status : FAILED AttemptID:attempt_1360528029309_0001_r_00_0 Timed out after 600 secs cleanup failed for container container_1360528029309_0001_01_06 : java.lang.reflect.UndeclaredThrowableException
Re: NullPointerException in Spring Data Hadoop with CDH4
Hi, For Spring Data Hadoop problems, it's best to use the designated forum [1]. These being said I've tried to reproduce your error but I can't - I've upgraded the build to CDH 4.1.3 which runs fine against the VM on the CI (4.1.1). Maybe you have some other libraries on the client classpath? From the stacktrace, it looks like the org.apache.hadoop.mapreduce.Job class has no 'state' or 'info' fields... Anyway, let's continue the discussion on the forum. Cheers, [1] http://forum.springsource.org/forumdisplay.php?87-Hadoop On 02/12/13 2:51 PM, Christian Schneider wrote: Hi, I try to use Spring Data Hadoop with CDH4 to write a Map Reduce Job. On startup, I get the following exception: Exception in thread SimpleAsyncTaskExecutor-1 java.lang.ExceptionInInitializerError at org.springframework.data.hadoop.mapreduce.JobExecutor$2.run(JobExecutor.java:183) at java.lang.Thread.run(Thread.java:722) Caused by: java.lang.NullPointerException at org.springframework.util.ReflectionUtils.makeAccessible(ReflectionUtils.java:405) at org.springframework.data.hadoop.mapreduce.JobUtils.clinit(JobUtils.java:123) ... 2 more I guess there is a problem with my Hadoop related dependencies. I couldn't find any reference showing how to configure Spring Data together with CDH4. But Costin showed, he is able to configure it: https://build.springsource.org/browse/SPRINGDATAHADOOP-CDH4-JOB1 **Maven Setup** properties spring.hadoop.version1.0.0.BUILD-SNAPSHOT/spring.hadoop.version hadoop.version2.0.0-cdh4.1.3/hadoop.version /properties dependencies ... dependency groupIdorg.springframework.data/groupId artifactIdspring-data-hadoop/artifactId version${spring.hadoop.version}/version /dependency dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-common/artifactId version${hadoop.version}/version /dependency dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-client/artifactId version${hadoop.version}/version /dependency dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-streaming/artifactId version${hadoop.version}/version /dependency dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-test/artifactId version2.0.0-mr1-cdh4.1.3/version /dependency dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-tools/artifactId version2.0.0-mr1-cdh4.1.3/version /dependency ... /dependencies ... repositories repository idcloudera/id urlhttps://repository.cloudera.com/artifactory/cloudera-repos//url snapshots enabledfalse/enabled /snapshots /repository repository idspring-snapshot/id nameSpring Maven SNAPSHOT Repository/name urlhttp://repo.springframework.org/snapshot/url /repository /repositories **Application Context** ?xml version=1.0 encoding=UTF-8? beans xmlns=http://www.springframework.org/schema/beans; xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance; xmlns:hdp=http://www.springframework.org/schema/hadoop; xmlns:context=http://www.springframework.org/schema/context; xmlns:hadoop=http://www.springframework.org/schema/hadoop; xsi:schemaLocation= http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd http://www.springframework.org/schema/hadoop http://www.springframework.org/schema/hadoop/spring-hadoop.xsd http://www.springframework.org/schema/context/spring-context.xsd http://www.springframework.org/schema/integration http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context-3.1.xsd; context:property-placeholder location=classpath:hadoop.properties / hdp:configuration id=hadoopConfiguration fs.defaultFS=${hd.fs} /hdp:configuration hdp:job id=wordCountJob input-path=${input.path} output-path=${output.path} mapper=com.example.WordMapper reducer=com.example.WordReducer / hdp:job-runner job-ref=wordCountJob run-at-startup=true wait-for-completion=true/ /beans **Cluster version** Hadoop 2.0.0-cdh4.1.3 **Note:** This small Unittest is running fine with the current configuration: @RunWith(SpringJUnit4ClassRunner.class) @ContextConfiguration(locations = { classpath:/applicationContext.xml }) public class Starter { @Autowired private Configuration configuration; @Test
RE: Error for Pseudo-distributed Mode
Hi, Could you first try running the example: $ /usr/bin/hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar grep input output 'dfs[a-z.]+' Do you receive the same error? Not sure if it's related to a lack of RAM, but as the stack trace shows errors with network timeout (I realise that you're running in pseudo-distributed mode): Caused by: com.google.protobuf.ServiceException: java.net.SocketTimeoutException: Call From localhost.localdomain/127.0.0.1 to localhost.localdomain:54113 failed on socket timeout exception: java.net.SocketTimeoutException: 6 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/127.0.0.1:60976 remote=localhost.localdomain/127.0.0.1:54113]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout Your best bet is probably to start with checking the items mentioned in the wiki page linked to above. While the default firewall rules (on CentOS) usually allows pretty much all traffic on the lo interface it might be worth temporarily turning off iptables (assuming it is on). Vijay From: yeyu1899 [mailto:yeyu1...@163.com] Sent: 12 February 2013 12:58 To: user@hadoop.apache.org Subject: Error for Pseudo-distributed Mode Hi all, I installed a redhat_enterprise-linux-x86 in VMware Workstation, and set the virtual machine 1G memory. Then I followed steps guided by Installing CDH4 on a Single Linux Node in Pseudo-distributed Mode -- https://ccp.cloudera.com/display/CDH4DOC/Installing+CDH4+on+a+Single+Linux+N ode+in+Pseudo-distributed+Mode. When at last, I ran an example Hadoop job with the command $ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output23 'dfs[a-z.]+' then the screen showed as follows, depending AttemptID:attempt_1360528029309_0001_r_00_0 Timed out after 600 secs and I wonder is that because my virtual machine's memory too little~~?? [hadoop@localhost hadoop-mapreduce]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output23 'dfs[a-z]+' 13/02/11 04:30:44 WARN mapreduce.JobSubmitter: No job jar file set. User classes may not be found. See Job or Job#setJar(String). 13/02/11 04:30:44 INFO input.FileInputFormat: Total input paths to process : 4 13/02/11 04:30:45 INFO mapreduce.JobSubmitter: number of splits:4 13/02/11 04:30:45 WARN conf.Configuration: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class 13/02/11 04:30:45 WARN conf.Configuration: mapreduce.combine.class is deprecated. Instead, use mapreduce.job.combine.class 13/02/11 04:30:45 WARN conf.Configuration: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class 13/02/11 04:30:45 WARN conf.Configuration: mapred.job.name is deprecated. Instead, use mapreduce.job.name 13/02/11 04:30:45 WARN conf.Configuration: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class 13/02/11 04:30:45 WARN conf.Configuration: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir 13/02/11 04:30:45 WARN conf.Configuration: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir 13/02/11 04:30:45 WARN conf.Configuration: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class 13/02/11 04:30:45 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 13/02/11 04:30:45 WARN conf.Configuration: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class 13/02/11 04:30:45 WARN conf.Configuration: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir 13/02/11 04:30:46 INFO mapred.YARNRunner: Job jar is not present. Not adding any jar to the list of resources. 13/02/11 04:30:46 INFO mapred.ResourceMgrDelegate: Submitted application application_1360528029309_0001 to ResourceManager at /0.0.0.0:8032 13/02/11 04:30:46 INFO mapreduce.Job: The url to track the job: http://localhost.localdomain:8088/proxy/application_1360528029309_0001/ 13/02/11 04:30:46 INFO mapreduce.Job: Running job: job_1360528029309_0001 13/02/11 04:31:01 INFO mapreduce.Job: Job job_1360528029309_0001 running in uber mode : false 13/02/11 04:31:01 INFO mapreduce.Job: map 0% reduce 0% 13/02/11 04:47:22 INFO mapreduce.Job: Task Id : attempt_1360528029309_0001_r_00_0, Status : FAILED AttemptID:attempt_1360528029309_0001_r_00_0 Timed out after 600 secs cleanup failed for container container_1360528029309_0001_01_06 : java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.unwrapAn dThrowException(YarnRemoteExceptionPBImpl.java:135) at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagerPBClientImpl.stopC
RE: number input files to mapreduce job
I don't think you can get list of all input files in the mapper, but what you can get is the current file's information. In the context object reference, you can get the InputSplit(), which should give you all the information you want of the current input file. http://hadoop.apache.org/docs/r2.0.2-alpha/api/org/apache/hadoop/mapred/FileSplit.html Date: Tue, 12 Feb 2013 12:35:16 +0530 Subject: number input files to mapreduce job From: vikascjadha...@gmail.com To: user@hadoop.apache.org Hi all,How to get number of Input files and thier to particular mapreduce job in java MapReduce program. -- Thanx and Regards Vikas Jadhav
RE: Loader for small files
Hi, Davie: I am not sure I understand this suggestion. Why smaller block size will help this performance issue? From what the original question about, it looks like the performance problem is due to that there are a lot of small files, and each file will run in its own mapper. As hadoop needs to start a lot of mappers (I think creating a mapper also takes time and resource), but each mapper only take small amount of data (maybe hundreds K or several M of data, much less than the block size), most of the time is wasting on creating task instance for mapper, but each mapper finishes very quickly. This is the reason of performance problem, right? Do I understand the problem wrong? If so, reducing the block size won't help in this case, right? To fix it, we need to merge multi-files into one mapper, so let one mapper has enough data to process. Unless my understanding is total wrong, I don't know how reducing block size will help in this case. Thanks Yong Subject: Re: Loader for small files From: davidlabarb...@localresponse.com Date: Mon, 11 Feb 2013 15:38:54 -0500 CC: user@hadoop.apache.org To: u...@pig.apache.org What process creates the data in HDFS? You should be able to set the block size there and avoid the copy. I would test the dfs.block.size on the copy and see if you get the mapper split you want before worrying about optimizing. David On Feb 11, 2013, at 2:10 PM, Something Something mailinglist...@gmail.com wrote: David: Your suggestion would add an additional step of copying data from one place to another. Not bad, but not ideal. Is there no way to avoid copying of data? BTW, we have tried changing the following options to no avail :( set pig.splitCombination false; a few other 'dfs' options given below: mapreduce.min.split.size mapreduce.max.split.size Thanks. On Mon, Feb 11, 2013 at 10:29 AM, David LaBarbera davidlabarb...@localresponse.com wrote: You could store your data in smaller block sizes. Do something like hadoop fs HADOOP_OPTS=-Ddfs.block.size=1048576 -Dfs.local.block.size=1048576 -cp /org-input /small-block-input You might only need one of those parameters. You can verify the block size with hadoop fsck /small-block-input In your pig script, you'll probably need to set pig.maxCombinedSplitSize to something around the block size David On Feb 11, 2013, at 1:24 PM, Something Something mailinglist...@gmail.com wrote: Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not related to HBase. Adding 'hadoop' user group. On Mon, Feb 11, 2013 at 10:22 AM, Something Something mailinglist...@gmail.com wrote: Hello, We are running into performance issues with Pig/Hadoop because our input files are small. Everything goes to only 1 Mapper. To get around this, we are trying to use our own Loader like this: 1) Extend PigStorage: public class SmallFileStorage extends PigStorage { public SmallFileStorage(String delimiter) { super(delimiter); } @Override public InputFormat getInputFormat() { return new NLineInputFormat(); } } 2) Add command line argument to the Pig command as follows: -Dmapreduce.input.lineinputformat.linespermap=50 3) Use SmallFileStorage in the Pig script as follows: USING com.xxx.yyy.SmallFileStorage ('\t') But this doesn't seem to work. We still see that everything is going to one mapper. Before we spend any more time on this, I am wondering if this is a good approach – OR – if there's a better approach? Please let me know. Thanks.
Re: NullPointerException in Spring Data Hadoop with CDH4
With the help of Costin I got a running Maven configuration. Thank you :). This is a pom.xml for Spring Data Hadoop and CDH4: project xmlns=http://maven.apache.org/POM/4.0.0; xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance; xsi:schemaLocation=http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd; modelVersion4.0.0/modelVersion groupIdcom.example/groupId artifactIdcom.example.main/artifactId version0.0.1-SNAPSHOT/version packagingjar/packaging properties java-version1.7/java-version project.build.sourceEncodingUTF-8/project.build.sourceEncoding spring.version3.2.0.RELEASE/spring.version spring.hadoop.version1.0.0.BUILD-SNAPSHOT/spring.hadoop.version hadoop.version.generic2.0.0-cdh4.1.3/hadoop.version.generic hadoop.version.mr12.0.0-mr1-cdh4.1.3/hadoop.version.mr1 /properties dependencies dependency groupIdorg.springframework/groupId artifactIdspring-core/artifactId version${spring.version}/version exclusions exclusion groupIdcommons-logging/groupId artifactIdcommons-logging/artifactId /exclusion /exclusions /dependency dependency groupIdorg.springframework/groupId artifactIdspring-context/artifactId version${spring.version}/version /dependency dependency groupIdorg.springframework.data/groupId artifactIdspring-data-hadoop/artifactId version${spring.hadoop.version}/version exclusions !-- Excluded the Hadoop dependencies to be sure that they are not mixed with them provided by cloudera. -- exclusion artifactIdhadoop-streaming/artifactId groupIdorg.apache.hadoop/groupId /exclusion exclusion artifactIdhadoop-tools/artifactId groupIdorg.apache.hadoop/groupId /exclusion /exclusions /dependency !-- Hadoop Cloudera Dependencies -- dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-common/artifactId version${hadoop.version.generic}/version /dependency dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-hdfs/artifactId version${hadoop.version.generic}/version /dependency dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-tools/artifactId version2.0.0-mr1-cdh4.1.3/version /dependency dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-streaming/artifactId version2.0.0-mr1-cdh4.1.3/version /dependency /dependencies build plugins plugin groupIdorg.apache.maven.plugins/groupId artifactIdmaven-compiler-plugin/artifactId configuration source${java-version}/source target${java-version}/target /configuration /plugin /plugins /build repositories repository idspring-milestones/id urlhttp://repo.springsource.org/libs-milestone/url snapshots enabledfalse/enabled /snapshots /repository repository idcloudera/id urlhttps://repository.cloudera.com/artifactory/cloudera-repos//url snapshots enabledfalse/enabled /snapshots /repository repository idspring-snapshot/id nameSpring Maven SNAPSHOT
Re: Decommissioning Nodes in Production Cluster.
Hi Dhanasekaran, I believe you are trying to ask if it is recommended to use the decommissioning feature to remove datanodes from your cluster, the answer would be yes. As far as how to do it, there should be some information here http://wiki.apache.org/hadoop/FAQ that should help. Regards, Robert On Tue, Feb 12, 2013 at 3:50 AM, Dhanasekaran Anbalagan bugcy...@gmail.comwrote: Hi Guys, It's recommenced do with removing one the datanode in production cluster. via Decommission the particular datanode. please guide me. -Dhanasekaran, Did I learn something today? If not, I wasted it.
Re: Decommissioning Nodes in Production Cluster.
Hi, I would like to add another scenario. What are the steps for removing a dead node when the server had a hard failure that is unrecoverable. Thanks, Ben On Tuesday, February 12, 2013 7:30:57 AM UTC-8, sudhakara st wrote: The decommissioning process is controlled by an exclude file, which for HDFS is set by the* dfs.hosts.exclude* property, and for MapReduce by the*mapred.hosts.exclude * property. In most cases, there is one shared file,referred to as the exclude file.This exclude file name should be specified as a configuration parameter *dfs.hosts.exclude *in the name node start up. To remove nodes from the cluster: 1. Add the network addresses of the nodes to be decommissioned to the exclude file. 2. Restart the MapReduce cluster to stop the tasktrackers on the nodes being decommissioned. 3. Update the namenode with the new set of permitted datanodes, with this command: % hadoop dfsadmin -refreshNodes 4. Go to the web UI and check whether the admin state has changed to “Decommission In Progress” for the datanodes being decommissioned. They will start copying their blocks to other datanodes in the cluster. 5. When all the datanodes report their state as “Decommissioned,” then all the blocks have been replicated. Shut down the decommissioned nodes. 6. Remove the nodes from the include file, and run: % hadoop dfsadmin -refreshNodes 7. Remove the nodes from the slaves file. Decommission data nodes in small percentage(less than 2%) at time don't cause any effect on cluster. But it better to pause MR-Jobs before you triggering Decommission to ensure no task running in decommissioning subjected nodes. If very small percentage of task running in the decommissioning node it can submit to other task tracker, but percentage queued jobs larger then threshold then there is chance of job failure. Once triggering the 'hadoop dfsadmin -refreshNodes' command and decommission started, you can resume the MR jobs. *Source : The Definitive Guide [Tom White]* On Tuesday, February 12, 2013 5:20:07 PM UTC+5:30, Dhanasekaran Anbalagan wrote: Hi Guys, It's recommenced do with removing one the datanode in production cluster. via Decommission the particular datanode. please guide me. -Dhanasekaran, Did I learn something today? If not, I wasted it.
Re: Decommissioning Nodes in Production Cluster.
The decommissioning process is controlled by an exclude file, which for HDFS is set by the* dfs.hosts.exclude* property, and for MapReduce by the*mapred.hosts.exclude * property. In most cases, there is one shared file,referred to as the exclude file.This exclude file name should be specified as a configuration parameter *dfs.hosts.exclude *in the name node start up. To remove nodes from the cluster: 1. Add the network addresses of the nodes to be decommissioned to the exclude file. 2. Restart the MapReduce cluster to stop the tasktrackers on the nodes being decommissioned. 3. Update the namenode with the new set of permitted datanodes, with this command: % hadoop dfsadmin -refreshNodes 4. Go to the web UI and check whether the admin state has changed to “Decommission In Progress” for the datanodes being decommissioned. They will start copying their blocks to other datanodes in the cluster. 5. When all the datanodes report their state as “Decommissioned,” then all the blocks have been replicated. Shut down the decommissioned nodes. 6. Remove the nodes from the include file, and run: % hadoop dfsadmin -refreshNodes 7. Remove the nodes from the slaves file. Decommission data nodes in small percentage(less than 2%) at time don't cause any effect on cluster. But it better to pause MR-Jobs before you triggering Decommission to ensure no task running in decommissioning subjected nodes. If very small percentage of task running in the decommissioning node it can submit to other task tracker, but percentage queued jobs larger then threshold then there is chance of job failure. Once triggering the 'hadoop dfsadmin -refreshNodes' command and decommission started, you can resume the MR jobs. *Source : The Definitive Guide [Tom White]* On Tuesday, February 12, 2013 5:20:07 PM UTC+5:30, Dhanasekaran Anbalagan wrote: Hi Guys, It's recommenced do with removing one the datanode in production cluster. via Decommission the particular datanode. please guide me. -Dhanasekaran, Did I learn something today? If not, I wasted it.
Re: Decommissioning Nodes in Production Cluster.
On Tue, Feb 12, 2013 at 11:43 PM, Robert Molina rmol...@hortonworks.comwrote: to do it, there should be some information he this is best way to remove data node from a cluster. you have done the right thing. ∞ Shashwat Shriparv
Re: Loader for small files
No, Yong, I believe you misunderstood. David's explanation makes sense. As pointed out in my original email, everything is going to 1 Mapper. It's not creating multiple mappers. BTW, the code given in my original email, indeed works as expected. It does trigger multiple mappers, but it doesn't really improve the performance. We believe the problem is that there's a data skew. We are looking into creating Partitioner to solve it. Thanks. On Tue, Feb 12, 2013 at 7:15 AM, java8964 java8964 java8...@hotmail.comwrote: Hi, Davie: I am not sure I understand this suggestion. Why smaller block size will help this performance issue? From what the original question about, it looks like the performance problem is due to that there are a lot of small files, and each file will run in its own mapper. As hadoop needs to start a lot of mappers (I think creating a mapper also takes time and resource), but each mapper only take small amount of data (maybe hundreds K or several M of data, much less than the block size), most of the time is wasting on creating task instance for mapper, but each mapper finishes very quickly. This is the reason of performance problem, right? Do I understand the problem wrong? If so, reducing the block size won't help in this case, right? To fix it, we need to merge multi-files into one mapper, so let one mapper has enough data to process. Unless my understanding is total wrong, I don't know how reducing block size will help in this case. Thanks Yong Subject: Re: Loader for small files From: davidlabarb...@localresponse.com Date: Mon, 11 Feb 2013 15:38:54 -0500 CC: user@hadoop.apache.org To: u...@pig.apache.org What process creates the data in HDFS? You should be able to set the block size there and avoid the copy. I would test the dfs.block.size on the copy and see if you get the mapper split you want before worrying about optimizing. David On Feb 11, 2013, at 2:10 PM, Something Something mailinglist...@gmail.com wrote: David: Your suggestion would add an additional step of copying data from one place to another. Not bad, but not ideal. Is there no way to avoid copying of data? BTW, we have tried changing the following options to no avail :( set pig.splitCombination false; a few other 'dfs' options given below: mapreduce.min.split.size mapreduce.max.split.size Thanks. On Mon, Feb 11, 2013 at 10:29 AM, David LaBarbera davidlabarb...@localresponse.com wrote: You could store your data in smaller block sizes. Do something like hadoop fs HADOOP_OPTS=-Ddfs.block.size=1048576 -Dfs.local.block.size=1048576 -cp /org-input /small-block-input You might only need one of those parameters. You can verify the block size with hadoop fsck /small-block-input In your pig script, you'll probably need to set pig.maxCombinedSplitSize to something around the block size David On Feb 11, 2013, at 1:24 PM, Something Something mailinglist...@gmail.com wrote: Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not related to HBase. Adding 'hadoop' user group. On Mon, Feb 11, 2013 at 10:22 AM, Something Something mailinglist...@gmail.com wrote: Hello, We are running into performance issues with Pig/Hadoop because our input files are small. Everything goes to only 1 Mapper. To get around this, we are trying to use our own Loader like this: 1) Extend PigStorage: public class SmallFileStorage extends PigStorage { public SmallFileStorage(String delimiter) { super(delimiter); } @Override public InputFormat getInputFormat() { return new NLineInputFormat(); } } 2) Add command line argument to the Pig command as follows: -Dmapreduce.input.lineinputformat.linespermap=50 3) Use SmallFileStorage in the Pig script as follows: USING com.xxx.yyy.SmallFileStorage ('\t') But this doesn't seem to work. We still see that everything is going to one mapper. Before we spend any more time on this, I am wondering if this is a good approach – OR – if there's a better approach? Please let me know. Thanks.
Java submit job to remote server
I apologize for asking what seems to be such a basic question, but I would use some help with submitting a job to a remote server. I have downloaded and installed hadoop locally in pseudo-distributed mode. I have written some Java code to submit a job. Here's the org.apache.hadoop.util.Tool and org.apache.hadoop.mapreduce.Mapper I've written. If I enable the conf.set(mapred.job.tracker, localhost:9001) line, then I get the exception included below. If that line is disabled, then the job is completed. However, in reviewing the hadoop server administration page (http://localhost:50030/jobtracker.jsp) I don't see the job as processed by the server. Instead, I wonder if my Java code is simply running the necessary mapper Java code, bypassing the locally installed server. Thanks in advance. Alex public class OfflineDataTool extends Configured implements Tool { public int run(final String[] args) throws Exception { final Configuration conf = getConf(); //conf.set(mapred.job.tracker, localhost:9001); final Job job = new Job(conf); job.setJarByClass(getClass()); job.setJobName(getClass().getName()); job.setMapperClass(OfflineDataMapper.class); job.setInputFormatClass(TextInputFormat.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); FileInputFormat.addInputPath(job, new org.apache.hadoop.fs.Path(args[0])); final org.apache.hadoop.fs.Path output = new org.apache.hadoop.fs.Path(args[1]); FileSystem.get(conf).delete(output, true); FileOutputFormat.setOutputPath(job, output); return job.waitForCompletion(true) ? 0 : 1; } public static void main(final String[] args) { try { int result = ToolRunner.run(new Configuration(), new OfflineDataTool(), new String[]{offline/input, offline/output}); log.error(result = {}, result); } catch (final Exception e) { throw new RuntimeException(e); } } } public class OfflineDataMapper extends MapperLongWritable, Text, Text, Text { public OfflineDataMapper() { super(); } @Override protected void map(final LongWritable key, final Text value, final Context context) throws IOException, InterruptedException { final String inputString = value.toString(); OfflineDataMapper.log.error(inputString = {}, inputString); } }
Re: Java submit job to remote server
conf.set(mapred.job.tracker, localhost:9001); this means that your jobtracker is on port 9001 on localhost if you change it to the remote host and thats the port its running on then it should work as expected whats the exception you are getting? On Wed, Feb 13, 2013 at 2:41 AM, Alex Thieme athi...@athieme.com wrote: I apologize for asking what seems to be such a basic question, but I would use some help with submitting a job to a remote server. I have downloaded and installed hadoop locally in pseudo-distributed mode. I have written some Java code to submit a job. Here's the org.apache.hadoop.util.Tool and org.apache.hadoop.mapreduce.Mapper I've written. If I enable the conf.set(mapred.job.tracker, localhost:9001) line, then I get the exception included below. If that line is disabled, then the job is completed. However, in reviewing the hadoop server administration page ( http://localhost:50030/jobtracker.jsp) I don't see the job as processed by the server. Instead, I wonder if my Java code is simply running the necessary mapper Java code, bypassing the locally installed server. Thanks in advance. Alex public class OfflineDataTool extends Configured implements Tool { public int run(final String[] args) throws Exception { final Configuration conf = getConf(); //conf.set(mapred.job.tracker, localhost:9001); final Job job = new Job(conf); job.setJarByClass(getClass()); job.setJobName(getClass().getName()); job.setMapperClass(OfflineDataMapper.class); job.setInputFormatClass(TextInputFormat.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); FileInputFormat.addInputPath(job, new org.apache.hadoop.fs.Path(args[0])); final org.apache.hadoop.fs.Path output = new org.apache.hadoop.fs.Path(args[1]); FileSystem.get(conf).delete(output, true); FileOutputFormat.setOutputPath(job, output); return job.waitForCompletion(true) ? 0 : 1; } public static void main(final String[] args) { try { int result = ToolRunner.run(new Configuration(), new OfflineDataTool(), new String[]{offline/input, offline/output}); log.error(result = {}, result); } catch (final Exception e) { throw new RuntimeException(e); } } } public class OfflineDataMapper extends MapperLongWritable, Text, Text, Text { public OfflineDataMapper() { super(); } @Override protected void map(final LongWritable key, final Text value, final Context context) throws IOException, InterruptedException { final String inputString = value.toString(); OfflineDataMapper.log.error(inputString = {}, inputString); } } -- Nitin Pawar
Re: Java submit job to remote server
Thanks for the prompt reply and I'm sorry I forgot to include the exception. My bad. I've included it below. There certainly appears to be a server running on localhost:9001. At least, I was able to telnet to that address. While in development, I'm treating the server on localhost as the remote server. Moving to production, there'd obviously be a different remote server address configured. Root Exception stack trace: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446) + 3 more (set debug level logging or '-Dmule.verbose.exceptions=true' for everything) On Feb 12, 2013, at 4:22 PM, Nitin Pawar nitinpawar...@gmail.com wrote: conf.set(mapred.job.tracker, localhost:9001); this means that your jobtracker is on port 9001 on localhost if you change it to the remote host and thats the port its running on then it should work as expected whats the exception you are getting? On Wed, Feb 13, 2013 at 2:41 AM, Alex Thieme athi...@athieme.com wrote: I apologize for asking what seems to be such a basic question, but I would use some help with submitting a job to a remote server. I have downloaded and installed hadoop locally in pseudo-distributed mode. I have written some Java code to submit a job. Here's the org.apache.hadoop.util.Tool and org.apache.hadoop.mapreduce.Mapper I've written. If I enable the conf.set(mapred.job.tracker, localhost:9001) line, then I get the exception included below. If that line is disabled, then the job is completed. However, in reviewing the hadoop server administration page (http://localhost:50030/jobtracker.jsp) I don't see the job as processed by the server. Instead, I wonder if my Java code is simply running the necessary mapper Java code, bypassing the locally installed server. Thanks in advance. Alex public class OfflineDataTool extends Configured implements Tool { public int run(final String[] args) throws Exception { final Configuration conf = getConf(); //conf.set(mapred.job.tracker, localhost:9001); final Job job = new Job(conf); job.setJarByClass(getClass()); job.setJobName(getClass().getName()); job.setMapperClass(OfflineDataMapper.class); job.setInputFormatClass(TextInputFormat.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); FileInputFormat.addInputPath(job, new org.apache.hadoop.fs.Path(args[0])); final org.apache.hadoop.fs.Path output = new org.apache.hadoop.fs.Path(args[1]); FileSystem.get(conf).delete(output, true); FileOutputFormat.setOutputPath(job, output); return job.waitForCompletion(true) ? 0 : 1; } public static void main(final String[] args) { try { int result = ToolRunner.run(new Configuration(), new OfflineDataTool(), new String[]{offline/input, offline/output}); log.error(result = {}, result); } catch (final Exception e) { throw new RuntimeException(e); } } } public class OfflineDataMapper extends MapperLongWritable, Text, Text, Text { public OfflineDataMapper() { super(); } @Override protected void map(final LongWritable key, final Text value, final Context context) throws IOException, InterruptedException { final String inputString = value.toString(); OfflineDataMapper.log.error(inputString = {}, inputString); } } -- Nitin Pawar
Re: Question related to Decompressor interface
Hello Can someone share some idea what the Hadoop source code of class org.apache.hadoop.io.compress.BlockDecompressorStream, method rawReadInt() is trying to do here? The BlockDecompressorStream class is used for block-based decompression (e.g. snappy). Each chunk has a header indicating how many bytes it is. That header is obtained by the rawReadInt method so it is expected to return a non-negative value (since you can't have a negative length). George
Re: Java submit job to remote server
Can you please include the complete stack trace and not just the root. Also, have you set fs.default.name to a hdfs location like hdfs://localhost:9000 ? Thanks Hemanth On Wednesday, February 13, 2013, Alex Thieme wrote: Thanks for the prompt reply and I'm sorry I forgot to include the exception. My bad. I've included it below. There certainly appears to be a server running on localhost:9001. At least, I was able to telnet to that address. While in development, I'm treating the server on localhost as the remote server. Moving to production, there'd obviously be a different remote server address configured. Root Exception stack trace: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446) + 3 more (set debug level logging or '-Dmule.verbose.exceptions=true' for everything) On Feb 12, 2013, at 4:22 PM, Nitin Pawar nitinpawar...@gmail.com wrote: conf.set(mapred.job.tracker, localhost:9001); this means that your jobtracker is on port 9001 on localhost if you change it to the remote host and thats the port its running on then it should work as expected whats the exception you are getting? On Wed, Feb 13, 2013 at 2:41 AM, Alex Thieme athi...@athieme.com wrote: I apologize for asking what seems to be such a basic question, but I would use some help with submitting a job to a remote server. I have downloaded and installed hadoop locally in pseudo-distributed mode. I have written some Java code to submit a job. Here's the org.apache.hadoop.util.Tool and org.apache.hadoop.mapreduce.Mapper I've written. If I enable the conf.set(mapred.job.tracker, localhost:9001) line, then I get the exception included below. If that line is disabled, then the job is completed. However, in reviewing the hadoop server administration page ( http://localhost:50030/jobtracker.jsp) I don't see the job as processed by the server. Instead, I wonder if my Java code is simply running the necessary mapper Java code, bypassing the locally installed server. Thanks in advance. Alex public class OfflineDataTool extends Configured implements Tool { public int run(final String[] args) throws Exception { final Configuration conf = getConf(); //conf.set(mapred.job.tracker, localhost:9001); final Job job = new Job(conf); job.setJarByClass(getClass()); job.setJobName(getClass().getName()); job.setMapperClass(OfflineDataMapper.class); job.setInputFormatClass(TextInputFormat.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); FileInputFormat.addInputPath(job, new org.apache.hadoop.fs.Path(args[0])); final org.apache.hadoop.fs.Path output = new org.a
Re: Delivery Status Notification (Failure)
Pls don't cross-post, this belong only to cdh lists. On Feb 12, 2013, at 12:55 AM, samir das mohapatra wrote: Hi All, I wanted to know how to connect Hive(hadoop-cdh4 distribution) with MircoStrategy Any help is very helpfull. Witing for you response Note: It is little bit urgent do any one have exprience in that Thanks, samir -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/
Re: Delivery Status Notification (Failure)
Arun I don't understand your reply! Had you redirected this person to the hive mailing list I would have understood.. My philosophy any mailing list has always been ; If I know the answer to a question, I reply.. Else I humbly walk away!. I got a lot of help from this group for my (mostly stupid) questions - and people helped me. I would like to return the favor when ( and if) I can. My humble $0.01.:-) And for the record- I don't know the answer to the question on microstrategy :-) Raj From: Arun C Murthy a...@hortonworks.com To: user@hadoop.apache.org Sent: Tuesday, February 12, 2013 6:42 PM Subject: Re: Delivery Status Notification (Failure) Pls don't cross-post, this belong only to cdh lists. On Feb 12, 2013, at 12:55 AM, samir das mohapatra wrote: Hi All, I wanted to know how to connect Hive(hadoop-cdh4 distribution) with MircoStrategy Any help is very helpfull. Witing for you response Note: It is little bit urgent do any one have exprience in that Thanks, samir -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/
Re: Delivery Status Notification (Failure)
With all due respect, sir, these mailing lists have certain rules, that aren't evidently coincide with your philosophy. Cos On Tue, Feb 12, 2013 at 08:45PM, Raj Vishwanathan wrote: Arun I don't understand your reply! ═Had you redirected this person to the hive mailing list I would have understood.. My═philosophy═any mailing list ═has always been ; If I know the answer to a question, I reply.. Else I humbly walk away!.═ I got a lot of help from this group for my (mostly stupid) questions - and people helped me. I would like to return the favor when ( and if) I can. My humble $0.01.:-) And for the record- I don't know the answer to the question on microstrategy :-) Raj From: Arun C Murthy a...@hortonworks.com To: user@hadoop.apache.org Sent: Tuesday, February 12, 2013 6:42 PM Subject: Re: Delivery Status Notification (Failure) Pls don't cross-post, this belong only to cdh lists. On Feb 12, 2013, at 12:55 AM, samir das mohapatra wrote: Hi All, ═ ═I wanted to know how to connect Hive(hadoop-cdh4 distribution) with MircoStrategy ═ ═Any help is very helpfull. ═ Witing for you response Note: It is little bit urgent do any one have exprience in that Thanks, samir -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/ signature.asc Description: Digital signature
Re: Anyway to load certain Key/Value pair fast?
Please do not use the general@ lists for any user-oriented questions. Please redirect them to user@hadoop.apache.org lists, which is where the user community and questions lie. I've moved your post there and have added you on CC in case you haven't subscribed there. Please reply back only to the user@ addresses. The general@ list is for Apache Hadoop project-level management and release oriented discussions alone. On Wed, Feb 13, 2013 at 10:54 AM, William Kang weliam.cl...@gmail.com wrote: Hi All, I am trying to figure out a good solution for such a scenario as following. 1. I have a 2T file (let's call it A), filled by key/value pairs, which is stored in the HDFS with the default 64M block size. In A, each key is less than 1K and each value is about 20M. 2. Occasionally, I will run analysis by using a different type of data (usually less than 10G, and let's call it B) and do look-up table alike operations by using the values in A. B resides in HDFS as well. 3. This analysis would require loading only a small number of values from A (usually less than 1000 of them) into the memory for fast look-up against the data in B. The way B finds the few values in A is by looking up for the key in A. Is there an efficient way to do this? I was thinking if I could identify the locality of the block that contains the few values, I might be able to push the B into the few nodes that contains the few values in A? Since I only need to do this occasionally, maintaining a distributed database such as HBase cant be justified. Many thanks. Cao -- Harsh J
Re: Anyway to load certain Key/Value pair fast?
My reply to your questions is inline. On Wed, Feb 13, 2013 at 10:59 AM, Harsh J ha...@cloudera.com wrote: Please do not use the general@ lists for any user-oriented questions. Please redirect them to user@hadoop.apache.org lists, which is where the user community and questions lie. I've moved your post there and have added you on CC in case you haven't subscribed there. Please reply back only to the user@ addresses. The general@ list is for Apache Hadoop project-level management and release oriented discussions alone. On Wed, Feb 13, 2013 at 10:54 AM, William Kang weliam.cl...@gmail.com wrote: Hi All, I am trying to figure out a good solution for such a scenario as following. 1. I have a 2T file (let's call it A), filled by key/value pairs, which is stored in the HDFS with the default 64M block size. In A, each key is less than 1K and each value is about 20M. 2. Occasionally, I will run analysis by using a different type of data (usually less than 10G, and let's call it B) and do look-up table alike operations by using the values in A. B resides in HDFS as well. 3. This analysis would require loading only a small number of values from A (usually less than 1000 of them) into the memory for fast look-up against the data in B. The way B finds the few values in A is by looking up for the key in A. About 1000 such rows would equal a memory expense of near 20 GB, given the value size of A you've noted above. The solution may need to be considered with this in mind, if the whole lookup table is to be eventually generated into the memory and never discarded until the end of processing. Is there an efficient way to do this? Since HBase may be too much for your simple needs, have you instead considered using MapFiles, which allow fast key lookups at a file level over HDFS/MR? You can have these files either highly replicated (if their size is large), or distributed via the distributed cache in the lookup jobs (if they are infrequently used and small sized), and be able to use the MapFile reader API to perform lookups of keys and read values only when you want them. I was thinking if I could identify the locality of the block that contains the few values, I might be able to push the B into the few nodes that contains the few values in A? Since I only need to do this occasionally, maintaining a distributed database such as HBase cant be justified. I agree that HBase may not be wholly suited to be run just for this purpose (unless A's also gonna be scaling over time). Maintaining value - locality mapping would need to be done by you. FS APIs provide locality info calls, and your files may be key-partitioned enough to identify each one's range, and you can combine the knowledge of these two to do something along these lines. Using HBase may also turn out to be easier, but thats upto you. You can also choose to tear it down (i.e. the services) when not needed, btw. Many thanks. Cao -- Harsh J -- Harsh J
Re: Anyway to load certain Key/Value pair fast?
Hi Harsh, Thanks a lot for your reply and great suggestions. In the practical cases, the values usually do not reside in the same data node. Instead, they are mostly distributed by the key range itself. So, it does require 20G of memory, but distributed in different nodes. The MapFile solution is very intriguing. I am not very familiar with it though. I assume it kinda resemble the basic idea of HBase? I will certainly try it out and follow up if there are questions. I agree that using HBase would be much easier. But the value size makes worry if it is going to push it to the edge. If I do this more often, I will definitely consider using HBase. Many thanks for the great reply. William On Wed, Feb 13, 2013 at 12:38 AM, Harsh J ha...@cloudera.com wrote: My reply to your questions is inline. On Wed, Feb 13, 2013 at 10:59 AM, Harsh J ha...@cloudera.com wrote: Please do not use the general@ lists for any user-oriented questions. Please redirect them to user@hadoop.apache.org lists, which is where the user community and questions lie. I've moved your post there and have added you on CC in case you haven't subscribed there. Please reply back only to the user@ addresses. The general@ list is for Apache Hadoop project-level management and release oriented discussions alone. On Wed, Feb 13, 2013 at 10:54 AM, William Kang weliam.cl...@gmail.com wrote: Hi All, I am trying to figure out a good solution for such a scenario as following. 1. I have a 2T file (let's call it A), filled by key/value pairs, which is stored in the HDFS with the default 64M block size. In A, each key is less than 1K and each value is about 20M. 2. Occasionally, I will run analysis by using a different type of data (usually less than 10G, and let's call it B) and do look-up table alike operations by using the values in A. B resides in HDFS as well. 3. This analysis would require loading only a small number of values from A (usually less than 1000 of them) into the memory for fast look-up against the data in B. The way B finds the few values in A is by looking up for the key in A. About 1000 such rows would equal a memory expense of near 20 GB, given the value size of A you've noted above. The solution may need to be considered with this in mind, if the whole lookup table is to be eventually generated into the memory and never discarded until the end of processing. Is there an efficient way to do this? Since HBase may be too much for your simple needs, have you instead considered using MapFiles, which allow fast key lookups at a file level over HDFS/MR? You can have these files either highly replicated (if their size is large), or distributed via the distributed cache in the lookup jobs (if they are infrequently used and small sized), and be able to use the MapFile reader API to perform lookups of keys and read values only when you want them. I was thinking if I could identify the locality of the block that contains the few values, I might be able to push the B into the few nodes that contains the few values in A? Since I only need to do this occasionally, maintaining a distributed database such as HBase cant be justified. I agree that HBase may not be wholly suited to be run just for this purpose (unless A's also gonna be scaling over time). Maintaining value - locality mapping would need to be done by you. FS APIs provide locality info calls, and your files may be key-partitioned enough to identify each one's range, and you can combine the knowledge of these two to do something along these lines. Using HBase may also turn out to be easier, but thats upto you. You can also choose to tear it down (i.e. the services) when not needed, btw. Many thanks. Cao -- Harsh J -- Harsh J
Re: Delivery Status Notification (Failure)
Cos, I understand that there are rules. What are these rules? Is it Hive vs hadoop ( this I understand) or apache hadoop vs a specific distribution? ( this I am not clear about.) Sent from my iPad Please excuse the typos. On Feb 12, 2013, at 8:56 PM, Konstantin Boudnik c...@apache.org wrote: With all due respect, sir, these mailing lists have certain rules, that aren't evidently coincide with your philosophy. Cos On Tue, Feb 12, 2013 at 08:45PM, Raj Vishwanathan wrote: Arun I don't understand your reply! ═Had you redirected this person to the hive mailing list I would have understood.. My═philosophy═any mailing list ═has always been ; If I know the answer to a question, I reply.. Else I humbly walk away!.═ I got a lot of help from this group for my (mostly stupid) questions - and people helped me. I would like to return the favor when ( and if) I can. My humble $0.01.:-) And for the record- I don't know the answer to the question on microstrategy :-) Raj From: Arun C Murthy a...@hortonworks.com To: user@hadoop.apache.org Sent: Tuesday, February 12, 2013 6:42 PM Subject: Re: Delivery Status Notification (Failure) Pls don't cross-post, this belong only to cdh lists. On Feb 12, 2013, at 12:55 AM, samir das mohapatra wrote: Hi All, ═ ═I wanted to know how to connect Hive(hadoop-cdh4 distribution) with MircoStrategy ═ ═Any help is very helpfull. ═ Witing for you response Note: It is little bit urgent do any one have exprience in that Thanks, samir -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/