Re: Hadoop integration with SAS
R has a connector for Hadoop if it helps.. From: jonathan.hw...@accenture.com jonathan.hw...@accenture.com To: common-user@hadoop.apache.org Sent: Tuesday, 23 August 2011 2:21 PM Subject: Hadoop integration with SAS Anyone had worked on Hadoop data integration with SAS? Does SAS have a connector to HDFS? Can it use data directly on HDFS? Any link or samples or tools? Thanks! Jonathan This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited.
MR job to copy to hadoop
Hi, What is the best and fast way to achieve parallel copy to hadoop from an NFS mount? We have a mount with huge number of files and we need to copy it into hdfs. Some options: 1. Run copyFromLocal in a multithreaded way 2. Use distcp in an isolated way. 3. Can i write a map only job to do copy? Regards, JD
Namenode Scalability
In my current project we are planning to streams of data to Namenode (20 Node Cluster). Data Volume would be around 1 PB per day. But there are application which can publish data at 1GBPS. Few queries: 1. Can a single Namenode handle such high speed writes? Or it becomes unresponsive when GC cycle kicks in. 2. Can we have multiple federated Name nodes sharing the same slaves and then we can distribute the writes accordingly. 3. Can multiple region servers of HBase help us ?? Please suggest how we can design the streaming part to handle such scale of data. Regards, Jagaran Das
Re: Namenode Scalability
What would cause the name node to have a GC issue? - I am writing opening at max 5000 connections and writing continuously through those 5000 connections to 5000 files at a time. - The volume of data that I would write through 5000 connections cannot be controlled as it is depends on upstream applications that publish data. Now if the heap memory nears the full size (let say M GB) and when the major GC cycle kicks in, the NameNode could stop responding for some time. This stop the world time should be directly proportional to the Heap Size. This may cause the data being blogged on the streaming application's memory. As of our architecture, It has a cluster of JMS Queue and We have multithreaded application that picks the messages from the queue and streams it to NameNode of a 20 Node cluster using FileSystem API as exposed. BTW, in real world if you have a fast car, you can race and win against a slow train, it all depends from what reference frame you are in :) Regards, Jagaran From: Michel Segel michael_se...@hotmail.com To: common-user@hadoop.apache.org common-user@hadoop.apache.org Cc: common-user@hadoop.apache.org common-user@hadoop.apache.org; jagaran das jagaran_...@yahoo.co.in Sent: Wednesday, 10 August 2011 11:26 AM Subject: Re: Namenode Scalability So many questions, why stop there? First question... What would cause the name node to have a GC issue? Second question... You're streaming 1PB a day. Is this a single stream of data? Are you writing this to one file before processing, or are you processing the data directly on the ingestion stream? Are you also filtering the data so that you are not saving all of the data? This sounds like a homework assignment than a real world problem. I guess people don't race cars against trains or have two trains traveling in different directions anymore... :-) Sent from a remote device. Please excuse any typos... Mike Segel On Aug 10, 2011, at 12:07 PM, jagaran das jagaran_...@yahoo.co.in wrote: To be precise, the projected data is around 1 PB. But the publishing rate is also around 1GBPS. Please suggest. From: jagaran das jagaran_...@yahoo.co.in To: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Wednesday, 10 August 2011 12:58 AM Subject: Namenode Scalability In my current project we are planning to streams of data to Namenode (20 Node Cluster). Data Volume would be around 1 PB per day. But there are application which can publish data at 1GBPS. Few queries: 1. Can a single Namenode handle such high speed writes? Or it becomes unresponsive when GC cycle kicks in. 2. Can we have multiple federated Name nodes sharing the same slaves and then we can distribute the writes accordingly. 3. Can multiple region servers of HBase help us ?? Please suggest how we can design the streaming part to handle such scale of data. Regards, Jagaran Das
Re: java.io.IOException: config()
I am accessing through threads in parallel. What is the concept of Lease in HDFS?? Regards, JD From: Harsh J ha...@cloudera.com To: jagaran das jagaran_...@yahoo.co.in Sent: Friday, 5 August 2011 11:37 PM Subject: Re: java.io.IOException: config() How long are you keeping it open for? On 06-Aug-2011, at 10:14 AM, jagaran das wrote: Hi, I am using CDH3. I need to stream huge amount of data from our application to hadoop. I am opening a connection like config.set(fs.default.name,hdfsURI); FileSystem dfs = FileSystem.get(config); String path = hdfsURI + connectionKey; Path destPath = new Path(path); logger.debug(Path -- + destPath.getName()); outStream = dfs.create(destPath); and keeping the outStream open for some time and writing continuously through it and then closing it. But it is throwing 5Aug2011 21:36:48,550 DEBUG [LeaseChecker@DFSClient[clientName=DFSClient_218151655, ugi=jagarandas]: java.lang.Throwable: for testing at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.toString(DFSClient.java:1181) at org.apache.hadoop.util.Daemon.init(Daemon.java:38) at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.put(DFSClient.java:1094) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:547) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:219) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:584) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:565) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:472) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:464) at com.apple.ireporter.common.persistence.ConnectionManager.createConnection(ConnectionManager.java:66) at com.apple.ireporter.common.persistence.HDPPersistor.writeToHDP(HDPPersistor.java:93) at com.apple.ireporter.datatransformer.translator.HDFSTranslator.persistData(HDFSTranslator.java:41) at com.apple.ireporter.datatransformer.adapter.TranslatorAdapter.processData(TranslatorAdapter.java:61) at com.apple.ireporter.datatransformer.DefaultMessageListener.persistValidatedData(DefaultMessageListener.java:276) at com.apple.ireporter.datatransformer.DefaultMessageListener.onMessage(DefaultMessageListener.java:93) at org.springframework.jms.listener.AbstractMessageListenerContainer.doInvokeListener(AbstractMessageListenerContainer.java:506) at org.springframework.jms.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:463) at org.springframework.jms.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:435) at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:322) at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:260) at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:944) at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:868) at java.lang.Thread.run(Thread.java:680) ] (RPC.java:230) - Call: renewLease 4 05Aug2011 21:36:48,550 DEBUG [listenerContainer-1] (DFSClient.java:3274) - DFSClient writeChunk allocating new packet seqno=0, src=/home/hadoop/listenerContainer-1jagaran-dass-macbook-pro.local_247811312605307819, packetSize=65557, chunksPerPacket=127, bytesCurBlock=0 05Aug2011 21:36:48,551 DEBUG [Thread-11] (DFSClient.java:2499) - Allocating new block 05Aug2011 21:36:48,552 DEBUG [sendParams-0] (Client.java:761) - IPC Client (47) connection to localhost/127.0.0.1:8020 from jagarandas sending #3 05Aug2011 21:36:48,553 DEBUG [IPC Client (47) connection to localhost/127.0.0.1:8020 from jagarandas] (Client.java:815) - IPC Client (47) connection to localhost/127.0.0.1:8020 from jagarandas got value #3 05Aug2011 21:36:48,556 DEBUG [Thread-11] (RPC.java:230) - Call: addBlock 4 05Aug2011 21:36:48,557 DEBUG [Thread-11] (DFSClient.java:3094) - pipeline = 127.0.0.1:50010 05Aug2011 21:36:48,557 DEBUG [Thread-11] (DFSClient.java:3102) - Connecting to 127.0.0.1:50010 05Aug2011 21:36:48,559 DEBUG [Thread-11] (DFSClient.java:3109) - Send buf size 131072 05Aug2011 21:36:48,635 DEBUG [DataStreamer for file /home/hadoop/listenerContainer-1jagaran-dass-macbook-pro.local_247811312605307819 block blk_-5183404460805094255_1042] (DFSClient.java:2533) - DataStreamer block blk_-5183404460805094255_1042 wrote packet seqno:0 size:1522 offsetInBlock:0 lastPacketInBlock:true 05Aug2011 21:36:48,638 DEBUG [ResponseProcessor for block blk_-5183404460805094255_1042] (DFSClient.java:2640) - DFSClient Replies for seqno 0 are SUCCESS 05Aug2011 21:36:48,639 DEBUG [DataStreamer for file /home/hadoop/listenerContainer-1jagaran-dass-macbook-pro.local_247811312605307819 block blk_-5183404460805094255_1042
Help on DFSClient
I am keeping a Stream Open and writing through it using a multithreaded application. The application is in a different box and I am connecting to NN remotely. I was using FileSystem and getting same error and now I am trying DFSClient and getting the same error. When I am running it via simple StandAlone class, it is not throwing any error but when i put that in my Application, it is throwing this error. Please help me with this. Regards, JD public String toString() { String s = getClass().getSimpleName(); if (LOG.isTraceEnabled()) { return s + @ + DFSClient.this + : + StringUtils.stringifyException(new Throwable(for testing)); } return s; } My Stack Trace ::: 06Aug2011 12:29:24,345 DEBUG [listenerContainer-1] (DFSClient.java:1115) - Wait for lease checker to terminate 06Aug2011 12:29:24,346 DEBUG [LeaseChecker@DFSClient[clientName=DFSClient_280246853, ugi=jagarandas]: java.lang.Throwable: for testing at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.toString(DFSClient.java:1181) at org.apache.hadoop.util.Daemon.init(Daemon.java:38) at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.put(DFSClient.java:1094) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:547) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:513) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:497) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:442) at com.apple.ireporter.common.persistence.ConnectionManager.createConnection(ConnectionManager.java:74) at com.apple.ireporter.common.persistence.HDPPersistor.writeToHDP(HDPPersistor.java:95) at com.apple.ireporter.datatransformer.translator.HDFSTranslator.persistData(HDFSTranslator.java:41) at com.apple.ireporter.datatransformer.adapter.TranslatorAdapter.processData(TranslatorAdapter.java:61) at com.apple.ireporter.datatransformer.DefaultMessageListener.persistValidatedData(DefaultMessageListener.java:276) at com.apple.ireporter.datatransformer.DefaultMessageListener.onMessage(DefaultMessageListener.java:93) at org.springframework.jms.listener.AbstractMessageListenerContainer.doInvokeListener(AbstractMessageListenerContainer.java:506) at org.springframework.jms.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:463) at org.springframework.jms.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:435) at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:322) at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:260) at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:944) at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:868) at java.lang.Thread.run(Thread.java:680)
NameNode Profiling Tools
Hi, Please suggest what would be the best way to profile NameNode? Any specific tools. We would streaming transaction data using around 2000 threads concurrently to NameNode continuously. Size is around 300 KB/transaction I am using DataInputStream and writing continuously for through each 2000 connections for 5 mins and then closing and reopening again new 2000 connections. Any benchmarks on CPU and Mem utilization of NameNode ? My NameNode Box Config: 1. HPDL360 G7 2X2.66GHz CPU's, 72 GB RAM, 8X300GB Drives. Regards, JD
java.io.IOException: config()
Hi, I have been struck with this exception: java.io.IOException: config() at org.apache.hadoop.conf.Configuration.(Configuration.java:211) at org.apache.hadoop.conf.Configuration.(Configuration.java:198) at org.apache.hadoop.hbase.HBaseConfiguration.create(HBaseConfiguration.java:99) at test.TestApp.main(TestApp.java:19) 05Aug2011 20:08:53,303 DEBUG [LeaseChecker@DFSClient[clientName=DFSClient_-1591195062, ugi=jagarandas,staff,com.apple.sharepoint.group.1,_developer,_lpoperator,_lpadmin,_appserveradm,admin,_appserverusr,localaccounts,everyone,fmsadmin,com.apple.access_screensharing,com.apple.sharepoint.group.2,com.apple.sharepoint.group.3]: java.lang.Throwable: for testing 05Aug2011 20:08:53,315 DEBUG [listenerContainer-1] (DFSClient.java:3012) - DFSClient writeChunk allocating new packet seqno=0, src=/home/hadoop/listenerContainer-1jagaran-dass-macbook-pro.local_222812011-08-05-20-08-52, packetSize=65557, chunksPerPacket=127, bytesCurBlock=0 I saw the source code : public Configuration(boolean loadDefaults) { this.loadDefaults = loadDefaults; if (LOG.isDebugEnabled()) { LOG.debug(StringUtils.stringifyException(new IOException(config(; } synchronized(Configuration.class) { REGISTRY.put(this, null); } } Log is in debug mode. Can anyone please help me on this?? Regards, JD
Re: java.io.IOException: config() IMP
:8020 from jagarandas got value #4 05Aug2011 21:36:48,648 DEBUG [listenerContainer-1] (RPC.java:230) - Call: complete 3 Please help as it a production enhancement for us. Regards Jagaran From: Harsh J ha...@cloudera.com To: u...@pig.apache.org; jagaran das jagaran_...@yahoo.co.in Sent: Friday, 5 August 2011 8:54 PM Subject: Re: java.io.IOException: config() Could you explain how/where you're stuck? That DEBUG log doesn't even seem like a valid throw; its just to get a strace I believe. On Sat, Aug 6, 2011 at 8:52 AM, jagaran das jagaran_...@yahoo.co.in wrote: Hi, I have been struck with this exception: java.io.IOException: config() at org.apache.hadoop.conf.Configuration.(Configuration.java:211) at org.apache.hadoop.conf.Configuration.(Configuration.java:198) at org.apache.hadoop.hbase.HBaseConfiguration.create(HBaseConfiguration.java:99) at test.TestApp.main(TestApp.java:19) 05Aug2011 20:08:53,303 DEBUG [LeaseChecker@DFSClient[clientName=DFSClient_-1591195062, ugi=jagarandas,staff,com.apple.sharepoint.group.1,_developer,_lpoperator,_lpadmin,_appserveradm,admin,_appserverusr,localaccounts,everyone,fmsadmin,com.apple.access_screensharing,com.apple.sharepoint.group.2,com.apple.sharepoint.group.3]: java.lang.Throwable: for testing 05Aug2011 20:08:53,315 DEBUG [listenerContainer-1] (DFSClient.java:3012) - DFSClient writeChunk allocating new packet seqno=0, src=/home/hadoop/listenerContainer-1jagaran-dass-macbook-pro.local_222812011-08-05-20-08-52, packetSize=65557, chunksPerPacket=127, bytesCurBlock=0 I saw the source code : public Configuration(boolean loadDefaults) { this.loadDefaults = loadDefaults; if (LOG.isDebugEnabled()) { LOG.debug(StringUtils.stringifyException(new IOException(config(; } synchronized(Configuration.class) { REGISTRY.put(this, null); } } Log is in debug mode. Can anyone please help me on this?? Regards, JD -- Harsh J
Max Number of Open Connections
Hi, What is the max number of open connections to a namenode? I am using FSDataOutputStream out = dfs.create(src); Cheers, JD
DFSClient Protocol and FileSystem class
What is the difference between DFSClient Protocol and FileSystem class in Hadoop DFS (HDFS). Both of these classes are used for connecting a remote client to the namenode in HDFS. So, I wanted to know the advantages of one over the other and which one is suitable for remote-client connection. Regards, JD
Hadoop Production Issue
Hi, Due to requirements in our current production CDH3 cluster we need to copy around 11520 small size files (Total Size 12 GB) to the cluster for one application. Like this we have 20 applications that would run in parallel So one set would have 11520 files of total size 12 GB Like this we would have 15 sets in parallel, We have a total SLA for the pipeline from copy to pig aggregation to copy to local and sql load is 15 mins. What we do: 1. Merge Files so that we get rid of small files. - Huge time hit process??? Do we have any other option??? 2. Copy to cluster 3. Execute PIG job 4. copy to local 5 Sql loader Can we perform merge and copy to cluster from a different host other than the Namenode? We want an out of cluster machine running a java process that would 1. Run periodically 2. Merge Files 3. Copy to Cluster Secondly, If we can append to an existing file in cluster? Please provide your thoughts as maintaing the SLA is becoming tough. Regards, Jagaran
Re: Any reason Hadoop logs cant be directed to a separate filesystem?
yeah, tats what we do. But its again an extra process, if hadoop had an ability, then it would be great. it uses log4j, i tired to tweak it, but it is throwing error. Regards, Jagaran From: Michael Segel michael_se...@hotmail.com To: common-user@hadoop.apache.org Sent: Sat, 25 June, 2011 3:58:19 AM Subject: RE: Any reason Hadoop logs cant be directed to a separate filesystem? Yes, and its called using cron and writing a simple ksh script to clear out any files that are older than 15 days. There may be another way, but that's really the easiest. Date: Thu, 23 Jun 2011 02:44:48 +0530 From: jagaran_...@yahoo.co.in Subject: Re: Any reason Hadoop logs cant be directed to a separate filesystem? To: common-user@hadoop.apache.org Hi, Can I limit the log file duration ? I want to keep files for last 15 days only. Regards, Jagaran From: Jack Craig jcr...@carrieriq.com To: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Wed, 22 June, 2011 2:00:23 PM Subject: Re: Any reason Hadoop logs cant be directed to a separate filesystem? Thx to both respondents. Note i've not tried this redirection as I have only production grids available. Our grids are growing and with them, log volume. As until now that log location has been in the same fs as the grid data, so running out of space due log bloat is a growing problem. From your replies, sounds like I can relocate my logs, Cool! But now the tough question, if i set up a too small partition and it runs out of space, will my grid become unstable if hadoop can no longer write to its logs? Thx again, jackc... Jack Craig, Operations CarrierIQ.comhttp://CarrierIQ.com 1200 Villa Ct, Suite 200 Mountain View, CA. 94041 650-625-5456 On Jun 22, 2011, at 1:09 PM, Harsh J wrote: Jack, I believe the location can definitely be set to any desired path. Could you tell us the issues you face when you change it? P.s. The env var is used to set the config property hadoop.log.dir internally. So as long as you use the regular scripts (bin/ or init.d/ ones) to start daemons, it would apply fine. On Thu, Jun 23, 2011 at 1:32 AM, Jack Craig jcr...@carrieriq.commailto:jcr...@carrieriq.com wrote: Hi Folks, In the hadoop-env.sh, we find, ... # Where log files are stored. $HADOOP_HOME/logs by default. # export HADOOP_LOG_DIR=${HADOOP_HOME}/logs is there any reason this location could not be a separate filesystem on the name node? Thx, jackc... Jack Craig, Operations CarrierIQ.comhttp://CarrierIQ.com 1200 Villa Ct, Suite 200 Mountain View, CA. 94041 650-625-5456 -- Harsh J
Re: Automatic Configuration of Hadoop Clusters
Pupetize From: gokul gokraz...@gmail.com To: common-user@hadoop.apache.org Sent: Wed, 22 June, 2011 8:38:13 AM Subject: Automatic Configuration of Hadoop Clusters Dear all, for benchmarking purposes we would like to adjust configurations as well as flexibly adding/removing machines from our Hadoop clusters. Is there any framework around allowing this in an easy manner without having to manually distribute the changed configuration files? We consider writing a bash script for that purpose, but hope that there is a tool out there saving us the work. Thanks in advance, Gokul -- View this message in context: http://hadoop-common.472056.n3.nabble.com/Automatic-Configuration-of-Hadoop-Clusters-tp3096077p3096077.html Sent from the Users mailing list archive at Nabble.com.
Re: Any reason Hadoop logs cant be directed to a separate filesystem?
Hi, Can I limit the log file duration ? I want to keep files for last 15 days only. Regards, Jagaran From: Jack Craig jcr...@carrieriq.com To: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Wed, 22 June, 2011 2:00:23 PM Subject: Re: Any reason Hadoop logs cant be directed to a separate filesystem? Thx to both respondents. Note i've not tried this redirection as I have only production grids available. Our grids are growing and with them, log volume. As until now that log location has been in the same fs as the grid data, so running out of space due log bloat is a growing problem. From your replies, sounds like I can relocate my logs, Cool! But now the tough question, if i set up a too small partition and it runs out of space, will my grid become unstable if hadoop can no longer write to its logs? Thx again, jackc... Jack Craig, Operations CarrierIQ.comhttp://CarrierIQ.com 1200 Villa Ct, Suite 200 Mountain View, CA. 94041 650-625-5456 On Jun 22, 2011, at 1:09 PM, Harsh J wrote: Jack, I believe the location can definitely be set to any desired path. Could you tell us the issues you face when you change it? P.s. The env var is used to set the config property hadoop.log.dir internally. So as long as you use the regular scripts (bin/ or init.d/ ones) to start daemons, it would apply fine. On Thu, Jun 23, 2011 at 1:32 AM, Jack Craig jcr...@carrieriq.commailto:jcr...@carrieriq.com wrote: Hi Folks, In the hadoop-env.sh, we find, ... # Where log files are stored. $HADOOP_HOME/logs by default. # export HADOOP_LOG_DIR=${HADOOP_HOME}/logs is there any reason this location could not be a separate filesystem on the name node? Thx, jackc... Jack Craig, Operations CarrierIQ.comhttp://CarrierIQ.com 1200 Villa Ct, Suite 200 Mountain View, CA. 94041 650-625-5456 -- Harsh J
Re: Append to Existing File
Hi All, Does CDH3 support Existing File Append ? Regards, Jagaran From: Eric Charles eric.char...@u-mangate.com To: common-user@hadoop.apache.org Sent: Tue, 21 June, 2011 3:53:33 AM Subject: Re: Append to Existing File When you say bugs pending, are your refering to HDFS-265 (which links to HDFS-1060, HADOOP-6239 and HDFS-744? Are there other issues related to append than the one above? Tks, Eric https://issues.apache.org/jira/browse/HDFS-265 On 21/06/11 12:36, madhu phatak wrote: Its not stable . There are some bugs pending . According one of the disccusion till date the append is not ready for production. On Tue, Jun 14, 2011 at 12:19 AM, jagaran dasjagaran_...@yahoo.co.inwrote: I am using hadoop-0.20.203.0 version. I have set dfs.support.append to true and then using append method It is working but need to know how stable it is to deploy and use in production clusters ? Regards, Jagaran From: jagaran dasjagaran_...@yahoo.co.in To: common-user@hadoop.apache.org Sent: Mon, 13 June, 2011 11:07:57 AM Subject: Append to Existing File Hi All, Is append to an existing file is now supported in Hadoop for production clusters? If yes, please let me know which version and how Thanks Jagaran -- Eric
Fw: HDFS File Appending URGENT
Please help me on this. I need it very urgently Regards, Jagaran - Forwarded Message From: jagaran das jagaran_...@yahoo.co.in To: common-user@hadoop.apache.org Sent: Thu, 16 June, 2011 9:51:51 PM Subject: Re: HDFS File Appending URGENT Thanks a lot Xiabo. I have tried with the below code in HDFS version 0.20.20 and it worked. Is it not stable yet? public class HadoopFileWriter { public static void main (String [] args) throws Exception{ try{ URI uri = new URI(hdfs://localhost:9000/Users/jagarandas/Work-Assignment/Analytics/analytics-poc/hadoop-0.20.203.0/data/test.dat); Path pt=new Path(uri); FileSystem fs = FileSystem.get(new Configuration()); BufferedWriter br; if(fs.isFile(pt)){ br=new BufferedWriter(new OutputStreamWriter(fs.append(pt))); br.newLine(); }else{ br=new BufferedWriter(new OutputStreamWriter(fs.create(pt,true))); } String line = args[0]; System.out.println(line); br.write(line); br.close(); }catch(Exception e){ e.printStackTrace(); System.out.println(File not found); } } } Thanks a lot for your help. Regards, Jagaran From: Xiaobo Gu guxiaobo1...@gmail.com To: common-user@hadoop.apache.org Sent: Thu, 16 June, 2011 8:01:14 PM Subject: Re: HDFS File Appending URGENT You can merge multiple files into a new one, there is no means to append to a existing file. On Fri, Jun 17, 2011 at 10:29 AM, jagaran das jagaran_...@yahoo.co.in wrote: Is the hadoop version Hadoop 0.20.203.0 API That means still the hadoop files in HDFS version 0.20.20 are immutable? And there is no means we can append to an existing file in HDFS? We need to do this urgently as we have do set up the pipeline accordingly in production? Regards, Jagaran From: Xiaobo Gu guxiaobo1...@gmail.com To: common-user@hadoop.apache.org Sent: Thu, 16 June, 2011 6:26:45 PM Subject: Re: HDFS File Appending please refer to FileUtil.CopyMerge On Fri, Jun 17, 2011 at 8:33 AM, jagaran das jagaran_...@yahoo.co.in wrote: Hi, We have a requirement where There would be huge number of small files to be pushed to hdfs and then use pig to do analysis. To get around the classic Small File Issue we merge the files and push a bigger file in to HDFS. But we are loosing time in this merging process of our pipeline. But If we can directly append to an existing file in HDFS we can save this Merging Files time. Can you please suggest if there a newer stable version of Hadoop where can go for appending ? Thanks and Regards, Jagaran
Re: HDFS File Appending URGENT
Thanks a lot guys. Another query for production. Do we have any way by which we can purge the hdfs job and history logs on time basis. For example we want to keep only last 30 days log and its size is increasing a lot in production. Thanks again Regards, Jagaran From: Tsz Wo (Nicholas), Sze s29752-hadoopu...@yahoo.com To: common-user@hadoop.apache.org Sent: Fri, 17 June, 2011 11:45:22 AM Subject: Re: HDFS File Appending URGENT Hi Jagaran, Short answer: the append feature is not in any release. In this sense, it is not stable. Below are more details on the Append feature status. - 0.20.x (includes release 0.20.2) There are known bugs in append. The bugs may cause data loss. - 0.20-append There were effort on fixing the known append bugs but there are no releases. I heard Facebook was using it (with additional patches?) in production but I did not have the details. - 0.21 It has a new append design (HDFS-265). However, the 0.21.0 release is only a minor release. It has not undergone testing at scale and should not be considered stable or suitable for production. Also, 0.21 development has been discontinued. Newly discovered bugs may not be fixed. - 0.22, 0.23 Not yet released. Regards, Tsz-Wo From: jagaran das jagaran_...@yahoo.co.in To: common-user@hadoop.apache.org Sent: Fri, June 17, 2011 11:15:04 AM Subject: Fw: HDFS File Appending URGENT Please help me on this. I need it very urgently Regards, Jagaran - Forwarded Message From: jagaran das jagaran_...@yahoo.co.in To: common-user@hadoop.apache.org Sent: Thu, 16 June, 2011 9:51:51 PM Subject: Re: HDFS File Appending URGENT Thanks a lot Xiabo. I have tried with the below code in HDFS version 0.20.20 and it worked. Is it not stable yet? public class HadoopFileWriter { public static void main (String [] args) throws Exception{ try{ URI uri = new URI(hdfs://localhost:9000/Users/jagarandas/Work-Assignment/Analytics/analytics-poc/hadoop-0.20.203.0/data/test.dat); Path pt=new Path(uri); FileSystem fs = FileSystem.get(new Configuration()); BufferedWriter br; if(fs.isFile(pt)){ br=new BufferedWriter(new OutputStreamWriter(fs.append(pt))); br.newLine(); }else{ br=new BufferedWriter(new OutputStreamWriter(fs.create(pt,true))); } String line = args[0]; System.out.println(line); br.write(line); br.close(); }catch(Exception e){ e.printStackTrace(); System.out.println(File not found); } } } Thanks a lot for your help. Regards, Jagaran From: Xiaobo Gu guxiaobo1...@gmail.com To: common-user@hadoop.apache.org Sent: Thu, 16 June, 2011 8:01:14 PM Subject: Re: HDFS File Appending URGENT You can merge multiple files into a new one, there is no means to append to a existing file. On Fri, Jun 17, 2011 at 10:29 AM, jagaran das jagaran_...@yahoo.co.in wrote: Is the hadoop version Hadoop 0.20.203.0 API That means still the hadoop files in HDFS version 0.20.20 are immutable? And there is no means we can append to an existing file in HDFS? We need to do this urgently as we have do set up the pipeline accordingly in production? Regards, Jagaran From: Xiaobo Gu guxiaobo1...@gmail.com To: common-user@hadoop.apache.org Sent: Thu, 16 June, 2011 6:26:45 PM Subject: Re: HDFS File Appending please refer to FileUtil.CopyMerge On Fri, Jun 17, 2011 at 8:33 AM, jagaran das jagaran_...@yahoo.co.in wrote: Hi, We have a requirement where There would be huge number of small files to be pushed to hdfs and then use pig to do analysis. To get around the classic Small File Issue we merge the files and push a bigger file in to HDFS. But we are loosing time in this merging process of our pipeline. But If we can directly append to an existing file in HDFS we can save this Merging Files time. Can you please suggest if there a newer stable version of Hadoop where can go for appending ? Thanks and Regards, Jagaran
HDFS File Appending
Hi, We have a requirement where There would be huge number of small files to be pushed to hdfs and then use pig to do analysis. To get around the classic Small File Issue we merge the files and push a bigger file in to HDFS. But we are loosing time in this merging process of our pipeline. But If we can directly append to an existing file in HDFS we can save this Merging Files time. Can you please suggest if there a newer stable version of Hadoop where can go for appending ? Thanks and Regards, Jagaran
Re: HDFS File Appending URGENT
Is the hadoop version Hadoop 0.20.203.0 API That means still the hadoop files in HDFS version 0.20.20 are immutable? And there is no means we can append to an existing file in HDFS? We need to do this urgently as we have do set up the pipeline accordingly in production? Regards, Jagaran From: Xiaobo Gu guxiaobo1...@gmail.com To: common-user@hadoop.apache.org Sent: Thu, 16 June, 2011 6:26:45 PM Subject: Re: HDFS File Appending please refer to FileUtil.CopyMerge On Fri, Jun 17, 2011 at 8:33 AM, jagaran das jagaran_...@yahoo.co.in wrote: Hi, We have a requirement where There would be huge number of small files to be pushed to hdfs and then use pig to do analysis. To get around the classic Small File Issue we merge the files and push a bigger file in to HDFS. But we are loosing time in this merging process of our pipeline. But If we can directly append to an existing file in HDFS we can save this Merging Files time. Can you please suggest if there a newer stable version of Hadoop where can go for appending ? Thanks and Regards, Jagaran
Re: HDFS File Appending URGENT
Thanks a lot Xiabo. I have tried with the below code in HDFS version 0.20.20 and it worked. Is it not stable yet? public class HadoopFileWriter { public static void main (String [] args) throws Exception{ try{ URI uri = new URI(hdfs://localhost:9000/Users/jagarandas/Work-Assignment/Analytics/analytics-poc/hadoop-0.20.203.0/data/test.dat); Path pt=new Path(uri); FileSystem fs = FileSystem.get(new Configuration()); BufferedWriter br; if(fs.isFile(pt)){ br=new BufferedWriter(new OutputStreamWriter(fs.append(pt))); br.newLine(); }else{ br=new BufferedWriter(new OutputStreamWriter(fs.create(pt,true))); } String line = args[0]; System.out.println(line); br.write(line); br.close(); }catch(Exception e){ e.printStackTrace(); System.out.println(File not found); } } } Thanks a lot for your help. Regards, Jagaran From: Xiaobo Gu guxiaobo1...@gmail.com To: common-user@hadoop.apache.org Sent: Thu, 16 June, 2011 8:01:14 PM Subject: Re: HDFS File Appending URGENT You can merge multiple files into a new one, there is no means to append to a existing file. On Fri, Jun 17, 2011 at 10:29 AM, jagaran das jagaran_...@yahoo.co.in wrote: Is the hadoop version Hadoop 0.20.203.0 API That means still the hadoop files in HDFS version 0.20.20 are immutable? And there is no means we can append to an existing file in HDFS? We need to do this urgently as we have do set up the pipeline accordingly in production? Regards, Jagaran From: Xiaobo Gu guxiaobo1...@gmail.com To: common-user@hadoop.apache.org Sent: Thu, 16 June, 2011 6:26:45 PM Subject: Re: HDFS File Appending please refer to FileUtil.CopyMerge On Fri, Jun 17, 2011 at 8:33 AM, jagaran das jagaran_...@yahoo.co.in wrote: Hi, We have a requirement where There would be huge number of small files to be pushed to hdfs and then use pig to do analysis. To get around the classic Small File Issue we merge the files and push a bigger file in to HDFS. But we are loosing time in this merging process of our pipeline. But If we can directly append to an existing file in HDFS we can save this Merging Files time. Can you please suggest if there a newer stable version of Hadoop where can go for appending ? Thanks and Regards, Jagaran
Append to Existing File
Hi All, Is append to an existing file is now supported in Hadoop for production clusters? If yes, please let me know which version and how Thanks Jagaran
Re: Append to Existing File
I am using hadoop-0.20.203.0 version. I have set dfs.support.append to true and then using append method It is working but need to know how stable it is to deploy and use in production clusters ? Regards, Jagaran From: jagaran das jagaran_...@yahoo.co.in To: common-user@hadoop.apache.org Sent: Mon, 13 June, 2011 11:07:57 AM Subject: Append to Existing File Hi All, Is append to an existing file is now supported in Hadoop for production clusters? If yes, please let me know which version and how Thanks Jagaran
Re: NameNode is starting with exceptions whenever its trying to start datanodes
Check two things: 1. Some of your data node is getting connected, that means password less SSH is not working within nodes. 2. Then Clear the Dir where you data is persisted in data nodes and format the namenode. It should definitely work then Cheers, Jagaran From: praveenesh kumar praveen...@gmail.com To: common-user@hadoop.apache.org Sent: Tue, 7 June, 2011 3:14:01 AM Subject: Re: NameNode is starting with exceptions whenever its trying to start datanodes But I dnt have any data on my HDFS.. I was having some date before.. but now I deleted all the files from HDFS.. I dnt know why datanodes are taking time to start.. I guess because of this exception its taking more time to start. On Tue, Jun 7, 2011 at 3:34 PM, Steve Loughran ste...@apache.org wrote: On 06/07/2011 10:50 AM, praveenesh kumar wrote: The logs say The ratio of reported blocks 0.9091 has not reached the threshold 0.9990. Safe mode will be turned off automatically. not enough datanodes reported in, or they are missing data
Re: NameNode is starting with exceptions whenever its trying to start datanodes
Sorry I mean Some of your data nodes are not getting connected From: jagaran das jagaran_...@yahoo.co.in To: common-user@hadoop.apache.org Sent: Tue, 7 June, 2011 10:45:59 AM Subject: Re: NameNode is starting with exceptions whenever its trying to start datanodes Check two things: 1. Some of your data node is getting connected, that means password less SSH is not working within nodes. 2. Then Clear the Dir where you data is persisted in data nodes and format the namenode. It should definitely work then Cheers, Jagaran From: praveenesh kumar praveen...@gmail.com To: common-user@hadoop.apache.org Sent: Tue, 7 June, 2011 3:14:01 AM Subject: Re: NameNode is starting with exceptions whenever its trying to start datanodes But I dnt have any data on my HDFS.. I was having some date before.. but now I deleted all the files from HDFS.. I dnt know why datanodes are taking time to start.. I guess because of this exception its taking more time to start. On Tue, Jun 7, 2011 at 3:34 PM, Steve Loughran ste...@apache.org wrote: On 06/07/2011 10:50 AM, praveenesh kumar wrote: The logs say The ratio of reported blocks 0.9091 has not reached the threshold 0.9990. Safe mode will be turned off automatically. not enough datanodes reported in, or they are missing data
Re: NameNode is starting with exceptions whenever its trying to start datanodes
Yes Correct Password less SSH between your name node and some of your datanode is not working From: praveenesh kumar praveen...@gmail.com To: common-user@hadoop.apache.org Sent: Tue, 7 June, 2011 10:56:08 AM Subject: Re: NameNode is starting with exceptions whenever its trying to start datanodes 1. Some of your data node is getting connected, that means password less SSH is not working within nodes. So you mean that passwordless SSH should be there among datanodes also. In hadoop we used to do password less SSH from namenode to data nodes Do we have to do passwordless ssh among datanodes also ??? On Tue, Jun 7, 2011 at 11:15 PM, jagaran das jagaran_...@yahoo.co.inwrote: Check two things: 1. Some of your data node is getting connected, that means password less SSH is not working within nodes. 2. Then Clear the Dir where you data is persisted in data nodes and format the namenode. It should definitely work then Cheers, Jagaran From: praveenesh kumar praveen...@gmail.com To: common-user@hadoop.apache.org Sent: Tue, 7 June, 2011 3:14:01 AM Subject: Re: NameNode is starting with exceptions whenever its trying to start datanodes But I dnt have any data on my HDFS.. I was having some date before.. but now I deleted all the files from HDFS.. I dnt know why datanodes are taking time to start.. I guess because of this exception its taking more time to start. On Tue, Jun 7, 2011 at 3:34 PM, Steve Loughran ste...@apache.org wrote: On 06/07/2011 10:50 AM, praveenesh kumar wrote: The logs say The ratio of reported blocks 0.9091 has not reached the threshold 0.9990. Safe mode will be turned off automatically. not enough datanodes reported in, or they are missing data
Re: NameNode is starting with exceptions whenever its trying to start datanodes
Cleaning data from data dir of datanode and formatting the name node may help you From: praveenesh kumar praveen...@gmail.com To: common-user@hadoop.apache.org Sent: Tue, 7 June, 2011 11:05:03 AM Subject: Re: NameNode is starting with exceptions whenever its trying to start datanodes Sorry I mean Some of your data nodes are not getting connected.. So are you sticking with your solution that you are saying to me.. to go for passwordless ssh for all datanodes.. because for my hadoop.. all datanodes are running fine On Tue, Jun 7, 2011 at 11:32 PM, jagaran das jagaran_...@yahoo.co.inwrote: Sorry I mean Some of your data nodes are not getting connected From: jagaran das jagaran_...@yahoo.co.in To: common-user@hadoop.apache.org Sent: Tue, 7 June, 2011 10:45:59 AM Subject: Re: NameNode is starting with exceptions whenever its trying to start datanodes Check two things: 1. Some of your data node is getting connected, that means password less SSH is not working within nodes. 2. Then Clear the Dir where you data is persisted in data nodes and format the namenode. It should definitely work then Cheers, Jagaran From: praveenesh kumar praveen...@gmail.com To: common-user@hadoop.apache.org Sent: Tue, 7 June, 2011 3:14:01 AM Subject: Re: NameNode is starting with exceptions whenever its trying to start datanodes But I dnt have any data on my HDFS.. I was having some date before.. but now I deleted all the files from HDFS.. I dnt know why datanodes are taking time to start.. I guess because of this exception its taking more time to start. On Tue, Jun 7, 2011 at 3:34 PM, Steve Loughran ste...@apache.org wrote: On 06/07/2011 10:50 AM, praveenesh kumar wrote: The logs say The ratio of reported blocks 0.9091 has not reached the threshold 0.9990. Safe mode will be turned off automatically. not enough datanodes reported in, or they are missing data
Re: NameNode is starting with exceptions whenever its trying to start datanodes
I mean removing rm -rf * in the datanode dir See this are debugging step that i followed From: praveenesh kumar praveen...@gmail.com To: common-user@hadoop.apache.org Sent: Tue, 7 June, 2011 11:19:50 AM Subject: Re: NameNode is starting with exceptions whenever its trying to start datanodes how shall I clean my data dir ??? Cleaning data dir .. u mean to say is deleting all files from hdfs ???.. is there any special command to clean all the datanodes in one step ??? On Tue, Jun 7, 2011 at 11:46 PM, jagaran das jagaran_...@yahoo.co.inwrote: Cleaning data from data dir of datanode and formatting the name node may help you From: praveenesh kumar praveen...@gmail.com To: common-user@hadoop.apache.org Sent: Tue, 7 June, 2011 11:05:03 AM Subject: Re: NameNode is starting with exceptions whenever its trying to start datanodes Sorry I mean Some of your data nodes are not getting connected.. So are you sticking with your solution that you are saying to me.. to go for passwordless ssh for all datanodes.. because for my hadoop.. all datanodes are running fine On Tue, Jun 7, 2011 at 11:32 PM, jagaran das jagaran_...@yahoo.co.in wrote: Sorry I mean Some of your data nodes are not getting connected From: jagaran das jagaran_...@yahoo.co.in To: common-user@hadoop.apache.org Sent: Tue, 7 June, 2011 10:45:59 AM Subject: Re: NameNode is starting with exceptions whenever its trying to start datanodes Check two things: 1. Some of your data node is getting connected, that means password less SSH is not working within nodes. 2. Then Clear the Dir where you data is persisted in data nodes and format the namenode. It should definitely work then Cheers, Jagaran From: praveenesh kumar praveen...@gmail.com To: common-user@hadoop.apache.org Sent: Tue, 7 June, 2011 3:14:01 AM Subject: Re: NameNode is starting with exceptions whenever its trying to start datanodes But I dnt have any data on my HDFS.. I was having some date before.. but now I deleted all the files from HDFS.. I dnt know why datanodes are taking time to start.. I guess because of this exception its taking more time to start. On Tue, Jun 7, 2011 at 3:34 PM, Steve Loughran ste...@apache.org wrote: On 06/07/2011 10:50 AM, praveenesh kumar wrote: The logs say The ratio of reported blocks 0.9091 has not reached the threshold 0.9990. Safe mode will be turned off automatically. not enough datanodes reported in, or they are missing data
Re: Adding first datanode isn't working
Check the password less SSH is working or not Regards, Jagaran From: MilleBii mille...@gmail.com To: common-user@hadoop.apache.org Sent: Wed, 1 June, 2011 12:28:54 PM Subject: Adding first datanode isn't working Newbie on hadoop clusters. I have setup my two nodes conf as described by M. G. Noll http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ The data node has datanode tasktracker running (jps command shows them), which means start-dfs.sh and start-mapred.sh worked fine. I can also shut them down gracefully. However in the WEB UI I only see one node for the DFS Live Node : 1 Dead Node : 0 Same thing on the MapRed WEB interface. Datanode logs on slave are just empty. Did check the network settings both nodes have access to each other on relevant ports. Did make sure namespaceID are the same ( https://issues.apache.org/jira/browse/HDFS-107) I did try to put data in the DFS worked but no data seemed to arrive in the slave datanode. Also tried a small MapRed only master node has been actually working, but that could be because there is only data in the master. Right ? -- -MilleBii-
Re: Adding first datanode isn't working
ufw From: MilleBii mille...@gmail.com To: common-user@hadoop.apache.org Sent: Wed, 1 June, 2011 3:37:23 PM Subject: Re: Adding first datanode isn't working OK found my issue. Turned off ufw and it sees the datanode. So I need to fix my ufw setup. 2011/6/1 MilleBii mille...@gmail.com Thx, already did that so I can ssh phraseless master to master and master to slave1. Same as before datanode tasktracker are starting up/shuting down well on slave1 2011/6/1 jagaran das jagaran_...@yahoo.co.in Check the password less SSH is working or not Regards, Jagaran From: MilleBii mille...@gmail.com To: common-user@hadoop.apache.org Sent: Wed, 1 June, 2011 12:28:54 PM Subject: Adding first datanode isn't working Newbie on hadoop clusters. I have setup my two nodes conf as described by M. G. Noll http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ / The data node has datanode tasktracker running (jps command shows them), which means start-dfs.sh and start-mapred.sh worked fine. I can also shut them down gracefully. However in the WEB UI I only see one node for the DFS Live Node : 1 Dead Node : 0 Same thing on the MapRed WEB interface. Datanode logs on slave are just empty. Did check the network settings both nodes have access to each other on relevant ports. Did make sure namespaceID are the same ( https://issues.apache.org/jira/browse/HDFS-107) I did try to put data in the DFS worked but no data seemed to arrive in the slave datanode. Also tried a small MapRed only master node has been actually working, but that could be because there is only data in the master. Right ? -- -MilleBii- -- -MilleBii- -- -MilleBii-
Re: Hadoop project - help needed
Hi, To be very precise, input to the mapper should be something you want to filter on basis of which you want to do the aggregation. The Reducer is where you aggregate the output from mapper. Check the WordCount Example in Hadoop, it can help you to understand the basic concepts. Cheers, Jagaran From: parismav paok_gate...@hotmail.com To: core-u...@hadoop.apache.org Sent: Tue, 31 May, 2011 8:35:27 AM Subject: Hadoop project - help needed Hello dear forum, i am working on a project on apache Hadoop, i am totally new to this software and i need some help understanding the basic features! To sum up, for my project i have configured hadoop so that it runs 3 datanodes on one machine. The project's main goal is, to use both Flickr API (flickr.com) libraries and hadoop libraries on Java, so that each one of the 3 datanodes, chooses a Flickr group and returns photos' info from that group. In order to do that, i have 3 flickr accounts, each one with a different api key. I dont need any help on the flickr side of the code, ofcourse. But what i dont understand, is how to use the Mapper and Reducer part of the code. What input do i have to give the Map() function? do i have to contain this whole info downloading process in the map() function? In a few words, how do i convert my code so that it runs distributedly on hadoop? thank u! -- View this message in context: http://old.nabble.com/Hadoop-project---help-needed-tp31741968p31741968.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: trying to select technology
Think of Lucene and Apache SOLR Cheers, Jagaran From: cs230 chintanjs...@gmail.com To: core-u...@hadoop.apache.org Sent: Tue, 31 May, 2011 10:50:49 AM Subject: trying to select technology Hello All, I am planning to start project where I have to do extensive storage of xml and text files. On top of that I have to implement efficient algorithm for searching over thousands or millions of files, and also do some indexes to make search faster next time. I looked into Oracle database but it delivers very poor result. Can I use Hadoop for this? Which Hadoop project would be best fit for this? Is there anything from Google I can use? Thanks a lot in advance. -- View this message in context: http://old.nabble.com/trying-to-select-technology-tp31743063p31743063.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Poor IO performance on a 10 node cluster.
Your Font block size got increased dynamically , check in core-site :) :) - Jagaran From: He Chen airb...@gmail.com To: common-user@hadoop.apache.org Sent: Mon, 30 May, 2011 11:39:35 AM Subject: Re: Poor IO performance on a 10 node cluster. Hi Gyuribácsi I would suggest you divide MapReduce program execution time into 3 parts a) Map Stage In this stage, wc splits input data and generates map tasks. Each map task process one block (in default, you can change it in FileInputFormat.java). As Brian said, if you have larger blocks size, you may have less number of map tasks, and then probably less overhead. b) Reduce Stage 2) shuffle phase In this phase, reduce task collect intermediate results from every node that has executed map tasks. Each reduce task can have many current threads to obtain data(you can configure it in mapred-site.xml, it is mapreduce.reduce.shuffle.parallelcopies). But, be careful to your data popularity. For example, you have Hadoop, Hadoop, Hadoop,hello. The default Hadoop partitioner will assign 3 Hadoop, 1 key-value pairs to one node. Thus, if you have two nodes run reduce tasks, one of them will copy 3 times more data than the other. This will cause one node slower than the other. You may rewrite the partitioner. 3) sort and reduce phase I think the Hadoop UI will give you some hints about how long this phase takes. By dividing MapReduce application into these 3 parts, you can easily find which one is your bottleneck and do some profiling. And I don't know why my font change to this type.:( Hope it will be helpful. Chen On Mon, May 30, 2011 at 12:32 PM, Harsh J ha...@cloudera.com wrote: Psst. The cats speak in their own language ;-) On Mon, May 30, 2011 at 10:31 PM, James Seigel ja...@tynt.com wrote: Not sure that will help ;) Sent from my mobile. Please excuse the typos. On 2011-05-30, at 9:23 AM, Boris Aleksandrovsky balek...@gmail.com wrote: Ljddfjfjfififfifjftjiifjfjjjffkxbznzsjxodiewisshsudddudsjidhddueiweefiuftttoitfiirriifoiffkllddiririiriioerorooiieirrioeekroooeoooirjjfdijdkkduddjudiiehs s On May 30, 2011 5:28 AM, Gyuribácsi bogyo...@gmail.com wrote: Hi, I have a 10 node cluster (IBM blade servers, 48GB RAM, 2x500GB Disk, 16 HT cores). I've uploaded 10 files to HDFS. Each file is 10GB. I used the streaming jar with 'wc -l' as mapper and 'cat' as reducer. I use 64MB block size and the default replication (3). The wc on the 100 GB took about 220 seconds which translates to about 3.5 Gbit/sec processing speed. One disk can do sequential read with 1Gbit/sec so i would expect someting around 20 GBit/sec (minus some overhead), and I'm getting only 3.5. Is my expectaion valid? I checked the jobtracked and it seems all nodes are working, each reading the right blocks. I have not played with the number of mapper and reducers yet. It seems the number of mappers is the same as the number of blocks and the number of reducers is 20 (there are 20 disks). This looks ok for me. We also did an experiment with TestDFSIO with similar results. Aggregated read io speed is around 3.5Gbit/sec. It is just too far from my expectation:( Please help! Thank you, Gyorgy -- View this message in context: http://old.nabble.com/Poor-IO-performance-on-a-10-node-cluster.-tp31732971p31732971.html l Sent from the Hadoop core-user mailing list archive at Nabble.com. -- Harsh J
Re: No. of Map and reduce tasks
Hi Mohit, No of Maps - It depends on what is the Total File Size / Block Size No of Reducers - You can specify. Regards, Jagaran From: Mohit Anchlia mohitanch...@gmail.com To: common-user@hadoop.apache.org Sent: Thu, 26 May, 2011 2:48:20 PM Subject: No. of Map and reduce tasks How can I tell how the map and reduce tasks were spread accross the cluster? I looked at the jobtracker web page but can't find that info. Also, can I specify how many map or reduce tasks I want to be launched? From what I understand is that it's based on the number of input files passed to hadoop. So if I have 4 files there will be 4 Map taks that will be launced and reducer is dependent on the hashpartitioner.
Re: No. of Map and reduce tasks
If you give really low size files, then the use of Big Block Size of Hadoop goes away. Instead try merging files. Hope that helps From: James Seigel ja...@tynt.com To: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Thu, 26 May, 2011 6:04:07 PM Subject: Re: No. of Map and reduce tasks Set input split size really low, you might get something. I'd rather you fire up some nix commands and pack together that file onto itself a bunch if times and the put it back into hdfs and let 'er rip Sent from my mobile. Please excuse the typos. On 2011-05-26, at 4:56 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I think I understand that by last 2 replies :) But my question is can I change this configuration to say split file into 250K so that multiple mappers can be invoked? On Thu, May 26, 2011 at 3:41 PM, James Seigel ja...@tynt.com wrote: have more data for it to process :) On 2011-05-26, at 4:30 PM, Mohit Anchlia wrote: I ran a simple pig script on this file: -rw-r--r-- 1 root root 208348 May 26 13:43 excite-small.log that orders the contents by name. But it only created one mapper. How can I change this to distribute accross multiple machines? On Thu, May 26, 2011 at 3:08 PM, jagaran das jagaran_...@yahoo.co.in wrote: Hi Mohit, No of Maps - It depends on what is the Total File Size / Block Size No of Reducers - You can specify. Regards, Jagaran From: Mohit Anchlia mohitanch...@gmail.com To: common-user@hadoop.apache.org Sent: Thu, 26 May, 2011 2:48:20 PM Subject: No. of Map and reduce tasks How can I tell how the map and reduce tasks were spread accross the cluster? I looked at the jobtracker web page but can't find that info. Also, can I specify how many map or reduce tasks I want to be launched? From what I understand is that it's based on the number of input files passed to hadoop. So if I have 4 files there will be 4 Map taks that will be launced and reducer is dependent on the hashpartitioner.