Hadoop integration with SAS
Anyone had worked on Hadoop data integration with SAS? Does SAS have a connector to HDFS? Can it use data directly on HDFS? Any link or samples or tools? Thanks! Jonathan This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited.
RE: Speed up node under replicated block during decomission
I did have these settings on all the hdfs-site.xml nodes: dfs.balance.bandwidthPerSec 131072000 dfs.max-repl-streams 50 It is still taking over 1 day or longer for 1TB of under replicated blocks to replicate. Thanks! Jonathan -Original Message- From: Joey Echeverria [mailto:j...@cloudera.com] Sent: Friday, August 12, 2011 9:14 AM To: common-user@hadoop.apache.org Subject: Re: Speed up node under replicated block during decomission You can configure the undocumented variable dfs.max-repl-streams to increase the number of replications a data-node is allowed to handle at one time. The default value is 2. [1] -Joey [1] https://issues.apache.org/jira/browse/HADOOP-2606?focusedCommentId=12578700&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12578700 On Fri, Aug 12, 2011 at 12:09 PM, Charles Wimmer wrote: > The balancer bandwidth setting does not affect decommissioning nodes. > Decommisssioning nodes replicate as fast as the cluster is capable. > > The replication pace has many variables. > Number nodes that are participating in the replication. > The amount of network bandwidth each has. > The amount of other HDFS activity at the time. > Total blocks being replicated. > Total data being replicated. > Many others. > > > On 8/12/11 8:58 AM, "jonathan.hw...@accenture.com" > wrote: > > Hi All, > > I'm trying to decommission data node from my cluster. I put the data node in > the /usr/lib/hadoop/conf/dfs.hosts.exclude list and restarted the name nodes. > The under-replicated blocks are starting to replicate, but it's going down > in a very slow pace. For 1 TB of data it takes over 1 day to complete. We > change the settings as below and try to increase the replication rate. > > Added this to hdfs-site.xml on all the nodes on the cluster and restarted the > data nodes and name node processes. > > > dfs.balance.bandwidthPerSec > 131072000 > > > Speed didn't seem to pick up. Do you know what may be happening? > > Thanks! > Jonathan > > This message is for the designated recipient only and may contain privileged, > proprietary, or otherwise private information. If you have received it in > error, please notify the sender immediately and delete the original. Any > other use of the email by you is prohibited. > > -- Joseph Echeverria Cloudera, Inc. 443.305.9434
Speed up node under replicated block during decomission
Hi All, I'm trying to decommission data node from my cluster. I put the data node in the /usr/lib/hadoop/conf/dfs.hosts.exclude list and restarted the name nodes. The under-replicated blocks are starting to replicate, but it's going down in a very slow pace. For 1 TB of data it takes over 1 day to complete. We change the settings as below and try to increase the replication rate. Added this to hdfs-site.xml on all the nodes on the cluster and restarted the data nodes and name node processes. dfs.balance.bandwidthPerSec 131072000 Speed didn't seem to pick up. Do you know what may be happening? Thanks! Jonathan This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited.
Hadoop cluster network requirement
I was asked by our IT folks if we can put hadoop name nodes storage using a shared disk storage unit. Does anyone have experience of how much IO throughput is required on the name nodes? What are the latency/data throughput requirements between the master and data nodes - can this tolerate network routing? Did anyone published any throughput requirement for the best network setup recommendation? Thanks! Jonathan This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited.
Deduplication Effort in Hadoop
Hi All, In databases you can be able to define primary keys to ensure no duplicate data get loaded into the system. Let say I have a lot of 1 billion records flowing into my system everyday and some of these are repeated data (Same records). I can use 2-3 columns in the record to match and look for duplicates. What is the best strategy of de-duplication? The duplicated records should only appear within the last 2 weeks.I want a fast way to get the data into the system without much delay. Anyway HBase or Hive can help? Thanks! Jonathan This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited.
Debug hadoop error
I need some help on figuring out why my job failed. I built a single node cluster just to try it out. I follow the example link http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ Everything seems to be working correctly. I formated the namenode. Able to connect to all my jobtracker, datanode, namenode via the WebUI. I was able to start and stop all the hadoop services. However, when I try to run the wordcount example, I got this: Error initializing attempt_201105161023_0002_m_11_0: java.io.IOException: Exception reading file:/app/hadoop/tmp/mapred/local/ttprivate/taskTracker/hadoop/jobcache/job_201105161023_0002/jobToken at org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:135) at org.apache.hadoop.mapreduce.security.TokenCache.loadTokens(TokenCache.java:163) at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1064) at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1001) at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2161) at org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2125) Caused by: java.io.FileNotFoundException: File file:/app/hadoop/tmp/mapred/local/ttprivate/taskTracker/hadoop/jobcache/job_201105161023_0002/jobToken does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:371) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:125) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:400) at org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:129) ... 5 more I created the directory on local file system. $ sudo mkdir /app/hadoop/tmp $ sudo chown hadoop:hadoop /app/hadoop/tmp Also modified in file conf/core-site.xml: hadoop.tmp.dir /app/hadoop/tmp A base for other temporary directories. When I format the namenode, it created the subdirectory on both local and HDFS successful. When I look at the result of the wordcount faile ouput, the error message is complaints about IO error, on file /app/hadoop/tmp/mapred/local/ttprivate/taskTracker/hadoop/jobcache/ job_201105161023_0002/jobToken Did some troubleshooting, I can browse to this jobToken file on the local file system no problem. content is something like HDTS MapReduce.job 201105161023_0002 So is it permission issue? I made owner of hadoop process able to write to all the subfolder and it was able to create the file. So what else can be wrong? What is that error message mean... so puzzling... wish there are better error messaging... BELOW IS THE DETAIL OUTPUT FROM COMMAND LINE: hadoop@jonathan-VirtualBox:/usr/local/hadoop/hadoop-0.20.203.0$ bin/hadoop jar hadoop-examples-0.20.203.0.jar wordcount app/download app/output4 11/05/16 13:38:56 INFO input.FileInputFormat: Total input paths to process : 3 11/05/16 13:39:05 INFO mapred.JobClient: Running job: job_201105161222_0003 11/05/16 13:39:06 INFO mapred.JobClient: map 0% reduce 0% 11/05/16 13:39:17 INFO mapred.JobClient: Task Id : attempt_201105161222_0003_m_04_0, Status : FAILED Error initializing attempt_201105161222_0003_m_04_0: java.io.IOException: Exception reading file:/app/hadoop/tmp/mapred/local/ttprivate/taskTracker/hadoop/jobcache/job_201105161222_0003/jobToken at org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:135) at org.apache.hadoop.mapreduce.security.TokenCache.loadTokens(TokenCache.java:163) at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1064) at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1001) at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2161) at org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2125) Caused by: java.io.FileNotFoundException: File file:/app/hadoop/tmp/mapred/local/ttprivate/taskTracker/hadoop/jobcache/job_201105161222_0003/jobToken does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:371) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:125) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:400) at org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:129) ... 5 more 11/05/16 13:39:21 WARN mapred.JobClient: Error reading task outputhttp://jonathan-VirtualBox:50060/tasklog?plaintext=true&attemptid=attempt_201105161222_0003_m_04_0&filter=stdout 11/05/16 13:39:21 WARN mapred.JobClient: Error reading