Hadoop integration with SAS

2011-08-23 Thread jonathan.hwang
Anyone had worked on Hadoop data integration with SAS?

Does SAS have a connector to HDFS?  Can it use data directly on HDFS?  Any link 
or samples or tools?

Thanks!
Jonathan


This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise private information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the email by you is prohibited.


RE: Speed up node under replicated block during decomission

2011-08-12 Thread jonathan.hwang
I did have these settings on all the hdfs-site.xml nodes:

  
  dfs.balance.bandwidthPerSec
  131072000


  dfs.max-repl-streams
  50


It is still taking over 1 day or longer for 1TB of under replicated blocks to 
replicate.

Thanks!
Jonathan


-Original Message-
From: Joey Echeverria [mailto:j...@cloudera.com] 
Sent: Friday, August 12, 2011 9:14 AM
To: common-user@hadoop.apache.org
Subject: Re: Speed up node under replicated block during decomission

You can configure the undocumented variable dfs.max-repl-streams to
increase the number of replications a data-node is allowed to handle
at one time. The default value is 2. [1]

-Joey

[1] 
https://issues.apache.org/jira/browse/HADOOP-2606?focusedCommentId=12578700&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12578700

On Fri, Aug 12, 2011 at 12:09 PM, Charles Wimmer  wrote:
> The balancer bandwidth setting does not affect decommissioning nodes.  
> Decommisssioning nodes replicate as fast as the cluster is capable.
>
> The replication pace has many variables.
>  Number nodes that are participating in the replication.
>  The amount of network bandwidth each has.
>  The amount of other HDFS activity at the time.
>  Total blocks being replicated.
>  Total data being replicated.
>  Many others.
>
>
> On 8/12/11 8:58 AM, "jonathan.hw...@accenture.com" 
>  wrote:
>
> Hi All,
>
> I'm trying to decommission data node from my cluster.  I put the data node in 
> the /usr/lib/hadoop/conf/dfs.hosts.exclude list and restarted the name nodes. 
>  The under-replicated blocks are starting to replicate, but it's going down 
> in a very slow pace.  For 1 TB of data it takes over 1 day to complete.   We 
> change the settings as below and try to increase the replication rate.
>
> Added this to hdfs-site.xml on all the nodes on the cluster and restarted the 
> data nodes and name node processes.
> 
>  
>  dfs.balance.bandwidthPerSec
>  131072000
> 
>
> Speed didn't seem to pick up. Do you know what may be happening?
>
> Thanks!
> Jonathan
>
> This message is for the designated recipient only and may contain privileged, 
> proprietary, or otherwise private information.  If you have received it in 
> error, please notify the sender immediately and delete the original.  Any 
> other use of the email by you is prohibited.
>
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434



Speed up node under replicated block during decomission

2011-08-12 Thread jonathan.hwang
Hi All,

I'm trying to decommission data node from my cluster.  I put the data node in 
the /usr/lib/hadoop/conf/dfs.hosts.exclude list and restarted the name nodes.  
The under-replicated blocks are starting to replicate, but it's going down in a 
very slow pace.  For 1 TB of data it takes over 1 day to complete.   We change 
the settings as below and try to increase the replication rate.

Added this to hdfs-site.xml on all the nodes on the cluster and restarted the 
data nodes and name node processes.

  
  dfs.balance.bandwidthPerSec
  131072000


Speed didn't seem to pick up. Do you know what may be happening?

Thanks!
Jonathan

This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise private information.  If you have received it in 
error, please notify the sender immediately and delete the original.  Any other 
use of the email by you is prohibited.


Hadoop cluster network requirement

2011-07-31 Thread jonathan.hwang
I was asked by our IT folks if we can put hadoop name nodes storage using a 
shared disk storage unit.  Does anyone have experience of how much IO 
throughput is required on the name nodes?  What are the latency/data throughput 
requirements between the master and data nodes - can this tolerate network 
routing?

Did anyone published any throughput requirement for the best network setup 
recommendation?

Thanks!
Jonathan



This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise private information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the email by you is prohibited.


Deduplication Effort in Hadoop

2011-07-14 Thread jonathan.hwang
Hi All,
In databases you can be able to define primary keys to ensure no duplicate data 
get loaded into the system.   Let say I have a lot of 1 billion records flowing 
into my system everyday and some of these are repeated data (Same records).   I 
can use 2-3 columns in the record to match and look for duplicates.   What is 
the best strategy of de-duplication?  The duplicated records should only appear 
within the last 2 weeks.I want a fast way to get the data into the system 
without much delay.  Anyway HBase or Hive can help?

Thanks!
Jonathan


This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise private information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the email by you is prohibited.


Debug hadoop error

2011-05-17 Thread jonathan.hwang
I need some help on figuring out why my job failed. I built a single node 
cluster just to try it out. I follow the example link 
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/

Everything seems to be working correctly. I formated the namenode. Able to 
connect to all my jobtracker, datanode, namenode via the WebUI. I was able to 
start and stop all the hadoop services.

However, when I try to run the wordcount example, I got this: Error 
initializing attempt_201105161023_0002_m_11_0: java.io.IOException: 
Exception reading 
file:/app/hadoop/tmp/mapred/local/ttprivate/taskTracker/hadoop/jobcache/job_201105161023_0002/jobToken
 at 
org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:135)
 at 
org.apache.hadoop.mapreduce.security.TokenCache.loadTokens(TokenCache.java:163) 
at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1064) at 
org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1001) at 
org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2161) at 
org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2125) 
Caused by: java.io.FileNotFoundException: File 
file:/app/hadoop/tmp/mapred/local/ttprivate/taskTracker/hadoop/jobcache/job_201105161023_0002/jobToken
 does not exist. at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:371)
 at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) 
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:125)
 at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283) 
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:400) at 
org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:129)
 ... 5 more

I created the directory on local file system. $ sudo mkdir /app/hadoop/tmp $ 
sudo chown hadoop:hadoop /app/hadoop/tmp

Also modified in file conf/core-site.xml:

hadoop.tmp.dir /app/hadoop/tmp A base for other temporary directories.

When I format the namenode, it created the subdirectory on both local and HDFS 
successful.

When I look at the result of the wordcount faile ouput, the error message is 
complaints about IO error, on file 
/app/hadoop/tmp/mapred/local/ttprivate/taskTracker/hadoop/jobcache/ 
job_201105161023_0002/jobToken

Did some troubleshooting, I can browse to this jobToken file on the local file 
system no problem. content is something like HDTS MapReduce.job 
201105161023_0002

So is it permission issue? I made owner of hadoop process able to write to all 
the subfolder and it was able to create the file. So what else can be wrong?

What is that error message mean... so puzzling... wish there are better error 
messaging...

BELOW IS THE DETAIL OUTPUT FROM COMMAND LINE:




hadoop@jonathan-VirtualBox:/usr/local/hadoop/hadoop-0.20.203.0$ bin/hadoop jar 
hadoop-examples-0.20.203.0.jar wordcount app/download app/output4 11/05/16 
13:38:56 INFO input.FileInputFormat: Total input paths to process : 3 11/05/16 
13:39:05 INFO mapred.JobClient: Running job: job_201105161222_0003 11/05/16 
13:39:06 INFO mapred.JobClient: map 0% reduce 0% 11/05/16 13:39:17 INFO 
mapred.JobClient: Task Id : attempt_201105161222_0003_m_04_0, Status : 
FAILED Error initializing attempt_201105161222_0003_m_04_0: 
java.io.IOException: Exception reading 
file:/app/hadoop/tmp/mapred/local/ttprivate/taskTracker/hadoop/jobcache/job_201105161222_0003/jobToken
 at 
org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:135)
 at 
org.apache.hadoop.mapreduce.security.TokenCache.loadTokens(TokenCache.java:163) 
at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1064) at 
org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1001) at 
org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2161) at 
org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2125) 
Caused by: java.io.FileNotFoundException: File 
file:/app/hadoop/tmp/mapred/local/ttprivate/taskTracker/hadoop/jobcache/job_201105161222_0003/jobToken
 does not exist. at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:371)
 at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) 
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:125)
 at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283) 
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:400) at 
org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:129)
 ... 5 more

11/05/16 13:39:21 WARN mapred.JobClient: Error reading task 
outputhttp://jonathan-VirtualBox:50060/tasklog?plaintext=true&attemptid=attempt_201105161222_0003_m_04_0&filter=stdout
 11/05/16 13:39:21 WARN mapred.JobClient: Error reading