Re: Hadoop integration with SAS

2011-08-23 Thread jagaran das
R has a connector for Hadoop if it helps..



From: jonathan.hw...@accenture.com jonathan.hw...@accenture.com
To: common-user@hadoop.apache.org
Sent: Tuesday, 23 August 2011 2:21 PM
Subject: Hadoop integration with SAS

Anyone had worked on Hadoop data integration with SAS?

Does SAS have a connector to HDFS?  Can it use data directly on HDFS?  Any link 
or samples or tools?

Thanks!
Jonathan


This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise private information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the email by you is prohibited.

MR job to copy to hadoop

2011-08-13 Thread jagaran das
Hi,

What is the best and fast way to achieve parallel copy to hadoop from an NFS 
mount?
We have a mount with huge number of files and we need to copy it into hdfs.

Some options:

1. Run copyFromLocal in a multithreaded way
2. Use distcp in an isolated way.
3. Can i write a map only job to do copy?

Regards,
JD

Namenode Scalability

2011-08-10 Thread jagaran das
In my current project we  are planning to streams of data to Namenode (20 Node 
Cluster).
Data Volume would be around 1 PB per day.
But there are application which can publish data at 1GBPS.

Few queries:

1. Can a single Namenode handle such high speed writes? Or it becomes 
unresponsive when GC cycle kicks in.
2. Can we have multiple federated Name nodes  sharing the same slaves and then 
we can distribute the writes accordingly.
3. Can multiple region servers of HBase help us ??

Please suggest how we can design the streaming part to handle such scale of 
data. 

Regards,
Jagaran Das 

Re: Namenode Scalability

2011-08-10 Thread jagaran das
What would cause the name node to have a GC issue?

- I am writing opening at max 5000 connections and writing continuously through 
those 5000 connections to 5000 files at a time.  
      - The volume of data that I would write through 5000 connections cannot 
be controlled as it is depends on upstream applications that publish data.

Now if the heap memory nears the full size (let say M GB) and when the major GC 
cycle kicks in, the NameNode could stop responding for some time.
This stop the world time should be directly proportional to the Heap Size.
This may cause the data being blogged on the streaming application's memory.

As of our architecture,

It has a cluster of JMS Queue and We have multithreaded application that picks 
the messages from the queue   and streams it to NameNode of a 20 Node cluster
using FileSystem API as exposed. 

BTW, in real world if you have a fast car, you can race and win against a slow 
train, it all depends from what reference frame you are in :)

Regards,
Jagaran 


From: Michel Segel michael_se...@hotmail.com
To: common-user@hadoop.apache.org common-user@hadoop.apache.org
Cc: common-user@hadoop.apache.org common-user@hadoop.apache.org; jagaran 
das jagaran_...@yahoo.co.in
Sent: Wednesday, 10 August 2011 11:26 AM
Subject: Re: Namenode Scalability

So many questions, why stop there?

First question... What would cause the name node to have a GC issue?
Second question... You're streaming 1PB a day. Is this a single stream of data?
Are you writing this to one file before processing, or are you processing the 
data directly on the ingestion stream?

Are you also filtering the data so that you are not saving all of the data?

This sounds like a homework assignment than a real world problem.

I guess people don't race cars against trains or have two trains traveling in 
different directions anymore... :-)


Sent from a remote device. Please excuse any typos...

Mike Segel

On Aug 10, 2011, at 12:07 PM, jagaran das jagaran_...@yahoo.co.in wrote:

 To be precise, the projected data is around 1 PB.
 But the publishing rate is also around 1GBPS.
 
 Please suggest.
 
 
 
 From: jagaran das jagaran_...@yahoo.co.in
 To: common-user@hadoop.apache.org common-user@hadoop.apache.org
 Sent: Wednesday, 10 August 2011 12:58 AM
 Subject: Namenode Scalability
 
 In my current project we  are planning to streams of data to Namenode (20 
 Node Cluster).
 Data Volume would be around 1 PB per day.
 But there are application which can publish data at 1GBPS.
 
 Few queries:
 
 1. Can a single Namenode handle such high speed writes? Or it becomes 
 unresponsive when GC cycle kicks in.
 2. Can we have multiple federated Name nodes  sharing the same slaves and 
 then we can distribute the writes accordingly.
 3. Can multiple region servers of HBase help us ??
 
 Please suggest how we can design the streaming part to handle such scale of 
 data. 
 
 Regards,
 Jagaran Das 

Re: java.io.IOException: config()

2011-08-06 Thread jagaran das
I am accessing through threads in parallel.

What is the concept of Lease in HDFS??

Regards,
JD



From: Harsh J ha...@cloudera.com
To: jagaran das jagaran_...@yahoo.co.in
Sent: Friday, 5 August 2011 11:37 PM
Subject: Re: java.io.IOException: config()


How long are you keeping it open for?


On 06-Aug-2011, at 10:14 AM, jagaran das wrote:

Hi,


I am using CDH3.
I need to stream huge amount of data from our application to hadoop.
I am opening a connection like
 
config.set(fs.default.name,hdfsURI);
FileSystem dfs = FileSystem.get(config);
String path = hdfsURI + connectionKey;
Path destPath = new Path(path);
logger.debug(Path --  + destPath.getName());
outStream = dfs.create(destPath);
and keeping the outStream open for some time and writing continuously through 
it and then closing it.
But it is throwing 


 
5Aug2011 21:36:48,550 DEBUG 
[LeaseChecker@DFSClient[clientName=DFSClient_218151655, ugi=jagarandas]: 
java.lang.Throwable: for testing
at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.toString(DFSClient.java:1181)
at org.apache.hadoop.util.Daemon.init(Daemon.java:38)
at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.put(DFSClient.java:1094)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:547)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:219)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:584)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:565)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:472)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:464)
at 
com.apple.ireporter.common.persistence.ConnectionManager.createConnection(ConnectionManager.java:66)
at 
com.apple.ireporter.common.persistence.HDPPersistor.writeToHDP(HDPPersistor.java:93)
at 
com.apple.ireporter.datatransformer.translator.HDFSTranslator.persistData(HDFSTranslator.java:41)
at 
com.apple.ireporter.datatransformer.adapter.TranslatorAdapter.processData(TranslatorAdapter.java:61)
at 
com.apple.ireporter.datatransformer.DefaultMessageListener.persistValidatedData(DefaultMessageListener.java:276)
at 
com.apple.ireporter.datatransformer.DefaultMessageListener.onMessage(DefaultMessageListener.java:93)
at 
org.springframework.jms.listener.AbstractMessageListenerContainer.doInvokeListener(AbstractMessageListenerContainer.java:506)
at 
org.springframework.jms.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:463)
at 
org.springframework.jms.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:435)
at 
org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:322)
at 
org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:260)
at 
org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:944)
at 
org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:868)
at java.lang.Thread.run(Thread.java:680)
] (RPC.java:230) - Call: renewLease 4
05Aug2011 21:36:48,550 DEBUG [listenerContainer-1] (DFSClient.java:3274) - 
DFSClient writeChunk allocating new packet seqno=0, 
src=/home/hadoop/listenerContainer-1jagaran-dass-macbook-pro.local_247811312605307819,
 packetSize=65557, chunksPerPacket=127, bytesCurBlock=0
05Aug2011 21:36:48,551 DEBUG [Thread-11] (DFSClient.java:2499) - Allocating 
new block
05Aug2011 21:36:48,552 DEBUG [sendParams-0] (Client.java:761) - IPC Client 
(47) connection to localhost/127.0.0.1:8020 from jagarandas sending #3
05Aug2011 21:36:48,553 DEBUG [IPC Client (47) connection to 
localhost/127.0.0.1:8020 from jagarandas] (Client.java:815) - IPC Client (47) 
connection to localhost/127.0.0.1:8020 from jagarandas got value #3
05Aug2011 21:36:48,556 DEBUG [Thread-11] (RPC.java:230) - Call: addBlock 4
05Aug2011 21:36:48,557 DEBUG [Thread-11] (DFSClient.java:3094) - pipeline = 
127.0.0.1:50010
05Aug2011 21:36:48,557 DEBUG [Thread-11] (DFSClient.java:3102) - Connecting to 
127.0.0.1:50010
05Aug2011 21:36:48,559 DEBUG [Thread-11] (DFSClient.java:3109) - Send buf size 
131072
05Aug2011 21:36:48,635 DEBUG [DataStreamer for file 
/home/hadoop/listenerContainer-1jagaran-dass-macbook-pro.local_247811312605307819
 block blk_-5183404460805094255_1042] (DFSClient.java:2533) - DataStreamer 
block blk_-5183404460805094255_1042 wrote packet seqno:0 size:1522 
offsetInBlock:0 lastPacketInBlock:true
05Aug2011 21:36:48,638 DEBUG [ResponseProcessor for block 
blk_-5183404460805094255_1042] (DFSClient.java:2640) - DFSClient Replies for 
seqno 0 are SUCCESS
05Aug2011 21:36:48,639 DEBUG [DataStreamer for file 
/home/hadoop/listenerContainer-1jagaran-dass-macbook-pro.local_247811312605307819
 block blk_-5183404460805094255_1042

Help on DFSClient

2011-08-06 Thread jagaran das
I am keeping a Stream Open and writing through it using a multithreaded 
application.
The application is in a different box and I am connecting to NN remotely.

I was using FileSystem and getting same error and now I am trying DFSClient and 
getting the same error.

When I am running it via simple StandAlone class, it is not throwing any error 
but when i put that in my Application, it is throwing this error.

Please help me with this.

Regards,
JD 

  
 public String toString() {
      String s = getClass().getSimpleName();
      if (LOG.isTraceEnabled()) {
        return s + @ + DFSClient.this + : 
               + StringUtils.stringifyException(new Throwable(for testing));
      }
      return s;
    }

My Stack Trace :::

  
06Aug2011 12:29:24,345 DEBUG [listenerContainer-1] (DFSClient.java:1115) - Wait 
for lease checker to terminate
06Aug2011 12:29:24,346 DEBUG 
[LeaseChecker@DFSClient[clientName=DFSClient_280246853, ugi=jagarandas]: 
java.lang.Throwable: for testing
at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.toString(DFSClient.java:1181)
at org.apache.hadoop.util.Daemon.init(Daemon.java:38)
at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.put(DFSClient.java:1094)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:547)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:513)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:497)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:442)
at 
com.apple.ireporter.common.persistence.ConnectionManager.createConnection(ConnectionManager.java:74)
at 
com.apple.ireporter.common.persistence.HDPPersistor.writeToHDP(HDPPersistor.java:95)
at 
com.apple.ireporter.datatransformer.translator.HDFSTranslator.persistData(HDFSTranslator.java:41)
at 
com.apple.ireporter.datatransformer.adapter.TranslatorAdapter.processData(TranslatorAdapter.java:61)
at 
com.apple.ireporter.datatransformer.DefaultMessageListener.persistValidatedData(DefaultMessageListener.java:276)
at 
com.apple.ireporter.datatransformer.DefaultMessageListener.onMessage(DefaultMessageListener.java:93)
at 
org.springframework.jms.listener.AbstractMessageListenerContainer.doInvokeListener(AbstractMessageListenerContainer.java:506)
at 
org.springframework.jms.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:463)
at 
org.springframework.jms.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:435)
at 
org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:322)
at 
org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:260)
at 
org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:944)
at 
org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:868)
at java.lang.Thread.run(Thread.java:680)

NameNode Profiling Tools

2011-08-06 Thread jagaran das
Hi,

Please suggest what would be the best way to profile NameNode?
Any specific tools.
We would streaming transaction data using around 2000 threads concurrently to 
NameNode continuously. Size is around 300 KB/transaction
I am using DataInputStream and writing continuously for through each 2000 
connections for 5 mins and then closing and reopening again new 2000 
connections.
 Any benchmarks on CPU and Mem utilization of NameNode ?

My NameNode Box Config:
1. HPDL360 G7 2X2.66GHz CPU's, 72 GB RAM, 8X300GB Drives.

Regards,
JD 

java.io.IOException: config()

2011-08-05 Thread jagaran das
Hi,

I have been struck with this exception:

java.io.IOException: config()
at org.apache.hadoop.conf.Configuration.(Configuration.java:211)
at org.apache.hadoop.conf.Configuration.(Configuration.java:198)
at org.apache.hadoop.hbase.HBaseConfiguration.create(HBaseConfiguration.java:99)
at test.TestApp.main(TestApp.java:19)


  
05Aug2011 20:08:53,303 DEBUG 
[LeaseChecker@DFSClient[clientName=DFSClient_-1591195062, 
ugi=jagarandas,staff,com.apple.sharepoint.group.1,_developer,_lpoperator,_lpadmin,_appserveradm,admin,_appserverusr,localaccounts,everyone,fmsadmin,com.apple.access_screensharing,com.apple.sharepoint.group.2,com.apple.sharepoint.group.3]:
 java.lang.Throwable: for testing

  
05Aug2011 20:08:53,315 DEBUG [listenerContainer-1] (DFSClient.java:3012) - 
DFSClient writeChunk allocating new packet seqno=0, 
src=/home/hadoop/listenerContainer-1jagaran-dass-macbook-pro.local_222812011-08-05-20-08-52,
 packetSize=65557, chunksPerPacket=127, bytesCurBlock=0

I saw the source code :

 public Configuration(boolean loadDefaults) {
    this.loadDefaults = loadDefaults;
    if (LOG.isDebugEnabled()) {
      LOG.debug(StringUtils.stringifyException(new IOException(config(;
    }
    synchronized(Configuration.class) {
      REGISTRY.put(this, null);
    }
  }

Log is in debug mode.

Can anyone please help me on this??

Regards,
JD

Re: java.io.IOException: config() IMP

2011-08-05 Thread jagaran das
:8020 from jagarandas got value #4
05Aug2011 21:36:48,648 DEBUG [listenerContainer-1] (RPC.java:230) - Call: 
complete 3

Please help as it a production enhancement for us.

Regards
Jagaran 



From: Harsh J ha...@cloudera.com
To: u...@pig.apache.org; jagaran das jagaran_...@yahoo.co.in
Sent: Friday, 5 August 2011 8:54 PM
Subject: Re: java.io.IOException: config()

Could you explain how/where you're stuck?

That DEBUG log doesn't even seem like a valid throw; its just to get a
strace I believe.

On Sat, Aug 6, 2011 at 8:52 AM, jagaran das jagaran_...@yahoo.co.in wrote:
 Hi,

 I have been struck with this exception:

 java.io.IOException: config()
 at org.apache.hadoop.conf.Configuration.(Configuration.java:211)
 at org.apache.hadoop.conf.Configuration.(Configuration.java:198)
 at 
 org.apache.hadoop.hbase.HBaseConfiguration.create(HBaseConfiguration.java:99)
 at test.TestApp.main(TestApp.java:19)



 05Aug2011 20:08:53,303 DEBUG 
 [LeaseChecker@DFSClient[clientName=DFSClient_-1591195062, 
 ugi=jagarandas,staff,com.apple.sharepoint.group.1,_developer,_lpoperator,_lpadmin,_appserveradm,admin,_appserverusr,localaccounts,everyone,fmsadmin,com.apple.access_screensharing,com.apple.sharepoint.group.2,com.apple.sharepoint.group.3]:
  java.lang.Throwable: for testing


 05Aug2011 20:08:53,315 DEBUG [listenerContainer-1] (DFSClient.java:3012) - 
 DFSClient writeChunk allocating new packet seqno=0, 
 src=/home/hadoop/listenerContainer-1jagaran-dass-macbook-pro.local_222812011-08-05-20-08-52,
  packetSize=65557, chunksPerPacket=127, bytesCurBlock=0

 I saw the source code :

  public Configuration(boolean loadDefaults) {
     this.loadDefaults = loadDefaults;
     if (LOG.isDebugEnabled()) {
       LOG.debug(StringUtils.stringifyException(new IOException(config(;
     }
     synchronized(Configuration.class) {
       REGISTRY.put(this, null);
     }
   }

 Log is in debug mode.

 Can anyone please help me on this??

 Regards,
 JD



-- 
Harsh J

Max Number of Open Connections

2011-08-01 Thread jagaran das


Hi,

What is the max number of open connections to a namenode?

I am using 


FSDataOutputStream out = dfs.create(src);

Cheers,
JD 


DFSClient Protocol and FileSystem class

2011-07-31 Thread jagaran das


What is the difference between DFSClient Protocol and FileSystem class in 
Hadoop DFS (HDFS). Both of these classes are used for connecting a remote 
client to the namenode in HDFS. So,

 I wanted to know the advantages of one over the other and which one is 
suitable for remote-client connection.


Regards,
JD

Hadoop Production Issue

2011-07-16 Thread jagaran das
Hi,

Due to requirements in our current production CDH3 cluster we need to copy 
around 11520 small size files (Total Size 12 GB) to the cluster for one 
application.
Like this we have 20 applications that would run in parallel

So one set would have 11520 files of total size 12 GB
Like this we would have 15 sets in parallel, 

We have a total SLA for the pipeline from copy to pig aggregation to copy to 
local and sql load is 15 mins. 

What we do:

1. Merge Files so that we get rid of small files. - Huge time hit process??? Do 
we have any other option???
2. Copy to cluster
3. Execute PIG job
4. copy to local
5 Sql loader

Can we perform merge and copy to cluster from a different host other than the 
Namenode?
We want an out of cluster machine running a java process that would
1. Run periodically
2. Merge Files
3. Copy to Cluster 

Secondly,
If we can append to an existing file in cluster?

Please provide your thoughts as maintaing the SLA is becoming tough. 

Regards,
Jagaran 

Re: Any reason Hadoop logs cant be directed to a separate filesystem?

2011-06-25 Thread jagaran das
yeah, tats what we do.
But its again an extra process, if hadoop had an ability, then it would be 
great.
it uses log4j, i tired to tweak it, but it is throwing error.

Regards,
Jagaran 




From: Michael Segel michael_se...@hotmail.com
To: common-user@hadoop.apache.org
Sent: Sat, 25 June, 2011 3:58:19 AM
Subject: RE: Any reason Hadoop logs cant be directed to a separate filesystem?


Yes, and its called using cron and writing a simple ksh script to clear out any 
files that are older than 15 days. 

There may be another way, but that's really the easiest.


 Date: Thu, 23 Jun 2011 02:44:48 +0530
 From: jagaran_...@yahoo.co.in
 Subject: Re: Any reason Hadoop logs cant be directed to a separate filesystem?
 To: common-user@hadoop.apache.org
 
 Hi,
 
 Can I limit the log file duration ?
 I want to keep files for last 15 days only.
 
 Regards,
 Jagaran 
 
 
 
 
 From: Jack Craig jcr...@carrieriq.com
 To: common-user@hadoop.apache.org common-user@hadoop.apache.org
 Sent: Wed, 22 June, 2011 2:00:23 PM
 Subject: Re: Any reason Hadoop logs cant be directed to a separate filesystem?
 
 Thx to both respondents.
 
 Note i've not tried this redirection as I have only production grids 
available.
 
 Our grids are growing and with them, log volume.
 
 As until now that log location has been in the same fs as the grid data,
 so running out of space due log bloat is a growing problem.
 
 From your replies, sounds like I can relocate my logs, Cool!
 
 But now the tough question, if i set up a too small partition and it runs out 
of 

 space,
 will my grid become unstable if hadoop can no longer write to its logs?
 
 Thx again, jackc...
 
 
 Jack Craig, Operations
 CarrierIQ.comhttp://CarrierIQ.com
 1200 Villa Ct, Suite 200
 Mountain View, CA. 94041
 650-625-5456
 
 On Jun 22, 2011, at 1:09 PM, Harsh J wrote:
 
 Jack,
 
 I believe the location can definitely be set to any desired path.
 Could you tell us the issues you face when you change it?
 
 P.s. The env var is used to set the config property hadoop.log.dir
 internally. So as long as you use the regular scripts (bin/ or init.d/
 ones) to start daemons, it would apply fine.
 
 On Thu, Jun 23, 2011 at 1:32 AM, Jack Craig 
 jcr...@carrieriq.commailto:jcr...@carrieriq.com wrote:
 Hi Folks,
 
 In the hadoop-env.sh, we find, ...
 
 # Where log files are stored.  $HADOOP_HOME/logs by default.
 # export HADOOP_LOG_DIR=${HADOOP_HOME}/logs
 
 is there any reason this location could not be a separate filesystem on the 
name 

 node?
 
 Thx, jackc...
 
 Jack Craig, Operations
 CarrierIQ.comhttp://CarrierIQ.com
 1200 Villa Ct, Suite 200
 Mountain View, CA. 94041
 650-625-5456
 
 
 
 
 
 --
 Harsh J

Re: Automatic Configuration of Hadoop Clusters

2011-06-22 Thread jagaran das
Pupetize




From: gokul gokraz...@gmail.com
To: common-user@hadoop.apache.org
Sent: Wed, 22 June, 2011 8:38:13 AM
Subject: Automatic Configuration of Hadoop Clusters

Dear all,
for benchmarking purposes we would like to adjust configurations as well as
flexibly adding/removing machines from our Hadoop clusters. Is there any
framework around allowing this in an easy manner without having to manually
distribute the changed configuration files? We consider writing a bash
script for that purpose, but hope that there is a tool out there saving us
the work.
Thanks in advance,
Gokul

--
View this message in context: 
http://hadoop-common.472056.n3.nabble.com/Automatic-Configuration-of-Hadoop-Clusters-tp3096077p3096077.html

Sent from the Users mailing list archive at Nabble.com.


Re: Any reason Hadoop logs cant be directed to a separate filesystem?

2011-06-22 Thread jagaran das
Hi,

Can I limit the log file duration ?
I want to keep files for last 15 days only.

Regards,
Jagaran 




From: Jack Craig jcr...@carrieriq.com
To: common-user@hadoop.apache.org common-user@hadoop.apache.org
Sent: Wed, 22 June, 2011 2:00:23 PM
Subject: Re: Any reason Hadoop logs cant be directed to a separate filesystem?

Thx to both respondents.

Note i've not tried this redirection as I have only production grids available.

Our grids are growing and with them, log volume.

As until now that log location has been in the same fs as the grid data,
so running out of space due log bloat is a growing problem.

From your replies, sounds like I can relocate my logs, Cool!

But now the tough question, if i set up a too small partition and it runs out 
of 
space,
will my grid become unstable if hadoop can no longer write to its logs?

Thx again, jackc...


Jack Craig, Operations
CarrierIQ.comhttp://CarrierIQ.com
1200 Villa Ct, Suite 200
Mountain View, CA. 94041
650-625-5456

On Jun 22, 2011, at 1:09 PM, Harsh J wrote:

Jack,

I believe the location can definitely be set to any desired path.
Could you tell us the issues you face when you change it?

P.s. The env var is used to set the config property hadoop.log.dir
internally. So as long as you use the regular scripts (bin/ or init.d/
ones) to start daemons, it would apply fine.

On Thu, Jun 23, 2011 at 1:32 AM, Jack Craig 
jcr...@carrieriq.commailto:jcr...@carrieriq.com wrote:
Hi Folks,

In the hadoop-env.sh, we find, ...

# Where log files are stored.  $HADOOP_HOME/logs by default.
# export HADOOP_LOG_DIR=${HADOOP_HOME}/logs

is there any reason this location could not be a separate filesystem on the 
name 
node?

Thx, jackc...

Jack Craig, Operations
CarrierIQ.comhttp://CarrierIQ.com
1200 Villa Ct, Suite 200
Mountain View, CA. 94041
650-625-5456





--
Harsh J

Re: Append to Existing File

2011-06-21 Thread jagaran das
Hi All,

Does CDH3 support Existing File Append ?

Regards,
Jagaran 




From: Eric Charles eric.char...@u-mangate.com
To: common-user@hadoop.apache.org
Sent: Tue, 21 June, 2011 3:53:33 AM
Subject: Re: Append to Existing File

When you say bugs pending, are your refering to HDFS-265 (which links 
to HDFS-1060, HADOOP-6239 and HDFS-744?

Are there other issues related to append than the one above?

Tks, Eric

https://issues.apache.org/jira/browse/HDFS-265


On 21/06/11 12:36, madhu phatak wrote:
 Its not stable . There are some bugs pending . According one of the
 disccusion till date the append is not ready for production.

 On Tue, Jun 14, 2011 at 12:19 AM, jagaran dasjagaran_...@yahoo.co.inwrote:

 I am using hadoop-0.20.203.0 version.
 I have set

 dfs.support.append to true and then using append method

 It is working but need to know how stable it is to deploy and use in
 production
 clusters ?

 Regards,
 Jagaran



 
 From: jagaran dasjagaran_...@yahoo.co.in
 To: common-user@hadoop.apache.org
 Sent: Mon, 13 June, 2011 11:07:57 AM
 Subject: Append to Existing File

 Hi All,

 Is append to an existing file is now supported in Hadoop for production
 clusters?
 If yes, please let me know which version and how

 Thanks
 Jagaran



-- 
Eric


Fw: HDFS File Appending URGENT

2011-06-17 Thread jagaran das
Please help me on this.
I need it very urgently

Regards,
Jagaran 


- Forwarded Message 
From: jagaran das jagaran_...@yahoo.co.in
To: common-user@hadoop.apache.org
Sent: Thu, 16 June, 2011 9:51:51 PM
Subject: Re: HDFS File Appending URGENT

Thanks a lot Xiabo.

I have tried with the  below code in HDFS version 0.20.20 and it worked.
Is it not stable yet?

public class HadoopFileWriter {
public static void main (String [] args) throws Exception{
try{
URI uri = new 
URI(hdfs://localhost:9000/Users/jagarandas/Work-Assignment/Analytics/analytics-poc/hadoop-0.20.203.0/data/test.dat);


Path pt=new Path(uri);
FileSystem fs = FileSystem.get(new Configuration());
BufferedWriter br;
if(fs.isFile(pt)){
br=new BufferedWriter(new OutputStreamWriter(fs.append(pt)));
br.newLine();
}else{
br=new BufferedWriter(new OutputStreamWriter(fs.create(pt,true)));
}
String line = args[0];
System.out.println(line);
br.write(line);
br.close();
}catch(Exception e){
e.printStackTrace();
System.out.println(File not found);
}
}
}

Thanks a lot for your help.

Regards,
Jagaran 





From: Xiaobo Gu guxiaobo1...@gmail.com
To: common-user@hadoop.apache.org
Sent: Thu, 16 June, 2011 8:01:14 PM
Subject: Re: HDFS File Appending URGENT

You can merge multiple files into a new one, there is no means to
append to a existing file.

On Fri, Jun 17, 2011 at 10:29 AM, jagaran das jagaran_...@yahoo.co.in wrote:
 Is the hadoop version Hadoop 0.20.203.0 API

 That means still the hadoop files in HDFS version 0.20.20  are immutable?
 And there is no means we can append to an existing file in HDFS?

 We need to do this urgently as we have do set up the pipeline accordingly in
 production?

 Regards,
 Jagaran



 
 From: Xiaobo Gu guxiaobo1...@gmail.com
 To: common-user@hadoop.apache.org
 Sent: Thu, 16 June, 2011 6:26:45 PM
 Subject: Re: HDFS File Appending

 please refer to FileUtil.CopyMerge

 On Fri, Jun 17, 2011 at 8:33 AM, jagaran das jagaran_...@yahoo.co.in wrote:
 Hi,

 We have a requirement where

  There would be huge number of small files to be pushed to hdfs and then use
pig
 to do analysis.
  To get around the classic Small File Issue we merge the files and push a
 bigger file in to HDFS.
  But we are loosing time in this merging process of our pipeline.

 But If we can directly append to an existing file in HDFS we can save this
 Merging Files time.

 Can you please suggest if there a newer stable version of Hadoop where can go
 for appending ?

 Thanks and Regards,
 Jagaran



Re: HDFS File Appending URGENT

2011-06-17 Thread jagaran das
Thanks a lot guys.

Another query for production.

Do we have any way by which we can purge the hdfs job and history logs on time 
basis.
For example we want to keep only last 30 days log and its size is increasing a 
lot in production.

Thanks again

Regards,
Jagaran 




From: Tsz Wo (Nicholas), Sze s29752-hadoopu...@yahoo.com
To: common-user@hadoop.apache.org
Sent: Fri, 17 June, 2011 11:45:22 AM
Subject: Re: HDFS File Appending URGENT

Hi Jagaran,

Short answer: the append feature is not in any release.  In this sense, it is 
not stable.  Below are more details on the Append feature status.

- 0.20.x (includes release 0.20.2)
There are known bugs in append.  The bugs may cause data loss.

- 0.20-append
There were effort on fixing the known append bugs but there are no releases.  I 
heard Facebook was using it (with additional patches?) in production but I did 
not have the details.

- 0.21
It has a new append design (HDFS-265).  However, the 0.21.0 release is only a 
minor release.  It has not undergone testing at scale and should not be 
considered stable or suitable for production.  Also, 0.21 development has been 
discontinued.  Newly discovered bugs may not be fixed.

- 0.22, 0.23
Not yet released.


Regards,
Tsz-Wo





From: jagaran das jagaran_...@yahoo.co.in
To: common-user@hadoop.apache.org
Sent: Fri, June 17, 2011 11:15:04 AM
Subject: Fw: HDFS File Appending URGENT

Please help me on this.
I need it very urgently

Regards,
Jagaran 


- Forwarded Message 
From: jagaran das jagaran_...@yahoo.co.in
To: common-user@hadoop.apache.org
Sent: Thu, 16 June, 2011 9:51:51 PM
Subject: Re: HDFS File Appending URGENT

Thanks a lot Xiabo.

I have tried with the  below code in HDFS version 0.20.20 and it worked.
Is it not stable yet?

public class HadoopFileWriter {
public static void main (String [] args) throws Exception{
try{
URI uri = new 
URI(hdfs://localhost:9000/Users/jagarandas/Work-Assignment/Analytics/analytics-poc/hadoop-0.20.203.0/data/test.dat);




Path pt=new Path(uri);
FileSystem fs = FileSystem.get(new Configuration());
BufferedWriter br;
if(fs.isFile(pt)){
br=new BufferedWriter(new OutputStreamWriter(fs.append(pt)));
br.newLine();
}else{
br=new BufferedWriter(new OutputStreamWriter(fs.create(pt,true)));
}
String line = args[0];
System.out.println(line);
br.write(line);
br.close();
}catch(Exception e){
e.printStackTrace();
System.out.println(File not found);
}
}
}

Thanks a lot for your help.

Regards,
Jagaran 





From: Xiaobo Gu guxiaobo1...@gmail.com
To: common-user@hadoop.apache.org
Sent: Thu, 16 June, 2011 8:01:14 PM
Subject: Re: HDFS File Appending URGENT

You can merge multiple files into a new one, there is no means to
append to a existing file.

On Fri, Jun 17, 2011 at 10:29 AM, jagaran das jagaran_...@yahoo.co.in wrote:
 Is the hadoop version Hadoop 0.20.203.0 API

 That means still the hadoop files in HDFS version 0.20.20  are immutable?
 And there is no means we can append to an existing file in HDFS?

 We need to do this urgently as we have do set up the pipeline accordingly in
 production?

 Regards,
 Jagaran



 
 From: Xiaobo Gu guxiaobo1...@gmail.com
 To: common-user@hadoop.apache.org
 Sent: Thu, 16 June, 2011 6:26:45 PM
 Subject: Re: HDFS File Appending

 please refer to FileUtil.CopyMerge

 On Fri, Jun 17, 2011 at 8:33 AM, jagaran das jagaran_...@yahoo.co.in wrote:
 Hi,

 We have a requirement where

  There would be huge number of small files to be pushed to hdfs and then use
pig
 to do analysis.
  To get around the classic Small File Issue we merge the files and push a
 bigger file in to HDFS.
  But we are loosing time in this merging process of our pipeline.

 But If we can directly append to an existing file in HDFS we can save this
 Merging Files time.

 Can you please suggest if there a newer stable version of Hadoop where can go
 for appending ?

 Thanks and Regards,
 Jagaran



HDFS File Appending

2011-06-16 Thread jagaran das
Hi,

We have a requirement where 

 There would be huge number of small files to be pushed to hdfs and then use 
pig 
to do analysis.
 To get around the classic Small File Issue we merge the files and push a 
bigger file in to HDFS.
 But we are loosing time in this merging process of our pipeline. 

But If we can directly append to an existing file in HDFS we can save this 
Merging Files time.

Can you please suggest if there a newer stable version of Hadoop where can go 
for appending ?

Thanks and Regards,
Jagaran 

Re: HDFS File Appending URGENT

2011-06-16 Thread jagaran das
Is the hadoop version Hadoop 0.20.203.0 API

That means still the hadoop files in HDFS version 0.20.20  are immutable?
And there is no means we can append to an existing file in HDFS?

We need to do this urgently as we have do set up the pipeline accordingly in 
production?

Regards,
Jagaran 




From: Xiaobo Gu guxiaobo1...@gmail.com
To: common-user@hadoop.apache.org
Sent: Thu, 16 June, 2011 6:26:45 PM
Subject: Re: HDFS File Appending

please refer to FileUtil.CopyMerge

On Fri, Jun 17, 2011 at 8:33 AM, jagaran das jagaran_...@yahoo.co.in wrote:
 Hi,

 We have a requirement where

  There would be huge number of small files to be pushed to hdfs and then use 
pig
 to do analysis.
  To get around the classic Small File Issue we merge the files and push a
 bigger file in to HDFS.
  But we are loosing time in this merging process of our pipeline.

 But If we can directly append to an existing file in HDFS we can save this
 Merging Files time.

 Can you please suggest if there a newer stable version of Hadoop where can go
 for appending ?

 Thanks and Regards,
 Jagaran


Re: HDFS File Appending URGENT

2011-06-16 Thread jagaran das
Thanks a lot Xiabo.

I have tried with the  below code in HDFS version 0.20.20 and it worked.
Is it not stable yet?

public class HadoopFileWriter {
public static void main (String [] args) throws Exception{
try{
URI uri = new 
URI(hdfs://localhost:9000/Users/jagarandas/Work-Assignment/Analytics/analytics-poc/hadoop-0.20.203.0/data/test.dat);

Path pt=new Path(uri);
FileSystem fs = FileSystem.get(new Configuration());
BufferedWriter br;
if(fs.isFile(pt)){
br=new BufferedWriter(new OutputStreamWriter(fs.append(pt)));
 br.newLine();
}else{
 br=new BufferedWriter(new OutputStreamWriter(fs.create(pt,true)));
}
String line = args[0];
System.out.println(line);
br.write(line);
br.close();
}catch(Exception e){
e.printStackTrace();
System.out.println(File not found);
}
}
}

Thanks a lot for your help.

Regards,
Jagaran 





From: Xiaobo Gu guxiaobo1...@gmail.com
To: common-user@hadoop.apache.org
Sent: Thu, 16 June, 2011 8:01:14 PM
Subject: Re: HDFS File Appending URGENT

You can merge multiple files into a new one, there is no means to
append to a existing file.

On Fri, Jun 17, 2011 at 10:29 AM, jagaran das jagaran_...@yahoo.co.in wrote:
 Is the hadoop version Hadoop 0.20.203.0 API

 That means still the hadoop files in HDFS version 0.20.20  are immutable?
 And there is no means we can append to an existing file in HDFS?

 We need to do this urgently as we have do set up the pipeline accordingly in
 production?

 Regards,
 Jagaran



 
 From: Xiaobo Gu guxiaobo1...@gmail.com
 To: common-user@hadoop.apache.org
 Sent: Thu, 16 June, 2011 6:26:45 PM
 Subject: Re: HDFS File Appending

 please refer to FileUtil.CopyMerge

 On Fri, Jun 17, 2011 at 8:33 AM, jagaran das jagaran_...@yahoo.co.in wrote:
 Hi,

 We have a requirement where

  There would be huge number of small files to be pushed to hdfs and then use
pig
 to do analysis.
  To get around the classic Small File Issue we merge the files and push a
 bigger file in to HDFS.
  But we are loosing time in this merging process of our pipeline.

 But If we can directly append to an existing file in HDFS we can save this
 Merging Files time.

 Can you please suggest if there a newer stable version of Hadoop where can go
 for appending ?

 Thanks and Regards,
 Jagaran



Append to Existing File

2011-06-13 Thread jagaran das
Hi All,

Is append to an existing file is now supported in Hadoop for production 
clusters?
If yes, please let me know which version and how

Thanks
Jagaran 

Re: Append to Existing File

2011-06-13 Thread jagaran das
I am using hadoop-0.20.203.0 version.
I have set 

dfs.support.append to true and then using append method 

It is working but need to know how stable it is to deploy and use in production 
clusters ?

Regards,
Jagaran 




From: jagaran das jagaran_...@yahoo.co.in
To: common-user@hadoop.apache.org
Sent: Mon, 13 June, 2011 11:07:57 AM
Subject: Append to Existing File

Hi All,

Is append to an existing file is now supported in Hadoop for production 
clusters?
If yes, please let me know which version and how

Thanks
Jagaran 

Re: NameNode is starting with exceptions whenever its trying to start datanodes

2011-06-07 Thread jagaran das
Check two things:

1. Some of your data node is getting connected, that means password less SSH is 
not working within nodes.
2. Then Clear the Dir where you data is persisted in data nodes and format the 
namenode.

It should definitely work then

Cheers,
Jagaran 




From: praveenesh kumar praveen...@gmail.com
To: common-user@hadoop.apache.org
Sent: Tue, 7 June, 2011 3:14:01 AM
Subject: Re: NameNode is starting with exceptions whenever its trying to start 
datanodes

But I dnt have any data on my HDFS.. I was having some date before.. but now
I deleted all the files from HDFS..
I dnt know why datanodes are taking time to start.. I guess because of this
exception its taking more time to start.

On Tue, Jun 7, 2011 at 3:34 PM, Steve Loughran ste...@apache.org wrote:

 On 06/07/2011 10:50 AM, praveenesh kumar wrote:

 The logs say


  The ratio of reported blocks 0.9091 has not reached the threshold 0.9990.
 Safe mode will be turned off automatically.



 not enough datanodes reported in, or they are missing data



Re: NameNode is starting with exceptions whenever its trying to start datanodes

2011-06-07 Thread jagaran das
Sorry I mean Some of your data nodes are not  getting connected




From: jagaran das jagaran_...@yahoo.co.in
To: common-user@hadoop.apache.org
Sent: Tue, 7 June, 2011 10:45:59 AM
Subject: Re: NameNode is starting with exceptions whenever its trying to start 
datanodes

Check two things:

1. Some of your data node is getting connected, that means password less SSH is 
not working within nodes.
2. Then Clear the Dir where you data is persisted in data nodes and format the 
namenode.

It should definitely work then

Cheers,
Jagaran 




From: praveenesh kumar praveen...@gmail.com
To: common-user@hadoop.apache.org
Sent: Tue, 7 June, 2011 3:14:01 AM
Subject: Re: NameNode is starting with exceptions whenever its trying to start 
datanodes

But I dnt have any data on my HDFS.. I was having some date before.. but now
I deleted all the files from HDFS..
I dnt know why datanodes are taking time to start.. I guess because of this
exception its taking more time to start.

On Tue, Jun 7, 2011 at 3:34 PM, Steve Loughran ste...@apache.org wrote:

 On 06/07/2011 10:50 AM, praveenesh kumar wrote:

 The logs say


  The ratio of reported blocks 0.9091 has not reached the threshold 0.9990.
 Safe mode will be turned off automatically.



 not enough datanodes reported in, or they are missing data



Re: NameNode is starting with exceptions whenever its trying to start datanodes

2011-06-07 Thread jagaran das
Yes Correct
Password less SSH between your name node and some of your datanode is not 
working





From: praveenesh kumar praveen...@gmail.com
To: common-user@hadoop.apache.org
Sent: Tue, 7 June, 2011 10:56:08 AM
Subject: Re: NameNode is starting with exceptions whenever its trying to start 
datanodes

1. Some of your data node is getting connected, that means password less
SSH is
not working within nodes.

So you mean that passwordless SSH should be there among datanodes also.
In hadoop we used to do password less SSH from namenode to data nodes
Do we have to do passwordless ssh among datanodes also ???

On Tue, Jun 7, 2011 at 11:15 PM, jagaran das jagaran_...@yahoo.co.inwrote:

 Check two things:

 1. Some of your data node is getting connected, that means password less
 SSH is
 not working within nodes.
 2. Then Clear the Dir where you data is persisted in data nodes and format
 the
 namenode.

 It should definitely work then

 Cheers,
 Jagaran



 
 From: praveenesh kumar praveen...@gmail.com
 To: common-user@hadoop.apache.org
 Sent: Tue, 7 June, 2011 3:14:01 AM
 Subject: Re: NameNode is starting with exceptions whenever its trying to
 start
 datanodes

 But I dnt have any data on my HDFS.. I was having some date before.. but
 now
 I deleted all the files from HDFS..
 I dnt know why datanodes are taking time to start.. I guess because of this
 exception its taking more time to start.

 On Tue, Jun 7, 2011 at 3:34 PM, Steve Loughran ste...@apache.org wrote:

  On 06/07/2011 10:50 AM, praveenesh kumar wrote:
 
  The logs say
 
 
   The ratio of reported blocks 0.9091 has not reached the threshold
 0.9990.
  Safe mode will be turned off automatically.
 
 
 
  not enough datanodes reported in, or they are missing data
 



Re: NameNode is starting with exceptions whenever its trying to start datanodes

2011-06-07 Thread jagaran das
Cleaning data from data dir of datanode and formatting the name node may help 
you





From: praveenesh kumar praveen...@gmail.com
To: common-user@hadoop.apache.org
Sent: Tue, 7 June, 2011 11:05:03 AM
Subject: Re: NameNode is starting with exceptions whenever its trying to start 
datanodes

Sorry I mean Some of your data nodes are not  getting connected..

So are you sticking with your solution that you are saying to me.. to go for
passwordless ssh for all datanodes..
because for my hadoop.. all datanodes are running fine



On Tue, Jun 7, 2011 at 11:32 PM, jagaran das jagaran_...@yahoo.co.inwrote:

 Sorry I mean Some of your data nodes are not  getting connected



 
 From: jagaran das jagaran_...@yahoo.co.in
 To: common-user@hadoop.apache.org
 Sent: Tue, 7 June, 2011 10:45:59 AM
  Subject: Re: NameNode is starting with exceptions whenever its trying to
 start
 datanodes

 Check two things:

 1. Some of your data node is getting connected, that means password less
 SSH is
 not working within nodes.
 2. Then Clear the Dir where you data is persisted in data nodes and format
 the
 namenode.

 It should definitely work then

 Cheers,
 Jagaran



 
 From: praveenesh kumar praveen...@gmail.com
 To: common-user@hadoop.apache.org
 Sent: Tue, 7 June, 2011 3:14:01 AM
 Subject: Re: NameNode is starting with exceptions whenever its trying to
 start
 datanodes

 But I dnt have any data on my HDFS.. I was having some date before.. but
 now
 I deleted all the files from HDFS..
 I dnt know why datanodes are taking time to start.. I guess because of this
 exception its taking more time to start.

 On Tue, Jun 7, 2011 at 3:34 PM, Steve Loughran ste...@apache.org wrote:

  On 06/07/2011 10:50 AM, praveenesh kumar wrote:
 
  The logs say
 
 
   The ratio of reported blocks 0.9091 has not reached the threshold
 0.9990.
  Safe mode will be turned off automatically.
 
 
 
  not enough datanodes reported in, or they are missing data
 



Re: NameNode is starting with exceptions whenever its trying to start datanodes

2011-06-07 Thread jagaran das
I mean removing rm -rf * in the datanode dir

See this are debugging step that i followed





From: praveenesh kumar praveen...@gmail.com
To: common-user@hadoop.apache.org
Sent: Tue, 7 June, 2011 11:19:50 AM
Subject: Re: NameNode is starting with exceptions whenever its trying to start 
datanodes

how shall I clean my data dir ???
Cleaning data dir .. u mean to say is deleting all files from hdfs ???..

is there any special command to clean all the datanodes in one step ???

On Tue, Jun 7, 2011 at 11:46 PM, jagaran das jagaran_...@yahoo.co.inwrote:

 Cleaning data from data dir of datanode and formatting the name node may
 help
 you




 
 From: praveenesh kumar praveen...@gmail.com
 To: common-user@hadoop.apache.org
 Sent: Tue, 7 June, 2011 11:05:03 AM
  Subject: Re: NameNode is starting with exceptions whenever its trying to
 start
 datanodes

 Sorry I mean Some of your data nodes are not  getting connected..

 So are you sticking with your solution that you are saying to me.. to go
 for
 passwordless ssh for all datanodes..
 because for my hadoop.. all datanodes are running fine



 On Tue, Jun 7, 2011 at 11:32 PM, jagaran das jagaran_...@yahoo.co.in
 wrote:

  Sorry I mean Some of your data nodes are not  getting connected
 
 
 
  
  From: jagaran das jagaran_...@yahoo.co.in
  To: common-user@hadoop.apache.org
  Sent: Tue, 7 June, 2011 10:45:59 AM
   Subject: Re: NameNode is starting with exceptions whenever its trying to
  start
  datanodes
 
  Check two things:
 
  1. Some of your data node is getting connected, that means password less
  SSH is
  not working within nodes.
  2. Then Clear the Dir where you data is persisted in data nodes and
 format
  the
  namenode.
 
  It should definitely work then
 
  Cheers,
  Jagaran
 
 
 
  
  From: praveenesh kumar praveen...@gmail.com
  To: common-user@hadoop.apache.org
  Sent: Tue, 7 June, 2011 3:14:01 AM
  Subject: Re: NameNode is starting with exceptions whenever its trying to
  start
  datanodes
 
  But I dnt have any data on my HDFS.. I was having some date before.. but
  now
  I deleted all the files from HDFS..
  I dnt know why datanodes are taking time to start.. I guess because of
 this
  exception its taking more time to start.
 
  On Tue, Jun 7, 2011 at 3:34 PM, Steve Loughran ste...@apache.org
 wrote:
 
   On 06/07/2011 10:50 AM, praveenesh kumar wrote:
  
   The logs say
  
  
The ratio of reported blocks 0.9091 has not reached the threshold
  0.9990.
   Safe mode will be turned off automatically.
  
  
  
   not enough datanodes reported in, or they are missing data
  
 



Re: Adding first datanode isn't working

2011-06-01 Thread jagaran das
Check the password less SSH is working or not

Regards,
Jagaran 




From: MilleBii mille...@gmail.com
To: common-user@hadoop.apache.org
Sent: Wed, 1 June, 2011 12:28:54 PM
Subject: Adding first datanode isn't working

Newbie on hadoop clusters.
I have setup my two nodes conf as described by M. G. Noll
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/


The data node has datanode  tasktracker running (jps command shows them),
which means start-dfs.sh and start-mapred.sh worked fine.

I can also shut them down gracefully.

However in the WEB UI I only see one node for the DFS

Live Node : 1
Dead Node : 0

Same thing on the MapRed WEB interface.

Datanode logs on slave are just empty.
Did check the network settings both nodes have access to each other on
relevant ports.

Did make sure namespaceID are the same (
https://issues.apache.org/jira/browse/HDFS-107)
I did try to put data in the DFS worked but no data seemed to arrive in the
slave datanode.
Also tried a small MapRed only master node has been actually working, but
that could be because there is only data in the master. Right ?

-- 
-MilleBii-


Re: Adding first datanode isn't working

2011-06-01 Thread jagaran das


ufw 





From: MilleBii mille...@gmail.com
To: common-user@hadoop.apache.org
Sent: Wed, 1 June, 2011 3:37:23 PM
Subject: Re: Adding first datanode isn't working

OK found my issue. Turned off ufw and it sees the datanode. So I need to fix
my ufw setup.

2011/6/1 MilleBii mille...@gmail.com

 Thx,  already did that
  so I can ssh phraseless master to master and master to slave1.
 Same as before datanode  tasktracker are starting up/shuting down well on
 slave1





 2011/6/1 jagaran das jagaran_...@yahoo.co.in

 Check the password less SSH is working or not

 Regards,
 Jagaran



 
 From: MilleBii mille...@gmail.com
 To: common-user@hadoop.apache.org
 Sent: Wed, 1 June, 2011 12:28:54 PM
 Subject: Adding first datanode isn't working

 Newbie on hadoop clusters.
 I have setup my two nodes conf as described by M. G. Noll

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
/


 The data node has datanode  tasktracker running (jps command shows them),
 which means start-dfs.sh and start-mapred.sh worked fine.

 I can also shut them down gracefully.

 However in the WEB UI I only see one node for the DFS

 Live Node : 1
 Dead Node : 0

 Same thing on the MapRed WEB interface.

 Datanode logs on slave are just empty.
 Did check the network settings both nodes have access to each other on
 relevant ports.

 Did make sure namespaceID are the same (
 https://issues.apache.org/jira/browse/HDFS-107)
 I did try to put data in the DFS worked but no data seemed to arrive in
 the
 slave datanode.
 Also tried a small MapRed only master node has been actually working, but
 that could be because there is only data in the master. Right ?

 --
 -MilleBii-




 --
 -MilleBii-




-- 
-MilleBii-


Re: Hadoop project - help needed

2011-05-31 Thread jagaran das
Hi,

To be very precise,
input to the mapper should be something you want to filter on basis of which 
you 
want to do the aggregation.
The Reducer is where you aggregate the output from mapper.

Check the WordCount Example in Hadoop, it can help you to understand the basic 
concepts.

Cheers,
Jagaran 




From: parismav paok_gate...@hotmail.com
To: core-u...@hadoop.apache.org
Sent: Tue, 31 May, 2011 8:35:27 AM
Subject: Hadoop project - help needed


Hello dear forum, 
i am working on a project on apache Hadoop, i am totally new to this
software and i need some help understanding the basic features!

To sum up, for my project i have configured hadoop so that it runs 3
datanodes on one machine.
The project's main goal is, to use both Flickr API (flickr.com) libraries
and hadoop libraries on Java, so that each one of the 3 datanodes, chooses a
Flickr group and returns photos' info from that group.

In order to do that, i have 3 flickr accounts, each one with a different api
key. 

I dont need any help on the flickr side of the code, ofcourse. But what i
dont understand, is how to use the Mapper and Reducer part of the code. 
What input do i have to give the Map() function? 
do i have to contain this whole info downloading process in the map()
function? 

In a few words, how do i convert my code so that it runs distributedly on
hadoop? 
thank u!
-- 
View this message in context: 
http://old.nabble.com/Hadoop-project---help-needed-tp31741968p31741968.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: trying to select technology

2011-05-31 Thread jagaran das
Think of Lucene and Apache SOLR

Cheers,
Jagaran 




From: cs230 chintanjs...@gmail.com
To: core-u...@hadoop.apache.org
Sent: Tue, 31 May, 2011 10:50:49 AM
Subject: trying to select technology


Hello All,

I am planning to start project where I have to do extensive storage of xml
and text files. On top of that I have to implement efficient algorithm for
searching over thousands or millions of files, and also do some indexes to
make search faster next time. 

I looked into Oracle database but it delivers very poor result. Can I use
Hadoop for this? Which Hadoop project would be best fit for this? 

Is there anything from Google I can use? 

Thanks a lot in advance.
-- 
View this message in context: 
http://old.nabble.com/trying-to-select-technology-tp31743063p31743063.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Poor IO performance on a 10 node cluster.

2011-05-30 Thread jagaran das
Your Font block size got increased dynamically , check in core-site :) :)

- Jagaran 




From: He Chen airb...@gmail.com
To: common-user@hadoop.apache.org
Sent: Mon, 30 May, 2011 11:39:35 AM
Subject: Re: Poor IO performance on a 10 node cluster.

Hi Gyuribácsi

I would suggest you divide MapReduce program execution time into 3 parts

a) Map Stage
In this stage, wc splits input data and generates map tasks. Each map task
process one block (in default, you can change it in FileInputFormat.java).
As Brian said, if you have larger blocks size, you may have less number of
map tasks, and then probably less overhead.

b) Reduce Stage
2) shuffle phase
 In this phase, reduce task collect intermediate results from every node
that has executed map tasks. Each reduce task can have many current threads
to obtain data(you can configure it in mapred-site.xml, it is
mapreduce.reduce.shuffle.parallelcopies). But, be careful to your data
popularity. For example, you have Hadoop, Hadoop, Hadoop,hello. The
default Hadoop partitioner will assign 3 Hadoop, 1  key-value pairs to one
node. Thus, if you have two nodes run reduce tasks, one of them will copy 3
times more data than the other. This will cause one node slower than the
other. You may rewrite the partitioner.

3) sort and reduce phase
I think the Hadoop UI will give you some hints about how long this phase
takes.

By dividing MapReduce application into these 3 parts, you can easily find
which one is your bottleneck and do some profiling. And I don't know why my
font change to this type.:(

Hope it will be helpful.
Chen

On Mon, May 30, 2011 at 12:32 PM, Harsh J ha...@cloudera.com wrote:

 Psst. The cats speak in their own language ;-)

 On Mon, May 30, 2011 at 10:31 PM, James Seigel ja...@tynt.com wrote:
  Not sure that will help ;)
 
  Sent from my mobile. Please excuse the typos.
 
  On 2011-05-30, at 9:23 AM, Boris Aleksandrovsky balek...@gmail.com
 wrote:
 
 
Ljddfjfjfififfifjftjiifjfjjjffkxbznzsjxodiewisshsudddudsjidhddueiweefiuftttoitfiirriifoiffkllddiririiriioerorooiieirrioeekroooeoooirjjfdijdkkduddjudiiehs
s
  On May 30, 2011 5:28 AM, Gyuribácsi bogyo...@gmail.com wrote:
 
 
  Hi,
 
  I have a 10 node cluster (IBM blade servers, 48GB RAM, 2x500GB Disk, 16
 HT
  cores).
 
  I've uploaded 10 files to HDFS. Each file is 10GB. I used the streaming
  jar
  with 'wc -l' as mapper and 'cat' as reducer.
 
  I use 64MB block size and the default replication (3).
 
  The wc on the 100 GB took about 220 seconds which translates to about
 3.5
  Gbit/sec processing speed. One disk can do sequential read with
 1Gbit/sec
  so
  i would expect someting around 20 GBit/sec (minus some overhead), and
 I'm
  getting only 3.5.
 
  Is my expectaion valid?
 
  I checked the jobtracked and it seems all nodes are working, each
 reading
  the right blocks. I have not played with the number of mapper and
 reducers
  yet. It seems the number of mappers is the same as the number of blocks
  and
  the number of reducers is 20 (there are 20 disks). This looks ok for
 me.
 
  We also did an experiment with TestDFSIO with similar results.
 Aggregated
  read io speed is around 3.5Gbit/sec. It is just too far from my
  expectation:(
 
  Please help!
 
  Thank you,
  Gyorgy
  --
  View this message in context:
 
http://old.nabble.com/Poor-IO-performance-on-a-10-node-cluster.-tp31732971p31732971.html
l
  Sent from the Hadoop core-user mailing list archive at Nabble.com.
 
 



 --
 Harsh J



Re: No. of Map and reduce tasks

2011-05-26 Thread jagaran das
Hi Mohit,

No of Maps - It depends on what is the Total File Size / Block Size 
No of Reducers - You can specify.

Regards,
Jagaran 




From: Mohit Anchlia mohitanch...@gmail.com
To: common-user@hadoop.apache.org
Sent: Thu, 26 May, 2011 2:48:20 PM
Subject: No. of Map and reduce tasks

How can I tell how the map and reduce tasks were spread accross the
cluster? I looked at the jobtracker web page but can't find that info.

Also, can I specify how many map or reduce tasks I want to be launched?

From what I understand is that it's based on the number of input files
passed to hadoop. So if I have 4 files there will be 4 Map taks that
will be launced and reducer is dependent on the hashpartitioner.


Re: No. of Map and reduce tasks

2011-05-26 Thread jagaran das
If you give really low size files, then the use of Big Block Size of Hadoop 
goes away.
Instead try merging files.

Hope that helps




From: James Seigel ja...@tynt.com
To: common-user@hadoop.apache.org common-user@hadoop.apache.org
Sent: Thu, 26 May, 2011 6:04:07 PM
Subject: Re: No. of Map and reduce tasks

Set input split size really low,  you might get something.

I'd rather you fire up some nix commands and pack together that file
onto itself a bunch if times and the put it back into hdfs and let 'er
rip

Sent from my mobile. Please excuse the typos.

On 2011-05-26, at 4:56 PM, Mohit Anchlia mohitanch...@gmail.com wrote:

 I think I understand that by last 2 replies :)  But my question is can
 I change this configuration to say split file into 250K so that
 multiple mappers can be invoked?

 On Thu, May 26, 2011 at 3:41 PM, James Seigel ja...@tynt.com wrote:
 have more data for it to process :)


 On 2011-05-26, at 4:30 PM, Mohit Anchlia wrote:

 I ran a simple pig script on this file:

 -rw-r--r-- 1 root root   208348 May 26 13:43 excite-small.log

 that orders the contents by name. But it only created one mapper. How
 can I change this to distribute accross multiple machines?

 On Thu, May 26, 2011 at 3:08 PM, jagaran das jagaran_...@yahoo.co.in 
wrote:
 Hi Mohit,

 No of Maps - It depends on what is the Total File Size / Block Size
 No of Reducers - You can specify.

 Regards,
 Jagaran



 
 From: Mohit Anchlia mohitanch...@gmail.com
 To: common-user@hadoop.apache.org
 Sent: Thu, 26 May, 2011 2:48:20 PM
 Subject: No. of Map and reduce tasks

 How can I tell how the map and reduce tasks were spread accross the
 cluster? I looked at the jobtracker web page but can't find that info.

 Also, can I specify how many map or reduce tasks I want to be launched?

 From what I understand is that it's based on the number of input files
 passed to hadoop. So if I have 4 files there will be 4 Map taks that
 will be launced and reducer is dependent on the hashpartitioner.