MapFile throwing IOException though reading data properly

2009-09-17 Thread Pallavi Palleti
Hi all,

I came across this strange error where my MapFile is reading data into object 
that is passed to it and throws an IOException saying

java.io.IOException: @e5b723 read 2628 bytes, should read 2628

When I went thru the code of SequenceFile.java (line no:1796), I could see this 
snippet of code which is throwing IOException.
if (valIn.read() > 0) {
  LOG.info("available bytes: " + valIn.available());
  throw new IOException(val+" read "+(valBuffer.getPosition()-keyLength)
+ " bytes, should read " +
(valBuffer.getLength()-keyLength));
}

Can some one please tell me what is this condition doing and what is it for? I 
am using hadoop-20. This didn't happen in hadoop-0.18.2.

Thanks
Pallavi



Re: MapFile throwing IOException though reading data properly

2009-09-17 Thread Pallavi Palleti
Hi all,

I found a fix for this problem. Before, I was using dataout.writeBytes(str) in 
my class which implements Writable and the problem is resolved when I used 
dataout.write(str.getBytes()). I was in the assumption that both do the same. 
But, looks like it is not. I thought of sharing this information with this 
group so that it can be useful for others. However, I am still puzzled on what 
is the difference between them.

Thanks
Pallavi

- Original Message -
From: "Pallavi Palleti" 
To: common-user@hadoop.apache.org
Cc: core-u...@hadoop.apache.org
Sent: Thursday, September 17, 2009 6:49:11 PM GMT +05:30 Chennai, Kolkata, 
Mumbai, New Delhi
Subject: MapFile throwing IOException though reading data properly

Hi all,

I came across this strange error where my MapFile is reading data into object 
that is passed to it and throws an IOException saying

java.io.IOException: @e5b723 read 2628 bytes, should read 2628

When I went thru the code of SequenceFile.java (line no:1796), I could see this 
snippet of code which is throwing IOException.
if (valIn.read() > 0) {
  LOG.info("available bytes: " + valIn.available());
  throw new IOException(val+" read "+(valBuffer.getPosition()-keyLength)
+ " bytes, should read " +
(valBuffer.getLength()-keyLength));
}

Can some one please tell me what is this condition doing and what is it for? I 
am using hadoop-20. This didn't happen in hadoop-0.18.2.

Thanks
Pallavi



Redirecting hadoop log messages to a log file at client side

2010-03-29 Thread Pallavi Palleti

Hi,

I am copying certain data from a client machine (which is not part of 
the cluster) using DFSClient to HDFS. During this process, I am 
encountering some issues and the error/info logs are going to stdout. Is 
there a way, I can configure the property at client side so that the 
error/info logs are appended to existing log file (being created using 
logger at client code) rather writing to stdout.


Thanks
Pallavi


Re: Redirecting hadoop log messages to a log file at client side

2010-03-30 Thread Pallavi Palleti

Hi Alex,

Thanks for the reply. I have already created a logger (from 
log4j.logger)and configured the same to log it to a file and it is 
logging for all the log statements that I have in my client code. 
However, the error/info logs of DFSClient are going to stdout.  The 
DFSClient code is using log from commons-logging.jar. I am wondering how 
to redirect those logs (which are right now going to stdout) to append 
to the existing logger in client code.


Thanks
Pallavi


On 03/30/2010 12:06 PM, Alex Kozlov wrote:

Hi Pallavi,

It depends what logging configuration you are using.  If it's log4j, you
need to modify (or create) log4j.properties file and point you code (via
classpath) to it.

A sample log4j.properties is in the conf directory (either apache or CDH
distributions).

Alex K

On Mon, Mar 29, 2010 at 11:25 PM, Pallavi Palleti<
pallavi.pall...@corp.aol.com>  wrote:

   

Hi,

I am copying certain data from a client machine (which is not part of the
cluster) using DFSClient to HDFS. During this process, I am encountering
some issues and the error/info logs are going to stdout. Is there a way, I
can configure the property at client side so that the error/info logs are
appended to existing log file (being created using logger at client code)
rather writing to stdout.

Thanks
Pallavi

 
   


Query over DFSClient

2010-03-30 Thread Pallavi Palleti

Hi,

Could some one kindly let me know if the DFSClient takes care of 
datanode failures and attempt to write to another datanode if primary 
datanode (and replicated datanodes) fail. I looked into the souce code 
of DFSClient and figured out that it attempts to write to one of the 
datanodes in pipeline and fails if it failed to write to at least one of 
them. However, I am not sure as I haven't explored fully. If so, is 
there a way of querying namenode to provide different datanodes in the 
case of failure. I am sure the Mapper would be doing similar 
thing(attempting to fetch different datanode from namenode)  if it fails 
to write to datanodes. Kindly let me know.


Thanks
Pallavi



Re: Redirecting hadoop log messages to a log file at client side

2010-03-31 Thread Pallavi Palleti

Hi Alex,

I created a jar including my client code (specified in manifest) and 
needed jar files like hadoop-20.jar, log4j.jar, commons-logging.jar and 
ran the application as
java -cp  -jar above_jar 
needed-parameters-for-client-code.


I will explore using commons-logging in my client code.

Thanks
Pallavi

On 03/30/2010 10:14 PM, Alex Kozlov wrote:

Hi Pallavi,

DFSClient uses log4j.properties for configuration.  What is your classpath?
  I need to know how exactly you invoke your program (java, hadoop script,
etc.).  The log level and appender is driven by the hadoop.root.logger
config variable.

I would also recommend to use one logging system in the code, which will be
commons-logging in this case.

Alex K

On Tue, Mar 30, 2010 at 12:12 AM, Pallavi Palleti<
pallavi.pall...@corp.aol.com>  wrote:

   

Hi Alex,

Thanks for the reply. I have already created a logger (from
log4j.logger)and configured the same to log it to a file and it is logging
for all the log statements that I have in my client code. However, the
error/info logs of DFSClient are going to stdout.  The DFSClient code is
using log from commons-logging.jar. I am wondering how to redirect those
logs (which are right now going to stdout) to append to the existing logger
in client code.

Thanks
Pallavi



On 03/30/2010 12:06 PM, Alex Kozlov wrote:

 

Hi Pallavi,

It depends what logging configuration you are using.  If it's log4j, you
need to modify (or create) log4j.properties file and point you code (via
classpath) to it.

A sample log4j.properties is in the conf directory (either apache or CDH
distributions).

Alex K

On Mon, Mar 29, 2010 at 11:25 PM, Pallavi Palleti<
pallavi.pall...@corp.aol.com>   wrote:



   

Hi,

I am copying certain data from a client machine (which is not part of the
cluster) using DFSClient to HDFS. During this process, I am encountering
some issues and the error/info logs are going to stdout. Is there a way,
I
can configure the property at client side so that the error/info logs are
appended to existing log file (being created using logger at client code)
rather writing to stdout.

Thanks
Pallavi



 


   
 
   


Re: Query over DFSClient

2010-03-31 Thread Pallavi Palleti

Hi,

I am looking into hadoop-20 source code for below issue. From DFSClient, 
I could see that once the datanodes given by namenode are not reachable, 
it is setting "lastException" variable to error message saying "recovery 
from primary datanode is failed N times, aborting.."(line No:2546 in 
processDataNodeError).  However, I couldn't figure out where this 
exception is thrown. I could see the throw statement in isClosed() but 
not finding the exact sequence after Streamer exits with lastException 
set to isClosed() method call. It would be great if some one could shed 
some light on this. I am essentially looking whether DFSClient 
approaches namenode in the case of failure of all datanodes that 
namenode has given for a given data block previously.


Thanks
Pallavi


On 03/30/2010 05:01 PM, Pallavi Palleti wrote:

Hi,

Could some one kindly let me know if the DFSClient takes care of 
datanode failures and attempt to write to another datanode if primary 
datanode (and replicated datanodes) fail. I looked into the souce code 
of DFSClient and figured out that it attempts to write to one of the 
datanodes in pipeline and fails if it failed to write to at least one 
of them. However, I am not sure as I haven't explored fully. If so, is 
there a way of querying namenode to provide different datanodes in the 
case of failure. I am sure the Mapper would be doing similar 
thing(attempting to fetch different datanode from namenode)  if it 
fails to write to datanodes. Kindly let me know.


Thanks
Pallavi



Re: Issue with HDFS Client when datanode is temporarily unavailable

2009-07-24 Thread Pallavi Palleti
Could some one let me know what would be the reason for failure. If the stream 
can be closed with out any issue when datanodes are available, it reduces most 
of the complexity that need to be done at my end. As the stream is failing to 
close even when datanodes are available, I have to maintain a kind of 
checkpointing to resume from where the data has failed to copy back to HDFS 
which will add an overhead for a solution which is near real time.

Thanks
Pallavi
- Original Message -
From: "Pallavi Palleti" 
To: common-user@hadoop.apache.org
Sent: Wednesday, July 22, 2009 5:06:49 PM GMT +05:30 Chennai, Kolkata, Mumbai, 
New Delhi
Subject: RE: Issue with HDFS Client when datanode is temporarily unavailable

Hi all,

In simple terms, Why is any output stream that failed to close when the
datanodes weren't available fails when I try to close the same again
when the datanodes are available? Could someone kindly help me to tackle
this situation?

Thanks
Pallavi

-Original Message-
From: Palleti, Pallavi [mailto:pallavi.pall...@corp.aol.com] 
Sent: Tuesday, July 21, 2009 10:21 PM
To: common-user@hadoop.apache.org
Subject: Issue with HDFS Client when datanode is temporarily unavailable

Hi all,

 

We are facing issues with an external application when it tries to write
data into HDFS using FSDataOutputStream. We are using hadoop-0.18.2
version. The code works perfectly fine as long as the data nodes are
doing well. If the data nodes are unavailable due to some reason (No
space left etc, which is temporary due to map red jobs running on the
machine), the code fails. I tried to fix the issue by catching the error
and waiting for some time before retrying again. During this, I came to
know that the actual writes are not happening when we specify
out.write() (Even the same case with out.write() followed by
out.flush()), but it happens when we actually specify out.close().
During this time, if the datanodes are unavailable, the DFSClient
internally tries multiple times before actually throwing exception.
Below are the sequence of exceptions that I am seeing.

 

09/07/21 19:33:25 INFO dfs.DFSClient: Exception in
createBlockOutputStream java.net.ConnectException: Connection refused

09/07/21 19:33:25 INFO dfs.DFSClient: Abandoning block
blk_2612177980121914843_134112

09/07/21 19:33:31 INFO dfs.DFSClient: Exception in
createBlockOutputStream java.net.ConnectException: Connection refused

09/07/21 19:33:31 INFO dfs.DFSClient: Abandoning block
blk_-3499389777806382640_134112

09/07/21 19:33:37 INFO dfs.DFSClient: Exception in
createBlockOutputStream java.net.ConnectException: Connection refused

09/07/21 19:33:37 INFO dfs.DFSClient: Abandoning block
blk_1835125657840860999_134112

09/07/21 19:33:43 INFO dfs.DFSClient: Exception in
createBlockOutputStream java.net.ConnectException: Connection refused

09/07/21 19:33:43 INFO dfs.DFSClient: Abandoning block
blk_-3979824251735502509_134112[4 times attempt done by DFSClient
before throwing exception during which datanode is unavailable]

09/07/21 19:33:49 WARN dfs.DFSClient: DataStreamer Exception:
java.io.IOException: Unable to create new block.

at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DF
SClient.java:2357)

at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.ja
va:1743)

at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClie
nt.java:1920)

 

09/07/21 19:33:49 WARN dfs.DFSClient: Error Recovery for block
blk_-3979824251735502509_134112 bad datanode[0]

09/07/21 19:33:49 ERROR logwriter.LogWriterToHDFSV2: Failed while
creating file for data:some dummy line [21/Jul/2009:17:15:18
somethinghere] with other dummy info :to HDFS

java.io.IOException: Could not get block locations. Aborting...

at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFS
Client.java:2151)

at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400(DFSClient.ja
va:1743)

at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClie
nt.java:1897)

09/07/21 19:33:49 INFO logwriter.LogWriterToHDFSV2: Retrying
again...number of Attempts =0  [done by me manually  during which
datanode is available]

09/07/21 19:33:54 ERROR logwriter.LogWriterToHDFSV2: Failed while
creating file for data:some dummy line [21/Jul/2009:17:15:18
somethinghere] with other dummy info :to HDFS

java.io.IOException: Could not get block locations. Aborting...

at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFS
Client.java:2151)

at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400(DFSClient.ja
va:1743)

at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClie
nt.java:1897)

09/07/21 19:33:54 INFO logwriter.LogWriterToHDFSV2: Retrying
again...number of Attempts =1 [done by me manually during which datanode
is available]

09/07/21 19:33:59 ERROR logwriter.Lo

Re: Remote access to cluster using user as hadoop

2009-07-24 Thread Pallavi Palleti
Hi all,

I tried to trackdown to the place where I can add some conditions for not 
allowing any remote user with username as hadoop(root user) (other than some 
specific hostnames or ipaddresses). I could see the call path as FsShell -> 
DistributedFileSystem ->DFSClient - ClientProtocol. As there is no way to debug 
the code via eclipse (when I ran thru eclipse it points to LocalFileSystem), I 
followed naive way of debugging by adding print commands. After DFSClient, I 
couldn't figure out which Class is getting called. From the code, I could see 
only NameNode extended ClientProtocol, so I was pretty sure that NameNode 
methods are getting called, but I coudln't see my debug print statements in the 
logs when I added some print statements in the namenode. Can some one help me 
what is the flow when a call from Remote machine with same root user 
name(hadoop) is made?

I tried for mkdir command which essentially calls mkdirs() method in DFSClient 
and there by ClientProtocol mkdirs() method.

Thanks
Pallavi 
- Original Message -
From: "Ted Dunning" 
To: common-user@hadoop.apache.org
Sent: Friday, July 24, 2009 6:22:12 AM GMT +05:30 Chennai, Kolkata, Mumbai, New 
Delhi
Subject: Re: Remote access to cluster using user as hadoop

Interesting approach.

My guess is that this would indeed protect the datanodes from accidental
"attack" by stopping access before they are involved.

You might also consider just changing the name of the magic hadoop user to
something that is more unlikely.  The name "hadoop" is not far off what
somebody might come up with as a user name for experimenting or running
scheduled jobs.

On Thu, Jul 23, 2009 at 3:28 PM, Ian Holsman  wrote:

> I was thinking of alternatives similar to creating a proxy nameserver that
> non-privileged users can attach to that forwards those to the "real"
> nameserver or just hacking the nameserver so that it switches "hadoop" to
> "hadoop_remote" for sessions from untrusted IP's.
>
> not being familiar with the code, I am presuming that there is a point
> where the code determines the userID. can anyone point me to that bit?
> I just want to hack it to  downgrade superusers, and it doesn't have to be
> too clean or work for every edge case. it's more to stop accidental
> problems.
>



-- 
Ted Dunning, CTO
DeepDyve


Re: Remote access to cluster using user as hadoop

2009-07-24 Thread Pallavi Palleti
I guess, I forgot to restart namenode after changes. It is working fine now. 
Apologies for the spam.

Thanks
Pallavi
- Original Message -
From: "Pallavi Palleti" 
To: common-user@hadoop.apache.org
Sent: Friday, July 24, 2009 6:45:02 PM GMT +05:30 Chennai, Kolkata, Mumbai, New 
Delhi
Subject: Re: Remote access to cluster using user as hadoop

Hi all,

I tried to trackdown to the place where I can add some conditions for not 
allowing any remote user with username as hadoop(root user) (other than some 
specific hostnames or ipaddresses). I could see the call path as FsShell -> 
DistributedFileSystem ->DFSClient - ClientProtocol. As there is no way to debug 
the code via eclipse (when I ran thru eclipse it points to LocalFileSystem), I 
followed naive way of debugging by adding print commands. After DFSClient, I 
couldn't figure out which Class is getting called. From the code, I could see 
only NameNode extended ClientProtocol, so I was pretty sure that NameNode 
methods are getting called, but I coudln't see my debug print statements in the 
logs when I added some print statements in the namenode. Can some one help me 
what is the flow when a call from Remote machine with same root user 
name(hadoop) is made?

I tried for mkdir command which essentially calls mkdirs() method in DFSClient 
and there by ClientProtocol mkdirs() method.

Thanks
Pallavi 
- Original Message -
From: "Ted Dunning" 
To: common-user@hadoop.apache.org
Sent: Friday, July 24, 2009 6:22:12 AM GMT +05:30 Chennai, Kolkata, Mumbai, New 
Delhi
Subject: Re: Remote access to cluster using user as hadoop

Interesting approach.

My guess is that this would indeed protect the datanodes from accidental
"attack" by stopping access before they are involved.

You might also consider just changing the name of the magic hadoop user to
something that is more unlikely.  The name "hadoop" is not far off what
somebody might come up with as a user name for experimenting or running
scheduled jobs.

On Thu, Jul 23, 2009 at 3:28 PM, Ian Holsman  wrote:

> I was thinking of alternatives similar to creating a proxy nameserver that
> non-privileged users can attach to that forwards those to the "real"
> nameserver or just hacking the nameserver so that it switches "hadoop" to
> "hadoop_remote" for sessions from untrusted IP's.
>
> not being familiar with the code, I am presuming that there is a point
> where the code determines the userID. can anyone point me to that bit?
> I just want to hack it to  downgrade superusers, and it doesn't have to be
> too clean or work for every edge case. it's more to stop accidental
> problems.
>



-- 
Ted Dunning, CTO
DeepDyve


Getting Slaves list in hadoop

2009-07-27 Thread Pallavi Palleti
Hi all,

Is there an easy way to get the slaves list in Server.java code?

Thanks
Pallavi


Re: Getting Slaves list in hadoop

2009-07-27 Thread Pallavi Palleti
I can do that. But, what if a user gives a different conf directory at the 
startup. Then, is there a way to find that? Essentially, what I am looking for 
is a variable or property which specifies the location of conf directory so 
that I can read from there. 

Thanks
Pallavi
- Original Message -
From: "Ninad Raut" 
To: common-user@hadoop.apache.org
Sent: Monday, July 27, 2009 3:34:58 PM GMT +05:30 Chennai, Kolkata, Mumbai, New 
Delhi
Subject: Re: Getting Slaves list in hadoop

write a java class to read the slaves file in conf folder. I hope thats what
you want.

On Mon, Jul 27, 2009 at 3:19 PM, Pallavi Palleti <
pallavi.pall...@corp.aol.com> wrote:

> Hi all,
>
> Is there an easy way to get the slaves list in Server.java code?
>
> Thanks
> Pallavi
>


Re: Remote access to cluster using user as hadoop

2009-07-30 Thread Pallavi Palleti
Hi all,

I have made changes to the hadoop-0.18.2 code to allow hadoop super user access 
only from some specified IP Range. If it is untrusted IP, it throws an 
exception. I would like to add it as a patch so that people can use it if 
needed in their environment. Can some one tell me what is the procedure to 
create a patch? I could see from trunk code that there is some work related to 
it happening. Especially, I am looking at Server.java code 
(PrivilegedActionException being thrown for untrusted user I believe).Can some 
one please clarify if it is written for the purpose that we are 
discussing(validating whether it is trusted super user from a specific remote 
IP)? If it is not, then I would like to add my patch.

Thanks
Pallavi

 
- Original Message -
From: "Ted Dunning" 
To: common-user@hadoop.apache.org
Sent: Friday, July 24, 2009 6:22:12 AM GMT +05:30 Chennai, Kolkata, Mumbai, New 
Delhi
Subject: Re: Remote access to cluster using user as hadoop

Interesting approach.

My guess is that this would indeed protect the datanodes from accidental
"attack" by stopping access before they are involved.

You might also consider just changing the name of the magic hadoop user to
something that is more unlikely.  The name "hadoop" is not far off what
somebody might come up with as a user name for experimenting or running
scheduled jobs.

On Thu, Jul 23, 2009 at 3:28 PM, Ian Holsman  wrote:

> I was thinking of alternatives similar to creating a proxy nameserver that
> non-privileged users can attach to that forwards those to the "real"
> nameserver or just hacking the nameserver so that it switches "hadoop" to
> "hadoop_remote" for sessions from untrusted IP's.
>
> not being familiar with the code, I am presuming that there is a point
> where the code determines the userID. can anyone point me to that bit?
> I just want to hack it to  downgrade superusers, and it doesn't have to be
> too clean or work for every edge case. it's more to stop accidental
> problems.
>



-- 
Ted Dunning, CTO
DeepDyve


Re: Remote access to cluster using user as hadoop

2009-07-30 Thread Pallavi Palleti
Thanks Steve for the link. It has answers to all my questions. I will discuss 
it in dev mailing list.

Regards
Pallavi

- Original Message -
From: "Steve Loughran" 
To: common-user@hadoop.apache.org
Sent: Thursday, July 30, 2009 6:00:45 PM GMT +05:30 Chennai, Kolkata, Mumbai, 
New Delhi
Subject: Re: Remote access to cluster using user as hadoop

Pallavi Palleti wrote:
> Hi all,
> 
> I have made changes to the hadoop-0.18.2 code to allow hadoop super user 
> access only from some specified IP Range. If it is untrusted IP, it throws an 
> exception. I would like to add it as a patch so that people can use it if 
> needed in their environment. Can some one tell me what is the procedure to 
> create a patch? I could see from trunk code that there is some work related 
> to it happening. Especially, I am looking at Server.java code 
> (PrivilegedActionException being thrown for untrusted user I believe).Can 
> some one please clarify if it is written for the purpose that we are 
> discussing(validating whether it is trusted super user from a specific remote 
> IP)? If it is not, then I would like to add my patch.

Follow the advice on http://wiki.apache.org/hadoop/HowToContribute ; the 
developer mailing list and hadoop issues are the ways to discuss this. 
New features go onto SVN_HEAD, so checking out and building that will be 
your first bit of work.

-steve


No Space Left On Device though space is available

2009-08-02 Thread Pallavi Palleti
Hi all,

We are having a 60 node cluster running hadoop-0.18.2. We are seeing "No Space 
Left On Device" and the detailed error is 
 org.apache.hadoop.ipc.RemoteException: java.io.IOException: 
java.lang.RuntimeException: javax.xml.transfor
m.TransformerException: java.io.IOException: No space left on device
at org.apache.hadoop.conf.Configuration.write(Configuration.java:996)
at 
org.apache.hadoop.mapred.JobHistory$JobInfo.logSubmitted(JobHistory.java:530)
at org.apache.hadoop.mapred.JobInProgress.(JobInProgress.java:196)
at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:1783)
at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)

at org.apache.hadoop.ipc.Client.call(Client.java:715)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at org.apache.hadoop.mapred.$Proxy1.submitJob(Unknown Source)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:788)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1026)

Surprisingly, there is no space issue. Still, it is giving above error. Can 
someone kindly let me know what could be the issue?

Thanks
Pallavi


File is closed but data is not visible

2009-08-11 Thread Pallavi Palleti
Hi all,

We have an application where we pull logs from an external server(far apart 
from hadoop cluster) to hadoop cluster. Sometimes, we could see huge delay (of 
1 hour or more) in actually seeing the data in HDFS though the file has been 
closed and the variable is set to null from the external application.I was in 
the impression that when I close the file, the data gets reflected in hadoop 
cluster. Now, in this situation, it is even more complicated to handle write 
failures as it is giving false impression to the client that data has been 
written to HDFS. Kindly clarify if my perception is correct. If yes, Could some 
one tell me what is causing the delay in actually showing the data. During 
those cases, how can we tackle write failures (due to some temporary issues 
like data node not available, disk is full) as there is no way, we can figure 
out the failure at the client side?

Thanks
Pallavi


Re: File is closed but data is not visible

2009-08-11 Thread Pallavi Palleti
Hi Raghu,

The file doesn't appear in the cluster when I saw it from Namenode UI. Also, I 
have a monitor at cluster side which checks whether file is created and throws 
an exception when it is not created. And, it threw an exception saying "File 
not found".

Thanks
Pallavi
- Original Message -
From: "Raghu Angadi" 
To: common-user@hadoop.apache.org
Sent: Wednesday, August 12, 2009 12:10:12 AM GMT +05:30 Chennai, Kolkata, 
Mumbai, New Delhi
Subject: Re: File is closed but data is not visible


Your assumption is correct. When you close the file, others can read the 
data. There is no delay expected before the data is visible. If there is 
an error either write() or close() would throw an error.

When you say data is not visible do you mean readers can not see the 
file or can not see the data? Is it guaranteed that readers open the 
file _after_ close returns on the writer?

Raghu.

Palleti, Pallavi wrote:
> Hi Jason,
> 
> Apologies for missing version information in my previous mail. I am
> using hadoop-0.18.3. I am getting FSDataOutputStream object using
> fs.create(new Path(some_file_name)), where fs is FileSystem object. And,
> I am closing the file using close(). 
> 
> Thanks
> Pallavi
> 
> -Original Message-
> From: Jason Venner [mailto:jason.had...@gmail.com] 
> Sent: Tuesday, August 11, 2009 6:24 PM
> To: common-user@hadoop.apache.org
> Subject: Re: File is closed but data is not visible
> 
> Please provide information on what version of hadoop you are using and
> the
> method of opening and closing the file.
> 
> 
> On Tue, Aug 11, 2009 at 12:48 AM, Pallavi Palleti <
> pallavi.pall...@corp.aol.com> wrote:
> 
>> Hi all,
>>
>> We have an application where we pull logs from an external server(far
> apart
>> from hadoop cluster) to hadoop cluster. Sometimes, we could see huge
> delay
>> (of 1 hour or more) in actually seeing the data in HDFS though the
> file has
>> been closed and the variable is set to null from the external
> application.I
>> was in the impression that when I close the file, the data gets
> reflected in
>> hadoop cluster. Now, in this situation, it is even more complicated to
>> handle write failures as it is giving false impression to the client
> that
>> data has been written to HDFS. Kindly clarify if my perception is
> correct.
>> If yes, Could some one tell me what is causing the delay in actually
> showing
>> the data. During those cases, how can we tackle write failures (due to
> some
>> temporary issues like data node not available, disk is full) as there
> is no
>> way, we can figure out the failure at the client side?
>>
>> Thanks
>> Pallavi
>>
> 
> 
> 



Re: File is closed but data is not visible

2009-08-12 Thread Pallavi Palleti
@corp.aol.com> wrote:
>>
>>> Hi Jason,
>>>
>>> The file is neither visible via Namenode UI nor via program(checking
>>> whether a file exists).
>>>
>>> There is no caching happening at the application level. The
>> application
>>> is pretty simple. We are taking apache logs and trying to put into
>>> timely buckets based on the logged time of records. We are creating
> 4
>>> files(one for every 15 minutes) for every hour. So, at the client
>> side,
>>> we are looking into the logs and see if the data belongs to the
>> current
>>> interval, then we are writing into the currently opened HDFS file.
> If
>> it
>>> belongs to new interval, the old file is closed and new file is
>> created.
>>> I have been logging the time at which the file is being created and
> at
>>> which the file is being closed at my client side. And, I could see
>> that
>>> the file is getting closed at expected time period. But, when I look
>> for
>>> the same file in hadoop cluster, it is still not created and if I
> wait
>>> for another 1 to 2 hours, I could see the file.
>>>
>>> Thanks
>>> Pallavi
>>>
>>>
>>> -Original Message-
>>> From: Jason Venner [mailto:jason.had...@gmail.com]
>>> Sent: Wednesday, August 12, 2009 6:03 PM
>>> To: common-user@hadoop.apache.org
>>> Subject: Re: File is closed but data is not visible
>>>
>>> Is it possible that your application is caching some data and not
>>> refreshing
>>> it when you expect?
>>> The HDFS file visibility semantics are well understood, and your
> case
>>> does
>>> not fit with that understanding.
>>> A factor that hints strongly at this is that your file is visible
> via
>>> the
>>> Namenode UI, there is nothing special about that UI
>>>
>>> On Tue, Aug 11, 2009 at 9:00 PM, Pallavi Palleti <
>>> pallavi.pall...@corp.aol.com> wrote:
>>>
>>>> Hi Raghu,
>>>>
>>>> The file doesn't appear in the cluster when I saw it from Namenode
>> UI.
>>>> Also, I have a monitor at cluster side which checks whether file
> is
>>> created
>>>> and throws an exception when it is not created. And, it threw an
>>> exception
>>>> saying "File not found".
>>>>
>>>> Thanks
>>>> Pallavi
>>>> - Original Message -
>>>> From: "Raghu Angadi" 
>>>> To: common-user@hadoop.apache.org
>>>> Sent: Wednesday, August 12, 2009 12:10:12 AM GMT +05:30 Chennai,
>>> Kolkata,
>>>> Mumbai, New Delhi
>>>> Subject: Re: File is closed but data is not visible
>>>>
>>>>
>>>> Your assumption is correct. When you close the file, others can
> read
>>> the
>>>> data. There is no delay expected before the data is visible. If
>> there
>>> is
>>>> an error either write() or close() would throw an error.
>>>>
>>>> When you say data is not visible do you mean readers can not see
> the
>>>> file or can not see the data? Is it guaranteed that readers open
> the
>>>> file _after_ close returns on the writer?
>>>>
>>>> Raghu.
>>>>
>>>> Palleti, Pallavi wrote:
>>>>> Hi Jason,
>>>>>
>>>>> Apologies for missing version information in my previous mail. I
>> am
>>>>> using hadoop-0.18.3. I am getting FSDataOutputStream object
> using
>>>>> fs.create(new Path(some_file_name)), where fs is FileSystem
>> object.
>>> And,
>>>>> I am closing the file using close().
>>>>>
>>>>> Thanks
>>>>> Pallavi
>>>>>
>>>>> -Original Message-
>>>>> From: Jason Venner [mailto:jason.had...@gmail.com]
>>>>> Sent: Tuesday, August 11, 2009 6:24 PM
>>>>> To: common-user@hadoop.apache.org
>>>>> Subject: Re: File is closed but data is not visible
>>>>>
>>>>> Please provide information on what version of hadoop you are
> using
>>> and
>>>>> the
>>>>> method of opening and closing the file.
>>>>>
>>>>>
>>>>> On Tue, Aug 11, 2009 at 12:48 AM, Pallavi Palleti <
>>>>> pallavi.pall...@corp.aol.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> We have an application where we pull logs from an external
>>> server(far
>>>>> apart
>>>>>> from hadoop cluster) to hadoop cluster. Sometimes, we could see
>>> huge
>>>>> delay
>>>>>> (of 1 hour or more) in actually seeing the data in HDFS though
>> the
>>>>> file has
>>>>>> been closed and the variable is set to null from the external
>>>>> application.I
>>>>>> was in the impression that when I close the file, the data gets
>>>>> reflected in
>>>>>> hadoop cluster. Now, in this situation, it is even more
>> complicated
>>> to
>>>>>> handle write failures as it is giving false impression to the
>>> client
>>>>> that
>>>>>> data has been written to HDFS. Kindly clarify if my perception
> is
>>>>> correct.
>>>>>> If yes, Could some one tell me what is causing the delay in
>>> actually
>>>>> showing
>>>>>> the data. During those cases, how can we tackle write failures
>> (due
>>> to
>>>>> some
>>>>>> temporary issues like data node not available, disk is full) as
>>> there
>>>>> is no
>>>>>> way, we can figure out the failure at the client side?
>>>>>>
>>>>>> Thanks
>>>>>> Pallavi
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>> --
>>> Pro Hadoop, a book to guide you from beginner to hadoop mastery,
>>> http://www.amazon.com/dp/1430219424?tag=jewlerymall
>>> www.prohadoopbook.com a community for Hadoop Professionals
>>>
>>
>>
>> --
>> Pro Hadoop, a book to guide you from beginner to hadoop mastery,
>> http://www.amazon.com/dp/1430219424?tag=jewlerymall
>> www.prohadoopbook.com a community for Hadoop Professionals
>>
> 
> 
> 



Re: File is closed but data is not visible

2009-08-13 Thread Pallavi Palleti
Hi Raghu and Jason,

Thanks for your help. Due to some confusion at our end, I was looking into a 
different output log file. Right now, I could see the consistency in timing 
between output log file and namenode log file though the problem is not 
resolved completely. I will seek your help in case it is needed and apologies 
for the confusion.
 
Thanks
Pallavi

 
- Original Message -
From: "Raghu Angadi" 
To: common-user@hadoop.apache.org
Sent: Thursday, August 13, 2009 10:04:47 AM GMT +05:30 Chennai, Kolkata, 
Mumbai, New Delhi
Subject: Re: File is closed but data is not visible

Pallavi Palleti wrote:
> yes.

Then you can check NameNode log for such a file name. If it is closed 
then you will notice 'completeFile...' message with the filename. This 
will also show if there was anything odd with the file.

Raghu.

> - Original Message -
> From: "Raghu Angadi" 
> To: common-user@hadoop.apache.org
> Sent: Wednesday, August 12, 2009 10:09:55 PM GMT +05:30 Chennai, Kolkata, 
> Mumbai, New Delhi
> Subject: Re: File is closed but data is not visible
> 
> 
> What happens when the while loop ends? Is 'out' closed then?
> 
> Palleti, Pallavi wrote:
>> No. I am closing it before opening a new one
>>
>> if (out != null) // if any output stream opened previously ,  close it
>>   {
>> logger.info("Closing writer of -" + 
>>  paramWrapper.getOutFileStr());
>> out.close();
>> out = null;
>>   }
>>
>> Thanks
>> Pallavi
>>
>> -Original Message-
>> From: Jason Venner [mailto:jason.had...@gmail.com] 
>> Sent: Wednesday, August 12, 2009 7:31 PM
>> To: common-user@hadoop.apache.org
>> Subject: Re: File is closed but data is not visible
>>
>> You do not appear to close out, except when an exception occurs.
>> The finally block only closes the reader.
>>
>> On Wed, Aug 12, 2009 at 6:24 AM, Palleti, Pallavi <
>> pallavi.pall...@corp.aol.com> wrote:
>>
>>> Hi Jason,
>>>
>>> Kindly find the snippet of code which creates and close file.
>>>
>>> Variables passed to the method:FSDataOutputStream out,ParamWrapper
>>> paramWrapper
>>>
>>> Snippet:
>>>
>>>String inputLine = null;
>>>int status = 0;
>>>
>>>BufferedReader reader = null;
>>>
>>>try {
>>>  reader = new BufferedReader(new InputStreamReader(); //reader
>>> initialization
>>>
>>>  while ((inputLine = reader.readLine()) != null) {
>>>
>>>Date date = getLoggedDate(inputLine); // process the line to
>> get
>>> input and if it is wrong
>>>if (date == null) // if input data is wrong, don't write
>>>{
>>>  continue;
>>>}
>>>Calendar cal = Calendar.getInstance();
>>>cal.setTime(date);
>>>int hour = cal.get(Calendar.HOUR_OF_DAY); // get input hour
>>>int minutes = cal.get(Calendar.MINUTE); // get input minute
>>>
>>>int outputMinute = minutes / timePeriod + 1; // compute the
>> slot
>>>if (paramWrapper.prevHour != hour
>>>|| paramWrapper.prevMin != outputMinute) // if it is a new
>>> slot
>>>{
>>>
>>>  if (out != null) // if any output stream opened previously ,
>>> close it
>>>  {
>>>logger.info("Closing writer of -" +
>>> paramWrapper.getOutFileStr());
>>>out.close();
>>>out = null;
>>>  }
>>>  String outFileStr = generateFileName(rootDir,
>>> hdfsOutFile,outputMinute, date); // generate file name ex:
>>> location/year/month/day/hour/_1.txt
>>>  Path outFile = new Path(outFileStr);
>>>
>>>  paramWrapper.setOutFileStr(outFileStr);
>>>  logger.info("Creating outFile:" + outFileStr);
>>>
>>>  out = fs.create(outFile); // create new file and get
>>>  // output stream
>>>  paramWrapper.setPrevHour(hour);
>>>  paramWrapper.setPrevMin(outputMinute);
>>>}
>>>StringBuilder outLineStr = new StringBuilder();
>>>outLineStr.append(inputLine).append("\n");
>>>out.write(outLineStr.toString().getBytes());
>>>  }
>>>} catch (IOException ioe) {
>>>  logger.error("Main: