RE: MODERATE for hdfs-iss...@hadoop.apache.org

2023-10-22 Thread Sergey Onuchin
Hello Ayush Saxena.

Thank you for your response.

We have parallel processes working on the same HDFS, but they are not touching 
the affected directory.
We cannot exclude (stop) them, as it is production under load.

Also we cannot enable debug mode due to the risk to impact ongoing operations.

Scanning hdfs-audit log shows creation and data access to 'lost' directories or 
their files, up to the day they were, well, lost.
No 'delete' or 'rename' operation are visible in logs - just no more matches.

>> then maybe check in edit logs, or enable debug logs and see for entries for 
>> edit log,"doEditTx op"
I don't know how to do that, please elaborate.

I've attached yesterdays evidence (user-perspective) of 2 partition loss.

Right now I did the following:
- copied whole parent directory to another HDFS location
- started rebuilding 'lost' partitions, this will take 3-4 calendar days to 
cover all missing days.
- only one partition is done so far, no loss appeared yet.


Thank you!

-Original Message-
From: Ayush Saxena  
Sent: 18 October, 2023 2:25
To: Sergey Onuchin 
Cc: Hdfs-dev ; Xiaoqiao He ; 
user.hadoop 
Subject: Re: MODERATE for hdfs-iss...@hadoop.apache.org

+ user@hadoop

This sounds pretty strange, do you have any background job in your cluster 
running, like for compaction kind of stuff, which plays with the files? Any 
traces in the Namenode Logs, what happens to the blocks associated with those 
files, If they get deleted before a FBR, that ain't a metadata loss I believe, 
something triggered a delete, maybe on the parent directory?

Will it be possible to enable debug logs and grep for "DIR* 
FSDirectory.delete:" (code here [1]) or check other delete related entries from 
StateChangeLog?
Maybe try to capture all the Audit logs from the create entry to the moment 
when you figure out files are missing & look for all the delete entries.
Still no luck, then maybe check in edit logs, or enable debug logs and see for 
entries for edit log,"doEditTx op"

-Ayush

[1] 
https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirDeleteOp.java#L175

On Tue, 17 Oct 2023 at 17:57, Xiaoqiao He  wrote:
>
> Hi Sergey Onuchin,
>
> Sorry to hear that. But we could not give some suggestions based on 
> the only information you mentioned.
> If any more on-site information may be better to trace, such as depoy 
> architecture, NameNode log and jstack etc.
> Based on my practice, I did not receive some cases which delete 
> directory without noise.
> Did you try to check operations (rename and delete) about the 
> parent-directory?
> Good luck!
>
> Best Regards,
> - He Xiaoqiao
>
>
> On Mon, Oct 16, 2023 at 11:58 PM <
> hdfs-issues-reject-1697471875.2027154.pkchcedhioidkhech...@hadoop.apac
> he.org>
> wrote:
>
> >
> > -- Forwarded message --
> > From: Sergey Onuchin 
> > To: "hdfs-iss...@hadoop.apache.org" 
> > Cc:
> > Bcc:
> > Date: Mon, 16 Oct 2023 15:57:47 +
> > Subject: HDFS loses directories with production data
> >
> > Hello,
> >
> >
> >
> > We’ve been using Hadoop (+Spark) for 3 years on production w/o major 
> > issues.
> >
> >
> >
> > Lately we observe that whole non-empty directories (table 
> > partitions) are disappearing in random ways.
> >
> > We see in application logs (and in hdfs-audit) logs creation of the 
> > directory + data files.
> >
> > Then later we see NO this directory in HDFS.
> >
> >
> >
> > hdfs-audit.log shows no traces of deletes or renames for the 
> > disappeared directories.
> >
> > We can trust these logs, as we see our manual operations are present 
> > in the logs.
> >
> >
> >
> > Time between creation and disappearing is 1-2 days.
> >
> >
> >
> > Maybe we are losing individual files as well, we just cannot find 
> > this out reliably.
> >
> >
> >
> > This is a blocker issue for us, we have to stop production data 
> > processing until we find out and fix data loss root cause.
> >
> >
> >
> > Please help to identify the root cause or find the right direction 
> > for search/further questions.
> >
> >
> >
> >
> >
> > -- Hadoop version: --
> >
> > Hadoop 3.2.1
> >
> > Source code repository 
> > https://gitbox.apache.org/repos/asf/hadoop.git -r
> > b3cbbb467e22ea829b3808f4b7b01d07e0bf3842
> >
> > Compiled by rohithsharmaks on 2019-09-10T15:56Z
> >
> > Compiled with 

Re: MODERATE for hdfs-iss...@hadoop.apache.org

2023-10-17 Thread Ayush Saxena
+ user@hadoop

This sounds pretty strange, do you have any background job in your
cluster running, like for compaction kind of stuff, which plays with
the files? Any traces in the Namenode Logs, what happens to the blocks
associated with those files, If they get deleted before a FBR, that
ain't a metadata loss I believe, something triggered a delete, maybe
on the parent directory?

Will it be possible to enable debug logs and grep for "DIR*
FSDirectory.delete:" (code here [1]) or check other delete related
entries from StateChangeLog?
Maybe try to capture all the Audit logs from the create entry to the
moment when you figure out files are missing & look for all the delete
entries.
Still no luck, then maybe check in edit logs, or enable debug logs and
see for entries for edit log,"doEditTx op"

-Ayush

[1] 
https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirDeleteOp.java#L175

On Tue, 17 Oct 2023 at 17:57, Xiaoqiao He  wrote:
>
> Hi Sergey Onuchin,
>
> Sorry to hear that. But we could not give some suggestions based on the
> only information you mentioned.
> If any more on-site information may be better to trace, such as
> depoy architecture, NameNode log and jstack etc.
> Based on my practice, I did not receive some cases which delete directory
> without noise.
> Did you try to check operations (rename and delete) about the
> parent-directory?
> Good luck!
>
> Best Regards,
> - He Xiaoqiao
>
>
> On Mon, Oct 16, 2023 at 11:58 PM <
> hdfs-issues-reject-1697471875.2027154.pkchcedhioidkhech...@hadoop.apache.org>
> wrote:
>
> >
> > -- Forwarded message --
> > From: Sergey Onuchin 
> > To: "hdfs-iss...@hadoop.apache.org" 
> > Cc:
> > Bcc:
> > Date: Mon, 16 Oct 2023 15:57:47 +
> > Subject: HDFS loses directories with production data
> >
> > Hello,
> >
> >
> >
> > We’ve been using Hadoop (+Spark) for 3 years on production w/o major
> > issues.
> >
> >
> >
> > Lately we observe that whole non-empty directories (table partitions) are
> > disappearing in random ways.
> >
> > We see in application logs (and in hdfs-audit) logs creation of the
> > directory + data files.
> >
> > Then later we see NO this directory in HDFS.
> >
> >
> >
> > hdfs-audit.log shows no traces of deletes or renames for the disappeared
> > directories.
> >
> > We can trust these logs, as we see our manual operations are present in
> > the logs.
> >
> >
> >
> > Time between creation and disappearing is 1-2 days.
> >
> >
> >
> > Maybe we are losing individual files as well, we just cannot find this out
> > reliably.
> >
> >
> >
> > This is a blocker issue for us, we have to stop production data processing
> > until we find out and fix data loss root cause.
> >
> >
> >
> > Please help to identify the root cause or find the right direction for
> > search/further questions.
> >
> >
> >
> >
> >
> > -- Hadoop version: --
> >
> > Hadoop 3.2.1
> >
> > Source code repository https://gitbox.apache.org/repos/asf/hadoop.git -r
> > b3cbbb467e22ea829b3808f4b7b01d07e0bf3842
> >
> > Compiled by rohithsharmaks on 2019-09-10T15:56Z
> >
> > Compiled with protoc 2.5.0
> >
> > From source with checksum 776eaf9eee9c0ffc370bcbc1888737
> >
> >
> >
> > Thank you!
> >
> > Sergey Onuchin
> >
> >
> >

-
To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
For additional commands, e-mail: user-h...@hadoop.apache.org