Hello Ayush Saxena.
Thank you for your response.
We have parallel processes working on the same HDFS, but they are not touching
the affected directory.
We cannot exclude (stop) them, as it is production under load.
Also we cannot enable debug mode due to the risk to impact ongoing operations.
Scanning hdfs-audit log shows creation and data access to 'lost' directories or
their files, up to the day they were, well, lost.
No 'delete' or 'rename' operation are visible in logs - just no more matches.
>> then maybe check in edit logs, or enable debug logs and see for entries for
>> edit log,"doEditTx op"
I don't know how to do that, please elaborate.
I've attached yesterdays evidence (user-perspective) of 2 partition loss.
Right now I did the following:
- copied whole parent directory to another HDFS location
- started rebuilding 'lost' partitions, this will take 3-4 calendar days to
cover all missing days.
- only one partition is done so far, no loss appeared yet.
Thank you!
-Original Message-
From: Ayush Saxena
Sent: 18 October, 2023 2:25
To: Sergey Onuchin
Cc: Hdfs-dev ; Xiaoqiao He ;
user.hadoop
Subject: Re: MODERATE for hdfs-iss...@hadoop.apache.org
+ user@hadoop
This sounds pretty strange, do you have any background job in your cluster
running, like for compaction kind of stuff, which plays with the files? Any
traces in the Namenode Logs, what happens to the blocks associated with those
files, If they get deleted before a FBR, that ain't a metadata loss I believe,
something triggered a delete, maybe on the parent directory?
Will it be possible to enable debug logs and grep for "DIR*
FSDirectory.delete:" (code here [1]) or check other delete related entries from
StateChangeLog?
Maybe try to capture all the Audit logs from the create entry to the moment
when you figure out files are missing & look for all the delete entries.
Still no luck, then maybe check in edit logs, or enable debug logs and see for
entries for edit log,"doEditTx op"
-Ayush
[1]
https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirDeleteOp.java#L175
On Tue, 17 Oct 2023 at 17:57, Xiaoqiao He wrote:
>
> Hi Sergey Onuchin,
>
> Sorry to hear that. But we could not give some suggestions based on
> the only information you mentioned.
> If any more on-site information may be better to trace, such as depoy
> architecture, NameNode log and jstack etc.
> Based on my practice, I did not receive some cases which delete
> directory without noise.
> Did you try to check operations (rename and delete) about the
> parent-directory?
> Good luck!
>
> Best Regards,
> - He Xiaoqiao
>
>
> On Mon, Oct 16, 2023 at 11:58 PM <
> hdfs-issues-reject-1697471875.2027154.pkchcedhioidkhech...@hadoop.apac
> he.org>
> wrote:
>
> >
> > -- Forwarded message --
> > From: Sergey Onuchin
> > To: "hdfs-iss...@hadoop.apache.org"
> > Cc:
> > Bcc:
> > Date: Mon, 16 Oct 2023 15:57:47 +
> > Subject: HDFS loses directories with production data
> >
> > Hello,
> >
> >
> >
> > We’ve been using Hadoop (+Spark) for 3 years on production w/o major
> > issues.
> >
> >
> >
> > Lately we observe that whole non-empty directories (table
> > partitions) are disappearing in random ways.
> >
> > We see in application logs (and in hdfs-audit) logs creation of the
> > directory + data files.
> >
> > Then later we see NO this directory in HDFS.
> >
> >
> >
> > hdfs-audit.log shows no traces of deletes or renames for the
> > disappeared directories.
> >
> > We can trust these logs, as we see our manual operations are present
> > in the logs.
> >
> >
> >
> > Time between creation and disappearing is 1-2 days.
> >
> >
> >
> > Maybe we are losing individual files as well, we just cannot find
> > this out reliably.
> >
> >
> >
> > This is a blocker issue for us, we have to stop production data
> > processing until we find out and fix data loss root cause.
> >
> >
> >
> > Please help to identify the root cause or find the right direction
> > for search/further questions.
> >
> >
> >
> >
> >
> > -- Hadoop version: --
> >
> > Hadoop 3.2.1
> >
> > Source code repository
> > https://gitbox.apache.org/repos/asf/hadoop.git -r
> > b3cbbb467e22ea829b3808f4b7b01d07e0bf3842
> >
> > Compiled by rohithsharmaks on 2019-09-10T15:56Z
> >
> > Compiled with