Hello Ayush Saxena.

Thank you for your response.

We have parallel processes working on the same HDFS, but they are not touching 
the affected directory.
We cannot exclude (stop) them, as it is production under load.

Also we cannot enable debug mode due to the risk to impact ongoing operations.

Scanning hdfs-audit log shows creation and data access to 'lost' directories or 
their files, up to the day they were, well, lost.
No 'delete' or 'rename' operation are visible in logs - just no more matches.

>> then maybe check in edit logs, or enable debug logs and see for entries for 
>> edit log,"doEditTx op"
I don't know how to do that, please elaborate.

I've attached yesterdays evidence (user-perspective) of 2 partition loss.

Right now I did the following:
- copied whole parent directory to another HDFS location
- started rebuilding 'lost' partitions, this will take 3-4 calendar days to 
cover all missing days.
- only one partition is done so far, no loss appeared yet.


Thank you!

-----Original Message-----
From: Ayush Saxena <ayush...@gmail.com> 
Sent: 18 October, 2023 2:25
To: Sergey Onuchin <sergey.onuc...@acronis.com>
Cc: Hdfs-dev <hdfs-...@hadoop.apache.org>; Xiaoqiao He <xq.he2...@gmail.com>; 
user.hadoop <user@hadoop.apache.org>
Subject: Re: MODERATE for hdfs-iss...@hadoop.apache.org

+ user@hadoop

This sounds pretty strange, do you have any background job in your cluster 
running, like for compaction kind of stuff, which plays with the files? Any 
traces in the Namenode Logs, what happens to the blocks associated with those 
files, If they get deleted before a FBR, that ain't a metadata loss I believe, 
something triggered a delete, maybe on the parent directory?

Will it be possible to enable debug logs and grep for "DIR* 
FSDirectory.delete:" (code here [1]) or check other delete related entries from 
StateChangeLog?
Maybe try to capture all the Audit logs from the create entry to the moment 
when you figure out files are missing & look for all the delete entries.
Still no luck, then maybe check in edit logs, or enable debug logs and see for 
entries for edit log,"doEditTx op"

-Ayush

[1] 
https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirDeleteOp.java#L175

On Tue, 17 Oct 2023 at 17:57, Xiaoqiao He <xq.he2...@gmail.com> wrote:
>
> Hi Sergey Onuchin,
>
> Sorry to hear that. But we could not give some suggestions based on 
> the only information you mentioned.
> If any more on-site information may be better to trace, such as depoy 
> architecture, NameNode log and jstack etc.
> Based on my practice, I did not receive some cases which delete 
> directory without noise.
> Did you try to check operations (rename and delete) about the 
> parent-directory?
> Good luck!
>
> Best Regards,
> - He Xiaoqiao
>
>
> On Mon, Oct 16, 2023 at 11:58 PM <
> hdfs-issues-reject-1697471875.2027154.pkchcedhioidkhech...@hadoop.apac
> he.org>
> wrote:
>
> >
> > ---------- Forwarded message ----------
> > From: Sergey Onuchin <sergey.onuc...@acronis.com.invalid>
> > To: "hdfs-iss...@hadoop.apache.org" <hdfs-iss...@hadoop.apache.org>
> > Cc:
> > Bcc:
> > Date: Mon, 16 Oct 2023 15:57:47 +0000
> > Subject: HDFS loses directories with production data
> >
> > Hello,
> >
> >
> >
> > We’ve been using Hadoop (+Spark) for 3 years on production w/o major 
> > issues.
> >
> >
> >
> > Lately we observe that whole non-empty directories (table 
> > partitions) are disappearing in random ways.
> >
> > We see in application logs (and in hdfs-audit) logs creation of the 
> > directory + data files.
> >
> > Then later we see NO this directory in HDFS.
> >
> >
> >
> > hdfs-audit.log shows no traces of deletes or renames for the 
> > disappeared directories.
> >
> > We can trust these logs, as we see our manual operations are present 
> > in the logs.
> >
> >
> >
> > Time between creation and disappearing is 1-2 days.
> >
> >
> >
> > Maybe we are losing individual files as well, we just cannot find 
> > this out reliably.
> >
> >
> >
> > This is a blocker issue for us, we have to stop production data 
> > processing until we find out and fix data loss root cause.
> >
> >
> >
> > Please help to identify the root cause or find the right direction 
> > for search/further questions.
> >
> >
> >
> >
> >
> > -- Hadoop version: ------------------------------
> >
> > Hadoop 3.2.1
> >
> > Source code repository 
> > https://gitbox.apache.org/repos/asf/hadoop.git -r
> > b3cbbb467e22ea829b3808f4b7b01d07e0bf3842
> >
> > Compiled by rohithsharmaks on 2019-09-10T15:56Z
> >
> > Compiled with protoc 2.5.0
> >
> > From source with checksum 776eaf9eee9c0ffc370bcbc1888737
> >
> >
> >
> > Thank you!
> >
> > Sergey Onuchin
> >
> >
> >
Evidence of data loss: in 1 hour we lost (at least) 2 folders.
ACEP Marts were NOT running. So we can conclude this is not a bug in marts 
calculation.

ACEP DWH was running + restarted once withing this hour.
disctcp was in progress. 



[19:25:33] r...@acep-dwh-hadoop-apps-vm01.acep:/home/sonuchin # hdfs dfs -ls -R 
/acep/data/marts | grep -v parquet | grep '=2023-10'
drwxr-xr-x   - acepdwh acepdwh_all_ro          0 2023-10-02 09:30 
/acep/data/marts/fact_lm_usages/calculated_at=2023-10-01
drwxr-xr-x   - acepdwh acepdwh_all_ro          0 2023-10-02 12:43 
/acep/data/marts/fact_lm_usages/calculated_at=2023-10-02
drwxr-xr-x   - acepdwh acepdwh_all_ro          0 2023-10-03 03:26 
/acep/data/marts/fact_lm_usages/calculated_at=2023-10-03
drwxr-xr-x   - acepdwh acepdwh_all_ro          0 2023-10-16 20:46 
/acep/data/marts/fact_lm_usages/calculated_at=2023-10-15   <------ HERE
drwxr-xr-x   - acepdwh acepdwh_all_ro          0 2023-10-16 23:31 
/acep/data/marts/fact_lm_usages/calculated_at=2023-10-16   <------ HERE
...
 
 
[20:30:21] r...@acep-dwh-hadoop-apps-vm01.acep:/home/sonuchin # hdfs dfs -ls 
/acep/data/marts/fact_lm_usages/calculated_at=2023-10-15
ls: `/acep/data/marts/fact_lm_usages/calculated_at=2023-10-15': No such file or 
directory

[20:32:24] r...@acep-dwh-hadoop-apps-vm01.acep:/home/sonuchin # hdfs dfs -ls 
/acep/data/marts/fact_lm_usages | grep '=2023-10'
drwxr-xr-x   - acepdwh acepdwh_all_ro          0 2023-10-02 09:30 
/acep/data/marts/fact_lm_usages/calculated_at=2023-10-01
drwxr-xr-x   - acepdwh acepdwh_all_ro          0 2023-10-02 12:43 
/acep/data/marts/fact_lm_usages/calculated_at=2023-10-02
drwxr-xr-x   - acepdwh acepdwh_all_ro          0 2023-10-03 03:26 
/acep/data/marts/fact_lm_usages/calculated_at=2023-10-03



distcp also fallen victim of data loss:

2023-10-17 20:45:38,497 INFO mapreduce.Job: Task Id : 
attempt_1694746061700_0447_m_000004_2, Status : FAILED
Error: java.io.IOException: 
org.apache.hadoop.tools.mapred.RetriableFileCopyCommand$CopyReadException: 
java.io.FileNotFoundException: File does not exist: 
hdfs://dwh/acep/data/marts/fact_lm_usages/calculated_at=20
23-10-16/part-00047-3dde665e-647d-4ad0-8b06-2fd6bc101312-c000.snappy.parquet
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
For additional commands, e-mail: user-h...@hadoop.apache.org

Reply via email to