[ 
https://issues.apache.org/jira/browse/HADOOP-16207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16933718#comment-16933718
 ] 

Siddharth Seth edited comment on HADOOP-16207 at 9/19/19 8:02 PM:
------------------------------------------------------------------

Seeing several MR job failures when running tests on HADOOP-16445.

{code}
[ERROR]   
ITestMagicCommitMRJob>AbstractITCommitMRJob.testMRJob:146->AbstractFSContractTestBase.assertIsDirectory:327
 » FileNotFound
[ERROR]   
ITestDirectoryCommitMRJob>AbstractITCommitMRJob.testMRJob:146->AbstractFSContractTestBase.assertIsDirectory:327
 » FileNotFound
[ERROR]   
ITestPartitionCommitMRJob>AbstractITCommitMRJob.testMRJob:146->AbstractFSContractTestBase.assertIsDirectory:327
 » FileNotFound
[ERROR]   
ITestStagingCommitMRJob>AbstractITCommitMRJob.testMRJob:146->AbstractFSContractTestBase.assertIsDirectory:327
 » FileNotFound
{code}
always fail when run with -Ds3guard -Ddynamo -Dauth (These fail when starting 
with a clean DDB table as well)

The test setup seems broken to me.
* Cluster set up happens with createCluster(new JobConf())
* After this, AbstractITCommitMRJob creates the MRJob with 
Job.getInstance(getClusterBinding().getConf() ... -> This will end up using the 
previously created JobConf
* JobConf will only read core-site.xml ... so the command line parameters 
-Ds3guard, -Ddynamo -Dauth don't make a difference.

Adding fs.s3a.metadatastore.authoritative=true, 
fs.s3a.metadatastore.impl=org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore
 in auth-keys.xml or core-site.xml fixed all the test failures for me. (With 
the additions, the JobConf used by the cluster has these configs, and the tests 
do what they're supposed to).

That isn't the correct fix though. Making sure the test configuration is used 
to create the JobConf for the cluster and jobs would allow the test properties 
to work.

That said, I did see 3 empty (and marked as deleted) files - part_0000, 
part_0001, _SUCCESS in the s3guard table. I suspect this is a result of the 
committer trying to access a file on the client, getting a cached FileSystem 
instance (same UGI), and the getFileStatus (maybe) creates these S3Guard DDB 
entries?


was (Author: sseth):
Seeing several MR job failures when running tests on HADOOP-16445.

{code}
[ERROR]   
ITestMagicCommitMRJob>AbstractITCommitMRJob.testMRJob:146->AbstractFSContractTestBase.assertIsDirectory:327
 » FileNotFound
[ERROR]   
ITestDirectoryCommitMRJob>AbstractITCommitMRJob.testMRJob:146->AbstractFSContractTestBase.assertIsDirectory:327
 » FileNotFound
[ERROR]   
ITestPartitionCommitMRJob>AbstractITCommitMRJob.testMRJob:146->AbstractFSContractTestBase.assertIsDirectory:327
 » FileNotFound
[ERROR]   
ITestStagingCommitMRJob>AbstractITCommitMRJob.testMRJob:146->AbstractFSContractTestBase.assertIsDirectory:327
 » FileNotFound
{code}
always fail when run with -Ds3guard -Ddynamo -Dauth (These fail when starting 
with a clean DDB table as well)

The test setup seems broken to me.
* Cluster set up happens with createCluster(new JobConf())
* After this, AbstractITCommitMRJob creates the MRJob with 
Job.getInstance(getClusterBinding().getConf() ... -> This will end up using the 
previously created JobConf
* JobConf will only read core-site.xml ... so the command line parameters 
-Ds3guard, -Ddynamo -Dauth don't make a difference.

Adding fs.s3a.metadatastore.authoritative=true, 
fs.s3a.metadatastore.impl=org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore
 in auth-keys.xml or core-site.xml fixed all the test failures for me. (With 
the additions, the JobConf used by the cluster has these configs, and the tests 
do what they're supposed to).

That isn't the correct fix though. Making sure the test configuration is used 
to create the JobConf for the cluster and jobs would allow the test properties 
to work.

That said, I did see 3 empty (and marked as deleted) files - part_0000, 
part_0001, _SUCCESS in the s3guard table. I suspect this is a result of the 
committer trying to access a file on the client, getting a cached FileSystem 
instance (same UGI), and the getFileStatus (maybe) creates these S3Guard DDB 
entries?

[~gabor.bota] - do you remember if you were you seeing failures on a single 
test only, and did it pass in non-parallel mode? (did the other tests exist 
when the jira was filed)

> Fix ITestDirectoryCommitMRJob.testMRJob
> ---------------------------------------
>
>                 Key: HADOOP-16207
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16207
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3, test
>    Affects Versions: 3.3.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Critical
>
> Reported failure of {{ITestDirectoryCommitMRJob}} in validation runs of 
> HADOOP-16186; assertIsDirectory with s3guard enabled and a parallel test run: 
> Path "is recorded as deleted by S3Guard"
> {code}
>     waitForConsistency();
>     assertIsDirectory(outputPath) /* here */
> {code}
> The file is there but there's a tombstone. Possibilities
> * some race condition with another test
> * tombstones aren't timing out
> * committers aren't creating that base dir in a way which cleans up S3Guard's 
> tombstones. 
> Remember: we do have to delete that dest dir before the committer runs unless 
> overwrite==true, so at the start of the run there will be a tombstone. It 
> should be overwritten by a success.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to