Re: Update UGI with new tokens during the lifespan of a yarn application

2024-06-11 Thread Clay B.

Hi Ankur,

There was some work I did in HADOOP-16298; the final code I used for 
$dayjob works for HDFS and HBase (HBase is non-renewable) tokens but 
Kubernetes was doing the on-disk token updates for me[1]; I just had to 
refresh the state in UGI. I ended up making the code proactively refresh 
tokens rather than wait for a token error, as there was a race condition 
when HBase 2.x came out. I think my last copy as I was trying to port it 
upstream was at: 
https://github.com/cbaenziger/hadoop/tree/hadoop-16298-wip (however, I no 
longer recall what was left to do).


Unfortunately, I got moved off this for my day job so haven't had time to 
revisit completing the contribution. Perhaps some of my hacks can help 
you? Also the testing got a bit zaney[2] to try and tickle races.


-Clay

[1]: A shim to get tokens similarly can be seen at 
https://github.com/cbaenziger/hadoop-token-cli
[2]: The testing code I used is at 
https://github.com/cbaenziger/delegation_token_tests


On Tue, 11 Jun 2024, Wei-Chiu Chuang wrote:


That sounds like what Spark did.Take a look at this
doc 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/security/README.md
The Spark AM has a Kerberos keytab and it periodically acquires a new 
delegation token (the old one is ignored) to make sure it always has a valid DT.
Finally, distribute the DT to all executors.

On Tue, Jun 11, 2024 at 4:34 AM Ankur Khanna  
wrote:

  Hi experts,

   

  I have a use-case with an external session token that is short lived and 
does not renew(ie, unlike a hadoop delegation token, the expiry time
  is not updated for this token). For a long running application (longer 
than the lifespan of the external token), I want to update the
  UGI/Credential object of each and every worker container with a new token.

  If I understand correctly, all delegation tokens are shared at the launch 
of a container.

  Is there any way to update the credential object after the launch of the 
container and during the lifespan of the application?


  Best,

  Ankur Khanna

   

 




-
To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
For additional commands, e-mail: user-h...@hadoop.apache.org

Re: Update UGI with new tokens during the lifespan of a yarn application

2024-06-11 Thread Wei-Chiu Chuang
That sounds like what Spark did.
Take a look at this doc
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/security/README.md
The Spark AM has a Kerberos keytab and it periodically acquires a new
delegation token (the old one is ignored) to make sure it always has a
valid DT. Finally, distribute the DT to all executors.

On Tue, Jun 11, 2024 at 4:34 AM Ankur Khanna
 wrote:

> Hi experts,
>
>
>
> I have a use-case with an external session token that is short lived and
> does not renew(ie, unlike a hadoop delegation token, the expiry time is not
> updated for this token). For a long running application (longer than the
> lifespan of the external token), I want to update the UGI/Credential object
> of each and every worker container with a new token.
>
> If I understand correctly, all delegation tokens are shared at the launch
> of a container.
>
> Is there any way to update the credential object after the launch of the
> container and during the lifespan of the application?
>
>
> Best,
>
> Ankur Khanna
>
>
>
>
>


Re: bootstrap standby namenode failure

2024-05-28 Thread anup ahire
Thanks Ayush,

I am trying to understand the reason why Active NN does not have a record
of txn ids that are in shared edit space.

On Sat, May 25, 2024 at 7:54 AM Ayush Saxena  wrote:

> Hi Anup,
> Did you explore: -skipSharedEditsCheck, Check this ticket once [1], if
> your use case is similar, little bit description can be found here
> [2], search for skipSharedEditsCheck, the jira does mention another
> solution as well, in case you don't like this or if it doesn't work
>
> -Ayush
>
>
> [1] https://issues.apache.org/jira/browse/HDFS-4120
> [2]
> https://apache.github.io/hadoop/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#namenode
>
> On Sat, 25 May 2024 at 01:59, anup ahire  wrote:
> >
> > Hello Team,
> >
> > I am trying to recover the failed node which has namenode and journal
> node,  the cluster has one active NN and 2 journal nodes currently.
> > When I am trying to setup node being recovered as standby, I am getting
> this error.
> >
> > java.io.IOException: Gap in transactions. Expected to be able to read up
> until at least txid 22450 but unable to find any edit logs containing txid 1
> >
> > Any idea what might be happening? As one NN active and 2 journal nodes
> are running, I was hoping all edit logs would be in sync.
> >
> > Thanks.
>


Re: HBase lz4 UnsatisfiedLinkError

2024-05-27 Thread fetch

Hi Ayush,

Upgrading to 2.6.0-hadoop3 worked, thanks so much!

On 2024-05-25 20:15, Ayush Saxena wrote:

Multiple things, the output of checknative only contains these stuff
only, not everything. From the code [1], So looking at your command
output everything is sorted there barring OpenSSL & PMDK which you
explicitly didn't ask for in your maven command & I believe you don't
need them either, in case you need them the instructions are there in
[2]

Looking at the trace:

org.apache.hadoop.io.compress.Lz4Codec.getLibraryName(Lz4Codec.java:73)

 at

You mentioned building ver 3.3.6, Your exception trace is calling
getLibraryName, which isn't present in the Lz4Codec.java in ver-3.3.6
[3], this method got removed as part of HADOOP-17292 [4] that is post
hadoop ver 3.3.1+, So, If you read the release notes of this ticket
you can see for Hadoop-3.3.1+ the Lz4 thing works OOTB

So, mostly it isn't a Hadoop problem.

What could be possible is the HBase version that you are using is
pulling in an older Hadoop release which is messing things up, So, I
would say try using the hadoop-3 binary of the latest version 2.6.0
[5]  & see how things go, else download the source tar of their latest
release 2.6.0 and build with -Phadoop-3.0
-Dhadoop-three.version=3.3.6, Looking at their source code they still
use 3.3.5 by default.

-Ayush

[1]
https://github.com/apache/hadoop/blob/1baf0e889fec54b6560417b62cada75daf6fe312/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/NativeLibraryChecker.java#L137-L144
[2] https://github.com/apache/hadoop/blob/branch-3.3.6/BUILDING.txt
[3]
https://github.com/apache/hadoop/blob/branch-3.3.6/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/Lz4Codec.java#L73
[4] https://issues.apache.org/jira/browse/HADOOP-17292
[5]
https://www.apache.org/dyn/closer.lua/hbase/2.6.0/hbase-2.6.0-hadoop3-bin.tar.gz

On Sat, 25 May 2024 at 22:41,  wrote:


Hey Ayush, thanks for the advice!

Building 3.3.6 from an EL9.4 machine resulted in the following:

[root@localhost bin]# JAVA_HOME=/etc/alternatives/java_sdk_openjdk/
./hadoop checknative -a
2024-05-25 19:05:56,068 INFO bzip2.Bzip2Factory: Successfully loaded &
initialized native-bzip2 library system-native
2024-05-25 19:05:56,071 INFO zlib.ZlibFactory: Successfully loaded &
initialized native-zlib library
2024-05-25 19:05:56,097 INFO nativeio.NativeIO: The native code was
built without PMDK support.
Native library checking:
hadoop:  true
/root/build/hadoop-3.3.6-src/hadoop-dist/target/hadoop-3.3.6/lib/native/libhadoop.so.1.0.0
zlib:true /lib64/libz.so.1
zstd  :  true /lib64/libzstd.so.1
bzip2:   true /lib64/libbz2.so.1
openssl: false EVP_CIPHER_CTX_block_size
ISA-L:   true /lib64/libisal.so.2
PMDK:false The native code was built without PMDK support.

No mention of lz4, though lz4[-devel] packages were installed on the
compiling host as per the BUILDING instructions. Is there a build 
option

I'm missing? I'm using:

* mvn -X package -Pdist,native -DskipTests -Dtar 
-Dmaven.javadoc.skip=true



Unfortunately the hbase "org.apache.hadoop.util.NativeLibraryChecker",
using this freshly made hadoop native library, also failed to load lz4
in the same way as the initial message with no extra information from 
debug:


2024-05-25T19:03:42,320 WARN  [main] lz4.Lz4Compressor:
java.lang.UnsatisfiedLinkError: 'void
org.apache.hadoop.io.compress.lz4.Lz4Compressor.initIDs()'
Exception in thread "main" java.lang.UnsatisfiedLinkError:
'java.lang.String
org.apache.hadoop.io.compress.lz4.Lz4Compressor.getLibraryName()'
 at
org.apache.hadoop.io.compress.lz4.Lz4Compressor.getLibraryName(Native
Method)
 at
org.apache.hadoop.io.compress.Lz4Codec.getLibraryName(Lz4Codec.java:73)
 at
org.apache.hadoop.util.NativeLibraryChecker.main(NativeLibraryChecker.java:109)

Thanks for your help!


On 5/25/24 4:16 PM, Ayush Saxena wrote:
> above things don't work then enable debug logging &
> then run the checknative command and capture the log & exception as
> here [2] & they might give you an an


-
To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
For additional commands, e-mail: user-h...@hadoop.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
For additional commands, e-mail: user-h...@hadoop.apache.org



Re: HBase lz4 UnsatisfiedLinkError

2024-05-25 Thread Ayush Saxena
Multiple things, the output of checknative only contains these stuff
only, not everything. From the code [1], So looking at your command
output everything is sorted there barring OpenSSL & PMDK which you
explicitly didn't ask for in your maven command & I believe you don't
need them either, in case you need them the instructions are there in
[2]

Looking at the trace:
> org.apache.hadoop.io.compress.Lz4Codec.getLibraryName(Lz4Codec.java:73)
 at

You mentioned building ver 3.3.6, Your exception trace is calling
getLibraryName, which isn't present in the Lz4Codec.java in ver-3.3.6
[3], this method got removed as part of HADOOP-17292 [4] that is post
hadoop ver 3.3.1+, So, If you read the release notes of this ticket
you can see for Hadoop-3.3.1+ the Lz4 thing works OOTB

So, mostly it isn't a Hadoop problem.

What could be possible is the HBase version that you are using is
pulling in an older Hadoop release which is messing things up, So, I
would say try using the hadoop-3 binary of the latest version 2.6.0
[5]  & see how things go, else download the source tar of their latest
release 2.6.0 and build with -Phadoop-3.0
-Dhadoop-three.version=3.3.6, Looking at their source code they still
use 3.3.5 by default.

-Ayush

[1] 
https://github.com/apache/hadoop/blob/1baf0e889fec54b6560417b62cada75daf6fe312/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/NativeLibraryChecker.java#L137-L144
[2] https://github.com/apache/hadoop/blob/branch-3.3.6/BUILDING.txt
[3] 
https://github.com/apache/hadoop/blob/branch-3.3.6/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/Lz4Codec.java#L73
[4] https://issues.apache.org/jira/browse/HADOOP-17292
[5] 
https://www.apache.org/dyn/closer.lua/hbase/2.6.0/hbase-2.6.0-hadoop3-bin.tar.gz

On Sat, 25 May 2024 at 22:41,  wrote:
>
> Hey Ayush, thanks for the advice!
>
> Building 3.3.6 from an EL9.4 machine resulted in the following:
>
> [root@localhost bin]# JAVA_HOME=/etc/alternatives/java_sdk_openjdk/
> ./hadoop checknative -a
> 2024-05-25 19:05:56,068 INFO bzip2.Bzip2Factory: Successfully loaded &
> initialized native-bzip2 library system-native
> 2024-05-25 19:05:56,071 INFO zlib.ZlibFactory: Successfully loaded &
> initialized native-zlib library
> 2024-05-25 19:05:56,097 INFO nativeio.NativeIO: The native code was
> built without PMDK support.
> Native library checking:
> hadoop:  true
> /root/build/hadoop-3.3.6-src/hadoop-dist/target/hadoop-3.3.6/lib/native/libhadoop.so.1.0.0
> zlib:true /lib64/libz.so.1
> zstd  :  true /lib64/libzstd.so.1
> bzip2:   true /lib64/libbz2.so.1
> openssl: false EVP_CIPHER_CTX_block_size
> ISA-L:   true /lib64/libisal.so.2
> PMDK:false The native code was built without PMDK support.
>
> No mention of lz4, though lz4[-devel] packages were installed on the
> compiling host as per the BUILDING instructions. Is there a build option
> I'm missing? I'm using:
>
> * mvn -X package -Pdist,native -DskipTests -Dtar -Dmaven.javadoc.skip=true
>
>
> Unfortunately the hbase "org.apache.hadoop.util.NativeLibraryChecker",
> using this freshly made hadoop native library, also failed to load lz4
> in the same way as the initial message with no extra information from debug:
>
> 2024-05-25T19:03:42,320 WARN  [main] lz4.Lz4Compressor:
> java.lang.UnsatisfiedLinkError: 'void
> org.apache.hadoop.io.compress.lz4.Lz4Compressor.initIDs()'
> Exception in thread "main" java.lang.UnsatisfiedLinkError:
> 'java.lang.String
> org.apache.hadoop.io.compress.lz4.Lz4Compressor.getLibraryName()'
>  at
> org.apache.hadoop.io.compress.lz4.Lz4Compressor.getLibraryName(Native
> Method)
>  at
> org.apache.hadoop.io.compress.Lz4Codec.getLibraryName(Lz4Codec.java:73)
>  at
> org.apache.hadoop.util.NativeLibraryChecker.main(NativeLibraryChecker.java:109)
>
> Thanks for your help!
>
>
> On 5/25/24 4:16 PM, Ayush Saxena wrote:
> > above things don't work then enable debug logging &
> > then run the checknative command and capture the log & exception as
> > here [2] & they might give you an an

-
To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
For additional commands, e-mail: user-h...@hadoop.apache.org



Re: HBase lz4 UnsatisfiedLinkError

2024-05-25 Thread fetch

Hey Ayush, thanks for the advice!

Building 3.3.6 from an EL9.4 machine resulted in the following:

[root@localhost bin]# JAVA_HOME=/etc/alternatives/java_sdk_openjdk/ 
./hadoop checknative -a
2024-05-25 19:05:56,068 INFO bzip2.Bzip2Factory: Successfully loaded & 
initialized native-bzip2 library system-native
2024-05-25 19:05:56,071 INFO zlib.ZlibFactory: Successfully loaded & 
initialized native-zlib library
2024-05-25 19:05:56,097 INFO nativeio.NativeIO: The native code was 
built without PMDK support.

Native library checking:
hadoop:  true 
/root/build/hadoop-3.3.6-src/hadoop-dist/target/hadoop-3.3.6/lib/native/libhadoop.so.1.0.0

zlib:    true /lib64/libz.so.1
zstd  :  true /lib64/libzstd.so.1
bzip2:   true /lib64/libbz2.so.1
openssl: false EVP_CIPHER_CTX_block_size
ISA-L:   true /lib64/libisal.so.2
PMDK:    false The native code was built without PMDK support.

No mention of lz4, though lz4[-devel] packages were installed on the 
compiling host as per the BUILDING instructions. Is there a build option 
I'm missing? I'm using:


* mvn -X package -Pdist,native -DskipTests -Dtar -Dmaven.javadoc.skip=true


Unfortunately the hbase "org.apache.hadoop.util.NativeLibraryChecker", 
using this freshly made hadoop native library, also failed to load lz4 
in the same way as the initial message with no extra information from debug:


2024-05-25T19:03:42,320 WARN  [main] lz4.Lz4Compressor: 
java.lang.UnsatisfiedLinkError: 'void 
org.apache.hadoop.io.compress.lz4.Lz4Compressor.initIDs()'
Exception in thread "main" java.lang.UnsatisfiedLinkError: 
'java.lang.String 
org.apache.hadoop.io.compress.lz4.Lz4Compressor.getLibraryName()'
    at 
org.apache.hadoop.io.compress.lz4.Lz4Compressor.getLibraryName(Native 
Method)
    at 
org.apache.hadoop.io.compress.Lz4Codec.getLibraryName(Lz4Codec.java:73)
    at 
org.apache.hadoop.util.NativeLibraryChecker.main(NativeLibraryChecker.java:109)


Thanks for your help!


On 5/25/24 4:16 PM, Ayush Saxena wrote:

above things don't work then enable debug logging &
then run the checknative command and capture the log & exception as
here [2] & they might give you an an


-
To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
For additional commands, e-mail: user-h...@hadoop.apache.org



Re: bootstrap standby namenode failure

2024-05-25 Thread Ayush Saxena
Hi Anup,
Did you explore: -skipSharedEditsCheck, Check this ticket once [1], if
your use case is similar, little bit description can be found here
[2], search for skipSharedEditsCheck, the jira does mention another
solution as well, in case you don't like this or if it doesn't work

-Ayush


[1] https://issues.apache.org/jira/browse/HDFS-4120
[2] 
https://apache.github.io/hadoop/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#namenode

On Sat, 25 May 2024 at 01:59, anup ahire  wrote:
>
> Hello Team,
>
> I am trying to recover the failed node which has namenode and journal node,  
> the cluster has one active NN and 2 journal nodes currently.
> When I am trying to setup node being recovered as standby, I am getting this 
> error.
>
> java.io.IOException: Gap in transactions. Expected to be able to read up 
> until at least txid 22450 but unable to find any edit logs containing txid 1
>
> Any idea what might be happening? As one NN active and 2 journal nodes are 
> running, I was hoping all edit logs would be in sync.
>
> Thanks.

-
To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
For additional commands, e-mail: user-h...@hadoop.apache.org



Re: HBase lz4 UnsatisfiedLinkError

2024-05-25 Thread Ayush Saxena
Hi,

We can't help with the HBase thing, for that you need to chase the
HBase user ML.

For the `hadoop checknative -a` showing false, maybe the native
libraries that are pre-built & published aren't compatible with the OS
you are using, In that case you need to build them on the "same" OS,
the instructions are here: [1] & replace those generated native files
with the existing ones.

Second, Run the command `hadoop jnipath` and see the output path &
check you have the native libs in that directory.

If both of the above things don't work then enable debug logging &
then run the checknative command and capture the log & exception as
here [2] & they might give you an answer why the native libraries
aren't getting loaded.

Most probably solving the Hadoop stuff should dispel the HBase or any
downstream problem tethered to native libs.

-Ayush


[1] 
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/NativeLibraries.html#Build
[2] 
https://github.com/apache/hadoop/blob/1baf0e889fec54b6560417b62cada75daf6fe312/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/NativeCodeLoader.java#L52-L55

On Sat, 25 May 2024 at 18:58,  wrote:
>
> Hi all,
>
> Using hadoop 3.3.6, hbase 2.5.6, and jdk 11  on EL9 we're seeing an
> UnsatisfiedLinkError when running the NativeLibraryChecker. It's
> identical to this question on StackOverflow:
>
> *
> https://stackoverflow.com/questions/72517212/check-hbase-native-extension-got-warn-main-lz4-lz4compressor-java-lang-unsati
>
> I've noticed it was moved from the os packages to lz4-java,  and now
> exists in hbase/libs. Is this just a java library path issue?
>
> On the NativeLibrary docs page it says the native hadoop library
> includes various components, including lz4. When running 'hadoop
> checknative -a' as is done in the example down the page, our output is
> missing lz4.
>
> *
> https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/NativeLibraries.html
>
> Thanks for your time!
>
> -
> To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: user-h...@hadoop.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
For additional commands, e-mail: user-h...@hadoop.apache.org



Re: How is HDFS Erasure Coding Phase II now?

2024-04-22 Thread Ayush Saxena
Hi,
>  Or is it just not developed to this point?

It isn't developed & I don't think there is any effort going on in that
direction

> I learned that continuous layout can ensure the locality of file blocks

How? Erasure Coding will have BlockGroups not just one Block, whether you
write in a striped manner or in a Contiguous manner, it will spread over
equal number of Datanodes based on the BPP, I am not sure if anything
changes with locality, just by the way how EC Blocks are written.

> , I have large files and write them once and read them many times.

Erasure Coding in general was developed for storing Archival data, so you
need to figure out how "many" is ok.


-Ayush

On Mon, 22 Apr 2024 at 15:56, zhangzhengli <1278789...@qq.com.invalid>
wrote:

> Hi all, Since HDFS-8030, hdfs ec continuous layout has not developed much.
> Are there any difficulties? Or is it just not developed to this point?
>
>
>
> I learned that continuous layout can ensure the locality of file blocks,
> and I want to use this feature in near-data scenarios. For example, I have
> large files and write them once and read them many times.
>
>
>
> Any suggestions are appreciated
>
>
>
> 从 Windows 版邮件 发送
>
>
>


Re: How to contribute code for the first time

2024-04-16 Thread Ayush Saxena
Hi Jim,
Directly create a PR against the trunk branch in Hadoop repo, if it is accepted 
then add the link to the PR and resubmit your request for Jira account, it will 
get approved 

-Ayush

> On 17 Apr 2024, at 10:02 AM, Jim Chen  wrote:
> 
> 
> Hi all, I want to optimize a script in dev-support in a hadoop project, how 
> do I submit a PR?
> 
> I tried to create apply jira account so that I can create an issue in jira 
> first, but the application was rejected. I was prompted to send a developer 
> email first.
> 
> Can anyone help me with this? Thanks!

-
To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
For additional commands, e-mail: user-h...@hadoop.apache.org



Re: Recommended way of using hadoop-minicluster für unit testing?

2024-04-15 Thread Richard Zowalla
Hi Ayush,

thanks for your time investigating!

I followed your recommendation and it seems to work (also for some of
our consumer projects), so thanks a lot for your time!

Gruß
Richard


Am Samstag, dem 13.04.2024 um 03:35 +0530 schrieb Ayush Saxena:
> Hi Richard,
> Thanx for sharing the steps to reproduce the issue. I cloned the
> Apache Storm repo and was able to repro the issue. The build was
> indeed failing due to missing classes.
> 
> Spent some time to debug the issue, might not be very right (no
> experience with Storm), There are Two ways to get this going
> 
> First Approach: If we want to use the shaded classes
> 
> 1. I think the artifact to be used for minicluster should be `hadoop-
> client-minicluster`, even spark uses the same [1], the one which you
> are using is `hadoop-minicluster`, which in its own is empty
> ```
> ayushsaxena@ayushsaxena ~ %  jar tf
> /Users/ayushsaxena/.m2/repository/org/apache/hadoop/hadoop-
> minicluster/3.3.6/hadoop-minicluster-3.3.6.jar  | grep .class
> ayushsaxena@ayushsaxena ~ %
> ```
> 
> It just defines artifacts which are to be used by `hadoop-client-
> minicluster` and this jar has that shading and stuff, using `hadoop-
> minicluster` is like adding the hadoop dependencies into the pom
> transitively, without any shading or so, which tends to conflict with
> `hadoop-client-api` and `hadoop-client-runtime` jars, which uses the
> shaded classes.
> 
> 2. Once you change `hadoop-minicluster` to `hadoop-client-
> minicluster`, still the tests won't pass, the reason being the
> `storm-autocreds` dependency which pulls in the hadoop jars via
> `hbase-client` & `hive-exec`, So, we need to exclude them as well
> 
> 3. I reverted your classpath hack, changed the jar, & excluded the
> dependencies from storm-autocreds & ran the storm-hdfs tests & all
> the tests passed, which were failing initially without any code
> change
> ```
> [INFO] Results:
> [INFO]
> [INFO] Tests run: 57, Failures: 0, Errors: 0, Skipped: 0
> [INFO]
> [INFO] --
> --
> [INFO] BUILD SUCCESS
> [INFO] --
> --
> ```
> 
> 4. Putting the code diff here might make this mail unreadable, so I
> am sharing the link to the commit which fixed Storm for me here [2],
> let me know if it has any access issues, I will put the diff on the
> mail itself in text form.
> 
> Second Approach: If we don't want to use the shaded classes
> 
> 1. The `hadoop-client-api` & the` hadoop-client-runtime` jars uses
> shading which tends to conflict with your non shaded `hadoop-
> minicluster`, Rather than using these jars use the `hadoop-client`
> jar
> 
> 2. I removed your hack & changed those two jars with `hadoop-client`
> jar & the storm-hdfs tests passes
> 
> 3. I am sharing the link to the commit in my fork, it is here at [3],
> one advantage is, you don't have to change your existing jar nor you
> would need to add those exclusions in the `storm-cred` dependency.
> 
> ++ Adding common-dev, in case any fellow developers with more
> experience around using the hadoop-client jars can help, if things
> still don't work or Storm needs something more. The downstream
> projects which I have experience with don't use these jars (which
> they should ideally) :-) 
> 
> -Ayush
> 
> 
> [1] https://github.com/apache/spark/blob/master/pom.xml#L1382
> [2]
> https://github.com/ayushtkn/storm/commit/e0cd8e21201e01d6d0e1f3ac1bc5ada8354436e6
> [3] 
> https://github.com/apache/storm/commit/fb5acdedd617de65e494c768b6ae4b
> ab9b3f7ac8
> 
> 
> On Fri, 12 Apr 2024 at 10:41, Richard Zowalla 
> wrote:
> > Hi,
> > 
> > thanks for the fast reply. The PR is here [1].
> > 
> > It works, if I exclude the client-api and client-api-runtime from
> > being scanned in surefire, which is a hacky workaround for the
> > actual issue.
> > 
> > The hadoop-commons jar is a transient dependency of the
> > minicluster, which is used for testing.
> > 
> > Debugging the situation shows, that HttpServer2  is in the same
> > package in hadoop-commons as well as in the client-api but with
> > differences in methods / classes used, so depending on the
> > classpath order the wrong class is loaded.
> > 
> > Stacktraces are in the first GH Action run.here: [1]. 
> > 
> > A reproducer would be to check out Storm, go to storm-hdfs and
> > remove the exclusion in [2] and run the tests in that module, which
> > will fail due to a missing jetty server class (as the HTTPServer2
> > class is loaded from client-api instead of minicluster).
> > 
> > Gruß & Thx
> > Richard 
> > 
> > [1] https://github.com/apache/storm/pull/3637
> > [2]
> > https://github.com/apache/storm/blob/e44f72767370d10a682446f8f36b75242040f675/external/storm-hdfs/pom.xml#L120
> > 
> > On 2024/04/11 21:29:13 Ayush Saxena wrote:
> > > Hi Richard,
> > > I am not able to decode the issue properly here, It would have
> > > been
> > > better if you shared the PR or the failure 

Re: Recommended way of using hadoop-minicluster für unit testing?

2024-04-12 Thread Ayush Saxena
Hi Richard,
Thanx for sharing the steps to reproduce the issue. I cloned the Apache
Storm repo and was able to repro the issue. The build was indeed failing
due to missing classes.

Spent some time to debug the issue, might not be very right (no
experience with Storm), There are Two ways to get this going

*First Approach: If we want to use the shaded classes*

1. I think the artifact to be used for minicluster should be
`hadoop-client-minicluster`, even spark uses the same [1], the one which
you are using is `hadoop-minicluster`, which in its own is empty
```
ayushsaxena@ayushsaxena ~ %  jar tf
/Users/ayushsaxena/.m2/repository/org/apache/hadoop/hadoop-minicluster/3.3.6/hadoop-minicluster-3.3.6.jar
 | grep .class
ayushsaxena@ayushsaxena ~ %
```

It just defines artifacts which are to be used by
`hadoop-client-minicluster` and this jar has that shading and stuff, using
`hadoop-minicluster` is like adding the hadoop dependencies into the pom
transitively, without any shading or so, which tends to conflict with
`hadoop-client-api` and `hadoop-client-runtime` jars, which uses the shaded
classes.

2. Once you change `hadoop-minicluster` to `hadoop-client-minicluster`,
still the tests won't pass, the reason being the `storm-autocreds`
dependency which pulls in the hadoop jars via `hbase-client` & `hive-exec`,
So, we need to exclude them as well

3. I reverted your classpath hack, changed the jar, & excluded the
dependencies from storm-autocreds & ran the storm-hdfs tests & all the
tests passed, which were failing initially without any code change
```
[INFO] Results:
[INFO]
[INFO] Tests run: 57, Failures: 0, Errors: 0, Skipped: 0
[INFO]
[INFO]

[INFO] BUILD SUCCESS
[INFO]

```

4. Putting the code diff here might make this mail unreadable, so I am
sharing the link to the commit which fixed Storm for me here [2], let me
know if it has any access issues, I will put the diff on the mail itself in
text form.

*Second Approach: If we don't want to use the shaded classes*

1. The `hadoop-client-api` & the` hadoop-client-runtime` jars uses shading
which tends to conflict with your non shaded `hadoop-minicluster`, Rather
than using these jars use the `hadoop-client` jar

2. I removed your hack & changed those two jars with `hadoop-client` jar &
the storm-hdfs tests passes

3. I am sharing the link to the commit in my fork, it is here at [3], one
advantage is, you don't have to change your existing jar nor you would need
to add those exclusions in the `storm-cred` dependency.

++ Adding common-dev, in case any fellow developers with more
experience around using the hadoop-client jars can help, if things still
don't work or Storm needs something more. The downstream projects which I
have experience with don't use these jars (which they should ideally) :-)

-Ayush


[1] https://github.com/apache/spark/blob/master/pom.xml#L1382
[2]
https://github.com/ayushtkn/storm/commit/e0cd8e21201e01d6d0e1f3ac1bc5ada8354436e6
[3]
https://github.com/apache/storm/commit/fb5acdedd617de65e494c768b6ae4bab9b3f7ac8


On Fri, 12 Apr 2024 at 10:41, Richard Zowalla  wrote:

> Hi,
>
> thanks for the fast reply. The PR is here [1].
>
> It works, if I exclude the client-api and client-api-runtime from being
> scanned in surefire, which is a hacky workaround for the actual issue.
>
> The hadoop-commons jar is a transient dependency of the minicluster, which
> is used for testing.
>
> Debugging the situation shows, that HttpServer2  is in the same package in
> hadoop-commons as well as in the client-api but with differences in methods
> / classes used, so depending on the classpath order the wrong class is
> loaded.
>
> Stacktraces are in the first GH Action run.here: [1].
>
> A reproducer would be to check out Storm, go to storm-hdfs and remove the
> exclusion in [2] and run the tests in that module, which will fail due to a
> missing jetty server class (as the HTTPServer2 class is loaded from
> client-api instead of minicluster).
>
> Gruß & Thx
> Richard
>
> [1] https://github.com/apache/storm/pull/3637
> [2]
> https://github.com/apache/storm/blob/e44f72767370d10a682446f8f36b75242040f675/external/storm-hdfs/pom.xml#L120
>
> On 2024/04/11 21:29:13 Ayush Saxena wrote:
> > Hi Richard,
> > I am not able to decode the issue properly here, It would have been
> > better if you shared the PR or the failure trace as well.
> > QQ: Why are you having hadoop-common as an explicit dependency? Those
> > hadoop-common stuff should be there in hadoop-client-api
> > I quickly checked once on the 3.4.0 release and I think it does have
> them.
> >
> > ```
> > ayushsaxena@ayushsaxena client % jar tf hadoop-client-api-3.4.0.jar |
> > grep org/apache/hadoop/fs/FileSystem.class
> > org/apache/hadoop/fs/FileSystem.class
> > ``
> >
> > You didn't mention which shaded classes are being reported as
> > missing... I think spark uses 

Re: Recommended way of using hadoop-minicluster für unit testing?

2024-04-11 Thread Richard Zowalla
Hi,

thanks for the fast reply. The PR is here [1].

It works, if I exclude the client-api and client-api-runtime from being scanned 
in surefire, which is a hacky workaround for the actual issue.

The hadoop-commons jar is a transient dependency of the minicluster, which is 
used for testing.

Debugging the situation shows, that HttpServer2  is in the same package in 
hadoop-commons as well as in the client-api but with differences in methods / 
classes used, so depending on the classpath order the wrong class is loaded.

Stacktraces are in the first GH Action run.here: [1]. 

A reproducer would be to check out Storm, go to storm-hdfs and remove the 
exclusion in [2] and run the tests in that module, which will fail due to a 
missing jetty server class (as the HTTPServer2 class is loaded from client-api 
instead of minicluster).

Gruß & Thx
Richard 

[1] https://github.com/apache/storm/pull/3637
[2] 
https://github.com/apache/storm/blob/e44f72767370d10a682446f8f36b75242040f675/external/storm-hdfs/pom.xml#L120

On 2024/04/11 21:29:13 Ayush Saxena wrote:
> Hi Richard,
> I am not able to decode the issue properly here, It would have been
> better if you shared the PR or the failure trace as well.
> QQ: Why are you having hadoop-common as an explicit dependency? Those
> hadoop-common stuff should be there in hadoop-client-api
> I quickly checked once on the 3.4.0 release and I think it does have them.
> 
> ```
> ayushsaxena@ayushsaxena client % jar tf hadoop-client-api-3.4.0.jar |
> grep org/apache/hadoop/fs/FileSystem.class
> org/apache/hadoop/fs/FileSystem.class
> ``
> 
> You didn't mention which shaded classes are being reported as
> missing... I think spark uses these client jars, you can use that as
> an example, can grab pointers from here: [1] & [2]
> 
> -Ayush
> 
> [1] https://github.com/apache/spark/blob/master/pom.xml#L1361
> [2] https://issues.apache.org/jira/browse/SPARK-33212
> 
> On Thu, 11 Apr 2024 at 17:09, Richard Zowalla  wrote:
> >
> > Hi all,
> >
> > we are using "hadoop-minicluster" in Apache Storm to test our hdfs
> > integration.
> >
> > Recently, we were cleaning up our dependencies and I noticed, that if I
> > am adding
> >
> > 
> > org.apache.hadoop
> > hadoop-client-api
> > ${hadoop.version}
> > 
> > 
> > org.apache.hadoop
> > hadoop-client-runtime
> > ${hadoop.version}
> > 
> >
> > and have
> > 
> > org.apache.hadoop
> > hadoop-minicluster
> > ${hadoop.version}
> > test
> > 
> >
> > as a test dependency to setup a mini-cluster to test our storm-hdfs
> > integration.
> >
> > This fails weirdly because of missing (shaded) classes as well as a
> > class ambiquity with HttpServer2.
> >
> > It is present as a class inside of the "hadoop-client-api" and within
> > "hadoop-common".
> >
> > Is this setup wrong or should we try something different here?
> >
> > Gruß
> > Richard
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: user-h...@hadoop.apache.org
> 
> 

-
To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
For additional commands, e-mail: user-h...@hadoop.apache.org



Re: [ANNOUNCE] Apache Hadoop 3.4.0 release

2024-04-11 Thread Sammi Chen
Xiaoqiao He and Shilun Fan

Awesome!  Thanks for leading the effort to release the Hadoop 3.4.0 !

Bests,
Sammi

On Tue, 19 Mar 2024 at 21:12, slfan1989  wrote:

> On behalf of the Apache Hadoop Project Management Committee, We are
> pleased to announce the release of Apache Hadoop 3.4.0.
>
> This is a release of Apache Hadoop 3.4 line.
>
> Key changes include
>
> * S3A: Upgrade AWS SDK to V2
> * HDFS DataNode Split one FsDatasetImpl lock to volume grain locks
> * YARN Federation improvements
> * YARN Capacity Scheduler improvements
> * HDFS RBF: Code Enhancements, New Features, and Bug Fixes
> * HDFS EC: Code Enhancements and Bug Fixes
> * Transitive CVE fixes
>
> This is the first release of Apache Hadoop 3.4 line. It contains 2888 bug
> fixes, improvements and enhancements since 3.3.
>
> Users are encouraged to read the [overview of major changes][1].
> For details of please check [release notes][2] and [changelog][3].
>
> [1]: http://hadoop.apache.org/docs/r3.4.0/index.html
> [2]:
>
> http://hadoop.apache.org/docs/r3.4.0/hadoop-project-dist/hadoop-common/release/3.4.0/RELEASENOTES.3.4.0.html
> [3]:
>
> http://hadoop.apache.org/docs/r3.4.0/hadoop-project-dist/hadoop-common/release/3.4.0/CHANGELOG.3.4.0.html
>
> Many thanks to everyone who helped in this release by supplying patches,
> reviewing them, helping get this release building and testing and
> reviewing the final artifacts.
>
> Best Regards,
> Xiaoqiao He And Shilun Fan.
>


Re: Recommended way of using hadoop-minicluster für unit testing?

2024-04-11 Thread Ayush Saxena
Hi Richard,
I am not able to decode the issue properly here, It would have been
better if you shared the PR or the failure trace as well.
QQ: Why are you having hadoop-common as an explicit dependency? Those
hadoop-common stuff should be there in hadoop-client-api
I quickly checked once on the 3.4.0 release and I think it does have them.

```
ayushsaxena@ayushsaxena client % jar tf hadoop-client-api-3.4.0.jar |
grep org/apache/hadoop/fs/FileSystem.class
org/apache/hadoop/fs/FileSystem.class
``

You didn't mention which shaded classes are being reported as
missing... I think spark uses these client jars, you can use that as
an example, can grab pointers from here: [1] & [2]

-Ayush

[1] https://github.com/apache/spark/blob/master/pom.xml#L1361
[2] https://issues.apache.org/jira/browse/SPARK-33212

On Thu, 11 Apr 2024 at 17:09, Richard Zowalla  wrote:
>
> Hi all,
>
> we are using "hadoop-minicluster" in Apache Storm to test our hdfs
> integration.
>
> Recently, we were cleaning up our dependencies and I noticed, that if I
> am adding
>
> 
> org.apache.hadoop
> hadoop-client-api
> ${hadoop.version}
> 
> 
> org.apache.hadoop
> hadoop-client-runtime
> ${hadoop.version}
> 
>
> and have
> 
> org.apache.hadoop
> hadoop-minicluster
> ${hadoop.version}
> test
> 
>
> as a test dependency to setup a mini-cluster to test our storm-hdfs
> integration.
>
> This fails weirdly because of missing (shaded) classes as well as a
> class ambiquity with HttpServer2.
>
> It is present as a class inside of the "hadoop-client-api" and within
> "hadoop-common".
>
> Is this setup wrong or should we try something different here?
>
> Gruß
> Richard

-
To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
For additional commands, e-mail: user-h...@hadoop.apache.org



Re: ContainerId starts with 1 ?

2024-03-20 Thread 李响
Dear Hadoop/Yarn community,

I still beg your help for the question above.

Additionally, I might have other questions.
The target is to get the driver container id of a Spark app, from Yarn
Aggregation Log. I would like to call
LogAggregationIndexedFileController#readAggregatedLogsMeta()

then get the first ContainLogMeta from the list returned, then call
getContainerId() from it.
The questions are:

   1. Is the first ContainerLogMeta always the driver container?
   2. If the driver failed to get up for the first time somehow, but
   succeed in its second try. The container id will be added by 1 if I
   understand it correctly. Under this case, will the first ContainLogMeta
   returned by that function above be the first failed container, or the
   second successful container? Or the container id gets not changed after a
   failure?

Thanks!


On Fri, Feb 23, 2024 at 4:21 PM 李响  wrote:

> Dear Hadoop/Yarn community,
>
> In Yarn, a container is represented as
> container_e*epoch*_*clusterTimestamp*_*appId*_*attemptId*_*containerId*
>
> Regarding the last section, "containerId", as the sequential number of
> containers, I notice it does not start with 0, but 1.
>
> My question is:
> 1. Is that observation correct?
> 2. Sorry I do not find the code to support that. I read ContainerId.java
> and ContainerIdPBImpl.java but does not find the answer. Could you please
> show me the code path to support it staring with 1?
> 3. It seems counter-intuitive for me, as a programmer ^_^, who thinks the
> index should start with 0, rather than 1. If it is designed to start with
> 1, any background / thought / discussion to share?
>
> Thanks !!!
>
>
>
> --
>
>李响 Xiang Li
>
>
>

-- 

   李响 Xiang Li

手机 cellphone :+86-136-8113-8972
邮件 e-mail  :wate...@gmail.com


Re: 为什么org.apache.hadoop.fs.FileSystem.Cache.Key的构造方法需要一个conf参数

2024-03-19 Thread Shuyan Zhang
hi 黄晟,
这是历史遗留代码造成的。过去获取ugi要使用conf,后来改变了ugi的获取方式,但漏删了参数。可参考
https://issues.apache.org/jira/browse/HADOOP-6299

黄晟  于2024年3月18日周一 19:24写道:

>
>
> 为什么org.apache.hadoop.fs.FileSystem.Cache.Key的构造方法需要一个conf参数
> ?但是传进来的这个conf却没有地方使用
>
>
>
> | |
> 黄晟
> |
> |
> huangshen...@163.com
> |
>
>


Re: NM status during RM failover

2024-02-25 Thread Hariharan
> We observe a drop of NumActiveNodes metric when fails over on a new RM.
Is that normal?

Yes, this does not seem unusual - the NMs will try to connect to the old RM
for some time before they fail over to the new RM. If this time exceeds the
heartbeat interval, the NMs may show up as disconnected until they reach
out to the new RM.

~ Hariharan


On Sun, Feb 25, 2024 at 4:12 PM Dong Ye  wrote:

> Hi, All:
>
>   I have a question, in the high availability resource manager
> scenario, how does the states of NodeManagers change if a new leader RM is
> elected? We observe a drop of NumActiveNodes metric when fails over on a
> new RM. Is that normal? Any documentation explains how the NM states will
> change? RM version is 2.8.5.
>
> Thanks.
> Have a nice day!
>


Re: NM status during RM failover

2024-02-24 Thread Dong Ye
Hi, All:

How to reduce RM fail over because it introduces disturbances to current
workload. The failover is mainly because of JVM pause (around 6 seconds)and
high CPU usage.

Thanks.
Have a nice day!

On Sat, Feb 24, 2024 at 8:06 PM Dong Ye  wrote:

> Hi, All:
>
>   I have a question, in the high availability resource manager
> scenario, how does the states of NodeManagers change if a new leader RM is
> elected? We observe a drop of NumActiveNodes metric when fails over on a
> new RM. Is that normal? Any documentation explains how the NM states will
> change? RM version is 2.8.5.
>
> Thanks.
> Have a nice day!
>


Re: subscribe

2024-02-20 Thread Battula, Brahma Reddy
Please drop mail to 
"user-unsubscr...@hadoop.apache.org" 
as mentioned in the footer mail.

From: Shuyan Zhang 
Date: Thursday, February 1, 2024 at 09:00
To: user@hadoop.apache.org 
Subject: subscribe
subscribe


Re: unsubscribe

2024-02-10 Thread Brahma Reddy Battula
Please drop mail to "user-unsubscr...@hadoop.apache.org" as mentioned in
the footer mail.

On Fri, Feb 9, 2024 at 2:32 PM Henning Blohm 
wrote:

> unsubscribe
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: user-h...@hadoop.apache.org
>
>


Re: observer namenode and Router-based Federation

2024-01-26 Thread Ayush Saxena
RBF does support observer reads, it was added as part of
https://issues.apache.org/jira/browse/HDFS-16767

you need to go through it, there are different configs and stuff you might
need to setup to get RBF & Observer NN work together.

-Ayush

On Fri, 26 Jan 2024 at 13:44, 尉雁磊  wrote:

> Can't observer namenode and Router-based Federation be used together? I at
> the same time of using RBF
> configuration 
> org.apache.hadoop.hdfs.server.namenode.ha.ObserverReadProxyProvider,but
> complains
>


Re: Data Remanence in HDFS

2024-01-13 Thread Jim Halfpenny
Hi Daniel,
In short you can’t create a HDFS block with unallocated data. You can create a 
zero length block, which will result in a zero byte file being created on the 
data node, but you can’t create a sparse file in HDFS. While HDFS has a block 
size e.g. 128MB if you create a small file then the file on the data node will 
be of a size directly proportional to the data and not the block length; 
creating a 32kB HDFS file will in turn create a single 32kB file on the 
datanodes. The way HDFS is built is not like a traditional file system with 
fixed size blocks/extents in fixed disk locations.

Kind regards,
Jim

> On 12 Jan 2024, at 18:35, Daniel Howard  wrote:
> 
> Thank Jim,
> 
> The scenario I have in mind is something like:
> 1) Ask HDFS to create a file that is 32k in length.
> 2) Attempt to read the contents of the file.
> 
> Can I even attempt to read the contents of a file that has not yet been 
> written? If so, what data would get sent?
> 
> For example, I asked a version of this question of ganeti with regard to 
> creating VMs. You can, by default, read the previous contents of the disk in 
> your new VM, but they have an option to wipe newly allocated VM disks for 
> added security.[1]
> 
> [1]: https://groups.google.com/g/ganeti/c/-c_KoLd6mnI
> 
> Thanks,
> -danny
> 
> On Fri, Jan 12, 2024 at 8:03 AM Jim Halfpenny  
> wrote:
>> Hi Danny,
>> This does depend on a number of circumstances, mostly based on file 
>> permissions. If for example a file is deleted without the -skipTrash option 
>> then it will be moved to the .Trash directory. From here it could be read, 
>> but the original file permissions will be preserved. Therefore if a user did 
>> not have read access before it was deleted then it won’t be able to read it 
>> from .Trash and if they did have read access then this ought to remain the 
>> case.
>> 
>> If a file is deleted then the blocks are marked for deletion by the namenode 
>> and won’t be available through HDFS, but there will be some lag between the 
>> HDFS delete operation and the block files being removed from the datanodes. 
>> It’s possible that someone could read the block from the datanode file 
>> system directly, but not through the HDFS file system. The blocks will exist 
>> on disk until the datanode itself deletes them.
>> 
>> The way HDFS works you won’t get previous data when you create a new block 
>> since unallocated spaces doesn’t exist in the same way as it does on a 
>> regular file system. Each HDFS block maps to a file on the datanodes and 
>> block files can be an arbitrary size, unlike the fixed block/extent size of 
>> a regular file system. You don’t “reuse" HDFS blocks, a block in HDFS is 
>> just a file on the data node. You could potentially recover data from 
>> unallocated space on the datanode disk the same way you would for any other 
>> deleted file.
>> 
>> If you want to remove the chance of data recovery on HDFS then encrypting 
>> the blocks using HDFS transparent encryption is the way to do it. They 
>> encryption keys reside in the namenode metadata so once they are deleted the 
>> data in that file is effectively lost. Beware of snapshots though since a 
>> deleted file in the live HDFS view may exist in a previous snapshot.
>> 
>> Kind regards,
>> Jim
>> 
>> 
>>> On 11 Jan 2024, at 21:50, Daniel Howard >> > wrote:
>>> 
>>> Is it possible for a user with HDFS access to read the contents of a file 
>>> previously deleted by a different user?
>>> 
>>> I know a user can employ KMS to encrypt files with a personal key, making 
>>> this sort of data leakage effectively impossible. But, without KMS, is it 
>>> possible to allocate a file with uninitialized data, and then read the data 
>>> that exists on the underlying disk?
>>> 
>>> Thanks,
>>> -danny
>>> 
>>> --
>>> http://dannyman.toldme.com 
> 
> 
> --
> http://dannyman.toldme.com 


Re: Data Remanence in HDFS

2024-01-12 Thread Daniel Howard
Thank Jim,

The scenario I have in mind is something like:
1) Ask HDFS to create a file that is 32k in length.
2) Attempt to read the contents of the file.

Can I even attempt to read the contents of a file that has not yet been
written? If so, what data would get sent?

For example, I asked a version of this question of ganeti with regard to
creating VMs. You can, by default, read the previous contents of the disk
in your new VM, but they have an option to wipe newly allocated VM disks
for added security.[1]

[1]: https://groups.google.com/g/ganeti/c/-c_KoLd6mnI

Thanks,
-danny

On Fri, Jan 12, 2024 at 8:03 AM Jim Halfpenny 
wrote:

> Hi Danny,
> This does depend on a number of circumstances, mostly based on file
> permissions. If for example a file is deleted without the -skipTrash option
> then it will be moved to the .Trash directory. From here it could be read,
> but the original file permissions will be preserved. Therefore if a user
> did not have read access before it was deleted then it won’t be able to
> read it from .Trash and if they did have read access then this ought to
> remain the case.
>
> If a file is deleted then the blocks are marked for deletion by the
> namenode and won’t be available through HDFS, but there will be some lag
> between the HDFS delete operation and the block files being removed from
> the datanodes. It’s possible that someone could read the block from the
> datanode file system directly, but not through the HDFS file system. The
> blocks will exist on disk until the datanode itself deletes them.
>
> The way HDFS works you won’t get previous data when you create a new block
> since unallocated spaces doesn’t exist in the same way as it does on a
> regular file system. Each HDFS block maps to a file on the datanodes and
> block files can be an arbitrary size, unlike the fixed block/extent size of
> a regular file system. You don’t “reuse" HDFS blocks, a block in HDFS is
> just a file on the data node. You could potentially recover data from
> unallocated space on the datanode disk the same way you would for any other
> deleted file.
>
> If you want to remove the chance of data recovery on HDFS then encrypting
> the blocks using HDFS transparent encryption is the way to do it. They
> encryption keys reside in the namenode metadata so once they are deleted
> the data in that file is effectively lost. Beware of snapshots though since
> a deleted file in the live HDFS view may exist in a previous snapshot.
>
> Kind regards,
> Jim
>
>
> On 11 Jan 2024, at 21:50, Daniel Howard  wrote:
>
> Is it possible for a user with HDFS access to read the contents of a file
> previously deleted by a different user?
>
> I know a user can employ KMS to encrypt files with a personal key, making
> this sort of data leakage effectively impossible. But, without KMS, is it
> possible to allocate a file with uninitialized data, and then read the data
> that exists on the underlying disk?
>
> Thanks,
> -danny
>
> --
> http://dannyman.toldme.com
>
>
>

-- 
http://dannyman.toldme.com


Re: Data Remanence in HDFS

2024-01-12 Thread Jim Halfpenny
Hi Danny,
This does depend on a number of circumstances, mostly based on file 
permissions. If for example a file is deleted without the -skipTrash option 
then it will be moved to the .Trash directory. From here it could be read, but 
the original file permissions will be preserved. Therefore if a user did not 
have read access before it was deleted then it won’t be able to read it from 
.Trash and if they did have read access then this ought to remain the case.

If a file is deleted then the blocks are marked for deletion by the namenode 
and won’t be available through HDFS, but there will be some lag between the 
HDFS delete operation and the block files being removed from the datanodes. 
It’s possible that someone could read the block from the datanode file system 
directly, but not through the HDFS file system. The blocks will exist on disk 
until the datanode itself deletes them.

The way HDFS works you won’t get previous data when you create a new block 
since unallocated spaces doesn’t exist in the same way as it does on a regular 
file system. Each HDFS block maps to a file on the datanodes and block files 
can be an arbitrary size, unlike the fixed block/extent size of a regular file 
system. You don’t “reuse" HDFS blocks, a block in HDFS is just a file on the 
data node. You could potentially recover data from unallocated space on the 
datanode disk the same way you would for any other deleted file.

If you want to remove the chance of data recovery on HDFS then encrypting the 
blocks using HDFS transparent encryption is the way to do it. They encryption 
keys reside in the namenode metadata so once they are deleted the data in that 
file is effectively lost. Beware of snapshots though since a deleted file in 
the live HDFS view may exist in a previous snapshot.

Kind regards,
Jim


> On 11 Jan 2024, at 21:50, Daniel Howard  wrote:
> 
> Is it possible for a user with HDFS access to read the contents of a file 
> previously deleted by a different user?
> 
> I know a user can employ KMS to encrypt files with a personal key, making 
> this sort of data leakage effectively impossible. But, without KMS, is it 
> possible to allocate a file with uninitialized data, and then read the data 
> that exists on the underlying disk?
> 
> Thanks,
> -danny
> 
> --
> http://dannyman.toldme.com 


Re: I don't want to set quotas through the router

2024-01-12 Thread Ayush Saxena
Hi,
Your question is not very clear. So, I am answering whatever I understand.

1. You don't want Router to manage Quotas?
Ans: Then you can use this config: dfs.federation.router.quota.enable
and set it to false

2. You have default NS as Router but you want to set Quota individually to NS?
Ans. Then use generic options in DFSAdmin

3. You want to set Quota on /path & it should set quota on NS1
/somePath & NS2 /somePath at same time?
Ans. You should explore mount entries with multiple
destinations(MultipleDestinationMountTableResolver), RBF supports
that, so if a path resolves to multiple destinations in different NS,
it would set the same quota on all the target destinations. if it is a
mount entry you need to go via DfsRouterAdmin else normal DfsAdmin
should do...

-Ayush

On Fri, 12 Jan 2024 at 12:16, 尉雁磊  wrote:
>
> Hello everyone, our cluster recently deployed router federation, because the 
> upper layer custom component depends on the way to set quotas and get quotas 
> without the router, so it does not want to set quotas through the router.
> After my test, hdfs dfadmin-setspacequota /path cannot be executed through 
> the router. I want to execute hdfs DFadmin-setspacequota /path through the 
> router at the same time in two clusters, which can achieve the desired 
> effect. This is the least change for us. Do you support the way I said
>

-
To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
For additional commands, e-mail: user-h...@hadoop.apache.org



Re: JSON in Kafka -> ORC in HDFS - Thoughts on different tools?

2023-12-10 Thread Aaron Grubb
Hi Michal,

Thanks for your detailed reply, it was very helpful. The motivation for 
replacing Kafka Connect is mostly related to having to run backfills from 
time-to-time - we store all the raw data from Kafka Connect, extract the fields 
we're currently using, then drop the extracted data and keep the raw JSON, and 
that's fine in the case that backfilling is never needed, but when it becomes 
necessary, processing 90+ days of JSON at 12+ billion rows per day using Hive 
LLAP is excruciatingly slow. Therefore we wanted to have the data in ORC format 
as early as possible instead of adding an intermediate job to transform the 
JSON to ORC in the current pipeline. Changing this part of the pipeline over 
should also result in an overall reduction of resources used - nothing crazy 
for this first topic that we're changing over but if it goes well, we have a 
few Kafka Connect clusters that we would be interested in converting, and that 
would also free up a ton of CPU time in Hive.

Thanks,
Aaron


On Thu, 2023-12-07 at 13:32 +0100, Michal Klempa wrote:
Hi Aaron,
I do not know Gobblin, so no advice there.

You write that currently Kafka Connect dumps to files, as you probably already 
know, Kafka Connect can't do the aggregation.
To my knowledge, NiFi is also ETL with local Transformation, there is no state 
maintenance on a global scale. You can write processors to do stateful 
transformation, but for this task, would be tedious in my opinion. I would put 
NiFi out of the game.

Now to the requirements, I assume, you have:
- large volume (~1k events per second)
- small messages (<1k JSONs)
- need to have data in near real-time (seconds at most) after a window 
(aggregation) is triggered for data stake holders to query


Then, it makes sense to think of doing the aggregation on-the-fly, in a 
real-time framework, i.e. real-time ETL non-stop running job.
If your use-case is not satisfying the criteria, e.g low volume, or no 
real-time need (1 to 5 minutes lateness is fine), I would strongly encourage to 
avoid using real-time stateful streaming, as it is complicated to setup, scale, 
maintain, run and mostly: code bug-free. It is a non-stop running application, 
any memory leak > you have restart on OOM every couple of hours. It is hard.

You may have:
- high volume + no real-time (5 minutes lateness is fine)

In that case, running any pyspark every 5 minutes with ad-hoc AWS spot 
instances cluster with batch job is just fine.

You may have:
- low volume + no real-time (5 minutes lateness is fine)
In that case, just run plain 1 instance python script doing the job, 1k to 100k 
events you can just consume from Kafka directly, pack into ORC, and dump on 
S3/HDFS on a single CPU. Use any cron to run it every 5 minutes. Done.


In case your use case is:
- large volume + real-time
for this, Flink and Spark Structured Streaming are both good fit, but there is 
also a thing called Kafka Streams, I would suggest to add this as a competitor. 
Also there is Beam(Google Dataflow), if you are on GCP already. All of them do 
the same job.


Flink vs. Spark Structured Streaming vs. Kafka Streams:
Deployement: Kafka Streams is just one fat-jar, Flink+Spark - you need to 
maintain clusters, but both frameworks are working on being k8s native, but... 
not easy to setup either.
Coding: everything is JVM, Spark has python, Flink added python too. Seems 
there are some python attempts on Kafka Streams approach, no experience though.
Fault Tolerance: I have real-world experience with Flink+Spark Structured 
Streaming, both can restart from checkpoints, Flink have also savepoints which 
is a good feature to start a new job after modifications (but also not easy to 
setup).
Automatically Scalable: I think none of the open-source has this feature 
out-of-the-box (correct me if wrong). You may want to pay Ververica platform 
(Flink authors offering), Databricks (Spark authors offering), there must be 
something from Confluent or competitors,too. Google of course has its Dataflow 
(Beam API). All auto-scaling is pain however, each rescale means reshuffle of 
the data.
Exactly once: To my knowledge, only Flink nowadays offers end-to-end exactly 
once and I would not be sure whether that can be achieved with ORC on HDFS as 
destination. Maybe idempontent ORC writer can be used or some other form of 
"transaction" on the destination must exist.

All in all, if I would be solving your problem, I would first attack the 
requirements list. Whether it can't be done easier. If not, Flink would be my 
choice as I had good experience with it and you can really hack anything 
inside. But prepare yourself, that the requirement list is hard, even if you 
get pipeline up in 2 weeks, you surely will re-iterate the decision 
after some incidents in next 6 months.
If you loosen requirements a bit, it becomes easier and easier. Your current 
solution sounds very reasonable to me. You picked something that works out of 
the bo

Re: JSON in Kafka -> ORC in HDFS - Thoughts on different tools?

2023-12-07 Thread Michal Klempa
Hi Aaron,
I do not know Gobblin, so no advice there.

You write that currently Kafka Connect dumps to files, as you probably
already know, Kafka Connect can't do the aggregation.
To my knowledge, NiFi is also ETL with local Transformation, there is no
state maintenance on a global scale. You can write processors to do
stateful transformation, but for this task, would be tedious in my opinion.
I would put NiFi out of the game.

Now to the requirements, I assume, you have:
- large volume (~1k events per second)
- small messages (<1k JSONs)
- need to have data in near real-time (seconds at most) after a window
(aggregation) is triggered for data stake holders to query

Then, it makes sense to think of doing the aggregation on-the-fly, in a
real-time framework, i.e. real-time ETL non-stop running job.
If your use-case is not satisfying the criteria, e.g low volume, or no
real-time need (1 to 5 minutes lateness is fine), I would strongly
encourage to avoid using real-time stateful streaming, as it is complicated
to setup, scale, maintain, run and mostly: code bug-free. It is a non-stop
running application, any memory leak > you have restart on OOM every couple
of hours. It is hard.

You may have:
- high volume + no real-time (5 minutes lateness is fine)
In that case, running any pyspark every 5 minutes with ad-hoc AWS spot
instances cluster with batch job is just fine.

You may have:
- low volume + no real-time (5 minutes lateness is fine)
In that case, just run plain 1 instance python script doing the job, 1k to
100k events you can just consume from Kafka directly, pack into ORC, and
dump on S3/HDFS on a single CPU. Use any cron to run it every 5 minutes.
Done.

In case your use case is:
- large volume + real-time
for this, Flink and Spark Structured Streaming are both good fit, but there
is also a thing called Kafka Streams, I would suggest to add this as a
competitor. Also there is Beam(Google Dataflow), if you are on GCP already.
All of them do the same job.

Flink vs. Spark Structured Streaming vs. Kafka Streams:
Deployement: Kafka Streams is just one fat-jar, Flink+Spark - you need to
maintain clusters, but both frameworks are working on being k8s native,
but... not easy to setup either.
Coding: everything is JVM, Spark has python, Flink added python too. Seems
there are some python attempts on Kafka Streams approach, no experience
though.
Fault Tolerance: I have real-world experience with Flink+Spark Structured
Streaming, both can restart from checkpoints, Flink have also savepoints
which is a good feature to start a new job after modifications (but also
not easy to setup).
Automatically Scalable: I think none of the open-source has this feature
out-of-the-box (correct me if wrong). You may want to pay Ververica
platform (Flink authors offering), Databricks (Spark authors offering),
there must be something from Confluent or competitors,too. Google of course
has its Dataflow (Beam API). All auto-scaling is pain however, each rescale
means reshuffle of the data.
Exactly once: To my knowledge, only Flink nowadays offers end-to-end
exactly once and I would not be sure whether that can be achieved with ORC
on HDFS as destination. Maybe idempontent ORC writer can be used or some
other form of "transaction" on the destination must exist.

All in all, if I would be solving your problem, I would first attack the
requirements list. Whether it can't be done easier. If not, Flink would be
my choice as I had good experience with it and you can really hack anything
inside. But prepare yourself, that the requirement list is hard, even if
you get pipeline up in 2 weeks, you surely will re-iterate the
decision after some incidents in next 6 months.
If you loosen requirements a bit, it becomes easier and easier. Your
current solution sounds very reasonable to me. You picked something that
works out of the box (Kafka Connect) and done ELT, where something, that
can aggregate out of the box (Hive) does it. Why exactly you need to
replace it?

Good luck, M.

On Fri, Dec 1, 2023 at 11:38 AM Aaron Grubb  wrote:

> Hi all,
>
> Posting this here to avoid biases from the individual mailing lists on why
> the product they're using is the best. I'm analyzing tools to
> replace a section of our pipeline with something more efficient. Currently
> we're using Kafka Connect to take data from Kafka and put it into
> S3 (not HDFS cause the connector is paid) in JSON format, then Hive reads
> JSON from S3 and creates ORC files in HDFS after a group by. I
> would like to replace this with something that reads Kafka, applies
> aggregations and windowing in-place and writes HDFS directly. I know that
> the impending Hive 4 release will support this but Hive LLAP is *very*
> slow when processing JSON. So far I have a working PySpark application
> that accomplishes this replacement using structured streaming + windowing,
> however the decision to evaluate Spark was based on there

Re: INFRA-25203

2023-11-27 Thread Peter Boot
unsubscribe

On Mon, 27 Nov 2023, 11:26 pm Drew Foulks,  wrote:

> Redirect test.
>
> --
> Cheers,
>
> Drew Foulks
>  ASF Infra
>
>
>


Re: Details about cluster balancing

2023-11-27 Thread Akash Jain
Thanks Ayush!

> On 15-Nov-2023, at 10:59 PM, Ayush Saxena  wrote:
> 
> Hi Akash,
> You can read about balancer here:
> https://apache.github.io/hadoop/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer
> HADOOP-1652(https://issues.apache.org/jira/browse/HADOOP-1652) has
> some details around it as well, it has some docs attached to it, you
> can read them...
> For the code, you can explore something over here:
> https://github.com/apache/hadoop/blob/rel/release-3.3.6/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Balancer.java#L473-L479
> 
> -Ayush
> 
> On Sun, 5 Nov 2023 at 22:33, Akash Jain  wrote:
>> 
>> Hello,
>> 
>> For my project, I am analyzing an algorithm to balance the disk usage across 
>> thousands of storage nodes across different availability zones.
>> 
>> Let’s say
>> Availability zone 1
>> Disk usage for data of customer 1 is 70%
>> Disk usage for data of customer 2 is 10%
>> 
>> Availability zone 2
>> Disk usage for data of customer 1 is 30%
>> Disk usage for data of customer 2 is 90%
>> 
>> and so forth…
>> 
>> Clearly in above example customer 1 data has much higher data locality in 
>> AZ1 compared to AZ2. Similarly for customer 2 data it is more data locality 
>> in AZ1 compared to AZ1
>> 
>> In an ideal world, the data of the customers would look something like this
>> 
>> 
>> Availability zone 1
>> Disk usage for data of customer 1 is 50%
>> Disk usage for data of customer 2 is 50%
>> 
>> Availability zone 2
>> Disk usage for data of customer 1 is 50%
>> Disk usage for data of customer 2 is 50%
>> 
>> 
>> HDFS Balancer looks related, however I have some questions:
>> 
>> 1. Why does the algorithm tries to pair an over utilized node with under 
>> utilized instead of every node holding average data?
>> (https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/data-storage/content/step_2__storage_group_pairing.html)
>> 
>> 2. Where can I find more algorithmic details of how the pairing happens?
>> 
>> 3. Is this the only balancing algorithm supported by HDFS?
>> 
>> Thanks
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: user-h...@hadoop.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
For additional commands, e-mail: user-h...@hadoop.apache.org



Re: Details about cluster balancing

2023-11-15 Thread Ayush Saxena
Hi Akash,
You can read about balancer here:
https://apache.github.io/hadoop/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer
HADOOP-1652(https://issues.apache.org/jira/browse/HADOOP-1652) has
some details around it as well, it has some docs attached to it, you
can read them...
For the code, you can explore something over here:
https://github.com/apache/hadoop/blob/rel/release-3.3.6/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Balancer.java#L473-L479

-Ayush

On Sun, 5 Nov 2023 at 22:33, Akash Jain  wrote:
>
> Hello,
>
> For my project, I am analyzing an algorithm to balance the disk usage across 
> thousands of storage nodes across different availability zones.
>
> Let’s say
> Availability zone 1
> Disk usage for data of customer 1 is 70%
> Disk usage for data of customer 2 is 10%
>
> Availability zone 2
> Disk usage for data of customer 1 is 30%
> Disk usage for data of customer 2 is 90%
>
> and so forth…
>
> Clearly in above example customer 1 data has much higher data locality in AZ1 
> compared to AZ2. Similarly for customer 2 data it is more data locality in 
> AZ1 compared to AZ1
>
> In an ideal world, the data of the customers would look something like this
>
>
> Availability zone 1
> Disk usage for data of customer 1 is 50%
> Disk usage for data of customer 2 is 50%
>
> Availability zone 2
> Disk usage for data of customer 1 is 50%
> Disk usage for data of customer 2 is 50%
>
>
> HDFS Balancer looks related, however I have some questions:
>
> 1. Why does the algorithm tries to pair an over utilized node with under 
> utilized instead of every node holding average data?
> (https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/data-storage/content/step_2__storage_group_pairing.html)
>
> 2. Where can I find more algorithmic details of how the pairing happens?
>
> 3. Is this the only balancing algorithm supported by HDFS?
>
> Thanks

-
To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
For additional commands, e-mail: user-h...@hadoop.apache.org



Re: 关于hadoop-3.3.1版本使用libhdfs3.so 访问hdfs联邦模式router节点rpc端口有20分钟延时问题

2023-10-30 Thread Xiaoqiao He
Add hdfs-dev@h.a.o and user@h.a.o

On Thu, Oct 26, 2023 at 7:07 PM 王继泽  wrote:

> 最近在使用hadoop的过程中,发现了一个情况。
> 当我使用c
> api向hdfs联邦模式router节点rpc端口发送请求时,比如说写文件,客户端发送完成请求后,hadoop端需要20分钟延时文件才有字节大小,延时期间不能对文件进行操作。
>
> 客户端这边运行结束之后,hadoop端日志大致过程:
> 1.namenode接收到客户端的请求,FSEditLog打印日志。
> 2.blockmanager.BlockPlacementPolicy: 提示没有足够的副本可供选择.
> Reason:{NO_REQUIRED_STORAGE_TYPE=1}
> 3.StateChange: 分配block
> 4.StateChange: 为hadoop目录文件获取租约
> 5.ipc.Server: 检查租约的方法抛了个异常 LeaseExpiredExcepion: INode is not a regular
> file: /
> 6.(开始等待)
> 7.20分钟后,达到hard limit最大值限制。强制关闭租约。
> 8.触发 Lease recovery
> 9.然后才可执行成功。
>
> 我也怀疑过是客户端的问题。但是我做了几组测试,(都是用c api向hadoop发送写请求,以下简写)
> 3.3.1版本、router、rpc端口。  --> 有20分钟延时
> 3.3.1版本、namenode、rpc端口。 --> 无问题
> 3.3.1版本、router、http端口。 --> 无问题
> 3.3.1版本、namenode、http端口。--> 无问题
>
> 3.1.1版本、router、rpc端口。 --> 无问题
> 3.1.1版本、namenode、rpc端口。 --> 无问题
> 3.1.1版本、router、rpc端口。 --> 无问题
> 3.1.1版本、namenode、rpc端口。 --> 无问题
>
> 以下是我的猜测:
> 从hadoop日志中看,猜测是3.3.1版本、router、rpc端口一开始未获取到租约,所以导致无法正常关闭租约,直到hard
> limit触发,才能退出。但是我无法解释为什么相同的客户端,3.1.1版本就无该现象。我怀疑是版本的变化改动与libhdfs3.so的某个部分不适配导致这一现象。
>
>
> 如果有人发现类似的情况,我希望得到回复,为我指明这个问题的方向。,
>
>
>
> | |
> 王继泽
> |
> |
> y98d...@163.com
> |
>
>


Re: 关于hadoop-3.3.1版本使用libhdfs3.so 访问hdfs联邦模式router节点rpc端口有20分钟延时问题

2023-10-30 Thread Xiaoqiao He
Add hdfs-dev@h.a.o and user@h.a.o

On Thu, Oct 26, 2023 at 7:07 PM 王继泽  wrote:

> 最近在使用hadoop的过程中,发现了一个情况。
> 当我使用c
> api向hdfs联邦模式router节点rpc端口发送请求时,比如说写文件,客户端发送完成请求后,hadoop端需要20分钟延时文件才有字节大小,延时期间不能对文件进行操作。
>
> 客户端这边运行结束之后,hadoop端日志大致过程:
> 1.namenode接收到客户端的请求,FSEditLog打印日志。
> 2.blockmanager.BlockPlacementPolicy: 提示没有足够的副本可供选择.
> Reason:{NO_REQUIRED_STORAGE_TYPE=1}
> 3.StateChange: 分配block
> 4.StateChange: 为hadoop目录文件获取租约
> 5.ipc.Server: 检查租约的方法抛了个异常 LeaseExpiredExcepion: INode is not a regular
> file: /
> 6.(开始等待)
> 7.20分钟后,达到hard limit最大值限制。强制关闭租约。
> 8.触发 Lease recovery
> 9.然后才可执行成功。
>
> 我也怀疑过是客户端的问题。但是我做了几组测试,(都是用c api向hadoop发送写请求,以下简写)
> 3.3.1版本、router、rpc端口。  --> 有20分钟延时
> 3.3.1版本、namenode、rpc端口。 --> 无问题
> 3.3.1版本、router、http端口。 --> 无问题
> 3.3.1版本、namenode、http端口。--> 无问题
>
> 3.1.1版本、router、rpc端口。 --> 无问题
> 3.1.1版本、namenode、rpc端口。 --> 无问题
> 3.1.1版本、router、rpc端口。 --> 无问题
> 3.1.1版本、namenode、rpc端口。 --> 无问题
>
> 以下是我的猜测:
> 从hadoop日志中看,猜测是3.3.1版本、router、rpc端口一开始未获取到租约,所以导致无法正常关闭租约,直到hard
> limit触发,才能退出。但是我无法解释为什么相同的客户端,3.1.1版本就无该现象。我怀疑是版本的变化改动与libhdfs3.so的某个部分不适配导致这一现象。
>
>
> 如果有人发现类似的情况,我希望得到回复,为我指明这个问题的方向。,
>
>
>
> | |
> 王继泽
> |
> |
> y98d...@163.com
> |
>
>


Re: Namenode Connection Refused

2023-10-24 Thread Harry Jamison
It is not an HA cluster, I gave up on that due to separate problems.
And I am doing this query from the same host as the namenode.

including the netstat -tulapn
that shows the namenode is not exposing the port









On Tuesday, October 24, 2023 at 09:40:09 AM PDT, Wei-Chiu Chuang 
 wrote: 





If it's an HA cluster, is it possible the client doesn't have the proper HA 
configuration so it doesn't know what host name to connect to?

Otherwise, the usual suspect is the firewall configuration between the client 
and the NameNode.

On Tue, Oct 24, 2023 at 9:05 AM Harry Jamison 
 wrote:
> I feel like I am doing something really dumb here, but my namenode is having 
> a connection refused on port 8020.
> 
> There is nothing in the logs that seems to indicate an error as far as I can 
> tell
> 
> ps aux shows the namenode is running
> 
> root   13169   10196  9 21:18 pts/100:00:02 
> /usr/lib/jvm/java-11-openjdk-amd64//bin/java -Dproc_namenode 
> -Djava.net.preferIPv4Stack=true -Dhdfs.audit.logger=INFO,NullAppender 
> -Dhadoop.security.logger=INFO,RFAS 
> -Dyarn.log.dir=/hadoop/hadoop/hadoop-3.3.6/logs -Dyarn.log.file=hadoop.log 
> -Dyarn.home.dir=/hadoop/hadoop/hadoop-3.3.6 -Dyarn.root.logger=INFO,console 
> -Djava.library.path=/hadoop/hadoop/hadoop-3.3.6/lib/native 
> -Dhadoop.log.dir=/hadoop/hadoop/hadoop-3.3.6/logs 
> -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/hadoop/hadoop/hadoop-3.3.6 
> -Dhadoop.id.str=root -Dhadoop.root.logger=TRACE,console 
> -Dhadoop.policy.file=hadoop-policy.xml 
> org.apache.hadoop.hdfs.server.namenode.NameNode
> 
> Netstat shows that this port is not open but others are
> root@vmnode1:/hadoop/hadoop/hadoop# netstat -tulapn|grep 802
> tcp        0      0 192.168.1.159:8023      0.0.0.0:*               LISTEN    
>   16347/java          
> tcp        0      0 192.168.1.159:8022      0.0.0.0:*               LISTEN    
>   16347/java          
> tcp        0      0 192.168.1.159:8022      192.168.1.159:56830     
> ESTABLISHED 16347/java          
> tcp        0      0 192.168.1.159:56830     192.168.1.159:8022      
> ESTABLISHED 13889/java          
> tcp        0      0 192.168.1.159:8022      192.168.1.104:58264     
> ESTABLISHED 16347/java          
> 
> 
> From the namenode logs I see that it has 8020 as the expected port
> [2023-10-23 21:18:21,739] INFO fs.defaultFS is hdfs://vmnode1:8020/ 
> (org.apache.hadoop.hdfs.server.namenode.NameNodeUtils)
> [2023-10-23 21:18:21,739] INFO Clients should use vmnode1:8020 to access this 
> namenode/service. (org.apache.hadoop.hdfs.server.namenode.NameNode)
> 
> My datanodes seem to be connecting, because I see that information about 0 
> invalid blocks in the logs
> [2023-10-24 09:03:21,255] INFO BLOCK* registerDatanode: from 
> DatanodeRegistration(192.168.1.159:9866, 
> datanodeUuid=fbefce35-15f7-43df-a666-ecc90f4bef0f, infoPort=9864, 
> infoSecurePort=0, ipcPort=9867, 
> storageInfo=lv=-57;cid=CID-0b66d2f6-6c6a-4f3f-bdb1-b1ab0c947d00;nsid=2036303633;c=1697774550786)
>  storage fbefce35-15f7-43df-a666-ecc90f4bef0f 
> (org.apache.hadoop.hdfs.StateChange)
> [2023-10-24 09:03:21,255] INFO Removing a node: 
> /default-rack/192.168.1.159:9866 (org.apache.hadoop.net.NetworkTopology)
> [2023-10-24 09:03:21,255] INFO Adding a new node: 
> /default-rack/192.168.1.159:9866 (org.apache.hadoop.net.NetworkTopology)
> [2023-10-24 09:03:21,281] INFO BLOCK* processReport 0x746ca82e1993dcbb with 
> lease ID 0xa39c5071fd7ca21f: Processing first storage report for 
> DS-ab8f27ed-6129-492c-9b8a-3800c46703fb from datanode 
> DatanodeRegistration(192.168.1.159:9866, 
> datanodeUuid=fbefce35-15f7-43df-a666-ecc90f4bef0f, infoPort=9864, 
> infoSecurePort=0, ipcPort=9867, 
> storageInfo=lv=-57;cid=CID-0b66d2f6-6c6a-4f3f-bdb1-b1ab0c947d00;nsid=2036303633;c=1697774550786)
>  (BlockStateChange)
> [2023-10-24 09:03:21,281] INFO BLOCK* processReport 0x746ca82e1993dcbb with 
> lease ID 0xa39c5071fd7ca21f: from storage 
> DS-ab8f27ed-6129-492c-9b8a-3800c46703fb node 
> DatanodeRegistration(192.168.1.159:9866, 
> datanodeUuid=fbefce35-15f7-43df-a666-ecc90f4bef0f, infoPort=9864, 
> infoSecurePort=0, ipcPort=9867, 
> storageInfo=lv=-57;cid=CID-0b66d2f6-6c6a-4f3f-bdb1-b1ab0c947d00;nsid=2036303633;c=1697774550786),
>  blocks: 0, hasStaleStorage: false, processing time: 0 msecs, 
> invalidatedBlocks: 0 (BlockStateChange)
> 
> 
> Is there anything else that I should look at?
> I am not sure how to debug why it is not starting up on this port
> 
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: user-h...@hadoop.apache.org
> 
> 


-
To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
For additional commands, e-mail: user-h...@hadoop.apache.org



Re: Namenode Connection Refused

2023-10-24 Thread Wei-Chiu Chuang
If it's an HA cluster, is it possible the client doesn't have the proper HA
configuration so it doesn't know what host name to connect to?

Otherwise, the usual suspect is the firewall configuration between the
client and the NameNode.

On Tue, Oct 24, 2023 at 9:05 AM Harry Jamison
 wrote:

> I feel like I am doing something really dumb here, but my namenode is
> having a connection refused on port 8020.
>
> There is nothing in the logs that seems to indicate an error as far as I
> can tell
>
> ps aux shows the namenode is running
>
> root   13169   10196  9 21:18 pts/100:00:02
> /usr/lib/jvm/java-11-openjdk-amd64//bin/java -Dproc_namenode
> -Djava.net.preferIPv4Stack=true -Dhdfs.audit.logger=INFO,NullAppender
> -Dhadoop.security.logger=INFO,RFAS
> -Dyarn.log.dir=/hadoop/hadoop/hadoop-3.3.6/logs -Dyarn.log.file=hadoop.log
> -Dyarn.home.dir=/hadoop/hadoop/hadoop-3.3.6 -Dyarn.root.logger=INFO,console
> -Djava.library.path=/hadoop/hadoop/hadoop-3.3.6/lib/native
> -Dhadoop.log.dir=/hadoop/hadoop/hadoop-3.3.6/logs
> -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/hadoop/hadoop/hadoop-3.3.6
> -Dhadoop.id.str=root -Dhadoop.root.logger=TRACE,console
> -Dhadoop.policy.file=hadoop-policy.xml
> org.apache.hadoop.hdfs.server.namenode.NameNode
>
> Netstat shows that this port is not open but others are
> root@vmnode1:/hadoop/hadoop/hadoop# netstat -tulapn|grep 802
> tcp0  0 192.168.1.159:8023  0.0.0.0:*
>  LISTEN  16347/java
> tcp0  0 192.168.1.159:8022  0.0.0.0:*
>  LISTEN  16347/java
> tcp0  0 192.168.1.159:8022  192.168.1.159:56830
>  ESTABLISHED 16347/java
> tcp0  0 192.168.1.159:56830 192.168.1.159:8022
> ESTABLISHED 13889/java
> tcp0  0 192.168.1.159:8022  192.168.1.104:58264
>  ESTABLISHED 16347/java
>
>
> From the namenode logs I see that it has 8020 as the expected port
> [2023-10-23 21:18:21,739] INFO fs.defaultFS is hdfs://vmnode1:8020/
> (org.apache.hadoop.hdfs.server.namenode.NameNodeUtils)
> [2023-10-23 21:18:21,739] INFO Clients should use vmnode1:8020 to access
> this namenode/service. (org.apache.hadoop.hdfs.server.namenode.NameNode)
>
> My datanodes seem to be connecting, because I see that information about 0
> invalid blocks in the logs
> [2023-10-24 09:03:21,255] INFO BLOCK* registerDatanode: from
> DatanodeRegistration(192.168.1.159:9866,
> datanodeUuid=fbefce35-15f7-43df-a666-ecc90f4bef0f, infoPort=9864,
> infoSecurePort=0, ipcPort=9867,
> storageInfo=lv=-57;cid=CID-0b66d2f6-6c6a-4f3f-bdb1-b1ab0c947d00;nsid=2036303633;c=1697774550786)
> storage fbefce35-15f7-43df-a666-ecc90f4bef0f
> (org.apache.hadoop.hdfs.StateChange)
> [2023-10-24 09:03:21,255] INFO Removing a node: /default-rack/
> 192.168.1.159:9866 (org.apache.hadoop.net.NetworkTopology)
> [2023-10-24 09:03:21,255] INFO Adding a new node: /default-rack/
> 192.168.1.159:9866 (org.apache.hadoop.net.NetworkTopology)
> [2023-10-24 09:03:21,281] INFO BLOCK* processReport 0x746ca82e1993dcbb
> with lease ID 0xa39c5071fd7ca21f: Processing first storage report for
> DS-ab8f27ed-6129-492c-9b8a-3800c46703fb from datanode DatanodeRegistration(
> 192.168.1.159:9866, datanodeUuid=fbefce35-15f7-43df-a666-ecc90f4bef0f,
> infoPort=9864, infoSecurePort=0, ipcPort=9867,
> storageInfo=lv=-57;cid=CID-0b66d2f6-6c6a-4f3f-bdb1-b1ab0c947d00;nsid=2036303633;c=1697774550786)
> (BlockStateChange)
> [2023-10-24 09:03:21,281] INFO BLOCK* processReport 0x746ca82e1993dcbb
> with lease ID 0xa39c5071fd7ca21f: from storage
> DS-ab8f27ed-6129-492c-9b8a-3800c46703fb node DatanodeRegistration(
> 192.168.1.159:9866, datanodeUuid=fbefce35-15f7-43df-a666-ecc90f4bef0f,
> infoPort=9864, infoSecurePort=0, ipcPort=9867,
> storageInfo=lv=-57;cid=CID-0b66d2f6-6c6a-4f3f-bdb1-b1ab0c947d00;nsid=2036303633;c=1697774550786),
> blocks: 0, hasStaleStorage: false, processing time: 0 msecs,
> invalidatedBlocks: 0 (BlockStateChange)
>
>
> Is there anything else that I should look at?
> I am not sure how to debug why it is not starting up on this port
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: user-h...@hadoop.apache.org
>
>


RE: MODERATE for hdfs-iss...@hadoop.apache.org

2023-10-22 Thread Sergey Onuchin
Hello Ayush Saxena.

Thank you for your response.

We have parallel processes working on the same HDFS, but they are not touching 
the affected directory.
We cannot exclude (stop) them, as it is production under load.

Also we cannot enable debug mode due to the risk to impact ongoing operations.

Scanning hdfs-audit log shows creation and data access to 'lost' directories or 
their files, up to the day they were, well, lost.
No 'delete' or 'rename' operation are visible in logs - just no more matches.

>> then maybe check in edit logs, or enable debug logs and see for entries for 
>> edit log,"doEditTx op"
I don't know how to do that, please elaborate.

I've attached yesterdays evidence (user-perspective) of 2 partition loss.

Right now I did the following:
- copied whole parent directory to another HDFS location
- started rebuilding 'lost' partitions, this will take 3-4 calendar days to 
cover all missing days.
- only one partition is done so far, no loss appeared yet.


Thank you!

-Original Message-
From: Ayush Saxena  
Sent: 18 October, 2023 2:25
To: Sergey Onuchin 
Cc: Hdfs-dev ; Xiaoqiao He ; 
user.hadoop 
Subject: Re: MODERATE for hdfs-iss...@hadoop.apache.org

+ user@hadoop

This sounds pretty strange, do you have any background job in your cluster 
running, like for compaction kind of stuff, which plays with the files? Any 
traces in the Namenode Logs, what happens to the blocks associated with those 
files, If they get deleted before a FBR, that ain't a metadata loss I believe, 
something triggered a delete, maybe on the parent directory?

Will it be possible to enable debug logs and grep for "DIR* 
FSDirectory.delete:" (code here [1]) or check other delete related entries from 
StateChangeLog?
Maybe try to capture all the Audit logs from the create entry to the moment 
when you figure out files are missing & look for all the delete entries.
Still no luck, then maybe check in edit logs, or enable debug logs and see for 
entries for edit log,"doEditTx op"

-Ayush

[1] 
https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirDeleteOp.java#L175

On Tue, 17 Oct 2023 at 17:57, Xiaoqiao He  wrote:
>
> Hi Sergey Onuchin,
>
> Sorry to hear that. But we could not give some suggestions based on 
> the only information you mentioned.
> If any more on-site information may be better to trace, such as depoy 
> architecture, NameNode log and jstack etc.
> Based on my practice, I did not receive some cases which delete 
> directory without noise.
> Did you try to check operations (rename and delete) about the 
> parent-directory?
> Good luck!
>
> Best Regards,
> - He Xiaoqiao
>
>
> On Mon, Oct 16, 2023 at 11:58 PM <
> hdfs-issues-reject-1697471875.2027154.pkchcedhioidkhech...@hadoop.apac
> he.org>
> wrote:
>
> >
> > -- Forwarded message --
> > From: Sergey Onuchin 
> > To: "hdfs-iss...@hadoop.apache.org" 
> > Cc:
> > Bcc:
> > Date: Mon, 16 Oct 2023 15:57:47 +
> > Subject: HDFS loses directories with production data
> >
> > Hello,
> >
> >
> >
> > We’ve been using Hadoop (+Spark) for 3 years on production w/o major 
> > issues.
> >
> >
> >
> > Lately we observe that whole non-empty directories (table 
> > partitions) are disappearing in random ways.
> >
> > We see in application logs (and in hdfs-audit) logs creation of the 
> > directory + data files.
> >
> > Then later we see NO this directory in HDFS.
> >
> >
> >
> > hdfs-audit.log shows no traces of deletes or renames for the 
> > disappeared directories.
> >
> > We can trust these logs, as we see our manual operations are present 
> > in the logs.
> >
> >
> >
> > Time between creation and disappearing is 1-2 days.
> >
> >
> >
> > Maybe we are losing individual files as well, we just cannot find 
> > this out reliably.
> >
> >
> >
> > This is a blocker issue for us, we have to stop production data 
> > processing until we find out and fix data loss root cause.
> >
> >
> >
> > Please help to identify the root cause or find the right direction 
> > for search/further questions.
> >
> >
> >
> >
> >
> > -- Hadoop version: --
> >
> > Hadoop 3.2.1
> >
> > Source code repository 
> > https://gitbox.apache.org/repos/asf/hadoop.git -r
> > b3cbbb467e22ea829b3808f4b7b01d07e0bf3842
> >
> > Compiled by rohithsharmaks on 2019-09-10T15:56Z
> >
> > Compiled with protoc 2.5.0
> >
> > From source with check

Re: How to clear EXPIRED routers?

2023-10-21 Thread Takanobu Asanuma
dfs.federation.router.store.router.expiration.deletion is the configuration
value for that purpose.
https://apache.github.io/hadoop/hadoop-project-dist/hadoop-hdfs-rbf/hdfs-rbf-default.xml

- Takanobu

2023年10月19日(木) 0:41 杨光 :

> Hi everyone!
>
> I'm using hadoop 3.3.4, and started 5 hdfs routers on servers. Now I have
> to delete two of them using: hdfs --daemon stop dfsrouter. The commend
> executed successfully, but on router WebUI (
> http://url-to-router-webui:50071), it shows 5 routers but 2 of them are
> on  EXPIRED status. How can I clear them?
>


Re: DistCP from Hadoop 2.X to 3.X - where to compute

2023-10-18 Thread 杨光
Hi PA,

We just did the same work recently, copying data from hadoop 2 to hadoop 3,
to be precise, src hadoop version was CDH hadoop-2.6 (5 hdfs nameservices
federation), dst hadoop version was hadoop 3.3.4. Both clusters are
protected with Kerberos, and of course, two realms have been trusted with
each other. We executed the DistCP on hadoop 3 version cluster, but also
tried on hadoop 2. Both were working nicely. I can confirm that copying
data with DistCP from 1.x to 2.x needs webhdfs, which is slow compared to
rpc one. Here is an execution example:

hadoop --config /home/hadoop/conf distcp \
  -Dmapreduce.job.hdfs-servers.token-renewal.exclude=ns1,ns2,ns3,ns4,ns5 \
  -update -skipcrccheck \
  hdfs://hadoop2-cluster/user/test \
  hdfs://hadoop3-cluster/user/test


Re: MODERATE for hdfs-iss...@hadoop.apache.org

2023-10-17 Thread Ayush Saxena
+ user@hadoop

This sounds pretty strange, do you have any background job in your
cluster running, like for compaction kind of stuff, which plays with
the files? Any traces in the Namenode Logs, what happens to the blocks
associated with those files, If they get deleted before a FBR, that
ain't a metadata loss I believe, something triggered a delete, maybe
on the parent directory?

Will it be possible to enable debug logs and grep for "DIR*
FSDirectory.delete:" (code here [1]) or check other delete related
entries from StateChangeLog?
Maybe try to capture all the Audit logs from the create entry to the
moment when you figure out files are missing & look for all the delete
entries.
Still no luck, then maybe check in edit logs, or enable debug logs and
see for entries for edit log,"doEditTx op"

-Ayush

[1] 
https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirDeleteOp.java#L175

On Tue, 17 Oct 2023 at 17:57, Xiaoqiao He  wrote:
>
> Hi Sergey Onuchin,
>
> Sorry to hear that. But we could not give some suggestions based on the
> only information you mentioned.
> If any more on-site information may be better to trace, such as
> depoy architecture, NameNode log and jstack etc.
> Based on my practice, I did not receive some cases which delete directory
> without noise.
> Did you try to check operations (rename and delete) about the
> parent-directory?
> Good luck!
>
> Best Regards,
> - He Xiaoqiao
>
>
> On Mon, Oct 16, 2023 at 11:58 PM <
> hdfs-issues-reject-1697471875.2027154.pkchcedhioidkhech...@hadoop.apache.org>
> wrote:
>
> >
> > -- Forwarded message --
> > From: Sergey Onuchin 
> > To: "hdfs-iss...@hadoop.apache.org" 
> > Cc:
> > Bcc:
> > Date: Mon, 16 Oct 2023 15:57:47 +
> > Subject: HDFS loses directories with production data
> >
> > Hello,
> >
> >
> >
> > We’ve been using Hadoop (+Spark) for 3 years on production w/o major
> > issues.
> >
> >
> >
> > Lately we observe that whole non-empty directories (table partitions) are
> > disappearing in random ways.
> >
> > We see in application logs (and in hdfs-audit) logs creation of the
> > directory + data files.
> >
> > Then later we see NO this directory in HDFS.
> >
> >
> >
> > hdfs-audit.log shows no traces of deletes or renames for the disappeared
> > directories.
> >
> > We can trust these logs, as we see our manual operations are present in
> > the logs.
> >
> >
> >
> > Time between creation and disappearing is 1-2 days.
> >
> >
> >
> > Maybe we are losing individual files as well, we just cannot find this out
> > reliably.
> >
> >
> >
> > This is a blocker issue for us, we have to stop production data processing
> > until we find out and fix data loss root cause.
> >
> >
> >
> > Please help to identify the root cause or find the right direction for
> > search/further questions.
> >
> >
> >
> >
> >
> > -- Hadoop version: --
> >
> > Hadoop 3.2.1
> >
> > Source code repository https://gitbox.apache.org/repos/asf/hadoop.git -r
> > b3cbbb467e22ea829b3808f4b7b01d07e0bf3842
> >
> > Compiled by rohithsharmaks on 2019-09-10T15:56Z
> >
> > Compiled with protoc 2.5.0
> >
> > From source with checksum 776eaf9eee9c0ffc370bcbc1888737
> >
> >
> >
> > Thank you!
> >
> > Sergey Onuchin
> >
> >
> >

-
To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
For additional commands, e-mail: user-h...@hadoop.apache.org



Re: Unsubscribe

2023-10-05 Thread Niketh Nikky
Unsubscribe 
Thanks 
Niketh 

> On Oct 5, 2023, at 7:56 AM, Viral Mehta  wrote:
> 
> 

-
To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
For additional commands, e-mail: user-h...@hadoop.apache.org



Re: HDFS HA standby

2023-10-04 Thread Kiyoshi Mizumaru
First of all, could you please explain how you installed Hadoop? It's
possible that you may have already disclosed this information in a previous
thread, but please understand that I haven't gone through all of them and
don't have all the details memorized.

I haven't actually tried it, but I believe that when we want to change the
log level for processes that are started as daemons, such as Namenode and
Datanode, we should configure the HADOOP_DAEMON_ROOT_LOGGER environment
variable in etc/hadoop/hadoop-env.sh:


# Default log4j setting for interactive commands
# Java property: hadoop.root.logger
# export HADOOP_ROOT_LOGGER=INFO,console

# Default log4j setting for daemons spawned explicitly by
# --daemon option of hadoop, hdfs, mapred and yarn command.
# Java property: hadoop.root.logger
# export HADOOP_DAEMON_ROOT_LOGGER=INFO,RFA


On Wed, Oct 4, 2023 at 5:11 PM Harry Jamison
 wrote:

> @*Kiyoshi Mizumaru*
>
> How would I do that?
> I tried changing
>
> /hadoop/etc/hadoop/hadoop-env.sh
>
> export HADOOP_*ROOT*_LOGGER=TRACE,console
>
> But that did not seem to work, I still only get INFO.
> On Tuesday, October 3, 2023 at 09:13:13 PM PDT, Harry Jamison
>  wrote:
>
>
> I am not sure exactly what the problem is now.
>
> My namenode (and I think journal node are getting shut down.
> Is there a way to tell Why it is getting the shutdown signal?
>
> Also the datanode seems to be getting this error
> End of File Exception between local host is
>
>
> Here are the logs, and I only see INFO logging, and then a the Shutdown
>
> [2023-10-03 20:53:00,873] INFO Initializing quota with 12 thread(s)
> (org.apache.hadoop.hdfs.server.namenode.FSDirectory)
>
> [2023-10-03 20:53:00,876] INFO Quota initialization completed in 1
> milliseconds
>
> name space=2
>
> storage space=0
>
> storage types=RAM_DISK=0, SSD=0, DISK=0, ARCHIVE=0, PROVIDED=0
> (org.apache.hadoop.hdfs.server.namenode.FSDirectory)
>
> [2023-10-03 20:53:00,882] INFO Total number of blocks= 0
> (org.apache.hadoop.hdfs.server.blockmanagement.BlockManager)
>
> [2023-10-03 20:53:00,884] INFO Starting CacheReplicationMonitor with
> interval 3 milliseconds
> (org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor)
>
> [2023-10-03 20:53:00,884] INFO Number of invalid blocks  = 0
> (org.apache.hadoop.hdfs.server.blockmanagement.BlockManager)
>
> [2023-10-03 20:53:00,884] INFO Number of under-replicated blocks = 0
> (org.apache.hadoop.hdfs.server.blockmanagement.BlockManager)
>
> [2023-10-03 20:53:00,884] INFO Number of  over-replicated blocks = 0
> (org.apache.hadoop.hdfs.server.blockmanagement.BlockManager)
>
> [2023-10-03 20:53:00,884] INFO Number of blocks being written= 0
> (org.apache.hadoop.hdfs.server.blockmanagement.BlockManager)
>
> [2023-10-03 20:53:00,884] INFO STATE* Replication Queue initialization
> scan for invalid, over- and under-replicated blocks completed in 67 msec
> (org.apache.hadoop.hdfs.StateChange)
>
> [2023-10-03 20:54:16,453] ERROR RECEIVED SIGNAL 15: SIGTERM
> (org.apache.hadoop.hdfs.server.namenode.NameNode)
>
> [2023-10-03 20:54:16,467] INFO SHUTDOWN_MSG:
>
> /
>
> SHUTDOWN_MSG: Shutting down NameNode at vmnode1/192.168.1.159
>
> /
> (org.apache.hadoop.hdfs.server.namenode.NameNode)
>
>
>
>
> When I start the data node I see this
>
> [2023-10-03 20:53:00,882] INFO Namenode Block pool
> BP-1620264838-192.168.1.159-1696370857417 (Datanode Uuid
> 66068658-b08b-49cd-aba0-56ac1f29e7d5) service to vmnode1/
> 192.168.1.159:8020 trying to claim ACTIVE state with txid=15
> (org.apache.hadoop.hdfs.server.datanode.DataNode)
>
> [2023-10-03 20:53:00,882] INFO Acknowledging ACTIVE Namenode Block pool
> BP-1620264838-192.168.1.159-1696370857417 (Datanode Uuid
> 66068658-b08b-49cd-aba0-56ac1f29e7d5) service to vmnode1/
> 192.168.1.159:8020 (org.apache.hadoop.hdfs.server.datanode.DataNode)
>
> [2023-10-03 20:53:00,882] INFO After receiving heartbeat response,
> updating state of namenode vmnode1:8020 to active
> (org.apache.hadoop.hdfs.server.datanode.DataNode)
>
> [2023-10-03 20:54:18,771] WARN IOException in offerService
> (org.apache.hadoop.hdfs.server.datanode.DataNode)
>
> java.io.EOFException: End of File Exception between local host is:
> "vmnode1/192.168.1.159"; destination host is: "vmnode1":8020; :
> java.io.EOFException; For more details see:
> http://wiki.apache.org/hadoop/EOFException
>
> at
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>
> at
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>
> at
> java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>
> at
> java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
>
> at 

Re: HDFS HA standby

2023-10-04 Thread Harry Jamison
 @Kiyoshi Mizumaru 
How would I do that?I tried changing
/hadoop/etc/hadoop/hadoop-env.sh

export HADOOP_ROOT_LOGGER=TRACE,console

But that did not seem to work, I still only get INFO.On Tuesday, October 3, 
2023 at 09:13:13 PM PDT, Harry Jamison  
wrote:  
 
 I am not sure exactly what the problem is now.
My namenode (and I think journal node are getting shut down.Is there a way to 
tell Why it is getting the shutdown signal?
Also the datanode seems to be getting this error
End of File Exception between local host is


Here are the logs, and I only see INFO logging, and then a the Shutdown
[2023-10-03 20:53:00,873] INFO Initializing quota with 12 thread(s) 
(org.apache.hadoop.hdfs.server.namenode.FSDirectory)

[2023-10-03 20:53:00,876] INFO Quota initialization completed in 1 milliseconds

name space=2

storage space=0

storage types=RAM_DISK=0, SSD=0, DISK=0, ARCHIVE=0, PROVIDED=0 
(org.apache.hadoop.hdfs.server.namenode.FSDirectory)

[2023-10-03 20:53:00,882] INFO Total number of blocks            = 0 
(org.apache.hadoop.hdfs.server.blockmanagement.BlockManager)

[2023-10-03 20:53:00,884] INFO Starting CacheReplicationMonitor with interval 
3 milliseconds 
(org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor)

[2023-10-03 20:53:00,884] INFO Number of invalid blocks          = 0 
(org.apache.hadoop.hdfs.server.blockmanagement.BlockManager)

[2023-10-03 20:53:00,884] INFO Number of under-replicated blocks = 0 
(org.apache.hadoop.hdfs.server.blockmanagement.BlockManager)

[2023-10-03 20:53:00,884] INFO Number of  over-replicated blocks = 0 
(org.apache.hadoop.hdfs.server.blockmanagement.BlockManager)

[2023-10-03 20:53:00,884] INFO Number of blocks being written    = 0 
(org.apache.hadoop.hdfs.server.blockmanagement.BlockManager)

[2023-10-03 20:53:00,884] INFO STATE* Replication Queue initialization scan for 
invalid, over- and under-replicated blocks completed in 67 msec 
(org.apache.hadoop.hdfs.StateChange)

[2023-10-03 20:54:16,453] ERROR RECEIVED SIGNAL 15: SIGTERM 
(org.apache.hadoop.hdfs.server.namenode.NameNode)

[2023-10-03 20:54:16,467] INFO SHUTDOWN_MSG: 

/

SHUTDOWN_MSG: Shutting down NameNode at vmnode1/192.168.1.159

/ 
(org.apache.hadoop.hdfs.server.namenode.NameNode)




When I start the data node I see this
[2023-10-03 20:53:00,882] INFO Namenode Block pool 
BP-1620264838-192.168.1.159-1696370857417 (Datanode Uuid 
66068658-b08b-49cd-aba0-56ac1f29e7d5) service to vmnode1/192.168.1.159:8020 
trying to claim ACTIVE state with txid=15 
(org.apache.hadoop.hdfs.server.datanode.DataNode)

[2023-10-03 20:53:00,882] INFO Acknowledging ACTIVE Namenode Block pool 
BP-1620264838-192.168.1.159-1696370857417 (Datanode Uuid 
66068658-b08b-49cd-aba0-56ac1f29e7d5) service to vmnode1/192.168.1.159:8020 
(org.apache.hadoop.hdfs.server.datanode.DataNode)

[2023-10-03 20:53:00,882] INFO After receiving heartbeat response, updating 
state of namenode vmnode1:8020 to active 
(org.apache.hadoop.hdfs.server.datanode.DataNode)

[2023-10-03 20:54:18,771] WARN IOException in offerService 
(org.apache.hadoop.hdfs.server.datanode.DataNode)

java.io.EOFException: End of File Exception between local host is: 
"vmnode1/192.168.1.159"; destination host is: "vmnode1":8020; : 
java.io.EOFException; For more details see:  
http://wiki.apache.org/hadoop/EOFException

 at 
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)

 at 
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)

 at 
java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

 at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)

 at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:930)

 at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:879)

 at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1571)

 at org.apache.hadoop.ipc.Client.call(Client.java:1513)

 at org.apache.hadoop.ipc.Client.call(Client.java:1410)

 at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:258)

 at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:139)

 at com.sun.proxy.$Proxy19.sendHeartbeat(Unknown Source)

 at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:168)

 at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:562)

 at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:710)

 at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:920)

 at java.base/java.lang.Thread.run(Thread.java:829)

Caused by: java.io.EOFException

 at 

Re: HDFS HA standby

2023-10-03 Thread Kiyoshi Mizumaru
Why don't you try to change the logging level? DEBUG or TRACE would be
helpful.


On Wed, Oct 4, 2023 at 1:13 PM Harry Jamison
 wrote:

> I am not sure exactly what the problem is now.
>
> My namenode (and I think journal node are getting shut down.
> Is there a way to tell Why it is getting the shutdown signal?
>
> Also the datanode seems to be getting this error
> End of File Exception between local host is
>
>
> Here are the logs, and I only see INFO logging, and then a the Shutdown
>
> [2023-10-03 20:53:00,873] INFO Initializing quota with 12 thread(s)
> (org.apache.hadoop.hdfs.server.namenode.FSDirectory)
>
> [2023-10-03 20:53:00,876] INFO Quota initialization completed in 1
> milliseconds
>
> name space=2
>
> storage space=0
>
> storage types=RAM_DISK=0, SSD=0, DISK=0, ARCHIVE=0, PROVIDED=0
> (org.apache.hadoop.hdfs.server.namenode.FSDirectory)
>
> [2023-10-03 20:53:00,882] INFO Total number of blocks= 0
> (org.apache.hadoop.hdfs.server.blockmanagement.BlockManager)
>
> [2023-10-03 20:53:00,884] INFO Starting CacheReplicationMonitor with
> interval 3 milliseconds
> (org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor)
>
> [2023-10-03 20:53:00,884] INFO Number of invalid blocks  = 0
> (org.apache.hadoop.hdfs.server.blockmanagement.BlockManager)
>
> [2023-10-03 20:53:00,884] INFO Number of under-replicated blocks = 0
> (org.apache.hadoop.hdfs.server.blockmanagement.BlockManager)
>
> [2023-10-03 20:53:00,884] INFO Number of  over-replicated blocks = 0
> (org.apache.hadoop.hdfs.server.blockmanagement.BlockManager)
>
> [2023-10-03 20:53:00,884] INFO Number of blocks being written= 0
> (org.apache.hadoop.hdfs.server.blockmanagement.BlockManager)
>
> [2023-10-03 20:53:00,884] INFO STATE* Replication Queue initialization
> scan for invalid, over- and under-replicated blocks completed in 67 msec
> (org.apache.hadoop.hdfs.StateChange)
>
> [2023-10-03 20:54:16,453] ERROR RECEIVED SIGNAL 15: SIGTERM
> (org.apache.hadoop.hdfs.server.namenode.NameNode)
>
> [2023-10-03 20:54:16,467] INFO SHUTDOWN_MSG:
>
> /
>
> SHUTDOWN_MSG: Shutting down NameNode at vmnode1/192.168.1.159
>
> /
> (org.apache.hadoop.hdfs.server.namenode.NameNode)
>
>
>
>
> When I start the data node I see this
>
> [2023-10-03 20:53:00,882] INFO Namenode Block pool
> BP-1620264838-192.168.1.159-1696370857417 (Datanode Uuid
> 66068658-b08b-49cd-aba0-56ac1f29e7d5) service to vmnode1/
> 192.168.1.159:8020 trying to claim ACTIVE state with txid=15
> (org.apache.hadoop.hdfs.server.datanode.DataNode)
>
> [2023-10-03 20:53:00,882] INFO Acknowledging ACTIVE Namenode Block pool
> BP-1620264838-192.168.1.159-1696370857417 (Datanode Uuid
> 66068658-b08b-49cd-aba0-56ac1f29e7d5) service to vmnode1/
> 192.168.1.159:8020 (org.apache.hadoop.hdfs.server.datanode.DataNode)
>
> [2023-10-03 20:53:00,882] INFO After receiving heartbeat response,
> updating state of namenode vmnode1:8020 to active
> (org.apache.hadoop.hdfs.server.datanode.DataNode)
>
> [2023-10-03 20:54:18,771] WARN IOException in offerService
> (org.apache.hadoop.hdfs.server.datanode.DataNode)
>
> java.io.EOFException: End of File Exception between local host is:
> "vmnode1/192.168.1.159"; destination host is: "vmnode1":8020; :
> java.io.EOFException; For more details see:
> http://wiki.apache.org/hadoop/EOFException
>
> at
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>
> at
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>
> at
> java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>
> at
> java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
>
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:930)
>
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:879)
>
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1571)
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1513)
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1410)
>
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:258)
>
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:139)
>
> at com.sun.proxy.$Proxy19.sendHeartbeat(Unknown Source)
>
> at
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:168)
>
> at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:562)
>
> at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:710)
>
> at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:920)
>
> at java.base/java.lang.Thread.run(Thread.java:829)
>
> Caused by: java.io.EOFException
>
> at 

Re: HDFS HA namenode issue

2023-10-03 Thread Harry Jamison
 Thanks guys, I figured out what my issue was.I did not setup the ssh key 
correctly, it was for my user but I started the service as root.
Now it is working except none of the namenodes are transitioning to active on 
startup, and the datanodes are not starting automatically (I think because no 
namenode is active).
I can start everything manually though.

On Tuesday, October 3, 2023 at 11:03:33 AM PDT, Susheel Kumar Gadalay 
 wrote:  
 
 Why you have set this again in hdfs-site.xml at the end.
    dfs.namenode.rpc-address    nn1:8020  

Remove this and start name node again.
Regards Susheel Kumar On Tue, 3 Oct 2023, 10:09 pm Harry Jamison, 
 wrote:

 OK here is where I am at now.
When I start the namenodes, they work, but they are all in standby mode.When I 
start my first datanode it seems to kill one of the namenodes (the active one I 
assume)
I am getting 2 different warnings in the namenode
[2023-10-03 09:03:52,162] WARN Unable to initialize FileSignerSecretProvider, 
falling back to use random secrets. Reason: Could not read signature secret 
file: /root/hadoop-http-auth-signature-secret 
(org.apache.hadoop.security.authentication.server.AuthenticationFilter)

[2023-10-03 09:03:52,350] WARN Only one image storage directory 
(dfs.namenode.name.dir) configured. Beware of data loss due to lack of 
redundant storage directories! 
(org.apache.hadoop.hdfs.server.namenode.FSNamesystem)

I am using a journal node, so I am not clear if I am supposed to have multiple 
dfs.namenode.name.dir directoriesI thought each namenode has 1 directory.

Susheel Kumar Gadalay said that my shared.edits.dir Is wrong, but I am not 
clear how it is wrongFrom here mine looks 
righthttps://hadoop.apache.org/docs/r3.3.6/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html

This is what is in the logs right before the namenode dies[2023-10-03 
09:01:22,054] INFO Listener at vmnode3:8020 
(org.apache.hadoop.ipc.Server)[2023-10-03 09:01:22,054] INFO Starting Socket 
Reader #1 for port 8020 (org.apache.hadoop.ipc.Server)[2023-10-03 09:01:22,097] 
INFO Registered FSNamesystemState, ReplicatedBlocksState and ECBlockGroupsState 
MBeans. (org.apache.hadoop.hdfs.server.namenode.FSNamesystem)[2023-10-03 
09:01:22,119] INFO Number of blocks under construction: 0 
(org.apache.hadoop.hdfs.server.namenode.LeaseManager)[2023-10-03 09:01:22,122] 
INFO Initialized the Default Decommission and Maintenance monitor 
(org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminDefaultMonitor)[2023-10-03
 09:01:22,131] INFO STATE* Leaving safe mode after 0 secs 
(org.apache.hadoop.hdfs.StateChange)[2023-10-03 09:01:22,131] INFO STATE* 
Network topology has 0 racks and 0 datanodes 
(org.apache.hadoop.hdfs.StateChange)[2023-10-03 09:01:22,131] INFO STATE* 
UnderReplicatedBlocks has 0 blocks 
(org.apache.hadoop.hdfs.StateChange)[2023-10-03 09:01:22,130] INFO Start 
MarkedDeleteBlockScrubber thread 
(org.apache.hadoop.hdfs.server.blockmanagement.BlockManager)[2023-10-03 
09:01:22,158] INFO IPC Server Responder: starting 
(org.apache.hadoop.ipc.Server)[2023-10-03 09:01:22,159] INFO IPC Server 
listener on 8020: starting (org.apache.hadoop.ipc.Server)[2023-10-03 
09:01:22,165] INFO NameNode RPC up at: vmnode3/192.168.1.103:8020 
(org.apache.hadoop.hdfs.server.namenode.NameNode)[2023-10-03 09:01:22,166] INFO 
Starting services required for standby state 
(org.apache.hadoop.hdfs.server.namenode.FSNamesystem)[2023-10-03 09:01:22,168] 
INFO Will roll logs on active node every 120 seconds. 
(org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer)[2023-10-03 
09:01:22,171] INFO Starting standby checkpoint thread...Checkpointing active NN 
to possible NNs: [http://vmnode1:9870, http://vmnode2:9870]Serving checkpoints 
at http://vmnode3:9870 
(org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer)real-time 
non-blocking time  (microseconds, -R) unlimitedcore file size              
(blocks, -c) 0data seg size               (kbytes, -d) unlimitedscheduling 
priority                 (-e) 0file size                   (blocks, -f) 
unlimitedpending signals                     (-i) 15187max locked memory        
   (kbytes, -l) 8192max memory size             (kbytes, -m) unlimitedopen 
files                          (-n) 1024pipe size                (512 bytes, 
-p) 8POSIX message queues         (bytes, -q) 819200real-time priority          
        (-r) 0stack size                  (kbytes, -s) 8192cpu time             
      (seconds, -t) unlimitedmax user processes                  (-u) 
15187virtual memory              (kbytes, -v) unlimitedfile locks               
           (-x) unlimited






On Tuesday, October 3, 2023 at 03:54:23 AM PDT, Liming Cui 
 wrote:  
 
 Harry,
Great question.I would say the same configurations in core-site.xml and 
hdfs-site.xml will be overwriting each other in some way.
Glad you found the root cause.
Keep going.
On Tue, Oct 3, 2023 at 10:27 AM Harry Jamison  wrote:

 Liming 
After looking at my 

Re: HDFS HA namenode issue

2023-10-03 Thread Susheel Kumar Gadalay
Why you have set this again in hdfs-site.xml at the end.


dfs.namenode.rpc-address
nn1:8020
  

Remove this and start name node again.

Regards
Susheel Kumar
On Tue, 3 Oct 2023, 10:09 pm Harry Jamison,
 wrote:

> OK here is where I am at now.
>
> When I start the namenodes, they work, but they are all in standby mode.
> When I start my first datanode it seems to kill one of the namenodes (the
> active one I assume)
>
> I am getting 2 different warnings in the namenode
>
> [2023-10-03 09:03:52,162] WARN Unable to initialize
> FileSignerSecretProvider, falling back to use random secrets. Reason: Could
> not read signature secret file: /root/hadoop-http-auth-signature-secret
> (org.apache.hadoop.security.authentication.server.AuthenticationFilter)
>
> [2023-10-03 09:03:52,350] WARN Only one image storage directory
> (dfs.namenode.name.dir) configured. Beware of data loss due to lack of
> redundant storage directories!
> (org.apache.hadoop.hdfs.server.namenode.FSNamesystem)
>
> I am using a journal node, so I am not clear if I am supposed to have
> multiple dfs.namenode.name.dir directories
> I thought each namenode has 1 directory.
>
>
> Susheel Kumar Gadalay said that my shared.edits.dir Is wrong, but I am
> not clear how it is wrong
> From here mine looks right
>
> https://hadoop.apache.org/docs/r3.3.6/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
>
> This is what is in the logs right before the namenode dies
> [2023-10-03 09:01:22,054] INFO Listener at vmnode3:8020
> (org.apache.hadoop.ipc.Server)
> [2023-10-03 09:01:22,054] INFO Starting Socket Reader #1 for port 8020
> (org.apache.hadoop.ipc.Server)
> [2023-10-03 09:01:22,097] INFO Registered FSNamesystemState,
> ReplicatedBlocksState and ECBlockGroupsState MBeans.
> (org.apache.hadoop.hdfs.server.namenode.FSNamesystem)
> [2023-10-03 09:01:22,119] INFO Number of blocks under construction: 0
> (org.apache.hadoop.hdfs.server.namenode.LeaseManager)
> [2023-10-03 09:01:22,122] INFO Initialized the Default Decommission and
> Maintenance monitor
> (org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminDefaultMonitor)
> [2023-10-03 09:01:22,131] INFO STATE* Leaving safe mode after 0 secs
> (org.apache.hadoop.hdfs.StateChange)
> [2023-10-03 09:01:22,131] INFO STATE* Network topology has 0 racks and 0
> datanodes (org.apache.hadoop.hdfs.StateChange)
> [2023-10-03 09:01:22,131] INFO STATE* UnderReplicatedBlocks has 0 blocks
> (org.apache.hadoop.hdfs.StateChange)
> [2023-10-03 09:01:22,130] INFO Start MarkedDeleteBlockScrubber thread
> (org.apache.hadoop.hdfs.server.blockmanagement.BlockManager)
> [2023-10-03 09:01:22,158] INFO IPC Server Responder: starting
> (org.apache.hadoop.ipc.Server)
> [2023-10-03 09:01:22,159] INFO IPC Server listener on 8020: starting
> (org.apache.hadoop.ipc.Server)
> [2023-10-03 09:01:22,165] INFO NameNode RPC up at: vmnode3/
> 192.168.1.103:8020 (org.apache.hadoop.hdfs.server.namenode.NameNode)
> [2023-10-03 09:01:22,166] INFO Starting services required for standby
> state (org.apache.hadoop.hdfs.server.namenode.FSNamesystem)
> [2023-10-03 09:01:22,168] INFO Will roll logs on active node every 120
> seconds. (org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer)
> [2023-10-03 09:01:22,171] INFO Starting standby checkpoint thread...
> Checkpointing active NN to possible NNs: [http://vmnode1:9870,
> http://vmnode2:9870]
> Serving checkpoints at http://vmnode3:9870
> (org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer)
> real-time non-blocking time  (microseconds, -R) unlimited
> core file size  (blocks, -c) 0
> data seg size   (kbytes, -d) unlimited
> scheduling priority (-e) 0
> file size   (blocks, -f) unlimited
> pending signals (-i) 15187
> max locked memory   (kbytes, -l) 8192
> max memory size (kbytes, -m) unlimited
> open files  (-n) 1024
> pipe size(512 bytes, -p) 8
> POSIX message queues (bytes, -q) 819200
> real-time priority  (-r) 0
> stack size  (kbytes, -s) 8192
> cpu time   (seconds, -t) unlimited
> max user processes  (-u) 15187
> virtual memory  (kbytes, -v) unlimited
> file locks  (-x) unlimited
>
>
>
>
>
>
>
> On Tuesday, October 3, 2023 at 03:54:23 AM PDT, Liming Cui <
> anyone.cui...@gmail.com> wrote:
>
>
> Harry,
>
> Great question.
> I would say the same configurations in core-site.xml and hdfs-site.xml
> will be overwriting each other in some way.
>
> Glad you found the root cause.
>
> Keep going.
>
> On Tue, Oct 3, 2023 at 10:27 AM Harry Jamison 
> wrote:
>
> Liming
>
> After looking at my config, I think that maybe my problem is because my 
> fs.defaultFS
> is inconsistent between hdfs-site.xml and core-site.xml
> What does hdfs-site.xml vs core-site.xml do why is the same setting in 2
> different places?
> Or do I just have it 

Re: Locating frequent data blocks

2023-10-03 Thread Mohammad Aghanabi
Hello. I would appreciate any help on this matter. Thanks

On Wed, Sep 13, 2023 at 1:30 PM Mohammad Aghanabi 
wrote:

> Hello.
>
> I read in a few articles like [1] that we can obtain data block stats from
> "historical data access recorder from the NameNode log file" or in another
> paper it's stated that frequent data blocks can be determined using
> NameNode provided logs.
>
> I searched for related information on hadoop.apache.org but didn't find
> anything. I read about job counters, fsimage, edit logs, audit logs... but
> nothing related to a metric that represents "frequently accessed data
> blocks" of DataNodes.
>
> I'd appreciate any help on whether this kind of stat is being collected by
> a component or not.
>
> Thank you
>
>
> [1] Jia-xuan Wu, Chang-sheng Zhang, Bin Zhang, Peng Wang, "A new
> data-grouping-aware dynamic data placement method that take into account
> jobs execute frequency for Hadoop", Microprocessors and Microsystems,
> Volume 47, Part A, 2016, Pages 161-169
>


Re: HDFS HA namenode issue

2023-10-03 Thread Harry Jamison
 OK here is where I am at now.
When I start the namenodes, they work, but they are all in standby mode.When I 
start my first datanode it seems to kill one of the namenodes (the active one I 
assume)
I am getting 2 different warnings in the namenode
[2023-10-03 09:03:52,162] WARN Unable to initialize FileSignerSecretProvider, 
falling back to use random secrets. Reason: Could not read signature secret 
file: /root/hadoop-http-auth-signature-secret 
(org.apache.hadoop.security.authentication.server.AuthenticationFilter)

[2023-10-03 09:03:52,350] WARN Only one image storage directory 
(dfs.namenode.name.dir) configured. Beware of data loss due to lack of 
redundant storage directories! 
(org.apache.hadoop.hdfs.server.namenode.FSNamesystem)

I am using a journal node, so I am not clear if I am supposed to have multiple 
dfs.namenode.name.dir directoriesI thought each namenode has 1 directory.

Susheel Kumar Gadalay said that my shared.edits.dir Is wrong, but I am not 
clear how it is wrongFrom here mine looks 
righthttps://hadoop.apache.org/docs/r3.3.6/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html

This is what is in the logs right before the namenode dies[2023-10-03 
09:01:22,054] INFO Listener at vmnode3:8020 
(org.apache.hadoop.ipc.Server)[2023-10-03 09:01:22,054] INFO Starting Socket 
Reader #1 for port 8020 (org.apache.hadoop.ipc.Server)[2023-10-03 09:01:22,097] 
INFO Registered FSNamesystemState, ReplicatedBlocksState and ECBlockGroupsState 
MBeans. (org.apache.hadoop.hdfs.server.namenode.FSNamesystem)[2023-10-03 
09:01:22,119] INFO Number of blocks under construction: 0 
(org.apache.hadoop.hdfs.server.namenode.LeaseManager)[2023-10-03 09:01:22,122] 
INFO Initialized the Default Decommission and Maintenance monitor 
(org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminDefaultMonitor)[2023-10-03
 09:01:22,131] INFO STATE* Leaving safe mode after 0 secs 
(org.apache.hadoop.hdfs.StateChange)[2023-10-03 09:01:22,131] INFO STATE* 
Network topology has 0 racks and 0 datanodes 
(org.apache.hadoop.hdfs.StateChange)[2023-10-03 09:01:22,131] INFO STATE* 
UnderReplicatedBlocks has 0 blocks 
(org.apache.hadoop.hdfs.StateChange)[2023-10-03 09:01:22,130] INFO Start 
MarkedDeleteBlockScrubber thread 
(org.apache.hadoop.hdfs.server.blockmanagement.BlockManager)[2023-10-03 
09:01:22,158] INFO IPC Server Responder: starting 
(org.apache.hadoop.ipc.Server)[2023-10-03 09:01:22,159] INFO IPC Server 
listener on 8020: starting (org.apache.hadoop.ipc.Server)[2023-10-03 
09:01:22,165] INFO NameNode RPC up at: vmnode3/192.168.1.103:8020 
(org.apache.hadoop.hdfs.server.namenode.NameNode)[2023-10-03 09:01:22,166] INFO 
Starting services required for standby state 
(org.apache.hadoop.hdfs.server.namenode.FSNamesystem)[2023-10-03 09:01:22,168] 
INFO Will roll logs on active node every 120 seconds. 
(org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer)[2023-10-03 
09:01:22,171] INFO Starting standby checkpoint thread...Checkpointing active NN 
to possible NNs: [http://vmnode1:9870, http://vmnode2:9870]Serving checkpoints 
at http://vmnode3:9870 
(org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer)real-time 
non-blocking time  (microseconds, -R) unlimitedcore file size              
(blocks, -c) 0data seg size               (kbytes, -d) unlimitedscheduling 
priority                 (-e) 0file size                   (blocks, -f) 
unlimitedpending signals                     (-i) 15187max locked memory        
   (kbytes, -l) 8192max memory size             (kbytes, -m) unlimitedopen 
files                          (-n) 1024pipe size                (512 bytes, 
-p) 8POSIX message queues         (bytes, -q) 819200real-time priority          
        (-r) 0stack size                  (kbytes, -s) 8192cpu time             
      (seconds, -t) unlimitedmax user processes                  (-u) 
15187virtual memory              (kbytes, -v) unlimitedfile locks               
           (-x) unlimited






On Tuesday, October 3, 2023 at 03:54:23 AM PDT, Liming Cui 
 wrote:  
 
 Harry,
Great question.I would say the same configurations in core-site.xml and 
hdfs-site.xml will be overwriting each other in some way.
Glad you found the root cause.
Keep going.
On Tue, Oct 3, 2023 at 10:27 AM Harry Jamison  wrote:

 Liming 
After looking at my config, I think that maybe my problem is because my 
fs.defaultFS is inconsistent between hdfs-site.xml and core-site.xmlWhat does 
hdfs-site.xml vs core-site.xml do why is the same setting in 2 different 
places?Or do I just have it there mistakenly?
this is what I have in hdfs-site.xml
        
fs.defaultFS      hdfs://mycluster     
    ha.zookeeper.quorum    
nn1:2181,nn2:2181,nn3:2181  
      dfs.nameservices    mycluster  

      dfs.ha.namenodes.mycluster    
nn1,nn2,nn3  
      dfs.namenode.rpc-address.mycluster.nn1    
nn1:8020        
dfs.namenode.rpc-address.mycluster.nn2    nn2:8020  
      dfs.namenode.rpc-address.mycluster.nn3  
  nn3:8020  
      

Re: HDFS HA namenode issue

2023-10-03 Thread Liming Cui
Harry,

Great question.
I would say the same configurations in core-site.xml and hdfs-site.xml will
be overwriting each other in some way.

Glad you found the root cause.

Keep going.

On Tue, Oct 3, 2023 at 10:27 AM Harry Jamison 
wrote:

> Liming
>
> After looking at my config, I think that maybe my problem is because my 
> fs.defaultFS
> is inconsistent between hdfs-site.xml and core-site.xml
> What does hdfs-site.xml vs core-site.xml do why is the same setting in 2
> different places?
> Or do I just have it there mistakenly?
>
> this is what I have in hdfs-site.xml
>
> 
> 
> 
>   
>   fs.defaultFS
>   hdfs://mycluster
>
>   
> ha.zookeeper.quorum
> nn1:2181,nn2:2181,nn3:2181
>   
>
>   
> dfs.nameservices
> mycluster
>   
>
>   
> dfs.ha.namenodes.mycluster
> nn1,nn2,nn3
>   
>
>   
> dfs.namenode.rpc-address.mycluster.nn1
> nn1:8020
>   
>   
> dfs.namenode.rpc-address.mycluster.nn2
> nn2:8020
>   
>   
> dfs.namenode.rpc-address.mycluster.nn3
> nn3:8020
>   
>
>   
> dfs.namenode.http-address.mycluster.nn1
> nn1:9870
>   
>   
> dfs.namenode.http-address.mycluster.nn2
> nn2:9870
>   
>   
> dfs.namenode.http-address.mycluster.nn3
> nn3:9870
>   
>
>   
> dfs.namenode.shared.edits.dir
> qjournal://nn1:8485;nn2:8485;nn3:8485/mycluster
>   
>   
> dfs.client.failover.proxy.provider.mycluster
>
> org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
>   
>
>   
> dfs.ha.fencing.methods
> sshfence
>   
>
>   
> dfs.ha.fencing.ssh.private-key-files
> /home/harry/.ssh/id_rsa
>   
>
>   
> dfs.namenode.name.dir
> file:/hadoop/data/hdfs/namenode
>   
>   
> dfs.datanode.data.dir
> file:/hadoop/data/hdfs/datanode
>   
>   
> dfs.journalnode.edits.dir
> /hadoop/data/hdfs/journalnode
>   
>   
> dfs.namenode.rpc-address
> nn1:8020
>   
>
>   
> dfs.ha.nn.not-become-active-in-safemode
> true
>   
>
> 
>
>
>
> In core-site.xml I have this
>
> 
>
> 
>
> 
>
>
> 
>
>
> 
>
>   
>
> fs.defaultFS
>
> hdfs://nn1:8020
>
>   
>
>
> 
>
>
> On Tuesday, October 3, 2023 at 12:54:26 AM PDT, Liming Cui <
> anyone.cui...@gmail.com> wrote:
>
>
> Can you show us the configuration files?
> Maybe I can help you with some suggestions.
>
>
> On Tue, Oct 3, 2023 at 9:05 AM Harry Jamison
>  wrote:
>
> I am trying to setup a HA HDFS cluster, and I am running into a problem
>
> I am not sure what I am doing wrong, I thought I followed the HA namenode
> guide, but it is not working.
>
>
> Apache Hadoop 3.3.6 – HDFS High Availability
> 
>
>
>
> I have 2 namenodes and 3 journal nodes, and 3 zookeeper nodes.
>
> After some period of time I see the following and my namenode and journal
> node die.
> I am not sure where the problem is, or how to diagnose what I am doing
> wrong here.  And the logging here does not make sense to me.
>
> Namenode
> Serving checkpoints at http://nn1:9870
> (org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer)
>
> real-time non-blocking time  (microseconds, -R) unlimited
>
> core file size  (blocks, -c) 0
>
> data seg size   (kbytes, -d) unlimited
>
> scheduling priority (-e) 0
>
> file size   (blocks, -f) unlimited
>
> pending signals (-i) 15187
>
> max locked memory   (kbytes, -l) 8192
>
> max memory size (kbytes, -m) unlimited
>
> open files  (-n) 1024
>
> pipe size(512 bytes, -p) 8
>
> POSIX message queues (bytes, -q) 819200
>
> real-time priority  (-r) 0
>
> stack size  (kbytes, -s) 8192
>
> cpu time   (seconds, -t) unlimited
>
> max user processes  (-u) 15187
>
> virtual memory  (kbytes, -v) unlimited
>
> file locks  (-x) unlimited
>
> [2023-10-02 23:53:46,693] ERROR RECEIVED SIGNAL 15: SIGTERM
> (org.apache.hadoop.hdfs.server.namenode.NameNode)
>
> [2023-10-02 23:53:46,701] INFO SHUTDOWN_MSG:
>
> /
>
> SHUTDOWN_MSG: Shutting down NameNode at nn1/192.168.1.159
>
> /
> (org.apache.hadoop.hdfs.server.namenode.NameNode)
>
> JournalNode
> [2023-10-02 23:54:19,162] WARN Journal at nn1/192.168.1.159:8485 has no
> edit logs (org.apache.hadoop.hdfs.qjournal.server.JournalNodeSyncer)
>
> real-time non-blocking time  (microseconds, -R) unlimited
>
> core file size  (blocks, -c) 0
>
> data seg size   (kbytes, -d) unlimited
>
> scheduling priority (-e) 0
>
> file size   (blocks, -f) unlimited
>
> pending signals (-i) 15187
>
> max locked memory   (kbytes, -l) 8192
>
> max memory size   

Re: HDFS HA namenode issue

2023-10-03 Thread Susheel Kumar Gadalay
The core-site.xml configuration settings will be overridden by
hdfs-site.xml, mapred-site.xml, yarn-site.xml. This was like that but don't
know if it is changed now.

Look at your shared.edits.dir configuration. You have not set it correct
across name nodes.

Regards


On Tue, 3 Oct 2023, 1:59 pm Harry Jamison, 
wrote:

> Liming
>
> After looking at my config, I think that maybe my problem is because my 
> fs.defaultFS
> is inconsistent between hdfs-site.xml and core-site.xml
> What does hdfs-site.xml vs core-site.xml do why is the same setting in 2
> different places?
> Or do I just have it there mistakenly?
>
> this is what I have in hdfs-site.xml
>
> 
> 
> 
>   
>   fs.defaultFS
>   hdfs://mycluster
>
>   
> ha.zookeeper.quorum
> nn1:2181,nn2:2181,nn3:2181
>   
>
>   
> dfs.nameservices
> mycluster
>   
>
>   
> dfs.ha.namenodes.mycluster
> nn1,nn2,nn3
>   
>
>   
> dfs.namenode.rpc-address.mycluster.nn1
> nn1:8020
>   
>   
> dfs.namenode.rpc-address.mycluster.nn2
> nn2:8020
>   
>   
> dfs.namenode.rpc-address.mycluster.nn3
> nn3:8020
>   
>
>   
> dfs.namenode.http-address.mycluster.nn1
> nn1:9870
>   
>   
> dfs.namenode.http-address.mycluster.nn2
> nn2:9870
>   
>   
> dfs.namenode.http-address.mycluster.nn3
> nn3:9870
>   
>
>   
> dfs.namenode.shared.edits.dir
> qjournal://nn1:8485;nn2:8485;nn3:8485/mycluster
>   
>   
> dfs.client.failover.proxy.provider.mycluster
>
> org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
>   
>
>   
> dfs.ha.fencing.methods
> sshfence
>   
>
>   
> dfs.ha.fencing.ssh.private-key-files
> /home/harry/.ssh/id_rsa
>   
>
>   
> dfs.namenode.name.dir
> file:/hadoop/data/hdfs/namenode
>   
>   
> dfs.datanode.data.dir
> file:/hadoop/data/hdfs/datanode
>   
>   
> dfs.journalnode.edits.dir
> /hadoop/data/hdfs/journalnode
>   
>   
> dfs.namenode.rpc-address
> nn1:8020
>   
>
>   
> dfs.ha.nn.not-become-active-in-safemode
> true
>   
>
> 
>
>
>
> In core-site.xml I have this
>
> 
>
> 
>
> 
>
>
> 
>
>
> 
>
>   
>
> fs.defaultFS
>
> hdfs://nn1:8020
>
>   
>
>
> 
>
>
> On Tuesday, October 3, 2023 at 12:54:26 AM PDT, Liming Cui <
> anyone.cui...@gmail.com> wrote:
>
>
> Can you show us the configuration files?
> Maybe I can help you with some suggestions.
>
>
> On Tue, Oct 3, 2023 at 9:05 AM Harry Jamison
>  wrote:
>
> I am trying to setup a HA HDFS cluster, and I am running into a problem
>
> I am not sure what I am doing wrong, I thought I followed the HA namenode
> guide, but it is not working.
>
>
> Apache Hadoop 3.3.6 – HDFS High Availability
> 
>
>
>
> I have 2 namenodes and 3 journal nodes, and 3 zookeeper nodes.
>
> After some period of time I see the following and my namenode and journal
> node die.
> I am not sure where the problem is, or how to diagnose what I am doing
> wrong here.  And the logging here does not make sense to me.
>
> Namenode
> Serving checkpoints at http://nn1:9870
> (org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer)
>
> real-time non-blocking time  (microseconds, -R) unlimited
>
> core file size  (blocks, -c) 0
>
> data seg size   (kbytes, -d) unlimited
>
> scheduling priority (-e) 0
>
> file size   (blocks, -f) unlimited
>
> pending signals (-i) 15187
>
> max locked memory   (kbytes, -l) 8192
>
> max memory size (kbytes, -m) unlimited
>
> open files  (-n) 1024
>
> pipe size(512 bytes, -p) 8
>
> POSIX message queues (bytes, -q) 819200
>
> real-time priority  (-r) 0
>
> stack size  (kbytes, -s) 8192
>
> cpu time   (seconds, -t) unlimited
>
> max user processes  (-u) 15187
>
> virtual memory  (kbytes, -v) unlimited
>
> file locks  (-x) unlimited
>
> [2023-10-02 23:53:46,693] ERROR RECEIVED SIGNAL 15: SIGTERM
> (org.apache.hadoop.hdfs.server.namenode.NameNode)
>
> [2023-10-02 23:53:46,701] INFO SHUTDOWN_MSG:
>
> /
>
> SHUTDOWN_MSG: Shutting down NameNode at nn1/192.168.1.159
>
> /
> (org.apache.hadoop.hdfs.server.namenode.NameNode)
>
> JournalNode
> [2023-10-02 23:54:19,162] WARN Journal at nn1/192.168.1.159:8485 has no
> edit logs (org.apache.hadoop.hdfs.qjournal.server.JournalNodeSyncer)
>
> real-time non-blocking time  (microseconds, -R) unlimited
>
> core file size  (blocks, -c) 0
>
> data seg size   (kbytes, -d) unlimited
>
> scheduling priority (-e) 0
>
> file size   (blocks, -f) unlimited
>
> pending signals (-i) 

Re: HDFS HA namenode issue

2023-10-03 Thread Ayush Saxena
> Or do I just have it there mistakenly?

Yes, It should be in core-site.xml

It is there in the HA doc
```

fs.defaultFS - the default path prefix used by the Hadoop FS client
when none is given

Optionally, you may now configure the default path for Hadoop clients
to use the new HA-enabled logical URI. If you used “mycluster” as the
nameservice ID earlier, this will be the value of the authority
portion of all of your HDFS paths. This may be configured like so, in
your core-site.xml file:

```

-Ayush

On Tue, 3 Oct 2023 at 13:58, Harry Jamison
 wrote:
>
> Liming
>
> After looking at my config, I think that maybe my problem is because my 
> fs.defaultFS is inconsistent between hdfs-site.xml and core-site.xml
> What does hdfs-site.xml vs core-site.xml do why is the same setting in 2 
> different places?
> Or do I just have it there mistakenly?
>
> this is what I have in hdfs-site.xml
>
> 
> 
> 
>   
>   fs.defaultFS
>   hdfs://mycluster
>
>   
> ha.zookeeper.quorum
> nn1:2181,nn2:2181,nn3:2181
>   
>
>   
> dfs.nameservices
> mycluster
>   
>
>   
> dfs.ha.namenodes.mycluster
> nn1,nn2,nn3
>   
>
>   
> dfs.namenode.rpc-address.mycluster.nn1
> nn1:8020
>   
>   
> dfs.namenode.rpc-address.mycluster.nn2
> nn2:8020
>   
>   
> dfs.namenode.rpc-address.mycluster.nn3
> nn3:8020
>   
>
>   
> dfs.namenode.http-address.mycluster.nn1
> nn1:9870
>   
>   
> dfs.namenode.http-address.mycluster.nn2
> nn2:9870
>   
>   
> dfs.namenode.http-address.mycluster.nn3
> nn3:9870
>   
>
>   
> dfs.namenode.shared.edits.dir
> qjournal://nn1:8485;nn2:8485;nn3:8485/mycluster
>   
>   
> dfs.client.failover.proxy.provider.mycluster
> 
> org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
>   
>
>   
> dfs.ha.fencing.methods
> sshfence
>   
>
>   
> dfs.ha.fencing.ssh.private-key-files
> /home/harry/.ssh/id_rsa
>   
>
>   
> dfs.namenode.name.dir
> file:/hadoop/data/hdfs/namenode
>   
>   
> dfs.datanode.data.dir
> file:/hadoop/data/hdfs/datanode
>   
>   
> dfs.journalnode.edits.dir
> /hadoop/data/hdfs/journalnode
>   
>   
> dfs.namenode.rpc-address
> nn1:8020
>   
>
>   
> dfs.ha.nn.not-become-active-in-safemode
> true
>   
>
> 
>
>
>
> In core-site.xml I have this
>
> 
>
> 
>
> 
>
>
> 
>
>
> 
>
>   
>
> fs.defaultFS
>
> hdfs://nn1:8020
>
>   
>
>
> 
>
>
>
> On Tuesday, October 3, 2023 at 12:54:26 AM PDT, Liming Cui 
>  wrote:
>
>
> Can you show us the configuration files?
> Maybe I can help you with some suggestions.
>
>
> On Tue, Oct 3, 2023 at 9:05 AM Harry Jamison 
>  wrote:
>
> I am trying to setup a HA HDFS cluster, and I am running into a problem
>
> I am not sure what I am doing wrong, I thought I followed the HA namenode 
> guide, but it is not working.
>
>
> Apache Hadoop 3.3.6 – HDFS High Availability
>
>
>
> I have 2 namenodes and 3 journal nodes, and 3 zookeeper nodes.
>
> After some period of time I see the following and my namenode and journal 
> node die.
> I am not sure where the problem is, or how to diagnose what I am doing wrong 
> here.  And the logging here does not make sense to me.
>
> Namenode
> Serving checkpoints at http://nn1:9870 
> (org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer)
>
> real-time non-blocking time  (microseconds, -R) unlimited
>
> core file size  (blocks, -c) 0
>
> data seg size   (kbytes, -d) unlimited
>
> scheduling priority (-e) 0
>
> file size   (blocks, -f) unlimited
>
> pending signals (-i) 15187
>
> max locked memory   (kbytes, -l) 8192
>
> max memory size (kbytes, -m) unlimited
>
> open files  (-n) 1024
>
> pipe size(512 bytes, -p) 8
>
> POSIX message queues (bytes, -q) 819200
>
> real-time priority  (-r) 0
>
> stack size  (kbytes, -s) 8192
>
> cpu time   (seconds, -t) unlimited
>
> max user processes  (-u) 15187
>
> virtual memory  (kbytes, -v) unlimited
>
> file locks  (-x) unlimited
>
> [2023-10-02 23:53:46,693] ERROR RECEIVED SIGNAL 15: SIGTERM 
> (org.apache.hadoop.hdfs.server.namenode.NameNode)
>
> [2023-10-02 23:53:46,701] INFO SHUTDOWN_MSG:
>
> /
>
> SHUTDOWN_MSG: Shutting down NameNode at nn1/192.168.1.159
>
> / 
> (org.apache.hadoop.hdfs.server.namenode.NameNode)
>
>
> JournalNode
> [2023-10-02 23:54:19,162] WARN Journal at nn1/192.168.1.159:8485 has no edit 
> logs (org.apache.hadoop.hdfs.qjournal.server.JournalNodeSyncer)
>
> real-time non-blocking time  (microseconds, -R) unlimited
>
> core file size  (blocks, -c) 0
>
> data seg size   (kbytes, -d) unlimited
>
> scheduling priority   

Re: HDFS HA namenode issue

2023-10-03 Thread Harry Jamison
 Liming 
After looking at my config, I think that maybe my problem is because my 
fs.defaultFS is inconsistent between hdfs-site.xml and core-site.xmlWhat does 
hdfs-site.xml vs core-site.xml do why is the same setting in 2 different 
places?Or do I just have it there mistakenly?
this is what I have in hdfs-site.xml
        
fs.defaultFS      hdfs://mycluster     
    ha.zookeeper.quorum    
nn1:2181,nn2:2181,nn3:2181  
      dfs.nameservices    mycluster  

      dfs.ha.namenodes.mycluster    
nn1,nn2,nn3  
      dfs.namenode.rpc-address.mycluster.nn1    
nn1:8020        
dfs.namenode.rpc-address.mycluster.nn2    nn2:8020  
      dfs.namenode.rpc-address.mycluster.nn3  
  nn3:8020  
      dfs.namenode.http-address.mycluster.nn1    
nn1:9870        
dfs.namenode.http-address.mycluster.nn2    nn2:9870 
       
dfs.namenode.http-address.mycluster.nn3    nn3:9870 
 
      dfs.namenode.shared.edits.dir    
qjournal://nn1:8485;nn2:8485;nn3:8485/mycluster    
    dfs.client.failover.proxy.provider.mycluster    
org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
  
      dfs.ha.fencing.methods    sshfence  

      dfs.ha.fencing.ssh.private-key-files    
/home/harry/.ssh/id_rsa  
      dfs.namenode.name.dir    
file:/hadoop/data/hdfs/namenode        
dfs.datanode.data.dir    
file:/hadoop/data/hdfs/datanode        
dfs.journalnode.edits.dir    
/hadoop/data/hdfs/journalnode        
dfs.namenode.rpc-address    nn1:8020  
      dfs.ha.nn.not-become-active-in-safemode    
true  



In core-site.xml I have this
















  

    fs.defaultFS

    hdfs://nn1:8020

  







On Tuesday, October 3, 2023 at 12:54:26 AM PDT, Liming Cui 
 wrote:  
 
 Can you show us the configuration files? Maybe I can help you with some 
suggestions.

On Tue, Oct 3, 2023 at 9:05 AM Harry Jamison  
wrote:

I am trying to setup a HA HDFS cluster, and I am running into a problem
I am not sure what I am doing wrong, I thought I followed the HA namenode 
guide, but it is not working.

Apache Hadoop 3.3.6 – HDFS High Availability


I have 2 namenodes and 3 journal nodes, and 3 zookeeper nodes.
After some period of time I see the following and my namenode and journal node 
die.I am not sure where the problem is, or how to diagnose what I am doing 
wrong here.  And the logging here does not make sense to me.
NamenodeServing checkpoints at http://nn1:9870 
(org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer)
real-time non-blocking time  (microseconds, -R) unlimited

core file size              (blocks, -c) 0

data seg size               (kbytes, -d) unlimited

scheduling priority                 (-e) 0

file size                   (blocks, -f) unlimited

pending signals                     (-i) 15187

max locked memory           (kbytes, -l) 8192

max memory size             (kbytes, -m) unlimited

open files                          (-n) 1024

pipe size                (512 bytes, -p) 8

POSIX message queues         (bytes, -q) 819200

real-time priority                  (-r) 0

stack size                  (kbytes, -s) 8192

cpu time                   (seconds, -t) unlimited

max user processes                  (-u) 15187

virtual memory              (kbytes, -v) unlimited

file locks                          (-x) unlimited

[2023-10-02 23:53:46,693] ERROR RECEIVED SIGNAL 15: SIGTERM 
(org.apache.hadoop.hdfs.server.namenode.NameNode)

[2023-10-02 23:53:46,701] INFO SHUTDOWN_MSG: 

/

SHUTDOWN_MSG: Shutting down NameNode at nn1/192.168.1.159

/ 
(org.apache.hadoop.hdfs.server.namenode.NameNode)

JournalNode[2023-10-02 23:54:19,162] WARN Journal at nn1/192.168.1.159:8485 has 
no edit logs (org.apache.hadoop.hdfs.qjournal.server.JournalNodeSyncer)
real-time non-blocking time  (microseconds, -R) unlimited

core file size              (blocks, -c) 0

data seg size               (kbytes, -d) unlimited

scheduling priority                 (-e) 0

file size                   (blocks, -f) unlimited

pending signals                     (-i) 15187

max locked memory           (kbytes, -l) 8192

max memory size             (kbytes, -m) unlimited

open files                          (-n) 1024

pipe size                (512 bytes, -p) 8

POSIX message queues         (bytes, -q) 819200

real-time priority                  (-r) 0

stack size                  (kbytes, -s) 8192

cpu time                   (seconds, -t) unlimited

max user processes                  (-u) 15187

virtual memory              (kbytes, -v) unlimited

file locks                          (-x) unlimited





-- 
Best
Liming  

Re: HDFS HA namenode issue

2023-10-03 Thread Liming Cui
Can you show us the configuration files?
Maybe I can help you with some suggestions.


On Tue, Oct 3, 2023 at 9:05 AM Harry Jamison
 wrote:

> I am trying to setup a HA HDFS cluster, and I am running into a problem
>
> I am not sure what I am doing wrong, I thought I followed the HA namenode
> guide, but it is not working.
>
>
> Apache Hadoop 3.3.6 – HDFS High Availability
> 
>
>
>
> I have 2 namenodes and 3 journal nodes, and 3 zookeeper nodes.
>
> After some period of time I see the following and my namenode and journal
> node die.
> I am not sure where the problem is, or how to diagnose what I am doing
> wrong here.  And the logging here does not make sense to me.
>
> Namenode
> Serving checkpoints at http://nn1:9870
> (org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer)
>
> real-time non-blocking time  (microseconds, -R) unlimited
>
> core file size  (blocks, -c) 0
>
> data seg size   (kbytes, -d) unlimited
>
> scheduling priority (-e) 0
>
> file size   (blocks, -f) unlimited
>
> pending signals (-i) 15187
>
> max locked memory   (kbytes, -l) 8192
>
> max memory size (kbytes, -m) unlimited
>
> open files  (-n) 1024
>
> pipe size(512 bytes, -p) 8
>
> POSIX message queues (bytes, -q) 819200
>
> real-time priority  (-r) 0
>
> stack size  (kbytes, -s) 8192
>
> cpu time   (seconds, -t) unlimited
>
> max user processes  (-u) 15187
>
> virtual memory  (kbytes, -v) unlimited
>
> file locks  (-x) unlimited
>
> [2023-10-02 23:53:46,693] ERROR RECEIVED SIGNAL 15: SIGTERM
> (org.apache.hadoop.hdfs.server.namenode.NameNode)
>
> [2023-10-02 23:53:46,701] INFO SHUTDOWN_MSG:
>
> /
>
> SHUTDOWN_MSG: Shutting down NameNode at nn1/192.168.1.159
>
> /
> (org.apache.hadoop.hdfs.server.namenode.NameNode)
>
> JournalNode
> [2023-10-02 23:54:19,162] WARN Journal at nn1/192.168.1.159:8485 has no
> edit logs (org.apache.hadoop.hdfs.qjournal.server.JournalNodeSyncer)
>
> real-time non-blocking time  (microseconds, -R) unlimited
>
> core file size  (blocks, -c) 0
>
> data seg size   (kbytes, -d) unlimited
>
> scheduling priority (-e) 0
>
> file size   (blocks, -f) unlimited
>
> pending signals (-i) 15187
>
> max locked memory   (kbytes, -l) 8192
>
> max memory size (kbytes, -m) unlimited
>
> open files  (-n) 1024
>
> pipe size(512 bytes, -p) 8
>
> POSIX message queues (bytes, -q) 819200
>
> real-time priority  (-r) 0
>
> stack size  (kbytes, -s) 8192
>
> cpu time   (seconds, -t) unlimited
>
> max user processes  (-u) 15187
>
> virtual memory  (kbytes, -v) unlimited
>
> file locks  (-x) unlimited
>
>
>

-- 
*Best*

Liming


Re: Compare hadoop and ytsaurus

2023-09-29 Thread Susheel Kumar Gadalay
Why still investing in these old technologies? Any reasons except for not
able to migrate to cloud because of non-availabilty and data residency
requirements.

How much is Hadoop data compatibility (parquet and HBase data), code
compatibility  of UDFs, megastore migration etc..

Thanks
Susheel Kumar

On Fri, 29 Sep 2023, 5:31 pm Roman Shaposhnik,  wrote:

> On Thu, Sep 28, 2023 at 7:31 PM Kirill  wrote:
> >
> > Hi everyone!
> >
> > Have you seen this platform https://ytsaurus.tech/platform-overview ?
>
> Yes ;-) I was pretty involved in a few Open Source projects that came
> out of Yandex recently.
>
> > What do you think? Has somebody tried it?
>
> Their CLA is weird in a sense that it is explicitly governed by the
> Russian law AND
> it used to contain some really weird language about "not harming interests
> of
> Russian federation" or some such -- I checked it now and that language
> seems
> to be gone from the CLA -- so maybe they wised up and stopped putting crazy
> statements into these documents.
>
> Another question about signing the CLA with them (especially if you're
> in the US or EU)
> is whether that would be problematic from a sanctions perspective.
>
> Again, they were supposed to move all that into their non-Russian
> entity, but it looks
> like that move hasn't happened yet.
>
> Other than that -- it is a decent C++ code base.
>
> > Is it based on Hadoop source code?
>
> No. Absolutely not.
>
> > It is claimed that there is also a MapReduce in it.
>
> Yeah, but their own version.
>
> > Is it possible to run Hadoop programs and Hive queries on ytsaurus?
>
> I would be surprised if Hive worked on it.
>
> Thanks,
> Roman.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: user-h...@hadoop.apache.org
>
>


Re: Compare hadoop and ytsaurus

2023-09-29 Thread Roman Shaposhnik
On Thu, Sep 28, 2023 at 7:31 PM Kirill  wrote:
>
> Hi everyone!
>
> Have you seen this platform https://ytsaurus.tech/platform-overview ?

Yes ;-) I was pretty involved in a few Open Source projects that came
out of Yandex recently.

> What do you think? Has somebody tried it?

Their CLA is weird in a sense that it is explicitly governed by the
Russian law AND
it used to contain some really weird language about "not harming interests of
Russian federation" or some such -- I checked it now and that language seems
to be gone from the CLA -- so maybe they wised up and stopped putting crazy
statements into these documents.

Another question about signing the CLA with them (especially if you're
in the US or EU)
is whether that would be problematic from a sanctions perspective.

Again, they were supposed to move all that into their non-Russian
entity, but it looks
like that move hasn't happened yet.

Other than that -- it is a decent C++ code base.

> Is it based on Hadoop source code?

No. Absolutely not.

> It is claimed that there is also a MapReduce in it.

Yeah, but their own version.

> Is it possible to run Hadoop programs and Hive queries on ytsaurus?

I would be surprised if Hive worked on it.

Thanks,
Roman.

-
To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
For additional commands, e-mail: user-h...@hadoop.apache.org



Re: Compare hadoop and ytsaurus

2023-09-28 Thread Wei-Chiu Chuang
Hey Kirill,

Thanks for sharing! I wasn't aware of this project. According to the blog
post
https://medium.com/yandex/ytsaurus-exabyte-scale-storage-and-processing-system-is-now-open-source-42e7f5fa5fc6
It was released in public earlier this year by Yandex.

It was inspired by Google's MapReduce, so it has the same root as Hadoop
but I don't think they use the same code. Looks like a very mature project
with more than 60 thousand commits in the repo.

Maybe I'll put it this way, an entire Hadoop ecosystem in a parallel
universe. (Hats off to YTsaurus developers). It's got its own scheduler
similar to YARN, dynamic table support like HBase, query engine similar to
Hive, consensus protocol similar to Raft (we have Apache Zookeeper and
Ratis)


On Thu, Sep 28, 2023 at 1:46 AM Kirill  wrote:

> Hi everyone!
>
> Have you seen this platform https://ytsaurus.tech/platform-overview ?
> What do you think? Has somebody tried it?
> Is it based on Hadoop source code? It is claimed that there is also a
> MapReduce in it.
> Is it possible to run Hadoop programs and Hive queries on ytsaurus?
>
>
>
> Regards,
> Kirill
>


Re: Deploy multi-node Hadoop with Docker

2023-09-22 Thread Ayush Saxena
Hi Nikos,
I think you are talking about the documentation in the overview
section of the docker image: https://hub.docker.com/r/apache/hadoop

I just wrote that 2-3 Months back particularly for dev purposes not
for any prod use case, you should change those values accordingly. The
docker-compose file I copied from
https://github.com/apache/hadoop/blob/docker-hadoop-3/docker-compose.yaml

-Ayush

On Fri, 22 Sept 2023 at 22:28, Nikos Spanos  wrote:
>
> Hi,
>
>
>
> I am creating a multi-node Hadoop cluster for a personal project, and I would 
> like to use the official docker image (apache/hadoop).
>
>
>
> However, looking at the official docker image documentation and the 
> docker-compose file I have seen the following environment variable:
>
>
>
> environment:
>
>   ENSURE_NAMENODE_DIR: "/tmp/hadoop-root/dfs/name"
>
>
>
> I would like to know if it is safe to create the namenode directory in the 
> /tmp folder since this kind of folder is neither secure nor data persistent. 
> Thus, I would like to understand which path is the best practice for this. 
> Moreover, which are other environment variables I could use of.
>
>
>
> Thanks a lot, in advance.
>
>
>
> Kind regards,
>
>
>
> Nikos Spanos
>
>
>
> M.Sc Business Analytics & Big Data| Athens University of Economics & Business
>
> Phone Number: +306982310494
>
> Linkedin profile
>
>

-
To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
For additional commands, e-mail: user-h...@hadoop.apache.org



Re: Deploy multi-node Hadoop with Docker

2023-09-22 Thread Wei-Chiu Chuang
The Hadoop's docker image is not for production use. That's why

But we should update that if people are thinking to use it for production.
Not familiar with docker compose but contributions welcomed:
https://github.com/apache/hadoop/blob/docker-hadoop-3/docker-compose.yaml

On Fri, Sep 22, 2023 at 5:44 AM Nikos Spanos 
wrote:

> Hi,
>
>
>
> I am creating a multi-node Hadoop cluster for a personal project, and I
> would like to use the official docker image (apache/hadoop
> ).
>
>
>
> However, looking at the official docker image documentation and the
> docker-compose file I have seen the following environment variable:
>
>
>
> environment:
>
>   ENSURE_NAMENODE_DIR: "/tmp/hadoop-root/dfs/name"
>
>
>
> I would like to know if it is safe to create the namenode directory in the
> /tmp folder since this kind of folder is neither secure nor data
> persistent. Thus, I would like to understand which path is the best
> practice for this. Moreover, which are other environment variables I could
> use of.
>
>
>
> Thanks a lot, in advance.
>
>
>
> Kind regards,
>
>
>
> *Nikos Spanos*
>
>
>
> M.Sc Business Analytics & Big Data| Athens University of Economics &
> Business
>
> Phone Number: +306982310494
>
> Linkedin profile  
>
>
>


Re:

2023-06-15 Thread Ayush Saxena
Well sending this unsubscribe won’t do anything, send a mail to:

user-unsubscr...@hadoop.apache.org

And for any other individual, if you want to unsubscribe, the above mail id
does that. Not this one!!!

It is mentioned over here as well:
https://hadoop.apache.org/mailing_lists.html

-Ayush

On 15-Jun-2023, at 12:46 PM, 陈伟  wrote:


unsubscribe


Re: Hadoop execution failure

2023-05-04 Thread Ayush Saxena
What is the bug here? Connection reset by peer, mostly n/w issue or the client aborted the connection.What were you executing? Is this intermittent? What is the state of the task that you ran? Is it happening for all operations or few?Mostly this ain’t a bug but some issue with your cluster-AyushSent from my iPhoneOn 05-May-2023, at 7:42 AM, 马福辰  wrote:I found a bug when executing the hadoop in namenode, the version is 3.3.2. The namenode throws the following trace:```2023-03-27 17:35:23,759 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for port 9000: readAndProcess from client 192.168.101.162:56078 threw exception [java.io.IOException: Connection reset by peer]java.io.IOException: Connection reset by peer        at java.base/sun.nio.ch.FileDispatcherImpl.read0(Native Method)        at java.base/sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)        at java.base/sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:276)        at java.base/sun.nio.ch.IOUtil.read(IOUtil.java:245)        at java.base/sun.nio.ch.IOUtil.read(IOUtil.java:223)        at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:370)        at org.apache.hadoop.ipc.Server.channelRead(Server.java:3639)        at org.apache.hadoop.ipc.Server.access$2600(Server.java:144)        at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:2262)        at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:1449)        at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:1304)        at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:1275)2023-03-27 17:35:24,271 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for port 9000: readAndProcess from client 192.168.101.162:56084 threw exception [java.io.IOException: Connection reset by peer]java.io.IOException: Connection reset by peer        at java.base/sun.nio.ch.FileDispatcherImpl.read0(Native Method)        at java.base/sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)        at java.base/sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:276)        at java.base/sun.nio.ch.IOUtil.read(IOUtil.java:245)        at java.base/sun.nio.ch.IOUtil.read(IOUtil.java:223)        at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:370)        at org.apache.hadoop.ipc.Server.channelRead(Server.java:3639)        at org.apache.hadoop.ipc.Server.access$2600(Server.java:144)        at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:2262)        at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:1449)        at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:1304)        at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:1275)```


Re: Question about getSchema method in SFTPFileSystem

2023-04-21 Thread Chris Nauroth
SFTPFileSystem was introduced in HADOOP-5732. I don't see any discussion
there about the getScheme() implementation, so this might not have been an
intentional design choice. I think it's a bug.

Are you interested in contributing a patch?

Chris Nauroth


On Thu, Apr 20, 2023 at 6:00 AM Wenqi Ma  wrote:

>
> Dear Hadoop Team,
>
> I am a developer working with the Hadoop platform, and I recently noticed
> that the *SFTPFileSystem *class in Hadoop 3.x extends the FileSystem
> class and should therefore inherit the *getScheme *method. However, I
> also noticed that there is no *getSchema *method specifically in
> *SFTPFileSystem*.
>
> I am curious about the reason for this design decision, and whether there
> are any specific considerations or use cases that led to the exclusion of
> this method in *SFTPFileSystem*. Would it be possible to provide some
> insight into this decision, or perhaps point me to any relevant
> documentation or discussions on this topic?
>
> Thank you for your time and help.
>
> Sincerely,
> Wenqi Ma
>


Re: Why is this node shutting down?

2023-03-01 Thread Douglas A. Whitfield
Thanks!


On Wed, 1 Mar 2023 at 16:54, Ayush Saxena  wrote:

> Not related to hadoop, reach out to hbase ML
>
> -Ayush
>
> On 02-Mar-2023, at 4:17 AM, Douglas A. Whitfield 
> wrote:
>
> 
>
> I can see a call and response between the regionserver and the
> central node, but I don't know why there is a shutdown happening. Do I need
> to raise the log level?
>
> Call:
>
> ./hbase/logs/hbase-ubuntu-regionserver-dmp-central-capacity-node-06-production-v02.log.2023-02-25:2023-02-25
>  06:04:57,538 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: 
> Installed shutdown hook thread: Shutdownhook:regionserver60020
> ./hbase/logs/hbase-ubuntu-regionserver-dmp-central-capacity-node-06-production-v02.log.2023-02-25:2023-02-25
>  06:50:07,832 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: 
> Installed shutdown hook thread: Shutdownhook:regionserver60020
> ./hbase/logs/hbase-ubuntu-regionserver-dmp-central-capacity-node-06-production-v02.log.2023-02-25:2023-02-25
>  07:45:37,378 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: 
> Installed shutdown hook thread: Shutdownhook:regionserver60020
>
> Response:
>
> ./central-node-logs/dmp-central-node-production-v02-westeurope/Feb_25/hbase-ubuntu-master-dmp-central-node-production-v02.log.2023-02-25:2023-02-25
>  06:05:26,334 INFO 
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Finished 
> processing of shutdown of 
> dmp-central-capacity-node-06-production-v02.internal.cloudapp.net,60020,1677014910209
> ./central-node-logs/dmp-central-node-production-v02-westeurope/Feb_25/hbase-ubuntu-master-dmp-central-node-production-v02.log.2023-02-25:2023-02-25
>  06:50:24,394 INFO 
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Finished 
> processing of shutdown of 
> dmp-central-capacity-node-06-production-v02.internal.cloudapp.net,60020,1677323097091
> ./central-node-logs/dmp-central-node-production-v02-westeurope/Feb_25/hbase-ubuntu-master-dmp-central-node-production-v02.log.2023-02-25:2023-02-25
>  07:45:54,575 INFO 
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Finished 
> processing of shutdown of 
> dmp-central-capacity-node-06-production-v02.internal.cloudapp.net,60020,1677325807482
>
>
>


Re: Why is this node shutting down?

2023-03-01 Thread Ayush Saxena
Not related to hadoop, reach out to hbase ML-AyushOn 02-Mar-2023, at 4:17 AM, Douglas A. Whitfield  wrote:I can see a call and response between the regionserver and the central node, but I don't know why there is a shutdown happening. Do I need to raise the log level?Call:./hbase/logs/hbase-ubuntu-regionserver-dmp-central-capacity-node-06-production-v02.log.2023-02-25:2023-02-25 06:04:57,538 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: Installed shutdown hook thread: Shutdownhook:regionserver60020
./hbase/logs/hbase-ubuntu-regionserver-dmp-central-capacity-node-06-production-v02.log.2023-02-25:2023-02-25 06:50:07,832 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: Installed shutdown hook thread: Shutdownhook:regionserver60020
./hbase/logs/hbase-ubuntu-regionserver-dmp-central-capacity-node-06-production-v02.log.2023-02-25:2023-02-25 07:45:37,378 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: Installed shutdown hook thread: Shutdownhook:regionserver60020Response:./central-node-logs/dmp-central-node-production-v02-westeurope/Feb_25/hbase-ubuntu-master-dmp-central-node-production-v02.log.2023-02-25:2023-02-25 06:05:26,334 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Finished processing of shutdown of dmp-central-capacity-node-06-production-v02.internal.cloudapp.net,60020,1677014910209
./central-node-logs/dmp-central-node-production-v02-westeurope/Feb_25/hbase-ubuntu-master-dmp-central-node-production-v02.log.2023-02-25:2023-02-25 06:50:24,394 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Finished processing of shutdown of dmp-central-capacity-node-06-production-v02.internal.cloudapp.net,60020,1677323097091
./central-node-logs/dmp-central-node-production-v02-westeurope/Feb_25/hbase-ubuntu-master-dmp-central-node-production-v02.log.2023-02-25:2023-02-25 07:45:54,575 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Finished processing of shutdown of dmp-central-capacity-node-06-production-v02.internal.cloudapp.net,60020,1677325807482


Re: Monitoring HDFS filesystem changes

2023-02-15 Thread phiroc
Many thanks, Wei-Chiu.

- Mail original -
De: "Wei-Chiu Chuang" 
À: phi...@free.fr
Cc: user@hadoop.apache.org
Envoyé: Mercredi 15 Février 2023 16:50:44
Objet: Re: Monitoring HDFS filesystem changes


Use the inotify api 


https://dev-listener.medium.com/watch-for-changes-in-hdfs-800c6fb5481f 

https://github.com/onefoursix/hdfs-inotify-example/blob/master/src/main/java/com/onefoursix/HdfsINotifyExample.java
 





On Wed, Feb 15, 2023 at 1:12 AM < phi...@free.fr > wrote: 


Hello, 
is there an efficient way to monitoring the HDFS Filesystem for owner-right 
changes? 
For instance, let's say the /a/b/c/d HDFS Directory's owner is called user1. 
However, overnight, the owner changed for some unknown reason. 
How can I monitor the /a/b/c/d directory and determine what caused the owner to 
change? 
Many thanks. 
Best regards, 
Philippe 


- 
To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org 
For additional commands, e-mail: user-h...@hadoop.apache.org 


-
To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
For additional commands, e-mail: user-h...@hadoop.apache.org



Re: Monitoring HDFS filesystem changes

2023-02-15 Thread Wei-Chiu Chuang
Use the inotify api

https://dev-listener.medium.com/watch-for-changes-in-hdfs-800c6fb5481f
https://github.com/onefoursix/hdfs-inotify-example/blob/master/src/main/java/com/onefoursix/HdfsINotifyExample.java


On Wed, Feb 15, 2023 at 1:12 AM  wrote:

> Hello,
> is there an efficient way to monitoring the HDFS Filesystem for
> owner-right changes?
> For instance, let's say the /a/b/c/d HDFS Directory's owner is called
> user1.
> However, overnight, the owner changed for some unknown reason.
> How can I monitor the /a/b/c/d directory and determine what caused the
> owner to change?
> Many thanks.
> Best regards,
> Philippe
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: user-h...@hadoop.apache.org
>
>


Re: request open hadoop issues to create Jira tickets

2023-02-15 Thread Xiaoqiao He
Hi Liangrui,

Please offer information as mentioned at link[1]. Thanks.

[1]
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute#HowToContribute-RequestingforaJiraaccount

Best Regards,
- He Xiaoqiao

On Wed, Feb 15, 2023 at 4:41 PM liang...@yy.com  wrote:

>
> hello
>   Can you help create a Jira ticket?I would like to open hadoop issues and
> create related issues, thank you
>
>
> liang...@yy.com
>


Re: Monitoring HDFS filesystem changes

2023-02-15 Thread Ayush Saxena
Hey,
The best I know you can check in the HDFS Audit logs. Just copying a sample
entry,

2023-02-15 14:47:30,679 [IPC Server handler 1 on default port 59514] INFO
 FSNamesystem.audit (FSNamesystem.java:logAuditMessage(8852)) -
allowed=true ugi=ayushsaxena (auth:SIMPLE) ip=localhost/127.0.0.1
cmd=setOwner src=/test dst=null perm=Ayush:Hadoop:rwxr-xr-x proto=rpc

This is what you can get from there
-Ayush

On Wed, 15 Feb 2023 at 14:42,  wrote:

> Hello,
> is there an efficient way to monitoring the HDFS Filesystem for
> owner-right changes?
> For instance, let's say the /a/b/c/d HDFS Directory's owner is called
> user1.
> However, overnight, the owner changed for some unknown reason.
> How can I monitor the /a/b/c/d directory and determine what caused the
> owner to change?
> Many thanks.
> Best regards,
> Philippe
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: user-h...@hadoop.apache.org
>
>


Re: Query on HDFS version 3.3.4

2023-02-07 Thread Ayush Saxena
We had to revert it since it broke a lot of downstream stuff, the upgrade patch had issues.At present we know it requires Jersey upgrade as well for sure, which is in blocked state as well, and not sure what else comes up post that.So, short answer: it isn’t there in the upcoming release, nor looks like happening anytime soon, and we neither have a fixed timeline for that. But we are trying…-AyushOn 07-Feb-2023, at 3:14 PM, Deepti Sharma S  wrote:







Hello Team,
 
We have to migrate from Apache Hadoop HDFS Client from 3.3.3 to 3.3.4.
In version 3.3.4 , its one of the dependency Jackson-databind have been downgraded from 2.13.2 to 2.12.7.
Due to this JIRA  https://issues.apache.org/jira/browse/HADOOP-18332 .
 
Please let us know in which future version of 
Apache Hadoop HDFS Client ,Jackson-databind will be upgraded to 2.13.2 again?
 
 
 
Regards,
Deepti Sharma

PMP® & ITIL 
 





Re: unsubscribe

2023-01-30 Thread Tushar Kapila
Hello

Please stop spamming all of us. If you want to unsubscribe, your software
folks, Google the instructions and follow it.

*Instructions for this group, to unsubscribe is to send an email to :
user-unsubscr...@hadoop.apache.org *

You do not need to also email all of us by marking this group.

Thank you

On Mon, 30 Jan, 2023, 17:00 Destin Ashwin,  wrote:

> unsubscribe
>
> On Mon, 30 Jan, 2023, 4:58 pm Lake Chang,  wrote:
>
>> unsubscribe
>>
>


Re: unsubscribe

2023-01-30 Thread Destin Ashwin
unsubscribe

On Mon, 30 Jan, 2023, 4:58 pm Lake Chang,  wrote:

> unsubscribe
>


Re: consistency of yarn exclude file

2023-01-04 Thread Chris Nauroth
Yes, I expect that will work (for both
yarn.resourcemanager.nodes.exclude-path and
yarn.resourcemanager.nodes.include-path), using the "s3a://..." scheme to
specify a file in an S3 bucket.

Chris Nauroth


On Tue, Jan 3, 2023 at 11:50 PM Dong Ye  wrote:

> Hi, All:
>
> For resource manager, can we set
> yarn.resourcemanager.nodes.exclude-path to a s3 file, so all 3 resource
> managers can access it. The benefit is that there is no need to sync the
> exclude.xml file. If not, how to sync the file on different HA resource
> managers?
>
> Thanks.
>
> Ref :
> https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/GracefulDecommission.html
>


Re: consistency of yarn exclude file

2023-01-04 Thread Vinod Kumar Vavilapalli
You can do this by pushing the same file to all Resource Managers at the same 
time.

This is either done by (1) admins / ops via something like scp / rsync with the 
source file in something like git, or (b) by an installer application that 
keeps the source in a DB and pushes to all the nodes.

Thanks
+Vinod 

> On 04-Jan-2023, at 1:18 PM, Dong Ye  wrote:
> 
> Hi, All:
> 
>For resource manager, can we set 
> yarn.resourcemanager.nodes.exclude-path to a s3 file, so all 3 resource 
> managers can access it. The benefit is that there is no need to sync the 
> exclude.xml file. If not, how to sync the file on different HA resource 
> managers?
> 
> Thanks.
> 
> Ref : 
> https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/GracefulDecommission.html


-
To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
For additional commands, e-mail: user-h...@hadoop.apache.org



Re: stale_status_of_NM_from_standby_RM

2023-01-03 Thread Chris Nauroth
You can only run "yarn rmadmin -refreshNodes" against the active
ResourceManager instance. In an HA deployment, a standby instance would
return a "not active" error if it received this call, and then the client
would failover to the other instance to retry.

The ResourceManagers do not synchronize the state of include/exclude files.

Chris Nauroth


On Wed, Dec 28, 2022 at 11:08 PM Dong Ye  wrote:

> Hi, Chris:
>
> Thank you very much! Yes, I am also concerned with the
> decommissioning of nodemanager in a Resource Manager High Availability
> scenario. In order to decommission a node manager,
>
> Can I add the node manager address to a standby RM exclude.xml and run
> "yarn refreshnodes"? Or I can only do that on an active RM? Do RM's sync
> the exclude/include xml file?
>
> Thanks.
> Have a nice holiday.
>
>
> On Tue, Dec 27, 2022 at 11:44 AM Chris Nauroth 
> wrote:
>
>> Every NodeManager registers and heartbeats to the active ResourceManager
>> instance, which acts as the source of truth for cluster node status. If the
>> active ResourceManager terminates, then another becomes active, and every
>> NodeManager will start a new connection to register and heartbeat with that
>> new active ResourceManager.
>>
>> As such, a standby ResourceManager cannot satisfy requests for node
>> status and instead will redirect to the current active:
>>
>> curl -i '
>> http://cnauroth-ha-m-2:8088/ws/v1/cluster/nodes/cnauroth-ha-w-0.us-central1-c.c.hadoop-cloud-dev.google.com.internal:8026
>> '
>> HTTP/1.1 307 Temporary Redirect
>> Date: Tue, 27 Dec 2022 19:28:38 GMT
>> Cache-Control: no-cache
>> Expires: Tue, 27 Dec 2022 19:28:38 GMT
>> Date: Tue, 27 Dec 2022 19:28:38 GMT
>> Pragma: no-cache
>> Content-Type: text/plain;charset=utf-8
>> X-Content-Type-Options: nosniff
>> X-XSS-Protection: 1; mode=block
>> X-Frame-Options: SAMEORIGIN
>> Location:
>> http://cnauroth-ha-m-1.us-central1-c.c.hadoop-cloud-dev.google.com.internal.:8088/ws/v1/cluster/nodes/cnauroth-ha-w-0.us-central1-c.c.hadoop-cloud-dev.google.com.internal:8026
>> Content-Length: 136
>>
>> If it looked like you were able to query a standby, then perhaps you were
>> using a browser or some other client that automatically follows redirects
>> (e.g. curl -L)?
>>
>> The data really would have come from the active though, so you can trust
>> that it's not stale. The only thing you might have to consider is that
>> after a failover, it might take a while before every NodeManager registers
>> with the new ResourceManager.
>>
>> Separately, if you're concerned about divergence of node include/exclude
>> files, you can configure them to be stored at a shared file system (e.g.
>> your preferred cloud object store) to be used by all ResourceManager
>> instances.
>>
>> Chris Nauroth
>>
>>
>> On Sat, Dec 24, 2022 at 6:27 PM Dong Ye  wrote:
>>
>>> Hi, All:
>>>
>>> I have some questions about the state of the node manager. If I use
>>> the rest API
>>>
>>>- http://rm-http-address:port/ws/v1/cluster/nodes/{nodeid}
>>>
>>> to get node manager state from a standby RM,
>>> 1) is it possible that it could be stale?
>>> 2) If it is possible, how long will the node manager state be updated?
>>> 3) Is it possible that the NM state returned from standby RM be very
>>> different from that returned from the active RM? Say one is returning
>>> RUNNING while the other returns DECOMMISSIONED because the local
>>> exclude.xml is very different/diverges?
>>>
>>> Thanks.
>>> Have a good holiday.
>>>
>>


Re: Block missing due to power failure

2022-12-30 Thread Viraj Jasani
Agree. For some reason, if you would not like to use more than one datanode
(let alone datanodes across multiple racks for fault tolerance) for some
non-critical usecase, it's still recommended to use hsync over the output
stream for on-disk persistence (unless the single DN setup is being used
only for some deliberate resilience testing of hflush, and data loss is not
a concern).


On Fri, Dec 30, 2022 at 9:04 AM Ayush Saxena  wrote:

> The file was in progress? In that case this is possible, once the data
> gets persisted on the disk of the datanode then the data loss ain’t
> possible.
>
> If someone did a hflush and not hsync while writing and the power loss
> happens immediately after that, so in that case also I feel there is a
> possibility that data might get lost post restart.
>
> Rest if the file was complete, then I don’t think in any circumstance data
> should get lost
>
> -Ayush
>
>
> On 30-Dec-2022, at 5:17 PM, hehaore...@gmail.com wrote:
>
> 
>
> Hi,
>
> A 1-replica HDFS cluster with a single DataNode. When the DataNode was
> restarted after power failure, it found a file with a missing block. The
> size of the block and mate files found in the storage path is empty, and
> the last modification time is the power off time. Besides the fact that the
> file is being written, what else could be causing this phenomenon?
>
> I wish you a happy New Year
>
>
>
> Hao He
>
> 从 Windows 版邮件 发送
>
>
> - To
> unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org For additional
> commands, e-mail: user-h...@hadoop.apache.org
>
>


Re: Block missing due to power failure

2022-12-30 Thread Ayush Saxena
The file was in progress? In that case this is possible, once the data gets persisted on the disk of the datanode then the data loss ain’t possible.If someone did a hflush and not hsync while writing and the power loss happens immediately after that, so in that case also I feel there is a possibility that data might get lost post restart.Rest if the file was complete, then I don’t think in any circumstance data should get lost-AyushOn 30-Dec-2022, at 5:17 PM, hehaore...@gmail.com wrote:Hi,A 1-replica HDFS cluster with a single DataNode. When the DataNode was restarted after power failure, it found a file with a missing block. The size of the block and mate files found in the storage path is empty, and the last modification time is the power off time. Besides the fact that the file is being written, what else could be causing this phenomenon? I wish you a happy New Year Hao He从 Windows 版邮件发送 

-
To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
For additional commands, e-mail: user-h...@hadoop.apache.org



Re: stale_status_of_NM_from_standby_RM

2022-12-28 Thread Dong Ye
Hi, Chris:

Thank you very much! Yes, I am also concerned with the
decommissioning of nodemanager in a Resource Manager High Availability
scenario. In order to decommission a node manager,

Can I add the node manager address to a standby RM exclude.xml and run
"yarn refreshnodes"? Or I can only do that on an active RM? Do RM's sync
the exclude/include xml file?

Thanks.
Have a nice holiday.


On Tue, Dec 27, 2022 at 11:44 AM Chris Nauroth  wrote:

> Every NodeManager registers and heartbeats to the active ResourceManager
> instance, which acts as the source of truth for cluster node status. If the
> active ResourceManager terminates, then another becomes active, and every
> NodeManager will start a new connection to register and heartbeat with that
> new active ResourceManager.
>
> As such, a standby ResourceManager cannot satisfy requests for node status
> and instead will redirect to the current active:
>
> curl -i '
> http://cnauroth-ha-m-2:8088/ws/v1/cluster/nodes/cnauroth-ha-w-0.us-central1-c.c.hadoop-cloud-dev.google.com.internal:8026
> '
> HTTP/1.1 307 Temporary Redirect
> Date: Tue, 27 Dec 2022 19:28:38 GMT
> Cache-Control: no-cache
> Expires: Tue, 27 Dec 2022 19:28:38 GMT
> Date: Tue, 27 Dec 2022 19:28:38 GMT
> Pragma: no-cache
> Content-Type: text/plain;charset=utf-8
> X-Content-Type-Options: nosniff
> X-XSS-Protection: 1; mode=block
> X-Frame-Options: SAMEORIGIN
> Location:
> http://cnauroth-ha-m-1.us-central1-c.c.hadoop-cloud-dev.google.com.internal.:8088/ws/v1/cluster/nodes/cnauroth-ha-w-0.us-central1-c.c.hadoop-cloud-dev.google.com.internal:8026
> Content-Length: 136
>
> If it looked like you were able to query a standby, then perhaps you were
> using a browser or some other client that automatically follows redirects
> (e.g. curl -L)?
>
> The data really would have come from the active though, so you can trust
> that it's not stale. The only thing you might have to consider is that
> after a failover, it might take a while before every NodeManager registers
> with the new ResourceManager.
>
> Separately, if you're concerned about divergence of node include/exclude
> files, you can configure them to be stored at a shared file system (e.g.
> your preferred cloud object store) to be used by all ResourceManager
> instances.
>
> Chris Nauroth
>
>
> On Sat, Dec 24, 2022 at 6:27 PM Dong Ye  wrote:
>
>> Hi, All:
>>
>> I have some questions about the state of the node manager. If I use
>> the rest API
>>
>>- http://rm-http-address:port/ws/v1/cluster/nodes/{nodeid}
>>
>> to get node manager state from a standby RM,
>> 1) is it possible that it could be stale?
>> 2) If it is possible, how long will the node manager state be updated?
>> 3) Is it possible that the NM state returned from standby RM be very
>> different from that returned from the active RM? Say one is returning
>> RUNNING while the other returns DECOMMISSIONED because the local
>> exclude.xml is very different/diverges?
>>
>> Thanks.
>> Have a good holiday.
>>
>


Re: stale_status_of_NM_from_standby_RM

2022-12-27 Thread Chris Nauroth
Every NodeManager registers and heartbeats to the active ResourceManager
instance, which acts as the source of truth for cluster node status. If the
active ResourceManager terminates, then another becomes active, and every
NodeManager will start a new connection to register and heartbeat with that
new active ResourceManager.

As such, a standby ResourceManager cannot satisfy requests for node status
and instead will redirect to the current active:

curl -i '
http://cnauroth-ha-m-2:8088/ws/v1/cluster/nodes/cnauroth-ha-w-0.us-central1-c.c.hadoop-cloud-dev.google.com.internal:8026
'
HTTP/1.1 307 Temporary Redirect
Date: Tue, 27 Dec 2022 19:28:38 GMT
Cache-Control: no-cache
Expires: Tue, 27 Dec 2022 19:28:38 GMT
Date: Tue, 27 Dec 2022 19:28:38 GMT
Pragma: no-cache
Content-Type: text/plain;charset=utf-8
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Location:
http://cnauroth-ha-m-1.us-central1-c.c.hadoop-cloud-dev.google.com.internal.:8088/ws/v1/cluster/nodes/cnauroth-ha-w-0.us-central1-c.c.hadoop-cloud-dev.google.com.internal:8026
Content-Length: 136

If it looked like you were able to query a standby, then perhaps you were
using a browser or some other client that automatically follows redirects
(e.g. curl -L)?

The data really would have come from the active though, so you can trust
that it's not stale. The only thing you might have to consider is that
after a failover, it might take a while before every NodeManager registers
with the new ResourceManager.

Separately, if you're concerned about divergence of node include/exclude
files, you can configure them to be stored at a shared file system (e.g.
your preferred cloud object store) to be used by all ResourceManager
instances.

Chris Nauroth


On Sat, Dec 24, 2022 at 6:27 PM Dong Ye  wrote:

> Hi, All:
>
> I have some questions about the state of the node manager. If I use
> the rest API
>
>- http://rm-http-address:port/ws/v1/cluster/nodes/{nodeid}
>
> to get node manager state from a standby RM,
> 1) is it possible that it could be stale?
> 2) If it is possible, how long will the node manager state be updated?
> 3) Is it possible that the NM state returned from standby RM be very
> different from that returned from the active RM? Say one is returning
> RUNNING while the other returns DECOMMISSIONED because the local
> exclude.xml is very different/diverges?
>
> Thanks.
> Have a good holiday.
>


Re: Unsubscribe

2022-12-19 Thread Azir Aliu
Unsubscribe


On Mon, Dec 19, 2022 at 7:26 PM Gabriel James 
wrote:

>
>
> --
>
> *Gabriel James, PhD*
>
> Director
>
>
>
> *Heliase *
>
>
>
> This correspondence is for the named person’s use only. It may contain
> information that is confidential, proprietary or the subject of legal
> privilege. No confidentiality or privilege is waived or lost by any
> mistransmission. If you receive this correspondence in error, please
> immediately delete it from your system and notify the sender. You must not
> disclose, copy or relay any part of this correspondence if you are not the
> intended recipient
>


Re: unsubscribe

2022-12-18 Thread Gabriel James
unsubscribe

On Sat, 17 Dec 2022 at 17:22, Agron Cela  wrote:

> unsubscribe
>


-- 

*Gabriel James, PhD*

Director



*Heliase *



This correspondence is for the named person’s use only. It may contain
information that is confidential, proprietary or the subject of legal
privilege. No confidentiality or privilege is waived or lost by any
mistransmission. If you receive this correspondence in error, please
immediately delete it from your system and notify the sender. You must not
disclose, copy or relay any part of this correspondence if you are not the
intended recipient


Re: Hadoop 2 to Hadoop 3 Rolling Upgrade feasibility

2022-12-15 Thread Nishtha Shah
FYI, We are trying to upgrade from 2.10 to 3.3.

On Fri, Dec 16, 2022 at 10:20 AM Nishtha Shah 
wrote:

> Hi team,
>
> While I am checking on feasible upgrade plans for this major upgrade, A
> quick check if someone was able to perform a successful rolling upgrade
> from Hadoop 2 to Hadoop 3.
>
> I have gone through a couple of articles online which are suggesting to
> opt for Express Upgrade and avoid Rolling upgrade. A quick check if someone
> was able to successfully do a rolling upgrade in recent years.
>
> References I have gone through:
> https://blog.cloudera.com/upgrading-clusters-workloads-hadoop-2-hadoop-3/
>
> https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+2.x+to+3.x+Upgrade+Efforts
>
> https://www.adaltas.com/en/2018/07/25/clusters-workloads-migration-hadoop-2-to-3/
>
> Any leads/responses are appreciated.
>
> --
> Thanks,
> Nishtha Shah
>


-- 
With Regards,
Nishtha Shah


Re: dfs.namenode.blockreport.queue Full of frequently,It may be related to the datanode Capacity

2022-12-01 Thread Ayush Saxena
Hi,
Is it happening regularly? kind of with regular FBR's, in that case you
need to configure your Datanode's block report interval high enough and in
a way that all of them don't bombard the namenode at same time and there is
enough gap between FBR's from the datanodes.
If it is happening with First FBR itself, give a check to the config:
dfs.blockreport.initialDelay it can be used to prevent all datanodes from
shooting requests to the namenode with block reports at the  same time.

Long back I had a proposal to optimize or maybe get rid of the FBR's itself
in cases where it wasn't required,we tried a POC and if I remember it was
working, but I never got a chance to contribute it back or explore more
around the challenges. The ML proposal[1] and the Jira stays at[2]

There is a jira chasing to improve the lock time as well in case of FBR[3],
that too isn't merged yet. If you want to explore more, you can get some
stuff at [4], it points to two Jira's as well, give a check if they can
help, if you already don't have them.

-Ayush

[1] https://lists.apache.org/thread/8poctzcmcxk7jmyj8vyb5txwdv6t57lc
[2] https://issues.apache.org/jira/browse/HDFS-15162
[3] https://issues.apache.org/jira/browse/HDFS-14657
[4] https://issues.apache.org/jira/browse/HDFS-14186

On Thu, 1 Dec 2022 at 12:34, 尉雁磊  wrote:

> hi ,In a cluster of thousands of Datanodes, the Capacity of each datanode
> is not exactly the same.  In this case, Datanodes with a large Capacity
> have more blocks and report more blocks to FBR, which takes longer to
> process FBR.  FBR processing is done in a queue (
> dfs.namenode.blockreport.queue.size).  If FBR takes a long time to
> process, FBRIBRQUEUQE will pile up.  Obviously, as more data is added,
> FbribrQueuQE will pile up more frequently, affecting namenode performance.
> Is this a problem or is there any optimization method?
>


Re: subscribe

2022-12-01 Thread Ayush Saxena
Send a mail to user-subscr...@hadoop.apache.org-AyushOn 01-Dec-2022, at 11:31 AM, fanyuping [范育萍]  wrote:Hi Community,    I’d like to subscribe to this mailing list.Best RegardsYuping FanB‹CB•È[œÝXœØÜšX™KK[XZ[ˆ\Ù\‹][œÝXœØÜšX™PYÛܘ\XÚK›Ü™ÃB‘›ÜˆY][Û˜[ÛÛ[X[™ËK[XZ[ˆ\Ù\‹Z[YÛܘ\XÚK›Ü™ÃB

RE: Vulnerability query on Hadoop

2022-11-29 Thread Deepti Sharma S
Thank you Ayush


Regards,
Deepti Sharma
PMP® & ITIL

From: Ayush Saxena 
Sent: 29 November 2022 16:27
To: Deepti Sharma S 
Cc: user@hadoop.apache.org
Subject: Re: Vulnerability query on Hadoop

Hi Deepti,
The OkHttp one I think got sorted as part of HDFS-16453, It is there in 
Hadoop-3.3.4(Released),
Second, netty is also upgraded as part of HADOOP-18079 and is also there in 
Hadoop-3.3.4, I tried to grep on the dependency tree of 3.3.4 and didn't find 
4.1.42. If you still see it let me know what is pulling that in, we can fix 
that in the next release(3.3.5) next month.

So, ideally an upgrade from hadoop 3.3.3 to 3.3.4 should get things fixed for 
you.

-Ayush

Refs:
https://issues.apache.org/jira/browse/HDFS-16453
https://issues.apache.org/jira/browse/HADOOP-18079


On Tue, 29 Nov 2022 at 09:54, Deepti Sharma S 
mailto:deepti.s.sha...@ericsson.com.invalid>>
 wrote:
Hello Team,

We had a query regarding below High and Critical vulnerability on Hadoop, could 
you please help here.

Query for below mentioned HIGH Vulnerability.

We are having java based HDFS client which uses Hadoop-Common-3.3.3, 
Hadoop-hdfs-3.3.3 and Hadoop-hdfs-client-3.3.3 as it's dependency.
Hadoop-Common and Hadoop-hdfs uses protobuf-java-2.5.0 as dependency.
Hadoop-hdfs-client uses okhttp-2.7.5 as dependency

We got the following high vulnerablilities in protobuf-java using "Anchore 
Grype" and in okhttp using "JFrog Xray".

1. Description : A parsing issue with binary data in protobuf-java core and 
lite versions prior to 3.21.7, 3.20.3, 3.19.6 and 3.16.3 can lead to a denial 
of service attack.
 Inputs containing multiple instances of non-repeated embedded 
messages with repeated or unknown fields causes objects to be converted 
back-n-forth between mutable and immutable forms,
 resulting in potentially long garbage collection pauses. We 
recommend updating to the versions mentioned above.


2. Description : OkHttp contains a flaw that is triggered during the handling 
of non-ASCII ETag headers. This may allow a remote attacker to crash a process 
linked against the library.

3. Description : OkHttp contains a flaw that is triggered during the reading of 
non-ASCII characters in HTTP/2 headers or in cached HTTP headers. This may 
allow a remote attacker to crash a process linked against the library.

What is the impact of these vulnerablilities on HDFS client?
If HDFS Client is impacted then what is the mitigation plan for that?

Query for below mentioned CRITICAL Vulnerability.

We are having java based HDFS client which uses Hadoop-Common-3.3.3 as it's 
dependency. in our application.
Hadoop-Common-3.3.3 uses netty-codec-4.1.42.Final as deep dependency.

We got the following critical vulnerablility in netty-codec using JFrog Xray.

Description : Netty contains an overflow condition in the 
Lz4FrameEncoder::finishEncode() function in 
codec/src/main/java/io/netty/handler/codec/compression/Lz4FrameEncoder.java
that is triggered when compressing data and writing the last header.
This may allow an attacker to cause a buffer overflow, resulting in a denial of 
service or potentially allowing the execution of arbitrary code.

What is the impact of this vulnerablility on HDFS client?
If HDFS Client is impacted then what is the mitigation plan for that?



Regards,
Deepti Sharma
PMP® & ITIL



Re: Vulnerability query on Hadoop

2022-11-29 Thread Ayush Saxena
Hi Deepti,
The OkHttp one I think got sorted as part of HDFS-16453, It is there in
Hadoop-3.3.4(Released),
Second, netty is also upgraded as part of HADOOP-18079 and is also there in
Hadoop-3.3.4, I tried to grep on the dependency tree of 3.3.4 and didn't
find 4.1.42. If you still see it let me know what is pulling that in, we
can fix that in the next release(3.3.5) next month.

So, ideally an upgrade from hadoop 3.3.3 to 3.3.4 should get things fixed
for you.

-Ayush

Refs:
https://issues.apache.org/jira/browse/HDFS-16453
https://issues.apache.org/jira/browse/HADOOP-18079


On Tue, 29 Nov 2022 at 09:54, Deepti Sharma S
 wrote:

> Hello Team,
>
>
>
> We had a query regarding below High and Critical vulnerability on Hadoop,
> could you please help here.
>
>
>
> *Query for below mentioned HIGH Vulnerability.*
>
>
>
> We are having java based HDFS client which uses Hadoop-Common-3.3.3,
> Hadoop-hdfs-3.3.3 and Hadoop-hdfs-client-3.3.3 as it's dependency.
>
> Hadoop-Common and Hadoop-hdfs uses protobuf-java-2.5.0 as dependency.
>
> Hadoop-hdfs-client uses okhttp-2.7.5 as dependency
>
>
>
> We got the following high vulnerablilities in protobuf-java using "Anchore
> Grype" and in okhttp using "JFrog Xray".
>
>
>
> 1. Description : A parsing issue with binary data in protobuf-java core
> and lite versions prior to 3.21.7, 3.20.3, 3.19.6 and 3.16.3 can lead to a
> denial of service attack.
>
>  Inputs containing multiple instances of non-repeated
> embedded messages with repeated or unknown fields causes objects to be
> converted back-n-forth between mutable and immutable forms,
>
>  resulting in potentially long garbage collection pauses.
> We recommend updating to the versions mentioned above.
>
>
>
>
>
> 2. Description : OkHttp contains a flaw that is triggered during the
> handling of non-ASCII ETag headers. This may allow a remote attacker to
> crash a process linked against the library.
>
>
>
> 3. Description : OkHttp contains a flaw that is triggered during the
> reading of non-ASCII characters in HTTP/2 headers or in cached HTTP
> headers. This may allow a remote attacker to crash a process linked against
> the library.
>
>
>
> What is the impact of these vulnerablilities on HDFS client?
>
> If HDFS Client is impacted then what is the mitigation plan for that?
>
>
>
> *Query for below mentioned CRITICAL Vulnerability.*
>
>
>
> We are having java based HDFS client which uses Hadoop-Common-3.3.3 as
> it's dependency. in our application.
>
> Hadoop-Common-3.3.3 uses netty-codec-4.1.42.Final as deep dependency.
>
>
>
> We got the following *critical vulnerablility* in netty-codec using JFrog
> Xray.
>
>
>
> *Description* : Netty contains an overflow condition in the
> Lz4FrameEncoder::finishEncode() function in
> codec/src/main/java/io/netty/handler/codec/compression/Lz4FrameEncoder.java
>
> that is triggered when compressing data and writing the last header.
>
> This may allow an attacker to cause a buffer overflow, resulting in a
> denial of service or potentially allowing the execution of arbitrary code.
>
>
>
> What is the impact of this vulnerablility on HDFS client?
>
> If HDFS Client is impacted then what is the mitigation plan for that?
>
>
>
>
>
>
>
> Regards,
>
> Deepti Sharma
> * PMP® & ITIL*
>
>
>


Re: hdfs dfsadmin -printTopology The target of the information may be abnormal

2022-11-10 Thread Ayush Saxena
What you are trying to achieve via that extra parameter can easily be done
using GenericOptions, use the -fs and specify the namenode and port for
which you want to get the results[1]
check the overview [2] here to see how to use them.

the second point doesn't make sense, fetch from all return result for one
and log others, that isn't something doable

-Ayush

[1]
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/CommandsManual.html#:~:text=Generic%20Options,-Many%20subcommands%20honor=Use%20value%20for%20given%20property.=Specify%20comma%20separated%20files%20to,Applies%20only%20to%20job.=Specify%20default%20filesystem%20URL%20to%20use
.

[2]
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html



On Thu, 10 Nov 2022 at 16:31, 尉雁磊  wrote:

>
>
> I agree with you, and I wonder if there is anything that can be done to
> help managers look at possible problems in this area
>
> I have two ideas:
>
> 1.  Add a namenodeIp parameter to hdfs dfsadmin-printTopology to obtain
> rack information about the specified namenode.
>
> 2.  Add debug information to the printTopology method of class DFSAdmin.
> However, the command only requests a fixed namenode, and the debug logs of
> the other namenode cannot be printed
>
>
>
>
> At 2022-11-10 18:44:19, "Ayush Saxena"  wrote:
>
> If some sort of debugging is going on which doubts topological
> misconfiguration, you anyway need to check all the namenodes, if one
> namenode is misconfigured and if another is not. Maybe the issue won't
> surface if the properly configured namenode is the Active namenode at that
> time, but one failover can screw things up.
>
> Secondly, checking the topology to triage a potential issue which doubts
> rack misconfiguration just by checking Active namenode itself isn't a
> complete solution, what if when the issue occurred the present standby
> namenode was active then. In such cases anyway you have to check all the
> Namenodes.
>
> Getting Topology from Individual Namenodes is a doable task for any Admin
> & isn't as such difficult. If that wasn't naive to do so, We could have
> explored getting Topology from all the namenodes as part of DebugAdmin
> commands maybe
>
> -Ayush
>
>
> On Thu, 10 Nov 2022 at 15:45, 尉雁磊  wrote:
>
>>
>>
>> So what you are saying is that this is a management issue, not a code
>> issue.  Even if the manager has misdeployed the rack perception of namnode,
>> the manager will not be able to locate the actual problem from the log and
>> will only be able to check whether the deployment operation is correct。
>>
>>
>>
>>
>> At 2022-11-10 17:34:37, "Ayush Saxena"  wrote:
>>
>> In a stable cluster, usually all the datanodes report to all the
>> namenodes and mostly the information would be more or less same in all
>> namenodes. This isn't data which goes stale you might land up in  some
>> mess, and moreover these aren't user commands but Admin commands, it is pre
>> assumed that the admin would be having idea about the system and how it
>> behaves, and there are ways to get this detail from a specific Namenode, it
>> can be done if required, even each namenode UI gives details about the
>> datanode states and so.
>>
>> From the code point of view, I don't think it is a good idea to change or
>> something which is gonna get accepted.
>>
>> -Ayush
>>
>> On Thu, 10 Nov 2022 at 13:53, 尉雁磊  wrote:
>>
>>> hdfs dfsadmin  -printTopology Always get information from this namenode
>>> in the cluster,Whether the namenode is active or standby,I don't think
>>> this is normal, this command should always get information from the active
>>> namenode!
>>>
>>


Re: hdfs dfsadmin -printTopology The target of the information may be abnormal

2022-11-10 Thread 尉雁磊






I agree with you, and I wonder if there is anything that can be done to help 
managers look at possible problems in this area 

I have two ideas: 

1.  Add a namenodeIp parameter to hdfs dfsadmin-printTopology to obtain rack 
information about the specified namenode. 

2.  Add debug information to the printTopology method of class DFSAdmin.  
However, the command only requests a fixed namenode, and the debug logs of the 
other namenode cannot be printed










At 2022-11-10 18:44:19, "Ayush Saxena"  wrote:

If some sort of debugging is going on which doubts topological 
misconfiguration, you anyway need to check all the namenodes, if one namenode 
is misconfigured and if another is not. Maybe the issue won't surface if the 
properly configured namenode is the Active namenode at that time, but one 
failover can screw things up.


Secondly, checking the topology to triage a potential issue which doubts rack 
misconfiguration just by checking Active namenode itself isn't a complete 
solution, what if when the issue occurred the present standby namenode was 
active then. In such cases anyway you have to check all the Namenodes. 


Getting Topology from Individual Namenodes is a doable task for any Admin & 
isn't as such difficult. If that wasn't naive to do so, We could have explored 
getting Topology from all the namenodes as part of DebugAdmin commands maybe


-Ayush




On Thu, 10 Nov 2022 at 15:45, 尉雁磊  wrote:








So what you are saying is that this is a management issue, not a code issue.  
Even if the manager has misdeployed the rack perception of namnode, the manager 
will not be able to locate the actual problem from the log and will only be 
able to check whether the deployment operation is correct。










At 2022-11-10 17:34:37, "Ayush Saxena"  wrote:

In a stable cluster, usually all the datanodes report to all the namenodes and 
mostly the information would be more or less same in all namenodes. This isn't 
data which goes stale you might land up in  some mess, and moreover these 
aren't user commands but Admin commands, it is pre assumed that the admin would 
be having idea about the system and how it behaves, and there are ways to get 
this detail from a specific Namenode, it can be done if required, even each 
namenode UI gives details about the datanode states and so.


From the code point of view, I don't think it is a good idea to change or 
something which is gonna get accepted.


-Ayush


On Thu, 10 Nov 2022 at 13:53, 尉雁磊  wrote:


hdfs dfsadmin  -printTopology Always get information from this namenode in the 
cluster,Whether the namenode is active or standby,I don't think this is normal, 
this command should always get information from the active namenode!


Re: Re: hdfs dfsadmin -printTopology The target of the information may be abnormal

2022-11-10 Thread Ayush Saxena
If some sort of debugging is going on which doubts topological
misconfiguration, you anyway need to check all the namenodes, if one
namenode is misconfigured and if another is not. Maybe the issue won't
surface if the properly configured namenode is the Active namenode at that
time, but one failover can screw things up.

Secondly, checking the topology to triage a potential issue which doubts
rack misconfiguration just by checking Active namenode itself isn't a
complete solution, what if when the issue occurred the present standby
namenode was active then. In such cases anyway you have to check all the
Namenodes.

Getting Topology from Individual Namenodes is a doable task for any Admin &
isn't as such difficult. If that wasn't naive to do so, We could have
explored getting Topology from all the namenodes as part of DebugAdmin
commands maybe

-Ayush


On Thu, 10 Nov 2022 at 15:45, 尉雁磊  wrote:

>
>
> So what you are saying is that this is a management issue, not a code
> issue.  Even if the manager has misdeployed the rack perception of namnode,
> the manager will not be able to locate the actual problem from the log and
> will only be able to check whether the deployment operation is correct。
>
>
>
>
> At 2022-11-10 17:34:37, "Ayush Saxena"  wrote:
>
> In a stable cluster, usually all the datanodes report to all the namenodes
> and mostly the information would be more or less same in all namenodes.
> This isn't data which goes stale you might land up in  some mess, and
> moreover these aren't user commands but Admin commands, it is pre assumed
> that the admin would be having idea about the system and how it behaves,
> and there are ways to get this detail from a specific Namenode, it can be
> done if required, even each namenode UI gives details about the datanode
> states and so.
>
> From the code point of view, I don't think it is a good idea to change or
> something which is gonna get accepted.
>
> -Ayush
>
> On Thu, 10 Nov 2022 at 13:53, 尉雁磊  wrote:
>
>> hdfs dfsadmin  -printTopology Always get information from this namenode
>> in the cluster,Whether the namenode is active or standby,I don't think
>> this is normal, this command should always get information from the active
>> namenode!
>>
>


Re: hdfs dfsadmin -printTopology The target of the information may be abnormal

2022-11-10 Thread Ayush Saxena
In a stable cluster, usually all the datanodes report to all the namenodes
and mostly the information would be more or less same in all namenodes.
This isn't data which goes stale you might land up in  some mess, and
moreover these aren't user commands but Admin commands, it is pre assumed
that the admin would be having idea about the system and how it behaves,
and there are ways to get this detail from a specific Namenode, it can be
done if required, even each namenode UI gives details about the datanode
states and so.

>From the code point of view, I don't think it is a good idea to change or
something which is gonna get accepted.

-Ayush

On Thu, 10 Nov 2022 at 13:53, 尉雁磊  wrote:

> hdfs dfsadmin  -printTopology Always get information from this namenode
> in the cluster,Whether the namenode is active or standby,I don't think
> this is normal, this command should always get information from the active
> namenode!
>


Re: HDFS space quota exception

2022-11-09 Thread Chris Nauroth
Is this cluster using snapshots? I'm not sure if this completely explains
what you're seeing, but there were several bugs in accounting of space
consumption by snapshots prior to 2.8.0, for example:

https://issues.apache.org/jira/browse/HDFS-7728
https://issues.apache.org/jira/browse/HDFS-9063

Chris Nauroth


On Tue, Nov 8, 2022 at 7:56 PM hehaore...@gmail.com 
wrote:

> hello
>
> HDFS cluster version 2.7.2, I set a space quota for the directory, but the
> available space is much less than expected, for example, this image has a
> quota of 600T, 31T used space, it should be 500T free space, but it
> actually shows only 132T.
>
> I randomly checked the number of file copies in the directory, and found
> that they were all 3. May I ask what caused this problem? I checked the
> patch of the community, but could not find a good explanation for this
> problem.
>
>
>
> 从 Windows 版邮件 发送
>
>
>


Re: Unsubscribe

2022-11-04 Thread Daniel Cowden
Unsubscribe



On Friday, November 4, 2022 at 12:33:44 AM EDT, 
rajachivuk...@yahoo.com.INVALID  wrote:  
 
 Unsubscribe

Sent from Yahoo Mail on Android 
 
  On Thu, 3 Nov 2022 at 16:34, rajila2008 wrote:   
Unsubscribe
On Sun, 16 Oct, 2022, 4:47 AM Manish Verma,  wrote:

Please unsubscribe me from this account.

  
  

Re: Unsubscribe

2022-11-03 Thread rajachivuk...@yahoo.com.INVALID
Unsubscribe

Sent from Yahoo Mail on Android 
 
  On Thu, 3 Nov 2022 at 16:34, rajila2008 wrote:   
Unsubscribe
On Sun, 16 Oct, 2022, 4:47 AM Manish Verma,  wrote:

Please unsubscribe me from this account.

  


Re: issue when enable gpu isolation

2022-10-31 Thread zxcs
Also when we directly use container-executor command to put something into 
devices.deny, it report unexpected operation code.

test@ip:/opt/hadoop-3.3.0$ sudo -U yarn 
/opt/hadoop-3.3.0/bin/container-executor  --module-gpu --container_id 
container_e57_1667177358230_0650_01_01
-excluded_gpus 1,2,3,4,5,6,7
[sudo〕 password for alpha:
CGroups: Updating cgroups, 
path=/sys/fs/cgroup/devices/yarn/container_e57_1667177358230_0650_01_01/devices.deny,
 value=c 195:1 rwm
CGroups: Updating cgroups, 
path=/sys/fs/cgroup/devices/yarn/container_e57_1667177358230_0650_01_01/devices.deny,
 value=c 195:2 rwm
CGroups: Updating cgroups, path=/ 
sys/fs/cgroup/devices/yarn/container_e57_1667177358230 0650 01 
01/devices.deny, value=c 195:3 rwm
CGroups: Updating cgroups, 
path=/sys/fs/cgroup/devices/yarn/container_e57_1667177358230_0650_01_01/devices.deny,
 value=c 195:4 rwm
CGroups: Updating cgroups, path=/sys/ 
fs/cgroup/devices/yarn/container_e57_1667177358230_0650_01_01/devices.deny, 
value=c 195:5 rwm
CGroups: Updating cgroups, path=/sys/fs/cgroup/ 
devices/yarn/container_e57_1667177358230_0650_01_01/devices.deny, value=c 
195:6 rwm
CGroups: Dpaatang SEroupo: Pathg/Bya/4S/Eroup/ aeVicas/arn/ ontatner-es/ 
18871773382S8 68s8 f ot /aevAces.a8y. value=c 195:7 rwm
Unexpected operation code: -1
Nonzero exit code=3, error message=' Invalid command provided’


Thanks,
Xiong


> 2022年10月31日 22:21,zxcs  写道:
> 
> Hi, experts,
> 
> we are using hadoop-3.3.0 and trying using cpu also enable gpu isolation 
> following guide 
> https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/UsingGpus.html
>  
> 
> 
> but when we start a  yarn job, node manager always failed at unexpected 
> operation code:-1 , could  any experts shed some light here? Thanks in 
> advance!
> 
> (sorry for the picture due, this due to we banned the copy anything from 
> testbed to outside)
> 
> <粘贴的图形-4.tiff>
> 
> 
> 
> here is the yarn-site.xml config 
> 
> yarn.resource-types< /name>
> yarn.io/gpu < /value>
> 
> 
> yarn.nodemanager.resource-plugins
> yarn.io/gpu 
> 
> 
> and below is obtainer-executor.cfg
>  yarn.nodemanager.linux-container-executor.group=hadoop
> banned.users=root
> min.user.id =500
> allowed.system.users=yarn
> [gpu]
> module.enabled=true
> [cgroups]
> root=/sys/fs/cgroup
> yarn-hierarchy=yarn
> 
> below is the directory of /sys/fs/cgroup
> <粘贴的图形-3.tiff>
> 



Re: unsubscribe

2022-10-29 Thread Chris Nauroth
https://hadoop.apache.org/mailing_lists.html

As described here,you can unsubscribe by sending an email to
user-unsubscr...@hadoop.apache.org. (That's a general pattern for all ASF
mailing lists.)

Chris Nauroth


On Sat, Oct 29, 2022 at 1:14 AM Vara Prasad Beerakam <
mr.b.varapra...@gmail.com> wrote:

> unsubscribe
>


RE: CVE-2022-42889

2022-10-27 Thread Deepti Sharma S
Thank you for sharing the link, however when is the plan to release version 
3.3.5 which has the fix of this CVE?


Regards,
Deepti Sharma
PMP® & ITIL

From: Wei-Chiu Chuang 
Sent: 27 October 2022 21:21
Cc: user@hadoop.apache.org
Subject: Re: CVE-2022-42889


  1.  HADOOP-18497<https://issues.apache.org/jira/browse/HADOOP-18497>

On Thu, Oct 27, 2022 at 4:45 AM Deepti Sharma S 
mailto:deepti.s.sha...@ericsson.com.invalid>>
 wrote:
Hello Team,

As we have received the vulnerability “CVE-2022-42889”. We are using Apache 
Hadoop common 3pp version 3.3.3 which has transitive dependency of Common text.

Do you have any plans to fix this vulnerability in Hadoop next version and when 
is the plan?


Regards,
Deepti Sharma
PMP® & ITIL



Re: CVE-2022-42889

2022-10-27 Thread Wei-Chiu Chuang
   1. HADOOP-18497 


On Thu, Oct 27, 2022 at 4:45 AM Deepti Sharma S
 wrote:

> Hello Team,
>
>
>
> As we have received the vulnerability “CVE-2022-42889”. We are using
> Apache Hadoop common 3pp version 3.3.3 which has transitive dependency of
> Common text.
>
>
>
> Do you have any plans to fix this vulnerability in Hadoop next version and
> when is the plan?
>
>
>
>
>
> Regards,
>
> Deepti Sharma
> * PMP® & ITIL*
>
>
>


Re: HDFS DataNode unavailable

2022-10-25 Thread Chris Nauroth
Hello,

I think broadly there could be 2 potential root cause explanations:

1. Logs are routed to a volume that is too small to hold the expected
logging. You can review configuration settings in log4j.properties related
to the rolling file appender. This determines how large logs can get and
how many of the old rolled files to retain. If the maximum would exceed the
capacity on the volume holding these logs, then you either need to
configure smaller retention or redirect the logs to a larger volume.

2. Some error condition caused abnormal log spam. If the log isn't there
anymore, then it's difficult to say what this could have been specifically.
You could keep an eye on logs for the next few days after the restart to
see if there are a lot of unexpected errors.

On a separate note, version 2.7.2 is quite old, released in 2017. It's
missing numerous bug fixes and security patches. I recommend looking into
an upgrade to 2.10.2 in the short term, followed by a plan for getting onto
a currently supported 3.x release.

I hope this helps.

Chris Nauroth


On Mon, Oct 24, 2022 at 11:31 PM hehaore...@gmail.com 
wrote:

> I have an HDFS cluster, version 2.7.2, with two namenodes and three
> datanodes. While uploads the file, an exception is found:
> java.io.IOException: Got error,status message,ack with firstBadLink as
> X:50010.
>
> I noticed that the datanode log is stopped, only datanode.log.1, not
> datanode.log. But the rest of the process logs are normal. The HDFS log
> directory is out of space. I did nothing but restart all the datanodes, and
> HDFS was back to normal.
>
> What's the reason?
>
> 从 Windows 版邮件 发送
>
>
> - To
> unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org For additional
> commands, e-mail: user-h...@hadoop.apache.org
>


Re: Make hadoop not listen on public network interface

2022-10-13 Thread Malcolm McFarland
Hey Pratyush,

If you're talking specifically about YARN, have you tried modifying the
 yarn.resourcemanager.hostname property in yarn-default.xml (at least in
version 2.10.x)?

Cheers,
Malcolm McFarland
Cavulus

Malcolm McFarland
Cavulus


This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any
unauthorized or improper disclosure, copying, distribution, or use of the
contents of this message is prohibited. The information contained in this
message is intended only for the personal and confidential use of the
recipient(s) named above. If you have received this message in error,
please notify the sender immediately and delete the original message.


On Thu, Oct 13, 2022 at 10:46 AM Pratyush Das  wrote:

> Hi,
>
> My IT administrator asked me to configure Hadoop not to listen on the
> public network interface (and gave me a particular IP address). Could
> someone help me with this?
>
> Regards,
>
> --
> Pratyush Das
>


Re: Performance with large no of files

2022-10-10 Thread Wei-Chiu Chuang
Do you have security enabled?

We did some preliminary benchmarks around webhdfs (i really want to revisit
it again) and with security enabled, a lot of overhead is between client
and KDC (SPENGO). Try run webhdfs using delegation tokens should help
remove that bottleneck.

On Sat, Oct 8, 2022 at 8:26 PM Abhishek  wrote:

> Hi,
> We want to backup large no of hadoop small files (~1mn) with webhdfs API
> We are getting a performance bottleneck here and it's taking days to back
> it up.
> Anyone know any solution where performance could be improved using any xml
> settings?
> This would really help us.
> v 3.1.1
>
> Appreciate your help !!
>
> --
>
>
>
>
>
>
>
>
>
>
>
>
>
> ~
> *Abhishek...*
>


Re: Performance with large no of files

2022-10-08 Thread Brahma Reddy Battula
Not sure, what's your backup approach.  One option can be archiving[1] the
files which were done for yarn logs[2].
To Speed on this, you can write one mapreduce job for archiving the files.
Please refer to the Document for sample mapreduce[3].


1.https://hadoop.apache.org/docs/stable/hadoop-archives/HadoopArchives.html
2.
https://hadoop.apache.org/docs/stable/hadoop-archive-logs/HadoopArchiveLogs.html
3.
https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

On Sun, Oct 9, 2022 at 9:22 AM Ayush Saxena  wrote:

> Using DistCp is the only option AFAIK. Distcp does support webhdfs, then
> try playing with the number of mappers and so to tune it for better
> performance
>
> -Ayush
>
>
> On 09-Oct-2022, at 8:56 AM, Abhishek  wrote:
>
> 
> Hi,
> We want to backup large no of hadoop small files (~1mn) with webhdfs API
> We are getting a performance bottleneck here and it's taking days to back
> it up.
> Anyone know any solution where performance could be improved using any xml
> settings?
> This would really help us.
> v 3.1.1
>
> Appreciate your help !!
>
> --
>
>
>
>
>
>
>
>
>
>
>
>
>
> ~
> *Abhishek...*
>
>


Re: Performance with large no of files

2022-10-08 Thread Ayush Saxena
Using DistCp is the only option AFAIK. Distcp does support webhdfs, then try 
playing with the number of mappers and so to tune it for better performance

-Ayush


> On 09-Oct-2022, at 8:56 AM, Abhishek  wrote:
> 
> 
> Hi,
> We want to backup large no of hadoop small files (~1mn) with webhdfs API
> We are getting a performance bottleneck here and it's taking days to back it 
> up.
> Anyone know any solution where performance could be improved using any xml 
> settings?
> This would really help us.
> v 3.1.1
> 
> Appreciate your help !!
> 
> -- 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> ~
> Abhishek...


Re: Communicating between yarn and tasks after delegation token renewal

2022-10-08 Thread Vinod Kumar Vavilapalli
There’s no way to do that.

Once YARN launches containers, it doesn’t communicate with them for anything 
after that. The tasks / containers can obviously always reach out to YARN 
services. But even that in this case is not helpful because YARN never exposes 
through APIs what it is doing with the tokens or when it is renewing them.

What is it that you are doing? What new information are you trying to share 
with the tasks? What framework is this? A custom YARN app or MapReduce / Tez / 
Spark / Flink etc..? 

Thanks
+Vinod

> On Oct 7, 2022, at 10:40 PM, Julien Phalip  wrote:
> 
> Hi,
> 
> IIUC, when a distributed job is started, Yarn first obtains a delegation 
> token from the target resource, then securely pushes the delegation token to 
> the individual tasks. If the job lasts longer than a given period of time, 
> then Yarn renews the delegation token (or more precisely, extends its 
> lifetime), therefore allowing the tasks to continue using the delegation 
> token. This is based on the assumption that the delegation token itself is 
> static and doesn't change (only its lifetime can be extended on the target 
> resource's server).
> 
> I'm building a custom service where I'd like to share new information with 
> the tasks once the delegation token has been renewed. Is there a way to let 
> Yarn push new information to the running tasks right after renewing the token?
> 
> Thanks,
> 
> Julien



  1   2   3   4   5   6   7   8   9   10   >