Re: Performance with large no of files

2022-10-08 Thread Brahma Reddy Battula
Not sure, what's your backup approach.  One option can be archiving[1] the
files which were done for yarn logs[2].
To Speed on this, you can write one mapreduce job for archiving the files.
Please refer to the Document for sample mapreduce[3].


1.https://hadoop.apache.org/docs/stable/hadoop-archives/HadoopArchives.html
2.
https://hadoop.apache.org/docs/stable/hadoop-archive-logs/HadoopArchiveLogs.html
3.
https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

On Sun, Oct 9, 2022 at 9:22 AM Ayush Saxena  wrote:

> Using DistCp is the only option AFAIK. Distcp does support webhdfs, then
> try playing with the number of mappers and so to tune it for better
> performance
>
> -Ayush
>
>
> On 09-Oct-2022, at 8:56 AM, Abhishek  wrote:
>
> 
> Hi,
> We want to backup large no of hadoop small files (~1mn) with webhdfs API
> We are getting a performance bottleneck here and it's taking days to back
> it up.
> Anyone know any solution where performance could be improved using any xml
> settings?
> This would really help us.
> v 3.1.1
>
> Appreciate your help !!
>
> --
>
>
>
>
>
>
>
>
>
>
>
>
>
> ~
> *Abhishek...*
>
>


Re: Performance with large no of files

2022-10-08 Thread Ayush Saxena
Using DistCp is the only option AFAIK. Distcp does support webhdfs, then try 
playing with the number of mappers and so to tune it for better performance

-Ayush


> On 09-Oct-2022, at 8:56 AM, Abhishek  wrote:
> 
> 
> Hi,
> We want to backup large no of hadoop small files (~1mn) with webhdfs API
> We are getting a performance bottleneck here and it's taking days to back it 
> up.
> Anyone know any solution where performance could be improved using any xml 
> settings?
> This would really help us.
> v 3.1.1
> 
> Appreciate your help !!
> 
> -- 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> ~
> Abhishek...


Performance with large no of files

2022-10-08 Thread Abhishek
Hi,
We want to backup large no of hadoop small files (~1mn) with webhdfs API
We are getting a performance bottleneck here and it's taking days to back
it up.
Anyone know any solution where performance could be improved using any xml
settings?
This would really help us.
v 3.1.1

Appreciate your help !!

-- 













~
*Abhishek...*


Re: Communicating between yarn and tasks after delegation token renewal

2022-10-08 Thread Vinod Kumar Vavilapalli
There’s no way to do that.

Once YARN launches containers, it doesn’t communicate with them for anything 
after that. The tasks / containers can obviously always reach out to YARN 
services. But even that in this case is not helpful because YARN never exposes 
through APIs what it is doing with the tokens or when it is renewing them.

What is it that you are doing? What new information are you trying to share 
with the tasks? What framework is this? A custom YARN app or MapReduce / Tez / 
Spark / Flink etc..? 

Thanks
+Vinod

> On Oct 7, 2022, at 10:40 PM, Julien Phalip  wrote:
> 
> Hi,
> 
> IIUC, when a distributed job is started, Yarn first obtains a delegation 
> token from the target resource, then securely pushes the delegation token to 
> the individual tasks. If the job lasts longer than a given period of time, 
> then Yarn renews the delegation token (or more precisely, extends its 
> lifetime), therefore allowing the tasks to continue using the delegation 
> token. This is based on the assumption that the delegation token itself is 
> static and doesn't change (only its lifetime can be extended on the target 
> resource's server).
> 
> I'm building a custom service where I'd like to share new information with 
> the tasks once the delegation token has been renewed. Is there a way to let 
> Yarn push new information to the running tasks right after renewing the token?
> 
> Thanks,
> 
> Julien