[
https://issues.apache.org/jira/browse/HIVE-16999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sailee Jain updated HIVE-16999:
-------------------------------
Description:
Performance bottleneck is found in adding resource[lying on hdfs] to the
distributed cache.
Commands used are :-
1. ADD ARCHIVE "{color:#d04437}hdfs{color}://some_dir/archive.tar"
2. ADD FILE "{color:#d04437}hdfs{color}://some_dir/file.txt"
Here is the log corresponding to the archive adding operation:-
{noformat}
converting to local hdfs://some_dir/archive.tar
Added resources: [hdfs://some_dir/archive.tar
{noformat}
Hive is downloading the resource to the local filesystem [shown in log by
"converting to local"].
Ideally there is no need to bring the file to the local filesystem when this
operation is all about copying the file from one location on HDFS to other
location on HDFS[distributed cache].
This adds lot of performance bottleneck when the the resource is a big file and
all commands need the same resource.
After debugging around the impacted piece of code is found to be :-
{noformat}
public List<String> add_resources(ResourceType t, Collection<String> values,
boolean convertToUnix)
throws RuntimeException {
Set<String> resourceSet = resourceMaps.getResourceSet(t);
Map<String, Set<String>> resourcePathMap =
resourceMaps.getResourcePathMap(t);
Map<String, Set<String>> reverseResourcePathMap =
resourceMaps.getReverseResourcePathMap(t);
List<String> localized = new ArrayList<String>();
try {
for (String value : values) {
String key;
{color:#d04437}//get the local path of downloaded jars{color}
List<URI> downloadedURLs = resolveAndDownload(t, value, convertToUnix);
;
.
{noformat}
{noformat}
List<URI> resolveAndDownload(ResourceType t, String value, boolean
convertToUnix) throws URISyntaxException,
IOException {
URI uri = createURI(value);
if (getURLType(value).equals("file")) {
return Arrays.asList(uri);
} else if (getURLType(value).equals("ivy")) {
return dependencyResolver.downloadDependencies(uri);
} else { // goes here for HDFS
return Arrays.asList(createURI(downloadResource(value, convertToUnix)));
// Here when the resource is not local it will download it to the local machine.
}
}
{noformat}
Thanks,
Sailee
was:
Performance bottleneck is found in adding resource[lying on hdfs] to the
distributed cache.
Commands used are :-
1. ADD ARCHIVE "{color:#d04437}hdfs{color}://some_dir/archive.tar"
2. ADD FILE "{color:#d04437}hdfs{color}://some_dir/file.txt"
Here is the log corresponding to the archive adding operation:-
=> converting to local hdfs://some_dir/archive.tar
=> Added resources: [hdfs://some_dir/archive.tar]
Hive is downloading the resource to the local filesystem [shown in log by
"converting to local"].
Ideally there is no need to bring the file to the local filesystem when this
operation is all about copying the file from one location on HDFS to other
location on HDFS[distributed cache].
This adds lot of performance bottleneck when the the resource is a big file and
all commands need the same resource.
After debugging around the impacted piece of code is found to be :-
{noformat}
public List<String> add_resources(ResourceType t, Collection<String> values,
boolean convertToUnix)
throws RuntimeException {
Set<String> resourceSet = resourceMaps.getResourceSet(t);
Map<String, Set<String>> resourcePathMap =
resourceMaps.getResourcePathMap(t);
Map<String, Set<String>> reverseResourcePathMap =
resourceMaps.getReverseResourcePathMap(t);
List<String> localized = new ArrayList<String>();
try {
for (String value : values) {
String key;
{color:#d04437}//get the local path of downloaded jars{color}
List<URI> downloadedURLs = resolveAndDownload(t, value, convertToUnix);
;
.
{noformat}
{noformat}
List<URI> resolveAndDownload(ResourceType t, String value, boolean
convertToUnix) throws URISyntaxException,
IOException {
URI uri = createURI(value);
if (getURLType(value).equals("file")) {
return Arrays.asList(uri);
} else if (getURLType(value).equals("ivy")) {
return dependencyResolver.downloadDependencies(uri);
} else { // goes here for HDFS
return Arrays.asList(createURI(downloadResource(value, convertToUnix)));
// Here when the resource is not local it will download it to the local machine.
}
}
{noformat}
Thanks,
Sailee
> Performance bottleneck in the add_resource api
> ----------------------------------------------
>
> Key: HIVE-16999
> URL: https://issues.apache.org/jira/browse/HIVE-16999
> Project: Hive
> Issue Type: Bug
> Components: Hive
> Reporter: Sailee Jain
> Priority: Critical
>
> Performance bottleneck is found in adding resource[lying on hdfs] to the
> distributed cache.
> Commands used are :-
> 1. ADD ARCHIVE "{color:#d04437}hdfs{color}://some_dir/archive.tar"
> 2. ADD FILE "{color:#d04437}hdfs{color}://some_dir/file.txt"
> Here is the log corresponding to the archive adding operation:-
> {noformat}
> converting to local hdfs://some_dir/archive.tar
> Added resources: [hdfs://some_dir/archive.tar
> {noformat}
> Hive is downloading the resource to the local filesystem [shown in log by
> "converting to local"].
> Ideally there is no need to bring the file to the local filesystem when this
> operation is all about copying the file from one location on HDFS to other
> location on HDFS[distributed cache].
> This adds lot of performance bottleneck when the the resource is a big file
> and all commands need the same resource.
> After debugging around the impacted piece of code is found to be :-
> {noformat}
> public List<String> add_resources(ResourceType t, Collection<String> values,
> boolean convertToUnix)
> throws RuntimeException {
> Set<String> resourceSet = resourceMaps.getResourceSet(t);
> Map<String, Set<String>> resourcePathMap =
> resourceMaps.getResourcePathMap(t);
> Map<String, Set<String>> reverseResourcePathMap =
> resourceMaps.getReverseResourcePathMap(t);
> List<String> localized = new ArrayList<String>();
> try {
> for (String value : values) {
> String key;
> {color:#d04437}//get the local path of downloaded jars{color}
> List<URI> downloadedURLs = resolveAndDownload(t, value,
> convertToUnix);
> ;
> .
> {noformat}
> {noformat}
> List<URI> resolveAndDownload(ResourceType t, String value, boolean
> convertToUnix) throws URISyntaxException,
> IOException {
> URI uri = createURI(value);
> if (getURLType(value).equals("file")) {
> return Arrays.asList(uri);
> } else if (getURLType(value).equals("ivy")) {
> return dependencyResolver.downloadDependencies(uri);
> } else { // goes here for HDFS
> return Arrays.asList(createURI(downloadResource(value,
> convertToUnix))); // Here when the resource is not local it will download it
> to the local machine.
> }
> }
> {noformat}
> Thanks,
> Sailee
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)