[jira] [Commented] (HADOOP-15320) Remove customized getFileBlockLocations for hadoop-azure and hadoop-azure-datalake
[ https://issues.apache.org/jira/browse/HADOOP-15320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417979#comment-16417979 ] Hudson commented on HADOOP-15320: - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #13895 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/13895/]) HADOOP-15320. Remove customized getFileBlockLocations for hadoop-azure (cdouglas: rev 081c3501885c543bb1f159929d456d1ba2e3650c) * (delete) hadoop-tools/hadoop-azure/src/test/java/org/apache/hadoop/fs/azure/TestNativeAzureFileSystemBlockLocations.java * (edit) hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/NativeAzureFileSystem.java * (edit) hadoop-tools/hadoop-azure-datalake/src/main/java/org/apache/hadoop/fs/adl/AdlFileSystem.java > Remove customized getFileBlockLocations for hadoop-azure and > hadoop-azure-datalake > -- > > Key: HADOOP-15320 > URL: https://issues.apache.org/jira/browse/HADOOP-15320 > Project: Hadoop Common > Issue Type: Bug > Components: fs/adl, fs/azure >Affects Versions: 2.7.3, 2.9.0, 3.0.0 >Reporter: shanyu zhao >Assignee: shanyu zhao >Priority: Major > Fix For: 3.1.0, 2.9.1 > > Attachments: HADOOP-15320.01.patch, HADOOP-15320.patch > > > hadoop-azure and hadoop-azure-datalake have its own implementation of > getFileBlockLocations(), which faked a list of artificial blocks based on the > hard-coded block size. And each block has one host with name "localhost". > Take a look at this code: > [https://github.com/apache/hadoop/blob/release-2.9.0-RC3/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/NativeAzureFileSystem.java#L3485] > This is a unnecessary mock up for a "remote" file system to mimic HDFS. And > the problem with this mock is that for large (~TB) files we generates lots of > artificial blocks, and FileInputFormat.getSplits() is slow in calculating > splits based on these blocks. > We can safely remove this customized getFileBlockLocations() implementation, > fall back to the default FileSystem.getFileBlockLocations() implementation, > which is to return 1 block for any file with 1 host "localhost". Note that > this doesn't mean we will create much less splits, because the number of > splits is still limited by the blockSize in > FileInputFormat.computeSplitSize(): > {code:java} > return Math.max(minSize, Math.min(goalSize, blockSize));{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15320) Remove customized getFileBlockLocations for hadoop-azure and hadoop-azure-datalake
[ https://issues.apache.org/jira/browse/HADOOP-15320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417139#comment-16417139 ] Steve Loughran commented on HADOOP-15320: - If you are happy, I am +1 > Remove customized getFileBlockLocations for hadoop-azure and > hadoop-azure-datalake > -- > > Key: HADOOP-15320 > URL: https://issues.apache.org/jira/browse/HADOOP-15320 > Project: Hadoop Common > Issue Type: Bug > Components: fs/adl, fs/azure >Affects Versions: 2.7.3, 2.9.0, 3.0.0 >Reporter: shanyu zhao >Assignee: shanyu zhao >Priority: Major > Attachments: HADOOP-15320.01.patch, HADOOP-15320.patch > > > hadoop-azure and hadoop-azure-datalake have its own implementation of > getFileBlockLocations(), which faked a list of artificial blocks based on the > hard-coded block size. And each block has one host with name "localhost". > Take a look at this code: > [https://github.com/apache/hadoop/blob/release-2.9.0-RC3/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/NativeAzureFileSystem.java#L3485] > This is a unnecessary mock up for a "remote" file system to mimic HDFS. And > the problem with this mock is that for large (~TB) files we generates lots of > artificial blocks, and FileInputFormat.getSplits() is slow in calculating > splits based on these blocks. > We can safely remove this customized getFileBlockLocations() implementation, > fall back to the default FileSystem.getFileBlockLocations() implementation, > which is to return 1 block for any file with 1 host "localhost". Note that > this doesn't mean we will create much less splits, because the number of > splits is still limited by the blockSize in > FileInputFormat.computeSplitSize(): > {code:java} > return Math.max(minSize, Math.min(goalSize, blockSize));{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15320) Remove customized getFileBlockLocations for hadoop-azure and hadoop-azure-datalake
[ https://issues.apache.org/jira/browse/HADOOP-15320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16416578#comment-16416578 ] genericqa commented on HADOOP-15320: | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 28s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 27m 31s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 8s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 37s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 52s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 10s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 5s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 38s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 11s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 57s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 57s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 33s{color} | {color:green} hadoop-tools: The patch generated 0 new + 25 unchanged - 24 fixed = 25 total (was 49) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 1s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 18s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 36s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 8s{color} | {color:green} hadoop-azure in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 48s{color} | {color:green} hadoop-azure-datalake in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 27s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 66m 11s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8620d2b | | JIRA Issue | HADOOP-15320 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12916504/HADOOP-15320.01.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 04be5bf449b9 3.13.0-139-generic #188-Ubuntu SMP Tue Jan 9 14:43:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 2a2ef15 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_151 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-HADOOP-Build/14400/testReport/ | |
[jira] [Commented] (HADOOP-15320) Remove customized getFileBlockLocations for hadoop-azure and hadoop-azure-datalake
[ https://issues.apache.org/jira/browse/HADOOP-15320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16416516#comment-16416516 ] Chris Douglas commented on HADOOP-15320: Fixed checkstyle warnings. +1 from me. [~ste...@apache.org], lgty? > Remove customized getFileBlockLocations for hadoop-azure and > hadoop-azure-datalake > -- > > Key: HADOOP-15320 > URL: https://issues.apache.org/jira/browse/HADOOP-15320 > Project: Hadoop Common > Issue Type: Bug > Components: fs/adl, fs/azure >Affects Versions: 2.7.3, 2.9.0, 3.0.0 >Reporter: shanyu zhao >Assignee: shanyu zhao >Priority: Major > Attachments: HADOOP-15320.01.patch, HADOOP-15320.patch > > > hadoop-azure and hadoop-azure-datalake have its own implementation of > getFileBlockLocations(), which faked a list of artificial blocks based on the > hard-coded block size. And each block has one host with name "localhost". > Take a look at this code: > [https://github.com/apache/hadoop/blob/release-2.9.0-RC3/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/NativeAzureFileSystem.java#L3485] > This is a unnecessary mock up for a "remote" file system to mimic HDFS. And > the problem with this mock is that for large (~TB) files we generates lots of > artificial blocks, and FileInputFormat.getSplits() is slow in calculating > splits based on these blocks. > We can safely remove this customized getFileBlockLocations() implementation, > fall back to the default FileSystem.getFileBlockLocations() implementation, > which is to return 1 block for any file with 1 host "localhost". Note that > this doesn't mean we will create much less splits, because the number of > splits is still limited by the blockSize in > FileInputFormat.computeSplitSize(): > {code:java} > return Math.max(minSize, Math.min(goalSize, blockSize));{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15320) Remove customized getFileBlockLocations for hadoop-azure and hadoop-azure-datalake
[ https://issues.apache.org/jira/browse/HADOOP-15320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16416497#comment-16416497 ] genericqa commented on HADOOP-15320: | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 1s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 16s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 26m 8s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 47s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 37s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 50s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 45s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 4s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 36s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 11s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 45s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 43s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 43s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 30s{color} | {color:orange} hadoop-tools: The patch generated 2 new + 25 unchanged - 24 fixed = 27 total (was 49) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 42s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 38s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 25s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 40s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 5s{color} | {color:green} hadoop-azure in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 43s{color} | {color:green} hadoop-azure-datalake in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 23s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 62m 56s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8620d2b | | JIRA Issue | HADOOP-15320 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12914953/HADOOP-15320.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux baac5fbe684c 3.13.0-139-generic #188-Ubuntu SMP Tue Jan 9 14:43:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 3fe41c6 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_151 | | findbugs | v3.1.0-RC1 | | checkstyle |
[jira] [Commented] (HADOOP-15320) Remove customized getFileBlockLocations for hadoop-azure and hadoop-azure-datalake
[ https://issues.apache.org/jira/browse/HADOOP-15320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16416424#comment-16416424 ] Chris Douglas commented on HADOOP-15320: Thanks [~shanyu]. Running this through Jenkins. We could add a unit test to signal that a change to the default behavior could affect these FS implementations, but that should be implied. > Remove customized getFileBlockLocations for hadoop-azure and > hadoop-azure-datalake > -- > > Key: HADOOP-15320 > URL: https://issues.apache.org/jira/browse/HADOOP-15320 > Project: Hadoop Common > Issue Type: Bug > Components: fs/adl, fs/azure >Affects Versions: 2.7.3, 2.9.0, 3.0.0 >Reporter: shanyu zhao >Assignee: shanyu zhao >Priority: Major > Attachments: HADOOP-15320.patch > > > hadoop-azure and hadoop-azure-datalake have its own implementation of > getFileBlockLocations(), which faked a list of artificial blocks based on the > hard-coded block size. And each block has one host with name "localhost". > Take a look at this code: > [https://github.com/apache/hadoop/blob/release-2.9.0-RC3/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/NativeAzureFileSystem.java#L3485] > This is a unnecessary mock up for a "remote" file system to mimic HDFS. And > the problem with this mock is that for large (~TB) files we generates lots of > artificial blocks, and FileInputFormat.getSplits() is slow in calculating > splits based on these blocks. > We can safely remove this customized getFileBlockLocations() implementation, > fall back to the default FileSystem.getFileBlockLocations() implementation, > which is to return 1 block for any file with 1 host "localhost". Note that > this doesn't mean we will create much less splits, because the number of > splits is still limited by the blockSize in > FileInputFormat.computeSplitSize(): > {code:java} > return Math.max(minSize, Math.min(goalSize, blockSize));{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15320) Remove customized getFileBlockLocations for hadoop-azure and hadoop-azure-datalake
[ https://issues.apache.org/jira/browse/HADOOP-15320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16416283#comment-16416283 ] shanyu zhao commented on HADOOP-15320: -- I also run the following manual tests successfully: 1) Hive TPCH test with my change on WASB and it passed with correct number of splits. 2) Spark application to convert a huge CSV file to parquet. > Remove customized getFileBlockLocations for hadoop-azure and > hadoop-azure-datalake > -- > > Key: HADOOP-15320 > URL: https://issues.apache.org/jira/browse/HADOOP-15320 > Project: Hadoop Common > Issue Type: Bug > Components: fs/adl, fs/azure >Affects Versions: 2.7.3, 2.9.0, 3.0.0 >Reporter: shanyu zhao >Assignee: shanyu zhao >Priority: Major > Attachments: HADOOP-15320.patch > > > hadoop-azure and hadoop-azure-datalake have its own implementation of > getFileBlockLocations(), which faked a list of artificial blocks based on the > hard-coded block size. And each block has one host with name "localhost". > Take a look at this code: > [https://github.com/apache/hadoop/blob/release-2.9.0-RC3/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/NativeAzureFileSystem.java#L3485] > This is a unnecessary mock up for a "remote" file system to mimic HDFS. And > the problem with this mock is that for large (~TB) files we generates lots of > artificial blocks, and FileInputFormat.getSplits() is slow in calculating > splits based on these blocks. > We can safely remove this customized getFileBlockLocations() implementation, > fall back to the default FileSystem.getFileBlockLocations() implementation, > which is to return 1 block for any file with 1 host "localhost". Note that > this doesn't mean we will create much less splits, because the number of > splits is still limited by the blockSize in > FileInputFormat.computeSplitSize(): > {code:java} > return Math.max(minSize, Math.min(goalSize, blockSize));{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15320) Remove customized getFileBlockLocations for hadoop-azure and hadoop-azure-datalake
[ https://issues.apache.org/jira/browse/HADOOP-15320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16411841#comment-16411841 ] Steve Loughran commented on HADOOP-15320: - OK, maybe I'm the confused one. Given it's working for S3a, if you're happy, so am I > Remove customized getFileBlockLocations for hadoop-azure and > hadoop-azure-datalake > -- > > Key: HADOOP-15320 > URL: https://issues.apache.org/jira/browse/HADOOP-15320 > Project: Hadoop Common > Issue Type: Bug > Components: fs/adl, fs/azure >Affects Versions: 2.7.3, 2.9.0, 3.0.0 >Reporter: shanyu zhao >Assignee: shanyu zhao >Priority: Major > Attachments: HADOOP-15320.patch > > > hadoop-azure and hadoop-azure-datalake have its own implementation of > getFileBlockLocations(), which faked a list of artificial blocks based on the > hard-coded block size. And each block has one host with name "localhost". > Take a look at this code: > [https://github.com/apache/hadoop/blob/release-2.9.0-RC3/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/NativeAzureFileSystem.java#L3485] > This is a unnecessary mock up for a "remote" file system to mimic HDFS. And > the problem with this mock is that for large (~TB) files we generates lots of > artificial blocks, and FileInputFormat.getSplits() is slow in calculating > splits based on these blocks. > We can safely remove this customized getFileBlockLocations() implementation, > fall back to the default FileSystem.getFileBlockLocations() implementation, > which is to return 1 block for any file with 1 host "localhost". Note that > this doesn't mean we will create much less splits, because the number of > splits is still limited by the blockSize in > FileInputFormat.computeSplitSize(): > {code:java} > return Math.max(minSize, Math.min(goalSize, blockSize));{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15320) Remove customized getFileBlockLocations for hadoop-azure and hadoop-azure-datalake
[ https://issues.apache.org/jira/browse/HADOOP-15320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16411827#comment-16411827 ] Chris Douglas commented on HADOOP-15320: bq. I know s3 "appears" to work, but I'm not actually confident that everything is getting the splits right there. Me either, but 1.5h to generate synthetic splits is definitely wrong. If we develop a new, best practice for object stores, then we can apply that across the stores we support. The {{Blocklocations[]}} return type is pretty restrictive, but we could probably do better. bq. The one I want you look at is: Spark, CSV, multiGB: SPARK-22240 . That's what's been niggling at me for a while. Maybe I'm missing the bug. Block locations are hints for locality, not format partitioning. In that JIRA: gzip is not splittable, so a single reader is correct, absent some other preparation (saving the dictionary at offsets, writing zero-length gzip files as split markers, etc.). In general, framework parallelism should not rely exclusively on block locations... > Remove customized getFileBlockLocations for hadoop-azure and > hadoop-azure-datalake > -- > > Key: HADOOP-15320 > URL: https://issues.apache.org/jira/browse/HADOOP-15320 > Project: Hadoop Common > Issue Type: Bug > Components: fs/adl, fs/azure >Affects Versions: 2.7.3, 2.9.0, 3.0.0 >Reporter: shanyu zhao >Assignee: shanyu zhao >Priority: Major > Attachments: HADOOP-15320.patch > > > hadoop-azure and hadoop-azure-datalake have its own implementation of > getFileBlockLocations(), which faked a list of artificial blocks based on the > hard-coded block size. And each block has one host with name "localhost". > Take a look at this code: > [https://github.com/apache/hadoop/blob/release-2.9.0-RC3/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/NativeAzureFileSystem.java#L3485] > This is a unnecessary mock up for a "remote" file system to mimic HDFS. And > the problem with this mock is that for large (~TB) files we generates lots of > artificial blocks, and FileInputFormat.getSplits() is slow in calculating > splits based on these blocks. > We can safely remove this customized getFileBlockLocations() implementation, > fall back to the default FileSystem.getFileBlockLocations() implementation, > which is to return 1 block for any file with 1 host "localhost". Note that > this doesn't mean we will create much less splits, because the number of > splits is still limited by the blockSize in > FileInputFormat.computeSplitSize(): > {code:java} > return Math.max(minSize, Math.min(goalSize, blockSize));{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15320) Remove customized getFileBlockLocations for hadoop-azure and hadoop-azure-datalake
[ https://issues.apache.org/jira/browse/HADOOP-15320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16411716#comment-16411716 ] Steve Loughran commented on HADOOP-15320: - I know s3 "appears" to work, but I'm not actually confident that everything is getting the splits right there. The one I want you look at is: Spark, CSV, multiGB: SPARK-22240 . That's what's been niggling at me for a while. > Remove customized getFileBlockLocations for hadoop-azure and > hadoop-azure-datalake > -- > > Key: HADOOP-15320 > URL: https://issues.apache.org/jira/browse/HADOOP-15320 > Project: Hadoop Common > Issue Type: Bug > Components: fs/adl, fs/azure >Affects Versions: 2.7.3, 2.9.0, 3.0.0 >Reporter: shanyu zhao >Assignee: shanyu zhao >Priority: Major > Attachments: HADOOP-15320.patch > > > hadoop-azure and hadoop-azure-datalake have its own implementation of > getFileBlockLocations(), which faked a list of artificial blocks based on the > hard-coded block size. And each block has one host with name "localhost". > Take a look at this code: > [https://github.com/apache/hadoop/blob/release-2.9.0-RC3/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/NativeAzureFileSystem.java#L3485] > This is a unnecessary mock up for a "remote" file system to mimic HDFS. And > the problem with this mock is that for large (~TB) files we generates lots of > artificial blocks, and FileInputFormat.getSplits() is slow in calculating > splits based on these blocks. > We can safely remove this customized getFileBlockLocations() implementation, > fall back to the default FileSystem.getFileBlockLocations() implementation, > which is to return 1 block for any file with 1 host "localhost". Note that > this doesn't mean we will create much less splits, because the number of > splits is still limited by the blockSize in > FileInputFormat.computeSplitSize(): > {code:java} > return Math.max(minSize, Math.min(goalSize, blockSize));{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15320) Remove customized getFileBlockLocations for hadoop-azure and hadoop-azure-datalake
[ https://issues.apache.org/jira/browse/HADOOP-15320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16410265#comment-16410265 ] Chris Douglas commented on HADOOP-15320: bq. Anything else we should run? As [~ste...@apache.org] suggested, the hadoop-azure and hadoop-azuredatalake test suites and contract tests should pass. > Remove customized getFileBlockLocations for hadoop-azure and > hadoop-azure-datalake > -- > > Key: HADOOP-15320 > URL: https://issues.apache.org/jira/browse/HADOOP-15320 > Project: Hadoop Common > Issue Type: Bug > Components: fs/adl, fs/azure >Affects Versions: 2.7.3, 2.9.0, 3.0.0 >Reporter: shanyu zhao >Assignee: shanyu zhao >Priority: Major > Attachments: HADOOP-15320.patch > > > hadoop-azure and hadoop-azure-datalake have its own implementation of > getFileBlockLocations(), which faked a list of artificial blocks based on the > hard-coded block size. And each block has one host with name "localhost". > Take a look at this code: > [https://github.com/apache/hadoop/blob/release-2.9.0-RC3/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/NativeAzureFileSystem.java#L3485] > This is a unnecessary mock up for a "remote" file system to mimic HDFS. And > the problem with this mock is that for large (~TB) files we generates lots of > artificial blocks, and FileInputFormat.getSplits() is slow in calculating > splits based on these blocks. > We can safely remove this customized getFileBlockLocations() implementation, > fall back to the default FileSystem.getFileBlockLocations() implementation, > which is to return 1 block for any file with 1 host "localhost". Note that > this doesn't mean we will create much less splits, because the number of > splits is still limited by the blockSize in > FileInputFormat.computeSplitSize(): > {code:java} > return Math.max(minSize, Math.min(goalSize, blockSize));{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15320) Remove customized getFileBlockLocations for hadoop-azure and hadoop-azure-datalake
[ https://issues.apache.org/jira/browse/HADOOP-15320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405682#comment-16405682 ] shanyu zhao commented on HADOOP-15320: -- I've run a few Spark jobs on very large input file (hundreds of TB) and the getSplits() on this file took a few seconds, vs. 1.5 hours without the change. I'm in the middle of running hive tpch tests. Anything else we should run? As [~chris.douglas] mentioned, since S3A is running file, we should be good to go for this patch. > Remove customized getFileBlockLocations for hadoop-azure and > hadoop-azure-datalake > -- > > Key: HADOOP-15320 > URL: https://issues.apache.org/jira/browse/HADOOP-15320 > Project: Hadoop Common > Issue Type: Bug > Components: fs/adl, fs/azure >Affects Versions: 2.7.3, 2.9.0, 3.0.0 >Reporter: shanyu zhao >Assignee: shanyu zhao >Priority: Major > Attachments: HADOOP-15320.patch > > > hadoop-azure and hadoop-azure-datalake have its own implementation of > getFileBlockLocations(), which faked a list of artificial blocks based on the > hard-coded block size. And each block has one host with name "localhost". > Take a look at this code: > [https://github.com/apache/hadoop/blob/release-2.9.0-RC3/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/NativeAzureFileSystem.java#L3485] > This is a unnecessary mock up for a "remote" file system to mimic HDFS. And > the problem with this mock is that for large (~TB) files we generates lots of > artificial blocks, and FileInputFormat.getSplits() is slow in calculating > splits based on these blocks. > We can safely remove this customized getFileBlockLocations() implementation, > fall back to the default FileSystem.getFileBlockLocations() implementation, > which is to return 1 block for any file with 1 host "localhost". Note that > this doesn't mean we will create much less splits, because the number of > splits is still limited by the blockSize in > FileInputFormat.computeSplitSize(): > {code:java} > return Math.max(minSize, Math.min(goalSize, blockSize));{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15320) Remove customized getFileBlockLocations for hadoop-azure and hadoop-azure-datalake
[ https://issues.apache.org/jira/browse/HADOOP-15320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405583#comment-16405583 ] Chris Douglas commented on HADOOP-15320: What testing has been done with this, already? bq. do think it will need be bounced past the various tools, including: hive, spark, pig to see that it all goes OK. But given S3A is using that default with no adverse consequences, I think you'll be right. Wouldn't one expect the same results, if the pattern worked for S3A? One would expect to find framework code that is unnecessarily serial after this change. What tests did S3A run that should be repeated? bq. which endpoints did you run the entire hadoop-azure and hadoop-azuredatalake test suites? Running these integration tests is a good idea. It's why they're there, after all. > Remove customized getFileBlockLocations for hadoop-azure and > hadoop-azure-datalake > -- > > Key: HADOOP-15320 > URL: https://issues.apache.org/jira/browse/HADOOP-15320 > Project: Hadoop Common > Issue Type: Bug > Components: fs/adl, fs/azure >Affects Versions: 2.7.3, 2.9.0, 3.0.0 >Reporter: shanyu zhao >Assignee: shanyu zhao >Priority: Major > Attachments: HADOOP-15320.patch > > > hadoop-azure and hadoop-azure-datalake have its own implementation of > getFileBlockLocations(), which faked a list of artificial blocks based on the > hard-coded block size. And each block has one host with name "localhost". > Take a look at this code: > [https://github.com/apache/hadoop/blob/release-2.9.0-RC3/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/NativeAzureFileSystem.java#L3485] > This is a unnecessary mock up for a "remote" file system to mimic HDFS. And > the problem with this mock is that for large (~TB) files we generates lots of > artificial blocks, and FileInputFormat.getSplits() is slow in calculating > splits based on these blocks. > We can safely remove this customized getFileBlockLocations() implementation, > fall back to the default FileSystem.getFileBlockLocations() implementation, > which is to return 1 block for any file with 1 host "localhost". Note that > this doesn't mean we will create much less splits, because the number of > splits is still limited by the blockSize in > FileInputFormat.computeSplitSize(): > {code:java} > return Math.max(minSize, Math.min(goalSize, blockSize));{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15320) Remove customized getFileBlockLocations for hadoop-azure and hadoop-azure-datalake
[ https://issues.apache.org/jira/browse/HADOOP-15320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403665#comment-16403665 ] Steve Loughran commented on HADOOP-15320: - Interesting. In HADOOP-14943 I'd proposed pulling up the azure one to hadoop common for shared use, spec a bit tighter what it did and then wire up S3A to it too. Now you are saying for multiTB files we don't need this code at all? well, that's good news. I see your arguments, but do think it will need be bounced past the various tools, including: hive, spark, pig to see that it all goes OK. But given S3A is using that default with no adverse consequences, I think you'll be right. As usual: which endpoints did you run the entire hadoop-azure and hadoop-azuredatalake test suites? > Remove customized getFileBlockLocations for hadoop-azure and > hadoop-azure-datalake > -- > > Key: HADOOP-15320 > URL: https://issues.apache.org/jira/browse/HADOOP-15320 > Project: Hadoop Common > Issue Type: Bug > Components: fs/adl, fs/azure >Affects Versions: 2.7.3, 2.9.0, 3.0.0 >Reporter: shanyu zhao >Assignee: shanyu zhao >Priority: Major > Attachments: HADOOP-15320.patch > > > hadoop-azure and hadoop-azure-datalake have its own implementation of > getFileBlockLocations(), which faked a list of artificial blocks based on the > hard-coded block size. And each block has one host with name "localhost". > Take a look at this code: > [https://github.com/apache/hadoop/blob/release-2.9.0-RC3/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/NativeAzureFileSystem.java#L3485] > This is a unnecessary mock up for a "remote" file system to mimic HDFS. And > the problem with this mock is that for large (~TB) files we generates lots of > artificial blocks, and FileInputFormat.getSplits() is slow in calculating > splits based on these blocks. > We can safely remove this customized getFileBlockLocations() implementation, > fall back to the default FileSystem.getFileBlockLocations() implementation, > which is to return 1 block for any file with 1 host "localhost". Note that > this doesn't mean we will create much less splits, because the number of > splits is still limited by the blockSize in > FileInputFormat.computeSplitSize(): > {code:java} > return Math.max(minSize, Math.min(goalSize, blockSize));{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org