[jira] [Created] (HADOOP-15320) Remove customized getFileBlockLocations for hadoop-azure and hadoop-azure-datalake

2018-03-16 Thread shanyu zhao (JIRA)
shanyu zhao created HADOOP-15320:


 Summary: Remove customized getFileBlockLocations for hadoop-azure 
and hadoop-azure-datalake
 Key: HADOOP-15320
 URL: https://issues.apache.org/jira/browse/HADOOP-15320
 Project: Hadoop Common
  Issue Type: Bug
  Components: fs/adl, fs/azure
Affects Versions: 3.0.0, 2.9.0, 2.7.3
Reporter: shanyu zhao
Assignee: shanyu zhao


hadoop-azure and hadoop-azure-datalake have its own implementation of 
getFileBlockLocations(), which faked a list of artificial blocks based on the 
hard-coded block size. And each block has one host with name "localhost". Take 
a look at this code:

[https://github.com/apache/hadoop/blob/release-2.9.0-RC3/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/NativeAzureFileSystem.java#L3485]

This is a unnecessary mock up for a "remote" file system to mimic HDFS. And the 
problem with this mock is that for large (~TB) files we generates lots of 
artificial blocks, and FileInputFormat.getSplits() is slow in calculating 
splits based on these blocks.

We can safely remove this customized getFileBlockLocations() implementation, 
fall back to the default FileSystem.getFileBlockLocations() implementation, 
which is to return 1 block for any file with 1 host "localhost". Note that this 
doesn't mean we will create much less splits, because the number of splits is 
still limited by the blockSize in FileInputFormat.computeSplitSize():
{code:java}
return Math.max(minSize, Math.min(goalSize, blockSize));{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org



[jira] [Created] (HADOOP-11629) WASB filesystem should not start BandwidthGaugeUpdater if fs.azure.skip.metrics set to true

2015-02-24 Thread shanyu zhao (JIRA)
shanyu zhao created HADOOP-11629:


 Summary: WASB filesystem should not start BandwidthGaugeUpdater if 
fs.azure.skip.metrics set to true
 Key: HADOOP-11629
 URL: https://issues.apache.org/jira/browse/HADOOP-11629
 Project: Hadoop Common
  Issue Type: Bug
  Components: tools
Affects Versions: 2.6.1
Reporter: shanyu zhao
Assignee: shanyu zhao


In Hadoop-11248 we added configuration "fs.azure.skip.metrics". If set to true, 
we do not register Azure FileSystem metrics with the metrics system. However, 
BandwidthGaugeUpdater object is still created in AzureNativeFileSystemStore, 
resulting in unnecessary threads being spawned.

Under heavy load the system could be busy dealing with these threads and GC has 
to work on removing the thread objects. E.g. When multiple WebHCat clients 
submitting jobs to WebHCat server, we observed that the WebHCat server spawns 
~400 daemon threads, which slows down the server and sometimes cause timeout.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HADOOP-11248) Add hadoop configuration to disable Azure Filesystem metrics collection

2014-10-29 Thread shanyu zhao (JIRA)
shanyu zhao created HADOOP-11248:


 Summary: Add hadoop configuration to disable Azure Filesystem 
metrics collection
 Key: HADOOP-11248
 URL: https://issues.apache.org/jira/browse/HADOOP-11248
 Project: Hadoop Common
  Issue Type: Bug
  Components: fs
Affects Versions: 2.5.1, 2.4.1
Reporter: shanyu zhao
Assignee: shanyu zhao


Today whenever Azure filesystem is used, metrics collection is enabled using 
class AzureFileSystemMetricsSystem. Metrics being collected includes bytes 
transferred and throughput.

In some situation, we do not want to collect metrics for Azure file system. 
E.g. for WebHCat server. We need to introduce a new configuration 
"fs.azure.skip.metrics" to disable metrics collection.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HADOOP-10840) Fix OutOfMemoryError caused by metrics system in Azure File System

2014-07-15 Thread shanyu zhao (JIRA)
shanyu zhao created HADOOP-10840:


 Summary: Fix OutOfMemoryError caused by metrics system in Azure 
File System
 Key: HADOOP-10840
 URL: https://issues.apache.org/jira/browse/HADOOP-10840
 Project: Hadoop Common
  Issue Type: Bug
  Components: metrics
Affects Versions: 2.4.1
Reporter: shanyu zhao
Assignee: shanyu zhao


In Hadoop 2.x the Hadoop File System framework changed and no cache is 
implemented (refer to HADOOP-6356). This means for every WASB access, a new 
NativeAzureFileSystem is created, along which a Metrics source created and 
added to MetricsSystemImpl. Over time the sources accumulated, eating memory 
and causing Java OutOfMemoryError.

The fix is to utilize the unregisterSource() method added to MetricsSystem in 
HADOOP-10839.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HADOOP-10839) Add unregisterSource() to MetricsSystem API

2014-07-15 Thread shanyu zhao (JIRA)
shanyu zhao created HADOOP-10839:


 Summary: Add unregisterSource() to MetricsSystem API
 Key: HADOOP-10839
 URL: https://issues.apache.org/jira/browse/HADOOP-10839
 Project: Hadoop Common
  Issue Type: Bug
  Components: metrics
Affects Versions: 2.4.1
Reporter: shanyu zhao
Assignee: shanyu zhao


Currently the MetrisSystem API has register() method to register a 
MetricsSource but doesn't have unregister() method. This means once a 
MetricsSource is registered with the MetricsSystem, it will be there forever 
until the MetricsSystem is shut down. This in some cases can cause Java 
OutOfMemoryError.

One such case is in file system metrics implementation. The new 
AbstractFileSystem/FileContext framework does not implement a cache so every 
file system access can lead to the creation of a NativeFileSystem instance. 
(refer to HADOOP-6356). And all these NativeFileSystem needs to share the same 
instance of MetricsSystemImpl, which means we cannot shut down MetricsSystem to 
clean up all the MetricsSources that has been registered but no longer active. 
Over time the MetricsSource instance accumulates and eventually we saw 
OutOfMemoryError.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HADOOP-10245) Hadoop command line always appends "-Xmx" option twice

2014-01-20 Thread shanyu zhao (JIRA)
shanyu zhao created HADOOP-10245:


 Summary: Hadoop command line always appends "-Xmx" option twice
 Key: HADOOP-10245
 URL: https://issues.apache.org/jira/browse/HADOOP-10245
 Project: Hadoop Common
  Issue Type: Bug
  Components: bin
Affects Versions: 2.2.0
Reporter: shanyu zhao
Assignee: shanyu zhao


The Hadoop command line scripts (hadoop.sh or hadoop.cmd) will call java with 
"-Xmx" options twice. The impact is that any user defined HADOOP_HEAP_SIZE env 
variable will take no effect because it is overwritten by the second "-Xmx" 
option.

For example, here is the java cmd generated for command "hadoop fs -ls /", 
Notice that there are two "-Xmx" options: "-Xmx1000m" and "-Xmx512m" in the 
command line:

java -Xmx1000m  -Dhadoop.log.dir=C:\tmp\logs -Dhadoop.log.file=hadoop.log 
-Dhadoop.root.logger=INFO,c
onsole,DRFA -Xmx512m  -Dhadoop.security.logger=INFO,RFAS -classpath XXX 
org.apache.hadoop.fs.FsShell -ls /

Here is the root cause:
The call flow is: hadoop.sh calls hadoop_config.sh, which in turn calls 
hadoop-env.sh. 
In hadoop.sh, the command line is generated by the following pseudo code:
java $JAVA_HEAP_MAX $HADOOP_CLIENT_OPTS -classpath ...

In hadoop-config.sh, $JAVA_HEAP_MAX is initialized as "-Xmx1000m" if user 
didn't set $HADOOP_HEAP_SIZE env variable.

In hadoop-env.sh, $HADOOP_CLIENT_OPTS is set as this:
export HADOOP_CLIENT_OPTS="-Xmx512m $HADOOP_CLIENT_OPTS"

To fix this problem, we should remove the "-Xmx512m" from HADOOP_CLIENT_OPTS. 
If we really want to change the memory settings we need to use 
$HADOOP_HEAP_SIZE env variable.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (HADOOP-10178) Configuration deprecation always emit "deprecated" warnings when a new key is used

2013-12-20 Thread shanyu zhao (JIRA)
shanyu zhao created HADOOP-10178:


 Summary: Configuration deprecation always emit "deprecated" 
warnings when a new key is used
 Key: HADOOP-10178
 URL: https://issues.apache.org/jira/browse/HADOOP-10178
 Project: Hadoop Common
  Issue Type: Bug
  Components: conf
Affects Versions: 2.2.0
Reporter: shanyu zhao
Assignee: shanyu zhao


Even if you use any new configuration properties, you still find "deprecated" 
warnings in your logs. E.g.:
13/12/14 01:00:51 INFO Configuration.deprecation: mapred.input.dir.recursive is 
deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive




--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Created] (HADOOP-10093) hadoop.cmd fs -copyFromLocal fails with large files on WASB

2013-11-12 Thread shanyu zhao (JIRA)
shanyu zhao created HADOOP-10093:


 Summary: hadoop.cmd fs -copyFromLocal fails with large files on 
WASB
 Key: HADOOP-10093
 URL: https://issues.apache.org/jira/browse/HADOOP-10093
 Project: Hadoop Common
  Issue Type: Bug
  Components: conf
Affects Versions: 2.2.0
Reporter: shanyu zhao
Assignee: shanyu zhao


When WASB is configured as default file system, if you run this:
 Hadoop fs -copyFromLocal largefile(>150MB) /test

You'll see this error message:
 Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at com.microsoft.windowsazure.services.blob.client.BlobOutputStream.writ
 eInternal(BlobOutputStream.java:618)
 at com.microsoft.windowsazure.services.blob.client.BlobOutputStream.writ
 e(BlobOutputStream.java:545)
 at java.io.DataOutputStream.write(DataOutputStream.java:107)
 at org.apache.hadoop.fs.azurenative.NativeAzureFileSystem$NativeAzureFsO
 utputStream.write(NativeAzureFileSystem.java:307)
 at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOut
 putStream.java:59)
 at java.io.DataOutputStream.write(DataOutputStream.java:107)
 at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:80)
 at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:52)
 at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:112)
 at org.apache.hadoop.fs.shell.CommandWithDestination$TargetFileSystem.wr
 iteStreamToFile(CommandWithDestination.java:299)
 at org.apache.hadoop.fs.shell.CommandWithDestination.copyStreamToTarget(
 CommandWithDestination.java:281)
 at org.apache.hadoop.fs.shell.CommandWithDestination.copyFileToTarget(Co
 mmandWithDestination.java:245)
 at org.apache.hadoop.fs.shell.CommandWithDestination.processPath(Command
 WithDestination.java:188)
 at org.apache.hadoop.fs.shell.CommandWithDestination.processPath(Command
 WithDestination.java:173)
 at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:306)
 at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:2
 78)
 at org.apache.hadoop.fs.shell.CommandWithDestination.processPathArgument
 (CommandWithDestination.java:168)
 at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:260)
 at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:244)

at org.apache.hadoop.fs.shell.CommandWithDestination.processArguments(Co
 mmandWithDestination.java:145)
 at org.apache.hadoop.fs.shell.CopyCommands$Put.processArguments(CopyComm
 ands.java:229)
 at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:1
 90)
 at org.apache.hadoop.fs.shell.Command.run(Command.java:154)
 at org.apache.hadoop.fs.FsShell.run(FsShell.java:255)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
 at org.apache.hadoop.fs.FsShell.main(FsShell.java:305)




--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (HADOOP-9924) FileUtil.createJarWithClassPath() does not generate relative classpath correctly

2013-08-30 Thread shanyu zhao (JIRA)
shanyu zhao created HADOOP-9924:
---

 Summary: FileUtil.createJarWithClassPath() does not generate 
relative classpath correctly
 Key: HADOOP-9924
 URL: https://issues.apache.org/jira/browse/HADOOP-9924
 Project: Hadoop Common
  Issue Type: Bug
  Components: fs
Affects Versions: 0.23.9, 2.1.0-beta
Reporter: shanyu zhao
Assignee: shanyu zhao


On Windows, FileUtil.createJarWithClassPath() is called to generate a manifest 
jar file to pack classpath - to avoid the problem of classpath being too long.
However, the relative classpath is not handled correctly. It relies on Java's 
File(relativePath) to resolve the relative path. But it really should be using 
the given pwd parameter to resolve the relative path.

To reproduce this bug, you can try some pig job on Windows, it will fail and 
the pig log on the application master will look like this:

2013-08-29 23:25:55,498 INFO [main] org.apache.hadoop.service.AbstractService: 
Service org.apache.hadoop.mapreduce.v2.app.MRAppMaster failed in state INITED; 
cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat 
not found
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat 
not found

This is because the PigOutputFormat class is in the job.jar file but the 
classpath manifest has:
file:/c:/apps/dist/hadoop-2.1.0-beta/bin/job.jar/job.jar
When it really should be:
file://job.jar/job.jar

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HADOOP-9896) TestIPC fail with VM crash or System.exit

2013-08-21 Thread shanyu zhao (JIRA)
shanyu zhao created HADOOP-9896:
---

 Summary: TestIPC fail with VM crash or System.exit
 Key: HADOOP-9896
 URL: https://issues.apache.org/jira/browse/HADOOP-9896
 Project: Hadoop Common
  Issue Type: Bug
  Components: ipc
Affects Versions: 2.0.5-alpha
Reporter: shanyu zhao
 Attachments: org.apache.hadoop.ipc.TestIPC-output.txt

I'm running hadoop unit tests on a Ubuntu 12.04 virtual machine, every time I 
try to run all unit tests with command "mvn test", the TestIPC unit test will 
fail, the console will show "The forked VM terminated without saying properly 
goodbye. VM crash or System.exit called?"



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HADOOP-9776) HarFileSystem.listStatus() returns "har://-localhost:/..." if port number is empty

2013-07-25 Thread shanyu zhao (JIRA)
shanyu zhao created HADOOP-9776:
---

 Summary: HarFileSystem.listStatus() returns 
"har://-localhost:/..." if port number is empty
 Key: HADOOP-9776
 URL: https://issues.apache.org/jira/browse/HADOOP-9776
 Project: Hadoop Common
  Issue Type: Bug
  Components: fs
Affects Versions: 0.23.9
Reporter: shanyu zhao


If the given har URI is "har://-localhost/usr/my.har/a", the result of 
HarFileSystem.listStatus() will have a ":" appended after localhost, like this: 
"har://-localhost:/usr/my.har/a". it should return 
"har://-localhost/usr/my.bar/a" instead.

This creates problem when running a hive unit test TestCliDriver 
(archive_excludeHadoop20.q), generating the following error:

java.io.IOException: cannot find dir = 
har://pfile-localhost:/GitHub/hive-monarch/build/ql/test/data/warehouse/tstsrcpart/ds=2008-04-08/hr=12/data.har/00_0
 in pathToPartitionInfo: 
[pfile:/GitHub/hive-monarch/build/ql/test/data/warehouse/tstsrcpart/ds=2008-04-08/hr=11,
 
har://pfile-localhost/GitHub/hive-monarch/build/ql/test/data/warehouse/tstsrcpart/ds=2008-04-08/hr=12/data.har]
[junit] at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:298)
[junit] at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:260)
[junit] at 
org.apache.hadoop.hive.ql.io.CombineHiveInputFormat$CombineHiveInputSplit.(CombineHiveInputFormat.java:104)


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HADOOP-9774) RawLocalFileSystem.listStatus() return absolution paths when input path is relative on Windows

2013-07-25 Thread shanyu zhao (JIRA)
shanyu zhao created HADOOP-9774:
---

 Summary: RawLocalFileSystem.listStatus() return absolution paths 
when input path is relative on Windows
 Key: HADOOP-9774
 URL: https://issues.apache.org/jira/browse/HADOOP-9774
 Project: Hadoop Common
  Issue Type: Bug
  Components: fs
Affects Versions: 0.23.9, 0.23.8, 0.23.7, 0.23.6, 0.23.5
Reporter: shanyu zhao


On Windows, when using RawLocalFileSystem.listStatus() to enumerate a relative 
path (without drive spec), e.g., "file:///mydata", the resulting paths become 
absolute paths, e.g., ["file://E:/mydata/t1.txt", "file://E:/mydata/t2.txt"...].
Note that if we use it to enumerate an absolute path, e.g., "file://E:/mydata" 
then the we get the same results as above.

This breaks some hive unit tests which uses local file system to simulate HDFS 
when testing, therefore the drive spec is removed. Then after listStatus() the 
path is changed to absolute path, hive failed to find the path in its map 
reduce job.

You'll see the following exception:
[junit] java.io.IOException: cannot find dir = 
pfile:/E:/GitHub/hive-monarch/build/ql/test/data/warehouse/src/kv1.txt in 
pathToPartitionInfo: 
[pfile:/GitHub/hive-monarch/build/ql/test/data/warehouse/src]
[junit] at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:298)


This problem is introduced by this JIRA:
HADOOP-8962

Prior to the fix for HADOOP-8962 (merged in 0.23.5), the resulting paths are 
relative paths if the parent paths are relative, e.g., 
["file:///mydata/t1.txt", "file:///mydata/t2.txt"...]

This behavior change is a side effect of the fix in HADOOP-8962, not an 
intended change. The resulting behavior, even though is legitimate from a 
function point of view, break consistency from the caller's point of view. When 
the caller use a relative path (without drive spec) to do listStatus() the 
resulting path should be relative. Therefore, I think this should be fixed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira