[Hadoop Wiki] Trivial Update of "首页" by sunlightcs
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification. The "首页" page has been changed by sunlightcs. http://wiki.apache.org/hadoop/%E9%A6%96%E9%A1%B5?action=diff&rev1=7&rev2=8 -- Hadoop由Doug Cutting在2004年开始开发,2008年开始流行于中国,2009年在中国已经火红,包括中国移动、百度、网易、淘宝、腾讯、金山和华为等众多公司都在研究和使用它,另外还有中科院、暨南大学、浙江大学等众多高校在研究它。 + + * [[http://www.juziku.com/|聚资库]] +
[Hadoop Wiki] Update of "Hive/DesignDocs" by JohnSichi
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification. The "Hive/DesignDocs" page has been changed by JohnSichi. http://wiki.apache.org/hadoop/Hive/DesignDocs?action=diff&rev1=2&rev2=3 -- * [[Hive/HBaseIntegration|HBase Integration]] * [[Hive/HBaseBulkLoad| HBase Bulk Load]] * [[Hive/Locking|Locking]] + * [[Hive/FilterPushdownDev|Filter Pushdown]]
Page nainai deleted from Hadoop Wiki
Dear wiki user, You have subscribed to a wiki page "Hadoop Wiki" for change notification. The page "nainai" has been deleted by DougCutting. The comment on this change is: spam. http://wiki.apache.org/hadoop/nainai
[Hadoop Wiki] Update of "Hive/LanguageManual/UDF" by Ar vindPrabhakar
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification. The "Hive/LanguageManual/UDF" page has been changed by ArvindPrabhakar. http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF?action=diff&rev1=44&rev2=45 -- == Built-in Aggregate Functions (UDAF) == The following are built-in aggregate functions are supported in Hive: ||<10%>Return Type''' ''' ||<10%>Name(Signature)''' ''' ||Description''' ''' || - ||bigint ||count(1), count(DISTINCT col [, col]...) ||count(1) returns the number of members in the group, whereas the count(DISTINCT col) gets the count of distinct values of the columns in the group || + ||bigint ||count(*), count(expr), count(DISTINCT expr[, expr...]) || count(*) - Returns the total number of retrieved rows, including rows containing NULL values; count(expr) - Returns the number of rows for which the supplied expression is non-NULL; count(DISTINCT expr[, expr]) - Returns the number of rows for which the supplied expression(s) are unique and non-NULL. || ||double ||sum(col), sum(DISTINCT col) ||Returns the sum of the elements in the group or the sum of the distinct values of the column in the group || ||double ||avg(col), avg(DISTINCT col) ||Returns the average of the elements in the group or the average of the distinct values of the column in the group || ||double ||min(col) ||Returns the minimum of the column in the group ||
[Hadoop Wiki] Update of "Hive/GenericUDAFCaseStudy" by ArvindPrabhakar
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification. The "Hive/GenericUDAFCaseStudy" page has been changed by ArvindPrabhakar. http://wiki.apache.org/hadoop/Hive/GenericUDAFCaseStudy?action=diff&rev1=1&rev2=2 -- == Writing the source == - As stated above, create a new file called `ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogram.java`, relative to the Hive root directory. Please see the `ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogramNumeric.java` for a detailed example of a UDAF. + This section gives a high-level outline of how to implement your own generic UDAF. For a concrete example, look at any of the existing UDAF sources present in `ql/src/java/org/apache/hadoop/hive/ql/udf/generic/` directory. + + At a high-level, there are two parts to implementing a Generic UDAF. The first is to write an ''evaluator'', and the second is to create a ''resolver''. An evaluator is the actual implementation of the generic UDAF with the processing logic in place. The resolver on the other provides a mechanism for the evaluator to be accessed by the query processing framework. + + All evaluators must extend from the abstract base class org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator. This class provides a few abstract methods that must be implemented by the extending class. These methods establish the processing semantics followed by the UDAF. Please refer to the javadocs for the abstract methods to see their exact specifications. + + The implementation of resolver is done by either implementing the interface org.apache.hadoop.hive.ql.udf.GenericUDAFResolver2 or extending from the abstract class org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver. There is also an interface org.apache.hadoop.hive.ql.udf.GenericUDAFResolver that can be implemented, but is deprecated as of 0.6.0 release. The key difference between GenericUDAFResolver and GenericUDAFResovler2 interface is the fact that the later allows the evaluator implementation to access extra information regarding the function invocation such as the presence of DISTINCT qualifier or the invocation with the wildcard syntax such as FUNCTION(*). Evaluators that implement the deprecated GenericUDAFResolver interface will not be able to tell the difference between an invocation such as FUNCTION() or FUNCTION(*) since the information regarding specification of the wildcard is not available. Similarly, these implementations will also not be able to tell the difference between FUNCTION(EXPR) vs FUNCTION(DISTINCT EXPR) since the information regarding presence of the DISTINCT qualifier too is not available. + + Note that while the resolvers which implement the GenericUDAFResolver2 interface are provided the extra information regarding the presence of DISTINCT qualifier of invocation with the wildcard syntax, they can choose to ignore it completely if it is of no significance to them. The underlying data manipulation to ensure DISTINCT nature of the expression values is actually done by the framework and not by the evaluator or resolver. For UDAF implementations that do not care about this extra information, they could simply extend from the AbstractGenericUDAFResolver interface which insulates the implementation from this information. It also offers an easy way to transition previously written UDAF implementations to migrate to the new resolver interface without having to re-write the implementation since the change from implementing GenericUDAFResolver interface to extending AbstractGenericUDAFResolver class is fairly minimal. There may be issues with implementations that are part of a inheritance hierarchy since it may not be easy to change the base class. == Modifying the function registry ==
svn commit: r963907 - in /hadoop/common/branches/branch-0.20: ./ src/hdfs/org/apache/hadoop/hdfs/server/namenode/ src/hdfs/org/apache/hadoop/hdfs/server/namenode/metrics/ src/test/org/apache/hadoop/hd
Author: shv Date: Tue Jul 13 23:49:58 2010 New Revision: 963907 URL: http://svn.apache.org/viewvc?rev=963907&view=rev Log: HDFS-132. Port to branch 0.20. Contributed by Konstantin Shvachko. Modified: hadoop/common/branches/branch-0.20/CHANGES.txt hadoop/common/branches/branch-0.20/src/hdfs/org/apache/hadoop/hdfs/server/namenode/FSDirectory.java hadoop/common/branches/branch-0.20/src/hdfs/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java hadoop/common/branches/branch-0.20/src/hdfs/org/apache/hadoop/hdfs/server/namenode/metrics/NameNodeMetrics.java hadoop/common/branches/branch-0.20/src/test/org/apache/hadoop/hdfs/server/namenode/metrics/TestNameNodeMetrics.java Modified: hadoop/common/branches/branch-0.20/CHANGES.txt URL: http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20/CHANGES.txt?rev=963907&r1=963906&r2=963907&view=diff == --- hadoop/common/branches/branch-0.20/CHANGES.txt (original) +++ hadoop/common/branches/branch-0.20/CHANGES.txt Tue Jul 13 23:49:58 2010 @@ -39,6 +39,9 @@ Release 0.20.3 - Unreleased HDFS-1258. Clearing namespace quota on "/" corrupts fs image. (Aaron T. Myers via szetszwo) +HDFS-132. Fix namenode to not report files deleted metrics for deletions +done while replaying edits during startup. (suresh & shv) + IMPROVEMENTS MAPREDUCE-1407. Update javadoc in mapreduce.{Mapper,Reducer} to match Modified: hadoop/common/branches/branch-0.20/src/hdfs/org/apache/hadoop/hdfs/server/namenode/FSDirectory.java URL: http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20/src/hdfs/org/apache/hadoop/hdfs/server/namenode/FSDirectory.java?rev=963907&r1=963906&r2=963907&view=diff == --- hadoop/common/branches/branch-0.20/src/hdfs/org/apache/hadoop/hdfs/server/namenode/FSDirectory.java (original) +++ hadoop/common/branches/branch-0.20/src/hdfs/org/apache/hadoop/hdfs/server/namenode/FSDirectory.java Tue Jul 13 23:49:58 2010 @@ -25,9 +25,6 @@ import org.apache.hadoop.fs.FileStatus; import org.apache.hadoop.fs.Path; import org.apache.hadoop.fs.ContentSummary; import org.apache.hadoop.fs.permission.*; -import org.apache.hadoop.metrics.MetricsRecord; -import org.apache.hadoop.metrics.MetricsUtil; -import org.apache.hadoop.metrics.MetricsContext; import org.apache.hadoop.hdfs.protocol.FSConstants; import org.apache.hadoop.hdfs.protocol.Block; import org.apache.hadoop.hdfs.protocol.QuotaExceededException; @@ -49,8 +46,6 @@ class FSDirectory implements FSConstants final INodeDirectoryWithQuota rootDir; FSImage fsImage; private boolean ready = false; - // Metrics record - private MetricsRecord directoryMetrics = null; /** Access an existing dfs name directory. */ FSDirectory(FSNamesystem ns, Configuration conf) { @@ -65,13 +60,6 @@ class FSDirectory implements FSConstants Integer.MAX_VALUE, -1); this.fsImage = fsImage; namesystem = ns; -initialize(conf); - } - - private void initialize(Configuration conf) { -MetricsContext metricsContext = MetricsUtil.getContext("dfs"); -directoryMetrics = MetricsUtil.createRecord(metricsContext, "FSDirectory"); -directoryMetrics.setTag("sessionId", conf.get("session.id")); } void loadFSImage(Collection dataDirs, @@ -103,8 +91,8 @@ class FSDirectory implements FSConstants } private void incrDeletedFileCount(int count) { -directoryMetrics.incrMetric("files_deleted", count); -directoryMetrics.update(); +if (namesystem != null) + NameNode.getNameNodeMetrics().numFilesDeleted.inc(count); } /** @@ -569,17 +557,19 @@ class FSDirectory implements FSConstants /** * Remove the file from management, return blocks */ - INode delete(String src) { + boolean delete(String src) { if (NameNode.stateChangeLog.isDebugEnabled()) { NameNode.stateChangeLog.debug("DIR* FSDirectory.delete: "+src); } waitForReady(); long now = FSNamesystem.now(); -INode deletedNode = unprotectedDelete(src, now); -if (deletedNode != null) { - fsImage.getEditLog().logDelete(src, now); +int filesRemoved = unprotectedDelete(src, now); +if (filesRemoved <= 0) { + return false; } -return deletedNode; +incrDeletedFileCount(filesRemoved); +fsImage.getEditLog().logDelete(src, now); +return true; } /** Return if a directory is empty or not **/ @@ -604,9 +594,9 @@ class FSDirectory implements FSConstants * @param src a string representation of a path to an inode * @param modificationTime the time the inode is removed * @param deletedBlocks the place holder for the blocks to be removed - * @return if the deletion succeeds + * @return the number of inodes deleted; 0 if no inodes are deleted. */ - INode unprotectedDelete(String src, long modificationTime) { + int
[Hadoop Wiki] Update of "Hive/LanguageManual/Joins" by EdwardCapriolo
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification. The "Hive/LanguageManual/Joins" page has been changed by EdwardCapriolo. http://wiki.apache.org/hadoop/Hive/LanguageManual/Joins?action=diff&rev1=19&rev2=20 -- <> ## page was renamed from Hive/LanguageManual/LanguageManual/Joins - == Join Syntax == + == THIS PAGE WAS MOVED TO HIVE XDOCS ! DO NOT EDIT!Join Syntax == Hive supports the following syntax for joining tables: {{{ @@ -24, +24 @@ join_condition: ON equality_expression ( AND equality_expression )* - equality_expression: + equality_expression: expression = expression }}} + Only equality joins, outer joins, and left semi joins are supported in Hive. Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a map/reduce job. Also, more than two tables can be joined in Hive. - - Only equality joins, outer joins, and left semi joins are supported in Hive. Hive does not support join conditions that are not equality - conditions as it is very difficult to express such conditions as a map/reduce job. Also, more than two tables can be - joined in Hive. Some salient points to consider when writing join queries are as follows: * Only equality joins are allowed e.g. + - {{{ + {{{ - SELECT a.* FROM a JOIN b ON (a.id = b.id) + SELECT a.* FROM a JOIN b ON (a.id = b.id) }}} - {{{ + {{{ - SELECT a.* FROM a JOIN b ON (a.id = b.id AND a.department = b.department) + SELECT a.* FROM a JOIN b ON (a.id = b.id AND a.department = b.department) }}} - are both valid joins, however + . are both valid joins, however + {{{ SELECT a.* FROM a JOIN b ON (a.id <> b.id) }}} - is NOT allowed + . is NOT allowed + * More than 2 tables can be joined in the same query e.g. + {{{ SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2) }}} - is a valid join. + . is a valid join. + * Hive converts joins over multiple tables into a single map/reduce job if for every table the same column is used in the join clauses e.g. + {{{ SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1) }}} - is converted into a single map/reduce job as only key1 column for b is involved in the join. On the other hand + . is converted into a single map/reduce job as only key1 column for b is involved in the join. On the other hand + {{{ SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2) }}} - is converted into two map/reduce jobs because key1 column from b is used in the first join condition and key2 column from b is used in the second one. The first map/reduce job joins a with b and the results are then joined with c in the second map/reduce job. + . is converted into two map/reduce jobs because key1 column from b is used in the first join condition and key2 column from b is used in the second one. The first map/reduce job joins a with b and the results are then joined with c in the second map/reduce job. + * In every map/reduce stage of the join, the last table in the sequence is streamed through the reducers where as the others are buffered. Therefore, it helps to reduce the memory needed in the reducer for buffering the rows for a particular value of the join key by organizing the tables such that the largest tables appear last in the sequence. e.g. in + {{{ SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1) }}} - all the three tables are joined in a single map/reduce job and the values for a particular value of the key for tables a and b are buffered in the memory in the reducers. Then for each row retrieved from c, the join is computed with the buffered rows. Similarly for + . all the three tables are joined in a single map/reduce job and the values for a particular value of the key for tables a and b are buffered in the memory in the reducers. Then for each row retrieved from c, the join is computed with the buffered rows. Similarly for + {{{ SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2) }}} - there are two map/reduce jobs involved in computing the join. The first of these joins a with b and buffers the values of a while streaming the values of b in the reducers. The second of one of these jobs buffers the results of the first join while streaming the values of c through the reducers. + . there are two map/reduce jobs involved in computing the join. The first of these joins a with b and buffers the values of a while streaming the values of b in the reducers. The second of one of these jobs buffers the results of the first join while streaming the values of c through the reducers. + * In every map/reduce stage of the joi