GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/1631
[SPARK-2726] and [SPARK-2727] Remove SortOrder and do in-place sort. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark sortOrder Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1631.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1631 ---- commit 2b8d89e30ebfe2272229a1eddd7542d7437c9924 Author: Cheng Hao <hao.ch...@intel.com> Date: 2014-07-28T17:59:53Z [SPARK-2523] [SQL] Hadoop table scan bug fixing In HiveTableScan.scala, ObjectInspector was created for all of the partition based records, which probably causes ClassCastException if the object inspector is not identical among table & partitions. This is the follow up with: https://github.com/apache/spark/pull/1408 https://github.com/apache/spark/pull/1390 I've run a micro benchmark in my local with 15000000 records totally, and got the result as below: With This Patch | Partition-Based Table | Non-Partition-Based Table ------------ | ------------- | ------------- No | 1927 ms | 1885 ms Yes | 1541 ms | 1524 ms It showed this patch will also improve the performance. PS: the benchmark code is also attached. (thanks liancheng ) ``` package org.apache.spark.sql.hive import org.apache.spark.SparkContext import org.apache.spark.SparkConf import org.apache.spark.sql._ object HiveTableScanPrepare extends App { case class Record(key: String, value: String) val sparkContext = new SparkContext( new SparkConf() .setMaster("local") .setAppName(getClass.getSimpleName.stripSuffix("$"))) val hiveContext = new LocalHiveContext(sparkContext) val rdd = sparkContext.parallelize((1 to 3000000).map(i => Record(s"$i", s"val_$i"))) import hiveContext._ hql("SHOW TABLES") hql("DROP TABLE if exists part_scan_test") hql("DROP TABLE if exists scan_test") hql("DROP TABLE if exists records") rdd.registerAsTable("records") hql("""CREATE TABLE part_scan_test (key STRING, value STRING) PARTITIONED BY (part1 string, part2 STRING) | ROW FORMAT SERDE | 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe' | STORED AS RCFILE """.stripMargin) hql("""CREATE TABLE scan_test (key STRING, value STRING) | ROW FORMAT SERDE | 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe' | STORED AS RCFILE """.stripMargin) for (part1 <- 2000 until 2001) { for (part2 <- 1 to 5) { hql(s"""from records | insert into table part_scan_test PARTITION (part1='$part1', part2='2010-01-$part2') | select key, value """.stripMargin) hql(s"""from records | insert into table scan_test select key, value """.stripMargin) } } } object HiveTableScanTest extends App { val sparkContext = new SparkContext( new SparkConf() .setMaster("local") .setAppName(getClass.getSimpleName.stripSuffix("$"))) val hiveContext = new LocalHiveContext(sparkContext) import hiveContext._ hql("SHOW TABLES") val part_scan_test = hql("select key, value from part_scan_test") val scan_test = hql("select key, value from scan_test") val r_part_scan_test = (0 to 5).map(i => benchmark(part_scan_test)) val r_scan_test = (0 to 5).map(i => benchmark(scan_test)) println("Scanning Partition-Based Table") r_part_scan_test.foreach(printResult) println("Scanning Non-Partition-Based Table") r_scan_test.foreach(printResult) def printResult(result: (Long, Long)) { println(s"Duration: ${result._1} ms Result: ${result._2}") } def benchmark(srdd: SchemaRDD) = { val begin = System.currentTimeMillis() val result = srdd.count() val end = System.currentTimeMillis() ((end - begin), result) } } ``` Author: Cheng Hao <hao.ch...@intel.com> Closes #1439 from chenghao-intel/hadoop_table_scan and squashes the following commits: 888968f [Cheng Hao] Fix issues in code style 27540ba [Cheng Hao] Fix the TableScan Bug while partition serde differs 40a24a7 [Cheng Hao] Add Unit Test commit 255b56f9f530e8594a7e6055ae07690454c66799 Author: DB Tsai <dbt...@alpinenow.com> Date: 2014-07-28T18:34:19Z [SPARK-2479][MLlib] Comparing floating-point numbers using relative error in UnitTests Floating point math is not exact, and most floating-point numbers end up being slightly imprecise due to rounding errors. Simple values like 0.1 cannot be precisely represented using binary floating point numbers, and the limited precision of floating point numbers means that slight changes in the order of operations or the precision of intermediates can change the result. That means that comparing two floats to see if they are equal is usually not what we want. As long as this imprecision stays small, it can usually be ignored. Based on discussion in the community, we have implemented two different APIs for relative tolerance, and absolute tolerance. It makes sense that test writers should know which one they need depending on their circumstances. Developers also need to explicitly specify the eps, and there is no default value which will sometimes cause confusion. When comparing against zero using relative tolerance, a exception will be raised to warn users that it's meaningless. For relative tolerance, users can now write assert(23.1 ~== 23.52 relTol 0.02) assert(23.1 ~== 22.74 relTol 0.02) assert(23.1 ~= 23.52 relTol 0.02) assert(23.1 ~= 22.74 relTol 0.02) assert(!(23.1 !~= 23.52 relTol 0.02)) assert(!(23.1 !~= 22.74 relTol 0.02)) // This will throw exception with the following message. // "Did not expect 23.1 and 23.52 to be within 0.02 using relative tolerance." assert(23.1 !~== 23.52 relTol 0.02) // "Expected 23.1 and 22.34 to be within 0.02 using relative tolerance." assert(23.1 ~== 22.34 relTol 0.02) For absolute error, assert(17.8 ~== 17.99 absTol 0.2) assert(17.8 ~== 17.61 absTol 0.2) assert(17.8 ~= 17.99 absTol 0.2) assert(17.8 ~= 17.61 absTol 0.2) assert(!(17.8 !~= 17.99 absTol 0.2)) assert(!(17.8 !~= 17.61 absTol 0.2)) // This will throw exception with the following message. // "Did not expect 17.8 and 17.99 to be within 0.2 using absolute error." assert(17.8 !~== 17.99 absTol 0.2) // "Expected 17.8 and 17.59 to be within 0.2 using absolute error." assert(17.8 ~== 17.59 absTol 0.2) Authors: DB Tsai <dbtsaialpinenow.com> Marek Kolodziej <marekalpinenow.com> Author: DB Tsai <dbt...@alpinenow.com> Closes #1425 from dbtsai/SPARK-2479_comparing_floating_point and squashes the following commits: 8c7cbcc [DB Tsai] Alpine Data Labs commit a7a9d14479ea6421513a962ff0f45cb969368bab Author: Cheng Lian <lian.cs....@gmail.com> Date: 2014-07-28T19:07:30Z [SPARK-2410][SQL] Merging Hive Thrift/JDBC server (with Maven profile fix) JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410) Another try for #1399 & #1600. Those two PR breaks Jenkins builds because we made a separate profile `hive-thriftserver` in sub-project `assembly`, but the `hive-thriftserver` module is defined outside the `hive-thriftserver` profile. Thus every time a pull request that doesn't touch SQL code will also execute test suites defined in `hive-thriftserver`, but tests fail because related .class files are not included in the assembly jar. In the most recent commit, module `hive-thriftserver` is moved into its own profile to fix this problem. All previous commits are squashed for clarity. Author: Cheng Lian <lian.cs....@gmail.com> Closes #1620 from liancheng/jdbc-with-maven-fix and squashes the following commits: 629988e [Cheng Lian] Moved hive-thriftserver module definition into its own profile ec3c7a7 [Cheng Lian] Cherry picked the Hive Thrift server commit 39ab87b924ad65b6b9b7aa6831f3e9ddc2b76dd7 Author: Aaron Davidson <aa...@databricks.com> Date: 2014-07-28T20:37:44Z Use commons-lang3 in SignalLogger rather than commons-lang Spark only transitively depends on the latter, based on the Hadoop version. Author: Aaron Davidson <aa...@databricks.com> Closes #1621 from aarondav/lang3 and squashes the following commits: 93c93bf [Aaron Davidson] Use commons-lang3 in SignalLogger rather than commons-lang commit 16ef4d110f15dfe66852802fdadfe2ed7574ddc2 Author: Yadong Qi <qiyadong2...@gmail.com> Date: 2014-07-29T04:39:02Z Excess judgment Author: Yadong Qi <qiyadong2...@gmail.com> Closes #1629 from watermen/bug-fix2 and squashes the following commits: 59b7237 [Yadong Qi] Update HiveQl.scala commit c9d37e1bacaff2be9ee9174a2965fdc2e9a04245 Author: Reynold Xin <r...@apache.org> Date: 2014-07-29T05:15:05Z [SPARK-2726] and [SPARK-2727] Remove SortOrder and do in-place sort. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---