[GitHub] spark pull request: [SPARK-2726] and [SPARK-2727] Remove SortOrder...

rxin Mon, 28 Jul 2014 22:21:16 -0700

GitHub user rxin opened a pull request:

    https://github.com/apache/spark/pull/1631


    [SPARK-2726] and [SPARK-2727] Remove SortOrder and do in-place sort.

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rxin/spark sortOrder

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1631.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1631
    
----
commit 2b8d89e30ebfe2272229a1eddd7542d7437c9924
Author: Cheng Hao <hao.ch...@intel.com>
Date:   2014-07-28T17:59:53Z

    [SPARK-2523] [SQL] Hadoop table scan bug fixing
    
    In HiveTableScan.scala, ObjectInspector was created for all of the 
partition based records, which probably causes ClassCastException if the object 
inspector is not identical among table & partitions.
    
    This is the follow up with:
    https://github.com/apache/spark/pull/1408
    https://github.com/apache/spark/pull/1390
    
    I've run a micro benchmark in my local with 15000000 records totally, and 
got the result as below:
    
    With This Patch  |  Partition-Based Table  |  Non-Partition-Based Table
    ------------ | ------------- | -------------
    No  |  1927 ms  |  1885 ms
    Yes  | 1541 ms  |  1524 ms
    
    It showed this patch will also improve the performance.
    
    PS:  the benchmark code is also attached. (thanks liancheng )
    ```
    package org.apache.spark.sql.hive
    
    import org.apache.spark.SparkContext
    import org.apache.spark.SparkConf
    import org.apache.spark.sql._
    
    object HiveTableScanPrepare extends App {
      case class Record(key: String, value: String)
    
      val sparkContext = new SparkContext(
        new SparkConf()
          .setMaster("local")
          .setAppName(getClass.getSimpleName.stripSuffix("$")))
    
      val hiveContext = new LocalHiveContext(sparkContext)
    
      val rdd = sparkContext.parallelize((1 to 3000000).map(i => Record(s"$i", 
s"val_$i")))
    
      import hiveContext._
    
      hql("SHOW TABLES")
      hql("DROP TABLE if exists part_scan_test")
      hql("DROP TABLE if exists scan_test")
      hql("DROP TABLE if exists records")
      rdd.registerAsTable("records")
    
      hql("""CREATE TABLE part_scan_test (key STRING, value STRING) PARTITIONED 
BY (part1 string, part2 STRING)
                     | ROW FORMAT SERDE
                     | 
'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe'
                     | STORED AS RCFILE
                   """.stripMargin)
      hql("""CREATE TABLE scan_test (key STRING, value STRING)
                     | ROW FORMAT SERDE
                     | 
'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe'
                     | STORED AS RCFILE
                   """.stripMargin)
    
      for (part1 <- 2000 until 2001) {
        for (part2 <- 1 to 5) {
          hql(s"""from records
                     | insert into table part_scan_test PARTITION 
(part1='$part1', part2='2010-01-$part2')
                     | select key, value
                   """.stripMargin)
          hql(s"""from records
                     | insert into table scan_test select key, value
                   """.stripMargin)
        }
      }
    }
    
    object HiveTableScanTest extends App {
      val sparkContext = new SparkContext(
        new SparkConf()
          .setMaster("local")
          .setAppName(getClass.getSimpleName.stripSuffix("$")))
    
      val hiveContext = new LocalHiveContext(sparkContext)
    
      import hiveContext._
    
      hql("SHOW TABLES")
      val part_scan_test = hql("select key, value from part_scan_test")
      val scan_test = hql("select key, value from scan_test")
    
      val r_part_scan_test = (0 to 5).map(i => benchmark(part_scan_test))
      val r_scan_test = (0 to 5).map(i => benchmark(scan_test))
      println("Scanning Partition-Based Table")
      r_part_scan_test.foreach(printResult)
      println("Scanning Non-Partition-Based Table")
      r_scan_test.foreach(printResult)
    
      def printResult(result: (Long, Long)) {
        println(s"Duration: ${result._1} ms Result: ${result._2}")
      }
    
      def benchmark(srdd: SchemaRDD) = {
        val begin = System.currentTimeMillis()
        val result = srdd.count()
        val end = System.currentTimeMillis()
        ((end - begin), result)
      }
    }
    ```
    
    Author: Cheng Hao <hao.ch...@intel.com>
    
    Closes #1439 from chenghao-intel/hadoop_table_scan and squashes the 
following commits:
    
    888968f [Cheng Hao] Fix issues in code style
    27540ba [Cheng Hao] Fix the TableScan Bug while partition serde differs
    40a24a7 [Cheng Hao] Add Unit Test

commit 255b56f9f530e8594a7e6055ae07690454c66799
Author: DB Tsai <dbt...@alpinenow.com>
Date:   2014-07-28T18:34:19Z

    [SPARK-2479][MLlib] Comparing floating-point numbers using relative error 
in UnitTests
    
    Floating point math is not exact, and most floating-point numbers end up 
being slightly imprecise due to rounding errors.
    
    Simple values like 0.1 cannot be precisely represented using binary 
floating point numbers, and the limited precision of floating point numbers 
means that slight changes in the order of operations or the precision of 
intermediates can change the result.
    
    That means that comparing two floats to see if they are equal is usually 
not what we want. As long as this imprecision stays small, it can usually be 
ignored.
    
    Based on discussion in the community, we have implemented two different 
APIs for relative tolerance, and absolute tolerance. It makes sense that test 
writers should know which one they need depending on their circumstances.
    
    Developers also need to explicitly specify the eps, and there is no default 
value which will sometimes cause confusion.
    
    When comparing against zero using relative tolerance, a exception will be 
raised to warn users that it's meaningless.
    
    For relative tolerance, users can now write
    
        assert(23.1 ~== 23.52 relTol 0.02)
        assert(23.1 ~== 22.74 relTol 0.02)
        assert(23.1 ~= 23.52 relTol 0.02)
        assert(23.1 ~= 22.74 relTol 0.02)
        assert(!(23.1 !~= 23.52 relTol 0.02))
        assert(!(23.1 !~= 22.74 relTol 0.02))
    
        // This will throw exception with the following message.
        // "Did not expect 23.1 and 23.52 to be within 0.02 using relative 
tolerance."
        assert(23.1 !~== 23.52 relTol 0.02)
    
        // "Expected 23.1 and 22.34 to be within 0.02 using relative tolerance."
        assert(23.1 ~== 22.34 relTol 0.02)
    
    For absolute error,
    
        assert(17.8 ~== 17.99 absTol 0.2)
        assert(17.8 ~== 17.61 absTol 0.2)
        assert(17.8 ~= 17.99 absTol 0.2)
        assert(17.8 ~= 17.61 absTol 0.2)
        assert(!(17.8 !~= 17.99 absTol 0.2))
        assert(!(17.8 !~= 17.61 absTol 0.2))
    
        // This will throw exception with the following message.
        // "Did not expect 17.8 and 17.99 to be within 0.2 using absolute 
error."
        assert(17.8 !~== 17.99 absTol 0.2)
    
        // "Expected 17.8 and 17.59 to be within 0.2 using absolute error."
        assert(17.8 ~== 17.59 absTol 0.2)
    
    Authors:
      DB Tsai <dbtsaialpinenow.com>
      Marek Kolodziej <marekalpinenow.com>
    
    Author: DB Tsai <dbt...@alpinenow.com>
    
    Closes #1425 from dbtsai/SPARK-2479_comparing_floating_point and squashes 
the following commits:
    
    8c7cbcc [DB Tsai] Alpine Data Labs

commit a7a9d14479ea6421513a962ff0f45cb969368bab
Author: Cheng Lian <lian.cs....@gmail.com>
Date:   2014-07-28T19:07:30Z

    [SPARK-2410][SQL] Merging Hive Thrift/JDBC server (with Maven profile fix)
    
    JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)
    
    Another try for #1399 & #1600. Those two PR breaks Jenkins builds because 
we made a separate profile `hive-thriftserver` in sub-project `assembly`, but 
the `hive-thriftserver` module is defined outside the `hive-thriftserver` 
profile. Thus every time a pull request that doesn't touch SQL code will also 
execute test suites defined in `hive-thriftserver`, but tests fail because 
related .class files are not included in the assembly jar.
    
    In the most recent commit, module `hive-thriftserver` is moved into its own 
profile to fix this problem. All previous commits are squashed for clarity.
    
    Author: Cheng Lian <lian.cs....@gmail.com>
    
    Closes #1620 from liancheng/jdbc-with-maven-fix and squashes the following 
commits:
    
    629988e [Cheng Lian] Moved hive-thriftserver module definition into its own 
profile
    ec3c7a7 [Cheng Lian] Cherry picked the Hive Thrift server

commit 39ab87b924ad65b6b9b7aa6831f3e9ddc2b76dd7
Author: Aaron Davidson <aa...@databricks.com>
Date:   2014-07-28T20:37:44Z

    Use commons-lang3 in SignalLogger rather than commons-lang
    
    Spark only transitively depends on the latter, based on the Hadoop version.
    
    Author: Aaron Davidson <aa...@databricks.com>
    
    Closes #1621 from aarondav/lang3 and squashes the following commits:
    
    93c93bf [Aaron Davidson] Use commons-lang3 in SignalLogger rather than 
commons-lang

commit 16ef4d110f15dfe66852802fdadfe2ed7574ddc2
Author: Yadong Qi <qiyadong2...@gmail.com>
Date:   2014-07-29T04:39:02Z

    Excess judgment
    
    Author: Yadong Qi <qiyadong2...@gmail.com>
    
    Closes #1629 from watermen/bug-fix2 and squashes the following commits:
    
    59b7237 [Yadong Qi] Update HiveQl.scala

commit c9d37e1bacaff2be9ee9174a2965fdc2e9a04245
Author: Reynold Xin <r...@apache.org>
Date:   2014-07-29T05:15:05Z

    [SPARK-2726] and [SPARK-2727] Remove SortOrder and do in-place sort.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2726] and [SPARK-2727] Remove SortOrder...

Reply via email to