[GitHub] spark pull request: [deprecated] [SPARK-5821] [SPARK-5746] [SQL] J...
Github user yanbohappy closed the pull request at: https://github.com/apache/spark/pull/4607 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5821] [SPARK-5746] [SQL] JSON external ...
Github user yanbohappy commented on the pull request: https://github.com/apache/spark/pull/4607#issuecomment-74405767 Actually, the insert function (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/json/JSONRelation.scala#L107) will be not called any time. The CTAS command is just executed at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/json/JSONRelation.scala#L81 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...
Github user yanbohappy commented on the pull request: https://github.com/apache/spark/pull/4527#issuecomment-74424959 cc @liancheng @rxin @yhuai --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5821] [SQL] JSON external data source i...
Github user yanbohappy commented on the pull request: https://github.com/apache/spark/pull/4610#issuecomment-74425262 @yhuai OK, if we don't support write to a table while reading it, the code will be more concise. I will try to rewrite the code to focus on the issue SPARK-5812. Thank you for your comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...
Github user yanbohappy commented on the pull request: https://github.com/apache/spark/pull/4527#issuecomment-74424908 This improvement is very similar with #758, so I have run the similar performance test. The benchmark suggests this optimization made the optimized version about 1.5x to 2x faster when scanning JSON table, but it depends on the JSON schema especially for whether different record with different schema. For a JSON file with 188010 lines, the build scan consumed time is: original: Takes 15598 ms optimized: Takes 10152 ms --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: JSON external data source INSERT improvements ...
GitHub user yanbohappy opened a pull request: https://github.com/apache/spark/pull/4610 JSON external data source INSERT improvements initial draft JSON external data source INSERT operation improvements and bug fix: 1, The path in CREATE TABLE AS SELECT must be a directory whether it exists, we use directory to represent the table. Because in this scenario we need to write(INSERT OVERWRITE) or append(INSERT INTO) data to the existing table, we can't append to HDFS files which represent RDD. If we want to implement append semantics, we need new files and add them to the specific directory. Another reason is that if the table based on a directory is more reasonable for access control, authentication and authorization. As SPARK-5821 mentioned, if we don't have write permission for the parent directory of the table, the CTAS command will failure. It's reasonable that it will not be granted some access rights to the directory which out of the table scope. So the table base on a directory may be a better choice. This restriction is only to CREATE TABLE AS SELECT, other DDL like CREATE TABLE can base on ordinary file or directory, due to the later one only scanning table without inserting new data. 2, New INSERT OVERWRITE implementation. First insert the new generated table to a temporary directory which named as _temporary under the path directory. After insert finished, we deleted the original files. At last we rename _temporary for data. This can fix the bug which mentioned at SPARK-5746. 3, Why to rename _temporary for data rather than move all files in _temporary to path and then delete _temporary? Because that spark RDD.saveAsTextFile(path) related operation will store the whole RDD to HDFS files which named as part-* like files under the path. If the original files were produced by this mean, and then we use INSERT without overwrite, the new generated table files are also named as part-* which will produce corrupted table. Todo: 1, If there is an existing RDD base on path a/b/c which has already been cached, after INSERT operation we need to recomputing this RDD by rescan the directory. Can we trigger a rescan execution operation after INSERT? 2, Is it enough that rename _temporary to data which mentioned above? If the base directory is produced by another CTAS command, there will be data directory under it. Can we append a unique number after data or just use the jobId or taskId to identify the subdirectory ? I think to resolve these problem in a follow up PR may be better. This is the initial draft and need optimization. Looking forward your comments. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yanbohappy/spark jsonInsertImprovements Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4610.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4610 commit 307278ff6bdcd1d1a5a50650fb3dfa6da3db070f Author: Yanbo Liang yanboha...@gmail.com Date: 2015-02-15T05:23:14Z JSON external data source INSERT improvements initial draft --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5821] [SPARK-5746] [SQL] JSON external ...
Github user yanbohappy commented on the pull request: https://github.com/apache/spark/pull/4607#issuecomment-74405701 @yhuai Thank you for your reply. Add analysis rule and throw an exception is reasonable and looking forward your PR. I can address the issue of SPARK-5821, I'm working on another PR #4610 not only resolve SPARK-5821 but also with some improvements. Could I close this PR and discuss JSON data source improvement related problem at #4610 ï¼ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5821] [SPARK-5746] [SQL] JSON data sour...
GitHub user yanbohappy opened a pull request: https://github.com/apache/spark/pull/4607 [SPARK-5821] [SPARK-5746] [SQL] JSON data source refactor initial draft JSON data source refactor 1, The path in CREATE TABLE AS SELECT must be a directory. Because in this scenario we need to write or append files to the existed table, underlying directory is more reasonable for append operation, authentication and authorization. For SPARK-5821, if we don't have write permission for the parent directory, the CTAS command will failure. Another reason is that we can't append to HDFS files which represent RDD, if we want to implement append semantics, we need new files and add to a specific directory. 2, New INSERT OVERWRITE implementation. First insert the new generated table to a temporary directory which named as _temporary under the path directory. After insert finished, we deleted the original files. At last we rename _temporary for data. This can fix the bug which mentioned at SPARK-5746. Why to rename _temporary for data rather than move all files in _temporary to path and then delete _temporary? Because that spark RDD.saveAsTextFile(path) related operation will store the whole RDD to HDFS files which named as part-* like files under the path. If the original files were produced by this mean, and then we use INSERT without overwrite, the new generated table files are also named as part-* which will produce corrupted table. This is the initial draft and need optimization. Looking forward your opinions and comments. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yanbohappy/spark JSONDataSourceRefactor Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4607.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4607 commit 8683a483c074f692152159d63a101f78c3c3fe58 Author: Yanbo Liang yanboha...@gmail.com Date: 2015-02-14T17:37:05Z JSON data source refactor initial draft --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...
Github user yanbohappy commented on the pull request: https://github.com/apache/spark/pull/4527#issuecomment-74388397 @yhuai This improvement is very similar with #758, so I have leverage the performance test there. The benchmark suggests this optimization made the optimized version 1.5x faster when scanning JSON table, but it's not very stable. For a json file with 188010 lines, the build scan consumed time is: original: Takes 15598 ms optimized: Takes 10152 ms --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...
Github user yanbohappy commented on a diff in the pull request: https://github.com/apache/spark/pull/4527#discussion_r24574169 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala --- @@ -39,7 +39,19 @@ private[sql] object JsonRDD extends Logging { json: RDD[String], schema: StructType, columnNameOfCorruptRecords: String): RDD[Row] = { -parseJson(json, columnNameOfCorruptRecords).map(parsed = asRow(parsed, schema)) +// Reuse the mutable row for each record, however we still need to +// create a new row for every nested struct type in each record +val mutableRow = new SpecificMutableRow(schema.fields.map(_.dataType)) --- End diff -- You are right, it's not appropriate to use SpecificMutableRow here. I will change back to GenericMutableRow. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...
Github user yanbohappy commented on the pull request: https://github.com/apache/spark/pull/4527#issuecomment-74059428 @chenghao-intel @yhuai Thank you for your advice and it's very useful. We can use mutable rows for both top level records and inner structures at present. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] Reuse mutable row for each record at jso...
Github user yanbohappy commented on the pull request: https://github.com/apache/spark/pull/4527#issuecomment-73855155 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] Reuse mutable row for each record at jso...
Github user yanbohappy commented on the pull request: https://github.com/apache/spark/pull/4527#issuecomment-73850498 https://issues.apache.org/jira/browse/SPARK-5738 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] Reuse mutable row for each record at jso...
GitHub user yanbohappy opened a pull request: https://github.com/apache/spark/pull/4527 [SQL] Reuse mutable row for each record at jsonStringToRow When serialize json string to row, reuse a mutable row for each record instead of creating a new one for every record. But every nested struct type in each record, we still need to create a new row for them. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yanbohappy/spark jsonStringToRowOptimization Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4527.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4527 commit b0c2b145950c18e30a8e88086e018cb66931fbec Author: Yanbo Liang yanboha...@gmail.com Date: 2015-02-11T08:36:50Z [SQL] Reuse mutable row for each record at jsonStringToRow --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] Implement Describe Table for SQLContext
Github user yanbohappy commented on the pull request: https://github.com/apache/spark/pull/4207#issuecomment-71626077 Move Describe command parser to DDLParser which can implement do not throw error when parse exception. So that some HiveContext particular describe command can be parse by fallback Hive parser and run by Hive native command. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] Implement Describe Table for SQLContext
Github user yanbohappy commented on the pull request: https://github.com/apache/spark/pull/4207#issuecomment-71794484 @OopsOutOfMemory Since you have go deep into this issue and I agree your PR is more mature. So close this one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] Implement Describe Table for SQLContext
Github user yanbohappy commented on the pull request: https://github.com/apache/spark/pull/4207#issuecomment-71793953 @lianhuiwang In this PR https://github.com/apache/spark/pull/3948, CommandStrategy had been removed and command had been refactor. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] Implement Describe Table for SQLContext
Github user yanbohappy closed the pull request at: https://github.com/apache/spark/pull/4207 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] Implement Describe Table for SQLContext
Github user yanbohappy commented on the pull request: https://github.com/apache/spark/pull/4207#issuecomment-71482088 https://issues.apache.org/jira/browse/SPARK-5324 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Implement Describe Table for SQLContext
GitHub user yanbohappy opened a pull request: https://github.com/apache/spark/pull/4207 Implement Describe Table for SQLContext Initial code snippet for Describe Table command. #1 SQL Parser and generate logical plan. Add DESCRIBE [FORMATTED] [db_name.]table_name command parser in SparkSQLParser and generate the same logical plan for both SQLContext and HiveContext. (note: HiveContext also support DESCRIBE [FORMATTED] [db_name.]table_name PARTITION partition_column_name and DESCRIBE [FORMATTED] [db_name.]table_name column_name is implement by Hive native command) #2 Implement DescribeCommand which invoke by RunnableCommand. #3 For SQLContext the code is clearly structured, but for HiveContext the output of describe command need to stay the same Hive. So For HiveContext, we still transfer logical command to DescribeHiveTableCommand which had been implement in HiveStrategies.HiveCommandStrategy. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yanbohappy/spark spark-5324 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4207.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4207 commit 7fb2e81d682d8f641275077b94dfb6d7de466de8 Author: Yanbo Liang yanboha...@gmail.com Date: 2015-01-26T15:29:10Z Implement Describe Table for SQLContext --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4963 [SQL] Add copy to SQL's Sample oper...
Github user yanbohappy commented on the pull request: https://github.com/apache/spark/pull/3827#issuecomment-69160371 Can anyone verify and merge this patch? It's a bug appeared frequently and fix it asap will be better. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4963 [SQL] HiveTableScan return mutable ...
Github user yanbohappy commented on the pull request: https://github.com/apache/spark/pull/3827#issuecomment-68339079 @liancheng I agree to move the copy call to execution.Sample.execute and added new commits. It will take no effect on HiveTableScan. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4963 [SQL] Add copy to SQL's Sample oper...
Github user yanbohappy commented on the pull request: https://github.com/apache/spark/pull/3827#issuecomment-68425561 Change for better test output and move it to another test file which is more reasonable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: HiveTableScan return mutable row with copy
GitHub user yanbohappy opened a pull request: https://github.com/apache/spark/pull/3827 HiveTableScan return mutable row with copy https://issues.apache.org/jira/browse/SPARK-4963 SchemaRDD.sample() return wrong results due to GapSamplingIterator operating on mutable row. HiveTableScan make RDD with SpecificMutableRow and SchemaRDD.sample() will return GapSamplingIterator for iterating. override def next(): T = { val r = data.next() advance r } GapSamplingIterator.next() return the current underlying element and assigned it to r. However if the underlying iterator is mutable row just like what HiveTableScan returned, underlying iterator and r will point to the same object. After advance operation, we drop some underlying elments and it also changed r which is not expected. Then we return the wrong value different from initial r. To fix this issue, the most direct way is to make HiveTableScan return mutable row with copy just like the initial commit that I have made. This solution will make HiveTableScan can not get the full advantage of reusable MutableRow, but it can make sample operation return correct result. Further more, we need to investigate GapSamplingIterator.next() and make it can implement copy operation inside it. To achieve this, we should define every elements that RDD can store implement the function like cloneable and it will make huge change. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yanbohappy/spark spark-4963 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3827.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3827 commit 6eaee5e7b1b5aca7f6abd16892f8312c7d6d7917 Author: Yanbo Liang yanboha...@gmail.com Date: 2014-12-29T09:00:44Z HiveTableScan return mutable row with copy --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4111 [MLlib] add regression metrics
Github user yanbohappy commented on the pull request: https://github.com/apache/spark/pull/2978#issuecomment-60883826 Rename parameter and function names to be consistent with spark naming rules. Delete unused columns and set prediction as the first column. Add explanation and reference to r2Score and explained variance. Other code style keeping. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: add regression metrics
GitHub user yanbohappy opened a pull request: https://github.com/apache/spark/pull/2978 add regression metrics Add RegressionMetrics.scala as regression metrics used for evaluation and corresponding test case RegressionMetricsSuite.scala. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yanbohappy/spark regression_metrics Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2978.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2978 commit 43bb12b0ce64a70c5d655ad19930b17a7d69ab6e Author: liangyanbo liangya...@meituan.com Date: 2014-10-28T11:23:09Z add regression metrics --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4111 [MLlib] add regression metrics
Github user yanbohappy commented on the pull request: https://github.com/apache/spark/pull/2978#issuecomment-60745341 Rename re_score() and remove unused column. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4111 [MLlib] add regression metrics
Github user yanbohappy commented on a diff in the pull request: https://github.com/apache/spark/pull/2978#discussion_r19465890 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala --- @@ -0,0 +1,83 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.evaluation + +import org.apache.spark.annotation.Experimental +import org.apache.spark.rdd.RDD +import org.apache.spark.Logging +import org.apache.spark.mllib.linalg.Vectors +import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer +import org.apache.spark.mllib.rdd.RDDFunctions._ + +/** + * :: Experimental :: + * Evaluator for regression. + * + * @param valuesAndPreds an RDD of (value, pred) pairs. + */ +@Experimental +class RegressionMetrics(valuesAndPreds: RDD[(Double, Double)]) extends Logging { + + /** + * Use MultivariateOnlineSummarizer to calculate mean and variance of different combination. + * MultivariateOnlineSummarizer is a numerically stable algorithm to compute mean and variance + * in a online fashion. + */ + private lazy val summarizer: MultivariateOnlineSummarizer = { +val summarizer: MultivariateOnlineSummarizer = valuesAndPreds.map{ + case (value,pred) = Vectors.dense( +Array(value, pred, value - pred, math.abs(value - pred), math.pow(value - pred, 2.0)) --- End diff -- Yes, it's not used and I have remove it in a new commit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org