[GitHub] spark pull request: [deprecated] [SPARK-5821] [SPARK-5746] [SQL] J...

2015-02-15 Thread yanbohappy
Github user yanbohappy closed the pull request at:

https://github.com/apache/spark/pull/4607


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5821] [SPARK-5746] [SQL] JSON external ...

2015-02-15 Thread yanbohappy
Github user yanbohappy commented on the pull request:

https://github.com/apache/spark/pull/4607#issuecomment-74405767
  
Actually, the insert function 
(https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/json/JSONRelation.scala#L107)
 will be not called any time. The CTAS command is just executed at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/json/JSONRelation.scala#L81
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...

2015-02-15 Thread yanbohappy
Github user yanbohappy commented on the pull request:

https://github.com/apache/spark/pull/4527#issuecomment-74424959
  
cc @liancheng @rxin @yhuai 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5821] [SQL] JSON external data source i...

2015-02-15 Thread yanbohappy
Github user yanbohappy commented on the pull request:

https://github.com/apache/spark/pull/4610#issuecomment-74425262
  
@yhuai 
OK, if we don't support write to a table while reading it, the code will be 
more concise.
I will try to rewrite the code to focus on the issue SPARK-5812.
Thank you for your comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...

2015-02-15 Thread yanbohappy
Github user yanbohappy commented on the pull request:

https://github.com/apache/spark/pull/4527#issuecomment-74424908
  
This improvement is very similar with #758, so I have run the similar 
performance test.
The benchmark suggests this optimization made the optimized version about 
1.5x to 2x faster when scanning JSON table, but it depends on the JSON schema 
especially for whether different record with different schema.
For a JSON file with 188010 lines, the build scan consumed time is: 
original: Takes 15598 ms
optimized: Takes 10152 ms



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: JSON external data source INSERT improvements ...

2015-02-15 Thread yanbohappy
GitHub user yanbohappy opened a pull request:

https://github.com/apache/spark/pull/4610

JSON external data source INSERT improvements initial draft



JSON external data source INSERT operation improvements and bug fix:
1, The path in CREATE TABLE AS SELECT must be a directory whether it 
exists, we use directory to represent the table.
Because in this scenario we need to write(INSERT OVERWRITE) or 
append(INSERT INTO) data to the existing table, we can't append to HDFS files 
which represent RDD. If we want to implement append semantics, we need new 
files and add them to the specific directory.
Another reason is that if the table based on a directory is more reasonable 
for access control, authentication and authorization. As SPARK-5821 mentioned, 
if we don't have write permission for the parent directory of the table, the 
CTAS command will failure. It's reasonable that it will not be granted some 
access rights to the directory which out of the table scope. So the table base 
on a directory may be a better choice. 
This restriction is only to CREATE TABLE AS SELECT, other DDL like 
CREATE TABLE can base on ordinary file or directory, due to the later one 
only scanning table without inserting new data.

2, New INSERT OVERWRITE implementation.
First insert the new generated table to a temporary directory which named 
as _temporary under the path directory. After insert finished, we deleted the 
original files. At last we rename _temporary for data.
This can fix the bug which mentioned at SPARK-5746.

3, Why to rename _temporary for data rather than move all files in 
_temporary to path and then delete _temporary? Because that spark 
RDD.saveAsTextFile(path) related operation will store the whole RDD to HDFS 
files which named as part-*  like files under the path. If the original 
files were produced by this mean, and then we use INSERT without overwrite, 
the new generated table files are also named as part-*  which will 
produce corrupted table.

Todo:
1, If there is an existing RDD base on path a/b/c which has already been 
cached, after INSERT operation we need to recomputing this RDD by rescan the 
directory. Can we trigger a rescan execution operation after INSERT?
2, Is it enough that rename  _temporary to data which mentioned above? 
If the base directory is produced by another CTAS command, there will be 
data directory under it. Can we append a unique number after data or just 
use the jobId or taskId to identify the subdirectory ?
I think to resolve these problem in a follow up PR may be better.

This is the initial draft and need optimization. Looking forward your 
comments.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanbohappy/spark jsonInsertImprovements

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4610.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4610


commit 307278ff6bdcd1d1a5a50650fb3dfa6da3db070f
Author: Yanbo Liang yanboha...@gmail.com
Date:   2015-02-15T05:23:14Z

JSON external data source INSERT improvements initial draft




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5821] [SPARK-5746] [SQL] JSON external ...

2015-02-15 Thread yanbohappy
Github user yanbohappy commented on the pull request:

https://github.com/apache/spark/pull/4607#issuecomment-74405701
  
@yhuai Thank you for your reply. Add analysis rule and throw an exception 
is reasonable and looking forward your PR.
I can address the issue of SPARK-5821, I'm working on another PR #4610 not 
only resolve SPARK-5821 but also with some improvements.
Could I close this PR and discuss JSON data source improvement related 
problem at #4610 ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5821] [SPARK-5746] [SQL] JSON data sour...

2015-02-14 Thread yanbohappy
GitHub user yanbohappy opened a pull request:

https://github.com/apache/spark/pull/4607

[SPARK-5821] [SPARK-5746] [SQL] JSON data source refactor initial draft

JSON data source refactor
1, The path in CREATE TABLE AS SELECT must be a directory. Because in 
this scenario we need to write or append files to the existed table, underlying 
directory is more reasonable for append operation, authentication and 
authorization.
For SPARK-5821, if we don't have write permission for the parent directory, 
the CTAS command will failure.
Another reason is that we can't append to HDFS files which represent RDD, 
if we want to implement append semantics, we need new files and add to a 
specific directory.
2, New INSERT OVERWRITE implementation.
First insert the new generated table to a temporary directory which named 
as _temporary under the path directory. After insert finished, we deleted the 
original files. At last we rename _temporary for data.
This can fix the bug which mentioned at SPARK-5746.
Why to rename _temporary for data rather than move all files in 
_temporary to path and then delete _temporary? Because that spark 
RDD.saveAsTextFile(path) related operation will store the whole RDD to HDFS 
files which named as part-* like files under the path. If the original 
files were produced by this mean, and then we use INSERT without overwrite, 
the new generated table files are also named as part-* which will produce 
corrupted table.
This is the initial draft and need optimization. Looking forward your 
opinions and comments.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanbohappy/spark JSONDataSourceRefactor

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4607.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4607


commit 8683a483c074f692152159d63a101f78c3c3fe58
Author: Yanbo Liang yanboha...@gmail.com
Date:   2015-02-14T17:37:05Z

JSON data source refactor initial draft




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...

2015-02-14 Thread yanbohappy
Github user yanbohappy commented on the pull request:

https://github.com/apache/spark/pull/4527#issuecomment-74388397
  
@yhuai 
This improvement is very similar with #758, so I have leverage the 
performance test there.
The benchmark suggests this optimization made the optimized version 1.5x 
faster when scanning JSON table, but it's not very stable.
For a json file with 188010 lines, the build scan consumed time is: 
original: Takes 15598 ms
optimized: Takes 10152 ms


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...

2015-02-12 Thread yanbohappy
Github user yanbohappy commented on a diff in the pull request:

https://github.com/apache/spark/pull/4527#discussion_r24574169
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala 
---
@@ -39,7 +39,19 @@ private[sql] object JsonRDD extends Logging {
   json: RDD[String],
   schema: StructType,
   columnNameOfCorruptRecords: String): RDD[Row] = {
-parseJson(json, columnNameOfCorruptRecords).map(parsed = 
asRow(parsed, schema))
+// Reuse the mutable row for each record, however we still need to 
+// create a new row for every nested struct type in each record
+val mutableRow = new SpecificMutableRow(schema.fields.map(_.dataType))
--- End diff --

You are right, it's not appropriate to use SpecificMutableRow here. I will 
change back to GenericMutableRow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...

2015-02-12 Thread yanbohappy
Github user yanbohappy commented on the pull request:

https://github.com/apache/spark/pull/4527#issuecomment-74059428
  
@chenghao-intel @yhuai 
Thank you for your advice and it's very useful.
We can use mutable rows for both top level records and inner structures at 
present.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SQL] Reuse mutable row for each record at jso...

2015-02-11 Thread yanbohappy
Github user yanbohappy commented on the pull request:

https://github.com/apache/spark/pull/4527#issuecomment-73855155
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SQL] Reuse mutable row for each record at jso...

2015-02-11 Thread yanbohappy
Github user yanbohappy commented on the pull request:

https://github.com/apache/spark/pull/4527#issuecomment-73850498
  
https://issues.apache.org/jira/browse/SPARK-5738


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SQL] Reuse mutable row for each record at jso...

2015-02-11 Thread yanbohappy
GitHub user yanbohappy opened a pull request:

https://github.com/apache/spark/pull/4527

[SQL] Reuse mutable row for each record at jsonStringToRow

When serialize json string to row, reuse a mutable row for each record 
instead of creating a new one for every record. But every nested struct type in 
each record, we still need to create a new row for them.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanbohappy/spark jsonStringToRowOptimization

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4527.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4527


commit b0c2b145950c18e30a8e88086e018cb66931fbec
Author: Yanbo Liang yanboha...@gmail.com
Date:   2015-02-11T08:36:50Z

[SQL] Reuse mutable row for each record at jsonStringToRow




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SQL] Implement Describe Table for SQLContext

2015-01-27 Thread yanbohappy
Github user yanbohappy commented on the pull request:

https://github.com/apache/spark/pull/4207#issuecomment-71626077
  
Move Describe command parser to DDLParser which can implement do not throw 
error when parse exception. So that some HiveContext particular describe 
command can be parse by fallback Hive parser and run by Hive native command.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SQL] Implement Describe Table for SQLContext

2015-01-27 Thread yanbohappy
Github user yanbohappy commented on the pull request:

https://github.com/apache/spark/pull/4207#issuecomment-71794484
  
@OopsOutOfMemory Since you have go deep into this issue and I agree your PR 
is more mature. So close this one. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SQL] Implement Describe Table for SQLContext

2015-01-27 Thread yanbohappy
Github user yanbohappy commented on the pull request:

https://github.com/apache/spark/pull/4207#issuecomment-71793953
  
@lianhuiwang In this PR https://github.com/apache/spark/pull/3948, 
CommandStrategy had been removed and command had been refactor.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SQL] Implement Describe Table for SQLContext

2015-01-27 Thread yanbohappy
Github user yanbohappy closed the pull request at:

https://github.com/apache/spark/pull/4207


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SQL] Implement Describe Table for SQLContext

2015-01-26 Thread yanbohappy
Github user yanbohappy commented on the pull request:

https://github.com/apache/spark/pull/4207#issuecomment-71482088
  
https://issues.apache.org/jira/browse/SPARK-5324


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Implement Describe Table for SQLContext

2015-01-26 Thread yanbohappy
GitHub user yanbohappy opened a pull request:

https://github.com/apache/spark/pull/4207

Implement Describe Table for SQLContext

Initial code snippet for Describe Table command.
#1 SQL Parser and generate logical plan. Add DESCRIBE [FORMATTED] 
[db_name.]table_name command parser in SparkSQLParser and generate the same 
logical plan for both SQLContext and HiveContext.
(note: HiveContext also support DESCRIBE [FORMATTED] [db_name.]table_name 
PARTITION partition_column_name and DESCRIBE [FORMATTED] [db_name.]table_name 
column_name is implement by Hive native command)
#2 Implement DescribeCommand which invoke by RunnableCommand.
#3 For SQLContext the code is clearly structured, but for HiveContext the 
output of describe command need to stay the same Hive. So For HiveContext, we 
still transfer logical command to DescribeHiveTableCommand which had been 
implement in HiveStrategies.HiveCommandStrategy.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanbohappy/spark spark-5324

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4207.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4207


commit 7fb2e81d682d8f641275077b94dfb6d7de466de8
Author: Yanbo Liang yanboha...@gmail.com
Date:   2015-01-26T15:29:10Z

Implement Describe Table for SQLContext




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4963 [SQL] Add copy to SQL's Sample oper...

2015-01-08 Thread yanbohappy
Github user yanbohappy commented on the pull request:

https://github.com/apache/spark/pull/3827#issuecomment-69160371
  
Can anyone verify and merge this patch? It's a bug appeared frequently and 
fix it asap will be better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4963 [SQL] HiveTableScan return mutable ...

2014-12-30 Thread yanbohappy
Github user yanbohappy commented on the pull request:

https://github.com/apache/spark/pull/3827#issuecomment-68339079
  
@liancheng I agree to move the copy call to execution.Sample.execute and 
added new commits.
It will take no effect on HiveTableScan.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4963 [SQL] Add copy to SQL's Sample oper...

2014-12-30 Thread yanbohappy
Github user yanbohappy commented on the pull request:

https://github.com/apache/spark/pull/3827#issuecomment-68425561
  
Change for better test output and move it to another test file which is 
more reasonable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: HiveTableScan return mutable row with copy

2014-12-29 Thread yanbohappy
GitHub user yanbohappy opened a pull request:

https://github.com/apache/spark/pull/3827

HiveTableScan return mutable row with copy

https://issues.apache.org/jira/browse/SPARK-4963
SchemaRDD.sample() return wrong results due to GapSamplingIterator 
operating on mutable row.
HiveTableScan make RDD with SpecificMutableRow and SchemaRDD.sample() will 
return GapSamplingIterator for iterating. 

override def next(): T = {
val r = data.next()
advance
r
  }

GapSamplingIterator.next() return the current underlying element and 
assigned it to r.
However if the underlying iterator is mutable row just like what 
HiveTableScan returned, underlying iterator and r will point to the same object.
After advance operation, we drop some underlying elments and it also 
changed r which is not expected. Then we return the wrong value different from 
initial r.

To fix this issue, the most direct way is to make HiveTableScan return 
mutable row with copy just like the initial commit that I have made. This 
solution will make HiveTableScan can not get the full advantage of reusable 
MutableRow, but it can make sample operation return correct result.
Further more, we need to investigate  GapSamplingIterator.next() and make 
it can implement copy operation inside it. To achieve this, we should define 
every elements that RDD can store implement the function like cloneable and it 
will make huge change.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanbohappy/spark spark-4963

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3827.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3827


commit 6eaee5e7b1b5aca7f6abd16892f8312c7d6d7917
Author: Yanbo Liang yanboha...@gmail.com
Date:   2014-12-29T09:00:44Z

HiveTableScan return mutable row with copy




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4111 [MLlib] add regression metrics

2014-10-29 Thread yanbohappy
Github user yanbohappy commented on the pull request:

https://github.com/apache/spark/pull/2978#issuecomment-60883826
  
Rename parameter and function names to be consistent with spark naming 
rules.
Delete unused columns and set prediction as the first column.
Add explanation and reference to r2Score and explained variance.
Other code style keeping.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: add regression metrics

2014-10-28 Thread yanbohappy
GitHub user yanbohappy opened a pull request:

https://github.com/apache/spark/pull/2978

add regression metrics

Add RegressionMetrics.scala as regression metrics used for evaluation and 
corresponding test case RegressionMetricsSuite.scala.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanbohappy/spark regression_metrics

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2978.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2978


commit 43bb12b0ce64a70c5d655ad19930b17a7d69ab6e
Author: liangyanbo liangya...@meituan.com
Date:   2014-10-28T11:23:09Z

add regression metrics




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4111 [MLlib] add regression metrics

2014-10-28 Thread yanbohappy
Github user yanbohappy commented on the pull request:

https://github.com/apache/spark/pull/2978#issuecomment-60745341
  
Rename re_score() and remove unused column.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4111 [MLlib] add regression metrics

2014-10-28 Thread yanbohappy
Github user yanbohappy commented on a diff in the pull request:

https://github.com/apache/spark/pull/2978#discussion_r19465890
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala 
---
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.evaluation
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.rdd.RDD
+import org.apache.spark.Logging
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.mllib.rdd.RDDFunctions._
+
+/**
+ * :: Experimental ::
+ * Evaluator for regression.
+ *
+ * @param valuesAndPreds an RDD of (value, pred) pairs.
+ */
+@Experimental
+class RegressionMetrics(valuesAndPreds: RDD[(Double, Double)]) extends 
Logging {
+
+  /**
+   * Use MultivariateOnlineSummarizer to calculate mean and variance of 
different combination.
+   * MultivariateOnlineSummarizer is a numerically stable algorithm to 
compute mean and variance 
+   * in a online fashion.
+   */
+  private lazy val summarizer: MultivariateOnlineSummarizer = {
+val summarizer: MultivariateOnlineSummarizer = valuesAndPreds.map{
+  case (value,pred) = Vectors.dense(
+Array(value, pred, value - pred, math.abs(value - pred), 
math.pow(value - pred, 2.0))
--- End diff --

Yes, it's not used and I have remove it in a new commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org