[GitHub] spark pull request #20027: Branch 2.2

2018-01-01 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20027


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20027: Branch 2.2

2017-12-19 Thread Maple-Wang
GitHub user Maple-Wang opened a pull request:

https://github.com/apache/spark/pull/20027

Branch 2.2

use SparkR in the R shell, the master parameter too old,connot run 
"spark-submit --master yarn --deploy-mode client" . I install R on all node.

when use in this way:
if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
  Sys.setenv(SPARK_HOME = "/usr/hdp/2.6.1.0-129/spark2")
}
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", 
"lib")))
sparkR.session(master = "yarn", sparkConfig = list(spark.driver.memory = 
"10g"))



it comes out:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 
(TID 4, node23.nuctech.com, executor 1): java.net.SocketTimeoutException: 
Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at 
java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
at java.net.ServerSocket.implAccept(ServerSocket.java:545)
at java.net.ServerSocket.accept(ServerSocket.java:513)
at org.apache.spark.api.r.RRunner$.createRWorker(RRunner.scala:372)
at org.apache.spark.api.r.RRunner.compute(RRunner.scala:69)
at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:51)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/apache/spark branch-2.2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20027.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20027


commit 96c04f1edcd53798d9db5a356482248868a0a905
Author: Marcelo Vanzin 
Date:   2017-06-24T05:23:43Z

[SPARK-21159][CORE] Don't try to connect to launcher in standalone cluster 
mode.

Monitoring for standalone cluster mode is not implemented (see 
SPARK-11033), but
the same scheduler implementation is used, and if it tries to connect to the
launcher it will fail. So fix the scheduler so it only tries that in client 
mode;
cluster mode applications will be correctly launched and will work, but 
monitoring
through the launcher handle will not be available.

Tested by running a cluster mode app with "SparkLauncher.startApplication".

Author: Marcelo Vanzin 

Closes #18397 from vanzin/SPARK-21159.

(cherry picked from commit bfd73a7c48b87456d1b84d826e04eca938a1be64)
Signed-off-by: Wenchen Fan 

commit ad44ab5cb9cdaff836c7469d10b00a86a3e46adf
Author: gatorsmile 
Date:   2017-06-24T14:35:59Z

[SPARK-21203][SQL] Fix wrong results of insertion of Array of Struct

### What changes were proposed in this pull request?
```SQL
CREATE TABLE `tab1`
(`custom_fields` ARRAY>)
USING parquet

INSERT INTO `tab1`
SELECT ARRAY(named_struct('id', 1, 'value', 'a'), named_struct('id', 2, 
'value', 'b'))

SELECT custom_fields.id, custom_fields.value FROM tab1
```

The above query always return the last struct of the array, because the 
rule `SimplifyCasts` incorrectly rewrites the query. The underlying cause is we 
always use the same `GenericInternalRow` object when doing the cast.

### How was this patch tested?

Author: gatorsmile 

Closes #18412 from gatorsmile/castStruct.

(cherry picked from commit 2e1586f60a77ea0adb6f3f68ba74323f0c242199)
Signed-off-by: Wenchen Fan 

commit d8e3a4af36f85455548e82ae4acd525f5e52f322
Author: Masha Basmanova 
Date:   2017-06-25T05:49:35Z

[SPARK-21079][SQL] Calculate total size of a partition table as a sum of 
individual partitions

## What changes were proposed in this pull request?

Storage URI of a partitioned table may or may not point to a directory 
under which individual partitions are stored. In fact, individual partitions 
may be located in totally unrelated directories. Before this change, ANALYZE 
TABLE table COMPUTE STATISTICS command calculated total size of a table by 
adding up sizes of files found under table's storage URI. This calculation 
could produce 0 if partitions are stored elsewhere.

This change uses storage URIs of individual partitions to calculate the 
sizes of all partitions of a table and adds these up to produce the total size 
of a table.