GitHub user Maple-Wang opened a pull request:
https://github.com/apache/spark/pull/20027
Branch 2.2
use SparkR in the R shell, the master parameter too old,connot run
"spark-submit --master yarn --deploy-mode client" . I install R on all node.
when use in this way:
if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
Sys.setenv(SPARK_HOME = "/usr/hdp/2.6.1.0-129/spark2")
}
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R",
"lib")))
sparkR.session(master = "yarn", sparkConfig = list(spark.driver.memory =
"10g"))
it comes out:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0
(TID 4, node23.nuctech.com, executor 1): java.net.SocketTimeoutException:
Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at
java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
at java.net.ServerSocket.implAccept(ServerSocket.java:545)
at java.net.ServerSocket.accept(ServerSocket.java:513)
at org.apache.spark.api.r.RRunner$.createRWorker(RRunner.scala:372)
at org.apache.spark.api.r.RRunner.compute(RRunner.scala:69)
at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:51)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/apache/spark branch-2.2
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/20027.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #20027
commit 96c04f1edcd53798d9db5a356482248868a0a905
Author: Marcelo Vanzin
Date: 2017-06-24T05:23:43Z
[SPARK-21159][CORE] Don't try to connect to launcher in standalone cluster
mode.
Monitoring for standalone cluster mode is not implemented (see
SPARK-11033), but
the same scheduler implementation is used, and if it tries to connect to the
launcher it will fail. So fix the scheduler so it only tries that in client
mode;
cluster mode applications will be correctly launched and will work, but
monitoring
through the launcher handle will not be available.
Tested by running a cluster mode app with "SparkLauncher.startApplication".
Author: Marcelo Vanzin
Closes #18397 from vanzin/SPARK-21159.
(cherry picked from commit bfd73a7c48b87456d1b84d826e04eca938a1be64)
Signed-off-by: Wenchen Fan
commit ad44ab5cb9cdaff836c7469d10b00a86a3e46adf
Author: gatorsmile
Date: 2017-06-24T14:35:59Z
[SPARK-21203][SQL] Fix wrong results of insertion of Array of Struct
### What changes were proposed in this pull request?
```SQL
CREATE TABLE `tab1`
(`custom_fields` ARRAY>)
USING parquet
INSERT INTO `tab1`
SELECT ARRAY(named_struct('id', 1, 'value', 'a'), named_struct('id', 2,
'value', 'b'))
SELECT custom_fields.id, custom_fields.value FROM tab1
```
The above query always return the last struct of the array, because the
rule `SimplifyCasts` incorrectly rewrites the query. The underlying cause is we
always use the same `GenericInternalRow` object when doing the cast.
### How was this patch tested?
Author: gatorsmile
Closes #18412 from gatorsmile/castStruct.
(cherry picked from commit 2e1586f60a77ea0adb6f3f68ba74323f0c242199)
Signed-off-by: Wenchen Fan
commit d8e3a4af36f85455548e82ae4acd525f5e52f322
Author: Masha Basmanova
Date: 2017-06-25T05:49:35Z
[SPARK-21079][SQL] Calculate total size of a partition table as a sum of
individual partitions
## What changes were proposed in this pull request?
Storage URI of a partitioned table may or may not point to a directory
under which individual partitions are stored. In fact, individual partitions
may be located in totally unrelated directories. Before this change, ANALYZE
TABLE table COMPUTE STATISTICS command calculated total size of a table by
adding up sizes of files found under table's storage URI. This calculation
could produce 0 if partitions are stored elsewhere.
This change uses storage URIs of individual partitions to calculate the
sizes of all partitions of a table and adds these up to produce the total size
of a table.