[GitHub] spark pull request #20078: [SPARK-22900] [Spark-Streaming] Remove unnecessar...

2017-12-25 Thread sharkdtu
GitHub user sharkdtu opened a pull request:

https://github.com/apache/spark/pull/20078

[SPARK-22900] [Spark-Streaming] Remove unnecessary restrict for streaming 
dynamic allocation

## What changes were proposed in this pull request?

When i set the conf `spark.streaming.dynamicAllocation.enabled=true`, the 
conf `num-executors` can not be set. As a result, it will allocate default 2 
executors and all receivers will be run on this 2 executors, there may not be 
redundant cpu cores for tasks. it will stuck all the time.

in my opinion, we should remove unnecessary restrict for streaming dynamic 
allocation. we can set `num-executors` and 
`spark.streaming.dynamicAllocation.enabled=true` together. when application 
starts, each receiver will be run on an executor.

## How was this patch tested?

Manual test.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sharkdtu/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20078.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20078


commit 6a7d07b7f135ed8ad079a1918fe3484757960df0
Author: sharkdtu 
Date:   2017-12-25T13:13:16Z

remove unnecessary restrict for streaming dynamic allocation




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20078: [SPARK-22900] [Spark-Streaming] Remove unnecessary restr...

2018-01-01 Thread sharkdtu
Github user sharkdtu commented on the issue:

https://github.com/apache/spark/pull/20078
  
@felixcheung 
At the beginning, if numReceivers > totleExecutorCores,  there is not cpu 
cores for batch processing, and `ExecutorAllocationManager` can't listen 
metrics of any batches. As a result, it doesn't work.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20078: [SPARK-22900] [Spark-Streaming] Remove unnecessary restr...

2018-01-01 Thread sharkdtu
Github user sharkdtu commented on the issue:

https://github.com/apache/spark/pull/20078
  
@felixcheung
if you submit spark on yarn with 
`spark.streaming.dynamicAllocation.enabled=true`, the `num-executors` can not 
be set. So, at the begining, there are only 2(default value) executors.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20078: [SPARK-22900] [Spark-Streaming] Remove unnecessary restr...

2018-01-03 Thread sharkdtu
Github user sharkdtu commented on the issue:

https://github.com/apache/spark/pull/20078
  
@jerryshao 
if this PR can fix bugs as you said. why not fix it. Or, it should be 
marked as deprecated.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20078: [SPARK-22900] [Spark-Streaming] Remove unnecessary restr...

2018-01-03 Thread sharkdtu
Github user sharkdtu commented on the issue:

https://github.com/apache/spark/pull/20078
  
@felixcheung
Have you ever thought about initial num-executors? Actually, it is default 
2 executors when you run spark on yarn. How can you make sure that this 2 
executors have enougth cores for receivers at the begining?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14479: [SPARK-16873] [Core] Fix SpillReader NPE when spi...

2016-08-03 Thread sharkdtu
GitHub user sharkdtu opened a pull request:

https://github.com/apache/spark/pull/14479

[SPARK-16873] [Core] Fix SpillReader NPE when spillFile has no data

## What changes were proposed in this pull request?

SpillReader NPE when spillFile has no data. See follow logs:

16/07/31 20:54:04 INFO collection.ExternalSorter: spill memory to 
file:/data4/yarnenv/local/usercache/tesla/appcache/application_1465785263942_56138/blockmgr-db5f46c3-d7a4-4f93-8b77-565e469696fb/09/temp_shuffle_ec3ece08-4569-4197-893a-4a5dfcbbf9fa,
 fileSize:0.0 B
16/07/31 20:54:04 WARN memory.TaskMemoryManager: leak 164.3 MB memory from 
org.apache.spark.util.collection.ExternalSorter@3db4b52d
16/07/31 20:54:04 ERROR executor.Executor: Managed memory leak detected; 
size = 190458101 bytes, TID = 2358516/07/31 20:54:04 ERROR executor.Executor: 
Exception in task 1013.0 in stage 18.0 (TID 23585)
java.lang.NullPointerException
at 
org.apache.spark.util.collection.ExternalSorter$SpillReader.cleanup(ExternalSorter.scala:624)
at 
org.apache.spark.util.collection.ExternalSorter$SpillReader.nextBatchStream(ExternalSorter.scala:539)
at 
org.apache.spark.util.collection.ExternalSorter$SpillReader.(ExternalSorter.scala:507)
at 
org.apache.spark.util.collection.ExternalSorter$SpillableIterator.spill(ExternalSorter.scala:816)
at 
org.apache.spark.util.collection.ExternalSorter.forceSpill(ExternalSorter.scala:251)
at org.apache.spark.util.collection.Spillable.spill(Spillable.scala:109)
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:154)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:249)
at 
org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:112)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
16/07/31 20:54:30 INFO executor.Executor: Executor is trying to kill task 
1090.1 in stage 18.0 (TID 23793)
16/07/31 20:54:30 INFO executor.CoarseGrainedExecutorBackend: Driver 
commanded a shutdown


## How was this patch tested?

Manual test.




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sharkdtu/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14479.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14479


commit d8cf2b493a589b745d54b3b903848d4d0827e642
Author: sharkd 
Date:   2016-07-12T23:59:26Z

rebase apache/master

commit 8b0c40ab555336899b684fc2a1d6cc1c0886cd11
Author: sharkd 
Date:   2016-07-11T16:49:56Z

fix style

commit 888cf1fa2187e4f92286c74ba6a05196348eff79
Author: sharkd 
Date:   2016-07-12T23:59:26Z

rebase apache/master

commit c470ab74b1bfc4814f0ca683102ed55b6c2a1410
Author: sharkd 
Date:   2016-07-11T16:49:56Z

fix style

commit 8ae5ec71c9e12b4004d0563c9b581b590890369f
Author: sharkdtu 
Date:   2016-08-03T11:51:45Z

SpillReader NPE when spillFile has no data




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17963: [SPARK-20722][Core][History Server] Replay newer ...

2017-05-12 Thread sharkdtu
GitHub user sharkdtu opened a pull request:

https://github.com/apache/spark/pull/17963

[SPARK-20722][Core][History Server] Replay newer event log that hasn't be 
replayed in advance for request

## What changes were proposed in this pull request?

History server may replay logs slowly if the size of event logs in current 
checking period is very large. It will get stuck for a while before entering 
next checking period, if we request a newer application history ui, we get the 
error like "Application application_1481785469354_934016 not found". We can let 
history server replay the newer event log in advance for request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sharkdtu/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17963.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17963


commit 3005c7cd7c57fd0c6a0ea318760dc2dc3010e3aa
Author: sharkdtu 
Date:   2017-05-12T07:50:44Z

Replay event log that hasn't be replayed in current checking period in 
advance for request




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16912: [SPARK-19576] [Core] Task attempt paths exist in ...

2017-05-13 Thread sharkdtu
Github user sharkdtu closed the pull request at:

https://github.com/apache/spark/pull/16912


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17963: [SPARK-20722][CORE] Replay newer event log that hasn't b...

2017-05-13 Thread sharkdtu
Github user sharkdtu commented on the issue:

https://github.com/apache/spark/pull/17963
  
cc @srowen @ajbozarth 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17963: [SPARK-20722][CORE] Replay newer event log that hasn't b...

2017-05-15 Thread sharkdtu
Github user sharkdtu commented on the issue:

https://github.com/apache/spark/pull/17963
  
@jerryshao Event log file will not be processed twice, you can review  
`FsHistoryProvider.checkForLogs` and 
`FsHistoryProvider.mergeApplicationListing`. In next checking period, it will 
check event log length by comparing to the corresponding appinfo from 
`fileToAppInfo`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17963: [SPARK-20722][CORE] Replay newer event log that hasn't b...

2017-05-15 Thread sharkdtu
Github user sharkdtu commented on the issue:

https://github.com/apache/spark/pull/17963
  
@jerryshao thx, i agree that. this pr may be a temporary fix before 
SPARK-18085


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17963: [SPARK-20722][CORE] Replay newer event log that hasn't b...

2017-05-15 Thread sharkdtu
Github user sharkdtu commented on the issue:

https://github.com/apache/spark/pull/17963
  
@ajbozarth Yes, this case is a big issue in my production cluster, where 
run nearly 20,000 applications every day.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17963: [SPARK-20722][CORE] Replay newer event log that h...

2017-06-19 Thread sharkdtu
Github user sharkdtu closed the pull request at:

https://github.com/apache/spark/pull/17963


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18352: [SPARK-21138] [YARN] Cannot delete staging dir wh...

2017-06-19 Thread sharkdtu
GitHub user sharkdtu opened a pull request:

https://github.com/apache/spark/pull/18352

[SPARK-21138] [YARN] Cannot delete staging dir when the clusters of 
"spark.yarn.stagingDir" and "spark.hadoop.fs.defaultFS" are different

## What changes were proposed in this pull request?

When I set different clusters for "spark.hadoop.fs.defaultFS" and 
"spark.yarn.stagingDir" as follows:
```
spark.hadoop.fs.defaultFS  hdfs://tl-nn-tdw.tencent-distribute.com:54310
spark.yarn.stagingDir hdfs://ss-teg-2-v2/tmp/spark
```
The staging dir can not be deleted, it will prompt following message:
```
java.lang.IllegalArgumentException: Wrong FS: 
hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618, 
expected: hdfs://tl-nn-tdw.tencent-distribute.com:54310
```

## How was this patch tested?

Existing tests


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sharkdtu/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18352.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18352


commit b74138e31d3317b34ffb9f13cf7fdd7873edc1a6
Author: sharkdtu 
Date:   2017-06-19T11:03:01Z

Cannot delete staging dir when the clusters of spark.yarn.stagingDir and 
spark.hadoop.fs.defaultFS are different




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21658: [SPARK-24678][Spark-Streaming] Give priority in u...

2018-06-28 Thread sharkdtu
GitHub user sharkdtu opened a pull request:

https://github.com/apache/spark/pull/21658

[SPARK-24678][Spark-Streaming] Give priority in use of 'PROCESS_LOCAL' for 
spark-streaming

## What changes were proposed in this pull request?

Currently, `BlockRDD.getPreferredLocations`  only get hosts info of blocks, 
which results in subsequent schedule level is not better than 'NODE_LOCAL'. We 
can just make a small changes, the schedule level can be improved to 
'PROCESS_LOCAL'

## How was this patch tested?

manual test


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sharkdtu/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21658.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21658


commit 666fb4c5d343a1ea439ecc284d047810d6189c23
Author: sharkdtu 
Date:   2018-06-28T07:35:52Z

give priority in use of 'PROCESS_LOCAL' for spark-streaming




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21658: [SPARK-24678][Spark-Streaming] Give priority in u...

2018-07-05 Thread sharkdtu
Github user sharkdtu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21658#discussion_r200310184
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1569,7 +1569,7 @@ private[spark] object BlockManager {
 
 val blockManagers = new HashMap[BlockId, Seq[String]]
 for (i <- 0 until blockIds.length) {
-  blockManagers(blockIds(i)) = blockLocations(i).map(_.host)
+  blockManagers(blockIds(i)) = blockLocations(i).map(b => 
s"executor_${b.host}_${b.executorId}")
--- End diff --

blockIdsToLocations ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21658: [SPARK-24678][Spark-Streaming] Give priority in use of '...

2018-07-09 Thread sharkdtu
Github user sharkdtu commented on the issue:

https://github.com/apache/spark/pull/21658
  
@jerryshao Yeah, I hava verified it in our cluster, and the locality is 
'PROCESS_LOCAL'.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16651: [SPARK-19298][Core] History server can't match Ma...

2017-02-13 Thread sharkdtu
Github user sharkdtu closed the pull request at:

https://github.com/apache/spark/pull/16651


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16911: [SPARK-19576] [Core] Task attempt paths exist in ...

2017-02-13 Thread sharkdtu
GitHub user sharkdtu opened a pull request:

https://github.com/apache/spark/pull/16911

[SPARK-19576] [Core] Task attempt paths exist in output path after 
saveAsNewAPIHadoopFile completes with speculation enabled

`writeShard` in `saveAsNewAPIHadoopDataset` always committed its tasks 
without question. The problem is that when speculation is enabled sometimes 
this can result in multiple tasks committing their output to the same path, 
which may lead to task temporary paths exist in output path after 
`saveAsNewAPIHadoopFile` completes. 

```scala
-rw-r--r--3   user group   0   2017-02-11 19:36 
hdfs://.../output/_SUCCESS
drwxr-xr-x-   user group   0   2017-02-11 19:36 
hdfs://.../output/attempt_201702111936_32487_r_44_0
-rw-r--r--3   user group8952   2017-02-11 19:36 
hdfs://.../output/part-r-0
-rw-r--r--3   user group7878   2017-02-11 19:36 
hdfs://.../output/part-r-1
```
Assume there are two attempt tasks that commit at the same time, The two 
attempt tasks maybe rename their task attempt paths to task committed path at 
the same time. When one task's `rename` operation completes, the other task's 
`rename` operation will let its task attempt path under the task committed path.

Anyway, it is not recommended that `writeShard` in 
`saveAsNewAPIHadoopDataset` always committed its tasks without question. 
Similar question in SPARK-4879 triggered by calling saveAsHadoopFile has been 
solved. Newest master has solved it too. This PR just fix 2.1

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sharkdtu/spark branch-2.1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16911.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16911


commit a7f8ebb8629706c54c286b7aca658838e718e804
Author: Cheng Lian 
Date:   2016-12-02T06:02:45Z

[SPARK-17213][SQL] Disable Parquet filter push-down for string and binary 
columns due to PARQUET-686

This PR targets to both master and branch-2.1.

## What changes were proposed in this pull request?

Due to PARQUET-686, Parquet doesn't do string comparison correctly while 
doing filter push-down for string columns. This PR disables filter push-down 
for both string and binary columns to work around this issue. Binary columns 
are also affected because some Parquet data models (like Hive) may store string 
columns as a plain Parquet `binary` instead of a `binary (UTF8)`.

## How was this patch tested?

New test case added in `ParquetFilterSuite`.

Author: Cheng Lian 

Closes #16106 from liancheng/spark-17213-bad-string-ppd.

(cherry picked from commit ca6391637212814b7c0bd14c434a6737da17b258)
Signed-off-by: Reynold Xin 

commit 65e896a6e9a5378f2d3a02c0c2a57fdb8d8f1d9d
Author: Eric Liang 
Date:   2016-12-02T12:59:39Z

[SPARK-18679][SQL] Fix regression in file listing performance for 
non-catalog tables

## What changes were proposed in this pull request?

In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed 
to InMemoryFileIndex). This introduced a regression where parallelism could 
only be introduced at the very top of the tree. However, in many cases (e.g. 
`spark.read.parquet(topLevelDir)`), the top of the tree is only a single 
directory.

This PR simplifies and fixes the parallel recursive listing code to allow 
parallelism to be introduced at any level during recursive descent (though note 
that once we decide to list a sub-tree in parallel, the sub-tree is listed in 
serial on executors).

cc mallman  cloud-fan

## How was this patch tested?

Checked metrics in unit tests.

Author: Eric Liang 

Closes #16112 from ericl/spark-18679.

(cherry picked from commit 294163ee9319e4f7f6da1259839eb3c80bba25c2)
Signed-off-by: Wenchen Fan 

commit 415730e19cea3a0e7ea5491bf801a22859bbab66
Author: Dongjoon Hyun 
Date:   2016-12-02T13:48:22Z

[SPARK-18419][SQL] `JDBCRelation.insert` should not remove Spark options

## What changes were proposed in this pull request?

Currently, `JDBCRelation.insert` removes Spark options too early by 
mistakenly using `asConnectionProperties`. Spark options like `numPartitions` 
should be passed into `DataFrameWriter.jdbc` correctly. This bug have been 
**hidden** because `JDBCOptions.asConnectionProperties` fails to filter out the 
mixed-case options. This PR aims to fix both.

**JDBCRelation.insert**
```scala
override def insert(data: DataFrame, overwrite: Boolean): Unit = {
  val url = jdbcOptions.url
  val table = jdbcOptions.table
- val properties = jdbcOptions.asConnectionProperties
+

[GitHub] spark pull request #16911: [SPARK-19576] [Core] Task attempt paths exist in ...

2017-02-13 Thread sharkdtu
Github user sharkdtu closed the pull request at:

https://github.com/apache/spark/pull/16911


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16912: [SPARK-19576] [Core] Task attempt paths exist in ...

2017-02-13 Thread sharkdtu
GitHub user sharkdtu opened a pull request:

https://github.com/apache/spark/pull/16912

[SPARK-19576] [Core] Task attempt paths exist in output path after 
saveAsNewAPIHadoopFile completes with speculation enabled

`writeShard` in `saveAsNewAPIHadoopDataset` always committed its tasks 
without question. The problem is that when speculation is enabled sometimes 
this can result in multiple tasks committing their output to the same path, 
which may lead to task temporary paths exist in output path after 
`saveAsNewAPIHadoopFile` completes. 

```scala
-rw-r--r--3   user group   0   2017-02-11 19:36 
hdfs://.../output/_SUCCESS
drwxr-xr-x-   user group   0   2017-02-11 19:36 
hdfs://.../output/attempt_201702111936_32487_r_44_0
-rw-r--r--3   user group8952   2017-02-11 19:36 
hdfs://.../output/part-r-0
-rw-r--r--3   user group7878   2017-02-11 19:36 
hdfs://.../output/part-r-1
```
Assume there are two attempt tasks that commit at the same time, The two 
attempt tasks maybe rename their task attempt paths to task committed path at 
the same time. When one task's `rename` operation completes, the other task's 
`rename` operation will let its task attempt path under the task committed path.

Anyway, it is not recommended that `writeShard` in 
`saveAsNewAPIHadoopDataset` always committed its tasks without question. 
Similar question in SPARK-4879 triggered by calling saveAsHadoopFile has been 
solved. Newest master has solved it too. This PR just fix 2.1


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sharkdtu/spark branch-2.1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16912.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16912


commit 6f41b90583c585414b99fe716377d0576499de8d
Author: sharkdtu 
Date:   2017-02-13T11:46:48Z

Task attempt paths exist in output path after saveAsNewAPIHadoopFile 
completes with speculation enabled




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16651: [SPARK-19298][Core] History server can't match Ma...

2017-01-19 Thread sharkdtu
GitHub user sharkdtu opened a pull request:

https://github.com/apache/spark/pull/16651

[SPARK-19298][Core] History server can't match MalformedInputException and 
prompt the detail logs while repalying eventlog

History server can't match MalformedInputException and prompt the detail 
logs while repalying eventlog, because MalformedInputException is a subclass of 
IOException.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sharkdtu/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16651.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16651


commit 07f59016d6175d5aac0242f7432ce09bb3f984b0
Author: sharkdtu 
Date:   2017-01-20T02:06:55Z

fix MalformedInputException match




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16651: [SPARK-19298][Core] History server can't match Ma...

2017-01-20 Thread sharkdtu
Github user sharkdtu commented on a diff in the pull request:

https://github.com/apache/spark/pull/16651#discussion_r97034201
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/ReplayListenerBus.scala ---
@@ -107,11 +107,11 @@ private[spark] class ReplayListenerBus extends 
SparkListenerBus with Logging {
 }
   }
 } catch {
-  case ioe: IOException =>
-throw ioe
-  case e: Exception =>
-logError(s"Exception parsing Spark event log: $sourceName", e)
+  case ex: MalformedInputException =>
--- End diff --

thx, forgot to import MalformedInputException


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16651: [SPARK-19298][Core] History server can't match Ma...

2017-01-20 Thread sharkdtu
Github user sharkdtu commented on a diff in the pull request:

https://github.com/apache/spark/pull/16651#discussion_r97037524
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/ReplayListenerBus.scala ---
@@ -107,11 +107,11 @@ private[spark] class ReplayListenerBus extends 
SparkListenerBus with Logging {
 }
   }
 } catch {
-  case ioe: IOException =>
-throw ioe
-  case e: Exception =>
-logError(s"Exception parsing Spark event log: $sourceName", e)
+  case ex: MalformedInputException =>
--- End diff --

please check: https://issues.apache.org/jira/browse/SPARK-19298


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16651: [SPARK-19298][Core] History server can't match Malformed...

2017-01-20 Thread sharkdtu
Github user sharkdtu commented on the issue:

https://github.com/apache/spark/pull/16651
  
@srowen 
i think the logs were just for `MalformedInputException`, it does't matter 
that non-IOExceptions will be rethrown, because they will be catched by upper 
callers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Core] Remove unnecessary calculation of stage...

2016-05-15 Thread sharkdtu
GitHub user sharkdtu opened a pull request:

https://github.com/apache/spark/pull/13123

[Core] Remove unnecessary calculation of stage's parents

## What changes were proposed in this pull request?

Remove unnecessary calculation of stage's parents, because stage's parents 
have been set at the time of stage construction.


## How was this patch tested?

Make use of the existing test cases




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sharkdtu/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13123.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13123


commit 6e93108e5a6642938bc1d16c3f204714f05e4bd5
Author: sharkd 
Date:   2016-05-15T09:15:02Z

Remove unnecessary calculation of stage's parents




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14088: Fix bugs for "Can not get user config when callin...

2016-07-07 Thread sharkdtu
GitHub user sharkdtu opened a pull request:

https://github.com/apache/spark/pull/14088

Fix bugs for "Can not get user config when calling SparkHadoopUtil.get.conf 
in other places"

## What changes were proposed in this pull request?

Fix bugs for "Can not get user config when calling SparkHadoopUtil.get.conf 
in other places".

The `SparkHadoopUtil` singleton was instantiated before 
`ApplicationMaster`, So the `sparkConf` and `conf` in the `SparkHadoopUtil` 
singleton didn't include user's configuration. But other places, such as 
`DataSourceStrategy`, use `hadoopConf` in `SparkHadoopUtil`:

```scala
...
case PhysicalOperation(projects, filters, l @ LogicalRelation(t: 
HadoopFsRelation, _)) =>
  // See buildPartitionedTableScan for the reason that we need to 
create a shard
  // broadcast HadoopConf.
  val sharedHadoopConf = SparkHadoopUtil.get.conf
  val confBroadcast =
t.sqlContext.sparkContext.broadcast(new 
SerializableConfiguration(sharedHadoopConf))
...
``` 

## How was this patch tested?

Use exist test cases

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sharkdtu/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14088.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14088


commit 55e66b21cdcd68861db0f1045186048c54b13153
Author: sharkdtu 
Date:   2016-07-07T11:04:11Z

Fix bugs for "Can not get user config when calling SparkHadoopUtil.get.conf 
in other places, such as DataSourceStrategy"




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14088: [SPARK-16414] [YARN] Fix bugs for "Can not get user conf...

2016-07-07 Thread sharkdtu
Github user sharkdtu commented on the issue:

https://github.com/apache/spark/pull/14088
  
@tgravescs  fixed the description and style


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14088: [SPARK-16414] [YARN] Fix bugs for "Can not get us...

2016-07-08 Thread sharkdtu
Github user sharkdtu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14088#discussion_r70076189
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala ---
@@ -743,6 +735,14 @@ object ApplicationMaster extends Logging {
   def main(args: Array[String]): Unit = {
 SignalUtils.registerLogger(log)
 val amArgs = new ApplicationMasterArguments(args)
+
+// Load the properties file with the Spark configuration and set 
entries as system properties,
+// so that user code run inside the AM also has access to them.
--- End diff --

@tgravescs Thanks, done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14088: [SPARK-16414] [YARN] Fix bugs for "Can not get us...

2016-07-11 Thread sharkdtu
Github user sharkdtu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14088#discussion_r70362297
  
--- Diff: 
yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala ---
@@ -274,6 +288,37 @@ private object YarnClusterDriverWithFailure extends 
Logging with Matchers {
   }
 }
 
+private object YarnClusterDriverUseSparkHadoopUtilConf extends Logging 
with Matchers {
+  def main(args: Array[String]): Unit = {
+if (args.length != 2) {
+  // scalastyle:off println
+  System.err.println(
+s"""
+|Invalid command line: ${args.mkString(" ")}
+|
+|Usage: YarnClusterDriverUseSparkHadoopUtilConf 
[propertyKey=value] [result file]
+""".stripMargin)
+  // scalastyle:on println
+  System.exit(1)
+}
+
+val sc = new SparkContext(new SparkConf()
+  .set("spark.extraListeners", classOf[SaveExecutorInfo].getName)
+  .setAppName("yarn test using SparkHadoopUtil's conf"))
+
+val propertyKeyValue = args(0).split("=")
+val status = new File(args(1))
+var result = "failure"
+try {
+  SparkHadoopUtil.get.conf.get(propertyKeyValue(0).drop(13)) should be 
(propertyKeyValue(1))
--- End diff --

it means drop `spark.hadoop.`. it may be hard to understand and i will fix 
it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14166: [MINOR][YARN] Fix code error in yarn-cluster unit...

2016-07-12 Thread sharkdtu
GitHub user sharkdtu opened a pull request:

https://github.com/apache/spark/pull/14166

[MINOR][YARN] Fix code error in yarn-cluster unit test

## What changes were proposed in this pull request?

Fix code error in yarn-cluster unit test.


## How was this patch tested?

Use exist tests



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sharkdtu/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14166.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14166


commit afb56a27b61a81d17a16405c95872eddff7e0bd1
Author: sharkd 
Date:   2016-07-12T23:59:26Z

rebase apache/master

commit 995d606243a95965cb0be28cf7006883400e09ac
Author: sharkd 
Date:   2016-07-11T16:49:56Z

fix style

commit 816979bc5e834aebd23e485bc6251640573fb0a4
Author: sharkd 
Date:   2016-07-12T23:14:02Z

fix code error in yarn-cluster unit test




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org