date:20170629

[jira] [Created] (SPARK-21250) Add a url in the table of 'Running Executors' in worker page to visit job page

2017-06-29 Thread guoxiaolongzte (JIRA)

guoxiaolongzte created SPARK-21250:
--

 Summary: Add a url in the table of 'Running Executors'  in worker 
page to visit job page
 Key: SPARK-21250
 URL: https://issues.apache.org/jira/browse/SPARK-21250
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.3.0
Reporter: guoxiaolongzte
Priority: Minor


Add a url in the table of 'Running Executors'  in worker page to visit job page.

When I click URL of 'Name', the current page jumps to the job page. Of course 
this is only in the table of 'Running Executors'.

This URL of 'Name' is in the table of 'Finished Executors' does not exist, the 
click will not jump to any page.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21251) Add Kafka consumer metrics

2017-06-29 Thread igor mazor (JIRA)

igor mazor created SPARK-21251:
--

 Summary: Add Kafka consumer metrics
 Key: SPARK-21251
 URL: https://issues.apache.org/jira/browse/SPARK-21251
 Project: Spark
  Issue Type: Request
  Components: DStreams
Affects Versions: 2.1.1, 2.1.0, 2.0.1, 2.0.0
Reporter: igor mazor


Add Kafka consumer detailed metrics can help very much with debugging any 
issues related to the consumer. Its also helpful in general  for monitoring 
proposes




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21210) Javadoc 8 fixes for ML shared param traits

2017-06-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-21210:
-

Assignee: Nick Pentreath

> Javadoc 8 fixes for ML shared param traits
> --
>
> Key: SPARK-21210
> URL: https://issues.apache.org/jira/browse/SPARK-21210
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>Priority: Minor
> Fix For: 2.2.0
>
>
> [PR 15999|https://github.com/apache/spark/pull/15999] included fixes for doc 
> strings in the ML shared param traits ({{>}} and {{>=}} predominatly) - see 
> [from 
> here|https://github.com/apache/spark/pull/15999/files#diff-9edc669edcf2c0c7cf1efe4a0a57da80L32].
> However, the changes were made directly to the traits, while the changes 
> should have been made to {{SharedParamsCodeGen}}. So every time the code gen 
> is run (i.e. whenever a new shared param trait is to be added), the fixes 
> will be lost.
> The changes also need to be made such that only the doc string is changed - 
> the param doc that gets printed from {{explainParams}} can remain unchanged.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21210) Javadoc 8 fixes for ML shared param traits

2017-06-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21210.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 18420
[https://github.com/apache/spark/pull/18420]

> Javadoc 8 fixes for ML shared param traits
> --
>
> Key: SPARK-21210
> URL: https://issues.apache.org/jira/browse/SPARK-21210
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Nick Pentreath
>Priority: Minor
> Fix For: 2.2.0
>
>
> [PR 15999|https://github.com/apache/spark/pull/15999] included fixes for doc 
> strings in the ML shared param traits ({{>}} and {{>=}} predominatly) - see 
> [from 
> here|https://github.com/apache/spark/pull/15999/files#diff-9edc669edcf2c0c7cf1efe4a0a57da80L32].
> However, the changes were made directly to the traits, while the changes 
> should have been made to {{SharedParamsCodeGen}}. So every time the code gen 
> is run (i.e. whenever a new shared param trait is to be added), the fixes 
> will be lost.
> The changes also need to be made such that only the doc string is changed - 
> the param doc that gets printed from {{explainParams}} can remain unchanged.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21252) The duration times showed by spark web UI are inaccurate

2017-06-29 Thread igor mazor (JIRA)

igor mazor created SPARK-21252:
--

 Summary: The duration times showed by spark web UI are inaccurate
 Key: SPARK-21252
 URL: https://issues.apache.org/jira/browse/SPARK-21252
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.1.1, 2.1.0, 2.0.1, 2.0.0
Reporter: igor mazor


The duration times showed by spark UI are inaccurate and seems to be rounded.
For example when a job had 2 stages, first stage executed in 47 ms and second 
stage in 3 seconds, the total execution time showed by the UI is 4 seconds.
Another example, first stage was executed in 20 ms and second stage in 4 
seconds, the total execution time showed by the UI would be in that case also 4 
seconds.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21250) Add a url in the table of 'Running Executors' in worker page to visit job page

2017-06-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21250:


Assignee: (was: Apache Spark)

> Add a url in the table of 'Running Executors'  in worker page to visit job 
> page
> ---
>
> Key: SPARK-21250
> URL: https://issues.apache.org/jira/browse/SPARK-21250
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: guoxiaolongzte
>Priority: Minor
>
> Add a url in the table of 'Running Executors'  in worker page to visit job 
> page.
> When I click URL of 'Name', the current page jumps to the job page. Of course 
> this is only in the table of 'Running Executors'.
> This URL of 'Name' is in the table of 'Finished Executors' does not exist, 
> the click will not jump to any page.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21250) Add a url in the table of 'Running Executors' in worker page to visit job page

2017-06-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21250:


Assignee: Apache Spark

> Add a url in the table of 'Running Executors'  in worker page to visit job 
> page
> ---
>
> Key: SPARK-21250
> URL: https://issues.apache.org/jira/browse/SPARK-21250
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: guoxiaolongzte
>Assignee: Apache Spark
>Priority: Minor
>
> Add a url in the table of 'Running Executors'  in worker page to visit job 
> page.
> When I click URL of 'Name', the current page jumps to the job page. Of course 
> this is only in the table of 'Running Executors'.
> This URL of 'Name' is in the table of 'Finished Executors' does not exist, 
> the click will not jump to any page.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21250) Add a url in the table of 'Running Executors' in worker page to visit job page

2017-06-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068028#comment-16068028
 ] 

Apache Spark commented on SPARK-21250:
--

User 'guoxiaolongzte' has created a pull request for this issue:
https://github.com/apache/spark/pull/18464

> Add a url in the table of 'Running Executors'  in worker page to visit job 
> page
> ---
>
> Key: SPARK-21250
> URL: https://issues.apache.org/jira/browse/SPARK-21250
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: guoxiaolongzte
>Priority: Minor
>
> Add a url in the table of 'Running Executors'  in worker page to visit job 
> page.
> When I click URL of 'Name', the current page jumps to the job page. Of course 
> this is only in the table of 'Running Executors'.
> This URL of 'Name' is in the table of 'Finished Executors' does not exist, 
> the click will not jump to any page.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21240) Fix code style for constructing and stopping a SparkContext in UT

2017-06-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-21240:
-

  Assignee: jin xing
Issue Type: Improvement  (was: Bug)

> Fix code style for constructing and stopping a SparkContext in UT
> -
>
> Key: SPARK-21240
> URL: https://issues.apache.org/jira/browse/SPARK-21240
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: jin xing
>Assignee: jin xing
>Priority: Trivial
> Fix For: 2.3.0
>
>
> Related to SPARK-20985.
> Fix code style for constructing and stopping a SparkContext. Assure the 
> context is stopped to avoid other tests complain that there's only one 
> SparkContext can exist.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21240) Fix code style for constructing and stopping a SparkContext in UT

2017-06-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21240.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18454
[https://github.com/apache/spark/pull/18454]

> Fix code style for constructing and stopping a SparkContext in UT
> -
>
> Key: SPARK-21240
> URL: https://issues.apache.org/jira/browse/SPARK-21240
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: jin xing
>Priority: Trivial
> Fix For: 2.3.0
>
>
> Related to SPARK-20985.
> Fix code style for constructing and stopping a SparkContext. Assure the 
> context is stopped to avoid other tests complain that there's only one 
> SparkContext can exist.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21135) On history server page，duration of incompleted applications should be hidden instead of showing up as 0

2017-06-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21135.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18351
[https://github.com/apache/spark/pull/18351]

> On history server page，duration of incompleted applications should be hidden 
> instead of showing up as 0
> ---
>
> Key: SPARK-21135
> URL: https://issues.apache.org/jira/browse/SPARK-21135
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.2.1
>Reporter: Jinhua Fu
>Priority: Minor
> Fix For: 2.3.0
>
>
> On history server page，duration of incompleted applications should be hidden 
> instead of showing up as 0.
> In addition, the application of an exception abort (such as the application 
> of a background kill or driver outage) will always be treated as a 
> Incompleted application, and I'm not sure if this is a problem.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21135) On history server page，duration of incompleted applications should be hidden instead of showing up as 0

2017-06-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-21135:
-

Assignee: Jinhua Fu

> On history server page，duration of incompleted applications should be hidden 
> instead of showing up as 0
> ---
>
> Key: SPARK-21135
> URL: https://issues.apache.org/jira/browse/SPARK-21135
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.2.1
>Reporter: Jinhua Fu
>Assignee: Jinhua Fu
>Priority: Minor
> Fix For: 2.3.0
>
>
> On history server page，duration of incompleted applications should be hidden 
> instead of showing up as 0.
> In addition, the application of an exception abort (such as the application 
> of a background kill or driver outage) will always be treated as a 
> Incompleted application, and I'm not sure if this is a problem.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21093) Multiple gapply execution occasionally failed in SparkR

2017-06-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068053#comment-16068053
 ] 

Apache Spark commented on SPARK-21093:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/18465

> Multiple gapply execution occasionally failed in SparkR 
> 
>
> Key: SPARK-21093
> URL: https://issues.apache.org/jira/browse/SPARK-21093
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1, 2.2.0
> Environment: CentOS 7.2.1511 / R 3.4.0, CentOS 7.2.1511 / R 3.3.3
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
> Fix For: 2.3.0
>
>
> On Centos 7.2.1511 with R 3.4.0/3.3.0, multiple execution of {{gapply}} looks 
> failed as below:
> {code}
>  Welcome to
>   __
>/ __/__  ___ _/ /__
>   _\ \/ _ \/ _ `/ __/  '_/
>  /___/ .__/\_,_/_/ /_/\_\   version  2.3.0-SNAPSHOT
> /_/
>  SparkSession available as 'spark'.
> > df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> 17/06/14 18:21:01 WARN Utils: Truncated the string representation of a plan 
> since it was too large. This behavior can be adjusted by setting 
> 'spark.debug.maxToStringFields' in SparkEnv.conf.
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> Error in handleErrors(returnStatus, conn) :
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 98 
> in stage 14.0 failed 1 times, most recent failure: Lost task 98.0 in stage 
> 14.0 (TID 1305, localhost, executor driver): org.apache.spark.SparkException: 
> R computation failed with
> at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
> at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:432)
> at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:414)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.a
> ...
> *** buffer overflow detected ***: /usr/lib64/R/bin/exec/R terminated
> === Backtrace: =
> /lib64/libc.so.6(__fortify_fail+0x37)[0x7fe699b3f597]
> /lib64/libc.so.6(+0x10c750)[0x7fe699b3d750]
> /lib64/libc.so.6(+0x10e507)[0x7fe699b3f507]
> /usr/lib64/R/modules//internet.so(+0x6015)[0x7fe689bb7015]
> /usr/lib64/R/modules//internet.so(+0xe81e)[0x7fe689bbf81e]
> /usr/lib64/R/lib/libR.so(+0xbd1b6)[0x7fe69c54a1b6]
> /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x354)[0x7fe69c5ad2f4]
> /usr/lib64/R/lib/libR.so(+0x123f8e)[0x7fe69c5b0f8e]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x589)[0x7fe69c5ad529]
> /usr/lib64/R/lib/libR.so(+0x1254ce)[0x7fe69c5b24ce]
> /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x120a7e)[0x7fe69c5ada7e]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x817)[0x7fe69c5ad7b7]
> /usr/lib64/R/lib/libR.so(+0x1256d1)[0x7fe69c5b26d1]
> /usr/lib64/R/lib/libR.so(+0x1552e9)[0x7fe69c5e22e9]
> /usr/lib64/R/lib/libR.so(+0x11062a)[0x7fe69c59d62a]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /u

[jira] [Commented] (SPARK-21252) The duration times showed by spark web UI are inaccurate

2017-06-29 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068069#comment-16068069
 ] 

Sean Owen commented on SPARK-21252:
---

They are rounded, and '3 seconds' is unlikely to be exactly 3 seconds,  but 
maybe "3.49 seconds". This could explain the rounding you see, but, it's on 
purpose. The output is for humans and it's not worth cluttering the display 
with the extra digits. I don't think this is a problem.

> The duration times showed by spark web UI are inaccurate
> 
>
> Key: SPARK-21252
> URL: https://issues.apache.org/jira/browse/SPARK-21252
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.1.1
>Reporter: igor mazor
>
> The duration times showed by spark UI are inaccurate and seems to be rounded.
> For example when a job had 2 stages, first stage executed in 47 ms and second 
> stage in 3 seconds, the total execution time showed by the UI is 4 seconds.
> Another example, first stage was executed in 20 ms and second stage in 4 
> seconds, the total execution time showed by the UI would be in that case also 
> 4 seconds.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21252) The duration times showed by spark web UI are inaccurate

2017-06-29 Thread igor mazor (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068082#comment-16068082
 ] 

igor mazor commented on SPARK-21252:


I still think that the developer should be able to configure by him self what 
precision he needs.
For example and issue that I have:
Consuming from kafka usually take 40-50 ms, sometimes there are network issues 
and the kafka consumer unable to fetch data for a period of request.timeout.ms, 
which is 20 seconds in my case, after that the kafka consumer get request time 
out, it would retry the request again and usually the second attempt would 
succeed. So eventually the UI would show that that stage took 20 seconds, but 
in reality it took 20 seconds and 40 ms. Seeing in the UI that it took 20 
seconds dont give any indication that there was afterwards seconds attend that 
took 40 ms and hence its quite hard to understand what exactly happens.

> The duration times showed by spark web UI are inaccurate
> 
>
> Key: SPARK-21252
> URL: https://issues.apache.org/jira/browse/SPARK-21252
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.1.1
>Reporter: igor mazor
>
> The duration times showed by spark UI are inaccurate and seems to be rounded.
> For example when a job had 2 stages, first stage executed in 47 ms and second 
> stage in 3 seconds, the total execution time showed by the UI is 4 seconds.
> Another example, first stage was executed in 20 ms and second stage in 4 
> seconds, the total execution time showed by the UI would be in that case also 
> 4 seconds.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21252) The duration times showed by spark web UI are inaccurate

2017-06-29 Thread igor mazor (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068082#comment-16068082
 ] 

igor mazor edited comment on SPARK-21252 at 6/29/17 9:43 AM:
-

I still think that the developer should be able to configure by him self what 
precision he needs.
For example an issue that I have:
Consuming from kafka usually take 40-50 ms, sometimes there are network issues 
and the kafka consumer unable to fetch data for a period of request.timeout.ms, 
which is 20 seconds in my case, after that the kafka consumer get request time 
out, it would retry the request again and usually the second attempt would 
succeed. So eventually the UI would show that that stage took 20 seconds, but 
in reality it took 20 seconds and 40 ms. Seeing in the UI that it took 20 
seconds dont give any indication that there was afterwards seconds attend that 
took 40 ms and hence its quite hard to understand what exactly happens.


was (Author: mazor.igal):
I still think that the developer should be able to configure by him self what 
precision he needs.
For example and issue that I have:
Consuming from kafka usually take 40-50 ms, sometimes there are network issues 
and the kafka consumer unable to fetch data for a period of request.timeout.ms, 
which is 20 seconds in my case, after that the kafka consumer get request time 
out, it would retry the request again and usually the second attempt would 
succeed. So eventually the UI would show that that stage took 20 seconds, but 
in reality it took 20 seconds and 40 ms. Seeing in the UI that it took 20 
seconds dont give any indication that there was afterwards seconds attend that 
took 40 ms and hence its quite hard to understand what exactly happens.

> The duration times showed by spark web UI are inaccurate
> 
>
> Key: SPARK-21252
> URL: https://issues.apache.org/jira/browse/SPARK-21252
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.1.1
>Reporter: igor mazor
>
> The duration times showed by spark UI are inaccurate and seems to be rounded.
> For example when a job had 2 stages, first stage executed in 47 ms and second 
> stage in 3 seconds, the total execution time showed by the UI is 4 seconds.
> Another example, first stage was executed in 20 ms and second stage in 4 
> seconds, the total execution time showed by the UI would be in that case also 
> 4 seconds.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21252) The duration times showed by spark web UI are inaccurate

2017-06-29 Thread igor mazor (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068082#comment-16068082
 ] 

igor mazor edited comment on SPARK-21252 at 6/29/17 9:45 AM:
-

I still think that the developer should be able to configure by him self what 
precision he needs.
For example an issue that I have:
Consuming from kafka usually take 40-50 ms, sometimes there are network issues 
and the kafka consumer unable to fetch data for a period of request.timeout.ms, 
which is 20 seconds in my case, after that the kafka consumer get request time 
out, it would retry the request again and usually the second attempt would 
succeed. So eventually the UI would show that that stage took 20 seconds, but 
in reality it took 20 seconds and 40 ms. Seeing in the UI that it took 20 
seconds dont give any indication that there was afterwards second request which 
took 40 ms and hence its quite hard to understand what exactly happens.


was (Author: mazor.igal):
I still think that the developer should be able to configure by him self what 
precision he needs.
For example an issue that I have:
Consuming from kafka usually take 40-50 ms, sometimes there are network issues 
and the kafka consumer unable to fetch data for a period of request.timeout.ms, 
which is 20 seconds in my case, after that the kafka consumer get request time 
out, it would retry the request again and usually the second attempt would 
succeed. So eventually the UI would show that that stage took 20 seconds, but 
in reality it took 20 seconds and 40 ms. Seeing in the UI that it took 20 
seconds dont give any indication that there was afterwards seconds attend that 
took 40 ms and hence its quite hard to understand what exactly happens.

> The duration times showed by spark web UI are inaccurate
> 
>
> Key: SPARK-21252
> URL: https://issues.apache.org/jira/browse/SPARK-21252
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.1.1
>Reporter: igor mazor
>
> The duration times showed by spark UI are inaccurate and seems to be rounded.
> For example when a job had 2 stages, first stage executed in 47 ms and second 
> stage in 3 seconds, the total execution time showed by the UI is 4 seconds.
> Another example, first stage was executed in 20 ms and second stage in 4 
> seconds, the total execution time showed by the UI would be in that case also 
> 4 seconds.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21252) The duration times showed by spark web UI are inaccurate

2017-06-29 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068085#comment-16068085
 ] 

Sean Owen commented on SPARK-21252:
---

You can drill into individual stage times, right?

> The duration times showed by spark web UI are inaccurate
> 
>
> Key: SPARK-21252
> URL: https://issues.apache.org/jira/browse/SPARK-21252
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.1.1
>Reporter: igor mazor
>
> The duration times showed by spark UI are inaccurate and seems to be rounded.
> For example when a job had 2 stages, first stage executed in 47 ms and second 
> stage in 3 seconds, the total execution time showed by the UI is 4 seconds.
> Another example, first stage was executed in 20 ms and second stage in 4 
> seconds, the total execution time showed by the UI would be in that case also 
> 4 seconds.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19908) Direct buffer memory OOM should not cause stage retries.

2017-06-29 Thread Kaushal Prajapati (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068106#comment-16068106
 ] 

Kaushal Prajapati commented on SPARK-19908:
---

[~zhanzhang] can you plz share some example code for which you are getting this 
error?

> Direct buffer memory OOM should not cause stage retries.
> 
>
> Key: SPARK-19908
> URL: https://issues.apache.org/jira/browse/SPARK-19908
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: Zhan Zhang
>Priority: Minor
>
> Currently if there is  java.lang.OutOfMemoryError: Direct buffer memory, the 
> exception will be changed to FetchFailedException, causing stage retries.
> org.apache.spark.shuffle.FetchFailedException: Direct buffer memory
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:40)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:731)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextOuterJoinRows(SortMergeJoinExec.scala:692)
>   at 
> org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceStream(SortMergeJoinExec.scala:854)
>   at 
> org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceNext(SortMergeJoinExec.scala:887)
>   at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:278)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.OutOfMemoryError: Direct buffer memory
>   at java.nio.Bits.reserveMemory(Bits.java:658)
>   at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
>   at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
>   at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645)
>   at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228)
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:212)
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:132)
>   at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271)
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>   at 
> io.netty.buffer.AbstractByteBufAll

[jira] [Resolved] (SPARK-17689) _temporary files breaks the Spark SQL streaming job.

2017-06-29 Thread Prashant Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma resolved SPARK-17689.
-
Resolution: Cannot Reproduce

I am unable to reproduce this anymore, looks like this might be fixed by some 
other changes.

> _temporary files breaks the Spark SQL streaming job.
> 
>
> Key: SPARK-17689
> URL: https://issues.apache.org/jira/browse/SPARK-17689
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Prashant Sharma
>
> Steps to reproduce:
> 1) Start a streaming job which reads from HDFS location hdfs://xyz/*
> 2) Write content to hdfs://xyz/a
> .
> .
> repeat a few times.
> And then job breaks as follows.
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 49 in 
> stage 304.0 failed 1 times, most recent failure: Lost task 49.0 in stage 
> 304.0 (TID 14794, localhost): java.io.FileNotFoundException: File does not 
> exist: hdfs://localhost:9000/input/t5/_temporary
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
>   at 
> org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$4.apply(fileSourceInterfaces.scala:464)
>   at 
> org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$4.apply(fileSourceInterfaces.scala:462)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:912)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:912)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1919)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1919)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21252) The duration times showed by spark web UI are inaccurate

2017-06-29 Thread igor mazor (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068147#comment-16068147
 ] 

igor mazor commented on SPARK-21252:


Yes, its indeed possible to drill to individual stages, however the scenario I 
described on my last comment is happens in the same stage.
The first attempt following by timeout after 20 seconds and then the second 
successful attempt are all in the same stage and therefore the UI shows 20 
seconds duration, although thats not true.
Also, if the stage was indeed 40 ms, the UI shows 40 ms as the duration, but 
with seconds there is rounding which also causing for  inconsistent result 
presentation. 
If you already do the rounding, why then the 40 ms is not rounded to 0 seconds ?

Also I noticed now that for example stage that took 42 ms (Job Duration), when 
I click on that stage I see Duration = 38 ms and when going deeper into the 
Summary Metrics for Tasks, I see that the longest task was 35 ms, so not sure 
where to all the gaps went.

> The duration times showed by spark web UI are inaccurate
> 
>
> Key: SPARK-21252
> URL: https://issues.apache.org/jira/browse/SPARK-21252
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.1.1
>Reporter: igor mazor
>
> The duration times showed by spark UI are inaccurate and seems to be rounded.
> For example when a job had 2 stages, first stage executed in 47 ms and second 
> stage in 3 seconds, the total execution time showed by the UI is 4 seconds.
> Another example, first stage was executed in 20 ms and second stage in 4 
> seconds, the total execution time showed by the UI would be in that case also 
> 4 seconds.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21252) The duration times showed by spark web UI are inaccurate

2017-06-29 Thread igor mazor (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068147#comment-16068147
 ] 

igor mazor edited comment on SPARK-21252 at 6/29/17 10:42 AM:
--

Yes, its indeed possible to drill to individual stages, however the scenario I 
described on my last comment is in the same stage.
The first attempt following by timeout after 20 seconds and then the second 
successful attempt are all in the same stage and therefore the UI shows 20 
seconds duration, although thats not true.
Also, if the stage was indeed 40 ms, the UI shows 40 ms as the duration, but 
with seconds there is rounding which also causing for  inconsistent result 
presentation. 
If you already do the rounding, why then the 40 ms is not rounded to 0 seconds ?

Also I noticed now that for example stage that took 42 ms (Job Duration), when 
I click on that stage I see Duration = 38 ms and when going deeper into the 
Summary Metrics for Tasks, I see that the longest task was 35 ms, so not sure 
where to all the gaps went.


was (Author: mazor.igal):
Yes, its indeed possible to drill to individual stages, however the scenario I 
described on my last comment is happens in the same stage.
The first attempt following by timeout after 20 seconds and then the second 
successful attempt are all in the same stage and therefore the UI shows 20 
seconds duration, although thats not true.
Also, if the stage was indeed 40 ms, the UI shows 40 ms as the duration, but 
with seconds there is rounding which also causing for  inconsistent result 
presentation. 
If you already do the rounding, why then the 40 ms is not rounded to 0 seconds ?

Also I noticed now that for example stage that took 42 ms (Job Duration), when 
I click on that stage I see Duration = 38 ms and when going deeper into the 
Summary Metrics for Tasks, I see that the longest task was 35 ms, so not sure 
where to all the gaps went.

> The duration times showed by spark web UI are inaccurate
> 
>
> Key: SPARK-21252
> URL: https://issues.apache.org/jira/browse/SPARK-21252
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.1.1
>Reporter: igor mazor
>
> The duration times showed by spark UI are inaccurate and seems to be rounded.
> For example when a job had 2 stages, first stage executed in 47 ms and second 
> stage in 3 seconds, the total execution time showed by the UI is 4 seconds.
> Another example, first stage was executed in 20 ms and second stage in 4 
> seconds, the total execution time showed by the UI would be in that case also 
> 4 seconds.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21253) Cannot fetch big blocks to disk

2017-06-29 Thread Yuming Wang (JIRA)

Yuming Wang created SPARK-21253:
---

 Summary: Cannot fetch big blocks to disk 
 Key: SPARK-21253
 URL: https://issues.apache.org/jira/browse/SPARK-21253
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Yuming Wang


Spark *cluster* can reproduce, *local* can't:

1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}:
{code:actionscript}
$ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K
{code}

2. Need a shuffe:
{code:actionscript}
scala> val count = sc.parallelize(0 until 300, 
10).repartition(2001).collect().length
{code}

The error messages:

{noformat}
org.apache.spark.shuffle.FetchFailedException: Failed to send request for 
1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: 
java.io.IOException: Connection reset by peer
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to 
yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: 
Connection reset by peer
at 
org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196)
at 
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
at 
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
at 
io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
at 
io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163)
at 
io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
at 
io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
at 
org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183)
at 
org.apache.spark.network.shuffle.OneForOneBlockFetcher$1.onSuccess(OneForOneBlockFetcher.java:123)
at 
org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:176)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:120)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannel

[jira] [Commented] (SPARK-21253) Cannot fetch big blocks to disk

2017-06-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068169#comment-16068169
 ] 

Apache Spark commented on SPARK-21253:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/18466

> Cannot fetch big blocks to disk 
> 
>
> Key: SPARK-21253
> URL: https://issues.apache.org/jira/browse/SPARK-21253
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>
> Spark *cluster* can reproduce, *local* can't:
> 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}:
> {code:actionscript}
> $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K
> {code}
> 2. Need a shuffe:
> {code:actionscript}
> scala> val count = sc.parallelize(0 until 300, 
> 10).repartition(2001).collect().length
> {code}
> The error messages:
> {noformat}
> org.apache.spark.shuffle.FetchFailedException: Failed to send request for 
> 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: 
> java.io.IOException: Connection reset by peer
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
> at scala.collection.AbstractIterator.to(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to 
> yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: 
> Connection reset by peer
> at 
> org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
> at 
> io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
> at 
> org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183)
> at 
> org.apache.spark.network.shuffle.O

[jira] [Assigned] (SPARK-21253) Cannot fetch big blocks to disk

2017-06-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21253:


Assignee: (was: Apache Spark)

> Cannot fetch big blocks to disk 
> 
>
> Key: SPARK-21253
> URL: https://issues.apache.org/jira/browse/SPARK-21253
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>
> Spark *cluster* can reproduce, *local* can't:
> 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}:
> {code:actionscript}
> $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K
> {code}
> 2. Need a shuffe:
> {code:actionscript}
> scala> val count = sc.parallelize(0 until 300, 
> 10).repartition(2001).collect().length
> {code}
> The error messages:
> {noformat}
> org.apache.spark.shuffle.FetchFailedException: Failed to send request for 
> 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: 
> java.io.IOException: Connection reset by peer
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
> at scala.collection.AbstractIterator.to(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to 
> yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: 
> Connection reset by peer
> at 
> org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
> at 
> io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
> at 
> org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183)
> at 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher$1.onSuccess(OneForOneBlockFetcher.java:123)
> at 
> org.apache.spark.network.client.T

[jira] [Assigned] (SPARK-21253) Cannot fetch big blocks to disk

2017-06-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21253:


Assignee: Apache Spark

> Cannot fetch big blocks to disk 
> 
>
> Key: SPARK-21253
> URL: https://issues.apache.org/jira/browse/SPARK-21253
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>
> Spark *cluster* can reproduce, *local* can't:
> 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}:
> {code:actionscript}
> $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K
> {code}
> 2. Need a shuffe:
> {code:actionscript}
> scala> val count = sc.parallelize(0 until 300, 
> 10).repartition(2001).collect().length
> {code}
> The error messages:
> {noformat}
> org.apache.spark.shuffle.FetchFailedException: Failed to send request for 
> 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: 
> java.io.IOException: Connection reset by peer
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
> at scala.collection.AbstractIterator.to(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to 
> yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: 
> Connection reset by peer
> at 
> org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
> at 
> io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
> at 
> org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183)
> at 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher$1.onSuccess(OneForOneBlockFetcher.java:123)
> at 
> org.apac

[jira] [Updated] (SPARK-21252) The duration times showed by spark web UI are inaccurate

2017-06-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21252:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

I don't know what the underlying exact values are; those are likely available 
from the API. That would help figure out if there's an actual rounding problem. 
If not, I don't think this would be changed.

> The duration times showed by spark web UI are inaccurate
> 
>
> Key: SPARK-21252
> URL: https://issues.apache.org/jira/browse/SPARK-21252
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.1.1
>Reporter: igor mazor
>Priority: Minor
>
> The duration times showed by spark UI are inaccurate and seems to be rounded.
> For example when a job had 2 stages, first stage executed in 47 ms and second 
> stage in 3 seconds, the total execution time showed by the UI is 4 seconds.
> Another example, first stage was executed in 20 ms and second stage in 4 
> seconds, the total execution time showed by the UI would be in that case also 
> 4 seconds.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21253) Cannot fetch big blocks to disk

2017-06-29 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-21253:

Attachment: ui-thread-dump-jqhadoop221-154.gif

> Cannot fetch big blocks to disk 
> 
>
> Key: SPARK-21253
> URL: https://issues.apache.org/jira/browse/SPARK-21253
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
> Attachments: ui-thread-dump-jqhadoop221-154.gif
>
>
> Spark *cluster* can reproduce, *local* can't:
> 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}:
> {code:actionscript}
> $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K
> {code}
> 2. Need a shuffe:
> {code:actionscript}
> scala> val count = sc.parallelize(0 until 300, 
> 10).repartition(2001).collect().length
> {code}
> The error messages:
> {noformat}
> org.apache.spark.shuffle.FetchFailedException: Failed to send request for 
> 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: 
> java.io.IOException: Connection reset by peer
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
> at scala.collection.AbstractIterator.to(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to 
> yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: 
> Connection reset by peer
> at 
> org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
> at 
> io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
> at 
> org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183)
> at 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher$1.onSuccess(OneForOneBlockFet

[jira] [Commented] (SPARK-21253) Cannot fetch big blocks to disk

2017-06-29 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068232#comment-16068232
 ] 

Yuming Wang commented on SPARK-21253:
-

It may be hang for a {{spark-sql}} application also:

!ui-thread-dump-jqhadoop221-154.gif!

> Cannot fetch big blocks to disk 
> 
>
> Key: SPARK-21253
> URL: https://issues.apache.org/jira/browse/SPARK-21253
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
> Attachments: ui-thread-dump-jqhadoop221-154.gif
>
>
> Spark *cluster* can reproduce, *local* can't:
> 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}:
> {code:actionscript}
> $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K
> {code}
> 2. Need a shuffe:
> {code:actionscript}
> scala> val count = sc.parallelize(0 until 300, 
> 10).repartition(2001).collect().length
> {code}
> The error messages:
> {noformat}
> org.apache.spark.shuffle.FetchFailedException: Failed to send request for 
> 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: 
> java.io.IOException: Connection reset by peer
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
> at scala.collection.AbstractIterator.to(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to 
> yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: 
> Connection reset by peer
> at 
> org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
> at 
> io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
> at 
> org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183)
>

[jira] [Updated] (SPARK-21253) Cannot fetch big blocks to disk

2017-06-29 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-21253:

Description: 
Spark *cluster* can reproduce, *local* can't:

1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}:
{code:actionscript}
$ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K
{code}

2. Need a shuffe:
{code:actionscript}
scala> sc.parallelize(0 until 300, 10).repartition(2001).count()
{code}

The error messages:

{noformat}
org.apache.spark.shuffle.FetchFailedException: Failed to send request for 
1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: 
java.io.IOException: Connection reset by peer
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to 
yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: 
Connection reset by peer
at 
org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196)
at 
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
at 
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
at 
io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
at 
io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163)
at 
io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
at 
io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
at 
org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183)
at 
org.apache.spark.network.shuffle.OneForOneBlockFetcher$1.onSuccess(OneForOneBlockFetcher.java:123)
at 
org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:176)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:120)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at 
io.netty.handler.ti

[jira] [Resolved] (SPARK-21225) decrease the Mem using for variable 'tasks' in function resourceOffers

2017-06-29 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21225.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18435
[https://github.com/apache/spark/pull/18435]

> decrease the Mem using for variable 'tasks' in function resourceOffers
> --
>
> Key: SPARK-21225
> URL: https://issues.apache.org/jira/browse/SPARK-21225
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1
>Reporter: yangZhiguo
>Priority: Minor
> Fix For: 2.3.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> In the function 'resourceOffers', It declare a variable 'tasks' for 
> storage the tasks which have  allocated a executor. It declared like this:
> *{color:#d04437}val tasks = shuffledOffers.map(o => new 
> ArrayBuffer[TaskDescription](o.cores)){color}*
> But, I think this code only conside a situation for that one task per core. 
> If the user config the "spark.task.cpus" as 2 or 3, It really don't need so 
> much space. I think It can motify as follow:
> {color:#14892c}*val tasks = shuffledOffers.map(o => new 
> ArrayBuffer[TaskDescription](Math.ceil(o.cores*1.0/CPUS_PER_TASK).toInt))*{color}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21225) decrease the Mem using for variable 'tasks' in function resourceOffers

2017-06-29 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21225:
---

Assignee: yangZhiguo

> decrease the Mem using for variable 'tasks' in function resourceOffers
> --
>
> Key: SPARK-21225
> URL: https://issues.apache.org/jira/browse/SPARK-21225
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1
>Reporter: yangZhiguo
>Assignee: yangZhiguo
>Priority: Minor
> Fix For: 2.3.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> In the function 'resourceOffers', It declare a variable 'tasks' for 
> storage the tasks which have  allocated a executor. It declared like this:
> *{color:#d04437}val tasks = shuffledOffers.map(o => new 
> ArrayBuffer[TaskDescription](o.cores)){color}*
> But, I think this code only conside a situation for that one task per core. 
> If the user config the "spark.task.cpus" as 2 or 3, It really don't need so 
> much space. I think It can motify as follow:
> {color:#14892c}*val tasks = shuffledOffers.map(o => new 
> ArrayBuffer[TaskDescription](Math.ceil(o.cores*1.0/CPUS_PER_TASK).toInt))*{color}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21253) Cannot fetch big blocks to disk

2017-06-29 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-21253:

Description: 
Spark *cluster* can reproduce, *local* can't:

1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}:
{code:actionscript}
$ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K
{code}

2. A shuffle:
{code:actionscript}
scala> sc.parallelize(0 until 300, 10).repartition(2001).count()
{code}

The error messages:

{noformat}
org.apache.spark.shuffle.FetchFailedException: Failed to send request for 
1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: 
java.io.IOException: Connection reset by peer
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to 
yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: 
Connection reset by peer
at 
org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196)
at 
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
at 
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
at 
io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
at 
io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163)
at 
io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
at 
io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
at 
org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183)
at 
org.apache.spark.network.shuffle.OneForOneBlockFetcher$1.onSuccess(OneForOneBlockFetcher.java:123)
at 
org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:176)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:120)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at 
io.netty.handler.timeou

[jira] [Commented] (SPARK-12868) ADD JAR via sparkSQL JDBC will fail when using a HDFS URL

2017-06-29 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068301#comment-16068301
 ] 

Steve Loughran commented on SPARK-12868:


It's actually not the cause of that, merely the messenger. Cause is 
HADOOP-14383: a combination of spark 2.2 & Hadoop 2.9+ will trigger the 
problem. Fix belongs in Hadoop.

> ADD JAR via sparkSQL JDBC will fail when using a HDFS URL
> -
>
> Key: SPARK-12868
> URL: https://issues.apache.org/jira/browse/SPARK-12868
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Trystan Leftwich
>Assignee: Weiqing Yang
> Fix For: 2.2.0
>
>
> When trying to add a jar with a HDFS URI, i.E
> {code:sql}
> ADD JAR hdfs:///tmp/foo.jar
> {code}
> Via the spark sql JDBC interface it will fail with:
> {code:sql}
> java.net.MalformedURLException: unknown protocol: hdfs
> at java.net.URL.(URL.java:593)
> at java.net.URL.(URL.java:483)
> at java.net.URL.(URL.java:432)
> at java.net.URI.toURL(URI.java:1089)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.addJar(ClientWrapper.scala:578)
> at org.apache.spark.sql.hive.HiveContext.addJar(HiveContext.scala:652)
> at org.apache.spark.sql.hive.execution.AddJar.run(commands.scala:89)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:145)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:211)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:154)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:151)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:164)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21052) Add hash map metrics to join

2017-06-29 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21052:
---

Assignee: Liang-Chi Hsieh

> Add hash map metrics to join
> 
>
> Key: SPARK-21052
> URL: https://issues.apache.org/jira/browse/SPARK-21052
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.3.0
>
>
> We should add avg hash map probe metric to join operator and report it on UI.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21052) Add hash map metrics to join

2017-06-29 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21052.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18301
[https://github.com/apache/spark/pull/18301]

> Add hash map metrics to join
> 
>
> Key: SPARK-21052
> URL: https://issues.apache.org/jira/browse/SPARK-21052
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
> Fix For: 2.3.0
>
>
> We should add avg hash map probe metric to join operator and report it on UI.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21254) History UI: Taking over 1 minute for initial page display

2017-06-29 Thread Dmitry Parfenchik (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Parfenchik updated SPARK-21254:
--
Attachment: screenshot-1.png

> History UI: Taking over 1 minute for initial page display
> -
>
> Key: SPARK-21254
> URL: https://issues.apache.org/jira/browse/SPARK-21254
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Dmitry Parfenchik
> Attachments: screenshot-1.png
>
>
> Currently on the first page load (if there is no limit set) the whole jobs 
> execution history is loaded since the begging of the time, which in itself is 
> not a big issue and only causes a small latency in case of 10k+ rows returned 
> from the server. The problem is that UI spends most of the time processing 
> the results even according to chrome devtools (newtwork IO is taking less 
> than 1s):
> !attachment-name.jpg|thumbnail!
> In case of larger amount of rows returned (10k+) this time grows 
> dramatically, causing 1min+ page load in Chrome and in Firefox, Safari and IE 
> freezes the process.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21254) History UI: Taking over 1 minute for initial page display

2017-06-29 Thread Dmitry Parfenchik (JIRA)

Dmitry Parfenchik created SPARK-21254:
-

 Summary: History UI: Taking over 1 minute for initial page display
 Key: SPARK-21254
 URL: https://issues.apache.org/jira/browse/SPARK-21254
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.1.0
Reporter: Dmitry Parfenchik


Currently on the first page load (if there is no limit set) the whole jobs 
execution history is loaded since the begging of the time, which in itself is 
not a big issue and only causes a small latency in case of 10k+ rows returned 
from the server. The problem is that UI spends most of the time processing the 
results even according to chrome devtools (newtwork IO is taking less than 1s):
!attachment-name.jpg|thumbnail!
In case of larger amount of rows returned (10k+) this time grows dramatically, 
causing 1min+ page load in Chrome and in Firefox, Safari and IE freezes the 
process.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21254) History UI: Taking over 1 minute for initial page display

2017-06-29 Thread Dmitry Parfenchik (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Parfenchik updated SPARK-21254:
--
Description: 
Currently on the first page load (if there is no limit set) the whole jobs 
execution history is loaded since the begging of the time, which in itself is 
not a big issue and only causes a small latency in case of 10k+ rows returned 
from the server. The problem is that UI spends most of the time processing the 
results even according to chrome devtools (newtwork IO is taking less than 1s):
!screenshot-1.png!
In case of larger amount of rows returned (10k+) this time grows dramatically, 
causing 1min+ page load in Chrome and in Firefox, Safari and IE freezes the 
process.


  was:
Currently on the first page load (if there is no limit set) the whole jobs 
execution history is loaded since the begging of the time, which in itself is 
not a big issue and only causes a small latency in case of 10k+ rows returned 
from the server. The problem is that UI spends most of the time processing the 
results even according to chrome devtools (newtwork IO is taking less than 1s):
!attachment-name.jpg|thumbnail!
In case of larger amount of rows returned (10k+) this time grows dramatically, 
causing 1min+ page load in Chrome and in Firefox, Safari and IE freezes the 
process.



> History UI: Taking over 1 minute for initial page display
> -
>
> Key: SPARK-21254
> URL: https://issues.apache.org/jira/browse/SPARK-21254
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Dmitry Parfenchik
> Attachments: screenshot-1.png
>
>
> Currently on the first page load (if there is no limit set) the whole jobs 
> execution history is loaded since the begging of the time, which in itself is 
> not a big issue and only causes a small latency in case of 10k+ rows returned 
> from the server. The problem is that UI spends most of the time processing 
> the results even according to chrome devtools (newtwork IO is taking less 
> than 1s):
> !screenshot-1.png!
> In case of larger amount of rows returned (10k+) this time grows 
> dramatically, causing 1min+ page load in Chrome and in Firefox, Safari and IE 
> freezes the process.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21254) History UI: Taking over 1 minute for initial page display

2017-06-29 Thread Dmitry Parfenchik (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Parfenchik updated SPARK-21254:
--
Description: 
Currently on the first page load (if there is no limit set) the whole jobs 
execution history is loaded since the begging of the time. On large amount of 
rows returned (10k+) page load time grows dramatically, causing 1min+ delay in 
Chrome and freezing the process in Firefox, Safari and IE.
A simple inspection in Chrome shows that network is not an issue here and only 
causes a small latency (<1s) while most of the time is spend in UI  processing 
the results even according to chrome devtools:
!screenshot-1.png!



  was:
Currently on the first page load (if there is no limit set) the whole jobs 
execution history is loaded since the begging of the time, which in itself is 
not a big issue and only causes a small latency in case of 10k+ rows returned 
from the server. The problem is that UI spends most of the time processing the 
results even according to chrome devtools (newtwork IO is taking less than 1s):
!screenshot-1.png!
In case of larger amount of rows returned (10k+) this time grows dramatically, 
causing 1min+ page load in Chrome and in Firefox, Safari and IE freezes the 
process.



> History UI: Taking over 1 minute for initial page display
> -
>
> Key: SPARK-21254
> URL: https://issues.apache.org/jira/browse/SPARK-21254
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Dmitry Parfenchik
> Attachments: screenshot-1.png
>
>
> Currently on the first page load (if there is no limit set) the whole jobs 
> execution history is loaded since the begging of the time. On large amount of 
> rows returned (10k+) page load time grows dramatically, causing 1min+ delay 
> in Chrome and freezing the process in Firefox, Safari and IE.
> A simple inspection in Chrome shows that network is not an issue here and 
> only causes a small latency (<1s) while most of the time is spend in UI  
> processing the results even according to chrome devtools:
> !screenshot-1.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21254) History UI: Taking over 1 minute for initial page display

2017-06-29 Thread Dmitry Parfenchik (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Parfenchik updated SPARK-21254:
--
Priority: Minor  (was: Major)

> History UI: Taking over 1 minute for initial page display
> -
>
> Key: SPARK-21254
> URL: https://issues.apache.org/jira/browse/SPARK-21254
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Dmitry Parfenchik
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> Currently on the first page load (if there is no limit set) the whole jobs 
> execution history is loaded since the begging of the time. On large amount of 
> rows returned (10k+) page load time grows dramatically, causing 1min+ delay 
> in Chrome and freezing the process in Firefox, Safari and IE.
> A simple inspection in Chrome shows that network is not an issue here and 
> only causes a small latency (<1s) while most of the time is spend in UI  
> processing the results even according to chrome devtools:
> !screenshot-1.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21255) NPE when creating encoder for enum

2017-06-29 Thread Mike (JIRA)

Mike created SPARK-21255:


 Summary: NPE when creating encoder for enum
 Key: SPARK-21255
 URL: https://issues.apache.org/jira/browse/SPARK-21255
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 2.1.0
 Environment: org.apache.spark:spark-core_2.10:2.1.0
org.apache.spark:spark-sql_2.10:2.1.0
Reporter: Mike


When you try to create an encoder for Enum type (or bean with enum property) 
via Encoders.bean(...), it fails with NullPointerException at TypeToken:495.
I did a little research and it turns out, that in JavaTypeInference:126 
following code 

{code:scala}
val beanInfo = Introspector.getBeanInfo(typeToken.getRawType)
val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == "class")
val fields = properties.map { property =>
  val returnType = 
typeToken.method(property.getReadMethod).getReturnType
  val (dataType, nullable) = inferDataType(returnType)
  new StructField(property.getName, dataType, nullable)
}
(new StructType(fields), true)
{code}

filters out properties named "class", because we wouldn't want to serialize 
that. But enum types have another property of type Class named 
"declaringClass", which we are trying to inspect recursively. Eventually we try 
to inspect ClassLoader class, which has property "defaultAssertionStatus" with 
no read method, which leads to NPE at TypeToken:495.

I think adding property name "declaringClass" to filtering will resolve this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21255) NPE when creating encoder for enum

2017-06-29 Thread Mike (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike updated SPARK-21255:
-
Description: 
When you try to create an encoder for Enum type (or bean with enum property) 
via Encoders.bean(...), it fails with NullPointerException at TypeToken:495.
I did a little research and it turns out, that in JavaTypeInference:126 
following code 

{code:java}
val beanInfo = Introspector.getBeanInfo(typeToken.getRawType)
val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == "class")
val fields = properties.map { property =>
  val returnType = 
typeToken.method(property.getReadMethod).getReturnType
  val (dataType, nullable) = inferDataType(returnType)
  new StructField(property.getName, dataType, nullable)
}
(new StructType(fields), true)
{code}

filters out properties named "class", because we wouldn't want to serialize 
that. But enum types have another property of type Class named 
"declaringClass", which we are trying to inspect recursively. Eventually we try 
to inspect ClassLoader class, which has property "defaultAssertionStatus" with 
no read method, which leads to NPE at TypeToken:495.

I think adding property name "declaringClass" to filtering will resolve this.

  was:
When you try to create an encoder for Enum type (or bean with enum property) 
via Encoders.bean(...), it fails with NullPointerException at TypeToken:495.
I did a little research and it turns out, that in JavaTypeInference:126 
following code 

{code:scala}
val beanInfo = Introspector.getBeanInfo(typeToken.getRawType)
val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == "class")
val fields = properties.map { property =>
  val returnType = 
typeToken.method(property.getReadMethod).getReturnType
  val (dataType, nullable) = inferDataType(returnType)
  new StructField(property.getName, dataType, nullable)
}
(new StructType(fields), true)
{code}

filters out properties named "class", because we wouldn't want to serialize 
that. But enum types have another property of type Class named 
"declaringClass", which we are trying to inspect recursively. Eventually we try 
to inspect ClassLoader class, which has property "defaultAssertionStatus" with 
no read method, which leads to NPE at TypeToken:495.

I think adding property name "declaringClass" to filtering will resolve this.


> NPE when creating encoder for enum
> --
>
> Key: SPARK-21255
> URL: https://issues.apache.org/jira/browse/SPARK-21255
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.0
> Environment: org.apache.spark:spark-core_2.10:2.1.0
> org.apache.spark:spark-sql_2.10:2.1.0
>Reporter: Mike
>
> When you try to create an encoder for Enum type (or bean with enum property) 
> via Encoders.bean(...), it fails with NullPointerException at TypeToken:495.
> I did a little research and it turns out, that in JavaTypeInference:126 
> following code 
> {code:java}
> val beanInfo = Introspector.getBeanInfo(typeToken.getRawType)
> val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == 
> "class")
> val fields = properties.map { property =>
>   val returnType = 
> typeToken.method(property.getReadMethod).getReturnType
>   val (dataType, nullable) = inferDataType(returnType)
>   new StructField(property.getName, dataType, nullable)
> }
> (new StructType(fields), true)
> {code}
> filters out properties named "class", because we wouldn't want to serialize 
> that. But enum types have another property of type Class named 
> "declaringClass", which we are trying to inspect recursively. Eventually we 
> try to inspect ClassLoader class, which has property "defaultAssertionStatus" 
> with no read method, which leads to NPE at TypeToken:495.
> I think adding property name "declaringClass" to filtering will resolve this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21255) NPE when creating encoder for enum

2017-06-29 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068520#comment-16068520
 ] 

Sean Owen commented on SPARK-21255:
---

Is the change to omit declaringClass too?

> NPE when creating encoder for enum
> --
>
> Key: SPARK-21255
> URL: https://issues.apache.org/jira/browse/SPARK-21255
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.0
> Environment: org.apache.spark:spark-core_2.10:2.1.0
> org.apache.spark:spark-sql_2.10:2.1.0
>Reporter: Mike
>
> When you try to create an encoder for Enum type (or bean with enum property) 
> via Encoders.bean(...), it fails with NullPointerException at TypeToken:495.
> I did a little research and it turns out, that in JavaTypeInference:126 
> following code 
> {code:java}
> val beanInfo = Introspector.getBeanInfo(typeToken.getRawType)
> val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == 
> "class")
> val fields = properties.map { property =>
>   val returnType = 
> typeToken.method(property.getReadMethod).getReturnType
>   val (dataType, nullable) = inferDataType(returnType)
>   new StructField(property.getName, dataType, nullable)
> }
> (new StructType(fields), true)
> {code}
> filters out properties named "class", because we wouldn't want to serialize 
> that. But enum types have another property of type Class named 
> "declaringClass", which we are trying to inspect recursively. Eventually we 
> try to inspect ClassLoader class, which has property "defaultAssertionStatus" 
> with no read method, which leads to NPE at TypeToken:495.
> I think adding property name "declaringClass" to filtering will resolve this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21254) History UI: Taking over 1 minute for initial page display

2017-06-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21254.
---
Resolution: Duplicate

Duplicate of lots of JIRAs related to making the initial read faster

> History UI: Taking over 1 minute for initial page display
> -
>
> Key: SPARK-21254
> URL: https://issues.apache.org/jira/browse/SPARK-21254
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Dmitry Parfenchik
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> Currently on the first page load (if there is no limit set) the whole jobs 
> execution history is loaded since the begging of the time. On large amount of 
> rows returned (10k+) page load time grows dramatically, causing 1min+ delay 
> in Chrome and freezing the process in Firefox, Safari and IE.
> A simple inspection in Chrome shows that network is not an issue here and 
> only causes a small latency (<1s) while most of the time is spend in UI  
> processing the results even according to chrome devtools:
> !screenshot-1.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21246) Unexpected Data Type conversion from LONG to BIGINT

2017-06-29 Thread Monica Raj (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068553#comment-16068553
 ] 

Monica Raj commented on SPARK-21246:


Thanks for your response. I also tried with Seq(3) as Seq(3L), however I had 
changed this back during the course of trying other options. I should also 
mention that we are running Zeppelin 0.6.0. I tried running the code you 
provided and still got the following output:

import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
schemaString: String = name
lstVals: Seq[Long] = List(3)
rowRdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
MapPartitionsRDD[30] at map at :59
res20: Array[org.apache.spark.sql.Row] = Array([3])
fields: Array[org.apache.spark.sql.types.StructField] = 
Array(StructField(name,LongType,true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(name,LongType,true))
StructType(StructField(name,LongType,true))peopleDF: 
org.apache.spark.sql.DataFrame = [name: bigint]
++
|name|
++
|   3|
++

> Unexpected Data Type conversion from LONG to BIGINT
> ---
>
> Key: SPARK-21246
> URL: https://issues.apache.org/jira/browse/SPARK-21246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Using Zeppelin Notebook or Spark Shell
>Reporter: Monica Raj
>
> The unexpected conversion occurred when creating a data frame out of an 
> existing data collection. The following code can be run in zeppelin notebook 
> to reproduce the bug:
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> val schemaString = "name"
> val lstVals = Seq(3)
> val rowRdd = sc.parallelize(lstVals).map(x => Row( x ))
> rowRdd.collect()
> // Generate the schema based on the string of schema
> val fields = schemaString.split(" ")
> .map(fieldName => StructField(fieldName, LongType, nullable = true))
> val schema = StructType(fields)
> print(schema)
> val peopleDF = sqlContext.createDataFrame(rowRdd, schema)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4131) Support "Writing data into the filesystem from queries"

2017-06-29 Thread ARUNA KIRAN NULU (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068700#comment-16068700
 ] 

ARUNA KIRAN NULU commented on SPARK-4131:
-

Is this feature available ?

> Support "Writing data into the filesystem from queries"
> ---
>
> Key: SPARK-4131
> URL: https://issues.apache.org/jira/browse/SPARK-4131
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: XiaoJing wang
>Assignee: Fei Wang
>Priority: Critical
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>
> Writing data into the filesystem from queries,SparkSql is not support .
> eg:
> {code}insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * 
> from page_views;
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21253) Cannot fetch big blocks to disk

2017-06-29 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-21253:
-
Target Version/s: 2.2.0

> Cannot fetch big blocks to disk 
> 
>
> Key: SPARK-21253
> URL: https://issues.apache.org/jira/browse/SPARK-21253
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Assignee: Shixiong Zhu
> Attachments: ui-thread-dump-jqhadoop221-154.gif
>
>
> Spark *cluster* can reproduce, *local* can't:
> 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}:
> {code:actionscript}
> $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K
> {code}
> 2. A shuffle:
> {code:actionscript}
> scala> sc.parallelize(0 until 300, 10).repartition(2001).count()
> {code}
> The error messages:
> {noformat}
> org.apache.spark.shuffle.FetchFailedException: Failed to send request for 
> 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: 
> java.io.IOException: Connection reset by peer
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
> at scala.collection.AbstractIterator.to(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to 
> yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: 
> Connection reset by peer
> at 
> org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
> at 
> io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
> at 
> org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183)
> at 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher$1.onSuccess(OneForOneBlockFetcher.

[jira] [Assigned] (SPARK-21253) Cannot fetch big blocks to disk

2017-06-29 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-21253:


Assignee: Shixiong Zhu

> Cannot fetch big blocks to disk 
> 
>
> Key: SPARK-21253
> URL: https://issues.apache.org/jira/browse/SPARK-21253
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Assignee: Shixiong Zhu
> Attachments: ui-thread-dump-jqhadoop221-154.gif
>
>
> Spark *cluster* can reproduce, *local* can't:
> 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}:
> {code:actionscript}
> $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K
> {code}
> 2. A shuffle:
> {code:actionscript}
> scala> sc.parallelize(0 until 300, 10).repartition(2001).count()
> {code}
> The error messages:
> {noformat}
> org.apache.spark.shuffle.FetchFailedException: Failed to send request for 
> 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: 
> java.io.IOException: Connection reset by peer
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
> at scala.collection.AbstractIterator.to(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to 
> yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: 
> Connection reset by peer
> at 
> org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
> at 
> io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
> at 
> org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183)
> at 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher$1.onSuccess(OneForOneBlockFe

[jira] [Updated] (SPARK-20783) Enhance ColumnVector to support compressed representation

2017-06-29 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-20783:
-
Summary: Enhance ColumnVector to support compressed representation  (was: 
Enhance ColumnVector to keep UnsafeArrayData for array)

> Enhance ColumnVector to support compressed representation
> -
>
> Key: SPARK-20783
> URL: https://issues.apache.org/jira/browse/SPARK-20783
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>
> Current {{ColumnVector}} accepts only primitive-type Java array as an input 
> for array. It is good to keep data from Parquet.
> On the other hand, in Spark internal, {{UnsafeArrayData}} is frequently used 
> to represent array, map, and struct. To keep these data, this JIRA entry 
> enhances {{ColumnVector}} to keep UnsafeArrayData.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20783) Enhance ColumnVector to support compressed representation

2017-06-29 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-20783:
-
Description: 
Current {{ColumnVector}} handles uncompressed data for parquet.

For handling table cache, this JIRA entry adds {{OnHeapCachedBatch}} class to 
have compressed data.
As first step of this implementation, this JIRA supports primitive data and 
string types.


  was:
Current {{ColumnVector}} accepts only primitive-type Java array as an input for 
array. It is good to keep data from Parquet.

On the other hand, in Spark internal, {{UnsafeArrayData}} is frequently used to 
represent array, map, and struct. To keep these data, this JIRA entry enhances 
{{ColumnVector}} to keep UnsafeArrayData.


> Enhance ColumnVector to support compressed representation
> -
>
> Key: SPARK-20783
> URL: https://issues.apache.org/jira/browse/SPARK-20783
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>
> Current {{ColumnVector}} handles uncompressed data for parquet.
> For handling table cache, this JIRA entry adds {{OnHeapCachedBatch}} class to 
> have compressed data.
> As first step of this implementation, this JIRA supports primitive data and 
> string types.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21253) Cannot fetch big blocks to disk

2017-06-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068793#comment-16068793
 ] 

Apache Spark commented on SPARK-21253:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/18467

> Cannot fetch big blocks to disk 
> 
>
> Key: SPARK-21253
> URL: https://issues.apache.org/jira/browse/SPARK-21253
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Assignee: Shixiong Zhu
> Attachments: ui-thread-dump-jqhadoop221-154.gif
>
>
> Spark *cluster* can reproduce, *local* can't:
> 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}:
> {code:actionscript}
> $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K
> {code}
> 2. A shuffle:
> {code:actionscript}
> scala> sc.parallelize(0 until 300, 10).repartition(2001).count()
> {code}
> The error messages:
> {noformat}
> org.apache.spark.shuffle.FetchFailedException: Failed to send request for 
> 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: 
> java.io.IOException: Connection reset by peer
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
> at scala.collection.AbstractIterator.to(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to 
> yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: 
> Connection reset by peer
> at 
> org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
> at 
> io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
> at 
> org.apache.spark.network.client.TransportClient.stream(Transpor

[jira] [Commented] (SPARK-20873) Improve the error message for unsupported Column Type

2017-06-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068800#comment-16068800
 ] 

Apache Spark commented on SPARK-20873:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/18468

> Improve the error message for unsupported Column Type
> -
>
> Key: SPARK-20873
> URL: https://issues.apache.org/jira/browse/SPARK-20873
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ruben Janssen
>Assignee: Ruben Janssen
> Fix For: 2.3.0
>
>
> For unsupported column type, we simply output the column type instead of the 
> type name. 
> {noformat}
> java.lang.Exception: Unsupported type: 
> org.apache.spark.sql.execution.columnar.ColumnTypeSuite$$anonfun$4$$anon$1@2205a05d
> {noformat}
> We should improve it by outputting its name.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21256) Add WithSQLConf to Catalyst

2017-06-29 Thread Xiao Li (JIRA)

Xiao Li created SPARK-21256:
---

 Summary: Add WithSQLConf to Catalyst
 Key: SPARK-21256
 URL: https://issues.apache.org/jira/browse/SPARK-21256
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 2.3.0
Reporter: Xiao Li
Assignee: Xiao Li


Add WithSQLConf to the Catalyst module.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21256) Add WithSQLConf to Catalyst Test

2017-06-29 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21256:

Summary: Add WithSQLConf to Catalyst Test  (was: Add WithSQLConf to 
Catalyst)

> Add WithSQLConf to Catalyst Test
> 
>
> Key: SPARK-21256
> URL: https://issues.apache.org/jira/browse/SPARK-21256
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Add WithSQLConf to the Catalyst module.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21256) Add WithSQLConf to Catalyst Test

2017-06-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068832#comment-16068832
 ] 

Apache Spark commented on SPARK-21256:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/18469

> Add WithSQLConf to Catalyst Test
> 
>
> Key: SPARK-21256
> URL: https://issues.apache.org/jira/browse/SPARK-21256
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Add WithSQLConf to the Catalyst module.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21256) Add WithSQLConf to Catalyst Test

2017-06-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21256:


Assignee: Xiao Li  (was: Apache Spark)

> Add WithSQLConf to Catalyst Test
> 
>
> Key: SPARK-21256
> URL: https://issues.apache.org/jira/browse/SPARK-21256
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Add WithSQLConf to the Catalyst module.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21256) Add WithSQLConf to Catalyst Test

2017-06-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21256:


Assignee: Apache Spark  (was: Xiao Li)

> Add WithSQLConf to Catalyst Test
> 
>
> Key: SPARK-21256
> URL: https://issues.apache.org/jira/browse/SPARK-21256
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Add WithSQLConf to the Catalyst module.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21257) LDA : create an Evaluator to enable cross validation

2017-06-29 Thread Mathieu DESPRIEE (JIRA)

Mathieu DESPRIEE created SPARK-21257:


 Summary: LDA : create an Evaluator to enable cross validation
 Key: SPARK-21257
 URL: https://issues.apache.org/jira/browse/SPARK-21257
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.1.1
Reporter: Mathieu DESPRIEE


I suggest the creation of an LDAEvaluator to use with CrossValidator, using 
logPerplexity as a metric for evaluation.

Unfortunately, the computation of perplexity needs to access some internal data 
of the model, and the current implementation of CrossValidator does not pass 
the model being evaluated to the Evaluator. 
A way could be to change the Evaluator.evaluate() method to pass the model 
along with the dataset.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21257) LDA : create an Evaluator to enable cross validation

2017-06-29 Thread Mathieu DESPRIEE (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mathieu DESPRIEE updated SPARK-21257:
-
Issue Type: New Feature  (was: Improvement)

> LDA : create an Evaluator to enable cross validation
> 
>
> Key: SPARK-21257
> URL: https://issues.apache.org/jira/browse/SPARK-21257
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Mathieu DESPRIEE
>
> I suggest the creation of an LDAEvaluator to use with CrossValidator, using 
> logPerplexity as a metric for evaluation.
> Unfortunately, the computation of perplexity needs to access some internal 
> data of the model, and the current implementation of CrossValidator does not 
> pass the model being evaluated to the Evaluator. 
> A way could be to change the Evaluator.evaluate() method to pass the model 
> along with the dataset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21257) LDA : create an Evaluator to enable cross validation

2017-06-29 Thread Mathieu DESPRIEE (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mathieu DESPRIEE updated SPARK-21257:
-
Issue Type: Improvement  (was: New Feature)

> LDA : create an Evaluator to enable cross validation
> 
>
> Key: SPARK-21257
> URL: https://issues.apache.org/jira/browse/SPARK-21257
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Mathieu DESPRIEE
>
> I suggest the creation of an LDAEvaluator to use with CrossValidator, using 
> logPerplexity as a metric for evaluation.
> Unfortunately, the computation of perplexity needs to access some internal 
> data of the model, and the current implementation of CrossValidator does not 
> pass the model being evaluated to the Evaluator. 
> A way could be to change the Evaluator.evaluate() method to pass the model 
> along with the dataset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21253) Cannot fetch big blocks to disk

2017-06-29 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068876#comment-16068876
 ] 

Shixiong Zhu commented on SPARK-21253:
--

[~q79969786] did you run Spark 2.2.0-rcX on Yarn which has a Spark 2.1.* 
shuffle service?

> Cannot fetch big blocks to disk 
> 
>
> Key: SPARK-21253
> URL: https://issues.apache.org/jira/browse/SPARK-21253
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Assignee: Shixiong Zhu
> Attachments: ui-thread-dump-jqhadoop221-154.gif
>
>
> Spark *cluster* can reproduce, *local* can't:
> 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}:
> {code:actionscript}
> $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K
> {code}
> 2. A shuffle:
> {code:actionscript}
> scala> sc.parallelize(0 until 300, 10).repartition(2001).count()
> {code}
> The error messages:
> {noformat}
> org.apache.spark.shuffle.FetchFailedException: Failed to send request for 
> 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: 
> java.io.IOException: Connection reset by peer
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
> at scala.collection.AbstractIterator.to(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to 
> yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: 
> Connection reset by peer
> at 
> org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
> at 
> io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
> at 
> org.apache.spark.network.client.TransportClient.stream(TransportClient.ja

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-06-29 Thread Cody Koeninger (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068980#comment-16068980
 ] 

Cody Koeninger commented on SPARK-18057:


Kafka 0.11 is now released.  

Are we upgrading spark artifacts named kafka-0-10 to use kafka 0.11, or are we 
renaming them to kafka-0-11?

> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20797) mllib lda's LocalLDAModel's save: out of memory.

2017-06-29 Thread Asher Krim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068988#comment-16068988
 ] 

Asher Krim edited comment on SPARK-20797 at 6/29/17 9:19 PM:
-

This looks like a duplicate of https://issues.apache.org/jira/browse/SPARK-19294


was (Author: akrim):
This looks like a duplicate of 
https://issues.apache.org/jira/browse/SPARK-19294? 

> mllib lda's LocalLDAModel's save: out of memory. 
> -
>
> Key: SPARK-20797
> URL: https://issues.apache.org/jira/browse/SPARK-20797
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1, 1.6.3, 2.0.0, 2.0.2, 2.1.1
>Reporter: d0evi1
>
> when i try online lda model with large text data(nearly 1 billion chinese 
> news' abstract), the training step went well, but the save step failed.  
> something like below happened (etc. 1.6.1):
> problem 1.bigger than spark.kryoserializer.buffer.max.  (turning bigger the 
> param can fix problem 1, but next will lead problem 2),
> problem 2. exceed spark.akka.frameSize. (turning this param too bigger will 
> fail for the reason out of memory,   kill it, version > 2.0.0, exceeds max 
> allowed: spark.rpc.message.maxSize).
> when topics  num is large(set topic num k=200 is ok, but set k=300 failed), 
> and vocab size is large(nearly 1000,000) too. this problem will appear.
> so i found word2vec's save function is similar to the LocalLDAModel's save 
> function :
> word2vec's problem (use repartition(1) to save) has been fixed 
> [https://github.com/apache/spark/pull/9989,], but LocalLDAModel still use:  
> repartition(1). use single partition when save.
> word2vec's  save method from latest code:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala:
>   val approxSize = (4L * vectorSize + 15) * numWords
>   val nPartitions = ((approxSize / bufferSize) + 1).toInt
>   val dataArray = model.toSeq.map { case (w, v) => Data(w, v) }
>   
> spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path))
> but the code in mllib.clustering.LDAModel's LocalLDAModel's save:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala
> you'll see:
>   val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix
>   val topics = Range(0, k).map { topicInd =>
> Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), 
> topicInd)
>   }
>   
> spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path))
> refer to word2vec's save (repartition(nPartitions)), i replace numWords to 
> topic K, repartition(nPartitions) in the LocalLDAModel's save method, 
> recompile the code, deploy the new lda's project with large data on our 
> machine cluster, it works.
> hopes it will fixed in the next version.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20797) mllib lda's LocalLDAModel's save: out of memory.

2017-06-29 Thread Asher Krim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068988#comment-16068988
 ] 

Asher Krim commented on SPARK-20797:


This looks like a duplicate of 
https://issues.apache.org/jira/browse/SPARK-19294? 

> mllib lda's LocalLDAModel's save: out of memory. 
> -
>
> Key: SPARK-20797
> URL: https://issues.apache.org/jira/browse/SPARK-20797
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1, 1.6.3, 2.0.0, 2.0.2, 2.1.1
>Reporter: d0evi1
>
> when i try online lda model with large text data(nearly 1 billion chinese 
> news' abstract), the training step went well, but the save step failed.  
> something like below happened (etc. 1.6.1):
> problem 1.bigger than spark.kryoserializer.buffer.max.  (turning bigger the 
> param can fix problem 1, but next will lead problem 2),
> problem 2. exceed spark.akka.frameSize. (turning this param too bigger will 
> fail for the reason out of memory,   kill it, version > 2.0.0, exceeds max 
> allowed: spark.rpc.message.maxSize).
> when topics  num is large(set topic num k=200 is ok, but set k=300 failed), 
> and vocab size is large(nearly 1000,000) too. this problem will appear.
> so i found word2vec's save function is similar to the LocalLDAModel's save 
> function :
> word2vec's problem (use repartition(1) to save) has been fixed 
> [https://github.com/apache/spark/pull/9989,], but LocalLDAModel still use:  
> repartition(1). use single partition when save.
> word2vec's  save method from latest code:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala:
>   val approxSize = (4L * vectorSize + 15) * numWords
>   val nPartitions = ((approxSize / bufferSize) + 1).toInt
>   val dataArray = model.toSeq.map { case (w, v) => Data(w, v) }
>   
> spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path))
> but the code in mllib.clustering.LDAModel's LocalLDAModel's save:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala
> you'll see:
>   val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix
>   val topics = Range(0, k).map { topicInd =>
> Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), 
> topicInd)
>   }
>   
> spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path))
> refer to word2vec's save (repartition(nPartitions)), i replace numWords to 
> topic K, repartition(nPartitions) in the LocalLDAModel's save method, 
> recompile the code, deploy the new lda's project with large data on our 
> machine cluster, it works.
> hopes it will fixed in the next version.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21258) Window result incorrect using complex object with spilling

2017-06-29 Thread Herman van Hovell (JIRA)

Herman van Hovell created SPARK-21258:
-

 Summary: Window result incorrect using complex object with spilling
 Key: SPARK-21258
 URL: https://issues.apache.org/jira/browse/SPARK-21258
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Herman van Hovell
Assignee: Herman van Hovell






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21258) Window result incorrect using complex object with spilling

2017-06-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21258:


Assignee: Apache Spark  (was: Herman van Hovell)

> Window result incorrect using complex object with spilling
> --
>
> Key: SPARK-21258
> URL: https://issues.apache.org/jira/browse/SPARK-21258
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21258) Window result incorrect using complex object with spilling

2017-06-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21258:


Assignee: Herman van Hovell  (was: Apache Spark)

> Window result incorrect using complex object with spilling
> --
>
> Key: SPARK-21258
> URL: https://issues.apache.org/jira/browse/SPARK-21258
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21258) Window result incorrect using complex object with spilling

2017-06-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069084#comment-16069084
 ] 

Apache Spark commented on SPARK-21258:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/18470

> Window result incorrect using complex object with spilling
> --
>
> Key: SPARK-21258
> URL: https://issues.apache.org/jira/browse/SPARK-21258
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21253) Cannot fetch big blocks to disk

2017-06-29 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069109#comment-16069109
 ] 

Yuming Wang commented on SPARK-21253:
-

I checked it, all jars are latest 2.2.0-rcX.

> Cannot fetch big blocks to disk 
> 
>
> Key: SPARK-21253
> URL: https://issues.apache.org/jira/browse/SPARK-21253
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Assignee: Shixiong Zhu
> Attachments: ui-thread-dump-jqhadoop221-154.gif
>
>
> Spark *cluster* can reproduce, *local* can't:
> 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}:
> {code:actionscript}
> $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K
> {code}
> 2. A shuffle:
> {code:actionscript}
> scala> sc.parallelize(0 until 300, 10).repartition(2001).count()
> {code}
> The error messages:
> {noformat}
> org.apache.spark.shuffle.FetchFailedException: Failed to send request for 
> 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: 
> java.io.IOException: Connection reset by peer
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
> at scala.collection.AbstractIterator.to(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to 
> yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: 
> Connection reset by peer
> at 
> org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
> at 
> io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
> at 
> org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183)
> at 
> org.apache.spark.network

[jira] [Commented] (SPARK-21255) NPE when creating encoder for enum

2017-06-29 Thread Mike (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069146#comment-16069146
 ] 

Mike commented on SPARK-21255:
--

Yes, but there may be other problems (at least I cannot guarantee there won't 
be), since it seems like enums were never used before.

> NPE when creating encoder for enum
> --
>
> Key: SPARK-21255
> URL: https://issues.apache.org/jira/browse/SPARK-21255
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.0
> Environment: org.apache.spark:spark-core_2.10:2.1.0
> org.apache.spark:spark-sql_2.10:2.1.0
>Reporter: Mike
>
> When you try to create an encoder for Enum type (or bean with enum property) 
> via Encoders.bean(...), it fails with NullPointerException at TypeToken:495.
> I did a little research and it turns out, that in JavaTypeInference:126 
> following code 
> {code:java}
> val beanInfo = Introspector.getBeanInfo(typeToken.getRawType)
> val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == 
> "class")
> val fields = properties.map { property =>
>   val returnType = 
> typeToken.method(property.getReadMethod).getReturnType
>   val (dataType, nullable) = inferDataType(returnType)
>   new StructField(property.getName, dataType, nullable)
> }
> (new StructType(fields), true)
> {code}
> filters out properties named "class", because we wouldn't want to serialize 
> that. But enum types have another property of type Class named 
> "declaringClass", which we are trying to inspect recursively. Eventually we 
> try to inspect ClassLoader class, which has property "defaultAssertionStatus" 
> with no read method, which leads to NPE at TypeToken:495.
> I think adding property name "declaringClass" to filtering will resolve this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21188) releaseAllLocksForTask should synchronize the whole method

2017-06-29 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-21188.
--
   Resolution: Fixed
 Assignee: Feng Liu
Fix Version/s: 2.3.0

> releaseAllLocksForTask should synchronize the whole method
> --
>
> Key: SPARK-21188
> URL: https://issues.apache.org/jira/browse/SPARK-21188
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Feng Liu
>Assignee: Feng Liu
> Fix For: 2.3.0
>
>
> Since the objects readLocksByTask, writeLocksByTask and infos are coupled and 
> supposed to be modified by other threads concurrently, all the read and 
> writes of them in the releaseAllLocksForTask method should be protected by a 
> single synchronized block. The fine-grained synchronization in the current 
> code can cause some test flakiness.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20690) Subqueries in FROM should have alias names

2017-06-29 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-20690:

Summary: Subqueries in FROM should have alias names  (was: Analyzer 
shouldn't add missing attributes through subquery)

> Subqueries in FROM should have alias names
> --
>
> Key: SPARK-20690
> URL: https://issues.apache.org/jira/browse/SPARK-20690
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>  Labels: release-notes
> Fix For: 2.3.0
>
>
> We add missing attributes into Filter in Analyzer. But we shouldn't do it 
> through subqueries like this:
> {code}
> select 1 from  (select 1 from onerow t1 LIMIT 1) where  t1.c1=1
> {code}
> This query works in current codebase. However, the outside where clause 
> shouldn't be able to refer t1.c1 attribute.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21259) More rules for scalastyle

2017-06-29 Thread Gengliang Wang (JIRA)

Gengliang Wang created SPARK-21259:
--

 Summary: More rules for scalastyle
 Key: SPARK-21259
 URL: https://issues.apache.org/jira/browse/SPARK-21259
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.1
Reporter: Gengliang Wang
Priority: Minor


During code review, we spent so much time on code style issues.
It would be great if we add rules:
1) disallow space before colon
2) disallow space before right parentheses
3) disallow space after left parentheses



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20690) Subqueries in FROM should have alias names

2017-06-29 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-20690:

Description: 
We add missing attributes into Filter in Analyzer. But we shouldn't do it 
through subqueries like this:

{code}
select 1 from  (select 1 from onerow t1 LIMIT 1) where  t1.c1=1
{code}

This query works in current codebase. However, the outside where clause 
shouldn't be able to refer t1.c1 attribute.

The root cause is we allow subqueries in FROM have no alias names previously, 
it is confusing and isn't supported by various databases such as MySQL, 
Postgres, Oracle. We shouldn't support it too.

  was:
We add missing attributes into Filter in Analyzer. But we shouldn't do it 
through subqueries like this:

{code}
select 1 from  (select 1 from onerow t1 LIMIT 1) where  t1.c1=1
{code}

This query works in current codebase. However, the outside where clause 
shouldn't be able to refer t1.c1 attribute.


> Subqueries in FROM should have alias names
> --
>
> Key: SPARK-20690
> URL: https://issues.apache.org/jira/browse/SPARK-20690
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>  Labels: release-notes
> Fix For: 2.3.0
>
>
> We add missing attributes into Filter in Analyzer. But we shouldn't do it 
> through subqueries like this:
> {code}
> select 1 from  (select 1 from onerow t1 LIMIT 1) where  t1.c1=1
> {code}
> This query works in current codebase. However, the outside where clause 
> shouldn't be able to refer t1.c1 attribute.
> The root cause is we allow subqueries in FROM have no alias names previously, 
> it is confusing and isn't supported by various databases such as MySQL, 
> Postgres, Oracle. We shouldn't support it too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21259) More rules for scalastyle

2017-06-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069208#comment-16069208
 ] 

Apache Spark commented on SPARK-21259:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/18471

> More rules for scalastyle
> -
>
> Key: SPARK-21259
> URL: https://issues.apache.org/jira/browse/SPARK-21259
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Gengliang Wang
>Priority: Minor
>
> During code review, we spent so much time on code style issues.
> It would be great if we add rules:
> 1) disallow space before colon
> 2) disallow space before right parentheses
> 3) disallow space after left parentheses



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21259) More rules for scalastyle

2017-06-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21259:


Assignee: Apache Spark

> More rules for scalastyle
> -
>
> Key: SPARK-21259
> URL: https://issues.apache.org/jira/browse/SPARK-21259
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Minor
>
> During code review, we spent so much time on code style issues.
> It would be great if we add rules:
> 1) disallow space before colon
> 2) disallow space before right parentheses
> 3) disallow space after left parentheses



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21259) More rules for scalastyle

2017-06-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21259:


Assignee: (was: Apache Spark)

> More rules for scalastyle
> -
>
> Key: SPARK-21259
> URL: https://issues.apache.org/jira/browse/SPARK-21259
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Gengliang Wang
>Priority: Minor
>
> During code review, we spent so much time on code style issues.
> It would be great if we add rules:
> 1) disallow space before colon
> 2) disallow space before right parentheses
> 3) disallow space after left parentheses



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21253) Cannot fetch big blocks to disk

2017-06-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069227#comment-16069227
 ] 

Apache Spark commented on SPARK-21253:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/18472

> Cannot fetch big blocks to disk 
> 
>
> Key: SPARK-21253
> URL: https://issues.apache.org/jira/browse/SPARK-21253
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Assignee: Shixiong Zhu
> Attachments: ui-thread-dump-jqhadoop221-154.gif
>
>
> Spark *cluster* can reproduce, *local* can't:
> 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}:
> {code:actionscript}
> $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K
> {code}
> 2. A shuffle:
> {code:actionscript}
> scala> sc.parallelize(0 until 300, 10).repartition(2001).count()
> {code}
> The error messages:
> {noformat}
> org.apache.spark.shuffle.FetchFailedException: Failed to send request for 
> 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: 
> java.io.IOException: Connection reset by peer
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
> at scala.collection.AbstractIterator.to(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to 
> yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: 
> Connection reset by peer
> at 
> org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
> at 
> io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
> at 
> org.apache.spark.network.client.TransportClient.stream(Transpor

[jira] [Closed] (SPARK-21148) Set SparkUncaughtExceptionHandler to the Master

2017-06-29 Thread Devaraj K (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devaraj K closed SPARK-21148.
-
Resolution: Duplicate

> Set SparkUncaughtExceptionHandler to the Master
> ---
>
> Key: SPARK-21148
> URL: https://issues.apache.org/jira/browse/SPARK-21148
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Affects Versions: 2.1.1
>Reporter: Devaraj K
>
> Any one thread of the Master gets any of the UncaughtException then the 
> thread gets terminate and the Master process keeps running without 
> functioning properly.
> I think we need to handle the UncaughtException and exit the Master 
> gracefully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-06-29 Thread Leif Walsh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069269#comment-16069269
 ] 

Leif Walsh commented on SPARK-21190:


I agree with [~icexelloss] that we should aim to provide an API which provides 
"logical" groups of data to the UDF rather than the implementation detail of 
providing partitions wholesale.  Setting aside for the moment the problem with 
dataset skew which could cause one group to be very large, let's look at some 
use cases.

One obvious use case that tools like dplyr and pandas support is 
{{df.groupby(...).aggregate(...)}}.  Here, we group on some key and apply a 
function to each logical group.  This can be used to e.g. demean each group 
w.r.t. its cohort.  Another use case that we care about with 
[Flint|https://github.com/twosigma/flint] is aggregating over a window.  In 
pandas terminology this is the {{rolling}} operator.  One might want to, for 
each row, perform a moving window average or rolling regression over a history 
of some size.  The windowed aggregation poses a performance question that the 
groupby case doesn't: namely, if we naively send each window to the python 
worker independently, we're transferring a lot of duplicate data since each 
overlapped window contains many of the same rows.  An option here is to 
transfer the entire partition on the backend and then instruct the python 
worker to call the UDF with slices of the whole dataset according to the 
windowing requested by the user.

I think the idea of presenting a whole partition in a pandas dataframe to a UDF 
is a bit off-track.  If someone really wants to apply a python function to the 
"whole" dataset, they'd be best served by pulling those data back to the driver 
and just using pandas, if they tried to use spark's partitions they'd get 
somewhat arbitrary partitions and have to implement some kind of merge operator 
on their own.  However, with grouped and windowed aggregations, we can provide 
an API which truly is parallelizable and useful.

I want to focus on use cases where we actually can parallelize without 
requiring a merge operator right now.  Aggregators in pandas and related tools 
in the ecosystem usually assume they have access to all the data for an 
operation and don't need to merge results of subaggregations.  For aggregations 
over larger datasets you'd really want to encourage the use of native Spark 
operations (that use e.g. {{treeAggregate}}).

Does that make sense?  I think it focuses the problem nicely that it becomes 
fairly tractable.

I think the really hard part of this API design is deciding what the inputs and 
outputs of the UDF look like, and providing for the myriad use cases therein.  
For example, one might want to aggregate each group down to a scalar (e.g. 
mean) and do something with that (either produce a reduced dataset with one 
value per group, or add a column where each group has the same value across all 
rows), or one might want to compute over the group and produce a value per row 
within the group and attach that as a new column (e.g. demeaning or ranking).  
These translate roughly to the differences between the [**ply operations in 
dplyr|https://www.jstatsoft.org/article/view/v040i01/v40i01.pdf] or the 
differences in pandas between {{df.groupby(...).agg(...)}} and 
{{df.groupby(...).transform(...)}} and {{df.groupby(...).apply(...)}}.

> SPIP: Vectorized UDFs in Python
> ---
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: SPIP
> Attachments: SPIPVectorizedUDFsforPython (1).pdf
>
>
> *Background and Motivation*
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> (e.g. numpy, Pandas) that are written in native code.
>  
> This proposal advocates introducing new APIs to support vectorized UDFs in 
> Python, in which a block of data is transferred over to Python in some 
> columnar format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> - Support vectorized UDFs that apply on chunks of the data frame
> - Low system overhead: Substantially reduce serialization and deserialization 
> overhead when compared with row-at-a-time interface
> - UDF performance: Enable users to leverage native libraries in Python (e.g. 
> numpy, Pandas) for data manipulation in these UDFs
>  
> *Non-Goals*
> The

[jira] [Resolved] (SPARK-21246) Unexpected Data Type conversion from LONG to BIGINT

2017-06-29 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21246.
--
Resolution: Invalid

Of course, it follows the schema an user specified.

{code}
scala> peopleDF.schema == schema
res9: Boolean = true
{code}

and this throws an exception as the schema is mismatched. I don't think at 
least the same schema is not an issue here. I am resolving this.

> Unexpected Data Type conversion from LONG to BIGINT
> ---
>
> Key: SPARK-21246
> URL: https://issues.apache.org/jira/browse/SPARK-21246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Using Zeppelin Notebook or Spark Shell
>Reporter: Monica Raj
>
> The unexpected conversion occurred when creating a data frame out of an 
> existing data collection. The following code can be run in zeppelin notebook 
> to reproduce the bug:
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> val schemaString = "name"
> val lstVals = Seq(3)
> val rowRdd = sc.parallelize(lstVals).map(x => Row( x ))
> rowRdd.collect()
> // Generate the schema based on the string of schema
> val fields = schemaString.split(" ")
> .map(fieldName => StructField(fieldName, LongType, nullable = true))
> val schema = StructType(fields)
> print(schema)
> val peopleDF = sqlContext.createDataFrame(rowRdd, schema)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-06-29 Thread Helena Edelson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069300#comment-16069300
 ] 

Helena Edelson commented on SPARK-18057:


IMHO kafka-0-11 to be explicit and wait until kafka 0.11.1.0 which per 
https://issues.apache.org/jira/browse/KAFKA-4879 resolves the last blocker to 
upgrading?

> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21230) Spark Encoder with mysql Enum and data truncated Error

2017-06-29 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21230.
--
Resolution: Invalid

It is hard for me to read and reproduce as well. Let's close unless you are 
going to fix it or provide some steps to reproduce anyone can follow 
explicitly. Also, I think the point is to narrow down. I don't think this is an 
actionable JIRA as well.


> Spark Encoder with mysql Enum and data truncated Error
> --
>
> Key: SPARK-21230
> URL: https://issues.apache.org/jira/browse/SPARK-21230
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.1
> Environment: macosX
>Reporter: Michael Kunkel
>
> I am using Spark via Java for a MYSQL/ML(machine learning) project.
> In the mysql database, I have a column "status_change_type" of type enum = 
> {broke, fixed} in a table called "status_change" in a DB called "test".
> I have an object StatusChangeDB that constructs the needed structure for the 
> table, however for the "status_change_type", I constructed it as a String. I 
> know the bytes from MYSQL enum to Java string are much different, but I am 
> using Spark, so the encoder does not recognize enums properly. However when I 
> try to set the value of the enum via a Java string, I receive the "data 
> truncated" error
> h5. org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in 
> stage 4.0 (TID 9, localhost, executor driver): java.sql.BatchUpdateException: 
> Data truncated for column 'status_change_type' at row 1 at 
> com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:2055)
> I have tried to use enum for "status_change_type", however it fails with a 
> stack trace of
> h5. Exception in thread "AWT-EventQueue-0" java.lang.NullPointerException 
> at org.spark_project.guava.reflect.TypeToken.method(TypeToken.java:465) at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:126)
>  at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125)
>  at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127)
>  at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125)
>  at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127)
>  at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) ... ...
> h5. 
> I have tried to use the jdbc setting "jdbcCompliantTruncation=false" but this 
> does nothing as I get the same error of "data truncated" as first stated. 
> Here are my jdbc options map, in case I am using the 
> "jdbcCompliantTruncation=false" incorrectly.
> public static Map jdbcOptions() {
> Map jdbcOptions = new HashMap();
> jdbcOptions.put("url", 
> "jdbc:mysql://localhost:3306/test?jdbcCompliantTruncation=false");
> jdbcOp

[jira] [Resolved] (SPARK-21249) Is it possible to use File Sink with mapGroupsWithState in Structured Streaming?

2017-06-29 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21249.
--
Resolution: Invalid

I am resolving this per {{Type: Question}}. Questions should go to the mailing 
list.

> Is it possible to use File Sink with mapGroupsWithState in Structured 
> Streaming?
> 
>
> Key: SPARK-21249
> URL: https://issues.apache.org/jira/browse/SPARK-21249
> Project: Spark
>  Issue Type: Question
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Amit Baghel
>Priority: Minor
>
> I am working with 2.2.0-SNAPSHOT and Structured Streaming. Is it possible to 
> use File Sink with mapGroupsWithState? With append output mode I am getting 
> below exception.
> Exception in thread "main" org.apache.spark.sql.AnalysisException: 
> mapGroupsWithState is not supported with Append output mode on a streaming 
> DataFrame/Dataset;;



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-18199) Support appending to Parquet files

2017-06-29 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-18199.
---
Resolution: Invalid

I'm closing this as invalid. It is not a good idea to append to an existing 
file in distributed systems, especially given we might have two writers at the 
same time.


> Support appending to Parquet files
> --
>
> Key: SPARK-18199
> URL: https://issues.apache.org/jira/browse/SPARK-18199
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jeremy Smith
>
> Currently, appending to a Parquet directory involves simply creating new 
> parquet files in the directory. With many small appends (for example, in a 
> streaming job with a short batch duration) this leads to an unbounded number 
> of small Parquet files accumulating. These must be cleaned up with some 
> frequency by removing them all and rewriting a new file containing all the 
> rows.
> It would be far better if Spark supported appending to the Parquet files 
> themselves. HDFS supports this, as does Parquet:
> * The Parquet footer can be read in order to obtain necessary metadata.
> * The new rows can then be appended to the Parquet file as a row group.
> * A new footer can then be appended containing the metadata and referencing 
> the new row groups as well as the previously existing row groups.
> This would result in a small amount of bloat in the file as new row groups 
> are added (since duplicate metadata would accumulate) but it's hugely 
> preferable to accumulating small files, which is bad for HDFS health and also 
> eventually leads to Spark being unable to read the Parquet directory at all.  
> Periodic rewriting of the file could still be performed in order to remove 
> the duplicate metadata.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21246) Unexpected Data Type conversion from LONG to BIGINT

2017-06-29 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069281#comment-16069281
 ] 

Hyukjin Kwon edited comment on SPARK-21246 at 6/30/17 1:27 AM:
---

Of course, it follows the schema an user specified.

{code}
scala> peopleDF.schema == schema
res9: Boolean = true
{code}

and this throws an exception as the schema is mismatched. I don't think at 
least the same schema is an issue here. I am resolving this.


was (Author: hyukjin.kwon):
Of course, it follows the schema an user specified.

{code}
scala> peopleDF.schema == schema
res9: Boolean = true
{code}

and this throws an exception as the schema is mismatched. I don't think at 
least the same schema is not an issue here. I am resolving this.

> Unexpected Data Type conversion from LONG to BIGINT
> ---
>
> Key: SPARK-21246
> URL: https://issues.apache.org/jira/browse/SPARK-21246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Using Zeppelin Notebook or Spark Shell
>Reporter: Monica Raj
>
> The unexpected conversion occurred when creating a data frame out of an 
> existing data collection. The following code can be run in zeppelin notebook 
> to reproduce the bug:
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> val schemaString = "name"
> val lstVals = Seq(3)
> val rowRdd = sc.parallelize(lstVals).map(x => Row( x ))
> rowRdd.collect()
> // Generate the schema based on the string of schema
> val fields = schemaString.split(" ")
> .map(fieldName => StructField(fieldName, LongType, nullable = true))
> val schema = StructType(fields)
> print(schema)
> val peopleDF = sqlContext.createDataFrame(rowRdd, schema)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21260) Remove the unused OutputFakerExec

2017-06-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069332#comment-16069332
 ] 

Apache Spark commented on SPARK-21260:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/18473

> Remove the unused OutputFakerExec
> -
>
> Key: SPARK-21260
> URL: https://issues.apache.org/jira/browse/SPARK-21260
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Jiang Xingbo
>Priority: Minor
>
> OutputFakerExec was added long ago and is not used anywhere now so we should 
> remove it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21260) Remove the unused OutputFakerExec

2017-06-29 Thread Jiang Xingbo (JIRA)

Jiang Xingbo created SPARK-21260:


 Summary: Remove the unused OutputFakerExec
 Key: SPARK-21260
 URL: https://issues.apache.org/jira/browse/SPARK-21260
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.1
Reporter: Jiang Xingbo
Priority: Minor


OutputFakerExec was added long ago and is not used anywhere now so we should 
remove it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21260) Remove the unused OutputFakerExec

2017-06-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21260:


Assignee: Apache Spark

> Remove the unused OutputFakerExec
> -
>
> Key: SPARK-21260
> URL: https://issues.apache.org/jira/browse/SPARK-21260
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Jiang Xingbo
>Assignee: Apache Spark
>Priority: Minor
>
> OutputFakerExec was added long ago and is not used anywhere now so we should 
> remove it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21260) Remove the unused OutputFakerExec

2017-06-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21260:


Assignee: (was: Apache Spark)

> Remove the unused OutputFakerExec
> -
>
> Key: SPARK-21260
> URL: https://issues.apache.org/jira/browse/SPARK-21260
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Jiang Xingbo
>Priority: Minor
>
> OutputFakerExec was added long ago and is not used anywhere now so we should 
> remove it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21224) Support a DDL-formatted string as schema in reading for R

2017-06-29 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069330#comment-16069330
 ] 

Hyukjin Kwon commented on SPARK-21224:
--

Oh! I almost missed this comment. Sure, I will soon. Yes, I would like to open 
a new JIRA. Thank you [~felixcheung]. 

> Support a DDL-formatted string as schema in reading for R
> -
>
> Key: SPARK-21224
> URL: https://issues.apache.org/jira/browse/SPARK-21224
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> This might have to be a followup for SPARK-20431 but I just decided to make 
> this separate for R specifically as many PRs might be confusing.
> Please refer the discussion in the PR and SPARK-20431.
> In a simple view, this JIRA describes the support for a DDL-formetted string 
> as schema as below:
> {code}
> mockLines <- c("{\"name\":\"Michael\"}",
>"{\"name\":\"Andy\", \"age\":30}",
>"{\"name\":\"Justin\", \"age\":19}")
> jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
> writeLines(mockLines, jsonPath)
> df <- read.df(jsonPath, "json", "name STRING, age DOUBLE")
> collect(df)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21235) UTest should clear temp results when run case

2017-06-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069340#comment-16069340
 ] 

Apache Spark commented on SPARK-21235:
--

User 'wangjiaochun' has created a pull request for this issue:
https://github.com/apache/spark/pull/18474

> UTest should clear temp results when run case 
> --
>
> Key: SPARK-21235
> URL: https://issues.apache.org/jira/browse/SPARK-21235
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.1.1
>Reporter: wangjiaochun
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21235) UTest should clear temp results when run case

2017-06-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21235:


Assignee: (was: Apache Spark)

> UTest should clear temp results when run case 
> --
>
> Key: SPARK-21235
> URL: https://issues.apache.org/jira/browse/SPARK-21235
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.1.1
>Reporter: wangjiaochun
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21235) UTest should clear temp results when run case

2017-06-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21235:


Assignee: Apache Spark

> UTest should clear temp results when run case 
> --
>
> Key: SPARK-21235
> URL: https://issues.apache.org/jira/browse/SPARK-21235
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.1.1
>Reporter: wangjiaochun
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20858) Document ListenerBus event queue size property

2017-06-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20858:


Assignee: (was: Apache Spark)

> Document ListenerBus event queue size property
> --
>
> Key: SPARK-20858
> URL: https://issues.apache.org/jira/browse/SPARK-20858
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.1
>Reporter: Bjorn Jonsson
>Priority: Minor
>
> SPARK-15703 made the ListenerBus event queue size configurable via 
> spark.scheduler.listenerbus.eventqueue.size. This should be documented.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20858) Document ListenerBus event queue size property

2017-06-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20858:


Assignee: Apache Spark

> Document ListenerBus event queue size property
> --
>
> Key: SPARK-20858
> URL: https://issues.apache.org/jira/browse/SPARK-20858
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.1
>Reporter: Bjorn Jonsson
>Assignee: Apache Spark
>Priority: Minor
>
> SPARK-15703 made the ListenerBus event queue size configurable via 
> spark.scheduler.listenerbus.eventqueue.size. This should be documented.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20858) Document ListenerBus event queue size property

2017-06-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069387#comment-16069387
 ] 

Apache Spark commented on SPARK-20858:
--

User 'sadikovi' has created a pull request for this issue:
https://github.com/apache/spark/pull/18476

> Document ListenerBus event queue size property
> --
>
> Key: SPARK-20858
> URL: https://issues.apache.org/jira/browse/SPARK-20858
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.1
>Reporter: Bjorn Jonsson
>Priority: Minor
>
> SPARK-15703 made the ListenerBus event queue size configurable via 
> spark.scheduler.listenerbus.eventqueue.size. This should be documented.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21253) Cannot fetch big blocks to disk

2017-06-29 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21253.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 18472
[https://github.com/apache/spark/pull/18472]

> Cannot fetch big blocks to disk 
> 
>
> Key: SPARK-21253
> URL: https://issues.apache.org/jira/browse/SPARK-21253
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Assignee: Shixiong Zhu
> Fix For: 2.2.0
>
> Attachments: ui-thread-dump-jqhadoop221-154.gif
>
>
> Spark *cluster* can reproduce, *local* can't:
> 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}:
> {code:actionscript}
> $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K
> {code}
> 2. A shuffle:
> {code:actionscript}
> scala> sc.parallelize(0 until 300, 10).repartition(2001).count()
> {code}
> The error messages:
> {noformat}
> org.apache.spark.shuffle.FetchFailedException: Failed to send request for 
> 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: 
> java.io.IOException: Connection reset by peer
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
> at scala.collection.AbstractIterator.to(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to 
> yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: 
> Connection reset by peer
> at 
> org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
> at 
> io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
> at 
> org.apache.spark.network.client.TransportClient.stream(Tra

[jira] [Created] (SPARK-21261) SparkSQL regexpExpressions example

2017-06-29 Thread zhangxin (JIRA)

zhangxin created SPARK-21261:


 Summary: SparkSQL regexpExpressions example 
 Key: SPARK-21261
 URL: https://issues.apache.org/jira/browse/SPARK-21261
 Project: Spark
  Issue Type: Documentation
  Components: Examples
Affects Versions: 2.1.1
Reporter: zhangxin


The follow execute result.

scala> spark.sql(""" select regexp_replace('100-200', '(\d+)', 'num') """).show
+--+
|regexp_replace(100-200, (d+), num)|
+--+
|   100-200|
+--+
scala> spark.sql(""" select regexp_replace('100-200', '(\\d+)', 'num') """).show
+---+
|regexp_replace(100-200, (\d+), num)|
+---+
|num-num|
+---+
Add Comment



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 122 matches

Mail list logo