[jira] [Created] (SPARK-21250) Add a url in the table of 'Running Executors' in worker page to visit job page
guoxiaolongzte created SPARK-21250: -- Summary: Add a url in the table of 'Running Executors' in worker page to visit job page Key: SPARK-21250 URL: https://issues.apache.org/jira/browse/SPARK-21250 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 2.3.0 Reporter: guoxiaolongzte Priority: Minor Add a url in the table of 'Running Executors' in worker page to visit job page. When I click URL of 'Name', the current page jumps to the job page. Of course this is only in the table of 'Running Executors'. This URL of 'Name' is in the table of 'Finished Executors' does not exist, the click will not jump to any page. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21251) Add Kafka consumer metrics
igor mazor created SPARK-21251: -- Summary: Add Kafka consumer metrics Key: SPARK-21251 URL: https://issues.apache.org/jira/browse/SPARK-21251 Project: Spark Issue Type: Request Components: DStreams Affects Versions: 2.1.1, 2.1.0, 2.0.1, 2.0.0 Reporter: igor mazor Add Kafka consumer detailed metrics can help very much with debugging any issues related to the consumer. Its also helpful in general for monitoring proposes -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21210) Javadoc 8 fixes for ML shared param traits
[ https://issues.apache.org/jira/browse/SPARK-21210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-21210: - Assignee: Nick Pentreath > Javadoc 8 fixes for ML shared param traits > -- > > Key: SPARK-21210 > URL: https://issues.apache.org/jira/browse/SPARK-21210 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Affects Versions: 2.1.1, 2.2.0 >Reporter: Nick Pentreath >Assignee: Nick Pentreath >Priority: Minor > Fix For: 2.2.0 > > > [PR 15999|https://github.com/apache/spark/pull/15999] included fixes for doc > strings in the ML shared param traits ({{>}} and {{>=}} predominatly) - see > [from > here|https://github.com/apache/spark/pull/15999/files#diff-9edc669edcf2c0c7cf1efe4a0a57da80L32]. > However, the changes were made directly to the traits, while the changes > should have been made to {{SharedParamsCodeGen}}. So every time the code gen > is run (i.e. whenever a new shared param trait is to be added), the fixes > will be lost. > The changes also need to be made such that only the doc string is changed - > the param doc that gets printed from {{explainParams}} can remain unchanged. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21210) Javadoc 8 fixes for ML shared param traits
[ https://issues.apache.org/jira/browse/SPARK-21210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21210. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 18420 [https://github.com/apache/spark/pull/18420] > Javadoc 8 fixes for ML shared param traits > -- > > Key: SPARK-21210 > URL: https://issues.apache.org/jira/browse/SPARK-21210 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Affects Versions: 2.1.1, 2.2.0 >Reporter: Nick Pentreath >Priority: Minor > Fix For: 2.2.0 > > > [PR 15999|https://github.com/apache/spark/pull/15999] included fixes for doc > strings in the ML shared param traits ({{>}} and {{>=}} predominatly) - see > [from > here|https://github.com/apache/spark/pull/15999/files#diff-9edc669edcf2c0c7cf1efe4a0a57da80L32]. > However, the changes were made directly to the traits, while the changes > should have been made to {{SharedParamsCodeGen}}. So every time the code gen > is run (i.e. whenever a new shared param trait is to be added), the fixes > will be lost. > The changes also need to be made such that only the doc string is changed - > the param doc that gets printed from {{explainParams}} can remain unchanged. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21252) The duration times showed by spark web UI are inaccurate
igor mazor created SPARK-21252: -- Summary: The duration times showed by spark web UI are inaccurate Key: SPARK-21252 URL: https://issues.apache.org/jira/browse/SPARK-21252 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.1.1, 2.1.0, 2.0.1, 2.0.0 Reporter: igor mazor The duration times showed by spark UI are inaccurate and seems to be rounded. For example when a job had 2 stages, first stage executed in 47 ms and second stage in 3 seconds, the total execution time showed by the UI is 4 seconds. Another example, first stage was executed in 20 ms and second stage in 4 seconds, the total execution time showed by the UI would be in that case also 4 seconds. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21250) Add a url in the table of 'Running Executors' in worker page to visit job page
[ https://issues.apache.org/jira/browse/SPARK-21250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21250: Assignee: (was: Apache Spark) > Add a url in the table of 'Running Executors' in worker page to visit job > page > --- > > Key: SPARK-21250 > URL: https://issues.apache.org/jira/browse/SPARK-21250 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.0 >Reporter: guoxiaolongzte >Priority: Minor > > Add a url in the table of 'Running Executors' in worker page to visit job > page. > When I click URL of 'Name', the current page jumps to the job page. Of course > this is only in the table of 'Running Executors'. > This URL of 'Name' is in the table of 'Finished Executors' does not exist, > the click will not jump to any page. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21250) Add a url in the table of 'Running Executors' in worker page to visit job page
[ https://issues.apache.org/jira/browse/SPARK-21250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21250: Assignee: Apache Spark > Add a url in the table of 'Running Executors' in worker page to visit job > page > --- > > Key: SPARK-21250 > URL: https://issues.apache.org/jira/browse/SPARK-21250 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.0 >Reporter: guoxiaolongzte >Assignee: Apache Spark >Priority: Minor > > Add a url in the table of 'Running Executors' in worker page to visit job > page. > When I click URL of 'Name', the current page jumps to the job page. Of course > this is only in the table of 'Running Executors'. > This URL of 'Name' is in the table of 'Finished Executors' does not exist, > the click will not jump to any page. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21250) Add a url in the table of 'Running Executors' in worker page to visit job page
[ https://issues.apache.org/jira/browse/SPARK-21250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068028#comment-16068028 ] Apache Spark commented on SPARK-21250: -- User 'guoxiaolongzte' has created a pull request for this issue: https://github.com/apache/spark/pull/18464 > Add a url in the table of 'Running Executors' in worker page to visit job > page > --- > > Key: SPARK-21250 > URL: https://issues.apache.org/jira/browse/SPARK-21250 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.0 >Reporter: guoxiaolongzte >Priority: Minor > > Add a url in the table of 'Running Executors' in worker page to visit job > page. > When I click URL of 'Name', the current page jumps to the job page. Of course > this is only in the table of 'Running Executors'. > This URL of 'Name' is in the table of 'Finished Executors' does not exist, > the click will not jump to any page. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21240) Fix code style for constructing and stopping a SparkContext in UT
[ https://issues.apache.org/jira/browse/SPARK-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-21240: - Assignee: jin xing Issue Type: Improvement (was: Bug) > Fix code style for constructing and stopping a SparkContext in UT > - > > Key: SPARK-21240 > URL: https://issues.apache.org/jira/browse/SPARK-21240 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: jin xing >Assignee: jin xing >Priority: Trivial > Fix For: 2.3.0 > > > Related to SPARK-20985. > Fix code style for constructing and stopping a SparkContext. Assure the > context is stopped to avoid other tests complain that there's only one > SparkContext can exist. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21240) Fix code style for constructing and stopping a SparkContext in UT
[ https://issues.apache.org/jira/browse/SPARK-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21240. --- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18454 [https://github.com/apache/spark/pull/18454] > Fix code style for constructing and stopping a SparkContext in UT > - > > Key: SPARK-21240 > URL: https://issues.apache.org/jira/browse/SPARK-21240 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: jin xing >Priority: Trivial > Fix For: 2.3.0 > > > Related to SPARK-20985. > Fix code style for constructing and stopping a SparkContext. Assure the > context is stopped to avoid other tests complain that there's only one > SparkContext can exist. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21135) On history server page,duration of incompleted applications should be hidden instead of showing up as 0
[ https://issues.apache.org/jira/browse/SPARK-21135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21135. --- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18351 [https://github.com/apache/spark/pull/18351] > On history server page,duration of incompleted applications should be hidden > instead of showing up as 0 > --- > > Key: SPARK-21135 > URL: https://issues.apache.org/jira/browse/SPARK-21135 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.2.1 >Reporter: Jinhua Fu >Priority: Minor > Fix For: 2.3.0 > > > On history server page,duration of incompleted applications should be hidden > instead of showing up as 0. > In addition, the application of an exception abort (such as the application > of a background kill or driver outage) will always be treated as a > Incompleted application, and I'm not sure if this is a problem. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21135) On history server page,duration of incompleted applications should be hidden instead of showing up as 0
[ https://issues.apache.org/jira/browse/SPARK-21135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-21135: - Assignee: Jinhua Fu > On history server page,duration of incompleted applications should be hidden > instead of showing up as 0 > --- > > Key: SPARK-21135 > URL: https://issues.apache.org/jira/browse/SPARK-21135 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.2.1 >Reporter: Jinhua Fu >Assignee: Jinhua Fu >Priority: Minor > Fix For: 2.3.0 > > > On history server page,duration of incompleted applications should be hidden > instead of showing up as 0. > In addition, the application of an exception abort (such as the application > of a background kill or driver outage) will always be treated as a > Incompleted application, and I'm not sure if this is a problem. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21093) Multiple gapply execution occasionally failed in SparkR
[ https://issues.apache.org/jira/browse/SPARK-21093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068053#comment-16068053 ] Apache Spark commented on SPARK-21093: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/18465 > Multiple gapply execution occasionally failed in SparkR > > > Key: SPARK-21093 > URL: https://issues.apache.org/jira/browse/SPARK-21093 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.1, 2.2.0 > Environment: CentOS 7.2.1511 / R 3.4.0, CentOS 7.2.1511 / R 3.3.3 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Critical > Fix For: 2.3.0 > > > On Centos 7.2.1511 with R 3.4.0/3.3.0, multiple execution of {{gapply}} looks > failed as below: > {code} > Welcome to > __ >/ __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT > /_/ > SparkSession available as 'spark'. > > df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d")) > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > 17/06/14 18:21:01 WARN Utils: Truncated the string representation of a plan > since it was too large. This behavior can be adjusted by setting > 'spark.debug.maxToStringFields' in SparkEnv.conf. > a b c d > 1 1 1 1 0.1 > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > a b c d > 1 1 1 1 0.1 > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > a b c d > 1 1 1 1 0.1 > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > a b c d > 1 1 1 1 0.1 > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > a b c d > 1 1 1 1 0.1 > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > a b c d > 1 1 1 1 0.1 > > collect(gapply(df, "a", function(key, x) { x }, schema(df))) > Error in handleErrors(returnStatus, conn) : > org.apache.spark.SparkException: Job aborted due to stage failure: Task 98 > in stage 14.0 failed 1 times, most recent failure: Lost task 98.0 in stage > 14.0 (TID 1305, localhost, executor driver): org.apache.spark.SparkException: > R computation failed with > at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108) > at > org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:432) > at > org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:414) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.a > ... > *** buffer overflow detected ***: /usr/lib64/R/bin/exec/R terminated > === Backtrace: = > /lib64/libc.so.6(__fortify_fail+0x37)[0x7fe699b3f597] > /lib64/libc.so.6(+0x10c750)[0x7fe699b3d750] > /lib64/libc.so.6(+0x10e507)[0x7fe699b3f507] > /usr/lib64/R/modules//internet.so(+0x6015)[0x7fe689bb7015] > /usr/lib64/R/modules//internet.so(+0xe81e)[0x7fe689bbf81e] > /usr/lib64/R/lib/libR.so(+0xbd1b6)[0x7fe69c54a1b6] > /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0] > /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138] > /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af] > /usr/lib64/R/lib/libR.so(Rf_eval+0x354)[0x7fe69c5ad2f4] > /usr/lib64/R/lib/libR.so(+0x123f8e)[0x7fe69c5b0f8e] > /usr/lib64/R/lib/libR.so(Rf_eval+0x589)[0x7fe69c5ad529] > /usr/lib64/R/lib/libR.so(+0x1254ce)[0x7fe69c5b24ce] > /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0] > /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138] > /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af] > /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101] > /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138] > /usr/lib64/R/lib/libR.so(+0x120a7e)[0x7fe69c5ada7e] > /usr/lib64/R/lib/libR.so(Rf_eval+0x817)[0x7fe69c5ad7b7] > /usr/lib64/R/lib/libR.so(+0x1256d1)[0x7fe69c5b26d1] > /usr/lib64/R/lib/libR.so(+0x1552e9)[0x7fe69c5e22e9] > /usr/lib64/R/lib/libR.so(+0x11062a)[0x7fe69c59d62a] > /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138] > /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af] > /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101] > /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138] > /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af] > /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101] > /u
[jira] [Commented] (SPARK-21252) The duration times showed by spark web UI are inaccurate
[ https://issues.apache.org/jira/browse/SPARK-21252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068069#comment-16068069 ] Sean Owen commented on SPARK-21252: --- They are rounded, and '3 seconds' is unlikely to be exactly 3 seconds, but maybe "3.49 seconds". This could explain the rounding you see, but, it's on purpose. The output is for humans and it's not worth cluttering the display with the extra digits. I don't think this is a problem. > The duration times showed by spark web UI are inaccurate > > > Key: SPARK-21252 > URL: https://issues.apache.org/jira/browse/SPARK-21252 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.1.1 >Reporter: igor mazor > > The duration times showed by spark UI are inaccurate and seems to be rounded. > For example when a job had 2 stages, first stage executed in 47 ms and second > stage in 3 seconds, the total execution time showed by the UI is 4 seconds. > Another example, first stage was executed in 20 ms and second stage in 4 > seconds, the total execution time showed by the UI would be in that case also > 4 seconds. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21252) The duration times showed by spark web UI are inaccurate
[ https://issues.apache.org/jira/browse/SPARK-21252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068082#comment-16068082 ] igor mazor commented on SPARK-21252: I still think that the developer should be able to configure by him self what precision he needs. For example and issue that I have: Consuming from kafka usually take 40-50 ms, sometimes there are network issues and the kafka consumer unable to fetch data for a period of request.timeout.ms, which is 20 seconds in my case, after that the kafka consumer get request time out, it would retry the request again and usually the second attempt would succeed. So eventually the UI would show that that stage took 20 seconds, but in reality it took 20 seconds and 40 ms. Seeing in the UI that it took 20 seconds dont give any indication that there was afterwards seconds attend that took 40 ms and hence its quite hard to understand what exactly happens. > The duration times showed by spark web UI are inaccurate > > > Key: SPARK-21252 > URL: https://issues.apache.org/jira/browse/SPARK-21252 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.1.1 >Reporter: igor mazor > > The duration times showed by spark UI are inaccurate and seems to be rounded. > For example when a job had 2 stages, first stage executed in 47 ms and second > stage in 3 seconds, the total execution time showed by the UI is 4 seconds. > Another example, first stage was executed in 20 ms and second stage in 4 > seconds, the total execution time showed by the UI would be in that case also > 4 seconds. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21252) The duration times showed by spark web UI are inaccurate
[ https://issues.apache.org/jira/browse/SPARK-21252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068082#comment-16068082 ] igor mazor edited comment on SPARK-21252 at 6/29/17 9:43 AM: - I still think that the developer should be able to configure by him self what precision he needs. For example an issue that I have: Consuming from kafka usually take 40-50 ms, sometimes there are network issues and the kafka consumer unable to fetch data for a period of request.timeout.ms, which is 20 seconds in my case, after that the kafka consumer get request time out, it would retry the request again and usually the second attempt would succeed. So eventually the UI would show that that stage took 20 seconds, but in reality it took 20 seconds and 40 ms. Seeing in the UI that it took 20 seconds dont give any indication that there was afterwards seconds attend that took 40 ms and hence its quite hard to understand what exactly happens. was (Author: mazor.igal): I still think that the developer should be able to configure by him self what precision he needs. For example and issue that I have: Consuming from kafka usually take 40-50 ms, sometimes there are network issues and the kafka consumer unable to fetch data for a period of request.timeout.ms, which is 20 seconds in my case, after that the kafka consumer get request time out, it would retry the request again and usually the second attempt would succeed. So eventually the UI would show that that stage took 20 seconds, but in reality it took 20 seconds and 40 ms. Seeing in the UI that it took 20 seconds dont give any indication that there was afterwards seconds attend that took 40 ms and hence its quite hard to understand what exactly happens. > The duration times showed by spark web UI are inaccurate > > > Key: SPARK-21252 > URL: https://issues.apache.org/jira/browse/SPARK-21252 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.1.1 >Reporter: igor mazor > > The duration times showed by spark UI are inaccurate and seems to be rounded. > For example when a job had 2 stages, first stage executed in 47 ms and second > stage in 3 seconds, the total execution time showed by the UI is 4 seconds. > Another example, first stage was executed in 20 ms and second stage in 4 > seconds, the total execution time showed by the UI would be in that case also > 4 seconds. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21252) The duration times showed by spark web UI are inaccurate
[ https://issues.apache.org/jira/browse/SPARK-21252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068082#comment-16068082 ] igor mazor edited comment on SPARK-21252 at 6/29/17 9:45 AM: - I still think that the developer should be able to configure by him self what precision he needs. For example an issue that I have: Consuming from kafka usually take 40-50 ms, sometimes there are network issues and the kafka consumer unable to fetch data for a period of request.timeout.ms, which is 20 seconds in my case, after that the kafka consumer get request time out, it would retry the request again and usually the second attempt would succeed. So eventually the UI would show that that stage took 20 seconds, but in reality it took 20 seconds and 40 ms. Seeing in the UI that it took 20 seconds dont give any indication that there was afterwards second request which took 40 ms and hence its quite hard to understand what exactly happens. was (Author: mazor.igal): I still think that the developer should be able to configure by him self what precision he needs. For example an issue that I have: Consuming from kafka usually take 40-50 ms, sometimes there are network issues and the kafka consumer unable to fetch data for a period of request.timeout.ms, which is 20 seconds in my case, after that the kafka consumer get request time out, it would retry the request again and usually the second attempt would succeed. So eventually the UI would show that that stage took 20 seconds, but in reality it took 20 seconds and 40 ms. Seeing in the UI that it took 20 seconds dont give any indication that there was afterwards seconds attend that took 40 ms and hence its quite hard to understand what exactly happens. > The duration times showed by spark web UI are inaccurate > > > Key: SPARK-21252 > URL: https://issues.apache.org/jira/browse/SPARK-21252 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.1.1 >Reporter: igor mazor > > The duration times showed by spark UI are inaccurate and seems to be rounded. > For example when a job had 2 stages, first stage executed in 47 ms and second > stage in 3 seconds, the total execution time showed by the UI is 4 seconds. > Another example, first stage was executed in 20 ms and second stage in 4 > seconds, the total execution time showed by the UI would be in that case also > 4 seconds. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21252) The duration times showed by spark web UI are inaccurate
[ https://issues.apache.org/jira/browse/SPARK-21252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068085#comment-16068085 ] Sean Owen commented on SPARK-21252: --- You can drill into individual stage times, right? > The duration times showed by spark web UI are inaccurate > > > Key: SPARK-21252 > URL: https://issues.apache.org/jira/browse/SPARK-21252 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.1.1 >Reporter: igor mazor > > The duration times showed by spark UI are inaccurate and seems to be rounded. > For example when a job had 2 stages, first stage executed in 47 ms and second > stage in 3 seconds, the total execution time showed by the UI is 4 seconds. > Another example, first stage was executed in 20 ms and second stage in 4 > seconds, the total execution time showed by the UI would be in that case also > 4 seconds. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19908) Direct buffer memory OOM should not cause stage retries.
[ https://issues.apache.org/jira/browse/SPARK-19908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068106#comment-16068106 ] Kaushal Prajapati commented on SPARK-19908: --- [~zhanzhang] can you plz share some example code for which you are getting this error? > Direct buffer memory OOM should not cause stage retries. > > > Key: SPARK-19908 > URL: https://issues.apache.org/jira/browse/SPARK-19908 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.1.0 >Reporter: Zhan Zhang >Priority: Minor > > Currently if there is java.lang.OutOfMemoryError: Direct buffer memory, the > exception will be changed to FetchFailedException, causing stage retries. > org.apache.spark.shuffle.FetchFailedException: Direct buffer memory > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:40) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83) > at > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:731) > at > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextOuterJoinRows(SortMergeJoinExec.scala:692) > at > org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceStream(SortMergeJoinExec.scala:854) > at > org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceNext(SortMergeJoinExec.scala:887) > at > org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:278) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645) > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:212) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > at > io.netty.buffer.AbstractByteBufAll
[jira] [Resolved] (SPARK-17689) _temporary files breaks the Spark SQL streaming job.
[ https://issues.apache.org/jira/browse/SPARK-17689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma resolved SPARK-17689. - Resolution: Cannot Reproduce I am unable to reproduce this anymore, looks like this might be fixed by some other changes. > _temporary files breaks the Spark SQL streaming job. > > > Key: SPARK-17689 > URL: https://issues.apache.org/jira/browse/SPARK-17689 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Prashant Sharma > > Steps to reproduce: > 1) Start a streaming job which reads from HDFS location hdfs://xyz/* > 2) Write content to hdfs://xyz/a > . > . > repeat a few times. > And then job breaks as follows. > org.apache.spark.SparkException: Job aborted due to stage failure: Task 49 in > stage 304.0 failed 1 times, most recent failure: Lost task 49.0 in stage > 304.0 (TID 14794, localhost): java.io.FileNotFoundException: File does not > exist: hdfs://localhost:9000/input/t5/_temporary > at > org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309) > at > org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317) > at > org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$4.apply(fileSourceInterfaces.scala:464) > at > org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$4.apply(fileSourceInterfaces.scala:462) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:912) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:912) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1919) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1919) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21252) The duration times showed by spark web UI are inaccurate
[ https://issues.apache.org/jira/browse/SPARK-21252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068147#comment-16068147 ] igor mazor commented on SPARK-21252: Yes, its indeed possible to drill to individual stages, however the scenario I described on my last comment is happens in the same stage. The first attempt following by timeout after 20 seconds and then the second successful attempt are all in the same stage and therefore the UI shows 20 seconds duration, although thats not true. Also, if the stage was indeed 40 ms, the UI shows 40 ms as the duration, but with seconds there is rounding which also causing for inconsistent result presentation. If you already do the rounding, why then the 40 ms is not rounded to 0 seconds ? Also I noticed now that for example stage that took 42 ms (Job Duration), when I click on that stage I see Duration = 38 ms and when going deeper into the Summary Metrics for Tasks, I see that the longest task was 35 ms, so not sure where to all the gaps went. > The duration times showed by spark web UI are inaccurate > > > Key: SPARK-21252 > URL: https://issues.apache.org/jira/browse/SPARK-21252 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.1.1 >Reporter: igor mazor > > The duration times showed by spark UI are inaccurate and seems to be rounded. > For example when a job had 2 stages, first stage executed in 47 ms and second > stage in 3 seconds, the total execution time showed by the UI is 4 seconds. > Another example, first stage was executed in 20 ms and second stage in 4 > seconds, the total execution time showed by the UI would be in that case also > 4 seconds. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21252) The duration times showed by spark web UI are inaccurate
[ https://issues.apache.org/jira/browse/SPARK-21252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068147#comment-16068147 ] igor mazor edited comment on SPARK-21252 at 6/29/17 10:42 AM: -- Yes, its indeed possible to drill to individual stages, however the scenario I described on my last comment is in the same stage. The first attempt following by timeout after 20 seconds and then the second successful attempt are all in the same stage and therefore the UI shows 20 seconds duration, although thats not true. Also, if the stage was indeed 40 ms, the UI shows 40 ms as the duration, but with seconds there is rounding which also causing for inconsistent result presentation. If you already do the rounding, why then the 40 ms is not rounded to 0 seconds ? Also I noticed now that for example stage that took 42 ms (Job Duration), when I click on that stage I see Duration = 38 ms and when going deeper into the Summary Metrics for Tasks, I see that the longest task was 35 ms, so not sure where to all the gaps went. was (Author: mazor.igal): Yes, its indeed possible to drill to individual stages, however the scenario I described on my last comment is happens in the same stage. The first attempt following by timeout after 20 seconds and then the second successful attempt are all in the same stage and therefore the UI shows 20 seconds duration, although thats not true. Also, if the stage was indeed 40 ms, the UI shows 40 ms as the duration, but with seconds there is rounding which also causing for inconsistent result presentation. If you already do the rounding, why then the 40 ms is not rounded to 0 seconds ? Also I noticed now that for example stage that took 42 ms (Job Duration), when I click on that stage I see Duration = 38 ms and when going deeper into the Summary Metrics for Tasks, I see that the longest task was 35 ms, so not sure where to all the gaps went. > The duration times showed by spark web UI are inaccurate > > > Key: SPARK-21252 > URL: https://issues.apache.org/jira/browse/SPARK-21252 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.1.1 >Reporter: igor mazor > > The duration times showed by spark UI are inaccurate and seems to be rounded. > For example when a job had 2 stages, first stage executed in 47 ms and second > stage in 3 seconds, the total execution time showed by the UI is 4 seconds. > Another example, first stage was executed in 20 ms and second stage in 4 > seconds, the total execution time showed by the UI would be in that case also > 4 seconds. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21253) Cannot fetch big blocks to disk
Yuming Wang created SPARK-21253: --- Summary: Cannot fetch big blocks to disk Key: SPARK-21253 URL: https://issues.apache.org/jira/browse/SPARK-21253 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: Yuming Wang Spark *cluster* can reproduce, *local* can't: 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}: {code:actionscript} $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K {code} 2. Need a shuffe: {code:actionscript} scala> val count = sc.parallelize(0 until 300, 10).repartition(2001).collect().length {code} The error messages: {noformat} org.apache.spark.shuffle.FetchFailedException: Failed to send request for 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: Connection reset by peer at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) at scala.collection.AbstractIterator.to(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: Connection reset by peer at org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196) at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) at io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163) at io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93) at io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28) at org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183) at org.apache.spark.network.shuffle.OneForOneBlockFetcher$1.onSuccess(OneForOneBlockFetcher.java:123) at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:176) at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:120) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
[jira] [Commented] (SPARK-21253) Cannot fetch big blocks to disk
[ https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068169#comment-16068169 ] Apache Spark commented on SPARK-21253: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/18466 > Cannot fetch big blocks to disk > > > Key: SPARK-21253 > URL: https://issues.apache.org/jira/browse/SPARK-21253 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yuming Wang > > Spark *cluster* can reproduce, *local* can't: > 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}: > {code:actionscript} > $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K > {code} > 2. Need a shuffe: > {code:actionscript} > scala> val count = sc.parallelize(0 until 300, > 10).repartition(2001).collect().length > {code} > The error messages: > {noformat} > org.apache.spark.shuffle.FetchFailedException: Failed to send request for > 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: > java.io.IOException: Connection reset by peer > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to > yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: > Connection reset by peer > at > org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196) > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) > at > io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) > at > io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) > at > io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28) > at > org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183) > at > org.apache.spark.network.shuffle.O
[jira] [Assigned] (SPARK-21253) Cannot fetch big blocks to disk
[ https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21253: Assignee: (was: Apache Spark) > Cannot fetch big blocks to disk > > > Key: SPARK-21253 > URL: https://issues.apache.org/jira/browse/SPARK-21253 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yuming Wang > > Spark *cluster* can reproduce, *local* can't: > 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}: > {code:actionscript} > $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K > {code} > 2. Need a shuffe: > {code:actionscript} > scala> val count = sc.parallelize(0 until 300, > 10).repartition(2001).collect().length > {code} > The error messages: > {noformat} > org.apache.spark.shuffle.FetchFailedException: Failed to send request for > 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: > java.io.IOException: Connection reset by peer > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to > yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: > Connection reset by peer > at > org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196) > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) > at > io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) > at > io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) > at > io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28) > at > org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183) > at > org.apache.spark.network.shuffle.OneForOneBlockFetcher$1.onSuccess(OneForOneBlockFetcher.java:123) > at > org.apache.spark.network.client.T
[jira] [Assigned] (SPARK-21253) Cannot fetch big blocks to disk
[ https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21253: Assignee: Apache Spark > Cannot fetch big blocks to disk > > > Key: SPARK-21253 > URL: https://issues.apache.org/jira/browse/SPARK-21253 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yuming Wang >Assignee: Apache Spark > > Spark *cluster* can reproduce, *local* can't: > 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}: > {code:actionscript} > $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K > {code} > 2. Need a shuffe: > {code:actionscript} > scala> val count = sc.parallelize(0 until 300, > 10).repartition(2001).collect().length > {code} > The error messages: > {noformat} > org.apache.spark.shuffle.FetchFailedException: Failed to send request for > 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: > java.io.IOException: Connection reset by peer > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to > yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: > Connection reset by peer > at > org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196) > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) > at > io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) > at > io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) > at > io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28) > at > org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183) > at > org.apache.spark.network.shuffle.OneForOneBlockFetcher$1.onSuccess(OneForOneBlockFetcher.java:123) > at > org.apac
[jira] [Updated] (SPARK-21252) The duration times showed by spark web UI are inaccurate
[ https://issues.apache.org/jira/browse/SPARK-21252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-21252: -- Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) I don't know what the underlying exact values are; those are likely available from the API. That would help figure out if there's an actual rounding problem. If not, I don't think this would be changed. > The duration times showed by spark web UI are inaccurate > > > Key: SPARK-21252 > URL: https://issues.apache.org/jira/browse/SPARK-21252 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.1.1 >Reporter: igor mazor >Priority: Minor > > The duration times showed by spark UI are inaccurate and seems to be rounded. > For example when a job had 2 stages, first stage executed in 47 ms and second > stage in 3 seconds, the total execution time showed by the UI is 4 seconds. > Another example, first stage was executed in 20 ms and second stage in 4 > seconds, the total execution time showed by the UI would be in that case also > 4 seconds. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21253) Cannot fetch big blocks to disk
[ https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-21253: Attachment: ui-thread-dump-jqhadoop221-154.gif > Cannot fetch big blocks to disk > > > Key: SPARK-21253 > URL: https://issues.apache.org/jira/browse/SPARK-21253 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yuming Wang > Attachments: ui-thread-dump-jqhadoop221-154.gif > > > Spark *cluster* can reproduce, *local* can't: > 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}: > {code:actionscript} > $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K > {code} > 2. Need a shuffe: > {code:actionscript} > scala> val count = sc.parallelize(0 until 300, > 10).repartition(2001).collect().length > {code} > The error messages: > {noformat} > org.apache.spark.shuffle.FetchFailedException: Failed to send request for > 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: > java.io.IOException: Connection reset by peer > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to > yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: > Connection reset by peer > at > org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196) > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) > at > io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) > at > io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) > at > io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28) > at > org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183) > at > org.apache.spark.network.shuffle.OneForOneBlockFetcher$1.onSuccess(OneForOneBlockFet
[jira] [Commented] (SPARK-21253) Cannot fetch big blocks to disk
[ https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068232#comment-16068232 ] Yuming Wang commented on SPARK-21253: - It may be hang for a {{spark-sql}} application also: !ui-thread-dump-jqhadoop221-154.gif! > Cannot fetch big blocks to disk > > > Key: SPARK-21253 > URL: https://issues.apache.org/jira/browse/SPARK-21253 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yuming Wang > Attachments: ui-thread-dump-jqhadoop221-154.gif > > > Spark *cluster* can reproduce, *local* can't: > 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}: > {code:actionscript} > $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K > {code} > 2. Need a shuffe: > {code:actionscript} > scala> val count = sc.parallelize(0 until 300, > 10).repartition(2001).collect().length > {code} > The error messages: > {noformat} > org.apache.spark.shuffle.FetchFailedException: Failed to send request for > 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: > java.io.IOException: Connection reset by peer > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to > yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: > Connection reset by peer > at > org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196) > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) > at > io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) > at > io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) > at > io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28) > at > org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183) >
[jira] [Updated] (SPARK-21253) Cannot fetch big blocks to disk
[ https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-21253: Description: Spark *cluster* can reproduce, *local* can't: 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}: {code:actionscript} $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K {code} 2. Need a shuffe: {code:actionscript} scala> sc.parallelize(0 until 300, 10).repartition(2001).count() {code} The error messages: {noformat} org.apache.spark.shuffle.FetchFailedException: Failed to send request for 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: Connection reset by peer at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) at scala.collection.AbstractIterator.to(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: Connection reset by peer at org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196) at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) at io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163) at io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93) at io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28) at org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183) at org.apache.spark.network.shuffle.OneForOneBlockFetcher$1.onSuccess(OneForOneBlockFetcher.java:123) at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:176) at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:120) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) at io.netty.handler.ti
[jira] [Resolved] (SPARK-21225) decrease the Mem using for variable 'tasks' in function resourceOffers
[ https://issues.apache.org/jira/browse/SPARK-21225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-21225. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18435 [https://github.com/apache/spark/pull/18435] > decrease the Mem using for variable 'tasks' in function resourceOffers > -- > > Key: SPARK-21225 > URL: https://issues.apache.org/jira/browse/SPARK-21225 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1 >Reporter: yangZhiguo >Priority: Minor > Fix For: 2.3.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > In the function 'resourceOffers', It declare a variable 'tasks' for > storage the tasks which have allocated a executor. It declared like this: > *{color:#d04437}val tasks = shuffledOffers.map(o => new > ArrayBuffer[TaskDescription](o.cores)){color}* > But, I think this code only conside a situation for that one task per core. > If the user config the "spark.task.cpus" as 2 or 3, It really don't need so > much space. I think It can motify as follow: > {color:#14892c}*val tasks = shuffledOffers.map(o => new > ArrayBuffer[TaskDescription](Math.ceil(o.cores*1.0/CPUS_PER_TASK).toInt))*{color} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21225) decrease the Mem using for variable 'tasks' in function resourceOffers
[ https://issues.apache.org/jira/browse/SPARK-21225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-21225: --- Assignee: yangZhiguo > decrease the Mem using for variable 'tasks' in function resourceOffers > -- > > Key: SPARK-21225 > URL: https://issues.apache.org/jira/browse/SPARK-21225 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1 >Reporter: yangZhiguo >Assignee: yangZhiguo >Priority: Minor > Fix For: 2.3.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > In the function 'resourceOffers', It declare a variable 'tasks' for > storage the tasks which have allocated a executor. It declared like this: > *{color:#d04437}val tasks = shuffledOffers.map(o => new > ArrayBuffer[TaskDescription](o.cores)){color}* > But, I think this code only conside a situation for that one task per core. > If the user config the "spark.task.cpus" as 2 or 3, It really don't need so > much space. I think It can motify as follow: > {color:#14892c}*val tasks = shuffledOffers.map(o => new > ArrayBuffer[TaskDescription](Math.ceil(o.cores*1.0/CPUS_PER_TASK).toInt))*{color} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21253) Cannot fetch big blocks to disk
[ https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-21253: Description: Spark *cluster* can reproduce, *local* can't: 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}: {code:actionscript} $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K {code} 2. A shuffle: {code:actionscript} scala> sc.parallelize(0 until 300, 10).repartition(2001).count() {code} The error messages: {noformat} org.apache.spark.shuffle.FetchFailedException: Failed to send request for 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: Connection reset by peer at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) at scala.collection.AbstractIterator.to(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: Connection reset by peer at org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196) at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) at io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163) at io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93) at io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28) at org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183) at org.apache.spark.network.shuffle.OneForOneBlockFetcher$1.onSuccess(OneForOneBlockFetcher.java:123) at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:176) at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:120) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) at io.netty.handler.timeou
[jira] [Commented] (SPARK-12868) ADD JAR via sparkSQL JDBC will fail when using a HDFS URL
[ https://issues.apache.org/jira/browse/SPARK-12868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068301#comment-16068301 ] Steve Loughran commented on SPARK-12868: It's actually not the cause of that, merely the messenger. Cause is HADOOP-14383: a combination of spark 2.2 & Hadoop 2.9+ will trigger the problem. Fix belongs in Hadoop. > ADD JAR via sparkSQL JDBC will fail when using a HDFS URL > - > > Key: SPARK-12868 > URL: https://issues.apache.org/jira/browse/SPARK-12868 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Trystan Leftwich >Assignee: Weiqing Yang > Fix For: 2.2.0 > > > When trying to add a jar with a HDFS URI, i.E > {code:sql} > ADD JAR hdfs:///tmp/foo.jar > {code} > Via the spark sql JDBC interface it will fail with: > {code:sql} > java.net.MalformedURLException: unknown protocol: hdfs > at java.net.URL.(URL.java:593) > at java.net.URL.(URL.java:483) > at java.net.URL.(URL.java:432) > at java.net.URI.toURL(URI.java:1089) > at > org.apache.spark.sql.hive.client.ClientWrapper.addJar(ClientWrapper.scala:578) > at org.apache.spark.sql.hive.HiveContext.addJar(HiveContext.scala:652) > at org.apache.spark.sql.hive.execution.AddJar.run(commands.scala:89) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:145) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:130) > at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:211) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:154) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:151) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:164) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21052) Add hash map metrics to join
[ https://issues.apache.org/jira/browse/SPARK-21052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-21052: --- Assignee: Liang-Chi Hsieh > Add hash map metrics to join > > > Key: SPARK-21052 > URL: https://issues.apache.org/jira/browse/SPARK-21052 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.3.0 > > > We should add avg hash map probe metric to join operator and report it on UI. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21052) Add hash map metrics to join
[ https://issues.apache.org/jira/browse/SPARK-21052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-21052. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18301 [https://github.com/apache/spark/pull/18301] > Add hash map metrics to join > > > Key: SPARK-21052 > URL: https://issues.apache.org/jira/browse/SPARK-21052 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh > Fix For: 2.3.0 > > > We should add avg hash map probe metric to join operator and report it on UI. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21254) History UI: Taking over 1 minute for initial page display
[ https://issues.apache.org/jira/browse/SPARK-21254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitry Parfenchik updated SPARK-21254: -- Attachment: screenshot-1.png > History UI: Taking over 1 minute for initial page display > - > > Key: SPARK-21254 > URL: https://issues.apache.org/jira/browse/SPARK-21254 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: Dmitry Parfenchik > Attachments: screenshot-1.png > > > Currently on the first page load (if there is no limit set) the whole jobs > execution history is loaded since the begging of the time, which in itself is > not a big issue and only causes a small latency in case of 10k+ rows returned > from the server. The problem is that UI spends most of the time processing > the results even according to chrome devtools (newtwork IO is taking less > than 1s): > !attachment-name.jpg|thumbnail! > In case of larger amount of rows returned (10k+) this time grows > dramatically, causing 1min+ page load in Chrome and in Firefox, Safari and IE > freezes the process. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21254) History UI: Taking over 1 minute for initial page display
Dmitry Parfenchik created SPARK-21254: - Summary: History UI: Taking over 1 minute for initial page display Key: SPARK-21254 URL: https://issues.apache.org/jira/browse/SPARK-21254 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.1.0 Reporter: Dmitry Parfenchik Currently on the first page load (if there is no limit set) the whole jobs execution history is loaded since the begging of the time, which in itself is not a big issue and only causes a small latency in case of 10k+ rows returned from the server. The problem is that UI spends most of the time processing the results even according to chrome devtools (newtwork IO is taking less than 1s): !attachment-name.jpg|thumbnail! In case of larger amount of rows returned (10k+) this time grows dramatically, causing 1min+ page load in Chrome and in Firefox, Safari and IE freezes the process. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21254) History UI: Taking over 1 minute for initial page display
[ https://issues.apache.org/jira/browse/SPARK-21254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitry Parfenchik updated SPARK-21254: -- Description: Currently on the first page load (if there is no limit set) the whole jobs execution history is loaded since the begging of the time, which in itself is not a big issue and only causes a small latency in case of 10k+ rows returned from the server. The problem is that UI spends most of the time processing the results even according to chrome devtools (newtwork IO is taking less than 1s): !screenshot-1.png! In case of larger amount of rows returned (10k+) this time grows dramatically, causing 1min+ page load in Chrome and in Firefox, Safari and IE freezes the process. was: Currently on the first page load (if there is no limit set) the whole jobs execution history is loaded since the begging of the time, which in itself is not a big issue and only causes a small latency in case of 10k+ rows returned from the server. The problem is that UI spends most of the time processing the results even according to chrome devtools (newtwork IO is taking less than 1s): !attachment-name.jpg|thumbnail! In case of larger amount of rows returned (10k+) this time grows dramatically, causing 1min+ page load in Chrome and in Firefox, Safari and IE freezes the process. > History UI: Taking over 1 minute for initial page display > - > > Key: SPARK-21254 > URL: https://issues.apache.org/jira/browse/SPARK-21254 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: Dmitry Parfenchik > Attachments: screenshot-1.png > > > Currently on the first page load (if there is no limit set) the whole jobs > execution history is loaded since the begging of the time, which in itself is > not a big issue and only causes a small latency in case of 10k+ rows returned > from the server. The problem is that UI spends most of the time processing > the results even according to chrome devtools (newtwork IO is taking less > than 1s): > !screenshot-1.png! > In case of larger amount of rows returned (10k+) this time grows > dramatically, causing 1min+ page load in Chrome and in Firefox, Safari and IE > freezes the process. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21254) History UI: Taking over 1 minute for initial page display
[ https://issues.apache.org/jira/browse/SPARK-21254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitry Parfenchik updated SPARK-21254: -- Description: Currently on the first page load (if there is no limit set) the whole jobs execution history is loaded since the begging of the time. On large amount of rows returned (10k+) page load time grows dramatically, causing 1min+ delay in Chrome and freezing the process in Firefox, Safari and IE. A simple inspection in Chrome shows that network is not an issue here and only causes a small latency (<1s) while most of the time is spend in UI processing the results even according to chrome devtools: !screenshot-1.png! was: Currently on the first page load (if there is no limit set) the whole jobs execution history is loaded since the begging of the time, which in itself is not a big issue and only causes a small latency in case of 10k+ rows returned from the server. The problem is that UI spends most of the time processing the results even according to chrome devtools (newtwork IO is taking less than 1s): !screenshot-1.png! In case of larger amount of rows returned (10k+) this time grows dramatically, causing 1min+ page load in Chrome and in Firefox, Safari and IE freezes the process. > History UI: Taking over 1 minute for initial page display > - > > Key: SPARK-21254 > URL: https://issues.apache.org/jira/browse/SPARK-21254 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: Dmitry Parfenchik > Attachments: screenshot-1.png > > > Currently on the first page load (if there is no limit set) the whole jobs > execution history is loaded since the begging of the time. On large amount of > rows returned (10k+) page load time grows dramatically, causing 1min+ delay > in Chrome and freezing the process in Firefox, Safari and IE. > A simple inspection in Chrome shows that network is not an issue here and > only causes a small latency (<1s) while most of the time is spend in UI > processing the results even according to chrome devtools: > !screenshot-1.png! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21254) History UI: Taking over 1 minute for initial page display
[ https://issues.apache.org/jira/browse/SPARK-21254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitry Parfenchik updated SPARK-21254: -- Priority: Minor (was: Major) > History UI: Taking over 1 minute for initial page display > - > > Key: SPARK-21254 > URL: https://issues.apache.org/jira/browse/SPARK-21254 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: Dmitry Parfenchik >Priority: Minor > Attachments: screenshot-1.png > > > Currently on the first page load (if there is no limit set) the whole jobs > execution history is loaded since the begging of the time. On large amount of > rows returned (10k+) page load time grows dramatically, causing 1min+ delay > in Chrome and freezing the process in Firefox, Safari and IE. > A simple inspection in Chrome shows that network is not an issue here and > only causes a small latency (<1s) while most of the time is spend in UI > processing the results even according to chrome devtools: > !screenshot-1.png! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21255) NPE when creating encoder for enum
Mike created SPARK-21255: Summary: NPE when creating encoder for enum Key: SPARK-21255 URL: https://issues.apache.org/jira/browse/SPARK-21255 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 2.1.0 Environment: org.apache.spark:spark-core_2.10:2.1.0 org.apache.spark:spark-sql_2.10:2.1.0 Reporter: Mike When you try to create an encoder for Enum type (or bean with enum property) via Encoders.bean(...), it fails with NullPointerException at TypeToken:495. I did a little research and it turns out, that in JavaTypeInference:126 following code {code:scala} val beanInfo = Introspector.getBeanInfo(typeToken.getRawType) val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == "class") val fields = properties.map { property => val returnType = typeToken.method(property.getReadMethod).getReturnType val (dataType, nullable) = inferDataType(returnType) new StructField(property.getName, dataType, nullable) } (new StructType(fields), true) {code} filters out properties named "class", because we wouldn't want to serialize that. But enum types have another property of type Class named "declaringClass", which we are trying to inspect recursively. Eventually we try to inspect ClassLoader class, which has property "defaultAssertionStatus" with no read method, which leads to NPE at TypeToken:495. I think adding property name "declaringClass" to filtering will resolve this. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21255) NPE when creating encoder for enum
[ https://issues.apache.org/jira/browse/SPARK-21255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike updated SPARK-21255: - Description: When you try to create an encoder for Enum type (or bean with enum property) via Encoders.bean(...), it fails with NullPointerException at TypeToken:495. I did a little research and it turns out, that in JavaTypeInference:126 following code {code:java} val beanInfo = Introspector.getBeanInfo(typeToken.getRawType) val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == "class") val fields = properties.map { property => val returnType = typeToken.method(property.getReadMethod).getReturnType val (dataType, nullable) = inferDataType(returnType) new StructField(property.getName, dataType, nullable) } (new StructType(fields), true) {code} filters out properties named "class", because we wouldn't want to serialize that. But enum types have another property of type Class named "declaringClass", which we are trying to inspect recursively. Eventually we try to inspect ClassLoader class, which has property "defaultAssertionStatus" with no read method, which leads to NPE at TypeToken:495. I think adding property name "declaringClass" to filtering will resolve this. was: When you try to create an encoder for Enum type (or bean with enum property) via Encoders.bean(...), it fails with NullPointerException at TypeToken:495. I did a little research and it turns out, that in JavaTypeInference:126 following code {code:scala} val beanInfo = Introspector.getBeanInfo(typeToken.getRawType) val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == "class") val fields = properties.map { property => val returnType = typeToken.method(property.getReadMethod).getReturnType val (dataType, nullable) = inferDataType(returnType) new StructField(property.getName, dataType, nullable) } (new StructType(fields), true) {code} filters out properties named "class", because we wouldn't want to serialize that. But enum types have another property of type Class named "declaringClass", which we are trying to inspect recursively. Eventually we try to inspect ClassLoader class, which has property "defaultAssertionStatus" with no read method, which leads to NPE at TypeToken:495. I think adding property name "declaringClass" to filtering will resolve this. > NPE when creating encoder for enum > -- > > Key: SPARK-21255 > URL: https://issues.apache.org/jira/browse/SPARK-21255 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.1.0 > Environment: org.apache.spark:spark-core_2.10:2.1.0 > org.apache.spark:spark-sql_2.10:2.1.0 >Reporter: Mike > > When you try to create an encoder for Enum type (or bean with enum property) > via Encoders.bean(...), it fails with NullPointerException at TypeToken:495. > I did a little research and it turns out, that in JavaTypeInference:126 > following code > {code:java} > val beanInfo = Introspector.getBeanInfo(typeToken.getRawType) > val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == > "class") > val fields = properties.map { property => > val returnType = > typeToken.method(property.getReadMethod).getReturnType > val (dataType, nullable) = inferDataType(returnType) > new StructField(property.getName, dataType, nullable) > } > (new StructType(fields), true) > {code} > filters out properties named "class", because we wouldn't want to serialize > that. But enum types have another property of type Class named > "declaringClass", which we are trying to inspect recursively. Eventually we > try to inspect ClassLoader class, which has property "defaultAssertionStatus" > with no read method, which leads to NPE at TypeToken:495. > I think adding property name "declaringClass" to filtering will resolve this. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21255) NPE when creating encoder for enum
[ https://issues.apache.org/jira/browse/SPARK-21255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068520#comment-16068520 ] Sean Owen commented on SPARK-21255: --- Is the change to omit declaringClass too? > NPE when creating encoder for enum > -- > > Key: SPARK-21255 > URL: https://issues.apache.org/jira/browse/SPARK-21255 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.1.0 > Environment: org.apache.spark:spark-core_2.10:2.1.0 > org.apache.spark:spark-sql_2.10:2.1.0 >Reporter: Mike > > When you try to create an encoder for Enum type (or bean with enum property) > via Encoders.bean(...), it fails with NullPointerException at TypeToken:495. > I did a little research and it turns out, that in JavaTypeInference:126 > following code > {code:java} > val beanInfo = Introspector.getBeanInfo(typeToken.getRawType) > val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == > "class") > val fields = properties.map { property => > val returnType = > typeToken.method(property.getReadMethod).getReturnType > val (dataType, nullable) = inferDataType(returnType) > new StructField(property.getName, dataType, nullable) > } > (new StructType(fields), true) > {code} > filters out properties named "class", because we wouldn't want to serialize > that. But enum types have another property of type Class named > "declaringClass", which we are trying to inspect recursively. Eventually we > try to inspect ClassLoader class, which has property "defaultAssertionStatus" > with no read method, which leads to NPE at TypeToken:495. > I think adding property name "declaringClass" to filtering will resolve this. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21254) History UI: Taking over 1 minute for initial page display
[ https://issues.apache.org/jira/browse/SPARK-21254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21254. --- Resolution: Duplicate Duplicate of lots of JIRAs related to making the initial read faster > History UI: Taking over 1 minute for initial page display > - > > Key: SPARK-21254 > URL: https://issues.apache.org/jira/browse/SPARK-21254 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: Dmitry Parfenchik >Priority: Minor > Attachments: screenshot-1.png > > > Currently on the first page load (if there is no limit set) the whole jobs > execution history is loaded since the begging of the time. On large amount of > rows returned (10k+) page load time grows dramatically, causing 1min+ delay > in Chrome and freezing the process in Firefox, Safari and IE. > A simple inspection in Chrome shows that network is not an issue here and > only causes a small latency (<1s) while most of the time is spend in UI > processing the results even according to chrome devtools: > !screenshot-1.png! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21246) Unexpected Data Type conversion from LONG to BIGINT
[ https://issues.apache.org/jira/browse/SPARK-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068553#comment-16068553 ] Monica Raj commented on SPARK-21246: Thanks for your response. I also tried with Seq(3) as Seq(3L), however I had changed this back during the course of trying other options. I should also mention that we are running Zeppelin 0.6.0. I tried running the code you provided and still got the following output: import org.apache.spark.sql.types._ import org.apache.spark.sql.Row schemaString: String = name lstVals: Seq[Long] = List(3) rowRdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[30] at map at :59 res20: Array[org.apache.spark.sql.Row] = Array([3]) fields: Array[org.apache.spark.sql.types.StructField] = Array(StructField(name,LongType,true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(name,LongType,true)) StructType(StructField(name,LongType,true))peopleDF: org.apache.spark.sql.DataFrame = [name: bigint] ++ |name| ++ | 3| ++ > Unexpected Data Type conversion from LONG to BIGINT > --- > > Key: SPARK-21246 > URL: https://issues.apache.org/jira/browse/SPARK-21246 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Using Zeppelin Notebook or Spark Shell >Reporter: Monica Raj > > The unexpected conversion occurred when creating a data frame out of an > existing data collection. The following code can be run in zeppelin notebook > to reproduce the bug: > import org.apache.spark.sql.types._ > import org.apache.spark.sql.Row > val schemaString = "name" > val lstVals = Seq(3) > val rowRdd = sc.parallelize(lstVals).map(x => Row( x )) > rowRdd.collect() > // Generate the schema based on the string of schema > val fields = schemaString.split(" ") > .map(fieldName => StructField(fieldName, LongType, nullable = true)) > val schema = StructType(fields) > print(schema) > val peopleDF = sqlContext.createDataFrame(rowRdd, schema) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4131) Support "Writing data into the filesystem from queries"
[ https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068700#comment-16068700 ] ARUNA KIRAN NULU commented on SPARK-4131: - Is this feature available ? > Support "Writing data into the filesystem from queries" > --- > > Key: SPARK-4131 > URL: https://issues.apache.org/jira/browse/SPARK-4131 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.1.0 >Reporter: XiaoJing wang >Assignee: Fei Wang >Priority: Critical > Original Estimate: 0.05h > Remaining Estimate: 0.05h > > Writing data into the filesystem from queries,SparkSql is not support . > eg: > {code}insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * > from page_views; > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21253) Cannot fetch big blocks to disk
[ https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-21253: - Target Version/s: 2.2.0 > Cannot fetch big blocks to disk > > > Key: SPARK-21253 > URL: https://issues.apache.org/jira/browse/SPARK-21253 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yuming Wang >Assignee: Shixiong Zhu > Attachments: ui-thread-dump-jqhadoop221-154.gif > > > Spark *cluster* can reproduce, *local* can't: > 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}: > {code:actionscript} > $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K > {code} > 2. A shuffle: > {code:actionscript} > scala> sc.parallelize(0 until 300, 10).repartition(2001).count() > {code} > The error messages: > {noformat} > org.apache.spark.shuffle.FetchFailedException: Failed to send request for > 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: > java.io.IOException: Connection reset by peer > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to > yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: > Connection reset by peer > at > org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196) > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) > at > io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) > at > io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) > at > io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28) > at > org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183) > at > org.apache.spark.network.shuffle.OneForOneBlockFetcher$1.onSuccess(OneForOneBlockFetcher.
[jira] [Assigned] (SPARK-21253) Cannot fetch big blocks to disk
[ https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-21253: Assignee: Shixiong Zhu > Cannot fetch big blocks to disk > > > Key: SPARK-21253 > URL: https://issues.apache.org/jira/browse/SPARK-21253 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yuming Wang >Assignee: Shixiong Zhu > Attachments: ui-thread-dump-jqhadoop221-154.gif > > > Spark *cluster* can reproduce, *local* can't: > 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}: > {code:actionscript} > $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K > {code} > 2. A shuffle: > {code:actionscript} > scala> sc.parallelize(0 until 300, 10).repartition(2001).count() > {code} > The error messages: > {noformat} > org.apache.spark.shuffle.FetchFailedException: Failed to send request for > 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: > java.io.IOException: Connection reset by peer > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to > yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: > Connection reset by peer > at > org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196) > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) > at > io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) > at > io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) > at > io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28) > at > org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183) > at > org.apache.spark.network.shuffle.OneForOneBlockFetcher$1.onSuccess(OneForOneBlockFe
[jira] [Updated] (SPARK-20783) Enhance ColumnVector to support compressed representation
[ https://issues.apache.org/jira/browse/SPARK-20783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-20783: - Summary: Enhance ColumnVector to support compressed representation (was: Enhance ColumnVector to keep UnsafeArrayData for array) > Enhance ColumnVector to support compressed representation > - > > Key: SPARK-20783 > URL: https://issues.apache.org/jira/browse/SPARK-20783 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki > > Current {{ColumnVector}} accepts only primitive-type Java array as an input > for array. It is good to keep data from Parquet. > On the other hand, in Spark internal, {{UnsafeArrayData}} is frequently used > to represent array, map, and struct. To keep these data, this JIRA entry > enhances {{ColumnVector}} to keep UnsafeArrayData. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20783) Enhance ColumnVector to support compressed representation
[ https://issues.apache.org/jira/browse/SPARK-20783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-20783: - Description: Current {{ColumnVector}} handles uncompressed data for parquet. For handling table cache, this JIRA entry adds {{OnHeapCachedBatch}} class to have compressed data. As first step of this implementation, this JIRA supports primitive data and string types. was: Current {{ColumnVector}} accepts only primitive-type Java array as an input for array. It is good to keep data from Parquet. On the other hand, in Spark internal, {{UnsafeArrayData}} is frequently used to represent array, map, and struct. To keep these data, this JIRA entry enhances {{ColumnVector}} to keep UnsafeArrayData. > Enhance ColumnVector to support compressed representation > - > > Key: SPARK-20783 > URL: https://issues.apache.org/jira/browse/SPARK-20783 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki > > Current {{ColumnVector}} handles uncompressed data for parquet. > For handling table cache, this JIRA entry adds {{OnHeapCachedBatch}} class to > have compressed data. > As first step of this implementation, this JIRA supports primitive data and > string types. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21253) Cannot fetch big blocks to disk
[ https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068793#comment-16068793 ] Apache Spark commented on SPARK-21253: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/18467 > Cannot fetch big blocks to disk > > > Key: SPARK-21253 > URL: https://issues.apache.org/jira/browse/SPARK-21253 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yuming Wang >Assignee: Shixiong Zhu > Attachments: ui-thread-dump-jqhadoop221-154.gif > > > Spark *cluster* can reproduce, *local* can't: > 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}: > {code:actionscript} > $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K > {code} > 2. A shuffle: > {code:actionscript} > scala> sc.parallelize(0 until 300, 10).repartition(2001).count() > {code} > The error messages: > {noformat} > org.apache.spark.shuffle.FetchFailedException: Failed to send request for > 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: > java.io.IOException: Connection reset by peer > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to > yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: > Connection reset by peer > at > org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196) > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) > at > io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) > at > io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) > at > io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28) > at > org.apache.spark.network.client.TransportClient.stream(Transpor
[jira] [Commented] (SPARK-20873) Improve the error message for unsupported Column Type
[ https://issues.apache.org/jira/browse/SPARK-20873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068800#comment-16068800 ] Apache Spark commented on SPARK-20873: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/18468 > Improve the error message for unsupported Column Type > - > > Key: SPARK-20873 > URL: https://issues.apache.org/jira/browse/SPARK-20873 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Ruben Janssen >Assignee: Ruben Janssen > Fix For: 2.3.0 > > > For unsupported column type, we simply output the column type instead of the > type name. > {noformat} > java.lang.Exception: Unsupported type: > org.apache.spark.sql.execution.columnar.ColumnTypeSuite$$anonfun$4$$anon$1@2205a05d > {noformat} > We should improve it by outputting its name. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21256) Add WithSQLConf to Catalyst
Xiao Li created SPARK-21256: --- Summary: Add WithSQLConf to Catalyst Key: SPARK-21256 URL: https://issues.apache.org/jira/browse/SPARK-21256 Project: Spark Issue Type: Improvement Components: Tests Affects Versions: 2.3.0 Reporter: Xiao Li Assignee: Xiao Li Add WithSQLConf to the Catalyst module. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21256) Add WithSQLConf to Catalyst Test
[ https://issues.apache.org/jira/browse/SPARK-21256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-21256: Summary: Add WithSQLConf to Catalyst Test (was: Add WithSQLConf to Catalyst) > Add WithSQLConf to Catalyst Test > > > Key: SPARK-21256 > URL: https://issues.apache.org/jira/browse/SPARK-21256 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li > > Add WithSQLConf to the Catalyst module. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21256) Add WithSQLConf to Catalyst Test
[ https://issues.apache.org/jira/browse/SPARK-21256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068832#comment-16068832 ] Apache Spark commented on SPARK-21256: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/18469 > Add WithSQLConf to Catalyst Test > > > Key: SPARK-21256 > URL: https://issues.apache.org/jira/browse/SPARK-21256 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li > > Add WithSQLConf to the Catalyst module. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21256) Add WithSQLConf to Catalyst Test
[ https://issues.apache.org/jira/browse/SPARK-21256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21256: Assignee: Xiao Li (was: Apache Spark) > Add WithSQLConf to Catalyst Test > > > Key: SPARK-21256 > URL: https://issues.apache.org/jira/browse/SPARK-21256 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li > > Add WithSQLConf to the Catalyst module. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21256) Add WithSQLConf to Catalyst Test
[ https://issues.apache.org/jira/browse/SPARK-21256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21256: Assignee: Apache Spark (was: Xiao Li) > Add WithSQLConf to Catalyst Test > > > Key: SPARK-21256 > URL: https://issues.apache.org/jira/browse/SPARK-21256 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Apache Spark > > Add WithSQLConf to the Catalyst module. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21257) LDA : create an Evaluator to enable cross validation
Mathieu DESPRIEE created SPARK-21257: Summary: LDA : create an Evaluator to enable cross validation Key: SPARK-21257 URL: https://issues.apache.org/jira/browse/SPARK-21257 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.1.1 Reporter: Mathieu DESPRIEE I suggest the creation of an LDAEvaluator to use with CrossValidator, using logPerplexity as a metric for evaluation. Unfortunately, the computation of perplexity needs to access some internal data of the model, and the current implementation of CrossValidator does not pass the model being evaluated to the Evaluator. A way could be to change the Evaluator.evaluate() method to pass the model along with the dataset. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21257) LDA : create an Evaluator to enable cross validation
[ https://issues.apache.org/jira/browse/SPARK-21257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mathieu DESPRIEE updated SPARK-21257: - Issue Type: New Feature (was: Improvement) > LDA : create an Evaluator to enable cross validation > > > Key: SPARK-21257 > URL: https://issues.apache.org/jira/browse/SPARK-21257 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.1.1 >Reporter: Mathieu DESPRIEE > > I suggest the creation of an LDAEvaluator to use with CrossValidator, using > logPerplexity as a metric for evaluation. > Unfortunately, the computation of perplexity needs to access some internal > data of the model, and the current implementation of CrossValidator does not > pass the model being evaluated to the Evaluator. > A way could be to change the Evaluator.evaluate() method to pass the model > along with the dataset. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21257) LDA : create an Evaluator to enable cross validation
[ https://issues.apache.org/jira/browse/SPARK-21257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mathieu DESPRIEE updated SPARK-21257: - Issue Type: Improvement (was: New Feature) > LDA : create an Evaluator to enable cross validation > > > Key: SPARK-21257 > URL: https://issues.apache.org/jira/browse/SPARK-21257 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.1 >Reporter: Mathieu DESPRIEE > > I suggest the creation of an LDAEvaluator to use with CrossValidator, using > logPerplexity as a metric for evaluation. > Unfortunately, the computation of perplexity needs to access some internal > data of the model, and the current implementation of CrossValidator does not > pass the model being evaluated to the Evaluator. > A way could be to change the Evaluator.evaluate() method to pass the model > along with the dataset. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21253) Cannot fetch big blocks to disk
[ https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068876#comment-16068876 ] Shixiong Zhu commented on SPARK-21253: -- [~q79969786] did you run Spark 2.2.0-rcX on Yarn which has a Spark 2.1.* shuffle service? > Cannot fetch big blocks to disk > > > Key: SPARK-21253 > URL: https://issues.apache.org/jira/browse/SPARK-21253 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yuming Wang >Assignee: Shixiong Zhu > Attachments: ui-thread-dump-jqhadoop221-154.gif > > > Spark *cluster* can reproduce, *local* can't: > 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}: > {code:actionscript} > $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K > {code} > 2. A shuffle: > {code:actionscript} > scala> sc.parallelize(0 until 300, 10).repartition(2001).count() > {code} > The error messages: > {noformat} > org.apache.spark.shuffle.FetchFailedException: Failed to send request for > 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: > java.io.IOException: Connection reset by peer > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to > yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: > Connection reset by peer > at > org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196) > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) > at > io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) > at > io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) > at > io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28) > at > org.apache.spark.network.client.TransportClient.stream(TransportClient.ja
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068980#comment-16068980 ] Cody Koeninger commented on SPARK-18057: Kafka 0.11 is now released. Are we upgrading spark artifacts named kafka-0-10 to use kafka 0.11, or are we renaming them to kafka-0-11? > Update structured streaming kafka from 10.0.1 to 10.2.0 > --- > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20797) mllib lda's LocalLDAModel's save: out of memory.
[ https://issues.apache.org/jira/browse/SPARK-20797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068988#comment-16068988 ] Asher Krim edited comment on SPARK-20797 at 6/29/17 9:19 PM: - This looks like a duplicate of https://issues.apache.org/jira/browse/SPARK-19294 was (Author: akrim): This looks like a duplicate of https://issues.apache.org/jira/browse/SPARK-19294? > mllib lda's LocalLDAModel's save: out of memory. > - > > Key: SPARK-20797 > URL: https://issues.apache.org/jira/browse/SPARK-20797 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1, 1.6.3, 2.0.0, 2.0.2, 2.1.1 >Reporter: d0evi1 > > when i try online lda model with large text data(nearly 1 billion chinese > news' abstract), the training step went well, but the save step failed. > something like below happened (etc. 1.6.1): > problem 1.bigger than spark.kryoserializer.buffer.max. (turning bigger the > param can fix problem 1, but next will lead problem 2), > problem 2. exceed spark.akka.frameSize. (turning this param too bigger will > fail for the reason out of memory, kill it, version > 2.0.0, exceeds max > allowed: spark.rpc.message.maxSize). > when topics num is large(set topic num k=200 is ok, but set k=300 failed), > and vocab size is large(nearly 1000,000) too. this problem will appear. > so i found word2vec's save function is similar to the LocalLDAModel's save > function : > word2vec's problem (use repartition(1) to save) has been fixed > [https://github.com/apache/spark/pull/9989,], but LocalLDAModel still use: > repartition(1). use single partition when save. > word2vec's save method from latest code: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala: > val approxSize = (4L * vectorSize + 15) * numWords > val nPartitions = ((approxSize / bufferSize) + 1).toInt > val dataArray = model.toSeq.map { case (w, v) => Data(w, v) } > > spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path)) > but the code in mllib.clustering.LDAModel's LocalLDAModel's save: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala > you'll see: > val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix > val topics = Range(0, k).map { topicInd => > Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), > topicInd) > } > > spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path)) > refer to word2vec's save (repartition(nPartitions)), i replace numWords to > topic K, repartition(nPartitions) in the LocalLDAModel's save method, > recompile the code, deploy the new lda's project with large data on our > machine cluster, it works. > hopes it will fixed in the next version. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20797) mllib lda's LocalLDAModel's save: out of memory.
[ https://issues.apache.org/jira/browse/SPARK-20797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068988#comment-16068988 ] Asher Krim commented on SPARK-20797: This looks like a duplicate of https://issues.apache.org/jira/browse/SPARK-19294? > mllib lda's LocalLDAModel's save: out of memory. > - > > Key: SPARK-20797 > URL: https://issues.apache.org/jira/browse/SPARK-20797 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1, 1.6.3, 2.0.0, 2.0.2, 2.1.1 >Reporter: d0evi1 > > when i try online lda model with large text data(nearly 1 billion chinese > news' abstract), the training step went well, but the save step failed. > something like below happened (etc. 1.6.1): > problem 1.bigger than spark.kryoserializer.buffer.max. (turning bigger the > param can fix problem 1, but next will lead problem 2), > problem 2. exceed spark.akka.frameSize. (turning this param too bigger will > fail for the reason out of memory, kill it, version > 2.0.0, exceeds max > allowed: spark.rpc.message.maxSize). > when topics num is large(set topic num k=200 is ok, but set k=300 failed), > and vocab size is large(nearly 1000,000) too. this problem will appear. > so i found word2vec's save function is similar to the LocalLDAModel's save > function : > word2vec's problem (use repartition(1) to save) has been fixed > [https://github.com/apache/spark/pull/9989,], but LocalLDAModel still use: > repartition(1). use single partition when save. > word2vec's save method from latest code: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala: > val approxSize = (4L * vectorSize + 15) * numWords > val nPartitions = ((approxSize / bufferSize) + 1).toInt > val dataArray = model.toSeq.map { case (w, v) => Data(w, v) } > > spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path)) > but the code in mllib.clustering.LDAModel's LocalLDAModel's save: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala > you'll see: > val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix > val topics = Range(0, k).map { topicInd => > Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), > topicInd) > } > > spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path)) > refer to word2vec's save (repartition(nPartitions)), i replace numWords to > topic K, repartition(nPartitions) in the LocalLDAModel's save method, > recompile the code, deploy the new lda's project with large data on our > machine cluster, it works. > hopes it will fixed in the next version. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21258) Window result incorrect using complex object with spilling
Herman van Hovell created SPARK-21258: - Summary: Window result incorrect using complex object with spilling Key: SPARK-21258 URL: https://issues.apache.org/jira/browse/SPARK-21258 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Herman van Hovell Assignee: Herman van Hovell -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21258) Window result incorrect using complex object with spilling
[ https://issues.apache.org/jira/browse/SPARK-21258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21258: Assignee: Apache Spark (was: Herman van Hovell) > Window result incorrect using complex object with spilling > -- > > Key: SPARK-21258 > URL: https://issues.apache.org/jira/browse/SPARK-21258 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Herman van Hovell >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21258) Window result incorrect using complex object with spilling
[ https://issues.apache.org/jira/browse/SPARK-21258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21258: Assignee: Herman van Hovell (was: Apache Spark) > Window result incorrect using complex object with spilling > -- > > Key: SPARK-21258 > URL: https://issues.apache.org/jira/browse/SPARK-21258 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21258) Window result incorrect using complex object with spilling
[ https://issues.apache.org/jira/browse/SPARK-21258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069084#comment-16069084 ] Apache Spark commented on SPARK-21258: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/18470 > Window result incorrect using complex object with spilling > -- > > Key: SPARK-21258 > URL: https://issues.apache.org/jira/browse/SPARK-21258 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21253) Cannot fetch big blocks to disk
[ https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069109#comment-16069109 ] Yuming Wang commented on SPARK-21253: - I checked it, all jars are latest 2.2.0-rcX. > Cannot fetch big blocks to disk > > > Key: SPARK-21253 > URL: https://issues.apache.org/jira/browse/SPARK-21253 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yuming Wang >Assignee: Shixiong Zhu > Attachments: ui-thread-dump-jqhadoop221-154.gif > > > Spark *cluster* can reproduce, *local* can't: > 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}: > {code:actionscript} > $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K > {code} > 2. A shuffle: > {code:actionscript} > scala> sc.parallelize(0 until 300, 10).repartition(2001).count() > {code} > The error messages: > {noformat} > org.apache.spark.shuffle.FetchFailedException: Failed to send request for > 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: > java.io.IOException: Connection reset by peer > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to > yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: > Connection reset by peer > at > org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196) > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) > at > io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) > at > io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) > at > io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28) > at > org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183) > at > org.apache.spark.network
[jira] [Commented] (SPARK-21255) NPE when creating encoder for enum
[ https://issues.apache.org/jira/browse/SPARK-21255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069146#comment-16069146 ] Mike commented on SPARK-21255: -- Yes, but there may be other problems (at least I cannot guarantee there won't be), since it seems like enums were never used before. > NPE when creating encoder for enum > -- > > Key: SPARK-21255 > URL: https://issues.apache.org/jira/browse/SPARK-21255 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.1.0 > Environment: org.apache.spark:spark-core_2.10:2.1.0 > org.apache.spark:spark-sql_2.10:2.1.0 >Reporter: Mike > > When you try to create an encoder for Enum type (or bean with enum property) > via Encoders.bean(...), it fails with NullPointerException at TypeToken:495. > I did a little research and it turns out, that in JavaTypeInference:126 > following code > {code:java} > val beanInfo = Introspector.getBeanInfo(typeToken.getRawType) > val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == > "class") > val fields = properties.map { property => > val returnType = > typeToken.method(property.getReadMethod).getReturnType > val (dataType, nullable) = inferDataType(returnType) > new StructField(property.getName, dataType, nullable) > } > (new StructType(fields), true) > {code} > filters out properties named "class", because we wouldn't want to serialize > that. But enum types have another property of type Class named > "declaringClass", which we are trying to inspect recursively. Eventually we > try to inspect ClassLoader class, which has property "defaultAssertionStatus" > with no read method, which leads to NPE at TypeToken:495. > I think adding property name "declaringClass" to filtering will resolve this. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21188) releaseAllLocksForTask should synchronize the whole method
[ https://issues.apache.org/jira/browse/SPARK-21188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-21188. -- Resolution: Fixed Assignee: Feng Liu Fix Version/s: 2.3.0 > releaseAllLocksForTask should synchronize the whole method > -- > > Key: SPARK-21188 > URL: https://issues.apache.org/jira/browse/SPARK-21188 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Feng Liu >Assignee: Feng Liu > Fix For: 2.3.0 > > > Since the objects readLocksByTask, writeLocksByTask and infos are coupled and > supposed to be modified by other threads concurrently, all the read and > writes of them in the releaseAllLocksForTask method should be protected by a > single synchronized block. The fine-grained synchronization in the current > code can cause some test flakiness. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20690) Subqueries in FROM should have alias names
[ https://issues.apache.org/jira/browse/SPARK-20690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-20690: Summary: Subqueries in FROM should have alias names (was: Analyzer shouldn't add missing attributes through subquery) > Subqueries in FROM should have alias names > -- > > Key: SPARK-20690 > URL: https://issues.apache.org/jira/browse/SPARK-20690 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Labels: release-notes > Fix For: 2.3.0 > > > We add missing attributes into Filter in Analyzer. But we shouldn't do it > through subqueries like this: > {code} > select 1 from (select 1 from onerow t1 LIMIT 1) where t1.c1=1 > {code} > This query works in current codebase. However, the outside where clause > shouldn't be able to refer t1.c1 attribute. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21259) More rules for scalastyle
Gengliang Wang created SPARK-21259: -- Summary: More rules for scalastyle Key: SPARK-21259 URL: https://issues.apache.org/jira/browse/SPARK-21259 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.2.1 Reporter: Gengliang Wang Priority: Minor During code review, we spent so much time on code style issues. It would be great if we add rules: 1) disallow space before colon 2) disallow space before right parentheses 3) disallow space after left parentheses -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20690) Subqueries in FROM should have alias names
[ https://issues.apache.org/jira/browse/SPARK-20690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-20690: Description: We add missing attributes into Filter in Analyzer. But we shouldn't do it through subqueries like this: {code} select 1 from (select 1 from onerow t1 LIMIT 1) where t1.c1=1 {code} This query works in current codebase. However, the outside where clause shouldn't be able to refer t1.c1 attribute. The root cause is we allow subqueries in FROM have no alias names previously, it is confusing and isn't supported by various databases such as MySQL, Postgres, Oracle. We shouldn't support it too. was: We add missing attributes into Filter in Analyzer. But we shouldn't do it through subqueries like this: {code} select 1 from (select 1 from onerow t1 LIMIT 1) where t1.c1=1 {code} This query works in current codebase. However, the outside where clause shouldn't be able to refer t1.c1 attribute. > Subqueries in FROM should have alias names > -- > > Key: SPARK-20690 > URL: https://issues.apache.org/jira/browse/SPARK-20690 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Labels: release-notes > Fix For: 2.3.0 > > > We add missing attributes into Filter in Analyzer. But we shouldn't do it > through subqueries like this: > {code} > select 1 from (select 1 from onerow t1 LIMIT 1) where t1.c1=1 > {code} > This query works in current codebase. However, the outside where clause > shouldn't be able to refer t1.c1 attribute. > The root cause is we allow subqueries in FROM have no alias names previously, > it is confusing and isn't supported by various databases such as MySQL, > Postgres, Oracle. We shouldn't support it too. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21259) More rules for scalastyle
[ https://issues.apache.org/jira/browse/SPARK-21259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069208#comment-16069208 ] Apache Spark commented on SPARK-21259: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/18471 > More rules for scalastyle > - > > Key: SPARK-21259 > URL: https://issues.apache.org/jira/browse/SPARK-21259 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Gengliang Wang >Priority: Minor > > During code review, we spent so much time on code style issues. > It would be great if we add rules: > 1) disallow space before colon > 2) disallow space before right parentheses > 3) disallow space after left parentheses -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21259) More rules for scalastyle
[ https://issues.apache.org/jira/browse/SPARK-21259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21259: Assignee: Apache Spark > More rules for scalastyle > - > > Key: SPARK-21259 > URL: https://issues.apache.org/jira/browse/SPARK-21259 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Minor > > During code review, we spent so much time on code style issues. > It would be great if we add rules: > 1) disallow space before colon > 2) disallow space before right parentheses > 3) disallow space after left parentheses -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21259) More rules for scalastyle
[ https://issues.apache.org/jira/browse/SPARK-21259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21259: Assignee: (was: Apache Spark) > More rules for scalastyle > - > > Key: SPARK-21259 > URL: https://issues.apache.org/jira/browse/SPARK-21259 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Gengliang Wang >Priority: Minor > > During code review, we spent so much time on code style issues. > It would be great if we add rules: > 1) disallow space before colon > 2) disallow space before right parentheses > 3) disallow space after left parentheses -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21253) Cannot fetch big blocks to disk
[ https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069227#comment-16069227 ] Apache Spark commented on SPARK-21253: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/18472 > Cannot fetch big blocks to disk > > > Key: SPARK-21253 > URL: https://issues.apache.org/jira/browse/SPARK-21253 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yuming Wang >Assignee: Shixiong Zhu > Attachments: ui-thread-dump-jqhadoop221-154.gif > > > Spark *cluster* can reproduce, *local* can't: > 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}: > {code:actionscript} > $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K > {code} > 2. A shuffle: > {code:actionscript} > scala> sc.parallelize(0 until 300, 10).repartition(2001).count() > {code} > The error messages: > {noformat} > org.apache.spark.shuffle.FetchFailedException: Failed to send request for > 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: > java.io.IOException: Connection reset by peer > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to > yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: > Connection reset by peer > at > org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196) > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) > at > io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) > at > io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) > at > io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28) > at > org.apache.spark.network.client.TransportClient.stream(Transpor
[jira] [Closed] (SPARK-21148) Set SparkUncaughtExceptionHandler to the Master
[ https://issues.apache.org/jira/browse/SPARK-21148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K closed SPARK-21148. - Resolution: Duplicate > Set SparkUncaughtExceptionHandler to the Master > --- > > Key: SPARK-21148 > URL: https://issues.apache.org/jira/browse/SPARK-21148 > Project: Spark > Issue Type: Improvement > Components: Deploy, Spark Core >Affects Versions: 2.1.1 >Reporter: Devaraj K > > Any one thread of the Master gets any of the UncaughtException then the > thread gets terminate and the Master process keeps running without > functioning properly. > I think we need to handle the UncaughtException and exit the Master > gracefully. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21190) SPIP: Vectorized UDFs in Python
[ https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069269#comment-16069269 ] Leif Walsh commented on SPARK-21190: I agree with [~icexelloss] that we should aim to provide an API which provides "logical" groups of data to the UDF rather than the implementation detail of providing partitions wholesale. Setting aside for the moment the problem with dataset skew which could cause one group to be very large, let's look at some use cases. One obvious use case that tools like dplyr and pandas support is {{df.groupby(...).aggregate(...)}}. Here, we group on some key and apply a function to each logical group. This can be used to e.g. demean each group w.r.t. its cohort. Another use case that we care about with [Flint|https://github.com/twosigma/flint] is aggregating over a window. In pandas terminology this is the {{rolling}} operator. One might want to, for each row, perform a moving window average or rolling regression over a history of some size. The windowed aggregation poses a performance question that the groupby case doesn't: namely, if we naively send each window to the python worker independently, we're transferring a lot of duplicate data since each overlapped window contains many of the same rows. An option here is to transfer the entire partition on the backend and then instruct the python worker to call the UDF with slices of the whole dataset according to the windowing requested by the user. I think the idea of presenting a whole partition in a pandas dataframe to a UDF is a bit off-track. If someone really wants to apply a python function to the "whole" dataset, they'd be best served by pulling those data back to the driver and just using pandas, if they tried to use spark's partitions they'd get somewhat arbitrary partitions and have to implement some kind of merge operator on their own. However, with grouped and windowed aggregations, we can provide an API which truly is parallelizable and useful. I want to focus on use cases where we actually can parallelize without requiring a merge operator right now. Aggregators in pandas and related tools in the ecosystem usually assume they have access to all the data for an operation and don't need to merge results of subaggregations. For aggregations over larger datasets you'd really want to encourage the use of native Spark operations (that use e.g. {{treeAggregate}}). Does that make sense? I think it focuses the problem nicely that it becomes fairly tractable. I think the really hard part of this API design is deciding what the inputs and outputs of the UDF look like, and providing for the myriad use cases therein. For example, one might want to aggregate each group down to a scalar (e.g. mean) and do something with that (either produce a reduced dataset with one value per group, or add a column where each group has the same value across all rows), or one might want to compute over the group and produce a value per row within the group and attach that as a new column (e.g. demeaning or ranking). These translate roughly to the differences between the [**ply operations in dplyr|https://www.jstatsoft.org/article/view/v040i01/v40i01.pdf] or the differences in pandas between {{df.groupby(...).agg(...)}} and {{df.groupby(...).transform(...)}} and {{df.groupby(...).apply(...)}}. > SPIP: Vectorized UDFs in Python > --- > > Key: SPARK-21190 > URL: https://issues.apache.org/jira/browse/SPARK-21190 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > Labels: SPIP > Attachments: SPIPVectorizedUDFsforPython (1).pdf > > > *Background and Motivation* > Python is one of the most popular programming languages among Spark users. > Spark currently exposes a row-at-a-time interface for defining and executing > user-defined functions (UDFs). This introduces high overhead in serialization > and deserialization, and also makes it difficult to leverage Python libraries > (e.g. numpy, Pandas) that are written in native code. > > This proposal advocates introducing new APIs to support vectorized UDFs in > Python, in which a block of data is transferred over to Python in some > columnar format for execution. > > > *Target Personas* > Data scientists, data engineers, library developers. > > *Goals* > - Support vectorized UDFs that apply on chunks of the data frame > - Low system overhead: Substantially reduce serialization and deserialization > overhead when compared with row-at-a-time interface > - UDF performance: Enable users to leverage native libraries in Python (e.g. > numpy, Pandas) for data manipulation in these UDFs > > *Non-Goals* > The
[jira] [Resolved] (SPARK-21246) Unexpected Data Type conversion from LONG to BIGINT
[ https://issues.apache.org/jira/browse/SPARK-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21246. -- Resolution: Invalid Of course, it follows the schema an user specified. {code} scala> peopleDF.schema == schema res9: Boolean = true {code} and this throws an exception as the schema is mismatched. I don't think at least the same schema is not an issue here. I am resolving this. > Unexpected Data Type conversion from LONG to BIGINT > --- > > Key: SPARK-21246 > URL: https://issues.apache.org/jira/browse/SPARK-21246 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Using Zeppelin Notebook or Spark Shell >Reporter: Monica Raj > > The unexpected conversion occurred when creating a data frame out of an > existing data collection. The following code can be run in zeppelin notebook > to reproduce the bug: > import org.apache.spark.sql.types._ > import org.apache.spark.sql.Row > val schemaString = "name" > val lstVals = Seq(3) > val rowRdd = sc.parallelize(lstVals).map(x => Row( x )) > rowRdd.collect() > // Generate the schema based on the string of schema > val fields = schemaString.split(" ") > .map(fieldName => StructField(fieldName, LongType, nullable = true)) > val schema = StructType(fields) > print(schema) > val peopleDF = sqlContext.createDataFrame(rowRdd, schema) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069300#comment-16069300 ] Helena Edelson commented on SPARK-18057: IMHO kafka-0-11 to be explicit and wait until kafka 0.11.1.0 which per https://issues.apache.org/jira/browse/KAFKA-4879 resolves the last blocker to upgrading? > Update structured streaming kafka from 10.0.1 to 10.2.0 > --- > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21230) Spark Encoder with mysql Enum and data truncated Error
[ https://issues.apache.org/jira/browse/SPARK-21230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21230. -- Resolution: Invalid It is hard for me to read and reproduce as well. Let's close unless you are going to fix it or provide some steps to reproduce anyone can follow explicitly. Also, I think the point is to narrow down. I don't think this is an actionable JIRA as well. > Spark Encoder with mysql Enum and data truncated Error > -- > > Key: SPARK-21230 > URL: https://issues.apache.org/jira/browse/SPARK-21230 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.1.1 > Environment: macosX >Reporter: Michael Kunkel > > I am using Spark via Java for a MYSQL/ML(machine learning) project. > In the mysql database, I have a column "status_change_type" of type enum = > {broke, fixed} in a table called "status_change" in a DB called "test". > I have an object StatusChangeDB that constructs the needed structure for the > table, however for the "status_change_type", I constructed it as a String. I > know the bytes from MYSQL enum to Java string are much different, but I am > using Spark, so the encoder does not recognize enums properly. However when I > try to set the value of the enum via a Java string, I receive the "data > truncated" error > h5. org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in > stage 4.0 (TID 9, localhost, executor driver): java.sql.BatchUpdateException: > Data truncated for column 'status_change_type' at row 1 at > com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:2055) > I have tried to use enum for "status_change_type", however it fails with a > stack trace of > h5. Exception in thread "AWT-EventQueue-0" java.lang.NullPointerException > at org.spark_project.guava.reflect.TypeToken.method(TypeToken.java:465) at > org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:126) > at > org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at > scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at > org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125) > at > org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127) > at > org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at > scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at > org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125) > at > org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127) > at > org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at > scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) ... ... > h5. > I have tried to use the jdbc setting "jdbcCompliantTruncation=false" but this > does nothing as I get the same error of "data truncated" as first stated. > Here are my jdbc options map, in case I am using the > "jdbcCompliantTruncation=false" incorrectly. > public static Map jdbcOptions() { > Map jdbcOptions = new HashMap(); > jdbcOptions.put("url", > "jdbc:mysql://localhost:3306/test?jdbcCompliantTruncation=false"); > jdbcOp
[jira] [Resolved] (SPARK-21249) Is it possible to use File Sink with mapGroupsWithState in Structured Streaming?
[ https://issues.apache.org/jira/browse/SPARK-21249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21249. -- Resolution: Invalid I am resolving this per {{Type: Question}}. Questions should go to the mailing list. > Is it possible to use File Sink with mapGroupsWithState in Structured > Streaming? > > > Key: SPARK-21249 > URL: https://issues.apache.org/jira/browse/SPARK-21249 > Project: Spark > Issue Type: Question > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Amit Baghel >Priority: Minor > > I am working with 2.2.0-SNAPSHOT and Structured Streaming. Is it possible to > use File Sink with mapGroupsWithState? With append output mode I am getting > below exception. > Exception in thread "main" org.apache.spark.sql.AnalysisException: > mapGroupsWithState is not supported with Append output mode on a streaming > DataFrame/Dataset;; -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-18199) Support appending to Parquet files
[ https://issues.apache.org/jira/browse/SPARK-18199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-18199. --- Resolution: Invalid I'm closing this as invalid. It is not a good idea to append to an existing file in distributed systems, especially given we might have two writers at the same time. > Support appending to Parquet files > -- > > Key: SPARK-18199 > URL: https://issues.apache.org/jira/browse/SPARK-18199 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Jeremy Smith > > Currently, appending to a Parquet directory involves simply creating new > parquet files in the directory. With many small appends (for example, in a > streaming job with a short batch duration) this leads to an unbounded number > of small Parquet files accumulating. These must be cleaned up with some > frequency by removing them all and rewriting a new file containing all the > rows. > It would be far better if Spark supported appending to the Parquet files > themselves. HDFS supports this, as does Parquet: > * The Parquet footer can be read in order to obtain necessary metadata. > * The new rows can then be appended to the Parquet file as a row group. > * A new footer can then be appended containing the metadata and referencing > the new row groups as well as the previously existing row groups. > This would result in a small amount of bloat in the file as new row groups > are added (since duplicate metadata would accumulate) but it's hugely > preferable to accumulating small files, which is bad for HDFS health and also > eventually leads to Spark being unable to read the Parquet directory at all. > Periodic rewriting of the file could still be performed in order to remove > the duplicate metadata. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21246) Unexpected Data Type conversion from LONG to BIGINT
[ https://issues.apache.org/jira/browse/SPARK-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069281#comment-16069281 ] Hyukjin Kwon edited comment on SPARK-21246 at 6/30/17 1:27 AM: --- Of course, it follows the schema an user specified. {code} scala> peopleDF.schema == schema res9: Boolean = true {code} and this throws an exception as the schema is mismatched. I don't think at least the same schema is an issue here. I am resolving this. was (Author: hyukjin.kwon): Of course, it follows the schema an user specified. {code} scala> peopleDF.schema == schema res9: Boolean = true {code} and this throws an exception as the schema is mismatched. I don't think at least the same schema is not an issue here. I am resolving this. > Unexpected Data Type conversion from LONG to BIGINT > --- > > Key: SPARK-21246 > URL: https://issues.apache.org/jira/browse/SPARK-21246 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Using Zeppelin Notebook or Spark Shell >Reporter: Monica Raj > > The unexpected conversion occurred when creating a data frame out of an > existing data collection. The following code can be run in zeppelin notebook > to reproduce the bug: > import org.apache.spark.sql.types._ > import org.apache.spark.sql.Row > val schemaString = "name" > val lstVals = Seq(3) > val rowRdd = sc.parallelize(lstVals).map(x => Row( x )) > rowRdd.collect() > // Generate the schema based on the string of schema > val fields = schemaString.split(" ") > .map(fieldName => StructField(fieldName, LongType, nullable = true)) > val schema = StructType(fields) > print(schema) > val peopleDF = sqlContext.createDataFrame(rowRdd, schema) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21260) Remove the unused OutputFakerExec
[ https://issues.apache.org/jira/browse/SPARK-21260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069332#comment-16069332 ] Apache Spark commented on SPARK-21260: -- User 'jiangxb1987' has created a pull request for this issue: https://github.com/apache/spark/pull/18473 > Remove the unused OutputFakerExec > - > > Key: SPARK-21260 > URL: https://issues.apache.org/jira/browse/SPARK-21260 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Jiang Xingbo >Priority: Minor > > OutputFakerExec was added long ago and is not used anywhere now so we should > remove it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21260) Remove the unused OutputFakerExec
Jiang Xingbo created SPARK-21260: Summary: Remove the unused OutputFakerExec Key: SPARK-21260 URL: https://issues.apache.org/jira/browse/SPARK-21260 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.1 Reporter: Jiang Xingbo Priority: Minor OutputFakerExec was added long ago and is not used anywhere now so we should remove it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21260) Remove the unused OutputFakerExec
[ https://issues.apache.org/jira/browse/SPARK-21260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21260: Assignee: Apache Spark > Remove the unused OutputFakerExec > - > > Key: SPARK-21260 > URL: https://issues.apache.org/jira/browse/SPARK-21260 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Jiang Xingbo >Assignee: Apache Spark >Priority: Minor > > OutputFakerExec was added long ago and is not used anywhere now so we should > remove it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21260) Remove the unused OutputFakerExec
[ https://issues.apache.org/jira/browse/SPARK-21260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21260: Assignee: (was: Apache Spark) > Remove the unused OutputFakerExec > - > > Key: SPARK-21260 > URL: https://issues.apache.org/jira/browse/SPARK-21260 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Jiang Xingbo >Priority: Minor > > OutputFakerExec was added long ago and is not used anywhere now so we should > remove it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21224) Support a DDL-formatted string as schema in reading for R
[ https://issues.apache.org/jira/browse/SPARK-21224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069330#comment-16069330 ] Hyukjin Kwon commented on SPARK-21224: -- Oh! I almost missed this comment. Sure, I will soon. Yes, I would like to open a new JIRA. Thank you [~felixcheung]. > Support a DDL-formatted string as schema in reading for R > - > > Key: SPARK-21224 > URL: https://issues.apache.org/jira/browse/SPARK-21224 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > > This might have to be a followup for SPARK-20431 but I just decided to make > this separate for R specifically as many PRs might be confusing. > Please refer the discussion in the PR and SPARK-20431. > In a simple view, this JIRA describes the support for a DDL-formetted string > as schema as below: > {code} > mockLines <- c("{\"name\":\"Michael\"}", >"{\"name\":\"Andy\", \"age\":30}", >"{\"name\":\"Justin\", \"age\":19}") > jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp") > writeLines(mockLines, jsonPath) > df <- read.df(jsonPath, "json", "name STRING, age DOUBLE") > collect(df) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21235) UTest should clear temp results when run case
[ https://issues.apache.org/jira/browse/SPARK-21235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069340#comment-16069340 ] Apache Spark commented on SPARK-21235: -- User 'wangjiaochun' has created a pull request for this issue: https://github.com/apache/spark/pull/18474 > UTest should clear temp results when run case > -- > > Key: SPARK-21235 > URL: https://issues.apache.org/jira/browse/SPARK-21235 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.1.1 >Reporter: wangjiaochun >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21235) UTest should clear temp results when run case
[ https://issues.apache.org/jira/browse/SPARK-21235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21235: Assignee: (was: Apache Spark) > UTest should clear temp results when run case > -- > > Key: SPARK-21235 > URL: https://issues.apache.org/jira/browse/SPARK-21235 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.1.1 >Reporter: wangjiaochun >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21235) UTest should clear temp results when run case
[ https://issues.apache.org/jira/browse/SPARK-21235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21235: Assignee: Apache Spark > UTest should clear temp results when run case > -- > > Key: SPARK-21235 > URL: https://issues.apache.org/jira/browse/SPARK-21235 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.1.1 >Reporter: wangjiaochun >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20858) Document ListenerBus event queue size property
[ https://issues.apache.org/jira/browse/SPARK-20858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20858: Assignee: (was: Apache Spark) > Document ListenerBus event queue size property > -- > > Key: SPARK-20858 > URL: https://issues.apache.org/jira/browse/SPARK-20858 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.1.1 >Reporter: Bjorn Jonsson >Priority: Minor > > SPARK-15703 made the ListenerBus event queue size configurable via > spark.scheduler.listenerbus.eventqueue.size. This should be documented. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20858) Document ListenerBus event queue size property
[ https://issues.apache.org/jira/browse/SPARK-20858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20858: Assignee: Apache Spark > Document ListenerBus event queue size property > -- > > Key: SPARK-20858 > URL: https://issues.apache.org/jira/browse/SPARK-20858 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.1.1 >Reporter: Bjorn Jonsson >Assignee: Apache Spark >Priority: Minor > > SPARK-15703 made the ListenerBus event queue size configurable via > spark.scheduler.listenerbus.eventqueue.size. This should be documented. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20858) Document ListenerBus event queue size property
[ https://issues.apache.org/jira/browse/SPARK-20858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069387#comment-16069387 ] Apache Spark commented on SPARK-20858: -- User 'sadikovi' has created a pull request for this issue: https://github.com/apache/spark/pull/18476 > Document ListenerBus event queue size property > -- > > Key: SPARK-20858 > URL: https://issues.apache.org/jira/browse/SPARK-20858 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.1.1 >Reporter: Bjorn Jonsson >Priority: Minor > > SPARK-15703 made the ListenerBus event queue size configurable via > spark.scheduler.listenerbus.eventqueue.size. This should be documented. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21253) Cannot fetch big blocks to disk
[ https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-21253. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 18472 [https://github.com/apache/spark/pull/18472] > Cannot fetch big blocks to disk > > > Key: SPARK-21253 > URL: https://issues.apache.org/jira/browse/SPARK-21253 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yuming Wang >Assignee: Shixiong Zhu > Fix For: 2.2.0 > > Attachments: ui-thread-dump-jqhadoop221-154.gif > > > Spark *cluster* can reproduce, *local* can't: > 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}: > {code:actionscript} > $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K > {code} > 2. A shuffle: > {code:actionscript} > scala> sc.parallelize(0 until 300, 10).repartition(2001).count() > {code} > The error messages: > {noformat} > org.apache.spark.shuffle.FetchFailedException: Failed to send request for > 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: > java.io.IOException: Connection reset by peer > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to > yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: > Connection reset by peer > at > org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196) > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) > at > io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) > at > io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) > at > io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28) > at > org.apache.spark.network.client.TransportClient.stream(Tra
[jira] [Created] (SPARK-21261) SparkSQL regexpExpressions example
zhangxin created SPARK-21261: Summary: SparkSQL regexpExpressions example Key: SPARK-21261 URL: https://issues.apache.org/jira/browse/SPARK-21261 Project: Spark Issue Type: Documentation Components: Examples Affects Versions: 2.1.1 Reporter: zhangxin The follow execute result. scala> spark.sql(""" select regexp_replace('100-200', '(\d+)', 'num') """).show +--+ |regexp_replace(100-200, (d+), num)| +--+ | 100-200| +--+ scala> spark.sql(""" select regexp_replace('100-200', '(\\d+)', 'num') """).show +---+ |regexp_replace(100-200, (\d+), num)| +---+ |num-num| +---+ Add Comment -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org