[jira] [Updated] (SPARK-21249) Is it possible to use File Sink with mapGroupsWithState in Structured Streaming?
[ https://issues.apache.org/jira/browse/SPARK-21249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amit Baghel updated SPARK-21249: Description: I am working with 2.2.0-SNAPSHOT and Structured Streaming. Is it possible to use File Sink with mapGroupsWithState? I verified below cases and getting exception in all. * Aggregation without watermark and Append mode {code} Exception in thread "main" org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;; {code} * Aggregation with watermark and Append mode {code} Exception in thread "main" org.apache.spark.sql.AnalysisException: mapGroupsWithState is not supported with aggregation on a streaming DataFrame/Dataset;; {code} * No Aggregation and Append mode {code} Exception in thread "main" org.apache.spark.sql.AnalysisException: mapGroupsWithState is not supported with Append output mode on a streaming DataFrame/Dataset;; {code} * No Aggregation and Update mode (Should fail as File Sink supports only Append) {code} Exception in thread "main" org.apache.spark.sql.AnalysisException: Data source parquet does not support Update output mode; {code} * No Aggregation and Complete mode (Should fail as File Sink supports only Append) {code} Exception in thread "main" org.apache.spark.sql.AnalysisException: Data source parquet does not support Complete output mode; {code} was: I am working with 2.2.0-SNAPSHOT and Structured Streaming. Is it possible to use File Sink with mapGroupsWithState? I verified below cases and getting exception in all. * Aggregation without watermark and Append mode {code} Exception in thread "main" org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;; {code} * Aggregation with watermark and Append mode {code} Exception in thread "main" org.apache.spark.sql.AnalysisException: mapGroupsWithState is not supported with aggregation on a streaming DataFrame/Dataset;; {code} * No Aggregation and Append mode {code} Exception in thread "main" org.apache.spark.sql.AnalysisException: mapGroupsWithState is not supported with Append output mode on a streaming DataFrame/Dataset;; {code} * No Aggregation and Update mode {code} Exception in thread "main" org.apache.spark.sql.AnalysisException: Data source parquet does not support Update output mode; {code} * No Aggregation and Complete mode {code} Exception in thread "main" org.apache.spark.sql.AnalysisException: Data source parquet does not support Complete output mode; {code} > Is it possible to use File Sink with mapGroupsWithState in Structured > Streaming? > > > Key: SPARK-21249 > URL: https://issues.apache.org/jira/browse/SPARK-21249 > Project: Spark > Issue Type: Question > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Amit Baghel >Priority: Minor > > I am working with 2.2.0-SNAPSHOT and Structured Streaming. Is it possible to > use File Sink with mapGroupsWithState? I verified below cases and getting > exception in all. > * Aggregation without watermark and Append mode > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: Append > output mode not supported when there are streaming aggregations on streaming > DataFrames/DataSets without watermark;; > {code} > * Aggregation with watermark and Append mode > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: > mapGroupsWithState is not supported with aggregation on a streaming > DataFrame/Dataset;; > {code} > * No Aggregation and Append mode > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: > mapGroupsWithState is not supported with Append output mode on a streaming > DataFrame/Dataset;; > {code} > * No Aggregation and Update mode (Should fail as File Sink supports only > Append) > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: Data > source parquet does not support Update output mode; > {code} > * No Aggregation and Complete mode (Should fail as File Sink supports only > Append) > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: Data > source parquet does not support Complete output mode; > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21249) Is it possible to use File Sink with mapGroupsWithState in Structured Streaming?
[ https://issues.apache.org/jira/browse/SPARK-21249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069528#comment-16069528 ] Amit Baghel commented on SPARK-21249: - [~hyukjin.kwon] Sorry for this jira post as I couldn't decide if the type should be bug or question. I have updated the jira description with all cases I tried for File Sink with mapGroupsWithState.I will post this to mailing list. Thanks. > Is it possible to use File Sink with mapGroupsWithState in Structured > Streaming? > > > Key: SPARK-21249 > URL: https://issues.apache.org/jira/browse/SPARK-21249 > Project: Spark > Issue Type: Question > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Amit Baghel >Priority: Minor > > I am working with 2.2.0-SNAPSHOT and Structured Streaming. Is it possible to > use File Sink with mapGroupsWithState? I verified below cases and getting > exception in all. > * Aggregation without watermark and Append mode > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: Append > output mode not supported when there are streaming aggregations on streaming > DataFrames/DataSets without watermark;; > {code} > * Aggregation with watermark and Append mode > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: > mapGroupsWithState is not supported with aggregation on a streaming > DataFrame/Dataset;; > {code} > * No Aggregation and Append mode > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: > mapGroupsWithState is not supported with Append output mode on a streaming > DataFrame/Dataset;; > {code} > * No Aggregation and Update mode > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: Data > source parquet does not support Update output mode; > {code} > * No Aggregation and Complete mode > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: Data > source parquet does not support Complete output mode; > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21249) Is it possible to use File Sink with mapGroupsWithState in Structured Streaming?
[ https://issues.apache.org/jira/browse/SPARK-21249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amit Baghel updated SPARK-21249: Description: I am working with 2.2.0-SNAPSHOT and Structured Streaming. Is it possible to use File Sink with mapGroupsWithState? I verified below cases and getting exception in all. * Aggregation without watermark and Append mode {code} Exception in thread "main" org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;; {code} * Aggregation with watermark and Append mode {code} Exception in thread "main" org.apache.spark.sql.AnalysisException: mapGroupsWithState is not supported with aggregation on a streaming DataFrame/Dataset;; {code} * No Aggregation and Append mode {code} Exception in thread "main" org.apache.spark.sql.AnalysisException: mapGroupsWithState is not supported with Append output mode on a streaming DataFrame/Dataset;; {code} * No Aggregation and Update mode {code} Exception in thread "main" org.apache.spark.sql.AnalysisException: Data source parquet does not support Update output mode; {code} * No Aggregation and Complete mode {code} Exception in thread "main" org.apache.spark.sql.AnalysisException: Data source parquet does not support Complete output mode; {code} was: I am working with 2.2.0-SNAPSHOT and Structured Streaming. Is it possible to use File Sink with mapGroupsWithState? I verified below cases and getting exception in all. * mapGroupsWithState - Aggregation without watermark and File Sink (Append mode) Exception in thread "main" org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;; mapGroupsWithState - Aggregation with watermark and File Sink (Append mode) Exception in thread "main" org.apache.spark.sql.AnalysisException: mapGroupsWithState is not supported with aggregation on a streaming DataFrame/Dataset;; mapGroupsWithState - No Aggregation and File Sink (Append mode) Exception in thread "main" org.apache.spark.sql.AnalysisException: mapGroupsWithState is not supported with Append output mode on a streaming DataFrame/Dataset;; mapGroupsWithState - No Aggregation and File Sink (Update mode) Exception in thread "main" org.apache.spark.sql.AnalysisException: Data source parquet does not support Update output mode; mapGroupsWithState - No Aggregation and File Sink (Complete mode) Exception in thread "main" org.apache.spark.sql.AnalysisException: Data source parquet does not support Complete output mode; > Is it possible to use File Sink with mapGroupsWithState in Structured > Streaming? > > > Key: SPARK-21249 > URL: https://issues.apache.org/jira/browse/SPARK-21249 > Project: Spark > Issue Type: Question > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Amit Baghel >Priority: Minor > > I am working with 2.2.0-SNAPSHOT and Structured Streaming. Is it possible to > use File Sink with mapGroupsWithState? I verified below cases and getting > exception in all. > * Aggregation without watermark and Append mode > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: Append > output mode not supported when there are streaming aggregations on streaming > DataFrames/DataSets without watermark;; > {code} > * Aggregation with watermark and Append mode > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: > mapGroupsWithState is not supported with aggregation on a streaming > DataFrame/Dataset;; > {code} > * No Aggregation and Append mode > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: > mapGroupsWithState is not supported with Append output mode on a streaming > DataFrame/Dataset;; > {code} > * No Aggregation and Update mode > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: Data > source parquet does not support Update output mode; > {code} > * No Aggregation and Complete mode > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: Data > source parquet does not support Complete output mode; > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21249) Is it possible to use File Sink with mapGroupsWithState in Structured Streaming?
[ https://issues.apache.org/jira/browse/SPARK-21249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amit Baghel updated SPARK-21249: Description: I am working with 2.2.0-SNAPSHOT and Structured Streaming. Is it possible to use File Sink with mapGroupsWithState? I verified below cases and getting exception in all. * mapGroupsWithState - Aggregation without watermark and File Sink (Append mode) Exception in thread "main" org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;; mapGroupsWithState - Aggregation with watermark and File Sink (Append mode) Exception in thread "main" org.apache.spark.sql.AnalysisException: mapGroupsWithState is not supported with aggregation on a streaming DataFrame/Dataset;; mapGroupsWithState - No Aggregation and File Sink (Append mode) Exception in thread "main" org.apache.spark.sql.AnalysisException: mapGroupsWithState is not supported with Append output mode on a streaming DataFrame/Dataset;; mapGroupsWithState - No Aggregation and File Sink (Update mode) Exception in thread "main" org.apache.spark.sql.AnalysisException: Data source parquet does not support Update output mode; mapGroupsWithState - No Aggregation and File Sink (Complete mode) Exception in thread "main" org.apache.spark.sql.AnalysisException: Data source parquet does not support Complete output mode; was: I am working with 2.2.0-SNAPSHOT and Structured Streaming. Is it possible to use File Sink with mapGroupsWithState? With append output mode I am getting below exception. Exception in thread "main" org.apache.spark.sql.AnalysisException: mapGroupsWithState is not supported with Append output mode on a streaming DataFrame/Dataset;; > Is it possible to use File Sink with mapGroupsWithState in Structured > Streaming? > > > Key: SPARK-21249 > URL: https://issues.apache.org/jira/browse/SPARK-21249 > Project: Spark > Issue Type: Question > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Amit Baghel >Priority: Minor > > I am working with 2.2.0-SNAPSHOT and Structured Streaming. Is it possible to > use File Sink with mapGroupsWithState? I verified below cases and getting > exception in all. > * mapGroupsWithState - Aggregation without watermark and File Sink (Append > mode) > Exception in thread "main" org.apache.spark.sql.AnalysisException: Append > output mode not supported when there are streaming aggregations on streaming > DataFrames/DataSets without watermark;; > mapGroupsWithState - Aggregation with watermark and File Sink (Append mode) > Exception in thread "main" org.apache.spark.sql.AnalysisException: > mapGroupsWithState is not supported with aggregation on a streaming > DataFrame/Dataset;; > mapGroupsWithState - No Aggregation and File Sink (Append mode) > Exception in thread "main" org.apache.spark.sql.AnalysisException: > mapGroupsWithState is not supported with Append output mode on a streaming > DataFrame/Dataset;; > mapGroupsWithState - No Aggregation and File Sink (Update mode) > Exception in thread "main" org.apache.spark.sql.AnalysisException: Data > source parquet does not support Update output mode; > mapGroupsWithState - No Aggregation and File Sink (Complete mode) > Exception in thread "main" org.apache.spark.sql.AnalysisException: Data > source parquet does not support Complete output mode; -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21190) SPIP: Vectorized UDFs in Python
[ https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069490#comment-16069490 ] Reynold Xin commented on SPARK-21190: - That makes a lot of sense. So to design APIs similar to a lot of the ply operations. I will think a bit about this next week. Btw if somebody can also just propose some different alternative APIs (perhaps as a gist), it'd facilitate the discussion further. How should we do memory management though? Should we just have a configurable cap on the maximum number of roles? Should we just let user crash (not a great experience)? > SPIP: Vectorized UDFs in Python > --- > > Key: SPARK-21190 > URL: https://issues.apache.org/jira/browse/SPARK-21190 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > Labels: SPIP > Attachments: SPIPVectorizedUDFsforPython (1).pdf > > > *Background and Motivation* > Python is one of the most popular programming languages among Spark users. > Spark currently exposes a row-at-a-time interface for defining and executing > user-defined functions (UDFs). This introduces high overhead in serialization > and deserialization, and also makes it difficult to leverage Python libraries > (e.g. numpy, Pandas) that are written in native code. > > This proposal advocates introducing new APIs to support vectorized UDFs in > Python, in which a block of data is transferred over to Python in some > columnar format for execution. > > > *Target Personas* > Data scientists, data engineers, library developers. > > *Goals* > - Support vectorized UDFs that apply on chunks of the data frame > - Low system overhead: Substantially reduce serialization and deserialization > overhead when compared with row-at-a-time interface > - UDF performance: Enable users to leverage native libraries in Python (e.g. > numpy, Pandas) for data manipulation in these UDFs > > *Non-Goals* > The following are explicitly out of scope for the current SPIP, and should be > done in future SPIPs. Nonetheless, it would be good to consider these future > use cases during API design, so we can achieve some consistency when rolling > out new APIs. > > - Define block oriented UDFs in other languages (that are not Python). > - Define aggregate UDFs > - Tight integration with machine learning frameworks > > *Proposed API Changes* > The following sketches some possibilities. I haven’t spent a lot of time > thinking about the API (wrote it down in 5 mins) and I am not attached to > this design at all. The main purpose of the SPIP is to get feedback on use > cases and see how they can impact API design. > > Two things to consider are: > > 1. Python is dynamically typed, whereas DataFrames/SQL requires static, > analysis time typing. This means users would need to specify the return type > of their UDFs. > > 2. Ratio of input rows to output rows. We propose initially we require number > of output rows to be the same as the number of input rows. In the future, we > can consider relaxing this constraint with support for vectorized aggregate > UDFs. > > Proposed API sketch (using examples): > > Use case 1. A function that defines all the columns of a DataFrame (similar > to a “map” function): > > {code} > @spark_udf(some way to describe the return schema) > def my_func_on_entire_df(input): > """ Some user-defined function. > > :param input: A Pandas DataFrame with two columns, a and b. > :return: :class: A Pandas data frame. > """ > input[c] = input[a] + input[b] > Input[d] = input[a] - input[b] > return input > > spark.range(1000).selectExpr("id a", "id / 2 b") > .mapBatches(my_func_on_entire_df) > {code} > > Use case 2. A function that defines only one column (similar to existing > UDFs): > > {code} > @spark_udf(some way to describe the return schema) > def my_func_that_returns_one_column(input): > """ Some user-defined function. > > :param input: A Pandas DataFrame with two columns, a and b. > :return: :class: A numpy array > """ > return input[a] + input[b] > > my_func = udf(my_func_that_returns_one_column) > > df = spark.range(1000).selectExpr("id a", "id / 2 b") > df.withColumn("c", my_func(df.a, df.b)) > {code} > > > > *Optional Design Sketch* > I’m more concerned about getting proper feedback for API design. The > implementation should be pretty straightforward and is not a huge concern at > this point. We can leverage the same implementation for faster toPandas > (using Arrow). > > > *Optional Rejected Designs* > See above. > > > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To
[jira] [Resolved] (SPARK-21258) Window result incorrect using complex object with spilling
[ https://issues.apache.org/jira/browse/SPARK-21258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-21258. - Resolution: Fixed Fix Version/s: 2.1.2 2.2.0 Issue resolved by pull request 18470 [https://github.com/apache/spark/pull/18470] > Window result incorrect using complex object with spilling > -- > > Key: SPARK-21258 > URL: https://issues.apache.org/jira/browse/SPARK-21258 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell > Fix For: 2.2.0, 2.1.2 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21176) Master UI hangs with spark.ui.reverseProxy=true if the master node has many CPUs
[ https://issues.apache.org/jira/browse/SPARK-21176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-21176: --- Assignee: Ingo Schuster > Master UI hangs with spark.ui.reverseProxy=true if the master node has many > CPUs > > > Key: SPARK-21176 > URL: https://issues.apache.org/jira/browse/SPARK-21176 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0, 2.1.1, 2.2.0, 2.2.1 > Environment: ppc64le GNU/Linux, POWER8, only master node is reachable > externally other nodes are in an internal network >Reporter: Ingo Schuster >Assignee: Ingo Schuster > Labels: network, web-ui > Fix For: 2.1.2, 2.2.0 > > > In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master > node has too many cpus or the cluster has too many executers: > For each ProxyServlet, Jetty creates Selector threads: minimum 4, maximum > half the number of available CPUs: > {{this(Math.max(1, Runtime.getRuntime().availableProcessors() / 2));}} > (see > https://github.com/eclipse/jetty.project/blob/0c8273f2ca1f9bf2064cd9c4c939d2546443f759/jetty-client/src/main/java/org/eclipse/jetty/client/http/HttpClientTransportOverHTTP.java) > In reverse proxy mode, a proxy servlet is set up for each executor. > I have a system with 7 executors and 88 CPUs on the master node. Jetty tries > to instantiate 7*44 = 309 selector threads just for the reverse proxy > servlets, but since the QueuedThreadPool is initialized with 200 threads by > default, the UI gets stuck. > I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new > QueuedThreadPool(400)}}). With this hack, the UI works. > Obviously, the Jetty defaults are meant for a real web server. If that has 88 > CPUs, you do certainly expect a lot of traffic. > For the Spark admin UI however, there will rarely be concurrent accesses for > the same application or the same executor. > I therefore propose to dramatically reduce the number of selector threads > that get instantiated - at least by default. > I will propose a fix in a pull request. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21253) Cannot fetch big blocks to disk
[ https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069424#comment-16069424 ] Apache Spark commented on SPARK-21253: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/18478 > Cannot fetch big blocks to disk > > > Key: SPARK-21253 > URL: https://issues.apache.org/jira/browse/SPARK-21253 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yuming Wang >Assignee: Shixiong Zhu > Fix For: 2.2.0 > > Attachments: ui-thread-dump-jqhadoop221-154.gif > > > Spark *cluster* can reproduce, *local* can't: > 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}: > {code:actionscript} > $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K > {code} > 2. A shuffle: > {code:actionscript} > scala> sc.parallelize(0 until 300, 10).repartition(2001).count() > {code} > The error messages: > {noformat} > org.apache.spark.shuffle.FetchFailedException: Failed to send request for > 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: > java.io.IOException: Connection reset by peer > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to > yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: > Connection reset by peer > at > org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196) > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) > at > io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) > at > io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) > at > io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28) > at >
[jira] [Assigned] (SPARK-21261) SparkSQL regexpExpressions example
[ https://issues.apache.org/jira/browse/SPARK-21261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21261: Assignee: Apache Spark > SparkSQL regexpExpressions example > --- > > Key: SPARK-21261 > URL: https://issues.apache.org/jira/browse/SPARK-21261 > Project: Spark > Issue Type: Documentation > Components: Examples >Affects Versions: 2.1.1 >Reporter: zhangxin >Assignee: Apache Spark > > The follow execute result. > scala> spark.sql(""" select regexp_replace('100-200', '(\d+)', 'num') > """).show > +--+ > |regexp_replace(100-200, (d+), num)| > +--+ > | 100-200| > +--+ > scala> spark.sql(""" select regexp_replace('100-200', '(\\d+)', 'num') > """).show > +---+ > |regexp_replace(100-200, (\d+), num)| > +---+ > |num-num| > +---+ > Add Comment -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21261) SparkSQL regexpExpressions example
[ https://issues.apache.org/jira/browse/SPARK-21261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21261: Assignee: (was: Apache Spark) > SparkSQL regexpExpressions example > --- > > Key: SPARK-21261 > URL: https://issues.apache.org/jira/browse/SPARK-21261 > Project: Spark > Issue Type: Documentation > Components: Examples >Affects Versions: 2.1.1 >Reporter: zhangxin > > The follow execute result. > scala> spark.sql(""" select regexp_replace('100-200', '(\d+)', 'num') > """).show > +--+ > |regexp_replace(100-200, (d+), num)| > +--+ > | 100-200| > +--+ > scala> spark.sql(""" select regexp_replace('100-200', '(\\d+)', 'num') > """).show > +---+ > |regexp_replace(100-200, (\d+), num)| > +---+ > |num-num| > +---+ > Add Comment -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21261) SparkSQL regexpExpressions example
[ https://issues.apache.org/jira/browse/SPARK-21261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069416#comment-16069416 ] Apache Spark commented on SPARK-21261: -- User 'visaxin' has created a pull request for this issue: https://github.com/apache/spark/pull/18477 > SparkSQL regexpExpressions example > --- > > Key: SPARK-21261 > URL: https://issues.apache.org/jira/browse/SPARK-21261 > Project: Spark > Issue Type: Documentation > Components: Examples >Affects Versions: 2.1.1 >Reporter: zhangxin > > The follow execute result. > scala> spark.sql(""" select regexp_replace('100-200', '(\d+)', 'num') > """).show > +--+ > |regexp_replace(100-200, (d+), num)| > +--+ > | 100-200| > +--+ > scala> spark.sql(""" select regexp_replace('100-200', '(\\d+)', 'num') > """).show > +---+ > |regexp_replace(100-200, (\d+), num)| > +---+ > |num-num| > +---+ > Add Comment -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21176) Master UI hangs with spark.ui.reverseProxy=true if the master node has many CPUs
[ https://issues.apache.org/jira/browse/SPARK-21176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-21176. - Resolution: Fixed Fix Version/s: 2.1.2 2.2.0 Issue resolved by pull request 18437 [https://github.com/apache/spark/pull/18437] > Master UI hangs with spark.ui.reverseProxy=true if the master node has many > CPUs > > > Key: SPARK-21176 > URL: https://issues.apache.org/jira/browse/SPARK-21176 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0, 2.1.1, 2.2.0, 2.2.1 > Environment: ppc64le GNU/Linux, POWER8, only master node is reachable > externally other nodes are in an internal network >Reporter: Ingo Schuster > Labels: network, web-ui > Fix For: 2.2.0, 2.1.2 > > > In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master > node has too many cpus or the cluster has too many executers: > For each ProxyServlet, Jetty creates Selector threads: minimum 4, maximum > half the number of available CPUs: > {{this(Math.max(1, Runtime.getRuntime().availableProcessors() / 2));}} > (see > https://github.com/eclipse/jetty.project/blob/0c8273f2ca1f9bf2064cd9c4c939d2546443f759/jetty-client/src/main/java/org/eclipse/jetty/client/http/HttpClientTransportOverHTTP.java) > In reverse proxy mode, a proxy servlet is set up for each executor. > I have a system with 7 executors and 88 CPUs on the master node. Jetty tries > to instantiate 7*44 = 309 selector threads just for the reverse proxy > servlets, but since the QueuedThreadPool is initialized with 200 threads by > default, the UI gets stuck. > I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new > QueuedThreadPool(400)}}). With this hack, the UI works. > Obviously, the Jetty defaults are meant for a real web server. If that has 88 > CPUs, you do certainly expect a lot of traffic. > For the Spark admin UI however, there will rarely be concurrent accesses for > the same application or the same executor. > I therefore propose to dramatically reduce the number of selector threads > that get instantiated - at least by default. > I will propose a fix in a pull request. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19659) Fetch big blocks to disk when shuffle-read
[ https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19659: Assignee: jin xing (was: Apache Spark) > Fetch big blocks to disk when shuffle-read > -- > > Key: SPARK-19659 > URL: https://issues.apache.org/jira/browse/SPARK-19659 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.1.0 >Reporter: jin xing >Assignee: jin xing > Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf > > > Currently the whole block is fetched into memory(offheap by default) when > shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can > be large when skew situations. If OOM happens during shuffle read, job will > be killed and users will be notified to "Consider boosting > spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more > memory can resolve the OOM. However the approach is not perfectly suitable > for production environment, especially for data warehouse. > Using Spark SQL as data engine in warehouse, users hope to have a unified > parameter(e.g. memory) but less resource wasted(resource is allocated but not > used), > It's not always easy to predict skew situations, when happen, it make sense > to fetch remote blocks to disk for shuffle-read, rather than > kill the job because of OOM. This approach is mentioned during the discussion > in SPARK-3019, by [~sandyr] and [~mridulm80] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19659) Fetch big blocks to disk when shuffle-read
[ https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19659: Assignee: Apache Spark (was: jin xing) > Fetch big blocks to disk when shuffle-read > -- > > Key: SPARK-19659 > URL: https://issues.apache.org/jira/browse/SPARK-19659 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.1.0 >Reporter: jin xing >Assignee: Apache Spark > Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf > > > Currently the whole block is fetched into memory(offheap by default) when > shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can > be large when skew situations. If OOM happens during shuffle read, job will > be killed and users will be notified to "Consider boosting > spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more > memory can resolve the OOM. However the approach is not perfectly suitable > for production environment, especially for data warehouse. > Using Spark SQL as data engine in warehouse, users hope to have a unified > parameter(e.g. memory) but less resource wasted(resource is allocated but not > used), > It's not always easy to predict skew situations, when happen, it make sense > to fetch remote blocks to disk for shuffle-read, rather than > kill the job because of OOM. This approach is mentioned during the discussion > in SPARK-3019, by [~sandyr] and [~mridulm80] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19659) Fetch big blocks to disk when shuffle-read
[ https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19659: - Fix Version/s: (was: 2.2.0) > Fetch big blocks to disk when shuffle-read > -- > > Key: SPARK-19659 > URL: https://issues.apache.org/jira/browse/SPARK-19659 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.1.0 >Reporter: jin xing >Assignee: jin xing > Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf > > > Currently the whole block is fetched into memory(offheap by default) when > shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can > be large when skew situations. If OOM happens during shuffle read, job will > be killed and users will be notified to "Consider boosting > spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more > memory can resolve the OOM. However the approach is not perfectly suitable > for production environment, especially for data warehouse. > Using Spark SQL as data engine in warehouse, users hope to have a unified > parameter(e.g. memory) but less resource wasted(resource is allocated but not > used), > It's not always easy to predict skew situations, when happen, it make sense > to fetch remote blocks to disk for shuffle-read, rather than > kill the job because of OOM. This approach is mentioned during the discussion > in SPARK-3019, by [~sandyr] and [~mridulm80] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-19659) Fetch big blocks to disk when shuffle-read
[ https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reopened SPARK-19659: -- Reopened this as it's disabled. > Fetch big blocks to disk when shuffle-read > -- > > Key: SPARK-19659 > URL: https://issues.apache.org/jira/browse/SPARK-19659 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.1.0 >Reporter: jin xing >Assignee: jin xing > Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf > > > Currently the whole block is fetched into memory(offheap by default) when > shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can > be large when skew situations. If OOM happens during shuffle read, job will > be killed and users will be notified to "Consider boosting > spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more > memory can resolve the OOM. However the approach is not perfectly suitable > for production environment, especially for data warehouse. > Using Spark SQL as data engine in warehouse, users hope to have a unified > parameter(e.g. memory) but less resource wasted(resource is allocated but not > used), > It's not always easy to predict skew situations, when happen, it make sense > to fetch remote blocks to disk for shuffle-read, rather than > kill the job because of OOM. This approach is mentioned during the discussion > in SPARK-3019, by [~sandyr] and [~mridulm80] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19659) Fetch big blocks to disk when shuffle-read
[ https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069396#comment-16069396 ] Apache Spark commented on SPARK-19659: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/18467 > Fetch big blocks to disk when shuffle-read > -- > > Key: SPARK-19659 > URL: https://issues.apache.org/jira/browse/SPARK-19659 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.1.0 >Reporter: jin xing >Assignee: jin xing > Fix For: 2.2.0 > > Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf > > > Currently the whole block is fetched into memory(offheap by default) when > shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can > be large when skew situations. If OOM happens during shuffle read, job will > be killed and users will be notified to "Consider boosting > spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more > memory can resolve the OOM. However the approach is not perfectly suitable > for production environment, especially for data warehouse. > Using Spark SQL as data engine in warehouse, users hope to have a unified > parameter(e.g. memory) but less resource wasted(resource is allocated but not > used), > It's not always easy to predict skew situations, when happen, it make sense > to fetch remote blocks to disk for shuffle-read, rather than > kill the job because of OOM. This approach is mentioned during the discussion > in SPARK-3019, by [~sandyr] and [~mridulm80] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21261) SparkSQL regexpExpressions example
zhangxin created SPARK-21261: Summary: SparkSQL regexpExpressions example Key: SPARK-21261 URL: https://issues.apache.org/jira/browse/SPARK-21261 Project: Spark Issue Type: Documentation Components: Examples Affects Versions: 2.1.1 Reporter: zhangxin The follow execute result. scala> spark.sql(""" select regexp_replace('100-200', '(\d+)', 'num') """).show +--+ |regexp_replace(100-200, (d+), num)| +--+ | 100-200| +--+ scala> spark.sql(""" select regexp_replace('100-200', '(\\d+)', 'num') """).show +---+ |regexp_replace(100-200, (\d+), num)| +---+ |num-num| +---+ Add Comment -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21253) Cannot fetch big blocks to disk
[ https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-21253. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 18472 [https://github.com/apache/spark/pull/18472] > Cannot fetch big blocks to disk > > > Key: SPARK-21253 > URL: https://issues.apache.org/jira/browse/SPARK-21253 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yuming Wang >Assignee: Shixiong Zhu > Fix For: 2.2.0 > > Attachments: ui-thread-dump-jqhadoop221-154.gif > > > Spark *cluster* can reproduce, *local* can't: > 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}: > {code:actionscript} > $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K > {code} > 2. A shuffle: > {code:actionscript} > scala> sc.parallelize(0 until 300, 10).repartition(2001).count() > {code} > The error messages: > {noformat} > org.apache.spark.shuffle.FetchFailedException: Failed to send request for > 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: > java.io.IOException: Connection reset by peer > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to > yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: > Connection reset by peer > at > org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196) > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) > at > io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) > at > io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) > at > io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28) > at >
[jira] [Commented] (SPARK-20858) Document ListenerBus event queue size property
[ https://issues.apache.org/jira/browse/SPARK-20858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069387#comment-16069387 ] Apache Spark commented on SPARK-20858: -- User 'sadikovi' has created a pull request for this issue: https://github.com/apache/spark/pull/18476 > Document ListenerBus event queue size property > -- > > Key: SPARK-20858 > URL: https://issues.apache.org/jira/browse/SPARK-20858 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.1.1 >Reporter: Bjorn Jonsson >Priority: Minor > > SPARK-15703 made the ListenerBus event queue size configurable via > spark.scheduler.listenerbus.eventqueue.size. This should be documented. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20858) Document ListenerBus event queue size property
[ https://issues.apache.org/jira/browse/SPARK-20858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20858: Assignee: (was: Apache Spark) > Document ListenerBus event queue size property > -- > > Key: SPARK-20858 > URL: https://issues.apache.org/jira/browse/SPARK-20858 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.1.1 >Reporter: Bjorn Jonsson >Priority: Minor > > SPARK-15703 made the ListenerBus event queue size configurable via > spark.scheduler.listenerbus.eventqueue.size. This should be documented. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20858) Document ListenerBus event queue size property
[ https://issues.apache.org/jira/browse/SPARK-20858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20858: Assignee: Apache Spark > Document ListenerBus event queue size property > -- > > Key: SPARK-20858 > URL: https://issues.apache.org/jira/browse/SPARK-20858 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.1.1 >Reporter: Bjorn Jonsson >Assignee: Apache Spark >Priority: Minor > > SPARK-15703 made the ListenerBus event queue size configurable via > spark.scheduler.listenerbus.eventqueue.size. This should be documented. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21235) UTest should clear temp results when run case
[ https://issues.apache.org/jira/browse/SPARK-21235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21235: Assignee: Apache Spark > UTest should clear temp results when run case > -- > > Key: SPARK-21235 > URL: https://issues.apache.org/jira/browse/SPARK-21235 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.1.1 >Reporter: wangjiaochun >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21235) UTest should clear temp results when run case
[ https://issues.apache.org/jira/browse/SPARK-21235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21235: Assignee: (was: Apache Spark) > UTest should clear temp results when run case > -- > > Key: SPARK-21235 > URL: https://issues.apache.org/jira/browse/SPARK-21235 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.1.1 >Reporter: wangjiaochun >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21224) Support a DDL-formatted string as schema in reading for R
[ https://issues.apache.org/jira/browse/SPARK-21224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069330#comment-16069330 ] Hyukjin Kwon commented on SPARK-21224: -- Oh! I almost missed this comment. Sure, I will soon. Yes, I would like to open a new JIRA. Thank you [~felixcheung]. > Support a DDL-formatted string as schema in reading for R > - > > Key: SPARK-21224 > URL: https://issues.apache.org/jira/browse/SPARK-21224 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > > This might have to be a followup for SPARK-20431 but I just decided to make > this separate for R specifically as many PRs might be confusing. > Please refer the discussion in the PR and SPARK-20431. > In a simple view, this JIRA describes the support for a DDL-formetted string > as schema as below: > {code} > mockLines <- c("{\"name\":\"Michael\"}", >"{\"name\":\"Andy\", \"age\":30}", >"{\"name\":\"Justin\", \"age\":19}") > jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp") > writeLines(mockLines, jsonPath) > df <- read.df(jsonPath, "json", "name STRING, age DOUBLE") > collect(df) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21235) UTest should clear temp results when run case
[ https://issues.apache.org/jira/browse/SPARK-21235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069340#comment-16069340 ] Apache Spark commented on SPARK-21235: -- User 'wangjiaochun' has created a pull request for this issue: https://github.com/apache/spark/pull/18474 > UTest should clear temp results when run case > -- > > Key: SPARK-21235 > URL: https://issues.apache.org/jira/browse/SPARK-21235 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.1.1 >Reporter: wangjiaochun >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21260) Remove the unused OutputFakerExec
[ https://issues.apache.org/jira/browse/SPARK-21260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21260: Assignee: (was: Apache Spark) > Remove the unused OutputFakerExec > - > > Key: SPARK-21260 > URL: https://issues.apache.org/jira/browse/SPARK-21260 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Jiang Xingbo >Priority: Minor > > OutputFakerExec was added long ago and is not used anywhere now so we should > remove it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21260) Remove the unused OutputFakerExec
[ https://issues.apache.org/jira/browse/SPARK-21260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21260: Assignee: Apache Spark > Remove the unused OutputFakerExec > - > > Key: SPARK-21260 > URL: https://issues.apache.org/jira/browse/SPARK-21260 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Jiang Xingbo >Assignee: Apache Spark >Priority: Minor > > OutputFakerExec was added long ago and is not used anywhere now so we should > remove it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21260) Remove the unused OutputFakerExec
Jiang Xingbo created SPARK-21260: Summary: Remove the unused OutputFakerExec Key: SPARK-21260 URL: https://issues.apache.org/jira/browse/SPARK-21260 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.1 Reporter: Jiang Xingbo Priority: Minor OutputFakerExec was added long ago and is not used anywhere now so we should remove it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21260) Remove the unused OutputFakerExec
[ https://issues.apache.org/jira/browse/SPARK-21260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069332#comment-16069332 ] Apache Spark commented on SPARK-21260: -- User 'jiangxb1987' has created a pull request for this issue: https://github.com/apache/spark/pull/18473 > Remove the unused OutputFakerExec > - > > Key: SPARK-21260 > URL: https://issues.apache.org/jira/browse/SPARK-21260 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Jiang Xingbo >Priority: Minor > > OutputFakerExec was added long ago and is not used anywhere now so we should > remove it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21246) Unexpected Data Type conversion from LONG to BIGINT
[ https://issues.apache.org/jira/browse/SPARK-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069281#comment-16069281 ] Hyukjin Kwon edited comment on SPARK-21246 at 6/30/17 1:27 AM: --- Of course, it follows the schema an user specified. {code} scala> peopleDF.schema == schema res9: Boolean = true {code} and this throws an exception as the schema is mismatched. I don't think at least the same schema is an issue here. I am resolving this. was (Author: hyukjin.kwon): Of course, it follows the schema an user specified. {code} scala> peopleDF.schema == schema res9: Boolean = true {code} and this throws an exception as the schema is mismatched. I don't think at least the same schema is not an issue here. I am resolving this. > Unexpected Data Type conversion from LONG to BIGINT > --- > > Key: SPARK-21246 > URL: https://issues.apache.org/jira/browse/SPARK-21246 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Using Zeppelin Notebook or Spark Shell >Reporter: Monica Raj > > The unexpected conversion occurred when creating a data frame out of an > existing data collection. The following code can be run in zeppelin notebook > to reproduce the bug: > import org.apache.spark.sql.types._ > import org.apache.spark.sql.Row > val schemaString = "name" > val lstVals = Seq(3) > val rowRdd = sc.parallelize(lstVals).map(x => Row( x )) > rowRdd.collect() > // Generate the schema based on the string of schema > val fields = schemaString.split(" ") > .map(fieldName => StructField(fieldName, LongType, nullable = true)) > val schema = StructType(fields) > print(schema) > val peopleDF = sqlContext.createDataFrame(rowRdd, schema) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-18199) Support appending to Parquet files
[ https://issues.apache.org/jira/browse/SPARK-18199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-18199. --- Resolution: Invalid I'm closing this as invalid. It is not a good idea to append to an existing file in distributed systems, especially given we might have two writers at the same time. > Support appending to Parquet files > -- > > Key: SPARK-18199 > URL: https://issues.apache.org/jira/browse/SPARK-18199 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Jeremy Smith > > Currently, appending to a Parquet directory involves simply creating new > parquet files in the directory. With many small appends (for example, in a > streaming job with a short batch duration) this leads to an unbounded number > of small Parquet files accumulating. These must be cleaned up with some > frequency by removing them all and rewriting a new file containing all the > rows. > It would be far better if Spark supported appending to the Parquet files > themselves. HDFS supports this, as does Parquet: > * The Parquet footer can be read in order to obtain necessary metadata. > * The new rows can then be appended to the Parquet file as a row group. > * A new footer can then be appended containing the metadata and referencing > the new row groups as well as the previously existing row groups. > This would result in a small amount of bloat in the file as new row groups > are added (since duplicate metadata would accumulate) but it's hugely > preferable to accumulating small files, which is bad for HDFS health and also > eventually leads to Spark being unable to read the Parquet directory at all. > Periodic rewriting of the file could still be performed in order to remove > the duplicate metadata. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21249) Is it possible to use File Sink with mapGroupsWithState in Structured Streaming?
[ https://issues.apache.org/jira/browse/SPARK-21249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21249. -- Resolution: Invalid I am resolving this per {{Type: Question}}. Questions should go to the mailing list. > Is it possible to use File Sink with mapGroupsWithState in Structured > Streaming? > > > Key: SPARK-21249 > URL: https://issues.apache.org/jira/browse/SPARK-21249 > Project: Spark > Issue Type: Question > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Amit Baghel >Priority: Minor > > I am working with 2.2.0-SNAPSHOT and Structured Streaming. Is it possible to > use File Sink with mapGroupsWithState? With append output mode I am getting > below exception. > Exception in thread "main" org.apache.spark.sql.AnalysisException: > mapGroupsWithState is not supported with Append output mode on a streaming > DataFrame/Dataset;; -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21230) Spark Encoder with mysql Enum and data truncated Error
[ https://issues.apache.org/jira/browse/SPARK-21230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21230. -- Resolution: Invalid It is hard for me to read and reproduce as well. Let's close unless you are going to fix it or provide some steps to reproduce anyone can follow explicitly. Also, I think the point is to narrow down. I don't think this is an actionable JIRA as well. > Spark Encoder with mysql Enum and data truncated Error > -- > > Key: SPARK-21230 > URL: https://issues.apache.org/jira/browse/SPARK-21230 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.1.1 > Environment: macosX >Reporter: Michael Kunkel > > I am using Spark via Java for a MYSQL/ML(machine learning) project. > In the mysql database, I have a column "status_change_type" of type enum = > {broke, fixed} in a table called "status_change" in a DB called "test". > I have an object StatusChangeDB that constructs the needed structure for the > table, however for the "status_change_type", I constructed it as a String. I > know the bytes from MYSQL enum to Java string are much different, but I am > using Spark, so the encoder does not recognize enums properly. However when I > try to set the value of the enum via a Java string, I receive the "data > truncated" error > h5. org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in > stage 4.0 (TID 9, localhost, executor driver): java.sql.BatchUpdateException: > Data truncated for column 'status_change_type' at row 1 at > com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:2055) > I have tried to use enum for "status_change_type", however it fails with a > stack trace of > h5. Exception in thread "AWT-EventQueue-0" java.lang.NullPointerException > at org.spark_project.guava.reflect.TypeToken.method(TypeToken.java:465) at > org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:126) > at > org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at > scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at > org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125) > at > org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127) > at > org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at > scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at > org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125) > at > org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127) > at > org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at > scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) ... ... > h5. > I have tried to use the jdbc setting "jdbcCompliantTruncation=false" but this > does nothing as I get the same error of "data truncated" as first stated. > Here are my jdbc options map, in case I am using the > "jdbcCompliantTruncation=false" incorrectly. > public static MapjdbcOptions() { > Map jdbcOptions = new HashMap (); > jdbcOptions.put("url", >
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069300#comment-16069300 ] Helena Edelson commented on SPARK-18057: IMHO kafka-0-11 to be explicit and wait until kafka 0.11.1.0 which per https://issues.apache.org/jira/browse/KAFKA-4879 resolves the last blocker to upgrading? > Update structured streaming kafka from 10.0.1 to 10.2.0 > --- > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21246) Unexpected Data Type conversion from LONG to BIGINT
[ https://issues.apache.org/jira/browse/SPARK-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21246. -- Resolution: Invalid Of course, it follows the schema an user specified. {code} scala> peopleDF.schema == schema res9: Boolean = true {code} and this throws an exception as the schema is mismatched. I don't think at least the same schema is not an issue here. I am resolving this. > Unexpected Data Type conversion from LONG to BIGINT > --- > > Key: SPARK-21246 > URL: https://issues.apache.org/jira/browse/SPARK-21246 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Using Zeppelin Notebook or Spark Shell >Reporter: Monica Raj > > The unexpected conversion occurred when creating a data frame out of an > existing data collection. The following code can be run in zeppelin notebook > to reproduce the bug: > import org.apache.spark.sql.types._ > import org.apache.spark.sql.Row > val schemaString = "name" > val lstVals = Seq(3) > val rowRdd = sc.parallelize(lstVals).map(x => Row( x )) > rowRdd.collect() > // Generate the schema based on the string of schema > val fields = schemaString.split(" ") > .map(fieldName => StructField(fieldName, LongType, nullable = true)) > val schema = StructType(fields) > print(schema) > val peopleDF = sqlContext.createDataFrame(rowRdd, schema) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21190) SPIP: Vectorized UDFs in Python
[ https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069269#comment-16069269 ] Leif Walsh commented on SPARK-21190: I agree with [~icexelloss] that we should aim to provide an API which provides "logical" groups of data to the UDF rather than the implementation detail of providing partitions wholesale. Setting aside for the moment the problem with dataset skew which could cause one group to be very large, let's look at some use cases. One obvious use case that tools like dplyr and pandas support is {{df.groupby(...).aggregate(...)}}. Here, we group on some key and apply a function to each logical group. This can be used to e.g. demean each group w.r.t. its cohort. Another use case that we care about with [Flint|https://github.com/twosigma/flint] is aggregating over a window. In pandas terminology this is the {{rolling}} operator. One might want to, for each row, perform a moving window average or rolling regression over a history of some size. The windowed aggregation poses a performance question that the groupby case doesn't: namely, if we naively send each window to the python worker independently, we're transferring a lot of duplicate data since each overlapped window contains many of the same rows. An option here is to transfer the entire partition on the backend and then instruct the python worker to call the UDF with slices of the whole dataset according to the windowing requested by the user. I think the idea of presenting a whole partition in a pandas dataframe to a UDF is a bit off-track. If someone really wants to apply a python function to the "whole" dataset, they'd be best served by pulling those data back to the driver and just using pandas, if they tried to use spark's partitions they'd get somewhat arbitrary partitions and have to implement some kind of merge operator on their own. However, with grouped and windowed aggregations, we can provide an API which truly is parallelizable and useful. I want to focus on use cases where we actually can parallelize without requiring a merge operator right now. Aggregators in pandas and related tools in the ecosystem usually assume they have access to all the data for an operation and don't need to merge results of subaggregations. For aggregations over larger datasets you'd really want to encourage the use of native Spark operations (that use e.g. {{treeAggregate}}). Does that make sense? I think it focuses the problem nicely that it becomes fairly tractable. I think the really hard part of this API design is deciding what the inputs and outputs of the UDF look like, and providing for the myriad use cases therein. For example, one might want to aggregate each group down to a scalar (e.g. mean) and do something with that (either produce a reduced dataset with one value per group, or add a column where each group has the same value across all rows), or one might want to compute over the group and produce a value per row within the group and attach that as a new column (e.g. demeaning or ranking). These translate roughly to the differences between the [**ply operations in dplyr|https://www.jstatsoft.org/article/view/v040i01/v40i01.pdf] or the differences in pandas between {{df.groupby(...).agg(...)}} and {{df.groupby(...).transform(...)}} and {{df.groupby(...).apply(...)}}. > SPIP: Vectorized UDFs in Python > --- > > Key: SPARK-21190 > URL: https://issues.apache.org/jira/browse/SPARK-21190 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > Labels: SPIP > Attachments: SPIPVectorizedUDFsforPython (1).pdf > > > *Background and Motivation* > Python is one of the most popular programming languages among Spark users. > Spark currently exposes a row-at-a-time interface for defining and executing > user-defined functions (UDFs). This introduces high overhead in serialization > and deserialization, and also makes it difficult to leverage Python libraries > (e.g. numpy, Pandas) that are written in native code. > > This proposal advocates introducing new APIs to support vectorized UDFs in > Python, in which a block of data is transferred over to Python in some > columnar format for execution. > > > *Target Personas* > Data scientists, data engineers, library developers. > > *Goals* > - Support vectorized UDFs that apply on chunks of the data frame > - Low system overhead: Substantially reduce serialization and deserialization > overhead when compared with row-at-a-time interface > - UDF performance: Enable users to leverage native libraries in Python (e.g. > numpy, Pandas) for data manipulation in these UDFs > > *Non-Goals* > The following are
[jira] [Closed] (SPARK-21148) Set SparkUncaughtExceptionHandler to the Master
[ https://issues.apache.org/jira/browse/SPARK-21148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K closed SPARK-21148. - Resolution: Duplicate > Set SparkUncaughtExceptionHandler to the Master > --- > > Key: SPARK-21148 > URL: https://issues.apache.org/jira/browse/SPARK-21148 > Project: Spark > Issue Type: Improvement > Components: Deploy, Spark Core >Affects Versions: 2.1.1 >Reporter: Devaraj K > > Any one thread of the Master gets any of the UncaughtException then the > thread gets terminate and the Master process keeps running without > functioning properly. > I think we need to handle the UncaughtException and exit the Master > gracefully. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21253) Cannot fetch big blocks to disk
[ https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069227#comment-16069227 ] Apache Spark commented on SPARK-21253: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/18472 > Cannot fetch big blocks to disk > > > Key: SPARK-21253 > URL: https://issues.apache.org/jira/browse/SPARK-21253 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yuming Wang >Assignee: Shixiong Zhu > Attachments: ui-thread-dump-jqhadoop221-154.gif > > > Spark *cluster* can reproduce, *local* can't: > 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}: > {code:actionscript} > $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K > {code} > 2. A shuffle: > {code:actionscript} > scala> sc.parallelize(0 until 300, 10).repartition(2001).count() > {code} > The error messages: > {noformat} > org.apache.spark.shuffle.FetchFailedException: Failed to send request for > 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: > java.io.IOException: Connection reset by peer > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to > yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: > Connection reset by peer > at > org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196) > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) > at > io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) > at > io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) > at > io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28) > at >
[jira] [Assigned] (SPARK-21259) More rules for scalastyle
[ https://issues.apache.org/jira/browse/SPARK-21259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21259: Assignee: (was: Apache Spark) > More rules for scalastyle > - > > Key: SPARK-21259 > URL: https://issues.apache.org/jira/browse/SPARK-21259 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Gengliang Wang >Priority: Minor > > During code review, we spent so much time on code style issues. > It would be great if we add rules: > 1) disallow space before colon > 2) disallow space before right parentheses > 3) disallow space after left parentheses -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21259) More rules for scalastyle
[ https://issues.apache.org/jira/browse/SPARK-21259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069208#comment-16069208 ] Apache Spark commented on SPARK-21259: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/18471 > More rules for scalastyle > - > > Key: SPARK-21259 > URL: https://issues.apache.org/jira/browse/SPARK-21259 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Gengliang Wang >Priority: Minor > > During code review, we spent so much time on code style issues. > It would be great if we add rules: > 1) disallow space before colon > 2) disallow space before right parentheses > 3) disallow space after left parentheses -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21259) More rules for scalastyle
[ https://issues.apache.org/jira/browse/SPARK-21259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21259: Assignee: Apache Spark > More rules for scalastyle > - > > Key: SPARK-21259 > URL: https://issues.apache.org/jira/browse/SPARK-21259 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Minor > > During code review, we spent so much time on code style issues. > It would be great if we add rules: > 1) disallow space before colon > 2) disallow space before right parentheses > 3) disallow space after left parentheses -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20690) Subqueries in FROM should have alias names
[ https://issues.apache.org/jira/browse/SPARK-20690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-20690: Description: We add missing attributes into Filter in Analyzer. But we shouldn't do it through subqueries like this: {code} select 1 from (select 1 from onerow t1 LIMIT 1) where t1.c1=1 {code} This query works in current codebase. However, the outside where clause shouldn't be able to refer t1.c1 attribute. The root cause is we allow subqueries in FROM have no alias names previously, it is confusing and isn't supported by various databases such as MySQL, Postgres, Oracle. We shouldn't support it too. was: We add missing attributes into Filter in Analyzer. But we shouldn't do it through subqueries like this: {code} select 1 from (select 1 from onerow t1 LIMIT 1) where t1.c1=1 {code} This query works in current codebase. However, the outside where clause shouldn't be able to refer t1.c1 attribute. > Subqueries in FROM should have alias names > -- > > Key: SPARK-20690 > URL: https://issues.apache.org/jira/browse/SPARK-20690 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Labels: release-notes > Fix For: 2.3.0 > > > We add missing attributes into Filter in Analyzer. But we shouldn't do it > through subqueries like this: > {code} > select 1 from (select 1 from onerow t1 LIMIT 1) where t1.c1=1 > {code} > This query works in current codebase. However, the outside where clause > shouldn't be able to refer t1.c1 attribute. > The root cause is we allow subqueries in FROM have no alias names previously, > it is confusing and isn't supported by various databases such as MySQL, > Postgres, Oracle. We shouldn't support it too. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21259) More rules for scalastyle
Gengliang Wang created SPARK-21259: -- Summary: More rules for scalastyle Key: SPARK-21259 URL: https://issues.apache.org/jira/browse/SPARK-21259 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.2.1 Reporter: Gengliang Wang Priority: Minor During code review, we spent so much time on code style issues. It would be great if we add rules: 1) disallow space before colon 2) disallow space before right parentheses 3) disallow space after left parentheses -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20690) Subqueries in FROM should have alias names
[ https://issues.apache.org/jira/browse/SPARK-20690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-20690: Summary: Subqueries in FROM should have alias names (was: Analyzer shouldn't add missing attributes through subquery) > Subqueries in FROM should have alias names > -- > > Key: SPARK-20690 > URL: https://issues.apache.org/jira/browse/SPARK-20690 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Labels: release-notes > Fix For: 2.3.0 > > > We add missing attributes into Filter in Analyzer. But we shouldn't do it > through subqueries like this: > {code} > select 1 from (select 1 from onerow t1 LIMIT 1) where t1.c1=1 > {code} > This query works in current codebase. However, the outside where clause > shouldn't be able to refer t1.c1 attribute. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21188) releaseAllLocksForTask should synchronize the whole method
[ https://issues.apache.org/jira/browse/SPARK-21188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-21188. -- Resolution: Fixed Assignee: Feng Liu Fix Version/s: 2.3.0 > releaseAllLocksForTask should synchronize the whole method > -- > > Key: SPARK-21188 > URL: https://issues.apache.org/jira/browse/SPARK-21188 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Feng Liu >Assignee: Feng Liu > Fix For: 2.3.0 > > > Since the objects readLocksByTask, writeLocksByTask and infos are coupled and > supposed to be modified by other threads concurrently, all the read and > writes of them in the releaseAllLocksForTask method should be protected by a > single synchronized block. The fine-grained synchronization in the current > code can cause some test flakiness. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21255) NPE when creating encoder for enum
[ https://issues.apache.org/jira/browse/SPARK-21255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069146#comment-16069146 ] Mike commented on SPARK-21255: -- Yes, but there may be other problems (at least I cannot guarantee there won't be), since it seems like enums were never used before. > NPE when creating encoder for enum > -- > > Key: SPARK-21255 > URL: https://issues.apache.org/jira/browse/SPARK-21255 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.1.0 > Environment: org.apache.spark:spark-core_2.10:2.1.0 > org.apache.spark:spark-sql_2.10:2.1.0 >Reporter: Mike > > When you try to create an encoder for Enum type (or bean with enum property) > via Encoders.bean(...), it fails with NullPointerException at TypeToken:495. > I did a little research and it turns out, that in JavaTypeInference:126 > following code > {code:java} > val beanInfo = Introspector.getBeanInfo(typeToken.getRawType) > val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == > "class") > val fields = properties.map { property => > val returnType = > typeToken.method(property.getReadMethod).getReturnType > val (dataType, nullable) = inferDataType(returnType) > new StructField(property.getName, dataType, nullable) > } > (new StructType(fields), true) > {code} > filters out properties named "class", because we wouldn't want to serialize > that. But enum types have another property of type Class named > "declaringClass", which we are trying to inspect recursively. Eventually we > try to inspect ClassLoader class, which has property "defaultAssertionStatus" > with no read method, which leads to NPE at TypeToken:495. > I think adding property name "declaringClass" to filtering will resolve this. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21253) Cannot fetch big blocks to disk
[ https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069109#comment-16069109 ] Yuming Wang commented on SPARK-21253: - I checked it, all jars are latest 2.2.0-rcX. > Cannot fetch big blocks to disk > > > Key: SPARK-21253 > URL: https://issues.apache.org/jira/browse/SPARK-21253 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yuming Wang >Assignee: Shixiong Zhu > Attachments: ui-thread-dump-jqhadoop221-154.gif > > > Spark *cluster* can reproduce, *local* can't: > 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}: > {code:actionscript} > $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K > {code} > 2. A shuffle: > {code:actionscript} > scala> sc.parallelize(0 until 300, 10).repartition(2001).count() > {code} > The error messages: > {noformat} > org.apache.spark.shuffle.FetchFailedException: Failed to send request for > 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: > java.io.IOException: Connection reset by peer > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to > yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: > Connection reset by peer > at > org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196) > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) > at > io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) > at > io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) > at > io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28) > at > org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183) > at >
[jira] [Commented] (SPARK-21258) Window result incorrect using complex object with spilling
[ https://issues.apache.org/jira/browse/SPARK-21258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069084#comment-16069084 ] Apache Spark commented on SPARK-21258: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/18470 > Window result incorrect using complex object with spilling > -- > > Key: SPARK-21258 > URL: https://issues.apache.org/jira/browse/SPARK-21258 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21258) Window result incorrect using complex object with spilling
[ https://issues.apache.org/jira/browse/SPARK-21258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21258: Assignee: Herman van Hovell (was: Apache Spark) > Window result incorrect using complex object with spilling > -- > > Key: SPARK-21258 > URL: https://issues.apache.org/jira/browse/SPARK-21258 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21258) Window result incorrect using complex object with spilling
[ https://issues.apache.org/jira/browse/SPARK-21258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21258: Assignee: Apache Spark (was: Herman van Hovell) > Window result incorrect using complex object with spilling > -- > > Key: SPARK-21258 > URL: https://issues.apache.org/jira/browse/SPARK-21258 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Herman van Hovell >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21258) Window result incorrect using complex object with spilling
Herman van Hovell created SPARK-21258: - Summary: Window result incorrect using complex object with spilling Key: SPARK-21258 URL: https://issues.apache.org/jira/browse/SPARK-21258 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Herman van Hovell Assignee: Herman van Hovell -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20797) mllib lda's LocalLDAModel's save: out of memory.
[ https://issues.apache.org/jira/browse/SPARK-20797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068988#comment-16068988 ] Asher Krim commented on SPARK-20797: This looks like a duplicate of https://issues.apache.org/jira/browse/SPARK-19294? > mllib lda's LocalLDAModel's save: out of memory. > - > > Key: SPARK-20797 > URL: https://issues.apache.org/jira/browse/SPARK-20797 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1, 1.6.3, 2.0.0, 2.0.2, 2.1.1 >Reporter: d0evi1 > > when i try online lda model with large text data(nearly 1 billion chinese > news' abstract), the training step went well, but the save step failed. > something like below happened (etc. 1.6.1): > problem 1.bigger than spark.kryoserializer.buffer.max. (turning bigger the > param can fix problem 1, but next will lead problem 2), > problem 2. exceed spark.akka.frameSize. (turning this param too bigger will > fail for the reason out of memory, kill it, version > 2.0.0, exceeds max > allowed: spark.rpc.message.maxSize). > when topics num is large(set topic num k=200 is ok, but set k=300 failed), > and vocab size is large(nearly 1000,000) too. this problem will appear. > so i found word2vec's save function is similar to the LocalLDAModel's save > function : > word2vec's problem (use repartition(1) to save) has been fixed > [https://github.com/apache/spark/pull/9989,], but LocalLDAModel still use: > repartition(1). use single partition when save. > word2vec's save method from latest code: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala: > val approxSize = (4L * vectorSize + 15) * numWords > val nPartitions = ((approxSize / bufferSize) + 1).toInt > val dataArray = model.toSeq.map { case (w, v) => Data(w, v) } > > spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path)) > but the code in mllib.clustering.LDAModel's LocalLDAModel's save: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala > you'll see: > val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix > val topics = Range(0, k).map { topicInd => > Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), > topicInd) > } > > spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path)) > refer to word2vec's save (repartition(nPartitions)), i replace numWords to > topic K, repartition(nPartitions) in the LocalLDAModel's save method, > recompile the code, deploy the new lda's project with large data on our > machine cluster, it works. > hopes it will fixed in the next version. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20797) mllib lda's LocalLDAModel's save: out of memory.
[ https://issues.apache.org/jira/browse/SPARK-20797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068988#comment-16068988 ] Asher Krim edited comment on SPARK-20797 at 6/29/17 9:19 PM: - This looks like a duplicate of https://issues.apache.org/jira/browse/SPARK-19294 was (Author: akrim): This looks like a duplicate of https://issues.apache.org/jira/browse/SPARK-19294? > mllib lda's LocalLDAModel's save: out of memory. > - > > Key: SPARK-20797 > URL: https://issues.apache.org/jira/browse/SPARK-20797 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1, 1.6.3, 2.0.0, 2.0.2, 2.1.1 >Reporter: d0evi1 > > when i try online lda model with large text data(nearly 1 billion chinese > news' abstract), the training step went well, but the save step failed. > something like below happened (etc. 1.6.1): > problem 1.bigger than spark.kryoserializer.buffer.max. (turning bigger the > param can fix problem 1, but next will lead problem 2), > problem 2. exceed spark.akka.frameSize. (turning this param too bigger will > fail for the reason out of memory, kill it, version > 2.0.0, exceeds max > allowed: spark.rpc.message.maxSize). > when topics num is large(set topic num k=200 is ok, but set k=300 failed), > and vocab size is large(nearly 1000,000) too. this problem will appear. > so i found word2vec's save function is similar to the LocalLDAModel's save > function : > word2vec's problem (use repartition(1) to save) has been fixed > [https://github.com/apache/spark/pull/9989,], but LocalLDAModel still use: > repartition(1). use single partition when save. > word2vec's save method from latest code: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala: > val approxSize = (4L * vectorSize + 15) * numWords > val nPartitions = ((approxSize / bufferSize) + 1).toInt > val dataArray = model.toSeq.map { case (w, v) => Data(w, v) } > > spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path)) > but the code in mllib.clustering.LDAModel's LocalLDAModel's save: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala > you'll see: > val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix > val topics = Range(0, k).map { topicInd => > Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), > topicInd) > } > > spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path)) > refer to word2vec's save (repartition(nPartitions)), i replace numWords to > topic K, repartition(nPartitions) in the LocalLDAModel's save method, > recompile the code, deploy the new lda's project with large data on our > machine cluster, it works. > hopes it will fixed in the next version. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068980#comment-16068980 ] Cody Koeninger commented on SPARK-18057: Kafka 0.11 is now released. Are we upgrading spark artifacts named kafka-0-10 to use kafka 0.11, or are we renaming them to kafka-0-11? > Update structured streaming kafka from 10.0.1 to 10.2.0 > --- > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21253) Cannot fetch big blocks to disk
[ https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068876#comment-16068876 ] Shixiong Zhu commented on SPARK-21253: -- [~q79969786] did you run Spark 2.2.0-rcX on Yarn which has a Spark 2.1.* shuffle service? > Cannot fetch big blocks to disk > > > Key: SPARK-21253 > URL: https://issues.apache.org/jira/browse/SPARK-21253 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yuming Wang >Assignee: Shixiong Zhu > Attachments: ui-thread-dump-jqhadoop221-154.gif > > > Spark *cluster* can reproduce, *local* can't: > 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}: > {code:actionscript} > $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K > {code} > 2. A shuffle: > {code:actionscript} > scala> sc.parallelize(0 until 300, 10).repartition(2001).count() > {code} > The error messages: > {noformat} > org.apache.spark.shuffle.FetchFailedException: Failed to send request for > 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: > java.io.IOException: Connection reset by peer > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to > yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: > Connection reset by peer > at > org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196) > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) > at > io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) > at > io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) > at > io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28) > at > org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183) >
[jira] [Updated] (SPARK-21257) LDA : create an Evaluator to enable cross validation
[ https://issues.apache.org/jira/browse/SPARK-21257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mathieu DESPRIEE updated SPARK-21257: - Issue Type: Improvement (was: New Feature) > LDA : create an Evaluator to enable cross validation > > > Key: SPARK-21257 > URL: https://issues.apache.org/jira/browse/SPARK-21257 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.1 >Reporter: Mathieu DESPRIEE > > I suggest the creation of an LDAEvaluator to use with CrossValidator, using > logPerplexity as a metric for evaluation. > Unfortunately, the computation of perplexity needs to access some internal > data of the model, and the current implementation of CrossValidator does not > pass the model being evaluated to the Evaluator. > A way could be to change the Evaluator.evaluate() method to pass the model > along with the dataset. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21257) LDA : create an Evaluator to enable cross validation
[ https://issues.apache.org/jira/browse/SPARK-21257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mathieu DESPRIEE updated SPARK-21257: - Issue Type: New Feature (was: Improvement) > LDA : create an Evaluator to enable cross validation > > > Key: SPARK-21257 > URL: https://issues.apache.org/jira/browse/SPARK-21257 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.1.1 >Reporter: Mathieu DESPRIEE > > I suggest the creation of an LDAEvaluator to use with CrossValidator, using > logPerplexity as a metric for evaluation. > Unfortunately, the computation of perplexity needs to access some internal > data of the model, and the current implementation of CrossValidator does not > pass the model being evaluated to the Evaluator. > A way could be to change the Evaluator.evaluate() method to pass the model > along with the dataset. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21257) LDA : create an Evaluator to enable cross validation
Mathieu DESPRIEE created SPARK-21257: Summary: LDA : create an Evaluator to enable cross validation Key: SPARK-21257 URL: https://issues.apache.org/jira/browse/SPARK-21257 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.1.1 Reporter: Mathieu DESPRIEE I suggest the creation of an LDAEvaluator to use with CrossValidator, using logPerplexity as a metric for evaluation. Unfortunately, the computation of perplexity needs to access some internal data of the model, and the current implementation of CrossValidator does not pass the model being evaluated to the Evaluator. A way could be to change the Evaluator.evaluate() method to pass the model along with the dataset. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21256) Add WithSQLConf to Catalyst Test
[ https://issues.apache.org/jira/browse/SPARK-21256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21256: Assignee: Xiao Li (was: Apache Spark) > Add WithSQLConf to Catalyst Test > > > Key: SPARK-21256 > URL: https://issues.apache.org/jira/browse/SPARK-21256 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li > > Add WithSQLConf to the Catalyst module. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21256) Add WithSQLConf to Catalyst Test
[ https://issues.apache.org/jira/browse/SPARK-21256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21256: Assignee: Apache Spark (was: Xiao Li) > Add WithSQLConf to Catalyst Test > > > Key: SPARK-21256 > URL: https://issues.apache.org/jira/browse/SPARK-21256 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Apache Spark > > Add WithSQLConf to the Catalyst module. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21256) Add WithSQLConf to Catalyst Test
[ https://issues.apache.org/jira/browse/SPARK-21256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068832#comment-16068832 ] Apache Spark commented on SPARK-21256: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/18469 > Add WithSQLConf to Catalyst Test > > > Key: SPARK-21256 > URL: https://issues.apache.org/jira/browse/SPARK-21256 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li > > Add WithSQLConf to the Catalyst module. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21256) Add WithSQLConf to Catalyst Test
[ https://issues.apache.org/jira/browse/SPARK-21256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-21256: Summary: Add WithSQLConf to Catalyst Test (was: Add WithSQLConf to Catalyst) > Add WithSQLConf to Catalyst Test > > > Key: SPARK-21256 > URL: https://issues.apache.org/jira/browse/SPARK-21256 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li > > Add WithSQLConf to the Catalyst module. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21256) Add WithSQLConf to Catalyst
Xiao Li created SPARK-21256: --- Summary: Add WithSQLConf to Catalyst Key: SPARK-21256 URL: https://issues.apache.org/jira/browse/SPARK-21256 Project: Spark Issue Type: Improvement Components: Tests Affects Versions: 2.3.0 Reporter: Xiao Li Assignee: Xiao Li Add WithSQLConf to the Catalyst module. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20873) Improve the error message for unsupported Column Type
[ https://issues.apache.org/jira/browse/SPARK-20873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068800#comment-16068800 ] Apache Spark commented on SPARK-20873: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/18468 > Improve the error message for unsupported Column Type > - > > Key: SPARK-20873 > URL: https://issues.apache.org/jira/browse/SPARK-20873 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Ruben Janssen >Assignee: Ruben Janssen > Fix For: 2.3.0 > > > For unsupported column type, we simply output the column type instead of the > type name. > {noformat} > java.lang.Exception: Unsupported type: > org.apache.spark.sql.execution.columnar.ColumnTypeSuite$$anonfun$4$$anon$1@2205a05d > {noformat} > We should improve it by outputting its name. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21253) Cannot fetch big blocks to disk
[ https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068793#comment-16068793 ] Apache Spark commented on SPARK-21253: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/18467 > Cannot fetch big blocks to disk > > > Key: SPARK-21253 > URL: https://issues.apache.org/jira/browse/SPARK-21253 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yuming Wang >Assignee: Shixiong Zhu > Attachments: ui-thread-dump-jqhadoop221-154.gif > > > Spark *cluster* can reproduce, *local* can't: > 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}: > {code:actionscript} > $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K > {code} > 2. A shuffle: > {code:actionscript} > scala> sc.parallelize(0 until 300, 10).repartition(2001).count() > {code} > The error messages: > {noformat} > org.apache.spark.shuffle.FetchFailedException: Failed to send request for > 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: > java.io.IOException: Connection reset by peer > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to > yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: > Connection reset by peer > at > org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196) > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) > at > io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) > at > io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) > at > io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28) > at >
[jira] [Updated] (SPARK-20783) Enhance ColumnVector to support compressed representation
[ https://issues.apache.org/jira/browse/SPARK-20783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-20783: - Description: Current {{ColumnVector}} handles uncompressed data for parquet. For handling table cache, this JIRA entry adds {{OnHeapCachedBatch}} class to have compressed data. As first step of this implementation, this JIRA supports primitive data and string types. was: Current {{ColumnVector}} accepts only primitive-type Java array as an input for array. It is good to keep data from Parquet. On the other hand, in Spark internal, {{UnsafeArrayData}} is frequently used to represent array, map, and struct. To keep these data, this JIRA entry enhances {{ColumnVector}} to keep UnsafeArrayData. > Enhance ColumnVector to support compressed representation > - > > Key: SPARK-20783 > URL: https://issues.apache.org/jira/browse/SPARK-20783 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki > > Current {{ColumnVector}} handles uncompressed data for parquet. > For handling table cache, this JIRA entry adds {{OnHeapCachedBatch}} class to > have compressed data. > As first step of this implementation, this JIRA supports primitive data and > string types. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20783) Enhance ColumnVector to support compressed representation
[ https://issues.apache.org/jira/browse/SPARK-20783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-20783: - Summary: Enhance ColumnVector to support compressed representation (was: Enhance ColumnVector to keep UnsafeArrayData for array) > Enhance ColumnVector to support compressed representation > - > > Key: SPARK-20783 > URL: https://issues.apache.org/jira/browse/SPARK-20783 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki > > Current {{ColumnVector}} accepts only primitive-type Java array as an input > for array. It is good to keep data from Parquet. > On the other hand, in Spark internal, {{UnsafeArrayData}} is frequently used > to represent array, map, and struct. To keep these data, this JIRA entry > enhances {{ColumnVector}} to keep UnsafeArrayData. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21253) Cannot fetch big blocks to disk
[ https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-21253: - Target Version/s: 2.2.0 > Cannot fetch big blocks to disk > > > Key: SPARK-21253 > URL: https://issues.apache.org/jira/browse/SPARK-21253 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yuming Wang >Assignee: Shixiong Zhu > Attachments: ui-thread-dump-jqhadoop221-154.gif > > > Spark *cluster* can reproduce, *local* can't: > 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}: > {code:actionscript} > $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K > {code} > 2. A shuffle: > {code:actionscript} > scala> sc.parallelize(0 until 300, 10).repartition(2001).count() > {code} > The error messages: > {noformat} > org.apache.spark.shuffle.FetchFailedException: Failed to send request for > 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: > java.io.IOException: Connection reset by peer > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to > yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: > Connection reset by peer > at > org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196) > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) > at > io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) > at > io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) > at > io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28) > at > org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183) > at >
[jira] [Assigned] (SPARK-21253) Cannot fetch big blocks to disk
[ https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-21253: Assignee: Shixiong Zhu > Cannot fetch big blocks to disk > > > Key: SPARK-21253 > URL: https://issues.apache.org/jira/browse/SPARK-21253 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yuming Wang >Assignee: Shixiong Zhu > Attachments: ui-thread-dump-jqhadoop221-154.gif > > > Spark *cluster* can reproduce, *local* can't: > 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}: > {code:actionscript} > $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K > {code} > 2. A shuffle: > {code:actionscript} > scala> sc.parallelize(0 until 300, 10).repartition(2001).count() > {code} > The error messages: > {noformat} > org.apache.spark.shuffle.FetchFailedException: Failed to send request for > 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: > java.io.IOException: Connection reset by peer > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to > yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: > Connection reset by peer > at > org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196) > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) > at > io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) > at > io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) > at > io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28) > at > org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183) > at >
[jira] [Commented] (SPARK-4131) Support "Writing data into the filesystem from queries"
[ https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068700#comment-16068700 ] ARUNA KIRAN NULU commented on SPARK-4131: - Is this feature available ? > Support "Writing data into the filesystem from queries" > --- > > Key: SPARK-4131 > URL: https://issues.apache.org/jira/browse/SPARK-4131 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.1.0 >Reporter: XiaoJing wang >Assignee: Fei Wang >Priority: Critical > Original Estimate: 0.05h > Remaining Estimate: 0.05h > > Writing data into the filesystem from queries,SparkSql is not support . > eg: > {code}insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * > from page_views; > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21246) Unexpected Data Type conversion from LONG to BIGINT
[ https://issues.apache.org/jira/browse/SPARK-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068553#comment-16068553 ] Monica Raj commented on SPARK-21246: Thanks for your response. I also tried with Seq(3) as Seq(3L), however I had changed this back during the course of trying other options. I should also mention that we are running Zeppelin 0.6.0. I tried running the code you provided and still got the following output: import org.apache.spark.sql.types._ import org.apache.spark.sql.Row schemaString: String = name lstVals: Seq[Long] = List(3) rowRdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[30] at map at :59 res20: Array[org.apache.spark.sql.Row] = Array([3]) fields: Array[org.apache.spark.sql.types.StructField] = Array(StructField(name,LongType,true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(name,LongType,true)) StructType(StructField(name,LongType,true))peopleDF: org.apache.spark.sql.DataFrame = [name: bigint] ++ |name| ++ | 3| ++ > Unexpected Data Type conversion from LONG to BIGINT > --- > > Key: SPARK-21246 > URL: https://issues.apache.org/jira/browse/SPARK-21246 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Using Zeppelin Notebook or Spark Shell >Reporter: Monica Raj > > The unexpected conversion occurred when creating a data frame out of an > existing data collection. The following code can be run in zeppelin notebook > to reproduce the bug: > import org.apache.spark.sql.types._ > import org.apache.spark.sql.Row > val schemaString = "name" > val lstVals = Seq(3) > val rowRdd = sc.parallelize(lstVals).map(x => Row( x )) > rowRdd.collect() > // Generate the schema based on the string of schema > val fields = schemaString.split(" ") > .map(fieldName => StructField(fieldName, LongType, nullable = true)) > val schema = StructType(fields) > print(schema) > val peopleDF = sqlContext.createDataFrame(rowRdd, schema) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21254) History UI: Taking over 1 minute for initial page display
[ https://issues.apache.org/jira/browse/SPARK-21254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21254. --- Resolution: Duplicate Duplicate of lots of JIRAs related to making the initial read faster > History UI: Taking over 1 minute for initial page display > - > > Key: SPARK-21254 > URL: https://issues.apache.org/jira/browse/SPARK-21254 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: Dmitry Parfenchik >Priority: Minor > Attachments: screenshot-1.png > > > Currently on the first page load (if there is no limit set) the whole jobs > execution history is loaded since the begging of the time. On large amount of > rows returned (10k+) page load time grows dramatically, causing 1min+ delay > in Chrome and freezing the process in Firefox, Safari and IE. > A simple inspection in Chrome shows that network is not an issue here and > only causes a small latency (<1s) while most of the time is spend in UI > processing the results even according to chrome devtools: > !screenshot-1.png! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21255) NPE when creating encoder for enum
[ https://issues.apache.org/jira/browse/SPARK-21255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068520#comment-16068520 ] Sean Owen commented on SPARK-21255: --- Is the change to omit declaringClass too? > NPE when creating encoder for enum > -- > > Key: SPARK-21255 > URL: https://issues.apache.org/jira/browse/SPARK-21255 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.1.0 > Environment: org.apache.spark:spark-core_2.10:2.1.0 > org.apache.spark:spark-sql_2.10:2.1.0 >Reporter: Mike > > When you try to create an encoder for Enum type (or bean with enum property) > via Encoders.bean(...), it fails with NullPointerException at TypeToken:495. > I did a little research and it turns out, that in JavaTypeInference:126 > following code > {code:java} > val beanInfo = Introspector.getBeanInfo(typeToken.getRawType) > val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == > "class") > val fields = properties.map { property => > val returnType = > typeToken.method(property.getReadMethod).getReturnType > val (dataType, nullable) = inferDataType(returnType) > new StructField(property.getName, dataType, nullable) > } > (new StructType(fields), true) > {code} > filters out properties named "class", because we wouldn't want to serialize > that. But enum types have another property of type Class named > "declaringClass", which we are trying to inspect recursively. Eventually we > try to inspect ClassLoader class, which has property "defaultAssertionStatus" > with no read method, which leads to NPE at TypeToken:495. > I think adding property name "declaringClass" to filtering will resolve this. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21255) NPE when creating encoder for enum
[ https://issues.apache.org/jira/browse/SPARK-21255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike updated SPARK-21255: - Description: When you try to create an encoder for Enum type (or bean with enum property) via Encoders.bean(...), it fails with NullPointerException at TypeToken:495. I did a little research and it turns out, that in JavaTypeInference:126 following code {code:java} val beanInfo = Introspector.getBeanInfo(typeToken.getRawType) val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == "class") val fields = properties.map { property => val returnType = typeToken.method(property.getReadMethod).getReturnType val (dataType, nullable) = inferDataType(returnType) new StructField(property.getName, dataType, nullable) } (new StructType(fields), true) {code} filters out properties named "class", because we wouldn't want to serialize that. But enum types have another property of type Class named "declaringClass", which we are trying to inspect recursively. Eventually we try to inspect ClassLoader class, which has property "defaultAssertionStatus" with no read method, which leads to NPE at TypeToken:495. I think adding property name "declaringClass" to filtering will resolve this. was: When you try to create an encoder for Enum type (or bean with enum property) via Encoders.bean(...), it fails with NullPointerException at TypeToken:495. I did a little research and it turns out, that in JavaTypeInference:126 following code {code:scala} val beanInfo = Introspector.getBeanInfo(typeToken.getRawType) val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == "class") val fields = properties.map { property => val returnType = typeToken.method(property.getReadMethod).getReturnType val (dataType, nullable) = inferDataType(returnType) new StructField(property.getName, dataType, nullable) } (new StructType(fields), true) {code} filters out properties named "class", because we wouldn't want to serialize that. But enum types have another property of type Class named "declaringClass", which we are trying to inspect recursively. Eventually we try to inspect ClassLoader class, which has property "defaultAssertionStatus" with no read method, which leads to NPE at TypeToken:495. I think adding property name "declaringClass" to filtering will resolve this. > NPE when creating encoder for enum > -- > > Key: SPARK-21255 > URL: https://issues.apache.org/jira/browse/SPARK-21255 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.1.0 > Environment: org.apache.spark:spark-core_2.10:2.1.0 > org.apache.spark:spark-sql_2.10:2.1.0 >Reporter: Mike > > When you try to create an encoder for Enum type (or bean with enum property) > via Encoders.bean(...), it fails with NullPointerException at TypeToken:495. > I did a little research and it turns out, that in JavaTypeInference:126 > following code > {code:java} > val beanInfo = Introspector.getBeanInfo(typeToken.getRawType) > val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == > "class") > val fields = properties.map { property => > val returnType = > typeToken.method(property.getReadMethod).getReturnType > val (dataType, nullable) = inferDataType(returnType) > new StructField(property.getName, dataType, nullable) > } > (new StructType(fields), true) > {code} > filters out properties named "class", because we wouldn't want to serialize > that. But enum types have another property of type Class named > "declaringClass", which we are trying to inspect recursively. Eventually we > try to inspect ClassLoader class, which has property "defaultAssertionStatus" > with no read method, which leads to NPE at TypeToken:495. > I think adding property name "declaringClass" to filtering will resolve this. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21255) NPE when creating encoder for enum
Mike created SPARK-21255: Summary: NPE when creating encoder for enum Key: SPARK-21255 URL: https://issues.apache.org/jira/browse/SPARK-21255 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 2.1.0 Environment: org.apache.spark:spark-core_2.10:2.1.0 org.apache.spark:spark-sql_2.10:2.1.0 Reporter: Mike When you try to create an encoder for Enum type (or bean with enum property) via Encoders.bean(...), it fails with NullPointerException at TypeToken:495. I did a little research and it turns out, that in JavaTypeInference:126 following code {code:scala} val beanInfo = Introspector.getBeanInfo(typeToken.getRawType) val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == "class") val fields = properties.map { property => val returnType = typeToken.method(property.getReadMethod).getReturnType val (dataType, nullable) = inferDataType(returnType) new StructField(property.getName, dataType, nullable) } (new StructType(fields), true) {code} filters out properties named "class", because we wouldn't want to serialize that. But enum types have another property of type Class named "declaringClass", which we are trying to inspect recursively. Eventually we try to inspect ClassLoader class, which has property "defaultAssertionStatus" with no read method, which leads to NPE at TypeToken:495. I think adding property name "declaringClass" to filtering will resolve this. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21254) History UI: Taking over 1 minute for initial page display
[ https://issues.apache.org/jira/browse/SPARK-21254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitry Parfenchik updated SPARK-21254: -- Priority: Minor (was: Major) > History UI: Taking over 1 minute for initial page display > - > > Key: SPARK-21254 > URL: https://issues.apache.org/jira/browse/SPARK-21254 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: Dmitry Parfenchik >Priority: Minor > Attachments: screenshot-1.png > > > Currently on the first page load (if there is no limit set) the whole jobs > execution history is loaded since the begging of the time. On large amount of > rows returned (10k+) page load time grows dramatically, causing 1min+ delay > in Chrome and freezing the process in Firefox, Safari and IE. > A simple inspection in Chrome shows that network is not an issue here and > only causes a small latency (<1s) while most of the time is spend in UI > processing the results even according to chrome devtools: > !screenshot-1.png! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21254) History UI: Taking over 1 minute for initial page display
[ https://issues.apache.org/jira/browse/SPARK-21254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitry Parfenchik updated SPARK-21254: -- Description: Currently on the first page load (if there is no limit set) the whole jobs execution history is loaded since the begging of the time. On large amount of rows returned (10k+) page load time grows dramatically, causing 1min+ delay in Chrome and freezing the process in Firefox, Safari and IE. A simple inspection in Chrome shows that network is not an issue here and only causes a small latency (<1s) while most of the time is spend in UI processing the results even according to chrome devtools: !screenshot-1.png! was: Currently on the first page load (if there is no limit set) the whole jobs execution history is loaded since the begging of the time, which in itself is not a big issue and only causes a small latency in case of 10k+ rows returned from the server. The problem is that UI spends most of the time processing the results even according to chrome devtools (newtwork IO is taking less than 1s): !screenshot-1.png! In case of larger amount of rows returned (10k+) this time grows dramatically, causing 1min+ page load in Chrome and in Firefox, Safari and IE freezes the process. > History UI: Taking over 1 minute for initial page display > - > > Key: SPARK-21254 > URL: https://issues.apache.org/jira/browse/SPARK-21254 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: Dmitry Parfenchik > Attachments: screenshot-1.png > > > Currently on the first page load (if there is no limit set) the whole jobs > execution history is loaded since the begging of the time. On large amount of > rows returned (10k+) page load time grows dramatically, causing 1min+ delay > in Chrome and freezing the process in Firefox, Safari and IE. > A simple inspection in Chrome shows that network is not an issue here and > only causes a small latency (<1s) while most of the time is spend in UI > processing the results even according to chrome devtools: > !screenshot-1.png! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21254) History UI: Taking over 1 minute for initial page display
[ https://issues.apache.org/jira/browse/SPARK-21254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitry Parfenchik updated SPARK-21254: -- Description: Currently on the first page load (if there is no limit set) the whole jobs execution history is loaded since the begging of the time, which in itself is not a big issue and only causes a small latency in case of 10k+ rows returned from the server. The problem is that UI spends most of the time processing the results even according to chrome devtools (newtwork IO is taking less than 1s): !screenshot-1.png! In case of larger amount of rows returned (10k+) this time grows dramatically, causing 1min+ page load in Chrome and in Firefox, Safari and IE freezes the process. was: Currently on the first page load (if there is no limit set) the whole jobs execution history is loaded since the begging of the time, which in itself is not a big issue and only causes a small latency in case of 10k+ rows returned from the server. The problem is that UI spends most of the time processing the results even according to chrome devtools (newtwork IO is taking less than 1s): !attachment-name.jpg|thumbnail! In case of larger amount of rows returned (10k+) this time grows dramatically, causing 1min+ page load in Chrome and in Firefox, Safari and IE freezes the process. > History UI: Taking over 1 minute for initial page display > - > > Key: SPARK-21254 > URL: https://issues.apache.org/jira/browse/SPARK-21254 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: Dmitry Parfenchik > Attachments: screenshot-1.png > > > Currently on the first page load (if there is no limit set) the whole jobs > execution history is loaded since the begging of the time, which in itself is > not a big issue and only causes a small latency in case of 10k+ rows returned > from the server. The problem is that UI spends most of the time processing > the results even according to chrome devtools (newtwork IO is taking less > than 1s): > !screenshot-1.png! > In case of larger amount of rows returned (10k+) this time grows > dramatically, causing 1min+ page load in Chrome and in Firefox, Safari and IE > freezes the process. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21254) History UI: Taking over 1 minute for initial page display
[ https://issues.apache.org/jira/browse/SPARK-21254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitry Parfenchik updated SPARK-21254: -- Attachment: screenshot-1.png > History UI: Taking over 1 minute for initial page display > - > > Key: SPARK-21254 > URL: https://issues.apache.org/jira/browse/SPARK-21254 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: Dmitry Parfenchik > Attachments: screenshot-1.png > > > Currently on the first page load (if there is no limit set) the whole jobs > execution history is loaded since the begging of the time, which in itself is > not a big issue and only causes a small latency in case of 10k+ rows returned > from the server. The problem is that UI spends most of the time processing > the results even according to chrome devtools (newtwork IO is taking less > than 1s): > !attachment-name.jpg|thumbnail! > In case of larger amount of rows returned (10k+) this time grows > dramatically, causing 1min+ page load in Chrome and in Firefox, Safari and IE > freezes the process. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21254) History UI: Taking over 1 minute for initial page display
Dmitry Parfenchik created SPARK-21254: - Summary: History UI: Taking over 1 minute for initial page display Key: SPARK-21254 URL: https://issues.apache.org/jira/browse/SPARK-21254 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.1.0 Reporter: Dmitry Parfenchik Currently on the first page load (if there is no limit set) the whole jobs execution history is loaded since the begging of the time, which in itself is not a big issue and only causes a small latency in case of 10k+ rows returned from the server. The problem is that UI spends most of the time processing the results even according to chrome devtools (newtwork IO is taking less than 1s): !attachment-name.jpg|thumbnail! In case of larger amount of rows returned (10k+) this time grows dramatically, causing 1min+ page load in Chrome and in Firefox, Safari and IE freezes the process. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21052) Add hash map metrics to join
[ https://issues.apache.org/jira/browse/SPARK-21052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-21052. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18301 [https://github.com/apache/spark/pull/18301] > Add hash map metrics to join > > > Key: SPARK-21052 > URL: https://issues.apache.org/jira/browse/SPARK-21052 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh > Fix For: 2.3.0 > > > We should add avg hash map probe metric to join operator and report it on UI. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21052) Add hash map metrics to join
[ https://issues.apache.org/jira/browse/SPARK-21052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-21052: --- Assignee: Liang-Chi Hsieh > Add hash map metrics to join > > > Key: SPARK-21052 > URL: https://issues.apache.org/jira/browse/SPARK-21052 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.3.0 > > > We should add avg hash map probe metric to join operator and report it on UI. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12868) ADD JAR via sparkSQL JDBC will fail when using a HDFS URL
[ https://issues.apache.org/jira/browse/SPARK-12868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068301#comment-16068301 ] Steve Loughran commented on SPARK-12868: It's actually not the cause of that, merely the messenger. Cause is HADOOP-14383: a combination of spark 2.2 & Hadoop 2.9+ will trigger the problem. Fix belongs in Hadoop. > ADD JAR via sparkSQL JDBC will fail when using a HDFS URL > - > > Key: SPARK-12868 > URL: https://issues.apache.org/jira/browse/SPARK-12868 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Trystan Leftwich >Assignee: Weiqing Yang > Fix For: 2.2.0 > > > When trying to add a jar with a HDFS URI, i.E > {code:sql} > ADD JAR hdfs:///tmp/foo.jar > {code} > Via the spark sql JDBC interface it will fail with: > {code:sql} > java.net.MalformedURLException: unknown protocol: hdfs > at java.net.URL.(URL.java:593) > at java.net.URL.(URL.java:483) > at java.net.URL.(URL.java:432) > at java.net.URI.toURL(URI.java:1089) > at > org.apache.spark.sql.hive.client.ClientWrapper.addJar(ClientWrapper.scala:578) > at org.apache.spark.sql.hive.HiveContext.addJar(HiveContext.scala:652) > at org.apache.spark.sql.hive.execution.AddJar.run(commands.scala:89) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:145) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:130) > at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:211) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:154) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:151) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:164) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21253) Cannot fetch big blocks to disk
[ https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-21253: Description: Spark *cluster* can reproduce, *local* can't: 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}: {code:actionscript} $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K {code} 2. A shuffle: {code:actionscript} scala> sc.parallelize(0 until 300, 10).repartition(2001).count() {code} The error messages: {noformat} org.apache.spark.shuffle.FetchFailedException: Failed to send request for 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: Connection reset by peer at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) at scala.collection.AbstractIterator.to(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: Connection reset by peer at org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196) at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) at io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163) at io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93) at io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28) at org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183) at org.apache.spark.network.shuffle.OneForOneBlockFetcher$1.onSuccess(OneForOneBlockFetcher.java:123) at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:176) at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:120) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) at
[jira] [Assigned] (SPARK-21225) decrease the Mem using for variable 'tasks' in function resourceOffers
[ https://issues.apache.org/jira/browse/SPARK-21225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-21225: --- Assignee: yangZhiguo > decrease the Mem using for variable 'tasks' in function resourceOffers > -- > > Key: SPARK-21225 > URL: https://issues.apache.org/jira/browse/SPARK-21225 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1 >Reporter: yangZhiguo >Assignee: yangZhiguo >Priority: Minor > Fix For: 2.3.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > In the function 'resourceOffers', It declare a variable 'tasks' for > storage the tasks which have allocated a executor. It declared like this: > *{color:#d04437}val tasks = shuffledOffers.map(o => new > ArrayBuffer[TaskDescription](o.cores)){color}* > But, I think this code only conside a situation for that one task per core. > If the user config the "spark.task.cpus" as 2 or 3, It really don't need so > much space. I think It can motify as follow: > {color:#14892c}*val tasks = shuffledOffers.map(o => new > ArrayBuffer[TaskDescription](Math.ceil(o.cores*1.0/CPUS_PER_TASK).toInt))*{color} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21225) decrease the Mem using for variable 'tasks' in function resourceOffers
[ https://issues.apache.org/jira/browse/SPARK-21225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-21225. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18435 [https://github.com/apache/spark/pull/18435] > decrease the Mem using for variable 'tasks' in function resourceOffers > -- > > Key: SPARK-21225 > URL: https://issues.apache.org/jira/browse/SPARK-21225 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1 >Reporter: yangZhiguo >Priority: Minor > Fix For: 2.3.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > In the function 'resourceOffers', It declare a variable 'tasks' for > storage the tasks which have allocated a executor. It declared like this: > *{color:#d04437}val tasks = shuffledOffers.map(o => new > ArrayBuffer[TaskDescription](o.cores)){color}* > But, I think this code only conside a situation for that one task per core. > If the user config the "spark.task.cpus" as 2 or 3, It really don't need so > much space. I think It can motify as follow: > {color:#14892c}*val tasks = shuffledOffers.map(o => new > ArrayBuffer[TaskDescription](Math.ceil(o.cores*1.0/CPUS_PER_TASK).toInt))*{color} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21253) Cannot fetch big blocks to disk
[ https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-21253: Description: Spark *cluster* can reproduce, *local* can't: 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}: {code:actionscript} $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K {code} 2. Need a shuffe: {code:actionscript} scala> sc.parallelize(0 until 300, 10).repartition(2001).count() {code} The error messages: {noformat} org.apache.spark.shuffle.FetchFailedException: Failed to send request for 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: Connection reset by peer at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) at scala.collection.AbstractIterator.to(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: Connection reset by peer at org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196) at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) at io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163) at io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93) at io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28) at org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183) at org.apache.spark.network.shuffle.OneForOneBlockFetcher$1.onSuccess(OneForOneBlockFetcher.java:123) at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:176) at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:120) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) at
[jira] [Commented] (SPARK-21253) Cannot fetch big blocks to disk
[ https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068232#comment-16068232 ] Yuming Wang commented on SPARK-21253: - It may be hang for a {{spark-sql}} application also: !ui-thread-dump-jqhadoop221-154.gif! > Cannot fetch big blocks to disk > > > Key: SPARK-21253 > URL: https://issues.apache.org/jira/browse/SPARK-21253 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yuming Wang > Attachments: ui-thread-dump-jqhadoop221-154.gif > > > Spark *cluster* can reproduce, *local* can't: > 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}: > {code:actionscript} > $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K > {code} > 2. Need a shuffe: > {code:actionscript} > scala> val count = sc.parallelize(0 until 300, > 10).repartition(2001).collect().length > {code} > The error messages: > {noformat} > org.apache.spark.shuffle.FetchFailedException: Failed to send request for > 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: > java.io.IOException: Connection reset by peer > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to > yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: > Connection reset by peer > at > org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196) > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) > at > io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) > at > io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) > at > io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28) > at > org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183) > at >
[jira] [Updated] (SPARK-21253) Cannot fetch big blocks to disk
[ https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-21253: Attachment: ui-thread-dump-jqhadoop221-154.gif > Cannot fetch big blocks to disk > > > Key: SPARK-21253 > URL: https://issues.apache.org/jira/browse/SPARK-21253 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yuming Wang > Attachments: ui-thread-dump-jqhadoop221-154.gif > > > Spark *cluster* can reproduce, *local* can't: > 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}: > {code:actionscript} > $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K > {code} > 2. Need a shuffe: > {code:actionscript} > scala> val count = sc.parallelize(0 until 300, > 10).repartition(2001).collect().length > {code} > The error messages: > {noformat} > org.apache.spark.shuffle.FetchFailedException: Failed to send request for > 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: > java.io.IOException: Connection reset by peer > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to > yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: > Connection reset by peer > at > org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196) > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) > at > io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) > at > io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) > at > io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28) > at > org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183) > at >
[jira] [Updated] (SPARK-21252) The duration times showed by spark web UI are inaccurate
[ https://issues.apache.org/jira/browse/SPARK-21252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-21252: -- Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) I don't know what the underlying exact values are; those are likely available from the API. That would help figure out if there's an actual rounding problem. If not, I don't think this would be changed. > The duration times showed by spark web UI are inaccurate > > > Key: SPARK-21252 > URL: https://issues.apache.org/jira/browse/SPARK-21252 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.1.1 >Reporter: igor mazor >Priority: Minor > > The duration times showed by spark UI are inaccurate and seems to be rounded. > For example when a job had 2 stages, first stage executed in 47 ms and second > stage in 3 seconds, the total execution time showed by the UI is 4 seconds. > Another example, first stage was executed in 20 ms and second stage in 4 > seconds, the total execution time showed by the UI would be in that case also > 4 seconds. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21253) Cannot fetch big blocks to disk
[ https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21253: Assignee: Apache Spark > Cannot fetch big blocks to disk > > > Key: SPARK-21253 > URL: https://issues.apache.org/jira/browse/SPARK-21253 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yuming Wang >Assignee: Apache Spark > > Spark *cluster* can reproduce, *local* can't: > 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}: > {code:actionscript} > $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K > {code} > 2. Need a shuffe: > {code:actionscript} > scala> val count = sc.parallelize(0 until 300, > 10).repartition(2001).collect().length > {code} > The error messages: > {noformat} > org.apache.spark.shuffle.FetchFailedException: Failed to send request for > 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: > java.io.IOException: Connection reset by peer > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to > yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: > Connection reset by peer > at > org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196) > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) > at > io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) > at > io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) > at > io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28) > at > org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183) > at > org.apache.spark.network.shuffle.OneForOneBlockFetcher$1.onSuccess(OneForOneBlockFetcher.java:123) > at >
[jira] [Assigned] (SPARK-21253) Cannot fetch big blocks to disk
[ https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21253: Assignee: (was: Apache Spark) > Cannot fetch big blocks to disk > > > Key: SPARK-21253 > URL: https://issues.apache.org/jira/browse/SPARK-21253 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yuming Wang > > Spark *cluster* can reproduce, *local* can't: > 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}: > {code:actionscript} > $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K > {code} > 2. Need a shuffe: > {code:actionscript} > scala> val count = sc.parallelize(0 until 300, > 10).repartition(2001).collect().length > {code} > The error messages: > {noformat} > org.apache.spark.shuffle.FetchFailedException: Failed to send request for > 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: > java.io.IOException: Connection reset by peer > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to > yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: > Connection reset by peer > at > org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196) > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) > at > io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) > at > io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) > at > io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28) > at > org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183) > at > org.apache.spark.network.shuffle.OneForOneBlockFetcher$1.onSuccess(OneForOneBlockFetcher.java:123) > at >
[jira] [Commented] (SPARK-21253) Cannot fetch big blocks to disk
[ https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068169#comment-16068169 ] Apache Spark commented on SPARK-21253: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/18466 > Cannot fetch big blocks to disk > > > Key: SPARK-21253 > URL: https://issues.apache.org/jira/browse/SPARK-21253 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yuming Wang > > Spark *cluster* can reproduce, *local* can't: > 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}: > {code:actionscript} > $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K > {code} > 2. Need a shuffe: > {code:actionscript} > scala> val count = sc.parallelize(0 until 300, > 10).repartition(2001).collect().length > {code} > The error messages: > {noformat} > org.apache.spark.shuffle.FetchFailedException: Failed to send request for > 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: > java.io.IOException: Connection reset by peer > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to > yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: > Connection reset by peer > at > org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196) > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) > at > io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) > at > io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) > at > io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93) > at > io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28) > at > org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183) > at >
[jira] [Created] (SPARK-21253) Cannot fetch big blocks to disk
Yuming Wang created SPARK-21253: --- Summary: Cannot fetch big blocks to disk Key: SPARK-21253 URL: https://issues.apache.org/jira/browse/SPARK-21253 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: Yuming Wang Spark *cluster* can reproduce, *local* can't: 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}: {code:actionscript} $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K {code} 2. Need a shuffe: {code:actionscript} scala> val count = sc.parallelize(0 until 300, 10).repartition(2001).collect().length {code} The error messages: {noformat} org.apache.spark.shuffle.FetchFailedException: Failed to send request for 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: Connection reset by peer at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) at scala.collection.AbstractIterator.to(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: Connection reset by peer at org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196) at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) at io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163) at io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93) at io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28) at org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183) at org.apache.spark.network.shuffle.OneForOneBlockFetcher$1.onSuccess(OneForOneBlockFetcher.java:123) at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:176) at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:120) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at
[jira] [Comment Edited] (SPARK-21252) The duration times showed by spark web UI are inaccurate
[ https://issues.apache.org/jira/browse/SPARK-21252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068147#comment-16068147 ] igor mazor edited comment on SPARK-21252 at 6/29/17 10:42 AM: -- Yes, its indeed possible to drill to individual stages, however the scenario I described on my last comment is in the same stage. The first attempt following by timeout after 20 seconds and then the second successful attempt are all in the same stage and therefore the UI shows 20 seconds duration, although thats not true. Also, if the stage was indeed 40 ms, the UI shows 40 ms as the duration, but with seconds there is rounding which also causing for inconsistent result presentation. If you already do the rounding, why then the 40 ms is not rounded to 0 seconds ? Also I noticed now that for example stage that took 42 ms (Job Duration), when I click on that stage I see Duration = 38 ms and when going deeper into the Summary Metrics for Tasks, I see that the longest task was 35 ms, so not sure where to all the gaps went. was (Author: mazor.igal): Yes, its indeed possible to drill to individual stages, however the scenario I described on my last comment is happens in the same stage. The first attempt following by timeout after 20 seconds and then the second successful attempt are all in the same stage and therefore the UI shows 20 seconds duration, although thats not true. Also, if the stage was indeed 40 ms, the UI shows 40 ms as the duration, but with seconds there is rounding which also causing for inconsistent result presentation. If you already do the rounding, why then the 40 ms is not rounded to 0 seconds ? Also I noticed now that for example stage that took 42 ms (Job Duration), when I click on that stage I see Duration = 38 ms and when going deeper into the Summary Metrics for Tasks, I see that the longest task was 35 ms, so not sure where to all the gaps went. > The duration times showed by spark web UI are inaccurate > > > Key: SPARK-21252 > URL: https://issues.apache.org/jira/browse/SPARK-21252 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.1.1 >Reporter: igor mazor > > The duration times showed by spark UI are inaccurate and seems to be rounded. > For example when a job had 2 stages, first stage executed in 47 ms and second > stage in 3 seconds, the total execution time showed by the UI is 4 seconds. > Another example, first stage was executed in 20 ms and second stage in 4 > seconds, the total execution time showed by the UI would be in that case also > 4 seconds. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21252) The duration times showed by spark web UI are inaccurate
[ https://issues.apache.org/jira/browse/SPARK-21252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068147#comment-16068147 ] igor mazor commented on SPARK-21252: Yes, its indeed possible to drill to individual stages, however the scenario I described on my last comment is happens in the same stage. The first attempt following by timeout after 20 seconds and then the second successful attempt are all in the same stage and therefore the UI shows 20 seconds duration, although thats not true. Also, if the stage was indeed 40 ms, the UI shows 40 ms as the duration, but with seconds there is rounding which also causing for inconsistent result presentation. If you already do the rounding, why then the 40 ms is not rounded to 0 seconds ? Also I noticed now that for example stage that took 42 ms (Job Duration), when I click on that stage I see Duration = 38 ms and when going deeper into the Summary Metrics for Tasks, I see that the longest task was 35 ms, so not sure where to all the gaps went. > The duration times showed by spark web UI are inaccurate > > > Key: SPARK-21252 > URL: https://issues.apache.org/jira/browse/SPARK-21252 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.1.1 >Reporter: igor mazor > > The duration times showed by spark UI are inaccurate and seems to be rounded. > For example when a job had 2 stages, first stage executed in 47 ms and second > stage in 3 seconds, the total execution time showed by the UI is 4 seconds. > Another example, first stage was executed in 20 ms and second stage in 4 > seconds, the total execution time showed by the UI would be in that case also > 4 seconds. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17689) _temporary files breaks the Spark SQL streaming job.
[ https://issues.apache.org/jira/browse/SPARK-17689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma resolved SPARK-17689. - Resolution: Cannot Reproduce I am unable to reproduce this anymore, looks like this might be fixed by some other changes. > _temporary files breaks the Spark SQL streaming job. > > > Key: SPARK-17689 > URL: https://issues.apache.org/jira/browse/SPARK-17689 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Prashant Sharma > > Steps to reproduce: > 1) Start a streaming job which reads from HDFS location hdfs://xyz/* > 2) Write content to hdfs://xyz/a > . > . > repeat a few times. > And then job breaks as follows. > org.apache.spark.SparkException: Job aborted due to stage failure: Task 49 in > stage 304.0 failed 1 times, most recent failure: Lost task 49.0 in stage > 304.0 (TID 14794, localhost): java.io.FileNotFoundException: File does not > exist: hdfs://localhost:9000/input/t5/_temporary > at > org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309) > at > org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317) > at > org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$4.apply(fileSourceInterfaces.scala:464) > at > org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$4.apply(fileSourceInterfaces.scala:462) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:912) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:912) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1919) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1919) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19908) Direct buffer memory OOM should not cause stage retries.
[ https://issues.apache.org/jira/browse/SPARK-19908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068106#comment-16068106 ] Kaushal Prajapati commented on SPARK-19908: --- [~zhanzhang] can you plz share some example code for which you are getting this error? > Direct buffer memory OOM should not cause stage retries. > > > Key: SPARK-19908 > URL: https://issues.apache.org/jira/browse/SPARK-19908 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.1.0 >Reporter: Zhan Zhang >Priority: Minor > > Currently if there is java.lang.OutOfMemoryError: Direct buffer memory, the > exception will be changed to FetchFailedException, causing stage retries. > org.apache.spark.shuffle.FetchFailedException: Direct buffer memory > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:40) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83) > at > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:731) > at > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextOuterJoinRows(SortMergeJoinExec.scala:692) > at > org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceStream(SortMergeJoinExec.scala:854) > at > org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceNext(SortMergeJoinExec.scala:887) > at > org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:278) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645) > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:212) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > at >
[jira] [Commented] (SPARK-21252) The duration times showed by spark web UI are inaccurate
[ https://issues.apache.org/jira/browse/SPARK-21252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068085#comment-16068085 ] Sean Owen commented on SPARK-21252: --- You can drill into individual stage times, right? > The duration times showed by spark web UI are inaccurate > > > Key: SPARK-21252 > URL: https://issues.apache.org/jira/browse/SPARK-21252 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.1.1 >Reporter: igor mazor > > The duration times showed by spark UI are inaccurate and seems to be rounded. > For example when a job had 2 stages, first stage executed in 47 ms and second > stage in 3 seconds, the total execution time showed by the UI is 4 seconds. > Another example, first stage was executed in 20 ms and second stage in 4 > seconds, the total execution time showed by the UI would be in that case also > 4 seconds. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org