[jira] [Updated] (SPARK-21249) Is it possible to use File Sink with mapGroupsWithState in Structured Streaming?

2017-06-29 Thread Amit Baghel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Baghel updated SPARK-21249:

Description: 
I am working with 2.2.0-SNAPSHOT and Structured Streaming. Is it possible to 
use File Sink with mapGroupsWithState? I verified below cases and getting 
exception in all.



* Aggregation without watermark and Append mode

{code}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Append 
output mode not supported when there are streaming aggregations on streaming 
DataFrames/DataSets without watermark;;
{code}

* Aggregation with watermark and Append mode

{code}
Exception in thread "main" org.apache.spark.sql.AnalysisException: 
mapGroupsWithState is not supported with aggregation on a streaming 
DataFrame/Dataset;;
{code}

* No Aggregation and Append mode

{code}
Exception in thread "main" org.apache.spark.sql.AnalysisException: 
mapGroupsWithState is not supported with Append output mode on a streaming 
DataFrame/Dataset;;
{code}

* No Aggregation and Update mode (Should fail as File Sink supports only Append)

{code}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Data source 
parquet does not support Update output mode;
{code}

* No Aggregation and Complete mode (Should fail as File Sink supports only 
Append)

{code}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Data source 
parquet does not support Complete output mode;
{code}

  was:
I am working with 2.2.0-SNAPSHOT and Structured Streaming. Is it possible to 
use File Sink with mapGroupsWithState? I verified below cases and getting 
exception in all.



* Aggregation without watermark and Append mode

{code}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Append 
output mode not supported when there are streaming aggregations on streaming 
DataFrames/DataSets without watermark;;
{code}

* Aggregation with watermark and Append mode

{code}
Exception in thread "main" org.apache.spark.sql.AnalysisException: 
mapGroupsWithState is not supported with aggregation on a streaming 
DataFrame/Dataset;;
{code}

* No Aggregation and Append mode

{code}
Exception in thread "main" org.apache.spark.sql.AnalysisException: 
mapGroupsWithState is not supported with Append output mode on a streaming 
DataFrame/Dataset;;
{code}

* No Aggregation and Update mode

{code}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Data source 
parquet does not support Update output mode;
{code}

* No Aggregation and Complete mode

{code}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Data source 
parquet does not support Complete output mode;
{code}


> Is it possible to use File Sink with mapGroupsWithState in Structured 
> Streaming?
> 
>
> Key: SPARK-21249
> URL: https://issues.apache.org/jira/browse/SPARK-21249
> Project: Spark
>  Issue Type: Question
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Amit Baghel
>Priority: Minor
>
> I am working with 2.2.0-SNAPSHOT and Structured Streaming. Is it possible to 
> use File Sink with mapGroupsWithState? I verified below cases and getting 
> exception in all.
> * Aggregation without watermark and Append mode
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Append 
> output mode not supported when there are streaming aggregations on streaming 
> DataFrames/DataSets without watermark;;
> {code}
> * Aggregation with watermark and Append mode
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: 
> mapGroupsWithState is not supported with aggregation on a streaming 
> DataFrame/Dataset;;
> {code}
> * No Aggregation and Append mode
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: 
> mapGroupsWithState is not supported with Append output mode on a streaming 
> DataFrame/Dataset;;
> {code}
> * No Aggregation and Update mode (Should fail as File Sink supports only 
> Append)
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Data 
> source parquet does not support Update output mode;
> {code}
> * No Aggregation and Complete mode (Should fail as File Sink supports only 
> Append)
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Data 
> source parquet does not support Complete output mode;
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21249) Is it possible to use File Sink with mapGroupsWithState in Structured Streaming?

2017-06-29 Thread Amit Baghel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069528#comment-16069528
 ] 

Amit Baghel commented on SPARK-21249:
-

[~hyukjin.kwon] Sorry for this jira post as I couldn't decide if the type 
should be bug or question. I have updated the jira description with all cases I 
tried for File Sink with mapGroupsWithState.I will post this to mailing list. 
Thanks.

> Is it possible to use File Sink with mapGroupsWithState in Structured 
> Streaming?
> 
>
> Key: SPARK-21249
> URL: https://issues.apache.org/jira/browse/SPARK-21249
> Project: Spark
>  Issue Type: Question
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Amit Baghel
>Priority: Minor
>
> I am working with 2.2.0-SNAPSHOT and Structured Streaming. Is it possible to 
> use File Sink with mapGroupsWithState? I verified below cases and getting 
> exception in all.
> * Aggregation without watermark and Append mode
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Append 
> output mode not supported when there are streaming aggregations on streaming 
> DataFrames/DataSets without watermark;;
> {code}
> * Aggregation with watermark and Append mode
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: 
> mapGroupsWithState is not supported with aggregation on a streaming 
> DataFrame/Dataset;;
> {code}
> * No Aggregation and Append mode
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: 
> mapGroupsWithState is not supported with Append output mode on a streaming 
> DataFrame/Dataset;;
> {code}
> * No Aggregation and Update mode
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Data 
> source parquet does not support Update output mode;
> {code}
> * No Aggregation and Complete mode
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Data 
> source parquet does not support Complete output mode;
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21249) Is it possible to use File Sink with mapGroupsWithState in Structured Streaming?

2017-06-29 Thread Amit Baghel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Baghel updated SPARK-21249:

Description: 
I am working with 2.2.0-SNAPSHOT and Structured Streaming. Is it possible to 
use File Sink with mapGroupsWithState? I verified below cases and getting 
exception in all.



* Aggregation without watermark and Append mode

{code}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Append 
output mode not supported when there are streaming aggregations on streaming 
DataFrames/DataSets without watermark;;
{code}

* Aggregation with watermark and Append mode

{code}
Exception in thread "main" org.apache.spark.sql.AnalysisException: 
mapGroupsWithState is not supported with aggregation on a streaming 
DataFrame/Dataset;;
{code}

* No Aggregation and Append mode

{code}
Exception in thread "main" org.apache.spark.sql.AnalysisException: 
mapGroupsWithState is not supported with Append output mode on a streaming 
DataFrame/Dataset;;
{code}

* No Aggregation and Update mode

{code}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Data source 
parquet does not support Update output mode;
{code}

* No Aggregation and Complete mode

{code}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Data source 
parquet does not support Complete output mode;
{code}

  was:
I am working with 2.2.0-SNAPSHOT and Structured Streaming. Is it possible to 
use File Sink with mapGroupsWithState? I verified below cases and getting 
exception in all.

* mapGroupsWithState - Aggregation without watermark and File Sink (Append mode)

Exception in thread "main" org.apache.spark.sql.AnalysisException: Append 
output mode not supported when there are streaming aggregations on streaming 
DataFrames/DataSets without watermark;;

mapGroupsWithState - Aggregation with watermark and File Sink (Append mode)

Exception in thread "main" org.apache.spark.sql.AnalysisException: 
mapGroupsWithState is not supported with aggregation on a streaming 
DataFrame/Dataset;;

mapGroupsWithState - No Aggregation and File Sink (Append mode)

Exception in thread "main" org.apache.spark.sql.AnalysisException: 
mapGroupsWithState is not supported with Append output mode on a streaming 
DataFrame/Dataset;;

mapGroupsWithState - No Aggregation and File Sink (Update mode)

Exception in thread "main" org.apache.spark.sql.AnalysisException: Data source 
parquet does not support Update output mode;

mapGroupsWithState - No Aggregation and File Sink (Complete mode)

Exception in thread "main" org.apache.spark.sql.AnalysisException: Data source 
parquet does not support Complete output mode;


> Is it possible to use File Sink with mapGroupsWithState in Structured 
> Streaming?
> 
>
> Key: SPARK-21249
> URL: https://issues.apache.org/jira/browse/SPARK-21249
> Project: Spark
>  Issue Type: Question
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Amit Baghel
>Priority: Minor
>
> I am working with 2.2.0-SNAPSHOT and Structured Streaming. Is it possible to 
> use File Sink with mapGroupsWithState? I verified below cases and getting 
> exception in all.
> * Aggregation without watermark and Append mode
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Append 
> output mode not supported when there are streaming aggregations on streaming 
> DataFrames/DataSets without watermark;;
> {code}
> * Aggregation with watermark and Append mode
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: 
> mapGroupsWithState is not supported with aggregation on a streaming 
> DataFrame/Dataset;;
> {code}
> * No Aggregation and Append mode
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: 
> mapGroupsWithState is not supported with Append output mode on a streaming 
> DataFrame/Dataset;;
> {code}
> * No Aggregation and Update mode
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Data 
> source parquet does not support Update output mode;
> {code}
> * No Aggregation and Complete mode
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Data 
> source parquet does not support Complete output mode;
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21249) Is it possible to use File Sink with mapGroupsWithState in Structured Streaming?

2017-06-29 Thread Amit Baghel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Baghel updated SPARK-21249:

Description: 
I am working with 2.2.0-SNAPSHOT and Structured Streaming. Is it possible to 
use File Sink with mapGroupsWithState? I verified below cases and getting 
exception in all.

* mapGroupsWithState - Aggregation without watermark and File Sink (Append mode)

Exception in thread "main" org.apache.spark.sql.AnalysisException: Append 
output mode not supported when there are streaming aggregations on streaming 
DataFrames/DataSets without watermark;;

mapGroupsWithState - Aggregation with watermark and File Sink (Append mode)

Exception in thread "main" org.apache.spark.sql.AnalysisException: 
mapGroupsWithState is not supported with aggregation on a streaming 
DataFrame/Dataset;;

mapGroupsWithState - No Aggregation and File Sink (Append mode)

Exception in thread "main" org.apache.spark.sql.AnalysisException: 
mapGroupsWithState is not supported with Append output mode on a streaming 
DataFrame/Dataset;;

mapGroupsWithState - No Aggregation and File Sink (Update mode)

Exception in thread "main" org.apache.spark.sql.AnalysisException: Data source 
parquet does not support Update output mode;

mapGroupsWithState - No Aggregation and File Sink (Complete mode)

Exception in thread "main" org.apache.spark.sql.AnalysisException: Data source 
parquet does not support Complete output mode;

  was:
I am working with 2.2.0-SNAPSHOT and Structured Streaming. Is it possible to 
use File Sink with mapGroupsWithState? With append output mode I am getting 
below exception.

Exception in thread "main" org.apache.spark.sql.AnalysisException: 
mapGroupsWithState is not supported with Append output mode on a streaming 
DataFrame/Dataset;;


> Is it possible to use File Sink with mapGroupsWithState in Structured 
> Streaming?
> 
>
> Key: SPARK-21249
> URL: https://issues.apache.org/jira/browse/SPARK-21249
> Project: Spark
>  Issue Type: Question
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Amit Baghel
>Priority: Minor
>
> I am working with 2.2.0-SNAPSHOT and Structured Streaming. Is it possible to 
> use File Sink with mapGroupsWithState? I verified below cases and getting 
> exception in all.
> * mapGroupsWithState - Aggregation without watermark and File Sink (Append 
> mode)
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Append 
> output mode not supported when there are streaming aggregations on streaming 
> DataFrames/DataSets without watermark;;
> mapGroupsWithState - Aggregation with watermark and File Sink (Append mode)
> Exception in thread "main" org.apache.spark.sql.AnalysisException: 
> mapGroupsWithState is not supported with aggregation on a streaming 
> DataFrame/Dataset;;
> mapGroupsWithState - No Aggregation and File Sink (Append mode)
> Exception in thread "main" org.apache.spark.sql.AnalysisException: 
> mapGroupsWithState is not supported with Append output mode on a streaming 
> DataFrame/Dataset;;
> mapGroupsWithState - No Aggregation and File Sink (Update mode)
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Data 
> source parquet does not support Update output mode;
> mapGroupsWithState - No Aggregation and File Sink (Complete mode)
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Data 
> source parquet does not support Complete output mode;



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-06-29 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069490#comment-16069490
 ] 

Reynold Xin commented on SPARK-21190:
-

That makes a lot of sense. So to design APIs similar to a lot of the ply 
operations. I will think a bit about this next week. Btw if somebody can also 
just propose some different alternative APIs (perhaps as a gist), it'd 
facilitate the discussion further.

How should we do memory management though? Should we just have a configurable 
cap on the maximum number of roles? Should we just let user crash (not a great 
experience)?


> SPIP: Vectorized UDFs in Python
> ---
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: SPIP
> Attachments: SPIPVectorizedUDFsforPython (1).pdf
>
>
> *Background and Motivation*
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> (e.g. numpy, Pandas) that are written in native code.
>  
> This proposal advocates introducing new APIs to support vectorized UDFs in 
> Python, in which a block of data is transferred over to Python in some 
> columnar format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> - Support vectorized UDFs that apply on chunks of the data frame
> - Low system overhead: Substantially reduce serialization and deserialization 
> overhead when compared with row-at-a-time interface
> - UDF performance: Enable users to leverage native libraries in Python (e.g. 
> numpy, Pandas) for data manipulation in these UDFs
>  
> *Non-Goals*
> The following are explicitly out of scope for the current SPIP, and should be 
> done in future SPIPs. Nonetheless, it would be good to consider these future 
> use cases during API design, so we can achieve some consistency when rolling 
> out new APIs.
>  
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
> - Tight integration with machine learning frameworks
>  
> *Proposed API Changes*
> The following sketches some possibilities. I haven’t spent a lot of time 
> thinking about the API (wrote it down in 5 mins) and I am not attached to 
> this design at all. The main purpose of the SPIP is to get feedback on use 
> cases and see how they can impact API design.
>  
> Two things to consider are:
>  
> 1. Python is dynamically typed, whereas DataFrames/SQL requires static, 
> analysis time typing. This means users would need to specify the return type 
> of their UDFs.
>  
> 2. Ratio of input rows to output rows. We propose initially we require number 
> of output rows to be the same as the number of input rows. In the future, we 
> can consider relaxing this constraint with support for vectorized aggregate 
> UDFs.
>  
> Proposed API sketch (using examples):
>  
> Use case 1. A function that defines all the columns of a DataFrame (similar 
> to a “map” function):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_on_entire_df(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A Pandas data frame.
>   """
>   input[c] = input[a] + input[b]
>   Input[d] = input[a] - input[b]
>   return input
>  
> spark.range(1000).selectExpr("id a", "id / 2 b")
>   .mapBatches(my_func_on_entire_df)
> {code}
>  
> Use case 2. A function that defines only one column (similar to existing 
> UDFs):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_that_returns_one_column(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A numpy array
>   """
>   return input[a] + input[b]
>  
> my_func = udf(my_func_that_returns_one_column)
>  
> df = spark.range(1000).selectExpr("id a", "id / 2 b")
> df.withColumn("c", my_func(df.a, df.b))
> {code}
>  
>  
>  
> *Optional Design Sketch*
> I’m more concerned about getting proper feedback for API design. The 
> implementation should be pretty straightforward and is not a huge concern at 
> this point. We can leverage the same implementation for faster toPandas 
> (using Arrow).
>  
>  
> *Optional Rejected Designs*
> See above.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To 

[jira] [Resolved] (SPARK-21258) Window result incorrect using complex object with spilling

2017-06-29 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21258.
-
   Resolution: Fixed
Fix Version/s: 2.1.2
   2.2.0

Issue resolved by pull request 18470
[https://github.com/apache/spark/pull/18470]

> Window result incorrect using complex object with spilling
> --
>
> Key: SPARK-21258
> URL: https://issues.apache.org/jira/browse/SPARK-21258
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
> Fix For: 2.2.0, 2.1.2
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21176) Master UI hangs with spark.ui.reverseProxy=true if the master node has many CPUs

2017-06-29 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21176:
---

Assignee: Ingo Schuster

> Master UI hangs with spark.ui.reverseProxy=true if the master node has many 
> CPUs
> 
>
> Key: SPARK-21176
> URL: https://issues.apache.org/jira/browse/SPARK-21176
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0, 2.1.1, 2.2.0, 2.2.1
> Environment: ppc64le GNU/Linux, POWER8, only master node is reachable 
> externally other nodes are in an internal network
>Reporter: Ingo Schuster
>Assignee: Ingo Schuster
>  Labels: network, web-ui
> Fix For: 2.1.2, 2.2.0
>
>
> In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master 
> node has too many cpus or the cluster has too many executers:
> For each ProxyServlet, Jetty creates Selector threads: minimum 4, maximum 
> half the number of available CPUs:
> {{this(Math.max(1, Runtime.getRuntime().availableProcessors() / 2));}}
> (see 
> https://github.com/eclipse/jetty.project/blob/0c8273f2ca1f9bf2064cd9c4c939d2546443f759/jetty-client/src/main/java/org/eclipse/jetty/client/http/HttpClientTransportOverHTTP.java)
> In reverse proxy mode, a proxy servlet is set up for each executor.
> I have a system with 7 executors and 88 CPUs on the master node. Jetty tries 
> to instantiate 7*44 = 309 selector threads just for the reverse proxy 
> servlets, but since the QueuedThreadPool is initialized with 200 threads by 
> default, the UI gets stuck.
> I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new 
> QueuedThreadPool(400)}}). With this hack, the UI works.
> Obviously, the Jetty defaults are meant for a real web server. If that has 88 
> CPUs, you do certainly expect a lot of traffic.
> For the Spark admin UI however, there will rarely be concurrent accesses for 
> the same application or the same executor.
> I therefore propose to dramatically reduce the number of selector threads 
> that get instantiated - at least by default.
> I will propose a fix in a pull request.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21253) Cannot fetch big blocks to disk

2017-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069424#comment-16069424
 ] 

Apache Spark commented on SPARK-21253:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/18478

> Cannot fetch big blocks to disk 
> 
>
> Key: SPARK-21253
> URL: https://issues.apache.org/jira/browse/SPARK-21253
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Assignee: Shixiong Zhu
> Fix For: 2.2.0
>
> Attachments: ui-thread-dump-jqhadoop221-154.gif
>
>
> Spark *cluster* can reproduce, *local* can't:
> 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}:
> {code:actionscript}
> $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K
> {code}
> 2. A shuffle:
> {code:actionscript}
> scala> sc.parallelize(0 until 300, 10).repartition(2001).count()
> {code}
> The error messages:
> {noformat}
> org.apache.spark.shuffle.FetchFailedException: Failed to send request for 
> 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: 
> java.io.IOException: Connection reset by peer
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
> at scala.collection.AbstractIterator.to(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to 
> yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: 
> Connection reset by peer
> at 
> org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
> at 
> io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
> at 
> 

[jira] [Assigned] (SPARK-21261) SparkSQL regexpExpressions example

2017-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21261:


Assignee: Apache Spark

> SparkSQL regexpExpressions example 
> ---
>
> Key: SPARK-21261
> URL: https://issues.apache.org/jira/browse/SPARK-21261
> Project: Spark
>  Issue Type: Documentation
>  Components: Examples
>Affects Versions: 2.1.1
>Reporter: zhangxin
>Assignee: Apache Spark
>
> The follow execute result.
> scala> spark.sql(""" select regexp_replace('100-200', '(\d+)', 'num') 
> """).show
> +--+
> |regexp_replace(100-200, (d+), num)|
> +--+
> |   100-200|
> +--+
> scala> spark.sql(""" select regexp_replace('100-200', '(\\d+)', 'num') 
> """).show
> +---+
> |regexp_replace(100-200, (\d+), num)|
> +---+
> |num-num|
> +---+
> Add Comment



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21261) SparkSQL regexpExpressions example

2017-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21261:


Assignee: (was: Apache Spark)

> SparkSQL regexpExpressions example 
> ---
>
> Key: SPARK-21261
> URL: https://issues.apache.org/jira/browse/SPARK-21261
> Project: Spark
>  Issue Type: Documentation
>  Components: Examples
>Affects Versions: 2.1.1
>Reporter: zhangxin
>
> The follow execute result.
> scala> spark.sql(""" select regexp_replace('100-200', '(\d+)', 'num') 
> """).show
> +--+
> |regexp_replace(100-200, (d+), num)|
> +--+
> |   100-200|
> +--+
> scala> spark.sql(""" select regexp_replace('100-200', '(\\d+)', 'num') 
> """).show
> +---+
> |regexp_replace(100-200, (\d+), num)|
> +---+
> |num-num|
> +---+
> Add Comment



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21261) SparkSQL regexpExpressions example

2017-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069416#comment-16069416
 ] 

Apache Spark commented on SPARK-21261:
--

User 'visaxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/18477

> SparkSQL regexpExpressions example 
> ---
>
> Key: SPARK-21261
> URL: https://issues.apache.org/jira/browse/SPARK-21261
> Project: Spark
>  Issue Type: Documentation
>  Components: Examples
>Affects Versions: 2.1.1
>Reporter: zhangxin
>
> The follow execute result.
> scala> spark.sql(""" select regexp_replace('100-200', '(\d+)', 'num') 
> """).show
> +--+
> |regexp_replace(100-200, (d+), num)|
> +--+
> |   100-200|
> +--+
> scala> spark.sql(""" select regexp_replace('100-200', '(\\d+)', 'num') 
> """).show
> +---+
> |regexp_replace(100-200, (\d+), num)|
> +---+
> |num-num|
> +---+
> Add Comment



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21176) Master UI hangs with spark.ui.reverseProxy=true if the master node has many CPUs

2017-06-29 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21176.
-
   Resolution: Fixed
Fix Version/s: 2.1.2
   2.2.0

Issue resolved by pull request 18437
[https://github.com/apache/spark/pull/18437]

> Master UI hangs with spark.ui.reverseProxy=true if the master node has many 
> CPUs
> 
>
> Key: SPARK-21176
> URL: https://issues.apache.org/jira/browse/SPARK-21176
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0, 2.1.1, 2.2.0, 2.2.1
> Environment: ppc64le GNU/Linux, POWER8, only master node is reachable 
> externally other nodes are in an internal network
>Reporter: Ingo Schuster
>  Labels: network, web-ui
> Fix For: 2.2.0, 2.1.2
>
>
> In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master 
> node has too many cpus or the cluster has too many executers:
> For each ProxyServlet, Jetty creates Selector threads: minimum 4, maximum 
> half the number of available CPUs:
> {{this(Math.max(1, Runtime.getRuntime().availableProcessors() / 2));}}
> (see 
> https://github.com/eclipse/jetty.project/blob/0c8273f2ca1f9bf2064cd9c4c939d2546443f759/jetty-client/src/main/java/org/eclipse/jetty/client/http/HttpClientTransportOverHTTP.java)
> In reverse proxy mode, a proxy servlet is set up for each executor.
> I have a system with 7 executors and 88 CPUs on the master node. Jetty tries 
> to instantiate 7*44 = 309 selector threads just for the reverse proxy 
> servlets, but since the QueuedThreadPool is initialized with 200 threads by 
> default, the UI gets stuck.
> I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new 
> QueuedThreadPool(400)}}). With this hack, the UI works.
> Obviously, the Jetty defaults are meant for a real web server. If that has 88 
> CPUs, you do certainly expect a lot of traffic.
> For the Spark admin UI however, there will rarely be concurrent accesses for 
> the same application or the same executor.
> I therefore propose to dramatically reduce the number of selector threads 
> that get instantiated - at least by default.
> I will propose a fix in a pull request.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19659) Fetch big blocks to disk when shuffle-read

2017-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19659:


Assignee: jin xing  (was: Apache Spark)

> Fetch big blocks to disk when shuffle-read
> --
>
> Key: SPARK-19659
> URL: https://issues.apache.org/jira/browse/SPARK-19659
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: jin xing
>Assignee: jin xing
> Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf
>
>
> Currently the whole block is fetched into memory(offheap by default) when 
> shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can 
> be large when skew situations. If OOM happens during shuffle read, job will 
> be killed and users will be notified to "Consider boosting 
> spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more 
> memory can resolve the OOM. However the approach is not perfectly suitable 
> for production environment, especially for data warehouse.
> Using Spark SQL as data engine in warehouse, users hope to have a unified 
> parameter(e.g. memory) but less resource wasted(resource is allocated but not 
> used),
> It's not always easy to predict skew situations, when happen, it make sense 
> to fetch remote blocks to disk for shuffle-read, rather than
> kill the job because of OOM. This approach is mentioned during the discussion 
> in SPARK-3019, by [~sandyr] and [~mridulm80]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19659) Fetch big blocks to disk when shuffle-read

2017-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19659:


Assignee: Apache Spark  (was: jin xing)

> Fetch big blocks to disk when shuffle-read
> --
>
> Key: SPARK-19659
> URL: https://issues.apache.org/jira/browse/SPARK-19659
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: jin xing
>Assignee: Apache Spark
> Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf
>
>
> Currently the whole block is fetched into memory(offheap by default) when 
> shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can 
> be large when skew situations. If OOM happens during shuffle read, job will 
> be killed and users will be notified to "Consider boosting 
> spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more 
> memory can resolve the OOM. However the approach is not perfectly suitable 
> for production environment, especially for data warehouse.
> Using Spark SQL as data engine in warehouse, users hope to have a unified 
> parameter(e.g. memory) but less resource wasted(resource is allocated but not 
> used),
> It's not always easy to predict skew situations, when happen, it make sense 
> to fetch remote blocks to disk for shuffle-read, rather than
> kill the job because of OOM. This approach is mentioned during the discussion 
> in SPARK-3019, by [~sandyr] and [~mridulm80]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19659) Fetch big blocks to disk when shuffle-read

2017-06-29 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19659:
-
Fix Version/s: (was: 2.2.0)

> Fetch big blocks to disk when shuffle-read
> --
>
> Key: SPARK-19659
> URL: https://issues.apache.org/jira/browse/SPARK-19659
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: jin xing
>Assignee: jin xing
> Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf
>
>
> Currently the whole block is fetched into memory(offheap by default) when 
> shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can 
> be large when skew situations. If OOM happens during shuffle read, job will 
> be killed and users will be notified to "Consider boosting 
> spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more 
> memory can resolve the OOM. However the approach is not perfectly suitable 
> for production environment, especially for data warehouse.
> Using Spark SQL as data engine in warehouse, users hope to have a unified 
> parameter(e.g. memory) but less resource wasted(resource is allocated but not 
> used),
> It's not always easy to predict skew situations, when happen, it make sense 
> to fetch remote blocks to disk for shuffle-read, rather than
> kill the job because of OOM. This approach is mentioned during the discussion 
> in SPARK-3019, by [~sandyr] and [~mridulm80]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-19659) Fetch big blocks to disk when shuffle-read

2017-06-29 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reopened SPARK-19659:
--

Reopened this as it's disabled.

> Fetch big blocks to disk when shuffle-read
> --
>
> Key: SPARK-19659
> URL: https://issues.apache.org/jira/browse/SPARK-19659
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: jin xing
>Assignee: jin xing
> Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf
>
>
> Currently the whole block is fetched into memory(offheap by default) when 
> shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can 
> be large when skew situations. If OOM happens during shuffle read, job will 
> be killed and users will be notified to "Consider boosting 
> spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more 
> memory can resolve the OOM. However the approach is not perfectly suitable 
> for production environment, especially for data warehouse.
> Using Spark SQL as data engine in warehouse, users hope to have a unified 
> parameter(e.g. memory) but less resource wasted(resource is allocated but not 
> used),
> It's not always easy to predict skew situations, when happen, it make sense 
> to fetch remote blocks to disk for shuffle-read, rather than
> kill the job because of OOM. This approach is mentioned during the discussion 
> in SPARK-3019, by [~sandyr] and [~mridulm80]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19659) Fetch big blocks to disk when shuffle-read

2017-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069396#comment-16069396
 ] 

Apache Spark commented on SPARK-19659:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/18467

> Fetch big blocks to disk when shuffle-read
> --
>
> Key: SPARK-19659
> URL: https://issues.apache.org/jira/browse/SPARK-19659
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: jin xing
>Assignee: jin xing
> Fix For: 2.2.0
>
> Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf
>
>
> Currently the whole block is fetched into memory(offheap by default) when 
> shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can 
> be large when skew situations. If OOM happens during shuffle read, job will 
> be killed and users will be notified to "Consider boosting 
> spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more 
> memory can resolve the OOM. However the approach is not perfectly suitable 
> for production environment, especially for data warehouse.
> Using Spark SQL as data engine in warehouse, users hope to have a unified 
> parameter(e.g. memory) but less resource wasted(resource is allocated but not 
> used),
> It's not always easy to predict skew situations, when happen, it make sense 
> to fetch remote blocks to disk for shuffle-read, rather than
> kill the job because of OOM. This approach is mentioned during the discussion 
> in SPARK-3019, by [~sandyr] and [~mridulm80]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21261) SparkSQL regexpExpressions example

2017-06-29 Thread zhangxin (JIRA)
zhangxin created SPARK-21261:


 Summary: SparkSQL regexpExpressions example 
 Key: SPARK-21261
 URL: https://issues.apache.org/jira/browse/SPARK-21261
 Project: Spark
  Issue Type: Documentation
  Components: Examples
Affects Versions: 2.1.1
Reporter: zhangxin


The follow execute result.

scala> spark.sql(""" select regexp_replace('100-200', '(\d+)', 'num') """).show
+--+
|regexp_replace(100-200, (d+), num)|
+--+
|   100-200|
+--+
scala> spark.sql(""" select regexp_replace('100-200', '(\\d+)', 'num') """).show
+---+
|regexp_replace(100-200, (\d+), num)|
+---+
|num-num|
+---+
Add Comment



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21253) Cannot fetch big blocks to disk

2017-06-29 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21253.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 18472
[https://github.com/apache/spark/pull/18472]

> Cannot fetch big blocks to disk 
> 
>
> Key: SPARK-21253
> URL: https://issues.apache.org/jira/browse/SPARK-21253
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Assignee: Shixiong Zhu
> Fix For: 2.2.0
>
> Attachments: ui-thread-dump-jqhadoop221-154.gif
>
>
> Spark *cluster* can reproduce, *local* can't:
> 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}:
> {code:actionscript}
> $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K
> {code}
> 2. A shuffle:
> {code:actionscript}
> scala> sc.parallelize(0 until 300, 10).repartition(2001).count()
> {code}
> The error messages:
> {noformat}
> org.apache.spark.shuffle.FetchFailedException: Failed to send request for 
> 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: 
> java.io.IOException: Connection reset by peer
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
> at scala.collection.AbstractIterator.to(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to 
> yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: 
> Connection reset by peer
> at 
> org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
> at 
> io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
> at 
> 

[jira] [Commented] (SPARK-20858) Document ListenerBus event queue size property

2017-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069387#comment-16069387
 ] 

Apache Spark commented on SPARK-20858:
--

User 'sadikovi' has created a pull request for this issue:
https://github.com/apache/spark/pull/18476

> Document ListenerBus event queue size property
> --
>
> Key: SPARK-20858
> URL: https://issues.apache.org/jira/browse/SPARK-20858
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.1
>Reporter: Bjorn Jonsson
>Priority: Minor
>
> SPARK-15703 made the ListenerBus event queue size configurable via 
> spark.scheduler.listenerbus.eventqueue.size. This should be documented.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20858) Document ListenerBus event queue size property

2017-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20858:


Assignee: (was: Apache Spark)

> Document ListenerBus event queue size property
> --
>
> Key: SPARK-20858
> URL: https://issues.apache.org/jira/browse/SPARK-20858
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.1
>Reporter: Bjorn Jonsson
>Priority: Minor
>
> SPARK-15703 made the ListenerBus event queue size configurable via 
> spark.scheduler.listenerbus.eventqueue.size. This should be documented.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20858) Document ListenerBus event queue size property

2017-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20858:


Assignee: Apache Spark

> Document ListenerBus event queue size property
> --
>
> Key: SPARK-20858
> URL: https://issues.apache.org/jira/browse/SPARK-20858
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.1
>Reporter: Bjorn Jonsson
>Assignee: Apache Spark
>Priority: Minor
>
> SPARK-15703 made the ListenerBus event queue size configurable via 
> spark.scheduler.listenerbus.eventqueue.size. This should be documented.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21235) UTest should clear temp results when run case

2017-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21235:


Assignee: Apache Spark

> UTest should clear temp results when run case 
> --
>
> Key: SPARK-21235
> URL: https://issues.apache.org/jira/browse/SPARK-21235
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.1.1
>Reporter: wangjiaochun
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21235) UTest should clear temp results when run case

2017-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21235:


Assignee: (was: Apache Spark)

> UTest should clear temp results when run case 
> --
>
> Key: SPARK-21235
> URL: https://issues.apache.org/jira/browse/SPARK-21235
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.1.1
>Reporter: wangjiaochun
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21224) Support a DDL-formatted string as schema in reading for R

2017-06-29 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069330#comment-16069330
 ] 

Hyukjin Kwon commented on SPARK-21224:
--

Oh! I almost missed this comment. Sure, I will soon. Yes, I would like to open 
a new JIRA. Thank you [~felixcheung]. 

> Support a DDL-formatted string as schema in reading for R
> -
>
> Key: SPARK-21224
> URL: https://issues.apache.org/jira/browse/SPARK-21224
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> This might have to be a followup for SPARK-20431 but I just decided to make 
> this separate for R specifically as many PRs might be confusing.
> Please refer the discussion in the PR and SPARK-20431.
> In a simple view, this JIRA describes the support for a DDL-formetted string 
> as schema as below:
> {code}
> mockLines <- c("{\"name\":\"Michael\"}",
>"{\"name\":\"Andy\", \"age\":30}",
>"{\"name\":\"Justin\", \"age\":19}")
> jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
> writeLines(mockLines, jsonPath)
> df <- read.df(jsonPath, "json", "name STRING, age DOUBLE")
> collect(df)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21235) UTest should clear temp results when run case

2017-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069340#comment-16069340
 ] 

Apache Spark commented on SPARK-21235:
--

User 'wangjiaochun' has created a pull request for this issue:
https://github.com/apache/spark/pull/18474

> UTest should clear temp results when run case 
> --
>
> Key: SPARK-21235
> URL: https://issues.apache.org/jira/browse/SPARK-21235
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.1.1
>Reporter: wangjiaochun
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21260) Remove the unused OutputFakerExec

2017-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21260:


Assignee: (was: Apache Spark)

> Remove the unused OutputFakerExec
> -
>
> Key: SPARK-21260
> URL: https://issues.apache.org/jira/browse/SPARK-21260
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Jiang Xingbo
>Priority: Minor
>
> OutputFakerExec was added long ago and is not used anywhere now so we should 
> remove it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21260) Remove the unused OutputFakerExec

2017-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21260:


Assignee: Apache Spark

> Remove the unused OutputFakerExec
> -
>
> Key: SPARK-21260
> URL: https://issues.apache.org/jira/browse/SPARK-21260
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Jiang Xingbo
>Assignee: Apache Spark
>Priority: Minor
>
> OutputFakerExec was added long ago and is not used anywhere now so we should 
> remove it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21260) Remove the unused OutputFakerExec

2017-06-29 Thread Jiang Xingbo (JIRA)
Jiang Xingbo created SPARK-21260:


 Summary: Remove the unused OutputFakerExec
 Key: SPARK-21260
 URL: https://issues.apache.org/jira/browse/SPARK-21260
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.1
Reporter: Jiang Xingbo
Priority: Minor


OutputFakerExec was added long ago and is not used anywhere now so we should 
remove it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21260) Remove the unused OutputFakerExec

2017-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069332#comment-16069332
 ] 

Apache Spark commented on SPARK-21260:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/18473

> Remove the unused OutputFakerExec
> -
>
> Key: SPARK-21260
> URL: https://issues.apache.org/jira/browse/SPARK-21260
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Jiang Xingbo
>Priority: Minor
>
> OutputFakerExec was added long ago and is not used anywhere now so we should 
> remove it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21246) Unexpected Data Type conversion from LONG to BIGINT

2017-06-29 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069281#comment-16069281
 ] 

Hyukjin Kwon edited comment on SPARK-21246 at 6/30/17 1:27 AM:
---

Of course, it follows the schema an user specified.

{code}
scala> peopleDF.schema == schema
res9: Boolean = true
{code}

and this throws an exception as the schema is mismatched. I don't think at 
least the same schema is an issue here. I am resolving this.


was (Author: hyukjin.kwon):
Of course, it follows the schema an user specified.

{code}
scala> peopleDF.schema == schema
res9: Boolean = true
{code}

and this throws an exception as the schema is mismatched. I don't think at 
least the same schema is not an issue here. I am resolving this.

> Unexpected Data Type conversion from LONG to BIGINT
> ---
>
> Key: SPARK-21246
> URL: https://issues.apache.org/jira/browse/SPARK-21246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Using Zeppelin Notebook or Spark Shell
>Reporter: Monica Raj
>
> The unexpected conversion occurred when creating a data frame out of an 
> existing data collection. The following code can be run in zeppelin notebook 
> to reproduce the bug:
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> val schemaString = "name"
> val lstVals = Seq(3)
> val rowRdd = sc.parallelize(lstVals).map(x => Row( x ))
> rowRdd.collect()
> // Generate the schema based on the string of schema
> val fields = schemaString.split(" ")
> .map(fieldName => StructField(fieldName, LongType, nullable = true))
> val schema = StructType(fields)
> print(schema)
> val peopleDF = sqlContext.createDataFrame(rowRdd, schema)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18199) Support appending to Parquet files

2017-06-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-18199.
---
Resolution: Invalid

I'm closing this as invalid. It is not a good idea to append to an existing 
file in distributed systems, especially given we might have two writers at the 
same time.


> Support appending to Parquet files
> --
>
> Key: SPARK-18199
> URL: https://issues.apache.org/jira/browse/SPARK-18199
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jeremy Smith
>
> Currently, appending to a Parquet directory involves simply creating new 
> parquet files in the directory. With many small appends (for example, in a 
> streaming job with a short batch duration) this leads to an unbounded number 
> of small Parquet files accumulating. These must be cleaned up with some 
> frequency by removing them all and rewriting a new file containing all the 
> rows.
> It would be far better if Spark supported appending to the Parquet files 
> themselves. HDFS supports this, as does Parquet:
> * The Parquet footer can be read in order to obtain necessary metadata.
> * The new rows can then be appended to the Parquet file as a row group.
> * A new footer can then be appended containing the metadata and referencing 
> the new row groups as well as the previously existing row groups.
> This would result in a small amount of bloat in the file as new row groups 
> are added (since duplicate metadata would accumulate) but it's hugely 
> preferable to accumulating small files, which is bad for HDFS health and also 
> eventually leads to Spark being unable to read the Parquet directory at all.  
> Periodic rewriting of the file could still be performed in order to remove 
> the duplicate metadata.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21249) Is it possible to use File Sink with mapGroupsWithState in Structured Streaming?

2017-06-29 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21249.
--
Resolution: Invalid

I am resolving this per {{Type: Question}}. Questions should go to the mailing 
list.

> Is it possible to use File Sink with mapGroupsWithState in Structured 
> Streaming?
> 
>
> Key: SPARK-21249
> URL: https://issues.apache.org/jira/browse/SPARK-21249
> Project: Spark
>  Issue Type: Question
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Amit Baghel
>Priority: Minor
>
> I am working with 2.2.0-SNAPSHOT and Structured Streaming. Is it possible to 
> use File Sink with mapGroupsWithState? With append output mode I am getting 
> below exception.
> Exception in thread "main" org.apache.spark.sql.AnalysisException: 
> mapGroupsWithState is not supported with Append output mode on a streaming 
> DataFrame/Dataset;;



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21230) Spark Encoder with mysql Enum and data truncated Error

2017-06-29 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21230.
--
Resolution: Invalid

It is hard for me to read and reproduce as well. Let's close unless you are 
going to fix it or provide some steps to reproduce anyone can follow 
explicitly. Also, I think the point is to narrow down. I don't think this is an 
actionable JIRA as well.


> Spark Encoder with mysql Enum and data truncated Error
> --
>
> Key: SPARK-21230
> URL: https://issues.apache.org/jira/browse/SPARK-21230
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.1
> Environment: macosX
>Reporter: Michael Kunkel
>
> I am using Spark via Java for a MYSQL/ML(machine learning) project.
> In the mysql database, I have a column "status_change_type" of type enum = 
> {broke, fixed} in a table called "status_change" in a DB called "test".
> I have an object StatusChangeDB that constructs the needed structure for the 
> table, however for the "status_change_type", I constructed it as a String. I 
> know the bytes from MYSQL enum to Java string are much different, but I am 
> using Spark, so the encoder does not recognize enums properly. However when I 
> try to set the value of the enum via a Java string, I receive the "data 
> truncated" error
> h5. org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in 
> stage 4.0 (TID 9, localhost, executor driver): java.sql.BatchUpdateException: 
> Data truncated for column 'status_change_type' at row 1 at 
> com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:2055)
> I have tried to use enum for "status_change_type", however it fails with a 
> stack trace of
> h5. Exception in thread "AWT-EventQueue-0" java.lang.NullPointerException 
> at org.spark_project.guava.reflect.TypeToken.method(TypeToken.java:465) at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:126)
>  at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125)
>  at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127)
>  at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125)
>  at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127)
>  at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) ... ...
> h5. 
> I have tried to use the jdbc setting "jdbcCompliantTruncation=false" but this 
> does nothing as I get the same error of "data truncated" as first stated. 
> Here are my jdbc options map, in case I am using the 
> "jdbcCompliantTruncation=false" incorrectly.
> public static Map jdbcOptions() {
> Map jdbcOptions = new HashMap();
> jdbcOptions.put("url", 
> 

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-06-29 Thread Helena Edelson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069300#comment-16069300
 ] 

Helena Edelson commented on SPARK-18057:


IMHO kafka-0-11 to be explicit and wait until kafka 0.11.1.0 which per 
https://issues.apache.org/jira/browse/KAFKA-4879 resolves the last blocker to 
upgrading?

> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21246) Unexpected Data Type conversion from LONG to BIGINT

2017-06-29 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21246.
--
Resolution: Invalid

Of course, it follows the schema an user specified.

{code}
scala> peopleDF.schema == schema
res9: Boolean = true
{code}

and this throws an exception as the schema is mismatched. I don't think at 
least the same schema is not an issue here. I am resolving this.

> Unexpected Data Type conversion from LONG to BIGINT
> ---
>
> Key: SPARK-21246
> URL: https://issues.apache.org/jira/browse/SPARK-21246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Using Zeppelin Notebook or Spark Shell
>Reporter: Monica Raj
>
> The unexpected conversion occurred when creating a data frame out of an 
> existing data collection. The following code can be run in zeppelin notebook 
> to reproduce the bug:
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> val schemaString = "name"
> val lstVals = Seq(3)
> val rowRdd = sc.parallelize(lstVals).map(x => Row( x ))
> rowRdd.collect()
> // Generate the schema based on the string of schema
> val fields = schemaString.split(" ")
> .map(fieldName => StructField(fieldName, LongType, nullable = true))
> val schema = StructType(fields)
> print(schema)
> val peopleDF = sqlContext.createDataFrame(rowRdd, schema)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-06-29 Thread Leif Walsh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069269#comment-16069269
 ] 

Leif Walsh commented on SPARK-21190:


I agree with [~icexelloss] that we should aim to provide an API which provides 
"logical" groups of data to the UDF rather than the implementation detail of 
providing partitions wholesale.  Setting aside for the moment the problem with 
dataset skew which could cause one group to be very large, let's look at some 
use cases.

One obvious use case that tools like dplyr and pandas support is 
{{df.groupby(...).aggregate(...)}}.  Here, we group on some key and apply a 
function to each logical group.  This can be used to e.g. demean each group 
w.r.t. its cohort.  Another use case that we care about with 
[Flint|https://github.com/twosigma/flint] is aggregating over a window.  In 
pandas terminology this is the {{rolling}} operator.  One might want to, for 
each row, perform a moving window average or rolling regression over a history 
of some size.  The windowed aggregation poses a performance question that the 
groupby case doesn't: namely, if we naively send each window to the python 
worker independently, we're transferring a lot of duplicate data since each 
overlapped window contains many of the same rows.  An option here is to 
transfer the entire partition on the backend and then instruct the python 
worker to call the UDF with slices of the whole dataset according to the 
windowing requested by the user.

I think the idea of presenting a whole partition in a pandas dataframe to a UDF 
is a bit off-track.  If someone really wants to apply a python function to the 
"whole" dataset, they'd be best served by pulling those data back to the driver 
and just using pandas, if they tried to use spark's partitions they'd get 
somewhat arbitrary partitions and have to implement some kind of merge operator 
on their own.  However, with grouped and windowed aggregations, we can provide 
an API which truly is parallelizable and useful.

I want to focus on use cases where we actually can parallelize without 
requiring a merge operator right now.  Aggregators in pandas and related tools 
in the ecosystem usually assume they have access to all the data for an 
operation and don't need to merge results of subaggregations.  For aggregations 
over larger datasets you'd really want to encourage the use of native Spark 
operations (that use e.g. {{treeAggregate}}).

Does that make sense?  I think it focuses the problem nicely that it becomes 
fairly tractable.

I think the really hard part of this API design is deciding what the inputs and 
outputs of the UDF look like, and providing for the myriad use cases therein.  
For example, one might want to aggregate each group down to a scalar (e.g. 
mean) and do something with that (either produce a reduced dataset with one 
value per group, or add a column where each group has the same value across all 
rows), or one might want to compute over the group and produce a value per row 
within the group and attach that as a new column (e.g. demeaning or ranking).  
These translate roughly to the differences between the [**ply operations in 
dplyr|https://www.jstatsoft.org/article/view/v040i01/v40i01.pdf] or the 
differences in pandas between {{df.groupby(...).agg(...)}} and 
{{df.groupby(...).transform(...)}} and {{df.groupby(...).apply(...)}}.

> SPIP: Vectorized UDFs in Python
> ---
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: SPIP
> Attachments: SPIPVectorizedUDFsforPython (1).pdf
>
>
> *Background and Motivation*
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> (e.g. numpy, Pandas) that are written in native code.
>  
> This proposal advocates introducing new APIs to support vectorized UDFs in 
> Python, in which a block of data is transferred over to Python in some 
> columnar format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> - Support vectorized UDFs that apply on chunks of the data frame
> - Low system overhead: Substantially reduce serialization and deserialization 
> overhead when compared with row-at-a-time interface
> - UDF performance: Enable users to leverage native libraries in Python (e.g. 
> numpy, Pandas) for data manipulation in these UDFs
>  
> *Non-Goals*
> The following are 

[jira] [Closed] (SPARK-21148) Set SparkUncaughtExceptionHandler to the Master

2017-06-29 Thread Devaraj K (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devaraj K closed SPARK-21148.
-
Resolution: Duplicate

> Set SparkUncaughtExceptionHandler to the Master
> ---
>
> Key: SPARK-21148
> URL: https://issues.apache.org/jira/browse/SPARK-21148
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Affects Versions: 2.1.1
>Reporter: Devaraj K
>
> Any one thread of the Master gets any of the UncaughtException then the 
> thread gets terminate and the Master process keeps running without 
> functioning properly.
> I think we need to handle the UncaughtException and exit the Master 
> gracefully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21253) Cannot fetch big blocks to disk

2017-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069227#comment-16069227
 ] 

Apache Spark commented on SPARK-21253:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/18472

> Cannot fetch big blocks to disk 
> 
>
> Key: SPARK-21253
> URL: https://issues.apache.org/jira/browse/SPARK-21253
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Assignee: Shixiong Zhu
> Attachments: ui-thread-dump-jqhadoop221-154.gif
>
>
> Spark *cluster* can reproduce, *local* can't:
> 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}:
> {code:actionscript}
> $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K
> {code}
> 2. A shuffle:
> {code:actionscript}
> scala> sc.parallelize(0 until 300, 10).repartition(2001).count()
> {code}
> The error messages:
> {noformat}
> org.apache.spark.shuffle.FetchFailedException: Failed to send request for 
> 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: 
> java.io.IOException: Connection reset by peer
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
> at scala.collection.AbstractIterator.to(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to 
> yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: 
> Connection reset by peer
> at 
> org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
> at 
> io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
> at 
> 

[jira] [Assigned] (SPARK-21259) More rules for scalastyle

2017-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21259:


Assignee: (was: Apache Spark)

> More rules for scalastyle
> -
>
> Key: SPARK-21259
> URL: https://issues.apache.org/jira/browse/SPARK-21259
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Gengliang Wang
>Priority: Minor
>
> During code review, we spent so much time on code style issues.
> It would be great if we add rules:
> 1) disallow space before colon
> 2) disallow space before right parentheses
> 3) disallow space after left parentheses



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21259) More rules for scalastyle

2017-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069208#comment-16069208
 ] 

Apache Spark commented on SPARK-21259:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/18471

> More rules for scalastyle
> -
>
> Key: SPARK-21259
> URL: https://issues.apache.org/jira/browse/SPARK-21259
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Gengliang Wang
>Priority: Minor
>
> During code review, we spent so much time on code style issues.
> It would be great if we add rules:
> 1) disallow space before colon
> 2) disallow space before right parentheses
> 3) disallow space after left parentheses



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21259) More rules for scalastyle

2017-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21259:


Assignee: Apache Spark

> More rules for scalastyle
> -
>
> Key: SPARK-21259
> URL: https://issues.apache.org/jira/browse/SPARK-21259
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Minor
>
> During code review, we spent so much time on code style issues.
> It would be great if we add rules:
> 1) disallow space before colon
> 2) disallow space before right parentheses
> 3) disallow space after left parentheses



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20690) Subqueries in FROM should have alias names

2017-06-29 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-20690:

Description: 
We add missing attributes into Filter in Analyzer. But we shouldn't do it 
through subqueries like this:

{code}
select 1 from  (select 1 from onerow t1 LIMIT 1) where  t1.c1=1
{code}

This query works in current codebase. However, the outside where clause 
shouldn't be able to refer t1.c1 attribute.

The root cause is we allow subqueries in FROM have no alias names previously, 
it is confusing and isn't supported by various databases such as MySQL, 
Postgres, Oracle. We shouldn't support it too.

  was:
We add missing attributes into Filter in Analyzer. But we shouldn't do it 
through subqueries like this:

{code}
select 1 from  (select 1 from onerow t1 LIMIT 1) where  t1.c1=1
{code}

This query works in current codebase. However, the outside where clause 
shouldn't be able to refer t1.c1 attribute.


> Subqueries in FROM should have alias names
> --
>
> Key: SPARK-20690
> URL: https://issues.apache.org/jira/browse/SPARK-20690
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>  Labels: release-notes
> Fix For: 2.3.0
>
>
> We add missing attributes into Filter in Analyzer. But we shouldn't do it 
> through subqueries like this:
> {code}
> select 1 from  (select 1 from onerow t1 LIMIT 1) where  t1.c1=1
> {code}
> This query works in current codebase. However, the outside where clause 
> shouldn't be able to refer t1.c1 attribute.
> The root cause is we allow subqueries in FROM have no alias names previously, 
> it is confusing and isn't supported by various databases such as MySQL, 
> Postgres, Oracle. We shouldn't support it too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21259) More rules for scalastyle

2017-06-29 Thread Gengliang Wang (JIRA)
Gengliang Wang created SPARK-21259:
--

 Summary: More rules for scalastyle
 Key: SPARK-21259
 URL: https://issues.apache.org/jira/browse/SPARK-21259
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.1
Reporter: Gengliang Wang
Priority: Minor


During code review, we spent so much time on code style issues.
It would be great if we add rules:
1) disallow space before colon
2) disallow space before right parentheses
3) disallow space after left parentheses



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20690) Subqueries in FROM should have alias names

2017-06-29 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-20690:

Summary: Subqueries in FROM should have alias names  (was: Analyzer 
shouldn't add missing attributes through subquery)

> Subqueries in FROM should have alias names
> --
>
> Key: SPARK-20690
> URL: https://issues.apache.org/jira/browse/SPARK-20690
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>  Labels: release-notes
> Fix For: 2.3.0
>
>
> We add missing attributes into Filter in Analyzer. But we shouldn't do it 
> through subqueries like this:
> {code}
> select 1 from  (select 1 from onerow t1 LIMIT 1) where  t1.c1=1
> {code}
> This query works in current codebase. However, the outside where clause 
> shouldn't be able to refer t1.c1 attribute.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21188) releaseAllLocksForTask should synchronize the whole method

2017-06-29 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-21188.
--
   Resolution: Fixed
 Assignee: Feng Liu
Fix Version/s: 2.3.0

> releaseAllLocksForTask should synchronize the whole method
> --
>
> Key: SPARK-21188
> URL: https://issues.apache.org/jira/browse/SPARK-21188
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Feng Liu
>Assignee: Feng Liu
> Fix For: 2.3.0
>
>
> Since the objects readLocksByTask, writeLocksByTask and infos are coupled and 
> supposed to be modified by other threads concurrently, all the read and 
> writes of them in the releaseAllLocksForTask method should be protected by a 
> single synchronized block. The fine-grained synchronization in the current 
> code can cause some test flakiness.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21255) NPE when creating encoder for enum

2017-06-29 Thread Mike (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069146#comment-16069146
 ] 

Mike commented on SPARK-21255:
--

Yes, but there may be other problems (at least I cannot guarantee there won't 
be), since it seems like enums were never used before.

> NPE when creating encoder for enum
> --
>
> Key: SPARK-21255
> URL: https://issues.apache.org/jira/browse/SPARK-21255
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.0
> Environment: org.apache.spark:spark-core_2.10:2.1.0
> org.apache.spark:spark-sql_2.10:2.1.0
>Reporter: Mike
>
> When you try to create an encoder for Enum type (or bean with enum property) 
> via Encoders.bean(...), it fails with NullPointerException at TypeToken:495.
> I did a little research and it turns out, that in JavaTypeInference:126 
> following code 
> {code:java}
> val beanInfo = Introspector.getBeanInfo(typeToken.getRawType)
> val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == 
> "class")
> val fields = properties.map { property =>
>   val returnType = 
> typeToken.method(property.getReadMethod).getReturnType
>   val (dataType, nullable) = inferDataType(returnType)
>   new StructField(property.getName, dataType, nullable)
> }
> (new StructType(fields), true)
> {code}
> filters out properties named "class", because we wouldn't want to serialize 
> that. But enum types have another property of type Class named 
> "declaringClass", which we are trying to inspect recursively. Eventually we 
> try to inspect ClassLoader class, which has property "defaultAssertionStatus" 
> with no read method, which leads to NPE at TypeToken:495.
> I think adding property name "declaringClass" to filtering will resolve this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21253) Cannot fetch big blocks to disk

2017-06-29 Thread Yuming Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069109#comment-16069109
 ] 

Yuming Wang commented on SPARK-21253:
-

I checked it, all jars are latest 2.2.0-rcX.

> Cannot fetch big blocks to disk 
> 
>
> Key: SPARK-21253
> URL: https://issues.apache.org/jira/browse/SPARK-21253
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Assignee: Shixiong Zhu
> Attachments: ui-thread-dump-jqhadoop221-154.gif
>
>
> Spark *cluster* can reproduce, *local* can't:
> 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}:
> {code:actionscript}
> $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K
> {code}
> 2. A shuffle:
> {code:actionscript}
> scala> sc.parallelize(0 until 300, 10).repartition(2001).count()
> {code}
> The error messages:
> {noformat}
> org.apache.spark.shuffle.FetchFailedException: Failed to send request for 
> 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: 
> java.io.IOException: Connection reset by peer
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
> at scala.collection.AbstractIterator.to(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to 
> yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: 
> Connection reset by peer
> at 
> org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
> at 
> io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
> at 
> org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183)
> at 
> 

[jira] [Commented] (SPARK-21258) Window result incorrect using complex object with spilling

2017-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069084#comment-16069084
 ] 

Apache Spark commented on SPARK-21258:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/18470

> Window result incorrect using complex object with spilling
> --
>
> Key: SPARK-21258
> URL: https://issues.apache.org/jira/browse/SPARK-21258
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21258) Window result incorrect using complex object with spilling

2017-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21258:


Assignee: Herman van Hovell  (was: Apache Spark)

> Window result incorrect using complex object with spilling
> --
>
> Key: SPARK-21258
> URL: https://issues.apache.org/jira/browse/SPARK-21258
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21258) Window result incorrect using complex object with spilling

2017-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21258:


Assignee: Apache Spark  (was: Herman van Hovell)

> Window result incorrect using complex object with spilling
> --
>
> Key: SPARK-21258
> URL: https://issues.apache.org/jira/browse/SPARK-21258
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21258) Window result incorrect using complex object with spilling

2017-06-29 Thread Herman van Hovell (JIRA)
Herman van Hovell created SPARK-21258:
-

 Summary: Window result incorrect using complex object with spilling
 Key: SPARK-21258
 URL: https://issues.apache.org/jira/browse/SPARK-21258
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Herman van Hovell
Assignee: Herman van Hovell






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20797) mllib lda's LocalLDAModel's save: out of memory.

2017-06-29 Thread Asher Krim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068988#comment-16068988
 ] 

Asher Krim commented on SPARK-20797:


This looks like a duplicate of 
https://issues.apache.org/jira/browse/SPARK-19294? 

> mllib lda's LocalLDAModel's save: out of memory. 
> -
>
> Key: SPARK-20797
> URL: https://issues.apache.org/jira/browse/SPARK-20797
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1, 1.6.3, 2.0.0, 2.0.2, 2.1.1
>Reporter: d0evi1
>
> when i try online lda model with large text data(nearly 1 billion chinese 
> news' abstract), the training step went well, but the save step failed.  
> something like below happened (etc. 1.6.1):
> problem 1.bigger than spark.kryoserializer.buffer.max.  (turning bigger the 
> param can fix problem 1, but next will lead problem 2),
> problem 2. exceed spark.akka.frameSize. (turning this param too bigger will 
> fail for the reason out of memory,   kill it, version > 2.0.0, exceeds max 
> allowed: spark.rpc.message.maxSize).
> when topics  num is large(set topic num k=200 is ok, but set k=300 failed), 
> and vocab size is large(nearly 1000,000) too. this problem will appear.
> so i found word2vec's save function is similar to the LocalLDAModel's save 
> function :
> word2vec's problem (use repartition(1) to save) has been fixed 
> [https://github.com/apache/spark/pull/9989,], but LocalLDAModel still use:  
> repartition(1). use single partition when save.
> word2vec's  save method from latest code:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala:
>   val approxSize = (4L * vectorSize + 15) * numWords
>   val nPartitions = ((approxSize / bufferSize) + 1).toInt
>   val dataArray = model.toSeq.map { case (w, v) => Data(w, v) }
>   
> spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path))
> but the code in mllib.clustering.LDAModel's LocalLDAModel's save:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala
> you'll see:
>   val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix
>   val topics = Range(0, k).map { topicInd =>
> Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), 
> topicInd)
>   }
>   
> spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path))
> refer to word2vec's save (repartition(nPartitions)), i replace numWords to 
> topic K, repartition(nPartitions) in the LocalLDAModel's save method, 
> recompile the code, deploy the new lda's project with large data on our 
> machine cluster, it works.
> hopes it will fixed in the next version.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20797) mllib lda's LocalLDAModel's save: out of memory.

2017-06-29 Thread Asher Krim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068988#comment-16068988
 ] 

Asher Krim edited comment on SPARK-20797 at 6/29/17 9:19 PM:
-

This looks like a duplicate of https://issues.apache.org/jira/browse/SPARK-19294


was (Author: akrim):
This looks like a duplicate of 
https://issues.apache.org/jira/browse/SPARK-19294? 

> mllib lda's LocalLDAModel's save: out of memory. 
> -
>
> Key: SPARK-20797
> URL: https://issues.apache.org/jira/browse/SPARK-20797
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1, 1.6.3, 2.0.0, 2.0.2, 2.1.1
>Reporter: d0evi1
>
> when i try online lda model with large text data(nearly 1 billion chinese 
> news' abstract), the training step went well, but the save step failed.  
> something like below happened (etc. 1.6.1):
> problem 1.bigger than spark.kryoserializer.buffer.max.  (turning bigger the 
> param can fix problem 1, but next will lead problem 2),
> problem 2. exceed spark.akka.frameSize. (turning this param too bigger will 
> fail for the reason out of memory,   kill it, version > 2.0.0, exceeds max 
> allowed: spark.rpc.message.maxSize).
> when topics  num is large(set topic num k=200 is ok, but set k=300 failed), 
> and vocab size is large(nearly 1000,000) too. this problem will appear.
> so i found word2vec's save function is similar to the LocalLDAModel's save 
> function :
> word2vec's problem (use repartition(1) to save) has been fixed 
> [https://github.com/apache/spark/pull/9989,], but LocalLDAModel still use:  
> repartition(1). use single partition when save.
> word2vec's  save method from latest code:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala:
>   val approxSize = (4L * vectorSize + 15) * numWords
>   val nPartitions = ((approxSize / bufferSize) + 1).toInt
>   val dataArray = model.toSeq.map { case (w, v) => Data(w, v) }
>   
> spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path))
> but the code in mllib.clustering.LDAModel's LocalLDAModel's save:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala
> you'll see:
>   val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix
>   val topics = Range(0, k).map { topicInd =>
> Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), 
> topicInd)
>   }
>   
> spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path))
> refer to word2vec's save (repartition(nPartitions)), i replace numWords to 
> topic K, repartition(nPartitions) in the LocalLDAModel's save method, 
> recompile the code, deploy the new lda's project with large data on our 
> machine cluster, it works.
> hopes it will fixed in the next version.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-06-29 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068980#comment-16068980
 ] 

Cody Koeninger commented on SPARK-18057:


Kafka 0.11 is now released.  

Are we upgrading spark artifacts named kafka-0-10 to use kafka 0.11, or are we 
renaming them to kafka-0-11?

> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21253) Cannot fetch big blocks to disk

2017-06-29 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068876#comment-16068876
 ] 

Shixiong Zhu commented on SPARK-21253:
--

[~q79969786] did you run Spark 2.2.0-rcX on Yarn which has a Spark 2.1.* 
shuffle service?

> Cannot fetch big blocks to disk 
> 
>
> Key: SPARK-21253
> URL: https://issues.apache.org/jira/browse/SPARK-21253
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Assignee: Shixiong Zhu
> Attachments: ui-thread-dump-jqhadoop221-154.gif
>
>
> Spark *cluster* can reproduce, *local* can't:
> 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}:
> {code:actionscript}
> $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K
> {code}
> 2. A shuffle:
> {code:actionscript}
> scala> sc.parallelize(0 until 300, 10).repartition(2001).count()
> {code}
> The error messages:
> {noformat}
> org.apache.spark.shuffle.FetchFailedException: Failed to send request for 
> 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: 
> java.io.IOException: Connection reset by peer
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
> at scala.collection.AbstractIterator.to(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to 
> yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: 
> Connection reset by peer
> at 
> org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
> at 
> io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
> at 
> org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183)
>

[jira] [Updated] (SPARK-21257) LDA : create an Evaluator to enable cross validation

2017-06-29 Thread Mathieu DESPRIEE (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mathieu DESPRIEE updated SPARK-21257:
-
Issue Type: Improvement  (was: New Feature)

> LDA : create an Evaluator to enable cross validation
> 
>
> Key: SPARK-21257
> URL: https://issues.apache.org/jira/browse/SPARK-21257
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Mathieu DESPRIEE
>
> I suggest the creation of an LDAEvaluator to use with CrossValidator, using 
> logPerplexity as a metric for evaluation.
> Unfortunately, the computation of perplexity needs to access some internal 
> data of the model, and the current implementation of CrossValidator does not 
> pass the model being evaluated to the Evaluator. 
> A way could be to change the Evaluator.evaluate() method to pass the model 
> along with the dataset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21257) LDA : create an Evaluator to enable cross validation

2017-06-29 Thread Mathieu DESPRIEE (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mathieu DESPRIEE updated SPARK-21257:
-
Issue Type: New Feature  (was: Improvement)

> LDA : create an Evaluator to enable cross validation
> 
>
> Key: SPARK-21257
> URL: https://issues.apache.org/jira/browse/SPARK-21257
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Mathieu DESPRIEE
>
> I suggest the creation of an LDAEvaluator to use with CrossValidator, using 
> logPerplexity as a metric for evaluation.
> Unfortunately, the computation of perplexity needs to access some internal 
> data of the model, and the current implementation of CrossValidator does not 
> pass the model being evaluated to the Evaluator. 
> A way could be to change the Evaluator.evaluate() method to pass the model 
> along with the dataset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21257) LDA : create an Evaluator to enable cross validation

2017-06-29 Thread Mathieu DESPRIEE (JIRA)
Mathieu DESPRIEE created SPARK-21257:


 Summary: LDA : create an Evaluator to enable cross validation
 Key: SPARK-21257
 URL: https://issues.apache.org/jira/browse/SPARK-21257
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.1.1
Reporter: Mathieu DESPRIEE


I suggest the creation of an LDAEvaluator to use with CrossValidator, using 
logPerplexity as a metric for evaluation.

Unfortunately, the computation of perplexity needs to access some internal data 
of the model, and the current implementation of CrossValidator does not pass 
the model being evaluated to the Evaluator. 
A way could be to change the Evaluator.evaluate() method to pass the model 
along with the dataset.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21256) Add WithSQLConf to Catalyst Test

2017-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21256:


Assignee: Xiao Li  (was: Apache Spark)

> Add WithSQLConf to Catalyst Test
> 
>
> Key: SPARK-21256
> URL: https://issues.apache.org/jira/browse/SPARK-21256
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Add WithSQLConf to the Catalyst module.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21256) Add WithSQLConf to Catalyst Test

2017-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21256:


Assignee: Apache Spark  (was: Xiao Li)

> Add WithSQLConf to Catalyst Test
> 
>
> Key: SPARK-21256
> URL: https://issues.apache.org/jira/browse/SPARK-21256
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Add WithSQLConf to the Catalyst module.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21256) Add WithSQLConf to Catalyst Test

2017-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068832#comment-16068832
 ] 

Apache Spark commented on SPARK-21256:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/18469

> Add WithSQLConf to Catalyst Test
> 
>
> Key: SPARK-21256
> URL: https://issues.apache.org/jira/browse/SPARK-21256
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Add WithSQLConf to the Catalyst module.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21256) Add WithSQLConf to Catalyst Test

2017-06-29 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21256:

Summary: Add WithSQLConf to Catalyst Test  (was: Add WithSQLConf to 
Catalyst)

> Add WithSQLConf to Catalyst Test
> 
>
> Key: SPARK-21256
> URL: https://issues.apache.org/jira/browse/SPARK-21256
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Add WithSQLConf to the Catalyst module.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21256) Add WithSQLConf to Catalyst

2017-06-29 Thread Xiao Li (JIRA)
Xiao Li created SPARK-21256:
---

 Summary: Add WithSQLConf to Catalyst
 Key: SPARK-21256
 URL: https://issues.apache.org/jira/browse/SPARK-21256
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 2.3.0
Reporter: Xiao Li
Assignee: Xiao Li


Add WithSQLConf to the Catalyst module.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20873) Improve the error message for unsupported Column Type

2017-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068800#comment-16068800
 ] 

Apache Spark commented on SPARK-20873:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/18468

> Improve the error message for unsupported Column Type
> -
>
> Key: SPARK-20873
> URL: https://issues.apache.org/jira/browse/SPARK-20873
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ruben Janssen
>Assignee: Ruben Janssen
> Fix For: 2.3.0
>
>
> For unsupported column type, we simply output the column type instead of the 
> type name. 
> {noformat}
> java.lang.Exception: Unsupported type: 
> org.apache.spark.sql.execution.columnar.ColumnTypeSuite$$anonfun$4$$anon$1@2205a05d
> {noformat}
> We should improve it by outputting its name.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21253) Cannot fetch big blocks to disk

2017-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068793#comment-16068793
 ] 

Apache Spark commented on SPARK-21253:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/18467

> Cannot fetch big blocks to disk 
> 
>
> Key: SPARK-21253
> URL: https://issues.apache.org/jira/browse/SPARK-21253
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Assignee: Shixiong Zhu
> Attachments: ui-thread-dump-jqhadoop221-154.gif
>
>
> Spark *cluster* can reproduce, *local* can't:
> 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}:
> {code:actionscript}
> $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K
> {code}
> 2. A shuffle:
> {code:actionscript}
> scala> sc.parallelize(0 until 300, 10).repartition(2001).count()
> {code}
> The error messages:
> {noformat}
> org.apache.spark.shuffle.FetchFailedException: Failed to send request for 
> 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: 
> java.io.IOException: Connection reset by peer
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
> at scala.collection.AbstractIterator.to(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to 
> yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: 
> Connection reset by peer
> at 
> org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
> at 
> io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
> at 
> 

[jira] [Updated] (SPARK-20783) Enhance ColumnVector to support compressed representation

2017-06-29 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-20783:
-
Description: 
Current {{ColumnVector}} handles uncompressed data for parquet.

For handling table cache, this JIRA entry adds {{OnHeapCachedBatch}} class to 
have compressed data.
As first step of this implementation, this JIRA supports primitive data and 
string types.


  was:
Current {{ColumnVector}} accepts only primitive-type Java array as an input for 
array. It is good to keep data from Parquet.

On the other hand, in Spark internal, {{UnsafeArrayData}} is frequently used to 
represent array, map, and struct. To keep these data, this JIRA entry enhances 
{{ColumnVector}} to keep UnsafeArrayData.


> Enhance ColumnVector to support compressed representation
> -
>
> Key: SPARK-20783
> URL: https://issues.apache.org/jira/browse/SPARK-20783
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>
> Current {{ColumnVector}} handles uncompressed data for parquet.
> For handling table cache, this JIRA entry adds {{OnHeapCachedBatch}} class to 
> have compressed data.
> As first step of this implementation, this JIRA supports primitive data and 
> string types.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20783) Enhance ColumnVector to support compressed representation

2017-06-29 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-20783:
-
Summary: Enhance ColumnVector to support compressed representation  (was: 
Enhance ColumnVector to keep UnsafeArrayData for array)

> Enhance ColumnVector to support compressed representation
> -
>
> Key: SPARK-20783
> URL: https://issues.apache.org/jira/browse/SPARK-20783
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>
> Current {{ColumnVector}} accepts only primitive-type Java array as an input 
> for array. It is good to keep data from Parquet.
> On the other hand, in Spark internal, {{UnsafeArrayData}} is frequently used 
> to represent array, map, and struct. To keep these data, this JIRA entry 
> enhances {{ColumnVector}} to keep UnsafeArrayData.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21253) Cannot fetch big blocks to disk

2017-06-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-21253:
-
Target Version/s: 2.2.0

> Cannot fetch big blocks to disk 
> 
>
> Key: SPARK-21253
> URL: https://issues.apache.org/jira/browse/SPARK-21253
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Assignee: Shixiong Zhu
> Attachments: ui-thread-dump-jqhadoop221-154.gif
>
>
> Spark *cluster* can reproduce, *local* can't:
> 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}:
> {code:actionscript}
> $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K
> {code}
> 2. A shuffle:
> {code:actionscript}
> scala> sc.parallelize(0 until 300, 10).repartition(2001).count()
> {code}
> The error messages:
> {noformat}
> org.apache.spark.shuffle.FetchFailedException: Failed to send request for 
> 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: 
> java.io.IOException: Connection reset by peer
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
> at scala.collection.AbstractIterator.to(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to 
> yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: 
> Connection reset by peer
> at 
> org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
> at 
> io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
> at 
> org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183)
> at 
> 

[jira] [Assigned] (SPARK-21253) Cannot fetch big blocks to disk

2017-06-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-21253:


Assignee: Shixiong Zhu

> Cannot fetch big blocks to disk 
> 
>
> Key: SPARK-21253
> URL: https://issues.apache.org/jira/browse/SPARK-21253
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Assignee: Shixiong Zhu
> Attachments: ui-thread-dump-jqhadoop221-154.gif
>
>
> Spark *cluster* can reproduce, *local* can't:
> 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}:
> {code:actionscript}
> $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K
> {code}
> 2. A shuffle:
> {code:actionscript}
> scala> sc.parallelize(0 until 300, 10).repartition(2001).count()
> {code}
> The error messages:
> {noformat}
> org.apache.spark.shuffle.FetchFailedException: Failed to send request for 
> 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: 
> java.io.IOException: Connection reset by peer
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
> at scala.collection.AbstractIterator.to(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to 
> yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: 
> Connection reset by peer
> at 
> org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
> at 
> io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
> at 
> org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183)
> at 
> 

[jira] [Commented] (SPARK-4131) Support "Writing data into the filesystem from queries"

2017-06-29 Thread ARUNA KIRAN NULU (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068700#comment-16068700
 ] 

ARUNA KIRAN NULU commented on SPARK-4131:
-

Is this feature available ?

> Support "Writing data into the filesystem from queries"
> ---
>
> Key: SPARK-4131
> URL: https://issues.apache.org/jira/browse/SPARK-4131
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: XiaoJing wang
>Assignee: Fei Wang
>Priority: Critical
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>
> Writing data into the filesystem from queries,SparkSql is not support .
> eg:
> {code}insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * 
> from page_views;
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21246) Unexpected Data Type conversion from LONG to BIGINT

2017-06-29 Thread Monica Raj (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068553#comment-16068553
 ] 

Monica Raj commented on SPARK-21246:


Thanks for your response. I also tried with Seq(3) as Seq(3L), however I had 
changed this back during the course of trying other options. I should also 
mention that we are running Zeppelin 0.6.0. I tried running the code you 
provided and still got the following output:

import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
schemaString: String = name
lstVals: Seq[Long] = List(3)
rowRdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
MapPartitionsRDD[30] at map at :59
res20: Array[org.apache.spark.sql.Row] = Array([3])
fields: Array[org.apache.spark.sql.types.StructField] = 
Array(StructField(name,LongType,true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(name,LongType,true))
StructType(StructField(name,LongType,true))peopleDF: 
org.apache.spark.sql.DataFrame = [name: bigint]
++
|name|
++
|   3|
++

> Unexpected Data Type conversion from LONG to BIGINT
> ---
>
> Key: SPARK-21246
> URL: https://issues.apache.org/jira/browse/SPARK-21246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Using Zeppelin Notebook or Spark Shell
>Reporter: Monica Raj
>
> The unexpected conversion occurred when creating a data frame out of an 
> existing data collection. The following code can be run in zeppelin notebook 
> to reproduce the bug:
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> val schemaString = "name"
> val lstVals = Seq(3)
> val rowRdd = sc.parallelize(lstVals).map(x => Row( x ))
> rowRdd.collect()
> // Generate the schema based on the string of schema
> val fields = schemaString.split(" ")
> .map(fieldName => StructField(fieldName, LongType, nullable = true))
> val schema = StructType(fields)
> print(schema)
> val peopleDF = sqlContext.createDataFrame(rowRdd, schema)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21254) History UI: Taking over 1 minute for initial page display

2017-06-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21254.
---
Resolution: Duplicate

Duplicate of lots of JIRAs related to making the initial read faster

> History UI: Taking over 1 minute for initial page display
> -
>
> Key: SPARK-21254
> URL: https://issues.apache.org/jira/browse/SPARK-21254
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Dmitry Parfenchik
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> Currently on the first page load (if there is no limit set) the whole jobs 
> execution history is loaded since the begging of the time. On large amount of 
> rows returned (10k+) page load time grows dramatically, causing 1min+ delay 
> in Chrome and freezing the process in Firefox, Safari and IE.
> A simple inspection in Chrome shows that network is not an issue here and 
> only causes a small latency (<1s) while most of the time is spend in UI  
> processing the results even according to chrome devtools:
> !screenshot-1.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21255) NPE when creating encoder for enum

2017-06-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068520#comment-16068520
 ] 

Sean Owen commented on SPARK-21255:
---

Is the change to omit declaringClass too?

> NPE when creating encoder for enum
> --
>
> Key: SPARK-21255
> URL: https://issues.apache.org/jira/browse/SPARK-21255
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.0
> Environment: org.apache.spark:spark-core_2.10:2.1.0
> org.apache.spark:spark-sql_2.10:2.1.0
>Reporter: Mike
>
> When you try to create an encoder for Enum type (or bean with enum property) 
> via Encoders.bean(...), it fails with NullPointerException at TypeToken:495.
> I did a little research and it turns out, that in JavaTypeInference:126 
> following code 
> {code:java}
> val beanInfo = Introspector.getBeanInfo(typeToken.getRawType)
> val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == 
> "class")
> val fields = properties.map { property =>
>   val returnType = 
> typeToken.method(property.getReadMethod).getReturnType
>   val (dataType, nullable) = inferDataType(returnType)
>   new StructField(property.getName, dataType, nullable)
> }
> (new StructType(fields), true)
> {code}
> filters out properties named "class", because we wouldn't want to serialize 
> that. But enum types have another property of type Class named 
> "declaringClass", which we are trying to inspect recursively. Eventually we 
> try to inspect ClassLoader class, which has property "defaultAssertionStatus" 
> with no read method, which leads to NPE at TypeToken:495.
> I think adding property name "declaringClass" to filtering will resolve this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21255) NPE when creating encoder for enum

2017-06-29 Thread Mike (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike updated SPARK-21255:
-
Description: 
When you try to create an encoder for Enum type (or bean with enum property) 
via Encoders.bean(...), it fails with NullPointerException at TypeToken:495.
I did a little research and it turns out, that in JavaTypeInference:126 
following code 

{code:java}
val beanInfo = Introspector.getBeanInfo(typeToken.getRawType)
val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == "class")
val fields = properties.map { property =>
  val returnType = 
typeToken.method(property.getReadMethod).getReturnType
  val (dataType, nullable) = inferDataType(returnType)
  new StructField(property.getName, dataType, nullable)
}
(new StructType(fields), true)
{code}

filters out properties named "class", because we wouldn't want to serialize 
that. But enum types have another property of type Class named 
"declaringClass", which we are trying to inspect recursively. Eventually we try 
to inspect ClassLoader class, which has property "defaultAssertionStatus" with 
no read method, which leads to NPE at TypeToken:495.

I think adding property name "declaringClass" to filtering will resolve this.

  was:
When you try to create an encoder for Enum type (or bean with enum property) 
via Encoders.bean(...), it fails with NullPointerException at TypeToken:495.
I did a little research and it turns out, that in JavaTypeInference:126 
following code 

{code:scala}
val beanInfo = Introspector.getBeanInfo(typeToken.getRawType)
val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == "class")
val fields = properties.map { property =>
  val returnType = 
typeToken.method(property.getReadMethod).getReturnType
  val (dataType, nullable) = inferDataType(returnType)
  new StructField(property.getName, dataType, nullable)
}
(new StructType(fields), true)
{code}

filters out properties named "class", because we wouldn't want to serialize 
that. But enum types have another property of type Class named 
"declaringClass", which we are trying to inspect recursively. Eventually we try 
to inspect ClassLoader class, which has property "defaultAssertionStatus" with 
no read method, which leads to NPE at TypeToken:495.

I think adding property name "declaringClass" to filtering will resolve this.


> NPE when creating encoder for enum
> --
>
> Key: SPARK-21255
> URL: https://issues.apache.org/jira/browse/SPARK-21255
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.0
> Environment: org.apache.spark:spark-core_2.10:2.1.0
> org.apache.spark:spark-sql_2.10:2.1.0
>Reporter: Mike
>
> When you try to create an encoder for Enum type (or bean with enum property) 
> via Encoders.bean(...), it fails with NullPointerException at TypeToken:495.
> I did a little research and it turns out, that in JavaTypeInference:126 
> following code 
> {code:java}
> val beanInfo = Introspector.getBeanInfo(typeToken.getRawType)
> val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == 
> "class")
> val fields = properties.map { property =>
>   val returnType = 
> typeToken.method(property.getReadMethod).getReturnType
>   val (dataType, nullable) = inferDataType(returnType)
>   new StructField(property.getName, dataType, nullable)
> }
> (new StructType(fields), true)
> {code}
> filters out properties named "class", because we wouldn't want to serialize 
> that. But enum types have another property of type Class named 
> "declaringClass", which we are trying to inspect recursively. Eventually we 
> try to inspect ClassLoader class, which has property "defaultAssertionStatus" 
> with no read method, which leads to NPE at TypeToken:495.
> I think adding property name "declaringClass" to filtering will resolve this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21255) NPE when creating encoder for enum

2017-06-29 Thread Mike (JIRA)
Mike created SPARK-21255:


 Summary: NPE when creating encoder for enum
 Key: SPARK-21255
 URL: https://issues.apache.org/jira/browse/SPARK-21255
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 2.1.0
 Environment: org.apache.spark:spark-core_2.10:2.1.0
org.apache.spark:spark-sql_2.10:2.1.0
Reporter: Mike


When you try to create an encoder for Enum type (or bean with enum property) 
via Encoders.bean(...), it fails with NullPointerException at TypeToken:495.
I did a little research and it turns out, that in JavaTypeInference:126 
following code 

{code:scala}
val beanInfo = Introspector.getBeanInfo(typeToken.getRawType)
val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == "class")
val fields = properties.map { property =>
  val returnType = 
typeToken.method(property.getReadMethod).getReturnType
  val (dataType, nullable) = inferDataType(returnType)
  new StructField(property.getName, dataType, nullable)
}
(new StructType(fields), true)
{code}

filters out properties named "class", because we wouldn't want to serialize 
that. But enum types have another property of type Class named 
"declaringClass", which we are trying to inspect recursively. Eventually we try 
to inspect ClassLoader class, which has property "defaultAssertionStatus" with 
no read method, which leads to NPE at TypeToken:495.

I think adding property name "declaringClass" to filtering will resolve this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21254) History UI: Taking over 1 minute for initial page display

2017-06-29 Thread Dmitry Parfenchik (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Parfenchik updated SPARK-21254:
--
Priority: Minor  (was: Major)

> History UI: Taking over 1 minute for initial page display
> -
>
> Key: SPARK-21254
> URL: https://issues.apache.org/jira/browse/SPARK-21254
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Dmitry Parfenchik
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> Currently on the first page load (if there is no limit set) the whole jobs 
> execution history is loaded since the begging of the time. On large amount of 
> rows returned (10k+) page load time grows dramatically, causing 1min+ delay 
> in Chrome and freezing the process in Firefox, Safari and IE.
> A simple inspection in Chrome shows that network is not an issue here and 
> only causes a small latency (<1s) while most of the time is spend in UI  
> processing the results even according to chrome devtools:
> !screenshot-1.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21254) History UI: Taking over 1 minute for initial page display

2017-06-29 Thread Dmitry Parfenchik (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Parfenchik updated SPARK-21254:
--
Description: 
Currently on the first page load (if there is no limit set) the whole jobs 
execution history is loaded since the begging of the time. On large amount of 
rows returned (10k+) page load time grows dramatically, causing 1min+ delay in 
Chrome and freezing the process in Firefox, Safari and IE.
A simple inspection in Chrome shows that network is not an issue here and only 
causes a small latency (<1s) while most of the time is spend in UI  processing 
the results even according to chrome devtools:
!screenshot-1.png!



  was:
Currently on the first page load (if there is no limit set) the whole jobs 
execution history is loaded since the begging of the time, which in itself is 
not a big issue and only causes a small latency in case of 10k+ rows returned 
from the server. The problem is that UI spends most of the time processing the 
results even according to chrome devtools (newtwork IO is taking less than 1s):
!screenshot-1.png!
In case of larger amount of rows returned (10k+) this time grows dramatically, 
causing 1min+ page load in Chrome and in Firefox, Safari and IE freezes the 
process.



> History UI: Taking over 1 minute for initial page display
> -
>
> Key: SPARK-21254
> URL: https://issues.apache.org/jira/browse/SPARK-21254
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Dmitry Parfenchik
> Attachments: screenshot-1.png
>
>
> Currently on the first page load (if there is no limit set) the whole jobs 
> execution history is loaded since the begging of the time. On large amount of 
> rows returned (10k+) page load time grows dramatically, causing 1min+ delay 
> in Chrome and freezing the process in Firefox, Safari and IE.
> A simple inspection in Chrome shows that network is not an issue here and 
> only causes a small latency (<1s) while most of the time is spend in UI  
> processing the results even according to chrome devtools:
> !screenshot-1.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21254) History UI: Taking over 1 minute for initial page display

2017-06-29 Thread Dmitry Parfenchik (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Parfenchik updated SPARK-21254:
--
Description: 
Currently on the first page load (if there is no limit set) the whole jobs 
execution history is loaded since the begging of the time, which in itself is 
not a big issue and only causes a small latency in case of 10k+ rows returned 
from the server. The problem is that UI spends most of the time processing the 
results even according to chrome devtools (newtwork IO is taking less than 1s):
!screenshot-1.png!
In case of larger amount of rows returned (10k+) this time grows dramatically, 
causing 1min+ page load in Chrome and in Firefox, Safari and IE freezes the 
process.


  was:
Currently on the first page load (if there is no limit set) the whole jobs 
execution history is loaded since the begging of the time, which in itself is 
not a big issue and only causes a small latency in case of 10k+ rows returned 
from the server. The problem is that UI spends most of the time processing the 
results even according to chrome devtools (newtwork IO is taking less than 1s):
!attachment-name.jpg|thumbnail!
In case of larger amount of rows returned (10k+) this time grows dramatically, 
causing 1min+ page load in Chrome and in Firefox, Safari and IE freezes the 
process.



> History UI: Taking over 1 minute for initial page display
> -
>
> Key: SPARK-21254
> URL: https://issues.apache.org/jira/browse/SPARK-21254
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Dmitry Parfenchik
> Attachments: screenshot-1.png
>
>
> Currently on the first page load (if there is no limit set) the whole jobs 
> execution history is loaded since the begging of the time, which in itself is 
> not a big issue and only causes a small latency in case of 10k+ rows returned 
> from the server. The problem is that UI spends most of the time processing 
> the results even according to chrome devtools (newtwork IO is taking less 
> than 1s):
> !screenshot-1.png!
> In case of larger amount of rows returned (10k+) this time grows 
> dramatically, causing 1min+ page load in Chrome and in Firefox, Safari and IE 
> freezes the process.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21254) History UI: Taking over 1 minute for initial page display

2017-06-29 Thread Dmitry Parfenchik (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Parfenchik updated SPARK-21254:
--
Attachment: screenshot-1.png

> History UI: Taking over 1 minute for initial page display
> -
>
> Key: SPARK-21254
> URL: https://issues.apache.org/jira/browse/SPARK-21254
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Dmitry Parfenchik
> Attachments: screenshot-1.png
>
>
> Currently on the first page load (if there is no limit set) the whole jobs 
> execution history is loaded since the begging of the time, which in itself is 
> not a big issue and only causes a small latency in case of 10k+ rows returned 
> from the server. The problem is that UI spends most of the time processing 
> the results even according to chrome devtools (newtwork IO is taking less 
> than 1s):
> !attachment-name.jpg|thumbnail!
> In case of larger amount of rows returned (10k+) this time grows 
> dramatically, causing 1min+ page load in Chrome and in Firefox, Safari and IE 
> freezes the process.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21254) History UI: Taking over 1 minute for initial page display

2017-06-29 Thread Dmitry Parfenchik (JIRA)
Dmitry Parfenchik created SPARK-21254:
-

 Summary: History UI: Taking over 1 minute for initial page display
 Key: SPARK-21254
 URL: https://issues.apache.org/jira/browse/SPARK-21254
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.1.0
Reporter: Dmitry Parfenchik


Currently on the first page load (if there is no limit set) the whole jobs 
execution history is loaded since the begging of the time, which in itself is 
not a big issue and only causes a small latency in case of 10k+ rows returned 
from the server. The problem is that UI spends most of the time processing the 
results even according to chrome devtools (newtwork IO is taking less than 1s):
!attachment-name.jpg|thumbnail!
In case of larger amount of rows returned (10k+) this time grows dramatically, 
causing 1min+ page load in Chrome and in Firefox, Safari and IE freezes the 
process.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21052) Add hash map metrics to join

2017-06-29 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21052.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18301
[https://github.com/apache/spark/pull/18301]

> Add hash map metrics to join
> 
>
> Key: SPARK-21052
> URL: https://issues.apache.org/jira/browse/SPARK-21052
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
> Fix For: 2.3.0
>
>
> We should add avg hash map probe metric to join operator and report it on UI.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21052) Add hash map metrics to join

2017-06-29 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21052:
---

Assignee: Liang-Chi Hsieh

> Add hash map metrics to join
> 
>
> Key: SPARK-21052
> URL: https://issues.apache.org/jira/browse/SPARK-21052
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.3.0
>
>
> We should add avg hash map probe metric to join operator and report it on UI.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12868) ADD JAR via sparkSQL JDBC will fail when using a HDFS URL

2017-06-29 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068301#comment-16068301
 ] 

Steve Loughran commented on SPARK-12868:


It's actually not the cause of that, merely the messenger. Cause is 
HADOOP-14383: a combination of spark 2.2 & Hadoop 2.9+ will trigger the 
problem. Fix belongs in Hadoop.

> ADD JAR via sparkSQL JDBC will fail when using a HDFS URL
> -
>
> Key: SPARK-12868
> URL: https://issues.apache.org/jira/browse/SPARK-12868
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Trystan Leftwich
>Assignee: Weiqing Yang
> Fix For: 2.2.0
>
>
> When trying to add a jar with a HDFS URI, i.E
> {code:sql}
> ADD JAR hdfs:///tmp/foo.jar
> {code}
> Via the spark sql JDBC interface it will fail with:
> {code:sql}
> java.net.MalformedURLException: unknown protocol: hdfs
> at java.net.URL.(URL.java:593)
> at java.net.URL.(URL.java:483)
> at java.net.URL.(URL.java:432)
> at java.net.URI.toURL(URI.java:1089)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.addJar(ClientWrapper.scala:578)
> at org.apache.spark.sql.hive.HiveContext.addJar(HiveContext.scala:652)
> at org.apache.spark.sql.hive.execution.AddJar.run(commands.scala:89)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:145)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:211)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:154)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:151)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:164)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21253) Cannot fetch big blocks to disk

2017-06-29 Thread Yuming Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-21253:

Description: 
Spark *cluster* can reproduce, *local* can't:

1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}:
{code:actionscript}
$ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K
{code}

2. A shuffle:
{code:actionscript}
scala> sc.parallelize(0 until 300, 10).repartition(2001).count()
{code}

The error messages:

{noformat}
org.apache.spark.shuffle.FetchFailedException: Failed to send request for 
1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: 
java.io.IOException: Connection reset by peer
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to 
yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: 
Connection reset by peer
at 
org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196)
at 
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
at 
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
at 
io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
at 
io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163)
at 
io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
at 
io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
at 
org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183)
at 
org.apache.spark.network.shuffle.OneForOneBlockFetcher$1.onSuccess(OneForOneBlockFetcher.java:123)
at 
org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:176)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:120)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at 

[jira] [Assigned] (SPARK-21225) decrease the Mem using for variable 'tasks' in function resourceOffers

2017-06-29 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21225:
---

Assignee: yangZhiguo

> decrease the Mem using for variable 'tasks' in function resourceOffers
> --
>
> Key: SPARK-21225
> URL: https://issues.apache.org/jira/browse/SPARK-21225
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1
>Reporter: yangZhiguo
>Assignee: yangZhiguo
>Priority: Minor
> Fix For: 2.3.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> In the function 'resourceOffers', It declare a variable 'tasks' for 
> storage the tasks which have  allocated a executor. It declared like this:
> *{color:#d04437}val tasks = shuffledOffers.map(o => new 
> ArrayBuffer[TaskDescription](o.cores)){color}*
> But, I think this code only conside a situation for that one task per core. 
> If the user config the "spark.task.cpus" as 2 or 3, It really don't need so 
> much space. I think It can motify as follow:
> {color:#14892c}*val tasks = shuffledOffers.map(o => new 
> ArrayBuffer[TaskDescription](Math.ceil(o.cores*1.0/CPUS_PER_TASK).toInt))*{color}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21225) decrease the Mem using for variable 'tasks' in function resourceOffers

2017-06-29 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21225.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18435
[https://github.com/apache/spark/pull/18435]

> decrease the Mem using for variable 'tasks' in function resourceOffers
> --
>
> Key: SPARK-21225
> URL: https://issues.apache.org/jira/browse/SPARK-21225
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1
>Reporter: yangZhiguo
>Priority: Minor
> Fix For: 2.3.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> In the function 'resourceOffers', It declare a variable 'tasks' for 
> storage the tasks which have  allocated a executor. It declared like this:
> *{color:#d04437}val tasks = shuffledOffers.map(o => new 
> ArrayBuffer[TaskDescription](o.cores)){color}*
> But, I think this code only conside a situation for that one task per core. 
> If the user config the "spark.task.cpus" as 2 or 3, It really don't need so 
> much space. I think It can motify as follow:
> {color:#14892c}*val tasks = shuffledOffers.map(o => new 
> ArrayBuffer[TaskDescription](Math.ceil(o.cores*1.0/CPUS_PER_TASK).toInt))*{color}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21253) Cannot fetch big blocks to disk

2017-06-29 Thread Yuming Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-21253:

Description: 
Spark *cluster* can reproduce, *local* can't:

1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}:
{code:actionscript}
$ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K
{code}

2. Need a shuffe:
{code:actionscript}
scala> sc.parallelize(0 until 300, 10).repartition(2001).count()
{code}

The error messages:

{noformat}
org.apache.spark.shuffle.FetchFailedException: Failed to send request for 
1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: 
java.io.IOException: Connection reset by peer
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to 
yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: 
Connection reset by peer
at 
org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196)
at 
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
at 
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
at 
io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
at 
io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163)
at 
io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
at 
io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
at 
org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183)
at 
org.apache.spark.network.shuffle.OneForOneBlockFetcher$1.onSuccess(OneForOneBlockFetcher.java:123)
at 
org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:176)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:120)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at 

[jira] [Commented] (SPARK-21253) Cannot fetch big blocks to disk

2017-06-29 Thread Yuming Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068232#comment-16068232
 ] 

Yuming Wang commented on SPARK-21253:
-

It may be hang for a {{spark-sql}} application also:

!ui-thread-dump-jqhadoop221-154.gif!

> Cannot fetch big blocks to disk 
> 
>
> Key: SPARK-21253
> URL: https://issues.apache.org/jira/browse/SPARK-21253
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
> Attachments: ui-thread-dump-jqhadoop221-154.gif
>
>
> Spark *cluster* can reproduce, *local* can't:
> 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}:
> {code:actionscript}
> $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K
> {code}
> 2. Need a shuffe:
> {code:actionscript}
> scala> val count = sc.parallelize(0 until 300, 
> 10).repartition(2001).collect().length
> {code}
> The error messages:
> {noformat}
> org.apache.spark.shuffle.FetchFailedException: Failed to send request for 
> 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: 
> java.io.IOException: Connection reset by peer
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
> at scala.collection.AbstractIterator.to(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to 
> yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: 
> Connection reset by peer
> at 
> org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
> at 
> io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
> at 
> org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183)
> at 
> 

[jira] [Updated] (SPARK-21253) Cannot fetch big blocks to disk

2017-06-29 Thread Yuming Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-21253:

Attachment: ui-thread-dump-jqhadoop221-154.gif

> Cannot fetch big blocks to disk 
> 
>
> Key: SPARK-21253
> URL: https://issues.apache.org/jira/browse/SPARK-21253
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
> Attachments: ui-thread-dump-jqhadoop221-154.gif
>
>
> Spark *cluster* can reproduce, *local* can't:
> 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}:
> {code:actionscript}
> $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K
> {code}
> 2. Need a shuffe:
> {code:actionscript}
> scala> val count = sc.parallelize(0 until 300, 
> 10).repartition(2001).collect().length
> {code}
> The error messages:
> {noformat}
> org.apache.spark.shuffle.FetchFailedException: Failed to send request for 
> 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: 
> java.io.IOException: Connection reset by peer
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
> at scala.collection.AbstractIterator.to(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to 
> yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: 
> Connection reset by peer
> at 
> org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
> at 
> io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
> at 
> org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183)
> at 
> 

[jira] [Updated] (SPARK-21252) The duration times showed by spark web UI are inaccurate

2017-06-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21252:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

I don't know what the underlying exact values are; those are likely available 
from the API. That would help figure out if there's an actual rounding problem. 
If not, I don't think this would be changed.

> The duration times showed by spark web UI are inaccurate
> 
>
> Key: SPARK-21252
> URL: https://issues.apache.org/jira/browse/SPARK-21252
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.1.1
>Reporter: igor mazor
>Priority: Minor
>
> The duration times showed by spark UI are inaccurate and seems to be rounded.
> For example when a job had 2 stages, first stage executed in 47 ms and second 
> stage in 3 seconds, the total execution time showed by the UI is 4 seconds.
> Another example, first stage was executed in 20 ms and second stage in 4 
> seconds, the total execution time showed by the UI would be in that case also 
> 4 seconds.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21253) Cannot fetch big blocks to disk

2017-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21253:


Assignee: Apache Spark

> Cannot fetch big blocks to disk 
> 
>
> Key: SPARK-21253
> URL: https://issues.apache.org/jira/browse/SPARK-21253
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>
> Spark *cluster* can reproduce, *local* can't:
> 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}:
> {code:actionscript}
> $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K
> {code}
> 2. Need a shuffe:
> {code:actionscript}
> scala> val count = sc.parallelize(0 until 300, 
> 10).repartition(2001).collect().length
> {code}
> The error messages:
> {noformat}
> org.apache.spark.shuffle.FetchFailedException: Failed to send request for 
> 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: 
> java.io.IOException: Connection reset by peer
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
> at scala.collection.AbstractIterator.to(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to 
> yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: 
> Connection reset by peer
> at 
> org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
> at 
> io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
> at 
> org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183)
> at 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher$1.onSuccess(OneForOneBlockFetcher.java:123)
> at 
> 

[jira] [Assigned] (SPARK-21253) Cannot fetch big blocks to disk

2017-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21253:


Assignee: (was: Apache Spark)

> Cannot fetch big blocks to disk 
> 
>
> Key: SPARK-21253
> URL: https://issues.apache.org/jira/browse/SPARK-21253
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>
> Spark *cluster* can reproduce, *local* can't:
> 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}:
> {code:actionscript}
> $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K
> {code}
> 2. Need a shuffe:
> {code:actionscript}
> scala> val count = sc.parallelize(0 until 300, 
> 10).repartition(2001).collect().length
> {code}
> The error messages:
> {noformat}
> org.apache.spark.shuffle.FetchFailedException: Failed to send request for 
> 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: 
> java.io.IOException: Connection reset by peer
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
> at scala.collection.AbstractIterator.to(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to 
> yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: 
> Connection reset by peer
> at 
> org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
> at 
> io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
> at 
> org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183)
> at 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher$1.onSuccess(OneForOneBlockFetcher.java:123)
> at 
> 

[jira] [Commented] (SPARK-21253) Cannot fetch big blocks to disk

2017-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068169#comment-16068169
 ] 

Apache Spark commented on SPARK-21253:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/18466

> Cannot fetch big blocks to disk 
> 
>
> Key: SPARK-21253
> URL: https://issues.apache.org/jira/browse/SPARK-21253
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>
> Spark *cluster* can reproduce, *local* can't:
> 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}:
> {code:actionscript}
> $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K
> {code}
> 2. Need a shuffe:
> {code:actionscript}
> scala> val count = sc.parallelize(0 until 300, 
> 10).repartition(2001).collect().length
> {code}
> The error messages:
> {noformat}
> org.apache.spark.shuffle.FetchFailedException: Failed to send request for 
> 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: 
> java.io.IOException: Connection reset by peer
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
> at scala.collection.AbstractIterator.to(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to 
> yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: 
> Connection reset by peer
> at 
> org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
> at 
> io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
> at 
> org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183)
> at 
> 

[jira] [Created] (SPARK-21253) Cannot fetch big blocks to disk

2017-06-29 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-21253:
---

 Summary: Cannot fetch big blocks to disk 
 Key: SPARK-21253
 URL: https://issues.apache.org/jira/browse/SPARK-21253
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Yuming Wang


Spark *cluster* can reproduce, *local* can't:

1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}:
{code:actionscript}
$ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K
{code}

2. Need a shuffe:
{code:actionscript}
scala> val count = sc.parallelize(0 until 300, 
10).repartition(2001).collect().length
{code}

The error messages:

{noformat}
org.apache.spark.shuffle.FetchFailedException: Failed to send request for 
1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: 
java.io.IOException: Connection reset by peer
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to 
yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: 
Connection reset by peer
at 
org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196)
at 
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
at 
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
at 
io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
at 
io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163)
at 
io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
at 
io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
at 
org.apache.spark.network.client.TransportClient.stream(TransportClient.java:183)
at 
org.apache.spark.network.shuffle.OneForOneBlockFetcher$1.onSuccess(OneForOneBlockFetcher.java:123)
at 
org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:176)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:120)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 

[jira] [Comment Edited] (SPARK-21252) The duration times showed by spark web UI are inaccurate

2017-06-29 Thread igor mazor (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068147#comment-16068147
 ] 

igor mazor edited comment on SPARK-21252 at 6/29/17 10:42 AM:
--

Yes, its indeed possible to drill to individual stages, however the scenario I 
described on my last comment is in the same stage.
The first attempt following by timeout after 20 seconds and then the second 
successful attempt are all in the same stage and therefore the UI shows 20 
seconds duration, although thats not true.
Also, if the stage was indeed 40 ms, the UI shows 40 ms as the duration, but 
with seconds there is rounding which also causing for  inconsistent result 
presentation. 
If you already do the rounding, why then the 40 ms is not rounded to 0 seconds ?

Also I noticed now that for example stage that took 42 ms (Job Duration), when 
I click on that stage I see Duration = 38 ms and when going deeper into the 
Summary Metrics for Tasks, I see that the longest task was 35 ms, so not sure 
where to all the gaps went.


was (Author: mazor.igal):
Yes, its indeed possible to drill to individual stages, however the scenario I 
described on my last comment is happens in the same stage.
The first attempt following by timeout after 20 seconds and then the second 
successful attempt are all in the same stage and therefore the UI shows 20 
seconds duration, although thats not true.
Also, if the stage was indeed 40 ms, the UI shows 40 ms as the duration, but 
with seconds there is rounding which also causing for  inconsistent result 
presentation. 
If you already do the rounding, why then the 40 ms is not rounded to 0 seconds ?

Also I noticed now that for example stage that took 42 ms (Job Duration), when 
I click on that stage I see Duration = 38 ms and when going deeper into the 
Summary Metrics for Tasks, I see that the longest task was 35 ms, so not sure 
where to all the gaps went.

> The duration times showed by spark web UI are inaccurate
> 
>
> Key: SPARK-21252
> URL: https://issues.apache.org/jira/browse/SPARK-21252
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.1.1
>Reporter: igor mazor
>
> The duration times showed by spark UI are inaccurate and seems to be rounded.
> For example when a job had 2 stages, first stage executed in 47 ms and second 
> stage in 3 seconds, the total execution time showed by the UI is 4 seconds.
> Another example, first stage was executed in 20 ms and second stage in 4 
> seconds, the total execution time showed by the UI would be in that case also 
> 4 seconds.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21252) The duration times showed by spark web UI are inaccurate

2017-06-29 Thread igor mazor (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068147#comment-16068147
 ] 

igor mazor commented on SPARK-21252:


Yes, its indeed possible to drill to individual stages, however the scenario I 
described on my last comment is happens in the same stage.
The first attempt following by timeout after 20 seconds and then the second 
successful attempt are all in the same stage and therefore the UI shows 20 
seconds duration, although thats not true.
Also, if the stage was indeed 40 ms, the UI shows 40 ms as the duration, but 
with seconds there is rounding which also causing for  inconsistent result 
presentation. 
If you already do the rounding, why then the 40 ms is not rounded to 0 seconds ?

Also I noticed now that for example stage that took 42 ms (Job Duration), when 
I click on that stage I see Duration = 38 ms and when going deeper into the 
Summary Metrics for Tasks, I see that the longest task was 35 ms, so not sure 
where to all the gaps went.

> The duration times showed by spark web UI are inaccurate
> 
>
> Key: SPARK-21252
> URL: https://issues.apache.org/jira/browse/SPARK-21252
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.1.1
>Reporter: igor mazor
>
> The duration times showed by spark UI are inaccurate and seems to be rounded.
> For example when a job had 2 stages, first stage executed in 47 ms and second 
> stage in 3 seconds, the total execution time showed by the UI is 4 seconds.
> Another example, first stage was executed in 20 ms and second stage in 4 
> seconds, the total execution time showed by the UI would be in that case also 
> 4 seconds.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17689) _temporary files breaks the Spark SQL streaming job.

2017-06-29 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma resolved SPARK-17689.
-
Resolution: Cannot Reproduce

I am unable to reproduce this anymore, looks like this might be fixed by some 
other changes.

> _temporary files breaks the Spark SQL streaming job.
> 
>
> Key: SPARK-17689
> URL: https://issues.apache.org/jira/browse/SPARK-17689
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Prashant Sharma
>
> Steps to reproduce:
> 1) Start a streaming job which reads from HDFS location hdfs://xyz/*
> 2) Write content to hdfs://xyz/a
> .
> .
> repeat a few times.
> And then job breaks as follows.
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 49 in 
> stage 304.0 failed 1 times, most recent failure: Lost task 49.0 in stage 
> 304.0 (TID 14794, localhost): java.io.FileNotFoundException: File does not 
> exist: hdfs://localhost:9000/input/t5/_temporary
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
>   at 
> org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$4.apply(fileSourceInterfaces.scala:464)
>   at 
> org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$4.apply(fileSourceInterfaces.scala:462)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:912)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:912)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1919)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1919)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19908) Direct buffer memory OOM should not cause stage retries.

2017-06-29 Thread Kaushal Prajapati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068106#comment-16068106
 ] 

Kaushal Prajapati commented on SPARK-19908:
---

[~zhanzhang] can you plz share some example code for which you are getting this 
error?

> Direct buffer memory OOM should not cause stage retries.
> 
>
> Key: SPARK-19908
> URL: https://issues.apache.org/jira/browse/SPARK-19908
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: Zhan Zhang
>Priority: Minor
>
> Currently if there is  java.lang.OutOfMemoryError: Direct buffer memory, the 
> exception will be changed to FetchFailedException, causing stage retries.
> org.apache.spark.shuffle.FetchFailedException: Direct buffer memory
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:40)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:731)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextOuterJoinRows(SortMergeJoinExec.scala:692)
>   at 
> org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceStream(SortMergeJoinExec.scala:854)
>   at 
> org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceNext(SortMergeJoinExec.scala:887)
>   at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:278)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.OutOfMemoryError: Direct buffer memory
>   at java.nio.Bits.reserveMemory(Bits.java:658)
>   at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
>   at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
>   at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645)
>   at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228)
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:212)
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:132)
>   at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271)
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>   at 
> 

[jira] [Commented] (SPARK-21252) The duration times showed by spark web UI are inaccurate

2017-06-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068085#comment-16068085
 ] 

Sean Owen commented on SPARK-21252:
---

You can drill into individual stage times, right?

> The duration times showed by spark web UI are inaccurate
> 
>
> Key: SPARK-21252
> URL: https://issues.apache.org/jira/browse/SPARK-21252
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.1.1
>Reporter: igor mazor
>
> The duration times showed by spark UI are inaccurate and seems to be rounded.
> For example when a job had 2 stages, first stage executed in 47 ms and second 
> stage in 3 seconds, the total execution time showed by the UI is 4 seconds.
> Another example, first stage was executed in 20 ms and second stage in 4 
> seconds, the total execution time showed by the UI would be in that case also 
> 4 seconds.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >