date:20160916

[jira] [Commented] (SPARK-5484) Pregel should checkpoint periodically to avoid StackOverflowError

2016-09-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15498029#comment-15498029
 ] 

Apache Spark commented on SPARK-5484:
-

User 'dding3' has created a pull request for this issue:
https://github.com/apache/spark/pull/15125

> Pregel should checkpoint periodically to avoid StackOverflowError
> -
>
> Key: SPARK-5484
> URL: https://issues.apache.org/jira/browse/SPARK-5484
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>
> Pregel-based iterative algorithms with more than ~50 iterations begin to slow 
> down and eventually fail with a StackOverflowError due to Spark's lack of 
> support for long lineage chains. Instead, Pregel should checkpoint the graph 
> periodically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17559) PeriodicGraphCheckpointer didnot persist edges as expected in some cases

2016-09-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15497917#comment-15497917
 ] 

Apache Spark commented on SPARK-17559:
--

User 'dding3' has created a pull request for this issue:
https://github.com/apache/spark/pull/15124

> PeriodicGraphCheckpointer didnot persist edges as expected in some cases
> 
>
> Key: SPARK-17559
> URL: https://issues.apache.org/jira/browse/SPARK-17559
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: ding
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> When use PeriodicGraphCheckpointer to persist graph, sometimes the edge isn't 
> persisted. As currently only when vertices's storage level is none, graph is 
> persisted. However there is a chance vertices's storage level is not none 
> while edges's is none. Eg. graph created by a outerJoinVertices operation, 
> vertices is automatically cached while edges is not. In this way, edges will 
> not be persisted if we use PeriodicGraphCheckpointer do persist.
> See below minimum example:
>val graphCheckpointer = new PeriodicGraphCheckpointer[Array[String], 
> Int](2, sc)
> val users = sc.textFile("data/graphx/users.txt")
>   .map(line => line.split(",")).map(parts => (parts.head.toLong, 
> parts.tail))
> val followerGraph = GraphLoader.edgeListFile(sc, 
> "data/graphx/followers.txt")
> val graph = followerGraph.outerJoinVertices(users) {
>   case (uid, deg, Some(attrList)) => attrList
>   case (uid, deg, None) => Array.empty[String]
> }
> graphCheckpointer.update(graph)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17551) support null ordering for DataFrame API

2016-09-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15497874#comment-15497874
 ] 

Apache Spark commented on SPARK-17551:
--

User 'petermaxlee' has created a pull request for this issue:
https://github.com/apache/spark/pull/15123

> support null ordering for DataFrame API
> ---
>
> Key: SPARK-17551
> URL: https://issues.apache.org/jira/browse/SPARK-17551
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xin Wu
>
> SPARK-10747 has added support for NULLS FIRST | LAST in ORDER BY clause for 
> SQL interface. This JIRA is to complete this feature by adding same support 
> for DataFrame/Dataset APIs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17570) Avoid Hash and Exchange in Sort Merge join if bucketing factor is multiple for tables

2016-09-16 Thread Tejas Patil (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15497729#comment-15497729
 ] 

Tejas Patil commented on SPARK-17570:
-

[~cloud_fan] , [~hvanhovell] : I have been looking into the code to figure out 
a way to make this change. So far I have gotten to this but I am not sure if 
this is the best way to do this:
1. Have a way to mutate the `outputPartitioning` for SparkPlan nodes. 
2. While creating the physical plan [0], when we come across a case when such 
optimization can be applied, mutate the child subtree's output partitioning 
from `HashPartitioning(expression, buckets = x * y)` to 
`HashPartitioning(expression, buckets = y)` where `y` is the desired buckets in 
the output table of Sort merge join.
3. When the bucketed RDD is created, take into account to pack multiple buckets 
of the input table into same `FilePartition`

#2 needs to visit and alter all the nodes of a subtree after the 
outputPartitioning for the root of the subtree is changed. Are you OK with this 
approach OR have better ideas ?

[0] : 
https://github.com/apache/spark/blob/aaf632b2132750c6970469b902d9308dbf36/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L189
[1] : 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L417

> Avoid Hash and Exchange in Sort Merge join if bucketing factor is multiple 
> for tables
> -
>
> Key: SPARK-17570
> URL: https://issues.apache.org/jira/browse/SPARK-17570
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Tejas Patil
>Priority: Minor
>
> In case of bucketed tables, Spark will avoid doing `Sort` and `Exchange` if 
> the input tables and output table has same number of buckets. However, 
> unequal bucketing will always lead to `Sort` and `Exchange`. If the number of 
> buckets in the output table is a factor of the buckets in the input table, we 
> should be able to avoid `Sort` and `Exchange` and directly join those.
> eg.
> Assume Input1, Input2 and Output be bucketed + sorted tables over the same 
> columns but with different number of buckets. Input1 has 8 buckets, Input1 
> has 4 buckets and Output has 4 buckets. Since hash-partitioning is done using 
> Modulus, if we JOIN buckets (0, 4) of Input1 and buckets (0, 4, 8) of Input2 
> in the same task, it would give the bucket 0 of output table.
> {noformat}
> Input1   (0, 4)  (1, 3)  (2, 5)   (3, 7)
> Input2   (0, 4, 8)   (1, 3, 9)   (2, 5, 10)   (3, 7, 11)
> Output   (0) (1) (2)  (3)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17569) Don't recheck existence of files when generating File Relation resolution in StructuredStreaming

2016-09-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17569:


Assignee: (was: Apache Spark)

> Don't recheck existence of files when generating File Relation resolution in 
> StructuredStreaming
> 
>
> Key: SPARK-17569
> URL: https://issues.apache.org/jira/browse/SPARK-17569
> Project: Spark
>  Issue Type: Improvement
>Reporter: Burak Yavuz
>
> Structured Streaming's FileSource lists files to classify files as Offsets. 
> Once this file list is committed to a metadata log for a batch, this file 
> list is turned into a "Batch FileSource" Relation which acts as the source to 
> the incremental execution.
> While this "Batch FileSource" Relation is resolved, we re-check that every 
> single file exists on the Driver. It takes a horrible amount of time, and is 
> a total waste. We can simply skip file existence during execution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17569) Don't recheck existence of files when generating File Relation resolution in StructuredStreaming

2016-09-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15497680#comment-15497680
 ] 

Apache Spark commented on SPARK-17569:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/15122

> Don't recheck existence of files when generating File Relation resolution in 
> StructuredStreaming
> 
>
> Key: SPARK-17569
> URL: https://issues.apache.org/jira/browse/SPARK-17569
> Project: Spark
>  Issue Type: Improvement
>Reporter: Burak Yavuz
>
> Structured Streaming's FileSource lists files to classify files as Offsets. 
> Once this file list is committed to a metadata log for a batch, this file 
> list is turned into a "Batch FileSource" Relation which acts as the source to 
> the incremental execution.
> While this "Batch FileSource" Relation is resolved, we re-check that every 
> single file exists on the Driver. It takes a horrible amount of time, and is 
> a total waste. We can simply skip file existence during execution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17569) Don't recheck existence of files when generating File Relation resolution in StructuredStreaming

2016-09-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17569:


Assignee: Apache Spark

> Don't recheck existence of files when generating File Relation resolution in 
> StructuredStreaming
> 
>
> Key: SPARK-17569
> URL: https://issues.apache.org/jira/browse/SPARK-17569
> Project: Spark
>  Issue Type: Improvement
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>
> Structured Streaming's FileSource lists files to classify files as Offsets. 
> Once this file list is committed to a metadata log for a batch, this file 
> list is turned into a "Batch FileSource" Relation which acts as the source to 
> the incremental execution.
> While this "Batch FileSource" Relation is resolved, we re-check that every 
> single file exists on the Driver. It takes a horrible amount of time, and is 
> a total waste. We can simply skip file existence during execution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17567) Broken link to Spark paper

2016-09-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17567:


Assignee: Apache Spark

> Broken link to Spark paper
> --
>
> Key: SPARK-17567
> URL: https://issues.apache.org/jira/browse/SPARK-17567
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.0.0
>Reporter: Ondrej Galbavy
>Assignee: Apache Spark
>
> Documentation 
> (http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.rdd.RDD) 
> contains broken link to Spark paper 
> (http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf). I found it 
> elsewhere 
> (https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf) 
> and I hope it is the same one. It should be uploaded to and linked from some 
> Apache controlled storage, so it won't break again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17570) Avoid Hash and Exchange in Sort Merge join if bucketing factor is multiple for tables

2016-09-16 Thread Tejas Patil (JIRA)

Tejas Patil created SPARK-17570:
---

 Summary: Avoid Hash and Exchange in Sort Merge join if bucketing 
factor is multiple for tables
 Key: SPARK-17570
 URL: https://issues.apache.org/jira/browse/SPARK-17570
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Tejas Patil
Priority: Minor


In case of bucketed tables, Spark will avoid doing `Sort` and `Exchange` if the 
input tables and output table has same number of buckets. However, unequal 
bucketing will always lead to `Sort` and `Exchange`. If the number of buckets 
in the output table is a factor of the buckets in the input table, we should be 
able to avoid `Sort` and `Exchange` and directly join those.
eg.

Assume Input1, Input2 and Output be bucketed + sorted tables over the same 
columns but with different number of buckets. Input1 has 8 buckets, Input1 has 
4 buckets and Output has 4 buckets. Since hash-partitioning is done using 
Modulus, if we JOIN buckets (0, 4) of Input1 and buckets (0, 4, 8) of Input2 in 
the same task, it would give the bucket 0 of output table.

{noformat}
Input1   (0, 4)  (1, 3)  (2, 5)   (3, 7)
Input2   (0, 4, 8)   (1, 3, 9)   (2, 5, 10)   (3, 7, 11)
Output   (0) (1) (2)  (3)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17567) Broken link to Spark paper

2016-09-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17567:


Assignee: (was: Apache Spark)

> Broken link to Spark paper
> --
>
> Key: SPARK-17567
> URL: https://issues.apache.org/jira/browse/SPARK-17567
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.0.0
>Reporter: Ondrej Galbavy
>
> Documentation 
> (http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.rdd.RDD) 
> contains broken link to Spark paper 
> (http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf). I found it 
> elsewhere 
> (https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf) 
> and I hope it is the same one. It should be uploaded to and linked from some 
> Apache controlled storage, so it won't break again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17567) Broken link to Spark paper

2016-09-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15497670#comment-15497670
 ] 

Apache Spark commented on SPARK-17567:
--

User 'keypointt' has created a pull request for this issue:
https://github.com/apache/spark/pull/15121

> Broken link to Spark paper
> --
>
> Key: SPARK-17567
> URL: https://issues.apache.org/jira/browse/SPARK-17567
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.0.0
>Reporter: Ondrej Galbavy
>
> Documentation 
> (http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.rdd.RDD) 
> contains broken link to Spark paper 
> (http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf). I found it 
> elsewhere 
> (https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf) 
> and I hope it is the same one. It should be uploaded to and linked from some 
> Apache controlled storage, so it won't break again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17569) Don't recheck existence of files when generating File Relation resolution in StructuredStreaming

2016-09-16 Thread Burak Yavuz (JIRA)

Burak Yavuz created SPARK-17569:
---

 Summary: Don't recheck existence of files when generating File 
Relation resolution in StructuredStreaming
 Key: SPARK-17569
 URL: https://issues.apache.org/jira/browse/SPARK-17569
 Project: Spark
  Issue Type: Improvement
Reporter: Burak Yavuz


Structured Streaming's FileSource lists files to classify files as Offsets. 
Once this file list is committed to a metadata log for a batch, this file list 
is turned into a "Batch FileSource" Relation which acts as the source to the 
incremental execution.

While this "Batch FileSource" Relation is resolved, we re-check that every 
single file exists on the Driver. It takes a horrible amount of time, and is a 
total waste. We can simply skip file existence during execution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-4563) Allow spark driver to bind to different ip then advertise ip

2016-09-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4563:
---

Assignee: (was: Apache Spark)

> Allow spark driver to bind to different ip then advertise ip
> 
>
> Key: SPARK-4563
> URL: https://issues.apache.org/jira/browse/SPARK-4563
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: Long Nguyen
>Priority: Minor
>
> Spark driver bind ip and advertise is not configurable. spark.driver.host is 
> only bind ip. SPARK_PUBLIC_DNS does not work for spark driver. Allow option 
> to set advertised ip/hostname



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4563) Allow spark driver to bind to different ip then advertise ip

2016-09-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15497532#comment-15497532
 ] 

Apache Spark commented on SPARK-4563:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/15120

> Allow spark driver to bind to different ip then advertise ip
> 
>
> Key: SPARK-4563
> URL: https://issues.apache.org/jira/browse/SPARK-4563
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: Long Nguyen
>Priority: Minor
>
> Spark driver bind ip and advertise is not configurable. spark.driver.host is 
> only bind ip. SPARK_PUBLIC_DNS does not work for spark driver. Allow option 
> to set advertised ip/hostname



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-4563) Allow spark driver to bind to different ip then advertise ip

2016-09-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4563:
---

Assignee: Apache Spark

> Allow spark driver to bind to different ip then advertise ip
> 
>
> Key: SPARK-4563
> URL: https://issues.apache.org/jira/browse/SPARK-4563
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: Long Nguyen
>Assignee: Apache Spark
>Priority: Minor
>
> Spark driver bind ip and advertise is not configurable. spark.driver.host is 
> only bind ip. SPARK_PUBLIC_DNS does not work for spark driver. Allow option 
> to set advertised ip/hostname



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17568) Add spark-submit option for user to override ivy settings used to resolve packages/artifacts

2016-09-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17568:


Assignee: Apache Spark

> Add spark-submit option for user to override ivy settings used to resolve 
> packages/artifacts
> 
>
> Key: SPARK-17568
> URL: https://issues.apache.org/jira/browse/SPARK-17568
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>
> The {{--packages}} option to {{spark-submit}} uses Ivy to map Maven 
> coordinates to package jars. Currently, the IvySettings are hard-coded with 
> Maven Central as the last repository in the chain of resolvers. 
> At IBM, we have heard from several enterprise clients that are frustrated 
> with lack of control over their local Spark installations. These clients want 
> to ensure that certain artifacts can be excluded or patched due to security 
> or license issues. For example, a package may use a vulnerable SSL protocol; 
> or a package may link against an AGPL library written by a litigious 
> competitor.
> While additional repositories and exclusions can be added on the spark-submit 
> command line, this falls short of what is needed. With Maven Central always 
> as a fall-back repository, it is difficult to ensure only approved artifacts 
> are used and it is often the exclusions that site admins are not aware of 
> that can cause problems. Also, known exclusions are better handled through a 
> centralized managed repository rather than as command line arguments.
> To resolve these issues, we propose the following change: allow the user to 
> specify an Ivy Settings XML file to pass in as an optional argument to 
> {{spark-submit}} (or specify in a config file) to define alternate 
> repositories used to resolve artifacts instead of the hard-coded defaults. 
> The use case for this would be to define a managed repository (such as Nexus) 
> in the settings file so that all requests for artifacts go through one 
> location only.
> Example usage:
> {noformat}
> $SPARK_HOME/bin/spark-submit --conf 
> spark.ivy.settings=/path/to/ivysettings.xml  myapp.jar
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17568) Add spark-submit option for user to override ivy settings used to resolve packages/artifacts

2016-09-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17568:


Assignee: (was: Apache Spark)

> Add spark-submit option for user to override ivy settings used to resolve 
> packages/artifacts
> 
>
> Key: SPARK-17568
> URL: https://issues.apache.org/jira/browse/SPARK-17568
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Reporter: Bryan Cutler
>
> The {{--packages}} option to {{spark-submit}} uses Ivy to map Maven 
> coordinates to package jars. Currently, the IvySettings are hard-coded with 
> Maven Central as the last repository in the chain of resolvers. 
> At IBM, we have heard from several enterprise clients that are frustrated 
> with lack of control over their local Spark installations. These clients want 
> to ensure that certain artifacts can be excluded or patched due to security 
> or license issues. For example, a package may use a vulnerable SSL protocol; 
> or a package may link against an AGPL library written by a litigious 
> competitor.
> While additional repositories and exclusions can be added on the spark-submit 
> command line, this falls short of what is needed. With Maven Central always 
> as a fall-back repository, it is difficult to ensure only approved artifacts 
> are used and it is often the exclusions that site admins are not aware of 
> that can cause problems. Also, known exclusions are better handled through a 
> centralized managed repository rather than as command line arguments.
> To resolve these issues, we propose the following change: allow the user to 
> specify an Ivy Settings XML file to pass in as an optional argument to 
> {{spark-submit}} (or specify in a config file) to define alternate 
> repositories used to resolve artifacts instead of the hard-coded defaults. 
> The use case for this would be to define a managed repository (such as Nexus) 
> in the settings file so that all requests for artifacts go through one 
> location only.
> Example usage:
> {noformat}
> $SPARK_HOME/bin/spark-submit --conf 
> spark.ivy.settings=/path/to/ivysettings.xml  myapp.jar
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17568) Add spark-submit option for user to override ivy settings used to resolve packages/artifacts

2016-09-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15497502#comment-15497502
 ] 

Apache Spark commented on SPARK-17568:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/15119

> Add spark-submit option for user to override ivy settings used to resolve 
> packages/artifacts
> 
>
> Key: SPARK-17568
> URL: https://issues.apache.org/jira/browse/SPARK-17568
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Reporter: Bryan Cutler
>
> The {{--packages}} option to {{spark-submit}} uses Ivy to map Maven 
> coordinates to package jars. Currently, the IvySettings are hard-coded with 
> Maven Central as the last repository in the chain of resolvers. 
> At IBM, we have heard from several enterprise clients that are frustrated 
> with lack of control over their local Spark installations. These clients want 
> to ensure that certain artifacts can be excluded or patched due to security 
> or license issues. For example, a package may use a vulnerable SSL protocol; 
> or a package may link against an AGPL library written by a litigious 
> competitor.
> While additional repositories and exclusions can be added on the spark-submit 
> command line, this falls short of what is needed. With Maven Central always 
> as a fall-back repository, it is difficult to ensure only approved artifacts 
> are used and it is often the exclusions that site admins are not aware of 
> that can cause problems. Also, known exclusions are better handled through a 
> centralized managed repository rather than as command line arguments.
> To resolve these issues, we propose the following change: allow the user to 
> specify an Ivy Settings XML file to pass in as an optional argument to 
> {{spark-submit}} (or specify in a config file) to define alternate 
> repositories used to resolve artifacts instead of the hard-coded defaults. 
> The use case for this would be to define a managed repository (such as Nexus) 
> in the settings file so that all requests for artifacts go through one 
> location only.
> Example usage:
> {noformat}
> $SPARK_HOME/bin/spark-submit --conf 
> spark.ivy.settings=/path/to/ivysettings.xml  myapp.jar
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17568) Add spark-submit option for user to override ivy settings used to resolve packages/artifacts

2016-09-16 Thread Bryan Cutler (JIRA)

Bryan Cutler created SPARK-17568:


 Summary: Add spark-submit option for user to override ivy settings 
used to resolve packages/artifacts
 Key: SPARK-17568
 URL: https://issues.apache.org/jira/browse/SPARK-17568
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, Spark Core
Reporter: Bryan Cutler


The {{--packages}} option to {{spark-submit}} uses Ivy to map Maven coordinates 
to package jars. Currently, the IvySettings are hard-coded with Maven Central 
as the last repository in the chain of resolvers. 

At IBM, we have heard from several enterprise clients that are frustrated with 
lack of control over their local Spark installations. These clients want to 
ensure that certain artifacts can be excluded or patched due to security or 
license issues. For example, a package may use a vulnerable SSL protocol; or a 
package may link against an AGPL library written by a litigious competitor.

While additional repositories and exclusions can be added on the spark-submit 
command line, this falls short of what is needed. With Maven Central always as 
a fall-back repository, it is difficult to ensure only approved artifacts are 
used and it is often the exclusions that site admins are not aware of that can 
cause problems. Also, known exclusions are better handled through a centralized 
managed repository rather than as command line arguments.

To resolve these issues, we propose the following change: allow the user to 
specify an Ivy Settings XML file to pass in as an optional argument to 
{{spark-submit}} (or specify in a config file) to define alternate repositories 
used to resolve artifacts instead of the hard-coded defaults. The use case for 
this would be to define a managed repository (such as Nexus) in the settings 
file so that all requests for artifacts go through one location only.

Example usage:
{noformat}
$SPARK_HOME/bin/spark-submit --conf spark.ivy.settings=/path/to/ivysettings.xml 
 myapp.jar
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17565) Janino exception when calculating metrics for large generated class

2016-09-16 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15497368#comment-15497368
 ] 

Marcelo Vanzin commented on SPARK-17565:


The fix for SPARK-17549 adds a workaround for this issue (the exception doesn't 
cause task failures anymore), but the underlying problem still exists and 
should be investigated.

> Janino exception when calculating metrics for large generated class
> ---
>
> Key: SPARK-17565
> URL: https://issues.apache.org/jira/browse/SPARK-17565
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
> Attachments: generated_code.txt
>
>
> This was found when investigating SPARK-17549:
> {noformat}
> Caused by: java.lang.IndexOutOfBoundsException: Index: 63235, Size: 1
> at java.util.ArrayList.rangeCheck(ArrayList.java:635)
> at java.util.ArrayList.get(ArrayList.java:411)
> at 
> org.codehaus.janino.util.ClassFile.getConstantPoolInfo(ClassFile.java:556)
> at 
> org.codehaus.janino.util.ClassFile.getConstantUtf8(ClassFile.java:572)
> at 
> org.codehaus.janino.util.ClassFile.loadAttribute(ClassFile.java:1513)
> at 
> org.codehaus.janino.util.ClassFile.loadAttributes(ClassFile.java:644)
> at org.codehaus.janino.util.ClassFile.loadFields(ClassFile.java:623)
> at org.codehaus.janino.util.ClassFile.(ClassFile.java:280)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:913)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:911)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:911)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:883)
> ... 54 more
> {noformat}
> Attaching file with full driver-side exception, which includes the generated 
> code (which is huge).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17549) InMemoryRelation doesn't scale to large tables

2016-09-16 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-17549:
-
Assignee: Marcelo Vanzin

> InMemoryRelation doesn't scale to large tables
> --
>
> Key: SPARK-17549
> URL: https://issues.apache.org/jira/browse/SPARK-17549
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 2.0.1, 2.1.0
>
> Attachments: create_parquet.scala, example_1.6_post_patch.png, 
> example_1.6_pre_patch.png, spark-1.6-2.patch, spark-1.6.patch, spark-2.0.patch
>
>
> An {{InMemoryRelation}} is created when you cache a table; but if the table 
> is large, defined by either having a really large amount of columns, or a 
> really large amount of partitions (in the file split sense, not the "table 
> partition" sense), or both, it causes an immense amount of memory to be used 
> in the driver.
> The reason is that it uses an accumulator to collect statistics about each 
> partition, and instead of summarizing the data in the driver, it keeps *all* 
> entries in memory.
> I'm attaching a script I used to create a parquet file with 20,000 columns 
> and a single row, which I then copied 500 times so I'd have 500 partitions.
> When doing the following:
> {code}
> sqlContext.read.parquet(...).count()
> {code}
> Everything works fine, both in Spark 1.6 and 2.0. (It's super slow with the 
> settings I used, but it works.)
> I ran spark-shell like this:
> {code}
> ./bin/spark-shell --master 'local-cluster[4,1,4096]' --driver-memory 2g 
> --conf spark.executor.memory=2g
> {code}
> And ran:
> {code}
> sqlContext.read.parquet(...).cache().count()
> {code}
> You'll see the results in screenshot {{example_1.6_pre_patch.png}}. After 40 
> partitions were processed, there were 40 GenericInternalRow objects with
> 100,000 items each (5 stat info fields * 20,000 columns). So, memory usage 
> was:
> {code}
>   40 * 10 * (4 * 20 + 24) = 41600 =~ 400MB
> {code}
> (Note: Integer = 20 bytes, Long = 24 bytes.)
> If I waited until the end, there would be 500 partitions, so ~ 5GB of memory 
> to hold the stats.
> I'm also attaching a patch I made on top of 1.6 that uses just a long 
> accumulator to capture the table size; with that patch memory usage on the 
> driver doesn't keep growing. Also note in the patch that I'm multiplying the 
> column size by the row count, which I think is a different bug in the 
> existing code (those stats should be for the whole batch, not just a single 
> row, right?). I also added {{example_1.6_post_patch.png}} to show the 
> {{InMemoryRelation}} with the patch.
> I also applied a very similar patch on top of Spark 2.0. But there things 
> blow up even more spectacularly when I try to run the count on the cached 
> table. It starts with this error:
> {noformat}
> 14:19:43 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2, 
> vanzin-st1-3.gce.cloudera.com): java.util.concurrent.ExecutionException: 
> java.lang.Exception: failed to compile: java.lang.IndexOutOfBoundsException: 
> Index: 63235, Size: 1
> (lots of generated code here...)
> Caused by: java.lang.IndexOutOfBoundsException: Index: 63235, Size: 1
>   at java.util.ArrayList.rangeCheck(ArrayList.java:635)
>   at java.util.ArrayList.get(ArrayList.java:411)
>   at 
> org.codehaus.janino.util.ClassFile.getConstantPoolInfo(ClassFile.java:556)
>   at 
> org.codehaus.janino.util.ClassFile.getConstantUtf8(ClassFile.java:572)
>   at org.codehaus.janino.util.ClassFile.loadAttribute(ClassFile.java:1513)
>   at org.codehaus.janino.util.ClassFile.loadAttributes(ClassFile.java:644)
>   at org.codehaus.janino.util.ClassFile.loadFields(ClassFile.java:623)
>   at org.codehaus.janino.util.ClassFile.(ClassFile.java:280)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:913)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:911)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:911)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:883)
>   ... 54 more
> {noformat}
> And basically a lot of that going on

[jira] [Resolved] (SPARK-17549) InMemoryRelation doesn't scale to large tables

2016-09-16 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-17549.
--
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 15112
[https://github.com/apache/spark/pull/15112]

> InMemoryRelation doesn't scale to large tables
> --
>
> Key: SPARK-17549
> URL: https://issues.apache.org/jira/browse/SPARK-17549
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Marcelo Vanzin
> Fix For: 2.0.1, 2.1.0
>
> Attachments: create_parquet.scala, example_1.6_post_patch.png, 
> example_1.6_pre_patch.png, spark-1.6-2.patch, spark-1.6.patch, spark-2.0.patch
>
>
> An {{InMemoryRelation}} is created when you cache a table; but if the table 
> is large, defined by either having a really large amount of columns, or a 
> really large amount of partitions (in the file split sense, not the "table 
> partition" sense), or both, it causes an immense amount of memory to be used 
> in the driver.
> The reason is that it uses an accumulator to collect statistics about each 
> partition, and instead of summarizing the data in the driver, it keeps *all* 
> entries in memory.
> I'm attaching a script I used to create a parquet file with 20,000 columns 
> and a single row, which I then copied 500 times so I'd have 500 partitions.
> When doing the following:
> {code}
> sqlContext.read.parquet(...).count()
> {code}
> Everything works fine, both in Spark 1.6 and 2.0. (It's super slow with the 
> settings I used, but it works.)
> I ran spark-shell like this:
> {code}
> ./bin/spark-shell --master 'local-cluster[4,1,4096]' --driver-memory 2g 
> --conf spark.executor.memory=2g
> {code}
> And ran:
> {code}
> sqlContext.read.parquet(...).cache().count()
> {code}
> You'll see the results in screenshot {{example_1.6_pre_patch.png}}. After 40 
> partitions were processed, there were 40 GenericInternalRow objects with
> 100,000 items each (5 stat info fields * 20,000 columns). So, memory usage 
> was:
> {code}
>   40 * 10 * (4 * 20 + 24) = 41600 =~ 400MB
> {code}
> (Note: Integer = 20 bytes, Long = 24 bytes.)
> If I waited until the end, there would be 500 partitions, so ~ 5GB of memory 
> to hold the stats.
> I'm also attaching a patch I made on top of 1.6 that uses just a long 
> accumulator to capture the table size; with that patch memory usage on the 
> driver doesn't keep growing. Also note in the patch that I'm multiplying the 
> column size by the row count, which I think is a different bug in the 
> existing code (those stats should be for the whole batch, not just a single 
> row, right?). I also added {{example_1.6_post_patch.png}} to show the 
> {{InMemoryRelation}} with the patch.
> I also applied a very similar patch on top of Spark 2.0. But there things 
> blow up even more spectacularly when I try to run the count on the cached 
> table. It starts with this error:
> {noformat}
> 14:19:43 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2, 
> vanzin-st1-3.gce.cloudera.com): java.util.concurrent.ExecutionException: 
> java.lang.Exception: failed to compile: java.lang.IndexOutOfBoundsException: 
> Index: 63235, Size: 1
> (lots of generated code here...)
> Caused by: java.lang.IndexOutOfBoundsException: Index: 63235, Size: 1
>   at java.util.ArrayList.rangeCheck(ArrayList.java:635)
>   at java.util.ArrayList.get(ArrayList.java:411)
>   at 
> org.codehaus.janino.util.ClassFile.getConstantPoolInfo(ClassFile.java:556)
>   at 
> org.codehaus.janino.util.ClassFile.getConstantUtf8(ClassFile.java:572)
>   at org.codehaus.janino.util.ClassFile.loadAttribute(ClassFile.java:1513)
>   at org.codehaus.janino.util.ClassFile.loadAttributes(ClassFile.java:644)
>   at org.codehaus.janino.util.ClassFile.loadFields(ClassFile.java:623)
>   at org.codehaus.janino.util.ClassFile.(ClassFile.java:280)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:913)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:911)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:911)
>   at 
>

[jira] [Updated] (SPARK-17561) DataFrameWriter documentation formatting problems

2016-09-16 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17561:

Fix Version/s: (was: 2.0.1)

> DataFrameWriter documentation formatting problems
> -
>
> Key: SPARK-17561
> URL: https://issues.apache.org/jira/browse/SPARK-17561
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>Assignee: Sean Owen
>Priority: Trivial
> Fix For: 2.1.0
>
> Attachments: screenshot-1.png
>
>
> I visited this page
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html
> and saw  that the docs have formatting problems
> !screenshot-1.png!
> Tried with browser cache disabled. Same issue



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17561) DataFrameWriter documentation formatting problems

2016-09-16 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17561.
-
   Resolution: Fixed
 Assignee: Sean Owen
Fix Version/s: 2.1.0
   2.0.1

> DataFrameWriter documentation formatting problems
> -
>
> Key: SPARK-17561
> URL: https://issues.apache.org/jira/browse/SPARK-17561
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>Assignee: Sean Owen
>Priority: Trivial
> Fix For: 2.0.1, 2.1.0
>
> Attachments: screenshot-1.png
>
>
> I visited this page
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html
> and saw  that the docs have formatting problems
> !screenshot-1.png!
> Tried with browser cache disabled. Same issue



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17567) Broken link to Spark paper

2016-09-16 Thread Ondrej Galbavy (JIRA)

Ondrej Galbavy created SPARK-17567:
--

 Summary: Broken link to Spark paper
 Key: SPARK-17567
 URL: https://issues.apache.org/jira/browse/SPARK-17567
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 2.0.0
Reporter: Ondrej Galbavy


Documentation 
(http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.rdd.RDD) 
contains broken link to Spark paper 
(http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf). I found it 
elsewhere 
(https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf) and 
I hope it is the same one. It should be uploaded to and linked from some Apache 
controlled storage, so it won't break again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17566) "--master yarn --deploy-mode cluster" gives "Launching Python applications through spark-submit is currently only supported for local files"

2016-09-16 Thread Zhenhua Xu (JIRA)

Zhenhua Xu created SPARK-17566:
--

 Summary: "--master yarn --deploy-mode cluster" gives "Launching 
Python applications through spark-submit is currently only supported for local 
files"
 Key: SPARK-17566
 URL: https://issues.apache.org/jira/browse/SPARK-17566
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.0.0
Reporter: Zhenhua Xu


In Spark 1.6, the following command runs fine with both primary and additional 
python files in hdfs.
/bin/spark-submit --py-files hdfs:///tmp/base.py --master yarn-cluster 
hdfs:///tmp/pi.py

In Spark 2.0.0, the following command fails:
/bin/spark-submit --py-files hdfs:///tmp/base.py --master yarn --deploy-mode 
cluster hdfs:///tmp/pi.py

Error:
Launching Python applications through spark-submit is currently only supported 
for local files: hdfs:///tmp/base.py





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14948) Exception when joining DataFrames derived form the same DataFrame

2016-09-16 Thread Mijung Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15497086#comment-15497086
 ] 

Mijung Kim commented on SPARK-14948:


I have come across the same problem. A leeway for this problem that I found is 
to create a clone table through "toDF" with the renamed columns. 

// It has an error
val tt1 = tt.toDF()
tt.join(tt1, expr("tt.salaryAvg > tt1.salaryAvg")).show()

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
resolve 'tt.salaryAvg' given input columns salary, depName, depName, empNo, 
empNo, salary, salaryMax, salaryMax;

==>

// This works!
val tt1 = tt.toDF("depName1", "empNo1", "salary1", "salaryMax1")
tt1.join(tt, tt.col("salaryMax") <= tt1.col("salaryMax1")).show()


> Exception when joining DataFrames derived form the same DataFrame
> -
>
> Key: SPARK-14948
> URL: https://issues.apache.org/jira/browse/SPARK-14948
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Saurabh Santhosh
>
> h2. Spark Analyser is throwing the following exception in a specific scenario 
> :
> h2. Exception :
> org.apache.spark.sql.AnalysisException: resolved attribute(s) F1#3 missing 
> from asd#5,F2#4,F1#6,F2#7 in operator !Project [asd#5,F1#3];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
> h2. Code :
> {code:title=SparkClient.java|borderStyle=solid}
> StructField[] fields = new StructField[2];
> fields[0] = new StructField("F1", DataTypes.StringType, true, 
> Metadata.empty());
> fields[1] = new StructField("F2", DataTypes.StringType, true, 
> Metadata.empty());
> JavaRDD rdd =
> 
> sparkClient.getJavaSparkContext().parallelize(Arrays.asList(RowFactory.create("a",
>  "b")));
> DataFrame df = sparkClient.getSparkHiveContext().createDataFrame(rdd, new 
> StructType(fields));
> sparkClient.getSparkHiveContext().registerDataFrameAsTable(df, "t1");
> DataFrame aliasedDf = sparkClient.getSparkHiveContext().sql("select F1 as 
> asd, F2 from t1");
> sparkClient.getSparkHiveContext().registerDataFrameAsTable(aliasedDf, 
> "t2");
> sparkClient.getSparkHiveContext().registerDataFrameAsTable(df, "t3");
> 
> DataFrame join = aliasedDf.join(df, 
> aliasedDf.col("F2").equalTo(df.col("F2")), "inner");
> DataFrame select = join.select(aliasedDf.col("asd"), df.col("F1"));
> select.collect();
> {code}
> h2. Observations :
> * This issue is related to the Data Type of Fields of the initial Data 
> Frame.(If the Data Type is not String, it will work.)
> * It works fine if the data frame is registered as a temporary table and an 
> sql (select a.asd,b.F1 from t2 a inner join t3 b on a.F2=b.F2) is written.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14358) Change SparkListener from a trait to an abstract class, and remove JavaSparkListener

2016-09-16 Thread Oleksiy Sayankin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496924#comment-15496924
 ] 

Oleksiy Sayankin commented on SPARK-14358:
--

[~rxin], please see my comments here: 
https://issues.apache.org/jira/browse/SPARK-17563, but I thinks yes, I can.
And after all I decided to create an issue 
https://issues.apache.org/jira/browse/HIVE-14777

> Change SparkListener from a trait to an abstract class, and remove 
> JavaSparkListener
> 
>
> Key: SPARK-14358
> URL: https://issues.apache.org/jira/browse/SPARK-14358
> Project: Spark
>  Issue Type: Sub-task
>  Components: Scheduler, Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> Scala traits are difficult to maintain binary compatibility on, and as a 
> result we had to introduce JavaSparkListener. In Spark 2.0 we can change 
> SparkListener from a trait to an abstract class and then remove 
> JavaSparkListener.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17565) Janino exception when calculating metrics for large generated class

2016-09-16 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496860#comment-15496860
 ] 

Marcelo Vanzin commented on SPARK-17565:


Note the code was generated with a recent 2.1 snapshot, not 2.0.0, but I'm 
pretty sure I also ran into this with 2.0.

> Janino exception when calculating metrics for large generated class
> ---
>
> Key: SPARK-17565
> URL: https://issues.apache.org/jira/browse/SPARK-17565
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
> Attachments: generated_code.txt
>
>
> This was found when investigating SPARK-17549:
> {noformat}
> Caused by: java.lang.IndexOutOfBoundsException: Index: 63235, Size: 1
> at java.util.ArrayList.rangeCheck(ArrayList.java:635)
> at java.util.ArrayList.get(ArrayList.java:411)
> at 
> org.codehaus.janino.util.ClassFile.getConstantPoolInfo(ClassFile.java:556)
> at 
> org.codehaus.janino.util.ClassFile.getConstantUtf8(ClassFile.java:572)
> at 
> org.codehaus.janino.util.ClassFile.loadAttribute(ClassFile.java:1513)
> at 
> org.codehaus.janino.util.ClassFile.loadAttributes(ClassFile.java:644)
> at org.codehaus.janino.util.ClassFile.loadFields(ClassFile.java:623)
> at org.codehaus.janino.util.ClassFile.(ClassFile.java:280)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:913)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:911)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:911)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:883)
> ... 54 more
> {noformat}
> Attaching file with full driver-side exception, which includes the generated 
> code (which is huge).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17565) Janino exception when calculating metrics for large generated class

2016-09-16 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-17565:
---
Attachment: generated_code.txt

> Janino exception when calculating metrics for large generated class
> ---
>
> Key: SPARK-17565
> URL: https://issues.apache.org/jira/browse/SPARK-17565
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
> Attachments: generated_code.txt
>
>
> This was found when investigating SPARK-17549:
> {noformat}
> Caused by: java.lang.IndexOutOfBoundsException: Index: 63235, Size: 1
> at java.util.ArrayList.rangeCheck(ArrayList.java:635)
> at java.util.ArrayList.get(ArrayList.java:411)
> at 
> org.codehaus.janino.util.ClassFile.getConstantPoolInfo(ClassFile.java:556)
> at 
> org.codehaus.janino.util.ClassFile.getConstantUtf8(ClassFile.java:572)
> at 
> org.codehaus.janino.util.ClassFile.loadAttribute(ClassFile.java:1513)
> at 
> org.codehaus.janino.util.ClassFile.loadAttributes(ClassFile.java:644)
> at org.codehaus.janino.util.ClassFile.loadFields(ClassFile.java:623)
> at org.codehaus.janino.util.ClassFile.(ClassFile.java:280)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:913)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:911)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:911)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:883)
> ... 54 more
> {noformat}
> Attaching file with full driver-side exception, which includes the generated 
> code (which is huge).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17565) Janino exception when calculating metrics for large generated class

2016-09-16 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-17565:
--

 Summary: Janino exception when calculating metrics for large 
generated class
 Key: SPARK-17565
 URL: https://issues.apache.org/jira/browse/SPARK-17565
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Marcelo Vanzin
Priority: Minor
 Attachments: generated_code.txt

This was found when investigating SPARK-17549:

{noformat}
Caused by: java.lang.IndexOutOfBoundsException: Index: 63235, Size: 1
at java.util.ArrayList.rangeCheck(ArrayList.java:635)
at java.util.ArrayList.get(ArrayList.java:411)
at 
org.codehaus.janino.util.ClassFile.getConstantPoolInfo(ClassFile.java:556)
at 
org.codehaus.janino.util.ClassFile.getConstantUtf8(ClassFile.java:572)
at org.codehaus.janino.util.ClassFile.loadAttribute(ClassFile.java:1513)
at org.codehaus.janino.util.ClassFile.loadAttributes(ClassFile.java:644)
at org.codehaus.janino.util.ClassFile.loadFields(ClassFile.java:623)
at org.codehaus.janino.util.ClassFile.(ClassFile.java:280)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:913)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:911)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:911)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:883)
... 54 more
{noformat}

Attaching file with full driver-side exception, which includes the generated 
code (which is huge).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5484) Pregel should checkpoint periodically to avoid StackOverflowError

2016-09-16 Thread ding (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496857#comment-15496857
 ] 

ding commented on SPARK-5484:
-

Thank you for your kindly reminder. However as the code is almost ready, I will 
still send PR in case someone has interest to review it.

> Pregel should checkpoint periodically to avoid StackOverflowError
> -
>
> Key: SPARK-5484
> URL: https://issues.apache.org/jira/browse/SPARK-5484
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>
> Pregel-based iterative algorithms with more than ~50 iterations begin to slow 
> down and eventually fail with a StackOverflowError due to Spark's lack of 
> support for long lineage chains. Instead, Pregel should checkpoint the graph 
> periodically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16534) Kafka 0.10 Python support

2016-09-16 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496823#comment-15496823
 ] 

Reynold Xin commented on SPARK-16534:
-

[~maver1ck] you don't need to wait on this do you? You can just build a module 
outside Spark to use for now.


> Kafka 0.10 Python support
> -
>
> Key: SPARK-16534
> URL: https://issues.apache.org/jira/browse/SPARK-16534
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17079) broadcast decision based on cbo

2016-09-16 Thread Zhenhua Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-17079:
-
Description: We decide if broadcast join should be used based on the 
cardinality and size of join input side rather than the initial size of the 
join base relation. We also decide build side of HashJoin based on cbo.  (was: 
We decide if broadcast join is used based on the cardinality and size of join 
input side rather than the initial size of the join base relation.)

> broadcast decision based on cbo
> ---
>
> Key: SPARK-17079
> URL: https://issues.apache.org/jira/browse/SPARK-17079
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>
> We decide if broadcast join should be used based on the cardinality and size 
> of join input side rather than the initial size of the join base relation. We 
> also decide build side of HashJoin based on cbo.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14358) Change SparkListener from a trait to an abstract class, and remove JavaSparkListener

2016-09-16 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496798#comment-15496798
 ] 

Reynold Xin commented on SPARK-14358:
-

[~osayankin] we thought about creating a compatibility package for backward 
compatibility, but in this case you can trivially add a JavaSparkListener in 
Hive, can't you?


> Change SparkListener from a trait to an abstract class, and remove 
> JavaSparkListener
> 
>
> Key: SPARK-14358
> URL: https://issues.apache.org/jira/browse/SPARK-14358
> Project: Spark
>  Issue Type: Sub-task
>  Components: Scheduler, Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> Scala traits are difficult to maintain binary compatibility on, and as a 
> result we had to introduce JavaSparkListener. In Spark 2.0 we can change 
> SparkListener from a trait to an abstract class and then remove 
> JavaSparkListener.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17564) Flaky RequestTimeoutIntegrationSuite, furtherRequestsDelay

2016-09-16 Thread Adam Roberts (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496699#comment-15496699
 ] 

Adam Roberts commented on SPARK-17564:
--

callback1.failure is sometimes null and due to this failing intermittently I'm 
sure it's timing window related

We are supposed to get an IOException when the test passes:
{code}
callback1.failure: java.io.IOException: Connection from /*some ip*:35581 closed
callback1.failure.getClass: class java.io.IOException
{code}

but sometimes we get this so the assertion fails (and we should improve the 
message too)
{code}
callback1.failure: null
{code}

[~zsxwing] your expertise is welcome here, there's also the CountdownLatch 
constructor to experiment with (so could increase from 1) as well as the 1.2 
sec timeouts, looking to improve this test's robustness

> Flaky RequestTimeoutIntegrationSuite, furtherRequestsDelay
> --
>
> Key: SPARK-17564
> URL: https://issues.apache.org/jira/browse/SPARK-17564
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Adam Roberts
>Priority: Minor
>
> Could be related to [SPARK-10680]
> This is the test and one fix would be to increase the timeouts from 1.2 
> seconds to 5 seconds
> {code}
> // The timeout is relative to the LAST request sent, which is kinda weird, 
> but still.
>   // This test also makes sure the timeout works for Fetch requests as well 
> as RPCs.
>   @Test
>   public void furtherRequestsDelay() throws Exception {
> final byte[] response = new byte[16];
> final StreamManager manager = new StreamManager() {
>   @Override
>   public ManagedBuffer getChunk(long streamId, int chunkIndex) {
> Uninterruptibles.sleepUninterruptibly(FOREVER, TimeUnit.MILLISECONDS);
> return new NioManagedBuffer(ByteBuffer.wrap(response));
>   }
> };
> RpcHandler handler = new RpcHandler() {
>   @Override
>   public void receive(
>   TransportClient client,
>   ByteBuffer message,
>   RpcResponseCallback callback) {
> throw new UnsupportedOperationException();
>   }
>   @Override
>   public StreamManager getStreamManager() {
> return manager;
>   }
> };
> TransportContext context = new TransportContext(conf, handler);
> server = context.createServer();
> clientFactory = context.createClientFactory();
> TransportClient client = 
> clientFactory.createClient(TestUtils.getLocalHost(), server.getPort());
> // Send one request, which will eventually fail.
> TestCallback callback0 = new TestCallback();
> client.fetchChunk(0, 0, callback0);
> Uninterruptibles.sleepUninterruptibly(1200, TimeUnit.MILLISECONDS);
> // Send a second request before the first has failed.
> TestCallback callback1 = new TestCallback();
> client.fetchChunk(0, 1, callback1);
> Uninterruptibles.sleepUninterruptibly(1200, TimeUnit.MILLISECONDS);
> // not complete yet, but should complete soon
> assertEquals(-1, callback0.successLength);
> assertNull(callback0.failure);
> callback0.latch.await(60, TimeUnit.SECONDS);
> assertTrue(callback0.failure instanceof IOException);
> // failed at same time as previous
> assertTrue(callback1.failure instanceof IOException); // This is where we 
> fail because callback1.failure is null
>   }
> {code}
> If there are better suggestions for improving this test let's take them 
> onboard, I think using 5 sec timeout periods would be a place to start so 
> folks don't need to needlessly triage this failure. Will add a few prints and 
> report back



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17564) Flaky RequestTimeoutIntegrationSuite, furtherRequestsDelay

2016-09-16 Thread Adam Roberts (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Roberts updated SPARK-17564:
-
Description: 
Could be related to [SPARK-10680]

This is the test and one fix would be to increase the timeouts from 1.2 seconds 
to 5 seconds

{code}
// The timeout is relative to the LAST request sent, which is kinda weird, but 
still.
  // This test also makes sure the timeout works for Fetch requests as well as 
RPCs.
  @Test
  public void furtherRequestsDelay() throws Exception {
final byte[] response = new byte[16];
final StreamManager manager = new StreamManager() {
  @Override
  public ManagedBuffer getChunk(long streamId, int chunkIndex) {
Uninterruptibles.sleepUninterruptibly(FOREVER, TimeUnit.MILLISECONDS);
return new NioManagedBuffer(ByteBuffer.wrap(response));
  }
};
RpcHandler handler = new RpcHandler() {
  @Override
  public void receive(
  TransportClient client,
  ByteBuffer message,
  RpcResponseCallback callback) {
throw new UnsupportedOperationException();
  }

  @Override
  public StreamManager getStreamManager() {
return manager;
  }
};

TransportContext context = new TransportContext(conf, handler);
server = context.createServer();
clientFactory = context.createClientFactory();
TransportClient client = 
clientFactory.createClient(TestUtils.getLocalHost(), server.getPort());

// Send one request, which will eventually fail.
TestCallback callback0 = new TestCallback();
client.fetchChunk(0, 0, callback0);
Uninterruptibles.sleepUninterruptibly(1200, TimeUnit.MILLISECONDS);

// Send a second request before the first has failed.
TestCallback callback1 = new TestCallback();
client.fetchChunk(0, 1, callback1);
Uninterruptibles.sleepUninterruptibly(1200, TimeUnit.MILLISECONDS);

// not complete yet, but should complete soon
assertEquals(-1, callback0.successLength);
assertNull(callback0.failure);
callback0.latch.await(60, TimeUnit.SECONDS);
assertTrue(callback0.failure instanceof IOException);

// failed at same time as previous
assertTrue(callback1.failure instanceof IOException); // This is where we 
fail because callback1.failure is null
  }
{code}

If there are better suggestions for improving this test let's take them 
onboard, I think using 5 sec timeout periods would be a place to start so folks 
don't need to needlessly triage this failure. Will add a few prints and report 
back

  was:
Could be related to [SPARK-10680]

This is the test and one fix would be to increase the timeouts from 1.2 seconds 
to 5 seconds

{code}
// The timeout is relative to the LAST request sent, which is kinda weird, but 
still.
  // This test also makes sure the timeout works for Fetch requests as well as 
RPCs.
  @Test
  public void furtherRequestsDelay() throws Exception {
final byte[] response = new byte[16];
final StreamManager manager = new StreamManager() {
  @Override
  public ManagedBuffer getChunk(long streamId, int chunkIndex) {
Uninterruptibles.sleepUninterruptibly(FOREVER, TimeUnit.MILLISECONDS);
return new NioManagedBuffer(ByteBuffer.wrap(response));
  }
};
RpcHandler handler = new RpcHandler() {
  @Override
  public void receive(
  TransportClient client,
  ByteBuffer message,
  RpcResponseCallback callback) {
throw new UnsupportedOperationException();
  }

  @Override
  public StreamManager getStreamManager() {
return manager;
  }
};

TransportContext context = new TransportContext(conf, handler);
server = context.createServer();
clientFactory = context.createClientFactory();
TransportClient client = 
clientFactory.createClient(TestUtils.getLocalHost(), server.getPort());

// Send one request, which will eventually fail.
TestCallback callback0 = new TestCallback();
client.fetchChunk(0, 0, callback0);
Uninterruptibles.sleepUninterruptibly(1200, TimeUnit.MILLISECONDS);
// This would be one timeout to increase

// Send a second request before the first has failed.
TestCallback callback1 = new TestCallback();
client.fetchChunk(0, 1, callback1);
Uninterruptibles.sleepUninterruptibly(1200, TimeUnit.MILLISECONDS);
// This would be another timeout to increase

synchronized (callback0) {
  // not complete yet, but should complete soon
  assertEquals(-1, callback0.successLength);
  assertNull(callback0.failure);
  callback0.wait(2 * 1000);
  assertTrue(callback0.failure instanceof IOException);
}

synchronized (callback1) {
  // failed at same time as previous
  assert (callback0.failure instanceof IOException);
}
  }
{code}

The suite fails with this 1/3 of the time:

{code}
Tests run: 3, Failures: 1, Errors:

[jira] [Resolved] (SPARK-17563) Add org/apache/spark/JavaSparkListener to make Spark-2.0.0 work with Hive-2.X.X

2016-09-16 Thread Oleksiy Sayankin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oleksiy Sayankin resolved SPARK-17563.
--
Resolution: Won't Fix

> Add org/apache/spark/JavaSparkListener to make Spark-2.0.0 work with 
> Hive-2.X.X
> ---
>
> Key: SPARK-17563
> URL: https://issues.apache.org/jira/browse/SPARK-17563
> Project: Spark
>  Issue Type: Bug
>Reporter: Oleksiy Sayankin
>
> According to https://issues.apache.org/jira/browse/SPARK-14358 
> JavaSparkListener was deleted from Spark-2.0.0, but Hive-2.X.X uses 
> JavaSparkListener
> {code}
> package org.apache.hadoop.hive.ql.exec.spark.status.impl;
> import ...
> public class JobMetricsListener extends JavaSparkListener {
> {code}
> Configuring Hive-2.X.X on Spark-2.0.0 will give an exception:
> {code}
> 2016-09-16T11:20:57,474 INFO  [stderr-redir-1]: client.SparkClientImpl 
> (SparkClientImpl.java:run(593)) - java.lang.NoClassDefFoundError: 
> org/apache/spark/JavaSparkListener
> {code}
> Please add JavaSparkListener into Spark-2.0.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17563) Add org/apache/spark/JavaSparkListener to make Spark-2.0.0 work with Hive-2.X.X

2016-09-16 Thread Oleksiy Sayankin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496622#comment-15496622
 ] 

Oleksiy Sayankin commented on SPARK-17563:
--

Created https://issues.apache.org/jira/browse/HIVE-14777

> Add org/apache/spark/JavaSparkListener to make Spark-2.0.0 work with 
> Hive-2.X.X
> ---
>
> Key: SPARK-17563
> URL: https://issues.apache.org/jira/browse/SPARK-17563
> Project: Spark
>  Issue Type: Bug
>Reporter: Oleksiy Sayankin
>
> According to https://issues.apache.org/jira/browse/SPARK-14358 
> JavaSparkListener was deleted from Spark-2.0.0, but Hive-2.X.X uses 
> JavaSparkListener
> {code}
> package org.apache.hadoop.hive.ql.exec.spark.status.impl;
> import ...
> public class JobMetricsListener extends JavaSparkListener {
> {code}
> Configuring Hive-2.X.X on Spark-2.0.0 will give an exception:
> {code}
> 2016-09-16T11:20:57,474 INFO  [stderr-redir-1]: client.SparkClientImpl 
> (SparkClientImpl.java:run(593)) - java.lang.NoClassDefFoundError: 
> org/apache/spark/JavaSparkListener
> {code}
> Please add JavaSparkListener into Spark-2.0.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17563) Add org/apache/spark/JavaSparkListener to make Spark-2.0.0 work with Hive-2.X.X

2016-09-16 Thread Oleksiy Sayankin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496614#comment-15496614
 ] 

Oleksiy Sayankin commented on SPARK-17563:
--

After three hours of fixing I have found out that there are too many changes in 
Spark-2.0.0 API comparing to Spark-1.6.1 API to make the fix in easy way. I was 
able to fix Spark Remote Client subproject, but Hive Query Language gives me a 
lot of errors.

{code}
[INFO] Hive ... SUCCESS [  0.883 s]
[INFO] Hive Shims Common .. SUCCESS [  2.424 s]
[INFO] Hive Shims 0.23  SUCCESS [  1.132 s]
[INFO] Hive Shims Scheduler ... SUCCESS [  0.299 s]
[INFO] Hive Shims . SUCCESS [  0.199 s]
[INFO] Hive Storage API ... SUCCESS [  0.851 s]
[INFO] Hive ORC ... SUCCESS [  2.346 s]
[INFO] Hive Common  SUCCESS [  3.567 s]
[INFO] Hive Serde . SUCCESS [  2.513 s]
[INFO] Hive Metastore . SUCCESS [ 10.782 s]
[INFO] Hive Ant Utilities . SUCCESS [  0.818 s]
[INFO] Hive Llap Common ... SUCCESS [  0.859 s]
[INFO] Hive Llap Client ... SUCCESS [  0.337 s]
[INFO] Hive Llap Tez .. SUCCESS [  0.525 s]
[INFO] Spark Remote Client  SUCCESS [  1.547 s]
[INFO] Hive Query Language  FAILURE [ 19.686 s]
[INFO] Hive Service ... SKIPPED
[INFO] Hive Accumulo Handler .. SKIPPED
[INFO] Hive JDBC .. SKIPPED
[INFO] Hive Beeline ... SKIPPED
[INFO] Hive CLI ... SKIPPED
[INFO] Hive Contrib ... SKIPPED
[INFO] Hive HBase Handler . SKIPPED
[INFO] Hive HCatalog .. SKIPPED
[INFO] Hive HCatalog Core . SKIPPED
[INFO] Hive HCatalog Pig Adapter .. SKIPPED
[INFO] Hive HCatalog Server Extensions  SKIPPED
[INFO] Hive HCatalog Webhcat Java Client .. SKIPPED
[INFO] Hive HCatalog Webhcat .. SKIPPED
[INFO] Hive HCatalog Streaming  SKIPPED
[INFO] Hive HPL/SQL ... SKIPPED
[INFO] Hive HWI ... SKIPPED
[INFO] Hive ODBC .. SKIPPED
[INFO] Hive Llap Server ... SKIPPED
[INFO] Hive Shims Aggregator .. SKIPPED
[INFO] Hive TestUtils . SKIPPED
[INFO] Hive Packaging . SKIPPED
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time: 49.643 s
[INFO] Finished at: 2016-09-16T18:27:24+03:00
[INFO] Final Memory: 154M/2994M
[INFO] 
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on 
project hive-exec: Compilation failure: Compilation failure:
[ERROR] 
/home/osayankin/git/myrepo/hive/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunction.java:[28,8]
 org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunction is not abstract and 
does not override abstract method 
call(java.util.Iterator>)
 in org.apache.spark.api.java.function.PairFlatMapFunction
[ERROR] 
/home/osayankin/git/myrepo/hive/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunction.java:[40,3]
 
call(java.util.Iterator>)
 in org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunction cannot implement 
call(T) in org.apache.spark.api.java.function.PairFlatMapFunction
[ERROR] return type 
java.lang.Iterable>
 is not compatible with 
java.util.Iterator>
[ERROR] 
/home/osayankin/git/myrepo/hive/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunction.java:[38,3]
 method does not override or implement a method from a supertype
[ERROR]

[jira] [Commented] (SPARK-17545) Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset

2016-09-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496579#comment-15496579
 ] 

Hyukjin Kwon commented on SPARK-17545:
--

Thank you both for your feedback. I am okay with fixing this and supporting 
some common quirky cases in general.

However, I'd like to note that we might have to avoid supporting other quirky 
cases being handled in {{DateTimeUtils.stringToTime}}.
More specifically, we should avoid using 
{{DatatypeConverter.parseDateTime(...)}} because an issue was identified in 
that - https://github.com/apache/spark/pull/14279#issuecomment-233887751

If we only allow the strict ISO 8601 format by default and other cases by 
{{timestampFormat}} and {{dateFormat}}, the problematic call above would not be 
called but if we allow other quirky cases in that, this will introduce 
potential problems from 2.0 too.

I left the usages only for backward compatibilities and would like to avoid 
adding a new logic in that personally.

To cut this short, I am okay with adding this case in that if we fix the issue 
above together or if this case is pretty much common.
Otherwise, I'd like to stay against this (although I am not supposed to decide 
what should be added into Spark) and rather promote the use of 
{{timestampFormat}} and {{dateFormat}}.



> Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset
> ---
>
> Key: SPARK-17545
> URL: https://issues.apache.org/jira/browse/SPARK-17545
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nathan Beyer
>
> When parsing a CSV with a date/time column that contains a variant ISO 8601 
> that doesn't include a colon in the offset, casting to Timestamp fails.
> Here's a simple, example CSV content.
> {quote}
> time
> "2015-07-20T15:09:23.736-0500"
> "2015-07-20T15:10:51.687-0500"
> "2015-11-21T23:15:01.499-0600"
> {quote}
> Here's the stack trace that results from processing this data.
> {quote}
> 16/09/14 15:22:59 ERROR Utils: Aborting task
> java.lang.IllegalArgumentException: 2015-11-21T23:15:01.499-0600
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown
>  Source)
>   at 
> javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422)
>   at 
> javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417)
>   at 
> javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:287)
> {quote}
> Somewhat related, I believe Python standard libraries can produce this form 
> of zone offset. The system I got the data from is written in Python.
> https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17545) Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset

2016-09-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496525#comment-15496525
 ] 

Sean Owen commented on SPARK-17545:
---

I think it depends on how hard it is to handle the quirk and whether it causes 
other problems. It's probably not too hard to also handle this one and not that 
dangerous. If you want to open a PR we can look at it.

> Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset
> ---
>
> Key: SPARK-17545
> URL: https://issues.apache.org/jira/browse/SPARK-17545
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nathan Beyer
>
> When parsing a CSV with a date/time column that contains a variant ISO 8601 
> that doesn't include a colon in the offset, casting to Timestamp fails.
> Here's a simple, example CSV content.
> {quote}
> time
> "2015-07-20T15:09:23.736-0500"
> "2015-07-20T15:10:51.687-0500"
> "2015-11-21T23:15:01.499-0600"
> {quote}
> Here's the stack trace that results from processing this data.
> {quote}
> 16/09/14 15:22:59 ERROR Utils: Aborting task
> java.lang.IllegalArgumentException: 2015-11-21T23:15:01.499-0600
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown
>  Source)
>   at 
> javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422)
>   at 
> javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417)
>   at 
> javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:287)
> {quote}
> Somewhat related, I believe Python standard libraries can produce this form 
> of zone offset. The system I got the data from is written in Python.
> https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17545) Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset

2016-09-16 Thread Nathan Beyer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496505#comment-15496505
 ] 

Nathan Beyer commented on SPARK-17545:
--

I agree, the data is quirky and it's not how I'd personally serialize the data, 
but it is valid ISO 8601 format, regardless of it being discouraged. Also, the 
code already has precedent for dealing with "quirks".

{code}
val indexOfGMT = s.indexOf("GMT")
if (indexOfGMT != -1) {
  // ISO8601 with a weird time zone specifier (2000-01-01T00:00GMT+01:00)
  val s0 = s.substring(0, indexOfGMT)
  val s1 = s.substring(indexOfGMT + 3)
  // Mapped to 2000-01-01T00:00+01:00
  stringToTime(s0 + s1)
}
{code}
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L126

If the converters are going to handle some quirks, why wouldn't it handle this 
one? If no quirks are going to be handled, then I would suggest that the 
documentation be made explicit to define that it handles the W3C note's profile 
of ISO 8601.

FWIW - I'm getting this data via a CSV export from Splunk.

BTW - Thanks for the link to the PR, that's actually another issue that I was 
wondering about.

> Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset
> ---
>
> Key: SPARK-17545
> URL: https://issues.apache.org/jira/browse/SPARK-17545
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nathan Beyer
>
> When parsing a CSV with a date/time column that contains a variant ISO 8601 
> that doesn't include a colon in the offset, casting to Timestamp fails.
> Here's a simple, example CSV content.
> {quote}
> time
> "2015-07-20T15:09:23.736-0500"
> "2015-07-20T15:10:51.687-0500"
> "2015-11-21T23:15:01.499-0600"
> {quote}
> Here's the stack trace that results from processing this data.
> {quote}
> 16/09/14 15:22:59 ERROR Utils: Aborting task
> java.lang.IllegalArgumentException: 2015-11-21T23:15:01.499-0600
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown
>  Source)
>   at 
> javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422)
>   at 
> javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417)
>   at 
> javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:287)
> {quote}
> Somewhat related, I believe Python standard libraries can produce this form 
> of zone offset. The system I got the data from is written in Python.
> https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17564) Flaky RequestTimeoutIntegrationSuite, furtherRequestsDelay

2016-09-16 Thread Adam Roberts (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Roberts updated SPARK-17564:
-
Description: 
Could be related to [SPARK-10680]

This is the test and one fix would be to increase the timeouts from 1.2 seconds 
to 5 seconds

{code}
// The timeout is relative to the LAST request sent, which is kinda weird, but 
still.
  // This test also makes sure the timeout works for Fetch requests as well as 
RPCs.
  @Test
  public void furtherRequestsDelay() throws Exception {
final byte[] response = new byte[16];
final StreamManager manager = new StreamManager() {
  @Override
  public ManagedBuffer getChunk(long streamId, int chunkIndex) {
Uninterruptibles.sleepUninterruptibly(FOREVER, TimeUnit.MILLISECONDS);
return new NioManagedBuffer(ByteBuffer.wrap(response));
  }
};
RpcHandler handler = new RpcHandler() {
  @Override
  public void receive(
  TransportClient client,
  ByteBuffer message,
  RpcResponseCallback callback) {
throw new UnsupportedOperationException();
  }

  @Override
  public StreamManager getStreamManager() {
return manager;
  }
};

TransportContext context = new TransportContext(conf, handler);
server = context.createServer();
clientFactory = context.createClientFactory();
TransportClient client = 
clientFactory.createClient(TestUtils.getLocalHost(), server.getPort());

// Send one request, which will eventually fail.
TestCallback callback0 = new TestCallback();
client.fetchChunk(0, 0, callback0);
Uninterruptibles.sleepUninterruptibly(1200, TimeUnit.MILLISECONDS);
// This would be one timeout to increase

// Send a second request before the first has failed.
TestCallback callback1 = new TestCallback();
client.fetchChunk(0, 1, callback1);
Uninterruptibles.sleepUninterruptibly(1200, TimeUnit.MILLISECONDS);
// This would be another timeout to increase

synchronized (callback0) {
  // not complete yet, but should complete soon
  assertEquals(-1, callback0.successLength);
  assertNull(callback0.failure);
  callback0.wait(2 * 1000);
  assertTrue(callback0.failure instanceof IOException);
}

synchronized (callback1) {
  // failed at same time as previous
  assert (callback0.failure instanceof IOException);
}
  }
{code}

The suite fails with this 1/3 of the time:

{code}
Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 31.487 sec <<< 
FAILURE! - in org.apache.spark.network.RequestTimeoutIntegrationSuite
furtherRequestsDelay(org.apache.spark.network.RequestTimeoutIntegrationSuite)  
Time elapsed: 11.297 sec  <<< FAILURE!
java.lang.AssertionError
at 
org.apache.spark.network.RequestTimeoutIntegrationSuite.furtherRequestsDelay(RequestTimeoutIntegrationSuite.java:230)
{code}

If there are better suggestions for improving this test let's take them 
onboard, I think using 5 sec timeout periods would be a place to start so folks 
don't need to needlessly triage this failure. Will add a few prints and report 
back

  was:
Could be related to [SPARK-10680]

This is the test and one fix would be to increase the timeouts from 1.2 seconds 
to 5 seconds

// The timeout is relative to the LAST request sent, which is kinda weird, but 
still.
  // This test also makes sure the timeout works for Fetch requests as well as 
RPCs.
  @Test
  public void furtherRequestsDelay() throws Exception {
final byte[] response = new byte[16];
final StreamManager manager = new StreamManager() {
  @Override
  public ManagedBuffer getChunk(long streamId, int chunkIndex) {
Uninterruptibles.sleepUninterruptibly(FOREVER, TimeUnit.MILLISECONDS);
return new NioManagedBuffer(ByteBuffer.wrap(response));
  }
};
RpcHandler handler = new RpcHandler() {
  @Override
  public void receive(
  TransportClient client,
  ByteBuffer message,
  RpcResponseCallback callback) {
throw new UnsupportedOperationException();
  }

  @Override
  public StreamManager getStreamManager() {
return manager;
  }
};

TransportContext context = new TransportContext(conf, handler);
server = context.createServer();
clientFactory = context.createClientFactory();
TransportClient client = 
clientFactory.createClient(TestUtils.getLocalHost(), server.getPort());

// Send one request, which will eventually fail.
TestCallback callback0 = new TestCallback();
client.fetchChunk(0, 0, callback0);
Uninterruptibles.sleepUninterruptibly(1200, TimeUnit.MILLISECONDS);
// This would be one timeout to increase

// Send a second request before the first has failed.
TestCallback callback1 = new TestCallback();
client.fetchChunk(0, 1, callback1);
Uninterruptibles.sleepUninterruptibly(1200,

[jira] [Updated] (SPARK-17564) Flaky RequestTimeoutIntegrationSuite, furtherRequestsDelay

2016-09-16 Thread Adam Roberts (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Roberts updated SPARK-17564:
-
Description: 
Could be related to [SPARK-10680]

This is the test and one fix would be to increase the timeouts from 1.2 seconds 
to 5 seconds

// The timeout is relative to the LAST request sent, which is kinda weird, but 
still.
  // This test also makes sure the timeout works for Fetch requests as well as 
RPCs.
  @Test
  public void furtherRequestsDelay() throws Exception {
final byte[] response = new byte[16];
final StreamManager manager = new StreamManager() {
  @Override
  public ManagedBuffer getChunk(long streamId, int chunkIndex) {
Uninterruptibles.sleepUninterruptibly(FOREVER, TimeUnit.MILLISECONDS);
return new NioManagedBuffer(ByteBuffer.wrap(response));
  }
};
RpcHandler handler = new RpcHandler() {
  @Override
  public void receive(
  TransportClient client,
  ByteBuffer message,
  RpcResponseCallback callback) {
throw new UnsupportedOperationException();
  }

  @Override
  public StreamManager getStreamManager() {
return manager;
  }
};

TransportContext context = new TransportContext(conf, handler);
server = context.createServer();
clientFactory = context.createClientFactory();
TransportClient client = 
clientFactory.createClient(TestUtils.getLocalHost(), server.getPort());

// Send one request, which will eventually fail.
TestCallback callback0 = new TestCallback();
client.fetchChunk(0, 0, callback0);
Uninterruptibles.sleepUninterruptibly(1200, TimeUnit.MILLISECONDS);
// This would be one timeout to increase

// Send a second request before the first has failed.
TestCallback callback1 = new TestCallback();
client.fetchChunk(0, 1, callback1);
Uninterruptibles.sleepUninterruptibly(1200, TimeUnit.MILLISECONDS);
// This would be another timeout to increase

synchronized (callback0) {
  // not complete yet, but should complete soon
  assertEquals(-1, callback0.successLength);
  assertNull(callback0.failure);
  callback0.wait(2 * 1000);
  assertTrue(callback0.failure instanceof IOException);
}

synchronized (callback1) {
  // failed at same time as previous
  assert (callback0.failure instanceof IOException);
}
  }

The suite fails with this 1/3 of the time

Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 31.487 sec <<< 
FAILURE! - in org.apache.spark.network.RequestTimeoutIntegrationSuite
furtherRequestsDelay(org.apache.spark.network.RequestTimeoutIntegrationSuite)  
Time elapsed: 11.297 sec  <<< FAILURE!
java.lang.AssertionError
at 
org.apache.spark.network.RequestTimeoutIntegrationSuite.furtherRequestsDelay(RequestTimeoutIntegrationSuite.java:230)

If there are better suggestions for improving this test let's take them 
onboard, I think using 5 sec timeout periods would be a place to start so folks 
don't need to needlessly triage this failure. Will add a few prints and report 
back

  was:
Could be related to [SPARK-10680]

This is the test and one fix would be to increase the timeouts from 1.2 seconds 
to 5 seconds

// The timeout is relative to the LAST request sent, which is kinda weird, but 
still.
  // This test also makes sure the timeout works for Fetch requests as well as 
RPCs.
  @Test
  public void furtherRequestsDelay() throws Exception {
final byte[] response = new byte[16];
final StreamManager manager = new StreamManager() {
  @Override
  public ManagedBuffer getChunk(long streamId, int chunkIndex) {
Uninterruptibles.sleepUninterruptibly(FOREVER, TimeUnit.MILLISECONDS);
return new NioManagedBuffer(ByteBuffer.wrap(response));
  }
};
RpcHandler handler = new RpcHandler() {
  @Override
  public void receive(
  TransportClient client,
  ByteBuffer message,
  RpcResponseCallback callback) {
throw new UnsupportedOperationException();
  }

  @Override
  public StreamManager getStreamManager() {
return manager;
  }
};

TransportContext context = new TransportContext(conf, handler);
server = context.createServer();
clientFactory = context.createClientFactory();
TransportClient client = 
clientFactory.createClient(TestUtils.getLocalHost(), server.getPort());

// Send one request, which will eventually fail.
TestCallback callback0 = new TestCallback();
client.fetchChunk(0, 0, callback0);
Uninterruptibles.sleepUninterruptibly(1200, TimeUnit.MILLISECONDS);
// This would be one timeout to increase

// Send a second request before the first has failed.
TestCallback callback1 = new TestCallback();
client.fetchChunk(0, 1, callback1);
Uninterruptibles.sleepUninterruptibly(1200, TimeUnit.MILLISECONDS);

[jira] [Updated] (SPARK-17564) Flaky RequestTimeoutIntegrationSuite, furtherRequestsDelay

2016-09-16 Thread Adam Roberts (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Roberts updated SPARK-17564:
-
Component/s: Tests

> Flaky RequestTimeoutIntegrationSuite, furtherRequestsDelay
> --
>
> Key: SPARK-17564
> URL: https://issues.apache.org/jira/browse/SPARK-17564
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Adam Roberts
>Priority: Minor
>
> Could be related to [SPARK-10680]
> This is the test and one fix would be to increase the timeouts from 1.2 
> seconds to 5 seconds
> // The timeout is relative to the LAST request sent, which is kinda weird, 
> but still.
>   // This test also makes sure the timeout works for Fetch requests as well 
> as RPCs.
>   @Test
>   public void furtherRequestsDelay() throws Exception {
> final byte[] response = new byte[16];
> final StreamManager manager = new StreamManager() {
>   @Override
>   public ManagedBuffer getChunk(long streamId, int chunkIndex) {
> Uninterruptibles.sleepUninterruptibly(FOREVER, TimeUnit.MILLISECONDS);
> return new NioManagedBuffer(ByteBuffer.wrap(response));
>   }
> };
> RpcHandler handler = new RpcHandler() {
>   @Override
>   public void receive(
>   TransportClient client,
>   ByteBuffer message,
>   RpcResponseCallback callback) {
> throw new UnsupportedOperationException();
>   }
>   @Override
>   public StreamManager getStreamManager() {
> return manager;
>   }
> };
> TransportContext context = new TransportContext(conf, handler);
> server = context.createServer();
> clientFactory = context.createClientFactory();
> TransportClient client = 
> clientFactory.createClient(TestUtils.getLocalHost(), server.getPort());
> // Send one request, which will eventually fail.
> TestCallback callback0 = new TestCallback();
> client.fetchChunk(0, 0, callback0);
> Uninterruptibles.sleepUninterruptibly(1200, TimeUnit.MILLISECONDS);
> // This would be one timeout to increase
> // Send a second request before the first has failed.
> TestCallback callback1 = new TestCallback();
> client.fetchChunk(0, 1, callback1);
> Uninterruptibles.sleepUninterruptibly(1200, TimeUnit.MILLISECONDS);
> // This would be another timeout to increase
> synchronized (callback0) {
>   // not complete yet, but should complete soon
>   assertEquals(-1, callback0.successLength);
>   assertNull(callback0.failure);
>   callback0.wait(2 * 1000);
>   assertTrue(callback0.failure instanceof IOException);
> }
> synchronized (callback1) {
>   // failed at same time as previous
>   assert (callback0.failure instanceof IOException);
> }
>   }
> If there are better suggestions for improving this test let's take them 
> onboard, I'll create a pull request using 5 sec timeout periods as a place to 
> start so folks don't need to needlessly triage this failure



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17564) Flaky RequestTimeoutIntegrationSuite, furtherRequestsDelay

2016-09-16 Thread Adam Roberts (JIRA)

Adam Roberts created SPARK-17564:


 Summary: Flaky RequestTimeoutIntegrationSuite, furtherRequestsDelay
 Key: SPARK-17564
 URL: https://issues.apache.org/jira/browse/SPARK-17564
 Project: Spark
  Issue Type: Improvement
Affects Versions: 2.0.1, 2.1.0
Reporter: Adam Roberts
Priority: Minor


Could be related to [SPARK-10680]

This is the test and one fix would be to increase the timeouts from 1.2 seconds 
to 5 seconds

// The timeout is relative to the LAST request sent, which is kinda weird, but 
still.
  // This test also makes sure the timeout works for Fetch requests as well as 
RPCs.
  @Test
  public void furtherRequestsDelay() throws Exception {
final byte[] response = new byte[16];
final StreamManager manager = new StreamManager() {
  @Override
  public ManagedBuffer getChunk(long streamId, int chunkIndex) {
Uninterruptibles.sleepUninterruptibly(FOREVER, TimeUnit.MILLISECONDS);
return new NioManagedBuffer(ByteBuffer.wrap(response));
  }
};
RpcHandler handler = new RpcHandler() {
  @Override
  public void receive(
  TransportClient client,
  ByteBuffer message,
  RpcResponseCallback callback) {
throw new UnsupportedOperationException();
  }

  @Override
  public StreamManager getStreamManager() {
return manager;
  }
};

TransportContext context = new TransportContext(conf, handler);
server = context.createServer();
clientFactory = context.createClientFactory();
TransportClient client = 
clientFactory.createClient(TestUtils.getLocalHost(), server.getPort());

// Send one request, which will eventually fail.
TestCallback callback0 = new TestCallback();
client.fetchChunk(0, 0, callback0);
Uninterruptibles.sleepUninterruptibly(1200, TimeUnit.MILLISECONDS);
// This would be one timeout to increase

// Send a second request before the first has failed.
TestCallback callback1 = new TestCallback();
client.fetchChunk(0, 1, callback1);
Uninterruptibles.sleepUninterruptibly(1200, TimeUnit.MILLISECONDS);
// This would be another timeout to increase

synchronized (callback0) {
  // not complete yet, but should complete soon
  assertEquals(-1, callback0.successLength);
  assertNull(callback0.failure);
  callback0.wait(2 * 1000);
  assertTrue(callback0.failure instanceof IOException);
}

synchronized (callback1) {
  // failed at same time as previous
  assert (callback0.failure instanceof IOException);
}
  }

If there are better suggestions for improving this test let's take them 
onboard, I'll create a pull request using 5 sec timeout periods as a place to 
start so folks don't need to needlessly triage this failure



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-17562) I think a little code is unnecessary to exist in ExternalSorter.spillMemoryIteratorToDisk

2016-09-16 Thread Jianfei Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianfei Wang updated SPARK-17562:
-
Comment: was deleted

(was: 中秋快乐！谢谢。)

> I think a little code is unnecessary to exist in 
> ExternalSorter.spillMemoryIteratorToDisk
> -
>
> Key: SPARK-17562
> URL: https://issues.apache.org/jira/browse/SPARK-17562
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Jianfei Wang
>Priority: Trivial
>  Labels: easyfix, performance
>
> In ExternalSorter.spillMemoryIteratorToDisk, I think the code below will 
> never be executed, so we can remove them
> {code}
> else  {
> writer.revertPartialWritersAndClose()
> }
> {code}
> the source code is as below:
> {code}
> try {
>   while (inMemoryIterator.hasNext) {
> val partitionId = inMemoryIterator.nextPartition()
> require(partitionId >= 0 && partitionId < numPartitions,
>   s"partition Id: ${partitionId} should be in the range [0, 
> ${numPartitions})")
> inMemoryIterator.writeNext(writer)
> elementsPerPartition(partitionId) += 1
> objectsWritten += 1
> if (objectsWritten == serializerBatchSize) {
>   flush()
> }
>   }
>   if (objectsWritten > 0) {
> flush()
>   } else {
> writer.revertPartialWritesAndClose()
>}
>   success = true
> } finally {
>   if (success) {
> writer.close()
>   } else {
> // This code path only happens if an exception was thrown above 
> before we set success;
> // close our stuff and let the exception be thrown further
> writer.revertPartialWritesAndClose()
> if (file.exists()) {
>   if (!file.delete()) {
> logWarning(s"Error deleting ${file}")
>   }
> }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17562) I think a little code is unnecessary to exist in ExternalSorter.spillMemoryIteratorToDisk

2016-09-16 Thread Jianfei Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496324#comment-15496324
 ] 

Jianfei Wang commented on SPARK-17562:
--

中秋快乐！谢谢。

> I think a little code is unnecessary to exist in 
> ExternalSorter.spillMemoryIteratorToDisk
> -
>
> Key: SPARK-17562
> URL: https://issues.apache.org/jira/browse/SPARK-17562
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Jianfei Wang
>Priority: Trivial
>  Labels: easyfix, performance
>
> In ExternalSorter.spillMemoryIteratorToDisk, I think the code below will 
> never be executed, so we can remove them
> {code}
> else  {
> writer.revertPartialWritersAndClose()
> }
> {code}
> the source code is as below:
> {code}
> try {
>   while (inMemoryIterator.hasNext) {
> val partitionId = inMemoryIterator.nextPartition()
> require(partitionId >= 0 && partitionId < numPartitions,
>   s"partition Id: ${partitionId} should be in the range [0, 
> ${numPartitions})")
> inMemoryIterator.writeNext(writer)
> elementsPerPartition(partitionId) += 1
> objectsWritten += 1
> if (objectsWritten == serializerBatchSize) {
>   flush()
> }
>   }
>   if (objectsWritten > 0) {
> flush()
>   } else {
> writer.revertPartialWritesAndClose()
>}
>   success = true
> } finally {
>   if (success) {
> writer.close()
>   } else {
> // This code path only happens if an exception was thrown above 
> before we set success;
> // close our stuff and let the exception be thrown further
> writer.revertPartialWritesAndClose()
> if (file.exists()) {
>   if (!file.delete()) {
> logWarning(s"Error deleting ${file}")
>   }
> }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17562) I think a little code is unnecessary to exist in ExternalSorter.spillMemoryIteratorToDisk

2016-09-16 Thread Jianfei Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496323#comment-15496323
 ] 

Jianfei Wang commented on SPARK-17562:
--

中秋快乐！谢谢。

> I think a little code is unnecessary to exist in 
> ExternalSorter.spillMemoryIteratorToDisk
> -
>
> Key: SPARK-17562
> URL: https://issues.apache.org/jira/browse/SPARK-17562
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Jianfei Wang
>Priority: Trivial
>  Labels: easyfix, performance
>
> In ExternalSorter.spillMemoryIteratorToDisk, I think the code below will 
> never be executed, so we can remove them
> {code}
> else  {
> writer.revertPartialWritersAndClose()
> }
> {code}
> the source code is as below:
> {code}
> try {
>   while (inMemoryIterator.hasNext) {
> val partitionId = inMemoryIterator.nextPartition()
> require(partitionId >= 0 && partitionId < numPartitions,
>   s"partition Id: ${partitionId} should be in the range [0, 
> ${numPartitions})")
> inMemoryIterator.writeNext(writer)
> elementsPerPartition(partitionId) += 1
> objectsWritten += 1
> if (objectsWritten == serializerBatchSize) {
>   flush()
> }
>   }
>   if (objectsWritten > 0) {
> flush()
>   } else {
> writer.revertPartialWritesAndClose()
>}
>   success = true
> } finally {
>   if (success) {
> writer.close()
>   } else {
> // This code path only happens if an exception was thrown above 
> before we set success;
> // close our stuff and let the exception be thrown further
> writer.revertPartialWritesAndClose()
> if (file.exists()) {
>   if (!file.delete()) {
> logWarning(s"Error deleting ${file}")
>   }
> }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2352) [MLLIB] Add Artificial Neural Network (ANN) to Spark

2016-09-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496326#comment-15496326
 ] 

Sean Owen commented on SPARK-2352:
--

Maybe, but, that's also too broad to be actionable. There are still several 
related JIRAs open.

I wouldn't pay attention to the 'In Progress' status; that's automatically set 
when anybody opens a PR. It doesn't mean it's active or going to be merged. I 
don't know that any of the other related JIRAs are being worked on either.

In general, I don't think there's a goal to implement everything under the sun 
in MLlib, but just the basics. These things will probably exist outside MLlib 
in third party packages, which is as-intended. I'd rather be able to eventually 
use TensorFlow or something, ideally, rather than have another different impl 
here.

> [MLLIB] Add Artificial Neural Network (ANN) to Spark
> 
>
> Key: SPARK-2352
> URL: https://issues.apache.org/jira/browse/SPARK-2352
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
> Environment: MLLIB code
>Reporter: Bert Greevenbosch
>Assignee: Bert Greevenbosch
>
> It would be good if the Machine Learning Library contained Artificial Neural 
> Networks (ANNs).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-17562) I think a little code is unnecessary to exist in ExternalSorter.spillMemoryIteratorToDisk

2016-09-16 Thread Jianfei Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianfei Wang updated SPARK-17562:
-
Comment: was deleted

(was: 中秋快乐！谢谢。)

> I think a little code is unnecessary to exist in 
> ExternalSorter.spillMemoryIteratorToDisk
> -
>
> Key: SPARK-17562
> URL: https://issues.apache.org/jira/browse/SPARK-17562
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Jianfei Wang
>Priority: Trivial
>  Labels: easyfix, performance
>
> In ExternalSorter.spillMemoryIteratorToDisk, I think the code below will 
> never be executed, so we can remove them
> {code}
> else  {
> writer.revertPartialWritersAndClose()
> }
> {code}
> the source code is as below:
> {code}
> try {
>   while (inMemoryIterator.hasNext) {
> val partitionId = inMemoryIterator.nextPartition()
> require(partitionId >= 0 && partitionId < numPartitions,
>   s"partition Id: ${partitionId} should be in the range [0, 
> ${numPartitions})")
> inMemoryIterator.writeNext(writer)
> elementsPerPartition(partitionId) += 1
> objectsWritten += 1
> if (objectsWritten == serializerBatchSize) {
>   flush()
> }
>   }
>   if (objectsWritten > 0) {
> flush()
>   } else {
> writer.revertPartialWritesAndClose()
>}
>   success = true
> } finally {
>   if (success) {
> writer.close()
>   } else {
> // This code path only happens if an exception was thrown above 
> before we set success;
> // close our stuff and let the exception be thrown further
> writer.revertPartialWritesAndClose()
> if (file.exists()) {
>   if (!file.delete()) {
> logWarning(s"Error deleting ${file}")
>   }
> }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2352) [MLLIB] Add Artificial Neural Network (ANN) to Spark

2016-09-16 Thread Alessio (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496312#comment-15496312
 ] 

Alessio commented on SPARK-2352:


Dear Sean, I am well aware of the MultiLayer Perceptron classifier.
But you must agree with me that the MLP is just a small branch of the ANN 
world. I reckon that's what the OP wanted to stress: not just MLP or 
feedforward NNs, but also recursive network, Boltzmann Machines and so on...

> [MLLIB] Add Artificial Neural Network (ANN) to Spark
> 
>
> Key: SPARK-2352
> URL: https://issues.apache.org/jira/browse/SPARK-2352
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
> Environment: MLLIB code
>Reporter: Bert Greevenbosch
>Assignee: Bert Greevenbosch
>
> It would be good if the Machine Learning Library contained Artificial Neural 
> Networks (ANNs).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17562) I think a little code is unnecessary to exist in ExternalSorter.spillMemoryIteratorToDisk

2016-09-16 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496302#comment-15496302
 ] 

Wenchen Fan commented on SPARK-17562:
-

I'm not familiar with this part, cc [~joshrosen]

> I think a little code is unnecessary to exist in 
> ExternalSorter.spillMemoryIteratorToDisk
> -
>
> Key: SPARK-17562
> URL: https://issues.apache.org/jira/browse/SPARK-17562
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Jianfei Wang
>Priority: Trivial
>  Labels: easyfix, performance
>
> In ExternalSorter.spillMemoryIteratorToDisk, I think the code below will 
> never be executed, so we can remove them
> {code}
> else  {
> writer.revertPartialWritersAndClose()
> }
> {code}
> the source code is as below:
> {code}
> try {
>   while (inMemoryIterator.hasNext) {
> val partitionId = inMemoryIterator.nextPartition()
> require(partitionId >= 0 && partitionId < numPartitions,
>   s"partition Id: ${partitionId} should be in the range [0, 
> ${numPartitions})")
> inMemoryIterator.writeNext(writer)
> elementsPerPartition(partitionId) += 1
> objectsWritten += 1
> if (objectsWritten == serializerBatchSize) {
>   flush()
> }
>   }
>   if (objectsWritten > 0) {
> flush()
>   } else {
> writer.revertPartialWritesAndClose()
>}
>   success = true
> } finally {
>   if (success) {
> writer.close()
>   } else {
> // This code path only happens if an exception was thrown above 
> before we set success;
> // close our stuff and let the exception be thrown further
> writer.revertPartialWritesAndClose()
> if (file.exists()) {
>   if (!file.delete()) {
> logWarning(s"Error deleting ${file}")
>   }
> }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17562) I think a little code is unnecessary to exist in ExternalSorter.spillMemoryIteratorToDisk

2016-09-16 Thread Jianfei Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496283#comment-15496283
 ] 

Jianfei Wang edited comment on SPARK-17562 at 9/16/16 1:03 PM:
---

this func is to revert writes that haven't been committed yet.
If there are exceptions,we will go into finally.
Besides,if there are 0 object written,we needn't to call this to revert,it will 
do nothing.


was (Author: codlife):
this func is to revert writes that haven't been committed yet.
If there are exception,we will go into finally.
Besides,if there are 0 object written,we needn't to call this to revert,it will 
do nothing.

> I think a little code is unnecessary to exist in 
> ExternalSorter.spillMemoryIteratorToDisk
> -
>
> Key: SPARK-17562
> URL: https://issues.apache.org/jira/browse/SPARK-17562
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Jianfei Wang
>Priority: Trivial
>  Labels: easyfix, performance
>
> In ExternalSorter.spillMemoryIteratorToDisk, I think the code below will 
> never be executed, so we can remove them
> {code}
> else  {
> writer.revertPartialWritersAndClose()
> }
> {code}
> the source code is as below:
> {code}
> try {
>   while (inMemoryIterator.hasNext) {
> val partitionId = inMemoryIterator.nextPartition()
> require(partitionId >= 0 && partitionId < numPartitions,
>   s"partition Id: ${partitionId} should be in the range [0, 
> ${numPartitions})")
> inMemoryIterator.writeNext(writer)
> elementsPerPartition(partitionId) += 1
> objectsWritten += 1
> if (objectsWritten == serializerBatchSize) {
>   flush()
> }
>   }
>   if (objectsWritten > 0) {
> flush()
>   } else {
> writer.revertPartialWritesAndClose()
>}
>   success = true
> } finally {
>   if (success) {
> writer.close()
>   } else {
> // This code path only happens if an exception was thrown above 
> before we set success;
> // close our stuff and let the exception be thrown further
> writer.revertPartialWritesAndClose()
> if (file.exists()) {
>   if (!file.delete()) {
> logWarning(s"Error deleting ${file}")
>   }
> }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17562) I think a little code is unnecessary to exist in ExternalSorter.spillMemoryIteratorToDisk

2016-09-16 Thread Jianfei Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496283#comment-15496283
 ] 

Jianfei Wang edited comment on SPARK-17562 at 9/16/16 1:03 PM:
---

this func is to revert writes that haven't been committed yet.
If there are exceptions,we will go into finally block.
Besides,if there are 0 object written,we needn't to call this to revert,it will 
do nothing.


was (Author: codlife):
this func is to revert writes that haven't been committed yet.
If there are exceptions,we will go into finally.
Besides,if there are 0 object written,we needn't to call this to revert,it will 
do nothing.

> I think a little code is unnecessary to exist in 
> ExternalSorter.spillMemoryIteratorToDisk
> -
>
> Key: SPARK-17562
> URL: https://issues.apache.org/jira/browse/SPARK-17562
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Jianfei Wang
>Priority: Trivial
>  Labels: easyfix, performance
>
> In ExternalSorter.spillMemoryIteratorToDisk, I think the code below will 
> never be executed, so we can remove them
> {code}
> else  {
> writer.revertPartialWritersAndClose()
> }
> {code}
> the source code is as below:
> {code}
> try {
>   while (inMemoryIterator.hasNext) {
> val partitionId = inMemoryIterator.nextPartition()
> require(partitionId >= 0 && partitionId < numPartitions,
>   s"partition Id: ${partitionId} should be in the range [0, 
> ${numPartitions})")
> inMemoryIterator.writeNext(writer)
> elementsPerPartition(partitionId) += 1
> objectsWritten += 1
> if (objectsWritten == serializerBatchSize) {
>   flush()
> }
>   }
>   if (objectsWritten > 0) {
> flush()
>   } else {
> writer.revertPartialWritesAndClose()
>}
>   success = true
> } finally {
>   if (success) {
> writer.close()
>   } else {
> // This code path only happens if an exception was thrown above 
> before we set success;
> // close our stuff and let the exception be thrown further
> writer.revertPartialWritesAndClose()
> if (file.exists()) {
>   if (!file.delete()) {
> logWarning(s"Error deleting ${file}")
>   }
> }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17562) I think a little code is unnecessary to exist in ExternalSorter.spillMemoryIteratorToDisk

2016-09-16 Thread Jianfei Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496283#comment-15496283
 ] 

Jianfei Wang commented on SPARK-17562:
--

this func is to revert writes that haven't been committed yet.
If there are exception,we will go into finally.
Besides,if there are 0 object written,we needn't to call this to revert,it will 
do nothing.

> I think a little code is unnecessary to exist in 
> ExternalSorter.spillMemoryIteratorToDisk
> -
>
> Key: SPARK-17562
> URL: https://issues.apache.org/jira/browse/SPARK-17562
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Jianfei Wang
>Priority: Trivial
>  Labels: easyfix, performance
>
> In ExternalSorter.spillMemoryIteratorToDisk, I think the code below will 
> never be executed, so we can remove them
> {code}
> else  {
> writer.revertPartialWritersAndClose()
> }
> {code}
> the source code is as below:
> {code}
> try {
>   while (inMemoryIterator.hasNext) {
> val partitionId = inMemoryIterator.nextPartition()
> require(partitionId >= 0 && partitionId < numPartitions,
>   s"partition Id: ${partitionId} should be in the range [0, 
> ${numPartitions})")
> inMemoryIterator.writeNext(writer)
> elementsPerPartition(partitionId) += 1
> objectsWritten += 1
> if (objectsWritten == serializerBatchSize) {
>   flush()
> }
>   }
>   if (objectsWritten > 0) {
> flush()
>   } else {
> writer.revertPartialWritesAndClose()
>}
>   success = true
> } finally {
>   if (success) {
> writer.close()
>   } else {
> // This code path only happens if an exception was thrown above 
> before we set success;
> // close our stuff and let the exception be thrown further
> writer.revertPartialWritesAndClose()
> if (file.exists()) {
>   if (!file.delete()) {
> logWarning(s"Error deleting ${file}")
>   }
> }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16121) ListingFileCatalog does not list in parallel anymore

2016-09-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496276#comment-15496276
 ] 

Sean Owen commented on SPARK-16121:
---

You can look at the branch/tag yourself: 
https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala
 
Yes the change seems to be applied.

> ListingFileCatalog does not list in parallel anymore
> 
>
> Key: SPARK-16121
> URL: https://issues.apache.org/jira/browse/SPARK-16121
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
> Fix For: 2.0.0
>
>
> In ListingFileCatalog, the implementation of {{listLeafFiles}} is shown 
> below. When the number of user-provided paths is less than the value of 
> {{sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold}}, we 
> will not use parallel listing, which is different from what 1.6 does (for 
> 1.6, if the number of children of any inner dir is larger than the threshold, 
> we will use the parallel listing).
> {code}
> protected def listLeafFiles(paths: Seq[Path]): 
> mutable.LinkedHashSet[FileStatus] = {
> if (paths.length >= 
> sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) {
>   HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, 
> sparkSession)
> } else {
>   // Dummy jobconf to get to the pathFilter defined in configuration
>   val jobConf = new JobConf(hadoopConf, this.getClass)
>   val pathFilter = FileInputFormat.getInputPathFilter(jobConf)
>   val statuses: Seq[FileStatus] = paths.flatMap { path =>
> val fs = path.getFileSystem(hadoopConf)
> logInfo(s"Listing $path on driver")
> Try {
>   HadoopFsRelation.listLeafFiles(fs, fs.getFileStatus(path), 
> pathFilter)
> }.getOrElse(Array.empty[FileStatus])
>   }
>   mutable.LinkedHashSet(statuses: _*)
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17562) I think a little code is unnecessary to exist in ExternalSorter.spillMemoryIteratorToDisk

2016-09-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496251#comment-15496251
 ] 

Sean Owen commented on SPARK-17562:
---

If you do that, you don't call revertPartialWritesAndClose() then. I am not 
sure that's correct.

> I think a little code is unnecessary to exist in 
> ExternalSorter.spillMemoryIteratorToDisk
> -
>
> Key: SPARK-17562
> URL: https://issues.apache.org/jira/browse/SPARK-17562
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Jianfei Wang
>Priority: Trivial
>  Labels: easyfix, performance
>
> In ExternalSorter.spillMemoryIteratorToDisk, I think the code below will 
> never be executed, so we can remove them
> {code}
> else  {
> writer.revertPartialWritersAndClose()
> }
> {code}
> the source code is as below:
> {code}
> try {
>   while (inMemoryIterator.hasNext) {
> val partitionId = inMemoryIterator.nextPartition()
> require(partitionId >= 0 && partitionId < numPartitions,
>   s"partition Id: ${partitionId} should be in the range [0, 
> ${numPartitions})")
> inMemoryIterator.writeNext(writer)
> elementsPerPartition(partitionId) += 1
> objectsWritten += 1
> if (objectsWritten == serializerBatchSize) {
>   flush()
> }
>   }
>   if (objectsWritten > 0) {
> flush()
>   } else {
> writer.revertPartialWritesAndClose()
>}
>   success = true
> } finally {
>   if (success) {
> writer.close()
>   } else {
> // This code path only happens if an exception was thrown above 
> before we set success;
> // close our stuff and let the exception be thrown further
> writer.revertPartialWritesAndClose()
> if (file.exists()) {
>   if (!file.delete()) {
> logWarning(s"Error deleting ${file}")
>   }
> }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17563) Add org/apache/spark/JavaSparkListener to make Spark-2.0.0 work with Hive-2.X.X

2016-09-16 Thread Oleksiy Sayankin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496232#comment-15496232
 ] 

Oleksiy Sayankin edited comment on SPARK-17563 at 9/16/16 12:40 PM:


{quote} JobMetricsListener is not part of Spark, right?{quote}
Yes. 

Well I can change JavaSparkListener --> SparkListener and 

1.6.1 --> 2.0.0 

in pom.xml in Hive-2.X.X. I guess this will work.


was (Author: osayankin):
Well I can change JavaSparkListener --> SparkListener and 

1.6.1 --> 2.0.0 

in pom.xml. I guess this will work.

> Add org/apache/spark/JavaSparkListener to make Spark-2.0.0 work with 
> Hive-2.X.X
> ---
>
> Key: SPARK-17563
> URL: https://issues.apache.org/jira/browse/SPARK-17563
> Project: Spark
>  Issue Type: Bug
>Reporter: Oleksiy Sayankin
>
> According to https://issues.apache.org/jira/browse/SPARK-14358 
> JavaSparkListener was deleted from Spark-2.0.0, but Hive-2.X.X uses 
> JavaSparkListener
> {code}
> package org.apache.hadoop.hive.ql.exec.spark.status.impl;
> import ...
> public class JobMetricsListener extends JavaSparkListener {
> {code}
> Configuring Hive-2.X.X on Spark-2.0.0 will give an exception:
> {code}
> 2016-09-16T11:20:57,474 INFO  [stderr-redir-1]: client.SparkClientImpl 
> (SparkClientImpl.java:run(593)) - java.lang.NoClassDefFoundError: 
> org/apache/spark/JavaSparkListener
> {code}
> Please add JavaSparkListener into Spark-2.0.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17563) Add org/apache/spark/JavaSparkListener to make Spark-2.0.0 work with Hive-2.X.X

2016-09-16 Thread Oleksiy Sayankin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496232#comment-15496232
 ] 

Oleksiy Sayankin commented on SPARK-17563:
--

Well I can change JavaSparkListener --> SparkListener and 

1.6.1 --> 2.0.0 

in pom.xml. I guess this will work.

> Add org/apache/spark/JavaSparkListener to make Spark-2.0.0 work with 
> Hive-2.X.X
> ---
>
> Key: SPARK-17563
> URL: https://issues.apache.org/jira/browse/SPARK-17563
> Project: Spark
>  Issue Type: Bug
>Reporter: Oleksiy Sayankin
>
> According to https://issues.apache.org/jira/browse/SPARK-14358 
> JavaSparkListener was deleted from Spark-2.0.0, but Hive-2.X.X uses 
> JavaSparkListener
> {code}
> package org.apache.hadoop.hive.ql.exec.spark.status.impl;
> import ...
> public class JobMetricsListener extends JavaSparkListener {
> {code}
> Configuring Hive-2.X.X on Spark-2.0.0 will give an exception:
> {code}
> 2016-09-16T11:20:57,474 INFO  [stderr-redir-1]: client.SparkClientImpl 
> (SparkClientImpl.java:run(593)) - java.lang.NoClassDefFoundError: 
> org/apache/spark/JavaSparkListener
> {code}
> Please add JavaSparkListener into Spark-2.0.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16121) ListingFileCatalog does not list in parallel anymore

2016-09-16 Thread Gaurav Shah (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496234#comment-15496234
 ] 

Gaurav Shah commented on SPARK-16121:
-

[~mengxr] was this fixed in 2.0.0 or is it planned for 2.0.1, My partition 
discovery takes about  10 minutes and I guess this should fix it

> ListingFileCatalog does not list in parallel anymore
> 
>
> Key: SPARK-16121
> URL: https://issues.apache.org/jira/browse/SPARK-16121
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
> Fix For: 2.0.0
>
>
> In ListingFileCatalog, the implementation of {{listLeafFiles}} is shown 
> below. When the number of user-provided paths is less than the value of 
> {{sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold}}, we 
> will not use parallel listing, which is different from what 1.6 does (for 
> 1.6, if the number of children of any inner dir is larger than the threshold, 
> we will use the parallel listing).
> {code}
> protected def listLeafFiles(paths: Seq[Path]): 
> mutable.LinkedHashSet[FileStatus] = {
> if (paths.length >= 
> sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) {
>   HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, 
> sparkSession)
> } else {
>   // Dummy jobconf to get to the pathFilter defined in configuration
>   val jobConf = new JobConf(hadoopConf, this.getClass)
>   val pathFilter = FileInputFormat.getInputPathFilter(jobConf)
>   val statuses: Seq[FileStatus] = paths.flatMap { path =>
> val fs = path.getFileSystem(hadoopConf)
> logInfo(s"Listing $path on driver")
> Try {
>   HadoopFsRelation.listLeafFiles(fs, fs.getFileStatus(path), 
> pathFilter)
> }.getOrElse(Array.empty[FileStatus])
>   }
>   mutable.LinkedHashSet(statuses: _*)
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17561) DataFrameWriter documentation formatting problems

2016-09-16 Thread Jagadeesan A S (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496233#comment-15496233
 ] 

Jagadeesan A S commented on SPARK-17561:


It's okay.. :)

> DataFrameWriter documentation formatting problems
> -
>
> Key: SPARK-17561
> URL: https://issues.apache.org/jira/browse/SPARK-17561
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>Priority: Trivial
> Attachments: screenshot-1.png
>
>
> I visited this page
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html
> and saw  that the docs have formatting problems
> !screenshot-1.png!
> Tried with browser cache disabled. Same issue



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17561) DataFrameWriter documentation formatting problems

2016-09-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496186#comment-15496186
 ] 

Sean Owen commented on SPARK-17561:
---

Oh, whoops I didn't actually see this until now. Sorry about that, I had 
already replied and made a PR.

> DataFrameWriter documentation formatting problems
> -
>
> Key: SPARK-17561
> URL: https://issues.apache.org/jira/browse/SPARK-17561
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>Priority: Trivial
> Attachments: screenshot-1.png
>
>
> I visited this page
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html
> and saw  that the docs have formatting problems
> !screenshot-1.png!
> Tried with browser cache disabled. Same issue



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17562) I think a little code is unnecessary to exist in ExternalSorter.spillMemoryIteratorToDisk

2016-09-16 Thread Jianfei Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496183#comment-15496183
 ] 

Jianfei Wang commented on SPARK-17562:
--

if 0 object is written, we should just set the success flag and then go into 
finally block , we need't call this.

> I think a little code is unnecessary to exist in 
> ExternalSorter.spillMemoryIteratorToDisk
> -
>
> Key: SPARK-17562
> URL: https://issues.apache.org/jira/browse/SPARK-17562
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Jianfei Wang
>Priority: Trivial
>  Labels: easyfix, performance
>
> In ExternalSorter.spillMemoryIteratorToDisk, I think the code below will 
> never be executed, so we can remove them
> {code}
> else  {
> writer.revertPartialWritersAndClose()
> }
> {code}
> the source code is as below:
> {code}
> try {
>   while (inMemoryIterator.hasNext) {
> val partitionId = inMemoryIterator.nextPartition()
> require(partitionId >= 0 && partitionId < numPartitions,
>   s"partition Id: ${partitionId} should be in the range [0, 
> ${numPartitions})")
> inMemoryIterator.writeNext(writer)
> elementsPerPartition(partitionId) += 1
> objectsWritten += 1
> if (objectsWritten == serializerBatchSize) {
>   flush()
> }
>   }
>   if (objectsWritten > 0) {
> flush()
>   } else {
> writer.revertPartialWritesAndClose()
>}
>   success = true
> } finally {
>   if (success) {
> writer.close()
>   } else {
> // This code path only happens if an exception was thrown above 
> before we set success;
> // close our stuff and let the exception be thrown further
> writer.revertPartialWritesAndClose()
> if (file.exists()) {
>   if (!file.delete()) {
> logWarning(s"Error deleting ${file}")
>   }
> }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17562) I think a little code is unnecessary to exist in ExternalSorter.spillMemoryIteratorToDisk

2016-09-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496167#comment-15496167
 ] 

Sean Owen commented on SPARK-17562:
---

Can you explain why? it looks possible that 0 objects are written.

> I think a little code is unnecessary to exist in 
> ExternalSorter.spillMemoryIteratorToDisk
> -
>
> Key: SPARK-17562
> URL: https://issues.apache.org/jira/browse/SPARK-17562
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Jianfei Wang
>Priority: Trivial
>  Labels: easyfix, performance
>
> In ExternalSorter.spillMemoryIteratorToDisk, I think the code below will 
> never be executed, so we can remove them
> {code}
> else  {
> writer.revertPartialWritersAndClose()
> }
> {code}
> the source code is as below:
> {code}
> try {
>   while (inMemoryIterator.hasNext) {
> val partitionId = inMemoryIterator.nextPartition()
> require(partitionId >= 0 && partitionId < numPartitions,
>   s"partition Id: ${partitionId} should be in the range [0, 
> ${numPartitions})")
> inMemoryIterator.writeNext(writer)
> elementsPerPartition(partitionId) += 1
> objectsWritten += 1
> if (objectsWritten == serializerBatchSize) {
>   flush()
> }
>   }
>   if (objectsWritten > 0) {
> flush()
>   } else {
> writer.revertPartialWritesAndClose()
>}
>   success = true
> } finally {
>   if (success) {
> writer.close()
>   } else {
> // This code path only happens if an exception was thrown above 
> before we set success;
> // close our stuff and let the exception be thrown further
> writer.revertPartialWritesAndClose()
> if (file.exists()) {
>   if (!file.delete()) {
> logWarning(s"Error deleting ${file}")
>   }
> }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17563) Add org/apache/spark/JavaSparkListener to make Spark-2.0.0 work with Hive-2.X.X

2016-09-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496162#comment-15496162
 ] 

Sean Owen commented on SPARK-17563:
---

JobMetricsListener is not part of Spark, right?
It needs to change to be compatible with Spark 2.0.

> Add org/apache/spark/JavaSparkListener to make Spark-2.0.0 work with 
> Hive-2.X.X
> ---
>
> Key: SPARK-17563
> URL: https://issues.apache.org/jira/browse/SPARK-17563
> Project: Spark
>  Issue Type: Bug
>Reporter: Oleksiy Sayankin
>
> According to https://issues.apache.org/jira/browse/SPARK-14358 
> JavaSparkListener was deleted from Spark-2.0.0, but Hive-2.X.X uses 
> JavaSparkListener
> {code}
> package org.apache.hadoop.hive.ql.exec.spark.status.impl;
> import ...
> public class JobMetricsListener extends JavaSparkListener {
> {code}
> Configuring Hive-2.X.X on Spark-2.0.0 will give an exception:
> {code}
> 2016-09-16T11:20:57,474 INFO  [stderr-redir-1]: client.SparkClientImpl 
> (SparkClientImpl.java:run(593)) - java.lang.NoClassDefFoundError: 
> org/apache/spark/JavaSparkListener
> {code}
> Please add JavaSparkListener into Spark-2.0.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17562) I think a little code is unnecessary to exist in ExternalSorter.spillMemoryIteratorToDisk

2016-09-16 Thread Jianfei Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496147#comment-15496147
 ] 

Jianfei Wang commented on SPARK-17562:
--

[~cloud_fan] can you check this? thank you!

> I think a little code is unnecessary to exist in 
> ExternalSorter.spillMemoryIteratorToDisk
> -
>
> Key: SPARK-17562
> URL: https://issues.apache.org/jira/browse/SPARK-17562
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Jianfei Wang
>Priority: Trivial
>  Labels: easyfix, performance
>
> In ExternalSorter.spillMemoryIteratorToDisk, I think the code below will 
> never be executed, so we can remove them
> {code}
> else  {
> writer.revertPartialWritersAndClose()
> }
> {code}
> the source code is as below:
> {code}
> try {
>   while (inMemoryIterator.hasNext) {
> val partitionId = inMemoryIterator.nextPartition()
> require(partitionId >= 0 && partitionId < numPartitions,
>   s"partition Id: ${partitionId} should be in the range [0, 
> ${numPartitions})")
> inMemoryIterator.writeNext(writer)
> elementsPerPartition(partitionId) += 1
> objectsWritten += 1
> if (objectsWritten == serializerBatchSize) {
>   flush()
> }
>   }
>   if (objectsWritten > 0) {
> flush()
>   } else {
> writer.revertPartialWritesAndClose()
>}
>   success = true
> } finally {
>   if (success) {
> writer.close()
>   } else {
> // This code path only happens if an exception was thrown above 
> before we set success;
> // close our stuff and let the exception be thrown further
> writer.revertPartialWritesAndClose()
> if (file.exists()) {
>   if (!file.delete()) {
> logWarning(s"Error deleting ${file}")
>   }
> }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17562) I think a little code is unnecessary to exist in ExternalSorter.spillMemoryIteratorToDisk

2016-09-16 Thread Jianfei Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianfei Wang updated SPARK-17562:
-
Description: 
In ExternalSorter.spillMemoryIteratorToDisk, I think the code below will never 
be executed, so we can remove them
{code}
else  {
writer.revertPartialWritersAndClose()
}
{code}
the source code is as below:
{code}
try {
  while (inMemoryIterator.hasNext) {
val partitionId = inMemoryIterator.nextPartition()
require(partitionId >= 0 && partitionId < numPartitions,
  s"partition Id: ${partitionId} should be in the range [0, 
${numPartitions})")
inMemoryIterator.writeNext(writer)
elementsPerPartition(partitionId) += 1
objectsWritten += 1

if (objectsWritten == serializerBatchSize) {
  flush()
}
  }
  if (objectsWritten > 0) {
flush()
  } else {
writer.revertPartialWritesAndClose()
   }
  success = true
} finally {
  if (success) {
writer.close()
  } else {
// This code path only happens if an exception was thrown above before 
we set success;
// close our stuff and let the exception be thrown further
writer.revertPartialWritesAndClose()
if (file.exists()) {
  if (!file.delete()) {
logWarning(s"Error deleting ${file}")
  }
}
  }
}
{code}

  was:
In ExternalSorter.spillMemoryIteratorToDisk, I think the code below will never 
be executed, so we can remove them
else  {
writer.revertPartialWritersAndClose()
}

the source code is as below:
{code}
try {
  while (inMemoryIterator.hasNext) {
val partitionId = inMemoryIterator.nextPartition()
require(partitionId >= 0 && partitionId < numPartitions,
  s"partition Id: ${partitionId} should be in the range [0, 
${numPartitions})")
inMemoryIterator.writeNext(writer)
elementsPerPartition(partitionId) += 1
objectsWritten += 1

if (objectsWritten == serializerBatchSize) {
  flush()
}
  }
  if (objectsWritten > 0) {
flush()
  } else {
writer.revertPartialWritesAndClose()
   }
  success = true
} finally {
  if (success) {
writer.close()
  } else {
// This code path only happens if an exception was thrown above before 
we set success;
// close our stuff and let the exception be thrown further
writer.revertPartialWritesAndClose()
if (file.exists()) {
  if (!file.delete()) {
logWarning(s"Error deleting ${file}")
  }
}
  }
}
{code}


> I think a little code is unnecessary to exist in 
> ExternalSorter.spillMemoryIteratorToDisk
> -
>
> Key: SPARK-17562
> URL: https://issues.apache.org/jira/browse/SPARK-17562
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Jianfei Wang
>Priority: Trivial
>  Labels: easyfix, performance
>
> In ExternalSorter.spillMemoryIteratorToDisk, I think the code below will 
> never be executed, so we can remove them
> {code}
> else  {
> writer.revertPartialWritersAndClose()
> }
> {code}
> the source code is as below:
> {code}
> try {
>   while (inMemoryIterator.hasNext) {
> val partitionId = inMemoryIterator.nextPartition()
> require(partitionId >= 0 && partitionId < numPartitions,
>   s"partition Id: ${partitionId} should be in the range [0, 
> ${numPartitions})")
> inMemoryIterator.writeNext(writer)
> elementsPerPartition(partitionId) += 1
> objectsWritten += 1
> if (objectsWritten == serializerBatchSize) {
>   flush()
> }
>   }
>   if (objectsWritten > 0) {
> flush()
>   } else {
> writer.revertPartialWritesAndClose()
>}
>   success = true
> } finally {
>   if (success) {
> writer.close()
>   } else {
> // This code path only happens if an exception was thrown above 
> before we set success;
> // close our stuff and let the exception be thrown further
> writer.revertPartialWritesAndClose()
> if (file.exists()) {
>   if (!file.delete()) {
> logWarning(s"Error deleting ${file}")
>   }
> }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17563) Add org/apache/spark/JavaSparkListener to make Spark-2.0.0 work with Hive-2.X.X

2016-09-16 Thread Oleksiy Sayankin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oleksiy Sayankin updated SPARK-17563:
-
Summary: Add org/apache/spark/JavaSparkListener to make Spark-2.0.0 work 
with Hive-2.X.X  (was: Add org/apache/spark/JavaSparkListener to make 
Spark-2.0.0 work with Hive-2.0.0)

> Add org/apache/spark/JavaSparkListener to make Spark-2.0.0 work with 
> Hive-2.X.X
> ---
>
> Key: SPARK-17563
> URL: https://issues.apache.org/jira/browse/SPARK-17563
> Project: Spark
>  Issue Type: Bug
>Reporter: Oleksiy Sayankin
>
> According to https://issues.apache.org/jira/browse/SPARK-14358 
> JavaSparkListener was deleted from Spark-2.0.0, but Hive-2.X.X uses 
> JavaSparkListener
> {code}
> package org.apache.hadoop.hive.ql.exec.spark.status.impl;
> import ...
> public class JobMetricsListener extends JavaSparkListener {
> {code}
> Configuring Hive-2.X.X on Spark-2.0.0 will give an exception:
> {code}
> 2016-09-16T11:20:57,474 INFO  [stderr-redir-1]: client.SparkClientImpl 
> (SparkClientImpl.java:run(593)) - java.lang.NoClassDefFoundError: 
> org/apache/spark/JavaSparkListener
> {code}
> Please add JavaSparkListener into Spark-2.0.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17563) Add org/apache/spark/JavaSparkListener to make Spark-2.0.0 work with Hive-2.0.0

2016-09-16 Thread Oleksiy Sayankin (JIRA)

Oleksiy Sayankin created SPARK-17563:


 Summary: Add org/apache/spark/JavaSparkListener to make 
Spark-2.0.0 work with Hive-2.0.0
 Key: SPARK-17563
 URL: https://issues.apache.org/jira/browse/SPARK-17563
 Project: Spark
  Issue Type: Bug
Reporter: Oleksiy Sayankin


According to https://issues.apache.org/jira/browse/SPARK-14358 
JavaSparkListener was deleted from Spark-2.0.0, but Hive-2.X.X uses 
JavaSparkListener

{code}
package org.apache.hadoop.hive.ql.exec.spark.status.impl;

import ...

public class JobMetricsListener extends JavaSparkListener {
{code}

Configuring Hive-2.X.X on Spark-2.0.0 will give an exception:

{code}
2016-09-16T11:20:57,474 INFO  [stderr-redir-1]: client.SparkClientImpl 
(SparkClientImpl.java:run(593)) - java.lang.NoClassDefFoundError: 
org/apache/spark/JavaSparkListener
{code}

Please add JavaSparkListener into Spark-2.0.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17562) I think a little code is unnecessary to exist in ExternalSorter.spillMemoryIteratorToDisk

2016-09-16 Thread Jianfei Wang (JIRA)

Jianfei Wang created SPARK-17562:


 Summary: I think a little code is unnecessary to exist in 
ExternalSorter.spillMemoryIteratorToDisk
 Key: SPARK-17562
 URL: https://issues.apache.org/jira/browse/SPARK-17562
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.0.0
Reporter: Jianfei Wang
Priority: Trivial


In ExternalSorter.spillMemoryIteratorToDisk, I think the code below will never 
be executed, so we can remove them
else  {
writer.revertPartialWritersAndClose()
}

the source code is as below:
{code}
try {
  while (inMemoryIterator.hasNext) {
val partitionId = inMemoryIterator.nextPartition()
require(partitionId >= 0 && partitionId < numPartitions,
  s"partition Id: ${partitionId} should be in the range [0, 
${numPartitions})")
inMemoryIterator.writeNext(writer)
elementsPerPartition(partitionId) += 1
objectsWritten += 1

if (objectsWritten == serializerBatchSize) {
  flush()
}
  }
  if (objectsWritten > 0) {
flush()
  } else {
writer.revertPartialWritesAndClose()
   }
  success = true
} finally {
  if (success) {
writer.close()
  } else {
// This code path only happens if an exception was thrown above before 
we set success;
// close our stuff and let the exception be thrown further
writer.revertPartialWritesAndClose()
if (file.exists()) {
  if (!file.delete()) {
logWarning(s"Error deleting ${file}")
  }
}
  }
}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14358) Change SparkListener from a trait to an abstract class, and remove JavaSparkListener

2016-09-16 Thread Oleksiy Sayankin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496130#comment-15496130
 ] 

Oleksiy Sayankin commented on SPARK-14358:
--

Hive-2.X.X uses JavaSparkListener

{code}
package org.apache.hadoop.hive.ql.exec.spark.status.impl;

import ...

public class JobMetricsListener extends JavaSparkListener {
{code}

Configuring Hive-2.X.X on Spark-2.0.0 will give an exception:

{code}

2016-09-16T11:20:57,474 INFO  [stderr-redir-1]: client.SparkClientImpl 
(SparkClientImpl.java:run(593)) - java.lang.NoClassDefFoundError: 
org/apache/spark/JavaSparkListener
{code}

Please add JavaSparkListener into Spark-2.0.0

> Change SparkListener from a trait to an abstract class, and remove 
> JavaSparkListener
> 
>
> Key: SPARK-14358
> URL: https://issues.apache.org/jira/browse/SPARK-14358
> Project: Spark
>  Issue Type: Sub-task
>  Components: Scheduler, Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> Scala traits are difficult to maintain binary compatibility on, and as a 
> result we had to introduce JavaSparkListener. In Spark 2.0 we can change 
> SparkListener from a trait to an abstract class and then remove 
> JavaSparkListener.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17426) Current TreeNode.toJSON may trigger OOM under some corner cases

2016-09-16 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-17426:

Assignee: Sean Zhong

> Current TreeNode.toJSON may trigger OOM under some corner cases
> ---
>
> Key: SPARK-17426
> URL: https://issues.apache.org/jira/browse/SPARK-17426
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>Assignee: Sean Zhong
> Fix For: 2.1.0
>
>
> In SPARK-17356, we fix the OOM issue when Metadata is super big. There are 
> other cases that may also trigger OOM. Current implementation of 
> TreeNode.toJSON will recursively search and print all fields of current 
> TreeNode, even if the field's type is of type Seq or type Map. 
> This is not safe because:
> 1. the Seq or Map can be very big. Converting them to JSON make take huge 
> memory, which may trigger out of memory error.
> 2. Some user space input may also be propagated to the Plan. The user space 
> input can be of arbitrary type, and may also be self-referencing. Trying to 
> print user space input to JSON is very risky.
> The following example triggers a StackOverflowError when calling toJSON on a 
> plan with user defined UDF.
> {code}
> case class SelfReferenceUDF(
> var config: Map[String, Any] = Map.empty[String, Any]) extends 
> Function1[String, Boolean] {
>   config += "self" -> this
>   def apply(key: String): Boolean = config.contains(key)
> }
> test("toJSON should not throws java.lang.StackOverflowError") {
>   val udf = ScalaUDF(SelfReferenceUDF(), BooleanType, Seq("col1".attr))
>   // triggers java.lang.StackOverflowError
>   udf.toJSON
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17426) Current TreeNode.toJSON may trigger OOM under some corner cases

2016-09-16 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-17426:

Fix Version/s: (was: 2.2.0)
   2.1.0

> Current TreeNode.toJSON may trigger OOM under some corner cases
> ---
>
> Key: SPARK-17426
> URL: https://issues.apache.org/jira/browse/SPARK-17426
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>Assignee: Sean Zhong
> Fix For: 2.1.0
>
>
> In SPARK-17356, we fix the OOM issue when Metadata is super big. There are 
> other cases that may also trigger OOM. Current implementation of 
> TreeNode.toJSON will recursively search and print all fields of current 
> TreeNode, even if the field's type is of type Seq or type Map. 
> This is not safe because:
> 1. the Seq or Map can be very big. Converting them to JSON make take huge 
> memory, which may trigger out of memory error.
> 2. Some user space input may also be propagated to the Plan. The user space 
> input can be of arbitrary type, and may also be self-referencing. Trying to 
> print user space input to JSON is very risky.
> The following example triggers a StackOverflowError when calling toJSON on a 
> plan with user defined UDF.
> {code}
> case class SelfReferenceUDF(
> var config: Map[String, Any] = Map.empty[String, Any]) extends 
> Function1[String, Boolean] {
>   config += "self" -> this
>   def apply(key: String): Boolean = config.contains(key)
> }
> test("toJSON should not throws java.lang.StackOverflowError") {
>   val udf = ScalaUDF(SelfReferenceUDF(), BooleanType, Seq("col1".attr))
>   // triggers java.lang.StackOverflowError
>   udf.toJSON
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17426) Current TreeNode.toJSON may trigger OOM under some corner cases

2016-09-16 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17426.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 14990
[https://github.com/apache/spark/pull/14990]

> Current TreeNode.toJSON may trigger OOM under some corner cases
> ---
>
> Key: SPARK-17426
> URL: https://issues.apache.org/jira/browse/SPARK-17426
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>Assignee: Sean Zhong
> Fix For: 2.2.0
>
>
> In SPARK-17356, we fix the OOM issue when Metadata is super big. There are 
> other cases that may also trigger OOM. Current implementation of 
> TreeNode.toJSON will recursively search and print all fields of current 
> TreeNode, even if the field's type is of type Seq or type Map. 
> This is not safe because:
> 1. the Seq or Map can be very big. Converting them to JSON make take huge 
> memory, which may trigger out of memory error.
> 2. Some user space input may also be propagated to the Plan. The user space 
> input can be of arbitrary type, and may also be self-referencing. Trying to 
> print user space input to JSON is very risky.
> The following example triggers a StackOverflowError when calling toJSON on a 
> plan with user defined UDF.
> {code}
> case class SelfReferenceUDF(
> var config: Map[String, Any] = Map.empty[String, Any]) extends 
> Function1[String, Boolean] {
>   config += "self" -> this
>   def apply(key: String): Boolean = config.contains(key)
> }
> test("toJSON should not throws java.lang.StackOverflowError") {
>   val udf = ScalaUDF(SelfReferenceUDF(), BooleanType, Seq("col1".attr))
>   // triggers java.lang.StackOverflowError
>   udf.toJSON
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17561) DataFrameWriter documentation formatting problems

2016-09-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17561:


Assignee: (was: Apache Spark)

> DataFrameWriter documentation formatting problems
> -
>
> Key: SPARK-17561
> URL: https://issues.apache.org/jira/browse/SPARK-17561
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>Priority: Trivial
> Attachments: screenshot-1.png
>
>
> I visited this page
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html
> and saw  that the docs have formatting problems
> !screenshot-1.png!
> Tried with browser cache disabled. Same issue



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17561) DataFrameWriter documentation formatting problems

2016-09-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496022#comment-15496022
 ] 

Apache Spark commented on SPARK-17561:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/15117

> DataFrameWriter documentation formatting problems
> -
>
> Key: SPARK-17561
> URL: https://issues.apache.org/jira/browse/SPARK-17561
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>Priority: Trivial
> Attachments: screenshot-1.png
>
>
> I visited this page
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html
> and saw  that the docs have formatting problems
> !screenshot-1.png!
> Tried with browser cache disabled. Same issue



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17561) DataFrameWriter documentation formatting problems

2016-09-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17561:


Assignee: Apache Spark

> DataFrameWriter documentation formatting problems
> -
>
> Key: SPARK-17561
> URL: https://issues.apache.org/jira/browse/SPARK-17561
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>Assignee: Apache Spark
>Priority: Trivial
> Attachments: screenshot-1.png
>
>
> I visited this page
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html
> and saw  that the docs have formatting problems
> !screenshot-1.png!
> Tried with browser cache disabled. Same issue



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5484) Pregel should checkpoint periodically to avoid StackOverflowError

2016-09-16 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495984#comment-15495984
 ] 

Takeshi Yamamuro commented on SPARK-5484:
-

This component seems very inactive, so I think there is a little chance to 
review your code if you take on this.

> Pregel should checkpoint periodically to avoid StackOverflowError
> -
>
> Key: SPARK-5484
> URL: https://issues.apache.org/jira/browse/SPARK-5484
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>
> Pregel-based iterative algorithms with more than ~50 iterations begin to slow 
> down and eventually fail with a StackOverflowError due to Spark's lack of 
> support for long lineage chains. Instead, Pregel should checkpoint the graph 
> periodically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17560) SQLContext tables returns table names in lower case only

2016-09-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495968#comment-15495968
 ] 

Sean Owen commented on SPARK-17560:
---

You just set it with --conf like other options. However I don't think you want 
to set this in general. I do not see what would need to be changed in Spark.

> SQLContext tables returns table names in lower case only
> 
>
> Key: SPARK-17560
> URL: https://issues.apache.org/jira/browse/SPARK-17560
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>
> I registered a table using
> dataSet.createOrReplaceTempView("TestTable");
> Then I tried to get the list of tables using 
> sparkSession.sqlContext().tableNames()
> but the name that I got was testtable. It used to give table names in proper 
> case in Spark 1.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17560) SQLContext tables returns table names in lower case only

2016-09-16 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495960#comment-15495960
 ] 

Aseem Bansal commented on SPARK-17560:
--

Can you share where this option needs to be set? Maybe I can try and add a pull 
request unless it is easier for you to just add a PR yourself instead of 
explaining.

> SQLContext tables returns table names in lower case only
> 
>
> Key: SPARK-17560
> URL: https://issues.apache.org/jira/browse/SPARK-17560
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>
> I registered a table using
> dataSet.createOrReplaceTempView("TestTable");
> Then I tried to get the list of tables using 
> sparkSession.sqlContext().tableNames()
> but the name that I got was testtable. It used to give table names in proper 
> case in Spark 1.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17560) SQLContext tables returns table names in lower case only

2016-09-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495955#comment-15495955
 ] 

Sean Owen commented on SPARK-17560:
---

It doesn't exist in Spark 1.4, and is an undocumented option. Behavior in 2.x 
is not necessarily the same as 1.x, and as i understand SQL is generally case 
insensitive now.

> SQLContext tables returns table names in lower case only
> 
>
> Key: SPARK-17560
> URL: https://issues.apache.org/jira/browse/SPARK-17560
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>
> I registered a table using
> dataSet.createOrReplaceTempView("TestTable");
> Then I tried to get the list of tables using 
> sparkSession.sqlContext().tableNames()
> but the name that I got was testtable. It used to give table names in proper 
> case in Spark 1.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17561) DataFrameWriter documentation formatting problems

2016-09-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17561:
--
Affects Version/s: 2.0.0
 Priority: Trivial  (was: Major)
  Component/s: SQL

See https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark and 
the changes i just made. This was not "Major".

Yes the HTML in the scaladoc isn't valid. It's easy to fix.

> DataFrameWriter documentation formatting problems
> -
>
> Key: SPARK-17561
> URL: https://issues.apache.org/jira/browse/SPARK-17561
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>Priority: Trivial
> Attachments: screenshot-1.png
>
>
> I visited this page
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html
> and saw  that the docs have formatting problems
> !screenshot-1.png!
> Tried with browser cache disabled. Same issue



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13210) NPE in Sort

2016-09-16 Thread Tzach Zohar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495921#comment-15495921
 ] 

Tzach Zohar commented on SPARK-13210:
-

I've just seen this happening on Spark 1.6.2 - very similar stack trace (line 
number changed slightly, but same stack).

I'm guessing fix wasn't ported to 1.6.1 eventually? 
If so - maybe 1.6.* versions should be added to "affected versions"?

{code:none}
java.lang.NullPointerException
at 
org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:351)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:56)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:37)
at 
org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270)
at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142)
at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:235)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:186)
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:175)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:249)
at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:83)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset(UnsafeInMemorySorter.java:122)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:201)
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:175)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:249)
at 
org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:112)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:332)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:373)
at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:139)
at 
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer$$anonfun$writeRows$4.apply$mcV$sp(WriterContainer.scala:377)
at 
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer$$anonfun$writeRows$4.apply(WriterContainer.scala:343)
at 
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer$$anonfun$writeRows$4.apply(WriterContainer.scala:343)
at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1277)
at 
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:409)
... 8 more
{code}



> NPE in Sort
> ---
>
> Key: SPARK-13210
> URL: https://issues.apache.org/jira/browse/SPARK-13210
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 2.0.0
>
>
> When run TPCDS query Q78 with scale 10:
> {code}
> 16/02/04 22:39:09 ERROR Executor: Managed memory leak detected; size = 
> 268435456 bytes, TID = 143
> 16/02/04 22:39:09 ERROR Executor: Exception in task 0.0 in stage 47.0 (TID 
> 143)
> java.lang.NullPointerException
>   at 
> org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:333)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:60)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:39)
>   at 
> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:239)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:415)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:116)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
>   at

[jira] [Commented] (SPARK-17560) SQLContext tables returns table names in lower case only

2016-09-16 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495906#comment-15495906
 ] 

Aseem Bansal commented on SPARK-17560:
--

Looked through 
https://spark.apache.org/docs/2.0.0/sql-programming-guide.html
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/Dataset.html
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/SparkSession.html
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/SparkConf.html

and none of them say anything about this parameter

> SQLContext tables returns table names in lower case only
> 
>
> Key: SPARK-17560
> URL: https://issues.apache.org/jira/browse/SPARK-17560
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>
> I registered a table using
> dataSet.createOrReplaceTempView("TestTable");
> Then I tried to get the list of tables using 
> sparkSession.sqlContext().tableNames()
> but the name that I got was testtable. It used to give table names in proper 
> case in Spark 1.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17561) DataFrameWriter documentation formatting problems

2016-09-16 Thread Jagadeesan A S (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495899#comment-15495899
 ] 

Jagadeesan A S commented on SPARK-17561:


[~anshbansal] Thanks for reporting. Started working on this.

> DataFrameWriter documentation formatting problems
> -
>
> Key: SPARK-17561
> URL: https://issues.apache.org/jira/browse/SPARK-17561
> Project: Spark
>  Issue Type: Documentation
>Reporter: Aseem Bansal
> Attachments: screenshot-1.png
>
>
> I visited this page
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html
> and saw  that the docs have formatting problems
> !screenshot-1.png!
> Tried with browser cache disabled. Same issue



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17561) DataFrameWriter documentation formatting problems

2016-09-16 Thread Aseem Bansal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aseem Bansal updated SPARK-17561:
-
Description: 
I visited this page
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html

and saw  that the docs have formatting problems

!screenshot-1.png!

  was:
I visited this page
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html

and saw  that the docs have formatting problems


> DataFrameWriter documentation formatting problems
> -
>
> Key: SPARK-17561
> URL: https://issues.apache.org/jira/browse/SPARK-17561
> Project: Spark
>  Issue Type: Documentation
>Reporter: Aseem Bansal
> Attachments: screenshot-1.png
>
>
> I visited this page
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html
> and saw  that the docs have formatting problems
> !screenshot-1.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17561) DataFrameWriter documentation formatting problems

2016-09-16 Thread Aseem Bansal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aseem Bansal updated SPARK-17561:
-
Description: 
I visited this page
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html

and saw  that the docs have formatting problems

!screenshot-1.png!

Tried with browser cache disabled. Same issue

  was:
I visited this page
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html

and saw  that the docs have formatting problems

!screenshot-1.png!


> DataFrameWriter documentation formatting problems
> -
>
> Key: SPARK-17561
> URL: https://issues.apache.org/jira/browse/SPARK-17561
> Project: Spark
>  Issue Type: Documentation
>Reporter: Aseem Bansal
> Attachments: screenshot-1.png
>
>
> I visited this page
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html
> and saw  that the docs have formatting problems
> !screenshot-1.png!
> Tried with browser cache disabled. Same issue



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17561) DataFrameWriter documentation formatting problems

2016-09-16 Thread Aseem Bansal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aseem Bansal updated SPARK-17561:
-
Attachment: screenshot-1.png

> DataFrameWriter documentation formatting problems
> -
>
> Key: SPARK-17561
> URL: https://issues.apache.org/jira/browse/SPARK-17561
> Project: Spark
>  Issue Type: Documentation
>Reporter: Aseem Bansal
> Attachments: screenshot-1.png
>
>
> I visited this page
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html
> and saw  that the docs have formatting problems



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17561) DataFrameWriter documentation formatting problems

2016-09-16 Thread Aseem Bansal (JIRA)

Aseem Bansal created SPARK-17561:


 Summary: DataFrameWriter documentation formatting problems
 Key: SPARK-17561
 URL: https://issues.apache.org/jira/browse/SPARK-17561
 Project: Spark
  Issue Type: Documentation
Reporter: Aseem Bansal
 Attachments: screenshot-1.png

I visited this page
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html

and saw  that the docs have formatting problems



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17560) SQLContext tables returns table names in lower case only

2016-09-16 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495862#comment-15495862
 ] 

Aseem Bansal commented on SPARK-17560:
--

No I did not. Where?

> SQLContext tables returns table names in lower case only
> 
>
> Key: SPARK-17560
> URL: https://issues.apache.org/jira/browse/SPARK-17560
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>
> I registered a table using
> dataSet.createOrReplaceTempView("TestTable");
> Then I tried to get the list of tables using 
> sparkSession.sqlContext().tableNames()
> but the name that I got was testtable. It used to give table names in proper 
> case in Spark 1.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17560) SQLContext tables returns table names in lower case only

2016-09-16 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495862#comment-15495862
 ] 

Aseem Bansal edited comment on SPARK-17560 at 9/16/16 9:38 AM:
---

No I did not. Where? Had not set that in Spark 1.4 either


was (Author: anshbansal):
No I did not. Where?

> SQLContext tables returns table names in lower case only
> 
>
> Key: SPARK-17560
> URL: https://issues.apache.org/jira/browse/SPARK-17560
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>
> I registered a table using
> dataSet.createOrReplaceTempView("TestTable");
> Then I tried to get the list of tables using 
> sparkSession.sqlContext().tableNames()
> but the name that I got was testtable. It used to give table names in proper 
> case in Spark 1.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17534) Increase timeouts for DirectKafkaStreamSuite tests

2016-09-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17534.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15094
[https://github.com/apache/spark/pull/15094]

> Increase timeouts for DirectKafkaStreamSuite tests
> --
>
> Key: SPARK-17534
> URL: https://issues.apache.org/jira/browse/SPARK-17534
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Adam Roberts
>Assignee: Adam Roberts
>Priority: Trivial
> Fix For: 2.1.0
>
>
> The current tests in DirectKafkaStreamSuite rely on a certain number of 
> messages being received within a given timeframe
> On machines with four+ cores and better clock speeds, this doesn't pose a 
> problem, but on a two core x86 box I regularly see timeouts within two tests 
> within this suite.
> To avoid other users hitting the same problem and needlessly doing their own 
> investigations, let's increase the timeouts. By using 1 sec instead of 0.2 
> sec batch durations and increasing the timeout to be 100 seconds not 20, we 
> consistently see the tests passing even on less powerful hardware
> I see the problem consistently using a machine with 2x
> Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz and 16 GB of RAM, 1 TB HDD
> Pull request to follow



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17543) Missing log4j config file for tests in common/network-shuffle

2016-09-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17543:
--
Assignee: Jagadeesan A S

> Missing log4j config file for tests in common/network-shuffle
> -
>
> Key: SPARK-17543
> URL: https://issues.apache.org/jira/browse/SPARK-17543
> Project: Spark
>  Issue Type: Bug
>Reporter: Frederick Reiss
>Assignee: Jagadeesan A S
>Priority: Trivial
>  Labels: starter
> Fix For: 2.1.0
>
>
> *This is a small starter task to help new contributors practice the pull 
> request and code review process.*
> The Maven module {{common/network-shuffle}} does not have a log4j 
> configuration file for its test cases. Usually these configuration files are 
> located inside each module, in the directory {{src/test/resources}}. The 
> missing configuration file leads to a scary-looking but harmless series of 
> errors and stack traces in Spark build logs:
> {noformat}
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> log4j:ERROR Could not read configuration file from URL 
> [file:src/test/resources/log4j.properties].
> java.io.FileNotFoundException: src/test/resources/log4j.properties (No such 
> file or directory)
> at java.io.FileInputStream.open(Native Method)
> at java.io.FileInputStream.(FileInputStream.java:146)
> at java.io.FileInputStream.(FileInputStream.java:101)
> at 
> sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90)
> at 
> sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188)
> at 
> org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:557)
> at 
> org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
> at org.apache.log4j.LogManager.(LogManager.java:127)
> at org.apache.log4j.Logger.getLogger(Logger.java:104)
> at 
> io.netty.util.internal.logging.Log4JLoggerFactory.newInstance(Log4JLoggerFactory.java:29)
> at 
> io.netty.util.internal.logging.InternalLoggerFactory.newDefaultFactory(InternalLoggerFactory.java:46)
> at 
> io.netty.util.internal.logging.InternalLoggerFactory.(InternalLoggerFactory.java:34)
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17543) Missing log4j config file for tests in common/network-shuffle

2016-09-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17543.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15108
[https://github.com/apache/spark/pull/15108]

> Missing log4j config file for tests in common/network-shuffle
> -
>
> Key: SPARK-17543
> URL: https://issues.apache.org/jira/browse/SPARK-17543
> Project: Spark
>  Issue Type: Bug
>Reporter: Frederick Reiss
>Priority: Trivial
>  Labels: starter
> Fix For: 2.1.0
>
>
> *This is a small starter task to help new contributors practice the pull 
> request and code review process.*
> The Maven module {{common/network-shuffle}} does not have a log4j 
> configuration file for its test cases. Usually these configuration files are 
> located inside each module, in the directory {{src/test/resources}}. The 
> missing configuration file leads to a scary-looking but harmless series of 
> errors and stack traces in Spark build logs:
> {noformat}
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> log4j:ERROR Could not read configuration file from URL 
> [file:src/test/resources/log4j.properties].
> java.io.FileNotFoundException: src/test/resources/log4j.properties (No such 
> file or directory)
> at java.io.FileInputStream.open(Native Method)
> at java.io.FileInputStream.(FileInputStream.java:146)
> at java.io.FileInputStream.(FileInputStream.java:101)
> at 
> sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90)
> at 
> sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188)
> at 
> org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:557)
> at 
> org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
> at org.apache.log4j.LogManager.(LogManager.java:127)
> at org.apache.log4j.Logger.getLogger(Logger.java:104)
> at 
> io.netty.util.internal.logging.Log4JLoggerFactory.newInstance(Log4JLoggerFactory.java:29)
> at 
> io.netty.util.internal.logging.InternalLoggerFactory.newDefaultFactory(InternalLoggerFactory.java:46)
> at 
> io.netty.util.internal.logging.InternalLoggerFactory.(InternalLoggerFactory.java:34)
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17560) SQLContext tables returns table names in lower case only

2016-09-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495832#comment-15495832
 ] 

Sean Owen commented on SPARK-17560:
---

Did you set spark.sql.caseSensitive=true?

> SQLContext tables returns table names in lower case only
> 
>
> Key: SPARK-17560
> URL: https://issues.apache.org/jira/browse/SPARK-17560
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>
> I registered a table using
> dataSet.createOrReplaceTempView("TestTable");
> Then I tried to get the list of tables using 
> sparkSession.sqlContext().tableNames()
> but the name that I got was testtable. It used to give table names in proper 
> case in Spark 1.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17560) SQLContext tables returns table names in lower case only

2016-09-16 Thread Aseem Bansal (JIRA)

Aseem Bansal created SPARK-17560:


 Summary: SQLContext tables returns table names in lower case only
 Key: SPARK-17560
 URL: https://issues.apache.org/jira/browse/SPARK-17560
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Aseem Bansal


I registered a table using

dataSet.createOrReplaceTempView("TestTable");

Then I tried to get the list of tables using 

sparkSession.sqlContext().tableNames()

but the name that I got was testtable. It used to give table names in proper 
case in Spark 1.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6385) ISO 8601 timestamp parsing does not support arbitrary precision second fractions

2016-09-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6385.
--
Resolution: Duplicate

> ISO 8601 timestamp parsing does not support arbitrary precision second 
> fractions
> 
>
> Key: SPARK-6385
> URL: https://issues.apache.org/jira/browse/SPARK-6385
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
>Reporter: Nick Bruun
>Priority: Minor
>
> The ISO 8601 timestamp parsing implemented as a resolution to SPARK-4149 does 
> not support arbitrary precision fractions of seconds, only millisecond 
> precision. Parsing {{2015-02-02T00:00:07.900GMT-00:00}} will succeed, while 
> {{2015-02-02T00:00:07.9000GMT-00:00}} will fail.
> The issue is caused by the fixed precision of the parsed format in 
> [DataTypeConversions.scala#L66|https://github.com/apache/spark/blob/84acd08e0886aa23195f35837c15c09aa7804aff/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataTypeConversions.scala#L66].
>  I'm willing to implement a fix, but pointers on the direction would be 
> appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16534) Kafka 0.10 Python support

2016-09-16 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-16534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495496#comment-15495496
 ] 

Maciej Bryński commented on SPARK-16534:


[~rxin]
I understand your point of view, but I think that DStreams are sometimes the 
only option. (especially when there is no support for datasets in Python world)
I'm using df.rdd.map a lot in my ETLs and dstreams are natural continuation 
when moving from batch to streaming world.

Right now I have production streaming environment developed in Python and I'm 
looking forward to use new Kafka API.
(mostly because of SSL support, but other features too)
And I hope that I'm not the only one.

Could we back to work on this feature ?

> Kafka 0.10 Python support
> -
>
> Key: SPARK-16534
> URL: https://issues.apache.org/jira/browse/SPARK-16534
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

100 matches

Mail list logo