[jira] [Created] (SPARK-18042) OutputWriter needs to return the path of the file written

2016-10-21 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-18042:
---

 Summary: OutputWriter needs to return the path of the file written
 Key: SPARK-18042
 URL: https://issues.apache.org/jira/browse/SPARK-18042
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18042) OutputWriter needs to return the path of the file written

2016-10-21 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18042:

Description: 
Without this we won't be able to actually use the normal OutputWriter in 
streaming.


> OutputWriter needs to return the path of the file written
> -
>
> Key: SPARK-18042
> URL: https://issues.apache.org/jira/browse/SPARK-18042
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Without this we won't be able to actually use the normal OutputWriter in 
> streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18042) OutputWriter needs to return the path of the file written

2016-10-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594353#comment-15594353
 ] 

Apache Spark commented on SPARK-18042:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/15580

> OutputWriter needs to return the path of the file written
> -
>
> Key: SPARK-18042
> URL: https://issues.apache.org/jira/browse/SPARK-18042
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Without this we won't be able to actually use the normal OutputWriter in 
> streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18042) OutputWriter needs to return the path of the file written

2016-10-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18042:


Assignee: Reynold Xin  (was: Apache Spark)

> OutputWriter needs to return the path of the file written
> -
>
> Key: SPARK-18042
> URL: https://issues.apache.org/jira/browse/SPARK-18042
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Without this we won't be able to actually use the normal OutputWriter in 
> streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18042) OutputWriter needs to return the path of the file written

2016-10-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18042:


Assignee: Apache Spark  (was: Reynold Xin)

> OutputWriter needs to return the path of the file written
> -
>
> Key: SPARK-18042
> URL: https://issues.apache.org/jira/browse/SPARK-18042
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> Without this we won't be able to actually use the normal OutputWriter in 
> streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17924) Consolidate streaming and batch write path

2016-10-21 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17924:

Description: 
Structured streaming and normal SQL operation currently have two separate write 
path, leading to a lot of duplicated functions (that look similar) and if 
branches. The purpose of this ticket is to consolidate the two as much as 
possible to make the write path more clear.

A side-effect of this change is that streaming will automatically support all 
the file formats.

  was:
Structured streaming and normal SQL operation currently have two separate write 
path, leading to a lot of duplicated functions (that look similar) and if 
branches. The purpose of this ticket is to consolidate the two as much as 
possible to make the write path more clear.



> Consolidate streaming and batch write path
> --
>
> Key: SPARK-17924
> URL: https://issues.apache.org/jira/browse/SPARK-17924
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Structured streaming and normal SQL operation currently have two separate 
> write path, leading to a lot of duplicated functions (that look similar) and 
> if branches. The purpose of this ticket is to consolidate the two as much as 
> possible to make the write path more clear.
> A side-effect of this change is that streaming will automatically support all 
> the file formats.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18042) OutputWriter needs to return the path of the file written

2016-10-21 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18042:

Description: 
This patch adds a new "path" method on OutputWriter that returns the path of 
the file written by the OutputWriter. This is part of the necessary work to 
consolidate structured streaming and batch write paths.

The batch write path has a nice feature that each data source can define the 
extension of the files, and allow Spark to specify the staging directory and 
the prefix for the files. However, in the streaming path we need to collect the 
list of files written, and there is no interface right now to do that.



  was:
Without this we won't be able to actually use the normal OutputWriter in 
streaming.



> OutputWriter needs to return the path of the file written
> -
>
> Key: SPARK-18042
> URL: https://issues.apache.org/jira/browse/SPARK-18042
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> This patch adds a new "path" method on OutputWriter that returns the path of 
> the file written by the OutputWriter. This is part of the necessary work to 
> consolidate structured streaming and batch write paths.
> The batch write path has a nice feature that each data source can define the 
> extension of the files, and allow Spark to specify the staging directory and 
> the prefix for the files. However, in the streaming path we need to collect 
> the list of files written, and there is no interface right now to do that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15472) Add support for writing partitioned `csv`, `json`, `text` formats in Structured Streaming

2016-10-21 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594369#comment-15594369
 ] 

Reynold Xin commented on SPARK-15472:
-

This actually will be subsumed by SPARK-17924.

> Add support for writing partitioned `csv`, `json`, `text` formats in 
> Structured Streaming
> -
>
> Key: SPARK-15472
> URL: https://issues.apache.org/jira/browse/SPARK-15472
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Liwei Lin(Inactive)
>
> Support for partitioned `parquet` format in FileStreamSink was added in 
> Spark-14716, now let's add support for partitioned `csv`, 'json', `text` 
> format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15472) Add support for writing partitioned `csv`, `json`, `text` formats in Structured Streaming

2016-10-21 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reassigned SPARK-15472:
---

Assignee: Reynold Xin

> Add support for writing partitioned `csv`, `json`, `text` formats in 
> Structured Streaming
> -
>
> Key: SPARK-15472
> URL: https://issues.apache.org/jira/browse/SPARK-15472
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Liwei Lin(Inactive)
>Assignee: Reynold Xin
>
> Support for partitioned `parquet` format in FileStreamSink was added in 
> Spark-14716, now let's add support for partitioned `csv`, 'json', `text` 
> format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17829) Stable format for offset log

2016-10-21 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594371#comment-15594371
 ] 

Reynold Xin commented on SPARK-17829:
-

I like option 3! (in reality it is a more general version of option 2).


> Stable format for offset log
> 
>
> Key: SPARK-17829
> URL: https://issues.apache.org/jira/browse/SPARK-17829
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Tyson Condie
>
> Currently we use java serialization for the WAL that stores the offsets 
> contained in each batch.  This has two main issues:
>  - It can break across spark releases (though this is not the only thing 
> preventing us from upgrading a running query)
>  - It is unnecessarily opaque to the user.
> I'd propose we require offsets to provide a user readable serialization and 
> use that instead.  JSON is probably a good option.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18043) Java example for Broadcasting

2016-10-21 Thread Akash Sethi (JIRA)
Akash Sethi created SPARK-18043:
---

 Summary: Java example for Broadcasting
 Key: SPARK-18043
 URL: https://issues.apache.org/jira/browse/SPARK-18043
 Project: Spark
  Issue Type: Task
  Components: Examples
Reporter: Akash Sethi
Priority: Critical


I have created a java example for Broadcasting similar to as it is in Scala i 
would like to contribute the code for the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18043) Java example for Broadcasting

2016-10-21 Thread Akash Sethi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akash Sethi updated SPARK-18043:

Attachment: JavaBroadcastTest.java

> Java example for Broadcasting
> -
>
> Key: SPARK-18043
> URL: https://issues.apache.org/jira/browse/SPARK-18043
> Project: Spark
>  Issue Type: Task
>  Components: Examples
>Reporter: Akash Sethi
>Priority: Critical
>  Labels: patch
> Attachments: JavaBroadcastTest.java
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> I have created a java example for Broadcasting similar to as it is in Scala i 
> would like to contribute the code for the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18043) Java example for Broadcasting

2016-10-21 Thread Akash Sethi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594410#comment-15594410
 ] 

Akash Sethi commented on SPARK-18043:
-

Broadcasting example is missing in from java examples so, i have created and 
uploaded a java file in attachments. i you like i can work on more example let 
me know. if changes needed then let me know .
Thanks

> Java example for Broadcasting
> -
>
> Key: SPARK-18043
> URL: https://issues.apache.org/jira/browse/SPARK-18043
> Project: Spark
>  Issue Type: Task
>  Components: Examples
>Reporter: Akash Sethi
>Priority: Critical
>  Labels: patch
> Attachments: JavaBroadcastTest.java
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> I have created a java example for Broadcasting similar to as it is in Scala i 
> would like to contribute the code for the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18043) Java example for Broadcasting

2016-10-21 Thread Akash Sethi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594410#comment-15594410
 ] 

Akash Sethi edited comment on SPARK-18043 at 10/21/16 8:15 AM:
---

Broadcasting example is missing in from java examples so, i have created and 
uploaded a java file in attachments. if you like i can work on more example or  
changes needed then let me know.
Thanks


was (Author: akashsethi24):
Broadcasting example is missing in from java examples so, i have created and 
uploaded a java file in attachments. i you like i can work on more example let 
me know. if changes needed then let me know .
Thanks

> Java example for Broadcasting
> -
>
> Key: SPARK-18043
> URL: https://issues.apache.org/jira/browse/SPARK-18043
> Project: Spark
>  Issue Type: Task
>  Components: Examples
>Reporter: Akash Sethi
>Priority: Critical
>  Labels: patch
> Attachments: JavaBroadcastTest.java
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> I have created a java example for Broadcasting similar to as it is in Scala i 
> would like to contribute the code for the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18044) FileStreamSource should not infer partitions in every batch

2016-10-21 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-18044:
---

 Summary: FileStreamSource should not infer partitions in every 
batch
 Key: SPARK-18044
 URL: https://issues.apache.org/jira/browse/SPARK-18044
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18044) FileStreamSource should not infer partitions in every batch

2016-10-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18044:


Assignee: Apache Spark  (was: Wenchen Fan)

> FileStreamSource should not infer partitions in every batch
> ---
>
> Key: SPARK-18044
> URL: https://issues.apache.org/jira/browse/SPARK-18044
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18044) FileStreamSource should not infer partitions in every batch

2016-10-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18044:


Assignee: Wenchen Fan  (was: Apache Spark)

> FileStreamSource should not infer partitions in every batch
> ---
>
> Key: SPARK-18044
> URL: https://issues.apache.org/jira/browse/SPARK-18044
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18044) FileStreamSource should not infer partitions in every batch

2016-10-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594485#comment-15594485
 ] 

Apache Spark commented on SPARK-18044:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/15581

> FileStreamSource should not infer partitions in every batch
> ---
>
> Key: SPARK-18044
> URL: https://issues.apache.org/jira/browse/SPARK-18044
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17960) Upgrade to Py4J 0.10.4

2016-10-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17960.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15514
[https://github.com/apache/spark/pull/15514]

> Upgrade to Py4J 0.10.4
> --
>
> Key: SPARK-17960
> URL: https://issues.apache.org/jira/browse/SPARK-17960
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
>Priority: Trivial
>  Labels: starter
> Fix For: 2.1.0
>
>
> In general we should try and keep up to date with Py4J's new releases. The 
> changes in this one are small ( 
> https://github.com/bartdag/py4j/milestone/21?closed=1 ) and shouldn't impact 
> Spark in any significant way so I'm going to tag this as a starter issue for 
> someone looking to get a deeper understanding of how PySpark works.
> Upgrading Py4J can be a bit tricky compared to updating other packages in 
> general the steps are:
> 1) Upgrade the Py4J version on the Java side
> 2) Update the py4j src zip file we bundle with Spark
> 3) Make sure everything still works (especially the streaming tests because 
> we do weird things to make streaming work and its the most likely place to 
> break during a Py4J upgrade).
> You can see how these bits have been done in past releases by looking in the 
> git log for the last time we changed the Py4J version numbers. Sometimes even 
> for "compatible" releases like this one we may need to make some small code 
> changes in side of PySpark because we hook into Py4Js internals, but I don't 
> think this should be the case here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17960) Upgrade to Py4J 0.10.4

2016-10-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17960:
--
Assignee: Jagadeesan A S

> Upgrade to Py4J 0.10.4
> --
>
> Key: SPARK-17960
> URL: https://issues.apache.org/jira/browse/SPARK-17960
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
>Assignee: Jagadeesan A S
>Priority: Trivial
>  Labels: starter
> Fix For: 2.1.0
>
>
> In general we should try and keep up to date with Py4J's new releases. The 
> changes in this one are small ( 
> https://github.com/bartdag/py4j/milestone/21?closed=1 ) and shouldn't impact 
> Spark in any significant way so I'm going to tag this as a starter issue for 
> someone looking to get a deeper understanding of how PySpark works.
> Upgrading Py4J can be a bit tricky compared to updating other packages in 
> general the steps are:
> 1) Upgrade the Py4J version on the Java side
> 2) Update the py4j src zip file we bundle with Spark
> 3) Make sure everything still works (especially the streaming tests because 
> we do weird things to make streaming work and its the most likely place to 
> break during a Py4J upgrade).
> You can see how these bits have been done in past releases by looking in the 
> git log for the last time we changed the Py4J version numbers. Sometimes even 
> for "compatible" releases like this one we may need to make some small code 
> changes in side of PySpark because we hook into Py4Js internals, but I don't 
> think this should be the case here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9219) ClassCastException in instance of org.apache.spark.rdd.MapPartitionsRDD

2016-10-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594517#comment-15594517
 ] 

Sean Owen commented on SPARK-9219:
--

{{mvn dependency:tree}} will be easier to read and more relevant since it will 
show what Maven thinks the situation is. It sounds like having mismatching 
classes on the classpath, one way or the other. You can also inspect your app 
JAR with 'jar tf' to see what's really inside it. Have a look at the classpath 
reported in your app's environment tab too.

> ClassCastException in instance of org.apache.spark.rdd.MapPartitionsRDD
> ---
>
> Key: SPARK-9219
> URL: https://issues.apache.org/jira/browse/SPARK-9219
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.4.1
>Reporter: Mohsen Zainalpour
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 
> (TID 77, 192.168.1.194): java.lang.ClassCastException: cannot assign instance 
> of scala.collection.immutable.List$SerializationProxy to field 
> org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type 
> scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
>   at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083)
>   at 
> java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477)
>   at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477)
>   at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(Obj

[jira] [Commented] (SPARK-17904) Add a wrapper function to install R packages on each executors.

2016-10-21 Thread Piotr Smolinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594539#comment-15594539
 ] 

Piotr Smolinski commented on SPARK-17904:
-

My bad. It is about deploying the packages and not loading. It still can be 
tricky. There is no guarantee that executors are running in all possible nodes, 
so Sun's comment 
(https://issues.apache.org/jira/browse/SPARK-17904?focusedCommentId=15571785&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15571785)
 makes sense.

> Add a wrapper function to install R packages on each executors.
> ---
>
> Key: SPARK-17904
> URL: https://issues.apache.org/jira/browse/SPARK-17904
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> SparkR provides {{spark.lappy}} to run local R functions in distributed 
> environment, and {{dapply}} to run UDF on SparkDataFrame.
> If users use third-party libraries inside of the function which was passed 
> into {{spark.lappy}} or {{dapply}}, they should install required R packages 
> on each executor in advance.
> To install dependent R packages on each executors and check it successfully, 
> we can run similar code like following:
> (Note: The code is just for example, not the prototype of this proposal. The 
> detail implementation should be discussed.)
> {code}
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), 
> install.packages("Matrix”))
> test <- function(x) { "Matrix" %in% rownames(installed.packages()) }
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), test )
> collectRDD(rdd)
> {code}
> It’s cumbersome to run this code snippet each time when you need third-party 
> library, since SparkR is an interactive analytics tools, users may call lots 
> of libraries during the analytics session. In native R, users can run 
> {{install.packages()}} and {{library()}} across the interactive session.
> Should we provide one API to wrapper the work mentioned above, then users can 
> install dependent R packages to each executor easily? 
> I propose the following API:
> {{spark.installPackages(pkgs, repos)}}
> * pkgs: the name of packages. If repos = NULL, this can be set with a 
> local/hdfs path, then SparkR can install packages from local package archives.
> * repos: the base URL(s) of the repositories to use. It can be NULL to 
> install from local directories.
> Since SparkR has its own library directories where to install the packages on 
> each executor, so I think it will not pollute the native R environment. I'd 
> like to know whether it make sense, and feel free to correct me if there is 
> misunderstanding.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18045) Move `HiveDataFrameAnalyticsSuite` to package `sql`

2016-10-21 Thread Jiang Xingbo (JIRA)
Jiang Xingbo created SPARK-18045:


 Summary: Move `HiveDataFrameAnalyticsSuite` to package `sql`
 Key: SPARK-18045
 URL: https://issues.apache.org/jira/browse/SPARK-18045
 Project: Spark
  Issue Type: Test
  Components: SQL
Reporter: Jiang Xingbo
Priority: Minor


The testsuite `HiveDataFrameAnalyticsSuite` has nothing to do with HIVE, we 
should move it to package `sql`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18046) Spark gives error(Initial Job has not been accepted)

2016-10-21 Thread Farman Ali (JIRA)
Farman Ali created SPARK-18046:
--

 Summary: Spark gives error(Initial Job has not been accepted)
 Key: SPARK-18046
 URL: https://issues.apache.org/jira/browse/SPARK-18046
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 2.0.0, 1.6.2, 1.6.0
Reporter: Farman Ali


I am trying spark on amazon EC2 but i am facing this issue. When i tried to run 
with local spark worker it works fine but when i tried with only EC2 instance 
it give error of Initial Job has been not accepted. How I resolve this error?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18046) Spark gives error(Initial Job has not been accepted)

2016-10-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18046.
---
  Resolution: Invalid
Target Version/s:   (was: 2.0.0)

Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before 
opening a JIRA; this is at best a question.

It only means you don't have enough resources for your job, and isn't a spark 
problem.

> Spark gives error(Initial Job has not been accepted)
> 
>
> Key: SPARK-18046
> URL: https://issues.apache.org/jira/browse/SPARK-18046
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.6.0, 1.6.2, 2.0.0
>Reporter: Farman Ali
>  Labels: features
>
> I am trying spark on amazon EC2 but i am facing this issue. When i tried to 
> run with local spark worker it works fine but when i tried with only EC2 
> instance it give error of Initial Job has been not accepted. How I resolve 
> this error?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17906) MulticlassClassificationEvaluator support target label

2016-10-21 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594569#comment-15594569
 ] 

zhengruifeng commented on SPARK-17906:
--

Yes. I think it useful to expose metrics computing one label vs others.

> MulticlassClassificationEvaluator support target label
> --
>
> Key: SPARK-17906
> URL: https://issues.apache.org/jira/browse/SPARK-17906
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> In practice, I sometime only focus on metric of one special label.
> For example, in CTR prediction, I usually only mind F1 of positive class.
> In sklearn, this is supported:
> {code}
> >>> from sklearn.metrics import classification_report
> >>> y_true = [0, 1, 2, 2, 2]
> >>> y_pred = [0, 0, 2, 2, 1]
> >>> target_names = ['class 0', 'class 1', 'class 2']
> >>> print(classification_report(y_true, y_pred, target_names=target_names))
>  precisionrecall  f1-score   support
> class 0   0.50  1.00  0.67 1
> class 1   0.00  0.00  0.00 1
> class 2   1.00  0.67  0.80 3
> avg / total   0.70  0.60  0.61 5
> {code}
> Now, ml only support `weightedXXX`. So I think there may be a point to 
> improve.
> The API may be designed like this:
> {code}
> val dataset = ...
> val evaluator = new MulticlassClassificationEvaluator
> evaluator.setMetricName("f1")
> evaluator.evaluate(dataset)   // weightedF1 of all classes
> evaluator.setTarget(0.0).setMetricName("f1")
> evaluator.evaluate(dataset)   // F1 of class "0"
> {code}
> what's your opinion? [~yanboliang][~josephkb][~sethah][~srowen] 
> If this is useful and acceptable, I'm happy to work on this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18045) Move `HiveDataFrameAnalyticsSuite` to package `sql`

2016-10-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594559#comment-15594559
 ] 

Apache Spark commented on SPARK-18045:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/15582

> Move `HiveDataFrameAnalyticsSuite` to package `sql`
> ---
>
> Key: SPARK-18045
> URL: https://issues.apache.org/jira/browse/SPARK-18045
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Jiang Xingbo
>Priority: Minor
>
> The testsuite `HiveDataFrameAnalyticsSuite` has nothing to do with HIVE, we 
> should move it to package `sql`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18045) Move `HiveDataFrameAnalyticsSuite` to package `sql`

2016-10-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18045:


Assignee: Apache Spark

> Move `HiveDataFrameAnalyticsSuite` to package `sql`
> ---
>
> Key: SPARK-18045
> URL: https://issues.apache.org/jira/browse/SPARK-18045
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Jiang Xingbo
>Assignee: Apache Spark
>Priority: Minor
>
> The testsuite `HiveDataFrameAnalyticsSuite` has nothing to do with HIVE, we 
> should move it to package `sql`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18041) activedrivers section in http:sparkMasterurl/json is missing Main class information

2016-10-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594560#comment-15594560
 ] 

Sean Owen commented on SPARK-18041:
---

The ID identifies the driver right?
Do you mean you want to know the application's main class?
It wouldn't help differentiate thing in case you were running, say, the shell, 
or multiple copies of one app, so I don't know how useful this is.

> activedrivers section in http:sparkMasterurl/json is missing Main class 
> information
> ---
>
> Key: SPARK-18041
> URL: https://issues.apache.org/jira/browse/SPARK-18041
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 1.6.2
>Reporter: sudheesh k s
>Priority: Minor
>
> http:sparkMaster_Url/json gives the status of running applications as well as 
> drivers. But it is missing information like, driver main class. 
> To identify which driver is running on driver class information is needed. 
> eg:
>   "activedrivers" : [ {
> "id" : "driver-20161020173528-0032",
> "starttime" : "1476965128734",
> "state" : "RUNNING",
> "cores" : 1,
> "memory" : 1024
>   } ],



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18045) Move `HiveDataFrameAnalyticsSuite` to package `sql`

2016-10-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18045:


Assignee: (was: Apache Spark)

> Move `HiveDataFrameAnalyticsSuite` to package `sql`
> ---
>
> Key: SPARK-18045
> URL: https://issues.apache.org/jira/browse/SPARK-18045
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Jiang Xingbo
>Priority: Minor
>
> The testsuite `HiveDataFrameAnalyticsSuite` has nothing to do with HIVE, we 
> should move it to package `sql`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17906) MulticlassClassificationEvaluator support target label

2016-10-21 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594570#comment-15594570
 ] 

zhengruifeng commented on SPARK-17906:
--

Yes. I think it useful to expose metrics computing one label vs others.

> MulticlassClassificationEvaluator support target label
> --
>
> Key: SPARK-17906
> URL: https://issues.apache.org/jira/browse/SPARK-17906
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> In practice, I sometime only focus on metric of one special label.
> For example, in CTR prediction, I usually only mind F1 of positive class.
> In sklearn, this is supported:
> {code}
> >>> from sklearn.metrics import classification_report
> >>> y_true = [0, 1, 2, 2, 2]
> >>> y_pred = [0, 0, 2, 2, 1]
> >>> target_names = ['class 0', 'class 1', 'class 2']
> >>> print(classification_report(y_true, y_pred, target_names=target_names))
>  precisionrecall  f1-score   support
> class 0   0.50  1.00  0.67 1
> class 1   0.00  0.00  0.00 1
> class 2   1.00  0.67  0.80 3
> avg / total   0.70  0.60  0.61 5
> {code}
> Now, ml only support `weightedXXX`. So I think there may be a point to 
> improve.
> The API may be designed like this:
> {code}
> val dataset = ...
> val evaluator = new MulticlassClassificationEvaluator
> evaluator.setMetricName("f1")
> evaluator.evaluate(dataset)   // weightedF1 of all classes
> evaluator.setTarget(0.0).setMetricName("f1")
> evaluator.evaluate(dataset)   // F1 of class "0"
> {code}
> what's your opinion? [~yanboliang][~josephkb][~sethah][~srowen] 
> If this is useful and acceptable, I'm happy to work on this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18022) java.lang.NullPointerException instead of real exception when saving DF to MySQL

2016-10-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594585#comment-15594585
 ] 

Sean Owen commented on SPARK-18022:
---

Yes, that code block should handle the case where cause is null. That will help 
a bit. 
It isn't the cause of the NPE here, which is probably about not handling the 
case of a null taskMetrics()?

> java.lang.NullPointerException instead of real exception when saving DF to 
> MySQL
> 
>
> Key: SPARK-18022
> URL: https://issues.apache.org/jira/browse/SPARK-18022
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Maciej Bryński
>Priority: Minor
>
> Hi,
> I have found following issue.
> When there is an exception while saving dataframe to MySQL I'm unable to get 
> it.
> Instead of I'm getting following stacktrace.
> {code}
> 16/10/20 06:00:35 WARN TaskSetManager: Lost task 56.0 in stage 10.0 (TID 
> 3753, dwh-hn28.adpilot.co): java.lang.NullPointerException: Cannot suppress a 
> null exception.
> at java.lang.Throwable.addSuppressed(Throwable.java:1046)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:256)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:314)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:313)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The real exception could be for example duplicate on primary key etc.
> With this it's very difficult to debugging apps.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18047) Spark worker port should be greater than 1023

2016-10-21 Thread darion yaphet (JIRA)
darion yaphet created SPARK-18047:
-

 Summary: Spark worker port should be greater than 1023
 Key: SPARK-18047
 URL: https://issues.apache.org/jira/browse/SPARK-18047
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.1, 2.0.0
Reporter: darion yaphet


The port numbers in the range from 0 to 1023 are the well-known ports (system 
ports) . They are widely used by system network services. Such as Telnet(23), 
Simple Mail Transfer Protocol(25) and Domain Name System(53). Work port should 
avoid using this ports . 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17906) MulticlassClassificationEvaluator support target label

2016-10-21 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594607#comment-15594607
 ] 

zhengruifeng commented on SPARK-17906:
--

It seems that in that PR, we should first obtain a model, then use it to 
{{evaluate}} on some dataframe to generate {{summary}}.
This MetricPerLabel maybe also added into {{MulticlassificationEvaluator}} for 
general purpose.

> MulticlassClassificationEvaluator support target label
> --
>
> Key: SPARK-17906
> URL: https://issues.apache.org/jira/browse/SPARK-17906
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> In practice, I sometime only focus on metric of one special label.
> For example, in CTR prediction, I usually only mind F1 of positive class.
> In sklearn, this is supported:
> {code}
> >>> from sklearn.metrics import classification_report
> >>> y_true = [0, 1, 2, 2, 2]
> >>> y_pred = [0, 0, 2, 2, 1]
> >>> target_names = ['class 0', 'class 1', 'class 2']
> >>> print(classification_report(y_true, y_pred, target_names=target_names))
>  precisionrecall  f1-score   support
> class 0   0.50  1.00  0.67 1
> class 1   0.00  0.00  0.00 1
> class 2   1.00  0.67  0.80 3
> avg / total   0.70  0.60  0.61 5
> {code}
> Now, ml only support `weightedXXX`. So I think there may be a point to 
> improve.
> The API may be designed like this:
> {code}
> val dataset = ...
> val evaluator = new MulticlassClassificationEvaluator
> evaluator.setMetricName("f1")
> evaluator.evaluate(dataset)   // weightedF1 of all classes
> evaluator.setTarget(0.0).setMetricName("f1")
> evaluator.evaluate(dataset)   // F1 of class "0"
> {code}
> what's your opinion? [~yanboliang][~josephkb][~sethah][~srowen] 
> If this is useful and acceptable, I'm happy to work on this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18047) Spark worker port should be greater than 1023

2016-10-21 Thread darion yaphet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

darion yaphet updated SPARK-18047:
--
Description: 
The port numbers in the range from 0 to 1023 are the well-known ports (system 
ports) . 

They are widely used by system network services. Such as Telnet(23), Simple 
Mail Transfer Protocol(25) and Domain Name System(53). 

Work port should avoid using this ports . 

  was:The port numbers in the range from 0 to 1023 are the well-known ports 
(system ports) . They are widely used by system network services. Such as 
Telnet(23), Simple Mail Transfer Protocol(25) and Domain Name System(53). Work 
port should avoid using this ports . 


> Spark worker port should be greater than 1023
> -
>
> Key: SPARK-18047
> URL: https://issues.apache.org/jira/browse/SPARK-18047
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.0.1
>Reporter: darion yaphet
>
> The port numbers in the range from 0 to 1023 are the well-known ports (system 
> ports) . 
> They are widely used by system network services. Such as Telnet(23), Simple 
> Mail Transfer Protocol(25) and Domain Name System(53). 
> Work port should avoid using this ports . 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17910) Allow users to update the comment of a column

2016-10-21 Thread Jiang Xingbo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594610#comment-15594610
 ] 

Jiang Xingbo commented on SPARK-17910:
--

[~yhuai] Seems we don't support `ALTER TABLE CHANGE COLUMN` statements 
currently, do we plan to support that? Are there any discussions I can refer 
to? Thank you!

> Allow users to update the comment of a column
> -
>
> Key: SPARK-17910
> URL: https://issues.apache.org/jira/browse/SPARK-17910
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> Right now, once a user set the comment of a column with create table command, 
> he/she cannot update the comment. It will be useful to provide a public 
> interface (e.g. SQL) to do that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18047) Spark worker port should be greater than 1023

2016-10-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594612#comment-15594612
 ] 

Sean Owen commented on SPARK-18047:
---

You can just not specify those ports of course, and they're not the default. If 
you want to use those ports for some reason, is it really something to 
prohibit? I don't see a problem.

> Spark worker port should be greater than 1023
> -
>
> Key: SPARK-18047
> URL: https://issues.apache.org/jira/browse/SPARK-18047
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.0.1
>Reporter: darion yaphet
>
> The port numbers in the range from 0 to 1023 are the well-known ports (system 
> ports) . 
> They are widely used by system network services. Such as Telnet(23), Simple 
> Mail Transfer Protocol(25) and Domain Name System(53). 
> Work port should avoid using this ports . 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18043) Java example for Broadcasting

2016-10-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18043:
--
 Flags:   (was: Patch)
Labels:   (was: patch)
  Priority: Minor  (was: Critical)
Issue Type: Improvement  (was: Task)

[~akashsethi24] please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark to 
understand how to file a JIRA. This can't be "Critical" for example and we 
don't use patches.

There is already an example of broadcast in Java in 
http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables 
and I don't see much value in more. It's a very simple API.

> Java example for Broadcasting
> -
>
> Key: SPARK-18043
> URL: https://issues.apache.org/jira/browse/SPARK-18043
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Reporter: Akash Sethi
>Priority: Minor
> Attachments: JavaBroadcastTest.java
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> I have created a java example for Broadcasting similar to as it is in Scala i 
> would like to contribute the code for the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-882) Have link for feedback/suggestions in docs

2016-10-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594619#comment-15594619
 ] 

Sean Owen commented on SPARK-882:
-

On second thought I'd just link to the 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark wiki as 
I'd rather funnel any contributions through that first.

> Have link for feedback/suggestions in docs
> --
>
> Key: SPARK-882
> URL: https://issues.apache.org/jira/browse/SPARK-882
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Patrick Wendell
>Assignee: Patrick Cogan
>
> It would be cool to have a link at the top of the docs for 
> feedback/suggestions/errors. I bet we'd get a lot of interesting stuff from 
> that and it could be a good way to crowdsource correctness checking, since a 
> lot of us that write them never have to use them.
> Something to the right of the main top nav might be good. [~andyk] [~matei] - 
> what do you guys think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-882) Have link for feedback/suggestions in docs

2016-10-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-882:

Priority: Minor  (was: Major)

> Have link for feedback/suggestions in docs
> --
>
> Key: SPARK-882
> URL: https://issues.apache.org/jira/browse/SPARK-882
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Patrick Wendell
>Assignee: Patrick Cogan
>Priority: Minor
>
> It would be cool to have a link at the top of the docs for 
> feedback/suggestions/errors. I bet we'd get a lot of interesting stuff from 
> that and it could be a good way to crowdsource correctness checking, since a 
> lot of us that write them never have to use them.
> Something to the right of the main top nav might be good. [~andyk] [~matei] - 
> what do you guys think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-17910) Allow users to update the comment of a column

2016-10-21 Thread Jiang Xingbo (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiang Xingbo updated SPARK-17910:
-
Comment: was deleted

(was: [~yhuai] Seems we don't support `ALTER TABLE CHANGE COLUMN` statements 
currently, do we plan to support that? Are there any discussions I can refer 
to? Thank you!)

> Allow users to update the comment of a column
> -
>
> Key: SPARK-17910
> URL: https://issues.apache.org/jira/browse/SPARK-17910
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> Right now, once a user set the comment of a column with create table command, 
> he/she cannot update the comment. It will be useful to provide a public 
> interface (e.g. SQL) to do that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17910) Allow users to update the comment of a column

2016-10-21 Thread Jiang Xingbo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594623#comment-15594623
 ] 

Jiang Xingbo commented on SPARK-17910:
--

 [~yhuai] Seems we don't support `ALTER TABLE CHANGE COLUMN` statements 
currently, do we plan to support that? Are there any discussions I can refer 
to? Thank you! 

> Allow users to update the comment of a column
> -
>
> Key: SPARK-17910
> URL: https://issues.apache.org/jira/browse/SPARK-17910
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> Right now, once a user set the comment of a column with create table command, 
> he/she cannot update the comment. It will be useful to provide a public 
> interface (e.g. SQL) to do that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18047) Spark worker port should be greater than 1023

2016-10-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18047:


Assignee: Apache Spark

> Spark worker port should be greater than 1023
> -
>
> Key: SPARK-18047
> URL: https://issues.apache.org/jira/browse/SPARK-18047
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.0.1
>Reporter: darion yaphet
>Assignee: Apache Spark
>
> The port numbers in the range from 0 to 1023 are the well-known ports (system 
> ports) . 
> They are widely used by system network services. Such as Telnet(23), Simple 
> Mail Transfer Protocol(25) and Domain Name System(53). 
> Work port should avoid using this ports . 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18047) Spark worker port should be greater than 1023

2016-10-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594641#comment-15594641
 ] 

Apache Spark commented on SPARK-18047:
--

User 'darionyaphet' has created a pull request for this issue:
https://github.com/apache/spark/pull/15583

> Spark worker port should be greater than 1023
> -
>
> Key: SPARK-18047
> URL: https://issues.apache.org/jira/browse/SPARK-18047
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.0.1
>Reporter: darion yaphet
>
> The port numbers in the range from 0 to 1023 are the well-known ports (system 
> ports) . 
> They are widely used by system network services. Such as Telnet(23), Simple 
> Mail Transfer Protocol(25) and Domain Name System(53). 
> Work port should avoid using this ports . 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18047) Spark worker port should be greater than 1023

2016-10-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18047:


Assignee: (was: Apache Spark)

> Spark worker port should be greater than 1023
> -
>
> Key: SPARK-18047
> URL: https://issues.apache.org/jira/browse/SPARK-18047
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.0.1
>Reporter: darion yaphet
>
> The port numbers in the range from 0 to 1023 are the well-known ports (system 
> ports) . 
> They are widely used by system network services. Such as Telnet(23), Simple 
> Mail Transfer Protocol(25) and Domain Name System(53). 
> Work port should avoid using this ports . 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18039) ReceiverTracker run dummyjob too fast cause receiver scheduling unbalaced

2016-10-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594708#comment-15594708
 ] 

Sean Owen commented on SPARK-18039:
---

I am not sure I understand the scenario given this description. Can you 
describe more carefully the sequence of events that leads to the scheduling, 
what scheduling would be better, and what kind of change might improve it if 
possible?

> ReceiverTracker run dummyjob too fast cause receiver scheduling unbalaced
> -
>
> Key: SPARK-18039
> URL: https://issues.apache.org/jira/browse/SPARK-18039
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.1
>Reporter: astralidea
>Priority: Minor
>
> receiver scheduling balance is important for me 
> for instance 
> if I have 2 executor, each executor has 1 receiver, calc time is 0.1s per 
> batch.
> but if  I have 2 executor, one executor has 2 receiver and another is 0 
> receiver ,calc time is increase 3s per batch.
> In my cluster executor init is slow I need about 30s to wait.
> but dummy job only run 4s to wait, I add conf 
> spark.scheduler.maxRegisteredResourcesWaitingTime it does not work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18046) Spark gives error(Initial Job has not been accepted)

2016-10-21 Thread Farman Ali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594746#comment-15594746
 ] 

Farman Ali commented on SPARK-18046:


Please tell the link of amazon configuration for spark?

> Spark gives error(Initial Job has not been accepted)
> 
>
> Key: SPARK-18046
> URL: https://issues.apache.org/jira/browse/SPARK-18046
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.6.0, 1.6.2, 2.0.0
>Reporter: Farman Ali
>  Labels: features
>
> I am trying spark on amazon EC2 but i am facing this issue. When i tried to 
> run with local spark worker it works fine but when i tried with only EC2 
> instance it give error of Initial Job has been not accepted. How I resolve 
> this error?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13275) With dynamic allocation, executors appear to be added before job starts

2016-10-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13275.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15536
[https://github.com/apache/spark/pull/15536]

> With dynamic allocation, executors appear to be added before job starts
> ---
>
> Key: SPARK-13275
> URL: https://issues.apache.org/jira/browse/SPARK-13275
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.0
>Reporter: Stephanie Bodoff
>Priority: Minor
> Fix For: 2.1.0
>
> Attachments: webui.png
>
>
> When I look at the timeline in the Spark Web UI I see the job starting and 
> then executors being added. The blue lines and dots hitting the timeline show 
> that the executors were added after the job started. But the way the Executor 
> box is rendered it looks like the executors started before the job. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13275) With dynamic allocation, executors appear to be added before job starts

2016-10-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13275:
--
Assignee: Alex Bozarth

> With dynamic allocation, executors appear to be added before job starts
> ---
>
> Key: SPARK-13275
> URL: https://issues.apache.org/jira/browse/SPARK-13275
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.0
>Reporter: Stephanie Bodoff
>Assignee: Alex Bozarth
>Priority: Minor
> Fix For: 2.1.0
>
> Attachments: webui.png
>
>
> When I look at the timeline in the Spark Web UI I see the job starting and 
> then executors being added. The blue lines and dots hitting the timeline show 
> that the executors were added after the job started. But the way the Executor 
> box is rendered it looks like the executors started before the job. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17709) spark 2.0 join - column resolution error

2016-10-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17709.
---
Resolution: Not A Problem

Provisionally closing as not a problem, or possibly a duplicate, as of 2.0.1

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>Priority: Critical
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17908) Column names Corrupted in pysaprk dataframe groupBy

2016-10-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17908.
---
Resolution: Cannot Reproduce

OK, provisionally closing as cannot reproduce

> Column names Corrupted in pysaprk dataframe groupBy
> ---
>
> Key: SPARK-17908
> URL: https://issues.apache.org/jira/browse/SPARK-17908
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0, 2.0.1
>Reporter: Harish
>Priority: Minor
>
> I have DF say df
> df1= df.groupBy('key1', 'key2', 
> 'key3').agg(func.count(func.col('val')).alias('total'))
> df3 =df.join(df1, ['key1', 'key2', 'key3'])\
>  .withcolumn('newcol', func.col('val')/func.col('total'))
> I am getting key2 is not present in df1, which is not truw becuase df1.show 
> () is having the data with the key2.
> Then i added this code  before join-- df1 = df1.columnRenamed('key2', 'key2') 
> renamed with same name. Then it works.
> Stack trace will say column missing, but it is npt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17898) --repositories needs username and password

2016-10-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17898:
--
  Assignee: Sean Owen
  Priority: Trivial  (was: Major)
Issue Type: Documentation  (was: Wish)

> --repositories  needs username and password
> ---
>
> Key: SPARK-17898
> URL: https://issues.apache.org/jira/browse/SPARK-17898
> Project: Spark
>  Issue Type: Documentation
>Affects Versions: 2.0.1
>Reporter: lichenglin
>Assignee: Sean Owen
>Priority: Trivial
>
> My private repositories need username and password to visit.
> I can't find a way to declaration  the username and password when submit 
> spark application
> {code}
> bin/spark-submit --repositories   
> http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ --packages 
> com.databricks:spark-csv_2.10:1.2.0   --class 
> org.apache.spark.examples.SparkPi   --master local[8]   
> examples/jars/spark-examples_2.11-2.0.1.jar   100
> {code}
> The rep http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ need username 
> and password



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17898) --repositories needs username and password

2016-10-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594773#comment-15594773
 ] 

Apache Spark commented on SPARK-17898:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/15584

> --repositories  needs username and password
> ---
>
> Key: SPARK-17898
> URL: https://issues.apache.org/jira/browse/SPARK-17898
> Project: Spark
>  Issue Type: Documentation
>Affects Versions: 2.0.1
>Reporter: lichenglin
>Assignee: Sean Owen
>Priority: Trivial
>
> My private repositories need username and password to visit.
> I can't find a way to declaration  the username and password when submit 
> spark application
> {code}
> bin/spark-submit --repositories   
> http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ --packages 
> com.databricks:spark-csv_2.10:1.2.0   --class 
> org.apache.spark.examples.SparkPi   --master local[8]   
> examples/jars/spark-examples_2.11-2.0.1.jar   100
> {code}
> The rep http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ need username 
> and password



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2885) All-pairs similarity via DIMSUM

2016-10-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594789#comment-15594789
 ] 

Apache Spark commented on SPARK-2885:
-

User 'rezazadeh' has created a pull request for this issue:
https://github.com/apache/spark/pull/1778

> All-pairs similarity via DIMSUM
> ---
>
> Key: SPARK-2885
> URL: https://issues.apache.org/jira/browse/SPARK-2885
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Reza Zadeh
>Assignee: Reza Zadeh
> Fix For: 1.2.0
>
> Attachments: SimilarItemsSmallTest.java
>
>
> Build all-pairs similarity algorithm via DIMSUM. 
> Given a dataset of sparse vector data, the all-pairs similarity problem is to 
> find all similar vector pairs according to a similarity function such as 
> cosine similarity, and a given similarity score threshold. Sometimes, this 
> problem is called a “similarity join”.
> The brute force approach of considering all pairs quickly breaks, since it 
> scales quadratically. For example, for a million vectors, it is not feasible 
> to check all roughly trillion pairs to see if they are above the similarity 
> threshold. Having said that, there exist clever sampling techniques to focus 
> the computational effort on those pairs that are above the similarity 
> threshold, which makes the problem feasible.
> DIMSUM has a single parameter (called gamma) to tradeoff computation time vs 
> accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of 
> computation vs accuracy from low computation to high accuracy. For a very 
> large gamma, all cosine similarities are computed exactly with no sampling.
> Current PR:
> https://github.com/apache/spark/pull/1778
> Justification for adding to MLlib:
> - All-pairs similarity is missing from MLlib and has been requested several 
> times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see 
> https://github.com/apache/spark/pull/1778#issuecomment-51300825)
> - Algorithm is used in large-scale production at Twitter. e.g. see 
> https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco
>  . Twitter also open-sourced their version in scalding: 
> https://github.com/twitter/scalding/pull/833
> - When used with the gamma parameter set high, this algorithm becomes the 
> normalized gramian matrix, which is useful in RowMatrix alongside the 
> computeGramianMatrix method already in RowMatrix
> More details about usage at Twitter: 
> https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum
> For correctness proof, see Theorem 4.3 in 
> http://stanford.edu/~rezab/papers/dimsum.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18048) If expression behaves differently if true and false expression are interchanged in case of different data types.

2016-10-21 Thread Priyanka Garg (JIRA)
Priyanka Garg created SPARK-18048:
-

 Summary: If expression behaves differently if true and false 
expression are interchanged in case of different data types.
 Key: SPARK-18048
 URL: https://issues.apache.org/jira/browse/SPARK-18048
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Priyanka Garg


If expression behaves differently if true and false expression are interchanged 
in case of different data types.

For eg. 
If(Literal.create(geo != null, BooleanType),
Literal.create(null, DateType),
Literal.create(null, TimestampType)) is throwing error while 

If(Literal.create(geo != null, BooleanType),
Literal.create(null, TimestampType),
Literal.create(null, DateType )) works fine.

The reason for the same is that the If expression 's datatype only considers 
trueValue.dataType.

Also, 
If(Literal.create(geo != null, BooleanType),
Literal.create(null, DateType),
Literal.create(null, TimestampType))
 is breaking only in case of Generated mutable Projection and Unsafe 
projection. For all other types its working fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18048) If expression behaves differently if true and false expression are interchanged in case of different data types.

2016-10-21 Thread Priyanka Garg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594818#comment-15594818
 ] 

Priyanka Garg commented on SPARK-18048:
---

I am working on it.

> If expression behaves differently if true and false expression are 
> interchanged in case of different data types.
> 
>
> Key: SPARK-18048
> URL: https://issues.apache.org/jira/browse/SPARK-18048
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Priyanka Garg
>
> If expression behaves differently if true and false expression are 
> interchanged in case of different data types.
> For eg. 
> If(Literal.create(geo != null, BooleanType),
> Literal.create(null, DateType),
> Literal.create(null, TimestampType)) is throwing error while 
> If(Literal.create(geo != null, BooleanType),
> Literal.create(null, TimestampType),
> Literal.create(null, DateType )) works fine.
> The reason for the same is that the If expression 's datatype only considers 
> trueValue.dataType.
> Also, 
> If(Literal.create(geo != null, BooleanType),
> Literal.create(null, DateType),
> Literal.create(null, TimestampType))
>  is breaking only in case of Generated mutable Projection and Unsafe 
> projection. For all other types its working fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-10-21 Thread Ashish Shrowty (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594989#comment-15594989
 ] 

Ashish Shrowty commented on SPARK-17709:


[~sowen], [~smilegator] - I confirmed that this is not a problem in 2.0.1. 
Sorry .. forgot to come back and post my finding. Thanks for your help guys!

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>Priority: Critical
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18049) Add missing tests for truePositiveRate and weightedTruePositiveRate

2016-10-21 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-18049:


 Summary: Add missing tests for truePositiveRate and 
weightedTruePositiveRate
 Key: SPARK-18049
 URL: https://issues.apache.org/jira/browse/SPARK-18049
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, Tests
Reporter: zhengruifeng
Priority: Trivial


Tests for {{truePositiveRate}} and {{weightedTruePositiveRate}} are missing in 
{{MulticlassMetricsSuite}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18049) Add missing tests for truePositiveRate and weightedTruePositiveRate

2016-10-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18049:


Assignee: Apache Spark

> Add missing tests for truePositiveRate and weightedTruePositiveRate
> ---
>
> Key: SPARK-18049
> URL: https://issues.apache.org/jira/browse/SPARK-18049
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Tests
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Trivial
>
> Tests for {{truePositiveRate}} and {{weightedTruePositiveRate}} are missing 
> in {{MulticlassMetricsSuite}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18049) Add missing tests for truePositiveRate and weightedTruePositiveRate

2016-10-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18049:


Assignee: (was: Apache Spark)

> Add missing tests for truePositiveRate and weightedTruePositiveRate
> ---
>
> Key: SPARK-18049
> URL: https://issues.apache.org/jira/browse/SPARK-18049
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Tests
>Reporter: zhengruifeng
>Priority: Trivial
>
> Tests for {{truePositiveRate}} and {{weightedTruePositiveRate}} are missing 
> in {{MulticlassMetricsSuite}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18049) Add missing tests for truePositiveRate and weightedTruePositiveRate

2016-10-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15595007#comment-15595007
 ] 

Apache Spark commented on SPARK-18049:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/15585

> Add missing tests for truePositiveRate and weightedTruePositiveRate
> ---
>
> Key: SPARK-18049
> URL: https://issues.apache.org/jira/browse/SPARK-18049
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Tests
>Reporter: zhengruifeng
>Priority: Trivial
>
> Tests for {{truePositiveRate}} and {{weightedTruePositiveRate}} are missing 
> in {{MulticlassMetricsSuite}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16606) Misleading warning for SparkContext.getOrCreate "WARN SparkContext: Use an existing SparkContext, some configuration may not take effect."

2016-10-21 Thread Chris Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15595098#comment-15595098
 ] 

Chris Brown commented on SPARK-16606:
-

Hi [~sowen], I noticed your PR #14533 and am excited about a little less noise 
in my logs, but it seems your PR didn't address the same typo in SparkSession? 
See 
https://github.com/apache/spark/search?utf8=%E2%9C%93&q=%22Use+an+existing%22&type=Code
 and 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L817-L829

Is that a no-fix?

> Misleading warning for SparkContext.getOrCreate "WARN SparkContext: Use an 
> existing SparkContext, some configuration may not take effect."
> --
>
> Key: SPARK-16606
> URL: https://issues.apache.org/jira/browse/SPARK-16606
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Jacek Laskowski
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.1.0
>
>
> {{SparkContext.getOrCreate}} should really be checking whether the code gets 
> the already-created instance or creating a new one.
> Just a nit-pick: the warning message should also be "Using..." not "Use"
> {code}
> scala> sc.version
> res2: String = 2.1.0-SNAPSHOT
> scala> sc
> res3: org.apache.spark.SparkContext = org.apache.spark.SparkContext@1186374c
> scala> SparkContext.getOrCreate
> 16/07/18 14:40:31 WARN SparkContext: Use an existing SparkContext, some 
> configuration may not take effect.
> res4: org.apache.spark.SparkContext = org.apache.spark.SparkContext@1186374c
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16606) Misleading warning for SparkContext.getOrCreate "WARN SparkContext: Use an existing SparkContext, some configuration may not take effect."

2016-10-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15595112#comment-15595112
 ] 

Sean Owen commented on SPARK-16606:
---

Oops, yes I missed that. It's just a typo but worth touching up. One sec...

> Misleading warning for SparkContext.getOrCreate "WARN SparkContext: Use an 
> existing SparkContext, some configuration may not take effect."
> --
>
> Key: SPARK-16606
> URL: https://issues.apache.org/jira/browse/SPARK-16606
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Jacek Laskowski
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.1.0
>
>
> {{SparkContext.getOrCreate}} should really be checking whether the code gets 
> the already-created instance or creating a new one.
> Just a nit-pick: the warning message should also be "Using..." not "Use"
> {code}
> scala> sc.version
> res2: String = 2.1.0-SNAPSHOT
> scala> sc
> res3: org.apache.spark.SparkContext = org.apache.spark.SparkContext@1186374c
> scala> SparkContext.getOrCreate
> 16/07/18 14:40:31 WARN SparkContext: Use an existing SparkContext, some 
> configuration may not take effect.
> res4: org.apache.spark.SparkContext = org.apache.spark.SparkContext@1186374c
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16606) Misleading warning for SparkContext.getOrCreate "WARN SparkContext: Use an existing SparkContext, some configuration may not take effect."

2016-10-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15595124#comment-15595124
 ] 

Apache Spark commented on SPARK-16606:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/15586

> Misleading warning for SparkContext.getOrCreate "WARN SparkContext: Use an 
> existing SparkContext, some configuration may not take effect."
> --
>
> Key: SPARK-16606
> URL: https://issues.apache.org/jira/browse/SPARK-16606
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Jacek Laskowski
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.1.0
>
>
> {{SparkContext.getOrCreate}} should really be checking whether the code gets 
> the already-created instance or creating a new one.
> Just a nit-pick: the warning message should also be "Using..." not "Use"
> {code}
> scala> sc.version
> res2: String = 2.1.0-SNAPSHOT
> scala> sc
> res3: org.apache.spark.SparkContext = org.apache.spark.SparkContext@1186374c
> scala> SparkContext.getOrCreate
> 16/07/18 14:40:31 WARN SparkContext: Use an existing SparkContext, some 
> configuration may not take effect.
> res4: org.apache.spark.SparkContext = org.apache.spark.SparkContext@1186374c
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18050) spark 2.0.1 enable hive throw AlreadyExistsException(message:Database default already exists)

2016-10-21 Thread todd.chen (JIRA)
todd.chen created SPARK-18050:
-

 Summary: spark 2.0.1 enable hive throw 
AlreadyExistsException(message:Database default already exists)
 Key: SPARK-18050
 URL: https://issues.apache.org/jira/browse/SPARK-18050
 Project: Spark
  Issue Type: Bug
  Components: SQL
 Environment: jdk1.8, macOs,spark 2.0.1
Reporter: todd.chen


in spark 2.0.1 ,I enable hive support and when init the sqlContext ,throw a 
AlreadyExistsException(message:Database default already exists),same as 
https://www.mail-archive.com/dev@spark.apache.org/msg15306.html ,my code is 
```scala
  private val master = "local[*]"
  private val appName = "xqlServerSpark"
  val fileSystem = FileSystem.get()
  val sparkConf = new SparkConf().setMaster(master).
setAppName(appName).set("spark.sql.warehouse.dir", 
s"${fileSystem.getUri.toASCIIString}/user/hive/warehouse")
  val   hiveContext = 
SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate().sqlContext
print(sparkConf.get("spark.sql.warehouse.dir"))
hiveContext.sql("show tables").show()
```

the result is correct,but a exception also throwBy the code




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18050) spark 2.0.1 enable hive throw AlreadyExistsException(message:Database default already exists)

2016-10-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18050:
--
Description: 
in spark 2.0.1 ,I enable hive support and when init the sqlContext ,throw a 
AlreadyExistsException(message:Database default already exists),same as 
https://www.mail-archive.com/dev@spark.apache.org/msg15306.html ,my code is 

{code}
  private val master = "local[*]"
  private val appName = "xqlServerSpark"
  val fileSystem = FileSystem.get()
  val sparkConf = new SparkConf().setMaster(master).
setAppName(appName).set("spark.sql.warehouse.dir", 
s"${fileSystem.getUri.toASCIIString}/user/hive/warehouse")
  val   hiveContext = 
SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate().sqlContext
print(sparkConf.get("spark.sql.warehouse.dir"))
hiveContext.sql("show tables").show()
{code}

the result is correct,but a exception also throwBy the code


  was:
in spark 2.0.1 ,I enable hive support and when init the sqlContext ,throw a 
AlreadyExistsException(message:Database default already exists),same as 
https://www.mail-archive.com/dev@spark.apache.org/msg15306.html ,my code is 
```scala
  private val master = "local[*]"
  private val appName = "xqlServerSpark"
  val fileSystem = FileSystem.get()
  val sparkConf = new SparkConf().setMaster(master).
setAppName(appName).set("spark.sql.warehouse.dir", 
s"${fileSystem.getUri.toASCIIString}/user/hive/warehouse")
  val   hiveContext = 
SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate().sqlContext
print(sparkConf.get("spark.sql.warehouse.dir"))
hiveContext.sql("show tables").show()
```

the result is correct,but a exception also throwBy the code



> spark 2.0.1 enable hive throw AlreadyExistsException(message:Database default 
> already exists)
> -
>
> Key: SPARK-18050
> URL: https://issues.apache.org/jira/browse/SPARK-18050
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: jdk1.8, macOs,spark 2.0.1
>Reporter: todd.chen
>
> in spark 2.0.1 ,I enable hive support and when init the sqlContext ,throw a 
> AlreadyExistsException(message:Database default already exists),same as 
> https://www.mail-archive.com/dev@spark.apache.org/msg15306.html ,my code is 
> {code}
>   private val master = "local[*]"
>   private val appName = "xqlServerSpark"
>   val fileSystem = FileSystem.get()
>   val sparkConf = new SparkConf().setMaster(master).
> setAppName(appName).set("spark.sql.warehouse.dir", 
> s"${fileSystem.getUri.toASCIIString}/user/hive/warehouse")
>   val   hiveContext = 
> SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate().sqlContext
> print(sparkConf.get("spark.sql.warehouse.dir"))
> hiveContext.sql("show tables").show()
> {code}
> the result is correct,but a exception also throwBy the code



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17990) ALTER TABLE ... ADD PARTITION does not play nice with mixed-case partition column names

2016-10-21 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15595184#comment-15595184
 ] 

Wenchen Fan edited comment on SPARK-17990 at 10/21/16 2:03 PM:
---

The root cause is that, hive table is not case-preserving. Although you are 
writing `CREATE TABLE... PARTITIONED BY (partCol int)`, the partition column is 
actually `partcol`. And then you use `DataFrameWriter` to write data files to 
the table location directly, with the wrong partition column.

If we admit this defect that hive table is not case-preserving, I think it's 
not a problem.

cc [~yhuai] [~rxin]


was (Author: cloud_fan):
The root cause is that, hive table is not case-preserving. Although you are 
writing `CREATE TABLE... PARTITIONED BY (partCol int)`, the partition column is 
actually `partcol`. And then you use `DataFrameWriter` to write data files to 
the table location directly, with the wrong partition column.

If we admit this defect that hive table is not case-preserving, I think it's 
not a problem.

cc @yhuai @rxin

> ALTER TABLE ... ADD PARTITION does not play nice with mixed-case partition 
> column names
> ---
>
> Key: SPARK-17990
> URL: https://issues.apache.org/jira/browse/SPARK-17990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: Linux
> Mac OS with a case-sensitive filesystem
>Reporter: Michael Allman
>
> Writing partition data to an external table's file location and then adding 
> those as table partition metadata is a common use case. However, for tables 
> with partition column names with upper case letters, the SQL command {{ALTER 
> TABLE ... ADD PARTITION}} does not work, as illustrated in the following 
> example:
> {code}
> scala> sql("create external table mixed_case_partitioning (a bigint) 
> PARTITIONED BY (partCol bigint) STORED AS parquet LOCATION 
> '/tmp/mixed_case_partitioning'")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sqlContext.range(10).selectExpr("id as a", "id as 
> partCol").write.partitionBy("partCol").mode("overwrite").parquet("/tmp/mixed_case_partitioning")
> {code}
> At this point, doing a {{hadoop fs -ls /tmp/mixed_case_partitioning}} 
> produces the following:
> {code}
> [msa@jupyter ~]$ hadoop fs -ls /tmp/mixed_case_partitioning
> Found 11 items
> -rw-r--r--   3 msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/_SUCCESS
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=0
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=1
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=2
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=3
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=4
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=5
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=6
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=7
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=8
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=9
> {code}
> Returning to the Spark shell, we execute the following to add the partition 
> metadata:
> {code}
> scala> (0 to 9).foreach { p => sql(s"alter table mixed_case_partitioning add 
> partition(partCol=$p)") }
> {code}
> Examining the HDFS file listing again, we see:
> {code}
> [msa@jupyter ~]$ hadoop fs -ls /tmp/mixed_case_partitioning
> Found 21 items
> -rw-r--r--   3 msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/_SUCCESS
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=0
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=1
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=2
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=3
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=4
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=5
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=6
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=7
> drwxr-xr-x   - msa supergroup  0 2016

[jira] [Commented] (SPARK-17990) ALTER TABLE ... ADD PARTITION does not play nice with mixed-case partition column names

2016-10-21 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15595184#comment-15595184
 ] 

Wenchen Fan commented on SPARK-17990:
-

The root cause is that, hive table is not case-preserving. Although you are 
writing `CREATE TABLE... PARTITIONED BY (partCol int)`, the partition column is 
actually `partcol`. And then you use `DataFrameWriter` to write data files to 
the table location directly, with the wrong partition column.

If we admit this defect that hive table is not case-preserving, I think it's 
not a problem.

cc @yhuai @rxin

> ALTER TABLE ... ADD PARTITION does not play nice with mixed-case partition 
> column names
> ---
>
> Key: SPARK-17990
> URL: https://issues.apache.org/jira/browse/SPARK-17990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: Linux
> Mac OS with a case-sensitive filesystem
>Reporter: Michael Allman
>
> Writing partition data to an external table's file location and then adding 
> those as table partition metadata is a common use case. However, for tables 
> with partition column names with upper case letters, the SQL command {{ALTER 
> TABLE ... ADD PARTITION}} does not work, as illustrated in the following 
> example:
> {code}
> scala> sql("create external table mixed_case_partitioning (a bigint) 
> PARTITIONED BY (partCol bigint) STORED AS parquet LOCATION 
> '/tmp/mixed_case_partitioning'")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sqlContext.range(10).selectExpr("id as a", "id as 
> partCol").write.partitionBy("partCol").mode("overwrite").parquet("/tmp/mixed_case_partitioning")
> {code}
> At this point, doing a {{hadoop fs -ls /tmp/mixed_case_partitioning}} 
> produces the following:
> {code}
> [msa@jupyter ~]$ hadoop fs -ls /tmp/mixed_case_partitioning
> Found 11 items
> -rw-r--r--   3 msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/_SUCCESS
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=0
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=1
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=2
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=3
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=4
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=5
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=6
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=7
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=8
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=9
> {code}
> Returning to the Spark shell, we execute the following to add the partition 
> metadata:
> {code}
> scala> (0 to 9).foreach { p => sql(s"alter table mixed_case_partitioning add 
> partition(partCol=$p)") }
> {code}
> Examining the HDFS file listing again, we see:
> {code}
> [msa@jupyter ~]$ hadoop fs -ls /tmp/mixed_case_partitioning
> Found 21 items
> -rw-r--r--   3 msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/_SUCCESS
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=0
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=1
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=2
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=3
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=4
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=5
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=6
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=7
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=8
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=9
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:53 
> /tmp/mixed_case_partitioning/partcol=0
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:53 
> /tmp/mixed_case_partitioning/partcol=1
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:53 
> /tmp/mixed_case_partitioning/partcol=2
> drwxr-xr-x   - msa supergro

[jira] [Reopened] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2016-10-21 Thread Aleksander Eskilson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksander Eskilson reopened SPARK-18016:
-

After digging through the stacktraces more closely, it appears that for certain 
very wide/nested schemas, the generated code does attempt to allocate a number 
of variables larger than the Janino compiler will allow, 0x or 65536. This 
convinces me that the error is not in the class of "64 KB" errors found in 
other Jiras (e.g. SPARK-17702, SPARK-16845), although it is related to the size 
of code generation in the sense of the number of variables declared. 

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
> 

[jira] [Commented] (SPARK-16606) Misleading warning for SparkContext.getOrCreate "WARN SparkContext: Use an existing SparkContext, some configuration may not take effect."

2016-10-21 Thread Chris Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15595256#comment-15595256
 ] 

Chris Brown commented on SPARK-16606:
-

Awesome, thanks! Just a typo, yes, but a significant one for newbs like me who 
aren't familiar with Spark internals and we get a message that translates, due 
to the typo, to something like:

> Warning! You just tried to retrieve a global singleton, but you should use an 
> existing SparkContext!

When in fact, what it actually means to say, once you get the verb 
tense/modality right, is more like:

> You tried to get or create a new Spark context with some configuration, but 
> there's already a Spark context registered, and we're gonna give you back 
> that one (because THERE CAN BE ONLY ONE) regardless of whatever configuration 
> you just supplied.

> Misleading warning for SparkContext.getOrCreate "WARN SparkContext: Use an 
> existing SparkContext, some configuration may not take effect."
> --
>
> Key: SPARK-16606
> URL: https://issues.apache.org/jira/browse/SPARK-16606
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Jacek Laskowski
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.1.0
>
>
> {{SparkContext.getOrCreate}} should really be checking whether the code gets 
> the already-created instance or creating a new one.
> Just a nit-pick: the warning message should also be "Using..." not "Use"
> {code}
> scala> sc.version
> res2: String = 2.1.0-SNAPSHOT
> scala> sc
> res3: org.apache.spark.SparkContext = org.apache.spark.SparkContext@1186374c
> scala> SparkContext.getOrCreate
> 16/07/18 14:40:31 WARN SparkContext: Use an existing SparkContext, some 
> configuration may not take effect.
> res4: org.apache.spark.SparkContext = org.apache.spark.SparkContext@1186374c
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2016-10-21 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15595315#comment-15595315
 ] 

Herman van Hovell commented on SPARK-18016:
---

The java constant pool cannot hold more then 65536 items. It is a hard JVM 
limit.

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345)
>   at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.j

[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2016-10-21 Thread Aleksander Eskilson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15595346#comment-15595346
 ] 

Aleksander Eskilson commented on SPARK-18016:
-

Yep, understood. This would imply that the size of our Datasets its truly 
limited to what can be captured in a single SpecificSafe/UnsafeProjection 
class, unless there was a manner for even the very projection to be performed 
piecewise.

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   

[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2016-10-21 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15595377#comment-15595377
 ] 

Herman van Hovell commented on SPARK-18016:
---

Yeah there are limitations. Splitting code generation in this case is 
non-trivial (it would require actual different classes).

Could you share something about the use case you are trying to solve. There 
might be better ways that do not hit this limitation.

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompile

[jira] [Created] (SPARK-18051) Custom PartitionCoalescer cause serialization exception

2016-10-21 Thread Weichen Xu (JIRA)
Weichen Xu created SPARK-18051:
--

 Summary: Custom PartitionCoalescer cause serialization exception
 Key: SPARK-18051
 URL: https://issues.apache.org/jira/browse/SPARK-18051
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Weichen Xu


for example, the following code cause exception:

{code: title=code}
class MyCoalescer extends PartitionCoalescer{
  override def coalesce(maxPartitions: Int, parent: RDD[_]): 
Array[PartitionGroup] = {
val pglist = Array.fill(2)(new PartitionGroup())
pglist(0).partitions.append(parent.partitions(0), parent.partitions(1), 
parent.partitions(2))
pglist(1).partitions.append(parent.partitions(3), parent.partitions(4), 
parent.partitions(5))
pglist
  }
}
object Test1 {
  def main(args: Array[String]) = {
val spark = SparkSession.builder().appName("test").getOrCreate()
val sc = spark.sparkContext
val rdd = sc.parallelize(1 to 6, 6)
rdd.coalesce(2, false, Some(new MyCoalescer)).count()
spark.stop()
  }
}
{code}

it throws exception:
Exception in thread "dag-scheduler-event-loop" java.lang.StackOverflowError
at java.lang.Exception.(Exception.java:102)

at 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:230)
at 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189)
at 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108)
at 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:243)
at 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17990) ALTER TABLE ... ADD PARTITION does not play nice with mixed-case partition column names

2016-10-21 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15595424#comment-15595424
 ] 

Yin Huai commented on SPARK-17990:
--

btw, if possible, it will be good that hive tables and data source tables have 
the consistent semantic. It is very hard to users to figure out why tables 
stored using different formats have different behaviors. 

> ALTER TABLE ... ADD PARTITION does not play nice with mixed-case partition 
> column names
> ---
>
> Key: SPARK-17990
> URL: https://issues.apache.org/jira/browse/SPARK-17990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: Linux
> Mac OS with a case-sensitive filesystem
>Reporter: Michael Allman
>
> Writing partition data to an external table's file location and then adding 
> those as table partition metadata is a common use case. However, for tables 
> with partition column names with upper case letters, the SQL command {{ALTER 
> TABLE ... ADD PARTITION}} does not work, as illustrated in the following 
> example:
> {code}
> scala> sql("create external table mixed_case_partitioning (a bigint) 
> PARTITIONED BY (partCol bigint) STORED AS parquet LOCATION 
> '/tmp/mixed_case_partitioning'")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sqlContext.range(10).selectExpr("id as a", "id as 
> partCol").write.partitionBy("partCol").mode("overwrite").parquet("/tmp/mixed_case_partitioning")
> {code}
> At this point, doing a {{hadoop fs -ls /tmp/mixed_case_partitioning}} 
> produces the following:
> {code}
> [msa@jupyter ~]$ hadoop fs -ls /tmp/mixed_case_partitioning
> Found 11 items
> -rw-r--r--   3 msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/_SUCCESS
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=0
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=1
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=2
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=3
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=4
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=5
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=6
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=7
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=8
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=9
> {code}
> Returning to the Spark shell, we execute the following to add the partition 
> metadata:
> {code}
> scala> (0 to 9).foreach { p => sql(s"alter table mixed_case_partitioning add 
> partition(partCol=$p)") }
> {code}
> Examining the HDFS file listing again, we see:
> {code}
> [msa@jupyter ~]$ hadoop fs -ls /tmp/mixed_case_partitioning
> Found 21 items
> -rw-r--r--   3 msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/_SUCCESS
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=0
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=1
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=2
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=3
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=4
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=5
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=6
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=7
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=8
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=9
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:53 
> /tmp/mixed_case_partitioning/partcol=0
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:53 
> /tmp/mixed_case_partitioning/partcol=1
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:53 
> /tmp/mixed_case_partitioning/partcol=2
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:53 
> /tmp/mixed_case_partitioning/partcol=3
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:53 
> /tmp/mixed_case_partitioning/partcol=4
> drwxr-xr-x   - msa superg

[jira] [Commented] (SPARK-18051) Custom PartitionCoalescer cause serialization exception

2016-10-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15595436#comment-15595436
 ] 

Apache Spark commented on SPARK-18051:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/15587

> Custom PartitionCoalescer cause serialization exception
> ---
>
> Key: SPARK-18051
> URL: https://issues.apache.org/jira/browse/SPARK-18051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> for example, the following code cause exception:
> {code: title=code}
> class MyCoalescer extends PartitionCoalescer{
>   override def coalesce(maxPartitions: Int, parent: RDD[_]): 
> Array[PartitionGroup] = {
> val pglist = Array.fill(2)(new PartitionGroup())
> pglist(0).partitions.append(parent.partitions(0), parent.partitions(1), 
> parent.partitions(2))
> pglist(1).partitions.append(parent.partitions(3), parent.partitions(4), 
> parent.partitions(5))
> pglist
>   }
> }
> object Test1 {
>   def main(args: Array[String]) = {
> val spark = SparkSession.builder().appName("test").getOrCreate()
> val sc = spark.sparkContext
> val rdd = sc.parallelize(1 to 6, 6)
> rdd.coalesce(2, false, Some(new MyCoalescer)).count()
> spark.stop()
>   }
> }
> {code}
> it throws exception:
> Exception in thread "dag-scheduler-event-loop" java.lang.StackOverflowError
> at java.lang.Exception.(Exception.java:102)
> 
> at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:230)
> at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189)
> at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108)
> at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:243)
> at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189)
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18051) Custom PartitionCoalescer cause serialization exception

2016-10-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18051:


Assignee: Apache Spark

> Custom PartitionCoalescer cause serialization exception
> ---
>
> Key: SPARK-18051
> URL: https://issues.apache.org/jira/browse/SPARK-18051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> for example, the following code cause exception:
> {code: title=code}
> class MyCoalescer extends PartitionCoalescer{
>   override def coalesce(maxPartitions: Int, parent: RDD[_]): 
> Array[PartitionGroup] = {
> val pglist = Array.fill(2)(new PartitionGroup())
> pglist(0).partitions.append(parent.partitions(0), parent.partitions(1), 
> parent.partitions(2))
> pglist(1).partitions.append(parent.partitions(3), parent.partitions(4), 
> parent.partitions(5))
> pglist
>   }
> }
> object Test1 {
>   def main(args: Array[String]) = {
> val spark = SparkSession.builder().appName("test").getOrCreate()
> val sc = spark.sparkContext
> val rdd = sc.parallelize(1 to 6, 6)
> rdd.coalesce(2, false, Some(new MyCoalescer)).count()
> spark.stop()
>   }
> }
> {code}
> it throws exception:
> Exception in thread "dag-scheduler-event-loop" java.lang.StackOverflowError
> at java.lang.Exception.(Exception.java:102)
> 
> at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:230)
> at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189)
> at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108)
> at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:243)
> at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189)
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18051) Custom PartitionCoalescer cause serialization exception

2016-10-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18051:


Assignee: (was: Apache Spark)

> Custom PartitionCoalescer cause serialization exception
> ---
>
> Key: SPARK-18051
> URL: https://issues.apache.org/jira/browse/SPARK-18051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> for example, the following code cause exception:
> {code: title=code}
> class MyCoalescer extends PartitionCoalescer{
>   override def coalesce(maxPartitions: Int, parent: RDD[_]): 
> Array[PartitionGroup] = {
> val pglist = Array.fill(2)(new PartitionGroup())
> pglist(0).partitions.append(parent.partitions(0), parent.partitions(1), 
> parent.partitions(2))
> pglist(1).partitions.append(parent.partitions(3), parent.partitions(4), 
> parent.partitions(5))
> pglist
>   }
> }
> object Test1 {
>   def main(args: Array[String]) = {
> val spark = SparkSession.builder().appName("test").getOrCreate()
> val sc = spark.sparkContext
> val rdd = sc.parallelize(1 to 6, 6)
> rdd.coalesce(2, false, Some(new MyCoalescer)).count()
> spark.stop()
>   }
> }
> {code}
> it throws exception:
> Exception in thread "dag-scheduler-event-loop" java.lang.StackOverflowError
> at java.lang.Exception.(Exception.java:102)
> 
> at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:230)
> at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189)
> at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108)
> at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:243)
> at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189)
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18052) Spark Job failing with org.apache.spark.rpc.RpcTimeoutException

2016-10-21 Thread Srikanth (JIRA)
Srikanth created SPARK-18052:


 Summary: Spark Job failing with 
org.apache.spark.rpc.RpcTimeoutException
 Key: SPARK-18052
 URL: https://issues.apache.org/jira/browse/SPARK-18052
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.0.0
 Environment: 3 node spark cluster, all AWS r3.xlarge instances running 
on ubuntu.
Reporter: Srikanth


Spark submit jobs are failing with org.apache.spark.rpc.RpcTimeoutException. 
increased the spark.executor.heartbeatInterval value from 10s to 60s, but still 
the same issue.

This is happening while saving a dataframe to a mounted network drive. Not 
using HDFS here. We are able to write successfully for smaller size files under 
10G, the data here we are reading is nearly 20G

driver memory = 10G
executor memory = 25G

Please see the attached log file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18052) Spark Job failing with org.apache.spark.rpc.RpcTimeoutException

2016-10-21 Thread Srikanth (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srikanth updated SPARK-18052:
-
Attachment: sparkErrorLog.txt

> Spark Job failing with org.apache.spark.rpc.RpcTimeoutException
> ---
>
> Key: SPARK-18052
> URL: https://issues.apache.org/jira/browse/SPARK-18052
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0
> Environment: 3 node spark cluster, all AWS r3.xlarge instances 
> running on ubuntu.
>Reporter: Srikanth
> Attachments: sparkErrorLog.txt
>
>
> Spark submit jobs are failing with org.apache.spark.rpc.RpcTimeoutException. 
> increased the spark.executor.heartbeatInterval value from 10s to 60s, but 
> still the same issue.
> This is happening while saving a dataframe to a mounted network drive. Not 
> using HDFS here. We are able to write successfully for smaller size files 
> under 10G, the data here we are reading is nearly 20G
> driver memory = 10G
> executor memory = 25G
> Please see the attached log file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9219) ClassCastException in instance of org.apache.spark.rdd.MapPartitionsRDD

2016-10-21 Thread Alberto Andreotti (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15595471#comment-15595471
 ] 

Alberto Andreotti commented on SPARK-9219:
--

Hello guys,
I'm currently facing the same exception, when running against standalone custom 
compiled spark 1.6.1(scala 2.11). The UDF seems to be the problem, it fails 
when deserializing.
Here's a simple sbt project that shows the issue,

https://github.com/albertoandreottiATgmail/spark_udf

> ClassCastException in instance of org.apache.spark.rdd.MapPartitionsRDD
> ---
>
> Key: SPARK-9219
> URL: https://issues.apache.org/jira/browse/SPARK-9219
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.4.1
>Reporter: Mohsen Zainalpour
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 
> (TID 77, 192.168.1.194): java.lang.ClassCastException: cannot assign instance 
> of scala.collection.immutable.List$SerializationProxy to field 
> org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type 
> scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
>   at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083)
>   at 
> java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477)
>   at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477)
>   at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.Obje

[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2016-10-21 Thread Aleksander Eskilson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15595479#comment-15595479
 ] 

Aleksander Eskilson commented on SPARK-18016:
-

Thanks, yeah, I definitely recognize splitting at this level would quite a bit 
more complicated than the method splitting utilized in the interior of 
generated classes (which all seemed to be about getting around 64 KB 
method-size limits).

We have some schemas as part of existing Hadoop workflows that we'd like to 
encode to Datasets, as well as create larger, aggregate Datasets. We recognize 
this though as an existing limitation, and have discussed some workarounds.

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.

[jira] [Commented] (SPARK-17097) Pregel does not keep vertex state properly; fails to terminate

2016-10-21 Thread Jon Maurer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15595521#comment-15595521
 ] 

Jon Maurer commented on SPARK-17097:


If ding's solution resolves the issue, should this be marked "not a bug"?

> Pregel does not keep vertex state properly; fails to terminate 
> ---
>
> Key: SPARK-17097
> URL: https://issues.apache.org/jira/browse/SPARK-17097
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.6.0
> Environment: Scala 2.10.5, Spark 1.6.0 with GraphX and Pregel
>Reporter: Seth Bromberger
>
> Consider the following minimum example:
> {code:title=PregelBug.scala|borderStyle=solid}
> package testGraph
> import org.apache.spark.{SparkConf, SparkContext}
> import org.apache.spark.graphx.{Edge, EdgeTriplet, Graph, _}
> object PregelBug {
>   def main(args: Array[String]) = {
> //FIXME breaks if TestVertex is a case class; works if not case class
> case class TestVertex(inId: VertexId,
>  inData: String,
>  inLabels: collection.mutable.HashSet[String]) extends 
> Serializable {
>   val id = inId
>   val value = inData
>   val labels = inLabels
> }
> class TestLink(inSrc: VertexId, inDst: VertexId, inData: String) extends 
> Serializable  {
>   val src = inSrc
>   val dst = inDst
>   val data = inData
> }
> val startString = "XXXSTARTXXX"
> val conf = new SparkConf().setAppName("pregeltest").setMaster("local[*]")
> val sc = new SparkContext(conf)
> val vertexes = Vector(
>   new TestVertex(0, "label0", collection.mutable.HashSet[String]()),
>   new TestVertex(1, "label1", collection.mutable.HashSet[String]())
> )
> val links = Vector(
>   new TestLink(0, 1, "linkData01")
> )
> val vertexes_packaged = vertexes.map(v => (v.id, v))
> val links_packaged = links.map(e => Edge(e.src, e.dst, e))
> val graph = Graph[TestVertex, 
> TestLink](sc.parallelize(vertexes_packaged), sc.parallelize(links_packaged))
> def vertexProgram (vertexId: VertexId, vdata: TestVertex, message: 
> Vector[String]): TestVertex = {
>   message.foreach {
> case `startString` =>
>   if (vdata.id == 0L)
> vdata.labels.add(vdata.value)
> case m =>
>   if (!vdata.labels.contains(m))
> vdata.labels.add(m)
>   }
>   new TestVertex(vdata.id, vdata.value, vdata.labels)
> }
> def sendMessage (triplet: EdgeTriplet[TestVertex, TestLink]): 
> Iterator[(VertexId, Vector[String])] = {
>   val srcLabels = triplet.srcAttr.labels
>   val dstLabels = triplet.dstAttr.labels
>   val msgsSrcDst = srcLabels.diff(dstLabels)
> .map(label => (triplet.dstAttr.id, Vector[String](label)))
>   val msgsDstSrc = dstLabels.diff(dstLabels)
> .map(label => (triplet.srcAttr.id, Vector[String](label)))
>   msgsSrcDst.toIterator ++ msgsDstSrc.toIterator
> }
> def mergeMessage (m1: Vector[String], m2: Vector[String]): Vector[String] 
> = m1.union(m2).distinct
> val g = graph.pregel(Vector[String](startString))(vertexProgram, 
> sendMessage, mergeMessage)
> println("---pregel done---")
> println("vertex info:")
> g.vertices.foreach(
>   v => {
> val labels = v._2.labels
> println(
>   "vertex " + v._1 +
> ": name = " + v._2.id +
> ", labels = " + labels)
>   }
> )
>   }
> }
> {code}
> This code never terminates even though we expect it to. To fix, we simply 
> remove the "case" designation for the TestVertex class (see FIXME comment), 
> and then it behaves as expected.
> (Apologies if this has been fixed in later versions; we're unfortunately 
> pegged to 2.10.5 / 1.6.0 for now.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5158) Allow for keytab-based HDFS security in Standalone mode

2016-10-21 Thread Ian Hummel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15595524#comment-15595524
 ] 

Ian Hummel commented on SPARK-5158:
---

I'm running into this now and have done some digging.  My setup is

- Small, dedicated standalone spark cluster
-- using spark.authenticate.secret
-- spark-env.sh sets HADOOP_CONF_DIR correctly on each node
-- core-site.xml has
--- hadoop.security.authentication = kerberos
--- hadoop.security.authorization = true
- Kerberized HDFS cluster

Reading and writing to HDFS in local mode works fine, provided I have run 
{{kinit}} beforehand.  Running a distributed job via the standalone cluster 
does not, seemingly because clients connecting to standalone clusters don't 
attempt to fetch/forward HDFS delegation tokens.

What I had hoped would work is ssh'ing onto each standalone worker node 
individually and running kinit out-of-process before submitting my job.  I 
figured that since the executors are launched as my unix user that they would 
inherit my kerberos context and be able to talk to HDFS, just as they can in 
local mode. 

I verified with a debugger that the {{UserGroupInformation}} in the worker JVMs 
correctly picks up the fact that the user the process is running as can access 
the kerberos ticket cache.

But it still doesn't work.

The reason is that the executor process ({{CoarseGrainedExecutorBackend}}) does 
something like this:

{code}
SparkHadoopUtil.get.runAsSparkUser { () =>
...
env.rpcEnv.setupEndpoint("Executor", new 
CoarseGrainedExecutorBackend(env.rpcEnv, driverUrl, executorId, hostname, 
cores, userClassPath, env))  
...
}
{code}

{{runAsSparkUser}} does this:

{code}
  def runAsSparkUser(func: () => Unit) {
val user = Utils.getCurrentUserName()
logDebug("running as user: " + user)
val ugi = UserGroupInformation.createRemoteUser(user)
transferCredentials(UserGroupInformation.getCurrentUser(), ugi)
ugi.doAs(new PrivilegedExceptionAction[Unit] {
  def run: Unit = func()
})
  }
{code}

{{createRemoteUser}} does this:

{code}
  public static UserGroupInformation createRemoteUser(String user) {
if (user == null || user.isEmpty()) {
  throw new IllegalArgumentException("Null user");
}
Subject subject = new Subject();
subject.getPrincipals().add(new User(user));
UserGroupInformation result = new UserGroupInformation(subject);
result.setAuthenticationMethod(AuthenticationMethod.SIMPLE);
return result;
  }
{code}

So effectively, if we had an HDFS delegation token, we would have copied it 
over in {{transferCredentials}}, but since there is no way for the client to 
include them when the task is submitted over-the wire, we are creating a 
_blank_ UGI from scratch and losing the Kerberos context.  Subsequent calls to 
HDFS are attempting with "simple" authentication and everything fails.


One workaround is that you can obtain an HDFS delegation token out of band, 
store it in a file, make it available on all worker nodes and then ensure 
executors are launched with {{HADOOP_TOKEN_FILE_LOCATION}} set.  To be more 
specific:

On client machine:
- ensure {{core-site.xml}}, {{hdfs-site.xml}} and {{yarn-site.xml}} are 
configured properly
- ensure {{HDFS_CONF_DIR}} is set
- Run {{spark-submit --class 
org.apache.hadoop.hdfs.tools.DelegationTokenFetcher "" --renewer null 
/nfs/path/to/TOKEN}}

On worker machines:
- ensure {{/nfs/path/to/TOKEN}} is readable

On client machine:
- submit job adding {{--conf 
"spark.executorEnv.HADOOP_TOKEN_FILE_LOCATION=/nfs/path/to/TOKEN"}}

There are obviously issues with this in terms of expiration, renewal, etc... 
just wanted to mention it for the record.


Another workaround is a build of Spark which simply comments out the 
{{runAsSparkUser}} call.  In this case users can simply have a cron job running 
kinit in the background (using a keytab) and the spawned executor will use the 
inherited kerberos context to talk to HDFS.

It seems like {{CoarseGrainedExecutorBackend}} is also used by Mesos, and I 
noticed SPARK-12909. If security doesn't even work for Mesos or Standalone why 
do even try the {{runAsSparkUser}} call?  It honestly seems like there is no 
reason for that... proxy users are not useful outside of a kerberized context 
(right?).  There is no real secured user identity when running as a standalone 
cluster (or from what I can tell when running under Mesos), only that which 
comes from whatever unix user the workers are running as.

As it stands, we actually _deescalate_ that user's privileges (by wiping the 
kerberos context).  Shouldn't we just keep them as they are?  This makes it a 
lot easier for standalone clusters to interact with a kerberized HDFS.

I know this ticket is more about forwarding keytabs to the executors, but the 
scenario outlined above also gets to that use case.

Thoughts?

> Allow for keytab-based HDFS security in Standalon

[jira] [Commented] (SPARK-17097) Pregel does not keep vertex state properly; fails to terminate

2016-10-21 Thread Seth Bromberger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15595562#comment-15595562
 ] 

Seth Bromberger commented on SPARK-17097:
-

OK by me. It's confusing behavior but definitely not a bug :)

> Pregel does not keep vertex state properly; fails to terminate 
> ---
>
> Key: SPARK-17097
> URL: https://issues.apache.org/jira/browse/SPARK-17097
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.6.0
> Environment: Scala 2.10.5, Spark 1.6.0 with GraphX and Pregel
>Reporter: Seth Bromberger
>
> Consider the following minimum example:
> {code:title=PregelBug.scala|borderStyle=solid}
> package testGraph
> import org.apache.spark.{SparkConf, SparkContext}
> import org.apache.spark.graphx.{Edge, EdgeTriplet, Graph, _}
> object PregelBug {
>   def main(args: Array[String]) = {
> //FIXME breaks if TestVertex is a case class; works if not case class
> case class TestVertex(inId: VertexId,
>  inData: String,
>  inLabels: collection.mutable.HashSet[String]) extends 
> Serializable {
>   val id = inId
>   val value = inData
>   val labels = inLabels
> }
> class TestLink(inSrc: VertexId, inDst: VertexId, inData: String) extends 
> Serializable  {
>   val src = inSrc
>   val dst = inDst
>   val data = inData
> }
> val startString = "XXXSTARTXXX"
> val conf = new SparkConf().setAppName("pregeltest").setMaster("local[*]")
> val sc = new SparkContext(conf)
> val vertexes = Vector(
>   new TestVertex(0, "label0", collection.mutable.HashSet[String]()),
>   new TestVertex(1, "label1", collection.mutable.HashSet[String]())
> )
> val links = Vector(
>   new TestLink(0, 1, "linkData01")
> )
> val vertexes_packaged = vertexes.map(v => (v.id, v))
> val links_packaged = links.map(e => Edge(e.src, e.dst, e))
> val graph = Graph[TestVertex, 
> TestLink](sc.parallelize(vertexes_packaged), sc.parallelize(links_packaged))
> def vertexProgram (vertexId: VertexId, vdata: TestVertex, message: 
> Vector[String]): TestVertex = {
>   message.foreach {
> case `startString` =>
>   if (vdata.id == 0L)
> vdata.labels.add(vdata.value)
> case m =>
>   if (!vdata.labels.contains(m))
> vdata.labels.add(m)
>   }
>   new TestVertex(vdata.id, vdata.value, vdata.labels)
> }
> def sendMessage (triplet: EdgeTriplet[TestVertex, TestLink]): 
> Iterator[(VertexId, Vector[String])] = {
>   val srcLabels = triplet.srcAttr.labels
>   val dstLabels = triplet.dstAttr.labels
>   val msgsSrcDst = srcLabels.diff(dstLabels)
> .map(label => (triplet.dstAttr.id, Vector[String](label)))
>   val msgsDstSrc = dstLabels.diff(dstLabels)
> .map(label => (triplet.srcAttr.id, Vector[String](label)))
>   msgsSrcDst.toIterator ++ msgsDstSrc.toIterator
> }
> def mergeMessage (m1: Vector[String], m2: Vector[String]): Vector[String] 
> = m1.union(m2).distinct
> val g = graph.pregel(Vector[String](startString))(vertexProgram, 
> sendMessage, mergeMessage)
> println("---pregel done---")
> println("vertex info:")
> g.vertices.foreach(
>   v => {
> val labels = v._2.labels
> println(
>   "vertex " + v._1 +
> ": name = " + v._2.id +
> ", labels = " + labels)
>   }
> )
>   }
> }
> {code}
> This code never terminates even though we expect it to. To fix, we simply 
> remove the "case" designation for the TestVertex class (see FIXME comment), 
> and then it behaves as expected.
> (Apologies if this has been fixed in later versions; we're unfortunately 
> pegged to 2.10.5 / 1.6.0 for now.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17097) Pregel does not keep vertex state properly; fails to terminate

2016-10-21 Thread Seth Bromberger (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seth Bromberger resolved SPARK-17097.
-
Resolution: Not A Bug

> Pregel does not keep vertex state properly; fails to terminate 
> ---
>
> Key: SPARK-17097
> URL: https://issues.apache.org/jira/browse/SPARK-17097
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.6.0
> Environment: Scala 2.10.5, Spark 1.6.0 with GraphX and Pregel
>Reporter: Seth Bromberger
>
> Consider the following minimum example:
> {code:title=PregelBug.scala|borderStyle=solid}
> package testGraph
> import org.apache.spark.{SparkConf, SparkContext}
> import org.apache.spark.graphx.{Edge, EdgeTriplet, Graph, _}
> object PregelBug {
>   def main(args: Array[String]) = {
> //FIXME breaks if TestVertex is a case class; works if not case class
> case class TestVertex(inId: VertexId,
>  inData: String,
>  inLabels: collection.mutable.HashSet[String]) extends 
> Serializable {
>   val id = inId
>   val value = inData
>   val labels = inLabels
> }
> class TestLink(inSrc: VertexId, inDst: VertexId, inData: String) extends 
> Serializable  {
>   val src = inSrc
>   val dst = inDst
>   val data = inData
> }
> val startString = "XXXSTARTXXX"
> val conf = new SparkConf().setAppName("pregeltest").setMaster("local[*]")
> val sc = new SparkContext(conf)
> val vertexes = Vector(
>   new TestVertex(0, "label0", collection.mutable.HashSet[String]()),
>   new TestVertex(1, "label1", collection.mutable.HashSet[String]())
> )
> val links = Vector(
>   new TestLink(0, 1, "linkData01")
> )
> val vertexes_packaged = vertexes.map(v => (v.id, v))
> val links_packaged = links.map(e => Edge(e.src, e.dst, e))
> val graph = Graph[TestVertex, 
> TestLink](sc.parallelize(vertexes_packaged), sc.parallelize(links_packaged))
> def vertexProgram (vertexId: VertexId, vdata: TestVertex, message: 
> Vector[String]): TestVertex = {
>   message.foreach {
> case `startString` =>
>   if (vdata.id == 0L)
> vdata.labels.add(vdata.value)
> case m =>
>   if (!vdata.labels.contains(m))
> vdata.labels.add(m)
>   }
>   new TestVertex(vdata.id, vdata.value, vdata.labels)
> }
> def sendMessage (triplet: EdgeTriplet[TestVertex, TestLink]): 
> Iterator[(VertexId, Vector[String])] = {
>   val srcLabels = triplet.srcAttr.labels
>   val dstLabels = triplet.dstAttr.labels
>   val msgsSrcDst = srcLabels.diff(dstLabels)
> .map(label => (triplet.dstAttr.id, Vector[String](label)))
>   val msgsDstSrc = dstLabels.diff(dstLabels)
> .map(label => (triplet.srcAttr.id, Vector[String](label)))
>   msgsSrcDst.toIterator ++ msgsDstSrc.toIterator
> }
> def mergeMessage (m1: Vector[String], m2: Vector[String]): Vector[String] 
> = m1.union(m2).distinct
> val g = graph.pregel(Vector[String](startString))(vertexProgram, 
> sendMessage, mergeMessage)
> println("---pregel done---")
> println("vertex info:")
> g.vertices.foreach(
>   v => {
> val labels = v._2.labels
> println(
>   "vertex " + v._1 +
> ": name = " + v._2.id +
> ", labels = " + labels)
>   }
> )
>   }
> }
> {code}
> This code never terminates even though we expect it to. To fix, we simply 
> remove the "case" designation for the TestVertex class (see FIXME comment), 
> and then it behaves as expected.
> (Apologies if this has been fixed in later versions; we're unfortunately 
> pegged to 2.10.5 / 1.6.0 for now.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17990) ALTER TABLE ... ADD PARTITION does not play nice with mixed-case partition column names

2016-10-21 Thread Michael Allman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15595569#comment-15595569
 ] 

Michael Allman commented on SPARK-17990:


The main problem as I see it is one of user experience. If the user puts an 
upper case letter in the partition column name, then things won't work as 
expected. If we're not going to support partition column names with upper case 
letters, I think we should throw an informative error when the user tries to 
create one, preferably in planning. What do you think?

> ALTER TABLE ... ADD PARTITION does not play nice with mixed-case partition 
> column names
> ---
>
> Key: SPARK-17990
> URL: https://issues.apache.org/jira/browse/SPARK-17990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: Linux
> Mac OS with a case-sensitive filesystem
>Reporter: Michael Allman
>
> Writing partition data to an external table's file location and then adding 
> those as table partition metadata is a common use case. However, for tables 
> with partition column names with upper case letters, the SQL command {{ALTER 
> TABLE ... ADD PARTITION}} does not work, as illustrated in the following 
> example:
> {code}
> scala> sql("create external table mixed_case_partitioning (a bigint) 
> PARTITIONED BY (partCol bigint) STORED AS parquet LOCATION 
> '/tmp/mixed_case_partitioning'")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sqlContext.range(10).selectExpr("id as a", "id as 
> partCol").write.partitionBy("partCol").mode("overwrite").parquet("/tmp/mixed_case_partitioning")
> {code}
> At this point, doing a {{hadoop fs -ls /tmp/mixed_case_partitioning}} 
> produces the following:
> {code}
> [msa@jupyter ~]$ hadoop fs -ls /tmp/mixed_case_partitioning
> Found 11 items
> -rw-r--r--   3 msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/_SUCCESS
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=0
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=1
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=2
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=3
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=4
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=5
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=6
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=7
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=8
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=9
> {code}
> Returning to the Spark shell, we execute the following to add the partition 
> metadata:
> {code}
> scala> (0 to 9).foreach { p => sql(s"alter table mixed_case_partitioning add 
> partition(partCol=$p)") }
> {code}
> Examining the HDFS file listing again, we see:
> {code}
> [msa@jupyter ~]$ hadoop fs -ls /tmp/mixed_case_partitioning
> Found 21 items
> -rw-r--r--   3 msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/_SUCCESS
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=0
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=1
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=2
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=3
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=4
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=5
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=6
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=7
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=8
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=9
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:53 
> /tmp/mixed_case_partitioning/partcol=0
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:53 
> /tmp/mixed_case_partitioning/partcol=1
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:53 
> /tmp/mixed_case_partitioning/partcol=2
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:53 
> /tmp/

[jira] [Comment Edited] (SPARK-16216) CSV data source does not write date and timestamp correctly

2016-10-21 Thread Barry Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15595619#comment-15595619
 ] 

Barry Becker edited comment on SPARK-16216 at 10/21/16 4:41 PM:


If timezone is not specified, the date should be interpreted as being in "local 
time".  Trying to add a time zone when none was specified is not the right 
thing to do since it is making an assumption that is not necessarily true. I 
think JSON is doing the right thing above by leaving off the timezone. I just 
updated to 2.0.1 and see that one of my tests broke because of this.
 Here is my test case:
 I create a dataFrame containing this data:
{code}
val ISO_DATE_FORMAT = DateTimeFormat.forPattern("-MM-dd'T'HH:mm:ss")
val columnData = List(
  new 
Timestamp(ISO_DATE_FORMAT.parseDateTime("2012-01-03T09:12:00").getMillis),
  null,
  new 
Timestamp(ISO_DATE_FORMAT.parseDateTime("2015-02-23T18:00:00").getMillis))
{code}
then write it to a file using
{code}
dataframe.write.format("csv") 
.option("delimiter", "\t")
.option("header", "false")
.option("nullValue", NULL_VALUE)
.option("dateFormat", "-MM-dd'T'HH:mm:ss")
.option("escape", "\\") 
.save(tempFileName)
{code}
Note that I specifically do not want a time zone when I write my dateTimes to 
the file. They are in local time not UTC or GMT. I do not want a timeZone added.

The dataFile used to contain
{code}
2012-01-03T09:12:00
?
2015-02-23T18:00:00
{code}
Which is correct. With spark 1.6.2, but now, with 2.0.1, it contains
{code}
2012-01-03T09:12:00.000-08:00
?
2015-02-23T18:00:00.000-08:00
{code}
Which is not correct. I think the previous behavior is correct. Can we reopen?
If I actually wanted the timeZone to be considered as UTC, then I could add an 
explicit Z at the end.



was (Author: barrybecker4):
If timezone is not specified, the date should be interpreted as being in "local 
time".  Trying to add a time zone when none was specified is not the right 
thing to do since it is making an assumption that is not necessarily true. I 
think JSON is doing the right thing above by leaving off the timezone. I just 
updated to 2.0.1 and see that one of my tests broke because of this.
 Here is my test case:
 I create a dataFrame containing this data:
{code}
val ISO_DATE_FORMAT = DateTimeFormat.forPattern("-MM-dd'T'HH:mm:ss")
val columnData = List(
  new 
Timestamp(ISO_DATE_FORMAT.parseDateTime("2012-01-03T09:12:00").getMillis),
  null,
  new 
Timestamp(ISO_DATE_FORMAT.parseDateTime("2015-02-23T18:00:00").getMillis))
{code}
then write it to a file using
{code}
dataframe.write.format("csv") 
.option("delimiter", "\t")
.option("header", "false")
.option("nullValue", NULL_VALUE)
.option("dateFormat", "-MM-dd'T'HH:mm:ss")
.option("escape", "\\") 
.save(tempFileName)
{code}
Note that I specifically do not want a time zone when I write my dateTimes to 
the file. They are in local time not UTC or GMT. I do not want a timeZone added.

The dataFile used to contain
{code}
2012-01-03T09:12:00
?
2015-02-23T18:00:00
{code}
Which is correct. With spark 1.6.2, but now, with 2.0.1, it contains
{code}
2012-01-03T09:12:00.000-08:00
?
2015-02-23T18:00:00.000-08:00
{code}
Which is not correct. I think the previous behavior is correct. Can we reopen?


> CSV data source does not write date and timestamp correctly
> ---
>
> Key: SPARK-16216
> URL: https://issues.apache.org/jira/browse/SPARK-16216
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Blocker
>  Labels: releasenotes
> Fix For: 2.0.1, 2.1.0
>
>
> Currently, CSV data source write {{DateType}} and {{TimestampType}} as below:
> {code}
> ++
> |date|
> ++
> |14406372|
> |14144598|
> |14540400|
> ++
> {code}
> It would be nicer if it write dates and timestamps as a formatted string just 
> like JSON data sources.
> Also, CSV data source currently supports {{dateFormat}} option to read dates 
> and timestamps in a custom format. It might be better if this option can be 
> applied in writing as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16216) CSV data source does not write date and timestamp correctly

2016-10-21 Thread Barry Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15595619#comment-15595619
 ] 

Barry Becker commented on SPARK-16216:
--

If timezone is not specified, the date should be interpreted as being in "local 
time".  Trying to add a time zone when none was specified is not the right 
thing to do since it is making an assumption that is not necessarily true. I 
think JSON is doing the right thing above by leaving off the timezone. I just 
updated to 2.0.1 and see that one of my tests broke because of this.
 Here is my test case:
 I create a dataFrame containing this data:
{code}
val ISO_DATE_FORMAT = DateTimeFormat.forPattern("-MM-dd'T'HH:mm:ss")
val columnData = List(
  new 
Timestamp(ISO_DATE_FORMAT.parseDateTime("2012-01-03T09:12:00").getMillis),
  null,
  new 
Timestamp(ISO_DATE_FORMAT.parseDateTime("2015-02-23T18:00:00").getMillis))
{code}
then write it to a file using
{code}
dataframe.write.format("csv") 
.option("delimiter", "\t")
.option("header", "false")
.option("nullValue", NULL_VALUE)
.option("dateFormat", "-MM-dd'T'HH:mm:ss")
.option("escape", "\\") 
.save(tempFileName)
{code}
Note that I specifically do not want a time zone when I write my dateTimes to 
the file. They are in local time not UTC or GMT. I do not want a timeZone added.

The dataFile used to contain
{code}
2012-01-03T09:12:00
?
2015-02-23T18:00:00
{code}
Which is correct. With spark 1.6.2, but now, with 2.0.1, it contains
{code}
2012-01-03T09:12:00.000-08:00
?
2015-02-23T18:00:00.000-08:00
{code}
Which is not correct. I think the previous behavior is correct. Can we reopen?


> CSV data source does not write date and timestamp correctly
> ---
>
> Key: SPARK-16216
> URL: https://issues.apache.org/jira/browse/SPARK-16216
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Blocker
>  Labels: releasenotes
> Fix For: 2.0.1, 2.1.0
>
>
> Currently, CSV data source write {{DateType}} and {{TimestampType}} as below:
> {code}
> ++
> |date|
> ++
> |14406372|
> |14144598|
> |14540400|
> ++
> {code}
> It would be nicer if it write dates and timestamps as a formatted string just 
> like JSON data sources.
> Also, CSV data source currently supports {{dateFormat}} option to read dates 
> and timestamps in a custom format. It might be better if this option can be 
> applied in writing as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18053) ARRAY equality is broken in Spark 2.0

2016-10-21 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-18053:
--

 Summary: ARRAY equality is broken in Spark 2.0
 Key: SPARK-18053
 URL: https://issues.apache.org/jira/browse/SPARK-18053
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1, 2.0.0
Reporter: Cheng Lian






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18053) ARRAY equality is broken in Spark 2.0

2016-10-21 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-18053:
---
Labels: correctness  (was: )

> ARRAY equality is broken in Spark 2.0
> -
>
> Key: SPARK-18053
> URL: https://issues.apache.org/jira/browse/SPARK-18053
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Cheng Lian
>  Labels: correctness
>
> The following Spark shell reproduces this issue:
> {code}
> case class Test(a: Seq[Int])
> Seq(Test(Seq(1))).toDF().createOrReplaceTempView("t")
> sql("SELECT a FROM t WHERE a = array(1)").show()
> // +---+
> // |  a|
> // +---+
> // +---+
> sql("SELECT a FROM (SELECT array(1) AS a) x WHERE x.a = array(1)").show()
> // +---+
> // |  a|
> // +---+
> // |[1]|
> // +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18053) ARRAY equality is broken in Spark 2.0

2016-10-21 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-18053:
---
Description: 
The following Spark shell reproduces this issue:
{code}
case class Test(a: Seq[Int])
Seq(Test(Seq(1))).toDF().createOrReplaceTempView("t")

sql("SELECT a FROM t WHERE a = array(1)").show()
// +---+
// |  a|
// +---+
// +---+

sql("SELECT a FROM (SELECT array(1) AS a) x WHERE x.a = array(1)").show()
// +---+
// |  a|
// +---+
// |[1]|
// +---+
{code}

> ARRAY equality is broken in Spark 2.0
> -
>
> Key: SPARK-18053
> URL: https://issues.apache.org/jira/browse/SPARK-18053
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Cheng Lian
>  Labels: correctness
>
> The following Spark shell reproduces this issue:
> {code}
> case class Test(a: Seq[Int])
> Seq(Test(Seq(1))).toDF().createOrReplaceTempView("t")
> sql("SELECT a FROM t WHERE a = array(1)").show()
> // +---+
> // |  a|
> // +---+
> // +---+
> sql("SELECT a FROM (SELECT array(1) AS a) x WHERE x.a = array(1)").show()
> // +---+
> // |  a|
> // +---+
> // |[1]|
> // +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18039) ReceiverTracker run dummyjob too fast cause receiver scheduling unbalaced

2016-10-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18039:


Assignee: (was: Apache Spark)

> ReceiverTracker run dummyjob too fast cause receiver scheduling unbalaced
> -
>
> Key: SPARK-18039
> URL: https://issues.apache.org/jira/browse/SPARK-18039
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.1
>Reporter: astralidea
>Priority: Minor
>
> receiver scheduling balance is important for me 
> for instance 
> if I have 2 executor, each executor has 1 receiver, calc time is 0.1s per 
> batch.
> but if  I have 2 executor, one executor has 2 receiver and another is 0 
> receiver ,calc time is increase 3s per batch.
> In my cluster executor init is slow I need about 30s to wait.
> but dummy job only run 4s to wait, I add conf 
> spark.scheduler.maxRegisteredResourcesWaitingTime it does not work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18039) ReceiverTracker run dummyjob too fast cause receiver scheduling unbalaced

2016-10-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15595695#comment-15595695
 ] 

Apache Spark commented on SPARK-18039:
--

User 'Astralidea' has created a pull request for this issue:
https://github.com/apache/spark/pull/15588

> ReceiverTracker run dummyjob too fast cause receiver scheduling unbalaced
> -
>
> Key: SPARK-18039
> URL: https://issues.apache.org/jira/browse/SPARK-18039
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.1
>Reporter: astralidea
>Priority: Minor
>
> receiver scheduling balance is important for me 
> for instance 
> if I have 2 executor, each executor has 1 receiver, calc time is 0.1s per 
> batch.
> but if  I have 2 executor, one executor has 2 receiver and another is 0 
> receiver ,calc time is increase 3s per batch.
> In my cluster executor init is slow I need about 30s to wait.
> but dummy job only run 4s to wait, I add conf 
> spark.scheduler.maxRegisteredResourcesWaitingTime it does not work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18039) ReceiverTracker run dummyjob too fast cause receiver scheduling unbalaced

2016-10-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18039:


Assignee: Apache Spark

> ReceiverTracker run dummyjob too fast cause receiver scheduling unbalaced
> -
>
> Key: SPARK-18039
> URL: https://issues.apache.org/jira/browse/SPARK-18039
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.1
>Reporter: astralidea
>Assignee: Apache Spark
>Priority: Minor
>
> receiver scheduling balance is important for me 
> for instance 
> if I have 2 executor, each executor has 1 receiver, calc time is 0.1s per 
> batch.
> but if  I have 2 executor, one executor has 2 receiver and another is 0 
> receiver ,calc time is increase 3s per batch.
> In my cluster executor init is slow I need about 30s to wait.
> but dummy job only run 4s to wait, I add conf 
> spark.scheduler.maxRegisteredResourcesWaitingTime it does not work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18039) ReceiverTracker run dummyjob too fast cause receiver scheduling unbalaced

2016-10-21 Thread astralidea (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15595709#comment-15595709
 ] 

astralidea commented on SPARK-18039:


I have write a PR to slove & explain this problem. could you help me to review 
it?

> ReceiverTracker run dummyjob too fast cause receiver scheduling unbalaced
> -
>
> Key: SPARK-18039
> URL: https://issues.apache.org/jira/browse/SPARK-18039
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.1
>Reporter: astralidea
>Priority: Minor
>
> receiver scheduling balance is important for me 
> for instance 
> if I have 2 executor, each executor has 1 receiver, calc time is 0.1s per 
> batch.
> but if  I have 2 executor, one executor has 2 receiver and another is 0 
> receiver ,calc time is increase 3s per batch.
> In my cluster executor init is slow I need about 30s to wait.
> but dummy job only run 4s to wait, I add conf 
> spark.scheduler.maxRegisteredResourcesWaitingTime it does not work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18053) ARRAY equality is broken in Spark 2.0

2016-10-21 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-18053:
---
Assignee: Wenchen Fan

> ARRAY equality is broken in Spark 2.0
> -
>
> Key: SPARK-18053
> URL: https://issues.apache.org/jira/browse/SPARK-18053
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
>  Labels: correctness
>
> The following Spark shell reproduces this issue:
> {code}
> case class Test(a: Seq[Int])
> Seq(Test(Seq(1))).toDF().createOrReplaceTempView("t")
> sql("SELECT a FROM t WHERE a = array(1)").show()
> // +---+
> // |  a|
> // +---+
> // +---+
> sql("SELECT a FROM (SELECT array(1) AS a) x WHERE x.a = array(1)").show()
> // +---+
> // |  a|
> // +---+
> // |[1]|
> // +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17829) Stable format for offset log

2016-10-21 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15595716#comment-15595716
 ] 

Michael Armbrust commented on SPARK-17829:
--

Yeah, I agree.  I think it could be really simple.  We can have one that just 
holds json, and sources can convert to something more specific internally.

{code}
abstract class Offset {
  def json: String
}

/** Used when loading */
case class SerializedOffset(json: String)

/** Used to convert to a more specific type */
object LongOffset {
  def apply(serialized: Offset) = LongOffset(parse(serialized.json).as[Long]))
}

case class LongOffset(value: Long) extends Offset {
  def json = value.toString
}
{code}

> Stable format for offset log
> 
>
> Key: SPARK-17829
> URL: https://issues.apache.org/jira/browse/SPARK-17829
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Tyson Condie
>
> Currently we use java serialization for the WAL that stores the offsets 
> contained in each batch.  This has two main issues:
>  - It can break across spark releases (though this is not the only thing 
> preventing us from upgrading a running query)
>  - It is unnecessarily opaque to the user.
> I'd propose we require offsets to provide a user readable serialization and 
> use that instead.  JSON is probably a good option.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18054) Unexpected error from UDF that gets an element of a vector: argument 1 requires vector type, however, '`_column_`' is of vector type

2016-10-21 Thread Barry Becker (JIRA)
Barry Becker created SPARK-18054:


 Summary: Unexpected error from UDF that gets an element of a 
vector: argument 1 requires vector type, however, '`_column_`' is of vector type
 Key: SPARK-18054
 URL: https://issues.apache.org/jira/browse/SPARK-18054
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.0.1
Reporter: Barry Becker


Not sure if this is a bug in ML or a more core part of spark.
It used to work in spark 1.6.2, but now gives me an error.

I have a pipeline that contains a NaiveBayesModel which I created like this
{code}
val nbModel = new NaiveBayes()
  .setLabelCol(target)
  .setFeaturesCol(FEATURES_COL)
  .setPredictionCol(PREDICTION_COLUMN)
  .setProbabilityCol("_probability_column_")
  .setModelType("multinomial")
{code}
When I apply that pipeline to some data there will be a "_probability_column_" 
of type vector. I want to extract a probability for a specific class label 
using the following, but it no longer works.

{code}
var newDf = pipeline.transform(df)
val extractProbability = udf((vector: DenseVector) => vector(1))
val dfWithProbability = newDf.withColumn("foo", 
extractProbability(col("_probability_column_")))
{code}

The error I get now that I have upgraded to 2.0.1 from 1.6.2 is shnown below. I 
consider this a strange error because its basically saying "argument 1 requires 
a vector, but we got a vector instead". That does not make any sense to me. It 
wants a vector, and a vector was given. Why does it fail?
{code}

org.apache.spark.sql.AnalysisException: cannot resolve 
'UDF(_class_probability_column__)' due to data type mismatch: argument 1 
requires vector type, however, '`_class_probability_column__`' is of vector 
type.;

at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:82)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:191)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:201)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:205)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:205)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$5.apply(QueryPlan.scala:210)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:210)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)

[jira] [Commented] (SPARK-9219) ClassCastException in instance of org.apache.spark.rdd.MapPartitionsRDD

2016-10-21 Thread Nick Orka (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15595770#comment-15595770
 ] 

Nick Orka commented on SPARK-9219:
--

Maybe it's better to create another JIRA ticket then, related with UDF but not 
with error in the topic?

> ClassCastException in instance of org.apache.spark.rdd.MapPartitionsRDD
> ---
>
> Key: SPARK-9219
> URL: https://issues.apache.org/jira/browse/SPARK-9219
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.4.1
>Reporter: Mohsen Zainalpour
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 
> (TID 77, 192.168.1.194): java.lang.ClassCastException: cannot assign instance 
> of scala.collection.immutable.List$SerializationProxy to field 
> org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type 
> scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
>   at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083)
>   at 
> java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477)
>   at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477)
>   at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> scala.collection.immutable.List$SerializationProxy.readObjec

  1   2   >