[jira] [Updated] (SPARK-22036) BigDecimal multiplication sometimes returns null

2017-09-17 Thread Olivier Blanvillain (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olivier Blanvillain updated SPARK-22036:

Description: 
The multiplication of two BigDecimal numbers sometimes returns null. Here is a 
minimal reproduction:

{code:java}
object Main extends App {
  import org.apache.spark.{SparkConf, SparkContext}
  import org.apache.spark.sql.SparkSession
  import spark.implicits._
  val conf = new 
SparkConf().setMaster("local[*]").setAppName("REPL").set("spark.ui.enabled", 
"false")
  val spark = SparkSession.builder().config(conf).appName("REPL").getOrCreate()
  implicit val sqlContext = spark.sqlContext

  case class X2(a: BigDecimal, b: BigDecimal)
  val ds = sqlContext.createDataset(List(X2(BigDecimal(-0.1267333984375), 
BigDecimal(-1000.1
  val result = ds.select(ds("a") * ds("b")).collect.head
  println(result) // [null]
}
{code}


  was:
The multiplication of two BigDecimal numbers sometimes returns null. This issue 
we discovered while doing property based testing for the frameless project. 
Here is a minimal reproduction:

{code:java}
object Main extends App {
  import org.apache.spark.{SparkConf, SparkContext}
  import org.apache.spark.sql.SparkSession
  import spark.implicits._
  val conf = new 
SparkConf().setMaster("local[*]").setAppName("REPL").set("spark.ui.enabled", 
"false")
  val spark = SparkSession.builder().config(conf).appName("REPL").getOrCreate()
  implicit val sqlContext = spark.sqlContext

  case class X2(a: BigDecimal, b: BigDecimal)
  val ds = sqlContext.createDataset(List(X2(BigDecimal(-0.1267333984375), 
BigDecimal(-1000.1
  val result = ds.select(ds("a") * ds("b")).collect.head
  println(result) // [null]
}
{code}



> BigDecimal multiplication sometimes returns null
> 
>
> Key: SPARK-22036
> URL: https://issues.apache.org/jira/browse/SPARK-22036
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Olivier Blanvillain
>
> The multiplication of two BigDecimal numbers sometimes returns null. Here is 
> a minimal reproduction:
> {code:java}
> object Main extends App {
>   import org.apache.spark.{SparkConf, SparkContext}
>   import org.apache.spark.sql.SparkSession
>   import spark.implicits._
>   val conf = new 
> SparkConf().setMaster("local[*]").setAppName("REPL").set("spark.ui.enabled", 
> "false")
>   val spark = 
> SparkSession.builder().config(conf).appName("REPL").getOrCreate()
>   implicit val sqlContext = spark.sqlContext
>   case class X2(a: BigDecimal, b: BigDecimal)
>   val ds = sqlContext.createDataset(List(X2(BigDecimal(-0.1267333984375), 
> BigDecimal(-1000.1
>   val result = ds.select(ds("a") * ds("b")).collect.head
>   println(result) // [null]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22040) current_date function with timezone id

2017-09-17 Thread Jacek Laskowski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16169215#comment-16169215
 ] 

Jacek Laskowski commented on SPARK-22040:
-

That'd be awesome! It's yours, [~mgaido]

> current_date function with timezone id
> --
>
> Key: SPARK-22040
> URL: https://issues.apache.org/jira/browse/SPARK-22040
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> {{current_date}} function creates {{CurrentDate}} expression that accepts 
> optional timezone id, but there's no function to allow for this.
> This is to have another {{current_date}} with the timezone id, i.e.
> {code}
> def current_date(timeZoneId: String): Column
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22041) Docker test case failure: `SPARK-16625: General data types to be mapped to Oracle`

2017-09-17 Thread Yuming Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16169220#comment-16169220
 ] 

Yuming Wang commented on SPARK-22041:
-

I know the reason, will fix it soon.

> Docker test case failure: `SPARK-16625: General data types to be mapped to 
> Oracle`
> --
>
> Key: SPARK-22041
> URL: https://issues.apache.org/jira/browse/SPARK-22041
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Xiao Li
>
> - SPARK-16625: General data types to be mapped to Oracle *** FAILED ***
>   types.apply(9).equals(org.apache.spark.sql.types.DateType) was false 
> (OracleIntegrationSuite.scala:158)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22039) Spark 2.1.1 Driver OOM when use interaction for large scale Sparse Vector

2017-09-17 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16169228#comment-16169228
 ] 

Hyukjin Kwon commented on SPARK-22039:
--

Please check out https://spark.apache.org/community.html.

> Spark 2.1.1 Driver OOM when use interaction for large scale Sparse Vector
> -
>
> Key: SPARK-22039
> URL: https://issues.apache.org/jira/browse/SPARK-22039
> Project: Spark
>  Issue Type: Question
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: wuhaibo
>
> I'm working on large scale logistic regression for ctr prediction, and when 
> user interaction for feature engineer, driver OOM. For detail, I interact 
> among userid(one-hot, 30w dimension, sparse) and base features(60 dimensions, 
> dense), driver memory is set to 40g.
> So, I try to debug from remote, and I find the spark interaction create a big 
> schema, and a lot job is doing at the driver.
> there is two question:
> By reading source, I found interaction is implemented with sparse vector, so 
> it does not need so much memory, and why it need do this at the driver. The 
> interaction result is 1800w dimension sparse dataframe, why 1800w structField 
> for schema is so big. this is dump file when the schema begins to create 
> because it is too big, I can't dump all: 
> https://i.stack.imgur.com/h0XBf.jpg
> So I implement interaction method with RDD, the job can finish in 5mim, so I 
> am wondering it's there any wrong here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16625) Oracle JDBC table creation fails with ORA-00902: invalid datatype

2017-09-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16169232#comment-16169232
 ] 

Apache Spark commented on SPARK-16625:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/19259

> Oracle JDBC table creation fails with ORA-00902: invalid datatype
> -
>
> Key: SPARK-16625
> URL: https://issues.apache.org/jira/browse/SPARK-16625
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: Daniel Darabos
>Assignee: Yuming Wang
> Fix For: 2.1.0
>
>
> Unfortunately I know very little about databases, but I figure this is a bug.
> I have a DataFrame with the following schema: 
> {noformat}
> StructType(StructField(dst,StringType,true), StructField(id,LongType,true), 
> StructField(src,StringType,true))
> {noformat}
> I am trying to write it to an Oracle database like this:
> {code:java}
> String url = "jdbc:oracle:thin:root/rootroot@:1521:db";
> java.util.Properties p = new java.util.Properties();
> p.setProperty("driver", "oracle.jdbc.OracleDriver");
> df.write().mode("overwrite").jdbc(url, "my_table", p);
> {code}
> And I get:
> {noformat}
> Exception in thread "main" java.sql.SQLSyntaxErrorException: ORA-00902: 
> invalid datatype
>   at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:461)
>   at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:402)
>   at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:1108)
>   at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:541)
>   at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:264)
>   at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:598)
>   at oracle.jdbc.driver.T4CStatement.doOall8(T4CStatement.java:213)
>   at oracle.jdbc.driver.T4CStatement.doOall8(T4CStatement.java:26)
>   at 
> oracle.jdbc.driver.T4CStatement.executeForRows(T4CStatement.java:1241)
>   at 
> oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1558)
>   at 
> oracle.jdbc.driver.OracleStatement.executeUpdateInternal(OracleStatement.java:2498)
>   at 
> oracle.jdbc.driver.OracleStatement.executeUpdate(OracleStatement.java:2431)
>   at 
> oracle.jdbc.driver.OracleStatementWrapper.executeUpdate(OracleStatementWrapper.java:975)
>   at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:302)
> {noformat}
> The Oracle server I am running against is the one I get on Amazon RDS for 
> engine type {{oracle-se}}. The same code (with the right driver) against the 
> RDS instance with engine type {{MySQL}} works.
> The error message is the same as in 
> https://issues.apache.org/jira/browse/SPARK-12941. Could it be that {{Long}} 
> is also translated into the wrong data type? Thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22041) Docker test case failure: `SPARK-16625: General data types to be mapped to Oracle`

2017-09-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16169231#comment-16169231
 ] 

Apache Spark commented on SPARK-22041:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/19259

> Docker test case failure: `SPARK-16625: General data types to be mapped to 
> Oracle`
> --
>
> Key: SPARK-22041
> URL: https://issues.apache.org/jira/browse/SPARK-22041
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Xiao Li
>
> - SPARK-16625: General data types to be mapped to Oracle *** FAILED ***
>   types.apply(9).equals(org.apache.spark.sql.types.DateType) was false 
> (OracleIntegrationSuite.scala:158)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19318) Docker test case failure: `SPARK-16625: General data types to be mapped to Oracle`

2017-09-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16169230#comment-16169230
 ] 

Apache Spark commented on SPARK-19318:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/19259

> Docker test case failure: `SPARK-16625: General data types to be mapped to 
> Oracle`
> --
>
> Key: SPARK-19318
> URL: https://issues.apache.org/jira/browse/SPARK-19318
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Suresh Thalamati
> Fix For: 2.2.0
>
>
> = FINISHED o.a.s.sql.jdbc.OracleIntegrationSuite: 'SPARK-16625: General 
> data types to be mapped to Oracle' =
> - SPARK-16625: General data types to be mapped to Oracle *** FAILED ***
>   types.apply(9).equals("class java.sql.Date") was false 
> (OracleIntegrationSuite.scala:136)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22041) Docker test case failure: `SPARK-16625: General data types to be mapped to Oracle`

2017-09-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22041:


Assignee: (was: Apache Spark)

> Docker test case failure: `SPARK-16625: General data types to be mapped to 
> Oracle`
> --
>
> Key: SPARK-22041
> URL: https://issues.apache.org/jira/browse/SPARK-22041
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Xiao Li
>
> - SPARK-16625: General data types to be mapped to Oracle *** FAILED ***
>   types.apply(9).equals(org.apache.spark.sql.types.DateType) was false 
> (OracleIntegrationSuite.scala:158)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22041) Docker test case failure: `SPARK-16625: General data types to be mapped to Oracle`

2017-09-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22041:


Assignee: Apache Spark

> Docker test case failure: `SPARK-16625: General data types to be mapped to 
> Oracle`
> --
>
> Key: SPARK-22041
> URL: https://issues.apache.org/jira/browse/SPARK-22041
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> - SPARK-16625: General data types to be mapped to Oracle *** FAILED ***
>   types.apply(9).equals(org.apache.spark.sql.types.DateType) was false 
> (OracleIntegrationSuite.scala:158)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-22041) Docker test case failure: `SPARK-16625: General data types to be mapped to Oracle`

2017-09-17 Thread Yuming Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-22041:

Comment: was deleted

(was: I know the reason, will fix it soon.)

> Docker test case failure: `SPARK-16625: General data types to be mapped to 
> Oracle`
> --
>
> Key: SPARK-22041
> URL: https://issues.apache.org/jira/browse/SPARK-22041
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Xiao Li
>
> - SPARK-16625: General data types to be mapped to Oracle *** FAILED ***
>   types.apply(9).equals(org.apache.spark.sql.types.DateType) was false 
> (OracleIntegrationSuite.scala:158)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22043) Python profile, show_profiles() and dump_profiles(), should throw an error with a better message

2017-09-17 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-22043:


 Summary: Python profile, show_profiles() and dump_profiles(), 
should throw an error with a better message
 Key: SPARK-22043
 URL: https://issues.apache.org/jira/browse/SPARK-22043
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Hyukjin Kwon
Priority: Trivial


I mistakenly missed {{spark.python.profile}} enabled today while profiling and 
met this unfriendly messages:

{code}
>>> sc.show_profiles()
Traceback (most recent call last):
  File "", line 1, in 
  File ".../spark/python/pyspark/context.py", line 1000, in show_profiles
self.profiler_collector.show_profiles()
AttributeError: 'NoneType' object has no attribute 'show_profiles'
>>> sc.dump_profiles("/tmp/abc")
Traceback (most recent call last):
  File "", line 1, in 
  File ".../spark/python/pyspark/context.py", line 1005, in dump_profiles
self.profiler_collector.dump_profiles(path)
AttributeError: 'NoneType' object has no attribute 'dump_profiles'
{code}

It looks we should give better information that says {{spark.python.profile}} 
should be enabled.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22043) Python profile, show_profiles() and dump_profiles(), should throw an error with a better message

2017-09-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22043:


Assignee: (was: Apache Spark)

> Python profile, show_profiles() and dump_profiles(), should throw an error 
> with a better message
> 
>
> Key: SPARK-22043
> URL: https://issues.apache.org/jira/browse/SPARK-22043
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> I mistakenly missed {{spark.python.profile}} enabled today while profiling 
> and met this unfriendly messages:
> {code}
> >>> sc.show_profiles()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File ".../spark/python/pyspark/context.py", line 1000, in show_profiles
> self.profiler_collector.show_profiles()
> AttributeError: 'NoneType' object has no attribute 'show_profiles'
> >>> sc.dump_profiles("/tmp/abc")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File ".../spark/python/pyspark/context.py", line 1005, in dump_profiles
> self.profiler_collector.dump_profiles(path)
> AttributeError: 'NoneType' object has no attribute 'dump_profiles'
> {code}
> It looks we should give better information that says {{spark.python.profile}} 
> should be enabled.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22043) Python profile, show_profiles() and dump_profiles(), should throw an error with a better message

2017-09-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16169267#comment-16169267
 ] 

Apache Spark commented on SPARK-22043:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/19260

> Python profile, show_profiles() and dump_profiles(), should throw an error 
> with a better message
> 
>
> Key: SPARK-22043
> URL: https://issues.apache.org/jira/browse/SPARK-22043
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> I mistakenly missed {{spark.python.profile}} enabled today while profiling 
> and met this unfriendly messages:
> {code}
> >>> sc.show_profiles()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File ".../spark/python/pyspark/context.py", line 1000, in show_profiles
> self.profiler_collector.show_profiles()
> AttributeError: 'NoneType' object has no attribute 'show_profiles'
> >>> sc.dump_profiles("/tmp/abc")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File ".../spark/python/pyspark/context.py", line 1005, in dump_profiles
> self.profiler_collector.dump_profiles(path)
> AttributeError: 'NoneType' object has no attribute 'dump_profiles'
> {code}
> It looks we should give better information that says {{spark.python.profile}} 
> should be enabled.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22043) Python profile, show_profiles() and dump_profiles(), should throw an error with a better message

2017-09-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22043:


Assignee: Apache Spark

> Python profile, show_profiles() and dump_profiles(), should throw an error 
> with a better message
> 
>
> Key: SPARK-22043
> URL: https://issues.apache.org/jira/browse/SPARK-22043
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Trivial
>
> I mistakenly missed {{spark.python.profile}} enabled today while profiling 
> and met this unfriendly messages:
> {code}
> >>> sc.show_profiles()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File ".../spark/python/pyspark/context.py", line 1000, in show_profiles
> self.profiler_collector.show_profiles()
> AttributeError: 'NoneType' object has no attribute 'show_profiles'
> >>> sc.dump_profiles("/tmp/abc")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File ".../spark/python/pyspark/context.py", line 1005, in dump_profiles
> self.profiler_collector.dump_profiles(path)
> AttributeError: 'NoneType' object has no attribute 'dump_profiles'
> {code}
> It looks we should give better information that says {{spark.python.profile}} 
> should be enabled.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22040) current_date function with timezone id

2017-09-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16169272#comment-16169272
 ] 

Apache Spark commented on SPARK-22040:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/19261

> current_date function with timezone id
> --
>
> Key: SPARK-22040
> URL: https://issues.apache.org/jira/browse/SPARK-22040
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> {{current_date}} function creates {{CurrentDate}} expression that accepts 
> optional timezone id, but there's no function to allow for this.
> This is to have another {{current_date}} with the timezone id, i.e.
> {code}
> def current_date(timeZoneId: String): Column
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22040) current_date function with timezone id

2017-09-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22040:


Assignee: Apache Spark

> current_date function with timezone id
> --
>
> Key: SPARK-22040
> URL: https://issues.apache.org/jira/browse/SPARK-22040
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Assignee: Apache Spark
>Priority: Minor
>
> {{current_date}} function creates {{CurrentDate}} expression that accepts 
> optional timezone id, but there's no function to allow for this.
> This is to have another {{current_date}} with the timezone id, i.e.
> {code}
> def current_date(timeZoneId: String): Column
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22040) current_date function with timezone id

2017-09-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22040:


Assignee: (was: Apache Spark)

> current_date function with timezone id
> --
>
> Key: SPARK-22040
> URL: https://issues.apache.org/jira/browse/SPARK-22040
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> {{current_date}} function creates {{CurrentDate}} expression that accepts 
> optional timezone id, but there's no function to allow for this.
> This is to have another {{current_date}} with the timezone id, i.e.
> {code}
> def current_date(timeZoneId: String): Column
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22044) explain function with codegen and cost parameters

2017-09-17 Thread Jacek Laskowski (JIRA)
Jacek Laskowski created SPARK-22044:
---

 Summary: explain function with codegen and cost parameters
 Key: SPARK-22044
 URL: https://issues.apache.org/jira/browse/SPARK-22044
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Jacek Laskowski
Priority: Minor


{{explain}} operator creates {{ExplainCommand}} runnable command that accepts 
(among other things) {{codegen}} and {{cost}} arguments.

There's no version of {{explain}} to allow for this. That's however possible 
using SQL which is kind of surprising (given how much focus is devoted to the 
Dataset API).

This is to have another {{explain}} with {{codegen}} and {{cost}} arguments, 
i.e.

{code}
def explain(codegen: Boolean = false, cost: Boolean = false): Unit
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22045) creating and starting a NetCat in test case programmatically

2017-09-17 Thread bluejoe (JIRA)
bluejoe created SPARK-22045:
---

 Summary: creating and starting a NetCat in test case 
programmatically
 Key: SPARK-22045
 URL: https://issues.apache.org/jira/browse/SPARK-22045
 Project: Spark
  Issue Type: New Feature
  Components: Tests
Affects Versions: 2.2.0
Reporter: bluejoe


hi, all

I have written a MockNetCat class, which help developers start a netcat 
programmatically, instead of manually launching 
{code:java}
nc -lk 
{code}
 command for test

also I completed a MockNetCatTest class, a JUnit 4 test case which test 
MockNetCat

use of MockNetCat is very simple, like:

{code:java}
var nc: MockNetCat = MockNetCat.start();
{code}


this starts a NetCat server, and data can be generated using following code:

{code:java}
nc.writeData("hello\r\nworld\r\nbye\r\nworld\r\n");
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22045) creating and starting a NetCat in test case programmatically

2017-09-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16169298#comment-16169298
 ] 

Sean Owen commented on SPARK-22045:
---

What is the use of this in Spark?

> creating and starting a NetCat in test case programmatically
> 
>
> Key: SPARK-22045
> URL: https://issues.apache.org/jira/browse/SPARK-22045
> Project: Spark
>  Issue Type: New Feature
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: bluejoe
>  Labels: netcat
>
> hi, all
> I have written a MockNetCat class, which help developers start a netcat 
> programmatically, instead of manually launching 
> {code:java}
> nc -lk 
> {code}
>  command for test
> also I completed a MockNetCatTest class, a JUnit 4 test case which test 
> MockNetCat
> use of MockNetCat is very simple, like:
> {code:java}
> var nc: MockNetCat = MockNetCat.start();
> {code}
> this starts a NetCat server, and data can be generated using following code:
> {code:java}
> nc.writeData("hello\r\nworld\r\nbye\r\nworld\r\n");
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22045) creating and starting a NetCat in test case programmatically

2017-09-17 Thread bluejoe (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16169302#comment-16169302
 ] 

bluejoe commented on SPARK-22045:
-

when testing spark streaming like:

{code:java}
val lines = spark.readStream
  .format("socket")
  .option("host", "localhost")
  .option("port", )
  .load()
{code}

we have to start netcat in command line first:

{code:java}
$ nc -lk 
{code}

also our test program can not get what users imput in the nc shell
so I wrote this MockNetCat class

developers can start a mock NetCat server and send data to the server 
(simulating user input text in the nc shell)


> creating and starting a NetCat in test case programmatically
> 
>
> Key: SPARK-22045
> URL: https://issues.apache.org/jira/browse/SPARK-22045
> Project: Spark
>  Issue Type: New Feature
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: bluejoe
>  Labels: netcat
>
> hi, all
> I have written a MockNetCat class, which help developers start a netcat 
> programmatically, instead of manually launching 
> {code:java}
> nc -lk 
> {code}
>  command for test
> also I completed a MockNetCatTest class, a JUnit 4 test case which test 
> MockNetCat
> use of MockNetCat is very simple, like:
> {code:java}
> var nc: MockNetCat = MockNetCat.start();
> {code}
> this starts a NetCat server, and data can be generated using following code:
> {code:java}
> nc.writeData("hello\r\nworld\r\nbye\r\nworld\r\n");
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22045) creating and starting a NetCat in test case programmatically

2017-09-17 Thread bluejoe (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16169302#comment-16169302
 ] 

bluejoe edited comment on SPARK-22045 at 9/17/17 1:44 PM:
--

when testing spark streaming like:

{code:java}
val lines = spark.readStream
  .format("socket")
  .option("host", "localhost")
  .option("port", )
  .load()
{code}

we have to start netcat in command line first:

{code:java}
$ nc -lk 
{code}

also our test program can not get what users imput in the nc shell
so I wrote this MockNetCat class

developers can start a mock NetCat server in test cases and send data to the 
server (simulating user input text in the nc shell)



was (Author: bluejoe2008):
when testing spark streaming like:

{code:java}
val lines = spark.readStream
  .format("socket")
  .option("host", "localhost")
  .option("port", )
  .load()
{code}

we have to start netcat in command line first:

{code:java}
$ nc -lk 
{code}

also our test program can not get what users imput in the nc shell
so I wrote this MockNetCat class

developers can start a mock NetCat server and send data to the server 
(simulating user input text in the nc shell)


> creating and starting a NetCat in test case programmatically
> 
>
> Key: SPARK-22045
> URL: https://issues.apache.org/jira/browse/SPARK-22045
> Project: Spark
>  Issue Type: New Feature
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: bluejoe
>  Labels: netcat
>
> hi, all
> I have written a MockNetCat class, which help developers start a netcat 
> programmatically, instead of manually launching 
> {code:java}
> nc -lk 
> {code}
>  command for test
> also I completed a MockNetCatTest class, a JUnit 4 test case which test 
> MockNetCat
> use of MockNetCat is very simple, like:
> {code:java}
> var nc: MockNetCat = MockNetCat.start();
> {code}
> this starts a NetCat server, and data can be generated using following code:
> {code:java}
> nc.writeData("hello\r\nworld\r\nbye\r\nworld\r\n");
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22045) creating and starting a NetCat in test cases programmatically

2017-09-17 Thread bluejoe (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bluejoe updated SPARK-22045:

Summary: creating and starting a NetCat in test cases programmatically  
(was: creating and starting a NetCat in test case programmatically)

> creating and starting a NetCat in test cases programmatically
> -
>
> Key: SPARK-22045
> URL: https://issues.apache.org/jira/browse/SPARK-22045
> Project: Spark
>  Issue Type: New Feature
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: bluejoe
>  Labels: netcat
>
> hi, all
> I have written a MockNetCat class, which help developers start a netcat 
> programmatically, instead of manually launching 
> {code:java}
> nc -lk 
> {code}
>  command for test
> also I completed a MockNetCatTest class, a JUnit 4 test case which test 
> MockNetCat
> use of MockNetCat is very simple, like:
> {code:java}
> var nc: MockNetCat = MockNetCat.start();
> {code}
> this starts a NetCat server, and data can be generated using following code:
> {code:java}
> nc.writeData("hello\r\nworld\r\nbye\r\nworld\r\n");
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22045) creating and starting a NetCat in test cases programmatically

2017-09-17 Thread bluejoe (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bluejoe updated SPARK-22045:

Flags: Patch,Important  (was: Patch)

> creating and starting a NetCat in test cases programmatically
> -
>
> Key: SPARK-22045
> URL: https://issues.apache.org/jira/browse/SPARK-22045
> Project: Spark
>  Issue Type: New Feature
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: bluejoe
>  Labels: netcat
>
> hi, all
> I have written a MockNetCat class, which help developers start a netcat 
> programmatically, instead of manually launching 
> {code:java}
> nc -lk 
> {code}
>  command for test
> also I completed a MockNetCatTest class, a JUnit 4 test case which test 
> MockNetCat
> use of MockNetCat is very simple, like:
> {code:java}
> var nc: MockNetCat = MockNetCat.start();
> {code}
> this starts a NetCat server, and data can be generated using following code:
> {code:java}
> nc.writeData("hello\r\nworld\r\nbye\r\nworld\r\n");
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22045) creating and starting a NetCat in test cases programmatically

2017-09-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22045.
---
Resolution: Not A Problem

The test harness doesn't require this. It's just an example. If somehow did 
really want to use netcat in a real job, they wouldn't want to use a test mock. 
If this code is useful, it doesn't need to be in Spark.

> creating and starting a NetCat in test cases programmatically
> -
>
> Key: SPARK-22045
> URL: https://issues.apache.org/jira/browse/SPARK-22045
> Project: Spark
>  Issue Type: New Feature
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: bluejoe
>  Labels: netcat
>
> hi, all
> I have written a MockNetCat class, which help developers start a netcat 
> programmatically, instead of manually launching 
> {code:java}
> nc -lk 
> {code}
>  command for test
> also I completed a MockNetCatTest class, a JUnit 4 test case which test 
> MockNetCat
> use of MockNetCat is very simple, like:
> {code:java}
> var nc: MockNetCat = MockNetCat.start();
> {code}
> this starts a NetCat server, and data can be generated using following code:
> {code:java}
> nc.writeData("hello\r\nworld\r\nbye\r\nworld\r\n");
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22033) BufferHolder size checks should account for the specific VM array size limitations

2017-09-17 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16169318#comment-16169318
 ] 

Kazuaki Ishizaki commented on SPARK-22033:
--

I think {{ColumnVector}} and {{HashMapGrowthStrategy}} may have possibility of 
the similar issue.
What do you think?

> BufferHolder size checks should account for the specific VM array size 
> limitations
> --
>
> Key: SPARK-22033
> URL: https://issues.apache.org/jira/browse/SPARK-22033
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Vadim Semenov
>Priority: Minor
>
> User may get the following OOM Error while running a job with heavy 
> aggregations
> ```
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:73)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:235)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:228)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$2.apply(AggregationIterator.scala:254)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$2.apply(AggregationIterator.scala:247)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.next(ObjectAggregationIterator.scala:88)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.next(ObjectAggregationIterator.scala:33)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:167)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> ```
> The [`BufferHolder.grow` tries to create a byte array of `Integer.MAX_VALUE` 
> here](https://github.com/apache/spark/blob/v2.2.0/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/BufferHolder.java#L72)
>  but the maximum size of an array depends on specifics of a VM.
> The safest value seems to be `Integer.MAX_VALUE - 8` 
> http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/java/util/ArrayList.java#l229
> In my JVM:
> ```
> java -version
> openjdk version "1.8.0_141"
> OpenJDK Runtime Environment (build 1.8.0_141-b16)
> OpenJDK 64-Bit Server VM (build 25.141-b16, mixed mode)
> ```
> the max is `new Array[Byte](Integer.MAX_VALUE - 2)`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22032) Speed up StructType.fromInternal

2017-09-17 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22032.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19249
[https://github.com/apache/spark/pull/19249]

> Speed up StructType.fromInternal
> 
>
> Key: SPARK-22032
> URL: https://issues.apache.org/jira/browse/SPARK-22032
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
> Fix For: 2.3.0
>
>
> StructType.fromInternal is calling f.fromInternal(v) for every field.
> We can use needConversion method to limit the number of function calls.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22032) Speed up StructType.fromInternal

2017-09-17 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-22032:


Assignee: Maciej Bryński

> Speed up StructType.fromInternal
> 
>
> Key: SPARK-22032
> URL: https://issues.apache.org/jira/browse/SPARK-22032
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
>Assignee: Maciej Bryński
> Fix For: 2.3.0
>
>
> StructType.fromInternal is calling f.fromInternal(v) for every field.
> We can use needConversion method to limit the number of function calls.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21985) PySpark PairDeserializer is broken for double-zipped RDDs

2017-09-17 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21985.
--
   Resolution: Fixed
Fix Version/s: 2.1.2
   2.3.0
   2.2.1

Issue resolved by pull request 19226
[https://github.com/apache/spark/pull/19226]

> PySpark PairDeserializer is broken for double-zipped RDDs
> -
>
> Key: SPARK-21985
> URL: https://issues.apache.org/jira/browse/SPARK-21985
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Stuart Berg
>  Labels: bug
> Fix For: 2.2.1, 2.3.0, 2.1.2
>
>
> PySpark fails to deserialize double-zipped RDDs.  For example, the following 
> example used to work in Spark 2.0.2:
> {code:}
> >>> a = sc.parallelize('aaa')
> >>> b = sc.parallelize('bbb')
> >>> c = sc.parallelize('ccc')
> >>> a_bc = a.zip( b.zip(c) )
> >>> a_bc.collect()
> [('a', ('b', 'c')), ('a', ('b', 'c')), ('a', ('b', 'c'))]
> {code}
> But in Spark >=2.1.0, it fails (regardless of Python 2 vs 3):
> {code:}
> >>> a_bc.collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/rdd.py", line 
> 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File "/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/serializers.py", 
> line 329, in _load_stream_without_unbatching
> if len(key_batch) != len(val_batch):
> TypeError: object of type 'itertools.izip' has no len()
> {code}
> As you can see, the error seems to be caused by [a check in the 
> PairDeserializer 
> class|https://github.com/apache/spark/blob/d03aebbe6508ba441dc87f9546f27aeb27553d77/python/pyspark/serializers.py#L346-L348]:
> {code:}
> if len(key_batch) != len(val_batch):
> raise ValueError("Can not deserialize PairRDD with different number of 
> items"
>  " in batches: (%d, %d)" % (len(key_batch), 
> len(val_batch)))
> {code}
> If that check is removed, then the example above works without error.  Can 
> the check simply be removed?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21985) PySpark PairDeserializer is broken for double-zipped RDDs

2017-09-17 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-21985:


Assignee: Andrew Ray

> PySpark PairDeserializer is broken for double-zipped RDDs
> -
>
> Key: SPARK-21985
> URL: https://issues.apache.org/jira/browse/SPARK-21985
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Stuart Berg
>Assignee: Andrew Ray
>  Labels: bug
> Fix For: 2.1.2, 2.2.1, 2.3.0
>
>
> PySpark fails to deserialize double-zipped RDDs.  For example, the following 
> example used to work in Spark 2.0.2:
> {code:}
> >>> a = sc.parallelize('aaa')
> >>> b = sc.parallelize('bbb')
> >>> c = sc.parallelize('ccc')
> >>> a_bc = a.zip( b.zip(c) )
> >>> a_bc.collect()
> [('a', ('b', 'c')), ('a', ('b', 'c')), ('a', ('b', 'c'))]
> {code}
> But in Spark >=2.1.0, it fails (regardless of Python 2 vs 3):
> {code:}
> >>> a_bc.collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/rdd.py", line 
> 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File "/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/serializers.py", 
> line 329, in _load_stream_without_unbatching
> if len(key_batch) != len(val_batch):
> TypeError: object of type 'itertools.izip' has no len()
> {code}
> As you can see, the error seems to be caused by [a check in the 
> PairDeserializer 
> class|https://github.com/apache/spark/blob/d03aebbe6508ba441dc87f9546f27aeb27553d77/python/pyspark/serializers.py#L346-L348]:
> {code:}
> if len(key_batch) != len(val_batch):
> raise ValueError("Can not deserialize PairRDD with different number of 
> items"
>  " in batches: (%d, %d)" % (len(key_batch), 
> len(val_batch)))
> {code}
> If that check is removed, then the example above works without error.  Can 
> the check simply be removed?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21842) Support Kerberos ticket renewal and creation in Mesos

2017-09-17 Thread Arthur Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16168278#comment-16168278
 ] 

Arthur Rand edited comment on SPARK-21842 at 9/18/17 12:42 AM:
---

Hey [~kalvinnchau]

I'm currently of the mind that using the RPC/broadcast approach is better (for 
Mesos) for a couple reasons.
1. We recently added Secret support to Spark on Mesos, this uses a temporary 
file system to put the keytab or TGT in the sandbox of the Spark driver. They 
are packed into the SparkAppConfig (CoarseGrainedSchedulerBackend.scala:L236) 
which is broadcast to the executors, so using RPC/broadcast is consistent with 
this.
2. Keeps all transfers of secure information within Spark.
3. Doesn't _require_ HDFS. 

However I understand that there is a potential risk with executors falsely 
registering with the Driver and getting tokens. I know in the case of DC/OS 
this is less of a concern (we have some protections around this). But this 
could still happen today due to the code mentioned above. We could prevent this 
by keeping track of the executor IDs and only allowing executors to register 
when they have an expected ID..? 


was (Author: arand):
Hey [~kalvinnchau]

I'm currently of the mind that using the RPC/broadcast approach is better (for 
Mesos) for a couple reasons.
1. We recently added Secret support to Spark on Mesos, this uses a temporary 
file system to put the keytab or TGT in the sandbox of the Spark driver. They 
are packed into the SparkAppConfig (CoarseGrainedSchedulerBackend.scala:L236) 
which is broadcast to the executors, so using RPC/broadcast is consistent with 
this.
2. Keeps all transfers of secure information within Spark.
3. Doesn't require HDFS. There is a little bit of a chicken-and-egg situation 
here w.r.t. YARN, but I'm obviously not familiar enough with how 
Spark-YARN-HDFS work together.  

However I understand that there is a potential risk with executors falsely 
registering with the Driver and getting tokens. I know in the case of DC/OS 
this is less of a concern (we have some protections around this). But this 
could still happen today due to the code mentioned above. We could prevent this 
by keeping track of the executor IDs and only allowing executors to register 
when they have an expected ID..? 

> Support Kerberos ticket renewal and creation in Mesos 
> --
>
> Key: SPARK-21842
> URL: https://issues.apache.org/jira/browse/SPARK-21842
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Arthur Rand
>
> We at Mesosphere have written Kerberos support for Spark on Mesos. The code 
> to use Kerberos on a Mesos cluster has been added to Apache Spark 
> (SPARK-16742). This ticket is to complete the implementation and allow for 
> ticket renewal and creation. Specifically for long running and streaming jobs.
> Mesosphere design doc (needs revision, wip): 
> https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21802) Make sparkR MLP summary() expose probability column

2017-09-17 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16169487#comment-16169487
 ] 

Felix Cheung commented on SPARK-21802:
--

Can you clarify where you see it? I just ran against the latest from master 
branch with R's spark.mlp and don't see any probability?
{code}
   summary <- summary(model)
>
> summar
Error: object 'summar' not found
> summary
$numOfInputs
[1] 4

$numOfOutputs
[1] 3

$layers
[1] 4 5 4 3

$weights
$weights[[1]]
[1] -0.878743

$weights[[2]]
[1] 0.2154151

$weights[[3]]
[1] -1.16304

$weights[[4]]
[1] -0.6583214

$weights[[5]]
[1] 1.009825

$weights[[6]]
[1] 0.2934758

$weights[[7]]
[1] -0.9528391

$weights[[8]]
[1] 0.4029029

$weights[[9]]
[1] -1.038043

$weights[[10]]
[1] 0.05164362

$weights[[11]]
[1] 0.9349549

$weights[[12]]
[1] -0.4283766

$weights[[13]]
[1] -0.5082246

$weights[[14]]
[1] -0.09600512

$weights[[15]]
[1] -0.7843158

$weights[[16]]
[1] -1.199724

$weights[[17]]
[1] 0.6001083

$weights[[18]]
[1] 0.1102863

$weights[[19]]
[1] 0.8259955

$weights[[20]]
[1] -0.4428631

$weights[[21]]
[1] 0.9691921

$weights[[22]]
[1] -0.8472953

$weights[[23]]
[1] -0.8521915

$weights[[24]]
[1] -0.770886

$weights[[25]]
[1] 0.7276595

$weights[[26]]
[1] -0.7675585

$weights[[27]]
[1] 0.1299603

$weights[[28]]
[1] -1.056605

$weights[[29]]
[1] 0.4421284

$weights[[30]]
[1] -0.3245397

$weights[[31]]
[1] -0.904001

$weights[[32]]
[1] 0.2793773

$weights[[33]]
[1] 1.045579

$weights[[34]]
[1] -0.5379433

$weights[[35]]
[1] -1.006988

$weights[[36]]
[1] -0.9652683

$weights[[37]]
[1] 0.8719215

$weights[[38]]
[1] -0.917228

$weights[[39]]
[1] 1.020896

$weights[[40]]
[1] 0.4951883

$weights[[41]]
[1] 0.7487854

$weights[[42]]
[1] -0.7130144

$weights[[43]]
[1] 0.598029

$weights[[44]]
[1] 0.8097242

$weights[[45]]
[1] -1.056401

$weights[[46]]
[1] -0.2041643

$weights[[47]]
[1] -0.9605507

$weights[[48]]
[1] -0.2151837

$weights[[49]]
[1] 0.9075675

$weights[[50]]
[1] 0.004306968

$weights[[51]]
[1] -0.4778498

$weights[[52]]
[1] 0.3312689

$weights[[53]]
[1] 0.6160091

$weights[[54]]
[1] 0.431806

$weights[[55]]
[1] -0.6039096

$weights[[56]]
[1] -0.008508999

$weights[[57]]
[1] 0.7539017

$weights[[58]]
[1] -1.186487

$weights[[59]]
[1] -0.8660557

$weights[[60]]
[1] 0.4443504

$weights[[61]]
[1] 0.5170843

$weights[[62]]
[1] 0.08373222

$weights[[63]]
[1] -1.039143

$weights[[64]]
[1] -0.4787311
{code}

this isn't the summary() right, it's the prediction I think

> Make sparkR MLP summary() expose probability column
> ---
>
> Key: SPARK-21802
> URL: https://issues.apache.org/jira/browse/SPARK-21802
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
>Priority: Minor
>
> Make sparkR MLP summary() expose probability column



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20221) Port pyspark.mllib.linalg tests in pyspark/mllib/tests.py to pyspark.ml.linalg

2017-09-17 Thread Daniel Imberman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16169502#comment-16169502
 ] 

Daniel Imberman commented on SPARK-20221:
-

Hi [~josephkb],

I would like to complete this ticket but have a few clarifying questions (to be 
certain that I am porting over the right work + don't miss anything).

1. I see that there are a few missing tests in ChiSquareTestTests, 
DimensionalityReductionTests, and MlUtilsTests, are there any other tests I 
should be porting over?

2. for VectorTests, it appears that all the tests are equivalent except for a 
few conversion mllib -> ml functions. I assume these would not be ported over?

3. Since there is no distributed "RowMatrix" class in spark/ml, would you want 
me to simply turn these into dataframes?

Thank you,

Daniel

> Port pyspark.mllib.linalg tests in pyspark/mllib/tests.py to pyspark.ml.linalg
> --
>
> Key: SPARK-20221
> URL: https://issues.apache.org/jira/browse/SPARK-20221
> Project: Spark
>  Issue Type: Test
>  Components: ML, PySpark, Tests
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>
> There are several linear algebra unit tests in pyspark/mllib/tests.py which 
> should be ported over to pyspark/ml/tests.py.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21953) Show both memory and disk bytes spilled if either is present

2017-09-17 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21953:
---

Assignee: Andrew Ash

> Show both memory and disk bytes spilled if either is present
> 
>
> Key: SPARK-21953
> URL: https://issues.apache.org/jira/browse/SPARK-21953
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.0
>Reporter: Andrew Ash
>Assignee: Andrew Ash
>Priority: Minor
> Fix For: 2.1.2, 2.2.1, 2.3.0
>
>
> https://github.com/apache/spark/commit/a1f0992faefbe042a9cb7a11842a817c958e4797#diff-fa4cfb2cce1b925f55f41f2dfa8c8501R61
>  should be {{||}} not {{&&}}
> As written now, there must be both memory and disk bytes spilled to show 
> either of them.  If there is only one of those types of spill recorded, it 
> will be hidden.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21953) Show both memory and disk bytes spilled if either is present

2017-09-17 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21953.
-
   Resolution: Fixed
Fix Version/s: 2.1.2
   2.3.0
   2.2.1

Issue resolved by pull request 19164
[https://github.com/apache/spark/pull/19164]

> Show both memory and disk bytes spilled if either is present
> 
>
> Key: SPARK-21953
> URL: https://issues.apache.org/jira/browse/SPARK-21953
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.0
>Reporter: Andrew Ash
>Priority: Minor
> Fix For: 2.2.1, 2.3.0, 2.1.2
>
>
> https://github.com/apache/spark/commit/a1f0992faefbe042a9cb7a11842a817c958e4797#diff-fa4cfb2cce1b925f55f41f2dfa8c8501R61
>  should be {{||}} not {{&&}}
> As written now, there must be both memory and disk bytes spilled to show 
> either of them.  If there is only one of those types of spill recorded, it 
> will be hidden.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21802) Make sparkR MLP summary() expose probability column

2017-09-17 Thread Weichen Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16169535#comment-16169535
 ] 

Weichen Xu commented on SPARK-21802:


[~felixcheung] The probability cannot be added to the result of 
`summary(model)` in sparkR. The probability column is generated in the result 
of `predict(df, model)`. So can you clarify your original purpose more clearer ?

> Make sparkR MLP summary() expose probability column
> ---
>
> Key: SPARK-21802
> URL: https://issues.apache.org/jira/browse/SPARK-21802
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
>Priority: Minor
>
> Make sparkR MLP summary() expose probability column



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21802) Make sparkR MLP summary() expose probability column

2017-09-17 Thread Weichen Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16169535#comment-16169535
 ] 

Weichen Xu edited comment on SPARK-21802 at 9/18/17 3:42 AM:
-

[~felixcheung] The probability cannot be added to the result of 
`summary(model)` in sparkR. The probability column is generated in the result 
of `predict(model, df)`. So can you clarify your original purpose more clearer ?


was (Author: weichenxu123):
[~felixcheung] The probability cannot be added to the result of 
`summary(model)` in sparkR. The probability column is generated in the result 
of `predict(df, model)`. So can you clarify your original purpose more clearer ?

> Make sparkR MLP summary() expose probability column
> ---
>
> Key: SPARK-21802
> URL: https://issues.apache.org/jira/browse/SPARK-21802
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
>Priority: Minor
>
> Make sparkR MLP summary() expose probability column



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22043) Python profile, show_profiles() and dump_profiles(), should throw an error with a better message

2017-09-17 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22043.
--
   Resolution: Fixed
Fix Version/s: 2.1.2
   2.3.0
   2.2.1

Issue resolved by pull request 19260
[https://github.com/apache/spark/pull/19260]

> Python profile, show_profiles() and dump_profiles(), should throw an error 
> with a better message
> 
>
> Key: SPARK-22043
> URL: https://issues.apache.org/jira/browse/SPARK-22043
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
> Fix For: 2.2.1, 2.3.0, 2.1.2
>
>
> I mistakenly missed {{spark.python.profile}} enabled today while profiling 
> and met this unfriendly messages:
> {code}
> >>> sc.show_profiles()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File ".../spark/python/pyspark/context.py", line 1000, in show_profiles
> self.profiler_collector.show_profiles()
> AttributeError: 'NoneType' object has no attribute 'show_profiles'
> >>> sc.dump_profiles("/tmp/abc")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File ".../spark/python/pyspark/context.py", line 1005, in dump_profiles
> self.profiler_collector.dump_profiles(path)
> AttributeError: 'NoneType' object has no attribute 'dump_profiles'
> {code}
> It looks we should give better information that says {{spark.python.profile}} 
> should be enabled.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22043) Python profile, show_profiles() and dump_profiles(), should throw an error with a better message

2017-09-17 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-22043:


Assignee: Hyukjin Kwon

> Python profile, show_profiles() and dump_profiles(), should throw an error 
> with a better message
> 
>
> Key: SPARK-22043
> URL: https://issues.apache.org/jira/browse/SPARK-22043
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Trivial
> Fix For: 2.1.2, 2.2.1, 2.3.0
>
>
> I mistakenly missed {{spark.python.profile}} enabled today while profiling 
> and met this unfriendly messages:
> {code}
> >>> sc.show_profiles()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File ".../spark/python/pyspark/context.py", line 1000, in show_profiles
> self.profiler_collector.show_profiles()
> AttributeError: 'NoneType' object has no attribute 'show_profiles'
> >>> sc.dump_profiles("/tmp/abc")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File ".../spark/python/pyspark/context.py", line 1005, in dump_profiles
> self.profiler_collector.dump_profiles(path)
> AttributeError: 'NoneType' object has no attribute 'dump_profiles'
> {code}
> It looks we should give better information that says {{spark.python.profile}} 
> should be enabled.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21802) Make sparkR MLP summary() expose probability column

2017-09-17 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16169575#comment-16169575
 ] 

Felix Cheung commented on SPARK-21802:
--

yes if this is from the prediction (with rawPrediction etc) it should be from 
predict not summary, sorry I misspoke

> Make sparkR MLP summary() expose probability column
> ---
>
> Key: SPARK-21802
> URL: https://issues.apache.org/jira/browse/SPARK-21802
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
>Priority: Minor
>
> Make sparkR MLP summary() expose probability column



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18131) Support returning Vector/Dense Vector from backend

2017-09-17 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16169579#comment-16169579
 ] 

Felix Cheung commented on SPARK-18131:
--

bump. I think this is a real big problem - results from mllib is basically 
unusable for R user:
{code}
ead(predict(model, test))$probability
[[1]]
Java ref type org.apache.spark.ml.linalg.DenseVector id 130

[[2]]
Java ref type org.apache.spark.ml.linalg.DenseVector id 131

[[3]]
Java ref type org.apache.spark.ml.linalg.DenseVector id 132

[[4]]
Java ref type org.apache.spark.ml.linalg.DenseVector id 133

[[5]]
Java ref type org.apache.spark.ml.linalg.DenseVector id 134

[[6]]
Java ref type org.apache.spark.ml.linalg.DenseVector id 135

> head(predict(model, test))$feature
[[1]]
Java ref type org.apache.spark.ml.linalg.SparseVector id 161

[[2]]
Java ref type org.apache.spark.ml.linalg.SparseVector id 162

[[3]]
Java ref type org.apache.spark.ml.linalg.SparseVector id 163

[[4]]
Java ref type org.apache.spark.ml.linalg.SparseVector id 164

[[5]]
Java ref type org.apache.spark.ml.linalg.SparseVector id 165

[[6]]
Java ref type org.apache.spark.ml.linalg.SparseVector id 166

> head(predict(model, test))$rawPrediction
[[1]]
Java ref type org.apache.spark.ml.linalg.DenseVector id 210

[[2]]
Java ref type org.apache.spark.ml.linalg.DenseVector id 211

[[3]]
Java ref type org.apache.spark.ml.linalg.DenseVector id 212

[[4]]
Java ref type org.apache.spark.ml.linalg.DenseVector id 213
...

{code}

> Support returning Vector/Dense Vector from backend
> --
>
> Key: SPARK-18131
> URL: https://issues.apache.org/jira/browse/SPARK-18131
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Miao Wang
>
> For `spark.logit`, there is a `probabilityCol`, which is a vector in the 
> backend (scala side). When we do collect(select(df, "probabilityCol")), 
> backend returns the java object handle (memory address). We need to implement 
> a method to convert a Vector/Dense Vector column as R vector, which can be 
> read in SparkR. It is a followup JIRA of adding `spark.logit`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release

2017-09-17 Thread Yongqin Xiao (JIRA)

[ 
https://issues-test.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167180#comment-16167180
 ] 

Yongqin Xiao commented on SPARK-18406:
--

[~cloud_fan], I see there are 3 check-ins for this issue, touching multiple 
files. You mentioned the fix will be backport to spark2.1.0. Can you let me 
know which single submission in spark2.1.0 will address the issue?
The reason I am asking is that my company may not update spark version to 2.2 
very soon, I will have to port your fix to our company's version of spark 2.1.0 
and 2.0.1. I cannot just use latest spark 2.1.0 even after you backport the fix 
because we have other patches on top of spark 2.1.0, some were fixed by 
ourselves.
Thanks for your help.

> Race between end-of-task and completion iterator read lock release
> --
>
> Key: SPARK-18406
> URL: https://issues-test.apache.org/jira/browse/SPARK-18406
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Josh Rosen
>Assignee: Jiang Xingbo
> Fix For: 2.0.3, 2.1.2, 2.2.0
>
>
> The following log comes from a production streaming job where executors 
> periodically die due to uncaught exceptions during block release:
> {code}
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921
> 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922
> 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923
> 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923)
> 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable 
> 2721
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924
> 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924)
> 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as 
> bytes in memory (estimated size 5.0 KB, free 4.9 GB)
> 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took 
> 3 ms
> 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in 
> memory (estimated size 9.4 KB, free 4.9 GB)
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = 
> 567, finish = 1
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = 
> 541, finish = 6
> 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID 
> 7923). 1429 bytes result sent to driver
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = 
> 533, finish = 7
> 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID 
> 7924). 1429 bytes result sent to driver
> 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID 
> 7921)
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)
>   at 
> org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356)
>   at 
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 16/11/07 17:11:06 INFO Coar

[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release

2017-09-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16169586#comment-16169586
 ] 

Hadoop QA commented on SPARK-18406:
---


[ 
https://issues-test.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167180#comment-16167180
 ] 

Yongqin Xiao commented on SPARK-18406:
--

[~cloud_fan], I see there are 3 check-ins for this issue, touching multiple 
files. You mentioned the fix will be backport to spark2.1.0. Can you let me 
know which single submission in spark2.1.0 will address the issue?
The reason I am asking is that my company may not update spark version to 2.2 
very soon, I will have to port your fix to our company's version of spark 2.1.0 
and 2.0.1. I cannot just use latest spark 2.1.0 even after you backport the fix 
because we have other patches on top of spark 2.1.0, some were fixed by 
ourselves.
Thanks for your help.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



> Race between end-of-task and completion iterator read lock release
> --
>
> Key: SPARK-18406
> URL: https://issues.apache.org/jira/browse/SPARK-18406
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Josh Rosen
>Assignee: Jiang Xingbo
> Fix For: 2.0.3, 2.1.2, 2.2.0
>
>
> The following log comes from a production streaming job where executors 
> periodically die due to uncaught exceptions during block release:
> {code}
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921
> 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922
> 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923
> 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923)
> 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable 
> 2721
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924
> 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924)
> 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as 
> bytes in memory (estimated size 5.0 KB, free 4.9 GB)
> 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took 
> 3 ms
> 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in 
> memory (estimated size 9.4 KB, free 4.9 GB)
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = 
> 567, finish = 1
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = 
> 541, finish = 6
> 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID 
> 7923). 1429 bytes result sent to driver
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = 
> 533, finish = 7
> 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID 
> 7924). 1429 bytes result sent to driver
> 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID 
> 7921)
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)
>   at 
> org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.storage.BlockInfoMan

[jira] [Resolved] (SPARK-21113) Support for read ahead input stream to amortize disk IO cost in the Spill reader

2017-09-17 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-21113.
--
   Resolution: Fixed
 Assignee: Sital Kedia
Fix Version/s: 2.3.0

> Support for read ahead input stream to amortize disk IO cost in the Spill 
> reader
> 
>
> Key: SPARK-21113
> URL: https://issues.apache.org/jira/browse/SPARK-21113
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.2
>Reporter: Sital Kedia
>Assignee: Sital Kedia
>Priority: Minor
> Fix For: 2.3.0
>
>
> Profiling some of our big jobs, we see that around 30% of the time is being 
> spent in reading the spill files from disk. In order to amortize the disk IO 
> cost, the idea is to implement a read ahead input stream which which 
> asynchronously reads ahead from the underlying input stream when specified 
> amount of data has been read from the current buffer. It does it by 
> maintaining two buffer - active buffer and read ahead buffer. Active buffer 
> contains data which should be returned when a read() call is issued. The read 
> ahead buffer is used to asynchronously read from the underlying input stream 
> and once the current active buffer is exhausted, we flip the two buffers so 
> that we can start reading from the read ahead buffer without being blocked in 
> disk I/O.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release

2017-09-17 Thread Wenchen Fan (JIRA)

[ 
https://issues-test.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167181#comment-16167181
 ] 

Wenchen Fan commented on SPARK-18406:
-

https://github.com/apache/spark/pull/18099 is the PR that backported the fix to 
2.1

> Race between end-of-task and completion iterator read lock release
> --
>
> Key: SPARK-18406
> URL: https://issues-test.apache.org/jira/browse/SPARK-18406
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Josh Rosen
>Assignee: Jiang Xingbo
> Fix For: 2.0.3, 2.1.2, 2.2.0
>
>
> The following log comes from a production streaming job where executors 
> periodically die due to uncaught exceptions during block release:
> {code}
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921
> 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922
> 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923
> 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923)
> 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable 
> 2721
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924
> 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924)
> 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as 
> bytes in memory (estimated size 5.0 KB, free 4.9 GB)
> 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took 
> 3 ms
> 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in 
> memory (estimated size 9.4 KB, free 4.9 GB)
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = 
> 567, finish = 1
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = 
> 541, finish = 6
> 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID 
> 7923). 1429 bytes result sent to driver
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = 
> 533, finish = 7
> 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID 
> 7924). 1429 bytes result sent to driver
> 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID 
> 7921)
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)
>   at 
> org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356)
>   at 
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7925
> 16/11/07 17:11:06 INFO Executor: Running task 0.1 in stage 2390.0 (TID 7925)
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 41, boot = -536, init = 
> 576, finish = 1
> 16/11/07 17:11:06 INFO Executor: Finished task 1.0 in stage 2390.0 (TID 
> 7922). 1429 bytes result sent to driver
> 16/11/07 17:11:06 ERROR Utils: Uncaught exception in thread stdout writ

[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release

2017-09-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16169661#comment-16169661
 ] 

Hadoop QA commented on SPARK-18406:
---


[ 
https://issues-test.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167181#comment-16167181
 ] 

Wenchen Fan commented on SPARK-18406:
-

https://github.com/apache/spark/pull/18099 is the PR that backported the fix to 
2.1




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



> Race between end-of-task and completion iterator read lock release
> --
>
> Key: SPARK-18406
> URL: https://issues.apache.org/jira/browse/SPARK-18406
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Josh Rosen
>Assignee: Jiang Xingbo
> Fix For: 2.0.3, 2.1.2, 2.2.0
>
>
> The following log comes from a production streaming job where executors 
> periodically die due to uncaught exceptions during block release:
> {code}
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921
> 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922
> 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923
> 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923)
> 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable 
> 2721
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924
> 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924)
> 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as 
> bytes in memory (estimated size 5.0 KB, free 4.9 GB)
> 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took 
> 3 ms
> 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in 
> memory (estimated size 9.4 KB, free 4.9 GB)
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = 
> 567, finish = 1
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = 
> 541, finish = 6
> 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID 
> 7923). 1429 bytes result sent to driver
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = 
> 533, finish = 7
> 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID 
> 7924). 1429 bytes result sent to driver
> 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID 
> 7921)
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)
>   at 
> org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356)
>   at 
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 16/11/07 1