[jira] [Created] (SPARK-22303) [SQL] Getting java.sql.SQLException: Unsupported type 101 for BINARY_DOUBLE

2017-10-17 Thread Kohki Nishio (JIRA)
Kohki Nishio created SPARK-22303:


 Summary: [SQL] Getting java.sql.SQLException: Unsupported type 101 
for BINARY_DOUBLE
 Key: SPARK-22303
 URL: https://issues.apache.org/jira/browse/SPARK-22303
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Kohki Nishio
Priority: Minor


When a table contains columns such as BINARY_DOUBLE or BINARY_FLOAT, this JDBC 
connector throws SQL exception

{code}
java.sql.SQLException: Unsupported type 101
  at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$getCatalystType(JdbcUtils.scala:235)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$8.apply(JdbcUtils.scala:292)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$8.apply(JdbcUtils.scala:292)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:291)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:64)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:113)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:47)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:306)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
{code}

these types are Oracle specific ones, described here
https://docs.oracle.com/cd/E11882_01/timesten.112/e21642/types.htm#TTSQL148



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22278) Expose current event time watermark and current processing time in GroupState

2017-10-17 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-22278.
---
   Resolution: Fixed
Fix Version/s: (was: 2.2.0)
   3.0.0

Issue resolved by pull request 19495
[https://github.com/apache/spark/pull/19495]

> Expose current event time watermark and current processing time in GroupState
> -
>
> Key: SPARK-22278
> URL: https://issues.apache.org/jira/browse/SPARK-22278
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 3.0.0
>
>
> Complex state-updating and/or timeout-handling logic in mapGroupsWithState 
> functions may require taking decisions based on the current event-time 
> watermark and/or processing-time. Currently, you can use the sql function 
> `current_timestamp` to get the current processing time, but it needs to 
> passed inserted in every row with a select, and then passed through the 
> encoder, which isnt efficient. Furthermore, there is no way to get the 
> current watermark.
> This JIRA is to expose them through the GroupState API.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22302) Remove manual backports for subprocess.check_output and check_call

2017-10-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208722#comment-16208722
 ] 

Apache Spark commented on SPARK-22302:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/19524

> Remove manual backports for subprocess.check_output and check_call
> --
>
> Key: SPARK-22302
> URL: https://issues.apache.org/jira/browse/SPARK-22302
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> This JIRA is loosely related with SPARK-21573. Python 2.6 could be used in 
> Jenkins given the past cases and investigations up to my knowledge and it 
> looks failing to execute some other scripts.
> In this particular case, it was:
> {code}
> cd dev && python2.6
> {code}
> {code}
> >>> from sparktestsupport import shellutils
> >>> shellutils.subprocess_check_call("ls")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "sparktestsupport/shellutils.py", line 46, in subprocess_check_call
> retcode = call(*popenargs, **kwargs)
> NameError: global name 'call' is not defined
> {code}
> Please see 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3950/console
> Since we dropped the Python 2.6.x support, looks better we remove those 
> workarounds and print out explicit error messages in order to duplicate the 
> efforts to find out the root causes for such cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22302) Remove manual backports for subprocess.check_output and check_call

2017-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22302:


Assignee: (was: Apache Spark)

> Remove manual backports for subprocess.check_output and check_call
> --
>
> Key: SPARK-22302
> URL: https://issues.apache.org/jira/browse/SPARK-22302
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> This JIRA is loosely related with SPARK-21573. Python 2.6 could be used in 
> Jenkins given the past cases and investigations up to my knowledge and it 
> looks failing to execute some other scripts.
> In this particular case, it was:
> {code}
> cd dev && python2.6
> {code}
> {code}
> >>> from sparktestsupport import shellutils
> >>> shellutils.subprocess_check_call("ls")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "sparktestsupport/shellutils.py", line 46, in subprocess_check_call
> retcode = call(*popenargs, **kwargs)
> NameError: global name 'call' is not defined
> {code}
> Please see 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3950/console
> Since we dropped the Python 2.6.x support, looks better we remove those 
> workarounds and print out explicit error messages in order to duplicate the 
> efforts to find out the root causes for such cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22302) Remove manual backports for subprocess.check_output and check_call

2017-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22302:


Assignee: Apache Spark

> Remove manual backports for subprocess.check_output and check_call
> --
>
> Key: SPARK-22302
> URL: https://issues.apache.org/jira/browse/SPARK-22302
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Trivial
>
> This JIRA is loosely related with SPARK-21573. Python 2.6 could be used in 
> Jenkins given the past cases and investigations up to my knowledge and it 
> looks failing to execute some other scripts.
> In this particular case, it was:
> {code}
> cd dev && python2.6
> {code}
> {code}
> >>> from sparktestsupport import shellutils
> >>> shellutils.subprocess_check_call("ls")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "sparktestsupport/shellutils.py", line 46, in subprocess_check_call
> retcode = call(*popenargs, **kwargs)
> NameError: global name 'call' is not defined
> {code}
> Please see 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3950/console
> Since we dropped the Python 2.6.x support, looks better we remove those 
> workarounds and print out explicit error messages in order to duplicate the 
> efforts to find out the root causes for such cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22302) Remove manual backports for subprocess.check_output and check_call

2017-10-17 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-22302:


 Summary: Remove manual backports for subprocess.check_output and 
check_call
 Key: SPARK-22302
 URL: https://issues.apache.org/jira/browse/SPARK-22302
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 2.3.0
Reporter: Hyukjin Kwon
Priority: Trivial


This JIRA is loosely related with SPARK-21573. Python 2.6 could be used in 
Jenkins given the past cases and investigations up to my knowledge and it looks 
failing to execute some other scripts.

In this particular case, it was:

{code}
cd dev && python2.6
{code}

{code}
>>> from sparktestsupport import shellutils
>>> shellutils.subprocess_check_call("ls")
Traceback (most recent call last):
  File "", line 1, in 
  File "sparktestsupport/shellutils.py", line 46, in subprocess_check_call
retcode = call(*popenargs, **kwargs)
NameError: global name 'call' is not defined
{code}

Please see 
https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3950/console

Since we dropped the Python 2.6.x support, looks better we remove those 
workarounds and print out explicit error messages in order to duplicate the 
efforts to find out the root causes for such cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22295) Chi Square selector not recognizing field in Data frame

2017-10-17 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208708#comment-16208708
 ] 

Peng Meng commented on SPARK-22295:
---

Hi [~cheburakshu] , thanks for reporting this bug and helpful code. 
This is caused by similar problem but not the same thing as SPARK-22277. 
The reason is when transform a dataframe, the field/attribute is not correctly 
set.

Maybe there are some other similar bugs in the code, we can solve them 
separately, or solve them together.   

[~yanboliang] [~mlnick] [~srowen]

> Chi Square selector not recognizing field in Data frame
> ---
>
> Key: SPARK-22295
> URL: https://issues.apache.org/jira/browse/SPARK-22295
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Cheburakshu
>
> ChiSquare selector is not recognizing the field 'class' which is present in 
> the data frame while fitting the model. I am using PIMA Indians diabetes 
> dataset of UCI. The complete code and output is below for reference. But, 
> when some rows of the input file is created as a dataframe manually, it will 
> work. Couldn't understand the pattern here.
> Kindly help.
> {code:python}
> from pyspark.ml.feature import VectorAssembler, ChiSqSelector
> import sys
> file_name='data/pima-indians-diabetes.data'
> df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache()
> df.show(1)
> assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' 
> test', ' mass', ' pedi', ' age'],outputCol="features")
> df=assembler.transform(df)
> df.show(1)
> try:
> css=ChiSqSelector(numTopFeatures=5, featuresCol="features",
>   outputCol="selected", labelCol='class').fit(df)
> except:
> print(sys.exc_info())
> {code}
> Output:
> ++-+-+-+-+-+-++--+
> |preg| plas| pres| skin| test| mass| pedi| age| class|
> ++-+-+-+-+-+-++--+
> |   6|  148|   72|   35|0| 33.6|0.627|  50| 1|
> ++-+-+-+-+-+-++--+
> only showing top 1 row
> ++-+-+-+-+-+-++--++
> |preg| plas| pres| skin| test| mass| pedi| age| class|features|
> ++-+-+-+-+-+-++--++
> |   6|  148|   72|   35|0| 33.6|0.627|  50| 1|[6.0,148.0,72.0,3...|
> ++-+-+-+-+-+-++--++
> only showing top 1 row
> (, 
> IllegalArgumentException('Field "class" does not exist.', 
> 'org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t
>  at 
> org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t
>  at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)\n\t at 
> scala.collection.AbstractMap.getOrElse(Map.scala:59)\n\t at 
> org.apache.spark.sql.types.StructType.apply(StructType.scala:263)\n\t at 
> org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71)\n\t
>  at 
> org.apache.spark.ml.feature.ChiSqSelector.transformSchema(ChiSqSelector.scala:183)\n\t
>  at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)\n\t 
> at org.apache.spark.ml.feature.ChiSqSelector.fit(ChiSqSelector.scala:159)\n\t 
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\t at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\t
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t
>  at java.lang.reflect.Method.invoke(Method.java:498)\n\t at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at 
> py4j.Gateway.invoke(Gateway.java:280)\n\t at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at 
> py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at 
> py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at 
> java.lang.Thread.run(Thread.java:745)'), )
> *The below code works fine:
> *
> {code:python}
> from pyspark.ml.feature import VectorAssembler, ChiSqSelector
> import sys
> file_name='data/pima-indians-diabetes.data'
> #df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache()
> # Just pasted a few rows from the input file and created a data frome. This 
> will work, but not the frame picked up from the file
> df = spark.createDataFrame([
> [6,148,72,35,0,33.6,0.627,50,1],
> [1,85,66,29,0,26.6,0.351,31,0],
> [8,183,64,0,0,23.3,0.672,32,1],
> ], ['preg', ' plas', ' pres', ' skin', ' test', ' mass', ' pedi', ' age', 
> "class"])
> df.show(1)
> assembler = 

[jira] [Commented] (SPARK-17902) collect() ignores stringsAsFactors

2017-10-17 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208685#comment-16208685
 ] 

Hyukjin Kwon commented on SPARK-17902:
--

Hi [~falaki] and [~shivaram], I was thinking a just simple way such as :

{quote}
if (stringsAsFactors) {
  df[sapply(df, is.character)] <- lapply(df[sapply(df, is.character)], 
as.factor)
}
{quote}

Would it make sense?

> collect() ignores stringsAsFactors
> --
>
> Key: SPARK-17902
> URL: https://issues.apache.org/jira/browse/SPARK-17902
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> `collect()` function signature includes an optional flag named 
> `stringsAsFactors`. It seems it is completely ignored.
> {code}
> str(collect(createDataFrame(iris), stringsAsFactors = TRUE)))
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17902) collect() ignores stringsAsFactors

2017-10-17 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208685#comment-16208685
 ] 

Hyukjin Kwon edited comment on SPARK-17902 at 10/18/17 1:29 AM:


Hi [~falaki] and [~shivaram], I was thinking a just simple way such as :

{code}
if (stringsAsFactors) {
  df[sapply(df, is.character)] <- lapply(df[sapply(df, is.character)], 
as.factor)
}
{code}

Would it make sense?


was (Author: hyukjin.kwon):
Hi [~falaki] and [~shivaram], I was thinking a just simple way such as :

{quote}
if (stringsAsFactors) {
  df[sapply(df, is.character)] <- lapply(df[sapply(df, is.character)], 
as.factor)
}
{quote}

Would it make sense?

> collect() ignores stringsAsFactors
> --
>
> Key: SPARK-17902
> URL: https://issues.apache.org/jira/browse/SPARK-17902
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> `collect()` function signature includes an optional flag named 
> `stringsAsFactors`. It seems it is completely ignored.
> {code}
> str(collect(createDataFrame(iris), stringsAsFactors = TRUE)))
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22204) Explain output for SQL with commands shows no optimization

2017-10-17 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208676#comment-16208676
 ] 

Andrew Ash commented on SPARK-22204:


One way to work around this issue could be by getting the child of the command 
node and running explain on that.  This does do the query planning twice though.

See also discussion at 
https://github.com/apache/spark/pull/19269#discussion_r139841435

> Explain output for SQL with commands shows no optimization
> --
>
> Key: SPARK-22204
> URL: https://issues.apache.org/jira/browse/SPARK-22204
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Andrew Ash
>
> When displaying the explain output for a basic SELECT query, the query plan 
> changes as expected from analyzed -> optimized stages.  But when putting that 
> same query into a command, for example {{CREATE TABLE}} it appears that the 
> optimization doesn't take place.
> In Spark shell:
> Explain output for a {{SELECT}} statement shows optimization:
> {noformat}
> scala> spark.sql("SELECT a FROM (SELECT a FROM (SELECT a FROM (SELECT 1 AS a) 
> AS b) AS c) AS d").explain(true)
> == Parsed Logical Plan ==
> 'Project ['a]
> +- 'SubqueryAlias d
>+- 'Project ['a]
>   +- 'SubqueryAlias c
>  +- 'Project ['a]
> +- SubqueryAlias b
>+- Project [1 AS a#29]
>   +- OneRowRelation
> == Analyzed Logical Plan ==
> a: int
> Project [a#29]
> +- SubqueryAlias d
>+- Project [a#29]
>   +- SubqueryAlias c
>  +- Project [a#29]
> +- SubqueryAlias b
>+- Project [1 AS a#29]
>   +- OneRowRelation
> == Optimized Logical Plan ==
> Project [1 AS a#29]
> +- OneRowRelation
> == Physical Plan ==
> *Project [1 AS a#29]
> +- Scan OneRowRelation[]
> scala> 
> {noformat}
> But the same command run inside {{CREATE TABLE}} does not:
> {noformat}
> scala> spark.sql("CREATE TABLE IF NOT EXISTS tmptable AS SELECT a FROM 
> (SELECT a FROM (SELECT a FROM (SELECT 1 AS a) AS b) AS c) AS d").explain(true)
> == Parsed Logical Plan ==
> 'CreateTable `tmptable`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> Ignore
> +- 'Project ['a]
>+- 'SubqueryAlias d
>   +- 'Project ['a]
>  +- 'SubqueryAlias c
> +- 'Project ['a]
>+- SubqueryAlias b
>   +- Project [1 AS a#33]
>  +- OneRowRelation
> == Analyzed Logical Plan ==
> CreateHiveTableAsSelectCommand [Database:default}, TableName: tmptable, 
> InsertIntoHiveTable]
>+- Project [a#33]
>   +- SubqueryAlias d
>  +- Project [a#33]
> +- SubqueryAlias c
>+- Project [a#33]
>   +- SubqueryAlias b
>  +- Project [1 AS a#33]
> +- OneRowRelation
> == Optimized Logical Plan ==
> CreateHiveTableAsSelectCommand [Database:default}, TableName: tmptable, 
> InsertIntoHiveTable]
>+- Project [a#33]
>   +- SubqueryAlias d
>  +- Project [a#33]
> +- SubqueryAlias c
>+- Project [a#33]
>   +- SubqueryAlias b
>  +- Project [1 AS a#33]
> +- OneRowRelation
> == Physical Plan ==
> CreateHiveTableAsSelectCommand CreateHiveTableAsSelectCommand 
> [Database:default}, TableName: tmptable, InsertIntoHiveTable]
>+- Project [a#33]
>   +- SubqueryAlias d
>  +- Project [a#33]
> +- SubqueryAlias c
>+- Project [a#33]
>   +- SubqueryAlias b
>  +- Project [1 AS a#33]
> +- OneRowRelation
> scala>
> {noformat}
> Note that there is no change between the analyzed and optimized plans when 
> run in a command.
> This is misleading my users, as they think that there is no optimization 
> happening in the query!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22301) Add rule to Optimizer for In with empty list of values

2017-10-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208640#comment-16208640
 ] 

Apache Spark commented on SPARK-22301:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/19523

> Add rule to Optimizer for In with empty list of values
> --
>
> Key: SPARK-22301
> URL: https://issues.apache.org/jira/browse/SPARK-22301
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Marco Gaido
>
> For performance reason, we should resolve in operation on an empty list as 
> false in the optimizations phase.
> For further reference, please look at the discussion on PRs: 
> https://github.com/apache/spark/pull/19522 and 
> https://github.com/apache/spark/pull/19494.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22301) Add rule to Optimizer for In with empty list of values

2017-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22301:


Assignee: (was: Apache Spark)

> Add rule to Optimizer for In with empty list of values
> --
>
> Key: SPARK-22301
> URL: https://issues.apache.org/jira/browse/SPARK-22301
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Marco Gaido
>
> For performance reason, we should resolve in operation on an empty list as 
> false in the optimizations phase.
> For further reference, please look at the discussion on PRs: 
> https://github.com/apache/spark/pull/19522 and 
> https://github.com/apache/spark/pull/19494.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22301) Add rule to Optimizer for In with empty list of values

2017-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22301:


Assignee: Apache Spark

> Add rule to Optimizer for In with empty list of values
> --
>
> Key: SPARK-22301
> URL: https://issues.apache.org/jira/browse/SPARK-22301
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Marco Gaido
>Assignee: Apache Spark
>
> For performance reason, we should resolve in operation on an empty list as 
> false in the optimizations phase.
> For further reference, please look at the discussion on PRs: 
> https://github.com/apache/spark/pull/19522 and 
> https://github.com/apache/spark/pull/19494.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22301) Add rule to Optimizer for In with empty list of values

2017-10-17 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-22301:
---

 Summary: Add rule to Optimizer for In with empty list of values
 Key: SPARK-22301
 URL: https://issues.apache.org/jira/browse/SPARK-22301
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Marco Gaido


For performance reason, we should resolve in operation on an empty list as 
false in the optimizations phase.
For further reference, please look at the discussion on PRs: 
https://github.com/apache/spark/pull/19522 and 
https://github.com/apache/spark/pull/19494.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22289) Cannot save LogisticRegressionClassificationModel with bounds on coefficients

2017-10-17 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208614#comment-16208614
 ] 

yuhao yang commented on SPARK-22289:


Thanks for the reply. I'll start compose a PR.

> Cannot save LogisticRegressionClassificationModel with bounds on coefficients
> -
>
> Key: SPARK-22289
> URL: https://issues.apache.org/jira/browse/SPARK-22289
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nic Eggert
>
> I think this was introduced in SPARK-20047.
> Trying to call save on a logistic regression model trained with bounds on its 
> parameters throws an error. This seems to be because Spark doesn't know how 
> to serialize the Matrix parameter.
> Model is set up like this:
> {code}
> val calibrator = new LogisticRegression()
>   .setFeaturesCol("uncalibrated_probability")
>   .setLabelCol("label")
>   .setWeightCol("weight")
>   .setStandardization(false)
>   .setLowerBoundsOnCoefficients(new DenseMatrix(1, 1, Array(0.0)))
>   .setFamily("binomial")
>   .setProbabilityCol("probability")
>   .setPredictionCol("logistic_prediction")
>   .setRawPredictionCol("logistic_raw_prediction")
> {code}
> {code}
> 17/10/16 15:36:59 ERROR ApplicationMaster: User class threw exception: 
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
>   at org.apache.spark.ml.param.Param.jsonEncode(params.scala:98)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:296)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:295)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.getMetadataToSave(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:277)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelWriter.saveImpl(LogisticRegression.scala:1182)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:254)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:253)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:253)
>   at 
> org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:337)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   -snip-
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13030) Change OneHotEncoder to Estimator

2017-10-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208607#comment-16208607
 ] 

Joseph K. Bradley commented on SPARK-13030:
---

Does multi-column support need to be put in this same task?  Isn't that an 
orthogonal issue?  Or is the proposal to use this chance to break the API to 
add simpler multi-column support?

> Change OneHotEncoder to Estimator
> -
>
> Key: SPARK-13030
> URL: https://issues.apache.org/jira/browse/SPARK-13030
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.0
>Reporter: Wojciech Jurczyk
>
> OneHotEncoder should be an Estimator, just like in scikit-learn 
> (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).
> In its current form, it is impossible to use when number of categories is 
> different between training dataset and test dataset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22289) Cannot save LogisticRegressionClassificationModel with bounds on coefficients

2017-10-17 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208601#comment-16208601
 ] 

Yanbo Liang commented on SPARK-22289:
-

+1 for option 2. Please feel free to send a PR. Thanks.

> Cannot save LogisticRegressionClassificationModel with bounds on coefficients
> -
>
> Key: SPARK-22289
> URL: https://issues.apache.org/jira/browse/SPARK-22289
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nic Eggert
>
> I think this was introduced in SPARK-20047.
> Trying to call save on a logistic regression model trained with bounds on its 
> parameters throws an error. This seems to be because Spark doesn't know how 
> to serialize the Matrix parameter.
> Model is set up like this:
> {code}
> val calibrator = new LogisticRegression()
>   .setFeaturesCol("uncalibrated_probability")
>   .setLabelCol("label")
>   .setWeightCol("weight")
>   .setStandardization(false)
>   .setLowerBoundsOnCoefficients(new DenseMatrix(1, 1, Array(0.0)))
>   .setFamily("binomial")
>   .setProbabilityCol("probability")
>   .setPredictionCol("logistic_prediction")
>   .setRawPredictionCol("logistic_raw_prediction")
> {code}
> {code}
> 17/10/16 15:36:59 ERROR ApplicationMaster: User class threw exception: 
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
>   at org.apache.spark.ml.param.Param.jsonEncode(params.scala:98)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:296)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:295)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.getMetadataToSave(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:277)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelWriter.saveImpl(LogisticRegression.scala:1182)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:254)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:253)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:253)
>   at 
> org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:337)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   -snip-
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22283) withColumn should replace multiple instances with a single one

2017-10-17 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208585#comment-16208585
 ] 

Liang-Chi Hsieh edited comment on SPARK-22283 at 10/17/17 11:43 PM:


[~kitbellew] {{withColumn}} adds/replaces existing column that has the same 
name. In the ambiguous columns case, it sounds reasonable that it replaces the 
columns with the same name.

We should let a API does one thing. {{withColumn}} adds/replaces column with 
the same name. It sounds weird to implicitly drop one column when there are 
ambiguous columns.

For the use case {{a.join(b, ..., "left").withColumn("c", coalesce(a("c"), 
b("c")).select(..., "c")}}, if you just want to get one of the ambiguous 
columns, the simple workaround can be simply selecting the column like 
{{a.join(b, ..., "left").select(..., a("c"))}}.



was (Author: viirya):
[~kitbellew] I didn't mean you're doing select. I meant you can't select the 
ambiguous columns by name, so isn't it reasonable that you can't also 
withColumn the ambiguous columns by name? They are following the same behavior.

For the use case {{a.join(b, ..., "left").withColumn("c", coalesce(a("c"), 
b("c")).select(..., "c")}}, if you just want to get one of the ambiguous 
columns, the simple workaround can be simply selecting the column like 
{{a.join(b, ..., "left").select(..., a("c"))}}.


> withColumn should replace multiple instances with a single one
> --
>
> Key: SPARK-22283
> URL: https://issues.apache.org/jira/browse/SPARK-22283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Albert Meltzer
>
> Currently, {{withColumn}} claims to do the following: _"adding a column or 
> replacing the existing column that has the same name."_
> Unfortunately, if multiple existing columns have the same name (which is a 
> normal occurrence after a join), this results in multiple replaced -- and 
> retained --
>  columns (with the same value), and messages about an ambiguous column.
> The current implementation of {{withColumn}} contains this:
> {noformat}
>   def withColumn(colName: String, col: Column): DataFrame = {
> val resolver = sparkSession.sessionState.analyzer.resolver
> val output = queryExecution.analyzed.output
> val shouldReplace = output.exists(f => resolver(f.name, colName))
> if (shouldReplace) {
>   val columns = output.map { field =>
> if (resolver(field.name, colName)) {
>   col.as(colName)
> } else {
>   Column(field)
> }
>   }
>   select(columns : _*)
> } else {
>   select(Column("*"), col.as(colName))
> }
>   }
> {noformat}
> Instead, suggest something like this (which replaces all matching fields with 
> a single instance of the new one):
> {noformat}
>   def withColumn(colName: String, col: Column): DataFrame = {
> val resolver = sparkSession.sessionState.analyzer.resolver
> val output = queryExecution.analyzed.output
> val existing = output.filterNot(f => resolver(f.name, colName)).map(new 
> Column(_))
> select(existing :+ col.as(colName): _*)
>   }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22283) withColumn should replace multiple instances with a single one

2017-10-17 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208585#comment-16208585
 ] 

Liang-Chi Hsieh commented on SPARK-22283:
-

[~kitbellew] I didn't mean you're doing select. I meant you can't select the 
ambiguous columns by name, so isn't it reasonable that you can't also 
withColumn the ambiguous columns by name? They are following the same behavior.

For the use case {{a.join(b, ..., "left").withColumn("c", coalesce(a("c"), 
b("c")).select(..., "c")}}, if you just want to get one of the ambiguous 
columns, the simple workaround can be simply selecting the column like 
{{a.join(b, ..., "left").select(..., a("c"))}}.


> withColumn should replace multiple instances with a single one
> --
>
> Key: SPARK-22283
> URL: https://issues.apache.org/jira/browse/SPARK-22283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Albert Meltzer
>
> Currently, {{withColumn}} claims to do the following: _"adding a column or 
> replacing the existing column that has the same name."_
> Unfortunately, if multiple existing columns have the same name (which is a 
> normal occurrence after a join), this results in multiple replaced -- and 
> retained --
>  columns (with the same value), and messages about an ambiguous column.
> The current implementation of {{withColumn}} contains this:
> {noformat}
>   def withColumn(colName: String, col: Column): DataFrame = {
> val resolver = sparkSession.sessionState.analyzer.resolver
> val output = queryExecution.analyzed.output
> val shouldReplace = output.exists(f => resolver(f.name, colName))
> if (shouldReplace) {
>   val columns = output.map { field =>
> if (resolver(field.name, colName)) {
>   col.as(colName)
> } else {
>   Column(field)
> }
>   }
>   select(columns : _*)
> } else {
>   select(Column("*"), col.as(colName))
> }
>   }
> {noformat}
> Instead, suggest something like this (which replaces all matching fields with 
> a single instance of the new one):
> {noformat}
>   def withColumn(colName: String, col: Column): DataFrame = {
> val resolver = sparkSession.sessionState.analyzer.resolver
> val output = queryExecution.analyzed.output
> val existing = output.filterNot(f => resolver(f.name, colName)).map(new 
> Column(_))
> select(existing :+ col.as(colName): _*)
>   }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22249) UnsupportedOperationException: empty.reduceLeft when caching a dataframe

2017-10-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208462#comment-16208462
 ] 

Apache Spark commented on SPARK-22249:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/19522

> UnsupportedOperationException: empty.reduceLeft when caching a dataframe
> 
>
> Key: SPARK-22249
> URL: https://issues.apache.org/jira/browse/SPARK-22249
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
> Environment: $ uname -a
> Darwin MAC-UM-024.local 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 15 
> 17:36:27 PDT 2017; root:xnu-3789.70.16~2/RELEASE_X86_64 x86_64
> $ pyspark --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.2.0
>   /_/
> 
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_92
> Branch 
> Compiled by user jenkins on 2017-06-30T22:58:04Z
> Revision 
> Url 
>Reporter: Andreas Maier
>Assignee: Marco Gaido
> Fix For: 2.2.1, 2.3.0
>
>
> It seems that the {{isin()}} method with an empty list as argument only 
> works, if the dataframe is not cached. If it is cached, it results in an 
> exception. To reproduce
> {code:java}
> $ pyspark
> >>> df = spark.createDataFrame([pyspark.Row(KEY="value")])
> >>> df.where(df["KEY"].isin([])).show()
> +---+
> |KEY|
> +---+
> +---+
> >>> df.cache()
> DataFrame[KEY: string]
> >>> df.where(df["KEY"].isin([])).show()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/sql/dataframe.py",
>  line 336, in show
> print(self._jdf.showString(n, 20))
>   File 
> "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
>  line 1133, in __call__
>   File 
> "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/sql/utils.py",
>  line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py",
>  line 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o302.showString.
> : java.lang.UnsupportedOperationException: empty.reduceLeft
>   at 
> scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:180)
>   at 
> scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$reduceLeft(ArrayBuffer.scala:48)
>   at 
> scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:74)
>   at scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:48)
>   at 
> scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:208)
>   at scala.collection.AbstractTraversable.reduce(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:107)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:71)
>   at scala.PartialFunction$Lifted.apply(PartialFunction.scala:223)
>   at scala.PartialFunction$Lifted.apply(PartialFunction.scala:219)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:112)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:111)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.(InMemoryTableScanExec.scala:111)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$3.apply(SparkStrategies.scala:307)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$3.apply(SparkStrategies.scala:307)
>   at 
> org.apache.spark.sql.execution.SparkPlanner.pruneFilterProject(SparkPlanner.scala:99)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$.apply(SparkStrategies.scala:303)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62)
>   at 
> 

[jira] [Resolved] (SPARK-22050) Allow BlockUpdated events to be optionally logged to the event log

2017-10-17 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-22050.

   Resolution: Fixed
 Assignee: Michael Mior
Fix Version/s: 2.3.0

> Allow BlockUpdated events to be optionally logged to the event log
> --
>
> Key: SPARK-22050
> URL: https://issues.apache.org/jira/browse/SPARK-22050
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Michael Mior
>Assignee: Michael Mior
>Priority: Minor
> Fix For: 2.3.0
>
>
> I see that block updates are not logged to the event log.
> This makes sense as a default for performance reasons.
> However, I find it helpful when trying to get a better understanding of 
> caching for a job to be able to log these updates.
> This PR adds a configuration setting {{spark.eventLog.blockUpdates}} 
> (defaulting to false) which allows block updates to be recorded in the log.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22298) SparkUI executor URL encode appID

2017-10-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22298:
--
Issue Type: Improvement  (was: Bug)

> SparkUI executor URL encode appID
> -
>
> Key: SPARK-22298
> URL: https://issues.apache.org/jira/browse/SPARK-22298
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Alexander Naspo
>Priority: Trivial
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Spark Executor Page will return a blank list when the application id contains 
> a forward slash. You can see the /allexecutors api failing with a 404. This 
> can be fixed trivially by url encoding the appId before making the call to 
> `/api/v1/applications//allexecutors` in executorspage.js.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22271) Describe results in "null" for the value of "mean" of a numeric variable

2017-10-17 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22271.
-
   Resolution: Fixed
 Assignee: Huaxin Gao
Fix Version/s: 2.3.0
   2.2.1

> Describe results in "null" for the value of "mean" of a numeric variable
> 
>
> Key: SPARK-22271
> URL: https://issues.apache.org/jira/browse/SPARK-22271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: 
>Reporter: Shafique Jamal
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 2.2.1, 2.3.0
>
> Attachments: decimalNumbers.zip
>
>
> Please excuse me if this issue was addressed already - I was unable to find 
> it.
> Calling .describe().show() on my dataframe results in a value of null for the 
> row "mean":
> {noformat}
> val foo = spark.read.parquet("decimalNumbers.parquet")
> foo.select(col("numericvariable")).describe().show()
> foo: org.apache.spark.sql.DataFrame = [numericvariable: decimal(38,32)]
> +---++
> |summary| numericvariable|
> +---++
> |  count| 299|
> |   mean|null|
> | stddev|  0.2376438793946738|
> |min|0.037815489727642...|
> |max|2.138189366554511...|
> {noformat}
> But all of the rows for this seem ok (I can attache a parquet file). When I 
> round the column, however, all is fine:
> {noformat}
> foo.select(bround(col("numericvariable"), 31)).describe().show()
> +---+---+
> |summary|bround(numericvariable, 31)|
> +---+---+
> |  count|299|
> |   mean|   0.139522503183236...|
> | stddev| 0.2376438793946738|
> |min|   0.037815489727642...|
> |max|   2.138189366554511...|
> +---+---+
> {noformat}
> Rounding using 32 gives null also though.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22296) CodeGenerator - failed to compile when constructor has scala.collection.mutable.Seq vs.

2017-10-17 Thread Randy Tidd (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Randy Tidd updated SPARK-22296:
---
Summary: CodeGenerator - failed to compile when constructor has 
scala.collection.mutable.Seq vs.   (was: CodeGenerator - failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java')

> CodeGenerator - failed to compile when constructor has 
> scala.collection.mutable.Seq vs. 
> 
>
> Key: SPARK-22296
> URL: https://issues.apache.org/jira/browse/SPARK-22296
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Randy Tidd
>
> This is with Scala 2.11.
> We have a case class that has a constructor with 85 args, the last two of 
> which are:
>  var chargesInst : 
> scala.collection.mutable.Seq[ChargeInstitutional] = 
> scala.collection.mutable.Seq.empty[ChargeInstitutional],
>  var chargesProf : 
> scala.collection.mutable.Seq[ChargeProfessional] = 
> scala.collection.mutable.Seq.empty[ChargeProfessional]
> A mutable Seq in a the constructor of a case class is probably poor form but 
> Scala allows it.  When we run this job we get this error:
> build   17-Oct-2017 05:30:502017-10-17 09:30:50 [Executor task launch 
> worker-1] ERROR o.a.s.s.c.e.codegen.CodeGenerator - failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 8217, Column 70: No applicable constructor/method found for actual parameters 
> "java.lang.String, java.lang.String, long, java.lang.String, long, long, 
> long, java.lang.String, long, long, double, scala.Option, scala.Option, 
> java.lang.String, java.lang.String, long, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, int, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, long, long, long, long, long, 
> scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, long, java.lang.String, int, double, 
> double, java.lang.String, java.lang.String, java.lang.String, long, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, 
> long, long, long, java.lang.String, com.xyz.xyz.xyz.domain.Patient, 
> com.xyz.xyz.xyz.domain.Physician, scala.collection.Seq, scala.collection.Seq, 
> java.lang.String, long, java.lang.String, int, int, boolean, boolean, 
> scala.collection.Seq, boolean, scala.collection.Seq, boolean, 
> scala.collection.Seq, scala.collection.Seq"; candidates are: 
> "com.xyz.xyz.xyz.domain.Account(java.lang.String, java.lang.String, long, 
> java.lang.String, long, long, long, java.lang.String, long, long, double, 
> scala.Option, scala.Option, java.lang.String, java.lang.String, long, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, int, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, 
> long, long, long, long, scala.Option, scala.Option, scala.Option, 
> scala.Option, scala.Option, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, 
> java.lang.String, int, double, double, java.lang.String, java.lang.String, 
> java.lang.String, long, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, long, long, long, long, java.lang.String, 
> com.xyz.xyz.xyz.domain.Patient, com.xyz.xyz.xyz.domain.Physician, 
> scala.collection.Seq, scala.collection.Seq, java.lang.String, long, 
> java.lang.String, int, int, boolean, boolean, scala.collection.Seq, boolean, 
> scala.collection.Seq, boolean, scala.collection.mutable.Seq, 
> scala.collection.mutable.Seq)"
> The relevant lines are:
> build   17-Oct-2017 05:30:50/* 093 */   private scala.collection.Seq 
> argValue84;
> build   17-Oct-2017 05:30:50/* 094 */   private scala.collection.Seq 
> argValue85;
> and
> build   17-Oct-2017 05:30:54/* 8217 */ final 
> com.xyz.xyz.xyz.domain.Account value1 = false ? null : new 
> com.xyz.xyz.xyz.domain.Account(argValue2, argValue3, argValue4, argValue5, 
> argValue6, argValue7, argValue8, argValue9, argValue10, argValue11, 
> argValue12, argValue13, argValue14, argValue15, argValue16, argValue17, 
> argValue18, argValue19, argValue20, argValue21, argValue22, argValue23, 
> argValue24, argValue25, argValue26, argValue27, argValue28, 

[jira] [Assigned] (SPARK-22300) Update ORC to 1.4.1

2017-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22300:


Assignee: (was: Apache Spark)

> Update ORC to 1.4.1
> ---
>
> Key: SPARK-22300
> URL: https://issues.apache.org/jira/browse/SPARK-22300
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>
> Apache ORC 1.4.1 is released yesterday.
> - https://orc.apache.org/news/2017/10/16/ORC-1.4.1/
> Like ORC-233 (Allow `orc.include.columns` to be empty), there are several 
> important fixes.
> We had better use the latest one.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22300) Update ORC to 1.4.1

2017-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22300:


Assignee: Apache Spark

> Update ORC to 1.4.1
> ---
>
> Key: SPARK-22300
> URL: https://issues.apache.org/jira/browse/SPARK-22300
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>
> Apache ORC 1.4.1 is released yesterday.
> - https://orc.apache.org/news/2017/10/16/ORC-1.4.1/
> Like ORC-233 (Allow `orc.include.columns` to be empty), there are several 
> important fixes.
> We had better use the latest one.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22300) Update ORC to 1.4.1

2017-10-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208180#comment-16208180
 ] 

Apache Spark commented on SPARK-22300:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/19521

> Update ORC to 1.4.1
> ---
>
> Key: SPARK-22300
> URL: https://issues.apache.org/jira/browse/SPARK-22300
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>
> Apache ORC 1.4.1 is released yesterday.
> - https://orc.apache.org/news/2017/10/16/ORC-1.4.1/
> Like ORC-233 (Allow `orc.include.columns` to be empty), there are several 
> important fixes.
> We had better use the latest one.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22296) CodeGenerator - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java'

2017-10-17 Thread Randy Tidd (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Randy Tidd updated SPARK-22296:
---
Description: 
This is with Scala 2.11.

We have a case class that has a constructor with 85 args, the last two of which 
are:

 var chargesInst : 
scala.collection.mutable.Seq[ChargeInstitutional] = 
scala.collection.mutable.Seq.empty[ChargeInstitutional],
 var chargesProf : 
scala.collection.mutable.Seq[ChargeProfessional] = 
scala.collection.mutable.Seq.empty[ChargeProfessional]

A mutable Seq in a the constructor of a case class is probably poor form but 
Scala allows it.  When we run this job we get this error:

build   17-Oct-2017 05:30:502017-10-17 09:30:50 [Executor task launch 
worker-1] ERROR o.a.s.s.c.e.codegen.CodeGenerator - failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
8217, Column 70: No applicable constructor/method found for actual parameters 
"java.lang.String, java.lang.String, long, java.lang.String, long, long, long, 
java.lang.String, long, long, double, scala.Option, scala.Option, 
java.lang.String, java.lang.String, long, java.lang.String, java.lang.String, 
java.lang.String, java.lang.String, java.lang.String, java.lang.String, int, 
java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
java.lang.String, java.lang.String, long, long, long, long, long, scala.Option, 
scala.Option, scala.Option, scala.Option, scala.Option, java.lang.String, 
java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
java.lang.String, long, java.lang.String, int, double, double, 
java.lang.String, java.lang.String, java.lang.String, long, java.lang.String, 
java.lang.String, java.lang.String, java.lang.String, long, long, long, long, 
java.lang.String, com.xyz.xyz.xyz.domain.Patient, 
com.xyz.xyz.xyz.domain.Physician, scala.collection.Seq, scala.collection.Seq, 
java.lang.String, long, java.lang.String, int, int, boolean, boolean, 
scala.collection.Seq, boolean, scala.collection.Seq, boolean, 
scala.collection.Seq, scala.collection.Seq"; candidates are: 
"com.xyz.xyz.xyz.domain.Account(java.lang.String, java.lang.String, long, 
java.lang.String, long, long, long, java.lang.String, long, long, double, 
scala.Option, scala.Option, java.lang.String, java.lang.String, long, 
java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
java.lang.String, java.lang.String, int, java.lang.String, java.lang.String, 
java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, 
long, long, long, long, scala.Option, scala.Option, scala.Option, scala.Option, 
scala.Option, java.lang.String, java.lang.String, java.lang.String, 
java.lang.String, java.lang.String, java.lang.String, long, java.lang.String, 
int, double, double, java.lang.String, java.lang.String, java.lang.String, 
long, java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
long, long, long, long, java.lang.String, com.xyz.xyz.xyz.domain.Patient, 
com.xyz.xyz.xyz.domain.Physician, scala.collection.Seq, scala.collection.Seq, 
java.lang.String, long, java.lang.String, int, int, boolean, boolean, 
scala.collection.Seq, boolean, scala.collection.Seq, boolean, 
scala.collection.mutable.Seq, scala.collection.mutable.Seq)"

The relevant lines are:

build   17-Oct-2017 05:30:50/* 093 */   private scala.collection.Seq 
argValue84;
build   17-Oct-2017 05:30:50/* 094 */   private scala.collection.Seq 
argValue85;

and

build   17-Oct-2017 05:30:54/* 8217 */ final 
com.xyz.xyz.xyz.domain.Account value1 = false ? null : new 
com.xyz.xyz.xyz.domain.Account(argValue2, argValue3, argValue4, argValue5, 
argValue6, argValue7, argValue8, argValue9, argValue10, argValue11, argValue12, 
argValue13, argValue14, argValue15, argValue16, argValue17, argValue18, 
argValue19, argValue20, argValue21, argValue22, argValue23, argValue24, 
argValue25, argValue26, argValue27, argValue28, argValue29, argValue30, 
argValue31, argValue32, argValue33, argValue34, argValue35, argValue36, 
argValue37, argValue38, argValue39, argValue40, argValue41, argValue42, 
argValue43, argValue44, argValue45, argValue46, argValue47, argValue48, 
argValue49, argValue50, argValue51, argValue52, argValue53, argValue54, 
argValue55, argValue56, argValue57, argValue58, argValue59, argValue60, 
argValue61, argValue62, argValue63, argValue64, argValue65, argValue66, 
argValue67, argValue68, argValue69, argValue70, argValue71, argValue72, 
argValue73, argValue74, argValue75, argValue76, argValue77, argValue78, 
argValue79, argValue80, argValue81, argValue82, argValue83, argValue84, 
argValue85);

In short, Spark uses scala.collection.Seq in the generated code which 

[jira] [Updated] (SPARK-22296) CodeGenerator - failed to compile when constructor has scala.collection.mutable.Seq vs. scala.collection.Seq

2017-10-17 Thread Randy Tidd (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Randy Tidd updated SPARK-22296:
---
Summary: CodeGenerator - failed to compile when constructor has 
scala.collection.mutable.Seq vs. scala.collection.Seq  (was: CodeGenerator - 
failed to compile when constructor has scala.collection.mutable.Seq vs. )

> CodeGenerator - failed to compile when constructor has 
> scala.collection.mutable.Seq vs. scala.collection.Seq
> 
>
> Key: SPARK-22296
> URL: https://issues.apache.org/jira/browse/SPARK-22296
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Randy Tidd
>
> This is with Scala 2.11.
> We have a case class that has a constructor with 85 args, the last two of 
> which are:
>  var chargesInst : 
> scala.collection.mutable.Seq[ChargeInstitutional] = 
> scala.collection.mutable.Seq.empty[ChargeInstitutional],
>  var chargesProf : 
> scala.collection.mutable.Seq[ChargeProfessional] = 
> scala.collection.mutable.Seq.empty[ChargeProfessional]
> A mutable Seq in a the constructor of a case class is probably poor form but 
> Scala allows it.  When we run this job we get this error:
> build   17-Oct-2017 05:30:502017-10-17 09:30:50 [Executor task launch 
> worker-1] ERROR o.a.s.s.c.e.codegen.CodeGenerator - failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 8217, Column 70: No applicable constructor/method found for actual parameters 
> "java.lang.String, java.lang.String, long, java.lang.String, long, long, 
> long, java.lang.String, long, long, double, scala.Option, scala.Option, 
> java.lang.String, java.lang.String, long, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, int, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, long, long, long, long, long, 
> scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, long, java.lang.String, int, double, 
> double, java.lang.String, java.lang.String, java.lang.String, long, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, 
> long, long, long, java.lang.String, com.xyz.xyz.xyz.domain.Patient, 
> com.xyz.xyz.xyz.domain.Physician, scala.collection.Seq, scala.collection.Seq, 
> java.lang.String, long, java.lang.String, int, int, boolean, boolean, 
> scala.collection.Seq, boolean, scala.collection.Seq, boolean, 
> scala.collection.Seq, scala.collection.Seq"; candidates are: 
> "com.xyz.xyz.xyz.domain.Account(java.lang.String, java.lang.String, long, 
> java.lang.String, long, long, long, java.lang.String, long, long, double, 
> scala.Option, scala.Option, java.lang.String, java.lang.String, long, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, int, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, 
> long, long, long, long, scala.Option, scala.Option, scala.Option, 
> scala.Option, scala.Option, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, 
> java.lang.String, int, double, double, java.lang.String, java.lang.String, 
> java.lang.String, long, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, long, long, long, long, java.lang.String, 
> com.xyz.xyz.xyz.domain.Patient, com.xyz.xyz.xyz.domain.Physician, 
> scala.collection.Seq, scala.collection.Seq, java.lang.String, long, 
> java.lang.String, int, int, boolean, boolean, scala.collection.Seq, boolean, 
> scala.collection.Seq, boolean, scala.collection.mutable.Seq, 
> scala.collection.mutable.Seq)"
> The relevant lines are:
> build   17-Oct-2017 05:30:50/* 093 */   private scala.collection.Seq 
> argValue84;
> build   17-Oct-2017 05:30:50/* 094 */   private scala.collection.Seq 
> argValue85;
> and
> build   17-Oct-2017 05:30:54/* 8217 */ final 
> com.xyz.xyz.xyz.domain.Account value1 = false ? null : new 
> com.xyz.xyz.xyz.domain.Account(argValue2, argValue3, argValue4, argValue5, 
> argValue6, argValue7, argValue8, argValue9, argValue10, argValue11, 
> argValue12, argValue13, argValue14, argValue15, argValue16, argValue17, 
> argValue18, argValue19, argValue20, argValue21, argValue22, argValue23, 
> argValue24, 

[jira] [Created] (SPARK-22300) Update ORC to 1.4.1

2017-10-17 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-22300:
-

 Summary: Update ORC to 1.4.1
 Key: SPARK-22300
 URL: https://issues.apache.org/jira/browse/SPARK-22300
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.3.0
Reporter: Dongjoon Hyun


Apache ORC 1.4.1 is released yesterday.
- https://orc.apache.org/news/2017/10/16/ORC-1.4.1/

Like ORC-233 (Allow `orc.include.columns` to be empty), there are several 
important fixes.

We had better use the latest one.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21840) Allow multiple SparkSubmit invocations in same JVM without polluting system properties

2017-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21840:


Assignee: (was: Apache Spark)

> Allow multiple SparkSubmit invocations in same JVM without polluting system 
> properties
> --
>
> Key: SPARK-21840
> URL: https://issues.apache.org/jira/browse/SPARK-21840
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> Filing this as a sub-task of SPARK-11035; this feature was discussed as part 
> of the PR currently attached to that bug.
> Basically, to allow the launcher library to run applications in-process, the 
> easiest way is for it to run the {{SparkSubmit}} class. But that class 
> currently propagates configuration to applications by modifying system 
> properties.
> That means that when launching multiple applications in that manner in the 
> same JVM, the configuration of the first application may leak into the second 
> application (or to any other invocation of `new SparkConf()` for that matter).
> This feature is about breaking out the fix for this particular issue from the 
> PR linked to SPARK-11035. With the changes in SPARK-21728, the implementation 
> can even be further enhanced by providing an actual {{SparkConf}} instance to 
> the application, instead of opaque maps.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22298) SparkUI executor URL encode appID

2017-10-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208172#comment-16208172
 ] 

Apache Spark commented on SPARK-22298:
--

User 'alexnaspo' has created a pull request for this issue:
https://github.com/apache/spark/pull/19520

> SparkUI executor URL encode appID
> -
>
> Key: SPARK-22298
> URL: https://issues.apache.org/jira/browse/SPARK-22298
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Alexander Naspo
>Priority: Trivial
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Spark Executor Page will return a blank list when the application id contains 
> a forward slash. You can see the /allexecutors api failing with a 404. This 
> can be fixed trivially by url encoding the appId before making the call to 
> `/api/v1/applications//allexecutors` in executorspage.js.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21840) Allow multiple SparkSubmit invocations in same JVM without polluting system properties

2017-10-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208171#comment-16208171
 ] 

Apache Spark commented on SPARK-21840:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/19519

> Allow multiple SparkSubmit invocations in same JVM without polluting system 
> properties
> --
>
> Key: SPARK-21840
> URL: https://issues.apache.org/jira/browse/SPARK-21840
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> Filing this as a sub-task of SPARK-11035; this feature was discussed as part 
> of the PR currently attached to that bug.
> Basically, to allow the launcher library to run applications in-process, the 
> easiest way is for it to run the {{SparkSubmit}} class. But that class 
> currently propagates configuration to applications by modifying system 
> properties.
> That means that when launching multiple applications in that manner in the 
> same JVM, the configuration of the first application may leak into the second 
> application (or to any other invocation of `new SparkConf()` for that matter).
> This feature is about breaking out the fix for this particular issue from the 
> PR linked to SPARK-11035. With the changes in SPARK-21728, the implementation 
> can even be further enhanced by providing an actual {{SparkConf}} instance to 
> the application, instead of opaque maps.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21840) Allow multiple SparkSubmit invocations in same JVM without polluting system properties

2017-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21840:


Assignee: Apache Spark

> Allow multiple SparkSubmit invocations in same JVM without polluting system 
> properties
> --
>
> Key: SPARK-21840
> URL: https://issues.apache.org/jira/browse/SPARK-21840
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
>
> Filing this as a sub-task of SPARK-11035; this feature was discussed as part 
> of the PR currently attached to that bug.
> Basically, to allow the launcher library to run applications in-process, the 
> easiest way is for it to run the {{SparkSubmit}} class. But that class 
> currently propagates configuration to applications by modifying system 
> properties.
> That means that when launching multiple applications in that manner in the 
> same JVM, the configuration of the first application may leak into the second 
> application (or to any other invocation of `new SparkConf()` for that matter).
> This feature is about breaking out the fix for this particular issue from the 
> PR linked to SPARK-11035. With the changes in SPARK-21728, the implementation 
> can even be further enhanced by providing an actual {{SparkConf}} instance to 
> the application, instead of opaque maps.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22298) SparkUI executor URL encode appID

2017-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22298:


Assignee: Apache Spark

> SparkUI executor URL encode appID
> -
>
> Key: SPARK-22298
> URL: https://issues.apache.org/jira/browse/SPARK-22298
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Alexander Naspo
>Assignee: Apache Spark
>Priority: Trivial
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Spark Executor Page will return a blank list when the application id contains 
> a forward slash. You can see the /allexecutors api failing with a 404. This 
> can be fixed trivially by url encoding the appId before making the call to 
> `/api/v1/applications//allexecutors` in executorspage.js.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22298) SparkUI executor URL encode appID

2017-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22298:


Assignee: (was: Apache Spark)

> SparkUI executor URL encode appID
> -
>
> Key: SPARK-22298
> URL: https://issues.apache.org/jira/browse/SPARK-22298
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Alexander Naspo
>Priority: Trivial
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Spark Executor Page will return a blank list when the application id contains 
> a forward slash. You can see the /allexecutors api failing with a 404. This 
> can be fixed trivially by url encoding the appId before making the call to 
> `/api/v1/applications//allexecutors` in executorspage.js.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22299) Use OFFSET and LIMIT for JDBC DataFrameReader striping

2017-10-17 Thread Zack Behringer (JIRA)
Zack Behringer created SPARK-22299:
--

 Summary: Use OFFSET and LIMIT for JDBC DataFrameReader striping
 Key: SPARK-22299
 URL: https://issues.apache.org/jira/browse/SPARK-22299
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0, 2.1.0, 2.0.0, 1.6.0, 1.5.0, 1.4.0
Reporter: Zack Behringer
Priority: Minor


Loading a large table (300M rows) from JDBC can be partitioned into tasks using 
the column, numPartitions, lowerBound and upperBound parameters on 
DataFrameReader.jdbc(), but that becomes troublesome if the column is 
skewed/fragmented (as in somebody used a global sequence for the partition 
column instead of a sequence specific to the table, or if the table becomes 
fragmented by deletes, etc.).
This can be worked around by using a modulus operation on the column, but that 
will be slow unless there is a already an index using the modulus expression 
with the exact numPartitions value, so that doesn't scale well if you want to 
change the number partitions. Another way would be to use an expression index 
on a hash of the partition column, but I'm not sure if JDBC striping is smart 
enough to create hash ranges for each stripe using hashes of the lower and 
upper bound parameters. If it is, that is great, but still that requires a very 
large index just for this use case.

A less invasive approach would be to use the table's physical ordering along 
with OFFSET and LIMIT so that only the total number of records to read would 
need to be known beforehand in order to evenly distribute, no indexes needed. I 
realize that OFFSET and LIMIT are not standard SQL keywords.

I also see that a list of custom predicates can be defined. I haven't tried 
that to see if I can embed numPartitions specific predicates each with their 
own OFFSET and LIMIT range.

Some relational databases take quite a long time to count the number of records 
in order to determine the stripe size, though, so this can also troublesome. 
Could a feature similar to "spark.sql.files.maxRecordsPerFile" be used in 
conjunction with the number of executors to read manageable batches (internally 
using OFFSET and LIMIT) until there are no more available results?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22298) SparkUI executor URL encode appID

2017-10-17 Thread Alexander Naspo (JIRA)
Alexander Naspo created SPARK-22298:
---

 Summary: SparkUI executor URL encode appID
 Key: SPARK-22298
 URL: https://issues.apache.org/jira/browse/SPARK-22298
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.2.0, 2.1.0
Reporter: Alexander Naspo
Priority: Trivial


Spark Executor Page will return a blank list when the application id contains a 
forward slash. You can see the /allexecutors api failing with a 404. This can 
be fixed trivially by url encoding the appId before making the call to 
`/api/v1/applications//allexecutors` in executorspage.js.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22297) Flaky test: BlockManagerSuite "Shuffle registration timeout and maxAttempts conf"

2017-10-17 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-22297:
--

 Summary: Flaky test: BlockManagerSuite "Shuffle registration 
timeout and maxAttempts conf"
 Key: SPARK-22297
 URL: https://issues.apache.org/jira/browse/SPARK-22297
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Tests
Affects Versions: 2.3.0
Reporter: Marcelo Vanzin
Priority: Minor


Ran into this locally; the test code seems to use timeouts which generally end 
up in flakiness like this.

{noformat}
[info] - SPARK-20640: Shuffle registration timeout and maxAttempts conf are 
working *** FAILED *** (1 second, 203 milliseconds)
[info]   "Unable to register with external shuffle server due to : 
java.util.concurrent.TimeoutException: Timeout waiting for task." did not 
contain "test_spark_20640_try_again" (BlockManagerSuite.scala:1370)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
[info]   at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
[info]   at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
[info]   at 
org.apache.spark.storage.BlockManagerSuite$$anonfun$14.apply$mcV$sp(BlockManagerSuite.scala:1370)
[info]   at 
org.apache.spark.storage.BlockManagerSuite$$anonfun$14.apply(BlockManagerSuite.scala:1323)
[info]   at 
org.apache.spark.storage.BlockManagerSuite$$anonfun$14.apply(BlockManagerSuite.scala:1323)
{noformat}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22296) CodeGenerator - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java'

2017-10-17 Thread Randy Tidd (JIRA)
Randy Tidd created SPARK-22296:
--

 Summary: CodeGenerator - failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java'
 Key: SPARK-22296
 URL: https://issues.apache.org/jira/browse/SPARK-22296
 Project: Spark
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 2.1.0
Reporter: Randy Tidd


We intermittently get this error when running Spark jobs.  It is very 
intermittent, I would estimate one in every 50-100 runs.  We can't always 
capture a full log file and our code is complex and 1000's of lines and the log 
file is 533MB so I can't post it all.  When it occurs we just run the exact 
same job again and it runs fine, we do not have a repeatable case.  I believe 
it only happens in client mode, and not cluster mode.

Sorry I know this isn't a lot to go on but I wanted to report it since I see 
other similar bugs like SPARK-19984 and SPARK-17936.  If there is any other way 
to collect more info or diagnostics please let me know.

build   17-Oct-2017 05:30:502017-10-17 09:30:50 [Executor task launch 
worker-1] ERROR o.a.s.s.c.e.codegen.CodeGenerator - failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
8217, Column 70: No applicable constructor/method found for actual parameters 
"java.lang.String, java.lang.String, long, java.lang.String, long, long, long, 
java.lang.String, long, long, double, scala.Option, scala.Option, 
java.lang.String, java.lang.String, long, java.lang.String, java.lang.String, 
java.lang.String, java.lang.String, java.lang.String, java.lang.String, int, 
java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
java.lang.String, java.lang.String, long, long, long, long, long, scala.Option, 
scala.Option, scala.Option, scala.Option, scala.Option, java.lang.String, 
java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
java.lang.String, long, java.lang.String, int, double, double, 
java.lang.String, java.lang.String, java.lang.String, long, java.lang.String, 
java.lang.String, java.lang.String, java.lang.String, long, long, long, long, 
java.lang.String, com.xyz.xyz.xyz.domain.Patient, 
com.xyz.xyz.xyz.domain.Physician, scala.collection.Seq, scala.collection.Seq, 
java.lang.String, long, java.lang.String, int, int, boolean, boolean, 
scala.collection.Seq, boolean, scala.collection.Seq, boolean, 
scala.collection.Seq, scala.collection.Seq"; candidates are: 
"com.xyz.xyz.xyz.domain.Account(java.lang.String, java.lang.String, long, 
java.lang.String, long, long, long, java.lang.String, long, long, double, 
scala.Option, scala.Option, java.lang.String, java.lang.String, long, 
java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
java.lang.String, java.lang.String, int, java.lang.String, java.lang.String, 
java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, 
long, long, long, long, scala.Option, scala.Option, scala.Option, scala.Option, 
scala.Option, java.lang.String, java.lang.String, java.lang.String, 
java.lang.String, java.lang.String, java.lang.String, long, java.lang.String, 
int, double, double, java.lang.String, java.lang.String, java.lang.String, 
long, java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
long, long, long, long, java.lang.String, com.xyz.xyz.xyz.domain.Patient, 
com.xyz.xyz.xyz.domain.Physician, scala.collection.Seq, scala.collection.Seq, 
java.lang.String, long, java.lang.String, int, int, boolean, boolean, 
scala.collection.Seq, boolean, scala.collection.Seq, boolean, 
scala.collection.mutable.Seq, scala.collection.mutable.Seq)"

The relevant line is:

build   17-Oct-2017 05:30:54/* 8217 */ final 
com.xyz.xyz.xyz.domain.Account value1 = false ? null : new 
com.xyz.xyz.xyz.domain.Account(argValue2, argValue3, argValue4, argValue5, 
argValue6, argValue7, argValue8, argValue9, argValue10, argValue11, argValue12, 
argValue13, argValue14, argValue15, argValue16, argValue17, argValue18, 
argValue19, argValue20, argValue21, argValue22, argValue23, argValue24, 
argValue25, argValue26, argValue27, argValue28, argValue29, argValue30, 
argValue31, argValue32, argValue33, argValue34, argValue35, argValue36, 
argValue37, argValue38, argValue39, argValue40, argValue41, argValue42, 
argValue43, argValue44, argValue45, argValue46, argValue47, argValue48, 
argValue49, argValue50, argValue51, argValue52, argValue53, argValue54, 
argValue55, argValue56, argValue57, argValue58, argValue59, argValue60, 
argValue61, argValue62, argValue63, argValue64, argValue65, argValue66, 
argValue67, argValue68, argValue69, argValue70, argValue71, argValue72, 
argValue73, argValue74, argValue75, argValue76, argValue77, argValue78, 

[jira] [Resolved] (SPARK-22295) Chi Square selector not recognizing field in Data frame

2017-10-17 Thread Cheburakshu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheburakshu resolved SPARK-22295.
-
Resolution: Invalid

> Chi Square selector not recognizing field in Data frame
> ---
>
> Key: SPARK-22295
> URL: https://issues.apache.org/jira/browse/SPARK-22295
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Cheburakshu
>
> ChiSquare selector is not recognizing the field 'class' which is present in 
> the data frame while fitting the model. I am using PIMA Indians diabetes 
> dataset of UCI. The complete code and output is below for reference. But, 
> when some rows of the input file is created as a dataframe manually, it will 
> work. Couldn't understand the pattern here.
> Kindly help.
> {code:python}
> from pyspark.ml.feature import VectorAssembler, ChiSqSelector
> import sys
> file_name='data/pima-indians-diabetes.data'
> df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache()
> df.show(1)
> assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' 
> test', ' mass', ' pedi', ' age'],outputCol="features")
> df=assembler.transform(df)
> df.show(1)
> try:
> css=ChiSqSelector(numTopFeatures=5, featuresCol="features",
>   outputCol="selected", labelCol='class').fit(df)
> except:
> print(sys.exc_info())
> {code}
> Output:
> ++-+-+-+-+-+-++--+
> |preg| plas| pres| skin| test| mass| pedi| age| class|
> ++-+-+-+-+-+-++--+
> |   6|  148|   72|   35|0| 33.6|0.627|  50| 1|
> ++-+-+-+-+-+-++--+
> only showing top 1 row
> ++-+-+-+-+-+-++--++
> |preg| plas| pres| skin| test| mass| pedi| age| class|features|
> ++-+-+-+-+-+-++--++
> |   6|  148|   72|   35|0| 33.6|0.627|  50| 1|[6.0,148.0,72.0,3...|
> ++-+-+-+-+-+-++--++
> only showing top 1 row
> (, 
> IllegalArgumentException('Field "class" does not exist.', 
> 'org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t
>  at 
> org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t
>  at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)\n\t at 
> scala.collection.AbstractMap.getOrElse(Map.scala:59)\n\t at 
> org.apache.spark.sql.types.StructType.apply(StructType.scala:263)\n\t at 
> org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71)\n\t
>  at 
> org.apache.spark.ml.feature.ChiSqSelector.transformSchema(ChiSqSelector.scala:183)\n\t
>  at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)\n\t 
> at org.apache.spark.ml.feature.ChiSqSelector.fit(ChiSqSelector.scala:159)\n\t 
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\t at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\t
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t
>  at java.lang.reflect.Method.invoke(Method.java:498)\n\t at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at 
> py4j.Gateway.invoke(Gateway.java:280)\n\t at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at 
> py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at 
> py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at 
> java.lang.Thread.run(Thread.java:745)'), )
> *The below code works fine:
> *
> {code:python}
> from pyspark.ml.feature import VectorAssembler, ChiSqSelector
> import sys
> file_name='data/pima-indians-diabetes.data'
> #df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache()
> # Just pasted a few rows from the input file and created a data frome. This 
> will work, but not the frame picked up from the file
> df = spark.createDataFrame([
> [6,148,72,35,0,33.6,0.627,50,1],
> [1,85,66,29,0,26.6,0.351,31,0],
> [8,183,64,0,0,23.3,0.672,32,1],
> ], ['preg', ' plas', ' pres', ' skin', ' test', ' mass', ' pedi', ' age', 
> "class"])
> df.show(1)
> assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' 
> test', ' mass', ' pedi', ' age'],outputCol="features")
> df=assembler.transform(df)
> df.show(1)
> try:
> css=ChiSqSelector(numTopFeatures=5, featuresCol="features",
>   outputCol="selected", labelCol="class").fit(df)
> except:
> print(sys.exc_info())
> print(css.selectedFeatures)
> {code}
> Output:
> 

[jira] [Updated] (SPARK-22295) Chi Square selector not recognizing field in Data frame

2017-10-17 Thread Cheburakshu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheburakshu updated SPARK-22295:

Description: 
ChiSquare selector is not recognizing the field 'class' which is present in the 
data frame while fitting the model. I am using PIMA Indians diabetes dataset of 
UCI. The complete code and output is below for reference. But, when some rows 
of the input file is created as a dataframe manually, it will work. Couldn't 
understand the pattern here.

Kindly help.

{code:python}
from pyspark.ml.feature import VectorAssembler, ChiSqSelector
import sys

file_name='data/pima-indians-diabetes.data'

df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache()

df.show(1)
assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' 
test', ' mass', ' pedi', ' age'],outputCol="features")
df=assembler.transform(df)
df.show(1)
try:
css=ChiSqSelector(numTopFeatures=5, featuresCol="features",
  outputCol="selected", labelCol='class').fit(df)
except:
print(sys.exc_info())
{code}

Output:

++-+-+-+-+-+-++--+
|preg| plas| pres| skin| test| mass| pedi| age| class|
++-+-+-+-+-+-++--+
|   6|  148|   72|   35|0| 33.6|0.627|  50| 1|
++-+-+-+-+-+-++--+
only showing top 1 row

++-+-+-+-+-+-++--++
|preg| plas| pres| skin| test| mass| pedi| age| class|features|
++-+-+-+-+-+-++--++
|   6|  148|   72|   35|0| 33.6|0.627|  50| 1|[6.0,148.0,72.0,3...|
++-+-+-+-+-+-++--++
only showing top 1 row

(, 
IllegalArgumentException('Field "class" does not exist.', 
'org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t
 at 
org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t
 at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)\n\t at 
scala.collection.AbstractMap.getOrElse(Map.scala:59)\n\t at 
org.apache.spark.sql.types.StructType.apply(StructType.scala:263)\n\t at 
org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71)\n\t
 at 
org.apache.spark.ml.feature.ChiSqSelector.transformSchema(ChiSqSelector.scala:183)\n\t
 at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)\n\t at 
org.apache.spark.ml.feature.ChiSqSelector.fit(ChiSqSelector.scala:159)\n\t at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\t at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\t
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t
 at java.lang.reflect.Method.invoke(Method.java:498)\n\t at 
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at 
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at 
py4j.Gateway.invoke(Gateway.java:280)\n\t at 
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at 
py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at 
py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at 
java.lang.Thread.run(Thread.java:745)'), )


*The below code works fine:
*
{code:python}
from pyspark.ml.feature import VectorAssembler, ChiSqSelector
import sys

file_name='data/pima-indians-diabetes.data'

#df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache()

# Just pasted a few rows from the input file and created a data frome. This 
will work, but not the frame picked up from the file
df = spark.createDataFrame([
[6,148,72,35,0,33.6,0.627,50,1],
[1,85,66,29,0,26.6,0.351,31,0],
[8,183,64,0,0,23.3,0.672,32,1],
], ['preg', ' plas', ' pres', ' skin', ' test', ' mass', ' pedi', ' age', 
"class"])


df.show(1)
assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' 
test', ' mass', ' pedi', ' age'],outputCol="features")
df=assembler.transform(df)
df.show(1)
try:
css=ChiSqSelector(numTopFeatures=5, featuresCol="features",
  outputCol="selected", labelCol="class").fit(df)
except:
print(sys.exc_info())

print(css.selectedFeatures)

{code}

Output:

++-+-+-+-+-+-++-+
|preg| plas| pres| skin| test| mass| pedi| age|class|
++-+-+-+-+-+-++-+
|   6|  148|   72|   35|0| 33.6|0.627|  50|1|
++-+-+-+-+-+-++-+
only showing top 1 row

++-+-+-+-+-+-++-++
|preg| plas| pres| skin| test| mass| pedi| age|class|features|
++-+-+-+-+-+-++-++
|   6|  148|   72|   35|0| 

[jira] [Updated] (SPARK-22295) Chi Square selector not recognizing field in Data frame

2017-10-17 Thread Cheburakshu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheburakshu updated SPARK-22295:

Description: 
ChiSquare selector is not recognizing the field 'class' which is present in the 
data frame while fitting the model. I am using PIMA Indians diabetes dataset of 
UCI. The complete code and output is below for reference. But, when some rows 
of the input file is created as a dataframe manually, it will work. Couldn't 
understand the pattern here.

Kindly help.

{code:python}
from pyspark.ml.feature import VectorAssembler, ChiSqSelector
import sys

file_name='data/pima-indians-diabetes.data'

df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache()

df.show(1)
assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' 
test', ' mass', ' pedi', ' age'],outputCol="features")
df=assembler.transform(df)
df.show(1)
try:
css=ChiSqSelector(numTopFeatures=5, featuresCol="features",
  outputCol="selected", labelCol='class').fit(df)
except:
print(sys.exc_info())
{code}

Output:

++-+-+-+-+-+-++--+
|preg| plas| pres| skin| test| mass| pedi| age| class|
++-+-+-+-+-+-++--+
|   6|  148|   72|   35|0| 33.6|0.627|  50| 1|
++-+-+-+-+-+-++--+
only showing top 1 row

++-+-+-+-+-+-++--++
|preg| plas| pres| skin| test| mass| pedi| age| class|features|
++-+-+-+-+-+-++--++
|   6|  148|   72|   35|0| 33.6|0.627|  50| 1|[6.0,148.0,72.0,3...|
++-+-+-+-+-+-++--++
only showing top 1 row

(, 
IllegalArgumentException('Field "class" does not exist.', 
'org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t
 at 
org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t
 at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)\n\t at 
scala.collection.AbstractMap.getOrElse(Map.scala:59)\n\t at 
org.apache.spark.sql.types.StructType.apply(StructType.scala:263)\n\t at 
org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71)\n\t
 at 
org.apache.spark.ml.feature.ChiSqSelector.transformSchema(ChiSqSelector.scala:183)\n\t
 at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)\n\t at 
org.apache.spark.ml.feature.ChiSqSelector.fit(ChiSqSelector.scala:159)\n\t at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\t at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\t
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t
 at java.lang.reflect.Method.invoke(Method.java:498)\n\t at 
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at 
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at 
py4j.Gateway.invoke(Gateway.java:280)\n\t at 
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at 
py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at 
py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at 
java.lang.Thread.run(Thread.java:745)'), )

{code:python}
from pyspark.ml.feature import VectorAssembler, ChiSqSelector
import sys

file_name='data/pima-indians-diabetes.data'

#df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache()

# Just pasted a few rows from the input file and created a data frome. This 
will work, but not the frame picked up from the file
df = spark.createDataFrame([
[6,148,72,35,0,33.6,0.627,50,1],
[1,85,66,29,0,26.6,0.351,31,0],
[8,183,64,0,0,23.3,0.672,32,1],
], ['preg', ' plas', ' pres', ' skin', ' test', ' mass', ' pedi', ' age', 
"class"])


df.show(1)
assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' 
test', ' mass', ' pedi', ' age'],outputCol="features")
df=assembler.transform(df)
df.show(1)
try:
css=ChiSqSelector(numTopFeatures=5, featuresCol="features",
  outputCol="selected", labelCol="class").fit(df)
except:
print(sys.exc_info())

print(css.selectedFeatures)

{code}

Output:

++-+-+-+-+-+-++-+
|preg| plas| pres| skin| test| mass| pedi| age|class|
++-+-+-+-+-+-++-+
|   6|  148|   72|   35|0| 33.6|0.627|  50|1|
++-+-+-+-+-+-++-+
only showing top 1 row

++-+-+-+-+-+-++-++
|preg| plas| pres| skin| test| mass| pedi| age|class|features|
++-+-+-+-+-+-++-++
|   6|  148|   72|   35|0| 33.6|0.627|  50|

[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2017-10-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208078#comment-16208078
 ] 

Apache Spark commented on SPARK-18016:
--

User 'bdrillard' has created a pull request for this issue:
https://github.com/apache/spark/pull/19518

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Aleksander Eskilson
> Fix For: 2.3.0
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345)
>   at 
> 

[jira] [Updated] (SPARK-22295) Chi Square selector not recognizing field in Data frame

2017-10-17 Thread Cheburakshu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheburakshu updated SPARK-22295:

Description: 
ChiSquare selector is not recognizing the field 'class' which is present in the 
data frame while fitting the model. I am using PIMA Indians diabetes dataset of 
UCI. The complete code and output is below for reference.

Kindly help.

{code:python}
from pyspark.ml.feature import VectorAssembler, ChiSqSelector
import sys

file_name='data/pima-indians-diabetes.data'

df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache()

df.show(1)
assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' 
test', ' mass', ' pedi', ' age'],outputCol="features")
df=assembler.transform(df)
df.show(1)
try:
css=ChiSqSelector(numTopFeatures=5, featuresCol="features",
  outputCol="selected", labelCol='class').fit(df)
except:
print(sys.exc_info())
{code}

Output:

++-+-+-+-+-+-++--+
|preg| plas| pres| skin| test| mass| pedi| age| class|
++-+-+-+-+-+-++--+
|   6|  148|   72|   35|0| 33.6|0.627|  50| 1|
++-+-+-+-+-+-++--+
only showing top 1 row

++-+-+-+-+-+-++--++
|preg| plas| pres| skin| test| mass| pedi| age| class|features|
++-+-+-+-+-+-++--++
|   6|  148|   72|   35|0| 33.6|0.627|  50| 1|[6.0,148.0,72.0,3...|
++-+-+-+-+-+-++--++
only showing top 1 row

(, 
IllegalArgumentException('Field "class" does not exist.', 
'org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t
 at 
org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t
 at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)\n\t at 
scala.collection.AbstractMap.getOrElse(Map.scala:59)\n\t at 
org.apache.spark.sql.types.StructType.apply(StructType.scala:263)\n\t at 
org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71)\n\t
 at 
org.apache.spark.ml.feature.ChiSqSelector.transformSchema(ChiSqSelector.scala:183)\n\t
 at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)\n\t at 
org.apache.spark.ml.feature.ChiSqSelector.fit(ChiSqSelector.scala:159)\n\t at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\t at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\t
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t
 at java.lang.reflect.Method.invoke(Method.java:498)\n\t at 
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at 
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at 
py4j.Gateway.invoke(Gateway.java:280)\n\t at 
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at 
py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at 
py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at 
java.lang.Thread.run(Thread.java:745)'), )

  was:
There is a difference in behavior when Chisquare selector is used v direct 
feature use in decision tree classifier. 
In the below code, I have used chisquare selector as a thru' pass but the 
decision tree classifier is unable to process it. But, it is able to process 
when the features are used directly.

The example is pulled out directly from Apache spark python documentation.

Kindly help.

{code:python}
from pyspark.ml.feature import ChiSqSelector
from pyspark.ml.linalg import Vectors
import sys

df = spark.createDataFrame([
(7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,),
(8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,),
(9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", 
"clicked"])

# ChiSq selector will just be a pass-through. All four featuresin the i/p will 
be in output also.
selector = ChiSqSelector(numTopFeatures=4, featuresCol="features",
 outputCol="selectedFeatures", labelCol="clicked")
result = selector.fit(df).transform(df)
print("ChiSqSelector output with top %d features selected" % 
selector.getNumTopFeatures())

from pyspark.ml.classification import DecisionTreeClassifier

try:
# Fails
dt = 
DecisionTreeClassifier(labelCol="clicked",featuresCol="selectedFeatures")
model = dt.fit(result)
except:
print(sys.exc_info())
#Works
dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="features")
model = dt.fit(df)

# Make predictions. Using same dataset, not splitting!!
predictions = model.transform(result)

# Select example rows to display.
predictions.select("prediction", "clicked", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = 

[jira] [Created] (SPARK-22295) Chi Square selector not recognizing field in Data frame

2017-10-17 Thread Cheburakshu (JIRA)
Cheburakshu created SPARK-22295:
---

 Summary: Chi Square selector not recognizing field in Data frame
 Key: SPARK-22295
 URL: https://issues.apache.org/jira/browse/SPARK-22295
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 2.1.1
Reporter: Cheburakshu


There is a difference in behavior when Chisquare selector is used v direct 
feature use in decision tree classifier. 
In the below code, I have used chisquare selector as a thru' pass but the 
decision tree classifier is unable to process it. But, it is able to process 
when the features are used directly.

The example is pulled out directly from Apache spark python documentation.

Kindly help.

{code:python}
from pyspark.ml.feature import ChiSqSelector
from pyspark.ml.linalg import Vectors
import sys

df = spark.createDataFrame([
(7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,),
(8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,),
(9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", 
"clicked"])

# ChiSq selector will just be a pass-through. All four featuresin the i/p will 
be in output also.
selector = ChiSqSelector(numTopFeatures=4, featuresCol="features",
 outputCol="selectedFeatures", labelCol="clicked")
result = selector.fit(df).transform(df)
print("ChiSqSelector output with top %d features selected" % 
selector.getNumTopFeatures())

from pyspark.ml.classification import DecisionTreeClassifier

try:
# Fails
dt = 
DecisionTreeClassifier(labelCol="clicked",featuresCol="selectedFeatures")
model = dt.fit(result)
except:
print(sys.exc_info())
#Works
dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="features")
model = dt.fit(df)

# Make predictions. Using same dataset, not splitting!!
predictions = model.transform(result)

# Select example rows to display.
predictions.select("prediction", "clicked", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
labelCol="clicked", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g " % (1.0 - accuracy))
{code}

Output:

ChiSqSelector output with top 4 features selected
(, 
IllegalArgumentException('Feature 0 is marked as Nominal (categorical), but it 
does not have the number of values specified.', 
'org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:69)\n\t
 at 
org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:59)\n\t
 at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t
 at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t
 at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)\n\t
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)\n\t at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)\n\t 
at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)\n\t at 
org.apache.spark.ml.util.MetadataUtils$.getCategoricalFeatures(MetadataUtils.scala:59)\n\t
 at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:101)\n\t
 at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)\n\t
 at org.apache.spark.ml.Predictor.fit(Predictor.scala:96)\n\t at 
org.apache.spark.ml.Predictor.fit(Predictor.scala:72)\n\t at 
sun.reflect.GeneratedMethodAccessor280.invoke(Unknown Source)\n\t at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t
 at java.lang.reflect.Method.invoke(Method.java:498)\n\t at 
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at 
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at 
py4j.Gateway.invoke(Gateway.java:280)\n\t at 
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at 
py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at 
py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at 
java.lang.Thread.run(Thread.java:745)'), )
+--+---+--+
|prediction|clicked|  features|
+--+---+--+
|   1.0|1.0|[0.0,0.0,18.0,1.0]|
|   0.0|0.0|[0.0,1.0,12.0,0.0]|
|   0.0|0.0|[1.0,0.0,15.0,0.1]|
+--+---+--+

Test Error = 0 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22249) UnsupportedOperationException: empty.reduceLeft when caching a dataframe

2017-10-17 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22249:

Component/s: (was: PySpark)
 SQL

> UnsupportedOperationException: empty.reduceLeft when caching a dataframe
> 
>
> Key: SPARK-22249
> URL: https://issues.apache.org/jira/browse/SPARK-22249
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
> Environment: $ uname -a
> Darwin MAC-UM-024.local 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 15 
> 17:36:27 PDT 2017; root:xnu-3789.70.16~2/RELEASE_X86_64 x86_64
> $ pyspark --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.2.0
>   /_/
> 
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_92
> Branch 
> Compiled by user jenkins on 2017-06-30T22:58:04Z
> Revision 
> Url 
>Reporter: Andreas Maier
>Assignee: Marco Gaido
> Fix For: 2.2.1, 2.3.0
>
>
> It seems that the {{isin()}} method with an empty list as argument only 
> works, if the dataframe is not cached. If it is cached, it results in an 
> exception. To reproduce
> {code:java}
> $ pyspark
> >>> df = spark.createDataFrame([pyspark.Row(KEY="value")])
> >>> df.where(df["KEY"].isin([])).show()
> +---+
> |KEY|
> +---+
> +---+
> >>> df.cache()
> DataFrame[KEY: string]
> >>> df.where(df["KEY"].isin([])).show()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/sql/dataframe.py",
>  line 336, in show
> print(self._jdf.showString(n, 20))
>   File 
> "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
>  line 1133, in __call__
>   File 
> "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/sql/utils.py",
>  line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py",
>  line 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o302.showString.
> : java.lang.UnsupportedOperationException: empty.reduceLeft
>   at 
> scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:180)
>   at 
> scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$reduceLeft(ArrayBuffer.scala:48)
>   at 
> scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:74)
>   at scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:48)
>   at 
> scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:208)
>   at scala.collection.AbstractTraversable.reduce(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:107)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:71)
>   at scala.PartialFunction$Lifted.apply(PartialFunction.scala:223)
>   at scala.PartialFunction$Lifted.apply(PartialFunction.scala:219)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:112)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:111)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.(InMemoryTableScanExec.scala:111)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$3.apply(SparkStrategies.scala:307)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$3.apply(SparkStrategies.scala:307)
>   at 
> org.apache.spark.sql.execution.SparkPlanner.pruneFilterProject(SparkPlanner.scala:99)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$.apply(SparkStrategies.scala:303)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>  

[jira] [Commented] (SPARK-21213) Support collecting partition-level statistics: rowCount and sizeInBytes

2017-10-17 Thread Ruslan Dautkhanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208027#comment-16208027
 ] 

Ruslan Dautkhanov commented on SPARK-21213:
---

Would the partition-level stats be compatible with Hive/Impala partition-level 
stats? Thanks.

> Support collecting partition-level statistics: rowCount and sizeInBytes
> ---
>
> Key: SPARK-21213
> URL: https://issues.apache.org/jira/browse/SPARK-21213
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Maria
>Assignee: Maria
> Fix For: 2.3.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Support ANALYZE TABLE table PARTITION (key=value,...) COMPUTE STATISTICS 
> [NOSCAN] SQL command to compute and store in Hive Metastore number of rows 
> and total size in bytes for individual partitions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22283) withColumn should replace multiple instances with a single one

2017-10-17 Thread Albert Meltzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208020#comment-16208020
 ] 

Albert Meltzer commented on SPARK-22283:


[~cjm] thank you for finding the new implementation, hopefully it addresses 
this problem; also, it would have been nice to expose {{withColumns(columns: 
Map[String, Column])}} as public and with a different signature.

> withColumn should replace multiple instances with a single one
> --
>
> Key: SPARK-22283
> URL: https://issues.apache.org/jira/browse/SPARK-22283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Albert Meltzer
>
> Currently, {{withColumn}} claims to do the following: _"adding a column or 
> replacing the existing column that has the same name."_
> Unfortunately, if multiple existing columns have the same name (which is a 
> normal occurrence after a join), this results in multiple replaced -- and 
> retained --
>  columns (with the same value), and messages about an ambiguous column.
> The current implementation of {{withColumn}} contains this:
> {noformat}
>   def withColumn(colName: String, col: Column): DataFrame = {
> val resolver = sparkSession.sessionState.analyzer.resolver
> val output = queryExecution.analyzed.output
> val shouldReplace = output.exists(f => resolver(f.name, colName))
> if (shouldReplace) {
>   val columns = output.map { field =>
> if (resolver(field.name, colName)) {
>   col.as(colName)
> } else {
>   Column(field)
> }
>   }
>   select(columns : _*)
> } else {
>   select(Column("*"), col.as(colName))
> }
>   }
> {noformat}
> Instead, suggest something like this (which replaces all matching fields with 
> a single instance of the new one):
> {noformat}
>   def withColumn(colName: String, col: Column): DataFrame = {
> val resolver = sparkSession.sessionState.analyzer.resolver
> val output = queryExecution.analyzed.output
> val existing = output.filterNot(f => resolver(f.name, colName)).map(new 
> Column(_))
> select(existing :+ col.as(colName): _*)
>   }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22283) withColumn should replace multiple instances with a single one

2017-10-17 Thread Albert Meltzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208011#comment-16208011
 ] 

Albert Meltzer commented on SPARK-22283:


[~viirya] I'm not doing select, I'm trying to replace the column with another 
value.

The use case is rather simple: {{a.join(b, ..., "left").withColumn("c", 
coalesce(a("c"), b("c")).select(..., "c")}}.

So I'm explicitly differentiating between the two sides,  but then I can't 
select the presumably unique column in the result since it's now there multiple 
times.

> withColumn should replace multiple instances with a single one
> --
>
> Key: SPARK-22283
> URL: https://issues.apache.org/jira/browse/SPARK-22283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Albert Meltzer
>
> Currently, {{withColumn}} claims to do the following: _"adding a column or 
> replacing the existing column that has the same name."_
> Unfortunately, if multiple existing columns have the same name (which is a 
> normal occurrence after a join), this results in multiple replaced -- and 
> retained --
>  columns (with the same value), and messages about an ambiguous column.
> The current implementation of {{withColumn}} contains this:
> {noformat}
>   def withColumn(colName: String, col: Column): DataFrame = {
> val resolver = sparkSession.sessionState.analyzer.resolver
> val output = queryExecution.analyzed.output
> val shouldReplace = output.exists(f => resolver(f.name, colName))
> if (shouldReplace) {
>   val columns = output.map { field =>
> if (resolver(field.name, colName)) {
>   col.as(colName)
> } else {
>   Column(field)
> }
>   }
>   select(columns : _*)
> } else {
>   select(Column("*"), col.as(colName))
> }
>   }
> {noformat}
> Instead, suggest something like this (which replaces all matching fields with 
> a single instance of the new one):
> {noformat}
>   def withColumn(colName: String, col: Column): DataFrame = {
> val resolver = sparkSession.sessionState.analyzer.resolver
> val output = queryExecution.analyzed.output
> val existing = output.filterNot(f => resolver(f.name, colName)).map(new 
> Column(_))
> select(existing :+ col.as(colName): _*)
>   }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21459) Some aggregation functions change the case of nested field names

2017-10-17 Thread David Allsopp (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Allsopp updated SPARK-21459:
--
Description: 
When working with DataFrames with nested schemas, the behavior of the 
aggregation functions is inconsistent with respect to preserving the case of 
the nested field names.

For example, {{first()}} preserves the case of the field names, but 
{{collect_set()}} and {{collect_list()}} force the field names to lowercase.

Expected behavior: Field name case is preserved (or is at least consistent and 
documented)

Spark-shell session to reproduce:

*Update*: After trying different versions, I discovered that this problem 
occurs in the version of Spark 1.6.0 shipped with Cloudera CDH, not plain 
Spark. 
The plain Spark 1.6.0 does not support structs in aggregation operations such 
as {{collect_set}} at all.

{code:java}
case class Inner(Key:String, Value:String)
case class Outer(ID:Long, Pairs:Array[Inner])

val rdd = sc.parallelize(Seq(Outer(1L, Array(Inner("foo", "bar")
val df = sqlContext.createDataFrame(rdd)

scala> df
... = [ID: bigint, Pairs: array>]

scala>df.groupBy("ID").agg(first("Pairs"))
... = [ID: bigint, first(Pairs)(): array>]
// Note that Key and Value preserve their original case

scala>df.groupBy("ID").agg(collect_set("Pairs"))
... = [ID: bigint, collect_set(Pairs): array>]
// Note that key and value are now lowercased

{code}

Additionally, the column name (generated during aggregation) is inconsistent: 
{{first(Pairs)()}} versus {{collect_set(Pairs)}} - note the extra parentheses 
in the first name.

  was:
When working with DataFrames with nested schemas, the behavior of the 
aggregation functions is inconsistent with respect to preserving the case of 
the nested field names.

For example, {{first()}} preserves the case of the field names, but 
{{collect_set()}} and {{collect_list()}} force the field names to lowercase.

Expected behavior: Field name case is preserved (or is at least consistent and 
documented)

Spark-shell session to reproduce:


{code:java}
case class Inner(Key:String, Value:String)
case class Outer(ID:Long, Pairs:Array[Inner])

val rdd = sc.parallelize(Seq(Outer(1L, Array(Inner("foo", "bar")
val df = sqlContext.createDataFrame(rdd)

scala> df
... = [ID: bigint, Pairs: array>]

scala>df.groupBy("ID").agg(first("Pairs"))
... = [ID: bigint, first(Pairs)(): array>]
// Note that Key and Value preserve their original case

scala>df.groupBy("ID").agg(collect_set("Pairs"))
... = [ID: bigint, collect_set(Pairs): array>]
// Note that key and value are now lowercased

{code}

Additionally, the column name (generated during aggregation) is inconsistent: 
{{first(Pairs)()}} versus {{collect_set(Pairs)}} - note the extra parentheses 
in the first name.


> Some aggregation functions change the case of nested field names
> 
>
> Key: SPARK-21459
> URL: https://issues.apache.org/jira/browse/SPARK-21459
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: David Allsopp
>Priority: Minor
>
> When working with DataFrames with nested schemas, the behavior of the 
> aggregation functions is inconsistent with respect to preserving the case of 
> the nested field names.
> For example, {{first()}} preserves the case of the field names, but 
> {{collect_set()}} and {{collect_list()}} force the field names to lowercase.
> Expected behavior: Field name case is preserved (or is at least consistent 
> and documented)
> Spark-shell session to reproduce:
> *Update*: After trying different versions, I discovered that this problem 
> occurs in the version of Spark 1.6.0 shipped with Cloudera CDH, not plain 
> Spark. 
> The plain Spark 1.6.0 does not support structs in aggregation operations such 
> as {{collect_set}} at all.
> {code:java}
> case class Inner(Key:String, Value:String)
> case class Outer(ID:Long, Pairs:Array[Inner])
> val rdd = sc.parallelize(Seq(Outer(1L, Array(Inner("foo", "bar")
> val df = sqlContext.createDataFrame(rdd)
> scala> df
> ... = [ID: bigint, Pairs: array>]
> scala>df.groupBy("ID").agg(first("Pairs"))
> ... = [ID: bigint, first(Pairs)(): array>]
> // Note that Key and Value preserve their original case
> scala>df.groupBy("ID").agg(collect_set("Pairs"))
> ... = [ID: bigint, collect_set(Pairs): array>]
> // Note that key and value are now lowercased
> {code}
> Additionally, the column name (generated during aggregation) is inconsistent: 
> {{first(Pairs)()}} versus 

[jira] [Comment Edited] (SPARK-21459) Some aggregation functions change the case of nested field names

2017-10-17 Thread David Allsopp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207270#comment-16207270
 ] 

David Allsopp edited comment on SPARK-21459 at 10/17/17 4:54 PM:
-

Just trying to see when this problem was resolved:
* *Update*: It is present in the _Cloudera _distribution 1.6.0 CDH 5.8+, not 
plain 1.6.0
* In 1.6.0 and 1.6.3 (currently the latest 1.6.x version), the {{collect_set}} 
aggregation operation fails with an {{org.apache.spark.sql.AnalysisException: 
No handler for Hive udf class 
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCollectSet because: Only 
primitive type arguments are accepted but 
array> was passed as parameter 1..;}}
* In 2.0.2 and 2.2.0 it works as expected


was (Author: dallsoppuk):
Just trying to see when this problem was resolved:
* It is present in 1.6.0, as originally reported
* In 1.6.3 (currently the latest 1.6.x version), the {{collect_set}} 
aggregation operation fails with an {{org.apache.spark.sql.AnalysisException: 
No handler for Hive udf class 
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCollectSet because: Only 
primitive type arguments are accepted but 
array> was passed as parameter 1..;}}
* In 2.0.2 and 2.2.0 the problem has gone.

> Some aggregation functions change the case of nested field names
> 
>
> Key: SPARK-21459
> URL: https://issues.apache.org/jira/browse/SPARK-21459
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: David Allsopp
>Priority: Minor
>
> When working with DataFrames with nested schemas, the behavior of the 
> aggregation functions is inconsistent with respect to preserving the case of 
> the nested field names.
> For example, {{first()}} preserves the case of the field names, but 
> {{collect_set()}} and {{collect_list()}} force the field names to lowercase.
> Expected behavior: Field name case is preserved (or is at least consistent 
> and documented)
> Spark-shell session to reproduce:
> {code:java}
> case class Inner(Key:String, Value:String)
> case class Outer(ID:Long, Pairs:Array[Inner])
> val rdd = sc.parallelize(Seq(Outer(1L, Array(Inner("foo", "bar")
> val df = sqlContext.createDataFrame(rdd)
> scala> df
> ... = [ID: bigint, Pairs: array>]
> scala>df.groupBy("ID").agg(first("Pairs"))
> ... = [ID: bigint, first(Pairs)(): array>]
> // Note that Key and Value preserve their original case
> scala>df.groupBy("ID").agg(collect_set("Pairs"))
> ... = [ID: bigint, collect_set(Pairs): array>]
> // Note that key and value are now lowercased
> {code}
> Additionally, the column name (generated during aggregation) is inconsistent: 
> {{first(Pairs)()}} versus {{collect_set(Pairs)}} - note the extra parentheses 
> in the first name.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22181) ReplaceExceptWithFilter if one or both of the datasets are fully derived out of Filters from a same parent

2017-10-17 Thread Sathiya Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sathiya Kumar updated SPARK-22181:
--
Description: 
While applying Except operator between two datasets, if one or both of the 
datasets are purely transformed using filter operations, then instead of 
rewriting the Except operator using expensive join operation, we can rewrite it 
using filter operation by flipping the filter condition of the right node.

Example:

{code:sql}
   SELECT a1, a2 FROM Tab1 WHERE a2 = 12 EXCEPT SELECT a1, a2 FROM Tab1 WHERE 
a1 = 5
   ==>  SELECT DISTINCT a1, a2 FROM Tab1 WHERE a2 = 12 AND (a1 is null OR a1 <> 
5)
{code}

For more details please refer: [this 
post|https://github.com/sathiyapk/Blog-Posts/blob/master/SparkOptimizer.md]

  was:
While applying Except operator between two datasets, if one or both of the 
datasets are purely transformed using filter operations, then instead of 
rewriting the Except operator using expensive join operation, we can rewrite it 
using filter operation by flipping the filter condition of the right node.

Example:

{code:sql}
   SELECT a1, a2 FROM Tab1 WHERE a2 = 12 EXCEPT SELECT a1, a2 FROM Tab1 WHERE 
a1 = 5
   ==>  SELECT DISTINCT a1, a2 FROM Tab1 WHERE a2 = 12 AND a1 <> 5
{code}

For more details please refer: [this 
post|https://github.com/sathiyapk/Blog-Posts/blob/master/SparkOptimizer.md]


> ReplaceExceptWithFilter if one or both of the datasets are fully derived out 
> of Filters from a same parent
> --
>
> Key: SPARK-22181
> URL: https://issues.apache.org/jira/browse/SPARK-22181
> Project: Spark
>  Issue Type: New Feature
>  Components: Optimizer, SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Sathiya Kumar
>Priority: Minor
>
> While applying Except operator between two datasets, if one or both of the 
> datasets are purely transformed using filter operations, then instead of 
> rewriting the Except operator using expensive join operation, we can rewrite 
> it using filter operation by flipping the filter condition of the right node.
> Example:
> {code:sql}
>SELECT a1, a2 FROM Tab1 WHERE a2 = 12 EXCEPT SELECT a1, a2 FROM Tab1 WHERE 
> a1 = 5
>==>  SELECT DISTINCT a1, a2 FROM Tab1 WHERE a2 = 12 AND (a1 is null OR a1 
> <> 5)
> {code}
> For more details please refer: [this 
> post|https://github.com/sathiyapk/Blog-Posts/blob/master/SparkOptimizer.md]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22250) Be less restrictive on type checking

2017-10-17 Thread Fernando Pereira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207798#comment-16207798
 ] 

Fernando Pereira commented on SPARK-22250:
--

I did some tests and even though verifySchema=False might help in some cases it 
is still not enough for the case of handling numpy arrays. Using arrays from 
the array module work nicely (even without disabling verifySchema). I think it 
is because elements of array.array are automatically converted to their python 
corresponding type.

So I think the problem mentioned involves two issues:
# Accept ints to float fields. Apparently is it just a matter of a schema 
verification issue, so _acceptable_types should be updated
# Accept numpy array for an ArrayField. In this case, with  verifySchema=False 
there's still an Exception:
net.razorvine.pickle.PickleException: expected zero arguments for construction 
of ClassDict (for numpy.core.multiarray._reconstruct)

I believe numpy array support, even in this simple case, would be extremely 
valuable for a lot of people. In our case we work with large hdf5 files where 
the data interface is numpy.

> Be less restrictive on type checking
> 
>
> Key: SPARK-22250
> URL: https://issues.apache.org/jira/browse/SPARK-22250
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Fernando Pereira
>Priority: Minor
>
> I find types.py._verify_type() often too restrictive. E.g. 
> {code}
> TypeError: FloatType can not accept object 0 in type 
> {code}
> I believe it would be globally acceptable to fill a float field with an int, 
> especially since in some formats (json) you don't have a way of inferring the 
> type correctly.
> Another situation relates to other equivalent numerical types, like 
> array.array or numpy. A numpy scalar int is not accepted as an int, and these 
> arrays have always to be converted down to plain lists, which can be 
> prohibitively large and computationally expensive.
> Any thoughts?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22283) withColumn should replace multiple instances with a single one

2017-10-17 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207780#comment-16207780
 ] 

Liang-Chi Hsieh commented on SPARK-22283:
-

When joined result has duplicate column name, you can't select any of the 
ambiguous columns by just name. Doesn't {{withColumn}} current behavior simply 
follow it?


> withColumn should replace multiple instances with a single one
> --
>
> Key: SPARK-22283
> URL: https://issues.apache.org/jira/browse/SPARK-22283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Albert Meltzer
>
> Currently, {{withColumn}} claims to do the following: _"adding a column or 
> replacing the existing column that has the same name."_
> Unfortunately, if multiple existing columns have the same name (which is a 
> normal occurrence after a join), this results in multiple replaced -- and 
> retained --
>  columns (with the same value), and messages about an ambiguous column.
> The current implementation of {{withColumn}} contains this:
> {noformat}
>   def withColumn(colName: String, col: Column): DataFrame = {
> val resolver = sparkSession.sessionState.analyzer.resolver
> val output = queryExecution.analyzed.output
> val shouldReplace = output.exists(f => resolver(f.name, colName))
> if (shouldReplace) {
>   val columns = output.map { field =>
> if (resolver(field.name, colName)) {
>   col.as(colName)
> } else {
>   Column(field)
> }
>   }
>   select(columns : _*)
> } else {
>   select(Column("*"), col.as(colName))
> }
>   }
> {noformat}
> Instead, suggest something like this (which replaces all matching fields with 
> a single instance of the new one):
> {noformat}
>   def withColumn(colName: String, col: Column): DataFrame = {
> val resolver = sparkSession.sessionState.analyzer.resolver
> val output = queryExecution.analyzed.output
> val existing = output.filterNot(f => resolver(f.name, colName)).map(new 
> Column(_))
> select(existing :+ col.as(colName): _*)
>   }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22062) BlockManager does not account for memory consumed by remote fetches

2017-10-17 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22062:
---

Assignee: Saisai Shao

> BlockManager does not account for memory consumed by remote fetches
> ---
>
> Key: SPARK-22062
> URL: https://issues.apache.org/jira/browse/SPARK-22062
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 2.2.0
>Reporter: Sergei Lebedev
>Assignee: Saisai Shao
>Priority: Minor
> Fix For: 2.3.0
>
>
> We use Spark exclusively with {{StorageLevel.DiskOnly}} as our workloads are 
> very sensitive to memory usage. Recently, we've spotted that the jobs 
> sometimes OOM leaving lots of byte[] arrays on the heap. Upon further 
> investigation, we've found that the arrays come from 
> {{BlockManager.getRemoteBytes}}, which 
> [calls|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L638]
>  {{BlockTransferService.fetchBlockSync}}, which in its turn would 
> [allocate|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/network/BlockTransferService.scala#L99]
>  an on-heap {{ByteBuffer}} of the same size as the block (e.g. full 
> partition), if the block was successfully retrieved over the network.
> This memory is not accounted towards Spark storage/execution memory and could 
> potentially lead to OOM if {{BlockManager}} fetches too many partitions in 
> parallel. I wonder if this is intentional behaviour, or in fact a bug?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22062) BlockManager does not account for memory consumed by remote fetches

2017-10-17 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22062.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19476
[https://github.com/apache/spark/pull/19476]

> BlockManager does not account for memory consumed by remote fetches
> ---
>
> Key: SPARK-22062
> URL: https://issues.apache.org/jira/browse/SPARK-22062
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 2.2.0
>Reporter: Sergei Lebedev
>Priority: Minor
> Fix For: 2.3.0
>
>
> We use Spark exclusively with {{StorageLevel.DiskOnly}} as our workloads are 
> very sensitive to memory usage. Recently, we've spotted that the jobs 
> sometimes OOM leaving lots of byte[] arrays on the heap. Upon further 
> investigation, we've found that the arrays come from 
> {{BlockManager.getRemoteBytes}}, which 
> [calls|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L638]
>  {{BlockTransferService.fetchBlockSync}}, which in its turn would 
> [allocate|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/network/BlockTransferService.scala#L99]
>  an on-heap {{ByteBuffer}} of the same size as the block (e.g. full 
> partition), if the block was successfully retrieved over the network.
> This memory is not accounted towards Spark storage/execution memory and could 
> potentially lead to OOM if {{BlockManager}} fetches too many partitions in 
> parallel. I wonder if this is intentional behaviour, or in fact a bug?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19606) Support constraints in spark-dispatcher

2017-10-17 Thread Pascal GILLET (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207722#comment-16207722
 ] 

Pascal GILLET edited comment on SPARK-19606 at 10/17/17 2:39 PM:
-

* _If "spark.mesos.constraints" is passed with the job then it will wind up 
overriding the value specified in the "driverDefault" property._: *False*. 
"spark.mesos.constraints" still applies for executors only, while the 
"driverDefault" will apply for the driver.
* _If "spark.mesos.constraints" is not passed with the job, then the value 
specified in the "driverDefault" property will get passed to the executors - 
which we definitely don't want._: *True*

OK then to add the "spark.mesos.constraints.driver" property.


was (Author: pgillet):
* _If "spark.mesos.constraints" is passed with the job then it will wind up 
overriding the value specified in the "driverDefault" property._: False. 
"spark.mesos.constraints" still applies for executors only, while the 
"driverDefault" will apply for the driver.
* _If "spark.mesos.constraints" is not passed with the job, then the value 
specified in the "driverDefault" property will get passed to the executors - 
which we definitely don't want._: True

OK then to add the "spark.mesos.constraints.driver" property.

> Support constraints in spark-dispatcher
> ---
>
> Key: SPARK-19606
> URL: https://issues.apache.org/jira/browse/SPARK-19606
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Philipp Hoffmann
>
> The `spark.mesos.constraints` configuration is ignored by the 
> spark-dispatcher. The constraints need to be passed in the Framework 
> information when registering with Mesos.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19606) Support constraints in spark-dispatcher

2017-10-17 Thread Pascal GILLET (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207722#comment-16207722
 ] 

Pascal GILLET edited comment on SPARK-19606 at 10/17/17 2:38 PM:
-

* _If "spark.mesos.constraints" is passed with the job then it will wind up 
overriding the value specified in the "driverDefault" property._: False. 
"spark.mesos.constraints" still applies for executors only, while the 
"driverDefault" will apply for the driver.
* _If "spark.mesos.constraints" is not passed with the job, then the value 
specified in the "driverDefault" property will get passed to the executors - 
which we definitely don't want._: True

OK then to add the "spark.mesos.constraints.driver" property.


was (Author: pgillet):
* _If "spark.mesos.constraints" is passed with the job then it will wind up 
overriding the value specified in the "driverDefault" property. _: False. 
"spark.mesos.constraints" still applies for executors only, while the 
"driverDefault" will apply for the driver.
* _If "spark.mesos.constraints" is not passed with the job, then the value 
specified in the "driverDefault" property will get passed to the executors - 
which we definitely don't want._: True

OK then to add the "spark.mesos.constraints.driver" property.

> Support constraints in spark-dispatcher
> ---
>
> Key: SPARK-19606
> URL: https://issues.apache.org/jira/browse/SPARK-19606
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Philipp Hoffmann
>
> The `spark.mesos.constraints` configuration is ignored by the 
> spark-dispatcher. The constraints need to be passed in the Framework 
> information when registering with Mesos.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19606) Support constraints in spark-dispatcher

2017-10-17 Thread Pascal GILLET (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207722#comment-16207722
 ] 

Pascal GILLET commented on SPARK-19606:
---

* _If "spark.mesos.constraints" is passed with the job then it will wind up 
overriding the value specified in the "driverDefault" property. _: False. 
"spark.mesos.constraints" still applies for executors only, while the 
"driverDefault" will apply for the driver.
* _If "spark.mesos.constraints" is not passed with the job, then the value 
specified in the "driverDefault" property will get passed to the executors - 
which we definitely don't want._: True

OK then to add the "spark.mesos.constraints.driver" property.

> Support constraints in spark-dispatcher
> ---
>
> Key: SPARK-19606
> URL: https://issues.apache.org/jira/browse/SPARK-19606
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Philipp Hoffmann
>
> The `spark.mesos.constraints` configuration is ignored by the 
> spark-dispatcher. The constraints need to be passed in the Framework 
> information when registering with Mesos.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20396) groupBy().apply() with pandas udf in pyspark

2017-10-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207711#comment-16207711
 ] 

Apache Spark commented on SPARK-20396:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/19517

> groupBy().apply() with pandas udf in pyspark
> 
>
> Key: SPARK-20396
> URL: https://issues.apache.org/jira/browse/SPARK-20396
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Li Jin
>Assignee: Li Jin
> Fix For: 2.3.0
>
>
> split-apply-merge is a common pattern when analyzing data. It is implemented 
> in many popular data analyzing libraries such as Spark, Pandas, R, and etc. 
> Split and merge operations in these libraries are similar to each other, 
> mostly implemented by certain grouping operators. For instance, Spark 
> DataFrame has groupBy, Pandas DataFrame has groupby. Therefore, for users 
> familiar with either Spark DataFrame or pandas DataFrame, it is not difficult 
> for them to understand how grouping works in the other library. However, 
> apply is more native to different libraries and therefore, quite different 
> between libraries. A pandas user knows how to use apply to do curtain 
> transformation in pandas might not know how to do the same using pyspark. 
> Also, the current implementation of passing data from the java executor to 
> python executor is not efficient, there is opportunity to speed it up using 
> Apache Arrow. This feature can enable use cases that uses Spark's grouping 
> operators such as groupBy, rollUp, cube, window and Pandas's native apply 
> operator.
> Related work:
> SPARK-13534
> This enables faster data serialization between Pyspark and Pandas using 
> Apache Arrow. Our work will be on top of this and use the same serialization 
> for pandas udf.
> SPARK-12919 and SPARK-12922
> These implemented two functions: dapply and gapply in Spark R which 
> implements the similar split-apply-merge pattern that we want to implement 
> with Pyspark. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22288) Tricky interaction between closure-serialization and inheritance results in confusing failure

2017-10-17 Thread Ryan Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207706#comment-16207706
 ] 

Ryan Williams commented on SPARK-22288:
---

Makes sense, fine with me to "Won't Fix".

bq. You can always use a different serializer like kryo

NB: this is closure-serialization, where [only Java has ever worked 
afaik|https://issues.apache.org/jira/browse/SPARK-12414]

> Tricky interaction between closure-serialization and inheritance results in 
> confusing failure
> -
>
> Key: SPARK-22288
> URL: https://issues.apache.org/jira/browse/SPARK-22288
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Ryan Williams
>Priority: Minor
>
> Documenting this since I've run into it a few times; [full repro / discussion 
> here|https://github.com/ryan-williams/spark-bugs/tree/serde].
> Given 3 possible super-classes:
> {code}
> class Super1(n: Int)
> class Super2(n: Int) extends Serializable
> class Super3
> {code}
> A subclass that passes a closure to an RDD operation (e.g. {{map}} or 
> {{filter}}), where the closure references one of the subclass's fields, will 
> throw an {{java.io.InvalidClassException: …; no valid constructor}} exception 
> when the subclass extends {{Super1}} but not {{Super2}} or {{Super3}}. 
> Referencing method-local variables (instead of fields) is fine in all cases:
> {code}
> class App extends Super1(4) with Serializable {
>   val s = "abc"
>   def run(): Unit = {
> val sc = new SparkContext(new SparkConf().set("spark.master", 
> "local[4]").set("spark.app.name", "serde-test"))
> try {
>   sc
> .parallelize(1 to 10)
> .filter(Main.fn(_, s))  // danger! closure references `s`, crash 
> ensues
> .collect()  // driver stack-trace points here
> } finally {
>   sc.stop()
> }
>   }
> }
> object App {
>   def main(args: Array[String]): Unit = { new App().run() }
>   def fn(i: Int, s: String): Boolean = i % 2 == 0
> }
> {code}
> The task-failure stack trace looks like:
> {code}
> java.io.InvalidClassException: com.MyClass; no valid constructor
>   at 
> java.io.ObjectStreamClass$ExceptionInfo.newInvalidClassException(ObjectStreamClass.java:150)
>   at 
> java.io.ObjectStreamClass.checkDeserialize(ObjectStreamClass.java:790)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1782)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
> {code}
> and a driver stack-trace will point to the first line that initiates a Spark 
> job that exercises the closure/RDD-operation in question.
> Not sure how much this should be considered a problem with Spark's 
> closure-serialization logic vs. Java serialization, but maybe if the former 
> gets looked at or improved (e.g. with 2.12 support), this kind of interaction 
> can be improved upon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22277) Chi Square selector garbling Vector content.

2017-10-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207618#comment-16207618
 ] 

Apache Spark commented on SPARK-22277:
--

User 'mpjlu' has created a pull request for this issue:
https://github.com/apache/spark/pull/19516

> Chi Square selector garbling Vector content.
> 
>
> Key: SPARK-22277
> URL: https://issues.apache.org/jira/browse/SPARK-22277
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Cheburakshu
>
> There is a difference in behavior when Chisquare selector is used v direct 
> feature use in decision tree classifier. 
> In the below code, I have used chisquare selector as a thru' pass but the 
> decision tree classifier is unable to process it. But, it is able to process 
> when the features are used directly.
> The example is pulled out directly from Apache spark python documentation.
> Kindly help.
> {code:python}
> from pyspark.ml.feature import ChiSqSelector
> from pyspark.ml.linalg import Vectors
> import sys
> df = spark.createDataFrame([
> (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,),
> (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,),
> (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", 
> "clicked"])
> # ChiSq selector will just be a pass-through. All four featuresin the i/p 
> will be in output also.
> selector = ChiSqSelector(numTopFeatures=4, featuresCol="features",
>  outputCol="selectedFeatures", labelCol="clicked")
> result = selector.fit(df).transform(df)
> print("ChiSqSelector output with top %d features selected" % 
> selector.getNumTopFeatures())
> from pyspark.ml.classification import DecisionTreeClassifier
> try:
> # Fails
> dt = 
> DecisionTreeClassifier(labelCol="clicked",featuresCol="selectedFeatures")
> model = dt.fit(result)
> except:
> print(sys.exc_info())
> #Works
> dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="features")
> model = dt.fit(df)
> 
> # Make predictions. Using same dataset, not splitting!!
> predictions = model.transform(result)
> # Select example rows to display.
> predictions.select("prediction", "clicked", "features").show(5)
> # Select (prediction, true label) and compute test error
> evaluator = MulticlassClassificationEvaluator(
> labelCol="clicked", predictionCol="prediction", metricName="accuracy")
> accuracy = evaluator.evaluate(predictions)
> print("Test Error = %g " % (1.0 - accuracy))
> {code}
> Output:
> ChiSqSelector output with top 4 features selected
> (, 
> IllegalArgumentException('Feature 0 is marked as Nominal (categorical), but 
> it does not have the number of values specified.', 
> 'org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:69)\n\t
>  at 
> org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:59)\n\t
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)\n\t
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)\n\t 
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)\n\t 
> at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)\n\t at 
> org.apache.spark.ml.util.MetadataUtils$.getCategoricalFeatures(MetadataUtils.scala:59)\n\t
>  at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:101)\n\t
>  at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)\n\t
>  at org.apache.spark.ml.Predictor.fit(Predictor.scala:96)\n\t at 
> org.apache.spark.ml.Predictor.fit(Predictor.scala:72)\n\t at 
> sun.reflect.GeneratedMethodAccessor280.invoke(Unknown Source)\n\t at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t
>  at java.lang.reflect.Method.invoke(Method.java:498)\n\t at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at 
> py4j.Gateway.invoke(Gateway.java:280)\n\t at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at 
> py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at 
> py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at 
> java.lang.Thread.run(Thread.java:745)'), )
> +--+---+--+
> |prediction|clicked|  features|
> +--+---+--+
> |   1.0|1.0|[0.0,0.0,18.0,1.0]|
> |   0.0|0.0|[0.0,1.0,12.0,0.0]|
> |   0.0|

[jira] [Assigned] (SPARK-22277) Chi Square selector garbling Vector content.

2017-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22277:


Assignee: (was: Apache Spark)

> Chi Square selector garbling Vector content.
> 
>
> Key: SPARK-22277
> URL: https://issues.apache.org/jira/browse/SPARK-22277
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Cheburakshu
>
> There is a difference in behavior when Chisquare selector is used v direct 
> feature use in decision tree classifier. 
> In the below code, I have used chisquare selector as a thru' pass but the 
> decision tree classifier is unable to process it. But, it is able to process 
> when the features are used directly.
> The example is pulled out directly from Apache spark python documentation.
> Kindly help.
> {code:python}
> from pyspark.ml.feature import ChiSqSelector
> from pyspark.ml.linalg import Vectors
> import sys
> df = spark.createDataFrame([
> (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,),
> (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,),
> (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", 
> "clicked"])
> # ChiSq selector will just be a pass-through. All four featuresin the i/p 
> will be in output also.
> selector = ChiSqSelector(numTopFeatures=4, featuresCol="features",
>  outputCol="selectedFeatures", labelCol="clicked")
> result = selector.fit(df).transform(df)
> print("ChiSqSelector output with top %d features selected" % 
> selector.getNumTopFeatures())
> from pyspark.ml.classification import DecisionTreeClassifier
> try:
> # Fails
> dt = 
> DecisionTreeClassifier(labelCol="clicked",featuresCol="selectedFeatures")
> model = dt.fit(result)
> except:
> print(sys.exc_info())
> #Works
> dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="features")
> model = dt.fit(df)
> 
> # Make predictions. Using same dataset, not splitting!!
> predictions = model.transform(result)
> # Select example rows to display.
> predictions.select("prediction", "clicked", "features").show(5)
> # Select (prediction, true label) and compute test error
> evaluator = MulticlassClassificationEvaluator(
> labelCol="clicked", predictionCol="prediction", metricName="accuracy")
> accuracy = evaluator.evaluate(predictions)
> print("Test Error = %g " % (1.0 - accuracy))
> {code}
> Output:
> ChiSqSelector output with top 4 features selected
> (, 
> IllegalArgumentException('Feature 0 is marked as Nominal (categorical), but 
> it does not have the number of values specified.', 
> 'org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:69)\n\t
>  at 
> org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:59)\n\t
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)\n\t
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)\n\t 
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)\n\t 
> at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)\n\t at 
> org.apache.spark.ml.util.MetadataUtils$.getCategoricalFeatures(MetadataUtils.scala:59)\n\t
>  at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:101)\n\t
>  at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)\n\t
>  at org.apache.spark.ml.Predictor.fit(Predictor.scala:96)\n\t at 
> org.apache.spark.ml.Predictor.fit(Predictor.scala:72)\n\t at 
> sun.reflect.GeneratedMethodAccessor280.invoke(Unknown Source)\n\t at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t
>  at java.lang.reflect.Method.invoke(Method.java:498)\n\t at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at 
> py4j.Gateway.invoke(Gateway.java:280)\n\t at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at 
> py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at 
> py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at 
> java.lang.Thread.run(Thread.java:745)'), )
> +--+---+--+
> |prediction|clicked|  features|
> +--+---+--+
> |   1.0|1.0|[0.0,0.0,18.0,1.0]|
> |   0.0|0.0|[0.0,1.0,12.0,0.0]|
> |   0.0|0.0|[1.0,0.0,15.0,0.1]|
> +--+---+--+
> Test Error = 0 



--
This message 

[jira] [Assigned] (SPARK-22277) Chi Square selector garbling Vector content.

2017-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22277:


Assignee: Apache Spark

> Chi Square selector garbling Vector content.
> 
>
> Key: SPARK-22277
> URL: https://issues.apache.org/jira/browse/SPARK-22277
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Cheburakshu
>Assignee: Apache Spark
>
> There is a difference in behavior when Chisquare selector is used v direct 
> feature use in decision tree classifier. 
> In the below code, I have used chisquare selector as a thru' pass but the 
> decision tree classifier is unable to process it. But, it is able to process 
> when the features are used directly.
> The example is pulled out directly from Apache spark python documentation.
> Kindly help.
> {code:python}
> from pyspark.ml.feature import ChiSqSelector
> from pyspark.ml.linalg import Vectors
> import sys
> df = spark.createDataFrame([
> (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,),
> (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,),
> (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", 
> "clicked"])
> # ChiSq selector will just be a pass-through. All four featuresin the i/p 
> will be in output also.
> selector = ChiSqSelector(numTopFeatures=4, featuresCol="features",
>  outputCol="selectedFeatures", labelCol="clicked")
> result = selector.fit(df).transform(df)
> print("ChiSqSelector output with top %d features selected" % 
> selector.getNumTopFeatures())
> from pyspark.ml.classification import DecisionTreeClassifier
> try:
> # Fails
> dt = 
> DecisionTreeClassifier(labelCol="clicked",featuresCol="selectedFeatures")
> model = dt.fit(result)
> except:
> print(sys.exc_info())
> #Works
> dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="features")
> model = dt.fit(df)
> 
> # Make predictions. Using same dataset, not splitting!!
> predictions = model.transform(result)
> # Select example rows to display.
> predictions.select("prediction", "clicked", "features").show(5)
> # Select (prediction, true label) and compute test error
> evaluator = MulticlassClassificationEvaluator(
> labelCol="clicked", predictionCol="prediction", metricName="accuracy")
> accuracy = evaluator.evaluate(predictions)
> print("Test Error = %g " % (1.0 - accuracy))
> {code}
> Output:
> ChiSqSelector output with top 4 features selected
> (, 
> IllegalArgumentException('Feature 0 is marked as Nominal (categorical), but 
> it does not have the number of values specified.', 
> 'org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:69)\n\t
>  at 
> org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:59)\n\t
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)\n\t
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)\n\t 
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)\n\t 
> at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)\n\t at 
> org.apache.spark.ml.util.MetadataUtils$.getCategoricalFeatures(MetadataUtils.scala:59)\n\t
>  at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:101)\n\t
>  at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)\n\t
>  at org.apache.spark.ml.Predictor.fit(Predictor.scala:96)\n\t at 
> org.apache.spark.ml.Predictor.fit(Predictor.scala:72)\n\t at 
> sun.reflect.GeneratedMethodAccessor280.invoke(Unknown Source)\n\t at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t
>  at java.lang.reflect.Method.invoke(Method.java:498)\n\t at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at 
> py4j.Gateway.invoke(Gateway.java:280)\n\t at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at 
> py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at 
> py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at 
> java.lang.Thread.run(Thread.java:745)'), )
> +--+---+--+
> |prediction|clicked|  features|
> +--+---+--+
> |   1.0|1.0|[0.0,0.0,18.0,1.0]|
> |   0.0|0.0|[0.0,1.0,12.0,0.0]|
> |   0.0|0.0|[1.0,0.0,15.0,0.1]|
> +--+---+--+
> Test 

[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory

2017-10-17 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207573#comment-16207573
 ] 

Steve Loughran commented on SPARK-2984:
---

bq. multiple batches writing to same location simultaneously


Hadoop {{FileOutputCommitter}} cleans up $dest/_temporary while committing or 
aborting a job.

if you are writing >1 job to the same directory tree simultaneously, expect the 
job cleanup in one task to break the others. You could try overloading 
ParquetOutputCommitter.cleanupJob() to stop this, but it's probably safer to 
work out why output to the same path is happening in parallel and stop it.

> FileNotFoundException on _temporary directory
> -
>
> Key: SPARK-2984
> URL: https://issues.apache.org/jira/browse/SPARK-2984
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Ash
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.3.0
>
>
> We've seen several stacktraces and threads on the user mailing list where 
> people are having issues with a {{FileNotFoundException}} stemming from an 
> HDFS path containing {{_temporary}}.
> I ([~aash]) think this may be related to {{spark.speculation}}.  I think the 
> error condition might manifest in this circumstance:
> 1) task T starts on a executor E1
> 2) it takes a long time, so task T' is started on another executor E2
> 3) T finishes in E1 so moves its data from {{_temporary}} to the final 
> destination and deletes the {{_temporary}} directory during cleanup
> 4) T' finishes in E2 and attempts to move its data from {{_temporary}}, but 
> those files no longer exist!  exception
> Some samples:
> {noformat}
> 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job 
> 140774430 ms.0
> java.io.FileNotFoundException: File 
> hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07
>  does not exist.
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
> at 
> org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136)
> at 
> org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643)
> at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at scala.util.Try$.apply(Try.scala:161)
> at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
> at 
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> -- Chen Song at 
> http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFiles-file-not-found-exception-td10686.html
> {noformat}
> I am running a Spark Streaming job that uses saveAsTextFiles to save results 
> into hdfs files. However, it has an exception after 20 batches
> result-140631234/_temporary/0/task_201407251119__m_03 does not 
> exist.
> {noformat}
> and
> {noformat}
> 

[jira] [Assigned] (SPARK-22287) SPARK_DAEMON_MEMORY not honored by MesosClusterDispatcher

2017-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22287:


Assignee: (was: Apache Spark)

> SPARK_DAEMON_MEMORY not honored by MesosClusterDispatcher
> -
>
> Key: SPARK-22287
> URL: https://issues.apache.org/jira/browse/SPARK-22287
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.1.1, 2.2.0, 2.3.0
>Reporter: paul mackles
>Priority: Minor
>
> There does not appear to be a way to control the heap size used by 
> MesosClusterDispatcher as the SPARK_DAEMON_MEMORY environment variable is not 
> honored for that particular daemon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22287) SPARK_DAEMON_MEMORY not honored by MesosClusterDispatcher

2017-10-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207541#comment-16207541
 ] 

Apache Spark commented on SPARK-22287:
--

User 'pmackles' has created a pull request for this issue:
https://github.com/apache/spark/pull/19515

> SPARK_DAEMON_MEMORY not honored by MesosClusterDispatcher
> -
>
> Key: SPARK-22287
> URL: https://issues.apache.org/jira/browse/SPARK-22287
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.1.1, 2.2.0, 2.3.0
>Reporter: paul mackles
>Priority: Minor
>
> There does not appear to be a way to control the heap size used by 
> MesosClusterDispatcher as the SPARK_DAEMON_MEMORY environment variable is not 
> honored for that particular daemon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22287) SPARK_DAEMON_MEMORY not honored by MesosClusterDispatcher

2017-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22287:


Assignee: Apache Spark

> SPARK_DAEMON_MEMORY not honored by MesosClusterDispatcher
> -
>
> Key: SPARK-22287
> URL: https://issues.apache.org/jira/browse/SPARK-22287
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.1.1, 2.2.0, 2.3.0
>Reporter: paul mackles
>Assignee: Apache Spark
>Priority: Minor
>
> There does not appear to be a way to control the heap size used by 
> MesosClusterDispatcher as the SPARK_DAEMON_MEMORY environment variable is not 
> honored for that particular daemon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21551) pyspark's collect fails when getaddrinfo is too slow

2017-10-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207540#comment-16207540
 ] 

Apache Spark commented on SPARK-21551:
--

User 'FRosner' has created a pull request for this issue:
https://github.com/apache/spark/pull/19514

> pyspark's collect fails when getaddrinfo is too slow
> 
>
> Key: SPARK-21551
> URL: https://issues.apache.org/jira/browse/SPARK-21551
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: peay
>Assignee: peay
>Priority: Critical
> Fix For: 2.3.0
>
>
> Pyspark's {{RDD.collect}}, as well as {{DataFrame.toLocalIterator}} and 
> {{DataFrame.collect}} all work by starting an ephemeral server in the driver, 
> and having Python connect to it to download the data.
> All three are implemented along the lines of:
> {code}
> port = self._jdf.collectToPython()
> return list(_load_from_socket(port, BatchedSerializer(PickleSerializer(
> {code}
> The server has **a hardcoded timeout of 3 seconds** 
> (https://github.com/apache/spark/blob/e26dac5feb02033f980b1e69c9b0ff50869b6f9e/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L695)
>  -- i.e., the Python process has 3 seconds to connect to it from the very 
> moment the driver server starts.
> In general, that seems fine, but I have been encountering frequent timeouts 
> leading to `Exception: could not open socket`.
> After investigating a bit, it turns out that {{_load_from_socket}} makes a 
> call to {{getaddrinfo}}:
> {code}
> def _load_from_socket(port, serializer):
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
> socket.SOCK_STREAM):
>.. connect ..
> {code}
> I am not sure why, but while most such calls to {{getaddrinfo}} on my machine 
> only take a couple milliseconds, about 10% of them take between 2 and 10 
> seconds, leading to about 10% of jobs failing. I don't think we can always 
> expect {{getaddrinfo}} to return instantly. More generally, Python may 
> sometimes pause for a couple seconds, which may not leave enough time for the 
> process to connect to the server.
> Especially since the server timeout is hardcoded, I think it would be best to 
> set a rather generous value (15 seconds?) to avoid such situations.
> A {{getaddrinfo}}  specific fix could avoid doing it every single time, or do 
> it before starting up the driver server.
>  
> cc SPARK-677 [~davies]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21551) pyspark's collect fails when getaddrinfo is too slow

2017-10-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207539#comment-16207539
 ] 

Apache Spark commented on SPARK-21551:
--

User 'FRosner' has created a pull request for this issue:
https://github.com/apache/spark/pull/19513

> pyspark's collect fails when getaddrinfo is too slow
> 
>
> Key: SPARK-21551
> URL: https://issues.apache.org/jira/browse/SPARK-21551
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: peay
>Assignee: peay
>Priority: Critical
> Fix For: 2.3.0
>
>
> Pyspark's {{RDD.collect}}, as well as {{DataFrame.toLocalIterator}} and 
> {{DataFrame.collect}} all work by starting an ephemeral server in the driver, 
> and having Python connect to it to download the data.
> All three are implemented along the lines of:
> {code}
> port = self._jdf.collectToPython()
> return list(_load_from_socket(port, BatchedSerializer(PickleSerializer(
> {code}
> The server has **a hardcoded timeout of 3 seconds** 
> (https://github.com/apache/spark/blob/e26dac5feb02033f980b1e69c9b0ff50869b6f9e/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L695)
>  -- i.e., the Python process has 3 seconds to connect to it from the very 
> moment the driver server starts.
> In general, that seems fine, but I have been encountering frequent timeouts 
> leading to `Exception: could not open socket`.
> After investigating a bit, it turns out that {{_load_from_socket}} makes a 
> call to {{getaddrinfo}}:
> {code}
> def _load_from_socket(port, serializer):
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
> socket.SOCK_STREAM):
>.. connect ..
> {code}
> I am not sure why, but while most such calls to {{getaddrinfo}} on my machine 
> only take a couple milliseconds, about 10% of them take between 2 and 10 
> seconds, leading to about 10% of jobs failing. I don't think we can always 
> expect {{getaddrinfo}} to return instantly. More generally, Python may 
> sometimes pause for a couple seconds, which may not leave enough time for the 
> process to connect to the server.
> Especially since the server timeout is hardcoded, I think it would be best to 
> set a rather generous value (15 seconds?) to avoid such situations.
> A {{getaddrinfo}}  specific fix could avoid doing it every single time, or do 
> it before starting up the driver server.
>  
> cc SPARK-677 [~davies]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21551) pyspark's collect fails when getaddrinfo is too slow

2017-10-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207538#comment-16207538
 ] 

Apache Spark commented on SPARK-21551:
--

User 'FRosner' has created a pull request for this issue:
https://github.com/apache/spark/pull/19512

> pyspark's collect fails when getaddrinfo is too slow
> 
>
> Key: SPARK-21551
> URL: https://issues.apache.org/jira/browse/SPARK-21551
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: peay
>Assignee: peay
>Priority: Critical
> Fix For: 2.3.0
>
>
> Pyspark's {{RDD.collect}}, as well as {{DataFrame.toLocalIterator}} and 
> {{DataFrame.collect}} all work by starting an ephemeral server in the driver, 
> and having Python connect to it to download the data.
> All three are implemented along the lines of:
> {code}
> port = self._jdf.collectToPython()
> return list(_load_from_socket(port, BatchedSerializer(PickleSerializer(
> {code}
> The server has **a hardcoded timeout of 3 seconds** 
> (https://github.com/apache/spark/blob/e26dac5feb02033f980b1e69c9b0ff50869b6f9e/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L695)
>  -- i.e., the Python process has 3 seconds to connect to it from the very 
> moment the driver server starts.
> In general, that seems fine, but I have been encountering frequent timeouts 
> leading to `Exception: could not open socket`.
> After investigating a bit, it turns out that {{_load_from_socket}} makes a 
> call to {{getaddrinfo}}:
> {code}
> def _load_from_socket(port, serializer):
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
> socket.SOCK_STREAM):
>.. connect ..
> {code}
> I am not sure why, but while most such calls to {{getaddrinfo}} on my machine 
> only take a couple milliseconds, about 10% of them take between 2 and 10 
> seconds, leading to about 10% of jobs failing. I don't think we can always 
> expect {{getaddrinfo}} to return instantly. More generally, Python may 
> sometimes pause for a couple seconds, which may not leave enough time for the 
> process to connect to the server.
> Especially since the server timeout is hardcoded, I think it would be best to 
> set a rather generous value (15 seconds?) to avoid such situations.
> A {{getaddrinfo}}  specific fix could avoid doing it every single time, or do 
> it before starting up the driver server.
>  
> cc SPARK-677 [~davies]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory

2017-10-17 Thread Soumitra Sulav (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207519#comment-16207519
 ] 

Soumitra Sulav commented on SPARK-2984:
---

I'm facing the same issues with Spark 2.0.2 on DC/OS with HDFS [Parquet files].
After few hours of streaming when the load increases and forces multiple 
batches writing to same location simultaneously, this error is reproduced.

java.io.IOException: Failed to rename 
FileStatus{path=hdfs://namenode/xxx/xxx/parquet/snappy/customer_demographics/datetime=2017101710/_temporary/0/task_201710171034_0026_m_02/part-r-2-59deea14-b221-4a91-bcd3-78e2dbaf6b91.snappy.parquet;
 isDirectory=false; length=2614; replication=3; blocksize=134217728; 
modification_time=1508236440393; access_time=1508236440331; owner=root; 
group=hdfs; permission=rw-r--r--; isSymlink=false} to 
hdfs://namenode/xxx/xxx/parquet/snappy/customer_demographics//datetime=2017101710/part-r-2-59deea14-b221-4a91-bcd3-78e2dbaf6b91.snappy.parquet
 

Spark configurations used :
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.sql.tungsten.enabled", "true")
.set("spark.speculation","false")
.set("spark.sql.parquet.mergeSchema", "false")
.set("spark.mapreduce.fileoutputcommitter.algorithm.version", "2")

Config in the properties file while spark submit :
spark.streaming.concurrentJobs 4



> FileNotFoundException on _temporary directory
> -
>
> Key: SPARK-2984
> URL: https://issues.apache.org/jira/browse/SPARK-2984
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Ash
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.3.0
>
>
> We've seen several stacktraces and threads on the user mailing list where 
> people are having issues with a {{FileNotFoundException}} stemming from an 
> HDFS path containing {{_temporary}}.
> I ([~aash]) think this may be related to {{spark.speculation}}.  I think the 
> error condition might manifest in this circumstance:
> 1) task T starts on a executor E1
> 2) it takes a long time, so task T' is started on another executor E2
> 3) T finishes in E1 so moves its data from {{_temporary}} to the final 
> destination and deletes the {{_temporary}} directory during cleanup
> 4) T' finishes in E2 and attempts to move its data from {{_temporary}}, but 
> those files no longer exist!  exception
> Some samples:
> {noformat}
> 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job 
> 140774430 ms.0
> java.io.FileNotFoundException: File 
> hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07
>  does not exist.
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
> at 
> org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136)
> at 
> org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643)
> at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at scala.util.Try$.apply(Try.scala:161)
> at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
>

[jira] [Commented] (SPARK-22294) Reset spark.driver.bindAddress when starting a Checkpoint

2017-10-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207509#comment-16207509
 ] 

Apache Spark commented on SPARK-22294:
--

User 'ssaavedra' has created a pull request for this issue:
https://github.com/apache/spark/pull/19427

> Reset spark.driver.bindAddress when starting a Checkpoint
> -
>
> Key: SPARK-22294
> URL: https://issues.apache.org/jira/browse/SPARK-22294
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Santiago Saavedra
>  Labels: newbie
>
> On SPARK-4563 support for binding the driver to a different address than the 
> spark.driver.host was provided so that the driver could be running under a 
> differently-routed network. However, when the driver fails, the Checkpoint 
> restoring function expects that the {{spark.driver.bindAddress}} remains the 
> same even if the {{spark.driver.host}} variable may change. That limits the 
> capabilities of recovery under several cluster configurations, and we propose 
> that {{spark.driver.bindAddress}} should have the same replacement behaviour 
> as {{spark.driver.host}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22294) Reset spark.driver.bindAddress when starting a Checkpoint

2017-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22294:


Assignee: (was: Apache Spark)

> Reset spark.driver.bindAddress when starting a Checkpoint
> -
>
> Key: SPARK-22294
> URL: https://issues.apache.org/jira/browse/SPARK-22294
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Santiago Saavedra
>  Labels: newbie
>
> On SPARK-4563 support for binding the driver to a different address than the 
> spark.driver.host was provided so that the driver could be running under a 
> differently-routed network. However, when the driver fails, the Checkpoint 
> restoring function expects that the {{spark.driver.bindAddress}} remains the 
> same even if the {{spark.driver.host}} variable may change. That limits the 
> capabilities of recovery under several cluster configurations, and we propose 
> that {{spark.driver.bindAddress}} should have the same replacement behaviour 
> as {{spark.driver.host}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22294) Reset spark.driver.bindAddress when starting a Checkpoint

2017-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22294:


Assignee: Apache Spark

> Reset spark.driver.bindAddress when starting a Checkpoint
> -
>
> Key: SPARK-22294
> URL: https://issues.apache.org/jira/browse/SPARK-22294
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Santiago Saavedra
>Assignee: Apache Spark
>  Labels: newbie
>
> On SPARK-4563 support for binding the driver to a different address than the 
> spark.driver.host was provided so that the driver could be running under a 
> differently-routed network. However, when the driver fails, the Checkpoint 
> restoring function expects that the {{spark.driver.bindAddress}} remains the 
> same even if the {{spark.driver.host}} variable may change. That limits the 
> capabilities of recovery under several cluster configurations, and we propose 
> that {{spark.driver.bindAddress}} should have the same replacement behaviour 
> as {{spark.driver.host}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22294) Reset spark.driver.bindAddress when starting a Checkpoint

2017-10-17 Thread Santiago Saavedra (JIRA)
Santiago Saavedra created SPARK-22294:
-

 Summary: Reset spark.driver.bindAddress when starting a Checkpoint
 Key: SPARK-22294
 URL: https://issues.apache.org/jira/browse/SPARK-22294
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, Spark Core
Affects Versions: 2.2.0, 2.1.0
Reporter: Santiago Saavedra


On SPARK-4563 support for binding the driver to a different address than the 
spark.driver.host was provided so that the driver could be running under a 
differently-routed network. However, when the driver fails, the Checkpoint 
restoring function expects that the {{spark.driver.bindAddress}} remains the 
same even if the {{spark.driver.host}} variable may change. That limits the 
capabilities of recovery under several cluster configurations, and we propose 
that {{spark.driver.bindAddress}} should have the same replacement behaviour as 
{{spark.driver.host}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21459) Some aggregation functions change the case of nested field names

2017-10-17 Thread David Allsopp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207270#comment-16207270
 ] 

David Allsopp commented on SPARK-21459:
---

Just trying to see when this problem was resolved:
* It is present in 1.6.0, as originally reported
* In 1.6.3 (currently the latest 1.6.x version), the {{collect_set}} 
aggregation operation fails with an {{org.apache.spark.sql.AnalysisException: 
No handler for Hive udf class 
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCollectSet because: Only 
primitive type arguments are accepted but 
array> was passed as parameter 1..;}}
* In 2.0.2 and 2.2.0 the problem has gone.

> Some aggregation functions change the case of nested field names
> 
>
> Key: SPARK-21459
> URL: https://issues.apache.org/jira/browse/SPARK-21459
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: David Allsopp
>Priority: Minor
>
> When working with DataFrames with nested schemas, the behavior of the 
> aggregation functions is inconsistent with respect to preserving the case of 
> the nested field names.
> For example, {{first()}} preserves the case of the field names, but 
> {{collect_set()}} and {{collect_list()}} force the field names to lowercase.
> Expected behavior: Field name case is preserved (or is at least consistent 
> and documented)
> Spark-shell session to reproduce:
> {code:java}
> case class Inner(Key:String, Value:String)
> case class Outer(ID:Long, Pairs:Array[Inner])
> val rdd = sc.parallelize(Seq(Outer(1L, Array(Inner("foo", "bar")
> val df = sqlContext.createDataFrame(rdd)
> scala> df
> ... = [ID: bigint, Pairs: array>]
> scala>df.groupBy("ID").agg(first("Pairs"))
> ... = [ID: bigint, first(Pairs)(): array>]
> // Note that Key and Value preserve their original case
> scala>df.groupBy("ID").agg(collect_set("Pairs"))
> ... = [ID: bigint, collect_set(Pairs): array>]
> // Note that key and value are now lowercased
> {code}
> Additionally, the column name (generated during aggregation) is inconsistent: 
> {{first(Pairs)()}} versus {{collect_set(Pairs)}} - note the extra parentheses 
> in the first name.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22224) Override toString of KeyValueGroupedDataset & RelationalGroupedDataset

2017-10-17 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-4:
---

Assignee: Kent Yao

> Override toString of KeyValueGroupedDataset & RelationalGroupedDataset 
> ---
>
> Key: SPARK-4
> URL: https://issues.apache.org/jira/browse/SPARK-4
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 2.3.0
>
>
> scala> val words = spark.read.textFile("README.md").flatMap(_.split(" "))
> words: org.apache.spark.sql.Dataset[String] = [value: string]
> scala> val grouped = words.groupByKey(identity)
> grouped: org.apache.spark.sql.KeyValueGroupedDataset[String,String] = 
> org.apache.spark.sql.KeyValueGroupedDataset@65214862



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22224) Override toString of KeyValueGroupedDataset & RelationalGroupedDataset

2017-10-17 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-4.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19363
[https://github.com/apache/spark/pull/19363]

> Override toString of KeyValueGroupedDataset & RelationalGroupedDataset 
> ---
>
> Key: SPARK-4
> URL: https://issues.apache.org/jira/browse/SPARK-4
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kent Yao
>Priority: Minor
> Fix For: 2.3.0
>
>
> scala> val words = spark.read.textFile("README.md").flatMap(_.split(" "))
> words: org.apache.spark.sql.Dataset[String] = [value: string]
> scala> val grouped = words.groupByKey(identity)
> grouped: org.apache.spark.sql.KeyValueGroupedDataset[String,String] = 
> org.apache.spark.sql.KeyValueGroupedDataset@65214862



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21551) pyspark's collect fails when getaddrinfo is too slow

2017-10-17 Thread Frank Rosner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207220#comment-16207220
 ] 

Frank Rosner commented on SPARK-21551:
--

Do you guys mind if I backport this also to 2.0.x, 2.1.x, and 2.2.x? We are 
having some jobs that we don't want to upgrade to 2.3.0 but that are failing 
regularly because of this problem.

Which branches would that have to go to? branch-2.0, branch-2.1, and branch-2.2?

> pyspark's collect fails when getaddrinfo is too slow
> 
>
> Key: SPARK-21551
> URL: https://issues.apache.org/jira/browse/SPARK-21551
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: peay
>Assignee: peay
>Priority: Critical
> Fix For: 2.3.0
>
>
> Pyspark's {{RDD.collect}}, as well as {{DataFrame.toLocalIterator}} and 
> {{DataFrame.collect}} all work by starting an ephemeral server in the driver, 
> and having Python connect to it to download the data.
> All three are implemented along the lines of:
> {code}
> port = self._jdf.collectToPython()
> return list(_load_from_socket(port, BatchedSerializer(PickleSerializer(
> {code}
> The server has **a hardcoded timeout of 3 seconds** 
> (https://github.com/apache/spark/blob/e26dac5feb02033f980b1e69c9b0ff50869b6f9e/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L695)
>  -- i.e., the Python process has 3 seconds to connect to it from the very 
> moment the driver server starts.
> In general, that seems fine, but I have been encountering frequent timeouts 
> leading to `Exception: could not open socket`.
> After investigating a bit, it turns out that {{_load_from_socket}} makes a 
> call to {{getaddrinfo}}:
> {code}
> def _load_from_socket(port, serializer):
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
> socket.SOCK_STREAM):
>.. connect ..
> {code}
> I am not sure why, but while most such calls to {{getaddrinfo}} on my machine 
> only take a couple milliseconds, about 10% of them take between 2 and 10 
> seconds, leading to about 10% of jobs failing. I don't think we can always 
> expect {{getaddrinfo}} to return instantly. More generally, Python may 
> sometimes pause for a couple seconds, which may not leave enough time for the 
> process to connect to the server.
> Especially since the server timeout is hardcoded, I think it would be best to 
> set a rather generous value (15 seconds?) to avoid such situations.
> A {{getaddrinfo}}  specific fix could avoid doing it every single time, or do 
> it before starting up the driver server.
>  
> cc SPARK-677 [~davies]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18649) sc.textFile(my_file).collect() raises socket.timeout on large files

2017-10-17 Thread Frank Rosner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207217#comment-16207217
 ] 

Frank Rosner commented on SPARK-18649:
--

Looks like in SPARK-21551 they increased the hard coded limit to 15 seconds.

> sc.textFile(my_file).collect() raises socket.timeout on large files
> ---
>
> Key: SPARK-18649
> URL: https://issues.apache.org/jira/browse/SPARK-18649
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: PySpark version 1.6.2
>Reporter: Erik Cederstrand
>
> I'm trying to load a file into the driver with this code:
> contents = sc.textFile('hdfs://path/to/big_file.csv').collect()
> Loading into the driver instead of creating a distributed RDD is intentional 
> in this case. The file is ca. 6GB, and I have adjusted driver memory 
> accordingly to fit the local data. After some time, my spark/submitted job 
> crashes with the stack trace below.
> I have traced this to pyspark/rdd.py where the _load_from_socket() method 
> creates a socket with a hard-coded timeout of 3 seconds (this code is also 
> present in HEAD although I'm on PySpark 1.6.2). Raising this hard-coded value 
> to e.g. 600 lets me read the entire file.
> Is there any reason that this value does not use e.g. the 
> 'spark.network.timeout' setting instead?
> Traceback (most recent call last):
>   File "my_textfile_test.py", line 119, in 
> contents = sc.textFile('hdfs://path/to/file.csv').collect()
>   File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
> line 772, in collect
>   File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
> line 142, in _load_from_socket
>   File 
> "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/serializers.py", 
> line 517, in load_stream
>   File 
> "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/serializers.py", 
> line 511, in loads
>   File "/usr/lib/python2.7/socket.py", line 380, in read
> data = self._sock.recv(left)
> socket.timeout: timed out
> 16/11/30 13:33:14 WARN Utils: Suppressing exception in finally: Broken pipe
> java.net.SocketException: Broken pipe
>   at java.net.SocketOutputStream.socketWrite0(Native Method)
>   at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
>   at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
>   at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
>   at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
>   at java.io.DataOutputStream.flush(DataOutputStream.java:123)
>   at java.io.FilterOutputStream.close(FilterOutputStream.java:158)
>   at 
> org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$2.apply$mcV$sp(PythonRDD.scala:650)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1248)
>   at 
> org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:649)
>   Suppressed: java.net.SocketException: Broken pipe
>   at java.net.SocketOutputStream.socketWrite0(Native Method)
>   at 
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
>   at 
> java.net.SocketOutputStream.write(SocketOutputStream.java:153)
>   at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
>   at 
> java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
>   at java.io.FilterOutputStream.close(FilterOutputStream.java:158)
>   at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
>   ... 3 more
> 16/11/30 13:33:14 ERROR PythonRDD: Error while sending iterator
> java.net.SocketException: Connection reset
>   at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
>   at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
>   at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
>   at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
>   at java.io.DataOutputStream.write(DataOutputStream.java:107)
>   at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
>   at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:622)
>   at 
> org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:442)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> 

[jira] [Commented] (SPARK-16599) java.util.NoSuchElementException: None.get at at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)

2017-10-17 Thread Pranav Singhania (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207204#comment-16207204
 ] 

Pranav Singhania commented on SPARK-16599:
--

[~srowen] I have observed this happening with my code, but one common 
observation was that, it always occurred with the last job which failed every 
single time, if that could help you understand the cause of the bug.

> java.util.NoSuchElementException: None.get  at at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
> --
>
> Key: SPARK-16599
> URL: https://issues.apache.org/jira/browse/SPARK-16599
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: centos 6.7   spark 2.0
>Reporter: binde
>
> run a spark job with spark 2.0, error message
> Job aborted due to stage failure: Task 0 in stage 821.0 failed 4 times, most 
> recent failure: Lost task 0.3 in stage 821.0 (TID 1480, e103): 
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
>   at 
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22284) Code of class \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\" grows beyond 64 KB

2017-10-17 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207183#comment-16207183
 ] 

Liang-Chi Hsieh commented on SPARK-22284:
-

Btw, we have used {{UnsafeProjection}} in many places for expression codegen. 
Disabling wholestage codegen doesn't help it. Even non-wholestage code path 
uses expression codegen.

> Code of class 
> \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\"
>  grows beyond 64 KB
> --
>
> Key: SPARK-22284
> URL: https://issues.apache.org/jira/browse/SPARK-22284
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Ben
>
> I am using pySpark 2.1.0 in a production environment, and trying to join two 
> DataFrames, one of which is very large and has complex nested structures.
> Basically, I load both DataFrames and cache them.
> Then, in the large DataFrame, I extract 3 nested values and save them as 
> direct columns.
> Finally, I join on these three columns with the smaller DataFrame.
> This would be a short code for this:
> {code}
> dataFrame.read..cache()
> dataFrameSmall.read...cache()
> dataFrame = dataFrame.selectExpr(['*','nested.Value1 AS 
> Value1','nested.Value2 AS Value2','nested.Value3 AS Value3'])
> dataFrame = dataFrame.dropDuplicates().join(dataFrameSmall, 
> ['Value1','Value2',Value3'])
> dataFrame.count()
> {code}
> And this is the error I get when it gets to the count():
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 11 in 
> stage 7.0 failed 4 times, most recent failure: Lost task 11.3 in stage 7.0 
> (TID 11234, somehost.com, executor 10): 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> \"apply_1$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V\"
>  of class 
> \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\"
>  grows beyond 64 KB
> {code}
> I have seen many tickets with similar issues here, but no proper solution. 
> Most of the fixes are until Spark 2.1.0 so I don't know if running it on 
> Spark 2.2.0 would fix it. In any case I cannot change the version of Spark 
> since it is in production.
> I have also tried setting 
> {code:java}
> spark.sql.codegen.wholeStage=false
> {code}
>  but still the same error.
> The job worked well up to now, also with large datasets, but apparently this 
> batch got larger, and that is the only thing that changed. Is there any 
> workaround for this?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22277) Chi Square selector garbling Vector content.

2017-10-17 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207146#comment-16207146
 ] 

Peng Meng commented on SPARK-22277:
---

This seems is a bug. If no one is working on it. I can work on it. 

> Chi Square selector garbling Vector content.
> 
>
> Key: SPARK-22277
> URL: https://issues.apache.org/jira/browse/SPARK-22277
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Cheburakshu
>
> There is a difference in behavior when Chisquare selector is used v direct 
> feature use in decision tree classifier. 
> In the below code, I have used chisquare selector as a thru' pass but the 
> decision tree classifier is unable to process it. But, it is able to process 
> when the features are used directly.
> The example is pulled out directly from Apache spark python documentation.
> Kindly help.
> {code:python}
> from pyspark.ml.feature import ChiSqSelector
> from pyspark.ml.linalg import Vectors
> import sys
> df = spark.createDataFrame([
> (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,),
> (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,),
> (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", 
> "clicked"])
> # ChiSq selector will just be a pass-through. All four featuresin the i/p 
> will be in output also.
> selector = ChiSqSelector(numTopFeatures=4, featuresCol="features",
>  outputCol="selectedFeatures", labelCol="clicked")
> result = selector.fit(df).transform(df)
> print("ChiSqSelector output with top %d features selected" % 
> selector.getNumTopFeatures())
> from pyspark.ml.classification import DecisionTreeClassifier
> try:
> # Fails
> dt = 
> DecisionTreeClassifier(labelCol="clicked",featuresCol="selectedFeatures")
> model = dt.fit(result)
> except:
> print(sys.exc_info())
> #Works
> dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="features")
> model = dt.fit(df)
> 
> # Make predictions. Using same dataset, not splitting!!
> predictions = model.transform(result)
> # Select example rows to display.
> predictions.select("prediction", "clicked", "features").show(5)
> # Select (prediction, true label) and compute test error
> evaluator = MulticlassClassificationEvaluator(
> labelCol="clicked", predictionCol="prediction", metricName="accuracy")
> accuracy = evaluator.evaluate(predictions)
> print("Test Error = %g " % (1.0 - accuracy))
> {code}
> Output:
> ChiSqSelector output with top 4 features selected
> (, 
> IllegalArgumentException('Feature 0 is marked as Nominal (categorical), but 
> it does not have the number of values specified.', 
> 'org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:69)\n\t
>  at 
> org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:59)\n\t
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)\n\t
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)\n\t 
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)\n\t 
> at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)\n\t at 
> org.apache.spark.ml.util.MetadataUtils$.getCategoricalFeatures(MetadataUtils.scala:59)\n\t
>  at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:101)\n\t
>  at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)\n\t
>  at org.apache.spark.ml.Predictor.fit(Predictor.scala:96)\n\t at 
> org.apache.spark.ml.Predictor.fit(Predictor.scala:72)\n\t at 
> sun.reflect.GeneratedMethodAccessor280.invoke(Unknown Source)\n\t at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t
>  at java.lang.reflect.Method.invoke(Method.java:498)\n\t at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at 
> py4j.Gateway.invoke(Gateway.java:280)\n\t at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at 
> py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at 
> py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at 
> java.lang.Thread.run(Thread.java:745)'), )
> +--+---+--+
> |prediction|clicked|  features|
> +--+---+--+
> |   1.0|1.0|[0.0,0.0,18.0,1.0]|
> |   0.0|0.0|[0.0,1.0,12.0,0.0]|
> |   0.0|0.0|[1.0,0.0,15.0,0.1]|
> 

[jira] [Commented] (SPARK-22289) Cannot save LogisticRegressionClassificationModel with bounds on coefficients

2017-10-17 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207133#comment-16207133
 ] 

Nick Pentreath commented on SPARK-22289:


I think option (2) is the more general fix here.

> Cannot save LogisticRegressionClassificationModel with bounds on coefficients
> -
>
> Key: SPARK-22289
> URL: https://issues.apache.org/jira/browse/SPARK-22289
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nic Eggert
>
> I think this was introduced in SPARK-20047.
> Trying to call save on a logistic regression model trained with bounds on its 
> parameters throws an error. This seems to be because Spark doesn't know how 
> to serialize the Matrix parameter.
> Model is set up like this:
> {code}
> val calibrator = new LogisticRegression()
>   .setFeaturesCol("uncalibrated_probability")
>   .setLabelCol("label")
>   .setWeightCol("weight")
>   .setStandardization(false)
>   .setLowerBoundsOnCoefficients(new DenseMatrix(1, 1, Array(0.0)))
>   .setFamily("binomial")
>   .setProbabilityCol("probability")
>   .setPredictionCol("logistic_prediction")
>   .setRawPredictionCol("logistic_raw_prediction")
> {code}
> {code}
> 17/10/16 15:36:59 ERROR ApplicationMaster: User class threw exception: 
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
>   at org.apache.spark.ml.param.Param.jsonEncode(params.scala:98)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:296)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:295)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.getMetadataToSave(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:277)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelWriter.saveImpl(LogisticRegression.scala:1182)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:254)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:253)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:253)
>   at 
> org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:337)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   -snip-
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22249) UnsupportedOperationException: empty.reduceLeft when caching a dataframe

2017-10-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22249.
---
Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/19494

> UnsupportedOperationException: empty.reduceLeft when caching a dataframe
> 
>
> Key: SPARK-22249
> URL: https://issues.apache.org/jira/browse/SPARK-22249
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1, 2.2.0
> Environment: $ uname -a
> Darwin MAC-UM-024.local 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 15 
> 17:36:27 PDT 2017; root:xnu-3789.70.16~2/RELEASE_X86_64 x86_64
> $ pyspark --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.2.0
>   /_/
> 
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_92
> Branch 
> Compiled by user jenkins on 2017-06-30T22:58:04Z
> Revision 
> Url 
>Reporter: Andreas Maier
>Assignee: Marco Gaido
> Fix For: 2.2.1, 2.3.0
>
>
> It seems that the {{isin()}} method with an empty list as argument only 
> works, if the dataframe is not cached. If it is cached, it results in an 
> exception. To reproduce
> {code:java}
> $ pyspark
> >>> df = spark.createDataFrame([pyspark.Row(KEY="value")])
> >>> df.where(df["KEY"].isin([])).show()
> +---+
> |KEY|
> +---+
> +---+
> >>> df.cache()
> DataFrame[KEY: string]
> >>> df.where(df["KEY"].isin([])).show()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/sql/dataframe.py",
>  line 336, in show
> print(self._jdf.showString(n, 20))
>   File 
> "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
>  line 1133, in __call__
>   File 
> "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/sql/utils.py",
>  line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py",
>  line 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o302.showString.
> : java.lang.UnsupportedOperationException: empty.reduceLeft
>   at 
> scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:180)
>   at 
> scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$reduceLeft(ArrayBuffer.scala:48)
>   at 
> scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:74)
>   at scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:48)
>   at 
> scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:208)
>   at scala.collection.AbstractTraversable.reduce(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:107)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:71)
>   at scala.PartialFunction$Lifted.apply(PartialFunction.scala:223)
>   at scala.PartialFunction$Lifted.apply(PartialFunction.scala:219)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:112)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:111)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.(InMemoryTableScanExec.scala:111)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$3.apply(SparkStrategies.scala:307)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$3.apply(SparkStrategies.scala:307)
>   at 
> org.apache.spark.sql.execution.SparkPlanner.pruneFilterProject(SparkPlanner.scala:99)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$.apply(SparkStrategies.scala:303)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62)
>   at 

[jira] [Assigned] (SPARK-22249) UnsupportedOperationException: empty.reduceLeft when caching a dataframe

2017-10-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-22249:
-

 Assignee: Marco Gaido
Fix Version/s: 2.3.0
   2.2.1

> UnsupportedOperationException: empty.reduceLeft when caching a dataframe
> 
>
> Key: SPARK-22249
> URL: https://issues.apache.org/jira/browse/SPARK-22249
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1, 2.2.0
> Environment: $ uname -a
> Darwin MAC-UM-024.local 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 15 
> 17:36:27 PDT 2017; root:xnu-3789.70.16~2/RELEASE_X86_64 x86_64
> $ pyspark --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.2.0
>   /_/
> 
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_92
> Branch 
> Compiled by user jenkins on 2017-06-30T22:58:04Z
> Revision 
> Url 
>Reporter: Andreas Maier
>Assignee: Marco Gaido
> Fix For: 2.2.1, 2.3.0
>
>
> It seems that the {{isin()}} method with an empty list as argument only 
> works, if the dataframe is not cached. If it is cached, it results in an 
> exception. To reproduce
> {code:java}
> $ pyspark
> >>> df = spark.createDataFrame([pyspark.Row(KEY="value")])
> >>> df.where(df["KEY"].isin([])).show()
> +---+
> |KEY|
> +---+
> +---+
> >>> df.cache()
> DataFrame[KEY: string]
> >>> df.where(df["KEY"].isin([])).show()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/sql/dataframe.py",
>  line 336, in show
> print(self._jdf.showString(n, 20))
>   File 
> "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
>  line 1133, in __call__
>   File 
> "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/sql/utils.py",
>  line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py",
>  line 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o302.showString.
> : java.lang.UnsupportedOperationException: empty.reduceLeft
>   at 
> scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:180)
>   at 
> scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$reduceLeft(ArrayBuffer.scala:48)
>   at 
> scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:74)
>   at scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:48)
>   at 
> scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:208)
>   at scala.collection.AbstractTraversable.reduce(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:107)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:71)
>   at scala.PartialFunction$Lifted.apply(PartialFunction.scala:223)
>   at scala.PartialFunction$Lifted.apply(PartialFunction.scala:219)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:112)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:111)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.(InMemoryTableScanExec.scala:111)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$3.apply(SparkStrategies.scala:307)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$3.apply(SparkStrategies.scala:307)
>   at 
> org.apache.spark.sql.execution.SparkPlanner.pruneFilterProject(SparkPlanner.scala:99)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$.apply(SparkStrategies.scala:303)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62)
>   at 

[jira] [Resolved] (SPARK-19317) UnsupportedOperationException: empty.reduceLeft in LinearSeqOptimized

2017-10-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19317.
---
Resolution: Duplicate

> UnsupportedOperationException: empty.reduceLeft in LinearSeqOptimized
> -
>
> Key: SPARK-19317
> URL: https://issues.apache.org/jira/browse/SPARK-19317
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Barry Becker
>Priority: Minor
>
> I wish I had more of a simple reproducible case to give, but I got the below 
> exception while selecting null values in one of the columns of a dataframe.
> My client code that failed was 
> df.filter(filterExp).count()
> where the filter expression was something like  "someCall.isin() || 
> someCall.isNulll". Based on the stack trace, I think the problem may be the 
> "isin()" in the expression - since there are no values to match in it. 
> There were 412 nulls out of 716,000 total rows for the column being filtered.
> The exception seems to indicate that spark is trying to do reduceLeft on an 
> empty list, but the dataset is not empty.
> {code}
> java.lang.UnsupportedOperationException: 
> empty.reduceLeftscala.collection.LinearSeqOptimized$class.reduceLeft(LinearSeqOptimized.scala:137)
>  scala.collection.immutable.List.reduceLeft(List.scala:84) 
> scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:208) 
> scala.collection.AbstractTraversable.reduce(Traversable.scala:104) 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:90)
>  
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:54)
>  
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:61)
>  
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:54)
>  scala.PartialFunction$Lifted.apply(PartialFunction.scala:223) 
> scala.PartialFunction$Lifted.apply(PartialFunction.scala:219) 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:95)
>  
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:94)
>  
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  scala.collection.immutable.List.foreach(List.scala:381) 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) 
> scala.collection.immutable.List.flatMap(List.scala:344) 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.(InMemoryTableScanExec.scala:94)
>  
> org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$6.apply(SparkStrategies.scala:306)
>  
> org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$6.apply(SparkStrategies.scala:306)
>  
> org.apache.spark.sql.execution.SparkPlanner.pruneFilterProject(SparkPlanner.scala:96)
>  
> org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$.apply(SparkStrategies.scala:302)
>  
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62)
>  
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62)
>  scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) 
> scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) 
> scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
>  
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
>  
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
>  
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>  
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>  scala.collection.Iterator$class.foreach(Iterator.scala:893) 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1336) 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) 
> scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336) 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74)
>  
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66)
>  scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) 
> scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) 
> 

[jira] [Assigned] (SPARK-20992) Link to Nomad scheduler backend in docs

2017-10-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-20992:
-

Assignee: Ben Barnard

> Link to Nomad scheduler backend in docs
> ---
>
> Key: SPARK-20992
> URL: https://issues.apache.org/jira/browse/SPARK-20992
> Project: Spark
>  Issue Type: Documentation
>  Components: Scheduler
>Affects Versions: 2.1.1
>Reporter: Ben Barnard
>Assignee: Ben Barnard
>Priority: Trivial
> Fix For: 2.3.0
>
>
> It is convenient to have scheduler backend support for running applications 
> on [Nomad|https://github.com/hashicorp/nomad], as with YARN and Mesos, so 
> that users can run Spark applications on a Nomad cluster without the need to 
> bring up a Spark Standalone cluster in the Nomad cluster.
> Both client and cluster deploy modes should be supported.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20992) Link to Nomad scheduler backend in docs

2017-10-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20992.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19354
[https://github.com/apache/spark/pull/19354]

> Link to Nomad scheduler backend in docs
> ---
>
> Key: SPARK-20992
> URL: https://issues.apache.org/jira/browse/SPARK-20992
> Project: Spark
>  Issue Type: Documentation
>  Components: Scheduler
>Affects Versions: 2.1.1
>Reporter: Ben Barnard
>Priority: Trivial
> Fix For: 2.3.0
>
>
> It is convenient to have scheduler backend support for running applications 
> on [Nomad|https://github.com/hashicorp/nomad], as with YARN and Mesos, so 
> that users can run Spark applications on a Nomad cluster without the need to 
> bring up a Spark Standalone cluster in the Nomad cluster.
> Both client and cluster deploy modes should be supported.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22288) Tricky interaction between closure-serialization and inheritance results in confusing failure

2017-10-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207116#comment-16207116
 ] 

Sean Owen commented on SPARK-22288:
---

I think this is a Java serialization question, not Spark. Still I'm curious why 
it fails here since it looks like the child class has an no-arg constructor and 
is serializable, and it's odd that case 1 works but not case 2. I think this 
might be the difference somehow between scala.Serializable and 
java.io.Serializable as per https://stackoverflow.com/a/27566890/64174 

I don't think this is something Spark can do anything about though. You can 
always use a different serializer like kryo

> Tricky interaction between closure-serialization and inheritance results in 
> confusing failure
> -
>
> Key: SPARK-22288
> URL: https://issues.apache.org/jira/browse/SPARK-22288
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Ryan Williams
>Priority: Minor
>
> Documenting this since I've run into it a few times; [full repro / discussion 
> here|https://github.com/ryan-williams/spark-bugs/tree/serde].
> Given 3 possible super-classes:
> {code}
> class Super1(n: Int)
> class Super2(n: Int) extends Serializable
> class Super3
> {code}
> A subclass that passes a closure to an RDD operation (e.g. {{map}} or 
> {{filter}}), where the closure references one of the subclass's fields, will 
> throw an {{java.io.InvalidClassException: …; no valid constructor}} exception 
> when the subclass extends {{Super1}} but not {{Super2}} or {{Super3}}. 
> Referencing method-local variables (instead of fields) is fine in all cases:
> {code}
> class App extends Super1(4) with Serializable {
>   val s = "abc"
>   def run(): Unit = {
> val sc = new SparkContext(new SparkConf().set("spark.master", 
> "local[4]").set("spark.app.name", "serde-test"))
> try {
>   sc
> .parallelize(1 to 10)
> .filter(Main.fn(_, s))  // danger! closure references `s`, crash 
> ensues
> .collect()  // driver stack-trace points here
> } finally {
>   sc.stop()
> }
>   }
> }
> object App {
>   def main(args: Array[String]): Unit = { new App().run() }
>   def fn(i: Int, s: String): Boolean = i % 2 == 0
> }
> {code}
> The task-failure stack trace looks like:
> {code}
> java.io.InvalidClassException: com.MyClass; no valid constructor
>   at 
> java.io.ObjectStreamClass$ExceptionInfo.newInvalidClassException(ObjectStreamClass.java:150)
>   at 
> java.io.ObjectStreamClass.checkDeserialize(ObjectStreamClass.java:790)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1782)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
> {code}
> and a driver stack-trace will point to the first line that initiates a Spark 
> job that exercises the closure/RDD-operation in question.
> Not sure how much this should be considered a problem with Spark's 
> closure-serialization logic vs. Java serialization, but maybe if the former 
> gets looked at or improved (e.g. with 2.12 support), this kind of interaction 
> can be improved upon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22289) Cannot save LogisticRegressionClassificationModel with bounds on coefficients

2017-10-17 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207115#comment-16207115
 ] 

yuhao yang commented on SPARK-22289:


cc [~yanboliang] [~dbtsai]

> Cannot save LogisticRegressionClassificationModel with bounds on coefficients
> -
>
> Key: SPARK-22289
> URL: https://issues.apache.org/jira/browse/SPARK-22289
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nic Eggert
>
> I think this was introduced in SPARK-20047.
> Trying to call save on a logistic regression model trained with bounds on its 
> parameters throws an error. This seems to be because Spark doesn't know how 
> to serialize the Matrix parameter.
> Model is set up like this:
> {code}
> val calibrator = new LogisticRegression()
>   .setFeaturesCol("uncalibrated_probability")
>   .setLabelCol("label")
>   .setWeightCol("weight")
>   .setStandardization(false)
>   .setLowerBoundsOnCoefficients(new DenseMatrix(1, 1, Array(0.0)))
>   .setFamily("binomial")
>   .setProbabilityCol("probability")
>   .setPredictionCol("logistic_prediction")
>   .setRawPredictionCol("logistic_raw_prediction")
> {code}
> {code}
> 17/10/16 15:36:59 ERROR ApplicationMaster: User class threw exception: 
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
>   at org.apache.spark.ml.param.Param.jsonEncode(params.scala:98)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:296)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:295)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.getMetadataToSave(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:277)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelWriter.saveImpl(LogisticRegression.scala:1182)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:254)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:253)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:253)
>   at 
> org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:337)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   -snip-
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21459) Some aggregation functions change the case of nested field names

2017-10-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21459.
---
Resolution: Cannot Reproduce

> Some aggregation functions change the case of nested field names
> 
>
> Key: SPARK-21459
> URL: https://issues.apache.org/jira/browse/SPARK-21459
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: David Allsopp
>Priority: Minor
>
> When working with DataFrames with nested schemas, the behavior of the 
> aggregation functions is inconsistent with respect to preserving the case of 
> the nested field names.
> For example, {{first()}} preserves the case of the field names, but 
> {{collect_set()}} and {{collect_list()}} force the field names to lowercase.
> Expected behavior: Field name case is preserved (or is at least consistent 
> and documented)
> Spark-shell session to reproduce:
> {code:java}
> case class Inner(Key:String, Value:String)
> case class Outer(ID:Long, Pairs:Array[Inner])
> val rdd = sc.parallelize(Seq(Outer(1L, Array(Inner("foo", "bar")
> val df = sqlContext.createDataFrame(rdd)
> scala> df
> ... = [ID: bigint, Pairs: array>]
> scala>df.groupBy("ID").agg(first("Pairs"))
> ... = [ID: bigint, first(Pairs)(): array>]
> // Note that Key and Value preserve their original case
> scala>df.groupBy("ID").agg(collect_set("Pairs"))
> ... = [ID: bigint, collect_set(Pairs): array>]
> // Note that key and value are now lowercased
> {code}
> Additionally, the column name (generated during aggregation) is inconsistent: 
> {{first(Pairs)()}} versus {{collect_set(Pairs)}} - note the extra parentheses 
> in the first name.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21697) NPE & ExceptionInInitializerError trying to load UTF from HDFS

2017-10-17 Thread Bang Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207098#comment-16207098
 ] 

Bang Xiao commented on SPARK-21697:
---

in the case describe above, i added "spark jars : file:///xxx.jar" in 
conf/spark-defaults.conf, then when i use "add jar"  through SparkSQL CLI in 
yarn-client mode, it occurs the error. 
if i added "spark jars : file:///xxx.jar, hdfs://xxx.jar" in 
conf/spark-defaults.conf.  i can "add jar hdfs:///.jar" successfully 
through SparkSQL CLI in yarn-client mode

> NPE & ExceptionInInitializerError trying to load UTF from HDFS
> --
>
> Key: SPARK-21697
> URL: https://issues.apache.org/jira/browse/SPARK-21697
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: Spark Client mode, Hadoop 2.6.0
>Reporter: Steve Loughran
>Priority: Minor
>
> Reported on [the 
> PR|https://github.com/apache/spark/pull/17342#issuecomment-321438157] for 
> SPARK-12868: trying to load a UDF of HDFS is triggering an 
> {{ExceptionInInitializerError}}, caused by an NPE which should only happen if 
> the commons-logging {{LOG}} log is null.
> Hypothesis: the commons logging scan for {{commons-logging.properties}} is 
> happening in the classpath with the HDFS JAR; this is triggering a D/L of the 
> JAR, which needs to force in commons-logging, and, as that's not inited yet, 
> NPEs



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22264) History server will be unavailable if there is an event log file with large size

2017-10-17 Thread zhoukang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang resolved SPARK-22264.
--
Resolution: Duplicate

> History server will be unavailable if there is an event log file with large 
> size
> 
>
> Key: SPARK-22264
> URL: https://issues.apache.org/jira/browse/SPARK-22264
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.0
>Reporter: zhoukang
> Attachments: not-found.png
>
>
> History server will be unavailable if there is an event log file with large 
> size.
> Large size here means the replaying time is too long.
> *We can fix this to add a timeout for event log replaying.*
> *Here is an example:*
> Every application submitted after restart can not open history ui.
> !not-found.png!
> *From event log directory we can find an event log file size is bigger than 
> 130GB.*
> {code:java}
> hadoop *144149840801* 2017-08-29 14:03 
> /spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress
> {code}
> *and from jstack and server log we can see replaying task blocked on this 
> event log:*
> *server log:*
> {code:java}
> 2017-10-12,16:00:12,151 INFO 
> org.apache.spark.deploy.history.FsHistoryProvider: Replaying log path: 
> hdfs://xxx/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress
> 2017-10-12,16:00:12,167 INFO org.apache.spark.scheduler.ReplayListenerBus: 
> Begin to replay 
> hdfs://xxx/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress!
> {code}
> *jstack*
> {code:java}
> "log-replay-executor-0" daemon prio=10 tid=0x7f0f48014800 nid=0x6160 
> runnable [0x7f0f4f6f5000]
>java.lang.Thread.State: RUNNABLE
> at net.jpountz.lz4.LZ4JNI.LZ4_decompress_fast(Native Method)
> at 
> net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:37)
> at 
> org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:205)
> at 
> org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:125)
> at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:283)
> at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:325)
> at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
> - locked <0x0005f0096948> (a java.io.InputStreamReader)
> at java.io.InputStreamReader.read(InputStreamReader.java:184)
> at java.io.BufferedReader.fill(BufferedReader.java:154)
> at java.io.BufferedReader.readLine(BufferedReader.java:317)
> - locked <0x0005f0096948> (a java.io.InputStreamReader)
> at java.io.BufferedReader.readLine(BufferedReader.java:382)
> at 
> scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:72)
> at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:836)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
> at 
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:79)
> at 
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:776)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:584)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$3$$anon$4.run(FsHistoryProvider.scala:464)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22293) Avoid unnecessary traversal in ResolveReferences

2017-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22293:


Assignee: (was: Apache Spark)

> Avoid unnecessary traversal in ResolveReferences
> 
>
> Key: SPARK-22293
> URL: https://issues.apache.org/jira/browse/SPARK-22293
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xianyang Liu
>
> We don't need traverse the children expression to determine whether there is 
> an `Star` when expand `Star` expression.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22293) Avoid unnecessary traversal in ResolveReferences

2017-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22293:


Assignee: Apache Spark

> Avoid unnecessary traversal in ResolveReferences
> 
>
> Key: SPARK-22293
> URL: https://issues.apache.org/jira/browse/SPARK-22293
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xianyang Liu
>Assignee: Apache Spark
>
> We don't need traverse the children expression to determine whether there is 
> an `Star` when expand `Star` expression.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22293) Avoid unnecessary traversal in ResolveReferences

2017-10-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207093#comment-16207093
 ] 

Apache Spark commented on SPARK-22293:
--

User 'ConeyLiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/19511

> Avoid unnecessary traversal in ResolveReferences
> 
>
> Key: SPARK-22293
> URL: https://issues.apache.org/jira/browse/SPARK-22293
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xianyang Liu
>
> We don't need traverse the children expression to determine whether there is 
> an `Star` when expand `Star` expression.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22293) Avoid unnecessary traversal in ResolveReferences

2017-10-17 Thread Xianyang Liu (JIRA)
Xianyang Liu created SPARK-22293:


 Summary: Avoid unnecessary traversal in ResolveReferences
 Key: SPARK-22293
 URL: https://issues.apache.org/jira/browse/SPARK-22293
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Xianyang Liu


We don't need traverse the children expression to determine whether there is an 
`Star` when expand `Star` expression.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >