date:20170301


 [ 
https://issues.apache.org/jira/browse/SPARK-19779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19779:
-
Affects Version/s: 2.1.1
   2.0.3

> structured streaming exist needless tmp file 
> -
>
> Key: SPARK-19779
> URL: https://issues.apache.org/jira/browse/SPARK-19779
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.3, 2.1.1, 2.2.0
>Reporter: Feng Gui
>Priority: Minor
>
> The PR (https://github.com/apache/spark/pull/17012) can to fix restart a 
> Structured Streaming application using hdfs as fileSystem, but also exist a 
> problem that a tmp file of delta file is still reserved in hdfs. And 
> Structured Streaming don't delete the tmp file generated when restart 
> streaming job in future, so we need to delete the tmp file after restart 
> streaming job.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19792) In the Master Page,the column named “Memory per Node” ,I think it is not all right


 [ 
https://issues.apache.org/jira/browse/SPARK-19792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19792:


Assignee: Apache Spark

> In the Master Page,the column named “Memory per Node” ,I think  it is not all 
> right
> ---
>
> Key: SPARK-19792
> URL: https://issues.apache.org/jira/browse/SPARK-19792
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: liuxian
>Assignee: Apache Spark
>
> Open the spark web page,in the Master Page ,have two tables:Running 
> Applications table and  Completed Applications table, to the column named 
> “Memory per Node” ,I think it is not all right ,because a node may be not 
> have only one executor.So I think that should be named as “Memory per 
> Executor”.Otherwise easy to let the user misunderstanding



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19792) In the Master Page,the column named “Memory per Node” ,I think it is not all right


 [ 
https://issues.apache.org/jira/browse/SPARK-19792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19792:


Assignee: (was: Apache Spark)

> In the Master Page,the column named “Memory per Node” ,I think  it is not all 
> right
> ---
>
> Key: SPARK-19792
> URL: https://issues.apache.org/jira/browse/SPARK-19792
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: liuxian
>
> Open the spark web page,in the Master Page ,have two tables:Running 
> Applications table and  Completed Applications table, to the column named 
> “Memory per Node” ,I think it is not all right ,because a node may be not 
> have only one executor.So I think that should be named as “Memory per 
> Executor”.Otherwise easy to let the user misunderstanding



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19792) In the Master Page,the column named “Memory per Node” ,I think it is not all right


[ 
https://issues.apache.org/jira/browse/SPARK-19792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15891731#comment-15891731
 ] 

Apache Spark commented on SPARK-19792:
--

User '10110346' has created a pull request for this issue:
https://github.com/apache/spark/pull/17132

> In the Master Page,the column named “Memory per Node” ,I think  it is not all 
> right
> ---
>
> Key: SPARK-19792
> URL: https://issues.apache.org/jira/browse/SPARK-19792
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: liuxian
>
> Open the spark web page,in the Master Page ,have two tables:Running 
> Applications table and  Completed Applications table, to the column named 
> “Memory per Node” ,I think it is not all right ,because a node may be not 
> have only one executor.So I think that should be named as “Memory per 
> Executor”.Otherwise easy to let the user misunderstanding



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19779) structured streaming exist needless tmp file


 [ 
https://issues.apache.org/jira/browse/SPARK-19779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19779:
-
Affects Version/s: (was: 2.1.0)
   2.2.0

> structured streaming exist needless tmp file 
> -
>
> Key: SPARK-19779
> URL: https://issues.apache.org/jira/browse/SPARK-19779
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Feng Gui
>Priority: Minor
>
> The PR (https://github.com/apache/spark/pull/17012) can to fix restart a 
> Structured Streaming application using hdfs as fileSystem, but also exist a 
> problem that a tmp file of delta file is still reserved in hdfs. And 
> Structured Streaming don't delete the tmp file generated when restart 
> streaming job in future, so we need to delete the tmp file after restart 
> streaming job.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19788) DataStreamReader/DataStreamWriter.option shall accept user-defined type


[ 
https://issues.apache.org/jira/browse/SPARK-19788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15891722#comment-15891722
 ] 

Shixiong Zhu edited comment on SPARK-19788 at 3/2/17 7:04 AM:
--

I remember that it's because we want to support both Scala and Python (maybe 
also R). If it accepts user-defined types, we don't know how to convert Python 
options to Scala options.


was (Author: zsxwing):
I remember that we want to support both Scala and Python. If it accepts 
user-defined types, we don't know how to convert Python options to Scala 
options.

> DataStreamReader/DataStreamWriter.option shall accept user-defined type
> ---
>
> Key: SPARK-19788
> URL: https://issues.apache.org/jira/browse/SPARK-19788
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Nan Zhu
>
> There are many other data sources/sinks which has very different 
> configuration ways than Kafka, FileSystem, etc. 
> The expected type of the configuration entry passed to them might be a nested 
> collection type, e.g. Map[String, Map[String, String]], or even a 
> user-defined type(for example, the one I am working on)
> Right now, option can only accept String -> String/Boolean/Long/Double OR a 
> complete Map[String, String]...my suggestion is that we can accept 
> Map[String, Any], and the type of 'parameters' in SourceProvider.createSource 
> can also be Map[String, Any], this will create much more flexibility to the 
> user
> The drawback is that, it is a breaking change ( we can mitigate this by 
> deprecating the current one, and progressively evolve to the new one if the 
> proposal is accepted)
> [~zsxwing] what do you think?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19788) DataStreamReader/DataStreamWriter.option shall accept user-defined type


[ 
https://issues.apache.org/jira/browse/SPARK-19788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15891722#comment-15891722
 ] 

Shixiong Zhu commented on SPARK-19788:
--

I remember that we want to support both Scala and Python. If it accepts 
user-defined types, we don't know how to convert Python options to Scala 
options.

> DataStreamReader/DataStreamWriter.option shall accept user-defined type
> ---
>
> Key: SPARK-19788
> URL: https://issues.apache.org/jira/browse/SPARK-19788
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Nan Zhu
>
> There are many other data sources/sinks which has very different 
> configuration ways than Kafka, FileSystem, etc. 
> The expected type of the configuration entry passed to them might be a nested 
> collection type, e.g. Map[String, Map[String, String]], or even a 
> user-defined type(for example, the one I am working on)
> Right now, option can only accept String -> String/Boolean/Long/Double OR a 
> complete Map[String, String]...my suggestion is that we can accept 
> Map[String, Any], and the type of 'parameters' in SourceProvider.createSource 
> can also be Map[String, Any], this will create much more flexibility to the 
> user
> The drawback is that, it is a breaking change ( we can mitigate this by 
> deprecating the current one, and progressively evolve to the new one if the 
> proposal is accepted)
> [~zsxwing] what do you think?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19734) OneHotEncoder init uses dropLast but doc strings all say includeFirst

2017-03-01 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-19734.
-
   Resolution: Fixed
 Assignee: Mark Grover
Fix Version/s: 2.2.0

> OneHotEncoder __init__ uses dropLast but doc strings all say includeFirst
> -
>
> Key: SPARK-19734
> URL: https://issues.apache.org/jira/browse/SPARK-19734
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 1.5.2, 1.6.3, 2.0.2, 2.1.0
>Reporter: Corey
>Assignee: Mark Grover
>Priority: Minor
>  Labels: documentation, easyfix
> Fix For: 2.2.0
>
>
> The {{OneHotEncoder.__init__}} doc string in PySpark has an input keyword 
> listed as {{includeFirst}}, whereas the code actually uses {{dropLast}}.
> This especially confusing because the {{__init__}} function accepts only 
> keywords, and following the documentation on the web 
> (https://spark.apache.org/docs/2.0.1/api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoder)
>  or of {{help}} in Python will result in the error:
> {quote}
> TypeError: __init__() got an unexpected keyword argument 'includeFirst'
> {quote}
> The error is immediately viewable in the source code:
> {code}
> @keyword_only
> def __init__(self, dropLast=True, inputCol=None, outputCol=None):
> """
> __init__(self, includeFirst=True, inputCol=None, outputCol=None)
> """
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19583) CTAS for data source tables with an created location does not work

2017-03-01 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19583:
---

Assignee: Song Jun

> CTAS for data source tables with an created location does not work
> --
>
> Key: SPARK-19583
> URL: https://issues.apache.org/jira/browse/SPARK-19583
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Song Jun
> Fix For: 2.2.0
>
>
> {noformat}
> spark.sql(
>   s"""
>  |CREATE TABLE t
>  |USING parquet
>  |PARTITIONED BY(a, b)
>  |LOCATION '$dir'
>  |AS SELECT 3 as a, 4 as b, 1 as c, 2 as d
>""".stripMargin)
> {noformat}
> Failed with the error message:
> {noformat}
> path 
> file:/private/var/folders/6r/15tqm8hn3ldb3rmbfqm1gf4cgn/T/spark-195cd513-428a-4df9-b196-87db0c73e772
>  already exists.;
> org.apache.spark.sql.AnalysisException: path 
> file:/private/var/folders/6r/15tqm8hn3ldb3rmbfqm1gf4cgn/T/spark-195cd513-428a-4df9-b196-87db0c73e772
>  already exists.;
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:102)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19583) CTAS for data source tables with an created location does not work

2017-03-01 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19583.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16938
[https://github.com/apache/spark/pull/16938]

> CTAS for data source tables with an created location does not work
> --
>
> Key: SPARK-19583
> URL: https://issues.apache.org/jira/browse/SPARK-19583
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
> Fix For: 2.2.0
>
>
> {noformat}
> spark.sql(
>   s"""
>  |CREATE TABLE t
>  |USING parquet
>  |PARTITIONED BY(a, b)
>  |LOCATION '$dir'
>  |AS SELECT 3 as a, 4 as b, 1 as c, 2 as d
>""".stripMargin)
> {noformat}
> Failed with the error message:
> {noformat}
> path 
> file:/private/var/folders/6r/15tqm8hn3ldb3rmbfqm1gf4cgn/T/spark-195cd513-428a-4df9-b196-87db0c73e772
>  already exists.;
> org.apache.spark.sql.AnalysisException: path 
> file:/private/var/folders/6r/15tqm8hn3ldb3rmbfqm1gf4cgn/T/spark-195cd513-428a-4df9-b196-87db0c73e772
>  already exists.;
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:102)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19745) SVCAggregator serializes coefficients

2017-03-01 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-19745:
---

Shepherd: Yanbo Liang
Assignee: Seth Hendrickson

> SVCAggregator serializes coefficients
> -
>
> Key: SPARK-19745
> URL: https://issues.apache.org/jira/browse/SPARK-19745
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>
> Similar to [SPARK-16008|https://issues.apache.org/jira/browse/SPARK-16008], 
> the SVC aggregator captures the coefficients in the class closure, and 
> therefore ships them around during optimization. We can prevent this with a 
> bit of reorganization of the aggregator class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19792) In the Master Page,the column named “Memory per Node” ,I think it is not all right

2017-03-01 Thread liuxian (JIRA)

liuxian created SPARK-19792:
---

 Summary: In the Master Page,the column named “Memory per Node” ,I 
think  it is not all right
 Key: SPARK-19792
 URL: https://issues.apache.org/jira/browse/SPARK-19792
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.1.0
Reporter: liuxian


Open the spark web page,in the Master Page ,have two tables:Running 
Applications table and  Completed Applications table, to the column named 
“Memory per Node” ,I think it is not all right ,because a node may be not have 
only one executor.So I think that should be named as “Memory per 
Executor”.Otherwise easy to let the user misunderstanding



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19503) Execution Plan Optimizer: avoid sort or shuffle when it does not change end result such as df.sort(...).count()


[ 
https://issues.apache.org/jira/browse/SPARK-19503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15891699#comment-15891699
 ] 

Kazuaki Ishizaki commented on SPARK-19503:
--

If it is good to leave sort intact for now, do we prune only sort for local 
(i.e. {{Sort(_, false, _)}})?

> Execution Plan Optimizer: avoid sort or shuffle when it does not change end 
> result such as df.sort(...).count()
> ---
>
> Key: SPARK-19503
> URL: https://issues.apache.org/jira/browse/SPARK-19503
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.1.0
> Environment: Perhaps only a pyspark or databricks AWS issue
>Reporter: R
>Priority: Minor
>  Labels: execution, optimizer, plan, query
>
> df.sort(...).count()
> performs shuffle and sort and then count! This is wasteful as sort is not 
> required here and makes me wonder how smart the algebraic optimiser is 
> indeed! The data may be partitioned by known count (such as parquet files) 
> and we should not shuffle to just perform count.
> This may look trivial, but if optimiser fails to recognise this, I wonder 
> what else is it missing especially in more complex operations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19766) INNER JOIN on constant alias columns return incorrect results


[ 
https://issues.apache.org/jira/browse/SPARK-19766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15891692#comment-15891692
 ] 

Apache Spark commented on SPARK-19766:
--

User 'stanzhai' has created a pull request for this issue:
https://github.com/apache/spark/pull/17131

> INNER JOIN on constant alias columns return incorrect results
> -
>
> Key: SPARK-19766
> URL: https://issues.apache.org/jira/browse/SPARK-19766
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Assignee: StanZhai
>Priority: Critical
>  Labels: Correctness
> Fix For: 2.1.1, 2.2.0
>
>
> We can demonstrate the problem with the following data set and query:
> {code}
> val spark = 
> SparkSession.builder().appName("test").master("local").getOrCreate()
> val sql1 =
>   """
> |create temporary view t1 as select * from values
> |(1)
> |as grouping(a)
>   """.stripMargin
> val sql2 =
>   """
> |create temporary view t2 as select * from values
> |(1)
> |as grouping(a)
>   """.stripMargin
> val sql3 =
>   """
> |create temporary view t3 as select * from values
> |(1),
> |(1)
> |as grouping(a)
>   """.stripMargin
> val sql4 =
>   """
> |create temporary view t4 as select * from values
> |(1),
> |(1)
> |as grouping(a)
>   """.stripMargin
> val sqlA =
>   """
> |create temporary view ta as
> |select a, 'a' as tag from t1 union all
> |select a, 'b' as tag from t2
>   """.stripMargin
> val sqlB =
>   """
> |create temporary view tb as
> |select a, 'a' as tag from t3 union all
> |select a, 'b' as tag from t4
>   """.stripMargin
> val sql =
>   """
> |select tb.* from ta inner join tb on
> |ta.a = tb.a and
> |ta.tag = tb.tag
>   """.stripMargin
> spark.sql(sql1)
> spark.sql(sql2)
> spark.sql(sql3)
> spark.sql(sql4)
> spark.sql(sqlA)
> spark.sql(sqlB)
> spark.sql(sql).show()
> {code}
> The results which is incorrect:
> {code}
> +---+---+
> |  a|tag|
> +---+---+
> |  1|  b|
> |  1|  b|
> |  1|  a|
> |  1|  a|
> |  1|  b|
> |  1|  b|
> |  1|  a|
> |  1|  a|
> +---+---+
> {code}
> The correct results should be:
> {code}
> +---+---+
> |  a|tag|
> +---+---+
> |  1|  a|
> |  1|  a|
> |  1|  b|
> |  1|  b|
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13931) Resolve stage hanging up problem in a particular case

2017-03-01 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-13931.

   Resolution: Fixed
Fix Version/s: 2.2.0

> Resolve stage hanging up problem in a particular case
> -
>
> Key: SPARK-13931
> URL: https://issues.apache.org/jira/browse/SPARK-13931
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.4.1, 1.5.2, 1.6.0, 1.6.1
>Reporter: ZhengYaofeng
> Fix For: 2.2.0
>
>
> Suppose the following steps:
> 1. Open speculation switch in the application. 
> 2. Run this app and suppose last task of shuffleMapStage 1 finishes. Let's 
> get the record straight, from the eyes of DAG, this stage really finishes, 
> and from the eyes of TaskSetManager, variable 'isZombie' is set to true, but 
> variable runningTasksSet isn't empty because of speculation.
> 3. Suddenly, executor 3 is lost. TaskScheduler receiving this signal, invokes 
> all executorLost functions of rootPool's taskSetManagers. DAG receiving this 
> signal, removes all this executor's outputLocs.
> 4. TaskSetManager adds all this executor's tasks to pendingTasks and tells 
> DAG they will be resubmitted (Attention: possibly not on time).
> 5. DAG starts to submit a new waitingStage, let's say shuffleMapStage 2, and 
> going to find that shuffleMapStage 1 is its missing parent because some 
> outputLocs are removed due to executor lost. Then DAG submits shuffleMapStage 
> 1 again.
> 6. DAG still receives Task 'Resubmitted' signal from old taskSetManager, and 
> increases the number of pendingTasks of shuffleMapStage 1 each time. However, 
> old taskSetManager won't resolve new task to submit because its variable 
> 'isZombie' is set to true.
> 7. Finally shuffleMapStage 1 never finishes in DAG together with all stages 
> depending on it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19468) Dataset slow because of unnecessary shuffles


[ 
https://issues.apache.org/jira/browse/SPARK-19468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15891658#comment-15891658
 ] 

Kazuaki Ishizaki commented on SPARK-19468:
--

Interesting.

For {{val joined1 = ds1.joinWith(ds2, ds1("_1") === ds2("_1"))}}, I thought 
that shuffle occurs since a optimized project plan has `Project` (I know this 
should not be related to Shuffle).

{code}
== Optimized Logical Plan ==
Join Inner, (_1#105._1 = _2#106._1)
:- Project [named_struct(_1, _1#83, _2, _2#84) AS _1#105]
:  +- InMemoryRelation [_1#83, _2#84], true, 1, StorageLevel(disk, 1 
replicas)
:+- *Sort [_1#83 ASC NULLS FIRST], false, 0
:   +- Exchange hashpartitioning(_1#83, 5)
:  +- LocalTableScan [_1#83, _2#84]
+- Project [named_struct(_1, _1#100, _2, _2#101) AS _2#106]
   +- InMemoryRelation [_1#100, _2#101], true, 1, StorageLevel(disk, 1 
replicas)
 +- *Sort [_1#83 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(_1#83, 5)
   +- LocalTableScan [_1#83, _2#84]

== Physical Plan ==
*SortMergeJoin [_1#105._1], [_2#106._1], Inner
:- *Sort [_1#105._1 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(_1#105._1, 5)
: +- *Project [named_struct(_1, _1#83, _2, _2#84) AS _1#105]
:+- InMemoryTableScan [_1#83, _2#84]
:  +- InMemoryRelation [_1#83, _2#84], true, 1, 
StorageLevel(disk, 1 replicas)
:+- *Sort [_1#83 ASC NULLS FIRST], false, 0
:   +- Exchange hashpartitioning(_1#83, 5)
:  +- LocalTableScan [_1#83, _2#84]
+- *Sort [_2#106._1 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(_2#106._1, 5)
  +- *Project [named_struct(_1, _1#100, _2, _2#101) AS _2#106]
 +- InMemoryTableScan [_1#100, _2#101]
   +- InMemoryRelation [_1#100, _2#101], true, 1, 
StorageLevel(disk, 1 replicas)
 +- *Sort [_1#83 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(_1#83, 5)
   +- LocalTableScan [_1#83, _2#84]
{code}

For {{val joined1 = ds1.join(ds2, ds1("_1") === ds2("_1"))}}, I knew shuffle 
occurs while we do not see `Project` in optimized logical plan.

{code}
== Optimized Logical Plan ==
Join Inner, (_1#83 = _1#100)
:- InMemoryRelation [_1#83, _2#84], true, 1, StorageLevel(disk, 1 replicas)
: +- *Sort [_1#83 ASC NULLS FIRST], false, 0
:+- Exchange hashpartitioning(_1#83, 5)
:   +- LocalTableScan [_1#83, _2#84]
+- InMemoryRelation [_1#100, _2#101], true, 1, StorageLevel(disk, 1 
replicas)
  +- *Sort [_1#83 ASC NULLS FIRST], false, 0
 +- Exchange hashpartitioning(_1#83, 5)
+- LocalTableScan [_1#83, _2#84]

== Physical Plan ==
*SortMergeJoin [_1#83], [_1#100], Inner
:- InMemoryTableScan [_1#83, _2#84]
: +- InMemoryRelation [_1#83, _2#84], true, 1, StorageLevel(disk, 1 
replicas)
:   +- *Sort [_1#83 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(_1#83, 5)
: +- LocalTableScan [_1#83, _2#84]
+- *Sort [_1#100 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(_1#100, 5)
  +- InMemoryTableScan [_1#100, _2#101]
+- InMemoryRelation [_1#100, _2#101], true, 1, 
StorageLevel(disk, 1 replicas)
  +- *Sort [_1#83 ASC NULLS FIRST], false, 0
 +- Exchange hashpartitioning(_1#83, 5)
+- LocalTableScan [_1#83, _2#84]
{code}

In summary, I do not know why this unnecessary shuffle is inserted.


> Dataset slow because of unnecessary shuffles
> 
>
> Key: SPARK-19468
> URL: https://issues.apache.org/jira/browse/SPARK-19468
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: koert kuipers
>
> we noticed that some algos we ported from rdd to dataset are significantly 
> slower, and the main reason seems to be more shuffles that we successfully 
> avoid for rdds by careful partitioning. this seems to be dataset specific as 
> it works ok for dataframe.
> see also here:
> http://blog.hydronitrogen.com/2016/05/13/shuffle-free-joins-in-spark-sql/
> it kind of boils down to this... if i partition and sort dataframes that get 
> used for joins repeatedly i can avoid shuffles:
> {noformat}
> System.setProperty("spark.sql.autoBroadcastJoinThreshold", "-1")
> val df1 = Seq((0, 0), (1, 1)).toDF("key", "value")
>   
> .repartition(col("key")).sortWithinPartitions(col("key")).persist(StorageLevel.DISK_ONLY)
> val df2 = Seq((0, 0), (1, 1)).toDF("key2", "value2")
>   
> .repartition(col("key2")).sortWithinPartitions(col("key2")).persist(StorageLevel.DISK_ONLY)
> val joined = df1.join(df2, col("key") === col("key2"))
> joined.explain
> == Physical Plan ==
> *SortMer

[jira] [Resolved] (SPARK-19777) Scan runningTasksSet when check speculatable tasks in TaskSetManager.

2017-03-01 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-19777.

   Resolution: Fixed
 Assignee: jin xing
Fix Version/s: 2.2.0

> Scan runningTasksSet when check speculatable tasks in TaskSetManager.
> -
>
> Key: SPARK-19777
> URL: https://issues.apache.org/jira/browse/SPARK-19777
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: jin xing
>Assignee: jin xing
>Priority: Minor
> Fix For: 2.2.0
>
>
> When check speculatable tasks in TaskSetManager, only scan runningTasksSet 
> instead of scanning all taskInfos.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18769) Spark to be smarter about what the upper bound is and to restrict number of executor when dynamic allocation is enabled

2017-03-01 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15891610#comment-15891610
 ] 

Xuefu Zhang commented on SPARK-18769:
-

Just as fyi, the problem is real and happens when allocation attempts are made 
as long as there are pending tasks (which is in line with dynamic allocation). 
However, it's pointless when all containers are already taken and further 
attempts are very likely in vain, which adversely creates pressure on event 
processing in Spark driver and may also have impact on YARN RM.

I don't know what's the best solution for this, Maybe Spark can just try to 
allocate all it needs upfront and update (tune down) the allocation request as 
the job progresses when necessary. 

As a workaround, we have to set an artificial upper limit (something like 
2500), which helps a lot.

>  Spark to be smarter about what the upper bound is and to restrict number of 
> executor when dynamic allocation is enabled
> 
>
> Key: SPARK-18769
> URL: https://issues.apache.org/jira/browse/SPARK-18769
> Project: Spark
>  Issue Type: New Feature
>Reporter: Neerja Khattar
>
> Currently when dynamic allocation is enabled max.executor is infinite and 
> spark creates so many executor and even exceed the yarn nodemanager memory 
> limit and vcores.
> It should have a check to not exceed more that yarn resource limit.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18389) Disallow cyclic view reference

2017-03-01 Thread Jiang Xingbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15891575#comment-15891575
 ] 

Jiang Xingbo commented on SPARK-18389:
--

I‘ve just figure out a way to work this out, will try to submit a PR later.

> Disallow cyclic view reference
> --
>
> Key: SPARK-18389
> URL: https://issues.apache.org/jira/browse/SPARK-18389
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> The following should not be allowed:
> {code}
> CREATE VIEW testView AS SELECT id FROM jt
> CREATE VIEW testView2 AS SELECT id FROM testView
> ALTER VIEW testView AS SELECT * FROM testView2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19741) ClassCastException when using Dataset with type containing value types


[ 
https://issues.apache.org/jira/browse/SPARK-19741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15891548#comment-15891548
 ] 

Kazuaki Ishizaki edited comment on SPARK-19741 at 3/2/17 3:17 AM:
--

I am afraid whether my sample program succeeded to reproduce the original 
issue. This is because the title of this PR says {{ClassCastException}} while I 
got another exception.
Otherwise, I agree that this is similar to SPARK-17368.


was (Author: kiszk):
I am afraid whether my sample program succeeded to reproduce the original 
issue. This is because the title of this PR says {{ClassCastException}} while I 
got another exception.

> ClassCastException when using Dataset with type containing value types
> --
>
> Key: SPARK-19741
> URL: https://issues.apache.org/jira/browse/SPARK-19741
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
> Environment: JDK 8 on Ubuntu
> Scala 2.11.8
> Spark 2.1.0
>Reporter: Lior Regev
>
> The following code reproduces the error
> {code}
> final case class Foo(id: Int) extends AnyVal
> final case class Bar(foo: Foo)
> val foo = Foo(5)
> val bar = Bar(foo)
> import spark.implicits._
> spark.sparkContext.parallelize(0 to 10).toDS().map(_ => bar).collect()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19741) ClassCastException when using Dataset with type containing value types


[ 
https://issues.apache.org/jira/browse/SPARK-19741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15891548#comment-15891548
 ] 

Kazuaki Ishizaki commented on SPARK-19741:
--

I am afraid whether my sample program succeeded to reproduce the original 
issue. This is because the title of this PR says {{ClassCastException}} while I 
got another exception.

> ClassCastException when using Dataset with type containing value types
> --
>
> Key: SPARK-19741
> URL: https://issues.apache.org/jira/browse/SPARK-19741
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
> Environment: JDK 8 on Ubuntu
> Scala 2.11.8
> Spark 2.1.0
>Reporter: Lior Regev
>
> The following code reproduces the error
> {code}
> final case class Foo(id: Int) extends AnyVal
> final case class Bar(foo: Foo)
> val foo = Foo(5)
> val bar = Bar(foo)
> import spark.implicits._
> spark.sparkContext.parallelize(0 to 10).toDS().map(_ => bar).collect()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19741) ClassCastException when using Dataset with type containing value types


[ 
https://issues.apache.org/jira/browse/SPARK-19741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15891463#comment-15891463
 ] 

Hyukjin Kwon edited comment on SPARK-19741 at 3/2/17 2:00 AM:
--

I just tried the code above in the current master and it seems still not 
working whereas the reproducer in SPARK-17368 seems working fine. I tested both 
via IDE not Spark REPL.

Apparently, it seems both JIRAs are similar. Maybe, we should make this as a 
duplicate and reopen SPARK-17368 (if both are the same issues but only the 
specific case is resolved) or describe the differences between them in more 
details.


was (Author: hyukjin.kwon):
I just tried the code above in the current master and it seems still not 
working whereas the reproducer in SPARK-17368 seems working fine. I tested both 
via IDE not Spark REPL.

Apparently, it seems both JIRAs are similar. Maybe, we should make this as a 
duplicate or describe the differences between them in more details.

> ClassCastException when using Dataset with type containing value types
> --
>
> Key: SPARK-19741
> URL: https://issues.apache.org/jira/browse/SPARK-19741
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
> Environment: JDK 8 on Ubuntu
> Scala 2.11.8
> Spark 2.1.0
>Reporter: Lior Regev
>
> The following code reproduces the error
> {code}
> final case class Foo(id: Int) extends AnyVal
> final case class Bar(foo: Foo)
> val foo = Foo(5)
> val bar = Bar(foo)
> import spark.implicits._
> spark.sparkContext.parallelize(0 to 10).toDS().map(_ => bar).collect()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19741) ClassCastException when using Dataset with type containing value types


[ 
https://issues.apache.org/jira/browse/SPARK-19741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15891463#comment-15891463
 ] 

Hyukjin Kwon commented on SPARK-19741:
--

I just tried the code above in the current master and it seems still not 
working whereas the reproducer in SPARK-17368 seems working fine. I tested both 
via IDE not Spark REPL.

Apparently, it seems both JIRAs are similar. Maybe, we should make this as a 
duplicate or describe the differences between them in more details.

> ClassCastException when using Dataset with type containing value types
> --
>
> Key: SPARK-19741
> URL: https://issues.apache.org/jira/browse/SPARK-19741
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
> Environment: JDK 8 on Ubuntu
> Scala 2.11.8
> Spark 2.1.0
>Reporter: Lior Regev
>
> The following code reproduces the error
> {code}
> final case class Foo(id: Int) extends AnyVal
> final case class Bar(foo: Foo)
> val foo = Foo(5)
> val bar = Bar(foo)
> import spark.implicits._
> spark.sparkContext.parallelize(0 to 10).toDS().map(_ => bar).collect()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19754) Casting to int from a JSON-parsed float rounds instead of truncating

2017-03-01 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15891422#comment-15891422
 ] 

Takeshi Yamamuro commented on SPARK-19754:
--

cc: [~cloud_fan]

> Casting to int from a JSON-parsed float rounds instead of truncating
> 
>
> Key: SPARK-19754
> URL: https://issues.apache.org/jira/browse/SPARK-19754
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Juan Pumarino
>Priority: Minor
>
> When retrieving a float value from a JSON document, and then casting it to an 
> integer, Hive simply truncates it, while Spark is rounding up when the 
> decimal value is >= 5.
> In Hive, the following query returns {{1}}, whereas in a Spark shell the 
> result is {{2}}.
> {code}
> SELECT CAST(get_json_object('{"a": 1.6}', '$.a') AS INT)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19741) ClassCastException when using Dataset with type containing value types


[ 
https://issues.apache.org/jira/browse/SPARK-19741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15891348#comment-15891348
 ] 

Sean Owen commented on SPARK-19741:
---

Duplicate of SPARK-17368 ?

> ClassCastException when using Dataset with type containing value types
> --
>
> Key: SPARK-19741
> URL: https://issues.apache.org/jira/browse/SPARK-19741
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
> Environment: JDK 8 on Ubuntu
> Scala 2.11.8
> Spark 2.1.0
>Reporter: Lior Regev
>
> The following code reproduces the error
> {code}
> final case class Foo(id: Int) extends AnyVal
> final case class Bar(foo: Foo)
> val foo = Foo(5)
> val bar = Bar(foo)
> import spark.implicits._
> spark.sparkContext.parallelize(0 to 10).toDS().map(_ => bar).collect()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19791) Add doc and example for fpgrowth


 [ 
https://issues.apache.org/jira/browse/SPARK-19791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19791:


Assignee: (was: Apache Spark)

> Add doc and example for fpgrowth
> 
>
> Key: SPARK-19791
> URL: https://issues.apache.org/jira/browse/SPARK-19791
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> Add a new section for fpm
> Add Example for FPGrowth in scala and Java.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19791) Add doc and example for fpgrowth


 [ 
https://issues.apache.org/jira/browse/SPARK-19791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19791:


Assignee: Apache Spark

> Add doc and example for fpgrowth
> 
>
> Key: SPARK-19791
> URL: https://issues.apache.org/jira/browse/SPARK-19791
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Assignee: Apache Spark
>
> Add a new section for fpm
> Add Example for FPGrowth in scala and Java.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19791) Add doc and example for fpgrowth


[ 
https://issues.apache.org/jira/browse/SPARK-19791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15891337#comment-15891337
 ] 

Apache Spark commented on SPARK-19791:
--

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/17130

> Add doc and example for fpgrowth
> 
>
> Key: SPARK-19791
> URL: https://issues.apache.org/jira/browse/SPARK-19791
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> Add a new section for fpm
> Add Example for FPGrowth in scala and Java.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19791) Add doc and example for fpgrowth

2017-03-01 Thread yuhao yang (JIRA)

yuhao yang created SPARK-19791:
--

 Summary: Add doc and example for fpgrowth
 Key: SPARK-19791
 URL: https://issues.apache.org/jira/browse/SPARK-19791
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 2.2.0
Reporter: yuhao yang


Add a new section for fpm

Add Example for FPGrowth in scala and Java.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19373) Mesos implementation of spark.scheduler.minRegisteredResourcesRatio looks at acquired cores rather than registerd cores

2017-03-01 Thread Michael Gummelt (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15891332#comment-15891332
 ] 

Michael Gummelt commented on SPARK-19373:
-

[~skonto] Either decline or hoard.

> Mesos implementation of spark.scheduler.minRegisteredResourcesRatio looks at 
> acquired cores rather than registerd cores
> ---
>
> Key: SPARK-19373
> URL: https://issues.apache.org/jira/browse/SPARK-19373
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.3, 2.0.2, 2.1.0
>Reporter: Michael Gummelt
>Assignee: Michael Gummelt
> Fix For: 2.1.1, 2.2.0
>
>
> We're currently using `totalCoresAcquired` to account for registered 
> resources, which is incorrect.  That variable measures the number of cores 
> the scheduler has accepted.  We should be using `totalCoreCount` like the 
> other schedulers do.
> Fixing this is important for locality, since users often want to wait for all 
> executors to come up before scheduling tasks to ensure they get a node-local 
> placement. 
> original PR to add support: https://github.com/apache/spark/pull/8672/files



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19775) Remove an obsolete `partitionBy().insertInto()` test case


 [ 
https://issues.apache.org/jira/browse/SPARK-19775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19775.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17106
[https://github.com/apache/spark/pull/17106]

> Remove an obsolete `partitionBy().insertInto()` test case
> -
>
> Key: SPARK-19775
> URL: https://issues.apache.org/jira/browse/SPARK-19775
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.1.0
>Reporter: Dongjoon Hyun
>Priority: Trivial
> Fix For: 2.2.0
>
>
> This issue removes [a test 
> case|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L287-L298]
>  which was introduced by 
> [SPARK-14459|https://github.com/apache/spark/commit/652bbb1bf62722b08a062c7a2bf72019f85e179e]
>  and was superseded by 
> [SPARK-16033|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L365-L371].
>  Basically, we cannot use `partitionBy` and `insertInto` together.
> {code}
>   test("Reject partitioning that does not match table") {
> withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
>   sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY 
> (part string)")
>   val data = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" 
> else "odd"))
>   .toDF("id", "data", "part")
>   intercept[AnalysisException] {
> // cannot partition by 2 fields when there is only one in the table 
> definition
> data.write.partitionBy("part", "data").insertInto("partitioned")
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19775) Remove an obsolete `partitionBy().insertInto()` test case


 [ 
https://issues.apache.org/jira/browse/SPARK-19775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-19775:
-

Assignee: Dongjoon Hyun

> Remove an obsolete `partitionBy().insertInto()` test case
> -
>
> Key: SPARK-19775
> URL: https://issues.apache.org/jira/browse/SPARK-19775
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Trivial
> Fix For: 2.2.0
>
>
> This issue removes [a test 
> case|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L287-L298]
>  which was introduced by 
> [SPARK-14459|https://github.com/apache/spark/commit/652bbb1bf62722b08a062c7a2bf72019f85e179e]
>  and was superseded by 
> [SPARK-16033|https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L365-L371].
>  Basically, we cannot use `partitionBy` and `insertInto` together.
> {code}
>   test("Reject partitioning that does not match table") {
> withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
>   sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY 
> (part string)")
>   val data = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" 
> else "odd"))
>   .toDF("id", "data", "part")
>   intercept[AnalysisException] {
> // cannot partition by 2 fields when there is only one in the table 
> definition
> data.write.partitionBy("part", "data").insertInto("partitioned")
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19373) Mesos implementation of spark.scheduler.minRegisteredResourcesRatio looks at acquired cores rather than registerd cores


 [ 
https://issues.apache.org/jira/browse/SPARK-19373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19373:
--
Fix Version/s: 2.1.1

> Mesos implementation of spark.scheduler.minRegisteredResourcesRatio looks at 
> acquired cores rather than registerd cores
> ---
>
> Key: SPARK-19373
> URL: https://issues.apache.org/jira/browse/SPARK-19373
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.3, 2.0.2, 2.1.0
>Reporter: Michael Gummelt
>Assignee: Michael Gummelt
> Fix For: 2.1.1, 2.2.0
>
>
> We're currently using `totalCoresAcquired` to account for registered 
> resources, which is incorrect.  That variable measures the number of cores 
> the scheduler has accepted.  We should be using `totalCoreCount` like the 
> other schedulers do.
> Fixing this is important for locality, since users often want to wait for all 
> executors to come up before scheduling tasks to ensure they get a node-local 
> placement. 
> original PR to add support: https://github.com/apache/spark/pull/8672/files



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19776) Is the JavaKafkaWordCount example correct for Spark version 2.1?

2017-03-01 Thread Russell Abedin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15891280#comment-15891280
 ] 

Russell Abedin commented on SPARK-19776:


Great - thanks for the answer [~srowen] - much appreciated - I've seen the 
examples JAR reference at 
https://github.com/apache/spark/blob/master/examples/pom.xml now.  I'll direct 
future questions to the mailing list for sure.

> Is the JavaKafkaWordCount example correct for Spark version 2.1?
> 
>
> Key: SPARK-19776
> URL: https://issues.apache.org/jira/browse/SPARK-19776
> Project: Spark
>  Issue Type: Question
>  Components: Examples, ML
>Affects Versions: 2.1.0
>Reporter: Russell Abedin
>
> My question is 
> https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java
>  correct?
> I'm pretty new to both Spark and Java.  I wanted to find an example of Spark 
> Streaming using Java, streaming from Kafka. The JavaKafkaWordCount at 
> https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java
>  looked to be perfect.
> However, when I tried running it, I found a couple of issues that I needed to 
> overcome.
> 1. This line was unnecessary:
> {code}
> StreamingExamples.setStreamingLogLevels();
> {code}
> Having this line in there (and the associated import) caused me to go looking 
> for a dependency spark-examples_2.10 which of no real use to me.
> 2. After running it, this line: 
> {code}
> JavaPairReceiverInputDStream messages = 
> KafkaUtils.createStream(jssc, args[0], args[1], topicMap);
> {code}
> Appeared to throw an error around logging:
> {code}
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/apache/spark/Logging
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
> at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at 
> org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:91
> at 
> org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:66
> at 
> org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:11
> at 
> org.apache.spark.streaming.kafka.KafkaUtils.createStream(KafkaUtils.scala)
> at 
> main.java.com.cm.JavaKafkaWordCount.main(JavaKafkaWordCount.java:72)
> {code}
> To get around this, I found that the code sample in 
> https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html 
> helped me to come up with the right lines to see streaming from Kafka in 
> action. Specifically this called createDirectStream instead of createStream.
> So is the example in 
> https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java
>  or is there something I could have done differently to get that example 
> working?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19373) Mesos implementation of spark.scheduler.minRegisteredResourcesRatio looks at acquired cores rather than registerd cores


[ 
https://issues.apache.org/jira/browse/SPARK-19373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15891244#comment-15891244
 ] 

Apache Spark commented on SPARK-19373:
--

User 'mgummelt' has created a pull request for this issue:
https://github.com/apache/spark/pull/17129

> Mesos implementation of spark.scheduler.minRegisteredResourcesRatio looks at 
> acquired cores rather than registerd cores
> ---
>
> Key: SPARK-19373
> URL: https://issues.apache.org/jira/browse/SPARK-19373
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.3, 2.0.2, 2.1.0
>Reporter: Michael Gummelt
>Assignee: Michael Gummelt
> Fix For: 2.2.0
>
>
> We're currently using `totalCoresAcquired` to account for registered 
> resources, which is incorrect.  That variable measures the number of cores 
> the scheduler has accepted.  We should be using `totalCoreCount` like the 
> other schedulers do.
> Fixing this is important for locality, since users often want to wait for all 
> executors to come up before scheduling tasks to ensure they get a node-local 
> placement. 
> original PR to add support: https://github.com/apache/spark/pull/8672/files



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15848) Spark unable to read partitioned table in avro format and column name in upper case

2017-03-01 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15891180#comment-15891180
 ] 

Dongjoon Hyun commented on SPARK-15848:
---

Hi, [~pratik.shah2462].
It doesn't happen in Spark 2.1.
For Spark 1.6.3, I can do the following.
{code}
spark-sql> ALTER TABLE avro_table_uppercase SET TBLPROPERTIES 
('avro.schema.literal'='{"namespace": "com.rishav.avro", "name": 
"student_marks", "type": "record", "fields": [ { 
"name":"student_id","aliases":["STUDENT_ID"],"type":"int"}, { 
"name":"subject_id","aliases":["SUBJECT_ID"],"type":"int"}, { 
"name":"marks","type":"int"}]}');
spark-sql> select * from avro_table_uppercase;
5   300 100 2000
7   650 20  2000
8   780 160 2000
1   340 963 2000
9   780 142 2000
2   110 430 2000
0   38  91  2002
0   65  28  2002
0   78  16  2002
1   34  96  2002
1   78  14  2002
1   11  43  2002
Time taken: 0.241 seconds, Fetched 12 row(s)
{code}

> Spark unable to read partitioned table in avro format and column name in 
> upper case
> ---
>
> Key: SPARK-15848
> URL: https://issues.apache.org/jira/browse/SPARK-15848
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Zhan Zhang
>
> If external partitioned Hive tables created in Avro format.
> Spark is returning "null" values if columns names are in Uppercase in the 
> Avro schema.
> The same tables return proper data when queried in the Hive client.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19767) API Doc pages for Streaming with Kafka 0.10 not current

2017-03-01 Thread Nick Afshartous (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15888572#comment-15888572
 ] 

Nick Afshartous edited comment on SPARK-19767 at 3/1/17 10:15 PM:
--

Yes the code examples on the Integration Page are current.  The issue with the 
linked API pages looks like more than incompleteness because the package name 
{code}org.apache.spark.streaming.kafka{code} should be 
{code}org.apache.spark.streaming.kafka10{code}.  

I'd be happy to help.  Tried to build the doc running "jekyll build" from the 
docs dir and got the error below.  Is this target broken or my env ?  
{code}
[info] Note: Custom tags that could override future standard tags:  @todo, 
@note, @tparam, @constructor, @groupname, @example, @group. To avoid potential 
overrides, use at least one period character (.) in custom tag names.
[info] Note: Custom tags that were not seen:  @todo, @tparam, @constructor, 
@groupname, @group
[info] 1 error
[info] 100 warnings
[error] (spark/javaunidoc:doc) javadoc returned nonzero exit code
[error] Total time: 198 s, completed Feb 28, 2017 11:56:20 AM
jekyll 3.4.0 | Error:  Unidoc generation failed
{code}


was (Author: nafshartous):
Yes the code examples on the Integration Page are current.  The issue with the 
linked API pages looks like more than incompleteness because the package names
name {code}org.apache.spark.streaming.kafka{code} should be 
{code}org.apache.spark.streaming.kafka10{code}.  

I'd be happy to help.  Tried to build the doc running "jekyll build" from the 
docs dir and got the error below.  Is this target broken or my env ?  
{code}
[info] Note: Custom tags that could override future standard tags:  @todo, 
@note, @tparam, @constructor, @groupname, @example, @group. To avoid potential 
overrides, use at least one period character (.) in custom tag names.
[info] Note: Custom tags that were not seen:  @todo, @tparam, @constructor, 
@groupname, @group
[info] 1 error
[info] 100 warnings
[error] (spark/javaunidoc:doc) javadoc returned nonzero exit code
[error] Total time: 198 s, completed Feb 28, 2017 11:56:20 AM
jekyll 3.4.0 | Error:  Unidoc generation failed
{code}

> API Doc pages for Streaming with Kafka 0.10 not current
> ---
>
> Key: SPARK-19767
> URL: https://issues.apache.org/jira/browse/SPARK-19767
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Nick Afshartous
>Priority: Minor
>
> The API docs linked from the Spark Kafka 0.10 Integration page are not 
> current.  For instance, on the page
>https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
> the code examples show the new API (i.e. class ConsumerStrategies).  However, 
> following the links
> API Docs --> (Scala | Java)
> lead to API pages that do not have class ConsumerStrategies) .  The API doc 
> package names  also have {code}streaming.kafka{code} as opposed to 
> {code}streaming.kafka10{code} 
> as in the code examples on streaming-kafka-0-10-integration.html.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19754) Casting to int from a JSON-parsed float rounds instead of truncating

2017-03-01 Thread Juan Pumarino (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15891069#comment-15891069
 ] 

Juan Pumarino commented on SPARK-19754:
---

Thank you both for your replies and for looking into the issue.
After git bisecting I found the 
[commit|https://github.com/apache/spark/commit/6b34e745bb8bdcf5a8bb78359fa39bbe8c6563cc]
 and the original JIRA: SPARK-19178.
It'd be nice if it could be backported to 1.6

> Casting to int from a JSON-parsed float rounds instead of truncating
> 
>
> Key: SPARK-19754
> URL: https://issues.apache.org/jira/browse/SPARK-19754
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Juan Pumarino
>Priority: Minor
>
> When retrieving a float value from a JSON document, and then casting it to an 
> integer, Hive simply truncates it, while Spark is rounding up when the 
> decimal value is >= 5.
> In Hive, the following query returns {{1}}, whereas in a Spark shell the 
> result is {{2}}.
> {code}
> SELECT CAST(get_json_object('{"a": 1.6}', '$.a') AS INT)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19578) Poor pyspark performance + incorrect UI input-size metrics

2017-03-01 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890930#comment-15890930
 ] 

Nicholas Chammas commented on SPARK-19578:
--

Makes sense to me. I suppose the Apache Arrow integration work that is 
currently ongoing (SPARK-13534) will not help RDD.count() since that will only 
benefit DataFrames. (Granted, in this specific example you can always read the 
file using spark.read.csv() or spark.read.text() which will avoid this problem.)

So it sounds like the "poor PySpark performance" part of this issue is 
"Won't/Can't fix" at this time. The incorrect UI input-size metrics sounds like 
a separate issue that should be split out. 

> Poor pyspark performance + incorrect UI input-size metrics
> --
>
> Key: SPARK-19578
> URL: https://issues.apache.org/jira/browse/SPARK-19578
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Web UI
>Affects Versions: 1.6.1, 1.6.2, 2.0.1
> Environment: Spark 1.6.2 Hortonworks
> Spark 2.0.1 MapR
> Spark 1.6.1 MapR
>Reporter: Artur Sukhenko
> Attachments: pyspark_incorrect_inputsize.png, reproduce_log, 
> spark_shell_correct_inputsize.png
>
>
> Simple job in pyspark takes 14 minutes to complete.
> The text file used to reproduce contains multiple millions lines of one word 
> "yes"
>  (it might be the cause of poor performance)
> {code}
> var a = sc.textFile("/tmp/yes.txt")
> a.count()
> {code}
> Same code took 33 sec in spark-shell
> Reproduce steps:
> Run this  to generate big file (press Ctrl+C after 5-6 seconds)
> [spark@c6401 ~]$ yes > /tmp/yes.txt
> [spark@c6401 ~]$ ll /tmp/
> -rw-r--r--  1 spark hadoop  516079616 Feb 13 11:10 yes.txt
> [spark@c6401 ~]$ hadoop fs -copyFromLocal /tmp/yes.txt /tmp/
> [spark@c6401 ~]$ pyspark
> {code}
> Python 2.6.6 (r266:84292, Feb 22 2013, 00:00:18) 
> [GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] on linux2
> 17/02/13 11:12:36 INFO SparkContext: Running Spark version 1.6.2
> {code}
> >>> a = sc.textFile("/tmp/yes.txt")
> {code}
> 17/02/13 11:12:58 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 341.1 KB, free 341.1 KB)
> 17/02/13 11:12:58 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 28.3 KB, free 369.4 KB)
> 17/02/13 11:12:58 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on localhost:43389 (size: 28.3 KB, free: 517.4 MB)
> 17/02/13 11:12:58 INFO SparkContext: Created broadcast 0 from textFile at 
> NativeMethodAccessorImpl.java:-2
> {code}
> >>> a.count()
> {code}
> 17/02/13 11:13:03 INFO FileInputFormat: Total input paths to process : 1
> 17/02/13 11:13:03 INFO SparkContext: Starting job: count at :1
> 17/02/13 11:13:03 INFO DAGScheduler: Got job 0 (count at :1) with 4 
> output partitions
> 17/02/13 11:13:03 INFO DAGScheduler: Final stage: ResultStage 0 (count at 
> :1)
> 17/02/13 11:13:03 INFO DAGScheduler: Parents of final stage: List()
> 17/02/13 11:13:03 INFO DAGScheduler: Missing parents: List()
> 17/02/13 11:13:03 INFO DAGScheduler: Submitting ResultStage 0 (PythonRDD[2] 
> at count at :1), which has no missing parents
> 17/02/13 11:13:03 INFO MemoryStore: Block broadcast_1 stored as values in 
> memory (estimated size 5.7 KB, free 375.1 KB)
> 17/02/13 11:13:03 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes 
> in memory (estimated size 3.5 KB, free 378.6 KB)
> 17/02/13 11:13:03 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory 
> on localhost:43389 (size: 3.5 KB, free: 517.4 MB)
> 17/02/13 11:13:03 INFO SparkContext: Created broadcast 1 from broadcast at 
> DAGScheduler.scala:1008
> 17/02/13 11:13:03 INFO DAGScheduler: Submitting 4 missing tasks from 
> ResultStage 0 (PythonRDD[2] at count at :1)
> 17/02/13 11:13:03 INFO TaskSchedulerImpl: Adding task set 0.0 with 4 tasks
> 17/02/13 11:13:03 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
> localhost, partition 0,ANY, 2149 bytes)
> 17/02/13 11:13:03 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
> 17/02/13 11:13:03 INFO HadoopRDD: Input split: 
> hdfs://c6401.ambari.apache.org:8020/tmp/yes.txt:0+134217728
> 17/02/13 11:13:03 INFO deprecation: mapred.tip.id is deprecated. Instead, use 
> mapreduce.task.id
> 17/02/13 11:13:03 INFO deprecation: mapred.task.id is deprecated. Instead, 
> use mapreduce.task.attempt.id
> 17/02/13 11:13:03 INFO deprecation: mapred.task.is.map is deprecated. 
> Instead, use mapreduce.task.ismap
> 17/02/13 11:13:03 INFO deprecation: mapred.task.partition is deprecated. 
> Instead, use mapreduce.task.partition
> 17/02/13 11:13:03 INFO deprecation: mapred.job.id is deprecated. Instead, use 
> mapreduce.job.id
> 17/02/13 11:16:37 INFO PythonRunner: Times: total = 213335, boot = 317, init 
> = 445, finish = 212573
> 17/02/13 11:16:37 INFO Executor: F

[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)


[ 
https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890907#comment-15890907
 ] 

Apache Spark commented on SPARK-18352:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/17128

> Parse normal, multi-line JSON files (not just JSON Lines)
> -
>
> Key: SPARK-18352
> URL: https://issues.apache.org/jira/browse/SPARK-18352
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Nathan Howell
>  Labels: releasenotes
> Fix For: 2.2.0
>
>
> Spark currently can only parse JSON files that are JSON lines, i.e. each 
> record has an entire line and records are separated by new line. In reality, 
> a lot of users want to use Spark to parse actual JSON files, and are 
> surprised to learn that it doesn't do that.
> We can introduce a new mode (wholeJsonFile?) in which we don't split the 
> files, and rather stream through them to parse the JSON files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19734) OneHotEncoder init uses dropLast but doc strings all say includeFirst

2017-03-01 Thread Corey (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890849#comment-15890849
 ] 

Corey commented on SPARK-19734:
---

Not at all, thanks for doing it.

> OneHotEncoder __init__ uses dropLast but doc strings all say includeFirst
> -
>
> Key: SPARK-19734
> URL: https://issues.apache.org/jira/browse/SPARK-19734
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 1.5.2, 1.6.3, 2.0.2, 2.1.0
>Reporter: Corey
>Priority: Minor
>  Labels: documentation, easyfix
>
> The {{OneHotEncoder.__init__}} doc string in PySpark has an input keyword 
> listed as {{includeFirst}}, whereas the code actually uses {{dropLast}}.
> This especially confusing because the {{__init__}} function accepts only 
> keywords, and following the documentation on the web 
> (https://spark.apache.org/docs/2.0.1/api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoder)
>  or of {{help}} in Python will result in the error:
> {quote}
> TypeError: __init__() got an unexpected keyword argument 'includeFirst'
> {quote}
> The error is immediately viewable in the source code:
> {code}
> @keyword_only
> def __init__(self, dropLast=True, inputCol=None, outputCol=None):
> """
> __init__(self, includeFirst=True, inputCol=None, outputCol=None)
> """
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19578) Poor pyspark performance + incorrect UI input-size metrics

2017-03-01 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890846#comment-15890846
 ] 

holdenk commented on SPARK-19578:
-

[~nchammas] It's an interesting idea but I don't think it would work and it 
wouldn't help as much as you would expect.
So the first part as to why it won't work in the general case: we use batch 
serialization inside of Python frequently, so one JRDD object could be a 
different number of Python RDD objects.

The not helping as much as you expect, even if we keep the count part in the 
JVM (assuming we didn't use batch serialization or something else), we would 
still need to roundtrip the data through Python to do any meaningful Python 
work. Computing the sum in Python isn't the expensive part - its copying the 
data from the JVM into Python, which leads nicely into our next problem.

Now as for skipping having Python being involved entirely, thats a bit more 
tricky. In the past people have suggested optimizing count in Scala in a 
similar mechanism (e.g. if we know the number of input records and we know we 
didn't drop or add any records we can just return the number of input records) 
- the problems with that are people depend on count to force evaluation of 
their RDD to do things like store in cache. This problem is further compounded 
in Python since to the Scala code it doesn't know if we've done a `map` (won't 
change number of records) or a `flatMap` (could change number of records) so 
implementing the optimization would be even more involved in Python (although 
possibly plumbing some of this data through to the Scala side would improve the 
UI and debugging experience - the more important part of it would break a lot 
of peoples work flow is a bigger issue).

Now in this very specific case (data loaded in the JVM and no Python 
transformations) we _could_ optimize it since our input data hasn't had any 
Python transformations applied to it - but I'm not super sure that is something 
worth doing.

This is only after my first cup of coffee, so if some of my explanation doesn't 
make sense let me know and I can drink more coffee and explain that part 
specifically.

> Poor pyspark performance + incorrect UI input-size metrics
> --
>
> Key: SPARK-19578
> URL: https://issues.apache.org/jira/browse/SPARK-19578
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Web UI
>Affects Versions: 1.6.1, 1.6.2, 2.0.1
> Environment: Spark 1.6.2 Hortonworks
> Spark 2.0.1 MapR
> Spark 1.6.1 MapR
>Reporter: Artur Sukhenko
> Attachments: pyspark_incorrect_inputsize.png, reproduce_log, 
> spark_shell_correct_inputsize.png
>
>
> Simple job in pyspark takes 14 minutes to complete.
> The text file used to reproduce contains multiple millions lines of one word 
> "yes"
>  (it might be the cause of poor performance)
> {code}
> var a = sc.textFile("/tmp/yes.txt")
> a.count()
> {code}
> Same code took 33 sec in spark-shell
> Reproduce steps:
> Run this  to generate big file (press Ctrl+C after 5-6 seconds)
> [spark@c6401 ~]$ yes > /tmp/yes.txt
> [spark@c6401 ~]$ ll /tmp/
> -rw-r--r--  1 spark hadoop  516079616 Feb 13 11:10 yes.txt
> [spark@c6401 ~]$ hadoop fs -copyFromLocal /tmp/yes.txt /tmp/
> [spark@c6401 ~]$ pyspark
> {code}
> Python 2.6.6 (r266:84292, Feb 22 2013, 00:00:18) 
> [GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] on linux2
> 17/02/13 11:12:36 INFO SparkContext: Running Spark version 1.6.2
> {code}
> >>> a = sc.textFile("/tmp/yes.txt")
> {code}
> 17/02/13 11:12:58 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 341.1 KB, free 341.1 KB)
> 17/02/13 11:12:58 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 28.3 KB, free 369.4 KB)
> 17/02/13 11:12:58 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on localhost:43389 (size: 28.3 KB, free: 517.4 MB)
> 17/02/13 11:12:58 INFO SparkContext: Created broadcast 0 from textFile at 
> NativeMethodAccessorImpl.java:-2
> {code}
> >>> a.count()
> {code}
> 17/02/13 11:13:03 INFO FileInputFormat: Total input paths to process : 1
> 17/02/13 11:13:03 INFO SparkContext: Starting job: count at :1
> 17/02/13 11:13:03 INFO DAGScheduler: Got job 0 (count at :1) with 4 
> output partitions
> 17/02/13 11:13:03 INFO DAGScheduler: Final stage: ResultStage 0 (count at 
> :1)
> 17/02/13 11:13:03 INFO DAGScheduler: Parents of final stage: List()
> 17/02/13 11:13:03 INFO DAGScheduler: Missing parents: List()
> 17/02/13 11:13:03 INFO DAGScheduler: Submitting ResultStage 0 (PythonRDD[2] 
> at count at :1), which has no missing parents
> 17/02/13 11:13:03 INFO MemoryStore: Block broadcast_1 stored as values in 
> memory (estimated size 5.7 KB, free 375.1 KB)
> 17/02/13 11:13:03 INFO MemoryStore: Block broadcast_1_piece0

[jira] [Commented] (SPARK-19734) OneHotEncoder init uses dropLast but doc strings all say includeFirst

2017-03-01 Thread Mark Grover (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890825#comment-15890825
 ] 

Mark Grover commented on SPARK-19734:
-

Don't mean to step on any toes but since there wasn't any activity here for the 
past few days, I decided to issue a PR 
(https://github.com/apache/spark/pull/17127/).

Corey, if you already have a PR - I would gladly have it supersede mine.

> OneHotEncoder __init__ uses dropLast but doc strings all say includeFirst
> -
>
> Key: SPARK-19734
> URL: https://issues.apache.org/jira/browse/SPARK-19734
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 1.5.2, 1.6.3, 2.0.2, 2.1.0
>Reporter: Corey
>Priority: Minor
>  Labels: documentation, easyfix
>
> The {{OneHotEncoder.__init__}} doc string in PySpark has an input keyword 
> listed as {{includeFirst}}, whereas the code actually uses {{dropLast}}.
> This especially confusing because the {{__init__}} function accepts only 
> keywords, and following the documentation on the web 
> (https://spark.apache.org/docs/2.0.1/api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoder)
>  or of {{help}} in Python will result in the error:
> {quote}
> TypeError: __init__() got an unexpected keyword argument 'includeFirst'
> {quote}
> The error is immediately viewable in the source code:
> {code}
> @keyword_only
> def __init__(self, dropLast=True, inputCol=None, outputCol=None):
> """
> __init__(self, includeFirst=True, inputCol=None, outputCol=None)
> """
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19734) OneHotEncoder init uses dropLast but doc strings all say includeFirst


 [ 
https://issues.apache.org/jira/browse/SPARK-19734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19734:


Assignee: (was: Apache Spark)

> OneHotEncoder __init__ uses dropLast but doc strings all say includeFirst
> -
>
> Key: SPARK-19734
> URL: https://issues.apache.org/jira/browse/SPARK-19734
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 1.5.2, 1.6.3, 2.0.2, 2.1.0
>Reporter: Corey
>Priority: Minor
>  Labels: documentation, easyfix
>
> The {{OneHotEncoder.__init__}} doc string in PySpark has an input keyword 
> listed as {{includeFirst}}, whereas the code actually uses {{dropLast}}.
> This especially confusing because the {{__init__}} function accepts only 
> keywords, and following the documentation on the web 
> (https://spark.apache.org/docs/2.0.1/api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoder)
>  or of {{help}} in Python will result in the error:
> {quote}
> TypeError: __init__() got an unexpected keyword argument 'includeFirst'
> {quote}
> The error is immediately viewable in the source code:
> {code}
> @keyword_only
> def __init__(self, dropLast=True, inputCol=None, outputCol=None):
> """
> __init__(self, includeFirst=True, inputCol=None, outputCol=None)
> """
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19734) OneHotEncoder init uses dropLast but doc strings all say includeFirst


 [ 
https://issues.apache.org/jira/browse/SPARK-19734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19734:


Assignee: Apache Spark

> OneHotEncoder __init__ uses dropLast but doc strings all say includeFirst
> -
>
> Key: SPARK-19734
> URL: https://issues.apache.org/jira/browse/SPARK-19734
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 1.5.2, 1.6.3, 2.0.2, 2.1.0
>Reporter: Corey
>Assignee: Apache Spark
>Priority: Minor
>  Labels: documentation, easyfix
>
> The {{OneHotEncoder.__init__}} doc string in PySpark has an input keyword 
> listed as {{includeFirst}}, whereas the code actually uses {{dropLast}}.
> This especially confusing because the {{__init__}} function accepts only 
> keywords, and following the documentation on the web 
> (https://spark.apache.org/docs/2.0.1/api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoder)
>  or of {{help}} in Python will result in the error:
> {quote}
> TypeError: __init__() got an unexpected keyword argument 'includeFirst'
> {quote}
> The error is immediately viewable in the source code:
> {code}
> @keyword_only
> def __init__(self, dropLast=True, inputCol=None, outputCol=None):
> """
> __init__(self, includeFirst=True, inputCol=None, outputCol=None)
> """
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19734) OneHotEncoder init uses dropLast but doc strings all say includeFirst


[ 
https://issues.apache.org/jira/browse/SPARK-19734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890823#comment-15890823
 ] 

Apache Spark commented on SPARK-19734:
--

User 'markgrover' has created a pull request for this issue:
https://github.com/apache/spark/pull/17127

> OneHotEncoder __init__ uses dropLast but doc strings all say includeFirst
> -
>
> Key: SPARK-19734
> URL: https://issues.apache.org/jira/browse/SPARK-19734
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 1.5.2, 1.6.3, 2.0.2, 2.1.0
>Reporter: Corey
>Priority: Minor
>  Labels: documentation, easyfix
>
> The {{OneHotEncoder.__init__}} doc string in PySpark has an input keyword 
> listed as {{includeFirst}}, whereas the code actually uses {{dropLast}}.
> This especially confusing because the {{__init__}} function accepts only 
> keywords, and following the documentation on the web 
> (https://spark.apache.org/docs/2.0.1/api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoder)
>  or of {{help}} in Python will result in the error:
> {quote}
> TypeError: __init__() got an unexpected keyword argument 'includeFirst'
> {quote}
> The error is immediately viewable in the source code:
> {code}
> @keyword_only
> def __init__(self, dropLast=True, inputCol=None, outputCol=None):
> """
> __init__(self, includeFirst=True, inputCol=None, outputCol=None)
> """
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19787) Different default regParam values in ALS

2017-03-01 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-19787.

   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17121
[https://github.com/apache/spark/pull/17121]

> Different default regParam values in ALS
> 
>
> Key: SPARK-19787
> URL: https://issues.apache.org/jira/browse/SPARK-19787
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Vasilis Vryniotis
>Priority: Trivial
> Fix For: 2.2.0
>
>
> In the ALS method the default values of regParam do not match within the same 
> file (lines 
> [224|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L224]
>  and 
> [714|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L714]).
>  In one place we set it to 1.0 and in the other to 0.1.
> We can change the one of train() method to 0.1. The method is marked with 
> DeveloperApi so it should not affect the users. Based on what I saw, whenever 
> we use the particular method we provide all parameters, so the default does 
> not matter. Only exception is the unit-tests on ALSSuite but the change does 
> not break them.
> This change was discussed on a separate PR and [~mlnick] 
> [suggested|https://github.com/apache/spark/pull/17059#issuecomment-28572] 
> to create a separate PR & ticket.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-19773) SparkDataFrame should not allow duplicate names

2017-03-01 Thread Wayne Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wayne Zhang closed SPARK-19773.
---
Resolution: Not A Problem

> SparkDataFrame should not allow duplicate names
> ---
>
> Key: SPARK-19773
> URL: https://issues.apache.org/jira/browse/SPARK-19773
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Wayne Zhang
>Priority: Minor
>
> SparkDataFrame in SparkR seems to accept duplicate names at creation, but 
> incurs error when calling methods downstream. For example, we can do: 
> {{{code}}}
> l <- list(list(1, 2), list(3, 4))
> df <- createDataFrame(l, c("a", "a"))
> head(df)
> {{{code}}}
> But an error occurs when we do df$a = df$a * 2.0. 
> I suggest we add validity check for duplicate names at initialization.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17931) taskScheduler has some unneeded serialization

2017-03-01 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890749#comment-15890749
 ] 

Imran Rashid commented on SPARK-17931:
--

[~gbloisi] thanks for reporting the issue.  Can you go ahead and open another 
jira for this?  Please ping me on it.  Since you have a reproduction, it would 
also be helpful if you could tell us what the offending property is that is 
going over 64KB.  Eg., you could do something like this (untested):

{code}
if (value.size > 16*1024) {
  val f = File.createTempFile(s"long_property_$key","txt")
  logWarning(s"Value for $key has length ${value.size}, writing to $f")
  val out = new PrintWriter(f)
  out.println(value)
  out.close()  
}
{code}

and then attach the generated file.

your workaround looks pretty dangerous -- if that property were actually 
important, then just randomly truncating it would be a big problem.  This 
method should be safe to long strings, but we might also want to find the 
source of that long string and avoid it.

> taskScheduler has some unneeded serialization
> -
>
> Key: SPARK-17931
> URL: https://issues.apache.org/jira/browse/SPARK-17931
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Guoqiang Li
>Assignee: Kay Ousterhout
> Fix For: 2.2.0
>
>
> In the existing code, there are three layers of serialization
> involved in sending a task from the scheduler to an executor:
> - A Task object is serialized
> - The Task object is copied to a byte buffer that also
> contains serialized information about any additional JARs,
> files, and Properties needed for the task to execute. This
> byte buffer is stored as the member variable serializedTask
> in the TaskDescription class.
> - The TaskDescription is serialized (in addition to the serialized
> task + JARs, the TaskDescription class contains the task ID and
> other metadata) and sent in a LaunchTask message.
> While it is necessary to have two layers of serialization, so that
> the JAR, file, and Property info can be deserialized prior to
> deserializing the Task object, the third layer of deserialization is
> unnecessary (this is as a result of SPARK-2521). We should
> eliminate a layer of serialization by moving the JARs, files, and Properties
> into the TaskDescription class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19773) SparkDataFrame should not allow duplicate names

2017-03-01 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890740#comment-15890740
 ] 

Felix Cheung commented on SPARK-19773:
--

Let's close this unless you want to look into getting mutate to support 
duplicated columns? ;)

> SparkDataFrame should not allow duplicate names
> ---
>
> Key: SPARK-19773
> URL: https://issues.apache.org/jira/browse/SPARK-19773
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Wayne Zhang
>Priority: Minor
>
> SparkDataFrame in SparkR seems to accept duplicate names at creation, but 
> incurs error when calling methods downstream. For example, we can do: 
> {{{code}}}
> l <- list(list(1, 2), list(3, 4))
> df <- createDataFrame(l, c("a", "a"))
> head(df)
> {{{code}}}
> But an error occurs when we do df$a = df$a * 2.0. 
> I suggest we add validity check for duplicate names at initialization.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19741) ClassCastException when using Dataset with type containing value types


[ 
https://issues.apache.org/jira/browse/SPARK-19741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890712#comment-15890712
 ] 

Kazuaki Ishizaki commented on SPARK-19741:
--

The following program causes an exception regarding compilation error in Janino 
using the latest Spark master branch.

{code:java}
final case class Foo(id: Int) extends AnyVal
final case class Bar(foo: Foo)
object SPARK19741 {
  def main(args: Array[String]): Unit = {
val foo = Foo(5)
val bar = Bar(foo)

val conf = new SparkConf().setAppName("test").setMaster("local")
val spark = SparkSession.builder.config(conf)getOrCreate
import spark.implicits._
spark.sparkContext.parallelize(0 to 10).toDS().map(_ => bar).collect()
  }
}
{code}

Exception
{code:java}
03:06:23.080 ERROR 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to 
compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', 
Line 75, Column 29: Assignment conversion not possible from type "int" to type 
"org.apache.spark.sql.Foo"
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
/* 004 */
/* 005 */ final class GeneratedIterator extends 
org.apache.spark.sql.execution.BufferedRowIterator {
/* 006 */   private Object[] references;
/* 007 */   private scala.collection.Iterator[] inputs;
/* 008 */   private scala.collection.Iterator inputadapter_input;
/* 009 */   private int mapelements_argValue;
/* 010 */   private UnsafeRow mapelements_result;
/* 011 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder 
mapelements_holder;
/* 012 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
mapelements_rowWriter;
/* 013 */   private Object[] serializefromobject_values;
/* 014 */   private UnsafeRow serializefromobject_result;
/* 015 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder 
serializefromobject_holder;
/* 016 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
serializefromobject_rowWriter;
/* 017 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
serializefromobject_rowWriter1;
/* 018 */
/* 019 */   public GeneratedIterator(Object[] references) {
/* 020 */ this.references = references;
/* 021 */   }
/* 022 */
/* 023 */   public void init(int index, scala.collection.Iterator[] inputs) {
/* 024 */ partitionIndex = index;
/* 025 */ this.inputs = inputs;
/* 026 */ inputadapter_input = inputs[0];
/* 027 */
/* 028 */ mapelements_result = new UnsafeRow(1);
/* 029 */ this.mapelements_holder = new 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(mapelements_result,
 32);
/* 030 */ this.mapelements_rowWriter = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(mapelements_holder,
 1);
/* 031 */ this.serializefromobject_values = null;
/* 032 */ serializefromobject_result = new UnsafeRow(1);
/* 033 */ this.serializefromobject_holder = new 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(serializefromobject_result,
 32);
/* 034 */ this.serializefromobject_rowWriter = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(serializefromobject_holder,
 1);
/* 035 */ this.serializefromobject_rowWriter1 = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(serializefromobject_holder,
 1);
/* 036 */
/* 037 */   }
/* 038 */
/* 039 */   protected void processNext() throws java.io.IOException {
/* 040 */ while (inputadapter_input.hasNext() && !stopEarly()) {
/* 041 */   InternalRow inputadapter_row = (InternalRow) 
inputadapter_input.next();
/* 042 */   int inputadapter_value = inputadapter_row.getInt(0);
/* 043 */
/* 044 */   boolean mapelements_isNull = true;
/* 045 */   org.apache.spark.sql.Bar mapelements_value = null;
/* 046 */   if (!false) {
/* 047 */ mapelements_argValue = inputadapter_value;
/* 048 */
/* 049 */ mapelements_isNull = false;
/* 050 */ if (!mapelements_isNull) {
/* 051 */   Object mapelements_funcResult = null;
/* 052 */   mapelements_funcResult = ((scala.Function1) 
references[0]).apply(mapelements_argValue);
/* 053 */   if (mapelements_funcResult == null) {
/* 054 */ mapelements_isNull = true;
/* 055 */   } else {
/* 056 */ mapelements_value = (org.apache.spark.sql.Bar) 
mapelements_funcResult;
/* 057 */   }
/* 058 */
/* 059 */ }
/* 060 */ mapelements_isNull = mapelements_value == null;
/* 061 */   }
/* 062 */
/* 063 */   if (mapelements_isNull) {
/* 064 */ throw new RuntimeException(((java.lang.String) 
references[1]));
/* 065 */   }
/* 066 */
/* 067 */   if (false) {
/* 068 */ throw new RuntimeException(((java.lang.String) 
r

[jira] [Commented] (SPARK-19790) OutputCommitCoordinator should not allow another task to commit after an ExecutorFailure

2017-03-01 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890654#comment-15890654
 ] 

Imran Rashid commented on SPARK-19790:
--

cc [~kayousterhout] [~markhamstra] [~mridulm80] [~pwoody]

> OutputCommitCoordinator should not allow another task to commit after an 
> ExecutorFailure
> 
>
> Key: SPARK-19790
> URL: https://issues.apache.org/jira/browse/SPARK-19790
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Imran Rashid
>
> The OutputCommitCoordinator resets the allowed committer when the task fails. 
>  
> https://github.com/apache/spark/blob/8aa560b75e6b083b2a890c52301414285ba35c3d/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala#L143
> However, if a task fails because of an ExecutorFailure, we actually have no 
> idea what the status is of the task.  The task may actually still be running, 
> and perhaps successfully commit its output.  By allowing another task to 
> commit its output, there is a chance that multiple tasks commit, which can 
> result in corrupt output.  This would be particularly problematic when 
> commit() is an expensive operation, eg. moving files on S3.
> For other task failures, we can allow other tasks to commit.  But with an 
> ExecutorFailure, its not clear what the right thing to do is.  The only safe 
> thing to do may be to fail the job.
> This is related to SPARK-19631, and was discovered during discussion on that 
> PR https://github.com/apache/spark/pull/16959#discussion_r103549134



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19790) OutputCommitCoordinator should not allow another task to commit after an ExecutorFailure

2017-03-01 Thread Imran Rashid (JIRA)

Imran Rashid created SPARK-19790:

Summary: OutputCommitCoordinator should not allow another task to
commit after an ExecutorFailure
Key: SPARK-19790
URL: https://issues.apache.org/jira/browse/SPARK-19790
Project: Spark
Issue Type: Bug
Components: Scheduler
Affects Versions: 2.1.0
Reporter: Imran Rashid

The OutputCommitCoordinator resets the allowed committer when the task fails.
https://github.com/apache/spark/blob/8aa560b75e6b083b2a890c52301414285ba35c3d/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala#L143
However, if a task fails because of an ExecutorFailure, we actually have no
idea what the status is of the task. The task may actually still be running,
and perhaps successfully commit its output. By allowing another task to commit
its output, there is a chance that multiple tasks commit, which can result in
corrupt output. This would be particularly problematic when commit() is an
expensive operation, eg. moving files on S3.

For other task failures, we can allow other tasks to commit. But with an
ExecutorFailure, its not clear what the right thing to do is. The only safe
thing to do may be to fail the job.

This is related to SPARK-19631, and was discovered during discussion on that PR
https://github.com/apache/spark/pull/16959#discussion_r103549134

--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19789) Add the shortcut of .format("parquet").option("path", "/hdfs/path").partitionBy("col1", "col2").start()


 [ 
https://issues.apache.org/jira/browse/SPARK-19789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19789:


Assignee: (was: Apache Spark)

> Add the shortcut of .format("parquet").option("path", 
> "/hdfs/path").partitionBy("col1", "col2").start()
> ---
>
> Key: SPARK-19789
> URL: https://issues.apache.org/jira/browse/SPARK-19789
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Nan Zhu
>
> as said in the title, everytime I use parquet, I need to type a long string 
> Being consistent with DataFrameReader/Writer, DataStreamReader, add parquet 
> shortcut for this case



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19789) Add the shortcut of .format("parquet").option("path", "/hdfs/path").partitionBy("col1", "col2").start()


 [ 
https://issues.apache.org/jira/browse/SPARK-19789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19789:


Assignee: Apache Spark

> Add the shortcut of .format("parquet").option("path", 
> "/hdfs/path").partitionBy("col1", "col2").start()
> ---
>
> Key: SPARK-19789
> URL: https://issues.apache.org/jira/browse/SPARK-19789
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Nan Zhu
>Assignee: Apache Spark
>
> as said in the title, everytime I use parquet, I need to type a long string 
> Being consistent with DataFrameReader/Writer, DataStreamReader, add parquet 
> shortcut for this case



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19789) Add the shortcut of .format("parquet").option("path", "/hdfs/path").partitionBy("col1", "col2").start()


[ 
https://issues.apache.org/jira/browse/SPARK-19789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890647#comment-15890647
 ] 

Apache Spark commented on SPARK-19789:
--

User 'CodingCat' has created a pull request for this issue:
https://github.com/apache/spark/pull/17126

> Add the shortcut of .format("parquet").option("path", 
> "/hdfs/path").partitionBy("col1", "col2").start()
> ---
>
> Key: SPARK-19789
> URL: https://issues.apache.org/jira/browse/SPARK-19789
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Nan Zhu
>
> as said in the title, everytime I use parquet, I need to type a long string 
> Being consistent with DataFrameReader/Writer, DataStreamReader, add parquet 
> shortcut for this case



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19789) Add the shortcut of .format("parquet").option("path", "/hdfs/path").partitionBy("col1", "col2").start()

Nan Zhu created SPARK-19789:
---

 Summary: Add the shortcut of .format("parquet").option("path", 
"/hdfs/path").partitionBy("col1", "col2").start()
 Key: SPARK-19789
 URL: https://issues.apache.org/jira/browse/SPARK-19789
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Nan Zhu


as said in the title, everytime I use parquet, I need to type a long string 

Being consistent with DataFrameReader/Writer, DataStreamReader, add parquet 
shortcut for this case




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15474) ORC data source fails to write and read back empty dataframe

2017-03-01 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890639#comment-15890639
 ] 

Nicholas Chammas commented on SPARK-15474:
--

There is a related discussion on ORC-152 which suggests that this is an issue 
with Spark's DataFrame writer for ORC. If there is evidence that this is not 
the case, it would be good to post it directly on ORC-152 so we can get input 
from people on that project.

>  ORC data source fails to write and read back empty dataframe
> -
>
> Key: SPARK-15474
> URL: https://issues.apache.org/jira/browse/SPARK-15474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> Currently ORC data source fails to write and read empty data.
> The code below:
> {code}
> val emptyDf = spark.range(10).limit(0)
> emptyDf.write
>   .format("orc")
>   .save(path.getCanonicalPath)
> val copyEmptyDf = spark.read
>   .format("orc")
>   .load(path.getCanonicalPath)
> copyEmptyDf.show()
> {code}
> throws an exception below:
> {code}
> Unable to infer schema for ORC at 
> /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da.
>  It must be specified manually;
> org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC at 
> /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da.
>  It must be specified manually;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:351)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:130)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:140)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:892)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:884)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withTempPath(SQLTestUtils.scala:114)
> {code}
> Note that this is a different case with the data below
> {code}
> val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
> {code}
> In this case, any writer is not initialised and created. (no calls of 
> {{WriterContainer.writeRows()}}.
> For Parquet and JSON, it works but ORC does not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18769) Spark to be smarter about what the upper bound is and to restrict number of executor when dynamic allocation is enabled

2017-03-01 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890629#comment-15890629
 ] 

Thomas Graves commented on SPARK-18769:
---

[~yuming] I already made a comment on that, I don't think we should be looking 
at queue configs to do this.  We can use the resources returned from the call 
to allocate to get a more accurate picture without having to know the internals 
of the RM queues.The max capacity of the queue could be the entire cluster 
which could be much much larger then a user can actually get based on other 
queue configs and who else is using the queue.  

[~vanzin]  that makes sense but I'm not sure there is anyway around that or 
that it matters that much. The yarn api requires 1 container request per 
container.It makes sense to try to limit if we don't really need those or 
the cluster has no where near those resources.   The reply from resource 
manager isn't based on the # of requests, it just updates the requests/releases 
and returns what the scheduler has allocated to it in between heartbeats. 
Scheduler runs async to that that updates the allocations for that application.

ignoring the cluster capacity issue, the hard thing about the dynamic 
allocation is determine if the tasks will be quick and thus might not need all 
the containers because tasks finish faster then we can allocate executors and 
use them.

>  Spark to be smarter about what the upper bound is and to restrict number of 
> executor when dynamic allocation is enabled
> 
>
> Key: SPARK-18769
> URL: https://issues.apache.org/jira/browse/SPARK-18769
> Project: Spark
>  Issue Type: New Feature
>Reporter: Neerja Khattar
>
> Currently when dynamic allocation is enabled max.executor is infinite and 
> spark creates so many executor and even exceed the yarn nodemanager memory 
> limit and vcores.
> It should have a check to not exceed more that yarn resource limit.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19578) Poor pyspark performance + incorrect UI input-size metrics

2017-03-01 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890588#comment-15890588
 ] 

Nicholas Chammas commented on SPARK-19578:
--

[~holdenk] - Would it make sense to have PySpark's {{RDD.count()}} simply 
delegate to the underlying {{_jrdd}}? I'm not seeing why Python needs to be 
involved at all to return a count.

> Poor pyspark performance + incorrect UI input-size metrics
> --
>
> Key: SPARK-19578
> URL: https://issues.apache.org/jira/browse/SPARK-19578
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Web UI
>Affects Versions: 1.6.1, 1.6.2, 2.0.1
> Environment: Spark 1.6.2 Hortonworks
> Spark 2.0.1 MapR
> Spark 1.6.1 MapR
>Reporter: Artur Sukhenko
> Attachments: pyspark_incorrect_inputsize.png, reproduce_log, 
> spark_shell_correct_inputsize.png
>
>
> Simple job in pyspark takes 14 minutes to complete.
> The text file used to reproduce contains multiple millions lines of one word 
> "yes"
>  (it might be the cause of poor performance)
> {code}
> var a = sc.textFile("/tmp/yes.txt")
> a.count()
> {code}
> Same code took 33 sec in spark-shell
> Reproduce steps:
> Run this  to generate big file (press Ctrl+C after 5-6 seconds)
> [spark@c6401 ~]$ yes > /tmp/yes.txt
> [spark@c6401 ~]$ ll /tmp/
> -rw-r--r--  1 spark hadoop  516079616 Feb 13 11:10 yes.txt
> [spark@c6401 ~]$ hadoop fs -copyFromLocal /tmp/yes.txt /tmp/
> [spark@c6401 ~]$ pyspark
> {code}
> Python 2.6.6 (r266:84292, Feb 22 2013, 00:00:18) 
> [GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] on linux2
> 17/02/13 11:12:36 INFO SparkContext: Running Spark version 1.6.2
> {code}
> >>> a = sc.textFile("/tmp/yes.txt")
> {code}
> 17/02/13 11:12:58 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 341.1 KB, free 341.1 KB)
> 17/02/13 11:12:58 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 28.3 KB, free 369.4 KB)
> 17/02/13 11:12:58 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on localhost:43389 (size: 28.3 KB, free: 517.4 MB)
> 17/02/13 11:12:58 INFO SparkContext: Created broadcast 0 from textFile at 
> NativeMethodAccessorImpl.java:-2
> {code}
> >>> a.count()
> {code}
> 17/02/13 11:13:03 INFO FileInputFormat: Total input paths to process : 1
> 17/02/13 11:13:03 INFO SparkContext: Starting job: count at :1
> 17/02/13 11:13:03 INFO DAGScheduler: Got job 0 (count at :1) with 4 
> output partitions
> 17/02/13 11:13:03 INFO DAGScheduler: Final stage: ResultStage 0 (count at 
> :1)
> 17/02/13 11:13:03 INFO DAGScheduler: Parents of final stage: List()
> 17/02/13 11:13:03 INFO DAGScheduler: Missing parents: List()
> 17/02/13 11:13:03 INFO DAGScheduler: Submitting ResultStage 0 (PythonRDD[2] 
> at count at :1), which has no missing parents
> 17/02/13 11:13:03 INFO MemoryStore: Block broadcast_1 stored as values in 
> memory (estimated size 5.7 KB, free 375.1 KB)
> 17/02/13 11:13:03 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes 
> in memory (estimated size 3.5 KB, free 378.6 KB)
> 17/02/13 11:13:03 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory 
> on localhost:43389 (size: 3.5 KB, free: 517.4 MB)
> 17/02/13 11:13:03 INFO SparkContext: Created broadcast 1 from broadcast at 
> DAGScheduler.scala:1008
> 17/02/13 11:13:03 INFO DAGScheduler: Submitting 4 missing tasks from 
> ResultStage 0 (PythonRDD[2] at count at :1)
> 17/02/13 11:13:03 INFO TaskSchedulerImpl: Adding task set 0.0 with 4 tasks
> 17/02/13 11:13:03 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
> localhost, partition 0,ANY, 2149 bytes)
> 17/02/13 11:13:03 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
> 17/02/13 11:13:03 INFO HadoopRDD: Input split: 
> hdfs://c6401.ambari.apache.org:8020/tmp/yes.txt:0+134217728
> 17/02/13 11:13:03 INFO deprecation: mapred.tip.id is deprecated. Instead, use 
> mapreduce.task.id
> 17/02/13 11:13:03 INFO deprecation: mapred.task.id is deprecated. Instead, 
> use mapreduce.task.attempt.id
> 17/02/13 11:13:03 INFO deprecation: mapred.task.is.map is deprecated. 
> Instead, use mapreduce.task.ismap
> 17/02/13 11:13:03 INFO deprecation: mapred.task.partition is deprecated. 
> Instead, use mapreduce.task.partition
> 17/02/13 11:13:03 INFO deprecation: mapred.job.id is deprecated. Instead, use 
> mapreduce.job.id
> 17/02/13 11:16:37 INFO PythonRunner: Times: total = 213335, boot = 317, init 
> = 445, finish = 212573
> 17/02/13 11:16:37 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2182 
> bytes result sent to driver
> 17/02/13 11:16:37 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 
> localhost, partition 1,ANY, 2149 bytes)
> 17/02/13 11:16:37 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
> 17/02/13 11:16:37 INFO HadoopRDD: Input split: 
> h

[jira] [Commented] (SPARK-19211) Explicitly prevent Insert into View or Create View As Insert


[ 
https://issues.apache.org/jira/browse/SPARK-19211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890545#comment-15890545
 ] 

Apache Spark commented on SPARK-19211:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/17125

> Explicitly prevent Insert into View or Create View As Insert
> 
>
> Key: SPARK-19211
> URL: https://issues.apache.org/jira/browse/SPARK-19211
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Jiang Xingbo
>
> Currently we don't explicitly forbid the following behaviors:
> 1. The statement CREATE VIEW AS INSERT INTO throws the following exception 
> from SQLBuilder:
> `java.lang.UnsupportedOperationException: unsupported plan InsertIntoTable 
> MetastoreRelation default, tbl, false, false`;
> 2. The statement INSERT INTO view VALUES throws the following exception from 
> checkAnalysis:
> `Error in query: Inserting into an RDD-based table is not allowed.;;`
> We should check for these behaviors earlier and explicitly prevent them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19768) Error for both aggregate and non-aggregate queries in Structured Streaming - "This query does not support recovering from checkpoint location"


[ 
https://issues.apache.org/jira/browse/SPARK-19768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890546#comment-15890546
 ] 

Shixiong Zhu commented on SPARK-19768:
--

Yeah, just recalled that I fixed the error message in 
https://github.com/apache/spark/pull/16970/files#diff-cb5dbc84ef906b6de2b6e36da45a86c3L78

There is a page about the supported operators on this page 
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes

> Error for both  aggregate  and  non-aggregate queries in Structured Streaming 
> - "This query does not support recovering from checkpoint location"
> -
>
> Key: SPARK-19768
> URL: https://issues.apache.org/jira/browse/SPARK-19768
> Project: Spark
>  Issue Type: Question
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Amit Baghel
>
> I am running JavaStructuredKafkaWordCount.java example with 
> checkpointLocation. Output mode is "complete". Below is relevant code.
> {code}
>  // Generate running word count
> Dataset wordCounts = lines.flatMap(new FlatMapFunction String>() {
>   @Override
>   public Iterator call(String x) {
> return Arrays.asList(x.split(" ")).iterator();
>   }
> }, Encoders.STRING()).groupBy("value").count();
> // Start running the query that prints the running counts to the console
> StreamingQuery query = wordCounts.writeStream()
>   .outputMode("complete")
>   .format("console")
>   .option("checkpointLocation", "/tmp/checkpoint-data")
>   .start();
> {code}
> This example runs successfully and writes data in checkpoint directory. When 
> I re-run the program it throws below exception
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: This query 
> does not support recovering from checkpoint location. Delete 
> /tmp/checkpoint-data/offsets to start over.;
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:219)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269)
>   at 
> org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262)
>   at 
> com.spark.JavaStructuredKafkaWordCount.main(JavaStructuredKafkaWordCount.java:85)
> {code}
> Then I modified JavaStructuredKafkaWordCount.java to have non aggregate query 
> with output mode as "append". Please see the code below.
> {code}
> // no aggregations
> Dataset wordCounts = lines.flatMap(new FlatMapFunction String>() {
>   @Override
>   public Iterator call(String x) {
> return Arrays.asList(x.split(" ")).iterator();
>   }
> }, Encoders.STRING()).select("value");
> // append mode with console
> StreamingQuery query = wordCounts.writeStream()
>   .outputMode("append")
>   .format("console")
>   .option("checkpointLocation", "/tmp/checkpoint-data")
>   .start();
> {code}
> This modified code runs successfully and writes data in checkpoint directory. 
> When I re-run the program it throws same exception
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: This query 
> does not support recovering from checkpoint location. Delete 
> /tmp/checkpoint-data/offsets to start over.;
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:219)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269)
>   at 
> org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262)
>   at 
> com.spark.JavaStructuredKafkaWordCount.main(JavaStructuredKafkaWordCount.java:85)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19211) Explicitly prevent Insert into View or Create View As Insert


 [ 
https://issues.apache.org/jira/browse/SPARK-19211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19211:


Assignee: (was: Apache Spark)

> Explicitly prevent Insert into View or Create View As Insert
> 
>
> Key: SPARK-19211
> URL: https://issues.apache.org/jira/browse/SPARK-19211
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Jiang Xingbo
>
> Currently we don't explicitly forbid the following behaviors:
> 1. The statement CREATE VIEW AS INSERT INTO throws the following exception 
> from SQLBuilder:
> `java.lang.UnsupportedOperationException: unsupported plan InsertIntoTable 
> MetastoreRelation default, tbl, false, false`;
> 2. The statement INSERT INTO view VALUES throws the following exception from 
> checkAnalysis:
> `Error in query: Inserting into an RDD-based table is not allowed.;;`
> We should check for these behaviors earlier and explicitly prevent them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19211) Explicitly prevent Insert into View or Create View As Insert


 [ 
https://issues.apache.org/jira/browse/SPARK-19211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19211:


Assignee: Apache Spark

> Explicitly prevent Insert into View or Create View As Insert
> 
>
> Key: SPARK-19211
> URL: https://issues.apache.org/jira/browse/SPARK-19211
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Jiang Xingbo
>Assignee: Apache Spark
>
> Currently we don't explicitly forbid the following behaviors:
> 1. The statement CREATE VIEW AS INSERT INTO throws the following exception 
> from SQLBuilder:
> `java.lang.UnsupportedOperationException: unsupported plan InsertIntoTable 
> MetastoreRelation default, tbl, false, false`;
> 2. The statement INSERT INTO view VALUES throws the following exception from 
> checkAnalysis:
> `Error in query: Inserting into an RDD-based table is not allowed.;;`
> We should check for these behaviors earlier and explicitly prevent them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19779) structured streaming exist needless tmp file


 [ 
https://issues.apache.org/jira/browse/SPARK-19779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19779:


Assignee: Apache Spark

> structured streaming exist needless tmp file 
> -
>
> Key: SPARK-19779
> URL: https://issues.apache.org/jira/browse/SPARK-19779
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Feng Gui
>Assignee: Apache Spark
>Priority: Minor
>
> The PR (https://github.com/apache/spark/pull/17012) can to fix restart a 
> Structured Streaming application using hdfs as fileSystem, but also exist a 
> problem that a tmp file of delta file is still reserved in hdfs. And 
> Structured Streaming don't delete the tmp file generated when restart 
> streaming job in future, so we need to delete the tmp file after restart 
> streaming job.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19779) structured streaming exist needless tmp file


 [ 
https://issues.apache.org/jira/browse/SPARK-19779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19779:


Assignee: (was: Apache Spark)

> structured streaming exist needless tmp file 
> -
>
> Key: SPARK-19779
> URL: https://issues.apache.org/jira/browse/SPARK-19779
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Feng Gui
>Priority: Minor
>
> The PR (https://github.com/apache/spark/pull/17012) can to fix restart a 
> Structured Streaming application using hdfs as fileSystem, but also exist a 
> problem that a tmp file of delta file is still reserved in hdfs. And 
> Structured Streaming don't delete the tmp file generated when restart 
> streaming job in future, so we need to delete the tmp file after restart 
> streaming job.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19788) DataStreamReader/DataStreamWriter.option shall accept user-defined type


 [ 
https://issues.apache.org/jira/browse/SPARK-19788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nan Zhu updated SPARK-19788:

Description: 
There are many other data sources/sinks which has very different configuration 
ways than Kafka, FileSystem, etc. 

The expected type of the configuration entry passed to them might be a nested 
collection type, e.g. Map[String, Map[String, String]], or even a user-defined 
type(for example, the one I am working on)

Right now, option can only accept String -> String/Boolean/Long/Double OR a 
complete Map[String, String]...my suggestion is that we can accept Map[String, 
Any], and the type of 'parameters' in SourceProvider.createSource can also be 
Map[String, Any], this will create much more flexibility to the user

The drawback is that, it is a breaking change ( we can mitigate this by 
deprecating the current one, and progressively evolve to the new one if the 
proposal is accepted)

[~zsxwing] what do you think?


  was:
There are many other data sources/sinks which has very different configuration 
ways than Kafka, FileSystem, etc. 

The expected type of the configuration entry passed to them might be a nested 
collection type, e.g. Map[String, Map[String, String]], or even a user-defined 
type(for example, the one I am working on)

Right now, option can only accept String -> String/Boolean/Long/Double OR a 
complete Map[String, String]...my suggestion is that we can accept Map[String, 
Any], and the type of 'parameters' in SourceProvider.createSource can also be 
Map[String, Any], this will create much more flexibility to the user

The drawback is that, it is a breaking change ( we can mitigate this by 
deprecate the current one, and progressively evolve to the new one if the 
proposal is accepted)

[~zsxwing] what do you think?



> DataStreamReader/DataStreamWriter.option shall accept user-defined type
> ---
>
> Key: SPARK-19788
> URL: https://issues.apache.org/jira/browse/SPARK-19788
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Nan Zhu
>
> There are many other data sources/sinks which has very different 
> configuration ways than Kafka, FileSystem, etc. 
> The expected type of the configuration entry passed to them might be a nested 
> collection type, e.g. Map[String, Map[String, String]], or even a 
> user-defined type(for example, the one I am working on)
> Right now, option can only accept String -> String/Boolean/Long/Double OR a 
> complete Map[String, String]...my suggestion is that we can accept 
> Map[String, Any], and the type of 'parameters' in SourceProvider.createSource 
> can also be Map[String, Any], this will create much more flexibility to the 
> user
> The drawback is that, it is a breaking change ( we can mitigate this by 
> deprecating the current one, and progressively evolve to the new one if the 
> proposal is accepted)
> [~zsxwing] what do you think?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19779) structured streaming exist needless tmp file


[ 
https://issues.apache.org/jira/browse/SPARK-19779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890531#comment-15890531
 ] 

Apache Spark commented on SPARK-19779:
--

User 'gf53520' has created a pull request for this issue:
https://github.com/apache/spark/pull/17124

> structured streaming exist needless tmp file 
> -
>
> Key: SPARK-19779
> URL: https://issues.apache.org/jira/browse/SPARK-19779
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Feng Gui
>Priority: Minor
>
> The PR (https://github.com/apache/spark/pull/17012) can to fix restart a 
> Structured Streaming application using hdfs as fileSystem, but also exist a 
> problem that a tmp file of delta file is still reserved in hdfs. And 
> Structured Streaming don't delete the tmp file generated when restart 
> streaming job in future, so we need to delete the tmp file after restart 
> streaming job.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19779) structured streaming exist needless tmp file

2017-03-01 Thread Feng Gui (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890528#comment-15890528
 ] 

Feng Gui commented on SPARK-19779:
--

[~srowen] The `Background maintenance` don't clean files started with `temp`, 
so I think the temp file is not deleted. However, the temp file don't impact to 
get incorrect results for Structured Streaming Job.

> structured streaming exist needless tmp file 
> -
>
> Key: SPARK-19779
> URL: https://issues.apache.org/jira/browse/SPARK-19779
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Feng Gui
>Priority: Minor
>
> The PR (https://github.com/apache/spark/pull/17012) can to fix restart a 
> Structured Streaming application using hdfs as fileSystem, but also exist a 
> problem that a tmp file of delta file is still reserved in hdfs. And 
> Structured Streaming don't delete the tmp file generated when restart 
> streaming job in future, so we need to delete the tmp file after restart 
> streaming job.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19788) DataStreamReader/DataStreamWriter.option shall accept user-defined type


 [ 
https://issues.apache.org/jira/browse/SPARK-19788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nan Zhu updated SPARK-19788:

Description: 
There are many other data sources/sinks which has very different configuration 
ways than Kafka, FileSystem, etc. 

The expected type of the configuration entry passed to them might be a nested 
collection type, e.g. Map[String, Map[String, String]], or even a user-defined 
type(for example, the one I am working on)

Right now, option can only accept String -> String/Boolean/Long/Double OR a 
complete Map[String, String]...my suggestion is that we can accept Map[String, 
Any], and the type of 'parameters' in SourceProvider.createSource can also be 
Map[String, Any], this will create much more flexibility to the user

The drawback is that, it is a breaking change ( we can mitigate this by 
deprecate the current one, and progressively evolve to the new one if the 
proposal is accepted)

[~zsxwing] what do you think?


  was:
There are many other data sources which has very different configuration ways 
than Kafka, FileSystem, etc. 

The expected type of the configuration entry passed to them might be a nested 
collection type, e.g. Map[String, Map[String, String]], or even a user-defined 
type(for example, the one I am working on)

Right now, option can only accept String -> String/Boolean/Long/Double OR a 
complete Map[String, String]...my suggestion is that we can accept Map[String, 
Any], and the type of 'parameters' in SourceProvider.createSource can also be 
Map[String, Any], this will create much more flexibility to the user

The drawback is that, it is a breaking change ( we can mitigate this by 
deprecate the current one, and progressively evolve to the new one if the 
proposal is accepted)

[~zsxwing] what do you think?



> DataStreamReader/DataStreamWriter.option shall accept user-defined type
> ---
>
> Key: SPARK-19788
> URL: https://issues.apache.org/jira/browse/SPARK-19788
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Nan Zhu
>
> There are many other data sources/sinks which has very different 
> configuration ways than Kafka, FileSystem, etc. 
> The expected type of the configuration entry passed to them might be a nested 
> collection type, e.g. Map[String, Map[String, String]], or even a 
> user-defined type(for example, the one I am working on)
> Right now, option can only accept String -> String/Boolean/Long/Double OR a 
> complete Map[String, String]...my suggestion is that we can accept 
> Map[String, Any], and the type of 'parameters' in SourceProvider.createSource 
> can also be Map[String, Any], this will create much more flexibility to the 
> user
> The drawback is that, it is a breaking change ( we can mitigate this by 
> deprecate the current one, and progressively evolve to the new one if the 
> proposal is accepted)
> [~zsxwing] what do you think?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19788) DataStreamReader/DataStreamWriter.option shall accept user-defined type


[ 
https://issues.apache.org/jira/browse/SPARK-19788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890522#comment-15890522
 ] 

Nan Zhu edited comment on SPARK-19788 at 3/1/17 4:45 PM:
-

another drawback is that it might look like incompatible with 
DataFrameReader/DataFrameWriter (we can also change that?)


was (Author: codingcat):
another drawback is that it might look like incompatible with DataFrameReader 
(we can also change that?)

> DataStreamReader/DataStreamWriter.option shall accept user-defined type
> ---
>
> Key: SPARK-19788
> URL: https://issues.apache.org/jira/browse/SPARK-19788
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Nan Zhu
>
> There are many other data sources/sinks which has very different 
> configuration ways than Kafka, FileSystem, etc. 
> The expected type of the configuration entry passed to them might be a nested 
> collection type, e.g. Map[String, Map[String, String]], or even a 
> user-defined type(for example, the one I am working on)
> Right now, option can only accept String -> String/Boolean/Long/Double OR a 
> complete Map[String, String]...my suggestion is that we can accept 
> Map[String, Any], and the type of 'parameters' in SourceProvider.createSource 
> can also be Map[String, Any], this will create much more flexibility to the 
> user
> The drawback is that, it is a breaking change ( we can mitigate this by 
> deprecate the current one, and progressively evolve to the new one if the 
> proposal is accepted)
> [~zsxwing] what do you think?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19788) DataStreamReader/DataStreamWriter.option shall accept user-defined type


 [ 
https://issues.apache.org/jira/browse/SPARK-19788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nan Zhu updated SPARK-19788:

Summary: DataStreamReader/DataStreamWriter.option shall accept user-defined 
type  (was: DataStreamReader.option shall accept user-defined type)

> DataStreamReader/DataStreamWriter.option shall accept user-defined type
> ---
>
> Key: SPARK-19788
> URL: https://issues.apache.org/jira/browse/SPARK-19788
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Nan Zhu
>
> There are many other data sources which has very different configuration ways 
> than Kafka, FileSystem, etc. 
> The expected type of the configuration entry passed to them might be a nested 
> collection type, e.g. Map[String, Map[String, String]], or even a 
> user-defined type(for example, the one I am working on)
> Right now, option can only accept String -> String/Boolean/Long/Double OR a 
> complete Map[String, String]...my suggestion is that we can accept 
> Map[String, Any], and the type of 'parameters' in SourceProvider.createSource 
> can also be Map[String, Any], this will create much more flexibility to the 
> user
> The drawback is that, it is a breaking change ( we can mitigate this by 
> deprecate the current one, and progressively evolve to the new one if the 
> proposal is accepted)
> [~zsxwing] what do you think?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19788) DataStreamReader.option shall accept user-defined type


[ 
https://issues.apache.org/jira/browse/SPARK-19788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890522#comment-15890522
 ] 

Nan Zhu commented on SPARK-19788:
-

another drawback is that it might look like incompatible with DataFrameReader 
(we can also change that?)

> DataStreamReader.option shall accept user-defined type
> --
>
> Key: SPARK-19788
> URL: https://issues.apache.org/jira/browse/SPARK-19788
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Nan Zhu
>
> There are many other data sources which has very different configuration ways 
> than Kafka, FileSystem, etc. 
> The expected type of the configuration entry passed to them might be a nested 
> collection type, e.g. Map[String, Map[String, String]], or even a 
> user-defined type(for example, the one I am working on)
> Right now, option can only accept String -> String/Boolean/Long/Double OR a 
> complete Map[String, String]...my suggestion is that we can accept 
> Map[String, Any], and the type of 'parameters' in SourceProvider.createSource 
> can also be Map[String, Any], this will create much more flexibility to the 
> user
> The drawback is that, it is a breaking change ( we can mitigate this by 
> deprecate the current one, and progressively evolve to the new one if the 
> proposal is accepted)
> [~zsxwing] what do you think?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19788) DataStreamReader.option shall accept user-defined type

Nan Zhu created SPARK-19788:
---

 Summary: DataStreamReader.option shall accept user-defined type
 Key: SPARK-19788
 URL: https://issues.apache.org/jira/browse/SPARK-19788
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Nan Zhu


There are many other data sources which has very different configuration ways 
than Kafka, FileSystem, etc. 

The expected type of the configuration entry passed to them might be a nested 
collection type, e.g. Map[String, Map[String, String]], or even a user-defined 
type(for example, the one I am working on)

Right now, option can only accept String -> String/Boolean/Long/Double OR a 
complete Map[String, String]...my suggestion is that we can accept Map[String, 
Any], and the type of 'parameters' in SourceProvider.createSource can also be 
Map[String, Any], this will create much more flexibility to the user

The drawback is that, it is a breaking change ( we can mitigate this by 
deprecate the current one, and progressively evolve to the new one if the 
proposal is accepted)

[~zsxwing] what do you think?




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19779) structured streaming exist needless tmp file

2017-03-01 Thread Feng Gui (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Gui updated SPARK-19779:
-
Summary: structured streaming exist needless tmp file   (was: structured 
streaming exist residual tmp file )

> structured streaming exist needless tmp file 
> -
>
> Key: SPARK-19779
> URL: https://issues.apache.org/jira/browse/SPARK-19779
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Feng Gui
>Priority: Minor
>
> The PR (https://github.com/apache/spark/pull/17012) can to fix restart a 
> Structured Streaming application using hdfs as fileSystem, but also exist a 
> problem that a tmp file of delta file is still reserved in hdfs. And 
> Structured Streaming don't delete the tmp file generated when restart 
> streaming job in future, so we need to delete the tmp file after restart 
> streaming job.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19766) INNER JOIN on constant alias columns return incorrect results


 [ 
https://issues.apache.org/jira/browse/SPARK-19766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-19766:
---

Assignee: StanZhai

> INNER JOIN on constant alias columns return incorrect results
> -
>
> Key: SPARK-19766
> URL: https://issues.apache.org/jira/browse/SPARK-19766
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Assignee: StanZhai
>Priority: Critical
>  Labels: Correctness
> Fix For: 2.1.1, 2.2.0
>
>
> We can demonstrate the problem with the following data set and query:
> {code}
> val spark = 
> SparkSession.builder().appName("test").master("local").getOrCreate()
> val sql1 =
>   """
> |create temporary view t1 as select * from values
> |(1)
> |as grouping(a)
>   """.stripMargin
> val sql2 =
>   """
> |create temporary view t2 as select * from values
> |(1)
> |as grouping(a)
>   """.stripMargin
> val sql3 =
>   """
> |create temporary view t3 as select * from values
> |(1),
> |(1)
> |as grouping(a)
>   """.stripMargin
> val sql4 =
>   """
> |create temporary view t4 as select * from values
> |(1),
> |(1)
> |as grouping(a)
>   """.stripMargin
> val sqlA =
>   """
> |create temporary view ta as
> |select a, 'a' as tag from t1 union all
> |select a, 'b' as tag from t2
>   """.stripMargin
> val sqlB =
>   """
> |create temporary view tb as
> |select a, 'a' as tag from t3 union all
> |select a, 'b' as tag from t4
>   """.stripMargin
> val sql =
>   """
> |select tb.* from ta inner join tb on
> |ta.a = tb.a and
> |ta.tag = tb.tag
>   """.stripMargin
> spark.sql(sql1)
> spark.sql(sql2)
> spark.sql(sql3)
> spark.sql(sql4)
> spark.sql(sqlA)
> spark.sql(sqlB)
> spark.sql(sql).show()
> {code}
> The results which is incorrect:
> {code}
> +---+---+
> |  a|tag|
> +---+---+
> |  1|  b|
> |  1|  b|
> |  1|  a|
> |  1|  a|
> |  1|  b|
> |  1|  b|
> |  1|  a|
> |  1|  a|
> +---+---+
> {code}
> The correct results should be:
> {code}
> +---+---+
> |  a|tag|
> +---+---+
> |  1|  a|
> |  1|  a|
> |  1|  b|
> |  1|  b|
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19766) INNER JOIN on constant alias columns return incorrect results


 [ 
https://issues.apache.org/jira/browse/SPARK-19766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19766.
-
Resolution: Fixed

> INNER JOIN on constant alias columns return incorrect results
> -
>
> Key: SPARK-19766
> URL: https://issues.apache.org/jira/browse/SPARK-19766
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Assignee: StanZhai
>Priority: Critical
>  Labels: Correctness
> Fix For: 2.1.1, 2.2.0
>
>
> We can demonstrate the problem with the following data set and query:
> {code}
> val spark = 
> SparkSession.builder().appName("test").master("local").getOrCreate()
> val sql1 =
>   """
> |create temporary view t1 as select * from values
> |(1)
> |as grouping(a)
>   """.stripMargin
> val sql2 =
>   """
> |create temporary view t2 as select * from values
> |(1)
> |as grouping(a)
>   """.stripMargin
> val sql3 =
>   """
> |create temporary view t3 as select * from values
> |(1),
> |(1)
> |as grouping(a)
>   """.stripMargin
> val sql4 =
>   """
> |create temporary view t4 as select * from values
> |(1),
> |(1)
> |as grouping(a)
>   """.stripMargin
> val sqlA =
>   """
> |create temporary view ta as
> |select a, 'a' as tag from t1 union all
> |select a, 'b' as tag from t2
>   """.stripMargin
> val sqlB =
>   """
> |create temporary view tb as
> |select a, 'a' as tag from t3 union all
> |select a, 'b' as tag from t4
>   """.stripMargin
> val sql =
>   """
> |select tb.* from ta inner join tb on
> |ta.a = tb.a and
> |ta.tag = tb.tag
>   """.stripMargin
> spark.sql(sql1)
> spark.sql(sql2)
> spark.sql(sql3)
> spark.sql(sql4)
> spark.sql(sqlA)
> spark.sql(sqlB)
> spark.sql(sql).show()
> {code}
> The results which is incorrect:
> {code}
> +---+---+
> |  a|tag|
> +---+---+
> |  1|  b|
> |  1|  b|
> |  1|  a|
> |  1|  a|
> |  1|  b|
> |  1|  b|
> |  1|  a|
> |  1|  a|
> +---+---+
> {code}
> The correct results should be:
> {code}
> +---+---+
> |  a|tag|
> +---+---+
> |  1|  a|
> |  1|  a|
> |  1|  b|
> |  1|  b|
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19766) INNER JOIN on constant alias columns return incorrect results


 [ 
https://issues.apache.org/jira/browse/SPARK-19766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-19766:

Fix Version/s: 2.2.0
   2.1.1

> INNER JOIN on constant alias columns return incorrect results
> -
>
> Key: SPARK-19766
> URL: https://issues.apache.org/jira/browse/SPARK-19766
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Priority: Critical
>  Labels: Correctness
> Fix For: 2.1.1, 2.2.0
>
>
> We can demonstrate the problem with the following data set and query:
> {code}
> val spark = 
> SparkSession.builder().appName("test").master("local").getOrCreate()
> val sql1 =
>   """
> |create temporary view t1 as select * from values
> |(1)
> |as grouping(a)
>   """.stripMargin
> val sql2 =
>   """
> |create temporary view t2 as select * from values
> |(1)
> |as grouping(a)
>   """.stripMargin
> val sql3 =
>   """
> |create temporary view t3 as select * from values
> |(1),
> |(1)
> |as grouping(a)
>   """.stripMargin
> val sql4 =
>   """
> |create temporary view t4 as select * from values
> |(1),
> |(1)
> |as grouping(a)
>   """.stripMargin
> val sqlA =
>   """
> |create temporary view ta as
> |select a, 'a' as tag from t1 union all
> |select a, 'b' as tag from t2
>   """.stripMargin
> val sqlB =
>   """
> |create temporary view tb as
> |select a, 'a' as tag from t3 union all
> |select a, 'b' as tag from t4
>   """.stripMargin
> val sql =
>   """
> |select tb.* from ta inner join tb on
> |ta.a = tb.a and
> |ta.tag = tb.tag
>   """.stripMargin
> spark.sql(sql1)
> spark.sql(sql2)
> spark.sql(sql3)
> spark.sql(sql4)
> spark.sql(sqlA)
> spark.sql(sqlB)
> spark.sql(sql).show()
> {code}
> The results which is incorrect:
> {code}
> +---+---+
> |  a|tag|
> +---+---+
> |  1|  b|
> |  1|  b|
> |  1|  a|
> |  1|  a|
> |  1|  b|
> |  1|  b|
> |  1|  a|
> |  1|  a|
> +---+---+
> {code}
> The correct results should be:
> {code}
> +---+---+
> |  a|tag|
> +---+---+
> |  1|  a|
> |  1|  a|
> |  1|  b|
> |  1|  b|
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19766) INNER JOIN on constant alias columns return incorrect results


 [ 
https://issues.apache.org/jira/browse/SPARK-19766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-19766:

Labels: Correctness  (was: )

> INNER JOIN on constant alias columns return incorrect results
> -
>
> Key: SPARK-19766
> URL: https://issues.apache.org/jira/browse/SPARK-19766
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Assignee: StanZhai
>Priority: Critical
>  Labels: Correctness
> Fix For: 2.1.1, 2.2.0
>
>
> We can demonstrate the problem with the following data set and query:
> {code}
> val spark = 
> SparkSession.builder().appName("test").master("local").getOrCreate()
> val sql1 =
>   """
> |create temporary view t1 as select * from values
> |(1)
> |as grouping(a)
>   """.stripMargin
> val sql2 =
>   """
> |create temporary view t2 as select * from values
> |(1)
> |as grouping(a)
>   """.stripMargin
> val sql3 =
>   """
> |create temporary view t3 as select * from values
> |(1),
> |(1)
> |as grouping(a)
>   """.stripMargin
> val sql4 =
>   """
> |create temporary view t4 as select * from values
> |(1),
> |(1)
> |as grouping(a)
>   """.stripMargin
> val sqlA =
>   """
> |create temporary view ta as
> |select a, 'a' as tag from t1 union all
> |select a, 'b' as tag from t2
>   """.stripMargin
> val sqlB =
>   """
> |create temporary view tb as
> |select a, 'a' as tag from t3 union all
> |select a, 'b' as tag from t4
>   """.stripMargin
> val sql =
>   """
> |select tb.* from ta inner join tb on
> |ta.a = tb.a and
> |ta.tag = tb.tag
>   """.stripMargin
> spark.sql(sql1)
> spark.sql(sql2)
> spark.sql(sql3)
> spark.sql(sql4)
> spark.sql(sqlA)
> spark.sql(sqlB)
> spark.sql(sql).show()
> {code}
> The results which is incorrect:
> {code}
> +---+---+
> |  a|tag|
> +---+---+
> |  1|  b|
> |  1|  b|
> |  1|  a|
> |  1|  a|
> |  1|  b|
> |  1|  b|
> |  1|  a|
> |  1|  a|
> +---+---+
> {code}
> The correct results should be:
> {code}
> +---+---+
> |  a|tag|
> +---+---+
> |  1|  a|
> |  1|  a|
> |  1|  b|
> |  1|  b|
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19781) Bucketizer's handleInvalid leave null values untouched unlike the NaNs


[ 
https://issues.apache.org/jira/browse/SPARK-19781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890373#comment-15890373
 ] 

Apache Spark commented on SPARK-19781:
--

User 'crackcell' has created a pull request for this issue:
https://github.com/apache/spark/pull/17123

> Bucketizer's handleInvalid leave null values untouched unlike the NaNs
> --
>
> Key: SPARK-19781
> URL: https://issues.apache.org/jira/browse/SPARK-19781
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Menglong TAN
>Priority: Minor
>  Labels: MLlib
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Bucketizer can put NaN values into a special bucket when handleInvalid is on. 
> but leave null values untouched.
> {code}
> import org.apache.spark.ml.feature.Bucketizer
> val data = sc.parallelize(Seq(("crackcell", 
> null.asInstanceOf[java.lang.Double]))).toDF("name", "number")
> val bucketizer = new 
> Bucketizer().setInputCol("number").setOutputCol("number_output").setSplits(Array(Double.NegativeInfinity,
>  0, 10, Double.PositiveInfinity)).setHandleInvalid("keep")
> val res = bucketizer.transform(data)
> res.show(1)
> {code}
> will output:
> {quote}
> +-+--+-+
> | name|number|number_output|
> +-+--+-+
> |crackcell|  null| null|
> +-+--+-+
> {quote}
> If we change null to NaN:
> {code}
> val data2 = sc.parallelize(Seq(("crackcell", Double.NaN))).toDF("name", 
> "number")
> data2: org.apache.spark.sql.DataFrame = [name: string, number: double]
> bucketizer.transform(data2).show(1)
> {code}
> will output:
> {quote}
> +-+--+-+
> | name|number|number_output|
> +-+--+-+
> |crackcell|   NaN|  3.0|
> +-+--+-+
> {quote}
> Maybe we should unify the behaviours? Is it resonable to process nulls as 
> well? If so, maybe my code can help. :-)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19781) Bucketizer's handleInvalid leave null values untouched unlike the NaNs


 [ 
https://issues.apache.org/jira/browse/SPARK-19781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19781:


Assignee: Apache Spark

> Bucketizer's handleInvalid leave null values untouched unlike the NaNs
> --
>
> Key: SPARK-19781
> URL: https://issues.apache.org/jira/browse/SPARK-19781
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Menglong TAN
>Assignee: Apache Spark
>Priority: Minor
>  Labels: MLlib
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Bucketizer can put NaN values into a special bucket when handleInvalid is on. 
> but leave null values untouched.
> {code}
> import org.apache.spark.ml.feature.Bucketizer
> val data = sc.parallelize(Seq(("crackcell", 
> null.asInstanceOf[java.lang.Double]))).toDF("name", "number")
> val bucketizer = new 
> Bucketizer().setInputCol("number").setOutputCol("number_output").setSplits(Array(Double.NegativeInfinity,
>  0, 10, Double.PositiveInfinity)).setHandleInvalid("keep")
> val res = bucketizer.transform(data)
> res.show(1)
> {code}
> will output:
> {quote}
> +-+--+-+
> | name|number|number_output|
> +-+--+-+
> |crackcell|  null| null|
> +-+--+-+
> {quote}
> If we change null to NaN:
> {code}
> val data2 = sc.parallelize(Seq(("crackcell", Double.NaN))).toDF("name", 
> "number")
> data2: org.apache.spark.sql.DataFrame = [name: string, number: double]
> bucketizer.transform(data2).show(1)
> {code}
> will output:
> {quote}
> +-+--+-+
> | name|number|number_output|
> +-+--+-+
> |crackcell|   NaN|  3.0|
> +-+--+-+
> {quote}
> Maybe we should unify the behaviours? Is it resonable to process nulls as 
> well? If so, maybe my code can help. :-)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19781) Bucketizer's handleInvalid leave null values untouched unlike the NaNs


 [ 
https://issues.apache.org/jira/browse/SPARK-19781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19781:


Assignee: (was: Apache Spark)

> Bucketizer's handleInvalid leave null values untouched unlike the NaNs
> --
>
> Key: SPARK-19781
> URL: https://issues.apache.org/jira/browse/SPARK-19781
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Menglong TAN
>Priority: Minor
>  Labels: MLlib
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Bucketizer can put NaN values into a special bucket when handleInvalid is on. 
> but leave null values untouched.
> {code}
> import org.apache.spark.ml.feature.Bucketizer
> val data = sc.parallelize(Seq(("crackcell", 
> null.asInstanceOf[java.lang.Double]))).toDF("name", "number")
> val bucketizer = new 
> Bucketizer().setInputCol("number").setOutputCol("number_output").setSplits(Array(Double.NegativeInfinity,
>  0, 10, Double.PositiveInfinity)).setHandleInvalid("keep")
> val res = bucketizer.transform(data)
> res.show(1)
> {code}
> will output:
> {quote}
> +-+--+-+
> | name|number|number_output|
> +-+--+-+
> |crackcell|  null| null|
> +-+--+-+
> {quote}
> If we change null to NaN:
> {code}
> val data2 = sc.parallelize(Seq(("crackcell", Double.NaN))).toDF("name", 
> "number")
> data2: org.apache.spark.sql.DataFrame = [name: string, number: double]
> bucketizer.transform(data2).show(1)
> {code}
> will output:
> {quote}
> +-+--+-+
> | name|number|number_output|
> +-+--+-+
> |crackcell|   NaN|  3.0|
> +-+--+-+
> {quote}
> Maybe we should unify the behaviours? Is it resonable to process nulls as 
> well? If so, maybe my code can help. :-)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19786) Facilitate loop optimizations in a JIT compiler regarding range()


 [ 
https://issues.apache.org/jira/browse/SPARK-19786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19786:


Assignee: (was: Apache Spark)

> Facilitate loop optimizations in a JIT compiler regarding range()
> -
>
> Key: SPARK-19786
> URL: https://issues.apache.org/jira/browse/SPARK-19786
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> [This 
> article|https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html]
>  suggests that better generated code can improve performance by facilitating 
> compiler optimizations.
> This JIRA changes the generated code for {{range()}} to facilitate loop 
> optimizations in a JIT compiler for achieving better performance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19786) Facilitate loop optimizations in a JIT compiler regarding range()


[ 
https://issues.apache.org/jira/browse/SPARK-19786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890271#comment-15890271
 ] 

Apache Spark commented on SPARK-19786:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/17122

> Facilitate loop optimizations in a JIT compiler regarding range()
> -
>
> Key: SPARK-19786
> URL: https://issues.apache.org/jira/browse/SPARK-19786
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> [This 
> article|https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html]
>  suggests that better generated code can improve performance by facilitating 
> compiler optimizations.
> This JIRA changes the generated code for {{range()}} to facilitate loop 
> optimizations in a JIT compiler for achieving better performance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19754) Casting to int from a JSON-parsed float rounds instead of truncating


[ 
https://issues.apache.org/jira/browse/SPARK-19754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890272#comment-15890272
 ] 

Hyukjin Kwon commented on SPARK-19754:
--

I see. It'd be great if anyone can identify the JIRA fixing this and backports 
it. I am resolving this as I can't find the JIRA but I can't reproduce this in 
the current master.

> Casting to int from a JSON-parsed float rounds instead of truncating
> 
>
> Key: SPARK-19754
> URL: https://issues.apache.org/jira/browse/SPARK-19754
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Juan Pumarino
>Priority: Minor
>
> When retrieving a float value from a JSON document, and then casting it to an 
> integer, Hive simply truncates it, while Spark is rounding up when the 
> decimal value is >= 5.
> In Hive, the following query returns {{1}}, whereas in a Spark shell the 
> result is {{2}}.
> {code}
> SELECT CAST(get_json_object('{"a": 1.6}', '$.a') AS INT)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19754) Casting to int from a JSON-parsed float rounds instead of truncating

2017-03-01 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890287#comment-15890287
 ] 

Takeshi Yamamuro commented on SPARK-19754:
--

Good.

> Casting to int from a JSON-parsed float rounds instead of truncating
> 
>
> Key: SPARK-19754
> URL: https://issues.apache.org/jira/browse/SPARK-19754
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Juan Pumarino
>Priority: Minor
>
> When retrieving a float value from a JSON document, and then casting it to an 
> integer, Hive simply truncates it, while Spark is rounding up when the 
> decimal value is >= 5.
> In Hive, the following query returns {{1}}, whereas in a Spark shell the 
> result is {{2}}.
> {code}
> SELECT CAST(get_json_object('{"a": 1.6}', '$.a') AS INT)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19754) Casting to int from a JSON-parsed float rounds instead of truncating

2017-03-01 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890261#comment-15890261
 ] 

Takeshi Yamamuro commented on SPARK-19754:
--

Aha, I see. I also checked this in v2.0.2;
{code}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.2
  /_/
 
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_31)
Type in expressions to have them evaluated.
Type :help for more information.

scala> sql("""SELECT CAST(get_json_object('{"a": 1.6}', '$.a') AS INT)""").show
+-+
|CAST(get_json_object({"a": 1.6}, $.a) AS INT)|
+-+
|2|
+-+
{code}

It seems some patches change this behaviour.

> Casting to int from a JSON-parsed float rounds instead of truncating
> 
>
> Key: SPARK-19754
> URL: https://issues.apache.org/jira/browse/SPARK-19754
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Juan Pumarino
>Priority: Minor
>
> When retrieving a float value from a JSON document, and then casting it to an 
> integer, Hive simply truncates it, while Spark is rounding up when the 
> decimal value is >= 5.
> In Hive, the following query returns {{1}}, whereas in a Spark shell the 
> result is {{2}}.
> {code}
> SELECT CAST(get_json_object('{"a": 1.6}', '$.a') AS INT)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19786) Facilitate loop optimizations in a JIT compiler regarding range()


 [ 
https://issues.apache.org/jira/browse/SPARK-19786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19786:


Assignee: Apache Spark

> Facilitate loop optimizations in a JIT compiler regarding range()
> -
>
> Key: SPARK-19786
> URL: https://issues.apache.org/jira/browse/SPARK-19786
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>
> [This 
> article|https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html]
>  suggests that better generated code can improve performance by facilitating 
> compiler optimizations.
> This JIRA changes the generated code for {{range()}} to facilitate loop 
> optimizations in a JIT compiler for achieving better performance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19754) Casting to int from a JSON-parsed float rounds instead of truncating


 [ 
https://issues.apache.org/jira/browse/SPARK-19754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-19754.
--
Resolution: Cannot Reproduce

> Casting to int from a JSON-parsed float rounds instead of truncating
> 
>
> Key: SPARK-19754
> URL: https://issues.apache.org/jira/browse/SPARK-19754
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Juan Pumarino
>Priority: Minor
>
> When retrieving a float value from a JSON document, and then casting it to an 
> integer, Hive simply truncates it, while Spark is rounding up when the 
> decimal value is >= 5.
> In Hive, the following query returns {{1}}, whereas in a Spark shell the 
> result is {{2}}.
> {code}
> SELECT CAST(get_json_object('{"a": 1.6}', '$.a') AS INT)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19767) API Doc pages for Streaming with Kafka 0.10 not current


[ 
https://issues.apache.org/jira/browse/SPARK-19767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890250#comment-15890250
 ] 

Sean Owen commented on SPARK-19767:
---

Sounds fine to me, you can add a note and link to the source tree which at 
least has the docs in scaladoc.

> API Doc pages for Streaming with Kafka 0.10 not current
> ---
>
> Key: SPARK-19767
> URL: https://issues.apache.org/jira/browse/SPARK-19767
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Nick Afshartous
>Priority: Minor
>
> The API docs linked from the Spark Kafka 0.10 Integration page are not 
> current.  For instance, on the page
>https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
> the code examples show the new API (i.e. class ConsumerStrategies).  However, 
> following the links
> API Docs --> (Scala | Java)
> lead to API pages that do not have class ConsumerStrategies) .  The API doc 
> package names  also have {code}streaming.kafka{code} as opposed to 
> {code}streaming.kafka10{code} 
> as in the code examples on streaming-kafka-0-10-integration.html.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19754) Casting to int from a JSON-parsed float rounds instead of truncating


[ 
https://issues.apache.org/jira/browse/SPARK-19754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890249#comment-15890249
 ] 

Hyukjin Kwon commented on SPARK-19754:
--

Thank you for cc'ing me. It seems it returns as below in the current master

{code}
scala> sql("SELECT CAST(1.6 AS INT)").show()
++
|CAST(1.6 AS INT)|
++
|   1|
++
{code}

{code}
scala> sql("""SELECT CAST(get_json_object('{"a": 1.6}', '$.a') AS 
INT)""").show()
+-+
|CAST(get_json_object({"a": 1.6}, $.a) AS INT)|
+-+
|1|
+-+
{code}

It seems consistent and the result seems {{1}}. Could you check and confirm if 
I missed something [~jipumarino]?

> Casting to int from a JSON-parsed float rounds instead of truncating
> 
>
> Key: SPARK-19754
> URL: https://issues.apache.org/jira/browse/SPARK-19754
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Juan Pumarino
>Priority: Minor
>
> When retrieving a float value from a JSON document, and then casting it to an 
> integer, Hive simply truncates it, while Spark is rounding up when the 
> decimal value is >= 5.
> In Hive, the following query returns {{1}}, whereas in a Spark shell the 
> result is {{2}}.
> {code}
> SELECT CAST(get_json_object('{"a": 1.6}', '$.a') AS INT)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19787) Different default regParam values in ALS


 [ 
https://issues.apache.org/jira/browse/SPARK-19787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19787:


Assignee: (was: Apache Spark)

> Different default regParam values in ALS
> 
>
> Key: SPARK-19787
> URL: https://issues.apache.org/jira/browse/SPARK-19787
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Vasilis Vryniotis
>Priority: Trivial
>
> In the ALS method the default values of regParam do not match within the same 
> file (lines 
> [224|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L224]
>  and 
> [714|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L714]).
>  In one place we set it to 1.0 and in the other to 0.1.
> We can change the one of train() method to 0.1. The method is marked with 
> DeveloperApi so it should not affect the users. Based on what I saw, whenever 
> we use the particular method we provide all parameters, so the default does 
> not matter. Only exception is the unit-tests on ALSSuite but the change does 
> not break them.
> This change was discussed on a separate PR and [~mlnick] 
> [suggested|https://github.com/apache/spark/pull/17059#issuecomment-28572] 
> to create a separate PR & ticket.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19787) Different default regParam values in ALS


[ 
https://issues.apache.org/jira/browse/SPARK-19787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890239#comment-15890239
 ] 

Apache Spark commented on SPARK-19787:
--

User 'datumbox' has created a pull request for this issue:
https://github.com/apache/spark/pull/17121

> Different default regParam values in ALS
> 
>
> Key: SPARK-19787
> URL: https://issues.apache.org/jira/browse/SPARK-19787
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Vasilis Vryniotis
>Priority: Trivial
>
> In the ALS method the default values of regParam do not match within the same 
> file (lines 
> [224|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L224]
>  and 
> [714|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L714]).
>  In one place we set it to 1.0 and in the other to 0.1.
> We can change the one of train() method to 0.1. The method is marked with 
> DeveloperApi so it should not affect the users. Based on what I saw, whenever 
> we use the particular method we provide all parameters, so the default does 
> not matter. Only exception is the unit-tests on ALSSuite but the change does 
> not break them.
> This change was discussed on a separate PR and [~mlnick] 
> [suggested|https://github.com/apache/spark/pull/17059#issuecomment-28572] 
> to create a separate PR & ticket.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19787) Different default regParam values in ALS


 [ 
https://issues.apache.org/jira/browse/SPARK-19787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19787:


Assignee: Apache Spark

> Different default regParam values in ALS
> 
>
> Key: SPARK-19787
> URL: https://issues.apache.org/jira/browse/SPARK-19787
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Vasilis Vryniotis
>Assignee: Apache Spark
>Priority: Trivial
>
> In the ALS method the default values of regParam do not match within the same 
> file (lines 
> [224|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L224]
>  and 
> [714|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L714]).
>  In one place we set it to 1.0 and in the other to 0.1.
> We can change the one of train() method to 0.1. The method is marked with 
> DeveloperApi so it should not affect the users. Based on what I saw, whenever 
> we use the particular method we provide all parameters, so the default does 
> not matter. Only exception is the unit-tests on ALSSuite but the change does 
> not break them.
> This change was discussed on a separate PR and [~mlnick] 
> [suggested|https://github.com/apache/spark/pull/17059#issuecomment-28572] 
> to create a separate PR & ticket.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17931) taskScheduler has some unneeded serialization

2017-03-01 Thread Giambattista (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890226#comment-15890226
 ] 

Giambattista commented on SPARK-17931:
--

I just wanted to report that after this change Spark is failing in executing 
long SQL statements (my case they were long insert into table statements).
The problem I was facing is very well described in this article 
https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/
Eventually, I was able to get them working again with the change below.

--- a/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala
@@ -86,7 +86,7 @@ private[spark] object TaskDescription {
 dataOut.writeInt(taskDescription.properties.size())
 taskDescription.properties.asScala.foreach { case (key, value) =>
   dataOut.writeUTF(key)
-  dataOut.writeUTF(value)
+  dataOut.writeUTF(value.substring(0, math.min(value.size, 65*1024/4)))
 }



> taskScheduler has some unneeded serialization
> -
>
> Key: SPARK-17931
> URL: https://issues.apache.org/jira/browse/SPARK-17931
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Guoqiang Li
>Assignee: Kay Ousterhout
> Fix For: 2.2.0
>
>
> In the existing code, there are three layers of serialization
> involved in sending a task from the scheduler to an executor:
> - A Task object is serialized
> - The Task object is copied to a byte buffer that also
> contains serialized information about any additional JARs,
> files, and Properties needed for the task to execute. This
> byte buffer is stored as the member variable serializedTask
> in the TaskDescription class.
> - The TaskDescription is serialized (in addition to the serialized
> task + JARs, the TaskDescription class contains the task ID and
> other metadata) and sent in a LaunchTask message.
> While it is necessary to have two layers of serialization, so that
> the JAR, file, and Property info can be deserialized prior to
> deserializing the Task object, the third layer of deserialization is
> unnecessary (this is as a result of SPARK-2521). We should
> eliminate a layer of serialization by moving the JARs, files, and Properties
> into the TaskDescription class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17931) taskScheduler has some unneeded serialization

2017-03-01 Thread Giambattista (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890226#comment-15890226
 ] 

Giambattista edited comment on SPARK-17931 at 3/1/17 2:06 PM:
--

I just wanted to report that after this change Spark is failing in executing 
long SQL statements (my case they were long insert into table statements).
The problem I was facing is very well described in this article 
https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/
Eventually, I was able to get them working again with the change below.

{noformat}
--- a/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala
@@ -86,7 +86,7 @@ private[spark] object TaskDescription {
 dataOut.writeInt(taskDescription.properties.size())
 taskDescription.properties.asScala.foreach { case (key, value) =>
   dataOut.writeUTF(key)
-  dataOut.writeUTF(value)
+  dataOut.writeUTF(value.substring(0, math.min(value.size, 65*1024/4)))
 }
{noformat}



was (Author: gbloisi):
I just wanted to report that after this change Spark is failing in executing 
long SQL statements (my case they were long insert into table statements).
The problem I was facing is very well described in this article 
https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/
Eventually, I was able to get them working again with the change below.

--- a/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala
@@ -86,7 +86,7 @@ private[spark] object TaskDescription {
 dataOut.writeInt(taskDescription.properties.size())
 taskDescription.properties.asScala.foreach { case (key, value) =>
   dataOut.writeUTF(key)
-  dataOut.writeUTF(value)
+  dataOut.writeUTF(value.substring(0, math.min(value.size, 65*1024/4)))
 }



> taskScheduler has some unneeded serialization
> -
>
> Key: SPARK-17931
> URL: https://issues.apache.org/jira/browse/SPARK-17931
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Guoqiang Li
>Assignee: Kay Ousterhout
> Fix For: 2.2.0
>
>
> In the existing code, there are three layers of serialization
> involved in sending a task from the scheduler to an executor:
> - A Task object is serialized
> - The Task object is copied to a byte buffer that also
> contains serialized information about any additional JARs,
> files, and Properties needed for the task to execute. This
> byte buffer is stored as the member variable serializedTask
> in the TaskDescription class.
> - The TaskDescription is serialized (in addition to the serialized
> task + JARs, the TaskDescription class contains the task ID and
> other metadata) and sent in a LaunchTask message.
> While it is necessary to have two layers of serialization, so that
> the JAR, file, and Property info can be deserialized prior to
> deserializing the Task object, the third layer of deserialization is
> unnecessary (this is as a result of SPARK-2521). We should
> eliminate a layer of serialization by moving the JARs, files, and Properties
> into the TaskDescription class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19787) Different default regParam values in ALS

2017-03-01 Thread Vasilis Vryniotis (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vasilis Vryniotis updated SPARK-19787:
--
Priority: Trivial  (was: Major)

> Different default regParam values in ALS
> 
>
> Key: SPARK-19787
> URL: https://issues.apache.org/jira/browse/SPARK-19787
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Vasilis Vryniotis
>Priority: Trivial
>
> In the ALS method the default values of regParam do not match within the same 
> file (lines 
> [224|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L224]
>  and 
> [714|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L714]).
>  In one place we set it to 1.0 and in the other to 0.1.
> We can change the one of train() method to 0.1. The method is marked with 
> DeveloperApi so it should not affect the users. Based on what I saw, whenever 
> we use the particular method we provide all parameters, so the default does 
> not matter. Only exception is the unit-tests on ALSSuite but the change does 
> not break them.
> This change was discussed on a separate PR and [~mlnick] 
> [suggested|https://github.com/apache/spark/pull/17059#issuecomment-28572] 
> to create a separate PR & ticket.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19787) Different default regParam values in ALS

2017-03-01 Thread Vasilis Vryniotis (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vasilis Vryniotis updated SPARK-19787:
--
Description: 
In the ALS method the default values of regParam do not match within the same 
file (lines 
[224|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L224]
 and 
[714|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L714]).
 In one place we set it to 1.0 and in the other to 0.1.

We can change the one of train() method to 0.1. The method is marked with 
DeveloperApi so it should not affect the users. Based on what I saw, whenever 
we use the particular method we provide all parameters, so the default does not 
matter. Only exception is the unit-tests on ALSSuite but the change does not 
break them.

This change was discussed on a separate PR and [~mlnick] 
[suggested|https://github.com/apache/spark/pull/17059#issuecomment-28572] 
to create a separate PR & ticket.

  was:
In the ALS method the default values of regParam do not match within the same 
file (lines 
[224](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L224)
 and 
[714](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L714)).
 In one place we set it to 1.0 and in the other to 0.1.

We can change the one of train() method to 0.1. The method is marked with 
DeveloperApi so it should not affect the users. Based on what I saw, whenever 
we use the particular method we provide all parameters, so the default does not 
matter. Only exception is the unit-tests on ALSSuite but the change does not 
break them.


> Different default regParam values in ALS
> 
>
> Key: SPARK-19787
> URL: https://issues.apache.org/jira/browse/SPARK-19787
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Vasilis Vryniotis
>
> In the ALS method the default values of regParam do not match within the same 
> file (lines 
> [224|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L224]
>  and 
> [714|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L714]).
>  In one place we set it to 1.0 and in the other to 0.1.
> We can change the one of train() method to 0.1. The method is marked with 
> DeveloperApi so it should not affect the users. Based on what I saw, whenever 
> we use the particular method we provide all parameters, so the default does 
> not matter. Only exception is the unit-tests on ALSSuite but the change does 
> not break them.
> This change was discussed on a separate PR and [~mlnick] 
> [suggested|https://github.com/apache/spark/pull/17059#issuecomment-28572] 
> to create a separate PR & ticket.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19787) Different default regParam values in ALS

2017-03-01 Thread Vasilis Vryniotis (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vasilis Vryniotis updated SPARK-19787:
--
Issue Type: Improvement  (was: Bug)

> Different default regParam values in ALS
> 
>
> Key: SPARK-19787
> URL: https://issues.apache.org/jira/browse/SPARK-19787
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Vasilis Vryniotis
>
> In the ALS method the default values of regParam do not match within the same 
> file (lines 
> [224|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L224]
>  and 
> [714|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L714]).
>  In one place we set it to 1.0 and in the other to 0.1.
> We can change the one of train() method to 0.1. The method is marked with 
> DeveloperApi so it should not affect the users. Based on what I saw, whenever 
> we use the particular method we provide all parameters, so the default does 
> not matter. Only exception is the unit-tests on ALSSuite but the change does 
> not break them.
> This change was discussed on a separate PR and [~mlnick] 
> [suggested|https://github.com/apache/spark/pull/17059#issuecomment-28572] 
> to create a separate PR & ticket.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19786) Facilitate loop optimizations in a JIT compiler regarding range()