[jira] [Resolved] (SPARK-23081) Add colRegex API to PySpark

2018-01-25 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23081.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Fixed in https://github.com/apache/spark/pull/20390

> Add colRegex API to PySpark
> ---
>
> Key: SPARK-23081
> URL: https://issues.apache.org/jira/browse/SPARK-23081
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23222) Flaky test: DataFrameRangeSuite

2018-01-25 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-23222:
--

 Summary: Flaky test: DataFrameRangeSuite
 Key: SPARK-23222
 URL: https://issues.apache.org/jira/browse/SPARK-23222
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 2.4.0
Reporter: Marcelo Vanzin


I've seen this test fail a few times in unrelated PRs. e.g.:

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86605/testReport/
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86656/testReport/

{noformat}
Error Message
org.scalatest.exceptions.TestFailedException: Expected exception 
org.apache.spark.SparkException to be thrown, but no exception was thrown
Stacktrace
sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: Expected 
exception org.apache.spark.SparkException to be thrown, but no exception was 
thrown
at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
at org.scalatest.Assertions$class.intercept(Assertions.scala:822)
at org.scalatest.FunSuite.intercept(FunSuite.scala:1560)
at 
org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2$$anonfun$apply$mcV$sp$4$$anonfun$apply$2.apply$mcV$sp(DataFrameRangeSuite.scala:168)
at 
org.apache.spark.sql.catalyst.plans.PlanTestBase$class.withSQLConf(PlanTest.scala:176)
at 
org.apache.spark.sql.DataFrameRangeSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(DataFrameRangeSuite.scala:33)
at 
org.apache.spark.sql.test.SQLTestUtilsBase$class.withSQLConf(SQLTestUtils.scala:167)
at 
org.apache.spark.sql.DataFrameRangeSuite.withSQLConf(DataFrameRangeSuite.scala:33)
at 
org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2$$anonfun$apply$mcV$sp$4.apply(DataFrameRangeSuite.scala:166)
at 
org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2$$anonfun$apply$mcV$sp$4.apply(DataFrameRangeSuite.scala:165)
at scala.collection.immutable.List.foreach(List.scala:381)
at 
org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2.apply$mcV$sp(DataFrameRangeSuite.scala:165)
at 
org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2.apply(DataFrameRangeSuite.scala:154)
at 
org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2.apply(DataFrameRangeSuite.scala:154)
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23205) ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel images

2018-01-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23205.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20389
[https://github.com/apache/spark/pull/20389]

> ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel 
> images
> 
>
> Key: SPARK-23205
> URL: https://issues.apache.org/jira/browse/SPARK-23205
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Siddharth Murching
>Assignee: Siddharth Murching
>Priority: Critical
> Fix For: 2.3.0
>
>
> When parsing raw image data in ImageSchema.decode(), we use a [java.awt.Color 
> constructor|https://docs.oracle.com/javase/7/docs/api/java/awt/Color.html#Color(int)]
>  that sets alpha = 255, even for four-channel images.
> See the offending line here: 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala#L172
> A fix is to simply update the line to: 
> val color = new Color(img.getRGB(w, h), nChannels == 4)
> instead of
> val color = new Color(img.getRGB(w, h))



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23205) ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel images

2018-01-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-23205:
-

Assignee: Siddharth Murching

> ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel 
> images
> 
>
> Key: SPARK-23205
> URL: https://issues.apache.org/jira/browse/SPARK-23205
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Siddharth Murching
>Assignee: Siddharth Murching
>Priority: Critical
> Fix For: 2.3.0
>
>
> When parsing raw image data in ImageSchema.decode(), we use a [java.awt.Color 
> constructor|https://docs.oracle.com/javase/7/docs/api/java/awt/Color.html#Color(int)]
>  that sets alpha = 255, even for four-channel images.
> See the offending line here: 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala#L172
> A fix is to simply update the line to: 
> val color = new Color(img.getRGB(w, h), nChannels == 4)
> instead of
> val color = new Color(img.getRGB(w, h))



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23032) Add a per-query codegenStageId to WholeStageCodegenExec

2018-01-25 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23032.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Add a per-query codegenStageId to WholeStageCodegenExec
> ---
>
> Key: SPARK-23032
> URL: https://issues.apache.org/jira/browse/SPARK-23032
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kris Mok
>Priority: Major
> Fix For: 2.3.0
>
>
> Proposing to add a per-query ID to the codegen stages as represented by 
> {{WholeStageCodegenExec}} operators. This ID will be used in
> * the explain output of the physical plan, and in
> * the generated class name.
> Specifically, this ID will be stable within a query, counting up from 1 in 
> depth-first post-order for all the {{WholeStageCodegenExec}} inserted into a 
> plan.
> The ID value 0 is reserved for "free-floating" {{WholeStageCodegenExec}} 
> objects, which may have been created for one-off purposes, e.g. for fallback 
> handling of codegen stages that failed to codegen the whole stage and wishes 
> to codegen a subset of the children operators.
> Example: for the following query:
> {code:none}
> scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 1)
> scala> val df1 = spark.range(10).select('id as 'x, 'id + 1 as 
> 'y).orderBy('x).select('x + 1 as 'z, 'y)
> df1: org.apache.spark.sql.DataFrame = [z: bigint, y: bigint]
> scala> val df2 = spark.range(5)
> df2: org.apache.spark.sql.Dataset[Long] = [id: bigint]
> scala> val query = df1.join(df2, 'z === 'id)
> query: org.apache.spark.sql.DataFrame = [z: bigint, y: bigint ... 1 more 
> field]
> {code}
> The explain output before the change is:
> {code:none}
> scala> query.explain
> == Physical Plan ==
> *SortMergeJoin [z#9L], [id#13L], Inner
> :- *Sort [z#9L ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(z#9L, 200)
> : +- *Project [(x#3L + 1) AS z#9L, y#4L]
> :+- *Sort [x#3L ASC NULLS FIRST], true, 0
> :   +- Exchange rangepartitioning(x#3L ASC NULLS FIRST, 200)
> :  +- *Project [id#0L AS x#3L, (id#0L + 1) AS y#4L]
> : +- *Range (0, 10, step=1, splits=8)
> +- *Sort [id#13L ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(id#13L, 200)
>   +- *Range (0, 5, step=1, splits=8)
> {code}
> Note how codegen'd operators are annotated with a prefix {{"*"}}.
> and after this change it'll be:
> {code:none}
> scala> query.explain
> == Physical Plan ==
> *(6) SortMergeJoin [z#9L], [id#13L], Inner
> :- *(3) Sort [z#9L ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(z#9L, 200)
> : +- *(2) Project [(x#3L + 1) AS z#9L, y#4L]
> :+- *(2) Sort [x#3L ASC NULLS FIRST], true, 0
> :   +- Exchange rangepartitioning(x#3L ASC NULLS FIRST, 200)
> :  +- *(1) Project [id#0L AS x#3L, (id#0L + 1) AS y#4L]
> : +- *(1) Range (0, 10, step=1, splits=8)
> +- *(5) Sort [id#13L ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(id#13L, 200)
>   +- *(4) Range (0, 5, step=1, splits=8)
> {code}
> Note that the annotated prefix becomes {{"*(id) "}}
> It'll also show up in the name of the generated class, as a suffix in the 
> format of
> {code:none}
> GeneratedClass$GeneratedIterator$id
> {code}
> for example, note how {{GeneratedClass$GeneratedIteratorForCodegenStage3}} 
> and {{GeneratedClass$GeneratedIteratorForCodegenStage6}} in the following 
> stack trace corresponds to the IDs shown in the explain output above:
> {code:none}
> "Executor task launch worker for task 424@12957" daemon prio=5 tid=0x58 
> nid=NA runnable
>   java.lang.Thread.State: RUNNABLE
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.sort_addToSorter$(generated.java:32)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(generated.java:41)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$9$$anon$1.hasNext(WholeStageCodegenExec.scala:494)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.findNextInnerJoinRows$(generated.java:42)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(generated.java:101)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$2.hasNext(Whole

[jira] [Assigned] (SPARK-23032) Add a per-query codegenStageId to WholeStageCodegenExec

2018-01-25 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-23032:
---

Assignee: Kris Mok

> Add a per-query codegenStageId to WholeStageCodegenExec
> ---
>
> Key: SPARK-23032
> URL: https://issues.apache.org/jira/browse/SPARK-23032
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kris Mok
>Assignee: Kris Mok
>Priority: Major
> Fix For: 2.3.0
>
>
> Proposing to add a per-query ID to the codegen stages as represented by 
> {{WholeStageCodegenExec}} operators. This ID will be used in
> * the explain output of the physical plan, and in
> * the generated class name.
> Specifically, this ID will be stable within a query, counting up from 1 in 
> depth-first post-order for all the {{WholeStageCodegenExec}} inserted into a 
> plan.
> The ID value 0 is reserved for "free-floating" {{WholeStageCodegenExec}} 
> objects, which may have been created for one-off purposes, e.g. for fallback 
> handling of codegen stages that failed to codegen the whole stage and wishes 
> to codegen a subset of the children operators.
> Example: for the following query:
> {code:none}
> scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 1)
> scala> val df1 = spark.range(10).select('id as 'x, 'id + 1 as 
> 'y).orderBy('x).select('x + 1 as 'z, 'y)
> df1: org.apache.spark.sql.DataFrame = [z: bigint, y: bigint]
> scala> val df2 = spark.range(5)
> df2: org.apache.spark.sql.Dataset[Long] = [id: bigint]
> scala> val query = df1.join(df2, 'z === 'id)
> query: org.apache.spark.sql.DataFrame = [z: bigint, y: bigint ... 1 more 
> field]
> {code}
> The explain output before the change is:
> {code:none}
> scala> query.explain
> == Physical Plan ==
> *SortMergeJoin [z#9L], [id#13L], Inner
> :- *Sort [z#9L ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(z#9L, 200)
> : +- *Project [(x#3L + 1) AS z#9L, y#4L]
> :+- *Sort [x#3L ASC NULLS FIRST], true, 0
> :   +- Exchange rangepartitioning(x#3L ASC NULLS FIRST, 200)
> :  +- *Project [id#0L AS x#3L, (id#0L + 1) AS y#4L]
> : +- *Range (0, 10, step=1, splits=8)
> +- *Sort [id#13L ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(id#13L, 200)
>   +- *Range (0, 5, step=1, splits=8)
> {code}
> Note how codegen'd operators are annotated with a prefix {{"*"}}.
> and after this change it'll be:
> {code:none}
> scala> query.explain
> == Physical Plan ==
> *(6) SortMergeJoin [z#9L], [id#13L], Inner
> :- *(3) Sort [z#9L ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(z#9L, 200)
> : +- *(2) Project [(x#3L + 1) AS z#9L, y#4L]
> :+- *(2) Sort [x#3L ASC NULLS FIRST], true, 0
> :   +- Exchange rangepartitioning(x#3L ASC NULLS FIRST, 200)
> :  +- *(1) Project [id#0L AS x#3L, (id#0L + 1) AS y#4L]
> : +- *(1) Range (0, 10, step=1, splits=8)
> +- *(5) Sort [id#13L ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(id#13L, 200)
>   +- *(4) Range (0, 5, step=1, splits=8)
> {code}
> Note that the annotated prefix becomes {{"*(id) "}}
> It'll also show up in the name of the generated class, as a suffix in the 
> format of
> {code:none}
> GeneratedClass$GeneratedIterator$id
> {code}
> for example, note how {{GeneratedClass$GeneratedIteratorForCodegenStage3}} 
> and {{GeneratedClass$GeneratedIteratorForCodegenStage6}} in the following 
> stack trace corresponds to the IDs shown in the explain output above:
> {code:none}
> "Executor task launch worker for task 424@12957" daemon prio=5 tid=0x58 
> nid=NA runnable
>   java.lang.Thread.State: RUNNABLE
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.sort_addToSorter$(generated.java:32)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(generated.java:41)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$9$$anon$1.hasNext(WholeStageCodegenExec.scala:494)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.findNextInnerJoinRows$(generated.java:42)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(generated.java:101)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$2.has

[jira] [Commented] (SPARK-22809) pyspark is sensitive to imports with dots

2018-01-25 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340344#comment-16340344
 ] 

Sean Owen commented on SPARK-22809:
---

Might duplicate SPARK-23159 but I wasn't sure.

> pyspark is sensitive to imports with dots
> -
>
> Key: SPARK-22809
> URL: https://issues.apache.org/jira/browse/SPARK-22809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Cricket Temple
>Assignee: holdenk
>Priority: Major
>
> User code can fail with dotted imports.  Here's a repro script.
> {noformat}
> import numpy as np
> import pandas as pd
> import pyspark
> import scipy.interpolate
> import scipy.interpolate as scipy_interpolate
> import py4j
> scipy_interpolate2 = scipy.interpolate
> sc = pyspark.SparkContext()
> spark_session = pyspark.SQLContext(sc)
> ###
> # The details of this dataset are irrelevant  #
> # Sorry if you'd have preferred something more boring #
> ###
> x__ = np.linspace(0,10,1000)
> freq__ = np.arange(1,5)
> x_, freq_ = np.ix_(x__, freq__)
> y = np.sin(x_ * freq_).ravel()
> x = (x_ * np.ones(freq_.shape)).ravel()
> freq = (np.ones(x_.shape) * freq_).ravel()
> df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
> df_sk = spark_session.createDataFrame(df_pd)
> assert(df_sk.toPandas() == df_pd).all().all()
> try:
> import matplotlib.pyplot as plt
> for f, data in df_pd.groupby("freq"):
> plt.plot(*data[['x','y']].values.T)
> plt.show()
> except:
> print("I guess we can't plot anything")
> def mymap(x, interp_fn):
> df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
> return interp_fn(df.x.values, df.y.values)(np.pi)
> df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> try:
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> raise Excpetion("Not going to reach this line")
> except py4j.protocol.Py4JJavaError, e:
> print("See?")
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate2.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> # But now it works!
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23109) ML 2.3 QA: API: Python API coverage

2018-01-25 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340399#comment-16340399
 ] 

Bago Amirbekian commented on SPARK-23109:
-

[~bryanc] One reason the python API might be different is because in python we 
can use `imageRow.height` in place of `getHeight(imageRow)` so the getters 
don't add much value. Also, `toNDArray` doesn't make sense in Scala. I think we 
should add `columnSchema` to the python API, but it doesn't need to be block 
the release IMHO.

> ML 2.3 QA: API: Python API coverage
> ---
>
> Key: SPARK-23109
> URL: https://issues.apache.org/jira/browse/SPARK-23109
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23157) withColumn fails for a column that is a result of mapped DataSet

2018-01-25 Thread Henry Robinson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340412#comment-16340412
 ] 

Henry Robinson commented on SPARK-23157:


I'm not sure if this should actually be expected to work. {{Dataset.map()}} 
will always return a dataset with a logical plan that's different to the 
original, so {{ds.map(a => a).col("id")}} has an expression that refers to an 
attribute ID that isn't produced by the original dataset. It seems like the 
requirement for {{ds.withColumn()}} is that the column argument is an 
expression over {{ds}}'s logical plan.

You get the same error doing the following, which is more explicit about these 
being two separate datasets.
{code:java}
scala> val ds = spark.createDataset(Seq(R("1")))
ds: org.apache.spark.sql.Dataset[R] = [id: string]

scala> val ds2 = spark.createDataset(Seq(R("1")))
ds2: org.apache.spark.sql.Dataset[R] = [id: string]

scala> ds.withColumn("id2", ds2.col("id"))
org.apache.spark.sql.AnalysisException: Resolved attribute(s) id#113 missing 
from id#1 in operator !Project [id#1, id#113 AS id2#115]. Attribute(s) with the 
same name appear in the operation: id. Please check if the right attribute(s) 
are used.;;
!Project [id#1, id#113 AS id2#115]
+- LocalRelation [id#1]

  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:297)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
  at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:70)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3286)
  at org.apache.spark.sql.Dataset.select(Dataset.scala:1303)
  at org.apache.spark.sql.Dataset.withColumns(Dataset.scala:2185)
  at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:2152)
  ... 49 elided
{code}
If the {{map}} function weren't the identity, would you expect this still to 
work?

> withColumn fails for a column that is a result of mapped DataSet
> 
>
> Key: SPARK-23157
> URL: https://issues.apache.org/jira/browse/SPARK-23157
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Tomasz Bartczak
>Priority: Minor
>
> Having 
> {code:java}
> case class R(id: String)
> val ds = spark.createDataset(Seq(R("1")))
> {code}
> This works:
> {code}
> scala> ds.withColumn("n", ds.col("id"))
> res16: org.apache.spark.sql.DataFrame = [id: string, n: string]
> {code}
> but when we map over ds it fails:
> {code}
> scala> ds.withColumn("n", ds.map(a => a).col("id"))
> org.apache.spark.sql.AnalysisException: resolved attribute(s) id#55 missing 
> from id#4 in operator !Project [id#4, id#55 AS n#57];;
> !Project [id#4, id#55 AS n#57]
> +- LocalRelation [id#4]
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:347)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2884)
>   at org.apache.spark.sql.Dataset.select(Dataset.scala:1150)
>   at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:1905)
>   ... 48 elided
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23105) Spark MLlib, GraphX 2.3 QA umbrella

2018-01-25 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340448#comment-16340448
 ] 

Bago Amirbekian commented on SPARK-23105:
-

[~mlnick] We can update the sub tasks to target 2.3 if you think it's 
appropriate. I don't know how involved Joseph can be for this release so we 
might need another committer to shepard these tasks, I can take on some of it.

> Spark MLlib, GraphX 2.3 QA umbrella
> ---
>
> Key: SPARK-23105
> URL: https://issues.apache.org/jira/browse/SPARK-23105
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for MLlib and 
> GraphX. *SparkR is separate: SPARK-23114.*
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
>  * Check binary API compatibility for Scala/Java
>  * Audit new public APIs (from the generated html doc)
>  ** Scala
>  ** Java compatibility
>  ** Python coverage
>  * Check Experimental, DeveloperApi tags
> h2. Algorithms and performance
>  * Performance tests
> h2. Documentation and example code
>  * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
>  * Update Programming Guide
>  * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9103) Tracking spark's memory usage

2018-01-25 Thread Edwina Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340449#comment-16340449
 ] 

Edwina Lu commented on SPARK-9103:
--

We (at LinkedIn) are interested in gathering more memory metrics as well. The 
pull requests for exposing netty and java.nio bufferedPool metrics via Metrics 
System have been merged. Are the changes for adding the metrics to the 
heartbeat and exposing it via the web UI still being worked on? From pull 
request 17762, it sounds like this may have been replaced by SPARK-21157. 
Having both total memory and netty memory information would be useful.

 [LIHADOOP-34243|https://jira01.corp.linkedin.com:8443/browse/LIHADOOP-34243] 
proposes adding executor level metrics for JVM used memory, storage memory, and 
execution memory. It is also using the heartbeat to send executor metrics, and 
exposing the metrics via the web UI, and could share some of the same 
infrastructure.

 

> Tracking spark's memory usage
> -
>
> Key: SPARK-9103
> URL: https://issues.apache.org/jira/browse/SPARK-9103
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Reporter: Zhang, Liye
>Priority: Major
> Attachments: Tracking Spark Memory Usage - Phase 1.pdf
>
>
> Currently spark only provides little memory usage information (RDD cache on 
> webUI) for the executors. User have no idea on what is the memory consumption 
> when they are running spark applications with a lot of memory used in spark 
> executors. Especially when they encounter the OOM, it’s really hard to know 
> what is the cause of the problem. So it would be helpful to give out the 
> detail memory consumption information for each part of spark, so that user 
> can clearly have a picture of where the memory is exactly used. 
> The memory usage info to expose should include but not limited to shuffle, 
> cache, network, serializer, etc.
> User can optionally choose to open this functionality since this is mainly 
> for debugging and tuning.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9103) Tracking spark's memory usage

2018-01-25 Thread Edwina Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340449#comment-16340449
 ] 

Edwina Lu edited comment on SPARK-9103 at 1/26/18 2:08 AM:
---

We (at LinkedIn) are interested in gathering more memory metrics as well. The 
pull requests for exposing netty and java.nio bufferedPool metrics via Metrics 
System have been merged. Are the changes for adding the metrics to the 
heartbeat and exposing it via the web UI still being worked on? From pull 
request 17762, it sounds like this may have been replaced by SPARK-21157. 
Having both total memory and netty memory information would be useful.

SPARK-23206 proposes adding executor level metrics for JVM used memory, storage 
memory, and execution memory. It is also using the heartbeat to send executor 
metrics, and exposing the metrics via the web UI, and could share some of the 
same infrastructure.

 


was (Author: elu):
We (at LinkedIn) are interested in gathering more memory metrics as well. The 
pull requests for exposing netty and java.nio bufferedPool metrics via Metrics 
System have been merged. Are the changes for adding the metrics to the 
heartbeat and exposing it via the web UI still being worked on? From pull 
request 17762, it sounds like this may have been replaced by SPARK-21157. 
Having both total memory and netty memory information would be useful.

 [LIHADOOP-34243|https://jira01.corp.linkedin.com:8443/browse/LIHADOOP-34243] 
proposes adding executor level metrics for JVM used memory, storage memory, and 
execution memory. It is also using the heartbeat to send executor metrics, and 
exposing the metrics via the web UI, and could share some of the same 
infrastructure.

 

> Tracking spark's memory usage
> -
>
> Key: SPARK-9103
> URL: https://issues.apache.org/jira/browse/SPARK-9103
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Reporter: Zhang, Liye
>Priority: Major
> Attachments: Tracking Spark Memory Usage - Phase 1.pdf
>
>
> Currently spark only provides little memory usage information (RDD cache on 
> webUI) for the executors. User have no idea on what is the memory consumption 
> when they are running spark applications with a lot of memory used in spark 
> executors. Especially when they encounter the OOM, it’s really hard to know 
> what is the cause of the problem. So it would be helpful to give out the 
> detail memory consumption information for each part of spark, so that user 
> can clearly have a picture of where the memory is exactly used. 
> The memory usage info to expose should include but not limited to shuffle, 
> cache, network, serializer, etc.
> User can optionally choose to open this functionality since this is mainly 
> for debugging and tuning.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2018-01-25 Thread Edwina Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340474#comment-16340474
 ] 

Edwina Lu commented on SPARK-23206:
---

[~jerryshao], yes  SPARK-9103 is similar, and this proposal (SPARK-23206) seems 
complementary. We are also interested in netty and total memory metrics from  
SPARK-9103 and SPARK-21157. It would be great to share the Heartbeat and 
history logging infrastructure mentioned in both tickets. We'd also like to 
have the metrics exposed via web UI and REST API.

Adding JVM used memory metrics would give an idea about Spark's JVM process, 
and adding executor level information about execution and storage memory would 
give insight into the unified memory region. Peak execution memory is available 
at the task level, so users could calculate a peak execution memory for the 
executor by multiplying with the number of concurrent tasks. Storage memory is 
available in the storage tab while it is running, but it isn't visible once the 
application finishes, so can be hard to examine, and would be difficult to 
combine with execution memory for an overall view into the unified memory 
region, at a per executor and stage level.

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: MemoryTuningMetricsDesignDoc.pdf
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4502) Spark SQL reads unneccesary nested fields from Parquet

2018-01-25 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340488#comment-16340488
 ] 

holdenk commented on SPARK-4502:


[~sameerag] understand this is a pretty big change to try and get in at this 
point for 2.3.0, but given that its improving existing functionality how would 
we feel about a 2.3.1 & 2.4.0 target?  (cc [~marmbrus])?

> Spark SQL reads unneccesary nested fields from Parquet
> --
>
> Key: SPARK-4502
> URL: https://issues.apache.org/jira/browse/SPARK-4502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Liwen Sun
>Priority: Critical
>
> When reading a field of a nested column from Parquet, SparkSQL reads and 
> assemble all the fields of that nested column. This is unnecessary, as 
> Parquet supports fine-grained field reads out of a nested column. This may 
> degrades the performance significantly when a nested column has many fields. 
> For example, I loaded json tweets data into SparkSQL and ran the following 
> query:
> {{SELECT User.contributors_enabled from Tweets;}}
> User is a nested structure that has 38 primitive fields (for Tweets schema, 
> see: https://dev.twitter.com/overview/api/tweets), here is the log message:
> {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 
> cell/ms}}
> For comparison, I also ran:
> {{SELECT User FROM Tweets;}}
> And here is the log message:
> {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}}
> So both queries load 38 columns from Parquet, while the first query only 
> needs 1 column. I also measured the bytes read within Parquet. In these two 
> cases, the same number of bytes (99365194 bytes) were read. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23220) broadcast hint not applied in a streaming left anti join

2018-01-25 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340495#comment-16340495
 ] 

Liang-Chi Hsieh commented on SPARK-23220:
-

I can't re-produce it locally. I join a stream with a static dataframe 
similarly, but the broadcast hint does work as expected. Can you provide a 
small reproducible example? Thanks.

> broadcast hint not applied in a streaming left anti join
> 
>
> Key: SPARK-23220
> URL: https://issues.apache.org/jira/browse/SPARK-23220
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.1
>Reporter: Mathieu DESPRIEE
>Priority: Major
> Attachments: Screenshot from 2018-01-25 17-32-45.png
>
>
> We have a structured streaming app doing a left anti-join between a stream, 
> and a static dataframe. This one is quite small (a few 100s of rows), but the 
> query plan by default is a sort merge join.
>   
>  It happens sometimes we need to re-process some historical data, so we feed 
> the same app with a FileSource pointing to our S3 storage with all archives. 
> In that situation, the first mini-batch is quite heavy (several 100'000s of 
> input files), and the time spent in sort-merge join is non-acceptable. 
> Additionally it's highly skewed, so partition sizes are completely uneven, 
> and executors tend to crash with OOMs.
> I tried to switch to a broadcast join, but Spark still applies a sort-merge.
> {noformat}
> ds.join(broadcast(hostnames), Seq("hostname"), "leftanti")
> {noformat}
> !Screenshot from 2018-01-25 17-32-45.png!
> The logical plan is :
> {noformat}
> Project [app_id#5203, <--- snip ---> ... 18 more fields]
> +- Project ...
> <-- snip -->
>  +- Join LeftAnti, (hostname#3584 = hostname#190)
> :- Project [app_id, ...
> <-- snip -->
>+- StreamingExecutionRelation 
> FileStreamSource[s3://{/2018/{01,02}/*/*/,/2017/{08,09,10,11,12}/*/*/}], 
> [app_id
>  <--snip--> ... 62 more fields]
> +- ResolvedHint isBroadcastable=true
>+- Relation[hostname#190,descr#191] 
> RedshiftRelation("PUBLIC"."hostname_filter")
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4502) Spark SQL reads unneccesary nested fields from Parquet

2018-01-25 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340515#comment-16340515
 ] 

Simeon Simeonov commented on SPARK-4502:


+1 [~holdenk] this should be a big boost for any Spark user that is not working 
with flat data. In tests I did a while back, the performance difference between 
a nested and a flat schema was > 3x.

> Spark SQL reads unneccesary nested fields from Parquet
> --
>
> Key: SPARK-4502
> URL: https://issues.apache.org/jira/browse/SPARK-4502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Liwen Sun
>Priority: Critical
>
> When reading a field of a nested column from Parquet, SparkSQL reads and 
> assemble all the fields of that nested column. This is unnecessary, as 
> Parquet supports fine-grained field reads out of a nested column. This may 
> degrades the performance significantly when a nested column has many fields. 
> For example, I loaded json tweets data into SparkSQL and ran the following 
> query:
> {{SELECT User.contributors_enabled from Tweets;}}
> User is a nested structure that has 38 primitive fields (for Tweets schema, 
> see: https://dev.twitter.com/overview/api/tweets), here is the log message:
> {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 
> cell/ms}}
> For comparison, I also ran:
> {{SELECT User FROM Tweets;}}
> And here is the log message:
> {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}}
> So both queries load 38 columns from Parquet, while the first query only 
> needs 1 column. I also measured the bytes read within Parquet. In these two 
> cases, the same number of bytes (99365194 bytes) were read. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23223) Stacking dataset transforms performs poorly

2018-01-25 Thread Herman van Hovell (JIRA)
Herman van Hovell created SPARK-23223:
-

 Summary: Stacking dataset transforms performs poorly
 Key: SPARK-23223
 URL: https://issues.apache.org/jira/browse/SPARK-23223
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Herman van Hovell
Assignee: Herman van Hovell


It is a common pattern to apply multiple transforms to a {{Dataset}} (using 
{{Dataset.withColumn}} for example. This is currently quite expensive because 
we run {{CheckAnalysis}} on the full plan and create an encoder for each 
intermediate {{Dataset}}.

{{CheckAnalysis}} only needs to be run for the newly added plan components, and 
not for the full plan. The addition of the {{AnalysisBarrier}} created this 
issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23187) Accumulator object can not be sent from Executor to Driver

2018-01-25 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340521#comment-16340521
 ] 

Saisai Shao commented on SPARK-23187:
-

I'm going to close this JIRA, as there's no issue in reporting accumulators via 
heartbeat. It can be reported periodically according to my verification. Please 
feel free to reopen it if you have further issues.

> Accumulator object can not be sent from Executor to Driver
> --
>
> Key: SPARK-23187
> URL: https://issues.apache.org/jira/browse/SPARK-23187
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Lantao Jin
>Priority: Major
>
> In the Executor.scala->reportHeartBeat(), task Metrics value can not be sent 
> to Driver (In receive side all values are zero).
> I write an UT for explanation.
> {code}
> diff --git 
> a/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala 
> b/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala
> index f9481f8..57fb096 100644
> --- a/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala
> +++ b/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala
> @@ -17,11 +17,16 @@
>  package org.apache.spark.rpc.netty
> +import scala.collection.mutable.ArrayBuffer
> +
>  import org.scalatest.mockito.MockitoSugar
>  import org.apache.spark._
>  import org.apache.spark.network.client.TransportClient
>  import org.apache.spark.rpc._
> +import org.apache.spark.util.AccumulatorContext
> +import org.apache.spark.util.AccumulatorV2
> +import org.apache.spark.util.LongAccumulator
>  class NettyRpcEnvSuite extends RpcEnvSuite with MockitoSugar {
> @@ -83,5 +88,21 @@ class NettyRpcEnvSuite extends RpcEnvSuite with 
> MockitoSugar {
>  assertRequestMessageEquals(
>msg3,
>RequestMessage(nettyEnv, client, msg3.serialize(nettyEnv)))
> +
> +val acc = new LongAccumulator
> +val sc = SparkContext.getOrCreate(new 
> SparkConf().setMaster("local").setAppName("testAcc"));
> +sc.register(acc, "testAcc")
> +acc.setValue(1)
> +//val msg4 = new RequestMessage(senderAddress, receiver, acc)
> +//assertRequestMessageEquals(
> +//  msg4,
> +//  RequestMessage(nettyEnv, client, msg4.serialize(nettyEnv)))
> +
> +val accbuf = new ArrayBuffer[AccumulatorV2[_, _]]()
> +accbuf += acc
> +val msg5 = new RequestMessage(senderAddress, receiver, accbuf)
> +assertRequestMessageEquals(
> +  msg5,
> +  RequestMessage(nettyEnv, client, msg5.serialize(nettyEnv)))
>}
>  }
> {code}
> msg4 and msg5 are all going to failed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23223) Stacking dataset transforms performs poorly

2018-01-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340520#comment-16340520
 ] 

Apache Spark commented on SPARK-23223:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/20402

> Stacking dataset transforms performs poorly
> ---
>
> Key: SPARK-23223
> URL: https://issues.apache.org/jira/browse/SPARK-23223
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Major
>
> It is a common pattern to apply multiple transforms to a {{Dataset}} (using 
> {{Dataset.withColumn}} for example. This is currently quite expensive because 
> we run {{CheckAnalysis}} on the full plan and create an encoder for each 
> intermediate {{Dataset}}.
> {{CheckAnalysis}} only needs to be run for the newly added plan components, 
> and not for the full plan. The addition of the {{AnalysisBarrier}} created 
> this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23223) Stacking dataset transforms performs poorly

2018-01-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23223:


Assignee: Herman van Hovell  (was: Apache Spark)

> Stacking dataset transforms performs poorly
> ---
>
> Key: SPARK-23223
> URL: https://issues.apache.org/jira/browse/SPARK-23223
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Major
>
> It is a common pattern to apply multiple transforms to a {{Dataset}} (using 
> {{Dataset.withColumn}} for example. This is currently quite expensive because 
> we run {{CheckAnalysis}} on the full plan and create an encoder for each 
> intermediate {{Dataset}}.
> {{CheckAnalysis}} only needs to be run for the newly added plan components, 
> and not for the full plan. The addition of the {{AnalysisBarrier}} created 
> this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23187) Accumulator object can not be sent from Executor to Driver

2018-01-25 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340522#comment-16340522
 ] 

Saisai Shao commented on SPARK-23187:
-

I'm going to close this JIRA, as there's no issue in reporting accumulators via 
heartbeat. It can be reported periodically according to my verification. Please 
feel free to reopen it if you have further issues.

> Accumulator object can not be sent from Executor to Driver
> --
>
> Key: SPARK-23187
> URL: https://issues.apache.org/jira/browse/SPARK-23187
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Lantao Jin
>Priority: Major
>
> In the Executor.scala->reportHeartBeat(), task Metrics value can not be sent 
> to Driver (In receive side all values are zero).
> I write an UT for explanation.
> {code}
> diff --git 
> a/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala 
> b/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala
> index f9481f8..57fb096 100644
> --- a/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala
> +++ b/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala
> @@ -17,11 +17,16 @@
>  package org.apache.spark.rpc.netty
> +import scala.collection.mutable.ArrayBuffer
> +
>  import org.scalatest.mockito.MockitoSugar
>  import org.apache.spark._
>  import org.apache.spark.network.client.TransportClient
>  import org.apache.spark.rpc._
> +import org.apache.spark.util.AccumulatorContext
> +import org.apache.spark.util.AccumulatorV2
> +import org.apache.spark.util.LongAccumulator
>  class NettyRpcEnvSuite extends RpcEnvSuite with MockitoSugar {
> @@ -83,5 +88,21 @@ class NettyRpcEnvSuite extends RpcEnvSuite with 
> MockitoSugar {
>  assertRequestMessageEquals(
>msg3,
>RequestMessage(nettyEnv, client, msg3.serialize(nettyEnv)))
> +
> +val acc = new LongAccumulator
> +val sc = SparkContext.getOrCreate(new 
> SparkConf().setMaster("local").setAppName("testAcc"));
> +sc.register(acc, "testAcc")
> +acc.setValue(1)
> +//val msg4 = new RequestMessage(senderAddress, receiver, acc)
> +//assertRequestMessageEquals(
> +//  msg4,
> +//  RequestMessage(nettyEnv, client, msg4.serialize(nettyEnv)))
> +
> +val accbuf = new ArrayBuffer[AccumulatorV2[_, _]]()
> +accbuf += acc
> +val msg5 = new RequestMessage(senderAddress, receiver, accbuf)
> +assertRequestMessageEquals(
> +  msg5,
> +  RequestMessage(nettyEnv, client, msg5.serialize(nettyEnv)))
>}
>  }
> {code}
> msg4 and msg5 are all going to failed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23223) Stacking dataset transforms performs poorly

2018-01-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23223:


Assignee: Apache Spark  (was: Herman van Hovell)

> Stacking dataset transforms performs poorly
> ---
>
> Key: SPARK-23223
> URL: https://issues.apache.org/jira/browse/SPARK-23223
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>Priority: Major
>
> It is a common pattern to apply multiple transforms to a {{Dataset}} (using 
> {{Dataset.withColumn}} for example. This is currently quite expensive because 
> we run {{CheckAnalysis}} on the full plan and create an encoder for each 
> intermediate {{Dataset}}.
> {{CheckAnalysis}} only needs to be run for the newly added plan components, 
> and not for the full plan. The addition of the {{AnalysisBarrier}} created 
> this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23187) Accumulator object can not be sent from Executor to Driver

2018-01-25 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-23187.
-
Resolution: Not A Problem

> Accumulator object can not be sent from Executor to Driver
> --
>
> Key: SPARK-23187
> URL: https://issues.apache.org/jira/browse/SPARK-23187
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Lantao Jin
>Priority: Major
>
> In the Executor.scala->reportHeartBeat(), task Metrics value can not be sent 
> to Driver (In receive side all values are zero).
> I write an UT for explanation.
> {code}
> diff --git 
> a/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala 
> b/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala
> index f9481f8..57fb096 100644
> --- a/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala
> +++ b/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala
> @@ -17,11 +17,16 @@
>  package org.apache.spark.rpc.netty
> +import scala.collection.mutable.ArrayBuffer
> +
>  import org.scalatest.mockito.MockitoSugar
>  import org.apache.spark._
>  import org.apache.spark.network.client.TransportClient
>  import org.apache.spark.rpc._
> +import org.apache.spark.util.AccumulatorContext
> +import org.apache.spark.util.AccumulatorV2
> +import org.apache.spark.util.LongAccumulator
>  class NettyRpcEnvSuite extends RpcEnvSuite with MockitoSugar {
> @@ -83,5 +88,21 @@ class NettyRpcEnvSuite extends RpcEnvSuite with 
> MockitoSugar {
>  assertRequestMessageEquals(
>msg3,
>RequestMessage(nettyEnv, client, msg3.serialize(nettyEnv)))
> +
> +val acc = new LongAccumulator
> +val sc = SparkContext.getOrCreate(new 
> SparkConf().setMaster("local").setAppName("testAcc"));
> +sc.register(acc, "testAcc")
> +acc.setValue(1)
> +//val msg4 = new RequestMessage(senderAddress, receiver, acc)
> +//assertRequestMessageEquals(
> +//  msg4,
> +//  RequestMessage(nettyEnv, client, msg4.serialize(nettyEnv)))
> +
> +val accbuf = new ArrayBuffer[AccumulatorV2[_, _]]()
> +accbuf += acc
> +val msg5 = new RequestMessage(senderAddress, receiver, accbuf)
> +assertRequestMessageEquals(
> +  msg5,
> +  RequestMessage(nettyEnv, client, msg5.serialize(nettyEnv)))
>}
>  }
> {code}
> msg4 and msg5 are all going to failed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2018-01-25 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340536#comment-16340536
 ] 

Saisai Shao commented on SPARK-23206:
-

Would you please summarize what kind of metrics you want to monitor via which 
way? I believe most of metrics you mentioned above already fully/partly tracked 
either via Accumulator (heartbeat) or metrics system, and are exposed via 
REST/web or metrics Sink.

 

Currently we don't have a table to list all the existed metrics. I think it 
would be better to know all the existed metrics and what you want to add 
further. Also about how to expose metrics. 

 

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: MemoryTuningMetricsDesignDoc.pdf
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4502) Spark SQL reads unneccesary nested fields from Parquet

2018-01-25 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340544#comment-16340544
 ] 

Yin Huai commented on SPARK-4502:
-

I think it makes sense to target for 2.4.0. 2.3.1 is a maintenance release. 
Since this is not a bug fix, it is not suitable for a maintenance release.

> Spark SQL reads unneccesary nested fields from Parquet
> --
>
> Key: SPARK-4502
> URL: https://issues.apache.org/jira/browse/SPARK-4502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Liwen Sun
>Priority: Critical
>
> When reading a field of a nested column from Parquet, SparkSQL reads and 
> assemble all the fields of that nested column. This is unnecessary, as 
> Parquet supports fine-grained field reads out of a nested column. This may 
> degrades the performance significantly when a nested column has many fields. 
> For example, I loaded json tweets data into SparkSQL and ran the following 
> query:
> {{SELECT User.contributors_enabled from Tweets;}}
> User is a nested structure that has 38 primitive fields (for Tweets schema, 
> see: https://dev.twitter.com/overview/api/tweets), here is the log message:
> {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 
> cell/ms}}
> For comparison, I also ran:
> {{SELECT User FROM Tweets;}}
> And here is the log message:
> {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}}
> So both queries load 38 columns from Parquet, while the first query only 
> needs 1 column. I also measured the bytes read within Parquet. In these two 
> cases, the same number of bytes (99365194 bytes) were read. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21601) Modify the JDK version of the Maven compilation

2018-01-25 Thread jifei_yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340568#comment-16340568
 ] 

jifei_yang commented on SPARK-21601:


Thanks.

> Modify the JDK version of the Maven compilation
> ---
>
> Key: SPARK-21601
> URL: https://issues.apache.org/jira/browse/SPARK-21601
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: jifei_yang
>Priority: Minor
>
> When using maven to compile spark, I want to add a modified jdk property. 
> This is user-friendly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23224) union all will throw gramma exception

2018-01-25 Thread chenyukang (JIRA)
chenyukang created SPARK-23224:
--

 Summary: union all will throw gramma exception
 Key: SPARK-23224
 URL: https://issues.apache.org/jira/browse/SPARK-23224
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: chenyukang


when keyword "limit " in first sub query , this query will fail with gramma 
exception
{code:java}
spark-sql>
 >
 > insert overwrite table tmp_wjmdb.tmp_cyk_test1
 > select * from tmp_wjmdb.abctest limit 10
 > union all
 > select * from tmp_wjmdb.abctest limit 20;
18/01/26 12:18:58 INFO SparkSqlParser: Parsing command: insert overwrite table 
tmp_wjmdb.tmp_cyk_test1
select * from tmp_wjmdb.abctest limit 10
union all
select * from tmp_wjmdb.abctest limit 20
Error in query:
mismatched input 'union' expecting {, '.', '[', 'OR', 'AND', 'IN', NOT, 
'BETWEEN', 'LIKE', RLIKE, 'IS', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', 
'-', '*', '/', '%', 'DIV', '&', '|', '^'}(line 3, pos 0)

== SQL ==
insert overwrite table tmp_wjmdb.tmp_cyk_test1
select * from tmp_wjmdb.abctest limit 10
union all
^^^
select * from tmp_wjmdb.abctest limit 20
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22799) Bucketizer should throw exception if single- and multi-column params are both set

2018-01-25 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-22799:
---
Priority: Blocker  (was: Major)

> Bucketizer should throw exception if single- and multi-column params are both 
> set
> -
>
> Key: SPARK-22799
> URL: https://issues.apache.org/jira/browse/SPARK-22799
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Blocker
>
> See the related discussion: 
> https://issues.apache.org/jira/browse/SPARK-8418?focusedCommentId=16275049&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16275049



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23200) Reset configuration when restarting from checkpoints

2018-01-25 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-23200.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> Reset configuration when restarting from checkpoints
> 
>
> Key: SPARK-23200
> URL: https://issues.apache.org/jira/browse/SPARK-23200
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Assignee: Anirudh Ramanathan
>Priority: Major
> Fix For: 2.4.0
>
>
> Streaming workloads and restarting from checkpoints may need additional 
> changes, i.e. resetting properties -  see 
> https://github.com/apache-spark-on-k8s/spark/pull/516



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23163) Sync Python ML API docs with Scala

2018-01-25 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-23163.

   Resolution: Fixed
Fix Version/s: 2.3.0

> Sync Python ML API docs with Scala
> --
>
> Key: SPARK-23163
> URL: https://issues.apache.org/jira/browse/SPARK-23163
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Trivial
> Fix For: 2.3.0
>
>
> Fix a few doc issues as reported in 2.3 ML QA SPARK-23109



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23163) Sync Python ML API docs with Scala

2018-01-25 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-23163:
--

Assignee: Bryan Cutler

> Sync Python ML API docs with Scala
> --
>
> Key: SPARK-23163
> URL: https://issues.apache.org/jira/browse/SPARK-23163
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Trivial
> Fix For: 2.3.0
>
>
> Fix a few doc issues as reported in 2.3 ML QA SPARK-23109



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23109) ML 2.3 QA: API: Python API coverage

2018-01-25 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-23109:
---
Affects Version/s: 2.3.0
 Target Version/s: 2.3.0

> ML 2.3 QA: API: Python API coverage
> ---
>
> Key: SPARK-23109
> URL: https://issues.apache.org/jira/browse/SPARK-23109
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23200) Reset configuration when restarting from checkpoints

2018-01-25 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340648#comment-16340648
 ] 

Saisai Shao commented on SPARK-23200:
-

Issue resolved by pull request 20383

https://github.com/apache/spark/pull/20383

> Reset configuration when restarting from checkpoints
> 
>
> Key: SPARK-23200
> URL: https://issues.apache.org/jira/browse/SPARK-23200
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Priority: Major
> Fix For: 2.4.0
>
>
> Streaming workloads and restarting from checkpoints may need additional 
> changes, i.e. resetting properties -  see 
> https://github.com/apache-spark-on-k8s/spark/pull/516



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23106) ML, Graph 2.3 QA: API: Binary incompatible changes

2018-01-25 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340645#comment-16340645
 ] 

Nick Pentreath commented on SPARK-23106:


Thanks [~bago.amirbekian]. However, running MiMa is not enough for this task, 
since some PRs are merged that add MiMa exclusions. So typically, to be safe, 
we would also double check the MiMa exclusions added for ML during the release 
cycle, to ensure the exclusions are valid (i.e. false positives etc. most 
commonly due to changes made to private classes that MiMa picks up but that are 
not part of the public API).

> ML, Graph 2.3 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-23106
> URL: https://issues.apache.org/jira/browse/SPARK-23106
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23106) ML, Graph 2.3 QA: API: Binary incompatible changes

2018-01-25 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-23106:
---
Affects Version/s: 2.3.0
 Target Version/s: 2.3.0

> ML, Graph 2.3 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-23106
> URL: https://issues.apache.org/jira/browse/SPARK-23106
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22799) Bucketizer should throw exception if single- and multi-column params are both set

2018-01-25 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-22799:
---
Target Version/s: 2.3.0  (was: 2.4.0)

> Bucketizer should throw exception if single- and multi-column params are both 
> set
> -
>
> Key: SPARK-22799
> URL: https://issues.apache.org/jira/browse/SPARK-22799
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Major
>
> See the related discussion: 
> https://issues.apache.org/jira/browse/SPARK-8418?focusedCommentId=16275049&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16275049



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23200) Reset configuration when restarting from checkpoints

2018-01-25 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-23200:
---

Assignee: Anirudh Ramanathan

> Reset configuration when restarting from checkpoints
> 
>
> Key: SPARK-23200
> URL: https://issues.apache.org/jira/browse/SPARK-23200
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Assignee: Anirudh Ramanathan
>Priority: Major
> Fix For: 2.4.0
>
>
> Streaming workloads and restarting from checkpoints may need additional 
> changes, i.e. resetting properties -  see 
> https://github.com/apache-spark-on-k8s/spark/pull/516



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23109) ML 2.3 QA: API: Python API coverage

2018-01-25 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340653#comment-16340653
 ] 

Nick Pentreath commented on SPARK-23109:


[~bryanc] can you add a Jira for adding {{columnSchema}} to Python?

Then if there is nothing else here, I can resolve this ticket (note this is for 
auditing, not fixing all the issues so anything outstanding won't block the 
release).

> ML 2.3 QA: API: Python API coverage
> ---
>
> Key: SPARK-23109
> URL: https://issues.apache.org/jira/browse/SPARK-23109
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2