[jira] [Assigned] (SPARK-8240) string function: concat
[ https://issues.apache.org/jira/browse/SPARK-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin reassigned SPARK-8240: -- Assignee: Reynold Xin (was: Cheng Hao) > string function: concat > --- > > Key: SPARK-8240 > URL: https://issues.apache.org/jira/browse/SPARK-8240 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > concat(string|binary A, string|binary B...): string / binary > Returns the string or bytes resulting from concatenating the strings or bytes > passed in as parameters in order. For example, concat('foo', 'bar') results > in 'foobar'. Note that this function can take any number of input strings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8240) string function: concat
[ https://issues.apache.org/jira/browse/SPARK-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632294#comment-14632294 ] Reynold Xin commented on SPARK-8240: [~adrian-wang] I had some time tonight and wrote a version of this that has codegen and avoids conversion back and forth between String and UTF8String. > string function: concat > --- > > Key: SPARK-8240 > URL: https://issues.apache.org/jira/browse/SPARK-8240 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > concat(string|binary A, string|binary B...): string / binary > Returns the string or bytes resulting from concatenating the strings or bytes > passed in as parameters in order. For example, concat('foo', 'bar') results > in 'foobar'. Note that this function can take any number of input strings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7218) Create a real iterator with open/close for Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-7218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7218: --- Target Version/s: 1.6.0 > Create a real iterator with open/close for Spark SQL > > > Key: SPARK-7218 > URL: https://issues.apache.org/jira/browse/SPARK-7218 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8240) string function: concat
[ https://issues.apache.org/jira/browse/SPARK-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632292#comment-14632292 ] Apache Spark commented on SPARK-8240: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/7486 > string function: concat > --- > > Key: SPARK-8240 > URL: https://issues.apache.org/jira/browse/SPARK-8240 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Cheng Hao > > concat(string|binary A, string|binary B...): string / binary > Returns the string or bytes resulting from concatenating the strings or bytes > passed in as parameters in order. For example, concat('foo', 'bar') results > in 'foobar'. Note that this function can take any number of input strings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9149) Add an example of spark.ml KMeans
[ https://issues.apache.org/jira/browse/SPARK-9149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632286#comment-14632286 ] Yu Ishikawa commented on SPARK-9149: Please assign this to me? > Add an example of spark.ml KMeans > - > > Key: SPARK-9149 > URL: https://issues.apache.org/jira/browse/SPARK-9149 > Project: Spark > Issue Type: Sub-task > Components: Examples, ML >Reporter: Yu Ishikawa > Fix For: 1.5.0 > > > Create an example of KMeans API for spark.ml. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9149) Add an example of spark.ml KMeans
[ https://issues.apache.org/jira/browse/SPARK-9149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Ishikawa updated SPARK-9149: --- Component/s: Examples > Add an example of spark.ml KMeans > - > > Key: SPARK-9149 > URL: https://issues.apache.org/jira/browse/SPARK-9149 > Project: Spark > Issue Type: Sub-task > Components: Examples, ML >Reporter: Yu Ishikawa > Fix For: 1.5.0 > > > Create an example of KMeans API for spark.ml. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9149) Add an example of spark.ml KMeans
Yu Ishikawa created SPARK-9149: -- Summary: Add an example of spark.ml KMeans Key: SPARK-9149 URL: https://issues.apache.org/jira/browse/SPARK-9149 Project: Spark Issue Type: Sub-task Components: ML Reporter: Yu Ishikawa Create an example of KMeans API for spark.ml. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8916) Add @since tags to mllib.regression
[ https://issues.apache.org/jira/browse/SPARK-8916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632262#comment-14632262 ] Prayag Chandran Nirmala commented on SPARK-8916: I would like to take this up, if that's okay. > Add @since tags to mllib.regression > --- > > Key: SPARK-8916 > URL: https://issues.apache.org/jira/browse/SPARK-8916 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib >Reporter: Xiangrui Meng >Priority: Minor > Labels: starter > Original Estimate: 1h > Remaining Estimate: 1h > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9118) Implement integer array parameters for ml.param as IntArrayParam
[ https://issues.apache.org/jira/browse/SPARK-9118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-9118. -- Resolution: Fixed Fix Version/s: (was: 1.4.0) 1.5.0 Issue resolved by pull request 7481 [https://github.com/apache/spark/pull/7481] > Implement integer array parameters for ml.param as IntArrayParam > > > Key: SPARK-9118 > URL: https://issues.apache.org/jira/browse/SPARK-9118 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Alexander Ulanov >Priority: Minor > Fix For: 1.5.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > ml/param/params.scala lacks integer array parameter. It is needed for some > models such as multilayer perceptron to specify the layer sizes. I suggest to > implement it as IntArrayParam similarly to DoubleArrayParam. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9118) Implement integer array parameters for ml.param as IntArrayParam
[ https://issues.apache.org/jira/browse/SPARK-9118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-9118: - Assignee: Rekha Joshi > Implement integer array parameters for ml.param as IntArrayParam > > > Key: SPARK-9118 > URL: https://issues.apache.org/jira/browse/SPARK-9118 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Alexander Ulanov >Assignee: Rekha Joshi >Priority: Minor > Fix For: 1.5.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > ml/param/params.scala lacks integer array parameter. It is needed for some > models such as multilayer perceptron to specify the layer sizes. I suggest to > implement it as IntArrayParam similarly to DoubleArrayParam. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8246) string function: get_json_object
[ https://issues.apache.org/jira/browse/SPARK-8246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8246: --- Assignee: Nathan Howell (was: Cheng Hao) > string function: get_json_object > > > Key: SPARK-8246 > URL: https://issues.apache.org/jira/browse/SPARK-8246 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Nathan Howell > > get_json_object(string json_string, string path): string > This is actually fairly complicated. Take a look at > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF > Only add this to SQL, not DataFrame. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8246) string function: get_json_object
[ https://issues.apache.org/jira/browse/SPARK-8246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8246: --- Assignee: Apache Spark (was: Cheng Hao) > string function: get_json_object > > > Key: SPARK-8246 > URL: https://issues.apache.org/jira/browse/SPARK-8246 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > get_json_object(string json_string, string path): string > This is actually fairly complicated. Take a look at > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF > Only add this to SQL, not DataFrame. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8246) string function: get_json_object
[ https://issues.apache.org/jira/browse/SPARK-8246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8246: --- Assignee: Cheng Hao (was: Apache Spark) > string function: get_json_object > > > Key: SPARK-8246 > URL: https://issues.apache.org/jira/browse/SPARK-8246 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Cheng Hao > > get_json_object(string json_string, string path): string > This is actually fairly complicated. Take a look at > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF > Only add this to SQL, not DataFrame. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8246) string function: get_json_object
[ https://issues.apache.org/jira/browse/SPARK-8246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632218#comment-14632218 ] Apache Spark commented on SPARK-8246: - User 'NathanHowell' has created a pull request for this issue: https://github.com/apache/spark/pull/7485 > string function: get_json_object > > > Key: SPARK-8246 > URL: https://issues.apache.org/jira/browse/SPARK-8246 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Cheng Hao > > get_json_object(string json_string, string path): string > This is actually fairly complicated. Take a look at > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF > Only add this to SQL, not DataFrame. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9148) User-facing documentation for NaN handling semantics
Josh Rosen created SPARK-9148: - Summary: User-facing documentation for NaN handling semantics Key: SPARK-9148 URL: https://issues.apache.org/jira/browse/SPARK-9148 Project: Spark Issue Type: Sub-task Components: Documentation, SQL Reporter: Josh Rosen Once we've finalized our NaN changes for Spark 1.5, we need to create user-facing documentation to explain our chosen semantics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8159) Improve SQL/DataFrame expression coverage
[ https://issues.apache.org/jira/browse/SPARK-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin reassigned SPARK-8159: -- Assignee: Reynold Xin > Improve SQL/DataFrame expression coverage > - > > Key: SPARK-8159 > URL: https://issues.apache.org/jira/browse/SPARK-8159 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > This is an umbrella ticket to track new expressions we are adding to > SQL/DataFrame. > For each new expression, we should: > 1. Add a new Expression implementation in > org.apache.spark.sql.catalyst.expressions > 2. If applicable, implement the code generated version (by implementing > genCode). > 3. Add comprehensive unit tests (for all the data types the expressions > support). > 4. If applicable, add a new function for DataFrame in > org.apache.spark.sql.functions, and python/pyspark/sql/functions.py for > Python. > For date/time functions, put them in expressions/datetime.scala, and create a > DateTimeFunctionSuite.scala for testing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8947) Improve expression type coercion, casting & checking
[ https://issues.apache.org/jira/browse/SPARK-8947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin reassigned SPARK-8947: -- Assignee: Reynold Xin > Improve expression type coercion, casting & checking > > > Key: SPARK-8947 > URL: https://issues.apache.org/jira/browse/SPARK-8947 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > This is an umbrella ticket to improve type casting & checking. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8846) Maintain binary compatibility for in function
[ https://issues.apache.org/jira/browse/SPARK-8846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632208#comment-14632208 ] Reynold Xin commented on SPARK-8846: [~yuu.ishik...@gmail.com] just a reminder. > Maintain binary compatibility for in function > - > > Key: SPARK-8846 > URL: https://issues.apache.org/jira/browse/SPARK-8846 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > In order to maintain binary compatibility, we should add a new "in" function > that takes Any, rather than changing the existing one. > cc [~yuu.ishik...@gmail.com] can you work on this? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9146) NaN should be greater than all other values
[ https://issues.apache.org/jira/browse/SPARK-9146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9146: --- Assignee: Josh Rosen (was: Apache Spark) > NaN should be greater than all other values > --- > > Key: SPARK-9146 > URL: https://issues.apache.org/jira/browse/SPARK-9146 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Josh Rosen >Priority: Critical > > Based on the design in SPARK-9079, NaN should be greater than all other > non-NaN numeric values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9076) Improve NaN value handling
[ https://issues.apache.org/jira/browse/SPARK-9076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin reassigned SPARK-9076: -- Assignee: Reynold Xin > Improve NaN value handling > -- > > Key: SPARK-9076 > URL: https://issues.apache.org/jira/browse/SPARK-9076 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > This is an umbrella ticket for handling NaN values. > For general design, please see > https://issues.apache.org/jira/browse/SPARK-9079 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9076) Improve NaN value handling
[ https://issues.apache.org/jira/browse/SPARK-9076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9076: --- Description: This is an umbrella ticket for handling NaN values. For general design, please see https://issues.apache.org/jira/browse/SPARK-9079 was: This is an umbrella ticket for handling NaN values. > Improve NaN value handling > -- > > Key: SPARK-9076 > URL: https://issues.apache.org/jira/browse/SPARK-9076 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin > > This is an umbrella ticket for handling NaN values. > For general design, please see > https://issues.apache.org/jira/browse/SPARK-9079 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9146) NaN should be greater than all other values
[ https://issues.apache.org/jira/browse/SPARK-9146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9146: --- Assignee: Apache Spark (was: Josh Rosen) > NaN should be greater than all other values > --- > > Key: SPARK-9146 > URL: https://issues.apache.org/jira/browse/SPARK-9146 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark >Priority: Critical > > Based on the design in SPARK-9079, NaN should be greater than all other > non-NaN numeric values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9146) NaN should be greater than all other values
[ https://issues.apache.org/jira/browse/SPARK-9146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632207#comment-14632207 ] Apache Spark commented on SPARK-9146: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/7194 > NaN should be greater than all other values > --- > > Key: SPARK-9146 > URL: https://issues.apache.org/jira/browse/SPARK-9146 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Josh Rosen >Priority: Critical > > Based on the design in SPARK-9079, NaN should be greater than all other > non-NaN numeric values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8797) Sorting float/double column containing NaNs can lead to "Comparison method violates its general contract!" errors
[ https://issues.apache.org/jira/browse/SPARK-8797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8797: --- Assignee: Josh Rosen > Sorting float/double column containing NaNs can lead to "Comparison method > violates its general contract!" errors > - > > Key: SPARK-8797 > URL: https://issues.apache.org/jira/browse/SPARK-8797 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.3.0, 1.4.0, 1.5.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Critical > > When sorting a float or double column that contains NaN (not a number) > values, TimSort may throw a ""Comparison method violates its general > contract!" error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6573) Convert inbound NaN values as null
[ https://issues.apache.org/jira/browse/SPARK-6573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6573: --- Assignee: Davies Liu > Convert inbound NaN values as null > -- > > Key: SPARK-6573 > URL: https://issues.apache.org/jira/browse/SPARK-6573 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.3.0 >Reporter: Fabian Boehnlein >Assignee: Davies Liu > > In pandas it is common to use numpy.nan as the null value, for missing data > or whatever. > http://pandas.pydata.org/pandas-docs/dev/gotchas.html#nan-integer-na-values-and-na-type-promotions > http://stackoverflow.com/questions/17534106/what-is-the-difference-between-nan-and-none > http://pandas.pydata.org/pandas-docs/dev/missing_data.html#filling-missing-values-fillna > createDataFrame however only works with None as null values, parsing them as > None in the RDD. > I suggest to add support for np.nan values in pandas DataFrames. > current stracktrace when calling a DataFrame with object type columns with > np.nan values (which are floats) > {code} > TypeError Traceback (most recent call last) > in () > > 1 sqldf = sqlCtx.createDataFrame(df_, schema=schema) > /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in > createDataFrame(self, data, schema, samplingRatio) > 339 schema = self._inferSchema(data.map(lambda r: > row_cls(*r)), samplingRatio) > 340 > --> 341 return self.applySchema(data, schema) > 342 > 343 def registerDataFrameAsTable(self, rdd, tableName): > /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in > applySchema(self, rdd, schema) > 246 > 247 for row in rows: > --> 248 _verify_type(row, schema) > 249 > 250 # convert python objects to sql data > /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in > _verify_type(obj, dataType) >1064 "length of fields (%d)" % (len(obj), > len(dataType.fields))) >1065 for v, f in zip(obj, dataType.fields): > -> 1066 _verify_type(v, f.dataType) >1067 >1068 _cached_cls = weakref.WeakValueDictionary() > /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in > _verify_type(obj, dataType) >1048 if type(obj) not in _acceptable_types[_type]: >1049 raise TypeError("%s can not accept object in type %s" > -> 1050 % (dataType, type(obj))) >1051 >1052 if isinstance(dataType, ArrayType): > TypeError: StringType can not accept object in type {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9146) NaN should be greater than all other values
[ https://issues.apache.org/jira/browse/SPARK-9146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9146: --- Assignee: Josh Rosen > NaN should be greater than all other values > --- > > Key: SPARK-9146 > URL: https://issues.apache.org/jira/browse/SPARK-9146 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Josh Rosen >Priority: Critical > > Based on the design in SPARK-9079, NaN should be greater than all other > non-NaN numeric values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7879) KMeans API for spark.ml Pipelines
[ https://issues.apache.org/jira/browse/SPARK-7879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-7879. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6756 [https://github.com/apache/spark/pull/6756] > KMeans API for spark.ml Pipelines > - > > Key: SPARK-7879 > URL: https://issues.apache.org/jira/browse/SPARK-7879 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley >Assignee: Yu Ishikawa >Priority: Critical > Fix For: 1.5.0 > > > Create a K-Means API for the spark.ml Pipelines API. This should wrap the > existing KMeans implementation in spark.mllib. > This should be the first clustering method added to Pipelines, and it will be > important to consider [SPARK-7610] and think about designing the clustering > API. We do not have to have abstractions from the beginning (and probably > should not) but should think far enough ahead so we can add abstractions > later on. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8281) udf_asin and udf_acos test failure
[ https://issues.apache.org/jira/browse/SPARK-8281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8281. Resolution: Fixed Assignee: Yijie Shen Fix Version/s: 1.5.0 > udf_asin and udf_acos test failure > -- > > Key: SPARK-8281 > URL: https://issues.apache.org/jira/browse/SPARK-8281 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Yijie Shen >Priority: Blocker > Fix For: 1.5.0 > > > acos/asin in Hive returns NaN for not a number, whereas we always return null. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8280) udf7 failed due to null vs nan semantics
[ https://issues.apache.org/jira/browse/SPARK-8280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8280. Resolution: Fixed Assignee: Yijie Shen Fix Version/s: 1.5.0 > udf7 failed due to null vs nan semantics > > > Key: SPARK-8280 > URL: https://issues.apache.org/jira/browse/SPARK-8280 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Yijie Shen >Priority: Blocker > Fix For: 1.5.0 > > > To execute > {code} > sbt/sbt -Phive -Dspark.hive.whitelist="udf7.*" "hive/test-only > org.apache.spark.sql.hive.execution.HiveCompatibilitySuite" > {code} > If we want to be consistent with Hive, we need to special case our log > function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9147) UnsafeRow should canonicalize NaN values
Reynold Xin created SPARK-9147: -- Summary: UnsafeRow should canonicalize NaN values Key: SPARK-9147 URL: https://issues.apache.org/jira/browse/SPARK-9147 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin NaN has many different representations in raw bytes. When we set a double/float value, we should check whether it is NaN, and a binary representation that is canonicalized, so we can do comparison on bytes directly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9146) NaN should be greater than all other values
Reynold Xin created SPARK-9146: -- Summary: NaN should be greater than all other values Key: SPARK-9146 URL: https://issues.apache.org/jira/browse/SPARK-9146 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Priority: Critical Based on the design in SPARK-9079, NaN should be greater than all other non-NaN numeric values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9145) Equality test on NaN = NaN should return true
Reynold Xin created SPARK-9145: -- Summary: Equality test on NaN = NaN should return true Key: SPARK-9145 URL: https://issues.apache.org/jira/browse/SPARK-9145 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Priority: Critical Based on the design in SPARK-9079, we want NaN = NaN to return true in SQL/DataFrame. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9079) Design NaN semantics
[ https://issues.apache.org/jira/browse/SPARK-9079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-9079. Resolution: Fixed Assignee: Michael Armbrust Fix Version/s: 1.5.0 > Design NaN semantics > > > Key: SPARK-9079 > URL: https://issues.apache.org/jira/browse/SPARK-9079 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Michael Armbrust > Fix For: 1.5.0 > > > 1. What should NaN = NaN return? > NaN = NaN should return true. > 2. If we see NaN in the group by key column, should we group NaN values into > one group, or into different groups? > All NaN values should be grouped together. > 3. What about NaN in join keys? > NaN should be treated as a normal value in join keys. > 4. When aggregating over columns containing NaN, should the result be NaN, or > should the result exclude NaN values (treating them like nulls)? > This is TO BE DECIDED. By default, the behavior is to return NaN. > 5. Where should NaN go in sorting? > NaN should go last when in ascending order, larger than any other numeric > value. > Note that 5 is much more important than the other 4 since right now the > sorter throws exceptions on NaN values. See SPARK-8797. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9079) Design NaN semantics
[ https://issues.apache.org/jira/browse/SPARK-9079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9079: --- Description: 1. What should NaN = NaN return? NaN = NaN should return true. 2. If we see NaN in the group by key column, should we group NaN values into one group, or into different groups? All NaN values should be grouped together. 3. What about NaN in join keys? NaN should be treated as a normal value in join keys. 4. When aggregating over columns containing NaN, should the result be NaN, or should the result exclude NaN values (treating them like nulls)? 5. Where should NaN go in sorting? NaN should go last when in ascending order, larger than any other numeric value. Note that 5 is much more important than the other 4 since right now the sorter throws exceptions on NaN values. See SPARK-8797. was: 1. What should NaN = NaN return? 2. If we see NaN in the group by key column, should we group NaN values into one group, or into different groups? 3. What about NaN in join keys? 4. When aggregating over columns containing NaN, should the result be NaN, or should the result exclude NaN values (treating them like nulls)? 5. Where should NaN go in sorting? Note that 5 is much more important than the other 4 since right now the sorter throws exceptions on NaN values. See SPARK-8797. > Design NaN semantics > > > Key: SPARK-9079 > URL: https://issues.apache.org/jira/browse/SPARK-9079 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > Fix For: 1.5.0 > > > 1. What should NaN = NaN return? > NaN = NaN should return true. > 2. If we see NaN in the group by key column, should we group NaN values into > one group, or into different groups? > All NaN values should be grouped together. > 3. What about NaN in join keys? > NaN should be treated as a normal value in join keys. > 4. When aggregating over columns containing NaN, should the result be NaN, or > should the result exclude NaN values (treating them like nulls)? > 5. Where should NaN go in sorting? > NaN should go last when in ascending order, larger than any other numeric > value. > Note that 5 is much more important than the other 4 since right now the > sorter throws exceptions on NaN values. See SPARK-8797. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9079) Design NaN semantics
[ https://issues.apache.org/jira/browse/SPARK-9079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9079: --- Description: 1. What should NaN = NaN return? NaN = NaN should return true. 2. If we see NaN in the group by key column, should we group NaN values into one group, or into different groups? All NaN values should be grouped together. 3. What about NaN in join keys? NaN should be treated as a normal value in join keys. 4. When aggregating over columns containing NaN, should the result be NaN, or should the result exclude NaN values (treating them like nulls)? This is TO BE DECIDED. By default, the behavior is to return NaN. 5. Where should NaN go in sorting? NaN should go last when in ascending order, larger than any other numeric value. Note that 5 is much more important than the other 4 since right now the sorter throws exceptions on NaN values. See SPARK-8797. was: 1. What should NaN = NaN return? NaN = NaN should return true. 2. If we see NaN in the group by key column, should we group NaN values into one group, or into different groups? All NaN values should be grouped together. 3. What about NaN in join keys? NaN should be treated as a normal value in join keys. 4. When aggregating over columns containing NaN, should the result be NaN, or should the result exclude NaN values (treating them like nulls)? 5. Where should NaN go in sorting? NaN should go last when in ascending order, larger than any other numeric value. Note that 5 is much more important than the other 4 since right now the sorter throws exceptions on NaN values. See SPARK-8797. > Design NaN semantics > > > Key: SPARK-9079 > URL: https://issues.apache.org/jira/browse/SPARK-9079 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > Fix For: 1.5.0 > > > 1. What should NaN = NaN return? > NaN = NaN should return true. > 2. If we see NaN in the group by key column, should we group NaN values into > one group, or into different groups? > All NaN values should be grouped together. > 3. What about NaN in join keys? > NaN should be treated as a normal value in join keys. > 4. When aggregating over columns containing NaN, should the result be NaN, or > should the result exclude NaN values (treating them like nulls)? > This is TO BE DECIDED. By default, the behavior is to return NaN. > 5. Where should NaN go in sorting? > NaN should go last when in ascending order, larger than any other numeric > value. > Note that 5 is much more important than the other 4 since right now the > sorter throws exceptions on NaN values. See SPARK-8797. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7026) LeftSemiJoin can not work when it has both equal condition and not equal condition.
[ https://issues.apache.org/jira/browse/SPARK-7026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-7026. - Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 5643 [https://github.com/apache/spark/pull/5643] > LeftSemiJoin can not work when it has both equal condition and not equal > condition. > - > > Key: SPARK-7026 > URL: https://issues.apache.org/jira/browse/SPARK-7026 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Zhongshuai Pei >Assignee: Adrian Wang > Fix For: 1.5.0 > > > Run sql like that > {panel} > select * > from > web_sales ws1 > left semi join > web_sales ws2 > on ws1.ws_order_number = ws2.ws_order_number > and ws1.ws_warehouse_sk <> ws2.ws_warehouse_sk > {panel} > then get an exception > {panel} > Couldn't find ws_warehouse_sk#287 in > {ws_sold_date_sk#237,ws_sold_time_sk#238,ws_ship_date_sk#239,ws_item_sk#240,ws_bill_customer_sk#241,ws_bill_cdemo_sk#242,ws_bill_hdemo_sk#243,ws_bill_addr_sk#244,ws_ship_customer_sk#245,ws_ship_cdemo_sk#246,ws_ship_hdemo_sk#247,ws_ship_addr_sk#248,ws_web_page_sk#249,ws_web_site_sk#250,ws_ship_mode_sk#251,ws_warehouse_sk#252,ws_promo_sk#253,ws_order_number#254,ws_quantity#255,ws_wholesale_cost#256,ws_list_price#257,ws_sales_price#258,ws_ext_discount_amt#259,ws_ext_sales_price#260,ws_ext_wholesale_cost#261,ws_ext_list_price#262,ws_ext_tax#263,ws_coupon_amt#264,ws_ext_ship_cost#265,ws_net_paid#266,ws_net_paid_inc_tax#267,ws_net_paid_inc_ship#268,ws_net_paid_inc_ship_tax#269,ws_net_profit#270,ws_sold_date#236} > at scala.sys.package$.error(package.scala:27) > {panel} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9081) fillna/dropna should also fill/drop NaN values in addition to null values
[ https://issues.apache.org/jira/browse/SPARK-9081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632107#comment-14632107 ] Reynold Xin commented on SPARK-9081: [~yijieshen] can you take this one? > fillna/dropna should also fill/drop NaN values in addition to null values > - > > Key: SPARK-9081 > URL: https://issues.apache.org/jira/browse/SPARK-9081 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9117) fix BooleanSimplification in case-insensitive
[ https://issues.apache.org/jira/browse/SPARK-9117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-9117. - Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7452 [https://github.com/apache/spark/pull/7452] > fix BooleanSimplification in case-insensitive > - > > Key: SPARK-9117 > URL: https://issues.apache.org/jira/browse/SPARK-9117 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Minor > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9113) enable analysis check code for self join
[ https://issues.apache.org/jira/browse/SPARK-9113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-9113. - Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7449 [https://github.com/apache/spark/pull/7449] > enable analysis check code for self join > > > Key: SPARK-9113 > URL: https://issues.apache.org/jira/browse/SPARK-9113 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Trivial > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9144) Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled
[ https://issues.apache.org/jira/browse/SPARK-9144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-9144: -- Component/s: Scheduler > Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled > --- > > Key: SPARK-9144 > URL: https://issues.apache.org/jira/browse/SPARK-9144 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Reporter: Josh Rosen >Assignee: Josh Rosen > > Spark has an option called {{spark.localExecution.enabled}}; according to the > docs: > {quote} > Enables Spark to run certain jobs, such as first() or take() on the driver, > without sending tasks to the cluster. This can make certain jobs execute very > quickly, but may require shipping a whole partition of data to the driver. > {quote} > This feature ends up adding quite a bit of complexity to DAGScheduler, > especially in the {{runLocallyWithinThread}} method, but as far as I know > nobody uses this feature (I searched the mailing list and haven't seen any > recent mentions of the configuration nor stacktraces including the runLocally > method). As a step towards scheduler complexity reduction, I propose that we > remove this feature and all code related to it for Spark 1.5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9144) Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled
[ https://issues.apache.org/jira/browse/SPARK-9144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-9144: -- Issue Type: Improvement (was: New Feature) > Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled > --- > > Key: SPARK-9144 > URL: https://issues.apache.org/jira/browse/SPARK-9144 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Reporter: Josh Rosen >Assignee: Josh Rosen > > Spark has an option called {{spark.localExecution.enabled}}; according to the > docs: > {quote} > Enables Spark to run certain jobs, such as first() or take() on the driver, > without sending tasks to the cluster. This can make certain jobs execute very > quickly, but may require shipping a whole partition of data to the driver. > {quote} > This feature ends up adding quite a bit of complexity to DAGScheduler, > especially in the {{runLocallyWithinThread}} method, but as far as I know > nobody uses this feature (I searched the mailing list and haven't seen any > recent mentions of the configuration nor stacktraces including the runLocally > method). As a step towards scheduler complexity reduction, I propose that we > remove this feature and all code related to it for Spark 1.5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9144) Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled
[ https://issues.apache.org/jira/browse/SPARK-9144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632075#comment-14632075 ] Apache Spark commented on SPARK-9144: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/7484 > Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled > --- > > Key: SPARK-9144 > URL: https://issues.apache.org/jira/browse/SPARK-9144 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Josh Rosen >Assignee: Josh Rosen > > Spark has an option called {{spark.localExecution.enabled}}; according to the > docs: > {quote} > Enables Spark to run certain jobs, such as first() or take() on the driver, > without sending tasks to the cluster. This can make certain jobs execute very > quickly, but may require shipping a whole partition of data to the driver. > {quote} > This feature ends up adding quite a bit of complexity to DAGScheduler, > especially in the {{runLocallyWithinThread}} method, but as far as I know > nobody uses this feature (I searched the mailing list and haven't seen any > recent mentions of the configuration nor stacktraces including the runLocally > method). As a step towards scheduler complexity reduction, I propose that we > remove this feature and all code related to it for Spark 1.5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9144) Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled
[ https://issues.apache.org/jira/browse/SPARK-9144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9144: --- Assignee: Josh Rosen (was: Apache Spark) > Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled > --- > > Key: SPARK-9144 > URL: https://issues.apache.org/jira/browse/SPARK-9144 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Josh Rosen >Assignee: Josh Rosen > > Spark has an option called {{spark.localExecution.enabled}}; according to the > docs: > {quote} > Enables Spark to run certain jobs, such as first() or take() on the driver, > without sending tasks to the cluster. This can make certain jobs execute very > quickly, but may require shipping a whole partition of data to the driver. > {quote} > This feature ends up adding quite a bit of complexity to DAGScheduler, > especially in the {{runLocallyWithinThread}} method, but as far as I know > nobody uses this feature (I searched the mailing list and haven't seen any > recent mentions of the configuration nor stacktraces including the runLocally > method). As a step towards scheduler complexity reduction, I propose that we > remove this feature and all code related to it for Spark 1.5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9144) Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled
[ https://issues.apache.org/jira/browse/SPARK-9144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9144: --- Assignee: Apache Spark (was: Josh Rosen) > Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled > --- > > Key: SPARK-9144 > URL: https://issues.apache.org/jira/browse/SPARK-9144 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Josh Rosen >Assignee: Apache Spark > > Spark has an option called {{spark.localExecution.enabled}}; according to the > docs: > {quote} > Enables Spark to run certain jobs, such as first() or take() on the driver, > without sending tasks to the cluster. This can make certain jobs execute very > quickly, but may require shipping a whole partition of data to the driver. > {quote} > This feature ends up adding quite a bit of complexity to DAGScheduler, > especially in the {{runLocallyWithinThread}} method, but as far as I know > nobody uses this feature (I searched the mailing list and haven't seen any > recent mentions of the configuration nor stacktraces including the runLocally > method). As a step towards scheduler complexity reduction, I propose that we > remove this feature and all code related to it for Spark 1.5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8794) Column pruning isn't applied beneath sample
[ https://issues.apache.org/jira/browse/SPARK-8794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632074#comment-14632074 ] Michael Armbrust commented on SPARK-8794: - Unfortunately, we typically avoid backporting anything that is not a bug fix to release branches. We really want to avoid unintended regressions so that it is very safe for people to upgrade. > Column pruning isn't applied beneath sample > --- > > Key: SPARK-8794 > URL: https://issues.apache.org/jira/browse/SPARK-8794 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 >Reporter: Eron Wright >Assignee: Liang-Chi Hsieh > Fix For: 1.5.0 > > > I observe that certain transformations (e.g. sample) on DataFrame cause the > underlying relation's support for column pruning to be disregarded in > subsequent queries. > I encountered this issue while using an ML pipeline with a typical dataset of > (label, features). For my particular data source (which implements > PrunedScan), the 'features' column is expensive to compute while the 'label' > column is cheap. The first stage of the pipeline - StringIndexer - operates > only on the label and so should be quick. Yet I found that the 'features' > column would be materialized. Upon investigation, the issue occurs when > the dataset is split into train/test with sampling. The sampling > transformation causes the pruning optimization to be lost. > See this gist for a sample program demonstrating the issue: > [https://gist.github.com/EronWright/cb5fb9af46fd810194f8] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9080) IsNaN expression
[ https://issues.apache.org/jira/browse/SPARK-9080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-9080. Resolution: Fixed Assignee: Yijie Shen Fix Version/s: 1.5.0 > IsNaN expression > > > Key: SPARK-9080 > URL: https://issues.apache.org/jira/browse/SPARK-9080 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Yijie Shen >Priority: Critical > Fix For: 1.5.0 > > > Add IsNaN expression to return true if the input double/float value is NaN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers
[ https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632061#comment-14632061 ] Nick Buroojy commented on SPARK-8418: - I like this idea a lot, and think it would solve one of our main performance issues with the ml api. Our data set has hundreds of string features that we need to convert into binary vectors. We have found the latency overhead of processing the features one at-a-time with a StringVectorizer (SPARK-7290) to be unbearable. We wrote a custom Estimator to vectorize all string columns with only a couple passes over the data set and found significant performance gains. I suspect that we aren't the only users with many columns, so we would love to fix this issue upstream with some sort of multi-column interface to transformers and estimators. I suppose we could make do with the Vector or Array interface using the VectorAssembler as described in this ticket; however, I think the cleanest interface for us would be a Map from source column to dest column. As far as sharing code, there are at least two strategies: 1) Use the single value implementation as it is today, and add a multi-value view on top of it. For example, StringVectorizer.setInputCols(Array[A, B]) would return a pipeline of [StringVectorizer.setInputCol(A), StringVectorizer(B)] 2) Reimplement each transformer to support a multi-value implementation and make the single-value interface a trivial invocation of the multi-value code. For example StringVectorizer.setInputCol(A) would invoke StringVectorizer.setInputCols(Array[A]) The obvious downside of 1 is that it wouldn't address the performance issues we ran into with hundreds of columns. The upsides are minimal implementation effort and simpler code to maintain. The main downside of 2 is more upfront effort to implement multi-value transformations, but the upside is reasonable performance with "wide" data sets. I don't think 1 and 2 are mutually exclusive. Maybe the multi-value interface could be solidified first with the 1 implementation, then over time the key transformers, like StringVectorizer, could be rewritten to 2? You mentioned that this would require a short design doc. Can I help with that? > Add single- and multi-value support to ML Transformers > -- > > Key: SPARK-8418 > URL: https://issues.apache.org/jira/browse/SPARK-8418 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > It would be convenient if all feature transformers supported transforming > columns of single values and multiple values, specifically: > * one column with one value (e.g., type {{Double}}) > * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}}) > We could go as far as supporting multiple columns, but that may not be > necessary since VectorAssembler could be used to handle that. > Estimators under {{ml.feature}} should also support this. > This will likely require a short design doc to describe: > * how input and output columns will be specified > * schema validation > * code sharing to reduce duplication -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9142) Removing unnecessary self types in Catalyst
[ https://issues.apache.org/jira/browse/SPARK-9142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-9142. - Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7479 [https://github.com/apache/spark/pull/7479] > Removing unnecessary self types in Catalyst > --- > > Key: SPARK-9142 > URL: https://issues.apache.org/jira/browse/SPARK-9142 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.5.0 > > > A small change, based on code review and offline discussion with [~dragos]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5571) LDA should handle text as well
[ https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632052#comment-14632052 ] Joseph K. Bradley commented on SPARK-5571: -- Stemmer: We'll need to be careful about adding dependencies on other libraries. We strongly prefer avoiding that if possible. If code can be copied and modified (assuming the license is friendly to copying), that might be preferable if the code is relatively simple. Stopwords: Sounds good. LDA.runText: I'd prefer this handle everything automatically: A user gives an unfiltered corpus and LDA handles it. This actually probably requires a quick design doc since I have not thought through the complexities. Pipeline: I agree this might work well under the Pipelines API. Here's what I propose: * For now, we focus on adding the necessary transformers individually: stemmer, stopwords filter. * For the next release, we design a good way to provide this functionality under Pipelines. If that sounds good, we can create & link JIRAs for those transformers, and I'll move the target version for this JIRA to 1.6. What do you think? > LDA should handle text as well > -- > > Key: SPARK-5571 > URL: https://issues.apache.org/jira/browse/SPARK-5571 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > Latent Dirichlet Allocation (LDA) currently operates only on vectors of word > counts. It should also supporting training and prediction using text > (Strings). > This plan is sketched in the [original LDA design > doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing]. > There should be: > * runWithText() method which takes an RDD with a collection of Strings (bags > of words). This will also index terms and compute a dictionary. > * dictionary parameter for when LDA is run with word count vectors > * prediction/feedback methods returning Strings (such as > describeTopicsAsStrings, which is commented out in LDA currently) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7690) MulticlassClassificationEvaluator for tuning Multiclass Classifiers
[ https://issues.apache.org/jira/browse/SPARK-7690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ram Sriharsha updated SPARK-7690: - Shepherd: Ram Sriharsha > MulticlassClassificationEvaluator for tuning Multiclass Classifiers > --- > > Key: SPARK-7690 > URL: https://issues.apache.org/jira/browse/SPARK-7690 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Ram Sriharsha >Assignee: Eron Wright > > Provide a MulticlassClassificationEvaluator with weighted F1-score to tune > multiclass classifiers using Pipeline API. > MLLib already provides a MulticlassMetrics functionality which can be wrapped > around a MulticlassClassificationEvaluator to expose weighted F1-score as > metric. > The functionality could be similar to > scikit(http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html) > in that we can support micro, macro and weighted versions of the F1-score > (with weighted being default) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7690) MulticlassClassificationEvaluator for tuning Multiclass Classifiers
[ https://issues.apache.org/jira/browse/SPARK-7690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ram Sriharsha updated SPARK-7690: - Assignee: Eron Wright (was: Ram Sriharsha) > MulticlassClassificationEvaluator for tuning Multiclass Classifiers > --- > > Key: SPARK-7690 > URL: https://issues.apache.org/jira/browse/SPARK-7690 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Ram Sriharsha >Assignee: Eron Wright > > Provide a MulticlassClassificationEvaluator with weighted F1-score to tune > multiclass classifiers using Pipeline API. > MLLib already provides a MulticlassMetrics functionality which can be wrapped > around a MulticlassClassificationEvaluator to expose weighted F1-score as > metric. > The functionality could be similar to > scikit(http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html) > in that we can support micro, macro and weighted versions of the F1-score > (with weighted being default) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9144) Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled
Josh Rosen created SPARK-9144: - Summary: Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled Key: SPARK-9144 URL: https://issues.apache.org/jira/browse/SPARK-9144 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Josh Rosen Assignee: Josh Rosen Spark has an option called {{spark.localExecution.enabled}}; according to the docs: {quote} Enables Spark to run certain jobs, such as first() or take() on the driver, without sending tasks to the cluster. This can make certain jobs execute very quickly, but may require shipping a whole partition of data to the driver. {quote} This feature ends up adding quite a bit of complexity to DAGScheduler, especially in the {{runLocallyWithinThread}} method, but as far as I know nobody uses this feature (I searched the mailing list and haven't seen any recent mentions of the configuration nor stacktraces including the runLocally method). As a step towards scheduler complexity reduction, I propose that we remove this feature and all code related to it for Spark 1.5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8668) expr function to convert SQL expression into a Column
[ https://issues.apache.org/jira/browse/SPARK-8668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631975#comment-14631975 ] Reynold Xin commented on SPARK-8668: Yes exactly! > expr function to convert SQL expression into a Column > - > > Key: SPARK-8668 > URL: https://issues.apache.org/jira/browse/SPARK-8668 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > selectExpr uses the expression parser to parse a string expressions. would be > great to create an "expr" function in functions.scala/functions.py that > converts a string into an expression (or a list of expressions separated by > comma). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8593) History Server doesn't show complete application when one attempt inprogress
[ https://issues.apache.org/jira/browse/SPARK-8593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8593: - Assignee: Rekha Joshi > History Server doesn't show complete application when one attempt inprogress > > > Key: SPARK-8593 > URL: https://issues.apache.org/jira/browse/SPARK-8593 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.4.0 >Reporter: Thomas Graves >Assignee: Rekha Joshi > Fix For: 1.4.2, 1.5.0 > > > The Spark history server doesn't show an application if the first attempt of > the application is still inprogress. > Here are the files in hdfs: > -rwxrwx--- 3 tgraves hdfs234 2015-06-24 15:49 > sparkhistory/application_1433751980223_18926_1.inprogress > -rwxrwx--- 3 tgraves hdfs9609450 2015-06-24 15:51 > sparkhistory/application_1433751980223_18926_2 > The UI shows them if I set the showIncomplete=true. > Removing the inprogress file allows it to show up when showIncomplete is > false. > It should be smart enough to atleast show the second successful attempt. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8593) History Server doesn't show complete application when one attempt inprogress
[ https://issues.apache.org/jira/browse/SPARK-8593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8593. -- Resolution: Fixed Fix Version/s: 1.5.0 1.4.2 Issue resolved by pull request 7253 [https://github.com/apache/spark/pull/7253] > History Server doesn't show complete application when one attempt inprogress > > > Key: SPARK-8593 > URL: https://issues.apache.org/jira/browse/SPARK-8593 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.4.0 >Reporter: Thomas Graves > Fix For: 1.4.2, 1.5.0 > > > The Spark history server doesn't show an application if the first attempt of > the application is still inprogress. > Here are the files in hdfs: > -rwxrwx--- 3 tgraves hdfs234 2015-06-24 15:49 > sparkhistory/application_1433751980223_18926_1.inprogress > -rwxrwx--- 3 tgraves hdfs9609450 2015-06-24 15:51 > sparkhistory/application_1433751980223_18926_2 > The UI shows them if I set the showIncomplete=true. > Removing the inprogress file allows it to show up when showIncomplete is > false. > It should be smart enough to atleast show the second successful attempt. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6805) ML Pipeline API in SparkR
[ https://issues.apache.org/jira/browse/SPARK-6805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6805: --- Assignee: Apache Spark > ML Pipeline API in SparkR > - > > Key: SPARK-6805 > URL: https://issues.apache.org/jira/browse/SPARK-6805 > Project: Spark > Issue Type: Umbrella > Components: ML, SparkR >Reporter: Xiangrui Meng >Assignee: Apache Spark >Priority: Critical > > SparkR was merged. So let's have this umbrella JIRA for the ML pipeline API > in SparkR. The implementation should be similar to the pipeline API > implementation in Python. > For Spark 1.5, we want to support linear/logistic regression in SparkR, with > basic support for R formula and elastic-net regularization. The design doc > can be viewed at > https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6805) ML Pipeline API in SparkR
[ https://issues.apache.org/jira/browse/SPARK-6805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6805: --- Assignee: (was: Apache Spark) > ML Pipeline API in SparkR > - > > Key: SPARK-6805 > URL: https://issues.apache.org/jira/browse/SPARK-6805 > Project: Spark > Issue Type: Umbrella > Components: ML, SparkR >Reporter: Xiangrui Meng >Priority: Critical > > SparkR was merged. So let's have this umbrella JIRA for the ML pipeline API > in SparkR. The implementation should be similar to the pipeline API > implementation in Python. > For Spark 1.5, we want to support linear/logistic regression in SparkR, with > basic support for R formula and elastic-net regularization. The design doc > can be viewed at > https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6805) ML Pipeline API in SparkR
[ https://issues.apache.org/jira/browse/SPARK-6805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631962#comment-14631962 ] Apache Spark commented on SPARK-6805: - User 'ericl' has created a pull request for this issue: https://github.com/apache/spark/pull/7483 > ML Pipeline API in SparkR > - > > Key: SPARK-6805 > URL: https://issues.apache.org/jira/browse/SPARK-6805 > Project: Spark > Issue Type: Umbrella > Components: ML, SparkR >Reporter: Xiangrui Meng >Priority: Critical > > SparkR was merged. So let's have this umbrella JIRA for the ML pipeline API > in SparkR. The implementation should be similar to the pipeline API > implementation in Python. > For Spark 1.5, we want to support linear/logistic regression in SparkR, with > basic support for R formula and elastic-net regularization. The design doc > can be viewed at > https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9143) Add planner rule for automatically inserting Unsafe <-> Safe row format converters
[ https://issues.apache.org/jira/browse/SPARK-9143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9143: --- Assignee: Josh Rosen (was: Apache Spark) > Add planner rule for automatically inserting Unsafe <-> Safe row format > converters > -- > > Key: SPARK-9143 > URL: https://issues.apache.org/jira/browse/SPARK-9143 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen > > Now that we have two different internal row formats, UnsafeRow and the old > Java-object-based row format, we end up having to perform conversions between > these two formats. These conversions should not be performed by the operators > themselves; instead, the planner should be responsible for inserting > appropriate format conversions when they are needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9143) Add planner rule for automatically inserting Unsafe <-> Safe row format converters
[ https://issues.apache.org/jira/browse/SPARK-9143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631956#comment-14631956 ] Apache Spark commented on SPARK-9143: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/7482 > Add planner rule for automatically inserting Unsafe <-> Safe row format > converters > -- > > Key: SPARK-9143 > URL: https://issues.apache.org/jira/browse/SPARK-9143 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen > > Now that we have two different internal row formats, UnsafeRow and the old > Java-object-based row format, we end up having to perform conversions between > these two formats. These conversions should not be performed by the operators > themselves; instead, the planner should be responsible for inserting > appropriate format conversions when they are needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9143) Add planner rule for automatically inserting Unsafe <-> Safe row format converters
[ https://issues.apache.org/jira/browse/SPARK-9143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9143: --- Assignee: Apache Spark (was: Josh Rosen) > Add planner rule for automatically inserting Unsafe <-> Safe row format > converters > -- > > Key: SPARK-9143 > URL: https://issues.apache.org/jira/browse/SPARK-9143 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Josh Rosen >Assignee: Apache Spark > > Now that we have two different internal row formats, UnsafeRow and the old > Java-object-based row format, we end up having to perform conversions between > these two formats. These conversions should not be performed by the operators > themselves; instead, the planner should be responsible for inserting > appropriate format conversions when they are needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9143) Add planner rule for automatically inserting Unsafe <-> Safe row format converters
[ https://issues.apache.org/jira/browse/SPARK-9143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-9143: -- Shepherd: Michael Armbrust > Add planner rule for automatically inserting Unsafe <-> Safe row format > converters > -- > > Key: SPARK-9143 > URL: https://issues.apache.org/jira/browse/SPARK-9143 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen > > Now that we have two different internal row formats, UnsafeRow and the old > Java-object-based row format, we end up having to perform conversions between > these two formats. These conversions should not be performed by the operators > themselves; instead, the planner should be responsible for inserting > appropriate format conversions when they are needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9143) Add planner rule for automatically inserting Unsafe <-> Safe row format converters
Josh Rosen created SPARK-9143: - Summary: Add planner rule for automatically inserting Unsafe <-> Safe row format converters Key: SPARK-9143 URL: https://issues.apache.org/jira/browse/SPARK-9143 Project: Spark Issue Type: New Feature Components: SQL Reporter: Josh Rosen Assignee: Josh Rosen Now that we have two different internal row formats, UnsafeRow and the old Java-object-based row format, we end up having to perform conversions between these two formats. These conversions should not be performed by the operators themselves; instead, the planner should be responsible for inserting appropriate format conversions when they are needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9075) DecisionTreeMetadata - setting maxPossibleBins to numExamples is incorrect.
[ https://issues.apache.org/jira/browse/SPARK-9075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631931#comment-14631931 ] Joseph K. Bradley commented on SPARK-9075: -- I agree there are ways to deal with very high-arity categories, but I think dealing with it is lower priority than some other improvements (such as providing predicted class probabilities) which we're working on. In general, one should throw out that high-arity categorical feature, if you have so few examples. It's true the check does not ensure all values are covered; that would be good to refine in the future. It sounds like we're discussing 3 possibilities, 2 short-term and 1 long-term: * Run without exceptions no matter what is given. ** Short-term: Run as is. This could mean giving meaningless results. ** Long-term: We should implement a better way to handle many categories. * Short-term: Throw exception and notify user of the problem. I prefer this for now, until we can do the long-term solution. > DecisionTreeMetadata - setting maxPossibleBins to numExamples is incorrect. > > > Key: SPARK-9075 > URL: https://issues.apache.org/jira/browse/SPARK-9075 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.4.0 >Reporter: Les Selecky >Priority: Minor > > In > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala > there's a statement that sets maxPossibileBins to numExamples when > numExamples is less than strategy.maxBins. > This can cause an error when training small partitions; the error is > triggered further down in the logic where it's required that > maxCategoriesPerFeature be less than or equal to maxPossibleBins. > Here's the an example of how it was manifested: the partition contained 49 > rows (i.e., numExamples=49 but strategy.maxBins was 57. > The maxPossibleBins = math.min(strategy.maxBins, numExamples) logic therefore > reduced maxPossibleBins to 49 causing the "require(maxCategoriesPerFeature <= > maxPossibleBins" to throw an error. > In short, this will be a problem when training small datasets with a feature > that contains more categories than numExamples. > In our local testing we commented out the "math.min(strategy.maxBins, > numExamples)" line and the decision tree succeeded where it had failed > previously. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8668) expr function to convert SQL expression into a Column
[ https://issues.apache.org/jira/browse/SPARK-8668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631909#comment-14631909 ] Dan McClary commented on SPARK-8668: So, if I understand, this would parse a string -- the same way selectExpr does -- and return a list of expressions? > expr function to convert SQL expression into a Column > - > > Key: SPARK-8668 > URL: https://issues.apache.org/jira/browse/SPARK-8668 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > selectExpr uses the expression parser to parse a string expressions. would be > great to create an "expr" function in functions.scala/functions.py that > converts a string into an expression (or a list of expressions separated by > comma). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9118) Implement integer array parameters for ml.param as IntArrayParam
[ https://issues.apache.org/jira/browse/SPARK-9118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631905#comment-14631905 ] Apache Spark commented on SPARK-9118: - User 'rekhajoshm' has created a pull request for this issue: https://github.com/apache/spark/pull/7481 > Implement integer array parameters for ml.param as IntArrayParam > > > Key: SPARK-9118 > URL: https://issues.apache.org/jira/browse/SPARK-9118 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Alexander Ulanov >Priority: Minor > Fix For: 1.4.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > ml/param/params.scala lacks integer array parameter. It is needed for some > models such as multilayer perceptron to specify the layer sizes. I suggest to > implement it as IntArrayParam similarly to DoubleArrayParam. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9118) Implement integer array parameters for ml.param as IntArrayParam
[ https://issues.apache.org/jira/browse/SPARK-9118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9118: --- Assignee: Apache Spark > Implement integer array parameters for ml.param as IntArrayParam > > > Key: SPARK-9118 > URL: https://issues.apache.org/jira/browse/SPARK-9118 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Alexander Ulanov >Assignee: Apache Spark >Priority: Minor > Fix For: 1.4.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > ml/param/params.scala lacks integer array parameter. It is needed for some > models such as multilayer perceptron to specify the layer sizes. I suggest to > implement it as IntArrayParam similarly to DoubleArrayParam. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7127) Broadcast spark.ml tree ensemble models for predict
[ https://issues.apache.org/jira/browse/SPARK-7127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-7127. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6300 [https://github.com/apache/spark/pull/6300] > Broadcast spark.ml tree ensemble models for predict > --- > > Key: SPARK-7127 > URL: https://issues.apache.org/jira/browse/SPARK-7127 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Joseph K. Bradley >Assignee: Bryan Cutler >Priority: Minor > Fix For: 1.5.0 > > > GBTRegressor/Classifier and RandomForestRegressor/Classifier should broadcast > models and then predict. This will mean overriding transform(). > Note: Try to reduce duplicated code via the TreeEnsembleModel abstraction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9118) Implement integer array parameters for ml.param as IntArrayParam
[ https://issues.apache.org/jira/browse/SPARK-9118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9118: --- Assignee: (was: Apache Spark) > Implement integer array parameters for ml.param as IntArrayParam > > > Key: SPARK-9118 > URL: https://issues.apache.org/jira/browse/SPARK-9118 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Alexander Ulanov >Priority: Minor > Fix For: 1.4.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > ml/param/params.scala lacks integer array parameter. It is needed for some > models such as multilayer perceptron to specify the layer sizes. I suggest to > implement it as IntArrayParam similarly to DoubleArrayParam. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8792) Add Python API for PCA transformer
[ https://issues.apache.org/jira/browse/SPARK-8792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-8792. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7190 [https://github.com/apache/spark/pull/7190] > Add Python API for PCA transformer > -- > > Key: SPARK-8792 > URL: https://issues.apache.org/jira/browse/SPARK-8792 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 1.5.0 >Reporter: Yanbo Liang >Assignee: Yanbo Liang > Fix For: 1.5.0 > > > Add Python API for PCA transformer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9090) Fix definition of residual in LinearRegressionSummary
[ https://issues.apache.org/jira/browse/SPARK-9090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-9090. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7435 [https://github.com/apache/spark/pull/7435] > Fix definition of residual in LinearRegressionSummary > - > > Key: SPARK-9090 > URL: https://issues.apache.org/jira/browse/SPARK-9090 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Feynman Liang >Assignee: Feynman Liang >Priority: Trivial > Fix For: 1.5.0 > > > Residual is defined as label - prediction > (https://en.wikipedia.org/wiki/Least_squares); we need to update > {{LinearRegressionSummary}} to be consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5681) Calling graceful stop() immediately after start() on StreamingContext should not get stuck indefinitely
[ https://issues.apache.org/jira/browse/SPARK-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-5681: - Assignee: Shixiong Zhu > Calling graceful stop() immediately after start() on StreamingContext should > not get stuck indefinitely > --- > > Key: SPARK-5681 > URL: https://issues.apache.org/jira/browse/SPARK-5681 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Liang-Chi Hsieh >Assignee: Shixiong Zhu > Fix For: 1.5.0 > > > Sometimes the receiver will be registered into tracker after ssc.stop is > called. Especially when stop() is called immediately after start(). So the > receiver doesn't get the StopReceiver message from the tracker. In this case, > when you call stop() in graceful mode, stop() would get stuck indefinitely. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5681) Calling graceful stop() immediately after start() on StreamingContext should not get stuck indefinitely
[ https://issues.apache.org/jira/browse/SPARK-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-5681. -- Resolution: Fixed Fix Version/s: 1.5.0 > Calling graceful stop() immediately after start() on StreamingContext should > not get stuck indefinitely > --- > > Key: SPARK-5681 > URL: https://issues.apache.org/jira/browse/SPARK-5681 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Liang-Chi Hsieh >Assignee: Shixiong Zhu > Fix For: 1.5.0 > > > Sometimes the receiver will be registered into tracker after ssc.stop is > called. Especially when stop() is called immediately after start(). So the > receiver doesn't get the StopReceiver message from the tracker. In this case, > when you call stop() in graceful mode, stop() would get stuck indefinitely. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9137) Unified label verification for Predictor
[ https://issues.apache.org/jira/browse/SPARK-9137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631890#comment-14631890 ] Joseph K. Bradley commented on SPARK-9137: -- More notes: Changing title to be for Classifier only since we don't really need to check it for regression. Also, just noting that this should be as lightweight as possible, happening in a UDF or map so that it can be pipelined without causing an extra RDD action. > Unified label verification for Predictor > > > Key: SPARK-9137 > URL: https://issues.apache.org/jira/browse/SPARK-9137 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > We should to check label valid before training model for ml.predictor such as > LogisticRegression, NaiveBayes, etc. We can make this check at > extractLabeledPoints. Some models do this check during training step at > present and we need to unified them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9137) Unified label verification for Predictor
[ https://issues.apache.org/jira/browse/SPARK-9137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9137: - Assignee: Yanbo Liang > Unified label verification for Predictor > > > Key: SPARK-9137 > URL: https://issues.apache.org/jira/browse/SPARK-9137 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > We should to check label valid before training model for ml.predictor such as > LogisticRegression, NaiveBayes, etc. We can make this check at > extractLabeledPoints. Some models do this check during training step at > present and we need to unified them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9136) fix several bugs in DateTimeUtils.stringToTimestamp
[ https://issues.apache.org/jira/browse/SPARK-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-9136. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7473 [https://github.com/apache/spark/pull/7473] > fix several bugs in DateTimeUtils.stringToTimestamp > --- > > Key: SPARK-9136 > URL: https://issues.apache.org/jira/browse/SPARK-9136 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8600) Naive Bayes API for spark.ml Pipelines
[ https://issues.apache.org/jira/browse/SPARK-8600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-8600. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7284 [https://github.com/apache/spark/pull/7284] > Naive Bayes API for spark.ml Pipelines > -- > > Key: SPARK-8600 > URL: https://issues.apache.org/jira/browse/SPARK-8600 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > Fix For: 1.5.0 > > > Create a NaiveBayes API for the spark.ml Pipelines API. This should wrap the > existing NaiveBayes implementation under spark.mllib package. Should also > keep the parameter names consistent. The output columns could include both > the prediction and confidence scores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9062) Change output type of Tokenizer to Array(String, true)
[ https://issues.apache.org/jira/browse/SPARK-9062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9062: - Shepherd: Joseph K. Bradley > Change output type of Tokenizer to Array(String, true) > -- > > Key: SPARK-9062 > URL: https://issues.apache.org/jira/browse/SPARK-9062 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang >Assignee: yuhao yang >Priority: Minor > Fix For: 1.5.0 > > > Currently output type of Tokenizer is Array(String, false), which is not > compatible with Word2Vec and Other transformers since their input type is > Array(String, true). Seq[String] in udf will be treated as Array(String, > true) by default. > I'm also thinking for Nullable columns, maybe tokenizer should return > Array(null) for null value in the input. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9062) Change output type of Tokenizer to Array(String, true)
[ https://issues.apache.org/jira/browse/SPARK-9062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9062: - Assignee: yuhao yang > Change output type of Tokenizer to Array(String, true) > -- > > Key: SPARK-9062 > URL: https://issues.apache.org/jira/browse/SPARK-9062 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang >Assignee: yuhao yang >Priority: Minor > Fix For: 1.5.0 > > > Currently output type of Tokenizer is Array(String, false), which is not > compatible with Word2Vec and Other transformers since their input type is > Array(String, true). Seq[String] in udf will be treated as Array(String, > true) by default. > I'm also thinking for Nullable columns, maybe tokenizer should return > Array(null) for null value in the input. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9062) Change output type of Tokenizer to Array(String, true)
[ https://issues.apache.org/jira/browse/SPARK-9062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-9062. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7414 [https://github.com/apache/spark/pull/7414] > Change output type of Tokenizer to Array(String, true) > -- > > Key: SPARK-9062 > URL: https://issues.apache.org/jira/browse/SPARK-9062 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang >Priority: Minor > Fix For: 1.5.0 > > > Currently output type of Tokenizer is Array(String, false), which is not > compatible with Word2Vec and Other transformers since their input type is > Array(String, true). Seq[String] in udf will be treated as Array(String, > true) by default. > I'm also thinking for Nullable columns, maybe tokenizer should return > Array(null) for null value in the input. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9062) Change output type of Tokenizer to Array(String, true)
[ https://issues.apache.org/jira/browse/SPARK-9062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631865#comment-14631865 ] Joseph K. Bradley commented on SPARK-9062: -- I guess we will be forced to support nullable types. Looking at Catalyst schema inference, it looks like the assumption of nullability is buried pretty deep. I agree with you that Tokenizer (and any other transformers which use Array/Seq) will need to be changed to use nullable = true. Thanks for looking into this! > Change output type of Tokenizer to Array(String, true) > -- > > Key: SPARK-9062 > URL: https://issues.apache.org/jira/browse/SPARK-9062 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang >Priority: Minor > > Currently output type of Tokenizer is Array(String, false), which is not > compatible with Word2Vec and Other transformers since their input type is > Array(String, true). Seq[String] in udf will be treated as Array(String, > true) by default. > I'm also thinking for Nullable columns, maybe tokenizer should return > Array(null) for null value in the input. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9024) Unsafe HashJoin
[ https://issues.apache.org/jira/browse/SPARK-9024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9024: --- Assignee: Apache Spark > Unsafe HashJoin > --- > > Key: SPARK-9024 > URL: https://issues.apache.org/jira/browse/SPARK-9024 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > Create a version of BroadcastJoin that accepts UnsafeRow as inputs, and > outputs UnsafeRow as outputs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8999) Support non-temporal sequence in PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-8999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631857#comment-14631857 ] Joseph K. Bradley commented on SPARK-8999: -- I also wonder if that generalization to non-temporal sequences could be supported more easily within the Pipelines API, where we could start accepting generalized inputs without breaking public APIs. In that case, this decision could be deferred. > Support non-temporal sequence in PrefixSpan > --- > > Key: SPARK-8999 > URL: https://issues.apache.org/jira/browse/SPARK-8999 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng >Priority: Critical > > In SPARK-6487, we assume that all items are ordered. However, we should > support non-temporal sequences in PrefixSpan. This should be done before 1.5 > because it changes PrefixSpan APIs. > We can use `Array[Array[Int]]` or follow SPMF to use `Array[Int]` and use -1 > to mark itemset boundaries. The latter is more efficient for storage. If we > support generic item type, we can use null. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9024) Unsafe HashJoin
[ https://issues.apache.org/jira/browse/SPARK-9024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9024: --- Assignee: (was: Apache Spark) > Unsafe HashJoin > --- > > Key: SPARK-9024 > URL: https://issues.apache.org/jira/browse/SPARK-9024 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Create a version of BroadcastJoin that accepts UnsafeRow as inputs, and > outputs UnsafeRow as outputs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9024) Unsafe HashJoin
[ https://issues.apache.org/jira/browse/SPARK-9024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631855#comment-14631855 ] Apache Spark commented on SPARK-9024: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/7480 > Unsafe HashJoin > --- > > Key: SPARK-9024 > URL: https://issues.apache.org/jira/browse/SPARK-9024 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Create a version of BroadcastJoin that accepts UnsafeRow as inputs, and > outputs UnsafeRow as outputs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9024) Unsafe HashJoin
[ https://issues.apache.org/jira/browse/SPARK-9024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-9024: -- Summary: Unsafe HashJoin (was: UnsafeBroadcastJoin) > Unsafe HashJoin > --- > > Key: SPARK-9024 > URL: https://issues.apache.org/jira/browse/SPARK-9024 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Create a version of BroadcastJoin that accepts UnsafeRow as inputs, and > outputs UnsafeRow as outputs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9142) Removing unnecessary self types in Catalyst
[ https://issues.apache.org/jira/browse/SPARK-9142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9142: --- Assignee: Reynold Xin (was: Apache Spark) > Removing unnecessary self types in Catalyst > --- > > Key: SPARK-9142 > URL: https://issues.apache.org/jira/browse/SPARK-9142 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > A small change, based on code review and offline discussion with [~dragos]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9142) Removing unnecessary self types in Catalyst
[ https://issues.apache.org/jira/browse/SPARK-9142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631841#comment-14631841 ] Apache Spark commented on SPARK-9142: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/7479 > Removing unnecessary self types in Catalyst > --- > > Key: SPARK-9142 > URL: https://issues.apache.org/jira/browse/SPARK-9142 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > A small change, based on code review and offline discussion with [~dragos]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9142) Removing unnecessary self types in Catalyst
[ https://issues.apache.org/jira/browse/SPARK-9142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9142: --- Assignee: Apache Spark (was: Reynold Xin) > Removing unnecessary self types in Catalyst > --- > > Key: SPARK-9142 > URL: https://issues.apache.org/jira/browse/SPARK-9142 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > A small change, based on code review and offline discussion with [~dragos]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9142) Removing unnecessary self types in Catalyst
Reynold Xin created SPARK-9142: -- Summary: Removing unnecessary self types in Catalyst Key: SPARK-9142 URL: https://issues.apache.org/jira/browse/SPARK-9142 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin A small change, based on code review and offline discussion with [~dragos]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9141) DataFrame recomputed instead of using cached parent.
Nick Pritchard created SPARK-9141: - Summary: DataFrame recomputed instead of using cached parent. Key: SPARK-9141 URL: https://issues.apache.org/jira/browse/SPARK-9141 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0, 1.4.1 Reporter: Nick Pritchard As I understand, DataFrame.cache() is supposed to work the same as RDD.cache(), so that repeated operations on it will use the cached results and not recompute the entire lineage. However, it seems that some DataFrame operations (e.g. withColumn) change the underlying RDD lineage so that cache doesn't work as expected. Below is a Scala example that demonstrates this. First, I define two UDF's that use println so that it is easy to see when they are being called. Next, I create a simple data frame with one row and two columns. Next, I add a column, cache it, and call count() to force the computation. Lastly, I add another column, cache it, and call count(). I would have expected the last statement to only compute the last column, since everything else was cached. However, because withColumn() changes the lineage, the whole data frame is recomputed. {code:scala} // Examples udf's that println when called val twice = udf { (x: Int) => println(s"Computed: twice($x)"); x * 2 } val triple = udf { (x: Int) => println(s"Computed: triple($x)"); x * 3 } // Initial dataset val df1 = sc.parallelize(Seq(("a", 1))).toDF("name", "value") // Add column by applying twice udf val df2 = df1.withColumn("twice", twice($"value")) df2.cache() df2.count() //prints Computed: twice(1) // Add column by applying triple udf val df3 = df2.withColumn("triple", triple($"value")) df3.cache() df3.count() //prints Computed: twice(1)\nComputed: triple(1) {code} I found a workaround, which helped me understand what was going on behind the scenes, but doesn't seem like an ideal solution. Basically, I convert to RDD then back DataFrame, which seems to freeze the lineage. The code below shows the workaround for creating the second data frame so cache will work as expected. {code:scala} val df2 = { val tmp = df1.withColumn("twice", twice($"value")) sqlContext.createDataFrame(tmp.rdd, tmp.schema) } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8007) Support resolving virtual columns in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8007: --- Assignee: (was: Apache Spark) > Support resolving virtual columns in DataFrames > --- > > Key: SPARK-8007 > URL: https://issues.apache.org/jira/browse/SPARK-8007 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > Create the infrastructure so we can resolve df("SPARK__PARTITION__ID") to > SparkPartitionID expression. > A cool use case is to understand physical data skew: > {code} > df.groupBy("SPARK__PARTITION__ID").count() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8007) Support resolving virtual columns in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631835#comment-14631835 ] Apache Spark commented on SPARK-8007: - User 'JDrit' has created a pull request for this issue: https://github.com/apache/spark/pull/7478 > Support resolving virtual columns in DataFrames > --- > > Key: SPARK-8007 > URL: https://issues.apache.org/jira/browse/SPARK-8007 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > Create the infrastructure so we can resolve df("SPARK__PARTITION__ID") to > SparkPartitionID expression. > A cool use case is to understand physical data skew: > {code} > df.groupBy("SPARK__PARTITION__ID").count() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8007) Support resolving virtual columns in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8007: --- Assignee: Apache Spark > Support resolving virtual columns in DataFrames > --- > > Key: SPARK-8007 > URL: https://issues.apache.org/jira/browse/SPARK-8007 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > Create the infrastructure so we can resolve df("SPARK__PARTITION__ID") to > SparkPartitionID expression. > A cool use case is to understand physical data skew: > {code} > df.groupBy("SPARK__PARTITION__ID").count() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5269) BlockManager.dataDeserialize always creates a new serializer instance
[ https://issues.apache.org/jira/browse/SPARK-5269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631820#comment-14631820 ] Matt Cheah commented on SPARK-5269: --- Sweet - working with someone else on this actually, but assigning to me is good. I expect that using the Kryo resource pool will provide a fairly elegant solution. > BlockManager.dataDeserialize always creates a new serializer instance > - > > Key: SPARK-5269 > URL: https://issues.apache.org/jira/browse/SPARK-5269 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Ivan Vergiliev >Assignee: Matt Cheah > Labels: performance, serializers > > BlockManager.dataDeserialize always creates a new instance of the serializer, > which is pretty slow in some cases. I'm using Kryo serialization and have a > custom registrator, and its register method is showing up as taking about 15% > of the execution time in my profiles. This started happening after I > increased the number of keys in a job with a shuffle phase by a factor of 40. > One solution I can think of is to create a ThreadLocal SerializerInstance for > the defaultSerializer, and only create a new one if a custom serializer is > passed in. AFAICT a custom serializer is passed only from > DiskStore.getValues, and that, on the other hand, depends on the serializer > passed to ExternalSorter. I don't know how often this is used, but I think > this can still be a good solution for the standard use case. > Oh, and also - ExternalSorter already has a SerializerInstance, so if the > getValues method is called from a single thread, maybe we can pass that > directly? > I'd be happy to try a patch but would probably need a confirmation from > someone that this approach would indeed work (or an idea for another). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9140) Replace TimeTracker by Stopwatch
Xiangrui Meng created SPARK-9140: Summary: Replace TimeTracker by Stopwatch Key: SPARK-9140 URL: https://issues.apache.org/jira/browse/SPARK-9140 Project: Spark Issue Type: Sub-task Components: ML, MLlib Affects Versions: 1.5.0 Reporter: Xiangrui Meng Priority: Minor We can replace TImeTracker in tree implementations by Stopwatch. The initial PR could use local stopwatches only. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9138) Vectors.dense() in Python should accept numbers directly
[ https://issues.apache.org/jira/browse/SPARK-9138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-9138. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7476 [https://github.com/apache/spark/pull/7476] > Vectors.dense() in Python should accept numbers directly > > > Key: SPARK-9138 > URL: https://issues.apache.org/jira/browse/SPARK-9138 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > Fix For: 1.5.0 > > > We already use this feature in doctests -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9139) Add backwards-compatibility tests for DataType.fromJson()
[ https://issues.apache.org/jira/browse/SPARK-9139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-9139: -- Description: SQL's DataType.fromJson is a public API and thus must be backwards-compatible; there are also backwards-compatibility concerns related to persistence of DataType JSON in metastores. Unfortunately, we do not have any backwards-compatibility tests which attempt to read old JSON values that were written by earlier versions of Spark. DataTypeSuite has "roundtrip" tests that test fromJson(toJson(foo)), but this doesn't ensure compatibility. I think that we should address this by capuring the JSON strings produced in Spark 1.3's DataFrameSuite and adding test cases that try to create DataTypes from those strings. This might be a good starter task for someone who wants to contribute to SQL tests. was: SQL's DataType.fromJson is a public API and thus must be backwards-compatible; there are also backwards-compatibility concerns related to persistence of DataType JSON in metastores. Unfortunately, we do not have any backwards-compatibility tests which attempt to read old JSON values that were written by earlier versions of Spark. DataTypeSuite has "roundtrip" tests that test fromJson(toJson(x)), but this doesn't ensure compatibility. I think that we should address this by capuring the JSON strings produced in Spark 1.3's DataFrameSuite and adding test cases that try to create DataTypes from those strings. This might be a good starter task for someone who wants to contribute to SQL tests. > Add backwards-compatibility tests for DataType.fromJson() > - > > Key: SPARK-9139 > URL: https://issues.apache.org/jira/browse/SPARK-9139 > Project: Spark > Issue Type: Test > Components: SQL >Reporter: Josh Rosen >Priority: Critical > > SQL's DataType.fromJson is a public API and thus must be > backwards-compatible; there are also backwards-compatibility concerns related > to persistence of DataType JSON in metastores. > Unfortunately, we do not have any backwards-compatibility tests which attempt > to read old JSON values that were written by earlier versions of Spark. > DataTypeSuite has "roundtrip" tests that test fromJson(toJson(foo)), but this > doesn't ensure compatibility. > I think that we should address this by capuring the JSON strings produced in > Spark 1.3's DataFrameSuite and adding test cases that try to create DataTypes > from those strings. > This might be a good starter task for someone who wants to contribute to SQL > tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9139) Add backwards-compatibility tests for DataType.fromJson()
[ https://issues.apache.org/jira/browse/SPARK-9139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-9139: -- Component/s: SQL > Add backwards-compatibility tests for DataType.fromJson() > - > > Key: SPARK-9139 > URL: https://issues.apache.org/jira/browse/SPARK-9139 > Project: Spark > Issue Type: Test > Components: SQL >Reporter: Josh Rosen >Priority: Critical > > SQL's DataType.fromJson is a public API and thus must be > backwards-compatible; there are also backwards-compatibility concerns related > to persistence of DataType JSON in metastores. > Unfortunately, we do not have any backwards-compatibility tests which attempt > to read old JSON values that were written by earlier versions of Spark. > DataTypeSuite has "roundtrip" tests that test fromJson(toJson(foo)), but this > doesn't ensure compatibility. > I think that we should address this by capuring the JSON strings produced in > Spark 1.3's DataFrameSuite and adding test cases that try to create DataTypes > from those strings. > This might be a good starter task for someone who wants to contribute to SQL > tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9139) Add backwards-compatibility tests for DataType.fromJson()
Josh Rosen created SPARK-9139: - Summary: Add backwards-compatibility tests for DataType.fromJson() Key: SPARK-9139 URL: https://issues.apache.org/jira/browse/SPARK-9139 Project: Spark Issue Type: Test Reporter: Josh Rosen Priority: Critical SQL's DataType.fromJson is a public API and thus must be backwards-compatible; there are also backwards-compatibility concerns related to persistence of DataType JSON in metastores. Unfortunately, we do not have any backwards-compatibility tests which attempt to read old JSON values that were written by earlier versions of Spark. DataTypeSuite has "roundtrip" tests that test fromJson(toJson(x)), but this doesn't ensure compatibility. I think that we should address this by capuring the JSON strings produced in Spark 1.3's DataFrameSuite and adding test cases that try to create DataTypes from those strings. This might be a good starter task for someone who wants to contribute to SQL tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org