[jira] [Commented] (SPARK-8781) Pusblished POMs are no longer effective POMs
[ https://issues.apache.org/jira/browse/SPARK-8781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611571#comment-14611571 ] Sean Owen commented on SPARK-8781: -- Does this affect release artifacts or just the snapshot? That commit doesn't look related since it doesn't touch the lines you reference here. Are you sure? If it's 'fixed' by changing it is maybe something else at work? > Pusblished POMs are no longer effective POMs > > > Key: SPARK-8781 > URL: https://issues.apache.org/jira/browse/SPARK-8781 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.3.2, 1.4.1, 1.5.0 >Reporter: Konstantin Shaposhnikov > > Published to maven repository POMs are no longer effective POMs. E.g. > In > https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-core_2.11/1.4.2-SNAPSHOT/spark-core_2.11-1.4.2-20150702.043114-52.pom: > {noformat} > ... > > org.apache.spark > spark-launcher_${scala.binary.version} > ${project.version} > > ... > {noformat} > while it should be > {noformat} > ... > > org.apache.spark > spark-launcher_2.11 > ${project.version} > > ... > {noformat} > The following commits are most likely the cause of it: > - for branch-1.3: > https://github.com/apache/spark/commit/ce137b8ed3b240b7516046699ac96daa55ddc129 > - for branch-1.4: > https://github.com/apache/spark/commit/84da653192a2d9edb82d0dbe50f577c4dc6a0c78 > - for master: > https://github.com/apache/spark/commit/984ad60147c933f2d5a2040c87ae687c14eb1724 > On branch-1.4 reverting the commit fixed the issue. > See SPARK-3812 for additional details -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6573) Convert inbound NaN values as null
[ https://issues.apache.org/jira/browse/SPARK-6573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611565#comment-14611565 ] Josh Rosen commented on SPARK-6573: --- NaN can lead to confusing exceptions during sorting if it appears in a column. I just ran into an issue where Sort threw a "Comparison method violates its general contract!" error for data containing NaN columns. See my comments at https://github.com/apache/spark/pull/7179#discussion_r33749911 > Convert inbound NaN values as null > -- > > Key: SPARK-6573 > URL: https://issues.apache.org/jira/browse/SPARK-6573 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.3.0 >Reporter: Fabian Boehnlein > > In pandas it is common to use numpy.nan as the null value, for missing data > or whatever. > http://pandas.pydata.org/pandas-docs/dev/gotchas.html#nan-integer-na-values-and-na-type-promotions > http://stackoverflow.com/questions/17534106/what-is-the-difference-between-nan-and-none > http://pandas.pydata.org/pandas-docs/dev/missing_data.html#filling-missing-values-fillna > createDataFrame however only works with None as null values, parsing them as > None in the RDD. > I suggest to add support for np.nan values in pandas DataFrames. > current stracktrace when calling a DataFrame with object type columns with > np.nan values (which are floats) > {code} > TypeError Traceback (most recent call last) > in () > > 1 sqldf = sqlCtx.createDataFrame(df_, schema=schema) > /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in > createDataFrame(self, data, schema, samplingRatio) > 339 schema = self._inferSchema(data.map(lambda r: > row_cls(*r)), samplingRatio) > 340 > --> 341 return self.applySchema(data, schema) > 342 > 343 def registerDataFrameAsTable(self, rdd, tableName): > /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in > applySchema(self, rdd, schema) > 246 > 247 for row in rows: > --> 248 _verify_type(row, schema) > 249 > 250 # convert python objects to sql data > /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in > _verify_type(obj, dataType) >1064 "length of fields (%d)" % (len(obj), > len(dataType.fields))) >1065 for v, f in zip(obj, dataType.fields): > -> 1066 _verify_type(v, f.dataType) >1067 >1068 _cached_cls = weakref.WeakValueDictionary() > /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in > _verify_type(obj, dataType) >1048 if type(obj) not in _acceptable_types[_type]: >1049 raise TypeError("%s can not accept object in type %s" > -> 1050 % (dataType, type(obj))) >1051 >1052 if isinstance(dataType, ArrayType): > TypeError: StringType can not accept object in type {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8783) CTAS with WITH clause does not work
[ https://issues.apache.org/jira/browse/SPARK-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8783: --- Assignee: Apache Spark > CTAS with WITH clause does not work > --- > > Key: SPARK-8783 > URL: https://issues.apache.org/jira/browse/SPARK-8783 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Keuntae Park >Assignee: Apache Spark >Priority: Minor > > Following CTAS with WITH clause query > {code} > CREATE TABLE with_table1 AS > WITH T AS ( > SELECT * > FROM table1 > ) > SELECT * > FROM T > {code} > induces following error > {code} > no such table T; line 7 pos 5 > org.apache.spark.sql.AnalysisException: no such table T; line 7 pos 5 > ... > {code} > I think that WITH clause within CTAS is not handled properly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8783) CTAS with WITH clause does not work
[ https://issues.apache.org/jira/browse/SPARK-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611556#comment-14611556 ] Apache Spark commented on SPARK-8783: - User 'sirpkt' has created a pull request for this issue: https://github.com/apache/spark/pull/7180 > CTAS with WITH clause does not work > --- > > Key: SPARK-8783 > URL: https://issues.apache.org/jira/browse/SPARK-8783 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Keuntae Park >Priority: Minor > > Following CTAS with WITH clause query > {code} > CREATE TABLE with_table1 AS > WITH T AS ( > SELECT * > FROM table1 > ) > SELECT * > FROM T > {code} > induces following error > {code} > no such table T; line 7 pos 5 > org.apache.spark.sql.AnalysisException: no such table T; line 7 pos 5 > ... > {code} > I think that WITH clause within CTAS is not handled properly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8783) CTAS with WITH clause does not work
[ https://issues.apache.org/jira/browse/SPARK-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8783: --- Assignee: (was: Apache Spark) > CTAS with WITH clause does not work > --- > > Key: SPARK-8783 > URL: https://issues.apache.org/jira/browse/SPARK-8783 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Keuntae Park >Priority: Minor > > Following CTAS with WITH clause query > {code} > CREATE TABLE with_table1 AS > WITH T AS ( > SELECT * > FROM table1 > ) > SELECT * > FROM T > {code} > induces following error > {code} > no such table T; line 7 pos 5 > org.apache.spark.sql.AnalysisException: no such table T; line 7 pos 5 > ... > {code} > I think that WITH clause within CTAS is not handled properly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8708) MatrixFactorizationModel.predictAll() populates single partition only
[ https://issues.apache.org/jira/browse/SPARK-8708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611555#comment-14611555 ] Antony Mayi commented on SPARK-8708: bq. Antony Mayi In your real case, how many partitions did ALS.predictAll return? 512 partitions of which 511 are empty and the single one with all 13M ratings. > MatrixFactorizationModel.predictAll() populates single partition only > - > > Key: SPARK-8708 > URL: https://issues.apache.org/jira/browse/SPARK-8708 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Antony Mayi > > When using mllib.recommendation.ALS the RDD returned by .predictAll() has all > values pushed into single partition despite using quite high parallelism. > This degrades performance of further processing (I can obviously run > .partitionBy()) to balance it but that's still too costly (ie if running > .predictAll() in loop for thousands of products) and should be possible to do > it rather somehow on the model (automatically)). > Bellow is an example on tiny sample (same on large dataset): > {code:title=pyspark} > >>> r1 = (1, 1, 1.0) > >>> r2 = (1, 2, 2.0) > >>> r3 = (2, 1, 2.0) > >>> r4 = (2, 2, 2.0) > >>> r5 = (3, 1, 1.0) > >>> ratings = sc.parallelize([r1, r2, r3, r4, r5], 5) > >>> ratings.getNumPartitions() > 5 > >>> users = ratings.map(itemgetter(0)).distinct() > >>> model = ALS.trainImplicit(ratings, 1, seed=10) > >>> predictions_for_2 = model.predictAll(users.map(lambda u: (u, 2))) > >>> predictions_for_2.glom().map(len).collect() > [0, 0, 3, 0, 0] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5682) Add encrypted shuffle in spark
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611553#comment-14611553 ] hujiayin commented on SPARK-5682: - Since the encrypted shuffle in spark is focus on the common module, it maybe not good to use hadoop API. On the other side, the AES solution is a bit heavy to encode/decode the live steaming data. > Add encrypted shuffle in spark > -- > > Key: SPARK-5682 > URL: https://issues.apache.org/jira/browse/SPARK-5682 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Reporter: liyunzhang_intel > Attachments: Design Document of Encrypted Spark > Shuffle_20150209.docx, Design Document of Encrypted Spark > Shuffle_20150318.docx, Design Document of Encrypted Spark > Shuffle_20150402.docx, Design Document of Encrypted Spark > Shuffle_20150506.docx > > > Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle > data safer. This feature is necessary in spark. AES is a specification for > the encryption of electronic data. There are 5 common modes in AES. CTR is > one of the modes. We use two codec JceAesCtrCryptoCodec and > OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used > in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms jdk > provides while OpensslAesCtrCryptoCodec uses encrypted algorithms openssl > provides. > Because ugi credential info is used in the process of encrypted shuffle, we > first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8783) CTAS with WITH clause does not work
Keuntae Park created SPARK-8783: --- Summary: CTAS with WITH clause does not work Key: SPARK-8783 URL: https://issues.apache.org/jira/browse/SPARK-8783 Project: Spark Issue Type: Bug Components: SQL Reporter: Keuntae Park Priority: Minor Following CTAS with WITH clause query {code} CREATE TABLE with_table1 AS WITH T AS ( SELECT * FROM table1 ) SELECT * FROM T {code} induces following error {code} no such table T; line 7 pos 5 org.apache.spark.sql.AnalysisException: no such table T; line 7 pos 5 ... {code} I think that WITH clause within CTAS is not handled properly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5682) Add encrypted shuffle in spark
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611547#comment-14611547 ] liyunzhang_intel commented on SPARK-5682: - [~hujiayin]: thanks for your comment. This feature is not based on hadooop2.6. it is based on hadoop2.6 in original design. In the latest design doc(20150506), It shows that now there are two ways to implement encrypted shuffle in spark. Currently we only implement it on spark-on-yarn framework. One is based on [Chimera(Chimera is a project which strips code related to CryptoInputStream/CryptoOutputStream from Hadoop to facilitate AES-NI based data encryption in other projects)|https://github.com/intel-hadoop/chimera](see https://github.com/apache/spark/pull/5307). In the other way,we implement all the crypto classes like CryptoInputStream/CryptoOutputStream in scala under core/src/main/scala/org/apache/spark/crypto/ package(see https://github.com/apache/spark/pull/4491). For the problem of importing hadoop api in spark, if the interface of hadoop class is public and stable,it can be use in spark. in https://hadoop.apache.org/docs/current/api/org/apache/hadoop/classification/InterfaceStability.html, it says: {quote} Incompatible changes must not be made to classes marked as stable. {quote} which means when a class is marked stable, later release will not change it. > Add encrypted shuffle in spark > -- > > Key: SPARK-5682 > URL: https://issues.apache.org/jira/browse/SPARK-5682 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Reporter: liyunzhang_intel > Attachments: Design Document of Encrypted Spark > Shuffle_20150209.docx, Design Document of Encrypted Spark > Shuffle_20150318.docx, Design Document of Encrypted Spark > Shuffle_20150402.docx, Design Document of Encrypted Spark > Shuffle_20150506.docx > > > Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle > data safer. This feature is necessary in spark. AES is a specification for > the encryption of electronic data. There are 5 common modes in AES. CTR is > one of the modes. We use two codec JceAesCtrCryptoCodec and > OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used > in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms jdk > provides while OpensslAesCtrCryptoCodec uses encrypted algorithms openssl > provides. > Because ugi credential info is used in the process of encrypted shuffle, we > first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8782) GenerateOrdering fails for NullType (i.e. ORDER BY NULL crashes)
[ https://issues.apache.org/jira/browse/SPARK-8782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8782: --- Assignee: Josh Rosen (was: Apache Spark) > GenerateOrdering fails for NullType (i.e. ORDER BY NULL crashes) > > > Key: SPARK-8782 > URL: https://issues.apache.org/jira/browse/SPARK-8782 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.5.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > > Queries containing ORDER BY NULL currently result in a code generation > exception: > {code} > public SpecificOrdering > generate(org.apache.spark.sql.catalyst.expressions.Expression[] expr) { > return new SpecificOrdering(expr); > } > class SpecificOrdering extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering { > private org.apache.spark.sql.catalyst.expressions.Expression[] > expressions = null; > public > SpecificOrdering(org.apache.spark.sql.catalyst.expressions.Expression[] expr) > { > expressions = expr; > } > @Override > public int compare(InternalRow a, InternalRow b) { > InternalRow i = null; // Holds current row being evaluated. > > i = a; > final Object primitive1 = null; > i = b; > final Object primitive3 = null; > if (true && true) { > // Nothing > } else if (true) { > return -1; > } else if (true) { > return 1; > } else { > int comp = primitive1.compare(primitive3); > if (comp != 0) { > return comp; > } > } > > return 0; > } > } > org.codehaus.commons.compiler.CompileException: Line 29, Column 43: A method > named "compare" is not declared in any enclosing class nor any supertype, nor > through a static import > at > org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10174) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8782) GenerateOrdering fails for NullType (i.e. ORDER BY NULL crashes)
[ https://issues.apache.org/jira/browse/SPARK-8782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8782: --- Assignee: Apache Spark (was: Josh Rosen) > GenerateOrdering fails for NullType (i.e. ORDER BY NULL crashes) > > > Key: SPARK-8782 > URL: https://issues.apache.org/jira/browse/SPARK-8782 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.5.0 >Reporter: Josh Rosen >Assignee: Apache Spark >Priority: Blocker > > Queries containing ORDER BY NULL currently result in a code generation > exception: > {code} > public SpecificOrdering > generate(org.apache.spark.sql.catalyst.expressions.Expression[] expr) { > return new SpecificOrdering(expr); > } > class SpecificOrdering extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering { > private org.apache.spark.sql.catalyst.expressions.Expression[] > expressions = null; > public > SpecificOrdering(org.apache.spark.sql.catalyst.expressions.Expression[] expr) > { > expressions = expr; > } > @Override > public int compare(InternalRow a, InternalRow b) { > InternalRow i = null; // Holds current row being evaluated. > > i = a; > final Object primitive1 = null; > i = b; > final Object primitive3 = null; > if (true && true) { > // Nothing > } else if (true) { > return -1; > } else if (true) { > return 1; > } else { > int comp = primitive1.compare(primitive3); > if (comp != 0) { > return comp; > } > } > > return 0; > } > } > org.codehaus.commons.compiler.CompileException: Line 29, Column 43: A method > named "compare" is not declared in any enclosing class nor any supertype, nor > through a static import > at > org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10174) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8782) GenerateOrdering fails for NullType (i.e. ORDER BY NULL crashes)
[ https://issues.apache.org/jira/browse/SPARK-8782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611543#comment-14611543 ] Apache Spark commented on SPARK-8782: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/7179 > GenerateOrdering fails for NullType (i.e. ORDER BY NULL crashes) > > > Key: SPARK-8782 > URL: https://issues.apache.org/jira/browse/SPARK-8782 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.5.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > > Queries containing ORDER BY NULL currently result in a code generation > exception: > {code} > public SpecificOrdering > generate(org.apache.spark.sql.catalyst.expressions.Expression[] expr) { > return new SpecificOrdering(expr); > } > class SpecificOrdering extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering { > private org.apache.spark.sql.catalyst.expressions.Expression[] > expressions = null; > public > SpecificOrdering(org.apache.spark.sql.catalyst.expressions.Expression[] expr) > { > expressions = expr; > } > @Override > public int compare(InternalRow a, InternalRow b) { > InternalRow i = null; // Holds current row being evaluated. > > i = a; > final Object primitive1 = null; > i = b; > final Object primitive3 = null; > if (true && true) { > // Nothing > } else if (true) { > return -1; > } else if (true) { > return 1; > } else { > int comp = primitive1.compare(primitive3); > if (comp != 0) { > return comp; > } > } > > return 0; > } > } > org.codehaus.commons.compiler.CompileException: Line 29, Column 43: A method > named "compare" is not declared in any enclosing class nor any supertype, nor > through a static import > at > org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10174) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8687) Spark on yarn-client mode can't send `spark.yarn.credentials.file` to executor.
[ https://issues.apache.org/jira/browse/SPARK-8687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8687: - Fix Version/s: 1.4.2 > Spark on yarn-client mode can't send `spark.yarn.credentials.file` to > executor. > --- > > Key: SPARK-8687 > URL: https://issues.apache.org/jira/browse/SPARK-8687 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.0 >Reporter: SaintBacchus >Assignee: SaintBacchus > Fix For: 1.5.0, 1.4.2 > > > Yarn will set +spark.yarn.credentials.file+ after *DriverEndpoint* > initialized. So executor will fetch the old configuration and will cause the > problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8687) Spark on yarn-client mode can't send `spark.yarn.credentials.file` to executor.
[ https://issues.apache.org/jira/browse/SPARK-8687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8687: - Target Version/s: 1.5.0, 1.4.2 (was: 1.5.0) > Spark on yarn-client mode can't send `spark.yarn.credentials.file` to > executor. > --- > > Key: SPARK-8687 > URL: https://issues.apache.org/jira/browse/SPARK-8687 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.0 >Reporter: SaintBacchus >Assignee: SaintBacchus > Fix For: 1.5.0, 1.4.2 > > > Yarn will set +spark.yarn.credentials.file+ after *DriverEndpoint* > initialized. So executor will fetch the old configuration and will cause the > problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8687) Spark on yarn-client mode can't send `spark.yarn.credentials.file` to executor.
[ https://issues.apache.org/jira/browse/SPARK-8687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-8687. Resolution: Fixed Assignee: SaintBacchus Fix Version/s: 1.5.0 Target Version/s: 1.5.0 > Spark on yarn-client mode can't send `spark.yarn.credentials.file` to > executor. > --- > > Key: SPARK-8687 > URL: https://issues.apache.org/jira/browse/SPARK-8687 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.0 >Reporter: SaintBacchus >Assignee: SaintBacchus > Fix For: 1.5.0 > > > Yarn will set +spark.yarn.credentials.file+ after *DriverEndpoint* > initialized. So executor will fetch the old configuration and will cause the > problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3071) Increase default driver memory
[ https://issues.apache.org/jira/browse/SPARK-3071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3071. Resolution: Fixed Fix Version/s: 1.5.0 > Increase default driver memory > -- > > Key: SPARK-3071 > URL: https://issues.apache.org/jira/browse/SPARK-3071 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.4.2 >Reporter: Xiangrui Meng >Assignee: Ilya Ganelin > Fix For: 1.5.0 > > > The current default is 512M, which is usually too small because user also > uses driver to do some computation. In local mode, executor memory setting is > ignored while only driver memory is used, which provides more incentive to > increase the default driver memory. > I suggest > 1. 2GB in local mode and warn users if executor memory is set a bigger value > 2. same as worker memory on an EC2 standalone server -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5682) Add encrypted shuffle in spark
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611527#comment-14611527 ] hujiayin edited comment on SPARK-5682 at 7/2/15 6:10 AM: - Steps were added to encode and decode the data, the performance will not be fast than before, in the same time, codes also have security issue, for example save the plain text in configuration file and finally used as the part of the key If you use a better cypher solution, the performance downgrade will be minimized. i think AES is a bit heavy. In the same time, the feature based on hadoop 2.6, it is the limitation, that is why i said rely on hadoop Though the API is public stable, however, you cannot ensure if the API will not be changed since it is not the comercial software. was (Author: hujiayin): Steps were added to encode and decode the data, the performance will not be fast than before, in the same time, codes also have security issue, for example save the plain text in configuration file and finally used as the part of the key In the same time, the feature based on hadoop 2.6, it is the limitation, that is why i said rely on hadoop Though the API is public stable, however, you cannot ensure if the API will not be changed since it is not the comercial software. > Add encrypted shuffle in spark > -- > > Key: SPARK-5682 > URL: https://issues.apache.org/jira/browse/SPARK-5682 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Reporter: liyunzhang_intel > Attachments: Design Document of Encrypted Spark > Shuffle_20150209.docx, Design Document of Encrypted Spark > Shuffle_20150318.docx, Design Document of Encrypted Spark > Shuffle_20150402.docx, Design Document of Encrypted Spark > Shuffle_20150506.docx > > > Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle > data safer. This feature is necessary in spark. AES is a specification for > the encryption of electronic data. There are 5 common modes in AES. CTR is > one of the modes. We use two codec JceAesCtrCryptoCodec and > OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used > in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms jdk > provides while OpensslAesCtrCryptoCodec uses encrypted algorithms openssl > provides. > Because ugi credential info is used in the process of encrypted shuffle, we > first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8740) Support GitHub OAuth tokens in dev/merge_spark_pr.py
[ https://issues.apache.org/jira/browse/SPARK-8740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-8740. Resolution: Fixed Fix Version/s: 1.5.0 Target Version/s: 1.5.0 > Support GitHub OAuth tokens in dev/merge_spark_pr.py > > > Key: SPARK-8740 > URL: https://issues.apache.org/jira/browse/SPARK-8740 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Minor > Fix For: 1.5.0 > > > We should allow dev/merge_spark_pr.py to use personal GitHub OAuth tokens in > order to make authenticated requests. This is necessary to work around per-IP > rate limiting issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8769) toLocalIterator should mention it results in many jobs
[ https://issues.apache.org/jira/browse/SPARK-8769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-8769. Resolution: Fixed Assignee: holdenk Fix Version/s: 1.4.2 1.5.0 Target Version/s: 1.5.0, 1.4.2 > toLocalIterator should mention it results in many jobs > -- > > Key: SPARK-8769 > URL: https://issues.apache.org/jira/browse/SPARK-8769 > Project: Spark > Issue Type: Documentation > Components: Documentation >Reporter: holdenk >Assignee: holdenk >Priority: Trivial > Fix For: 1.5.0, 1.4.2 > > > toLocalIterator on RDDs should mention that it results in mutliple jobs, and > that to avoid re-computing, if the input was the result of a > wide-transformation, the input should be cached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8771) Actor system deprecation tag uses deprecated deprecation tag
[ https://issues.apache.org/jira/browse/SPARK-8771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-8771. Resolution: Fixed Assignee: holdenk Fix Version/s: 1.5.0 Target Version/s: 1.5.0 > Actor system deprecation tag uses deprecated deprecation tag > > > Key: SPARK-8771 > URL: https://issues.apache.org/jira/browse/SPARK-8771 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.4.0 >Reporter: holdenk >Assignee: holdenk >Priority: Trivial > Fix For: 1.5.0 > > > The deprecation of the actor system adds a spurious build warning: > {quote} > @deprecated now takes two arguments; see the scaladoc. > [warn] @deprecated("Actor system is no longer supported as of 1.4") > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8771) Actor system deprecation tag uses deprecated deprecation tag
[ https://issues.apache.org/jira/browse/SPARK-8771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8771: - Affects Version/s: 1.4.0 > Actor system deprecation tag uses deprecated deprecation tag > > > Key: SPARK-8771 > URL: https://issues.apache.org/jira/browse/SPARK-8771 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.4.0 >Reporter: holdenk >Priority: Trivial > > The deprecation of the actor system adds a spurious build warning: > {quote} > @deprecated now takes two arguments; see the scaladoc. > [warn] @deprecated("Actor system is no longer supported as of 1.4") > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5682) Add encrypted shuffle in spark
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611527#comment-14611527 ] hujiayin edited comment on SPARK-5682 at 7/2/15 6:03 AM: - Steps were added to encode and decode the data, the performance will not be fast than before, in the same time, codes also have security issue, for example save the plain text in configuration file and finally used as the part of the key In the same time, the feature based on hadoop 2.6, it is the limitation, that is why i said rely on hadoop Though the API is public stable, however, you cannot ensure if the API will not be changed since it is not the comercial software. was (Author: hujiayin): steps were added to encode and decode the data, the performance will not be fast than before, in the same time, codes also have security issue, for example save the plain text in configuration file and finally used as the part of the key in the same time, the feature based on hadoop 2.6, it is the limitation, that is why i said rely on hadoop though it is public stable, however, you cannot ensure if the api will not be changed since it was not the comercial software. > Add encrypted shuffle in spark > -- > > Key: SPARK-5682 > URL: https://issues.apache.org/jira/browse/SPARK-5682 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Reporter: liyunzhang_intel > Attachments: Design Document of Encrypted Spark > Shuffle_20150209.docx, Design Document of Encrypted Spark > Shuffle_20150318.docx, Design Document of Encrypted Spark > Shuffle_20150402.docx, Design Document of Encrypted Spark > Shuffle_20150506.docx > > > Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle > data safer. This feature is necessary in spark. AES is a specification for > the encryption of electronic data. There are 5 common modes in AES. CTR is > one of the modes. We use two codec JceAesCtrCryptoCodec and > OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used > in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms jdk > provides while OpensslAesCtrCryptoCodec uses encrypted algorithms openssl > provides. > Because ugi credential info is used in the process of encrypted shuffle, we > first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8688) Hadoop Configuration has to disable client cache when writing or reading delegation tokens.
[ https://issues.apache.org/jira/browse/SPARK-8688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-8688. Resolution: Fixed Assignee: SaintBacchus Fix Version/s: 1.5.0 Target Version/s: 1.5.0 > Hadoop Configuration has to disable client cache when writing or reading > delegation tokens. > --- > > Key: SPARK-8688 > URL: https://issues.apache.org/jira/browse/SPARK-8688 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.0 >Reporter: SaintBacchus >Assignee: SaintBacchus > Fix For: 1.5.0 > > > In class *AMDelegationTokenRenewer* and *ExecutorDelegationTokenUpdater*, > Spark will write and read the credentials. > But if we don't disable the *fs.hdfs.impl.disable.cache*, Spark will use > cached FileSystem (which will use old token ) to upload or download file. > Then when the old token is expired, it can't gain the auth to get/put the > hdfs. > (I only tested in a very short time with the configuration: > dfs.namenode.delegation.token.renew-interval=3min > dfs.namenode.delegation.token.max-lifetime=10min > I'm not sure whatever it matters. > ) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5682) Add encrypted shuffle in spark
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611527#comment-14611527 ] hujiayin edited comment on SPARK-5682 at 7/2/15 6:02 AM: - steps were added to encode and decode the data, the performance will not be fast than before, in the same time, codes also have security issue, for example save the plain text in configuration file and finally used as the part of the key in the same time, the feature based on hadoop 2.6, it is the limitation, that is why i said rely on hadoop though it is public stable, however, you cannot ensure if the api will not be changed since it was not the comercial software. was (Author: hujiayin): steps were added to encode and decode the data, the performance will not be fast than before, in the same time, codes also have security issue, for example save the plain text in configuration file and finally used as the part of the key in the same time, the feature based on hadoop 2.6, it is the limitation, that is why i said reply on hadoop though it is public stable, however, you cannot ensure if the api will not be changed since it was not the comercial software. > Add encrypted shuffle in spark > -- > > Key: SPARK-5682 > URL: https://issues.apache.org/jira/browse/SPARK-5682 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Reporter: liyunzhang_intel > Attachments: Design Document of Encrypted Spark > Shuffle_20150209.docx, Design Document of Encrypted Spark > Shuffle_20150318.docx, Design Document of Encrypted Spark > Shuffle_20150402.docx, Design Document of Encrypted Spark > Shuffle_20150506.docx > > > Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle > data safer. This feature is necessary in spark. AES is a specification for > the encryption of electronic data. There are 5 common modes in AES. CTR is > one of the modes. We use two codec JceAesCtrCryptoCodec and > OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used > in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms jdk > provides while OpensslAesCtrCryptoCodec uses encrypted algorithms openssl > provides. > Because ugi credential info is used in the process of encrypted shuffle, we > first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5682) Add encrypted shuffle in spark
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611527#comment-14611527 ] hujiayin commented on SPARK-5682: - steps were added to encode and decode the data, the performance will not be fast than before, in the same time, codes also have security issue, for example save the plain text in configuration file and finally used as the part of the key in the same time, the feature based on hadoop 2.6, it is the limitation, that is why i said reply on hadoop though it is public stable, however, you cannot ensure if the api will not be changed since it was not the comercial software. > Add encrypted shuffle in spark > -- > > Key: SPARK-5682 > URL: https://issues.apache.org/jira/browse/SPARK-5682 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Reporter: liyunzhang_intel > Attachments: Design Document of Encrypted Spark > Shuffle_20150209.docx, Design Document of Encrypted Spark > Shuffle_20150318.docx, Design Document of Encrypted Spark > Shuffle_20150402.docx, Design Document of Encrypted Spark > Shuffle_20150506.docx > > > Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle > data safer. This feature is necessary in spark. AES is a specification for > the encryption of electronic data. There are 5 common modes in AES. CTR is > one of the modes. We use two codec JceAesCtrCryptoCodec and > OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used > in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms jdk > provides while OpensslAesCtrCryptoCodec uses encrypted algorithms openssl > provides. > Because ugi credential info is used in the process of encrypted shuffle, we > first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8754) YarnClientSchedulerBackend doesn't stop gracefully in failure conditions
[ https://issues.apache.org/jira/browse/SPARK-8754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-8754. Resolution: Fixed Fix Version/s: 1.4.2 1.5.0 Target Version/s: 1.5.0, 1.4.2 > YarnClientSchedulerBackend doesn't stop gracefully in failure conditions > > > Key: SPARK-8754 > URL: https://issues.apache.org/jira/browse/SPARK-8754 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.4.0 >Reporter: Devaraj K >Priority: Minor > Fix For: 1.5.0, 1.4.2 > > > {code:xml} > java.lang.NullPointerException > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:151) > at > org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:421) > at > org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1447) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1651) > at org.apache.spark.SparkContext.(SparkContext.scala:572) > at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:28) > at org.apache.spark.examples.SparkPi.main(SparkPi.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:621) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} > If the application has FINISHED/FAILED/KILLED or failed to launch application > master, monitorThread is not getting initialized but > monitorThread.interrupt() is getting invoked as part of stop() without any > check and It is causing to throw NPE and also it is preventing to stop the > client. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8708) MatrixFactorizationModel.predictAll() populates single partition only
[ https://issues.apache.org/jira/browse/SPARK-8708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611505#comment-14611505 ] Xiangrui Meng commented on SPARK-8708: -- [~antonymayi] In your real case, how many partitions did ALS.predictAll return? > MatrixFactorizationModel.predictAll() populates single partition only > - > > Key: SPARK-8708 > URL: https://issues.apache.org/jira/browse/SPARK-8708 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Antony Mayi > > When using mllib.recommendation.ALS the RDD returned by .predictAll() has all > values pushed into single partition despite using quite high parallelism. > This degrades performance of further processing (I can obviously run > .partitionBy()) to balance it but that's still too costly (ie if running > .predictAll() in loop for thousands of products) and should be possible to do > it rather somehow on the model (automatically)). > Bellow is an example on tiny sample (same on large dataset): > {code:title=pyspark} > >>> r1 = (1, 1, 1.0) > >>> r2 = (1, 2, 2.0) > >>> r3 = (2, 1, 2.0) > >>> r4 = (2, 2, 2.0) > >>> r5 = (3, 1, 1.0) > >>> ratings = sc.parallelize([r1, r2, r3, r4, r5], 5) > >>> ratings.getNumPartitions() > 5 > >>> users = ratings.map(itemgetter(0)).distinct() > >>> model = ALS.trainImplicit(ratings, 1, seed=10) > >>> predictions_for_2 = model.predictAll(users.map(lambda u: (u, 2))) > >>> predictions_for_2.glom().map(len).collect() > [0, 0, 3, 0, 0] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8782) GenerateOrdering fails for NullType (i.e. ORDER BY NULL crashes)
Josh Rosen created SPARK-8782: - Summary: GenerateOrdering fails for NullType (i.e. ORDER BY NULL crashes) Key: SPARK-8782 URL: https://issues.apache.org/jira/browse/SPARK-8782 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.5.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Blocker Queries containing ORDER BY NULL currently result in a code generation exception: {code} public SpecificOrdering generate(org.apache.spark.sql.catalyst.expressions.Expression[] expr) { return new SpecificOrdering(expr); } class SpecificOrdering extends org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering { private org.apache.spark.sql.catalyst.expressions.Expression[] expressions = null; public SpecificOrdering(org.apache.spark.sql.catalyst.expressions.Expression[] expr) { expressions = expr; } @Override public int compare(InternalRow a, InternalRow b) { InternalRow i = null; // Holds current row being evaluated. i = a; final Object primitive1 = null; i = b; final Object primitive3 = null; if (true && true) { // Nothing } else if (true) { return -1; } else if (true) { return 1; } else { int comp = primitive1.compare(primitive3); if (comp != 0) { return comp; } } return 0; } } org.codehaus.commons.compiler.CompileException: Line 29, Column 43: A method named "compare" is not declared in any enclosing class nor any supertype, nor through a static import at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10174) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8781) Pusblished POMs are no longer effective POMs
Konstantin Shaposhnikov created SPARK-8781: -- Summary: Pusblished POMs are no longer effective POMs Key: SPARK-8781 URL: https://issues.apache.org/jira/browse/SPARK-8781 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.3.2, 1.4.1, 1.5.0 Reporter: Konstantin Shaposhnikov Published to maven repository POMs are no longer effective POMs. E.g. In https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-core_2.11/1.4.2-SNAPSHOT/spark-core_2.11-1.4.2-20150702.043114-52.pom: {noformat} ... org.apache.spark spark-launcher_${scala.binary.version} ${project.version} ... {noformat} while it should be {noformat} ... org.apache.spark spark-launcher_2.11 ${project.version} ... {noformat} The following commits are most likely the cause of it: - for branch-1.3: https://github.com/apache/spark/commit/ce137b8ed3b240b7516046699ac96daa55ddc129 - for branch-1.4: https://github.com/apache/spark/commit/84da653192a2d9edb82d0dbe50f577c4dc6a0c78 - for master: https://github.com/apache/spark/commit/984ad60147c933f2d5a2040c87ae687c14eb1724 On branch-1.4 reverting the commit fixed the issue. See SPARK-3812 for additional details -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8780) Move Python doctest code example from models to algorithms
Yanbo Liang created SPARK-8780: -- Summary: Move Python doctest code example from models to algorithms Key: SPARK-8780 URL: https://issues.apache.org/jira/browse/SPARK-8780 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Affects Versions: 1.5.0 Reporter: Yanbo Liang Almost all doctest code examples are in the models at Pyspark mllib. Since users usually start with algorithms rather than models, we need to move them from models to algorithms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5682) Add encrypted shuffle in spark
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611491#comment-14611491 ] liyunzhang_intel commented on SPARK-5682: - [~hujiayin]: thanks for your comment {quote} The solution relied on hadoop API and maybe downgrade the performance.  {quote} For The solution relied on hadoop API: You mean i use org.apache.hadoop.io.Text in [CommonConfigurationKeys |https://github.com/apache/spark/pull/4491/files#diff-a76c55d0e8f2e4e1a6cb5848826585fe]. But i have different idea for this: {code} @Stringable @InterfaceAudience.Public @InterfaceStability.Stable public class Text extends BinaryComparable org.apache.hadoop.io.Text {code} it shows that org.apache.hadoop.io.Text is stable which means the interfaces it provides will be not changed a lot in the later release. For downgrade the performance: have you any test results to show this? > Add encrypted shuffle in spark > -- > > Key: SPARK-5682 > URL: https://issues.apache.org/jira/browse/SPARK-5682 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Reporter: liyunzhang_intel > Attachments: Design Document of Encrypted Spark > Shuffle_20150209.docx, Design Document of Encrypted Spark > Shuffle_20150318.docx, Design Document of Encrypted Spark > Shuffle_20150402.docx, Design Document of Encrypted Spark > Shuffle_20150506.docx > > > Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle > data safer. This feature is necessary in spark. AES is a specification for > the encryption of electronic data. There are 5 common modes in AES. CTR is > one of the modes. We use two codec JceAesCtrCryptoCodec and > OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used > in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms jdk > provides while OpensslAesCtrCryptoCodec uses encrypted algorithms openssl > provides. > Because ugi credential info is used in the process of encrypted shuffle, we > first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8227) math function: unhex
[ https://issues.apache.org/jira/browse/SPARK-8227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8227. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7113 [https://github.com/apache/spark/pull/7113] > math function: unhex > > > Key: SPARK-8227 > URL: https://issues.apache.org/jira/browse/SPARK-8227 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: zhichao-li > Fix For: 1.5.0 > > > unhex(STRING a): BINARY > Inverse of hex. Interprets each pair of characters as a hexadecimal number > and converts to the byte representation of the number. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8779) Add documentation for Python's FP-growth
Hrishikesh created SPARK-8779: - Summary: Add documentation for Python's FP-growth Key: SPARK-8779 URL: https://issues.apache.org/jira/browse/SPARK-8779 Project: Spark Issue Type: Documentation Components: Documentation, MLlib, PySpark Reporter: Hrishikesh Priority: Minor We need to add documentation for Python FP-Growth in the MLlib Programming Guide. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8224) math function: shiftright
[ https://issues.apache.org/jira/browse/SPARK-8224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611459#comment-14611459 ] Apache Spark commented on SPARK-8224: - User 'zhichao-li' has created a pull request for this issue: https://github.com/apache/spark/pull/7035 > math function: shiftright > - > > Key: SPARK-8224 > URL: https://issues.apache.org/jira/browse/SPARK-8224 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: zhichao-li > > shiftrightunsigned(INT a), shiftrightunsigned(BIGINT a) > Bitwise unsigned right shift (as of Hive 1.2.0). Returns int for tinyint, > smallint and int a. Returns bigint for bigint a. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8223) math function: shiftleft
[ https://issues.apache.org/jira/browse/SPARK-8223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611458#comment-14611458 ] Apache Spark commented on SPARK-8223: - User 'zhichao-li' has created a pull request for this issue: https://github.com/apache/spark/pull/7035 > math function: shiftleft > > > Key: SPARK-8223 > URL: https://issues.apache.org/jira/browse/SPARK-8223 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: zhichao-li > > shiftleft(INT a) > shiftleft(BIGINT a) > Bitwise left shift (as of Hive 1.2.0). Returns int for tinyint, smallint and > int a. Returns bigint for bigint a. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8223) math function: shiftleft
[ https://issues.apache.org/jira/browse/SPARK-8223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8223: --- Assignee: Apache Spark (was: zhichao-li) > math function: shiftleft > > > Key: SPARK-8223 > URL: https://issues.apache.org/jira/browse/SPARK-8223 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > shiftleft(INT a) > shiftleft(BIGINT a) > Bitwise left shift (as of Hive 1.2.0). Returns int for tinyint, smallint and > int a. Returns bigint for bigint a. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8223) math function: shiftleft
[ https://issues.apache.org/jira/browse/SPARK-8223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611448#comment-14611448 ] Apache Spark commented on SPARK-8223: - User 'tarekauel' has created a pull request for this issue: https://github.com/apache/spark/pull/7178 > math function: shiftleft > > > Key: SPARK-8223 > URL: https://issues.apache.org/jira/browse/SPARK-8223 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: zhichao-li > > shiftleft(INT a) > shiftleft(BIGINT a) > Bitwise left shift (as of Hive 1.2.0). Returns int for tinyint, smallint and > int a. Returns bigint for bigint a. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8224) math function: shiftright
[ https://issues.apache.org/jira/browse/SPARK-8224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611449#comment-14611449 ] Apache Spark commented on SPARK-8224: - User 'tarekauel' has created a pull request for this issue: https://github.com/apache/spark/pull/7178 > math function: shiftright > - > > Key: SPARK-8224 > URL: https://issues.apache.org/jira/browse/SPARK-8224 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: zhichao-li > > shiftrightunsigned(INT a), shiftrightunsigned(BIGINT a) > Bitwise unsigned right shift (as of Hive 1.2.0). Returns int for tinyint, > smallint and int a. Returns bigint for bigint a. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8224) math function: shiftright
[ https://issues.apache.org/jira/browse/SPARK-8224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8224: --- Assignee: Apache Spark (was: zhichao-li) > math function: shiftright > - > > Key: SPARK-8224 > URL: https://issues.apache.org/jira/browse/SPARK-8224 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > shiftrightunsigned(INT a), shiftrightunsigned(BIGINT a) > Bitwise unsigned right shift (as of Hive 1.2.0). Returns int for tinyint, > smallint and int a. Returns bigint for bigint a. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8223) math function: shiftleft
[ https://issues.apache.org/jira/browse/SPARK-8223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8223: --- Assignee: zhichao-li (was: Apache Spark) > math function: shiftleft > > > Key: SPARK-8223 > URL: https://issues.apache.org/jira/browse/SPARK-8223 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: zhichao-li > > shiftleft(INT a) > shiftleft(BIGINT a) > Bitwise left shift (as of Hive 1.2.0). Returns int for tinyint, smallint and > int a. Returns bigint for bigint a. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8224) math function: shiftright
[ https://issues.apache.org/jira/browse/SPARK-8224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8224: --- Assignee: zhichao-li (was: Apache Spark) > math function: shiftright > - > > Key: SPARK-8224 > URL: https://issues.apache.org/jira/browse/SPARK-8224 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: zhichao-li > > shiftrightunsigned(INT a), shiftrightunsigned(BIGINT a) > Bitwise unsigned right shift (as of Hive 1.2.0). Returns int for tinyint, > smallint and int a. Returns bigint for bigint a. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8770) BinaryOperator expression
[ https://issues.apache.org/jira/browse/SPARK-8770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8770. Resolution: Fixed > BinaryOperator expression > - > > Key: SPARK-8770 > URL: https://issues.apache.org/jira/browse/SPARK-8770 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.5.0 > > > Our current BinaryExpression abstract class is not for generic binary > expressions, i.e. it requires left/right children to have the same type. > However, due to its name, contributors build new binary expressions that > don't have that assumption (e.g. Sha) and still extend BinaryExpression. > We should create a new BinaryOperator abstract class with this assumption, > and update the analyzer to only apply type casting rule there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8765) Flaky PySpark PowerIterationClustering test
[ https://issues.apache.org/jira/browse/SPARK-8765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611440#comment-14611440 ] Apache Spark commented on SPARK-8765: - User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/7177 > Flaky PySpark PowerIterationClustering test > --- > > Key: SPARK-8765 > URL: https://issues.apache.org/jira/browse/SPARK-8765 > Project: Spark > Issue Type: Test > Components: MLlib, PySpark >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang >Priority: Critical > Labels: flaky-test > > See failure: > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36133/console] > {code} > ** > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/clustering.py", > line 291, in __main__.PowerIterationClusteringModel > Failed example: > sorted(model.assignments().collect()) > Expected: > [Assignment(id=0, cluster=1), Assignment(id=1, cluster=0), ... > Got: > [Assignment(id=0, cluster=1), Assignment(id=1, cluster=1), > Assignment(id=2, cluster=1), Assignment(id=3, cluster=1), Assignment(id=4, > cluster=0)] > ** > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/clustering.py", > line 299, in __main__.PowerIterationClusteringModel > Failed example: > sorted(sameModel.assignments().collect()) > Expected: > [Assignment(id=0, cluster=1), Assignment(id=1, cluster=0), ... > Got: > [Assignment(id=0, cluster=1), Assignment(id=1, cluster=1), > Assignment(id=2, cluster=1), Assignment(id=3, cluster=1), Assignment(id=4, > cluster=0)] > ** >2 of 13 in __main__.PowerIterationClusteringModel > ***Test Failed*** 2 failures. > Had test failures in pyspark.mllib.clustering with python2.6; see logs. > {code} > CC: [~mengxr] [~yanboliang] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8778) The item of "Scheduler delay" is not consistent between "Event Timeline" and "Task List"
[ https://issues.apache.org/jira/browse/SPARK-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhangxiongfei updated SPARK-8778: - Description: In the page "Details for Stage" of Spark Web UI, "Scheduler delay" of some running tasks showed in "Event Timeline" is not consistent with that showed in "Tasks".I attached 2 snapshots.In "Event Timeline" section ,almost all the time is about "Scheduler delay",however, the "Scheduler delay" is 0 in the "Task" section. (was: In the page "Details for Stage" of Spark Web UI, "Scheduler delay" of some running tasks showed in "Event Timeline" is not displayed consistent with that showed in "Tasks".I attached 2 snapshots.In "Event Timeline" section ,almost all the time is about "Scheduler delay",however, the "Scheduler delay" is 0 in the "Task" section.) > The item of "Scheduler delay" is not consistent between "Event Timeline" and > "Task List" > > > Key: SPARK-8778 > URL: https://issues.apache.org/jira/browse/SPARK-8778 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.4.0 >Reporter: zhangxiongfei >Priority: Minor > Attachments: Event Timeline.png, Tasks.png > > > In the page "Details for Stage" of Spark Web UI, "Scheduler delay" of some > running tasks showed in "Event Timeline" is not consistent with that showed > in "Tasks".I attached 2 snapshots.In "Event Timeline" section ,almost all the > time is about "Scheduler delay",however, the "Scheduler delay" is 0 in the > "Task" section. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8778) The item of "Scheduler delay" is not consistent between "Event Timeline" and "Task List"
[ https://issues.apache.org/jira/browse/SPARK-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhangxiongfei updated SPARK-8778: - Attachment: Event Timeline.png Tasks.png > The item of "Scheduler delay" is not consistent between "Event Timeline" and > "Task List" > > > Key: SPARK-8778 > URL: https://issues.apache.org/jira/browse/SPARK-8778 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.4.0 >Reporter: zhangxiongfei >Priority: Minor > Attachments: Event Timeline.png, Tasks.png > > > In the page "Details for Stage" of Spark Web UI, "Scheduler delay" of some > running tasks showed in "Event Timeline" is not displayed consistent with > that showed in "Tasks".I attached 2 snapshots.In "Event Timeline" section > ,almost all the time is about "Scheduler delay",however, the "Scheduler > delay" is 0 in the "Task" section. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8778) The item of "Scheduler delay" is not consistent between "Event Timeline" and "Task List"
zhangxiongfei created SPARK-8778: Summary: The item of "Scheduler delay" is not consistent between "Event Timeline" and "Task List" Key: SPARK-8778 URL: https://issues.apache.org/jira/browse/SPARK-8778 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.4.0 Reporter: zhangxiongfei Priority: Minor In the page "Details for Stage" of Spark Web UI, "Scheduler delay" of some running tasks showed in "Event Timeline" is not displayed consistent with that showed in "Tasks".I attached 2 snapshots.In "Event Timeline" section ,almost all the time is about "Scheduler delay",however, the "Scheduler delay" is 0 in the "Task" section. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8777) Add random data generation test utilities to Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-8777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8777: --- Assignee: Apache Spark (was: Josh Rosen) > Add random data generation test utilities to Spark SQL > -- > > Key: SPARK-8777 > URL: https://issues.apache.org/jira/browse/SPARK-8777 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Josh Rosen >Assignee: Apache Spark > > We should add utility functions for generating data that conforms to a given > SparkSQL DataType or Schema. This would make it significantly easier to write > certain types of tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8777) Add random data generation test utilities to Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-8777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8777: --- Assignee: Josh Rosen (was: Apache Spark) > Add random data generation test utilities to Spark SQL > -- > > Key: SPARK-8777 > URL: https://issues.apache.org/jira/browse/SPARK-8777 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen > > We should add utility functions for generating data that conforms to a given > SparkSQL DataType or Schema. This would make it significantly easier to write > certain types of tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8777) Add random data generation test utilities to Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-8777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611404#comment-14611404 ] Apache Spark commented on SPARK-8777: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/7176 > Add random data generation test utilities to Spark SQL > -- > > Key: SPARK-8777 > URL: https://issues.apache.org/jira/browse/SPARK-8777 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen > > We should add utility functions for generating data that conforms to a given > SparkSQL DataType or Schema. This would make it significantly easier to write > certain types of tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8777) Add random data generation test utilities to Spark SQL
Josh Rosen created SPARK-8777: - Summary: Add random data generation test utilities to Spark SQL Key: SPARK-8777 URL: https://issues.apache.org/jira/browse/SPARK-8777 Project: Spark Issue Type: New Feature Components: SQL Reporter: Josh Rosen Assignee: Josh Rosen We should add utility functions for generating data that conforms to a given SparkSQL DataType or Schema. This would make it significantly easier to write certain types of tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8776) Increase the default MaxPermSize
Yin Huai created SPARK-8776: --- Summary: Increase the default MaxPermSize Key: SPARK-8776 URL: https://issues.apache.org/jira/browse/SPARK-8776 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Yin Huai Since 1.4.0, Spark SQL has isolated class loaders for seperating hive dependencies on metastore and execution, which increases the memory consumption of PermGen. How about we increase the default size from 128m to 256m? Seems the change we need to make is https://github.com/apache/spark/blob/3c0156899dc1ec1f7dfe6d7c8af47fa6dc7d00bf/launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java#L139. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8775) Move expression specific type coercion into expression themselves
Reynold Xin created SPARK-8775: -- Summary: Move expression specific type coercion into expression themselves Key: SPARK-8775 URL: https://issues.apache.org/jira/browse/SPARK-8775 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8772) Implement implicit type cast for expressions that define expected input types
[ https://issues.apache.org/jira/browse/SPARK-8772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8772: --- Assignee: Reynold Xin (was: Apache Spark) > Implement implicit type cast for expressions that define expected input types > - > > Key: SPARK-8772 > URL: https://issues.apache.org/jira/browse/SPARK-8772 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > We should have a engine-wide implicit cast rule defined. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8772) Implement implicit type cast for expressions that define expected input types
[ https://issues.apache.org/jira/browse/SPARK-8772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611360#comment-14611360 ] Apache Spark commented on SPARK-8772: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/7175 > Implement implicit type cast for expressions that define expected input types > - > > Key: SPARK-8772 > URL: https://issues.apache.org/jira/browse/SPARK-8772 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > We should have a engine-wide implicit cast rule defined. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8772) Implement implicit type cast for expressions that define expected input types
[ https://issues.apache.org/jira/browse/SPARK-8772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8772: --- Assignee: Apache Spark (was: Reynold Xin) > Implement implicit type cast for expressions that define expected input types > - > > Key: SPARK-8772 > URL: https://issues.apache.org/jira/browse/SPARK-8772 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > We should have a engine-wide implicit cast rule defined. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-8770) BinaryOperator expression
[ https://issues.apache.org/jira/browse/SPARK-8770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin reopened SPARK-8770: Reverted the commit since it failed Python tests. > BinaryOperator expression > - > > Key: SPARK-8770 > URL: https://issues.apache.org/jira/browse/SPARK-8770 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.5.0 > > > Our current BinaryExpression abstract class is not for generic binary > expressions, i.e. it requires left/right children to have the same type. > However, due to its name, contributors build new binary expressions that > don't have that assumption (e.g. Sha) and still extend BinaryExpression. > We should create a new BinaryOperator abstract class with this assumption, > and update the analyzer to only apply type casting rule there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8770) BinaryOperator expression
[ https://issues.apache.org/jira/browse/SPARK-8770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611351#comment-14611351 ] Apache Spark commented on SPARK-8770: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/7174 > BinaryOperator expression > - > > Key: SPARK-8770 > URL: https://issues.apache.org/jira/browse/SPARK-8770 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.5.0 > > > Our current BinaryExpression abstract class is not for generic binary > expressions, i.e. it requires left/right children to have the same type. > However, due to its name, contributors build new binary expressions that > don't have that assumption (e.g. Sha) and still extend BinaryExpression. > We should create a new BinaryOperator abstract class with this assumption, > and update the analyzer to only apply type casting rule there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7483) [MLLib] Using Kryo with FPGrowth fails with an exception
[ https://issues.apache.org/jira/browse/SPARK-7483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611347#comment-14611347 ] S commented on SPARK-7483: -- I encountered the same bug. Adding sparkConf.registerKryoClasses(Array(classOf[ArrayBuffer[String]], classOf[ListBuffer[String]])) seems to fix the problem. > [MLLib] Using Kryo with FPGrowth fails with an exception > > > Key: SPARK-7483 > URL: https://issues.apache.org/jira/browse/SPARK-7483 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.3.1 >Reporter: Tomasz Bartczak >Priority: Minor > > When using FPGrowth algorithm with KryoSerializer - Spark fails with > {code} > Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times, most > recent failure: Lost task 0.0 in stage 9.0 (TID 16, localhost): > com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: > Can not set final scala.collection.mutable.ListBuffer field > org.apache.spark.mllib.fpm.FPTree$Summary.nodes to > scala.collection.mutable.ArrayBuffer > Serialization trace: > nodes (org.apache.spark.mllib.fpm.FPTree$Summary) > org$apache$spark$mllib$fpm$FPTree$$summaries > (org.apache.spark.mllib.fpm.FPTree) > {code} > This can be easily reproduced in spark codebase by setting > {code} > conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") > {code} and running FPGrowthSuite. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5682) Add encrypted shuffle in spark
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611328#comment-14611328 ] hujiayin commented on SPARK-5682: - The solution relied on hadoop API and maybe downgrade the performance. The AES algorithm was used in block data encryption in many case. I think rc4 could be used to encode the stream or a simple solution with a authentication header could be used. : ) > Add encrypted shuffle in spark > -- > > Key: SPARK-5682 > URL: https://issues.apache.org/jira/browse/SPARK-5682 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Reporter: liyunzhang_intel > Attachments: Design Document of Encrypted Spark > Shuffle_20150209.docx, Design Document of Encrypted Spark > Shuffle_20150318.docx, Design Document of Encrypted Spark > Shuffle_20150402.docx, Design Document of Encrypted Spark > Shuffle_20150506.docx > > > Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle > data safer. This feature is necessary in spark. AES is a specification for > the encryption of electronic data. There are 5 common modes in AES. CTR is > one of the modes. We use two codec JceAesCtrCryptoCodec and > OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used > in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms jdk > provides while OpensslAesCtrCryptoCodec uses encrypted algorithms openssl > provides. > Because ugi credential info is used in the process of encrypted shuffle, we > first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7835) Refactor HeartbeatReceiverSuite for coverage and clean up
[ https://issues.apache.org/jira/browse/SPARK-7835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611315#comment-14611315 ] Apache Spark commented on SPARK-7835: - User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/7173 > Refactor HeartbeatReceiverSuite for coverage and clean up > - > > Key: SPARK-7835 > URL: https://issues.apache.org/jira/browse/SPARK-7835 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 1.4.0 >Reporter: Andrew Or >Assignee: Andrew Or > > As of the writing of this description, the existing test suite has a lot of > duplicate code and doesn't even cover the most fundamental feature of the > HeartbeatReceiver, which is expiring hosts that have not responded in a while. > https://github.com/apache/spark/blob/31d5d463e76b6611c854c6cf27059fec8198adc9/core/src/test/scala/org/apache/spark/HeartbeatReceiverSuite.scala > We should rewrite this test suite to increase coverage and decrease duplicate > code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8708) MatrixFactorizationModel.predictAll() populates single partition only
[ https://issues.apache.org/jira/browse/SPARK-8708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611304#comment-14611304 ] Antony Mayi commented on SPARK-8708: I suspect this issue with all data in single partition is also causing following of my problems: At some point (seems to be random) my whole execution gets stuck after one of .predictAll() calls and there is a message in errorlog saying: bq. 15/07/01 21:54:15 INFO shuffle.ShuffleMemoryManager: Thread 16468 waiting for at least 1/2N of shuffle memory pool to be free I then have to kill the processing. Is there any simple workaround for this? > MatrixFactorizationModel.predictAll() populates single partition only > - > > Key: SPARK-8708 > URL: https://issues.apache.org/jira/browse/SPARK-8708 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Antony Mayi > > When using mllib.recommendation.ALS the RDD returned by .predictAll() has all > values pushed into single partition despite using quite high parallelism. > This degrades performance of further processing (I can obviously run > .partitionBy()) to balance it but that's still too costly (ie if running > .predictAll() in loop for thousands of products) and should be possible to do > it rather somehow on the model (automatically)). > Bellow is an example on tiny sample (same on large dataset): > {code:title=pyspark} > >>> r1 = (1, 1, 1.0) > >>> r2 = (1, 2, 2.0) > >>> r3 = (2, 1, 2.0) > >>> r4 = (2, 2, 2.0) > >>> r5 = (3, 1, 1.0) > >>> ratings = sc.parallelize([r1, r2, r3, r4, r5], 5) > >>> ratings.getNumPartitions() > 5 > >>> users = ratings.map(itemgetter(0)).distinct() > >>> model = ALS.trainImplicit(ratings, 1, seed=10) > >>> predictions_for_2 = model.predictAll(users.map(lambda u: (u, 2))) > >>> predictions_for_2.glom().map(len).collect() > [0, 0, 3, 0, 0] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8743) Deregister Codahale metrics for streaming when StreamingContext is closed
[ https://issues.apache.org/jira/browse/SPARK-8743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611258#comment-14611258 ] Tathagata Das commented on SPARK-8743: -- I have assigned this to you. > Deregister Codahale metrics for streaming when StreamingContext is closed > -- > > Key: SPARK-8743 > URL: https://issues.apache.org/jira/browse/SPARK-8743 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Affects Versions: 1.4.1 >Reporter: Tathagata Das >Assignee: Neelesh Srinivas Salian > Labels: starter > > Currently, when the StreamingContext is closed, the registered metrics are > not deregistered. If another streaming context is started, it throws a > warning saying that the metrics are already registered. > The solution is to deregister the metrics when streamingcontext is stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8743) Deregister Codahale metrics for streaming when StreamingContext is closed
[ https://issues.apache.org/jira/browse/SPARK-8743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-8743: - Assignee: Neelesh Srinivas Salian > Deregister Codahale metrics for streaming when StreamingContext is closed > -- > > Key: SPARK-8743 > URL: https://issues.apache.org/jira/browse/SPARK-8743 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Affects Versions: 1.4.1 >Reporter: Tathagata Das >Assignee: Neelesh Srinivas Salian > Labels: starter > > Currently, when the StreamingContext is closed, the registered metrics are > not deregistered. If another streaming context is started, it throws a > warning saying that the metrics are already registered. > The solution is to deregister the metrics when streamingcontext is stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8069) Add support for cutoff to RandomForestClassifier
[ https://issues.apache.org/jira/browse/SPARK-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611244#comment-14611244 ] Joseph K. Bradley commented on SPARK-8069: -- I like the idea of including it in an abstraction like ClassificationModel and ProbabilisticClassificationModel, unless it is too difficult. If a developer does not want to support thresholds/cutoffs (or wants to modify the API), the developer does not have to use the abstraction. The main difficulty I see is in trying to specify thresholds in a uniform way: * Thresholding rawPrediction vs. probability: It would be easy to mimic the R randomForest package for thresholding probabilities, for which we know which values are in the range [0,1]. That won't work well for rawPrediction values, which could be negative. ** We could initially only support thresholding for ProbabilisticClassificationModel. I expect to modify trees & tree ensembles to subclass ProbabilisticClassificationModel in release 1.5 (WIP). ** Do you have ideas for thresholding for rawPrediction? * Binary vs. multiclass: It would be nice to think of a way to naturally support binary, though it might mean modifying or deprecating HasThreshold. Once we decide on a good way to specify thresholds, then perhaps the binary case can be handled by providing a setter as in HasThreshold ({{setThreshold(value: Double)}}) but returning the generalized threshold in the getter ({{Vector getThreshold}}). > Add support for cutoff to RandomForestClassifier > > > Key: SPARK-8069 > URL: https://issues.apache.org/jira/browse/SPARK-8069 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: holdenk >Priority: Minor > Original Estimate: 240h > Remaining Estimate: 240h > > Consider adding support for cutoffs similar to > http://cran.r-project.org/web/packages/randomForest/randomForest.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5222) YARN client and cluster modes have different app name behaviors
[ https://issues.apache.org/jira/browse/SPARK-5222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611241#comment-14611241 ] Andrew Or commented on SPARK-5222: -- I'm inclined to close this as a "Won't Fix" because the user can set the app name in his/her main method, in which case there's nothing we can do because cluster mode will not have access to the name until later. I'll let this sit for a few more days to see if others have any objections to this resolution. > YARN client and cluster modes have different app name behaviors > --- > > Key: SPARK-5222 > URL: https://issues.apache.org/jira/browse/SPARK-5222 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.0.0 >Reporter: Andrew Or >Assignee: Tao Wang >Priority: Minor > > The behavior is summarized in a table produced by [~WangTaoTheTonic] here: > https://github.com/apache/spark/pull/3557 > SPARK_YARN_APP_NAME is respected only in client mode but not in cluster mode. > This results in the strange behavior where the app name changes if the user > runs the same application but uses a different deploy mode from before. We > should make sure the app name behavior is consistent across deploy modes > regardless of what variable or config is set. > Additionally, it should be noted that because "spark.app.name" is required of > all applications, the setting of "SPARK_YARN_APP_NAME" will not take effect > unless we handle it preemptively in Spark submit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8016) YARN cluster / client modes have different app names for python
[ https://issues.apache.org/jira/browse/SPARK-8016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8016: - Priority: Minor (was: Major) > YARN cluster / client modes have different app names for python > --- > > Key: SPARK-8016 > URL: https://issues.apache.org/jira/browse/SPARK-8016 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 1.4.0 >Reporter: Andrew Or >Priority: Minor > Attachments: python.png > > > See screenshot. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5222) YARN client and cluster modes have different app name behaviors
[ https://issues.apache.org/jira/browse/SPARK-5222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5222: - Priority: Minor (was: Major) > YARN client and cluster modes have different app name behaviors > --- > > Key: SPARK-5222 > URL: https://issues.apache.org/jira/browse/SPARK-5222 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.0.0 >Reporter: Andrew Or >Assignee: Tao Wang >Priority: Minor > > The behavior is summarized in a table produced by [~WangTaoTheTonic] here: > https://github.com/apache/spark/pull/3557 > SPARK_YARN_APP_NAME is respected only in client mode but not in cluster mode. > This results in the strange behavior where the app name changes if the user > runs the same application but uses a different deploy mode from before. We > should make sure the app name behavior is consistent across deploy modes > regardless of what variable or config is set. > Additionally, it should be noted that because "spark.app.name" is required of > all applications, the setting of "SPARK_YARN_APP_NAME" will not take effect > unless we handle it preemptively in Spark submit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8774) Add R model formula with basic support as a transformer
[ https://issues.apache.org/jira/browse/SPARK-8774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-8774: Assignee: Xiangrui Meng > Add R model formula with basic support as a transformer > --- > > Key: SPARK-8774 > URL: https://issues.apache.org/jira/browse/SPARK-8774 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > To have better integration with SparkR, we can add a feature transformer to > support R formula. A list of operators R supports can be find here: > http://ww2.coastal.edu/kingw/statistics/R-tutorials/formulae.html > The initial version should support "~", "+", and "." on numeric columns and > we can expand it in the future. > {code} > val formula = new RModelFormula() > .setFormula("y ~ x + z") > {code} > The output should append two new columns: features and label. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8069) Add support for cutoff to RandomForestClassifier
[ https://issues.apache.org/jira/browse/SPARK-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611224#comment-14611224 ] holdenk commented on SPARK-8069: So I started working on doing this for the randomforestclassifier. I think we could maybe add this as a trait for multiclass classification models which return all of the scores (but the current API of the multiclass classification models seems to just return the winning class). What about about if we added a trait to do this and then the other models could implement it as needed? Or I could change the base predictionmodel but that would probably be a pretty large change. > Add support for cutoff to RandomForestClassifier > > > Key: SPARK-8069 > URL: https://issues.apache.org/jira/browse/SPARK-8069 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: holdenk >Priority: Minor > Original Estimate: 240h > Remaining Estimate: 240h > > Consider adding support for cutoffs similar to > http://cran.r-project.org/web/packages/randomForest/randomForest.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8774) Add R model formula with basic support as a transformer
Xiangrui Meng created SPARK-8774: Summary: Add R model formula with basic support as a transformer Key: SPARK-8774 URL: https://issues.apache.org/jira/browse/SPARK-8774 Project: Spark Issue Type: New Feature Components: ML, SparkR Reporter: Xiangrui Meng To have better integration with SparkR, we can add a feature transformer to support R formula. A list of operators R supports can be find here: http://ww2.coastal.edu/kingw/statistics/R-tutorials/formulae.html The initial version should support "~", "+", and "." on numeric columns and we can expand it in the future. {code} val formula = new RModelFormula() .setFormula("y ~ x + z") {code} The output should append two new columns: features and label. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6805) ML Pipeline API in SparkR
[ https://issues.apache.org/jira/browse/SPARK-6805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6805: - Assignee: (was: Xiangrui Meng) > ML Pipeline API in SparkR > - > > Key: SPARK-6805 > URL: https://issues.apache.org/jira/browse/SPARK-6805 > Project: Spark > Issue Type: Umbrella > Components: ML, SparkR >Reporter: Xiangrui Meng > > SparkR was merged. So let's have this umbrella JIRA for the ML pipeline API > in SparkR. The implementation should be similar to the pipeline API > implementation in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6805) ML Pipeline API in SparkR
[ https://issues.apache.org/jira/browse/SPARK-6805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-6805: Assignee: Xiangrui Meng > ML Pipeline API in SparkR > - > > Key: SPARK-6805 > URL: https://issues.apache.org/jira/browse/SPARK-6805 > Project: Spark > Issue Type: Umbrella > Components: ML, SparkR >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > SparkR was merged. So let's have this umbrella JIRA for the ML pipeline API > in SparkR. The implementation should be similar to the pipeline API > implementation in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8770) BinaryOperator expression
[ https://issues.apache.org/jira/browse/SPARK-8770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8770. Resolution: Fixed Fix Version/s: 1.5.0 > BinaryOperator expression > - > > Key: SPARK-8770 > URL: https://issues.apache.org/jira/browse/SPARK-8770 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.5.0 > > > Our current BinaryExpression abstract class is not for generic binary > expressions, i.e. it requires left/right children to have the same type. > However, due to its name, contributors build new binary expressions that > don't have that assumption (e.g. Sha) and still extend BinaryExpression. > We should create a new BinaryOperator abstract class with this assumption, > and update the analyzer to only apply type casting rule there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8773) Throw type mismatch in check analysis for expressions with expected input types defined
Reynold Xin created SPARK-8773: -- Summary: Throw type mismatch in check analysis for expressions with expected input types defined Key: SPARK-8773 URL: https://issues.apache.org/jira/browse/SPARK-8773 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8772) Implement implicit type cast for expressions that define expected input types
Reynold Xin created SPARK-8772: -- Summary: Implement implicit type cast for expressions that define expected input types Key: SPARK-8772 URL: https://issues.apache.org/jira/browse/SPARK-8772 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin We should have a engine-wide implicit cast rule defined. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8766) DataFrame Python API should work with column which has non-ascii character in it
[ https://issues.apache.org/jira/browse/SPARK-8766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8766. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7165 [https://github.com/apache/spark/pull/7165] > DataFrame Python API should work with column which has non-ascii character in > it > > > Key: SPARK-8766 > URL: https://issues.apache.org/jira/browse/SPARK-8766 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 1.3.1, 1.4.0 >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6707) Mesos Scheduler should allow the user to specify constraints based on slave attributes
[ https://issues.apache.org/jira/browse/SPARK-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6707: - Target Version/s: 1.5.0 > Mesos Scheduler should allow the user to specify constraints based on slave > attributes > -- > > Key: SPARK-6707 > URL: https://issues.apache.org/jira/browse/SPARK-6707 > Project: Spark > Issue Type: Improvement > Components: Mesos, Scheduler >Affects Versions: 1.3.0 >Reporter: Ankur Chauhan > Labels: mesos, scheduler > > Currently, the mesos scheduler only looks at the 'cpu' and 'mem' resources > when trying to determine the usablility of a resource offer from a mesos > slave node. It may be preferable for the user to be able to ensure that the > spark jobs are only started on a certain set of nodes (based on attributes). > For example, If the user sets a property, let's say > {code}spark.mesos.constraints{code} is set to > {code}tachyon=true;us-east-1=false{code}, then the resource offers will be > checked to see if they meet both these constraints and only then will be > accepted to start new executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7079) Cache-aware external sort
[ https://issues.apache.org/jira/browse/SPARK-7079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611141#comment-14611141 ] Apache Spark commented on SPARK-7079: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/6444 > Cache-aware external sort > - > > Key: SPARK-7079 > URL: https://issues.apache.org/jira/browse/SPARK-7079 > Project: Spark > Issue Type: New Feature > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Assignee: Josh Rosen > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7078) Cache-aware binary processing in-memory sort
[ https://issues.apache.org/jira/browse/SPARK-7078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7078: --- Assignee: Apache Spark (was: Josh Rosen) > Cache-aware binary processing in-memory sort > > > Key: SPARK-7078 > URL: https://issues.apache.org/jira/browse/SPARK-7078 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Reporter: Reynold Xin >Assignee: Apache Spark > > A cache-friendly sort algorithm that can be used eventually for: > * sort-merge join > * shuffle > See the old alpha sort paper: > http://research.microsoft.com/pubs/68249/alphasort.doc > Note that state-of-the-art for sorting has improved quite a bit, but we can > easily optimize the sorting algorithm itself later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7078) Cache-aware binary processing in-memory sort
[ https://issues.apache.org/jira/browse/SPARK-7078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611140#comment-14611140 ] Apache Spark commented on SPARK-7078: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/6444 > Cache-aware binary processing in-memory sort > > > Key: SPARK-7078 > URL: https://issues.apache.org/jira/browse/SPARK-7078 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Reporter: Reynold Xin >Assignee: Josh Rosen > > A cache-friendly sort algorithm that can be used eventually for: > * sort-merge join > * shuffle > See the old alpha sort paper: > http://research.microsoft.com/pubs/68249/alphasort.doc > Note that state-of-the-art for sorting has improved quite a bit, but we can > easily optimize the sorting algorithm itself later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7079) Cache-aware external sort
[ https://issues.apache.org/jira/browse/SPARK-7079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7079: --- Assignee: Josh Rosen (was: Apache Spark) > Cache-aware external sort > - > > Key: SPARK-7079 > URL: https://issues.apache.org/jira/browse/SPARK-7079 > Project: Spark > Issue Type: New Feature > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Assignee: Josh Rosen > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7079) Cache-aware external sort
[ https://issues.apache.org/jira/browse/SPARK-7079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7079: --- Assignee: Apache Spark (was: Josh Rosen) > Cache-aware external sort > - > > Key: SPARK-7079 > URL: https://issues.apache.org/jira/browse/SPARK-7079 > Project: Spark > Issue Type: New Feature > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7078) Cache-aware binary processing in-memory sort
[ https://issues.apache.org/jira/browse/SPARK-7078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7078: --- Assignee: Josh Rosen (was: Apache Spark) > Cache-aware binary processing in-memory sort > > > Key: SPARK-7078 > URL: https://issues.apache.org/jira/browse/SPARK-7078 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Reporter: Reynold Xin >Assignee: Josh Rosen > > A cache-friendly sort algorithm that can be used eventually for: > * sort-merge join > * shuffle > See the old alpha sort paper: > http://research.microsoft.com/pubs/68249/alphasort.doc > Note that state-of-the-art for sorting has improved quite a bit, but we can > easily optimize the sorting algorithm itself later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8771) Actor system deprecation tag uses deprecated deprecation tag
[ https://issues.apache.org/jira/browse/SPARK-8771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1460#comment-1460 ] Apache Spark commented on SPARK-8771: - User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/7172 > Actor system deprecation tag uses deprecated deprecation tag > > > Key: SPARK-8771 > URL: https://issues.apache.org/jira/browse/SPARK-8771 > Project: Spark > Issue Type: Improvement >Reporter: holdenk >Priority: Trivial > > The deprecation of the actor system adds a spurious build warning: > {quote} > @deprecated now takes two arguments; see the scaladoc. > [warn] @deprecated("Actor system is no longer supported as of 1.4") > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8771) Actor system deprecation tag uses deprecated deprecation tag
[ https://issues.apache.org/jira/browse/SPARK-8771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8771: --- Assignee: (was: Apache Spark) > Actor system deprecation tag uses deprecated deprecation tag > > > Key: SPARK-8771 > URL: https://issues.apache.org/jira/browse/SPARK-8771 > Project: Spark > Issue Type: Improvement >Reporter: holdenk >Priority: Trivial > > The deprecation of the actor system adds a spurious build warning: > {quote} > @deprecated now takes two arguments; see the scaladoc. > [warn] @deprecated("Actor system is no longer supported as of 1.4") > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8771) Actor system deprecation tag uses deprecated deprecation tag
[ https://issues.apache.org/jira/browse/SPARK-8771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8771: --- Assignee: Apache Spark > Actor system deprecation tag uses deprecated deprecation tag > > > Key: SPARK-8771 > URL: https://issues.apache.org/jira/browse/SPARK-8771 > Project: Spark > Issue Type: Improvement >Reporter: holdenk >Assignee: Apache Spark >Priority: Trivial > > The deprecation of the actor system adds a spurious build warning: > {quote} > @deprecated now takes two arguments; see the scaladoc. > [warn] @deprecated("Actor system is no longer supported as of 1.4") > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8769) toLocalIterator should mention it results in many jobs
[ https://issues.apache.org/jira/browse/SPARK-8769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611107#comment-14611107 ] Apache Spark commented on SPARK-8769: - User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/7171 > toLocalIterator should mention it results in many jobs > -- > > Key: SPARK-8769 > URL: https://issues.apache.org/jira/browse/SPARK-8769 > Project: Spark > Issue Type: Documentation > Components: Documentation >Reporter: holdenk >Priority: Trivial > > toLocalIterator on RDDs should mention that it results in mutliple jobs, and > that to avoid re-computing, if the input was the result of a > wide-transformation, the input should be cached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8769) toLocalIterator should mention it results in many jobs
[ https://issues.apache.org/jira/browse/SPARK-8769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8769: --- Assignee: (was: Apache Spark) > toLocalIterator should mention it results in many jobs > -- > > Key: SPARK-8769 > URL: https://issues.apache.org/jira/browse/SPARK-8769 > Project: Spark > Issue Type: Documentation > Components: Documentation >Reporter: holdenk >Priority: Trivial > > toLocalIterator on RDDs should mention that it results in mutliple jobs, and > that to avoid re-computing, if the input was the result of a > wide-transformation, the input should be cached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8769) toLocalIterator should mention it results in many jobs
[ https://issues.apache.org/jira/browse/SPARK-8769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8769: --- Assignee: Apache Spark > toLocalIterator should mention it results in many jobs > -- > > Key: SPARK-8769 > URL: https://issues.apache.org/jira/browse/SPARK-8769 > Project: Spark > Issue Type: Documentation > Components: Documentation >Reporter: holdenk >Assignee: Apache Spark >Priority: Trivial > > toLocalIterator on RDDs should mention that it results in mutliple jobs, and > that to avoid re-computing, if the input was the result of a > wide-transformation, the input should be cached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8770) BinaryOperator expression
[ https://issues.apache.org/jira/browse/SPARK-8770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8770: --- Assignee: Reynold Xin (was: Apache Spark) > BinaryOperator expression > - > > Key: SPARK-8770 > URL: https://issues.apache.org/jira/browse/SPARK-8770 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > Our current BinaryExpression abstract class is not for generic binary > expressions, i.e. it requires left/right children to have the same type. > However, due to its name, contributors build new binary expressions that > don't have that assumption (e.g. Sha) and still extend BinaryExpression. > We should create a new BinaryOperator abstract class with this assumption, > and update the analyzer to only apply type casting rule there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8770) BinaryOperator expression
[ https://issues.apache.org/jira/browse/SPARK-8770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8770: --- Assignee: Apache Spark (was: Reynold Xin) > BinaryOperator expression > - > > Key: SPARK-8770 > URL: https://issues.apache.org/jira/browse/SPARK-8770 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > Our current BinaryExpression abstract class is not for generic binary > expressions, i.e. it requires left/right children to have the same type. > However, due to its name, contributors build new binary expressions that > don't have that assumption (e.g. Sha) and still extend BinaryExpression. > We should create a new BinaryOperator abstract class with this assumption, > and update the analyzer to only apply type casting rule there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8770) BinaryOperator expression
[ https://issues.apache.org/jira/browse/SPARK-8770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-8770: Shepherd: Michael Armbrust > BinaryOperator expression > - > > Key: SPARK-8770 > URL: https://issues.apache.org/jira/browse/SPARK-8770 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > Our current BinaryExpression abstract class is not for generic binary > expressions, i.e. it requires left/right children to have the same type. > However, due to its name, contributors build new binary expressions that > don't have that assumption (e.g. Sha) and still extend BinaryExpression. > We should create a new BinaryOperator abstract class with this assumption, > and update the analyzer to only apply type casting rule there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8770) BinaryOperator expression
[ https://issues.apache.org/jira/browse/SPARK-8770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611101#comment-14611101 ] Apache Spark commented on SPARK-8770: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/7170 > BinaryOperator expression > - > > Key: SPARK-8770 > URL: https://issues.apache.org/jira/browse/SPARK-8770 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > Our current BinaryExpression abstract class is not for generic binary > expressions, i.e. it requires left/right children to have the same type. > However, due to its name, contributors build new binary expressions that > don't have that assumption (e.g. Sha) and still extend BinaryExpression. > We should create a new BinaryOperator abstract class with this assumption, > and update the analyzer to only apply type casting rule there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8770) BinaryOperator expression
Reynold Xin created SPARK-8770: -- Summary: BinaryOperator expression Key: SPARK-8770 URL: https://issues.apache.org/jira/browse/SPARK-8770 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Our current BinaryExpression abstract class is not for generic binary expressions, i.e. it requires left/right children to have the same type. However, due to its name, contributors build new binary expressions that don't have that assumption (e.g. Sha) and still extend BinaryExpression. We should create a new BinaryOperator abstract class with this assumption, and update the analyzer to only apply type casting rule there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8771) Actor system deprecation tag uses deprecated deprecation tag
holdenk created SPARK-8771: -- Summary: Actor system deprecation tag uses deprecated deprecation tag Key: SPARK-8771 URL: https://issues.apache.org/jira/browse/SPARK-8771 Project: Spark Issue Type: Improvement Reporter: holdenk Priority: Trivial The deprecation of the actor system adds a spurious build warning: {quote} @deprecated now takes two arguments; see the scaladoc. [warn] @deprecated("Actor system is no longer supported as of 1.4") {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8766) DataFrame Python API should work with column which has non-ascii character in it
[ https://issues.apache.org/jira/browse/SPARK-8766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8766: --- Issue Type: Sub-task (was: Bug) Parent: SPARK-6116 > DataFrame Python API should work with column which has non-ascii character in > it > > > Key: SPARK-8766 > URL: https://issues.apache.org/jira/browse/SPARK-8766 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 1.3.1, 1.4.0 >Reporter: Davies Liu >Assignee: Davies Liu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8769) toLocalIterator should mention it results in many jobs
[ https://issues.apache.org/jira/browse/SPARK-8769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8769: - Component/s: Documentation > toLocalIterator should mention it results in many jobs > -- > > Key: SPARK-8769 > URL: https://issues.apache.org/jira/browse/SPARK-8769 > Project: Spark > Issue Type: Documentation > Components: Documentation >Reporter: holdenk >Priority: Trivial > > toLocalIterator on RDDs should mention that it results in mutliple jobs, and > that to avoid re-computing, if the input was the result of a > wide-transformation, the input should be cached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8766) DataFrame Python API should work with column which has non-ascii character in it
[ https://issues.apache.org/jira/browse/SPARK-8766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8766: - Component/s: PySpark > DataFrame Python API should work with column which has non-ascii character in > it > > > Key: SPARK-8766 > URL: https://issues.apache.org/jira/browse/SPARK-8766 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.3.1, 1.4.0 >Reporter: Davies Liu >Assignee: Davies Liu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8753) Create an IntervalType data type
[ https://issues.apache.org/jira/browse/SPARK-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611067#comment-14611067 ] holdenk commented on SPARK-8753: I could give this a shot if people want :) > Create an IntervalType data type > > > Key: SPARK-8753 > URL: https://issues.apache.org/jira/browse/SPARK-8753 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > We should create an IntervalType data type that represents time intervals. > Internally, we can use a long value to store it, similar to Timestamp (i.e. > 100ns precision). This data type initially cannot be stored externally, but > only used for expressions. > 1. Add IntervalType data type. > 2. Add parser support in our SQL expression, in the form of > {code} > INTERVAL [number] [unit] > {code} > unit can be YEAR[S], MONTH[S], WEEK[S], DAY[S], HOUR[S], MINUTE[S], > SECOND[S], MILLISECOND[S], MICROSECOND[S], or NANOSECOND[S]. > 3. Add in the analyzer to make sure we throw some exception to prevent saving > a dataframe/table with IntervalType out to external systems. > Related Hive ticket: https://issues.apache.org/jira/browse/HIVE-9792 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8763) executing run-tests.py with Python 2.6 fails with absence of subprocess.check_output function
[ https://issues.apache.org/jira/browse/SPARK-8763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8763: - Assignee: Tomohiko K. > executing run-tests.py with Python 2.6 fails with absence of > subprocess.check_output function > - > > Key: SPARK-8763 > URL: https://issues.apache.org/jira/browse/SPARK-8763 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.0 > Environment: Mac OS X 10.10.3 Python 2.6.9 Java 1.8.0 >Reporter: Tomohiko K. >Assignee: Tomohiko K. > Labels: pyspark, testing > Fix For: 1.5.0 > > > Running run-tests.py with Python 2.6 cause following error: > {noformat} > Running PySpark tests. Output is in > python//Users/tomohiko/.jenkins/jobs/pyspark_test/workspace/python/unit-tests.log > Will test against the following Python executables: ['python2.6', > 'python3.4', 'pypy'] > Will test the following Python modules: ['pyspark-core', 'pyspark-ml', > 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming'] > Traceback (most recent call last): > File "./python/run-tests.py", line 196, in > main() > File "./python/run-tests.py", line 159, in main > python_implementation = subprocess.check_output( > AttributeError: 'module' object has no attribute 'check_output' > ... > {noformat} > The cause of this error is using subprocess.check_output function, which > exists since Python 2.7. > (ref. > https://docs.python.org/2.7/library/subprocess.html#subprocess.check_output) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org