[jira] [Updated] (SPARK-3880) HBase as data source to SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan updated SPARK-3880: --- Attachment: SparkSQLOnHBase_v2.docx Version 2 > HBase as data source to SparkSQL > > > Key: SPARK-3880 > URL: https://issues.apache.org/jira/browse/SPARK-3880 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Yan >Assignee: Yan > Attachments: HBaseOnSpark.docx, SparkSQLOnHBase_v2.docx > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3880) HBase as data source to SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14281236#comment-14281236 ] Apache Spark commented on SPARK-3880: - User 'yzhou2001' has created a pull request for this issue: https://github.com/apache/spark/pull/4084 > HBase as data source to SparkSQL > > > Key: SPARK-3880 > URL: https://issues.apache.org/jira/browse/SPARK-3880 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Yan >Assignee: Yan > Attachments: HBaseOnSpark.docx > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5296) Predicate Pushdown (BaseRelation) to have an interface that will accept OR filters
Corey J. Nolet created SPARK-5296: - Summary: Predicate Pushdown (BaseRelation) to have an interface that will accept OR filters Key: SPARK-5296 URL: https://issues.apache.org/jira/browse/SPARK-5296 Project: Spark Issue Type: Improvement Components: SQL Reporter: Corey J. Nolet Currently, the BaseRelation API allows a FilteredRelation to handle an Array[Filter] which represents filter expressions that are applied as an AND operator. We should support OR operations in a BaseRelation as well. I'm not sure what this would look like in terms of API changes, but it almost seems like a FilteredUnionedScan BaseRelation (the name stinks but you get the idea) would be useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5295) Only expose leaf data types
[ https://issues.apache.org/jira/browse/SPARK-5295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5295: --- Description: 1. We expose all the stuff in data types right now, including NumericTypes, etc. These should be hidden from users. We should only expose the leaf types. 2. Remove DeveloperAPI tag from the common types. was:We expose all the stuff in data types right now, including NumericTypes, etc. These should be hidden from users. > Only expose leaf data types > --- > > Key: SPARK-5295 > URL: https://issues.apache.org/jira/browse/SPARK-5295 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > 1. We expose all the stuff in data types right now, including NumericTypes, > etc. These should be hidden from users. We should only expose the leaf types. > 2. Remove DeveloperAPI tag from the common types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5295) Stabilize data types
[ https://issues.apache.org/jira/browse/SPARK-5295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5295: --- Summary: Stabilize data types (was: Only expose leaf data types) > Stabilize data types > > > Key: SPARK-5295 > URL: https://issues.apache.org/jira/browse/SPARK-5295 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > 1. We expose all the stuff in data types right now, including NumericTypes, > etc. These should be hidden from users. We should only expose the leaf types. > 2. Remove DeveloperAPI tag from the common types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5295) Only expose leaf data types
Reynold Xin created SPARK-5295: -- Summary: Only expose leaf data types Key: SPARK-5295 URL: https://issues.apache.org/jira/browse/SPARK-5295 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin We expose all the stuff in data types right now, including NumericTypes, etc. These should be hidden from users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5193) Make Spark SQL API usable in Java and remove the Java-specific API
[ https://issues.apache.org/jira/browse/SPARK-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-5193. Resolution: Fixed Fix Version/s: 1.3.0 > Make Spark SQL API usable in Java and remove the Java-specific API > -- > > Key: SPARK-5193 > URL: https://issues.apache.org/jira/browse/SPARK-5193 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.3.0 > > > Java version of the SchemaRDD API causes high maintenance burden for Spark > SQL itself and downstream libraries (e.g. MLlib pipeline API needs to support > both JavaSchemaRDD and SchemaRDD). We can audit the Scala API and make it > usable for Java, and then we can remove the Java specific version. > Things to remove include (Java version of): > - data type > - Row > - SQLContext > - HiveContext > Things to consider: > - Scala and Java have a different collection library. > - Scala and Java (8) have different closure interface. > - Scala and Java can have duplicate definitions of common classes, such as > BigDecimal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5278) check ambiguous reference to fields in Spark SQL is incompleted
[ https://issues.apache.org/jira/browse/SPARK-5278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-5278: --- Summary: check ambiguous reference to fields in Spark SQL is incompleted (was: ambiguous reference to fields in Spark SQL is incompleted) > check ambiguous reference to fields in Spark SQL is incompleted > --- > > Key: SPARK-5278 > URL: https://issues.apache.org/jira/browse/SPARK-5278 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan > > at hive context > for json string like > {code}{"a": {"b": 1, "B": 2}}{code} > The SQL `SELECT a.b from t` will report error for ambiguous reference to > fields. > But for json string like > {code}{"a": [{"b": 1, "B": 2}]}{code} > The SQL `SELECT a[0].b from t` will pass and pick the first `b` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5294) Hide tables in AllStagePages for "Active Stages, Completed Stages and Failed Stages" when they are empty
[ https://issues.apache.org/jira/browse/SPARK-5294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14281173#comment-14281173 ] Apache Spark commented on SPARK-5294: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/4083 > Hide tables in AllStagePages for "Active Stages, Completed Stages and Failed > Stages" when they are empty > > > Key: SPARK-5294 > URL: https://issues.apache.org/jira/browse/SPARK-5294 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.3.0 >Reporter: Kousuke Saruta > > Related to SPARK-5228, AllStagesPage also should hide the table for > ActiveStages, CompleteStages and FailedStages when they are empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5293) Enable Spark user applications to use different versions of Akka
[ https://issues.apache.org/jira/browse/SPARK-5293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5293: --- Description: A lot of Spark user applications are using (or want to use) Akka. Akka as a whole can contribute great architectural simplicity and uniformity. However, because Spark depends on Akka, it is not possible for users to rely on different versions, and we have received many requests in the past asking for help about this specific issue. For example, Spark Streaming might be used as the receiver of Akka messages - but our dependency on Akka requires the upstream Akka actors to also use the identical version of Akka. Since our usage of Akka is limited (mainly for RPC and single-threaded event loop), we can replace it with alternative RPC implementations and a common event loop in Spark. was: A lot of Spark user applications are using (or want to use) Akka. Akka as a whole can contribute great architectural simplicity and uniformity. However, because Spark depends on Akka, it is not possible for users to rely on different versions, and we have received many requests in the past asking for help about this specific issue. Since our usage of Akka is limited (mainly for RPC and single-threaded event loop), we can replace it with alternative RPC implementations and a common event loop in Spark. > Enable Spark user applications to use different versions of Akka > > > Key: SPARK-5293 > URL: https://issues.apache.org/jira/browse/SPARK-5293 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Reynold Xin > > A lot of Spark user applications are using (or want to use) Akka. Akka as a > whole can contribute great architectural simplicity and uniformity. However, > because Spark depends on Akka, it is not possible for users to rely on > different versions, and we have received many requests in the past asking for > help about this specific issue. For example, Spark Streaming might be used as > the receiver of Akka messages - but our dependency on Akka requires the > upstream Akka actors to also use the identical version of Akka. > Since our usage of Akka is limited (mainly for RPC and single-threaded event > loop), we can replace it with alternative RPC implementations and a common > event loop in Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5293) Enable Spark user applications to use different versions of Akka
[ https://issues.apache.org/jira/browse/SPARK-5293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5293: --- Description: A lot of Spark user applications are using (or want to use) Akka. Akka as a whole can contribute great architectural simplicity and uniformity. However, because Spark depends on Akka, it is not possible for users to rely on different versions, and we have received many requests in the past asking for help about this specific issue. Since our usage of Akka is limited (mainly for RPC and single-threaded event loop), we can replace it with alternative RPC implementations and a common event loop in Spark. was: A lot of Spark user applications are using (or want to use) Akka. Akka as a whole can contribute great architectural simplicity and unification. However, because Spark depends on Akka, it is not possible for users to rely on different versions, and we have received many requests in the past asking for help about this specific issue. Since our usage of Akka is limited (mainly for RPC and single-threaded event loop), we can replace it with alternative RPC implementations and a common event loop in Spark. > Enable Spark user applications to use different versions of Akka > > > Key: SPARK-5293 > URL: https://issues.apache.org/jira/browse/SPARK-5293 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Reynold Xin > > A lot of Spark user applications are using (or want to use) Akka. Akka as a > whole can contribute great architectural simplicity and uniformity. However, > because Spark depends on Akka, it is not possible for users to rely on > different versions, and we have received many requests in the past asking for > help about this specific issue. > Since our usage of Akka is limited (mainly for RPC and single-threaded event > loop), we can replace it with alternative RPC implementations and a common > event loop in Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5294) Hide tables in AllStagePages for "Active Stages, Completed Stages and Failed Stages" when they are empty
Kousuke Saruta created SPARK-5294: - Summary: Hide tables in AllStagePages for "Active Stages, Completed Stages and Failed Stages" when they are empty Key: SPARK-5294 URL: https://issues.apache.org/jira/browse/SPARK-5294 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.3.0 Reporter: Kousuke Saruta Related to SPARK-5228, AllStagesPage also should hide the table for ActiveStages, CompleteStages and FailedStages when they are empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5251) Using `tableIdentifier` in hive metastore
[ https://issues.apache.org/jira/browse/SPARK-5251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14281170#comment-14281170 ] Apache Spark commented on SPARK-5251: - User 'scwf' has created a pull request for this issue: https://github.com/apache/spark/pull/4062 > Using `tableIdentifier` in hive metastore > -- > > Key: SPARK-5251 > URL: https://issues.apache.org/jira/browse/SPARK-5251 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.2.0 >Reporter: wangfei > > Using `tableIdentifier` in hive metastore -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1021) sortByKey() launches a cluster job when it shouldn't
[ https://issues.apache.org/jira/browse/SPARK-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14281168#comment-14281168 ] Xuefu Zhang commented on SPARK-1021: This problem also occurred on Hive on Spark (HIVE-9370. Could we take this forward? > sortByKey() launches a cluster job when it shouldn't > > > Key: SPARK-1021 > URL: https://issues.apache.org/jira/browse/SPARK-1021 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 0.8.0, 0.9.0, 1.0.0, 1.1.0 >Reporter: Andrew Ash >Assignee: Erik Erlandson > Labels: starter > > The sortByKey() method is listed as a transformation, not an action, in the > documentation. But it launches a cluster job regardless. > http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html > Some discussion on the mailing list suggested that this is a problem with the > rdd.count() call inside Partitioner.scala's rangeBounds method. > https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L102 > Josh Rosen suggests that rangeBounds should be made into a lazy variable: > {quote} > I wonder whether making RangePartitoner .rangeBounds into a lazy val would > fix this > (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95). > We'd need to make sure that rangeBounds() is never called before an action > is performed. This could be tricky because it's called in the > RangePartitioner.equals() method. Maybe it's sufficient to just compare the > number of partitions, the ids of the RDDs used to create the > RangePartitioner, and the sort ordering. This still supports the case where > I range-partition one RDD and pass the same partitioner to a different RDD. > It breaks support for the case where two range partitioners created on > different RDDs happened to have the same rangeBounds(), but it seems unlikely > that this would really harm performance since it's probably unlikely that the > range partitioners are equal by chance. > {quote} > Can we please make this happen? I'll send a PR on GitHub to start the > discussion and testing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5214) Add EventLoop and change DAGScheduler to an EventLoop
[ https://issues.apache.org/jira/browse/SPARK-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5214: --- Assignee: Shixiong Zhu > Add EventLoop and change DAGScheduler to an EventLoop > - > > Key: SPARK-5214 > URL: https://issues.apache.org/jira/browse/SPARK-5214 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > As per discussion in SPARK-5124, DAGScheduler can simply use a queue & event > loop to process events. It would be great when we want to decouple Akka in > the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5293) Enable Spark user applications to use different versions of Akka
Reynold Xin created SPARK-5293: -- Summary: Enable Spark user applications to use different versions of Akka Key: SPARK-5293 URL: https://issues.apache.org/jira/browse/SPARK-5293 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.0 Reporter: Reynold Xin A lot of Spark user applications are using (or want to use) Akka. Akka as a whole can contribute great architectural simplicity and unification. However, because Spark depends on Akka, it is not possible for users to rely on different versions, and we have received many requests in the past asking for help about this specific issue. Since our usage of Akka is limited (mainly for RPC and single-threaded event loop), we can replace it with alternative RPC implementations and a common event loop in Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5292) optimize join for table that are already sharded/support for hive bucket
gagan taneja created SPARK-5292: --- Summary: optimize join for table that are already sharded/support for hive bucket Key: SPARK-5292 URL: https://issues.apache.org/jira/browse/SPARK-5292 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.2.0 Reporter: gagan taneja Currently join do not consider the locality of the data and perform the shuffle anyway If the user takes the responsilbity of distributing the data based on some hash or shared the data, spark join should be able to leverage sharding to optimize join calculation/eliminate shuffle -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5291) Add timestamp and reason why an executor is removed to SparkListenerExecutorAdded and SparkListenerExecutorRemoved
[ https://issues.apache.org/jira/browse/SPARK-5291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14281143#comment-14281143 ] Apache Spark commented on SPARK-5291: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/4082 > Add timestamp and reason why an executor is removed to > SparkListenerExecutorAdded and SparkListenerExecutorRemoved > -- > > Key: SPARK-5291 > URL: https://issues.apache.org/jira/browse/SPARK-5291 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Kousuke Saruta > > Recently SparkListenerExecutorAdded and SparkListenerExecutorRemoved are > added. > I think it's useful if they have timestamp and the reason why an executor is > removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5291) Add timestamp and reason why an executor is removed to SparkListenerExecutorAdded and SparkListenerExecutorRemoved
[ https://issues.apache.org/jira/browse/SPARK-5291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-5291: -- Description: Recently SparkListenerExecutorAdded and SparkListenerExecutorRemoved are added. I think it's useful if they have timestamp and the reason why an executor is removed. was:Recently SparkListenerExecutorAdded and SparkListenerExecutorRemoved are added but I think it's useful if they have timestamp and the reason why an executor is removed. > Add timestamp and reason why an executor is removed to > SparkListenerExecutorAdded and SparkListenerExecutorRemoved > -- > > Key: SPARK-5291 > URL: https://issues.apache.org/jira/browse/SPARK-5291 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Kousuke Saruta > > Recently SparkListenerExecutorAdded and SparkListenerExecutorRemoved are > added. > I think it's useful if they have timestamp and the reason why an executor is > removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5291) Add timestamp and reason why an executor is removed to SparkListenerExecutorAdded and SparkListenerExecutorRemoved
[ https://issues.apache.org/jira/browse/SPARK-5291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-5291: -- Summary: Add timestamp and reason why an executor is removed to SparkListenerExecutorAdded and SparkListenerExecutorRemoved (was: Add timestamp and reason why an executor is removed) > Add timestamp and reason why an executor is removed to > SparkListenerExecutorAdded and SparkListenerExecutorRemoved > -- > > Key: SPARK-5291 > URL: https://issues.apache.org/jira/browse/SPARK-5291 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Kousuke Saruta > > Recently SparkListenerExecutorAdded and SparkListenerExecutorRemoved are > added but I think it's useful if they have timestamp and the reason why an > executor is removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5291) Add timestamp and reason why an executor is removed
Kousuke Saruta created SPARK-5291: - Summary: Add timestamp and reason why an executor is removed Key: SPARK-5291 URL: https://issues.apache.org/jira/browse/SPARK-5291 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.0 Reporter: Kousuke Saruta Recently SparkListenerExecutorAdded and SparkListenerExecutorRemoved are added but I think it's useful if they have timestamp and the reason why an executor is removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5282) RowMatrix easily gets int overflow in the memory size warning
[ https://issues.apache.org/jira/browse/SPARK-5282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5282: - Assignee: yuhao yang > RowMatrix easily gets int overflow in the memory size warning > - > > Key: SPARK-5282 > URL: https://issues.apache.org/jira/browse/SPARK-5282 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.2.0 > Environment: centos, others should be similar >Reporter: yuhao yang >Assignee: yuhao yang >Priority: Trivial > Original Estimate: 1h > Remaining Estimate: 1h > > The warning in the RowMatrix will easily get int overflow when the cols is > larger than 16385. > minor issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5287) NativeType.defaultSizeOf should have default sizes of all NativeTypes.
[ https://issues.apache.org/jira/browse/SPARK-5287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280986#comment-14280986 ] Apache Spark commented on SPARK-5287: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/4081 > NativeType.defaultSizeOf should have default sizes of all NativeTypes. > -- > > Key: SPARK-5287 > URL: https://issues.apache.org/jira/browse/SPARK-5287 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > > Otherwise, we will failed to do stats estimation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5289) Backport publishing of repl, yarn into branch-1.2
[ https://issues.apache.org/jira/browse/SPARK-5289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280927#comment-14280927 ] Apache Spark commented on SPARK-5289: - User 'pwendell' has created a pull request for this issue: https://github.com/apache/spark/pull/4079 > Backport publishing of repl, yarn into branch-1.2 > - > > Key: SPARK-5289 > URL: https://issues.apache.org/jira/browse/SPARK-5289 > Project: Spark > Issue Type: Improvement >Reporter: Patrick Wendell >Assignee: Patrick Wendell >Priority: Blocker > > In SPARK-3452 we did some clean-up of published artifacts that turned out to > adversely affect some users. This has been mostly patched up in master via > SPARK-4925 (hive-thritserver) which was backported. For the repl and yarn > modules, they were fixed in SPARK-4048 as part of a larger change that only > went into master. > Those pieces should be backported to Spark 1.2 to allow publishing in a 1.2.1 > release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5290) Executing functions in sparkSQL registered in sqlcontext gives scala.reflect.internal.MissingRequirementError: class org.apache.spark.sql.catalyst.ScalaReflection
Manoj Samel created SPARK-5290: -- Summary: Executing functions in sparkSQL registered in sqlcontext gives scala.reflect.internal.MissingRequirementError: class org.apache.spark.sql.catalyst.ScalaReflection Key: SPARK-5290 URL: https://issues.apache.org/jira/browse/SPARK-5290 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Environment: Spark 1.2 on centos or Mac Reporter: Manoj Samel Register a function using sqlContext.registerFunction and then use that function in sparkSQL. The execution gives following stack trace in Spark 1.2 - this works in Spark 1.1.1 at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61) at scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21) at org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(ScalaReflection.scala:115) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335) at scala.reflect.api.Universe.typeOf(Universe.scala:59) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:115) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:100) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33) at org.apache.spark.sql.UDFRegistration$class.builder$2(UdfRegistration.scala:91) at org.apache.spark.sql.UDFRegistration$$anonfun$registerFunction$1.apply(UdfRegistration.scala:92) at org.apache.spark.sql.UDFRegistration$$anonfun$registerFunction$1.apply(UdfRegistration.scala:92) at org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:53) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:220) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:218) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:191) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:147) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionDown$1(QueryPlan.scala:71) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1$$anonfun$apply$1.apply(QueryPlan.scala:85) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:84) at scala.
[jira] [Updated] (SPARK-5289) Backport publishing of repl, yarn into branch-1.2
[ https://issues.apache.org/jira/browse/SPARK-5289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5289: --- Description: In SPARK-3452 we did some clean-up of published artifacts that turned out to adversely affect some users. This has been mostly patched up in master via SPARK-4925 (hive-thritserver) which was backported. For the repl and yarn modules, they were fixed in SPARK-4048 as part of a larger change that only went into master. Those pieces should be backported to Spark 1.2 to allow publishing in a 1.2.1 release. was: In SPARK-3452 we did some clean-up of published artifacts that turned out to adversely affect some users. This has been mostly patched up in master via SPARK-4925 (hive-thritserver) which was backported. For the repl and yarn modules, they were fixed in SPARK-4048 as part of a larger change that only went into master. Those pieces should be backported. > Backport publishing of repl, yarn into branch-1.2 > - > > Key: SPARK-5289 > URL: https://issues.apache.org/jira/browse/SPARK-5289 > Project: Spark > Issue Type: Improvement >Reporter: Patrick Wendell >Assignee: Patrick Wendell >Priority: Blocker > > In SPARK-3452 we did some clean-up of published artifacts that turned out to > adversely affect some users. This has been mostly patched up in master via > SPARK-4925 (hive-thritserver) which was backported. For the repl and yarn > modules, they were fixed in SPARK-4048 as part of a larger change that only > went into master. > Those pieces should be backported to Spark 1.2 to allow publishing in a 1.2.1 > release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5289) Backport publishing of repl, yarn into branch-1.2
[ https://issues.apache.org/jira/browse/SPARK-5289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5289: --- Description: In SPARK-3452 we did some clean-up of published artifacts that turned out to adversely affect some users. This has been mostly patched up in master via SPARK-4925 (hive-thritserver) which was backported. For the repl and yarn modules, they were fixed in SPARK-4048 as part of a larger change that only went into master. (was: In SPARK-3452 we did some clean-up of published artifacts that turned out to adversely affect some users. This has been mostly patched up in master via SPARK-4925 (hive-thritserver), SPARK-4048 (which inadvertently did this for yarn and repl). But we should go in branch 1.2 and fix this as well so that we can do a 1.2.1 release with these artifacts.) > Backport publishing of repl, yarn into branch-1.2 > - > > Key: SPARK-5289 > URL: https://issues.apache.org/jira/browse/SPARK-5289 > Project: Spark > Issue Type: Improvement >Reporter: Patrick Wendell >Assignee: Patrick Wendell >Priority: Blocker > > In SPARK-3452 we did some clean-up of published artifacts that turned out to > adversely affect some users. This has been mostly patched up in master via > SPARK-4925 (hive-thritserver) which was backported. For the repl and yarn > modules, they were fixed in SPARK-4048 as part of a larger change that only > went into master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5289) Backport publishing of repl, yarn into branch-1.2
[ https://issues.apache.org/jira/browse/SPARK-5289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5289: --- Summary: Backport publishing of repl, yarn into branch-1.2 (was: Backport publishing of repl, yarn, and hive-thriftserver into branch-1.2) > Backport publishing of repl, yarn into branch-1.2 > - > > Key: SPARK-5289 > URL: https://issues.apache.org/jira/browse/SPARK-5289 > Project: Spark > Issue Type: Improvement >Reporter: Patrick Wendell >Assignee: Patrick Wendell >Priority: Blocker > > In SPARK-3452 we did some clean-up of published artifacts that turned out to > adversely affect some users. This has been mostly patched up in master via > SPARK-4925 (hive-thritserver), SPARK-4048 (which inadvertently did this for > yarn and repl). But we should go in branch 1.2 and fix this as well so that > we can do a 1.2.1 release with these artifacts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5289) Backport publishing of repl, yarn into branch-1.2
[ https://issues.apache.org/jira/browse/SPARK-5289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5289: --- Description: In SPARK-3452 we did some clean-up of published artifacts that turned out to adversely affect some users. This has been mostly patched up in master via SPARK-4925 (hive-thritserver) which was backported. For the repl and yarn modules, they were fixed in SPARK-4048 as part of a larger change that only went into master. Those pieces should be backported. was:In SPARK-3452 we did some clean-up of published artifacts that turned out to adversely affect some users. This has been mostly patched up in master via SPARK-4925 (hive-thritserver) which was backported. For the repl and yarn modules, they were fixed in SPARK-4048 as part of a larger change that only went into master. > Backport publishing of repl, yarn into branch-1.2 > - > > Key: SPARK-5289 > URL: https://issues.apache.org/jira/browse/SPARK-5289 > Project: Spark > Issue Type: Improvement >Reporter: Patrick Wendell >Assignee: Patrick Wendell >Priority: Blocker > > In SPARK-3452 we did some clean-up of published artifacts that turned out to > adversely affect some users. This has been mostly patched up in master via > SPARK-4925 (hive-thritserver) which was backported. For the repl and yarn > modules, they were fixed in SPARK-4048 as part of a larger change that only > went into master. > Those pieces should be backported. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5289) Backport publishing of repl, yarn, and hive-thriftserver into branch-1.2
Patrick Wendell created SPARK-5289: -- Summary: Backport publishing of repl, yarn, and hive-thriftserver into branch-1.2 Key: SPARK-5289 URL: https://issues.apache.org/jira/browse/SPARK-5289 Project: Spark Issue Type: Improvement Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Blocker In SPARK-3452 we did some clean-up of published artifacts that turned out to adversely affect some users. This has been mostly patched up in master via SPARK-4925 (hive-thritserver), SPARK-4048 (which inadvertently did this for yarn and repl). But we should go in branch 1.2 and fix this as well so that we can do a 1.2.1 release with these artifacts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5287) NativeType.defaultSizeOf should have default sizes of all NativeTypes.
[ https://issues.apache.org/jira/browse/SPARK-5287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-5287: Summary: NativeType.defaultSizeOf should have default sizes of all NativeTypes. (was: NativeType.defaultSizeOf should have all data types.) > NativeType.defaultSizeOf should have default sizes of all NativeTypes. > -- > > Key: SPARK-5287 > URL: https://issues.apache.org/jira/browse/SPARK-5287 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > > Otherwise, we will failed to do stats estimation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5288) Stabilize Spark SQL data type API followup
Yin Huai created SPARK-5288: --- Summary: Stabilize Spark SQL data type API followup Key: SPARK-5288 URL: https://issues.apache.org/jira/browse/SPARK-5288 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Several issues we need to address before release 1.3 * Do we want to make all classes in org.apache.spark.sql.types.dataTypes.scala public? Seems we do not need to make those abstract classes public. * Seems NativeType is not a very clear and useful concept. Should we just remove it? * We need to Stabilize the type hierarchy of our data types. Seems StringType and Decimal Type should not be primitive types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5287) NativeType.defaultSizeOf should have all data types.
[ https://issues.apache.org/jira/browse/SPARK-5287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-5287: Description: Otherwise, we will failed to do stats estimation. (was: NativeType.all and NativeType.defaultSizeOf are missing DecimalType, BinaryType, DateType, and TimestampType. ) > NativeType.defaultSizeOf should have all data types. > > > Key: SPARK-5287 > URL: https://issues.apache.org/jira/browse/SPARK-5287 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > > Otherwise, we will failed to do stats estimation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5287) NativeType.defaultSizeOf should have all data types.
[ https://issues.apache.org/jira/browse/SPARK-5287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-5287: Summary: NativeType.defaultSizeOf should have all data types. (was: NativeType's companion object should include all native types.) > NativeType.defaultSizeOf should have all data types. > > > Key: SPARK-5287 > URL: https://issues.apache.org/jira/browse/SPARK-5287 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > > NativeType.all and NativeType.defaultSizeOf are missing DecimalType, > BinaryType, DateType, and TimestampType. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-733) Add documentation on use of accumulators in lazy transformation
[ https://issues.apache.org/jira/browse/SPARK-733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid closed SPARK-733. -- Resolution: Fixed Fixed by https://github.com/apache/spark/pull/4022 > Add documentation on use of accumulators in lazy transformation > --- > > Key: SPARK-733 > URL: https://issues.apache.org/jira/browse/SPARK-733 > Project: Spark > Issue Type: Bug > Components: Documentation >Reporter: Josh Rosen > Fix For: 1.3.0, 1.2.1 > > > Accumulators updates are side-effects of RDD computations. Unlike RDDs, > accumulators do not carry lineage that would allow them to be computed when > their values are accessed on the master. > This can lead to confusion when accumulators are used in lazy transformations > like `map`: > {code} > val acc = sc.accumulator(0) > data.map(x => acc += x; f(x)) > // Here, acc is 0 because no actions have cause the `map` to be computed. > {code} > As far as I can tell, our documentation only includes examples of using > accumulators in `foreach`, for which this problem does not occur. > This pattern of using accumulators in map() occurs in Bagel and other Spark > code found in the wild. > It might be nice to document this behavior in the accumulators section of the > Spark programming guide. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-733) Add documentation on use of accumulators in lazy transformation
[ https://issues.apache.org/jira/browse/SPARK-733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-733: --- Fix Version/s: 1.2.1 1.3.0 > Add documentation on use of accumulators in lazy transformation > --- > > Key: SPARK-733 > URL: https://issues.apache.org/jira/browse/SPARK-733 > Project: Spark > Issue Type: Bug > Components: Documentation >Reporter: Josh Rosen > Fix For: 1.3.0, 1.2.1 > > > Accumulators updates are side-effects of RDD computations. Unlike RDDs, > accumulators do not carry lineage that would allow them to be computed when > their values are accessed on the master. > This can lead to confusion when accumulators are used in lazy transformations > like `map`: > {code} > val acc = sc.accumulator(0) > data.map(x => acc += x; f(x)) > // Here, acc is 0 because no actions have cause the `map` to be computed. > {code} > As far as I can tell, our documentation only includes examples of using > accumulators in `foreach`, for which this problem does not occur. > This pattern of using accumulators in map() occurs in Bagel and other Spark > code found in the wild. > It might be nice to document this behavior in the accumulators section of the > Spark programming guide. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5287) NativeType's companion object should include all native types.
Yin Huai created SPARK-5287: --- Summary: NativeType's companion object should include all native types. Key: SPARK-5287 URL: https://issues.apache.org/jira/browse/SPARK-5287 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai NativeType.all and NativeType.defaultSizeOf are missing DecimalType, BinaryType, DateType, and TimestampType. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5284) Insert into Hive throws NPE when a inner complex type field has a null value
[ https://issues.apache.org/jira/browse/SPARK-5284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280835#comment-14280835 ] Apache Spark commented on SPARK-5284: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/4077 > Insert into Hive throws NPE when a inner complex type field has a null value > > > Key: SPARK-5284 > URL: https://issues.apache.org/jira/browse/SPARK-5284 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > > For a table like the following one, > {code} > CREATE TABLE nullValuesInInnerComplexTypes > (s struct, > innerArray:array, > innerMap: map>) > {code} > When we want to insert a row like this > {code} > Row(Row(null, null, null)) > {code} > Will get a NPE > {code} > [info] org.apache.spark.SparkException: Job aborted due to stage failure: > Task 1 in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in > stage 0.0 (TID 1, localhost): java.lang.NullPointerException > [info]at > scala.runtime.Tuple3Zipped$.foreach$extension(Tuple3Zipped.scala:105) > [info]at > org.apache.spark.sql.hive.HiveInspectors$$anonfun$wrapperFor$3.apply(HiveInspectors.scala:351) > [info]at > org.apache.spark.sql.hive.HiveInspectors$$anonfun$wrapperFor$3$$anonfun$apply$4.apply(HiveInspectors.scala:351) > [info]at > org.apache.spark.sql.hive.HiveInspectors$$anonfun$wrapperFor$3$$anonfun$apply$4.apply(HiveInspectors.scala:351) > [info]at > scala.runtime.Tuple3Zipped$$anonfun$foreach$extension$1.apply(Tuple3Zipped.scala:109) > [info]at scala.collection.Iterator$class.foreach(Iterator.scala:727) > [info]at > scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > [info]at > scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > [info]at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > [info]at > scala.runtime.Tuple3Zipped$.foreach$extension(Tuple3Zipped.scala:107) > [info]at > org.apache.spark.sql.hive.HiveInspectors$$anonfun$wrapperFor$3.apply(HiveInspectors.scala:351) > [info]at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:108) > [info]at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:105) > [info]at scala.collection.Iterator$class.foreach(Iterator.scala:727) > [info]at > org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) > [info]at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:105) > [info]at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:87) > [info]at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:87) > [info]at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > [info]at org.apache.spark.scheduler.Task.run(Task.scala:64) > [info]at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:192) > [info]at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > [info]at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > [info]at java.lang.Thread.run(Thread.java:745) > [info] > [info] Driver stacktrace: > [info] at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1199) > [info] at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1188) > [info] at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1187) > [info] at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > [info] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > [info] at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1187) > [info] at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) > [info] at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) > [info] at scala.Option.foreach(Option.scala:236) > [info] at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697) > [info]
[jira] [Commented] (SPARK-5286) Fail to drop an invalid table when using the data source API
[ https://issues.apache.org/jira/browse/SPARK-5286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280827#comment-14280827 ] Apache Spark commented on SPARK-5286: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/4076 > Fail to drop an invalid table when using the data source API > > > Key: SPARK-5286 > URL: https://issues.apache.org/jira/browse/SPARK-5286 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Critical > > Example > {code} > CREATE TABLE jsonTable > USING org.apache.spark.sql.json.DefaultSource > OPTIONS ( > path 'it is not a path at all!' > ) > DROP TABLE jsonTable > {code} > We will get > {code} > [info] com.google.common.util.concurrent.UncheckedExecutionException: > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > file:/Users/yhuai/Projects/Spark/spark/sql/hive/it is not a path at all! > [info] at > com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4882) > [info] at > com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4898) > [info] at > org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:147) > [info] at > org.apache.spark.sql.hive.HiveContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:241) > [info] at > org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137) > [info] at > org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137) > [info] at scala.Option.getOrElse(Option.scala:120) > [info] at > org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137) > [info] at > org.apache.spark.sql.hive.HiveContext$$anon$2.lookupRelation(HiveContext.scala:241) > [info] at org.apache.spark.sql.SQLContext.table(SQLContext.scala:332) > [info] at > org.apache.spark.sql.hive.execution.DropTable.run(commands.scala:57) > [info] at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:53) > [info] at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:53) > [info] at > org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:61) > [info] at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:472) > [info] at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:472) > [info] at > org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) > [info] at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:108) > [info] at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:73) > [info] at > org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply$mcV$sp(MetastoreDataSourcesSuite.scala:258) > [info] at > org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply(MetastoreDataSourcesSuite.scala:246) > [info] at > org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply(MetastoreDataSourcesSuite.scala:246) > [info] at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > [info] at org.scalatest.Suite$class.withFixture(Suite.scala:1122) > [info] at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) > [info] at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > [info] at > org.apache.spark.sql.hive.MetastoreDataSourcesSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(MetastoreDataSourcesSuite.scala:36) > [info] at > org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255) > [info] at > org.apache.spark.sql.hive.MetastoreDataSourcesSuite.runTest(MetastoreDataSourcesSuite.scala:36) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > [info]
[jira] [Updated] (SPARK-5286) Fail to drop an invalid table when using the data source API
[ https://issues.apache.org/jira/browse/SPARK-5286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-5286: Summary: Fail to drop an invalid table when using the data source API (was: Fail to drop a invalid table when using the data source API) > Fail to drop an invalid table when using the data source API > > > Key: SPARK-5286 > URL: https://issues.apache.org/jira/browse/SPARK-5286 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Critical > > Example > {code} > CREATE TABLE jsonTable > USING org.apache.spark.sql.json.DefaultSource > OPTIONS ( > path 'it is not a path at all!' > ) > DROP TABLE jsonTable > {code} > We will get > {code} > [info] com.google.common.util.concurrent.UncheckedExecutionException: > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > file:/Users/yhuai/Projects/Spark/spark/sql/hive/it is not a path at all! > [info] at > com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4882) > [info] at > com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4898) > [info] at > org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:147) > [info] at > org.apache.spark.sql.hive.HiveContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:241) > [info] at > org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137) > [info] at > org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137) > [info] at scala.Option.getOrElse(Option.scala:120) > [info] at > org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137) > [info] at > org.apache.spark.sql.hive.HiveContext$$anon$2.lookupRelation(HiveContext.scala:241) > [info] at org.apache.spark.sql.SQLContext.table(SQLContext.scala:332) > [info] at > org.apache.spark.sql.hive.execution.DropTable.run(commands.scala:57) > [info] at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:53) > [info] at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:53) > [info] at > org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:61) > [info] at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:472) > [info] at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:472) > [info] at > org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) > [info] at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:108) > [info] at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:73) > [info] at > org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply$mcV$sp(MetastoreDataSourcesSuite.scala:258) > [info] at > org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply(MetastoreDataSourcesSuite.scala:246) > [info] at > org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply(MetastoreDataSourcesSuite.scala:246) > [info] at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > [info] at org.scalatest.Suite$class.withFixture(Suite.scala:1122) > [info] at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) > [info] at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > [info] at > org.apache.spark.sql.hive.MetastoreDataSourcesSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(MetastoreDataSourcesSuite.scala:36) > [info] at > org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255) > [info] at > org.apache.spark.sql.hive.MetastoreDataSourcesSuite.runTest(MetastoreDataSourcesSuite.scala:36) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > [info] at > org.scalates
[jira] [Created] (SPARK-5286) Fail to drop a invalid table when using the data source API
Yin Huai created SPARK-5286: --- Summary: Fail to drop a invalid table when using the data source API Key: SPARK-5286 URL: https://issues.apache.org/jira/browse/SPARK-5286 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Priority: Critical Example {code} CREATE TABLE jsonTable USING org.apache.spark.sql.json.DefaultSource OPTIONS ( path 'it is not a path at all!' ) DROP TABLE jsonTable {code} We will get {code} [info] com.google.common.util.concurrent.UncheckedExecutionException: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/Users/yhuai/Projects/Spark/spark/sql/hive/it is not a path at all! [info] at com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4882) [info] at com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4898) [info] at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:147) [info] at org.apache.spark.sql.hive.HiveContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:241) [info] at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137) [info] at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137) [info] at org.apache.spark.sql.hive.HiveContext$$anon$2.lookupRelation(HiveContext.scala:241) [info] at org.apache.spark.sql.SQLContext.table(SQLContext.scala:332) [info] at org.apache.spark.sql.hive.execution.DropTable.run(commands.scala:57) [info] at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:53) [info] at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:53) [info] at org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:61) [info] at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:472) [info] at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:472) [info] at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) [info] at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:108) [info] at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:73) [info] at org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply$mcV$sp(MetastoreDataSourcesSuite.scala:258) [info] at org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply(MetastoreDataSourcesSuite.scala:246) [info] at org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply(MetastoreDataSourcesSuite.scala:246) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) [info] at org.scalatest.Suite$class.withFixture(Suite.scala:1122) [info] at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) [info] at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) [info] at org.apache.spark.sql.hive.MetastoreDataSourcesSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(MetastoreDataSourcesSuite.scala:36) [info] at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255) [info] at org.apache.spark.sql.hive.MetastoreDataSourcesSuite.runTest(MetastoreDataSourcesSuite.scala:36) [info] at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) [info] at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) [info] at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) [info] at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) [info] at scala.collection.immutable.List.foreach(List.scala:318) [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) [info] at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) [info] at
[jira] [Comment Edited] (SPARK-5260) Expose JsonRDD.allKeysWithValueTypes() in a utility class
[ https://issues.apache.org/jira/browse/SPARK-5260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280802#comment-14280802 ] Corey J. Nolet edited comment on SPARK-5260 at 1/16/15 8:52 PM: bq. you can make the change and create a pull request. I'd love to submit a pull request for this. Do you have a proposed name for the utility object? bq. We do not add fix version(s) until it has been merged into our code base. Noted. We're quite different in Accumulo- we require fix versions for each ticket. was (Author: sonixbp): bq. you can make the change and create a pull request. I've love to submit a pull request for this. Do you have a proposed name for the utility object? bq. We do not add fix version(s) until it has been merged into our code base. Noted. We're quite different in Accumulo- we require fix versions for each ticket. > Expose JsonRDD.allKeysWithValueTypes() in a utility class > -- > > Key: SPARK-5260 > URL: https://issues.apache.org/jira/browse/SPARK-5260 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Corey J. Nolet > > I have found this method extremely useful when implementing my own strategy > for inferring a schema from parsed json. For now, I've actually copied the > method right out of the JsonRDD class into my own project but I think it > would be immensely useful to keep the code in Spark and expose it publicly > somewhere else- like an object called JsonSchema. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4502) Spark SQL reads unneccesary nested fields from Parquet
[ https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-4502: --- Summary: Spark SQL reads unneccesary nested fields from Parquet (was: Spark SQL reads unneccesary fields from Parquet) > Spark SQL reads unneccesary nested fields from Parquet > -- > > Key: SPARK-4502 > URL: https://issues.apache.org/jira/browse/SPARK-4502 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.1.0 >Reporter: Liwen Sun >Priority: Critical > > When reading a field of a nested column from Parquet, SparkSQL reads and > assemble all the fields of that nested column. This is unnecessary, as > Parquet supports fine-grained field reads out of a nested column. This may > degrades the performance significantly when a nested column has many fields. > For example, I loaded json tweets data into SparkSQL and ran the following > query: > {{SELECT User.contributors_enabled from Tweets;}} > User is a nested structure that has 38 primitive fields (for Tweets schema, > see: https://dev.twitter.com/overview/api/tweets), here is the log message: > {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed > 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 > cell/ms}} > For comparison, I also ran: > {{SELECT User FROM Tweets;}} > And here is the log message: > {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed > 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}} > So both queries load 38 columns from Parquet, while the first query only > needs 1 column. I also measured the bytes read within Parquet. In these two > cases, the same number of bytes (99365194 bytes) were read. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5260) Expose JsonRDD.allKeysWithValueTypes() in a utility class
[ https://issues.apache.org/jira/browse/SPARK-5260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280802#comment-14280802 ] Corey J. Nolet commented on SPARK-5260: --- bq. you can make the change and create a pull request. I've love to submit a pull request for this. Do you have a proposed name for the utility object? bq. We do not add fix version(s) until it has been merged into our code base. Noted, we're quite different in Accumulo- we require fix versions for each ticket. > Expose JsonRDD.allKeysWithValueTypes() in a utility class > -- > > Key: SPARK-5260 > URL: https://issues.apache.org/jira/browse/SPARK-5260 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Corey J. Nolet > > I have found this method extremely useful when implementing my own strategy > for inferring a schema from parsed json. For now, I've actually copied the > method right out of the JsonRDD class into my own project but I think it > would be immensely useful to keep the code in Spark and expose it publicly > somewhere else- like an object called JsonSchema. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5260) Expose JsonRDD.allKeysWithValueTypes() in a utility class
[ https://issues.apache.org/jira/browse/SPARK-5260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280802#comment-14280802 ] Corey J. Nolet edited comment on SPARK-5260 at 1/16/15 8:48 PM: bq. you can make the change and create a pull request. I've love to submit a pull request for this. Do you have a proposed name for the utility object? bq. We do not add fix version(s) until it has been merged into our code base. Noted. We're quite different in Accumulo- we require fix versions for each ticket. was (Author: sonixbp): bq. you can make the change and create a pull request. I've love to submit a pull request for this. Do you have a proposed name for the utility object? bq. We do not add fix version(s) until it has been merged into our code base. Noted, we're quite different in Accumulo- we require fix versions for each ticket. > Expose JsonRDD.allKeysWithValueTypes() in a utility class > -- > > Key: SPARK-5260 > URL: https://issues.apache.org/jira/browse/SPARK-5260 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Corey J. Nolet > > I have found this method extremely useful when implementing my own strategy > for inferring a schema from parsed json. For now, I've actually copied the > method right out of the JsonRDD class into my own project but I think it > would be immensely useful to keep the code in Spark and expose it publicly > somewhere else- like an object called JsonSchema. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5260) Expose JsonRDD.allKeysWithValueTypes() in a utility class
[ https://issues.apache.org/jira/browse/SPARK-5260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5260: --- Fix Version/s: (was: 1.3.0) > Expose JsonRDD.allKeysWithValueTypes() in a utility class > -- > > Key: SPARK-5260 > URL: https://issues.apache.org/jira/browse/SPARK-5260 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Corey J. Nolet > > I have found this method extremely useful when implementing my own strategy > for inferring a schema from parsed json. For now, I've actually copied the > method right out of the JsonRDD class into my own project but I think it > would be immensely useful to keep the code in Spark and expose it publicly > somewhere else- like an object called JsonSchema. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5270) Elegantly check if RDD is empty
[ https://issues.apache.org/jira/browse/SPARK-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5270: --- Target Version/s: 1.3.0 > Elegantly check if RDD is empty > --- > > Key: SPARK-5270 > URL: https://issues.apache.org/jira/browse/SPARK-5270 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.2.0 > Environment: Centos 6 >Reporter: Al M >Priority: Trivial > > Right now there is no clean way to check if an RDD is empty. As discussed > here: > http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679 > I'd like a method rdd.isEmpty that returns a boolean. > This would be especially useful when using streams. Sometimes my batches are > huge in one stream, sometimes I get nothing for hours. Still I have to run > count() to check if there is anything in the RDD. I can process my empty RDD > like the others but it would be more efficient to just skip the empty ones. > I can also run first() and catch the exception; this is neither a clean nor > fast solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4357) Modify release publishing to work with Scala 2.11
[ https://issues.apache.org/jira/browse/SPARK-4357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-4357. Resolution: Fixed Sorry this is actually working now. We now publish artifacts for Scala 2.11. It was fixed a while back. > Modify release publishing to work with Scala 2.11 > - > > Key: SPARK-4357 > URL: https://issues.apache.org/jira/browse/SPARK-4357 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Reporter: Patrick Wendell >Assignee: Patrick Wendell > > We'll need to do some effort to make our publishing work with 2.11 since the > current pipeline assumes a single set of artifacts is published. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5285) Removed GroupExpression in catalyst
[ https://issues.apache.org/jira/browse/SPARK-5285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280767#comment-14280767 ] Apache Spark commented on SPARK-5285: - User 'scwf' has created a pull request for this issue: https://github.com/apache/spark/pull/4075 > Removed GroupExpression in catalyst > > > Key: SPARK-5285 > URL: https://issues.apache.org/jira/browse/SPARK-5285 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.2.0 >Reporter: wangfei > > Removed GroupExpression in catalyst -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5285) Removed GroupExpression in catalyst
wangfei created SPARK-5285: -- Summary: Removed GroupExpression in catalyst Key: SPARK-5285 URL: https://issues.apache.org/jira/browse/SPARK-5285 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: wangfei Removed GroupExpression in catalyst -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5284) Insert into Hive throws NPE when a inner complex type field has a null value
Yin Huai created SPARK-5284: --- Summary: Insert into Hive throws NPE when a inner complex type field has a null value Key: SPARK-5284 URL: https://issues.apache.org/jira/browse/SPARK-5284 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai For a table like the following one, {code} CREATE TABLE nullValuesInInnerComplexTypes (s struct, innerArray:array, innerMap: map>) {code} When we want to insert a row like this {code} Row(Row(null, null, null)) {code} Will get a NPE {code} [info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 (TID 1, localhost): java.lang.NullPointerException [info] at scala.runtime.Tuple3Zipped$.foreach$extension(Tuple3Zipped.scala:105) [info] at org.apache.spark.sql.hive.HiveInspectors$$anonfun$wrapperFor$3.apply(HiveInspectors.scala:351) [info] at org.apache.spark.sql.hive.HiveInspectors$$anonfun$wrapperFor$3$$anonfun$apply$4.apply(HiveInspectors.scala:351) [info] at org.apache.spark.sql.hive.HiveInspectors$$anonfun$wrapperFor$3$$anonfun$apply$4.apply(HiveInspectors.scala:351) [info] at scala.runtime.Tuple3Zipped$$anonfun$foreach$extension$1.apply(Tuple3Zipped.scala:109) [info] at scala.collection.Iterator$class.foreach(Iterator.scala:727) [info] at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) [info] at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) [info] at scala.collection.AbstractIterable.foreach(Iterable.scala:54) [info] at scala.runtime.Tuple3Zipped$.foreach$extension(Tuple3Zipped.scala:107) [info] at org.apache.spark.sql.hive.HiveInspectors$$anonfun$wrapperFor$3.apply(HiveInspectors.scala:351) [info] at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:108) [info] at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:105) [info] at scala.collection.Iterator$class.foreach(Iterator.scala:727) [info] at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) [info] at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:105) [info] at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:87) [info] at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:87) [info] at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) [info] at org.apache.spark.scheduler.Task.run(Task.scala:64) [info] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:192) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [info] at java.lang.Thread.run(Thread.java:745) [info] [info] Driver stacktrace: [info] at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1199) [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1188) [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1187) [info] at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) [info] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) [info] at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1187) [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) [info] at scala.Option.foreach(Option.scala:236) [info] at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697) [info] at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1399) [info] at akka.actor.Actor$class.aroundReceive(Actor.scala:465) [info] at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1360) [info] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) [info] at akka.actor.ActorCell.invoke(ActorCell.scala:487) [info] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) [info] at akka.dispatch.Mailbox.run(Mailbox.scala:220) [info] at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) [info]
[jira] [Commented] (SPARK-4259) Add Spectral Clustering Algorithm with Gaussian Similarity Function
[ https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280731#comment-14280731 ] Andrew Musselman commented on SPARK-4259: - Thinking of picking this up; has there been any work on this already? > Add Spectral Clustering Algorithm with Gaussian Similarity Function > --- > > Key: SPARK-4259 > URL: https://issues.apache.org/jira/browse/SPARK-4259 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Fan Jiang >Assignee: Fan Jiang > Labels: features > > In recent years, spectral clustering has become one of the most popular > modern clustering algorithms. It is simple to implement, can be solved > efficiently by standard linear algebra software, and very often outperforms > traditional clustering algorithms such as the k-means algorithm. > We implemented the unnormalized graph Laplacian matrix by Gaussian similarity > function. A brief design looks like below: > Unnormalized spectral clustering > Input: raw data points, number k of clusters to construct: > • Comupte Similarity matrix S ∈ Rn×n, . > • Construct a similarity graph. Let W be its weighted adjacency matrix. > • Compute the unnormalized Laplacian L = D - W. where D is the Degree > diagonal matrix > • Compute the first k eigenvectors u1, . . . , uk of L. > • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. > • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th > row of U. > • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into > clusters C1, . . . , Ck. > Output: Clusters A1, . . . , Ak with Ai = { j | yj ∈ Ci }. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4520) SparkSQL exception when reading certain columns from a parquet file
[ https://issues.apache.org/jira/browse/SPARK-4520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280705#comment-14280705 ] Tyler commented on SPARK-4520: -- No rush. Just interested. I figured my problem was something along the lines of : schema != my custom serializer != the spark deserializer But it looks like the problems may lie with the spark deserializer more than my own serialization. > SparkSQL exception when reading certain columns from a parquet file > --- > > Key: SPARK-4520 > URL: https://issues.apache.org/jira/browse/SPARK-4520 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: sadhan sood >Assignee: sadhan sood >Priority: Critical > Attachments: part-r-0.parquet > > > I am seeing this issue with spark sql throwing an exception when trying to > read selective columns from a thrift parquet file and also when caching them. > On some further digging, I was able to narrow it down to at-least one > particular column type: map> to be causing this issue. To > reproduce this I created a test thrift file with a very basic schema and > stored some sample data in a parquet file: > Test.thrift > === > {code} > typedef binary SomeId > enum SomeExclusionCause { > WHITELIST = 1, > HAS_PURCHASE = 2, > } > struct SampleThriftObject { > 10: string col_a; > 20: string col_b; > 30: string col_c; > 40: optional map> col_d; > } > {code} > = > And loading the data in spark through schemaRDD: > {code} > import org.apache.spark.sql.SchemaRDD > val sqlContext = new org.apache.spark.sql.SQLContext(sc); > val parquetFile = "/path/to/generated/parquet/file" > val parquetFileRDD = sqlContext.parquetFile(parquetFile) > parquetFileRDD.printSchema > root > |-- col_a: string (nullable = true) > |-- col_b: string (nullable = true) > |-- col_c: string (nullable = true) > |-- col_d: map (nullable = true) > ||-- key: string > ||-- value: array (valueContainsNull = true) > |||-- element: string (containsNull = false) > parquetFileRDD.registerTempTable("test") > sqlContext.cacheTable("test") > sqlContext.sql("select col_a from test").collect() <-- see the exception > stack here > {code} > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 > (TID 0, localhost): parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file file:/tmp/xyz/part-r-0.parquet > at > parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) > at > parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780) > at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780) > at > org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223) > at > org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at jav
[jira] [Commented] (SPARK-4520) SparkSQL exception when reading certain columns from a parquet file
[ https://issues.apache.org/jira/browse/SPARK-4520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280698#comment-14280698 ] sadhan sood commented on SPARK-4520: Tyler, Alex - the problem is not with parquet but how we are reading the parquet columns. Just wanted to make sure that you are seeing this problem with thrift generated parquet files as well? I am going to submit my fix this weekend now that I have some availability, my apologies for the delay. > SparkSQL exception when reading certain columns from a parquet file > --- > > Key: SPARK-4520 > URL: https://issues.apache.org/jira/browse/SPARK-4520 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: sadhan sood >Assignee: sadhan sood >Priority: Critical > Attachments: part-r-0.parquet > > > I am seeing this issue with spark sql throwing an exception when trying to > read selective columns from a thrift parquet file and also when caching them. > On some further digging, I was able to narrow it down to at-least one > particular column type: map> to be causing this issue. To > reproduce this I created a test thrift file with a very basic schema and > stored some sample data in a parquet file: > Test.thrift > === > {code} > typedef binary SomeId > enum SomeExclusionCause { > WHITELIST = 1, > HAS_PURCHASE = 2, > } > struct SampleThriftObject { > 10: string col_a; > 20: string col_b; > 30: string col_c; > 40: optional map> col_d; > } > {code} > = > And loading the data in spark through schemaRDD: > {code} > import org.apache.spark.sql.SchemaRDD > val sqlContext = new org.apache.spark.sql.SQLContext(sc); > val parquetFile = "/path/to/generated/parquet/file" > val parquetFileRDD = sqlContext.parquetFile(parquetFile) > parquetFileRDD.printSchema > root > |-- col_a: string (nullable = true) > |-- col_b: string (nullable = true) > |-- col_c: string (nullable = true) > |-- col_d: map (nullable = true) > ||-- key: string > ||-- value: array (valueContainsNull = true) > |||-- element: string (containsNull = false) > parquetFileRDD.registerTempTable("test") > sqlContext.cacheTable("test") > sqlContext.sql("select col_a from test").collect() <-- see the exception > stack here > {code} > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 > (TID 0, localhost): parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file file:/tmp/xyz/part-r-0.parquet > at > parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) > at > parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780) > at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780) > at > org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223) > at > org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.Th
[jira] [Reopened] (SPARK-3726) RandomForest: Support for bootstrap options
[ https://issues.apache.org/jira/browse/SPARK-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reopened SPARK-3726: -- Assignee: Manoj Kumar (was: Manish Amde) This wasn't really fixed actually; my mistake. (The option is overridden for RandomForest.) > RandomForest: Support for bootstrap options > --- > > Key: SPARK-3726 > URL: https://issues.apache.org/jira/browse/SPARK-3726 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Manoj Kumar >Priority: Minor > Fix For: 1.2.0 > > > RandomForest uses BaggedPoint to simulate bootstrapped samples of the data. > The expected size of each sample is the same as the original data (sampling > rate = 1.0), and sampling is done with replacement. Adding support for other > sampling rates and for sampling without replacement would be useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3726) RandomForest: Support for bootstrap options
[ https://issues.apache.org/jira/browse/SPARK-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-3726. -- Resolution: Fixed Fix Version/s: 1.2.0 Assignee: Manish Amde (was: Manoj Kumar) Implemented in PR https://github.com/apache/spark/pull/2607 > RandomForest: Support for bootstrap options > --- > > Key: SPARK-3726 > URL: https://issues.apache.org/jira/browse/SPARK-3726 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Manish Amde >Priority: Minor > Fix For: 1.2.0 > > > RandomForest uses BaggedPoint to simulate bootstrapped samples of the data. > The expected size of each sample is the same as the original data (sampling > rate = 1.0), and sampling is done with replacement. Adding support for other > sampling rates and for sampling without replacement would be useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3726) RandomForest: Support for bootstrap options
[ https://issues.apache.org/jira/browse/SPARK-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280663#comment-14280663 ] Joseph K. Bradley commented on SPARK-3726: -- IMO I think it should be closed. I'll get someone to fix the JIRA-PR links/tags. Sorry for the wasted effort! > RandomForest: Support for bootstrap options > --- > > Key: SPARK-3726 > URL: https://issues.apache.org/jira/browse/SPARK-3726 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Manoj Kumar >Priority: Minor > > RandomForest uses BaggedPoint to simulate bootstrapped samples of the data. > The expected size of each sample is the same as the original data (sampling > rate = 1.0), and sampling is done with replacement. Adding support for other > sampling rates and for sampling without replacement would be useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3726) RandomForest: Support for bootstrap options
[ https://issues.apache.org/jira/browse/SPARK-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280658#comment-14280658 ] Manoj Kumar commented on SPARK-3726: Ah I see. I had my doubts when I started looking at the code, but was in a hurry to send a Pull Request. So this can be closed? > RandomForest: Support for bootstrap options > --- > > Key: SPARK-3726 > URL: https://issues.apache.org/jira/browse/SPARK-3726 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Manoj Kumar >Priority: Minor > > RandomForest uses BaggedPoint to simulate bootstrapped samples of the data. > The expected size of each sample is the same as the original data (sampling > rate = 1.0), and sampling is done with replacement. Adding support for other > sampling rates and for sampling without replacement would be useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3726) RandomForest: Support for bootstrap options
[ https://issues.apache.org/jira/browse/SPARK-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280651#comment-14280651 ] Joseph K. Bradley commented on SPARK-3726: -- Sorry! I had forgotten that this was really solved by [https://github.com/apache/spark/commit/8602195510f5821b37746bb7fa24902f43a1bd93]! That commit added subsamplingRate. Thinking more about this, I'm not sure if sampling without replacement is needed (or useful, since it is more expensive and makes for less randomness in the bootstrapped samples). Users can currently set subsamplingRate via Strategy, and I don't think it needs to be added to the train* methods. Let me know if you have a good use case for subsampling without replacement. Thanks! > RandomForest: Support for bootstrap options > --- > > Key: SPARK-3726 > URL: https://issues.apache.org/jira/browse/SPARK-3726 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Manoj Kumar >Priority: Minor > > RandomForest uses BaggedPoint to simulate bootstrapped samples of the data. > The expected size of each sample is the same as the original data (sampling > rate = 1.0), and sampling is done with replacement. Adding support for other > sampling rates and for sampling without replacement would be useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4766) ML Estimator Params should subclass Transformer Params
[ https://issues.apache.org/jira/browse/SPARK-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280604#comment-14280604 ] Joseph K. Bradley commented on SPARK-4766: -- That is a good point, but I'll put it in another JIRA as a separate issue: [https://issues.apache.org/jira/browse/SPARK-5283] > ML Estimator Params should subclass Transformer Params > -- > > Key: SPARK-4766 > URL: https://issues.apache.org/jira/browse/SPARK-4766 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > > Currently, in spark.ml, both Transformers and Estimators extend the same > Params classes. There should be one Params class for the Transformer and one > for the Estimator, where the Estimator params class extends the Transformer > one. > E.g., it is weird to be able to do: > {code} > val model: LogisticRegressionModel = ... > model.getMaxIter() > {code} > (This is the only case where this happens currently, but it is worth setting > a precedent.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5283) ML sharedParams should be public
Joseph K. Bradley created SPARK-5283: Summary: ML sharedParams should be public Key: SPARK-5283 URL: https://issues.apache.org/jira/browse/SPARK-5283 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.2.0 Reporter: Joseph K. Bradley The many shared Params implemented in sharedParams.scala should be made public. Pros: * Easier for developers of outside packages * Standardized parameter and input/output column names Cons: * None? Except that we'd need to make sure that the APIs are good enough -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5231) History Server shows wrong job submission time.
[ https://issues.apache.org/jira/browse/SPARK-5231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-5231. Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Kousuke Saruta > History Server shows wrong job submission time. > --- > > Key: SPARK-5231 > URL: https://issues.apache.org/jira/browse/SPARK-5231 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta > Fix For: 1.3.0 > > > History Server doesn't show collect job submission time. > It's because JobProgressListener updates job submission time every time > onJobStart method is invoked from ReplayListenerBus. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4766) ML Estimator Params should subclass Transformer Params
[ https://issues.apache.org/jira/browse/SPARK-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280571#comment-14280571 ] Peter Rudenko commented on SPARK-4766: -- Also make a traits that extends Params public. Here's a use case. Want to make custom transformers that take several columns and produces single one: {code} trait HasMultipleInputColumns extends Params{ val inputColumns: Param[Seq[String]] = new Param(this, "input columns", "names of input columns") def getInputCols: Seq[String] = { paramMap(inputColumns) } } /* * Takes a col1, col2, ... and produces a column "features" -> Vector(col1,col2) */ class LRFeatureListTransformer extends Transformer with HasMultipleInputColumns with HasOutputColumn /* * Takes a col1, col2, ... and produces a column "features" -> Vector(col1+col2+...) */ class SumFeatureListTransformer extends Transformer with HasMultipleInputColumns with HasOutputColumn {code} can't import HasOutputColumn trait, because it's private to ml package. > ML Estimator Params should subclass Transformer Params > -- > > Key: SPARK-4766 > URL: https://issues.apache.org/jira/browse/SPARK-4766 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > > Currently, in spark.ml, both Transformers and Estimators extend the same > Params classes. There should be one Params class for the Transformer and one > for the Estimator, where the Estimator params class extends the Transformer > one. > E.g., it is weird to be able to do: > {code} > val model: LogisticRegressionModel = ... > model.getMaxIter() > {code} > (This is the only case where this happens currently, but it is worth setting > a precedent.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5201) ParallelCollectionRDD.slice(seq, numSlices) has int overflow when dealing with inclusive range
[ https://issues.apache.org/jira/browse/SPARK-5201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5201: - Affects Version/s: (was: 1.2.0) 1.0.0 > ParallelCollectionRDD.slice(seq, numSlices) has int overflow when dealing > with inclusive range > -- > > Key: SPARK-5201 > URL: https://issues.apache.org/jira/browse/SPARK-5201 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Ye Xianjin >Assignee: Ye Xianjin > Labels: rdd > Fix For: 1.3.0, 1.2.1 > > Original Estimate: 2h > Remaining Estimate: 2h > > {code} > sc.makeRDD(1 to (Int.MaxValue)).count // result = 0 > sc.makeRDD(1 to (Int.MaxValue - 1)).count // result = 2147483646 = > Int.MaxValue - 1 > sc.makeRDD(1 until (Int.MaxValue)).count// result = 2147483646 = > Int.MaxValue - 1 > {code} > More details on the discussion https://github.com/apache/spark/pull/2874 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5201) ParallelCollectionRDD.slice(seq, numSlices) has int overflow when dealing with inclusive range
[ https://issues.apache.org/jira/browse/SPARK-5201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-5201. Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Ye Xianjin Target Version/s: 1.3.0, 1.2.1 (was: 1.0.2, 1.1.1, 1.2.0) > ParallelCollectionRDD.slice(seq, numSlices) has int overflow when dealing > with inclusive range > -- > > Key: SPARK-5201 > URL: https://issues.apache.org/jira/browse/SPARK-5201 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Ye Xianjin >Assignee: Ye Xianjin > Labels: rdd > Fix For: 1.3.0, 1.2.1 > > Original Estimate: 2h > Remaining Estimate: 2h > > {code} > sc.makeRDD(1 to (Int.MaxValue)).count // result = 0 > sc.makeRDD(1 to (Int.MaxValue - 1)).count // result = 2147483646 = > Int.MaxValue - 1 > sc.makeRDD(1 until (Int.MaxValue)).count// result = 2147483646 = > Int.MaxValue - 1 > {code} > More details on the discussion https://github.com/apache/spark/pull/2874 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-1507) Spark on Yarn: Add support for user to specify # cores for ApplicationMaster
[ https://issues.apache.org/jira/browse/SPARK-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-1507. Resolution: Fixed Fix Version/s: 1.3.0 Assignee: WangTaoTheTonic Target Version/s: 1.3.0 > Spark on Yarn: Add support for user to specify # cores for ApplicationMaster > > > Key: SPARK-1507 > URL: https://issues.apache.org/jira/browse/SPARK-1507 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.0.0 >Reporter: Thomas Graves >Assignee: WangTaoTheTonic > Fix For: 1.3.0 > > > Now that Hadoop 2.x can schedule cores as a resource we should allow the user > to specify the # of cores for the ApplicationMaster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5270) Elegantly check if RDD is empty
[ https://issues.apache.org/jira/browse/SPARK-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280491#comment-14280491 ] Apache Spark commented on SPARK-5270: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/4074 > Elegantly check if RDD is empty > --- > > Key: SPARK-5270 > URL: https://issues.apache.org/jira/browse/SPARK-5270 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.2.0 > Environment: Centos 6 >Reporter: Al M >Priority: Trivial > > Right now there is no clean way to check if an RDD is empty. As discussed > here: > http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679 > I'd like a method rdd.isEmpty that returns a boolean. > This would be especially useful when using streams. Sometimes my batches are > huge in one stream, sometimes I get nothing for hours. Still I have to run > count() to check if there is anything in the RDD. I can process my empty RDD > like the others but it would be more efficient to just skip the empty ones. > I can also run first() and catch the exception; this is neither a clean nor > fast solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3726) RandomForest: Support for bootstrap options
[ https://issues.apache.org/jira/browse/SPARK-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280486#comment-14280486 ] Apache Spark commented on SPARK-3726: - User 'MechCoder' has created a pull request for this issue: https://github.com/apache/spark/pull/4073 > RandomForest: Support for bootstrap options > --- > > Key: SPARK-3726 > URL: https://issues.apache.org/jira/browse/SPARK-3726 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Manoj Kumar >Priority: Minor > > RandomForest uses BaggedPoint to simulate bootstrapped samples of the data. > The expected size of each sample is the same as the original data (sampling > rate = 1.0), and sampling is done with replacement. Adding support for other > sampling rates and for sampling without replacement would be useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5270) Elegantly check if RDD is empty
[ https://issues.apache.org/jira/browse/SPARK-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280260#comment-14280260 ] Al M commented on SPARK-5270: - I don't mind at all. I'd be really happy to have such a utility method in Spark. > Elegantly check if RDD is empty > --- > > Key: SPARK-5270 > URL: https://issues.apache.org/jira/browse/SPARK-5270 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.2.0 > Environment: Centos 6 >Reporter: Al M >Priority: Trivial > > Right now there is no clean way to check if an RDD is empty. As discussed > here: > http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679 > I'd like a method rdd.isEmpty that returns a boolean. > This would be especially useful when using streams. Sometimes my batches are > huge in one stream, sometimes I get nothing for hours. Still I have to run > count() to check if there is anything in the RDD. I can process my empty RDD > like the others but it would be more efficient to just skip the empty ones. > I can also run first() and catch the exception; this is neither a clean nor > fast solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4630) Dynamically determine optimal number of partitions
[ https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280173#comment-14280173 ] Apache Spark commented on SPARK-4630: - User 'lianhuiwang' has created a pull request for this issue: https://github.com/apache/spark/pull/4070 > Dynamically determine optimal number of partitions > -- > > Key: SPARK-4630 > URL: https://issues.apache.org/jira/browse/SPARK-4630 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Kostas Sakellis >Assignee: Kostas Sakellis > > Partition sizes play a big part in how fast stages execute during a Spark > job. There is a direct relationship between the size of partitions to the > number of tasks - larger partitions, fewer tasks. For better performance, > Spark has a sweet spot for how large partitions should be that get executed > by a task. If partitions are too small, then the user pays a disproportionate > cost in scheduling overhead. If the partitions are too large, then task > execution slows down due to gc pressure and spilling to disk. > To increase performance of jobs, users often hand optimize the number(size) > of partitions that the next stage gets. Factors that come into play are: > Incoming partition sizes from previous stage > number of available executors > available memory per executor (taking into account > spark.shuffle.memoryFraction) > Spark has access to this data and so should be able to automatically do the > partition sizing for the user. This feature can be turned off/on with a > configuration option. > To make this happen, we propose modifying the DAGScheduler to take into > account partition sizes upon stage completion. Before scheduling the next > stage, the scheduler can examine the sizes of the partitions and determine > the appropriate number tasks to create. Since this change requires > non-trivial modifications to the DAGScheduler, a detailed design doc will be > attached before proceeding with the work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5282) RowMatrix easily gets int overflow in the memory size warning
[ https://issues.apache.org/jira/browse/SPARK-5282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280167#comment-14280167 ] Apache Spark commented on SPARK-5282: - User 'hhbyyh' has created a pull request for this issue: https://github.com/apache/spark/pull/4069 > RowMatrix easily gets int overflow in the memory size warning > - > > Key: SPARK-5282 > URL: https://issues.apache.org/jira/browse/SPARK-5282 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.2.0 > Environment: centos, others should be similar >Reporter: yuhao yang >Priority: Trivial > Original Estimate: 1h > Remaining Estimate: 1h > > The warning in the RowMatrix will easily get int overflow when the cols is > larger than 16385. > minor issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5282) RowMatrix easily gets int overflow in the memory size warning
[ https://issues.apache.org/jira/browse/SPARK-5282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280159#comment-14280159 ] yuhao yang commented on SPARK-5282: --- typical wrong message: Row matrix: 17000 cloumns will require at least -1982967296 bytes of memory! PR on the way. > RowMatrix easily gets int overflow in the memory size warning > - > > Key: SPARK-5282 > URL: https://issues.apache.org/jira/browse/SPARK-5282 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.2.0 > Environment: centos, others should be similar >Reporter: yuhao yang >Priority: Trivial > Original Estimate: 1h > Remaining Estimate: 1h > > The warning in the RowMatrix will easily get int overflow when the cols is > larger than 16385. > minor issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5282) RowMatrix easily gets int overflow in the memory size warning
yuhao yang created SPARK-5282: - Summary: RowMatrix easily gets int overflow in the memory size warning Key: SPARK-5282 URL: https://issues.apache.org/jira/browse/SPARK-5282 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.0 Environment: centos, others should be similar Reporter: yuhao yang Priority: Trivial The warning in the RowMatrix will easily get int overflow when the cols is larger than 16385. minor issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5281) Registering table on RDD is giving MissingRequirementError
[ https://issues.apache.org/jira/browse/SPARK-5281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sarsol updated SPARK-5281: -- Component/s: SQL > Registering table on RDD is giving MissingRequirementError > -- > > Key: SPARK-5281 > URL: https://issues.apache.org/jira/browse/SPARK-5281 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: sarsol >Priority: Critical > > Application crashes on this line rdd.registerTempTable("temp") in 1.2 > version when using sbt or Eclipse SCALA IDE > Stacktrace > Exception in thread "main" scala.reflect.internal.MissingRequirementError: > class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with > primordial classloader with boot classpath > [C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-library.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-reflect.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-actor.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-swing.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-compiler.jar;C:\Program > Files\Java\jre7\lib\resources.jar;C:\Program > Files\Java\jre7\lib\rt.jar;C:\Program > Files\Java\jre7\lib\sunrsasign.jar;C:\Program > Files\Java\jre7\lib\jsse.jar;C:\Program > Files\Java\jre7\lib\jce.jar;C:\Program > Files\Java\jre7\lib\charsets.jar;C:\Program > Files\Java\jre7\lib\jfr.jar;C:\Program Files\Java\jre7\classes] not found. > at > scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) > at > scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) > at > scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48) > at > scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61) > at > scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72) > at > scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119) > at > scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21) > at > org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(ScalaReflection.scala:115) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) > at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335) > at scala.reflect.api.Universe.typeOf(Universe.scala:59) > at > org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:115) > at > org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33) > at > org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:100) > at > org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33) > at > org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor(ScalaReflection.scala:94) > at > org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:33) > at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111) > at > com.sar.spark.dq.poc.SparkPOC$delayedInit$body.apply(SparkPOC.scala:43) > at scala.Function0$class.apply$mcV$sp(Function0.scala:40) > at > scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12) > at scala.App$$anonfun$main$1.apply(App.scala:71) > at scala.App$$anonfun$main$1.apply(App.scala:71) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32) > at scala.App$class.main(App.scala:71) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5270) Elegantly check if RDD is empty
[ https://issues.apache.org/jira/browse/SPARK-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280145#comment-14280145 ] Sean Owen commented on SPARK-5270: -- I think it would be nice to have a utility method like this indeed since it can wrap up all these options. Check for 0 partitions then check for first element. Mind if I make a PR? > Elegantly check if RDD is empty > --- > > Key: SPARK-5270 > URL: https://issues.apache.org/jira/browse/SPARK-5270 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.2.0 > Environment: Centos 6 >Reporter: Al M >Priority: Trivial > > Right now there is no clean way to check if an RDD is empty. As discussed > here: > http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679 > I'd like a method rdd.isEmpty that returns a boolean. > This would be especially useful when using streams. Sometimes my batches are > huge in one stream, sometimes I get nothing for hours. Still I have to run > count() to check if there is anything in the RDD. I can process my empty RDD > like the others but it would be more efficient to just skip the empty ones. > I can also run first() and catch the exception; this is neither a clean nor > fast solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5281) Registering table on RDD is giving MissingRequirementError
[ https://issues.apache.org/jira/browse/SPARK-5281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sarsol updated SPARK-5281: -- Priority: Critical (was: Major) > Registering table on RDD is giving MissingRequirementError > -- > > Key: SPARK-5281 > URL: https://issues.apache.org/jira/browse/SPARK-5281 > Project: Spark > Issue Type: Bug >Affects Versions: 1.2.0 >Reporter: sarsol >Priority: Critical > > Application crashes on this line rdd.registerTempTable("temp") in 1.2 > version when using sbt or Eclipse SCALA IDE > Stacktrace > Exception in thread "main" scala.reflect.internal.MissingRequirementError: > class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with > primordial classloader with boot classpath > [C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-library.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-reflect.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-actor.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-swing.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-compiler.jar;C:\Program > Files\Java\jre7\lib\resources.jar;C:\Program > Files\Java\jre7\lib\rt.jar;C:\Program > Files\Java\jre7\lib\sunrsasign.jar;C:\Program > Files\Java\jre7\lib\jsse.jar;C:\Program > Files\Java\jre7\lib\jce.jar;C:\Program > Files\Java\jre7\lib\charsets.jar;C:\Program > Files\Java\jre7\lib\jfr.jar;C:\Program Files\Java\jre7\classes] not found. > at > scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) > at > scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) > at > scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48) > at > scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61) > at > scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72) > at > scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119) > at > scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21) > at > org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(ScalaReflection.scala:115) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) > at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335) > at scala.reflect.api.Universe.typeOf(Universe.scala:59) > at > org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:115) > at > org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33) > at > org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:100) > at > org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33) > at > org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor(ScalaReflection.scala:94) > at > org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:33) > at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111) > at > com.sar.spark.dq.poc.SparkPOC$delayedInit$body.apply(SparkPOC.scala:43) > at scala.Function0$class.apply$mcV$sp(Function0.scala:40) > at > scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12) > at scala.App$$anonfun$main$1.apply(App.scala:71) > at scala.App$$anonfun$main$1.apply(App.scala:71) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32) > at scala.App$class.main(App.scala:71) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5234) examples for ml don't have sparkContext.stop
[ https://issues.apache.org/jira/browse/SPARK-5234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang closed SPARK-5234. - fixed > examples for ml don't have sparkContext.stop > > > Key: SPARK-5234 > URL: https://issues.apache.org/jira/browse/SPARK-5234 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.2.0 > Environment: all >Reporter: yuhao yang >Assignee: yuhao yang >Priority: Trivial > Fix For: 1.3.0, 1.2.1 > > Original Estimate: 1h > Remaining Estimate: 1h > > Not sure why sc.stop() is not in the > org.apache.spark.examples.ml {CrossValidatorExample, SimpleParamsExample, > SimpleTextClassificationPipeline}. > I can prepare a PR if it's not intentional to omit the call to stop. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5281) Registering table on RDD is giving MissingRequirementError
sarsol created SPARK-5281: - Summary: Registering table on RDD is giving MissingRequirementError Key: SPARK-5281 URL: https://issues.apache.org/jira/browse/SPARK-5281 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Reporter: sarsol Application crashes on this line rdd.registerTempTable("temp") in 1.2 version when using sbt or Eclipse SCALA IDE Stacktrace Exception in thread "main" scala.reflect.internal.MissingRequirementError: class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with primordial classloader with boot classpath [C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-library.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-reflect.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-actor.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-swing.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-compiler.jar;C:\Program Files\Java\jre7\lib\resources.jar;C:\Program Files\Java\jre7\lib\rt.jar;C:\Program Files\Java\jre7\lib\sunrsasign.jar;C:\Program Files\Java\jre7\lib\jsse.jar;C:\Program Files\Java\jre7\lib\jce.jar;C:\Program Files\Java\jre7\lib\charsets.jar;C:\Program Files\Java\jre7\lib\jfr.jar;C:\Program Files\Java\jre7\classes] not found. at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61) at scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21) at org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(ScalaReflection.scala:115) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335) at scala.reflect.api.Universe.typeOf(Universe.scala:59) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:115) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:100) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33) at org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor(ScalaReflection.scala:94) at org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:33) at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111) at com.sar.spark.dq.poc.SparkPOC$delayedInit$body.apply(SparkPOC.scala:43) at scala.Function0$class.apply$mcV$sp(Function0.scala:40) at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12) at scala.App$$anonfun$main$1.apply(App.scala:71) at scala.App$$anonfun$main$1.apply(App.scala:71) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32) at scala.App$class.main(App.scala:71) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4357) Modify release publishing to work with Scala 2.11
[ https://issues.apache.org/jira/browse/SPARK-4357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280037#comment-14280037 ] François Garillot commented on SPARK-4357: -- Scala 2.11.5 [has been released|http://scala-lang.org/news/2.11.5]. What would be the next step to help you with this ? > Modify release publishing to work with Scala 2.11 > - > > Key: SPARK-4357 > URL: https://issues.apache.org/jira/browse/SPARK-4357 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Reporter: Patrick Wendell >Assignee: Patrick Wendell > > We'll need to do some effort to make our publishing work with 2.11 since the > current pipeline assumes a single set of artifacts is published. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5147) write ahead logs from streaming receiver are not purged because cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called
[ https://issues.apache.org/jira/browse/SPARK-5147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280034#comment-14280034 ] François Garillot commented on SPARK-5147: -- I see. Thanks for your answers ! For the locality issue, how about running recovery from the WAL as if it was replication ? In that sense, we would be using the WAL's HDFS write as a transport mechanism (as it will replicate on 2 other executors), and the recreating a block at the end point. Perhaps it's worth noting this idea in a JIRA as a possible future enhancement ? > write ahead logs from streaming receiver are not purged because > cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called > -- > > Key: SPARK-5147 > URL: https://issues.apache.org/jira/browse/SPARK-5147 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Affects Versions: 1.2.0 >Reporter: Max Xu >Priority: Blocker > > Hi all, > We are running a Spark streaming application with ReliableKafkaReceiver. We > have "spark.streaming.receiver.writeAheadLog.enable" set to true so write > ahead logs (WALs) for received data are created under receivedData/streamId > folder in the checkpoint directory. > However, old WALs are never purged by time. receivedBlockMetadata and > checkpoint files are purged correctly though. I went through the code, > WriteAheadLogBasedBlockHandler class in ReceivedBlockHandler.scala is > responsible for cleaning up the old blocks. It has method cleanupOldBlocks, > which is never called by any class. ReceiverSupervisorImpl class holds a > WriteAheadLogBasedBlockHandler instance. However, it only calls storeBlock > method to create WALs but never calls cleanupOldBlocks method to purge old > WALs. > The size of the WAL folder increases constantly on HDFS. This is preventing > us from running the ReliableKafkaReceiver 24x7. Can somebody please take a > look. > Thanks, > Max -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5186) Vector.equals and Vector.hashCode are very inefficient and fail on SparseVectors with large size
[ https://issues.apache.org/jira/browse/SPARK-5186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280025#comment-14280025 ] yuhao yang commented on SPARK-5186: --- I just updated the PR with a hashCode fix. Please help review at will. > Vector.equals and Vector.hashCode are very inefficient and fail on > SparseVectors with large size > - > > Key: SPARK-5186 > URL: https://issues.apache.org/jira/browse/SPARK-5186 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Derrick Burns > Original Estimate: 0.25h > Remaining Estimate: 0.25h > > The implementation of Vector.equals and Vector.hashCode are correct but slow > for SparseVectors that are truly sparse. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5280) Import RDF graphs into GraphX
lukovnikov created SPARK-5280: - Summary: Import RDF graphs into GraphX Key: SPARK-5280 URL: https://issues.apache.org/jira/browse/SPARK-5280 Project: Spark Issue Type: New Feature Components: GraphX Reporter: lukovnikov RDF (Resource Description Framework) models knowledge in a graph and is heavily used on the Semantic Web and beyond. GraphX should include a way to import RDF data easily. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4867) UDF clean up
[ https://issues.apache.org/jira/browse/SPARK-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279983#comment-14279983 ] Reynold Xin commented on SPARK-4867: BTW if we plan to implement most SQL functions using this new UDF interface, than we should consider making mutable primitive types a first class citizen. Otherwise we will incur a huge performance hit when any functions on primitives are invoked. > UDF clean up > > > Key: SPARK-4867 > URL: https://issues.apache.org/jira/browse/SPARK-4867 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Priority: Blocker > > Right now our support and internal implementation of many functions has a few > issues. Specifically: > - UDFS don't know their input types and thus don't do type coercion. > - We hard code a bunch of built in functions into the parser. This is bad > because in SQL it creates new reserved words for things that aren't actually > keywords. Also it means that for each function we need to add support to > both SQLContext and HiveContext separately. > For this JIRA I propose we do the following: > - Change the interfaces for registerFunction and ScalaUdf to include types > for the input arguments as well as the output type. > - Add a rule to analysis that does type coercion for UDFs. > - Add a parse rule for functions to SQLParser. > - Rewrite all the UDFs that are currently hacked into the various parsers > using this new functionality. > Depending on how big this refactoring becomes we could split parts 1&2 from > part 3 above. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5270) Elegantly check if RDD is empty
[ https://issues.apache.org/jira/browse/SPARK-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279962#comment-14279962 ] Al M commented on SPARK-5270: - Good point it's not a catch-all solution. The rdd.partitions.size solution does work well in the case of empty RDDs created by Spark streaming. > Elegantly check if RDD is empty > --- > > Key: SPARK-5270 > URL: https://issues.apache.org/jira/browse/SPARK-5270 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.2.0 > Environment: Centos 6 >Reporter: Al M >Priority: Trivial > > Right now there is no clean way to check if an RDD is empty. As discussed > here: > http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679 > I'd like a method rdd.isEmpty that returns a boolean. > This would be especially useful when using streams. Sometimes my batches are > huge in one stream, sometimes I get nothing for hours. Still I have to run > count() to check if there is anything in the RDD. I can process my empty RDD > like the others but it would be more efficient to just skip the empty ones. > I can also run first() and catch the exception; this is neither a clean nor > fast solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org