[jira] [Assigned] (SPARK-6604) Specify ip of python server scoket
[ https://issues.apache.org/jira/browse/SPARK-6604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6604: --- Assignee: (was: Apache Spark) > Specify ip of python server scoket > -- > > Key: SPARK-6604 > URL: https://issues.apache.org/jira/browse/SPARK-6604 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Weizhong >Priority: Minor > > In driver now will start a server socket and use a wildcard ip, use 127.0.0.0 > is more reasonable, as we only use it by local Python process -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6604) Specify ip of python server scoket
[ https://issues.apache.org/jira/browse/SPARK-6604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386268#comment-14386268 ] Apache Spark commented on SPARK-6604: - User 'Sephiroth-Lin' has created a pull request for this issue: https://github.com/apache/spark/pull/5256 > Specify ip of python server scoket > -- > > Key: SPARK-6604 > URL: https://issues.apache.org/jira/browse/SPARK-6604 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Weizhong >Priority: Minor > > In driver now will start a server socket and use a wildcard ip, use 127.0.0.0 > is more reasonable, as we only use it by local Python process -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6604) Specify ip of python server scoket
[ https://issues.apache.org/jira/browse/SPARK-6604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6604: --- Assignee: Apache Spark > Specify ip of python server scoket > -- > > Key: SPARK-6604 > URL: https://issues.apache.org/jira/browse/SPARK-6604 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Weizhong >Assignee: Apache Spark >Priority: Minor > > In driver now will start a server socket and use a wildcard ip, use 127.0.0.0 > is more reasonable, as we only use it by local Python process -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6600) Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway
[ https://issues.apache.org/jira/browse/SPARK-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6600: --- Assignee: (was: Apache Spark) > Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway > -- > > Key: SPARK-6600 > URL: https://issues.apache.org/jira/browse/SPARK-6600 > Project: Spark > Issue Type: New Feature > Components: EC2 >Reporter: Florian Verhein > > Use case: User has set up the hadoop hdfs nfs gateway service on their > spark_ec2.py launched cluster, and wants to mount that on their local > machine. > Requires the following ports to be opened on incoming rule set for MASTER for > both UDP and TCP: 111, 2049, 4242. > (I have tried this and it works) > Note that this issue *does not* cover the implementation of a hdfs nfs > gateway module in the spark-ec2 project. See linked issue. > Reference: > https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6600) Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway
[ https://issues.apache.org/jira/browse/SPARK-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386267#comment-14386267 ] Apache Spark commented on SPARK-6600: - User 'florianverhein' has created a pull request for this issue: https://github.com/apache/spark/pull/5257 > Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway > -- > > Key: SPARK-6600 > URL: https://issues.apache.org/jira/browse/SPARK-6600 > Project: Spark > Issue Type: New Feature > Components: EC2 >Reporter: Florian Verhein > > Use case: User has set up the hadoop hdfs nfs gateway service on their > spark_ec2.py launched cluster, and wants to mount that on their local > machine. > Requires the following ports to be opened on incoming rule set for MASTER for > both UDP and TCP: 111, 2049, 4242. > (I have tried this and it works) > Note that this issue *does not* cover the implementation of a hdfs nfs > gateway module in the spark-ec2 project. See linked issue. > Reference: > https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6600) Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway
[ https://issues.apache.org/jira/browse/SPARK-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6600: --- Assignee: Apache Spark > Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway > -- > > Key: SPARK-6600 > URL: https://issues.apache.org/jira/browse/SPARK-6600 > Project: Spark > Issue Type: New Feature > Components: EC2 >Reporter: Florian Verhein >Assignee: Apache Spark > > Use case: User has set up the hadoop hdfs nfs gateway service on their > spark_ec2.py launched cluster, and wants to mount that on their local > machine. > Requires the following ports to be opened on incoming rule set for MASTER for > both UDP and TCP: 111, 2049, 4242. > (I have tried this and it works) > Note that this issue *does not* cover the implementation of a hdfs nfs > gateway module in the spark-ec2 project. See linked issue. > Reference: > https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6604) Specify ip of python server scoket
Weizhong created SPARK-6604: --- Summary: Specify ip of python server scoket Key: SPARK-6604 URL: https://issues.apache.org/jira/browse/SPARK-6604 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Weizhong Priority: Minor In driver now will start a server socket and use a wildcard ip, use 127.0.0.0 is more reasonable, as we only use it by local Python process -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6603) SQLContext.registerFunction -> SQLContext.udf.register
Reynold Xin created SPARK-6603: -- Summary: SQLContext.registerFunction -> SQLContext.udf.register Key: SPARK-6603 URL: https://issues.apache.org/jira/browse/SPARK-6603 Project: Spark Issue Type: Improvement Reporter: Reynold Xin Assignee: Davies Liu We didn't change the Python implementation to use that. Maybe the best strategy is to deprecate SQLContext.registerFunction, and just add SQLContext.udf.register. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5203) union with different decimal type report error
[ https://issues.apache.org/jira/browse/SPARK-5203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5203: --- Assignee: Apache Spark > union with different decimal type report error > -- > > Key: SPARK-5203 > URL: https://issues.apache.org/jira/browse/SPARK-5203 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: guowei >Assignee: Apache Spark > > Test case like this: > {code:sql} > create table test (a decimal(10,1)); > select a from test union all select a*2 from test; > {code} > Exception thown: > {noformat} > 15/01/12 16:28:54 ERROR SparkSQLDriver: Failed in [select a from test union > all select a*2 from test] > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved > attributes: *, tree: > 'Project [*] > 'Subquery _u1 > 'Union >Project [a#1] > MetastoreRelation default, test, None >Project [CAST((CAST(a#2, DecimalType()) * CAST(CAST(2, DecimalType(10,0)), > DecimalType())), DecimalType(21,1)) AS _c0#0] > MetastoreRelation default, test, None > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:85) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:83) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:83) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:81) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) > at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) > at > org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:410) > at > org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:410) > at > org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:411) > at > org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:411) > at > org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:412) > at > org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:412) > at > org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:417) > at > org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:415) > at > org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:421) > at > org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:421) > at > org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:369) > at > org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:58) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:275) > at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:211) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5203) union with different decimal type report error
[ https://issues.apache.org/jira/browse/SPARK-5203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5203: --- Assignee: (was: Apache Spark) > union with different decimal type report error > -- > > Key: SPARK-5203 > URL: https://issues.apache.org/jira/browse/SPARK-5203 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: guowei > > Test case like this: > {code:sql} > create table test (a decimal(10,1)); > select a from test union all select a*2 from test; > {code} > Exception thown: > {noformat} > 15/01/12 16:28:54 ERROR SparkSQLDriver: Failed in [select a from test union > all select a*2 from test] > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved > attributes: *, tree: > 'Project [*] > 'Subquery _u1 > 'Union >Project [a#1] > MetastoreRelation default, test, None >Project [CAST((CAST(a#2, DecimalType()) * CAST(CAST(2, DecimalType(10,0)), > DecimalType())), DecimalType(21,1)) AS _c0#0] > MetastoreRelation default, test, None > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:85) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:83) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:83) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:81) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) > at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) > at > org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:410) > at > org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:410) > at > org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:411) > at > org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:411) > at > org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:412) > at > org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:412) > at > org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:417) > at > org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:415) > at > org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:421) > at > org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:421) > at > org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:369) > at > org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:58) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:275) > at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:211) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame
[ https://issues.apache.org/jira/browse/SPARK-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh closed SPARK-6586. -- Resolution: Not a Problem > Add the capability of retrieving original logical plan of DataFrame > --- > > Key: SPARK-6586 > URL: https://issues.apache.org/jira/browse/SPARK-6586 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Priority: Minor > > In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan > instead of logical plan. However, by doing that we can't know the logical > plan of a {{DataFrame}}. But it might be still useful and important to > retrieve the original logical plan in some use cases. > In this pr, we introduce the capability of retrieving original logical plan > of {{DataFrame}}. > The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once > {{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as > {{true}}. In {{QueryExecution}}, we keep the original logical plan in the > analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to > recursively replace the analyzed logical plan with original logical plan and > retrieve it. > Besides the capability of retrieving original logical plan, this modification > also can avoid do plan analysis if it is already analyzed. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame
[ https://issues.apache.org/jira/browse/SPARK-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386209#comment-14386209 ] Liang-Chi Hsieh commented on SPARK-6586: I am no problem with your opinion. But If that is true, we don't need to keep logical plan in queryExecution now. I am closing this, thanks. > Add the capability of retrieving original logical plan of DataFrame > --- > > Key: SPARK-6586 > URL: https://issues.apache.org/jira/browse/SPARK-6586 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Priority: Minor > > In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan > instead of logical plan. However, by doing that we can't know the logical > plan of a {{DataFrame}}. But it might be still useful and important to > retrieve the original logical plan in some use cases. > In this pr, we introduce the capability of retrieving original logical plan > of {{DataFrame}}. > The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once > {{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as > {{true}}. In {{QueryExecution}}, we keep the original logical plan in the > analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to > recursively replace the analyzed logical plan with original logical plan and > retrieve it. > Besides the capability of retrieving original logical plan, this modification > also can avoid do plan analysis if it is already analyzed. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6602) Replace direct use of Akka with Spark RPC interface
Reynold Xin created SPARK-6602: -- Summary: Replace direct use of Akka with Spark RPC interface Key: SPARK-6602 URL: https://issues.apache.org/jira/browse/SPARK-6602 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin Assignee: Shixiong Zhu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5124) Standardize internal RPC interface
[ https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-5124. Resolution: Fixed Fix Version/s: 1.4.0 > Standardize internal RPC interface > -- > > Key: SPARK-5124 > URL: https://issues.apache.org/jira/browse/SPARK-5124 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin >Assignee: Shixiong Zhu > Fix For: 1.4.0 > > Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf > > > In Spark we use Akka as the RPC layer. It would be great if we can > standardize the internal RPC interface to facilitate testing. This will also > provide the foundation to try other RPC implementations in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6566) Update Spark to use the latest version of Parquet libraries
[ https://issues.apache.org/jira/browse/SPARK-6566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386170#comment-14386170 ] Konstantin Shaposhnikov commented on SPARK-6566: Thank you for the update [~lian cheng] > Update Spark to use the latest version of Parquet libraries > --- > > Key: SPARK-6566 > URL: https://issues.apache.org/jira/browse/SPARK-6566 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.0 >Reporter: Konstantin Shaposhnikov > > There are a lot of bug fixes in the latest version of parquet (1.6.0rc7). > E.g. PARQUET-136 > It would be good to update Spark to use the latest parquet version. > The following changes are required: > {code} > diff --git a/pom.xml b/pom.xml > index 5ad39a9..095b519 100644 > --- a/pom.xml > +++ b/pom.xml > @@ -132,7 +132,7 @@ > > 0.13.1 > 10.10.1.1 > -1.6.0rc3 > +1.6.0rc7 > 1.2.3 > 8.1.14.v20131031 > 3.0.0.v201112011016 > {code} > and > {code} > --- > a/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala > +++ > b/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala > @@ -480,7 +480,7 @@ private[parquet] class FilteringParquetRowInputFormat > globalMetaData = new GlobalMetaData(globalMetaData.getSchema, >mergedMetadata, globalMetaData.getCreatedBy) > > -val readContext = getReadSupport(configuration).init( > +val readContext = > ParquetInputFormat.getReadSupportInstance(configuration).init( >new InitContext(configuration, > globalMetaData.getKeyValueMetaData, > globalMetaData.getSchema)) > {code} > I am happy to prepare a pull request if necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame
[ https://issues.apache.org/jira/browse/SPARK-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386167#comment-14386167 ] Michael Armbrust commented on SPARK-6586: - I can see the utility for seeing the original plan in any given run of the optimizer. However, providing it for any arbitrarily assembly of query plans feels like unnecessary complexity to me. I think its only reasonable to add such instrumentation when it was actually useful to solve an issue. Doing so speculatively only leads to code complexity. If you have a concrete example where this information would be useful we can continue to discuss, but otherwise this issue should be closed. Additionally, PRs to add new features should *always* have tests. Otherwise these features will be broken almost immediately. > Add the capability of retrieving original logical plan of DataFrame > --- > > Key: SPARK-6586 > URL: https://issues.apache.org/jira/browse/SPARK-6586 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Priority: Minor > > In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan > instead of logical plan. However, by doing that we can't know the logical > plan of a {{DataFrame}}. But it might be still useful and important to > retrieve the original logical plan in some use cases. > In this pr, we introduce the capability of retrieving original logical plan > of {{DataFrame}}. > The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once > {{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as > {{true}}. In {{QueryExecution}}, we keep the original logical plan in the > analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to > recursively replace the analyzed logical plan with original logical plan and > retrieve it. > Besides the capability of retrieving original logical plan, this modification > also can avoid do plan analysis if it is already analyzed. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6559) specify port of python gateway server and let it retry if failed
[ https://issues.apache.org/jira/browse/SPARK-6559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Wang closed SPARK-6559. --- Resolution: Not a Problem > specify port of python gateway server and let it retry if failed > > > Key: SPARK-6559 > URL: https://issues.apache.org/jira/browse/SPARK-6559 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Tao Wang >Priority: Minor > > Now gateway server binds to a random port, we might wanna bind it to a > specify port so that we can play some firewall rules. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6601) Add HDFS NFS gateway module to spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-6601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florian Verhein updated SPARK-6601: --- Description: Add module "hdfs-nfs-gateway", which sets up the gateway for (say, ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes. Note: For nfs to be available outside AWS, also requires #6600 was: Add module "hdfs-nfs-gateway", which sets up the gateway for (say, ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes. Note: For nfs to be available outside AWS, also requires [#6600] > Add HDFS NFS gateway module to spark-ec2 > > > Key: SPARK-6601 > URL: https://issues.apache.org/jira/browse/SPARK-6601 > Project: Spark > Issue Type: New Feature > Components: EC2 >Reporter: Florian Verhein > > Add module "hdfs-nfs-gateway", which sets up the gateway for (say, > ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes. > Note: For nfs to be available outside AWS, also requires #6600 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6600) Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway
[ https://issues.apache.org/jira/browse/SPARK-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florian Verhein updated SPARK-6600: --- Description: Use case: User has set up the hadoop hdfs nfs gateway service on their spark_ec2.py launched cluster, and wants to mount that on their local machine. Requires the following ports to be opened on incoming rule set for MASTER for both UDP and TCP: 111, 2049, 4242. (I have tried this and it works) Note that this issue *does not* cover the implementation of a hdfs nfs gateway module in the spark-ec2 project. See linked issue. Reference: https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html was: Use case: User has set up the hadoop hdfs nfs gateway service on their spark_ec2.py launched cluster, and wants to mount that on their local machine. Requires the following ports to be opened on incoming rule set for MASTER for both UDP and TCP: 111, 2049, 4242. (I have tried this and it works) Note that this issue *does not* cover the implementation of a hdfs nfs gateway module in the spark-ec2 project. See [#6601] for this. Reference: https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html > Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway > -- > > Key: SPARK-6600 > URL: https://issues.apache.org/jira/browse/SPARK-6600 > Project: Spark > Issue Type: New Feature > Components: EC2 >Reporter: Florian Verhein > > Use case: User has set up the hadoop hdfs nfs gateway service on their > spark_ec2.py launched cluster, and wants to mount that on their local > machine. > Requires the following ports to be opened on incoming rule set for MASTER for > both UDP and TCP: 111, 2049, 4242. > (I have tried this and it works) > Note that this issue *does not* cover the implementation of a hdfs nfs > gateway module in the spark-ec2 project. See linked issue. > Reference: > https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6600) Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway
[ https://issues.apache.org/jira/browse/SPARK-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florian Verhein updated SPARK-6600: --- Description: Use case: User has set up the hadoop hdfs nfs gateway service on their spark_ec2.py launched cluster, and wants to mount that on their local machine. Requires the following ports to be opened on incoming rule set for MASTER for both UDP and TCP: 111, 2049, 4242. (I have tried this and it works) Note that this issue *does not* cover the implementation of a hdfs nfs gateway module in the spark-ec2 project. See [#6601] for this. Reference: https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html was: Use case: User has set up the hadoop hdfs nfs gateway service on their spark_ec2.py launched cluster, and wants to mount that on their local machine. Requires the following ports to be opened on incoming rule set for MASTER for both UDP and TCP: 111, 2049, 4242. (I have tried this and it works) Note that this issue *does not* cover the implementation of a hdfs nfs gateway module in the spark-ec2 project. That should be a separate issue (TODO). Reference: https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html > Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway > -- > > Key: SPARK-6600 > URL: https://issues.apache.org/jira/browse/SPARK-6600 > Project: Spark > Issue Type: New Feature > Components: EC2 >Reporter: Florian Verhein > > Use case: User has set up the hadoop hdfs nfs gateway service on their > spark_ec2.py launched cluster, and wants to mount that on their local > machine. > Requires the following ports to be opened on incoming rule set for MASTER for > both UDP and TCP: 111, 2049, 4242. > (I have tried this and it works) > Note that this issue *does not* cover the implementation of a hdfs nfs > gateway module in the spark-ec2 project. See [#6601] for this. > Reference: > https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6601) Add HDFS NFS gateway module to spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-6601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florian Verhein updated SPARK-6601: --- Description: Add module "hdfs-nfs-gateway", which sets up the gateway for (say, ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes. Note: For nfs to be available outside AWS, also requires [#6600] was: Add module "hdfs-nfs-gateway", which sets up the gateway for (say, ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes. Note: For nfs to be available outside AWS, also requires #6600 > Add HDFS NFS gateway module to spark-ec2 > > > Key: SPARK-6601 > URL: https://issues.apache.org/jira/browse/SPARK-6601 > Project: Spark > Issue Type: New Feature > Components: EC2 >Reporter: Florian Verhein > > Add module "hdfs-nfs-gateway", which sets up the gateway for (say, > ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes. > Note: For nfs to be available outside AWS, also requires [#6600] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6601) Add HDFS NFS gateway module to spark-ec2
Florian Verhein created SPARK-6601: -- Summary: Add HDFS NFS gateway module to spark-ec2 Key: SPARK-6601 URL: https://issues.apache.org/jira/browse/SPARK-6601 Project: Spark Issue Type: New Feature Components: EC2 Reporter: Florian Verhein Add module "hdfs-nfs-gateway", which sets up the gateway for (say, ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes. Note: For nfs to be available outside AWS, also requires #6600 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6600) Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway
[ https://issues.apache.org/jira/browse/SPARK-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florian Verhein updated SPARK-6600: --- Description: Use case: User has set up the hadoop hdfs nfs gateway service on their spark_ec2.py launched cluster, and wants to mount that on their local machine. Requires the following ports to be opened on incoming rule set for MASTER for both UDP and TCP: 111, 2049, 4242. (I have tried this and it works) Note that this issue *does not* cover the implementation of a hdfs nfs gateway module in the spark-ec2 project. That should be a separate issue (TODO). Reference: https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html was: Use case: User has set up the hadoop hdfs nfs gateway service on their spark-ec2.py launched cluster, and wants to mount that on their local machine. Requires the following ports to be opened on incoming rule set for MASTER for both UDP and TCP: 111, 2049, 4242. (I have tried this and it works) Note that this issue *does not* cover the implementation of a hdfs nfs gateway module in the spark-ec2 project. That should be a separate issue (TODO). Reference: https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html > Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway > -- > > Key: SPARK-6600 > URL: https://issues.apache.org/jira/browse/SPARK-6600 > Project: Spark > Issue Type: New Feature > Components: EC2 >Reporter: Florian Verhein > > Use case: User has set up the hadoop hdfs nfs gateway service on their > spark_ec2.py launched cluster, and wants to mount that on their local > machine. > Requires the following ports to be opened on incoming rule set for MASTER for > both UDP and TCP: 111, 2049, 4242. > (I have tried this and it works) > Note that this issue *does not* cover the implementation of a hdfs nfs > gateway module in the spark-ec2 project. That should be a separate issue > (TODO). > Reference: > https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6600) Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway
[ https://issues.apache.org/jira/browse/SPARK-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florian Verhein updated SPARK-6600: --- Summary: Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway(was: Open ports in spark-ec2.py to allow HDFS NFS gateway ) > Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway > -- > > Key: SPARK-6600 > URL: https://issues.apache.org/jira/browse/SPARK-6600 > Project: Spark > Issue Type: New Feature > Components: EC2 >Reporter: Florian Verhein > > Use case: User has set up the hadoop hdfs nfs gateway service on their > spark-ec2.py launched cluster, and wants to mount that on their local > machine. > Requires the following ports to be opened on incoming rule set for MASTER for > both UDP and TCP: 111, 2049, 4242. > (I have tried this and it works) > Note that this issue *does not* cover the implementation of a hdfs nfs > gateway module in the spark-ec2 project. That should be a separate issue > (TODO). > Reference: > https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6600) Open ports in spark-ec2.py to allow HDFS NFS gateway
[ https://issues.apache.org/jira/browse/SPARK-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florian Verhein updated SPARK-6600: --- Description: Use case: User has set up the hadoop hdfs nfs gateway service on their spark-ec2.py launched cluster, and wants to mount that on their local machine. Requires the following ports to be opened on incoming rule set for MASTER for both UDP and TCP: 111, 2049, 4242. (I have tried this and it works) Note that this issue *does not* cover the implementation of a hdfs nfs gateway module in the spark-ec2 project. That should be a separate issue (TODO). Reference: https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html was: Use case: User has set up the hadoop hdfs nfs gateway service on their spark-ec2.py launched cluster, and wants to mount that on their local machine. Requires the following ports to be opened on incoming rule set for MASTER for both UDP and TCP: 111, 2049, 4242. (I have tried this and it works) Note that this issue *does not* cover the implementation of a hdfs nfs gateway module in the spark-ec2 project. That should be a separate issue (TODO). Reference: https://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html > Open ports in spark-ec2.py to allow HDFS NFS gateway > -- > > Key: SPARK-6600 > URL: https://issues.apache.org/jira/browse/SPARK-6600 > Project: Spark > Issue Type: New Feature > Components: EC2 >Reporter: Florian Verhein > > Use case: User has set up the hadoop hdfs nfs gateway service on their > spark-ec2.py launched cluster, and wants to mount that on their local > machine. > Requires the following ports to be opened on incoming rule set for MASTER for > both UDP and TCP: 111, 2049, 4242. > (I have tried this and it works) > Note that this issue *does not* cover the implementation of a hdfs nfs > gateway module in the spark-ec2 project. That should be a separate issue > (TODO). > Reference: > https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6600) Open ports in spark-ec2.py to allow HDFS NFS gateway
Florian Verhein created SPARK-6600: -- Summary: Open ports in spark-ec2.py to allow HDFS NFS gateway Key: SPARK-6600 URL: https://issues.apache.org/jira/browse/SPARK-6600 Project: Spark Issue Type: New Feature Components: EC2 Reporter: Florian Verhein Use case: User has set up the hadoop hdfs nfs gateway service on their spark-ec2.py launched cluster, and wants to mount that on their local machine. Requires the following ports to be opened on incoming rule set for MASTER for both UDP and TCP: 111, 2049, 4242. (I have tried this and it works) Note that this issue *does not* cover the implementation of a hdfs nfs gateway module in the spark-ec2 project. That should be a separate issue (TODO). Reference: https://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6369) InsertIntoHiveTable and Parquet Relation should use logic from SparkHadoopWriter
[ https://issues.apache.org/jira/browse/SPARK-6369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-6369: Summary: InsertIntoHiveTable and Parquet Relation should use logic from SparkHadoopWriter (was: InsertIntoHiveTable should use logic from SparkHadoopWriter) > InsertIntoHiveTable and Parquet Relation should use logic from > SparkHadoopWriter > > > Key: SPARK-6369 > URL: https://issues.apache.org/jira/browse/SPARK-6369 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Cheng Lian >Priority: Blocker > > Right now it is possible that we will corrupt the output if there is a race > between competing speculative tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-6594) Spark Streaming can't receive data from kafka
[ https://issues.apache.org/jira/browse/SPARK-6594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] q79969786 updated SPARK-6594: - Comment: was deleted (was: 1. Create kafka topic as follow: $kafka-topics.sh --create --zookeeper zkServer1:2181,zkServer2:2181,zkServer3:2181 --replication-factor 1 --partitions 5 --topic ORDER 2. I'm use Java API to process data as follow: SparkConf sparkConf = new SparkConf().setAppName("TestOrder"); sparkConf.set("spark.cleaner.ttl", "600"); JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(1000)); Map topicorder = new HashMap(); topicorder.put("order", 5); JavaPairReceiverInputDStream jPRIDSOrder = KafkaUtils.createStream(jssc, ’zkServer1:2181,zkServer2:2181,zkServer3:2181‘, ’test-consumer-group‘, topicorder); jPRIDSOrder.map(new Function, String>() { @Override public String call(Tuple2 tuple2) { return tuple2._2(); } }).print(); 3. Submit this application as follow: spark-submit --class com.bigdata.TestOrder --master spark://SPKMASTER:19002 /home/bigdata/test-spark.jar TestOrder 4. It will shows five warnings as follows when submit application: 15/03/29 21:23:03 WARN ZookeeperConsumerConnector: [test-consumer-group_work1-1427462582342-5714642d], No broker partitions consumed by consumer thread test-consumer-group_work1-1427462582342-5714642d-0 for topic ORDER .. ) > Spark Streaming can't receive data from kafka > - > > Key: SPARK-6594 > URL: https://issues.apache.org/jira/browse/SPARK-6594 > Project: Spark > Issue Type: Bug >Affects Versions: 1.2.1 > Environment: kafka_2.10-0.8.1.1 + Spark-1.2.1 >Reporter: q79969786 > > I use KafkaUtils to receive data from Kafka In my Spark streaming application > as follows: > Map topicorder = new HashMap(); > topicorder.put("order", Integer.valueOf(readThread)); > JavaPairReceiverInputDStream jPRIDSOrder = > KafkaUtils.createStream(jssc, zkQuorum, group, topicorder); > It worked well at fist, but after I submit this application several times, > Spark streaming can‘t receive data anymore(Kafka works well). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6594) Spark Streaming can't receive data from kafka
[ https://issues.apache.org/jira/browse/SPARK-6594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386070#comment-14386070 ] q79969786 commented on SPARK-6594: -- 1. Create kafka topic as follow: $kafka-topics.sh --create --zookeeper zkServer1:2181,zkServer2:2181,zkServer3:2181 --replication-factor 1 --partitions 5 --topic ORDER 2. I'm use Java API to process data as follow: SparkConf sparkConf = new SparkConf().setAppName("TestOrder"); sparkConf.set("spark.cleaner.ttl", "600"); JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(1000)); Map topicorder = new HashMap(); topicorder.put("order", 5); JavaPairReceiverInputDStream jPRIDSOrder = KafkaUtils.createStream(jssc, ’zkServer1:2181,zkServer2:2181,zkServer3:2181‘, ’test-consumer-group‘, topicorder); jPRIDSOrder.map(new Function, String>() { @Override public String call(Tuple2 tuple2) { return tuple2._2(); } }).print(); 3. Submit this application as follow: spark-submit --class com.bigdata.TestOrder --master spark://SPKMASTER:19002 /home/bigdata/test-spark.jar TestOrder 4. It will shows five warnings as follows when submit application: 15/03/29 21:23:03 WARN ZookeeperConsumerConnector: [test-consumer-group_work1-1427462582342-5714642d], No broker partitions consumed by consumer thread test-consumer-group_work1-1427462582342-5714642d-0 for topic ORDER .. > Spark Streaming can't receive data from kafka > - > > Key: SPARK-6594 > URL: https://issues.apache.org/jira/browse/SPARK-6594 > Project: Spark > Issue Type: Bug >Affects Versions: 1.2.1 > Environment: kafka_2.10-0.8.1.1 + Spark-1.2.1 >Reporter: q79969786 > > I use KafkaUtils to receive data from Kafka In my Spark streaming application > as follows: > Map topicorder = new HashMap(); > topicorder.put("order", Integer.valueOf(readThread)); > JavaPairReceiverInputDStream jPRIDSOrder = > KafkaUtils.createStream(jssc, zkQuorum, group, topicorder); > It worked well at fist, but after I submit this application several times, > Spark streaming can‘t receive data anymore(Kafka works well). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6594) Spark Streaming can't receive data from kafka
[ https://issues.apache.org/jira/browse/SPARK-6594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386071#comment-14386071 ] q79969786 commented on SPARK-6594: -- 1. Create kafka topic as follow: $kafka-topics.sh --create --zookeeper zkServer1:2181,zkServer2:2181,zkServer3:2181 --replication-factor 1 --partitions 5 --topic ORDER 2. I'm use Java API to process data as follow: SparkConf sparkConf = new SparkConf().setAppName("TestOrder"); sparkConf.set("spark.cleaner.ttl", "600"); JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(1000)); Map topicorder = new HashMap(); topicorder.put("order", 5); JavaPairReceiverInputDStream jPRIDSOrder = KafkaUtils.createStream(jssc, ’zkServer1:2181,zkServer2:2181,zkServer3:2181‘, ’test-consumer-group‘, topicorder); jPRIDSOrder.map(new Function, String>() { @Override public String call(Tuple2 tuple2) { return tuple2._2(); } }).print(); 3. Submit this application as follow: spark-submit --class com.bigdata.TestOrder --master spark://SPKMASTER:19002 /home/bigdata/test-spark.jar TestOrder 4. It will shows five warnings as follows when submit application: 15/03/29 21:23:03 WARN ZookeeperConsumerConnector: [test-consumer-group_work1-1427462582342-5714642d], No broker partitions consumed by consumer thread test-consumer-group_work1-1427462582342-5714642d-0 for topic ORDER .. > Spark Streaming can't receive data from kafka > - > > Key: SPARK-6594 > URL: https://issues.apache.org/jira/browse/SPARK-6594 > Project: Spark > Issue Type: Bug >Affects Versions: 1.2.1 > Environment: kafka_2.10-0.8.1.1 + Spark-1.2.1 >Reporter: q79969786 > > I use KafkaUtils to receive data from Kafka In my Spark streaming application > as follows: > Map topicorder = new HashMap(); > topicorder.put("order", Integer.valueOf(readThread)); > JavaPairReceiverInputDStream jPRIDSOrder = > KafkaUtils.createStream(jssc, zkQuorum, group, topicorder); > It worked well at fist, but after I submit this application several times, > Spark streaming can‘t receive data anymore(Kafka works well). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame
[ https://issues.apache.org/jira/browse/SPARK-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386068#comment-14386068 ] Liang-Chi Hsieh commented on SPARK-6586: Even just for debuging purpose, I think it is important to provide a way to access the logical plan. The mutable states this added are limited in Logical Plan. If mutable states are bad here, I think I can refactor this to one without mutable states. > Add the capability of retrieving original logical plan of DataFrame > --- > > Key: SPARK-6586 > URL: https://issues.apache.org/jira/browse/SPARK-6586 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Priority: Minor > > In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan > instead of logical plan. However, by doing that we can't know the logical > plan of a {{DataFrame}}. But it might be still useful and important to > retrieve the original logical plan in some use cases. > In this pr, we introduce the capability of retrieving original logical plan > of {{DataFrame}}. > The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once > {{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as > {{true}}. In {{QueryExecution}}, we keep the original logical plan in the > analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to > recursively replace the analyzed logical plan with original logical plan and > retrieve it. > Besides the capability of retrieving original logical plan, this modification > also can avoid do plan analysis if it is already analyzed. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5564) Support sparse LDA solutions
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386049#comment-14386049 ] Debasish Das edited comment on SPARK-5564 at 3/30/15 12:31 AM: --- [~josephkb] could you please point me to the datasets that are used for benchmarking LDA and how do they scale as we start scaling the topics? I have started testing loglikelihood loss for recommendation and since I already added the constraints, this is the right time to test it on LDA benchmarks as well...I will open up the code as part of https://issues.apache.org/jira/browse/SPARK-6323 as soon as our legal clears it... I am looking into LDA test-cases but since I am optimizing log-likelihood directly, I am looking to add more testcases based on document and word matrix...For recommendation, I know how to construct the testcases with loglikelihood loss was (Author: debasish83): [~josephkb] could you please point me to the datasets that are used for benchmarking? I have started testing loglikelihood loss for recommendation and since I already added the constraints, this is the right time to test it on LDA benchmarks as well...I will open up the code as part of https://issues.apache.org/jira/browse/SPARK-6323 as soon as our legal clears it... I am looking into LDA test-cases but since I am optimizing log-likelihood directly, I am looking to add more testcases based on document and word matrix...For recommendation, I know how to construct the testcases with loglikelihood loss > Support sparse LDA solutions > > > Key: SPARK-5564 > URL: https://issues.apache.org/jira/browse/SPARK-5564 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > Latent Dirichlet Allocation (LDA) currently requires that the priors’ > concentration parameters be > 1.0. It should support values > 0.0, which > should encourage sparser topics (phi) and document-topic distributions > (theta). > For EM, this will require adding a projection to the M-step, as in: Vorontsov > and Potapenko. "Tutorial on Probabilistic Topic Modeling : Additive > Regularization for Stochastic Matrix Factorization." 2014. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5564) Support sparse LDA solutions
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386049#comment-14386049 ] Debasish Das edited comment on SPARK-5564 at 3/30/15 12:30 AM: --- [~josephkb] could you please point me to the datasets that are used for benchmarking? I have started testing loglikelihood loss for recommendation and since I already added the constraints, this is the right time to test it on LDA benchmarks as well...I will open up the code as part of https://issues.apache.org/jira/browse/SPARK-6323 as soon as our legal clears it... I am looking into LDA test-cases but since I am optimizing log-likelihood directly, I am looking to add more testcases based on document and word matrix...For recommendation, I know how to construct the testcases with loglikelihood loss was (Author: debasish83): [~josephkb] could you please point me to the datasets that are used for benchmarking? I have started testing loglikelihood loss for recommendation and since I already added the constraints, this is the right time to test it on LDA benchmarks as well...I will open up the code as part of https://issues.apache.org/jira/browse/SPARK-6323 as soon as our legal clears it... I am looking into LDA test-cases but since I am optimizing log-likelihood directly, I am looking to add more testcases from your LDA JIRA...For recommendation, I know how to construct the testcases... > Support sparse LDA solutions > > > Key: SPARK-5564 > URL: https://issues.apache.org/jira/browse/SPARK-5564 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > Latent Dirichlet Allocation (LDA) currently requires that the priors’ > concentration parameters be > 1.0. It should support values > 0.0, which > should encourage sparser topics (phi) and document-topic distributions > (theta). > For EM, this will require adding a projection to the M-step, as in: Vorontsov > and Potapenko. "Tutorial on Probabilistic Topic Modeling : Additive > Regularization for Stochastic Matrix Factorization." 2014. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5564) Support sparse LDA solutions
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386049#comment-14386049 ] Debasish Das commented on SPARK-5564: - [~josephkb] could you please point me to the datasets that are used for benchmarking? I have started testing loglikelihood loss for recommendation and since I already added the constraints, this is the right time to test it on LDA benchmarks as well...I will open up the code as part of https://issues.apache.org/jira/browse/SPARK-6323 as soon as our legal clears it... I am looking into LDA test-cases but since I am optimizing log-likelihood directly, I am looking to add more testcases from your LDA JIRA...For recommendation, I know how to construct the testcases... > Support sparse LDA solutions > > > Key: SPARK-5564 > URL: https://issues.apache.org/jira/browse/SPARK-5564 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > Latent Dirichlet Allocation (LDA) currently requires that the priors’ > concentration parameters be > 1.0. It should support values > 0.0, which > should encourage sparser topics (phi) and document-topic distributions > (theta). > For EM, this will require adding a projection to the M-step, as in: Vorontsov > and Potapenko. "Tutorial on Probabilistic Topic Modeling : Additive > Regularization for Stochastic Matrix Factorization." 2014. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6599) Add Kinesis Direct API
Tathagata Das created SPARK-6599: Summary: Add Kinesis Direct API Key: SPARK-6599 URL: https://issues.apache.org/jira/browse/SPARK-6599 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6261) Python MLlib API missing items: Feature
[ https://issues.apache.org/jira/browse/SPARK-6261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386036#comment-14386036 ] Kai Sasaki commented on SPARK-6261: --- [~josephkb] I created JIRA for IDFModel here. [SPARK-6598|https://issues.apache.org/jira/browse/SPARK-6598]. Thank you! > Python MLlib API missing items: Feature > --- > > Key: SPARK-6261 > URL: https://issues.apache.org/jira/browse/SPARK-6261 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > This JIRA lists items missing in the Python API for this sub-package of MLlib. > This list may be incomplete, so please check again when sending a PR to add > these features to the Python API. > Also, please check for major disparities between documentation; some parts of > the Python API are less well-documented than their Scala counterparts. Some > items may be listed in the umbrella JIRA linked to this task. > StandardScalerModel > * All functionality except predict() is missing. > IDFModel > * idf > Word2Vec > * setMinCount > Word2VecModel > * getVectors -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6598) Python API for IDFModel
Kai Sasaki created SPARK-6598: - Summary: Python API for IDFModel Key: SPARK-6598 URL: https://issues.apache.org/jira/browse/SPARK-6598 Project: Spark Issue Type: Task Components: MLlib Affects Versions: 1.4.0 Reporter: Kai Sasaki Priority: Minor This is the sub-task of [SPARK-6254|https://issues.apache.org/jira/browse/SPARK-6254]. Wrap IDFModel {{idf}} member function for pyspark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame
[ https://issues.apache.org/jira/browse/SPARK-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386020#comment-14386020 ] Michael Armbrust commented on SPARK-6586: - Okay, but what is the utility of keeping a fully unresolved plan around? You are just complicating data frame with a bunch of mutable state. > Add the capability of retrieving original logical plan of DataFrame > --- > > Key: SPARK-6586 > URL: https://issues.apache.org/jira/browse/SPARK-6586 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Priority: Minor > > In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan > instead of logical plan. However, by doing that we can't know the logical > plan of a {{DataFrame}}. But it might be still useful and important to > retrieve the original logical plan in some use cases. > In this pr, we introduce the capability of retrieving original logical plan > of {{DataFrame}}. > The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once > {{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as > {{true}}. In {{QueryExecution}}, we keep the original logical plan in the > analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to > recursively replace the analyzed logical plan with original logical plan and > retrieve it. > Besides the capability of retrieving original logical plan, this modification > also can avoid do plan analysis if it is already analyzed. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6597) Replace `input:checkbox` with `input[type="checkbox"] in additional-metrics.js
[ https://issues.apache.org/jira/browse/SPARK-6597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386017#comment-14386017 ] Apache Spark commented on SPARK-6597: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/5254 > Replace `input:checkbox` with `input[type="checkbox"] in additional-metrics.js > -- > > Key: SPARK-6597 > URL: https://issues.apache.org/jira/browse/SPARK-6597 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.2.2, 1.3.1, 1.4.0 >Reporter: Kousuke Saruta >Priority: Minor > > In additional-metrics.js, there are some selector notation like > `input:checkbox` but JQuery's official document says `input[type="checkbox"]` > is better. > https://api.jquery.com/checkbox-selector/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6597) Replace `input:checkbox` with `input[type="checkbox"] in additional-metrics.js
[ https://issues.apache.org/jira/browse/SPARK-6597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6597: --- Assignee: (was: Apache Spark) > Replace `input:checkbox` with `input[type="checkbox"] in additional-metrics.js > -- > > Key: SPARK-6597 > URL: https://issues.apache.org/jira/browse/SPARK-6597 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.2.2, 1.3.1, 1.4.0 >Reporter: Kousuke Saruta >Priority: Minor > > In additional-metrics.js, there are some selector notation like > `input:checkbox` but JQuery's official document says `input[type="checkbox"]` > is better. > https://api.jquery.com/checkbox-selector/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6597) Replace `input:checkbox` with `input[type="checkbox"] in additional-metrics.js
[ https://issues.apache.org/jira/browse/SPARK-6597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6597: --- Assignee: Apache Spark > Replace `input:checkbox` with `input[type="checkbox"] in additional-metrics.js > -- > > Key: SPARK-6597 > URL: https://issues.apache.org/jira/browse/SPARK-6597 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.2.2, 1.3.1, 1.4.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Minor > > In additional-metrics.js, there are some selector notation like > `input:checkbox` but JQuery's official document says `input[type="checkbox"]` > is better. > https://api.jquery.com/checkbox-selector/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6597) Replace `input:checkbox` with `input[type="checkbox"] in additional-metrics.js
Kousuke Saruta created SPARK-6597: - Summary: Replace `input:checkbox` with `input[type="checkbox"] in additional-metrics.js Key: SPARK-6597 URL: https://issues.apache.org/jira/browse/SPARK-6597 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.2, 1.3.1, 1.4.0 Reporter: Kousuke Saruta Priority: Minor In additional-metrics.js, there are some selector notation like `input:checkbox` but JQuery's official document says `input[type="checkbox"]` is better. https://api.jquery.com/checkbox-selector/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame
[ https://issues.apache.org/jira/browse/SPARK-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386009#comment-14386009 ] Liang-Chi Hsieh edited comment on SPARK-6586 at 3/29/15 11:24 PM: -- Not true. Because DataFrame now is given analyzed plan after its many operations, {{df.queryExecution.logical}} is analyzed plan instead of the original logical plan. You can check the pr #5217 for the modification. was (Author: viirya): Not true. Because DataFrame now is given analyzed plan after its many operations, {{df.queryExecution.logical}} is analyzed plan instead of the original logical plan. > Add the capability of retrieving original logical plan of DataFrame > --- > > Key: SPARK-6586 > URL: https://issues.apache.org/jira/browse/SPARK-6586 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Priority: Minor > > In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan > instead of logical plan. However, by doing that we can't know the logical > plan of a {{DataFrame}}. But it might be still useful and important to > retrieve the original logical plan in some use cases. > In this pr, we introduce the capability of retrieving original logical plan > of {{DataFrame}}. > The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once > {{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as > {{true}}. In {{QueryExecution}}, we keep the original logical plan in the > analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to > recursively replace the analyzed logical plan with original logical plan and > retrieve it. > Besides the capability of retrieving original logical plan, this modification > also can avoid do plan analysis if it is already analyzed. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame
[ https://issues.apache.org/jira/browse/SPARK-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh reopened SPARK-6586: > Add the capability of retrieving original logical plan of DataFrame > --- > > Key: SPARK-6586 > URL: https://issues.apache.org/jira/browse/SPARK-6586 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Priority: Minor > > In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan > instead of logical plan. However, by doing that we can't know the logical > plan of a {{DataFrame}}. But it might be still useful and important to > retrieve the original logical plan in some use cases. > In this pr, we introduce the capability of retrieving original logical plan > of {{DataFrame}}. > The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once > {{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as > {{true}}. In {{QueryExecution}}, we keep the original logical plan in the > analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to > recursively replace the analyzed logical plan with original logical plan and > retrieve it. > Besides the capability of retrieving original logical plan, this modification > also can avoid do plan analysis if it is already analyzed. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame
[ https://issues.apache.org/jira/browse/SPARK-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386009#comment-14386009 ] Liang-Chi Hsieh commented on SPARK-6586: Not true. Because DataFrame now is given analyzed plan after its many operations, {{df.queryExecution.logical}} is analyzed plan instead of the original logical plan. > Add the capability of retrieving original logical plan of DataFrame > --- > > Key: SPARK-6586 > URL: https://issues.apache.org/jira/browse/SPARK-6586 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Priority: Minor > > In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan > instead of logical plan. However, by doing that we can't know the logical > plan of a {{DataFrame}}. But it might be still useful and important to > retrieve the original logical plan in some use cases. > In this pr, we introduce the capability of retrieving original logical plan > of {{DataFrame}}. > The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once > {{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as > {{true}}. In {{QueryExecution}}, we keep the original logical plan in the > analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to > recursively replace the analyzed logical plan with original logical plan and > retrieve it. > Besides the capability of retrieving original logical plan, this modification > also can avoid do plan analysis if it is already analyzed. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6596) fix the instruction on building scaladoc
[ https://issues.apache.org/jira/browse/SPARK-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6596: --- Assignee: Apache Spark > fix the instruction on building scaladoc > - > > Key: SPARK-6596 > URL: https://issues.apache.org/jira/browse/SPARK-6596 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 1.4.0 >Reporter: Nan Zhu >Assignee: Apache Spark > > In README.md under docs/ directory, it says that > You can build just the Spark scaladoc by running build/sbt doc from the > SPARK_PROJECT_ROOT directory. > I guess the right approach is build/sbt unidoc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6596) fix the instruction on building scaladoc
[ https://issues.apache.org/jira/browse/SPARK-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385993#comment-14385993 ] Apache Spark commented on SPARK-6596: - User 'CodingCat' has created a pull request for this issue: https://github.com/apache/spark/pull/5253 > fix the instruction on building scaladoc > - > > Key: SPARK-6596 > URL: https://issues.apache.org/jira/browse/SPARK-6596 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 1.4.0 >Reporter: Nan Zhu > > In README.md under docs/ directory, it says that > You can build just the Spark scaladoc by running build/sbt doc from the > SPARK_PROJECT_ROOT directory. > I guess the right approach is build/sbt unidoc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6596) fix the instruction on building scaladoc
[ https://issues.apache.org/jira/browse/SPARK-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6596: --- Assignee: (was: Apache Spark) > fix the instruction on building scaladoc > - > > Key: SPARK-6596 > URL: https://issues.apache.org/jira/browse/SPARK-6596 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 1.4.0 >Reporter: Nan Zhu > > In README.md under docs/ directory, it says that > You can build just the Spark scaladoc by running build/sbt doc from the > SPARK_PROJECT_ROOT directory. > I guess the right approach is build/sbt unidoc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6596) fix the instruction on building scaladoc
Nan Zhu created SPARK-6596: -- Summary: fix the instruction on building scaladoc Key: SPARK-6596 URL: https://issues.apache.org/jira/browse/SPARK-6596 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.4.0 Reporter: Nan Zhu In README.md under docs/ directory, it says that You can build just the Spark scaladoc by running build/sbt doc from the SPARK_PROJECT_ROOT directory. I guess the right approach is build/sbt unidoc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6592) API of Row trait should be presented in Scala doc
[ https://issues.apache.org/jira/browse/SPARK-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6592: --- Assignee: (was: Apache Spark) > API of Row trait should be presented in Scala doc > - > > Key: SPARK-6592 > URL: https://issues.apache.org/jira/browse/SPARK-6592 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Affects Versions: 1.3.0 >Reporter: Nan Zhu >Priority: Critical > > Currently, the API of Row class is not presented in Scaladoc, though we have > many chances to use it > the reason is that we ignore all files under catalyst directly in > SparkBuild.scala when generating Scaladoc, > (https://github.com/apache/spark/blob/f75f633b21faaf911f04aeff847f25749b1ecd89/project/SparkBuild.scala#L369) > What's the best approach to fix this? [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6592) API of Row trait should be presented in Scala doc
[ https://issues.apache.org/jira/browse/SPARK-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385991#comment-14385991 ] Apache Spark commented on SPARK-6592: - User 'CodingCat' has created a pull request for this issue: https://github.com/apache/spark/pull/5252 > API of Row trait should be presented in Scala doc > - > > Key: SPARK-6592 > URL: https://issues.apache.org/jira/browse/SPARK-6592 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Affects Versions: 1.3.0 >Reporter: Nan Zhu >Priority: Critical > > Currently, the API of Row class is not presented in Scaladoc, though we have > many chances to use it > the reason is that we ignore all files under catalyst directly in > SparkBuild.scala when generating Scaladoc, > (https://github.com/apache/spark/blob/f75f633b21faaf911f04aeff847f25749b1ecd89/project/SparkBuild.scala#L369) > What's the best approach to fix this? [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6592) API of Row trait should be presented in Scala doc
[ https://issues.apache.org/jira/browse/SPARK-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6592: --- Assignee: Apache Spark > API of Row trait should be presented in Scala doc > - > > Key: SPARK-6592 > URL: https://issues.apache.org/jira/browse/SPARK-6592 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Affects Versions: 1.3.0 >Reporter: Nan Zhu >Assignee: Apache Spark >Priority: Critical > > Currently, the API of Row class is not presented in Scaladoc, though we have > many chances to use it > the reason is that we ignore all files under catalyst directly in > SparkBuild.scala when generating Scaladoc, > (https://github.com/apache/spark/blob/f75f633b21faaf911f04aeff847f25749b1ecd89/project/SparkBuild.scala#L369) > What's the best approach to fix this? [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6595) DataFrame self joins with MetastoreRelations fail
[ https://issues.apache.org/jira/browse/SPARK-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6595: Target Version/s: 1.3.1, 1.4.0 (was: 1.3.1) > DataFrame self joins with MetastoreRelations fail > - > > Key: SPARK-6595 > URL: https://issues.apache.org/jira/browse/SPARK-6595 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6592) API of Row trait should be presented in Scala doc
[ https://issues.apache.org/jira/browse/SPARK-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385970#comment-14385970 ] Reynold Xin commented on SPARK-6592: Ok then can't you just add "apache" to it? > API of Row trait should be presented in Scala doc > - > > Key: SPARK-6592 > URL: https://issues.apache.org/jira/browse/SPARK-6592 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Affects Versions: 1.3.0 >Reporter: Nan Zhu >Priority: Critical > > Currently, the API of Row class is not presented in Scaladoc, though we have > many chances to use it > the reason is that we ignore all files under catalyst directly in > SparkBuild.scala when generating Scaladoc, > (https://github.com/apache/spark/blob/f75f633b21faaf911f04aeff847f25749b1ecd89/project/SparkBuild.scala#L369) > What's the best approach to fix this? [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6592) API of Row trait should be presented in Scala doc
[ https://issues.apache.org/jira/browse/SPARK-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385964#comment-14385964 ] Nan Zhu commented on SPARK-6592: it contains the reason is that the input of that line is "file.getCanonicalPath"...which output the absolute path e.g. {code} scala> val f = new java.io.File("Row.class") f: java.io.File = Row.class scala> f.getCanonicalPath res0: String = /Users/nanzhu/code/spark/sql/catalyst/target/scala-2.10/classes/org/apache/spark/sql/Row.class {code} > API of Row trait should be presented in Scala doc > - > > Key: SPARK-6592 > URL: https://issues.apache.org/jira/browse/SPARK-6592 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Affects Versions: 1.3.0 >Reporter: Nan Zhu >Priority: Critical > > Currently, the API of Row class is not presented in Scaladoc, though we have > many chances to use it > the reason is that we ignore all files under catalyst directly in > SparkBuild.scala when generating Scaladoc, > (https://github.com/apache/spark/blob/f75f633b21faaf911f04aeff847f25749b1ecd89/project/SparkBuild.scala#L369) > What's the best approach to fix this? [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6595) DataFrame self joins with MetastoreRelations fail
[ https://issues.apache.org/jira/browse/SPARK-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385924#comment-14385924 ] Apache Spark commented on SPARK-6595: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/5251 > DataFrame self joins with MetastoreRelations fail > - > > Key: SPARK-6595 > URL: https://issues.apache.org/jira/browse/SPARK-6595 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6595) DataFrame self joins with MetastoreRelations fail
[ https://issues.apache.org/jira/browse/SPARK-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6595: --- Assignee: Michael Armbrust (was: Apache Spark) > DataFrame self joins with MetastoreRelations fail > - > > Key: SPARK-6595 > URL: https://issues.apache.org/jira/browse/SPARK-6595 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6595) DataFrame self joins with MetastoreRelations fail
[ https://issues.apache.org/jira/browse/SPARK-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6595: --- Assignee: Apache Spark (was: Michael Armbrust) > DataFrame self joins with MetastoreRelations fail > - > > Key: SPARK-6595 > URL: https://issues.apache.org/jira/browse/SPARK-6595 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Michael Armbrust >Assignee: Apache Spark >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6595) DataFrame self joins with MetastoreRelations fail
Michael Armbrust created SPARK-6595: --- Summary: DataFrame self joins with MetastoreRelations fail Key: SPARK-6595 URL: https://issues.apache.org/jira/browse/SPARK-6595 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6592) API of Row trait should be presented in Scala doc
[ https://issues.apache.org/jira/browse/SPARK-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385895#comment-14385895 ] Reynold Xin commented on SPARK-6592: Row.html/class doesn't contain the word catalyst, does it? ./api/java/org/apache/spark/sql/Row.html > API of Row trait should be presented in Scala doc > - > > Key: SPARK-6592 > URL: https://issues.apache.org/jira/browse/SPARK-6592 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Affects Versions: 1.3.0 >Reporter: Nan Zhu >Priority: Critical > > Currently, the API of Row class is not presented in Scaladoc, though we have > many chances to use it > the reason is that we ignore all files under catalyst directly in > SparkBuild.scala when generating Scaladoc, > (https://github.com/apache/spark/blob/f75f633b21faaf911f04aeff847f25749b1ecd89/project/SparkBuild.scala#L369) > What's the best approach to fix this? [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385890#comment-14385890 ] Steve Loughran commented on SPARK-1537: --- # I've just tried to see where YARN-2444 stands; I can't replicate it in trunk but I've submitted the tests to verify that it isn't there. # for YARN-2423 Spark seems kind of trapped. It needs an api tagged as public/stable; Robert's patch has the API, except it's being rejected on the basis that "ATSv2 will break it". So it can't be tagged as stable. So there's no API for GET operations until some undefined time {{t1 > now()}} —and then, only for Hadoop versions with it. Which implies it won't get picked up by Spark for a long time. I think we need to talk to the YARN dev team and see what can be done here. Even if there's no API client bundled into YARN, unless the v1 API and its paths beginning {{/ws/v1/timeline/}} are going to go away, then a REST client is possible; it may just have to be done spark-side, where at least it can be made resilient to hadoop versions. > Add integration with Yarn's Application Timeline Server > --- > > Key: SPARK-1537 > URL: https://issues.apache.org/jira/browse/SPARK-1537 > Project: Spark > Issue Type: New Feature > Components: YARN >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Attachments: SPARK-1537.txt, spark-1573.patch > > > It would be nice to have Spark integrate with Yarn's Application Timeline > Server (see YARN-321, YARN-1530). This would allow users running Spark on > Yarn to have a single place to go for all their history needs, and avoid > having to manage a separate service (Spark's built-in server). > At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, > although there is still some ongoing work. But the basics are there, and I > wouldn't expect them to change (much) at this point. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6548) Adding stddev to DataFrame functions
[ https://issues.apache.org/jira/browse/SPARK-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6548: Target Version/s: 1.4.0 > Adding stddev to DataFrame functions > > > Key: SPARK-6548 > URL: https://issues.apache.org/jira/browse/SPARK-6548 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin > Labels: DataFrame, starter > > Add it to the list of aggregate functions: > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala > Also add it to > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala > We can either add a Stddev Catalyst expression, or just compute it using > existing functions like here: > https://github.com/apache/spark/commit/5bbcd1304cfebba31ec6857a80d3825a40d02e83#diff-c3d0394b2fc08fb2842ff0362a5ac6c9R776 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame
[ https://issues.apache.org/jira/browse/SPARK-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-6586. - Resolution: Not a Problem You can already get the original plan from a DataFrame: {{df.queryExecution.logical}}. > Add the capability of retrieving original logical plan of DataFrame > --- > > Key: SPARK-6586 > URL: https://issues.apache.org/jira/browse/SPARK-6586 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Priority: Minor > > In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan > instead of logical plan. However, by doing that we can't know the logical > plan of a {{DataFrame}}. But it might be still useful and important to > retrieve the original logical plan in some use cases. > In this pr, we introduce the capability of retrieving original logical plan > of {{DataFrame}}. > The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once > {{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as > {{true}}. In {{QueryExecution}}, we keep the original logical plan in the > analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to > recursively replace the analyzed logical plan with original logical plan and > retrieve it. > Besides the capability of retrieving original logical plan, this modification > also can avoid do plan analysis if it is already analyzed. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6592) API of Row trait should be presented in Scala doc
[ https://issues.apache.org/jira/browse/SPARK-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6592: Target Version/s: 1.4.0 > API of Row trait should be presented in Scala doc > - > > Key: SPARK-6592 > URL: https://issues.apache.org/jira/browse/SPARK-6592 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Affects Versions: 1.3.0 >Reporter: Nan Zhu >Priority: Critical > > Currently, the API of Row class is not presented in Scaladoc, though we have > many chances to use it > the reason is that we ignore all files under catalyst directly in > SparkBuild.scala when generating Scaladoc, > (https://github.com/apache/spark/blob/f75f633b21faaf911f04aeff847f25749b1ecd89/project/SparkBuild.scala#L369) > What's the best approach to fix this? [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6592) API of Row trait should be presented in Scala doc
[ https://issues.apache.org/jira/browse/SPARK-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6592: Priority: Critical (was: Major) > API of Row trait should be presented in Scala doc > - > > Key: SPARK-6592 > URL: https://issues.apache.org/jira/browse/SPARK-6592 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Affects Versions: 1.3.0 >Reporter: Nan Zhu >Priority: Critical > > Currently, the API of Row class is not presented in Scaladoc, though we have > many chances to use it > the reason is that we ignore all files under catalyst directly in > SparkBuild.scala when generating Scaladoc, > (https://github.com/apache/spark/blob/f75f633b21faaf911f04aeff847f25749b1ecd89/project/SparkBuild.scala#L369) > What's the best approach to fix this? [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6594) Spark Streaming can't receive data from kafka
[ https://issues.apache.org/jira/browse/SPARK-6594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385743#comment-14385743 ] Sean Owen commented on SPARK-6594: -- There's no useful detail here. Can you elaborate what exactly you are running, what you observe? what is the input, what are the states of the topics, what information leads you to believe streaming is not reading? It's certainly not true that they don't work in general. I am successfully using this exact combination now and have had no problems. > Spark Streaming can't receive data from kafka > - > > Key: SPARK-6594 > URL: https://issues.apache.org/jira/browse/SPARK-6594 > Project: Spark > Issue Type: Bug >Affects Versions: 1.2.1 > Environment: kafka_2.10-0.8.1.1 + Spark-1.2.1 >Reporter: q79969786 > > I use KafkaUtils to receive data from Kafka In my Spark streaming application > as follows: > Map topicorder = new HashMap(); > topicorder.put("order", Integer.valueOf(readThread)); > JavaPairReceiverInputDStream jPRIDSOrder = > KafkaUtils.createStream(jssc, zkQuorum, group, topicorder); > It worked well at fist, but after I submit this application several times, > Spark streaming can‘t receive data anymore(Kafka works well). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6594) Spark Streaming can't receive data from kafka
q79969786 created SPARK-6594: Summary: Spark Streaming can't receive data from kafka Key: SPARK-6594 URL: https://issues.apache.org/jira/browse/SPARK-6594 Project: Spark Issue Type: Bug Affects Versions: 1.2.1 Environment: kafka_2.10-0.8.1.1 + Spark-1.2.1 Reporter: q79969786 I use KafkaUtils to receive data from Kafka In my Spark streaming application as follows: Map topicorder = new HashMap(); topicorder.put("order", Integer.valueOf(readThread)); JavaPairReceiverInputDStream jPRIDSOrder = KafkaUtils.createStream(jssc, zkQuorum, group, topicorder); It worked well at fist, but after I submit this application several times, Spark streaming can‘t receive data anymore(Kafka works well). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6592) API of Row trait should be presented in Scala doc
[ https://issues.apache.org/jira/browse/SPARK-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385724#comment-14385724 ] Nan Zhu commented on SPARK-6592: ? I don't think that makes any difference, as the path of Row.scala still contains "spark/sql/catalyst"? I also tried to rerun build/sbt doc, the same thing... maybe we need to hack SparkBuild.scala to exclude Row.scala? > API of Row trait should be presented in Scala doc > - > > Key: SPARK-6592 > URL: https://issues.apache.org/jira/browse/SPARK-6592 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Affects Versions: 1.3.0 >Reporter: Nan Zhu > > Currently, the API of Row class is not presented in Scaladoc, though we have > many chances to use it > the reason is that we ignore all files under catalyst directly in > SparkBuild.scala when generating Scaladoc, > (https://github.com/apache/spark/blob/f75f633b21faaf911f04aeff847f25749b1ecd89/project/SparkBuild.scala#L369) > What's the best approach to fix this? [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6593) Provide option for HadoopRDD to skip corrupted files
[ https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dale Richardson updated SPARK-6593: --- Description: When reading a large amount of gzip files from HDFS eg. with sc.textFile("hdfs:///user/cloudera/logs*.gz"), If the hadoop input libraries report an exception then the entire job is canceled. As default behaviour this is probably for the best, but it would be nice in some circumstances where you know it will be ok to have the option to skip the corrupted file and continue the job. was: When reading a large amount of files from HDFS eg. with sc.textFile("hdfs:///user/cloudera/logs*.gz"). If the hadoop input libraries report an exception then the entire job is canceled. As default behaviour this is probably for the best, but it would be nice in some circumstances where you know it will be ok to have the option to skip the corrupted file and continue the job. > Provide option for HadoopRDD to skip corrupted files > > > Key: SPARK-6593 > URL: https://issues.apache.org/jira/browse/SPARK-6593 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Dale Richardson >Priority: Minor > > When reading a large amount of gzip files from HDFS eg. with > sc.textFile("hdfs:///user/cloudera/logs*.gz"), If the hadoop input libraries > report an exception then the entire job is canceled. As default behaviour > this is probably for the best, but it would be nice in some circumstances > where you know it will be ok to have the option to skip the corrupted file > and continue the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6593) Provide option for HadoopRDD to skip corrupted files
[ https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dale Richardson updated SPARK-6593: --- Description: When reading a large amount of files from HDFS eg. with sc.textFile("hdfs:///user/cloudera/logs*.gz"). If the hadoop input libraries report an exception then the entire job is canceled. As default behaviour this is probably for the best, but it would be nice in some circumstances where you know it will be ok to have the option to skip the corrupted file and continue the job. was: When reading a large amount of files from HDFS eg. with sc.textFile("hdfs:///user/cloudera/logs*.gz"). If the hadoop input libraries report an exception then the entire job is canceled. As default behaviour this is probably for the best, but it would be nice in some circumstances where you know it will be ok to have the option to skip the corrupted portion and continue the job. > Provide option for HadoopRDD to skip corrupted files > > > Key: SPARK-6593 > URL: https://issues.apache.org/jira/browse/SPARK-6593 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Dale Richardson >Priority: Minor > > When reading a large amount of files from HDFS eg. with > sc.textFile("hdfs:///user/cloudera/logs*.gz"). If the hadoop input libraries > report an exception then the entire job is canceled. As default behaviour > this is probably for the best, but it would be nice in some circumstances > where you know it will be ok to have the option to skip the corrupted file > and continue the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6585) FileServerSuite.test ("HttpFileServer should not work with SSL when the server is untrusted") failed is some evn.
[ https://issues.apache.org/jira/browse/SPARK-6585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6585: - Assignee: June > FileServerSuite.test ("HttpFileServer should not work with SSL when the > server is untrusted") failed is some evn. > - > > Key: SPARK-6585 > URL: https://issues.apache.org/jira/browse/SPARK-6585 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 1.3.0 >Reporter: June >Assignee: June >Priority: Minor > Fix For: 1.4.0 > > > In my test machine, FileServerSuite.test ("HttpFileServer should not work > with SSL when the server is untrusted") case throw SSLException not > SSLHandshakeException, suggest change to catch SSLException to improve test > case 's robustness. > [info] - HttpFileServer should not work with SSL when the server is untrusted > *** FAILED *** (69 milliseconds) > [info] Expected exception javax.net.ssl.SSLHandshakeException to be thrown, > but javax.net.ssl.SSLException was thrown. (FileServerSuite.scala:231) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:496) > [info] at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) > [info] at org.scalatest.Assertions$class.intercept(Assertions.scala:1004) > [info] at org.scalatest.FunSuite.intercept(FunSuite.scala:1555) > [info] at > org.apache.spark.FileServerSuite$$anonfun$15.apply$mcV$sp(FileServerSuite.scala:231) > [info] at > org.apache.spark.FileServerSuite$$anonfun$15.apply(FileServerSuite.scala:224) > [info] at > org.apache.spark.FileServerSuite$$anonfun$15.apply(FileServerSuite.scala:224) > [info] at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > [info] at org.scalatest.Suite$class.withFixture(Suite.scala:1122) > [info] at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) > [info] at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > [info] at > org.apache.spark.FileServerSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(FileServerSuite.scala:34) > [info] at > org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6593) Provide option for HadoopRDD to skip corrupted files
[ https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dale Richardson updated SPARK-6593: --- Summary: Provide option for HadoopRDD to skip corrupted files (was: Provide option for HadoopRDD to skip bad data splits.) > Provide option for HadoopRDD to skip corrupted files > > > Key: SPARK-6593 > URL: https://issues.apache.org/jira/browse/SPARK-6593 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Dale Richardson >Priority: Minor > > When reading a large amount of files from HDFS eg. with > sc.textFile("hdfs:///user/cloudera/logs*.gz"). If the hadoop input libraries > report an exception then the entire job is canceled. As default behaviour > this is probably for the best, but it would be nice in some circumstances > where you know it will be ok to have the option to skip the corrupted portion > and continue the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6593) Provide option for HadoopRDD to skip corrupted files
[ https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385723#comment-14385723 ] Dale Richardson commented on SPARK-6593: Changed the title and description to focus closer on my particular use case, which is corrupted gzip files. > Provide option for HadoopRDD to skip corrupted files > > > Key: SPARK-6593 > URL: https://issues.apache.org/jira/browse/SPARK-6593 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Dale Richardson >Priority: Minor > > When reading a large amount of files from HDFS eg. with > sc.textFile("hdfs:///user/cloudera/logs*.gz"). If the hadoop input libraries > report an exception then the entire job is canceled. As default behaviour > this is probably for the best, but it would be nice in some circumstances > where you know it will be ok to have the option to skip the corrupted portion > and continue the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6585) FileServerSuite.test ("HttpFileServer should not work with SSL when the server is untrusted") failed is some evn.
[ https://issues.apache.org/jira/browse/SPARK-6585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6585. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5239 [https://github.com/apache/spark/pull/5239] > FileServerSuite.test ("HttpFileServer should not work with SSL when the > server is untrusted") failed is some evn. > - > > Key: SPARK-6585 > URL: https://issues.apache.org/jira/browse/SPARK-6585 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 1.3.0 >Reporter: June >Priority: Minor > Fix For: 1.4.0 > > > In my test machine, FileServerSuite.test ("HttpFileServer should not work > with SSL when the server is untrusted") case throw SSLException not > SSLHandshakeException, suggest change to catch SSLException to improve test > case 's robustness. > [info] - HttpFileServer should not work with SSL when the server is untrusted > *** FAILED *** (69 milliseconds) > [info] Expected exception javax.net.ssl.SSLHandshakeException to be thrown, > but javax.net.ssl.SSLException was thrown. (FileServerSuite.scala:231) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:496) > [info] at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) > [info] at org.scalatest.Assertions$class.intercept(Assertions.scala:1004) > [info] at org.scalatest.FunSuite.intercept(FunSuite.scala:1555) > [info] at > org.apache.spark.FileServerSuite$$anonfun$15.apply$mcV$sp(FileServerSuite.scala:231) > [info] at > org.apache.spark.FileServerSuite$$anonfun$15.apply(FileServerSuite.scala:224) > [info] at > org.apache.spark.FileServerSuite$$anonfun$15.apply(FileServerSuite.scala:224) > [info] at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > [info] at org.scalatest.Suite$class.withFixture(Suite.scala:1122) > [info] at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) > [info] at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > [info] at > org.apache.spark.FileServerSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(FileServerSuite.scala:34) > [info] at > org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6593) Provide option for HadoopRDD to skip bad data splits.
[ https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dale Richardson updated SPARK-6593: --- Description: When reading a large amount of files from HDFS eg. with sc.textFile("hdfs:///user/cloudera/logs*.gz"). If the hadoop input libraries report an exception then the entire job is canceled. As default behaviour this is probably for the best, but it would be nice in some circumstances where you know it will be ok to have the option to skip the corrupted portion and continue the job. was: When reading a large amount of files from HDFS eg. with sc.textFile("hdfs:///user/cloudera/logs*.gz"). If a single split is corrupted then the entire job is canceled. As default behaviour this is probably for the best, but it would be nice in some circumstances where you know it will be ok to have the option to skip the corrupted portion and continue the job. > Provide option for HadoopRDD to skip bad data splits. > - > > Key: SPARK-6593 > URL: https://issues.apache.org/jira/browse/SPARK-6593 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Dale Richardson >Priority: Minor > > When reading a large amount of files from HDFS eg. with > sc.textFile("hdfs:///user/cloudera/logs*.gz"). If the hadoop input libraries > report an exception then the entire job is canceled. As default behaviour > this is probably for the best, but it would be nice in some circumstances > where you know it will be ok to have the option to skip the corrupted portion > and continue the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6558) Utils.getCurrentUserName returns the full principal name instead of login name
[ https://issues.apache.org/jira/browse/SPARK-6558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6558. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5229 [https://github.com/apache/spark/pull/5229] > Utils.getCurrentUserName returns the full principal name instead of login name > -- > > Key: SPARK-6558 > URL: https://issues.apache.org/jira/browse/SPARK-6558 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Critical > Fix For: 1.4.0 > > > Utils.getCurrentUserName returns > UserGroupInformation.getCurrentUser().getUserName() when SPARK_USER isn't > set. It should return > UserGroupInformation.getCurrentUser().getShortUserName() > getUserName() returns the users full principal name (ie us...@corp.com). > getShortUserName() returns just the users login name (user1). > This just happens to work on YARN because the Client code sets: > env("SPARK_USER") = > UserGroupInformation.getCurrentUser().getShortUserName() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6406) Launcher backward compatibility issues
[ https://issues.apache.org/jira/browse/SPARK-6406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6406: - Assignee: Nishkam Ravi > Launcher backward compatibility issues > -- > > Key: SPARK-6406 > URL: https://issues.apache.org/jira/browse/SPARK-6406 > Project: Spark > Issue Type: Improvement > Components: Deploy >Reporter: Nishkam Ravi >Assignee: Nishkam Ravi >Priority: Minor > Fix For: 1.4.0 > > > The new launcher library breaks backward compatibility. "hadoop" string in > the spark assembly should not be mandatory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6406) Launcher backward compatibility issues
[ https://issues.apache.org/jira/browse/SPARK-6406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6406. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5085 [https://github.com/apache/spark/pull/5085] > Launcher backward compatibility issues > -- > > Key: SPARK-6406 > URL: https://issues.apache.org/jira/browse/SPARK-6406 > Project: Spark > Issue Type: Improvement > Components: Deploy >Reporter: Nishkam Ravi >Priority: Minor > Fix For: 1.4.0 > > > The new launcher library breaks backward compatibility. "hadoop" string in > the spark assembly should not be mandatory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4123) Show dependency changes in pull requests
[ https://issues.apache.org/jira/browse/SPARK-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4123. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5093 [https://github.com/apache/spark/pull/5093] > Show dependency changes in pull requests > > > Key: SPARK-4123 > URL: https://issues.apache.org/jira/browse/SPARK-4123 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Reporter: Patrick Wendell >Assignee: Brennon York >Priority: Critical > Fix For: 1.4.0 > > > We should inspect the classpath of Spark's assembly jar for every pull > request. This only takes a few seconds in Maven and it will help weed out > dependency changes from the master branch. Ideally we'd post any dependency > changes in the pull request message. > {code} > $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly | grep -v > INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > my-classpath > $ git checkout apache/master > $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly | grep -v > INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > master-classpath > $ diff my-classpath master-classpath > < chill-java-0.3.6.jar > < chill_2.10-0.3.6.jar > --- > > chill-java-0.5.0.jar > > chill_2.10-0.5.0.jar > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6593) Provide option for HadoopRDD to skip bad data splits.
[ https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385716#comment-14385716 ] Dale Richardson edited comment on SPARK-6593 at 3/29/15 11:35 AM: -- With a gz file for example, the entire file is a split. so a corrupted gz file will kill the entire job - with no way of catching and remediating the error. was (Author: tigerquoll): With a gz file for example, the entire file is a split. so a corrupted gz file will kill the entire job. > Provide option for HadoopRDD to skip bad data splits. > - > > Key: SPARK-6593 > URL: https://issues.apache.org/jira/browse/SPARK-6593 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Dale Richardson >Priority: Minor > > When reading a large amount of files from HDFS eg. with > sc.textFile("hdfs:///user/cloudera/logs*.gz"). If a single split is corrupted > then the entire job is canceled. As default behaviour this is probably for > the best, but it would be nice in some circumstances where you know it will > be ok to have the option to skip the corrupted portion and continue the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6593) Provide option for HadoopRDD to skip bad data splits.
[ https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385716#comment-14385716 ] Dale Richardson commented on SPARK-6593: With a gz file for example, the entire file is a split. so a corrupted gz file will kill the entire job. > Provide option for HadoopRDD to skip bad data splits. > - > > Key: SPARK-6593 > URL: https://issues.apache.org/jira/browse/SPARK-6593 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Dale Richardson >Priority: Minor > > When reading a large amount of files from HDFS eg. with > sc.textFile("hdfs:///user/cloudera/logs*.gz"). If a single split is corrupted > then the entire job is canceled. As default behaviour this is probably for > the best, but it would be nice in some circumstances where you know it will > be ok to have the option to skip the corrupted portion and continue the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6593) Provide option for HadoopRDD to skip bad data splits.
[ https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6593: --- Assignee: Apache Spark > Provide option for HadoopRDD to skip bad data splits. > - > > Key: SPARK-6593 > URL: https://issues.apache.org/jira/browse/SPARK-6593 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Dale Richardson >Assignee: Apache Spark >Priority: Minor > > When reading a large amount of files from HDFS eg. with > sc.textFile("hdfs:///user/cloudera/logs*.gz"). If a single split is corrupted > then the entire job is canceled. As default behaviour this is probably for > the best, but it would be nice in some circumstances where you know it will > be ok to have the option to skip the corrupted portion and continue the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6593) Provide option for HadoopRDD to skip bad data splits.
[ https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385715#comment-14385715 ] Apache Spark commented on SPARK-6593: - User 'tigerquoll' has created a pull request for this issue: https://github.com/apache/spark/pull/5250 > Provide option for HadoopRDD to skip bad data splits. > - > > Key: SPARK-6593 > URL: https://issues.apache.org/jira/browse/SPARK-6593 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Dale Richardson >Priority: Minor > > When reading a large amount of files from HDFS eg. with > sc.textFile("hdfs:///user/cloudera/logs*.gz"). If a single split is corrupted > then the entire job is canceled. As default behaviour this is probably for > the best, but it would be nice in some circumstances where you know it will > be ok to have the option to skip the corrupted portion and continue the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6593) Provide option for HadoopRDD to skip bad data splits.
[ https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6593: --- Assignee: (was: Apache Spark) > Provide option for HadoopRDD to skip bad data splits. > - > > Key: SPARK-6593 > URL: https://issues.apache.org/jira/browse/SPARK-6593 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Dale Richardson >Priority: Minor > > When reading a large amount of files from HDFS eg. with > sc.textFile("hdfs:///user/cloudera/logs*.gz"). If a single split is corrupted > then the entire job is canceled. As default behaviour this is probably for > the best, but it would be nice in some circumstances where you know it will > be ok to have the option to skip the corrupted portion and continue the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6593) Provide option for HadoopRDD to skip bad data splits.
[ https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385706#comment-14385706 ] Sean Owen commented on SPARK-6593: -- At this level though, what's a bad split? a line of text that doesn't parse as expected? that's application-level logic. Given how little the framework knows, this would amount to ignoring a partition if there was any error in computing it, which seems too coarse to encourage people to use. You can of course handle this in the application logic -- catch the error, return nothing, log it, add to a counter, etc. > Provide option for HadoopRDD to skip bad data splits. > - > > Key: SPARK-6593 > URL: https://issues.apache.org/jira/browse/SPARK-6593 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Dale Richardson >Priority: Minor > > When reading a large amount of files from HDFS eg. with > sc.textFile("hdfs:///user/cloudera/logs*.gz"). If a single split is corrupted > then the entire job is canceled. As default behaviour this is probably for > the best, but it would be nice in some circumstances where you know it will > be ok to have the option to skip the corrupted portion and continue the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6593) Provide option for HadoopRDD to skip bad data splits.
Dale Richardson created SPARK-6593: -- Summary: Provide option for HadoopRDD to skip bad data splits. Key: SPARK-6593 URL: https://issues.apache.org/jira/browse/SPARK-6593 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.0 Reporter: Dale Richardson Priority: Minor When reading a large amount of files from HDFS eg. with sc.textFile("hdfs:///user/cloudera/logs*.gz"). If a single split is corrupted then the entire job is canceled. As default behaviour this is probably for the best, but it would be nice in some circumstances where you know it will be ok to have the option to skip the corrupted portion and continue the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6587) Inferring schema for case class hierarchy fails with mysterious message
[ https://issues.apache.org/jira/browse/SPARK-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385695#comment-14385695 ] Cheng Lian edited comment on SPARK-6587 at 3/29/15 10:32 AM: - This behavior is expected. There are two problems in your case: # Because {{things}} contains instances of all three case classes, the type of {{things}} is {{Seq[MyHolder]}}. Since {{MyHolder}} doesn't extend {{Product}}, can't be recognized by {{ScalaReflection}}. # You can only use a single concrete case class {{T}} when converting {{RDD[T]}} or {{Seq[T]}} to a DataFrame. For {{things}}, we can't figure out what data type should the {{foo}} field in the reflected schema have. was (Author: lian cheng): This behavior is expected. There are two problems in your case: # Because {{things}} contains instances of all three case classes, the type of {{things}} is {{Seq[MyHolder]}}. Since {{MyHolder}} doesn't extend {{Product}}, can't be recognized by {{ScalaReflection}}. # You can only use a single concrete case class {{T}} when converting {{RDD[T]}} or {{Seq[T]}} to a DataFrame. For {{things}}, we can't figure out what data type should the {{foo}} field in the reflected schema have. > Inferring schema for case class hierarchy fails with mysterious message > --- > > Key: SPARK-6587 > URL: https://issues.apache.org/jira/browse/SPARK-6587 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 > Environment: At least Windows 8, Scala 2.11.2. >Reporter: Spiro Michaylov > > (Don't know if this is a functionality bug, error reporting bug or an RFE ...) > I define the following hierarchy: > {code} > private abstract class MyHolder > private case class StringHolder(s: String) extends MyHolder > private case class IntHolder(i: Int) extends MyHolder > private case class BooleanHolder(b: Boolean) extends MyHolder > {code} > and a top level case class: > {code} > private case class Thing(key: Integer, foo: MyHolder) > {code} > When I try to convert it: > {code} > val things = Seq( > Thing(1, IntHolder(42)), > Thing(2, StringHolder("hello")), > Thing(3, BooleanHolder(false)) > ) > val thingsDF = sc.parallelize(things, 4).toDF() > thingsDF.registerTempTable("things") > val all = sqlContext.sql("SELECT * from things") > {code} > I get the following stack trace: > {noformat} > Exception in thread "main" scala.MatchError: > sql.CaseClassSchemaProblem.MyHolder (of class > scala.reflect.internal.Types$ClassNoArgsTypeRef) > at > org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:112) > at > org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:159) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:157) > at scala.collection.immutable.List.map(List.scala:276) > at > org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:157) > at > org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) > at > org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107) > at > org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) > at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:312) > at > org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:250) > at sql.CaseClassSchemaProblem$.main(CaseClassSchemaProblem.scala:35) > at sql.CaseClassSchemaProblem.main(CaseClassSchemaProblem.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) > {noformat} > I wrote this to answer [a question on > StackOverflow|http://stackoverflow.com/questions/29310405/what-is-the-right-way-to-represent-an-any-type-in-spark-sql] > which uses a much simpler approach and suffers the same problem. > Looking at what seems to me to be the [relevant unit test > suite|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/ScalaReflectionRelationSuite.scala] > I see that this case is not covered. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --
[jira] [Commented] (SPARK-6587) Inferring schema for case class hierarchy fails with mysterious message
[ https://issues.apache.org/jira/browse/SPARK-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385695#comment-14385695 ] Cheng Lian commented on SPARK-6587: --- This behavior is expected. There are two problems in your case: # Because {{things}} contains instances of all three case classes, the type of {{things}} is {{Seq[MyHolder]}}. Since {{MyHolder}} doesn't extend {{Product}}, can't be recognized by {{ScalaReflection}}. # You can only use a single concrete case class {{T}} when converting {{RDD[T]}} or {{Seq[T]}} to a DataFrame. For {{things}}, we can't figure out what data type should the {{foo}} field in the reflected schema have. > Inferring schema for case class hierarchy fails with mysterious message > --- > > Key: SPARK-6587 > URL: https://issues.apache.org/jira/browse/SPARK-6587 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 > Environment: At least Windows 8, Scala 2.11.2. >Reporter: Spiro Michaylov > > (Don't know if this is a functionality bug, error reporting bug or an RFE ...) > I define the following hierarchy: > {code} > private abstract class MyHolder > private case class StringHolder(s: String) extends MyHolder > private case class IntHolder(i: Int) extends MyHolder > private case class BooleanHolder(b: Boolean) extends MyHolder > {code} > and a top level case class: > {code} > private case class Thing(key: Integer, foo: MyHolder) > {code} > When I try to convert it: > {code} > val things = Seq( > Thing(1, IntHolder(42)), > Thing(2, StringHolder("hello")), > Thing(3, BooleanHolder(false)) > ) > val thingsDF = sc.parallelize(things, 4).toDF() > thingsDF.registerTempTable("things") > val all = sqlContext.sql("SELECT * from things") > {code} > I get the following stack trace: > {noformat} > Exception in thread "main" scala.MatchError: > sql.CaseClassSchemaProblem.MyHolder (of class > scala.reflect.internal.Types$ClassNoArgsTypeRef) > at > org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:112) > at > org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:159) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:157) > at scala.collection.immutable.List.map(List.scala:276) > at > org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:157) > at > org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) > at > org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107) > at > org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) > at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:312) > at > org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:250) > at sql.CaseClassSchemaProblem$.main(CaseClassSchemaProblem.scala:35) > at sql.CaseClassSchemaProblem.main(CaseClassSchemaProblem.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) > {noformat} > I wrote this to answer [a question on > StackOverflow|http://stackoverflow.com/questions/29310405/what-is-the-right-way-to-represent-an-any-type-in-spark-sql] > which uses a much simpler approach and suffers the same problem. > Looking at what seems to me to be the [relevant unit test > suite|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/ScalaReflectionRelationSuite.scala] > I see that this case is not covered. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6587) Inferring schema for case class hierarchy fails with mysterious message
[ https://issues.apache.org/jira/browse/SPARK-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6587: -- Description: (Don't know if this is a functionality bug, error reporting bug or an RFE ...) I define the following hierarchy: {code} private abstract class MyHolder private case class StringHolder(s: String) extends MyHolder private case class IntHolder(i: Int) extends MyHolder private case class BooleanHolder(b: Boolean) extends MyHolder {code} and a top level case class: {code} private case class Thing(key: Integer, foo: MyHolder) {code} When I try to convert it: {code} val things = Seq( Thing(1, IntHolder(42)), Thing(2, StringHolder("hello")), Thing(3, BooleanHolder(false)) ) val thingsDF = sc.parallelize(things, 4).toDF() thingsDF.registerTempTable("things") val all = sqlContext.sql("SELECT * from things") {code} I get the following stack trace: {noformat} Exception in thread "main" scala.MatchError: sql.CaseClassSchemaProblem.MyHolder (of class scala.reflect.internal.Types$ClassNoArgsTypeRef) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:112) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:159) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:157) at scala.collection.immutable.List.map(List.scala:276) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:157) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:312) at org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:250) at sql.CaseClassSchemaProblem$.main(CaseClassSchemaProblem.scala:35) at sql.CaseClassSchemaProblem.main(CaseClassSchemaProblem.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) {noformat} I wrote this to answer [a question on StackOverflow|http://stackoverflow.com/questions/29310405/what-is-the-right-way-to-represent-an-any-type-in-spark-sql] which uses a much simpler approach and suffers the same problem. Looking at what seems to me to be the [relevant unit test suite|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/ScalaReflectionRelationSuite.scala] I see that this case is not covered. was: (Don't know if this is a functionality bug, error reporting bug or an RFE ...) I define the following hierarchy: {code} private abstract class MyHolder private case class StringHolder(s: String) extends MyHolder private case class IntHolder(i: Int) extends MyHolder private case class BooleanHolder(b: Boolean) extends MyHolder {code} and a top level case class: {code} private case class Thing(key: Integer, foo: MyHolder) {code} When I try to convert it: {code} val things = Seq( Thing(1, IntHolder(42)), Thing(2, StringHolder("hello")), Thing(3, BooleanHolder(false)) ) val thingsDF = sc.parallelize(things, 4).toDF() thingsDF.registerTempTable("things") val all = sqlContext.sql("SELECT * from things") {code} I get the following stack trace: {quote} Exception in thread "main" scala.MatchError: sql.CaseClassSchemaProblem.MyHolder (of class scala.reflect.internal.Types$ClassNoArgsTypeRef) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:112) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:159) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:157) at scala.collection.immutable.List.map(List.scala:276) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:157) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107)
[jira] [Updated] (SPARK-6579) save as parquet with overwrite failed when linking with Hadoop 1.0.4
[ https://issues.apache.org/jira/browse/SPARK-6579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6579: -- Summary: save as parquet with overwrite failed when linking with Hadoop 1.0.4 (was: save as parquet with overwrite failed) > save as parquet with overwrite failed when linking with Hadoop 1.0.4 > > > Key: SPARK-6579 > URL: https://issues.apache.org/jira/browse/SPARK-6579 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0, 1.4.0 >Reporter: Davies Liu >Assignee: Michael Armbrust >Priority: Critical > > {code} > df = sc.parallelize(xrange(n), 4).map(lambda x: (x, str(x) * > 2,)).toDF(['int', 'str']) > df.save("test_data", source="parquet", mode='overwrite') > df.save("test_data", source="parquet", mode='overwrite') > {code} > it failed with: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in > stage 3.0 failed 1 times, most recent failure: Lost task 3.0 in stage 3.0 > (TID 6, localhost): java.lang.IllegalArgumentException: You cannot call > toBytes() more than once without calling reset() > at parquet.Preconditions.checkArgument(Preconditions.java:47) > at > parquet.column.values.rle.RunLengthBitPackingHybridEncoder.toBytes(RunLengthBitPackingHybridEncoder.java:254) > at > parquet.column.values.rle.RunLengthBitPackingHybridValuesWriter.getBytes(RunLengthBitPackingHybridValuesWriter.java:68) > at > parquet.column.impl.ColumnWriterImpl.writePage(ColumnWriterImpl.java:147) > at parquet.column.impl.ColumnWriterImpl.flush(ColumnWriterImpl.java:236) > at > parquet.column.impl.ColumnWriteStoreImpl.flush(ColumnWriteStoreImpl.java:113) > at > parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:153) > at > parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112) > at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73) > at > org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:663) > at > org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:677) > at > org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:677) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:64) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:212) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1211) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1200) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1199) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1199) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1399) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1360) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > {code} > run it again, it failed with: > {code} > 15/03/27 13:26:16 WARN FSInputChecker: Problem opening checksum file: > file:/Users/davies/work/spark/tmp/test_data/_temporary/_attempt_201503271324_0011_r_03_0/part-r-4.parquet. > Ignoring exception: java.io.EOFException > at java.io.DataInputStream.readFully(DataInputStream.java:197) > at java.io.DataInputStream.readFully(DataInputStream.java:169) > at > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:134) > at > org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283) > at org.apache.ha
[jira] [Commented] (SPARK-6579) save as parquet with overwrite failed
[ https://issues.apache.org/jira/browse/SPARK-6579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385694#comment-14385694 ] Cheng Lian commented on SPARK-6579: --- Here's another Parquet issue with Hadoop 1.0.4: SPARK-6581. > save as parquet with overwrite failed > - > > Key: SPARK-6579 > URL: https://issues.apache.org/jira/browse/SPARK-6579 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0, 1.4.0 >Reporter: Davies Liu >Assignee: Michael Armbrust >Priority: Critical > > {code} > df = sc.parallelize(xrange(n), 4).map(lambda x: (x, str(x) * > 2,)).toDF(['int', 'str']) > df.save("test_data", source="parquet", mode='overwrite') > df.save("test_data", source="parquet", mode='overwrite') > {code} > it failed with: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in > stage 3.0 failed 1 times, most recent failure: Lost task 3.0 in stage 3.0 > (TID 6, localhost): java.lang.IllegalArgumentException: You cannot call > toBytes() more than once without calling reset() > at parquet.Preconditions.checkArgument(Preconditions.java:47) > at > parquet.column.values.rle.RunLengthBitPackingHybridEncoder.toBytes(RunLengthBitPackingHybridEncoder.java:254) > at > parquet.column.values.rle.RunLengthBitPackingHybridValuesWriter.getBytes(RunLengthBitPackingHybridValuesWriter.java:68) > at > parquet.column.impl.ColumnWriterImpl.writePage(ColumnWriterImpl.java:147) > at parquet.column.impl.ColumnWriterImpl.flush(ColumnWriterImpl.java:236) > at > parquet.column.impl.ColumnWriteStoreImpl.flush(ColumnWriteStoreImpl.java:113) > at > parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:153) > at > parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112) > at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73) > at > org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:663) > at > org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:677) > at > org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:677) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:64) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:212) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1211) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1200) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1199) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1199) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1399) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1360) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > {code} > run it again, it failed with: > {code} > 15/03/27 13:26:16 WARN FSInputChecker: Problem opening checksum file: > file:/Users/davies/work/spark/tmp/test_data/_temporary/_attempt_201503271324_0011_r_03_0/part-r-4.parquet. > Ignoring exception: java.io.EOFException > at java.io.DataInputStream.readFully(DataInputStream.java:197) > at java.io.DataInputStream.readFully(DataInputStream.java:169) > at > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:134) > at > org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427) > at > parquet.hadoop
[jira] [Assigned] (SPARK-6580) Optimize LogisticRegressionModel.predictPoint
[ https://issues.apache.org/jira/browse/SPARK-6580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6580: --- Assignee: (was: Apache Spark) > Optimize LogisticRegressionModel.predictPoint > - > > Key: SPARK-6580 > URL: https://issues.apache.org/jira/browse/SPARK-6580 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Priority: Minor > > LogisticRegressionModel.predictPoint could be optimized some. There are > several checks which could be moved outside loops or even outside > predictPoint to initialization of the model. > Some include: > {code} > require(numFeatures == weightMatrix.size) > val dataWithBiasSize = weightMatrix.size / (numClasses - 1) > val weightsArray = weightMatrix match { ... > if (dataMatrix.size + 1 == dataWithBiasSize) {... > {code} > Also, for multiclass, the 2 loops (over numClasses and margins) could be > combined into 1 loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6580) Optimize LogisticRegressionModel.predictPoint
[ https://issues.apache.org/jira/browse/SPARK-6580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385664#comment-14385664 ] Apache Spark commented on SPARK-6580: - User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/5249 > Optimize LogisticRegressionModel.predictPoint > - > > Key: SPARK-6580 > URL: https://issues.apache.org/jira/browse/SPARK-6580 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Priority: Minor > > LogisticRegressionModel.predictPoint could be optimized some. There are > several checks which could be moved outside loops or even outside > predictPoint to initialization of the model. > Some include: > {code} > require(numFeatures == weightMatrix.size) > val dataWithBiasSize = weightMatrix.size / (numClasses - 1) > val weightsArray = weightMatrix match { ... > if (dataMatrix.size + 1 == dataWithBiasSize) {... > {code} > Also, for multiclass, the 2 loops (over numClasses and margins) could be > combined into 1 loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6580) Optimize LogisticRegressionModel.predictPoint
[ https://issues.apache.org/jira/browse/SPARK-6580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6580: --- Assignee: Apache Spark > Optimize LogisticRegressionModel.predictPoint > - > > Key: SPARK-6580 > URL: https://issues.apache.org/jira/browse/SPARK-6580 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Minor > > LogisticRegressionModel.predictPoint could be optimized some. There are > several checks which could be moved outside loops or even outside > predictPoint to initialization of the model. > Some include: > {code} > require(numFeatures == weightMatrix.size) > val dataWithBiasSize = weightMatrix.size / (numClasses - 1) > val weightsArray = weightMatrix match { ... > if (dataMatrix.size + 1 == dataWithBiasSize) {... > {code} > Also, for multiclass, the 2 loops (over numClasses and margins) could be > combined into 1 loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org