[jira] [Commented] (SPARK-10873) can't sort columns on history page
[ https://issues.apache.org/jira/browse/SPARK-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957080#comment-14957080 ] Thomas Graves commented on SPARK-10873: --- Has anyone discussed just using something like jquery datatables or similar which automatically gives us search (https://issues.apache.org/jira/browse/SPARK-10874), sort, pagination, etc? I'm not sure how well the row spanning works with the datatables it seems its possible: http://www.datatables.net/examples/advanced_init/row_grouping.html I know there are others like jqgrid but I'm by no means a UI expert and have used datatables some in hadoop. [~rxin] [~zsxwing] thoughts on using something like the jquery datatables? What about just using it on certain pages like the history page first. The downside is pages might look different. As more and more people use spark being able to use the history page and debug is becoming bigger and bigger issue > can't sort columns on history page > -- > > Key: SPARK-10873 > URL: https://issues.apache.org/jira/browse/SPARK-10873 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.5.1 >Reporter: Thomas Graves > > Starting with 1.5.1 the history server page isn't allowing sorting by column -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957144#comment-14957144 ] Sandy Ryza commented on SPARK-: --- Maybe you all have thought through this as well, but I had some more thoughts on the proposed API. Fundamentally, it seems a weird to me that the user is responsible for having a matching Encoder around every time they want to map to a class of a particular type. In 99% of cases, the Encoder used to encode any given type will be the same, and it seems more intuitive to me to specify this up front. To be more concrete, suppose I want to use case classes in my app and have a function that can auto-generate an Encoder from a class object (though this might be a little bit time consuming because it needs to use reflection). With the current proposal, any time I want to map my Dataset to a Dataset of some case class, I need to either have a line of code that generates an Encoder for that case class, or have an Encoder already lying around. If I perform this operation within a method, I need to pass the Encoder down to the method and include it in the signature. Ideally I would be able to register an EncoderSystem up front that caches Encoders and generates new Encoders whenever it sees a new class used. This still of course requires the user to pass in type information when they call map, but it's easier for them to get this information than an actual encoder. If there's not some principled way to get this working implicitly with ClassTags, the user could just pass in classOf[MyCaseClass] as the second argument to map. > RDD-like API on top of Catalyst/DataFrame > - > > Key: SPARK- > URL: https://issues.apache.org/jira/browse/SPARK- > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Reynold Xin >Assignee: Michael Armbrust > > The RDD API is very flexible, and as a result harder to optimize its > execution in some cases. The DataFrame API, on the other hand, is much easier > to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to > use UDFs, lack of strong types in Scala/Java). > The goal of Spark Datasets is to provide an API that allows users to easily > express transformations on domain objects, while also providing the > performance and robustness advantages of the Spark SQL execution engine. > h2. Requirements > - *Fast* - In most cases, the performance of Datasets should be equal to or > better than working with RDDs. Encoders should be as fast or faster than > Kryo and Java serialization, and unnecessary conversion should be avoided. > - *Typesafe* - Similar to RDDs, objects and functions that operate on those > objects should provide compile-time safety where possible. When converting > from data where the schema is not known at compile-time (for example data > read from an external source such as JSON), the conversion function should > fail-fast if there is a schema mismatch. > - *Support for a variety of object models* - Default encoders should be > provided for a variety of object models: primitive types, case classes, > tuples, POJOs, JavaBeans, etc. Ideally, objects that follow standard > conventions, such as Avro SpecificRecords, should also work out of the box. > - *Java Compatible* - Datasets should provide a single API that works in > both Scala and Java. Where possible, shared types like Array will be used in > the API. Where not possible, overloaded functions should be provided for > both languages. Scala concepts, such as ClassTags should not be required in > the user-facing API. > - *Interoperates with DataFrames* - Users should be able to seamlessly > transition between Datasets and DataFrames, without specifying conversion > boiler-plate. When names used in the input schema line-up with fields in the > given class, no extra mapping should be necessary. Libraries like MLlib > should not need to provide different interfaces for accepting DataFrames and > Datasets as input. > For a detailed outline of the complete proposed API: > [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files] > For an initial discussion of the design considerations in this API: [design > doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10925) Exception when joining DataFrames
[ https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957170#comment-14957170 ] Xiao Li commented on SPARK-10925: - Hi, Alexis, The schema of your query results has the duplicate column names. In your test case, you just need to fix one line: val cardinalityDF2 = df4.groupBy("surname") .agg(count("surname").as("cardinality_surname")) --> val cardinalityDF2 = df4.groupBy("surname") .agg(count("surname").as("cardinality_surname")).withColumnRenamed("surname", "surname_new") cardinalityDF2.show() I think Spark SQL should detect the problem in the earlier stage. I will try to fix the problem and output an error message. Let me know if you have more questions. Thanks! Xiao Li > Exception when joining DataFrames > - > > Key: SPARK-10925 > URL: https://issues.apache.org/jira/browse/SPARK-10925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Tested with Spark 1.5.0 and Spark 1.5.1 >Reporter: Alexis Seigneurin > Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase2.scala > > > I get an exception when joining a DataFrame with another DataFrame. The > second DataFrame was created by performing an aggregation on the first > DataFrame. > My complete workflow is: > # read the DataFrame > # apply an UDF on column "name" > # apply an UDF on column "surname" > # apply an UDF on column "birthDate" > # aggregate on "name" and re-join with the DF > # aggregate on "surname" and re-join with the DF > If I remove one step, the process completes normally. > Here is the exception: > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved > attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in > operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS > birthDate_cleaned#8]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at
[jira] [Assigned] (SPARK-2533) Show summary of locality level of completed tasks in the each stage page of web UI
[ https://issues.apache.org/jira/browse/SPARK-2533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-2533: --- Assignee: (was: Apache Spark) > Show summary of locality level of completed tasks in the each stage page of > web UI > -- > > Key: SPARK-2533 > URL: https://issues.apache.org/jira/browse/SPARK-2533 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.0.0 >Reporter: Masayoshi TSUZUKI >Priority: Minor > > When the number of tasks is very large, it is impossible to know how many > tasks were executed under (PROCESS_LOCAL/NODE_LOCAL/RACK_LOCAL) from the > stage page of web UI. It would be better to show the summary of task locality > level in web UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-2533) Show summary of locality level of completed tasks in the each stage page of web UI
[ https://issues.apache.org/jira/browse/SPARK-2533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-2533: --- Assignee: Apache Spark > Show summary of locality level of completed tasks in the each stage page of > web UI > -- > > Key: SPARK-2533 > URL: https://issues.apache.org/jira/browse/SPARK-2533 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.0.0 >Reporter: Masayoshi TSUZUKI >Assignee: Apache Spark >Priority: Minor > > When the number of tasks is very large, it is impossible to know how many > tasks were executed under (PROCESS_LOCAL/NODE_LOCAL/RACK_LOCAL) from the > stage page of web UI. It would be better to show the summary of task locality > level in web UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2533) Show summary of locality level of completed tasks in the each stage page of web UI
[ https://issues.apache.org/jira/browse/SPARK-2533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957131#comment-14957131 ] Apache Spark commented on SPARK-2533: - User 'jbonofre' has created a pull request for this issue: https://github.com/apache/spark/pull/9117 > Show summary of locality level of completed tasks in the each stage page of > web UI > -- > > Key: SPARK-2533 > URL: https://issues.apache.org/jira/browse/SPARK-2533 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.0.0 >Reporter: Masayoshi TSUZUKI >Priority: Minor > > When the number of tasks is very large, it is impossible to know how many > tasks were executed under (PROCESS_LOCAL/NODE_LOCAL/RACK_LOCAL) from the > stage page of web UI. It would be better to show the summary of task locality > level in web UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11102) Unreadable exception when specifing non-exist input for JSON data source
[ https://issues.apache.org/jira/browse/SPARK-11102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated SPARK-11102: --- Summary: Unreadable exception when specifing non-exist input for JSON data source (was: Not readable exception when specifing non-exist input for JSON data source) > Unreadable exception when specifing non-exist input for JSON data source > > > Key: SPARK-11102 > URL: https://issues.apache.org/jira/browse/SPARK-11102 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.1 >Reporter: Jeff Zhang >Priority: Minor > > If I specify a non-exist input path for json data source, the following > exception will be thrown, it is not readable. > {code} > 15/10/14 16:14:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes > in memory (estimated size 19.9 KB, free 251.4 KB) > 15/10/14 16:14:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory > on 192.168.3.3:54725 (size: 19.9 KB, free: 2.2 GB) > 15/10/14 16:14:39 INFO SparkContext: Created broadcast 0 from json at > :19 > java.io.IOException: No input paths specified in job > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1087) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) > at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1085) > at > org.apache.spark.sql.execution.datasources.json.InferSchema$.apply(InferSchema.scala:58) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:105) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:100) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:100) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:99) > at > org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:561) > at > org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560) > at > org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:106) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:221) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:19) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26) > at $iwC$$iwC$$iwC$$iwC$$iwC.(:28) > at $iwC$$iwC$$iwC$$iwC.(:30) > at $iwC$$iwC$$iwC.(:32) > at $iwC$$iwC.(:34) > at $iwC.(:36) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10876) display total application time in spark history UI
[ https://issues.apache.org/jira/browse/SPARK-10876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-10876: -- Assignee: Jean-Baptiste Onofré > display total application time in spark history UI > -- > > Key: SPARK-10876 > URL: https://issues.apache.org/jira/browse/SPARK-10876 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.5.1 >Reporter: Thomas Graves >Assignee: Jean-Baptiste Onofré > > The history file has an application start and application end events. It > would be nice if we could use these to display the total run time for the > application in the history UI. > Could be displayed similar to "Total Uptime" for a running application. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11098) RPC message ordering is not guaranteed
[ https://issues.apache.org/jira/browse/SPARK-11098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957232#comment-14957232 ] Marcelo Vanzin commented on SPARK-11098: I'm not explicitly working on this a.t.m.. > RPC message ordering is not guaranteed > -- > > Key: SPARK-11098 > URL: https://issues.apache.org/jira/browse/SPARK-11098 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin > > NettyRpcEnv doesn't guarantee message delivery order since there are multiple > threads sending messages in clientConnectionExecutor thread pool. We should > fix that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11108) OneHotEncoder should support other numeric input types
[ https://issues.apache.org/jira/browse/SPARK-11108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-11108: -- Description: See parent JIRA for more info. Also see [SPARK-10513] for motivation behind issue. was:See parent JIRA for more info. > OneHotEncoder should support other numeric input types > -- > > Key: SPARK-11108 > URL: https://issues.apache.org/jira/browse/SPARK-11108 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > See parent JIRA for more info. > Also see [SPARK-10513] for motivation behind issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11040) SaslRpcHandler does not delegate all methods to underlying handler
[ https://issues.apache.org/jira/browse/SPARK-11040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-11040. Resolution: Fixed Assignee: Marcelo Vanzin Fix Version/s: 1.6.0 > SaslRpcHandler does not delegate all methods to underlying handler > -- > > Key: SPARK-11040 > URL: https://issues.apache.org/jira/browse/SPARK-11040 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Fix For: 1.6.0 > > > {{SaslRpcHandler}} only delegates {{receive}} and {{getStreamManager}}, so > when SASL is enabled, other events will be missed by apps. > This affects other version too, but I think these events aren't actually used > there. They'll be used by the new rpc backend in 1.6, though. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11108) OneHotEncoder should support other numeric input types
Joseph K. Bradley created SPARK-11108: - Summary: OneHotEncoder should support other numeric input types Key: SPARK-11108 URL: https://issues.apache.org/jira/browse/SPARK-11108 Project: Spark Issue Type: Sub-task Components: ML Reporter: Joseph K. Bradley Priority: Minor See parent JIRA for more info. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-10943) NullType Column cannot be written to Parquet
[ https://issues.apache.org/jira/browse/SPARK-10943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason C Lee updated SPARK-10943: Comment: was deleted (was: I'd like to work on this. Thanx) > NullType Column cannot be written to Parquet > > > Key: SPARK-10943 > URL: https://issues.apache.org/jira/browse/SPARK-10943 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Jason Pohl > > var data02 = sqlContext.sql("select 1 as id, \"cat in the hat\" as text, null > as comments") > //FAIL - Try writing a NullType column (where all the values are NULL) > data02.write.parquet("/tmp/celtra-test/dataset2") > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57) > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933) > at > org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 179.0 failed 4 times, most recent failure: Lost task 0.3 in > stage 179.0 (TID 39924, 10.0.196.208): > org.apache.spark.sql.AnalysisException: Unsupported data type > StructField(comments,NullType,true).dataType; > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:524) > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:312) > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305) > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:92) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at org.apache.spark.sql.types.StructType.map(StructType.scala:92) > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convert(CatalystSchemaConverter.scala:305) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypesConverter.scala:58) > at > org.apache.spark.sql.execution.datasources.parquet.RowWriteSupport.init(ParquetTableSupport.scala:55) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:288) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetRelation.scala:94) > at >
[jira] [Updated] (SPARK-11099) Default conf property file is not loaded
[ https://issues.apache.org/jira/browse/SPARK-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-11099: --- Affects Version/s: (was: 1.5.1) (Removing affected version. This code does not exist in branch-1.5.) > Default conf property file is not loaded > - > > Key: SPARK-11099 > URL: https://issues.apache.org/jira/browse/SPARK-11099 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Spark Submit >Reporter: Jeff Zhang >Priority: Critical > > spark.driver.extraClassPath doesn't take effect in the latest code, and find > the root cause is due to the default conf property file is not loaded > The bug is caused by this code snippet in AbstractCommandBuilder > {code} > MapgetEffectiveConfig() throws IOException { > if (effectiveConfig == null) { > if (propertiesFile == null) { > effectiveConfig = conf; // return from here if no propertyFile > is provided > } else { > effectiveConfig = new HashMap<>(conf); > Properties p = loadPropertiesFile();// default propertyFile > will load here > for (String key : p.stringPropertyNames()) { > if (!effectiveConfig.containsKey(key)) { > effectiveConfig.put(key, p.getProperty(key)); > } > } > } > } > return effectiveConfig; > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11105) Dsitribute the log4j.properties files from the client to the executors
Srinivasa Reddy Vundela created SPARK-11105: --- Summary: Dsitribute the log4j.properties files from the client to the executors Key: SPARK-11105 URL: https://issues.apache.org/jira/browse/SPARK-11105 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.5.1 Reporter: Srinivasa Reddy Vundela Priority: Minor The log4j.properties file from the client is not distributed to the executors. This means that the client settings are not applied to the executors and they run with the default settings. This affects troubleshooting and data gathering. The workaround is to use the --files option for spark-submit to propagate the log4j.properties file -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
[ https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957301#comment-14957301 ] Zhan Zhang commented on SPARK-11087: I will take a look at this one. > spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate > - > > Key: SPARK-11087 > URL: https://issues.apache.org/jira/browse/SPARK-11087 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: orc file version 0.12 with HIVE_8732 > hive version 1.2.1.2.3.0.0-2557 >Reporter: patcharee >Priority: Minor > > I have an external hive table stored as partitioned orc file (see the table > schema below). I tried to query from the table with where clause> > hiveContext.setConf("spark.sql.orc.filterPushdown", "true") > hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = > 117")). > But from the log file with debug logging level on, the ORC pushdown predicate > was not generated. > Unfortunately my table was not sorted when I inserted the data, but I > expected the ORC pushdown predicate should be generated (because of the where > clause) though > Table schema > > hive> describe formatted 4D; > OK > # col_namedata_type comment > > date int > hhint > x int > y int > heightfloat > u float > v float > w float > phfloat > phb float > t float > p float > pbfloat > qvaporfloat > qgraupfloat > qnice float > qnrainfloat > tke_pbl float > el_pblfloat > qcloudfloat > > # Partition Information > # col_namedata_type comment > > zone int > z int > year int > month int > > # Detailed Table Information > Database: default > Owner:patcharee > CreateTime: Thu Jul 09 16:46:54 CEST 2015 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention:0 > Location: hdfs://helmhdfs/apps/hive/warehouse/wrf_tables/4D > > Table Type: EXTERNAL_TABLE > Table Parameters: > EXTERNALTRUE > comment this table is imported from rwf_data/*/wrf/* > last_modified_bypatcharee > last_modified_time 1439806692 > orc.compressZLIB > transient_lastDdlTime 1439806692 > > # Storage Information > SerDe Library:org.apache.hadoop.hive.ql.io.orc.OrcSerde > InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat > OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat > > Compressed: No > Num Buckets: -1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > serialization.format1 > Time taken: 0.388 seconds, Fetched: 58 row(s) > > Data was inserted into this table by another spark job> >
[jira] [Assigned] (SPARK-11078) Ensure spilling tests are actually spilling
[ https://issues.apache.org/jira/browse/SPARK-11078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or reassigned SPARK-11078: - Assignee: Andrew Or > Ensure spilling tests are actually spilling > --- > > Key: SPARK-11078 > URL: https://issues.apache.org/jira/browse/SPARK-11078 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Reporter: Andrew Or >Assignee: Andrew Or > > The new unified memory management model in SPARK-10983 uncovered many brittle > tests that rely on arbitrary thresholds to detect spilling. Some tests don't > even assert that spilling did occur. > We should go through all the places where we test spilling behavior and > correct the tests, a subset of which are definitely incorrect. Potential > suspects: > - UnsafeShuffleSuite > - ExternalAppendOnlyMapSuite > - ExternalSorterSuite > - SQLQuerySuite > - DistributedSuite -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6910) Support for pushing predicates down to metastore for partition pruning
[ https://issues.apache.org/jira/browse/SPARK-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957194#comment-14957194 ] Cheolsoo Park commented on SPARK-6910: -- You're right that 2nd query is faster because the table/partition metadata is cached. Particularly, if you set {{spark.sql.hive.metastorePartitionPruning}} false (by default, false), Spark will cache metadata for all the partitions and any query against the same table will run faster even with a different predicate. See relevant code [here|https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L830-L839]. > Support for pushing predicates down to metastore for partition pruning > -- > > Key: SPARK-6910 > URL: https://issues.apache.org/jira/browse/SPARK-6910 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Cheolsoo Park >Priority: Critical > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11097) Add connection established callback to lower level RPC layer so we don't need to check for new connections in NettyRpcHandler.receive
[ https://issues.apache.org/jira/browse/SPARK-11097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957230#comment-14957230 ] Marcelo Vanzin commented on SPARK-11097: Hi [~rxin], can you explain what would be the use case for this? Is it just to simplify the code? I'm working on SPARK-10997 and I have changed code around that area a lot. I was able to simplify the code without the need for a connection established callback. > Add connection established callback to lower level RPC layer so we don't need > to check for new connections in NettyRpcHandler.receive > - > > Key: SPARK-11097 > URL: https://issues.apache.org/jira/browse/SPARK-11097 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin > > I think we can remove the check for new connections in > NettyRpcHandler.receive if we just add a channel registered callback to the > lower level network module. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11105) Disitribute the log4j.properties files from the client to the executors
[ https://issues.apache.org/jira/browse/SPARK-11105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Srinivasa Reddy Vundela updated SPARK-11105: Summary: Disitribute the log4j.properties files from the client to the executors (was: Dsitribute the log4j.properties files from the client to the executors) > Disitribute the log4j.properties files from the client to the executors > --- > > Key: SPARK-11105 > URL: https://issues.apache.org/jira/browse/SPARK-11105 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.1 >Reporter: Srinivasa Reddy Vundela >Priority: Minor > > The log4j.properties file from the client is not distributed to the > executors. This means that the client settings are not applied to the > executors and they run with the default settings. > This affects troubleshooting and data gathering. > The workaround is to use the --files option for spark-submit to propagate the > log4j.properties file -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10619) Can't sort columns on Executor Page
[ https://issues.apache.org/jira/browse/SPARK-10619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-10619. - Resolution: Fixed Fix Version/s: 1.6.0 1.5.2 > Can't sort columns on Executor Page > --- > > Key: SPARK-10619 > URL: https://issues.apache.org/jira/browse/SPARK-10619 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.5.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Fix For: 1.5.2, 1.6.0 > > > I am using spark 1.5 running on yarn and go to the executors page. It won't > allow sorting of the columns. This used to work in Spark 1.4. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11106) Should ML Models contains single models or Pipelines?
Joseph K. Bradley created SPARK-11106: - Summary: Should ML Models contains single models or Pipelines? Key: SPARK-11106 URL: https://issues.apache.org/jira/browse/SPARK-11106 Project: Spark Issue Type: Sub-task Components: ML Reporter: Joseph K. Bradley Priority: Critical This JIRA is for discussing whether an ML Estimators should do feature processing. h2. Issue Currently, almost all ML Estimators require strict input types. E.g., DecisionTreeClassifier requires that the label column be Double type and have metadata indicating the number of classes. This requires users to know how to preprocess data. h2. Ideal workflow A user should be able to pass any reasonable data to a Transformer or Estimator and have it "do the right thing." E.g.: * If DecisionTreeClassifier is given a String column for labels, it should know to index the Strings. * See [SPARK-10513] for a similar issue with OneHotEncoder. h2. Possible solutions There are a few solutions I have thought of. Please comment with feedback or alternative ideas! h3. Leave as is Pro: The current setup is good in that it forces the user to be very aware of what they are doing. Feature transformations will not happen silently. Con: The user has to write boilerplate code for transformations. The API is not what some users would expect; e.g., coming from R, a user might expect some automatic transformations. h3. All Transformers can contain PipelineModels We could allow all Transformers and Models to contain arbitrary PipelineModels. E.g., if a DecisionTreeClassifier were given a String label column, it might return a Model which contains a simple fitted PipelineModel containing StringIndexer + DecisionTreeClassificationModel. The API could present this to the user, or it could be hidden from the user. Ideally, it would be hidden from the beginner user, but accessible for experts. The main problem is that we might have to break APIs. E.g., OneHotEncoder may need to do indexing if given a String input column. This means it should no longer be a Transformer; it should be an Estimator. h3. All Estimators should use RFormula The best option I have thought of is to make RFormula be the primary method for automatic feature transformation. We could start adding an RFormula Param to all Estimators, and it could handle most of these feature transformation issues. We could maintain old APIs: * If a user sets the input column names, then those can be used in the traditional (no automatic transformation) way. * If a user sets the RFormula Param, then it can be used instead. (This should probably take precedence over the old API.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10873) can't sort columns on history page
[ https://issues.apache.org/jira/browse/SPARK-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957375#comment-14957375 ] Marcelo Vanzin commented on SPARK-10873: What I mean is that while replacing the sorting library is sort of easy, by itself it doesn't really solve the problem. Pagination is currently done in the backend, meaning the backend will generate hardcoded HTML with the current page, instead of something that can be easily consumed by a client-side library to do pagination and sorting on the client. > can't sort columns on history page > -- > > Key: SPARK-10873 > URL: https://issues.apache.org/jira/browse/SPARK-10873 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.5.1 >Reporter: Thomas Graves > > Starting with 1.5.1 the history server page isn't allowing sorting by column -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10513) Springleaf Marketing Response
[ https://issues.apache.org/jira/browse/SPARK-10513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957345#comment-14957345 ] Joseph K. Bradley commented on SPARK-10513: --- [~yanboliang] This is really helpful feedback. Thanks very much for taking the time! I'll try to list plans for addressing the various issues you found: 1. Here's the closest issue I could find for spark-csv: [https://github.com/databricks/spark-csv/issues/48] Would you mind commenting there to try to escalate the issue? 2. What would be your ideal way to write this in the DataFrame API? Something like {{train.withColumn(train("label").cast(DoubleType).as("label")).na.drop()}}? (I think that almost works now, but I'm not actually sure if the cast works or fails when it encounters empty Strings.) 3. Just made a JIRA: [SPARK-11108] 4. Do you mean a completely missing value? Or do you mean that StringIndexer should handle an empty String differently? 5. Multi-value support for transformers: [SPARK-8418] 6. Here's some more detailed discussion which I just wrote down: [SPARK-11106] I haven't yet looked at your example code, but will try to soon. Thanks again for working on this! > Springleaf Marketing Response > - > > Key: SPARK-10513 > URL: https://issues.apache.org/jira/browse/SPARK-10513 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > Apply ML pipeline API to Springleaf Marketing Response > (https://www.kaggle.com/c/springleaf-marketing-response) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11109) move FsHistoryProvider off import org.apache.hadoop.fs.permission.AccessControlException
Steve Loughran created SPARK-11109: -- Summary: move FsHistoryProvider off import org.apache.hadoop.fs.permission.AccessControlException Key: SPARK-11109 URL: https://issues.apache.org/jira/browse/SPARK-11109 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.6.0 Reporter: Steve Loughran Priority: Minor {{FsHistoryProvider}} imports and uses {{org.apache.hadoop.fs.permission.AccessControlException}}; this has been superceded by its subclass {{org.apache.hadoop.security.AccessControlException}} since ~2011. Moving to that subclass would remove a deprecation warning and ensure that were the Hadoop team to remove that old method (as HADOOP-11356 has currently done to trunk), everything still compiles and links -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10873) can't sort columns on history page
[ https://issues.apache.org/jira/browse/SPARK-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957210#comment-14957210 ] Marcelo Vanzin commented on SPARK-10873: As part of trying to fix SPARK-10172 I played with jquery datatables, and it works fine even with rowspan. But I thought it would be too big a change for the 1.5 branch. Also, I still believe that sorting with the current pagination code is confusing and not very helpful. To enable proper sorting / searching, the backend would need to be changed to support something more dynamic, so that the client can make the decision about what to show and how. > can't sort columns on history page > -- > > Key: SPARK-10873 > URL: https://issues.apache.org/jira/browse/SPARK-10873 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.5.1 >Reporter: Thomas Graves > > Starting with 1.5.1 the history server page isn't allowing sorting by column -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11099) Default conf property file is not loaded
[ https://issues.apache.org/jira/browse/SPARK-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated SPARK-11099: --- Description: spark.driver.extraClassPath doesn't take effect in the latest code, and find the root cause is due to the default conf property file is not loaded The bug is caused by this code snippet in AbstractCommandBuilder {code} MapgetEffectiveConfig() throws IOException { if (effectiveConfig == null) { if (propertiesFile == null) { effectiveConfig = conf; // return from here if no propertyFile is provided } else { effectiveConfig = new HashMap<>(conf); Properties p = loadPropertiesFile();// default propertyFile will load here for (String key : p.stringPropertyNames()) { if (!effectiveConfig.containsKey(key)) { effectiveConfig.put(key, p.getProperty(key)); } } } } return effectiveConfig; } {code} was: spark.driver.extraClassPath doesn't take effect in the latest code, and find the root cause is due to the default conf property file is not loaded The bug is caused by this code snippet in AbstractCommandBuilder {code} Map getEffectiveConfig() throws IOException { if (effectiveConfig == null) { if (propertiesFile == null) { effectiveConfig = conf; } else { effectiveConfig = new HashMap<>(conf); Properties p = loadPropertiesFile(); for (String key : p.stringPropertyNames()) { if (!effectiveConfig.containsKey(key)) { effectiveConfig.put(key, p.getProperty(key)); } } } } return effectiveConfig; } {code} > Default conf property file is not loaded > - > > Key: SPARK-11099 > URL: https://issues.apache.org/jira/browse/SPARK-11099 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Spark Submit >Reporter: Jeff Zhang >Priority: Critical > > spark.driver.extraClassPath doesn't take effect in the latest code, and find > the root cause is due to the default conf property file is not loaded > The bug is caused by this code snippet in AbstractCommandBuilder > {code} > Map getEffectiveConfig() throws IOException { > if (effectiveConfig == null) { > if (propertiesFile == null) { > effectiveConfig = conf; // return from here if no propertyFile > is provided > } else { > effectiveConfig = new HashMap<>(conf); > Properties p = loadPropertiesFile();// default propertyFile > will load here > for (String key : p.stringPropertyNames()) { > if (!effectiveConfig.containsKey(key)) { > effectiveConfig.put(key, p.getProperty(key)); > } > } > } > } > return effectiveConfig; > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956341#comment-14956341 ] Sandy Ryza commented on SPARK-: --- Thanks for the explanation [~rxin] and [~marmbrus]. I understand the problem and don't have any great ideas for an alternative workable solution. > RDD-like API on top of Catalyst/DataFrame > - > > Key: SPARK- > URL: https://issues.apache.org/jira/browse/SPARK- > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Reynold Xin >Assignee: Michael Armbrust > > The RDD API is very flexible, and as a result harder to optimize its > execution in some cases. The DataFrame API, on the other hand, is much easier > to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to > use UDFs, lack of strong types in Scala/Java). > The goal of Spark Datasets is to provide an API that allows users to easily > express transformations on domain objects, while also providing the > performance and robustness advantages of the Spark SQL execution engine. > h2. Requirements > - *Fast* - In most cases, the performance of Datasets should be equal to or > better than working with RDDs. Encoders should be as fast or faster than > Kryo and Java serialization, and unnecessary conversion should be avoided. > - *Typesafe* - Similar to RDDs, objects and functions that operate on those > objects should provide compile-time safety where possible. When converting > from data where the schema is not known at compile-time (for example data > read from an external source such as JSON), the conversion function should > fail-fast if there is a schema mismatch. > - *Support for a variety of object models* - Default encoders should be > provided for a variety of object models: primitive types, case classes, > tuples, POJOs, JavaBeans, etc. Ideally, objects that follow standard > conventions, such as Avro SpecificRecords, should also work out of the box. > - *Java Compatible* - Datasets should provide a single API that works in > both Scala and Java. Where possible, shared types like Array will be used in > the API. Where not possible, overloaded functions should be provided for > both languages. Scala concepts, such as ClassTags should not be required in > the user-facing API. > - *Interoperates with DataFrames* - Users should be able to seamlessly > transition between Datasets and DataFrames, without specifying conversion > boiler-plate. When names used in the input schema line-up with fields in the > given class, no extra mapping should be necessary. Libraries like MLlib > should not need to provide different interfaces for accepting DataFrames and > Datasets as input. > For a detailed outline of the complete proposed API: > [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files] > For an initial discussion of the design considerations in this API: [design > doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11096) Post-hoc review Netty based RPC implementation - round 2
Reynold Xin created SPARK-11096: --- Summary: Post-hoc review Netty based RPC implementation - round 2 Key: SPARK-11096 URL: https://issues.apache.org/jira/browse/SPARK-11096 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11096) Post-hoc review Netty based RPC implementation - round 2
[ https://issues.apache.org/jira/browse/SPARK-11096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956390#comment-14956390 ] Apache Spark commented on SPARK-11096: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/9112 > Post-hoc review Netty based RPC implementation - round 2 > > > Key: SPARK-11096 > URL: https://issues.apache.org/jira/browse/SPARK-11096 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin >Assignee: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11096) Post-hoc review Netty based RPC implementation - round 2
[ https://issues.apache.org/jira/browse/SPARK-11096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11096: Assignee: Apache Spark (was: Reynold Xin) > Post-hoc review Netty based RPC implementation - round 2 > > > Key: SPARK-11096 > URL: https://issues.apache.org/jira/browse/SPARK-11096 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11097) Add connection established callback to lower level RPC layer so we don't need to check for new connections in NettyRpcHandler.receive
Reynold Xin created SPARK-11097: --- Summary: Add connection established callback to lower level RPC layer so we don't need to check for new connections in NettyRpcHandler.receive Key: SPARK-11097 URL: https://issues.apache.org/jira/browse/SPARK-11097 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin I think we can remove the check for new connections in NettyRpcHandler.receive if we just add a channel registered callback to the lower level network module. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11097) Add connection established callback to lower level RPC layer so we don't need to check for new connections in NettyRpcHandler.receive
[ https://issues.apache.org/jira/browse/SPARK-11097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956394#comment-14956394 ] Reynold Xin commented on SPARK-11097: - cc [~vanzin] do you have time to do this? > Add connection established callback to lower level RPC layer so we don't need > to check for new connections in NettyRpcHandler.receive > - > > Key: SPARK-11097 > URL: https://issues.apache.org/jira/browse/SPARK-11097 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin > > I think we can remove the check for new connections in > NettyRpcHandler.receive if we just add a channel registered callback to the > lower level network module. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10925) Exception when joining DataFrames
[ https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956367#comment-14956367 ] Xiao Li edited comment on SPARK-10925 at 10/14/15 7:16 AM: --- Also hit the same problem, but this is not related to UDF. Trying to narrow down the root cause of the analyzer internal. was (Author: smilegator): Also hit the same problem. Trying to narrow down the root cause of the analyzer internal. > Exception when joining DataFrames > - > > Key: SPARK-10925 > URL: https://issues.apache.org/jira/browse/SPARK-10925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Tested with Spark 1.5.0 and Spark 1.5.1 >Reporter: Alexis Seigneurin > Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase2.scala > > > I get an exception when joining a DataFrame with another DataFrame. The > second DataFrame was created by performing an aggregation on the first > DataFrame. > My complete workflow is: > # read the DataFrame > # apply an UDF on column "name" > # apply an UDF on column "surname" > # apply an UDF on column "birthDate" > # aggregate on "name" and re-join with the DF > # aggregate on "surname" and re-join with the DF > If I remove one step, the process completes normally. > Here is the exception: > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved > attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in > operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS > birthDate_cleaned#8]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553) > at org.apache.spark.sql.DataFrame.join(DataFrame.scala:520) > at TestCase2$.main(TestCase2.scala:51) > at TestCase2.main(TestCase2.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at >
[jira] [Created] (SPARK-11099) Default conf property file is not loaded
Jeff Zhang created SPARK-11099: -- Summary: Default conf property file is not loaded Key: SPARK-11099 URL: https://issues.apache.org/jira/browse/SPARK-11099 Project: Spark Issue Type: Bug Reporter: Jeff Zhang Priority: Critical spark.driver.extraClassPath doesn't take effect in the latest code, and find the root cause is due to the default conf property file is not loaded The bug is caused by this code snippet in AbstractCommandBuilder {code} MapgetEffectiveConfig() throws IOException { if (effectiveConfig == null) { if (propertiesFile == null) { effectiveConfig = conf; } else { effectiveConfig = new HashMap<>(conf); Properties p = loadPropertiesFile(); for (String key : p.stringPropertyNames()) { if (!effectiveConfig.containsKey(key)) { effectiveConfig.put(key, p.getProperty(key)); } } } } return effectiveConfig; } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11098) RPC message ordering is not guaranteed
Reynold Xin created SPARK-11098: --- Summary: RPC message ordering is not guaranteed Key: SPARK-11098 URL: https://issues.apache.org/jira/browse/SPARK-11098 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin NettyRpcEnv doesn't guarantee message delivery order since there are multiple threads sending messages in clientConnectionExecutor thread pool. We should fix that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11098) RPC message ordering is not guaranteed
[ https://issues.apache.org/jira/browse/SPARK-11098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956396#comment-14956396 ] Reynold Xin commented on SPARK-11098: - [~vanzin] zsxwing told me you were working on this. Let me know if it is not the case. > RPC message ordering is not guaranteed > -- > > Key: SPARK-11098 > URL: https://issues.apache.org/jira/browse/SPARK-11098 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin > > NettyRpcEnv doesn't guarantee message delivery order since there are multiple > threads sending messages in clientConnectionExecutor thread pool. We should > fix that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11099) Default conf property file is not loaded
[ https://issues.apache.org/jira/browse/SPARK-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated SPARK-11099: --- Component/s: Spark Submit > Default conf property file is not loaded > - > > Key: SPARK-11099 > URL: https://issues.apache.org/jira/browse/SPARK-11099 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Reporter: Jeff Zhang >Priority: Critical > > spark.driver.extraClassPath doesn't take effect in the latest code, and find > the root cause is due to the default conf property file is not loaded > The bug is caused by this code snippet in AbstractCommandBuilder > {code} > MapgetEffectiveConfig() throws IOException { > if (effectiveConfig == null) { > if (propertiesFile == null) { > effectiveConfig = conf; > } else { > effectiveConfig = new HashMap<>(conf); > Properties p = loadPropertiesFile(); > for (String key : p.stringPropertyNames()) { > if (!effectiveConfig.containsKey(key)) { > effectiveConfig.put(key, p.getProperty(key)); > } > } > } > } > return effectiveConfig; > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10925) Exception when joining DataFrames
[ https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956367#comment-14956367 ] Xiao Li commented on SPARK-10925: - Also hit the same problem. Trying to narrow down the root cause of the analyzer internal. > Exception when joining DataFrames > - > > Key: SPARK-10925 > URL: https://issues.apache.org/jira/browse/SPARK-10925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Tested with Spark 1.5.0 and Spark 1.5.1 >Reporter: Alexis Seigneurin > Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase2.scala > > > I get an exception when joining a DataFrame with another DataFrame. The > second DataFrame was created by performing an aggregation on the first > DataFrame. > My complete workflow is: > # read the DataFrame > # apply an UDF on column "name" > # apply an UDF on column "surname" > # apply an UDF on column "birthDate" > # aggregate on "name" and re-join with the DF > # aggregate on "surname" and re-join with the DF > If I remove one step, the process completes normally. > Here is the exception: > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved > attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in > operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS > birthDate_cleaned#8]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553) > at org.apache.spark.sql.DataFrame.join(DataFrame.scala:520) > at TestCase2$.main(TestCase2.scala:51) > at TestCase2.main(TestCase2.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at
[jira] [Assigned] (SPARK-11096) Post-hoc review Netty based RPC implementation - round 2
[ https://issues.apache.org/jira/browse/SPARK-11096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11096: Assignee: Reynold Xin (was: Apache Spark) > Post-hoc review Netty based RPC implementation - round 2 > > > Key: SPARK-11096 > URL: https://issues.apache.org/jira/browse/SPARK-11096 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin >Assignee: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11099) Default conf property file is not loaded
[ https://issues.apache.org/jira/browse/SPARK-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956399#comment-14956399 ] Jeff Zhang commented on SPARK-11099: Will create a pull request soon > Default conf property file is not loaded > - > > Key: SPARK-11099 > URL: https://issues.apache.org/jira/browse/SPARK-11099 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Spark Submit >Reporter: Jeff Zhang >Priority: Critical > > spark.driver.extraClassPath doesn't take effect in the latest code, and find > the root cause is due to the default conf property file is not loaded > The bug is caused by this code snippet in AbstractCommandBuilder > {code} > MapgetEffectiveConfig() throws IOException { > if (effectiveConfig == null) { > if (propertiesFile == null) { > effectiveConfig = conf; > } else { > effectiveConfig = new HashMap<>(conf); > Properties p = loadPropertiesFile(); > for (String key : p.stringPropertyNames()) { > if (!effectiveConfig.containsKey(key)) { > effectiveConfig.put(key, p.getProperty(key)); > } > } > } > } > return effectiveConfig; > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11099) Default conf property file is not loaded
[ https://issues.apache.org/jira/browse/SPARK-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated SPARK-11099: --- Component/s: Spark Shell > Default conf property file is not loaded > - > > Key: SPARK-11099 > URL: https://issues.apache.org/jira/browse/SPARK-11099 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Spark Submit >Reporter: Jeff Zhang >Priority: Critical > > spark.driver.extraClassPath doesn't take effect in the latest code, and find > the root cause is due to the default conf property file is not loaded > The bug is caused by this code snippet in AbstractCommandBuilder > {code} > MapgetEffectiveConfig() throws IOException { > if (effectiveConfig == null) { > if (propertiesFile == null) { > effectiveConfig = conf; > } else { > effectiveConfig = new HashMap<>(conf); > Properties p = loadPropertiesFile(); > for (String key : p.stringPropertyNames()) { > if (!effectiveConfig.containsKey(key)) { > effectiveConfig.put(key, p.getProperty(key)); > } > } > } > } > return effectiveConfig; > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11083) insert overwrite table failed when beeline reconnect
[ https://issues.apache.org/jira/browse/SPARK-11083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956453#comment-14956453 ] Weizhong commented on SPARK-11083: -- I have retest on latest master branch(end at commit ce3f9a80657751ee0bc0ed6a9b6558acbb40af4f, [SPARK-11091] [SQL] Change spark.sql.canonicalizeView to spark.sql.nativeView.) and this issuse have been fixed. But know I don't very clear which PR fix this issue. > insert overwrite table failed when beeline reconnect > > > Key: SPARK-11083 > URL: https://issues.apache.org/jira/browse/SPARK-11083 > Project: Spark > Issue Type: Bug > Components: SQL > Environment: Spark: master branch > Hadoop: 2.7.1 > JDK: 1.8.0_60 >Reporter: Weizhong > > 1. Start Thriftserver > 2. Use beeline connect to thriftserver, then execute "insert overwrite > table_name ..." clause -- success > 3. Exit beelin > 4. Reconnect to thriftserver, and then execute "insert overwrite table_name > ..." clause. -- failed > {noformat} > 15/10/13 18:44:35 ERROR SparkExecuteStatementOperation: Error executing > query, currentState RUNNING, > java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.sql.hive.client.Shim_v1_2.loadDynamicPartitions(HiveShim.scala:520) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply$mcV$sp(ClientWrapper.scala:506) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply(ClientWrapper.scala:506) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply(ClientWrapper.scala:506) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256) > at > org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211) > at > org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248) > at > org.apache.spark.sql.hive.client.ClientWrapper.loadDynamicPartitions(ClientWrapper.scala:505) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:225) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:276) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:58) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:58) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:144) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:129) > at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:739) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.runInternal(SparkExecuteStatementOperation.scala:224) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:182) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move > source >
[jira] [Resolved] (SPARK-11101) pipe() operation OOM
[ https://issues.apache.org/jira/browse/SPARK-11101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-11101. --- Resolution: Invalid If it's a question, you should ask as u...@spark.apache.org, not make a JIRA. It may have nothing to do with your process, though you do need to verify how much it uses. There is little margin in the YARN allocation for off-heap memory, so you probably have to increase this value, yes. > pipe() operation OOM > > > Key: SPARK-11101 > URL: https://issues.apache.org/jira/browse/SPARK-11101 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1 > Environment: spark on yarn >Reporter: hotdog > Original Estimate: 72h > Remaining Estimate: 72h > > when using pipe() operation with large data(10TB), the pipe() operation > always OOM. > I use pipe() to calling a external c++ process. I'm sure the c++ program only > use little memory(about 1MB). > my parameters: > executor-memory 16g > executor-cores 4 > num-executors 400 > "spark.yarn.executor.memoryOverhead", "8192" > partition number: 6 > does pipe() operation use many off-heap memory? > the log is : > killed by YARN for exceeding memory limits. 24.4 GB of 24 GB physical memory > used. Consider boosting spark.yarn.executor.memoryOverhead. > should I continue boosting spark.yarn.executor.memoryOverhead? Or there are > some bugs in the pipe() operation? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11100) HiveThriftServer not registering with Zookeeper
[ https://issues.apache.org/jira/browse/SPARK-11100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11100: Assignee: Apache Spark > HiveThriftServer not registering with Zookeeper > --- > > Key: SPARK-11100 > URL: https://issues.apache.org/jira/browse/SPARK-11100 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: Hive-1.2.1 > Hadoop-2.6.0 >Reporter: Xiaoyu Wang >Assignee: Apache Spark > > hive-site.xml config: > {code} > > hive.server2.support.dynamic.service.discovery > true > > > hive.server2.zookeeper.namespace > sparkhiveserver2 > > > hive.zookeeper.quorum > zk1,zk2,zk3 > > {code} > then start thrift server > {code} > start-thriftserver.sh --master yarn > {code} > In zookeeper znode "sparkhiveserver2" not found. > hiveserver2 is working on this config! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11099) Default conf property file is not loaded
[ https://issues.apache.org/jira/browse/SPARK-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11099: Assignee: (was: Apache Spark) > Default conf property file is not loaded > - > > Key: SPARK-11099 > URL: https://issues.apache.org/jira/browse/SPARK-11099 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Spark Submit >Reporter: Jeff Zhang >Priority: Critical > > spark.driver.extraClassPath doesn't take effect in the latest code, and find > the root cause is due to the default conf property file is not loaded > The bug is caused by this code snippet in AbstractCommandBuilder > {code} > MapgetEffectiveConfig() throws IOException { > if (effectiveConfig == null) { > if (propertiesFile == null) { > effectiveConfig = conf; // return from here if no propertyFile > is provided > } else { > effectiveConfig = new HashMap<>(conf); > Properties p = loadPropertiesFile();// default propertyFile > will load here > for (String key : p.stringPropertyNames()) { > if (!effectiveConfig.containsKey(key)) { > effectiveConfig.put(key, p.getProperty(key)); > } > } > } > } > return effectiveConfig; > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11099) Default conf property file is not loaded
[ https://issues.apache.org/jira/browse/SPARK-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956470#comment-14956470 ] Apache Spark commented on SPARK-11099: -- User 'zjffdu' has created a pull request for this issue: https://github.com/apache/spark/pull/9114 > Default conf property file is not loaded > - > > Key: SPARK-11099 > URL: https://issues.apache.org/jira/browse/SPARK-11099 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Spark Submit >Reporter: Jeff Zhang >Priority: Critical > > spark.driver.extraClassPath doesn't take effect in the latest code, and find > the root cause is due to the default conf property file is not loaded > The bug is caused by this code snippet in AbstractCommandBuilder > {code} > MapgetEffectiveConfig() throws IOException { > if (effectiveConfig == null) { > if (propertiesFile == null) { > effectiveConfig = conf; // return from here if no propertyFile > is provided > } else { > effectiveConfig = new HashMap<>(conf); > Properties p = loadPropertiesFile();// default propertyFile > will load here > for (String key : p.stringPropertyNames()) { > if (!effectiveConfig.containsKey(key)) { > effectiveConfig.put(key, p.getProperty(key)); > } > } > } > } > return effectiveConfig; > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11099) Default conf property file is not loaded
[ https://issues.apache.org/jira/browse/SPARK-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11099: Assignee: Apache Spark > Default conf property file is not loaded > - > > Key: SPARK-11099 > URL: https://issues.apache.org/jira/browse/SPARK-11099 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Spark Submit >Reporter: Jeff Zhang >Assignee: Apache Spark >Priority: Critical > > spark.driver.extraClassPath doesn't take effect in the latest code, and find > the root cause is due to the default conf property file is not loaded > The bug is caused by this code snippet in AbstractCommandBuilder > {code} > MapgetEffectiveConfig() throws IOException { > if (effectiveConfig == null) { > if (propertiesFile == null) { > effectiveConfig = conf; // return from here if no propertyFile > is provided > } else { > effectiveConfig = new HashMap<>(conf); > Properties p = loadPropertiesFile();// default propertyFile > will load here > for (String key : p.stringPropertyNames()) { > if (!effectiveConfig.containsKey(key)) { > effectiveConfig.put(key, p.getProperty(key)); > } > } > } > } > return effectiveConfig; > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11102) Not readable exception when specifing non-exist input for JSON data source
[ https://issues.apache.org/jira/browse/SPARK-11102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated SPARK-11102: --- Issue Type: Improvement (was: Bug) > Not readable exception when specifing non-exist input for JSON data source > -- > > Key: SPARK-11102 > URL: https://issues.apache.org/jira/browse/SPARK-11102 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.1 >Reporter: Jeff Zhang > > If I specify a non-exist input path for json data source, the following > exception will be thrown, it is not readable. > {code} > 15/10/14 16:14:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes > in memory (estimated size 19.9 KB, free 251.4 KB) > 15/10/14 16:14:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory > on 192.168.3.3:54725 (size: 19.9 KB, free: 2.2 GB) > 15/10/14 16:14:39 INFO SparkContext: Created broadcast 0 from json at > :19 > java.io.IOException: No input paths specified in job > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1087) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) > at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1085) > at > org.apache.spark.sql.execution.datasources.json.InferSchema$.apply(InferSchema.scala:58) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:105) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:100) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:100) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:99) > at > org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:561) > at > org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560) > at > org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:106) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:221) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:19) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26) > at $iwC$$iwC$$iwC$$iwC$$iwC.(:28) > at $iwC$$iwC$$iwC$$iwC.(:30) > at $iwC$$iwC$$iwC.(:32) > at $iwC$$iwC.(:34) > at $iwC.(:36) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11102) Not readable exception when specifing non-exist input for JSON data source
Jeff Zhang created SPARK-11102: -- Summary: Not readable exception when specifing non-exist input for JSON data source Key: SPARK-11102 URL: https://issues.apache.org/jira/browse/SPARK-11102 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.1 Reporter: Jeff Zhang If I specify a non-exist input path for json data source, the following exception will be thrown, it is not readable. {code} 15/10/14 16:14:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 19.9 KB, free 251.4 KB) 15/10/14 16:14:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.3.3:54725 (size: 19.9 KB, free: 2.2 GB) 15/10/14 16:14:39 INFO SparkContext: Created broadcast 0 from json at :19 java.io.IOException: No input paths specified in job at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1087) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1085) at org.apache.spark.sql.execution.datasources.json.InferSchema$.apply(InferSchema.scala:58) at org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:105) at org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:100) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:100) at org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:99) at org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:561) at org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560) at org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:106) at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:221) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:19) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26) at $iwC$$iwC$$iwC$$iwC$$iwC.(:28) at $iwC$$iwC$$iwC$$iwC.(:30) at $iwC$$iwC$$iwC.(:32) at $iwC$$iwC.(:34) at $iwC.(:36) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11102) Not readable exception when specifing non-exist input for JSON data source
[ https://issues.apache.org/jira/browse/SPARK-11102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated SPARK-11102: --- Priority: Minor (was: Major) > Not readable exception when specifing non-exist input for JSON data source > -- > > Key: SPARK-11102 > URL: https://issues.apache.org/jira/browse/SPARK-11102 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.1 >Reporter: Jeff Zhang >Priority: Minor > > If I specify a non-exist input path for json data source, the following > exception will be thrown, it is not readable. > {code} > 15/10/14 16:14:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes > in memory (estimated size 19.9 KB, free 251.4 KB) > 15/10/14 16:14:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory > on 192.168.3.3:54725 (size: 19.9 KB, free: 2.2 GB) > 15/10/14 16:14:39 INFO SparkContext: Created broadcast 0 from json at > :19 > java.io.IOException: No input paths specified in job > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1087) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) > at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1085) > at > org.apache.spark.sql.execution.datasources.json.InferSchema$.apply(InferSchema.scala:58) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:105) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:100) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:100) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:99) > at > org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:561) > at > org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560) > at > org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:106) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:221) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:19) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26) > at $iwC$$iwC$$iwC$$iwC$$iwC.(:28) > at $iwC$$iwC$$iwC$$iwC.(:30) > at $iwC$$iwC$$iwC.(:32) > at $iwC$$iwC.(:34) > at $iwC.(:36) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11102) Not readable exception when specifing non-exist input for JSON data source
[ https://issues.apache.org/jira/browse/SPARK-11102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956491#comment-14956491 ] Jeff Zhang commented on SPARK-11102: Will create a pull request soon > Not readable exception when specifing non-exist input for JSON data source > -- > > Key: SPARK-11102 > URL: https://issues.apache.org/jira/browse/SPARK-11102 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.1 >Reporter: Jeff Zhang >Priority: Minor > > If I specify a non-exist input path for json data source, the following > exception will be thrown, it is not readable. > {code} > 15/10/14 16:14:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes > in memory (estimated size 19.9 KB, free 251.4 KB) > 15/10/14 16:14:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory > on 192.168.3.3:54725 (size: 19.9 KB, free: 2.2 GB) > 15/10/14 16:14:39 INFO SparkContext: Created broadcast 0 from json at > :19 > java.io.IOException: No input paths specified in job > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1087) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) > at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1085) > at > org.apache.spark.sql.execution.datasources.json.InferSchema$.apply(InferSchema.scala:58) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:105) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:100) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:100) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:99) > at > org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:561) > at > org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560) > at > org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:106) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:221) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:19) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26) > at $iwC$$iwC$$iwC$$iwC$$iwC.(:28) > at $iwC$$iwC$$iwC$$iwC.(:30) > at $iwC$$iwC$$iwC.(:32) > at $iwC$$iwC.(:34) > at $iwC.(:36) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6910) Support for pushing predicates down to metastore for partition pruning
[ https://issues.apache.org/jira/browse/SPARK-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956508#comment-14956508 ] qian, chen edited comment on SPARK-6910 at 10/14/15 9:03 AM: - I'm using spark-sql (spark version 1.5.1 && hadoop 2.4.0) and found a very interesting thing: in spark-sql shell: at first I ran this, it took about 3 minutes select * from table1 where date='20151010' and hour='12' and name='x' limit 5; Time taken: 164.502 seconds then I ran this, it only took 10s. date, hour and name are partition columns in this hive table. this table has >4000 partitions select * from table1 where date='20151010' and hour='13' limit 5; Time taken: 10.881 seconds is it because that the first time I need to download all partition information from hive metastore? the second query is faster because all partitions are cached in memory now? any suggestions about speeding up the first query? was (Author: nedqian): I'm using spark-sql (spark version 1.5.1 && hadoop 2.4.0) and found a very interesting thing: in spark-sql shell: at first I ran this, it took about 3 minutes select * from table1 where date='20151010' and hour='12' and name='x' limit 5; Time taken: 164.502 seconds then I ran this, it only took 10s. date, hour and name are partition columns in this hive table. this table has >4000 partitions select * from table1 where date='20151010' and hour='13' limit 5; Time taken: 10.881 seconds is it because that the first time I need to download all partition information from hive metastore? the second query is faster because all partitions are cached in memory now? > Support for pushing predicates down to metastore for partition pruning > -- > > Key: SPARK-6910 > URL: https://issues.apache.org/jira/browse/SPARK-6910 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Cheolsoo Park >Priority: Critical > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7099) Floating point literals cannot be specified using exponent
[ https://issues.apache.org/jira/browse/SPARK-7099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7099. -- Resolution: Not A Problem > Floating point literals cannot be specified using exponent > -- > > Key: SPARK-7099 > URL: https://issues.apache.org/jira/browse/SPARK-7099 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.1 > Environment: Windows, Linux, Mac OS X >Reporter: Peter Hagelund >Priority: Minor > > Floating point literals cannot be expressed in scientific notation using an > exponent, like e.g. 1.23E4. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11101) pipe() operation OOM
[ https://issues.apache.org/jira/browse/SPARK-11101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hotdog updated SPARK-11101: --- Description: when using pipe() operation with large data(10TB), the pipe() operation always OOM. I use pipe() to calling a external c++ process. I'm sure the c++ program only use little memory(about 1MB). my parameters: executor-memory 16g executor-cores 4 num-executors 400 "spark.yarn.executor.memoryOverhead", "8192" partition number: 6 does pipe() operation use many off-heap memory? the log is : killed by YARN for exceeding memory limits. 24.4 GB of 24 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. should I continue boosting spark.yarn.executor.memoryOverhead? Or there are some bugs in the pipe() operation? was: when using pipe() operation with large data(10TB), the pipe() operation always OOM. my parameters: executor-memory 16g executor-cores 4 num-executors 400 "spark.yarn.executor.memoryOverhead", "8192" partition number: 6 does pipe() operation use many off-heap memory? the log is : killed by YARN for exceeding memory limits. 24.4 GB of 24 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. should I continue boosting spark.yarn.executor.memoryOverhead? Or there are some bugs in the pipe() operation? > pipe() operation OOM > > > Key: SPARK-11101 > URL: https://issues.apache.org/jira/browse/SPARK-11101 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1 > Environment: spark on yarn >Reporter: hotdog > Original Estimate: 72h > Remaining Estimate: 72h > > when using pipe() operation with large data(10TB), the pipe() operation > always OOM. > I use pipe() to calling a external c++ process. I'm sure the c++ program only > use little memory(about 1MB). > my parameters: > executor-memory 16g > executor-cores 4 > num-executors 400 > "spark.yarn.executor.memoryOverhead", "8192" > partition number: 6 > does pipe() operation use many off-heap memory? > the log is : > killed by YARN for exceeding memory limits. 24.4 GB of 24 GB physical memory > used. Consider boosting spark.yarn.executor.memoryOverhead. > should I continue boosting spark.yarn.executor.memoryOverhead? Or there are > some bugs in the pipe() operation? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11100) HiveThriftServer not registering with Zookeeper
[ https://issues.apache.org/jira/browse/SPARK-11100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11100: Assignee: (was: Apache Spark) > HiveThriftServer not registering with Zookeeper > --- > > Key: SPARK-11100 > URL: https://issues.apache.org/jira/browse/SPARK-11100 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: Hive-1.2.1 > Hadoop-2.6.0 >Reporter: Xiaoyu Wang > > hive-site.xml config: > {code} > > hive.server2.support.dynamic.service.discovery > true > > > hive.server2.zookeeper.namespace > sparkhiveserver2 > > > hive.zookeeper.quorum > zk1,zk2,zk3 > > {code} > then start thrift server > {code} > start-thriftserver.sh --master yarn > {code} > In zookeeper znode "sparkhiveserver2" not found. > hiveserver2 is working on this config! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11100) HiveThriftServer not registering with Zookeeper
[ https://issues.apache.org/jira/browse/SPARK-11100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956467#comment-14956467 ] Apache Spark commented on SPARK-11100: -- User 'xiaowangyu' has created a pull request for this issue: https://github.com/apache/spark/pull/9113 > HiveThriftServer not registering with Zookeeper > --- > > Key: SPARK-11100 > URL: https://issues.apache.org/jira/browse/SPARK-11100 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: Hive-1.2.1 > Hadoop-2.6.0 >Reporter: Xiaoyu Wang > > hive-site.xml config: > {code} > > hive.server2.support.dynamic.service.discovery > true > > > hive.server2.zookeeper.namespace > sparkhiveserver2 > > > hive.zookeeper.quorum > zk1,zk2,zk3 > > {code} > then start thrift server > {code} > start-thriftserver.sh --master yarn > {code} > In zookeeper znode "sparkhiveserver2" not found. > hiveserver2 is working on this config! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6910) Support for pushing predicates down to metastore for partition pruning
[ https://issues.apache.org/jira/browse/SPARK-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956508#comment-14956508 ] qian, chen commented on SPARK-6910: --- I'm using spark-sql (spark version 1.5.1 && hadoop 2.4.0) and found a very interesting thing: in spark-sql shell: at first I ran this, it took about 3 minutes select * from table1 where date='20151010' and hour='12' and name='x' limit 5; Time taken: 164.502 seconds then I ran this, it only took 10s. date, hour and name are partition columns in this hive table. this table has >4000 partitions select * from table1 where date='20151010' and hour='13' limit 5; Time taken: 10.881 seconds is it because that the first time I need to download all partition information from hive metastore? the second query is faster because all partitions are cached in memory now? > Support for pushing predicates down to metastore for partition pruning > -- > > Key: SPARK-6910 > URL: https://issues.apache.org/jira/browse/SPARK-6910 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Cheolsoo Park >Priority: Critical > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11099) Default conf property file is not loaded
[ https://issues.apache.org/jira/browse/SPARK-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated SPARK-11099: --- Affects Version/s: 1.5.1 > Default conf property file is not loaded > - > > Key: SPARK-11099 > URL: https://issues.apache.org/jira/browse/SPARK-11099 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Spark Submit >Affects Versions: 1.5.1 >Reporter: Jeff Zhang >Priority: Critical > > spark.driver.extraClassPath doesn't take effect in the latest code, and find > the root cause is due to the default conf property file is not loaded > The bug is caused by this code snippet in AbstractCommandBuilder > {code} > MapgetEffectiveConfig() throws IOException { > if (effectiveConfig == null) { > if (propertiesFile == null) { > effectiveConfig = conf; // return from here if no propertyFile > is provided > } else { > effectiveConfig = new HashMap<>(conf); > Properties p = loadPropertiesFile();// default propertyFile > will load here > for (String key : p.stringPropertyNames()) { > if (!effectiveConfig.containsKey(key)) { > effectiveConfig.put(key, p.getProperty(key)); > } > } > } > } > return effectiveConfig; > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10845) SQL option "spark.sql.hive.version" doesn't show up in the result of "SET -v"
[ https://issues.apache.org/jira/browse/SPARK-10845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-10845. - Resolution: Fixed I backported it. > SQL option "spark.sql.hive.version" doesn't show up in the result of "SET -v" > - > > Key: SPARK-10845 > URL: https://issues.apache.org/jira/browse/SPARK-10845 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > Labels: backport-needed > Fix For: 1.5.2, 1.6.0 > > > When refactoring SQL options from plain strings to the strongly typed > {{SQLConfEntry}}, {{spark.sql.hive.version}} wasn't migrated, and doesn't > show up in the result of {{SET -v}}, as {{SET -v}} only shows public > {{SQLConfEntry}} instances. > This affects compatibility with Simba ODBC driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10845) SQL option "spark.sql.hive.version" doesn't show up in the result of "SET -v"
[ https://issues.apache.org/jira/browse/SPARK-10845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-10845: Fix Version/s: 1.5.2 > SQL option "spark.sql.hive.version" doesn't show up in the result of "SET -v" > - > > Key: SPARK-10845 > URL: https://issues.apache.org/jira/browse/SPARK-10845 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > Labels: backport-needed > Fix For: 1.5.2, 1.6.0 > > > When refactoring SQL options from plain strings to the strongly typed > {{SQLConfEntry}}, {{spark.sql.hive.version}} wasn't migrated, and doesn't > show up in the result of {{SET -v}}, as {{SET -v}} only shows public > {{SQLConfEntry}} instances. > This affects compatibility with Simba ODBC driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11105) Disitribute the log4j.properties files from the client to the executors
[ https://issues.apache.org/jira/browse/SPARK-11105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-11105: Target Version/s: (was: 1.5.1) > Disitribute the log4j.properties files from the client to the executors > --- > > Key: SPARK-11105 > URL: https://issues.apache.org/jira/browse/SPARK-11105 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.1 >Reporter: Srinivasa Reddy Vundela >Priority: Minor > > The log4j.properties file from the client is not distributed to the > executors. This means that the client settings are not applied to the > executors and they run with the default settings. > This affects troubleshooting and data gathering. > The workaround is to use the --files option for spark-submit to propagate the > log4j.properties file -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10973) __gettitem__ method throws IndexError exception when we try to access index after the last non-zero entry.
[ https://issues.apache.org/jira/browse/SPARK-10973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957641#comment-14957641 ] Joseph K. Bradley commented on SPARK-10973: --- Yes, thanks! > __gettitem__ method throws IndexError exception when we try to access index > after the last non-zero entry. > -- > > Key: SPARK-10973 > URL: https://issues.apache.org/jira/browse/SPARK-10973 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.3.0, 1.4.0, 1.5.0, 1.6.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz > Labels: backport-needed > Fix For: 1.3.2, 1.4.2, 1.5.2, 1.6.0 > > > \_\_gettitem\_\_ method throws IndexError exception when we try to access > index after the last non-zero entry. > {code} > from pyspark.mllib.linalg import Vectors > sv = Vectors.sparse(5, {1: 3}) > sv[0] > ## 0.0 > sv[1] > ## 3.0 > sv[2] > ## Traceback (most recent call last): > ## File "", line 1, in > ## File "/python/pyspark/mllib/linalg/__init__.py", line 734, in __getitem__ > ## row_ind = inds[insert_index] > ## IndexError: index out of bounds > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11111) Fast null-safe join
Davies Liu created SPARK-1: -- Summary: Fast null-safe join Key: SPARK-1 URL: https://issues.apache.org/jira/browse/SPARK-1 Project: Spark Issue Type: Improvement Reporter: Davies Liu Assignee: Davies Liu Today, null safe joins are executed with a Cartesian product. {code} scala> sqlContext.sql("select * from t a join t b on (a.i <=> b.i)").explain == Physical Plan == TungstenProject [i#2,j#3,i#7,j#8] Filter (i#2 <=> i#7) CartesianProduct LocalTableScan [i#2,j#3], [[1,1]] LocalTableScan [i#7,j#8], [[1,1]] {code} One option is to add this rewrite to the optimizer: {code} select * from t a join t b on coalesce(a.i, ) = coalesce(b.i, ) AND (a.i <=> b.i) {code} Acceptance criteria: joins with only null safe equality should not result in a Cartesian product. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11110) Scala 2.11 build fails due to compiler errors
[ https://issues.apache.org/jira/browse/SPARK-0?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-0: Assignee: Jakob Odersky > Scala 2.11 build fails due to compiler errors > - > > Key: SPARK-0 > URL: https://issues.apache.org/jira/browse/SPARK-0 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: Patrick Wendell >Assignee: Jakob Odersky > > Right now the 2.11 build is failing due to compiler errors in SBT (though not > in Maven). I have updated our 2.11 compile test harness to catch this. > https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/job/Spark-Master-Scala211-Compile/1667/consoleFull > {code} > [error] > /home/jenkins/workspace/Spark-Master-Scala211-Compile/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala:308: > no valid targets for annotation on value conf - it is discarded unused. You > may specify targets with meta-annotations, e.g. @(transient @param) > [error] private[netty] class NettyRpcEndpointRef(@transient conf: SparkConf) > [error] > {code} > This is one error, but there may be others past this point (the compile fails > fast). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957546#comment-14957546 ] Reynold Xin commented on SPARK-6235: Is your data skewed? i.e. maybe there is a single key that's enormous? > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10577) [PySpark] DataFrame hint for broadcast join
[ https://issues.apache.org/jira/browse/SPARK-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-10577: Fix Version/s: 1.5.2 > [PySpark] DataFrame hint for broadcast join > --- > > Key: SPARK-10577 > URL: https://issues.apache.org/jira/browse/SPARK-10577 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 1.5.0 >Reporter: Maciej Bryński >Assignee: Jian Feng Zhang > Labels: starter > Fix For: 1.5.2, 1.6.0 > > > As in https://issues.apache.org/jira/browse/SPARK-8300 > there should by possibility to add hint for broadcast join in: > - Pyspark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11110) Scala 2.11 build fails due to compiler errors
[ https://issues.apache.org/jira/browse/SPARK-0?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957721#comment-14957721 ] Jakob Odersky commented on SPARK-0: --- exactly what I got, I'll take a look at it > Scala 2.11 build fails due to compiler errors > - > > Key: SPARK-0 > URL: https://issues.apache.org/jira/browse/SPARK-0 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: Patrick Wendell > > Right now the 2.11 build is failing due to compiler errors in SBT (though not > in Maven). I have updated our 2.11 compile test harness to catch this. > https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/job/Spark-Master-Scala211-Compile/1667/consoleFull > {code} > [error] > /home/jenkins/workspace/Spark-Master-Scala211-Compile/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala:308: > no valid targets for annotation on value conf - it is discarded unused. You > may specify targets with meta-annotations, e.g. @(transient @param) > [error] private[netty] class NettyRpcEndpointRef(@transient conf: SparkConf) > [error] > {code} > This is one error, but there may be others past this point (the compile fails > fast). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957554#comment-14957554 ] Glenn Strycker commented on SPARK-6235: --- I don't think so, but I can check. My RDD came from an RDD of type (K,V) that was partitioned by key and worked just fine... my new RDD that is failing is attempting to map the value V to the K, so that (V, K) is now going to be partitioned by the value (now the key) instead. So I can try running some checks of multiplicity to see if my values have some kind of skew... unfortunately most of those checks are going to involve reduceByKey-like operations that will probably result in 2GB failures themselves... I was hoping to get the mapping and partitioning of (K,V) -> (V,K) accomplished first before running such checks. Thanks for the suggestion, though! > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10538) java.lang.NegativeArraySizeException during join
[ https://issues.apache.org/jira/browse/SPARK-10538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-10538: Target Version/s: 1.6.0 (was: 1.5.2) > java.lang.NegativeArraySizeException during join > > > Key: SPARK-10538 > URL: https://issues.apache.org/jira/browse/SPARK-10538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Maciej Bryński >Assignee: Davies Liu > Attachments: screenshot-1.png > > > Hi, > I've got a problem during joining tables in PySpark. (in my example 20 of > them) > I can observe that during calculation of first partition (on one of > consecutive joins) there is a big shuffle read size (294.7 MB / 146 records) > vs on others partitions (approx. 272.5 KB / 113 record) > I can also observe that just before the crash python process going up to few > gb of RAM. > After some time there is an exception: > {code} > java.lang.NegativeArraySizeException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:90) > at > org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:88) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:119) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > I'm running this on 2 nodes cluster (12 cores, 64 GB RAM) > Config: > {code} > spark.driver.memory 10g > spark.executor.extraJavaOptions -XX:-UseGCOverheadLimit -XX:+UseParallelGC > -Dfile.encoding=UTF8 > spark.executor.memory 60g > spark.storage.memoryFraction0.05 > spark.shuffle.memoryFraction0.75 > spark.driver.maxResultSize 10g > spark.cores.max 24 > spark.kryoserializer.buffer.max 1g > spark.default.parallelism 200 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10577) [PySpark] DataFrame hint for broadcast join
[ https://issues.apache.org/jira/browse/SPARK-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957569#comment-14957569 ] Reynold Xin commented on SPARK-10577: - I also backported this into branch-1.5 so this can be included in 1.5.2. > [PySpark] DataFrame hint for broadcast join > --- > > Key: SPARK-10577 > URL: https://issues.apache.org/jira/browse/SPARK-10577 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 1.5.0 >Reporter: Maciej Bryński >Assignee: Jian Feng Zhang > Labels: starter > Fix For: 1.5.2, 1.6.0 > > > As in https://issues.apache.org/jira/browse/SPARK-8300 > there should by possibility to add hint for broadcast join in: > - Pyspark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10973) __gettitem__ method throws IndexError exception when we try to access index after the last non-zero entry.
[ https://issues.apache.org/jira/browse/SPARK-10973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957587#comment-14957587 ] Reynold Xin commented on SPARK-10973: - [~josephkb] this should be closed now right? > __gettitem__ method throws IndexError exception when we try to access index > after the last non-zero entry. > -- > > Key: SPARK-10973 > URL: https://issues.apache.org/jira/browse/SPARK-10973 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.3.0, 1.4.0, 1.5.0, 1.6.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz > Labels: backport-needed > Fix For: 1.6.0 > > > \_\_gettitem\_\_ method throws IndexError exception when we try to access > index after the last non-zero entry. > {code} > from pyspark.mllib.linalg import Vectors > sv = Vectors.sparse(5, {1: 3}) > sv[0] > ## 0.0 > sv[1] > ## 3.0 > sv[2] > ## Traceback (most recent call last): > ## File "", line 1, in > ## File "/python/pyspark/mllib/linalg/__init__.py", line 734, in __getitem__ > ## row_ind = inds[insert_index] > ## IndexError: index out of bounds > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11096) Post-hoc review Netty based RPC implementation - round 2
[ https://issues.apache.org/jira/browse/SPARK-11096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-11096. - Resolution: Fixed Fix Version/s: 1.6.0 > Post-hoc review Netty based RPC implementation - round 2 > > > Key: SPARK-11096 > URL: https://issues.apache.org/jira/browse/SPARK-11096 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10528) spark-shell throws java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.
[ https://issues.apache.org/jira/browse/SPARK-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957661#comment-14957661 ] Balaji Krish commented on SPARK-10528: -- Following steps solved my problem 1. Open Command Prompt in Admin Mode 2. winutils.exe chmod 777 /tmp/hive 3. Open Spark-Shell --master local[2] > spark-shell throws java.lang.RuntimeException: The root scratch dir: > /tmp/hive on HDFS should be writable. > -- > > Key: SPARK-10528 > URL: https://issues.apache.org/jira/browse/SPARK-10528 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.5.0 > Environment: Windows 7 x64 >Reporter: Aliaksei Belablotski >Priority: Minor > > Starting spark-shell throws > java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: > /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw- -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11105) Disitribute the log4j.properties files from the client to the executors
[ https://issues.apache.org/jira/browse/SPARK-11105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11105: Assignee: (was: Apache Spark) > Disitribute the log4j.properties files from the client to the executors > --- > > Key: SPARK-11105 > URL: https://issues.apache.org/jira/browse/SPARK-11105 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.1 >Reporter: Srinivasa Reddy Vundela >Priority: Minor > > The log4j.properties file from the client is not distributed to the > executors. This means that the client settings are not applied to the > executors and they run with the default settings. > This affects troubleshooting and data gathering. > The workaround is to use the --files option for spark-submit to propagate the > log4j.properties file -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10973) __gettitem__ method throws IndexError exception when we try to access index after the last non-zero entry.
[ https://issues.apache.org/jira/browse/SPARK-10973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-10973. - Resolution: Fixed Fix Version/s: 1.5.2 1.4.2 1.3.2 > __gettitem__ method throws IndexError exception when we try to access index > after the last non-zero entry. > -- > > Key: SPARK-10973 > URL: https://issues.apache.org/jira/browse/SPARK-10973 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.3.0, 1.4.0, 1.5.0, 1.6.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz > Labels: backport-needed > Fix For: 1.3.2, 1.4.2, 1.5.2, 1.6.0 > > > \_\_gettitem\_\_ method throws IndexError exception when we try to access > index after the last non-zero entry. > {code} > from pyspark.mllib.linalg import Vectors > sv = Vectors.sparse(5, {1: 3}) > sv[0] > ## 0.0 > sv[1] > ## 3.0 > sv[2] > ## Traceback (most recent call last): > ## File "", line 1, in > ## File "/python/pyspark/mllib/linalg/__init__.py", line 734, in __getitem__ > ## row_ind = inds[insert_index] > ## IndexError: index out of bounds > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11111) Fast null-safe join
[ https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-1: Assignee: Davies Liu (was: Apache Spark) > Fast null-safe join > --- > > Key: SPARK-1 > URL: https://issues.apache.org/jira/browse/SPARK-1 > Project: Spark > Issue Type: Improvement >Reporter: Davies Liu >Assignee: Davies Liu > > Today, null safe joins are executed with a Cartesian product. > {code} > scala> sqlContext.sql("select * from t a join t b on (a.i <=> b.i)").explain > == Physical Plan == > TungstenProject [i#2,j#3,i#7,j#8] > Filter (i#2 <=> i#7) > CartesianProduct >LocalTableScan [i#2,j#3], [[1,1]] >LocalTableScan [i#7,j#8], [[1,1]] > {code} > One option is to add this rewrite to the optimizer: > {code} > select * > from t a > join t b > on coalesce(a.i, ) = coalesce(b.i, ) AND (a.i <=> b.i) > {code} > Acceptance criteria: joins with only null safe equality should not result in > a Cartesian product. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11111) Fast null-safe join
[ https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-1: Assignee: Apache Spark (was: Davies Liu) > Fast null-safe join > --- > > Key: SPARK-1 > URL: https://issues.apache.org/jira/browse/SPARK-1 > Project: Spark > Issue Type: Improvement >Reporter: Davies Liu >Assignee: Apache Spark > > Today, null safe joins are executed with a Cartesian product. > {code} > scala> sqlContext.sql("select * from t a join t b on (a.i <=> b.i)").explain > == Physical Plan == > TungstenProject [i#2,j#3,i#7,j#8] > Filter (i#2 <=> i#7) > CartesianProduct >LocalTableScan [i#2,j#3], [[1,1]] >LocalTableScan [i#7,j#8], [[1,1]] > {code} > One option is to add this rewrite to the optimizer: > {code} > select * > from t a > join t b > on coalesce(a.i, ) = coalesce(b.i, ) AND (a.i <=> b.i) > {code} > Acceptance criteria: joins with only null safe equality should not result in > a Cartesian product. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11111) Fast null-safe join
[ https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957669#comment-14957669 ] Apache Spark commented on SPARK-1: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/9120 > Fast null-safe join > --- > > Key: SPARK-1 > URL: https://issues.apache.org/jira/browse/SPARK-1 > Project: Spark > Issue Type: Improvement >Reporter: Davies Liu >Assignee: Davies Liu > > Today, null safe joins are executed with a Cartesian product. > {code} > scala> sqlContext.sql("select * from t a join t b on (a.i <=> b.i)").explain > == Physical Plan == > TungstenProject [i#2,j#3,i#7,j#8] > Filter (i#2 <=> i#7) > CartesianProduct >LocalTableScan [i#2,j#3], [[1,1]] >LocalTableScan [i#7,j#8], [[1,1]] > {code} > One option is to add this rewrite to the optimizer: > {code} > select * > from t a > join t b > on coalesce(a.i, ) = coalesce(b.i, ) AND (a.i <=> b.i) > {code} > Acceptance criteria: joins with only null safe equality should not result in > a Cartesian product. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11112) DAG visualization: display RDD callsite
Andrew Or created SPARK-2: - Summary: DAG visualization: display RDD callsite Key: SPARK-2 URL: https://issues.apache.org/jira/browse/SPARK-2 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.4.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11112) DAG visualization: display RDD callsite
[ https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-2: -- Issue Type: Sub-task (was: Improvement) Parent: SPARK-7463 > DAG visualization: display RDD callsite > --- > > Key: SPARK-2 > URL: https://issues.apache.org/jira/browse/SPARK-2 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11110) Scala 2.11 build fails due to compiler errors
[ https://issues.apache.org/jira/browse/SPARK-0?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-0: Priority: Critical (was: Major) > Scala 2.11 build fails due to compiler errors > - > > Key: SPARK-0 > URL: https://issues.apache.org/jira/browse/SPARK-0 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: Patrick Wendell >Assignee: Jakob Odersky >Priority: Critical > > Right now the 2.11 build is failing due to compiler errors in SBT (though not > in Maven). I have updated our 2.11 compile test harness to catch this. > https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/job/Spark-Master-Scala211-Compile/1667/consoleFull > {code} > [error] > /home/jenkins/workspace/Spark-Master-Scala211-Compile/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala:308: > no valid targets for annotation on value conf - it is discarded unused. You > may specify targets with meta-annotations, e.g. @(transient @param) > [error] private[netty] class NettyRpcEndpointRef(@transient conf: SparkConf) > [error] > {code} > This is one error, but there may be others past this point (the compile fails > fast). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8386) DataFrame and JDBC regression
[ https://issues.apache.org/jira/browse/SPARK-8386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8386. Resolution: Fixed Assignee: Huaxin Gao Fix Version/s: 1.6.0 1.5.2 1.4.2 > DataFrame and JDBC regression > - > > Key: SPARK-8386 > URL: https://issues.apache.org/jira/browse/SPARK-8386 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 > Environment: RHEL 7.1 >Reporter: Peter Haumer >Assignee: Huaxin Gao >Priority: Critical > Fix For: 1.4.2, 1.5.2, 1.6.0 > > > I have an ETL app that appends to a JDBC table new results found at each run. > In 1.3.1 I did this: > testResultsDF.insertIntoJDBC(CONNECTION_URL, TABLE_NAME, false); > When I do this now in 1.4 it complains that the "object" 'TABLE_NAME' already > exists. I get this even if I switch the overwrite to true. I also tried this > now: > testResultsDF.write().mode(SaveMode.Append).jdbc(CONNECTION_URL, TABLE_NAME, > connectionProperties); > getting the same error. It works running the first time creating the new > table and adding data successfully. But, running it a second time it (the > jdbc driver) will tell me that the table already exists. Even > SaveMode.Overwrite will give me the same error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11110) Scala 2.11 build fails due to compiler errors
Patrick Wendell created SPARK-0: --- Summary: Scala 2.11 build fails due to compiler errors Key: SPARK-0 URL: https://issues.apache.org/jira/browse/SPARK-0 Project: Spark Issue Type: Bug Components: Build Reporter: Patrick Wendell Right now the 2.11 build is failing due to compiler errors in SBT (though not in Maven). I have updated our 2.11 compile test harness to catch this. https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/job/Spark-Master-Scala211-Compile/1667/consoleFull {code} [error] /home/jenkins/workspace/Spark-Master-Scala211-Compile/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala:308: no valid targets for annotation on value conf - it is discarded unused. You may specify targets with meta-annotations, e.g. @(transient @param) [error] private[netty] class NettyRpcEndpointRef(@transient conf: SparkConf) [error] {code} This is one error, but there may be others past this point (the compile fails fast). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957926#comment-14957926 ] Michael Armbrust commented on SPARK-: - [~sandyr] did you look at the test cases [in scala|https://github.com/marmbrus/spark/blob/d0277f5013fd9e5e758c607b5c833cf5aa7bb93c/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala] and [java|https://github.com/marmbrus/spark/blob/d0277f5013fd9e5e758c607b5c833cf5aa7bb93c/sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java] linked from the attached design doc? In Scala, users should never have to think about Encoders as long as their data can be represented as primitives, case classes, tuples, or collections. Implicits (provided by {{sqlContext.implicits._}}) automatically pass the required information to the function. In Java, the compiler is not helping us out as much, so the user must do as you suggest. The prototype shows {{ProductEncoder.tuple(Long.class, Long.class)}}, but we will have a similar interface that works for class objects for POJOs / JavaBeans. The problem with doing this using a registry (like kryo in RDDs today) is that then you aren't finding out the object type until you have an example object from realizing the computation. That is often too late to do the kinds of optimizations that we are trying to enable. Instead we'd like to statically realize the schema at Dataset construction time. Encoders are just an encapsulation of the required information and provide an interface if we ever want to allow someone to specify a custom encoder. Regarding the performance concerns with reflection, the implementation that is already present in Spark master ([SPARK-10993] and [SPARK-11090]) is based on catalyst expressions. Reflection is done once on the driver, and the existing code generation caching framework is taking care of caching generated encoder bytecode on the executors. > RDD-like API on top of Catalyst/DataFrame > - > > Key: SPARK- > URL: https://issues.apache.org/jira/browse/SPARK- > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Reynold Xin >Assignee: Michael Armbrust > > The RDD API is very flexible, and as a result harder to optimize its > execution in some cases. The DataFrame API, on the other hand, is much easier > to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to > use UDFs, lack of strong types in Scala/Java). > The goal of Spark Datasets is to provide an API that allows users to easily > express transformations on domain objects, while also providing the > performance and robustness advantages of the Spark SQL execution engine. > h2. Requirements > - *Fast* - In most cases, the performance of Datasets should be equal to or > better than working with RDDs. Encoders should be as fast or faster than > Kryo and Java serialization, and unnecessary conversion should be avoided. > - *Typesafe* - Similar to RDDs, objects and functions that operate on those > objects should provide compile-time safety where possible. When converting > from data where the schema is not known at compile-time (for example data > read from an external source such as JSON), the conversion function should > fail-fast if there is a schema mismatch. > - *Support for a variety of object models* - Default encoders should be > provided for a variety of object models: primitive types, case classes, > tuples, POJOs, JavaBeans, etc. Ideally, objects that follow standard > conventions, such as Avro SpecificRecords, should also work out of the box. > - *Java Compatible* - Datasets should provide a single API that works in > both Scala and Java. Where possible, shared types like Array will be used in > the API. Where not possible, overloaded functions should be provided for > both languages. Scala concepts, such as ClassTags should not be required in > the user-facing API. > - *Interoperates with DataFrames* - Users should be able to seamlessly > transition between Datasets and DataFrames, without specifying conversion > boiler-plate. When names used in the input schema line-up with fields in the > given class, no extra mapping should be necessary. Libraries like MLlib > should not need to provide different interfaces for accepting DataFrames and > Datasets as input. > For a detailed outline of the complete proposed API: > [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files] > For an initial discussion of the design considerations in this API: [design > doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To
[jira] [Commented] (SPARK-10534) ORDER BY clause allows only columns that are present in SELECT statement
[ https://issues.apache.org/jira/browse/SPARK-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957935#comment-14957935 ] Apache Spark commented on SPARK-10534: -- User 'dilipbiswal' has created a pull request for this issue: https://github.com/apache/spark/pull/9123 > ORDER BY clause allows only columns that are present in SELECT statement > > > Key: SPARK-10534 > URL: https://issues.apache.org/jira/browse/SPARK-10534 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Michal Cwienczek > > When invoking query SELECT EmployeeID from Employees order by YEAR(HireDate) > Spark 1.5 throws exception: > {code} > cannot resolve 'MsSqlNorthwindJobServerTested_dbo_Employees.HireDate' given > input columns EmployeeID; line 2 pos 14 StackTrace: > org.apache.spark.sql.AnalysisException: cannot resolve > 'MsSqlNorthwindJobServerTested_dbo_Employees.HireDate' given input columns > EmployeeID; line 2 pos 14 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$7.apply(TreeNode.scala:268) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:266) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at
[jira] [Commented] (SPARK-11078) Ensure spilling tests are actually spilling
[ https://issues.apache.org/jira/browse/SPARK-11078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957940#comment-14957940 ] Apache Spark commented on SPARK-11078: -- User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/9124 > Ensure spilling tests are actually spilling > --- > > Key: SPARK-11078 > URL: https://issues.apache.org/jira/browse/SPARK-11078 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Reporter: Andrew Or >Assignee: Andrew Or > > The new unified memory management model in SPARK-10983 uncovered many brittle > tests that rely on arbitrary thresholds to detect spilling. Some tests don't > even assert that spilling did occur. > We should go through all the places where we test spilling behavior and > correct the tests, a subset of which are definitely incorrect. Potential > suspects: > - UnsafeShuffleSuite > - ExternalAppendOnlyMapSuite > - ExternalSorterSuite > - SQLQuerySuite > - DistributedSuite -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)
[ https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957965#comment-14957965 ] Hyukjin Kwon commented on SPARK-11103: -- In this case, this should be fine because filter is not pushed down to Parquet and data is filtered by Spark filter. If you set off spark.sql.parquet.filterPushdown which is true by default, the original case also should work okay. > Filter applied on Merged Parquet shema with new column fail with > (java.lang.IllegalArgumentException: Column [column_name] was not found in > schema!) > > > Key: SPARK-11103 > URL: https://issues.apache.org/jira/browse/SPARK-11103 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Dominic Ricard > > When evolving a schema in parquet files, spark properly expose all columns > found in the different parquet files but when trying to query the data, it is > not possible to apply a filter on a column that is not present in all files. > To reproduce: > *SQL:* > {noformat} > create table `table1` STORED AS PARQUET LOCATION > 'hdfs://:/path/to/table/id=1/' as select 1 as `col1`; > create table `table2` STORED AS PARQUET LOCATION > 'hdfs://:/path/to/table/id=2/' as select 1 as `col1`, 2 as > `col2`; > create table `table3` USING org.apache.spark.sql.parquet OPTIONS (path > "hdfs://:/path/to/table"); > select col1 from `table3` where col2 = 2; > {noformat} > The last select will output the following Stack Trace: > {noformat} > An error occurred when executing the SQL command: > select col1 from `table3` where col2 = 2 > [Simba][HiveJDBCDriver](500051) ERROR processing query/statement. Error Code: > 0, SQL state: TStatus(statusCode:ERROR_STATUS, > infoMessages:[*org.apache.hive.service.cli.HiveSQLException:org.apache.spark.SparkException: > Job aborted due to stage failure: Task 0 in stage 7212.0 failed 4 times, > most recent failure: Lost task 0.3 in stage 7212.0 (TID 138449, > 208.92.52.88): java.lang.IllegalArgumentException: Column [col2] was not > found in schema! > at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:94) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59) > at > org.apache.parquet.filter2.predicate.Operators$Eq.accept(Operators.java:180) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64) > at > org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59) > at > org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40) > at > org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126) > at > org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46) > at > org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:160) > at > org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) > at > org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:155) > at > org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:120) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at
[jira] [Assigned] (SPARK-11114) Add getOrCreate for SparkContext/SQLContext for Python
[ https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4: Assignee: Davies Liu (was: Apache Spark) > Add getOrCreate for SparkContext/SQLContext for Python > -- > > Key: SPARK-4 > URL: https://issues.apache.org/jira/browse/SPARK-4 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Davies Liu >Assignee: Davies Liu > > Also SQLContext.newSession() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11116) Initial API Draft
Michael Armbrust created SPARK-6: Summary: Initial API Draft Key: SPARK-6 URL: https://issues.apache.org/jira/browse/SPARK-6 Project: Spark Issue Type: Sub-task Reporter: Michael Armbrust Assignee: Michael Armbrust The goal here is to spec out the main functions to give people an idea of what using the API would be like. Optimization and whatnot can be done in a follow up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10925) Exception when joining DataFrames
[ https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957731#comment-14957731 ] Alexis Seigneurin commented on SPARK-10925: --- Well, technically, it's not a duplicate column. An inner join between two Dataframes on a column that carries the same name on both sides is supposed to work and to only retain one column. I had noticed that renaming one of the columns was a workaround and that's what I'm doing before this issue gets fixed. One thing to note, though, is that this code used to work with Spark 1.4 (I have only adjusted the call to the UDFs to use the new API). This means there must be a regression in the query analyzer. > Exception when joining DataFrames > - > > Key: SPARK-10925 > URL: https://issues.apache.org/jira/browse/SPARK-10925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Tested with Spark 1.5.0 and Spark 1.5.1 >Reporter: Alexis Seigneurin > Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase2.scala > > > I get an exception when joining a DataFrame with another DataFrame. The > second DataFrame was created by performing an aggregation on the first > DataFrame. > My complete workflow is: > # read the DataFrame > # apply an UDF on column "name" > # apply an UDF on column "surname" > # apply an UDF on column "birthDate" > # aggregate on "name" and re-join with the DF > # aggregate on "surname" and re-join with the DF > If I remove one step, the process completes normally. > Here is the exception: > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved > attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in > operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS > birthDate_cleaned#8]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553) > at
[jira] [Updated] (SPARK-11113) Remove DeveloperApi annotation from private classes
[ https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-3: Description: For a variety of reasons, we tagged a bunch of internal classes in the execution package in SQL as DeveloperApi. was: For a variety of reasons, we tagged a bunch of internal classes in SQL as DeveloperApi. > Remove DeveloperApi annotation from private classes > --- > > Key: SPARK-3 > URL: https://issues.apache.org/jira/browse/SPARK-3 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > For a variety of reasons, we tagged a bunch of internal classes in the > execution package in SQL as DeveloperApi. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11113) Remove DeveloperApi annotation from private classes
[ https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-3: Assignee: Apache Spark (was: Reynold Xin) > Remove DeveloperApi annotation from private classes > --- > > Key: SPARK-3 > URL: https://issues.apache.org/jira/browse/SPARK-3 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > For a variety of reasons, we tagged a bunch of internal classes in the > execution package in SQL as DeveloperApi. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11113) Remove DeveloperApi annotation from private classes
[ https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-3: Assignee: Reynold Xin (was: Apache Spark) > Remove DeveloperApi annotation from private classes > --- > > Key: SPARK-3 > URL: https://issues.apache.org/jira/browse/SPARK-3 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > For a variety of reasons, we tagged a bunch of internal classes in the > execution package in SQL as DeveloperApi. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11113) Remove DeveloperApi annotation from private classes
[ https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957776#comment-14957776 ] Apache Spark commented on SPARK-3: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/9121 > Remove DeveloperApi annotation from private classes > --- > > Key: SPARK-3 > URL: https://issues.apache.org/jira/browse/SPARK-3 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > For a variety of reasons, we tagged a bunch of internal classes in the > execution package in SQL as DeveloperApi. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11113) Remove DeveloperApi annotation from private classes
Reynold Xin created SPARK-3: --- Summary: Remove DeveloperApi annotation from private classes Key: SPARK-3 URL: https://issues.apache.org/jira/browse/SPARK-3 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin For a variety of reasons, we tagged a bunch of internal classes in SQL as DeveloperApi. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11114) Add getOrCreate for SparkContext/SQLContext for Python
[ https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4: Assignee: Apache Spark (was: Davies Liu) > Add getOrCreate for SparkContext/SQLContext for Python > -- > > Key: SPARK-4 > URL: https://issues.apache.org/jira/browse/SPARK-4 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Davies Liu >Assignee: Apache Spark > > Also SQLContext.newSession() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11114) Add getOrCreate for SparkContext/SQLContext for Python
Davies Liu created SPARK-4: -- Summary: Add getOrCreate for SparkContext/SQLContext for Python Key: SPARK-4 URL: https://issues.apache.org/jira/browse/SPARK-4 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Davies Liu Assignee: Davies Liu Also SQLContext.newSession() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11114) Add getOrCreate for SparkContext/SQLContext for Python
[ https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957779#comment-14957779 ] Apache Spark commented on SPARK-4: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/9122 > Add getOrCreate for SparkContext/SQLContext for Python > -- > > Key: SPARK-4 > URL: https://issues.apache.org/jira/browse/SPARK-4 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Davies Liu >Assignee: Davies Liu > > Also SQLContext.newSession() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10534) ORDER BY clause allows only columns that are present in SELECT statement
[ https://issues.apache.org/jira/browse/SPARK-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10534: Assignee: Apache Spark > ORDER BY clause allows only columns that are present in SELECT statement > > > Key: SPARK-10534 > URL: https://issues.apache.org/jira/browse/SPARK-10534 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Michal Cwienczek >Assignee: Apache Spark > > When invoking query SELECT EmployeeID from Employees order by YEAR(HireDate) > Spark 1.5 throws exception: > {code} > cannot resolve 'MsSqlNorthwindJobServerTested_dbo_Employees.HireDate' given > input columns EmployeeID; line 2 pos 14 StackTrace: > org.apache.spark.sql.AnalysisException: cannot resolve > 'MsSqlNorthwindJobServerTested_dbo_Employees.HireDate' given input columns > EmployeeID; line 2 pos 14 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$7.apply(TreeNode.scala:268) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:266) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at >