[jira] [Commented] (SPARK-10873) can't sort columns on history page

2015-10-14 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957080#comment-14957080
 ] 

Thomas Graves commented on SPARK-10873:
---

Has anyone discussed just using something like jquery datatables or similar 
which automatically gives us search 
(https://issues.apache.org/jira/browse/SPARK-10874), sort, pagination, etc?
I'm not sure how well the row spanning works with the datatables it seems its 
possible: http://www.datatables.net/examples/advanced_init/row_grouping.html

I know there are others like jqgrid but I'm by no means a UI expert and have 
used datatables some in hadoop.

[~rxin]  [~zsxwing]  thoughts on using something like the jquery datatables?  

What about just using it on certain pages like the history page first. The 
downside is pages might look different.  As more and more people use spark 
being able to use the history page and debug is becoming bigger and bigger issue

> can't sort columns on history page
> --
>
> Key: SPARK-10873
> URL: https://issues.apache.org/jira/browse/SPARK-10873
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>
> Starting with 1.5.1 the history server page isn't allowing sorting by column



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-10-14 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957144#comment-14957144
 ] 

Sandy Ryza commented on SPARK-:
---

Maybe you all have thought through this as well, but I had some more thoughts 
on the proposed API.

Fundamentally, it seems a weird to me that the user is responsible for having a 
matching Encoder around every time they want to map to a class of a particular 
type.  In 99% of cases, the Encoder used to encode any given type will be the 
same, and it seems more intuitive to me to specify this up front.

To be more concrete, suppose I want to use case classes in my app and have a 
function that can auto-generate an Encoder from a class object (though this 
might be a little bit time consuming because it needs to use reflection).  With 
the current proposal, any time I want to map my Dataset to a Dataset of some 
case class, I need to either have a line of code that generates an Encoder for 
that case class, or have an Encoder already lying around.  If I perform this 
operation within a method, I need to pass the Encoder down to the method and 
include it in the signature.

Ideally I would be able to register an EncoderSystem up front that caches 
Encoders and generates new Encoders whenever it sees a new class used.  This 
still of course requires the user to pass in type information when they call 
map, but it's easier for them to get this information than an actual encoder.  
If there's not some principled way to get this working implicitly with 
ClassTags, the user could just pass in classOf[MyCaseClass] as the second 
argument to map.

> RDD-like API on top of Catalyst/DataFrame
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10925) Exception when joining DataFrames

2015-10-14 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957170#comment-14957170
 ] 

Xiao Li commented on SPARK-10925:
-

Hi, Alexis, 

The schema of your query results has the duplicate column names. 

In your test case, you just need to fix one line:
val cardinalityDF2 = df4.groupBy("surname")
  .agg(count("surname").as("cardinality_surname"))
-->
val cardinalityDF2 = df4.groupBy("surname")
  
.agg(count("surname").as("cardinality_surname")).withColumnRenamed("surname", 
"surname_new")
cardinalityDF2.show()

I think Spark SQL should detect the problem in the earlier stage. I will try to 
fix the problem and output an error message. 

Let me know if you have more questions. Thanks! 

Xiao Li


> Exception when joining DataFrames
> -
>
> Key: SPARK-10925
> URL: https://issues.apache.org/jira/browse/SPARK-10925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tested with Spark 1.5.0 and Spark 1.5.1
>Reporter: Alexis Seigneurin
> Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase2.scala
>
>
> I get an exception when joining a DataFrame with another DataFrame. The 
> second DataFrame was created by performing an aggregation on the first 
> DataFrame.
> My complete workflow is:
> # read the DataFrame
> # apply an UDF on column "name"
> # apply an UDF on column "surname"
> # apply an UDF on column "birthDate"
> # aggregate on "name" and re-join with the DF
> # aggregate on "surname" and re-join with the DF
> If I remove one step, the process completes normally.
> Here is the exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in 
> operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS 
> birthDate_cleaned#8];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at 

[jira] [Assigned] (SPARK-2533) Show summary of locality level of completed tasks in the each stage page of web UI

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2533:
---

Assignee: (was: Apache Spark)

> Show summary of locality level of completed tasks in the each stage page of 
> web UI
> --
>
> Key: SPARK-2533
> URL: https://issues.apache.org/jira/browse/SPARK-2533
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.0.0
>Reporter: Masayoshi TSUZUKI
>Priority: Minor
>
> When the number of tasks is very large, it is impossible to know how many 
> tasks were executed under (PROCESS_LOCAL/NODE_LOCAL/RACK_LOCAL) from the 
> stage page of web UI. It would be better to show the summary of task locality 
> level in web UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-2533) Show summary of locality level of completed tasks in the each stage page of web UI

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2533:
---

Assignee: Apache Spark

> Show summary of locality level of completed tasks in the each stage page of 
> web UI
> --
>
> Key: SPARK-2533
> URL: https://issues.apache.org/jira/browse/SPARK-2533
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.0.0
>Reporter: Masayoshi TSUZUKI
>Assignee: Apache Spark
>Priority: Minor
>
> When the number of tasks is very large, it is impossible to know how many 
> tasks were executed under (PROCESS_LOCAL/NODE_LOCAL/RACK_LOCAL) from the 
> stage page of web UI. It would be better to show the summary of task locality 
> level in web UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2533) Show summary of locality level of completed tasks in the each stage page of web UI

2015-10-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957131#comment-14957131
 ] 

Apache Spark commented on SPARK-2533:
-

User 'jbonofre' has created a pull request for this issue:
https://github.com/apache/spark/pull/9117

> Show summary of locality level of completed tasks in the each stage page of 
> web UI
> --
>
> Key: SPARK-2533
> URL: https://issues.apache.org/jira/browse/SPARK-2533
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.0.0
>Reporter: Masayoshi TSUZUKI
>Priority: Minor
>
> When the number of tasks is very large, it is impossible to know how many 
> tasks were executed under (PROCESS_LOCAL/NODE_LOCAL/RACK_LOCAL) from the 
> stage page of web UI. It would be better to show the summary of task locality 
> level in web UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11102) Unreadable exception when specifing non-exist input for JSON data source

2015-10-14 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-11102:
---
Summary: Unreadable exception when specifing non-exist input for JSON data 
source  (was: Not readable exception when specifing non-exist input for JSON 
data source)

> Unreadable exception when specifing non-exist input for JSON data source
> 
>
> Key: SPARK-11102
> URL: https://issues.apache.org/jira/browse/SPARK-11102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Jeff Zhang
>Priority: Minor
>
> If I specify a non-exist input path for json data source, the following 
> exception will be thrown, it is not readable. 
> {code}
> 15/10/14 16:14:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 19.9 KB, free 251.4 KB)
> 15/10/14 16:14:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on 192.168.3.3:54725 (size: 19.9 KB, free: 2.2 GB)
> 15/10/14 16:14:39 INFO SparkContext: Created broadcast 0 from json at 
> :19
> java.io.IOException: No input paths specified in job
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1087)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1085)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$.apply(InferSchema.scala:58)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:105)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:100)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:100)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:99)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:561)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:106)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:221)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:19)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:28)
>   at $iwC$$iwC$$iwC$$iwC.(:30)
>   at $iwC$$iwC$$iwC.(:32)
>   at $iwC$$iwC.(:34)
>   at $iwC.(:36)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10876) display total application time in spark history UI

2015-10-14 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-10876:
--
Assignee: Jean-Baptiste Onofré

> display total application time in spark history UI
> --
>
> Key: SPARK-10876
> URL: https://issues.apache.org/jira/browse/SPARK-10876
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Jean-Baptiste Onofré
>
> The history file has an application start and application end events.  It 
> would be nice if we could use these to display the total run time for the 
> application in the history UI.
> Could be displayed similar to "Total Uptime" for a running application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11098) RPC message ordering is not guaranteed

2015-10-14 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957232#comment-14957232
 ] 

Marcelo Vanzin commented on SPARK-11098:


I'm not explicitly working on this a.t.m..

> RPC message ordering is not guaranteed
> --
>
> Key: SPARK-11098
> URL: https://issues.apache.org/jira/browse/SPARK-11098
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>
> NettyRpcEnv doesn't guarantee message delivery order since there are multiple 
> threads sending messages in clientConnectionExecutor thread pool. We should 
> fix that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11108) OneHotEncoder should support other numeric input types

2015-10-14 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11108:
--
Description: 
See parent JIRA for more info.

Also see [SPARK-10513] for motivation behind issue.

  was:See parent JIRA for more info.


> OneHotEncoder should support other numeric input types
> --
>
> Key: SPARK-11108
> URL: https://issues.apache.org/jira/browse/SPARK-11108
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> See parent JIRA for more info.
> Also see [SPARK-10513] for motivation behind issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11040) SaslRpcHandler does not delegate all methods to underlying handler

2015-10-14 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-11040.

   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 1.6.0

> SaslRpcHandler does not delegate all methods to underlying handler
> --
>
> Key: SPARK-11040
> URL: https://issues.apache.org/jira/browse/SPARK-11040
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 1.6.0
>
>
> {{SaslRpcHandler}} only delegates {{receive}} and {{getStreamManager}}, so 
> when SASL is enabled, other events will be missed by apps.
> This affects other version too, but I think these events aren't actually used 
> there. They'll be used by the new rpc backend in 1.6, though.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11108) OneHotEncoder should support other numeric input types

2015-10-14 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-11108:
-

 Summary: OneHotEncoder should support other numeric input types
 Key: SPARK-11108
 URL: https://issues.apache.org/jira/browse/SPARK-11108
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Joseph K. Bradley
Priority: Minor


See parent JIRA for more info.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-10943) NullType Column cannot be written to Parquet

2015-10-14 Thread Jason C Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason C Lee updated SPARK-10943:

Comment: was deleted

(was: I'd like to work on this. Thanx)

> NullType Column cannot be written to Parquet
> 
>
> Key: SPARK-10943
> URL: https://issues.apache.org/jira/browse/SPARK-10943
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Jason Pohl
>
> var data02 = sqlContext.sql("select 1 as id, \"cat in the hat\" as text, null 
> as comments")
> //FAIL - Try writing a NullType column (where all the values are NULL)
> data02.write.parquet("/tmp/celtra-test/dataset2")
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 179.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 179.0 (TID 39924, 10.0.196.208): 
> org.apache.spark.sql.AnalysisException: Unsupported data type 
> StructField(comments,NullType,true).dataType;
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:524)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:312)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:92)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:92)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convert(CatalystSchemaConverter.scala:305)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypesConverter.scala:58)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.RowWriteSupport.init(ParquetTableSupport.scala:55)
>   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:288)
>   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetRelation.scala:94)
>   at 
> 

[jira] [Updated] (SPARK-11099) Default conf property file is not loaded

2015-10-14 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-11099:
---
Affects Version/s: (was: 1.5.1)

(Removing affected version. This code does not exist in branch-1.5.)

> Default conf property file is not loaded 
> -
>
> Key: SPARK-11099
> URL: https://issues.apache.org/jira/browse/SPARK-11099
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit
>Reporter: Jeff Zhang
>Priority: Critical
>
> spark.driver.extraClassPath doesn't take effect in the latest code, and find 
> the root cause is due to the default conf property file is not loaded 
> The bug is caused by this code snippet in AbstractCommandBuilder
> {code}
>   Map getEffectiveConfig() throws IOException {
> if (effectiveConfig == null) {
>   if (propertiesFile == null) {
> effectiveConfig = conf;   // return from here if no propertyFile 
> is provided
>   } else {
> effectiveConfig = new HashMap<>(conf);
> Properties p = loadPropertiesFile();// default propertyFile 
> will load here
> for (String key : p.stringPropertyNames()) {
>   if (!effectiveConfig.containsKey(key)) {
> effectiveConfig.put(key, p.getProperty(key));
>   }
> }
>   }
> }
> return effectiveConfig;
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11105) Dsitribute the log4j.properties files from the client to the executors

2015-10-14 Thread Srinivasa Reddy Vundela (JIRA)
Srinivasa Reddy Vundela created SPARK-11105:
---

 Summary: Dsitribute the log4j.properties files from the client to 
the executors
 Key: SPARK-11105
 URL: https://issues.apache.org/jira/browse/SPARK-11105
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.5.1
Reporter: Srinivasa Reddy Vundela
Priority: Minor


The log4j.properties file from the client is not distributed to the executors. 
This means that the client settings are not applied to the executors and they 
run with the default settings.
This affects troubleshooting and data gathering.
The workaround is to use the --files option for spark-submit to propagate the 
log4j.properties file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate

2015-10-14 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957301#comment-14957301
 ] 

Zhan Zhang commented on SPARK-11087:


I will take a look at this one.

> spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
> -
>
> Key: SPARK-11087
> URL: https://issues.apache.org/jira/browse/SPARK-11087
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: orc file version 0.12 with HIVE_8732
> hive version 1.2.1.2.3.0.0-2557
>Reporter: patcharee
>Priority: Minor
>
> I have an external hive table stored as partitioned orc file (see the table 
> schema below). I tried to query from the table with where clause>
> hiveContext.setConf("spark.sql.orc.filterPushdown", "true")
> hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = 
> 117")). 
> But from the log file with debug logging level on, the ORC pushdown predicate 
> was not generated. 
> Unfortunately my table was not sorted when I inserted the data, but I 
> expected the ORC pushdown predicate should be generated (because of the where 
> clause) though
> Table schema
> 
> hive> describe formatted 4D;
> OK
> # col_namedata_type   comment 
>
> date  int 
> hhint 
> x int 
> y int 
> heightfloat   
> u float   
> v float   
> w float   
> phfloat   
> phb   float   
> t float   
> p float   
> pbfloat   
> qvaporfloat   
> qgraupfloat   
> qnice float   
> qnrainfloat   
> tke_pbl   float   
> el_pblfloat   
> qcloudfloat   
>
> # Partition Information
> # col_namedata_type   comment 
>
> zone  int 
> z int 
> year  int 
> month int 
>
> # Detailed Table Information   
> Database: default  
> Owner:patcharee
> CreateTime:   Thu Jul 09 16:46:54 CEST 2015
> LastAccessTime:   UNKNOWN  
> Protect Mode: None 
> Retention:0
> Location: hdfs://helmhdfs/apps/hive/warehouse/wrf_tables/4D   
>  
> Table Type:   EXTERNAL_TABLE   
> Table Parameters:  
>   EXTERNALTRUE
>   comment this table is imported from rwf_data/*/wrf/*
>   last_modified_bypatcharee   
>   last_modified_time  1439806692  
>   orc.compressZLIB
>   transient_lastDdlTime   1439806692  
>
> # Storage Information  
> SerDe Library:org.apache.hadoop.hive.ql.io.orc.OrcSerde
> InputFormat:  org.apache.hadoop.hive.ql.io.orc.OrcInputFormat  
> OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>  
> Compressed:   No   
> Num Buckets:  -1   
> Bucket Columns:   []   
> Sort Columns: []   
> Storage Desc Params:   
>   serialization.format1   
> Time taken: 0.388 seconds, Fetched: 58 row(s)
> 
> Data was inserted into this table by another spark job>
> 

[jira] [Assigned] (SPARK-11078) Ensure spilling tests are actually spilling

2015-10-14 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reassigned SPARK-11078:
-

Assignee: Andrew Or

> Ensure spilling tests are actually spilling
> ---
>
> Key: SPARK-11078
> URL: https://issues.apache.org/jira/browse/SPARK-11078
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> The new unified memory management model in SPARK-10983 uncovered many brittle 
> tests that rely on arbitrary thresholds to detect spilling. Some tests don't 
> even assert that spilling did occur.
> We should go through all the places where we test spilling behavior and 
> correct the tests, a subset of which are definitely incorrect. Potential 
> suspects:
> - UnsafeShuffleSuite
> - ExternalAppendOnlyMapSuite
> - ExternalSorterSuite
> - SQLQuerySuite
> - DistributedSuite



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6910) Support for pushing predicates down to metastore for partition pruning

2015-10-14 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957194#comment-14957194
 ] 

Cheolsoo Park commented on SPARK-6910:
--

You're right that 2nd query is faster because the table/partition metadata is 
cached. Particularly, if you set {{spark.sql.hive.metastorePartitionPruning}} 
false (by default, false), Spark will cache metadata for all the partitions and 
any query against the same table will run faster even with a different 
predicate. See relevant code 
[here|https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L830-L839].

> Support for pushing predicates down to metastore for partition pruning
> --
>
> Key: SPARK-6910
> URL: https://issues.apache.org/jira/browse/SPARK-6910
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cheolsoo Park
>Priority: Critical
> Fix For: 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11097) Add connection established callback to lower level RPC layer so we don't need to check for new connections in NettyRpcHandler.receive

2015-10-14 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957230#comment-14957230
 ] 

Marcelo Vanzin commented on SPARK-11097:


Hi [~rxin], can you explain what would be the use case for this? Is it just to 
simplify the code?

I'm working on SPARK-10997 and I have changed code around that area a lot. I 
was able to simplify the code without the need for a connection established 
callback.

> Add connection established callback to lower level RPC layer so we don't need 
> to check for new connections in NettyRpcHandler.receive
> -
>
> Key: SPARK-11097
> URL: https://issues.apache.org/jira/browse/SPARK-11097
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>
> I think we can remove the check for new connections in 
> NettyRpcHandler.receive if we just add a channel registered callback to the 
> lower level network module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11105) Disitribute the log4j.properties files from the client to the executors

2015-10-14 Thread Srinivasa Reddy Vundela (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srinivasa Reddy Vundela updated SPARK-11105:

Summary: Disitribute the log4j.properties files from the client to the 
executors  (was: Dsitribute the log4j.properties files from the client to the 
executors)

> Disitribute the log4j.properties files from the client to the executors
> ---
>
> Key: SPARK-11105
> URL: https://issues.apache.org/jira/browse/SPARK-11105
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Srinivasa Reddy Vundela
>Priority: Minor
>
> The log4j.properties file from the client is not distributed to the 
> executors. This means that the client settings are not applied to the 
> executors and they run with the default settings.
> This affects troubleshooting and data gathering.
> The workaround is to use the --files option for spark-submit to propagate the 
> log4j.properties file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10619) Can't sort columns on Executor Page

2015-10-14 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-10619.
-
   Resolution: Fixed
Fix Version/s: 1.6.0
   1.5.2

> Can't sort columns on Executor Page
> ---
>
> Key: SPARK-10619
> URL: https://issues.apache.org/jira/browse/SPARK-10619
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Fix For: 1.5.2, 1.6.0
>
>
> I am using spark 1.5 running on yarn and go to the executors page.  It won't 
> allow sorting of the columns. This used to work in Spark 1.4.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11106) Should ML Models contains single models or Pipelines?

2015-10-14 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-11106:
-

 Summary: Should ML Models contains single models or Pipelines?
 Key: SPARK-11106
 URL: https://issues.apache.org/jira/browse/SPARK-11106
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Joseph K. Bradley
Priority: Critical


This JIRA is for discussing whether an ML Estimators should do feature 
processing.

h2. Issue

Currently, almost all ML Estimators require strict input types.  E.g., 
DecisionTreeClassifier requires that the label column be Double type and have 
metadata indicating the number of classes.

This requires users to know how to preprocess data.

h2. Ideal workflow

A user should be able to pass any reasonable data to a Transformer or Estimator 
and have it "do the right thing."

E.g.:
* If DecisionTreeClassifier is given a String column for labels, it should know 
to index the Strings.
* See [SPARK-10513] for a similar issue with OneHotEncoder.

h2. Possible solutions

There are a few solutions I have thought of.  Please comment with feedback or 
alternative ideas!

h3. Leave as is

Pro: The current setup is good in that it forces the user to be very aware of 
what they are doing.  Feature transformations will not happen silently.

Con: The user has to write boilerplate code for transformations.  The API is 
not what some users would expect; e.g., coming from R, a user might expect some 
automatic transformations.

h3. All Transformers can contain PipelineModels

We could allow all Transformers and Models to contain arbitrary PipelineModels. 
 E.g., if a DecisionTreeClassifier were given a String label column, it might 
return a Model which contains a simple fitted PipelineModel containing 
StringIndexer + DecisionTreeClassificationModel.

The API could present this to the user, or it could be hidden from the user.  
Ideally, it would be hidden from the beginner user, but accessible for experts.

The main problem is that we might have to break APIs.  E.g., OneHotEncoder may 
need to do indexing if given a String input column.  This means it should no 
longer be a Transformer; it should be an Estimator.

h3. All Estimators should use RFormula

The best option I have thought of is to make RFormula be the primary method for 
automatic feature transformation.  We could start adding an RFormula Param to 
all Estimators, and it could handle most of these feature transformation issues.

We could maintain old APIs:
* If a user sets the input column names, then those can be used in the 
traditional (no automatic transformation) way.
* If a user sets the RFormula Param, then it can be used instead.  (This should 
probably take precedence over the old API.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10873) can't sort columns on history page

2015-10-14 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957375#comment-14957375
 ] 

Marcelo Vanzin commented on SPARK-10873:


What I mean is that while replacing the sorting library is sort of easy, by 
itself it doesn't really solve the problem.

Pagination is currently done in the backend, meaning the backend will generate 
hardcoded HTML with the current page, instead of something that can be easily 
consumed by a client-side library to do pagination and sorting on the client.

> can't sort columns on history page
> --
>
> Key: SPARK-10873
> URL: https://issues.apache.org/jira/browse/SPARK-10873
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>
> Starting with 1.5.1 the history server page isn't allowing sorting by column



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10513) Springleaf Marketing Response

2015-10-14 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957345#comment-14957345
 ] 

Joseph K. Bradley commented on SPARK-10513:
---

[~yanboliang]  This is really helpful feedback.  Thanks very much for taking 
the time!  I'll try to list plans for addressing the various issues you found:

1. Here's the closest issue I could find for spark-csv: 
[https://github.com/databricks/spark-csv/issues/48]  Would you mind commenting 
there to try to escalate the issue?

2. What would be your ideal way to write this in the DataFrame API?  Something 
like 
{{train.withColumn(train("label").cast(DoubleType).as("label")).na.drop()}}?  
(I think that almost works now, but I'm not actually sure if the cast works or 
fails when it encounters empty Strings.)

3. Just made a JIRA: [SPARK-11108]

4. Do you mean a completely missing value?  Or do you mean that StringIndexer 
should handle an empty String differently?

5. Multi-value support for transformers: [SPARK-8418]

6. Here's some more detailed discussion which I just wrote down: [SPARK-11106]

I haven't yet looked at your example code, but will try to soon.  Thanks again 
for working on this!

> Springleaf Marketing Response
> -
>
> Key: SPARK-10513
> URL: https://issues.apache.org/jira/browse/SPARK-10513
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> Apply ML pipeline API to Springleaf Marketing Response 
> (https://www.kaggle.com/c/springleaf-marketing-response)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11109) move FsHistoryProvider off import org.apache.hadoop.fs.permission.AccessControlException

2015-10-14 Thread Steve Loughran (JIRA)
Steve Loughran created SPARK-11109:
--

 Summary: move FsHistoryProvider off import 
org.apache.hadoop.fs.permission.AccessControlException
 Key: SPARK-11109
 URL: https://issues.apache.org/jira/browse/SPARK-11109
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Steve Loughran
Priority: Minor


{{FsHistoryProvider}} imports and uses 
{{org.apache.hadoop.fs.permission.AccessControlException}}; this has been 
superceded by its subclass 
{{org.apache.hadoop.security.AccessControlException}} since ~2011. Moving to 
that subclass would remove a deprecation warning and ensure that were the 
Hadoop team to remove that old method (as HADOOP-11356 has currently done to 
trunk), everything still compiles and links



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10873) can't sort columns on history page

2015-10-14 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957210#comment-14957210
 ] 

Marcelo Vanzin commented on SPARK-10873:


As part of trying to fix SPARK-10172 I played with jquery datatables, and it 
works fine even with rowspan. But I thought it would be too big a change for 
the 1.5 branch.

Also, I still believe that sorting with the current pagination code is 
confusing and not very helpful. To enable proper sorting / searching, the 
backend would need to be changed to support something more dynamic, so that the 
client can make the decision about what to show and how.

> can't sort columns on history page
> --
>
> Key: SPARK-10873
> URL: https://issues.apache.org/jira/browse/SPARK-10873
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>
> Starting with 1.5.1 the history server page isn't allowing sorting by column



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11099) Default conf property file is not loaded

2015-10-14 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-11099:
---
Description: 
spark.driver.extraClassPath doesn't take effect in the latest code, and find 
the root cause is due to the default conf property file is not loaded 

The bug is caused by this code snippet in AbstractCommandBuilder
{code}
  Map getEffectiveConfig() throws IOException {
if (effectiveConfig == null) {
  if (propertiesFile == null) {
effectiveConfig = conf;   // return from here if no propertyFile is 
provided
  } else {
effectiveConfig = new HashMap<>(conf);
Properties p = loadPropertiesFile();// default propertyFile 
will load here
for (String key : p.stringPropertyNames()) {
  if (!effectiveConfig.containsKey(key)) {
effectiveConfig.put(key, p.getProperty(key));
  }
}
  }
}
return effectiveConfig;
  }
{code}

  was:
spark.driver.extraClassPath doesn't take effect in the latest code, and find 
the root cause is due to the default conf property file is not loaded 

The bug is caused by this code snippet in AbstractCommandBuilder
{code}
  Map getEffectiveConfig() throws IOException {
if (effectiveConfig == null) {
  if (propertiesFile == null) {
effectiveConfig = conf;   
  } else {
effectiveConfig = new HashMap<>(conf);
Properties p = loadPropertiesFile();
for (String key : p.stringPropertyNames()) {
  if (!effectiveConfig.containsKey(key)) {
effectiveConfig.put(key, p.getProperty(key));
  }
}
  }
}
return effectiveConfig;
  }
{code}


> Default conf property file is not loaded 
> -
>
> Key: SPARK-11099
> URL: https://issues.apache.org/jira/browse/SPARK-11099
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit
>Reporter: Jeff Zhang
>Priority: Critical
>
> spark.driver.extraClassPath doesn't take effect in the latest code, and find 
> the root cause is due to the default conf property file is not loaded 
> The bug is caused by this code snippet in AbstractCommandBuilder
> {code}
>   Map getEffectiveConfig() throws IOException {
> if (effectiveConfig == null) {
>   if (propertiesFile == null) {
> effectiveConfig = conf;   // return from here if no propertyFile 
> is provided
>   } else {
> effectiveConfig = new HashMap<>(conf);
> Properties p = loadPropertiesFile();// default propertyFile 
> will load here
> for (String key : p.stringPropertyNames()) {
>   if (!effectiveConfig.containsKey(key)) {
> effectiveConfig.put(key, p.getProperty(key));
>   }
> }
>   }
> }
> return effectiveConfig;
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-10-14 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956341#comment-14956341
 ] 

Sandy Ryza commented on SPARK-:
---

Thanks for the explanation [~rxin] and [~marmbrus].  I understand the problem 
and don't have any great ideas for an alternative workable solution.

> RDD-like API on top of Catalyst/DataFrame
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11096) Post-hoc review Netty based RPC implementation - round 2

2015-10-14 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-11096:
---

 Summary: Post-hoc review Netty based RPC implementation - round 2
 Key: SPARK-11096
 URL: https://issues.apache.org/jira/browse/SPARK-11096
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11096) Post-hoc review Netty based RPC implementation - round 2

2015-10-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956390#comment-14956390
 ] 

Apache Spark commented on SPARK-11096:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/9112

> Post-hoc review Netty based RPC implementation - round 2
> 
>
> Key: SPARK-11096
> URL: https://issues.apache.org/jira/browse/SPARK-11096
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11096) Post-hoc review Netty based RPC implementation - round 2

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11096:


Assignee: Apache Spark  (was: Reynold Xin)

> Post-hoc review Netty based RPC implementation - round 2
> 
>
> Key: SPARK-11096
> URL: https://issues.apache.org/jira/browse/SPARK-11096
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11097) Add connection established callback to lower level RPC layer so we don't need to check for new connections in NettyRpcHandler.receive

2015-10-14 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-11097:
---

 Summary: Add connection established callback to lower level RPC 
layer so we don't need to check for new connections in NettyRpcHandler.receive
 Key: SPARK-11097
 URL: https://issues.apache.org/jira/browse/SPARK-11097
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin


I think we can remove the check for new connections in NettyRpcHandler.receive 
if we just add a channel registered callback to the lower level network module.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11097) Add connection established callback to lower level RPC layer so we don't need to check for new connections in NettyRpcHandler.receive

2015-10-14 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956394#comment-14956394
 ] 

Reynold Xin commented on SPARK-11097:
-

cc [~vanzin] do you have time to do this?


> Add connection established callback to lower level RPC layer so we don't need 
> to check for new connections in NettyRpcHandler.receive
> -
>
> Key: SPARK-11097
> URL: https://issues.apache.org/jira/browse/SPARK-11097
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>
> I think we can remove the check for new connections in 
> NettyRpcHandler.receive if we just add a channel registered callback to the 
> lower level network module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10925) Exception when joining DataFrames

2015-10-14 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956367#comment-14956367
 ] 

Xiao Li edited comment on SPARK-10925 at 10/14/15 7:16 AM:
---

Also hit the same problem, but this is not related to UDF. Trying to narrow 
down the root cause of the analyzer internal. 


was (Author: smilegator):
Also hit the same problem. Trying to narrow down the root cause of the analyzer 
internal. 

> Exception when joining DataFrames
> -
>
> Key: SPARK-10925
> URL: https://issues.apache.org/jira/browse/SPARK-10925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tested with Spark 1.5.0 and Spark 1.5.1
>Reporter: Alexis Seigneurin
> Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase2.scala
>
>
> I get an exception when joining a DataFrame with another DataFrame. The 
> second DataFrame was created by performing an aggregation on the first 
> DataFrame.
> My complete workflow is:
> # read the DataFrame
> # apply an UDF on column "name"
> # apply an UDF on column "surname"
> # apply an UDF on column "birthDate"
> # aggregate on "name" and re-join with the DF
> # aggregate on "surname" and re-join with the DF
> If I remove one step, the process completes normally.
> Here is the exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in 
> operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS 
> birthDate_cleaned#8];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:520)
>   at TestCase2$.main(TestCase2.scala:51)
>   at TestCase2.main(TestCase2.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> 

[jira] [Created] (SPARK-11099) Default conf property file is not loaded

2015-10-14 Thread Jeff Zhang (JIRA)
Jeff Zhang created SPARK-11099:
--

 Summary: Default conf property file is not loaded 
 Key: SPARK-11099
 URL: https://issues.apache.org/jira/browse/SPARK-11099
 Project: Spark
  Issue Type: Bug
Reporter: Jeff Zhang
Priority: Critical


spark.driver.extraClassPath doesn't take effect in the latest code, and find 
the root cause is due to the default conf property file is not loaded 

The bug is caused by this code snippet in AbstractCommandBuilder
{code}
  Map getEffectiveConfig() throws IOException {
if (effectiveConfig == null) {
  if (propertiesFile == null) {
effectiveConfig = conf;   
  } else {
effectiveConfig = new HashMap<>(conf);
Properties p = loadPropertiesFile();
for (String key : p.stringPropertyNames()) {
  if (!effectiveConfig.containsKey(key)) {
effectiveConfig.put(key, p.getProperty(key));
  }
}
  }
}
return effectiveConfig;
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11098) RPC message ordering is not guaranteed

2015-10-14 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-11098:
---

 Summary: RPC message ordering is not guaranteed
 Key: SPARK-11098
 URL: https://issues.apache.org/jira/browse/SPARK-11098
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin


NettyRpcEnv doesn't guarantee message delivery order since there are multiple 
threads sending messages in clientConnectionExecutor thread pool. We should fix 
that.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11098) RPC message ordering is not guaranteed

2015-10-14 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956396#comment-14956396
 ] 

Reynold Xin commented on SPARK-11098:
-

[~vanzin]  zsxwing told me you were working on this. Let me know if it is not 
the case.


> RPC message ordering is not guaranteed
> --
>
> Key: SPARK-11098
> URL: https://issues.apache.org/jira/browse/SPARK-11098
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>
> NettyRpcEnv doesn't guarantee message delivery order since there are multiple 
> threads sending messages in clientConnectionExecutor thread pool. We should 
> fix that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11099) Default conf property file is not loaded

2015-10-14 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-11099:
---
Component/s: Spark Submit

> Default conf property file is not loaded 
> -
>
> Key: SPARK-11099
> URL: https://issues.apache.org/jira/browse/SPARK-11099
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Reporter: Jeff Zhang
>Priority: Critical
>
> spark.driver.extraClassPath doesn't take effect in the latest code, and find 
> the root cause is due to the default conf property file is not loaded 
> The bug is caused by this code snippet in AbstractCommandBuilder
> {code}
>   Map getEffectiveConfig() throws IOException {
> if (effectiveConfig == null) {
>   if (propertiesFile == null) {
> effectiveConfig = conf;   
>   } else {
> effectiveConfig = new HashMap<>(conf);
> Properties p = loadPropertiesFile();
> for (String key : p.stringPropertyNames()) {
>   if (!effectiveConfig.containsKey(key)) {
> effectiveConfig.put(key, p.getProperty(key));
>   }
> }
>   }
> }
> return effectiveConfig;
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10925) Exception when joining DataFrames

2015-10-14 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956367#comment-14956367
 ] 

Xiao Li commented on SPARK-10925:
-

Also hit the same problem. Trying to narrow down the root cause of the analyzer 
internal. 

> Exception when joining DataFrames
> -
>
> Key: SPARK-10925
> URL: https://issues.apache.org/jira/browse/SPARK-10925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tested with Spark 1.5.0 and Spark 1.5.1
>Reporter: Alexis Seigneurin
> Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase2.scala
>
>
> I get an exception when joining a DataFrame with another DataFrame. The 
> second DataFrame was created by performing an aggregation on the first 
> DataFrame.
> My complete workflow is:
> # read the DataFrame
> # apply an UDF on column "name"
> # apply an UDF on column "surname"
> # apply an UDF on column "birthDate"
> # aggregate on "name" and re-join with the DF
> # aggregate on "surname" and re-join with the DF
> If I remove one step, the process completes normally.
> Here is the exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in 
> operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS 
> birthDate_cleaned#8];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:520)
>   at TestCase2$.main(TestCase2.scala:51)
>   at TestCase2.main(TestCase2.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 

[jira] [Assigned] (SPARK-11096) Post-hoc review Netty based RPC implementation - round 2

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11096:


Assignee: Reynold Xin  (was: Apache Spark)

> Post-hoc review Netty based RPC implementation - round 2
> 
>
> Key: SPARK-11096
> URL: https://issues.apache.org/jira/browse/SPARK-11096
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11099) Default conf property file is not loaded

2015-10-14 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956399#comment-14956399
 ] 

Jeff Zhang commented on SPARK-11099:


Will create a pull request soon

> Default conf property file is not loaded 
> -
>
> Key: SPARK-11099
> URL: https://issues.apache.org/jira/browse/SPARK-11099
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit
>Reporter: Jeff Zhang
>Priority: Critical
>
> spark.driver.extraClassPath doesn't take effect in the latest code, and find 
> the root cause is due to the default conf property file is not loaded 
> The bug is caused by this code snippet in AbstractCommandBuilder
> {code}
>   Map getEffectiveConfig() throws IOException {
> if (effectiveConfig == null) {
>   if (propertiesFile == null) {
> effectiveConfig = conf;   
>   } else {
> effectiveConfig = new HashMap<>(conf);
> Properties p = loadPropertiesFile();
> for (String key : p.stringPropertyNames()) {
>   if (!effectiveConfig.containsKey(key)) {
> effectiveConfig.put(key, p.getProperty(key));
>   }
> }
>   }
> }
> return effectiveConfig;
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11099) Default conf property file is not loaded

2015-10-14 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-11099:
---
Component/s: Spark Shell

> Default conf property file is not loaded 
> -
>
> Key: SPARK-11099
> URL: https://issues.apache.org/jira/browse/SPARK-11099
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit
>Reporter: Jeff Zhang
>Priority: Critical
>
> spark.driver.extraClassPath doesn't take effect in the latest code, and find 
> the root cause is due to the default conf property file is not loaded 
> The bug is caused by this code snippet in AbstractCommandBuilder
> {code}
>   Map getEffectiveConfig() throws IOException {
> if (effectiveConfig == null) {
>   if (propertiesFile == null) {
> effectiveConfig = conf;   
>   } else {
> effectiveConfig = new HashMap<>(conf);
> Properties p = loadPropertiesFile();
> for (String key : p.stringPropertyNames()) {
>   if (!effectiveConfig.containsKey(key)) {
> effectiveConfig.put(key, p.getProperty(key));
>   }
> }
>   }
> }
> return effectiveConfig;
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11083) insert overwrite table failed when beeline reconnect

2015-10-14 Thread Weizhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956453#comment-14956453
 ] 

Weizhong commented on SPARK-11083:
--

I have retest on latest master branch(end at commit 
ce3f9a80657751ee0bc0ed6a9b6558acbb40af4f, [SPARK-11091] [SQL] Change 
spark.sql.canonicalizeView to spark.sql.nativeView.) and this issuse have been 
fixed. But know I don't very clear which PR fix this issue.

> insert overwrite table failed when beeline reconnect
> 
>
> Key: SPARK-11083
> URL: https://issues.apache.org/jira/browse/SPARK-11083
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: Spark: master branch
> Hadoop: 2.7.1
> JDK: 1.8.0_60
>Reporter: Weizhong
>
> 1. Start Thriftserver
> 2. Use beeline connect to thriftserver, then execute "insert overwrite 
> table_name ..." clause -- success
> 3. Exit beelin
> 4. Reconnect to thriftserver, and then execute "insert overwrite table_name 
> ..." clause. -- failed
> {noformat}
> 15/10/13 18:44:35 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.sql.hive.client.Shim_v1_2.loadDynamicPartitions(HiveShim.scala:520)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply$mcV$sp(ClientWrapper.scala:506)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply(ClientWrapper.scala:506)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply(ClientWrapper.scala:506)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.loadDynamicPartitions(ClientWrapper.scala:505)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:225)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:276)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:58)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:58)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:144)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:129)
>   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:739)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.runInternal(SparkExecuteStatementOperation.scala:224)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:182)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> source 
> 

[jira] [Resolved] (SPARK-11101) pipe() operation OOM

2015-10-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11101.
---
Resolution: Invalid

If it's a question, you should ask as u...@spark.apache.org, not make a JIRA. 
It may have nothing to do with your process, though you do need to verify how 
much it uses. There is little margin in the YARN allocation for off-heap 
memory, so you probably have to increase this value, yes.

> pipe() operation OOM
> 
>
> Key: SPARK-11101
> URL: https://issues.apache.org/jira/browse/SPARK-11101
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
> Environment: spark on yarn
>Reporter: hotdog
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> when using pipe() operation with large data(10TB), the pipe() operation 
> always OOM. 
> I use pipe() to calling a external c++ process. I'm sure the c++ program only 
> use little memory(about 1MB).
> my parameters:
> executor-memory 16g
> executor-cores 4
> num-executors 400
> "spark.yarn.executor.memoryOverhead", "8192"
> partition number: 6
> does pipe() operation use many off-heap memory? 
> the log is :
> killed by YARN for exceeding memory limits. 24.4 GB of 24 GB physical memory 
> used. Consider boosting spark.yarn.executor.memoryOverhead.
> should I continue boosting spark.yarn.executor.memoryOverhead? Or there are 
> some bugs in the pipe() operation?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11100) HiveThriftServer not registering with Zookeeper

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11100:


Assignee: Apache Spark

> HiveThriftServer not registering with Zookeeper
> ---
>
> Key: SPARK-11100
> URL: https://issues.apache.org/jira/browse/SPARK-11100
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: Hive-1.2.1
> Hadoop-2.6.0
>Reporter: Xiaoyu Wang
>Assignee: Apache Spark
>
> hive-site.xml config:
> {code}
> 
> hive.server2.support.dynamic.service.discovery
> true
> 
> 
> hive.server2.zookeeper.namespace
> sparkhiveserver2
> 
> 
> hive.zookeeper.quorum
> zk1,zk2,zk3
> 
> {code}
> then start thrift server
> {code}
> start-thriftserver.sh --master yarn
> {code}
> In zookeeper znode "sparkhiveserver2" not found.
> hiveserver2 is working on this config!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11099) Default conf property file is not loaded

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11099:


Assignee: (was: Apache Spark)

> Default conf property file is not loaded 
> -
>
> Key: SPARK-11099
> URL: https://issues.apache.org/jira/browse/SPARK-11099
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit
>Reporter: Jeff Zhang
>Priority: Critical
>
> spark.driver.extraClassPath doesn't take effect in the latest code, and find 
> the root cause is due to the default conf property file is not loaded 
> The bug is caused by this code snippet in AbstractCommandBuilder
> {code}
>   Map getEffectiveConfig() throws IOException {
> if (effectiveConfig == null) {
>   if (propertiesFile == null) {
> effectiveConfig = conf;   // return from here if no propertyFile 
> is provided
>   } else {
> effectiveConfig = new HashMap<>(conf);
> Properties p = loadPropertiesFile();// default propertyFile 
> will load here
> for (String key : p.stringPropertyNames()) {
>   if (!effectiveConfig.containsKey(key)) {
> effectiveConfig.put(key, p.getProperty(key));
>   }
> }
>   }
> }
> return effectiveConfig;
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11099) Default conf property file is not loaded

2015-10-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956470#comment-14956470
 ] 

Apache Spark commented on SPARK-11099:
--

User 'zjffdu' has created a pull request for this issue:
https://github.com/apache/spark/pull/9114

> Default conf property file is not loaded 
> -
>
> Key: SPARK-11099
> URL: https://issues.apache.org/jira/browse/SPARK-11099
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit
>Reporter: Jeff Zhang
>Priority: Critical
>
> spark.driver.extraClassPath doesn't take effect in the latest code, and find 
> the root cause is due to the default conf property file is not loaded 
> The bug is caused by this code snippet in AbstractCommandBuilder
> {code}
>   Map getEffectiveConfig() throws IOException {
> if (effectiveConfig == null) {
>   if (propertiesFile == null) {
> effectiveConfig = conf;   // return from here if no propertyFile 
> is provided
>   } else {
> effectiveConfig = new HashMap<>(conf);
> Properties p = loadPropertiesFile();// default propertyFile 
> will load here
> for (String key : p.stringPropertyNames()) {
>   if (!effectiveConfig.containsKey(key)) {
> effectiveConfig.put(key, p.getProperty(key));
>   }
> }
>   }
> }
> return effectiveConfig;
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11099) Default conf property file is not loaded

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11099:


Assignee: Apache Spark

> Default conf property file is not loaded 
> -
>
> Key: SPARK-11099
> URL: https://issues.apache.org/jira/browse/SPARK-11099
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit
>Reporter: Jeff Zhang
>Assignee: Apache Spark
>Priority: Critical
>
> spark.driver.extraClassPath doesn't take effect in the latest code, and find 
> the root cause is due to the default conf property file is not loaded 
> The bug is caused by this code snippet in AbstractCommandBuilder
> {code}
>   Map getEffectiveConfig() throws IOException {
> if (effectiveConfig == null) {
>   if (propertiesFile == null) {
> effectiveConfig = conf;   // return from here if no propertyFile 
> is provided
>   } else {
> effectiveConfig = new HashMap<>(conf);
> Properties p = loadPropertiesFile();// default propertyFile 
> will load here
> for (String key : p.stringPropertyNames()) {
>   if (!effectiveConfig.containsKey(key)) {
> effectiveConfig.put(key, p.getProperty(key));
>   }
> }
>   }
> }
> return effectiveConfig;
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org




[jira] [Updated] (SPARK-11102) Not readable exception when specifing non-exist input for JSON data source

2015-10-14 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-11102:
---
Issue Type: Improvement  (was: Bug)

> Not readable exception when specifing non-exist input for JSON data source
> --
>
> Key: SPARK-11102
> URL: https://issues.apache.org/jira/browse/SPARK-11102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Jeff Zhang
>
> If I specify a non-exist input path for json data source, the following 
> exception will be thrown, it is not readable. 
> {code}
> 15/10/14 16:14:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 19.9 KB, free 251.4 KB)
> 15/10/14 16:14:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on 192.168.3.3:54725 (size: 19.9 KB, free: 2.2 GB)
> 15/10/14 16:14:39 INFO SparkContext: Created broadcast 0 from json at 
> :19
> java.io.IOException: No input paths specified in job
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1087)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1085)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$.apply(InferSchema.scala:58)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:105)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:100)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:100)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:99)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:561)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:106)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:221)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:19)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:28)
>   at $iwC$$iwC$$iwC$$iwC.(:30)
>   at $iwC$$iwC$$iwC.(:32)
>   at $iwC$$iwC.(:34)
>   at $iwC.(:36)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11102) Not readable exception when specifing non-exist input for JSON data source

2015-10-14 Thread Jeff Zhang (JIRA)
Jeff Zhang created SPARK-11102:
--

 Summary: Not readable exception when specifing non-exist input for 
JSON data source
 Key: SPARK-11102
 URL: https://issues.apache.org/jira/browse/SPARK-11102
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1
Reporter: Jeff Zhang


If I specify a non-exist input path for json data source, the following 
exception will be thrown, it is not readable. 

{code}
15/10/14 16:14:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in 
memory (estimated size 19.9 KB, free 251.4 KB)
15/10/14 16:14:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 
192.168.3.3:54725 (size: 19.9 KB, free: 2.2 GB)
15/10/14 16:14:39 INFO SparkContext: Created broadcast 0 from json at 
:19
java.io.IOException: No input paths specified in job
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1087)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1085)
at 
org.apache.spark.sql.execution.datasources.json.InferSchema$.apply(InferSchema.scala:58)
at 
org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:105)
at 
org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:100)
at scala.Option.getOrElse(Option.scala:120)
at 
org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:100)
at 
org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:99)
at 
org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:561)
at 
org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560)
at 
org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:106)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:221)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:19)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:28)
at $iwC$$iwC$$iwC$$iwC.(:30)
at $iwC$$iwC$$iwC.(:32)
at $iwC$$iwC.(:34)
at $iwC.(:36)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11102) Not readable exception when specifing non-exist input for JSON data source

2015-10-14 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-11102:
---
Priority: Minor  (was: Major)

> Not readable exception when specifing non-exist input for JSON data source
> --
>
> Key: SPARK-11102
> URL: https://issues.apache.org/jira/browse/SPARK-11102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Jeff Zhang
>Priority: Minor
>
> If I specify a non-exist input path for json data source, the following 
> exception will be thrown, it is not readable. 
> {code}
> 15/10/14 16:14:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 19.9 KB, free 251.4 KB)
> 15/10/14 16:14:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on 192.168.3.3:54725 (size: 19.9 KB, free: 2.2 GB)
> 15/10/14 16:14:39 INFO SparkContext: Created broadcast 0 from json at 
> :19
> java.io.IOException: No input paths specified in job
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1087)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1085)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$.apply(InferSchema.scala:58)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:105)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:100)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:100)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:99)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:561)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:106)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:221)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:19)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:28)
>   at $iwC$$iwC$$iwC$$iwC.(:30)
>   at $iwC$$iwC$$iwC.(:32)
>   at $iwC$$iwC.(:34)
>   at $iwC.(:36)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11102) Not readable exception when specifing non-exist input for JSON data source

2015-10-14 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956491#comment-14956491
 ] 

Jeff Zhang commented on SPARK-11102:


Will create a pull request soon

> Not readable exception when specifing non-exist input for JSON data source
> --
>
> Key: SPARK-11102
> URL: https://issues.apache.org/jira/browse/SPARK-11102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Jeff Zhang
>Priority: Minor
>
> If I specify a non-exist input path for json data source, the following 
> exception will be thrown, it is not readable. 
> {code}
> 15/10/14 16:14:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 19.9 KB, free 251.4 KB)
> 15/10/14 16:14:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on 192.168.3.3:54725 (size: 19.9 KB, free: 2.2 GB)
> 15/10/14 16:14:39 INFO SparkContext: Created broadcast 0 from json at 
> :19
> java.io.IOException: No input paths specified in job
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1087)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1085)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$.apply(InferSchema.scala:58)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:105)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:100)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:100)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:99)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:561)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:106)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:221)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:19)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:28)
>   at $iwC$$iwC$$iwC$$iwC.(:30)
>   at $iwC$$iwC$$iwC.(:32)
>   at $iwC$$iwC.(:34)
>   at $iwC.(:36)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6910) Support for pushing predicates down to metastore for partition pruning

2015-10-14 Thread qian, chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956508#comment-14956508
 ] 

qian, chen edited comment on SPARK-6910 at 10/14/15 9:03 AM:
-

I'm using spark-sql (spark version 1.5.1 && hadoop 2.4.0) and found a very 
interesting thing:
in spark-sql shell:
at first I ran this, it took about 3 minutes
select * from table1 where date='20151010' and hour='12' and name='x' limit 5;
Time taken: 164.502 seconds

then I ran this, it only took 10s. date, hour and name are partition columns in 
this hive table. this table has >4000 partitions
select * from table1 where date='20151010' and hour='13' limit 5;
Time taken: 10.881 seconds
is it because that the first time I need to download all partition information 
from hive metastore? the second query is faster because all partitions are 
cached in memory now?
any suggestions about speeding up the first query?


was (Author: nedqian):
I'm using spark-sql (spark version 1.5.1 && hadoop 2.4.0) and found a very 
interesting thing:
in spark-sql shell:
at first I ran this, it took about 3 minutes
select * from table1 where date='20151010' and hour='12' and name='x' limit 5;
Time taken: 164.502 seconds

then I ran this, it only took 10s. date, hour and name are partition columns in 
this hive table. this table has >4000 partitions
select * from table1 where date='20151010' and hour='13' limit 5;
Time taken: 10.881 seconds
is it because that the first time I need to download all partition information 
from hive metastore? the second query is faster because all partitions are 
cached in memory now?

> Support for pushing predicates down to metastore for partition pruning
> --
>
> Key: SPARK-6910
> URL: https://issues.apache.org/jira/browse/SPARK-6910
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cheolsoo Park
>Priority: Critical
> Fix For: 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7099) Floating point literals cannot be specified using exponent

2015-10-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7099.
--
Resolution: Not A Problem

> Floating point literals cannot be specified using exponent
> --
>
> Key: SPARK-7099
> URL: https://issues.apache.org/jira/browse/SPARK-7099
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
> Environment: Windows, Linux, Mac OS X
>Reporter: Peter Hagelund
>Priority: Minor
>
> Floating point literals cannot be expressed in scientific notation using an 
> exponent, like e.g. 1.23E4.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11101) pipe() operation OOM

2015-10-14 Thread hotdog (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hotdog updated SPARK-11101:
---
Description: 
when using pipe() operation with large data(10TB), the pipe() operation always 
OOM. 
I use pipe() to calling a external c++ process. I'm sure the c++ program only 
use little memory(about 1MB).
my parameters:
executor-memory 16g
executor-cores 4
num-executors 400
"spark.yarn.executor.memoryOverhead", "8192"
partition number: 6

does pipe() operation use many off-heap memory? 
the log is :
killed by YARN for exceeding memory limits. 24.4 GB of 24 GB physical memory 
used. Consider boosting spark.yarn.executor.memoryOverhead.

should I continue boosting spark.yarn.executor.memoryOverhead? Or there are 
some bugs in the pipe() operation?


  was:
when using pipe() operation with large data(10TB), the pipe() operation always 
OOM. 
my parameters:
executor-memory 16g
executor-cores 4
num-executors 400
"spark.yarn.executor.memoryOverhead", "8192"
partition number: 6

does pipe() operation use many off-heap memory? 
the log is :
killed by YARN for exceeding memory limits. 24.4 GB of 24 GB physical memory 
used. Consider boosting spark.yarn.executor.memoryOverhead.

should I continue boosting spark.yarn.executor.memoryOverhead? Or there are 
some bugs in the pipe() operation?



> pipe() operation OOM
> 
>
> Key: SPARK-11101
> URL: https://issues.apache.org/jira/browse/SPARK-11101
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
> Environment: spark on yarn
>Reporter: hotdog
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> when using pipe() operation with large data(10TB), the pipe() operation 
> always OOM. 
> I use pipe() to calling a external c++ process. I'm sure the c++ program only 
> use little memory(about 1MB).
> my parameters:
> executor-memory 16g
> executor-cores 4
> num-executors 400
> "spark.yarn.executor.memoryOverhead", "8192"
> partition number: 6
> does pipe() operation use many off-heap memory? 
> the log is :
> killed by YARN for exceeding memory limits. 24.4 GB of 24 GB physical memory 
> used. Consider boosting spark.yarn.executor.memoryOverhead.
> should I continue boosting spark.yarn.executor.memoryOverhead? Or there are 
> some bugs in the pipe() operation?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11100) HiveThriftServer not registering with Zookeeper

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11100:


Assignee: (was: Apache Spark)

> HiveThriftServer not registering with Zookeeper
> ---
>
> Key: SPARK-11100
> URL: https://issues.apache.org/jira/browse/SPARK-11100
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: Hive-1.2.1
> Hadoop-2.6.0
>Reporter: Xiaoyu Wang
>
> hive-site.xml config:
> {code}
> 
> hive.server2.support.dynamic.service.discovery
> true
> 
> 
> hive.server2.zookeeper.namespace
> sparkhiveserver2
> 
> 
> hive.zookeeper.quorum
> zk1,zk2,zk3
> 
> {code}
> then start thrift server
> {code}
> start-thriftserver.sh --master yarn
> {code}
> In zookeeper znode "sparkhiveserver2" not found.
> hiveserver2 is working on this config!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11100) HiveThriftServer not registering with Zookeeper

2015-10-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956467#comment-14956467
 ] 

Apache Spark commented on SPARK-11100:
--

User 'xiaowangyu' has created a pull request for this issue:
https://github.com/apache/spark/pull/9113

> HiveThriftServer not registering with Zookeeper
> ---
>
> Key: SPARK-11100
> URL: https://issues.apache.org/jira/browse/SPARK-11100
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: Hive-1.2.1
> Hadoop-2.6.0
>Reporter: Xiaoyu Wang
>
> hive-site.xml config:
> {code}
> 
> hive.server2.support.dynamic.service.discovery
> true
> 
> 
> hive.server2.zookeeper.namespace
> sparkhiveserver2
> 
> 
> hive.zookeeper.quorum
> zk1,zk2,zk3
> 
> {code}
> then start thrift server
> {code}
> start-thriftserver.sh --master yarn
> {code}
> In zookeeper znode "sparkhiveserver2" not found.
> hiveserver2 is working on this config!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6910) Support for pushing predicates down to metastore for partition pruning

2015-10-14 Thread qian, chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956508#comment-14956508
 ] 

qian, chen commented on SPARK-6910:
---

I'm using spark-sql (spark version 1.5.1 && hadoop 2.4.0) and found a very 
interesting thing:
in spark-sql shell:
at first I ran this, it took about 3 minutes
select * from table1 where date='20151010' and hour='12' and name='x' limit 5;
Time taken: 164.502 seconds

then I ran this, it only took 10s. date, hour and name are partition columns in 
this hive table. this table has >4000 partitions
select * from table1 where date='20151010' and hour='13' limit 5;
Time taken: 10.881 seconds
is it because that the first time I need to download all partition information 
from hive metastore? the second query is faster because all partitions are 
cached in memory now?

> Support for pushing predicates down to metastore for partition pruning
> --
>
> Key: SPARK-6910
> URL: https://issues.apache.org/jira/browse/SPARK-6910
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cheolsoo Park
>Priority: Critical
> Fix For: 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11099) Default conf property file is not loaded

2015-10-14 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-11099:
---
Affects Version/s: 1.5.1

> Default conf property file is not loaded 
> -
>
> Key: SPARK-11099
> URL: https://issues.apache.org/jira/browse/SPARK-11099
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit
>Affects Versions: 1.5.1
>Reporter: Jeff Zhang
>Priority: Critical
>
> spark.driver.extraClassPath doesn't take effect in the latest code, and find 
> the root cause is due to the default conf property file is not loaded 
> The bug is caused by this code snippet in AbstractCommandBuilder
> {code}
>   Map getEffectiveConfig() throws IOException {
> if (effectiveConfig == null) {
>   if (propertiesFile == null) {
> effectiveConfig = conf;   // return from here if no propertyFile 
> is provided
>   } else {
> effectiveConfig = new HashMap<>(conf);
> Properties p = loadPropertiesFile();// default propertyFile 
> will load here
> for (String key : p.stringPropertyNames()) {
>   if (!effectiveConfig.containsKey(key)) {
> effectiveConfig.put(key, p.getProperty(key));
>   }
> }
>   }
> }
> return effectiveConfig;
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10845) SQL option "spark.sql.hive.version" doesn't show up in the result of "SET -v"

2015-10-14 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-10845.
-
Resolution: Fixed

I backported it.


> SQL option "spark.sql.hive.version" doesn't show up in the result of "SET -v"
> -
>
> Key: SPARK-10845
> URL: https://issues.apache.org/jira/browse/SPARK-10845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>  Labels: backport-needed
> Fix For: 1.5.2, 1.6.0
>
>
> When refactoring SQL options from plain strings to the strongly typed 
> {{SQLConfEntry}}, {{spark.sql.hive.version}} wasn't migrated, and doesn't 
> show up in the result of {{SET -v}}, as {{SET -v}} only shows public 
> {{SQLConfEntry}} instances.
> This affects compatibility with Simba ODBC driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10845) SQL option "spark.sql.hive.version" doesn't show up in the result of "SET -v"

2015-10-14 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10845:

Fix Version/s: 1.5.2

> SQL option "spark.sql.hive.version" doesn't show up in the result of "SET -v"
> -
>
> Key: SPARK-10845
> URL: https://issues.apache.org/jira/browse/SPARK-10845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>  Labels: backport-needed
> Fix For: 1.5.2, 1.6.0
>
>
> When refactoring SQL options from plain strings to the strongly typed 
> {{SQLConfEntry}}, {{spark.sql.hive.version}} wasn't migrated, and doesn't 
> show up in the result of {{SET -v}}, as {{SET -v}} only shows public 
> {{SQLConfEntry}} instances.
> This affects compatibility with Simba ODBC driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11105) Disitribute the log4j.properties files from the client to the executors

2015-10-14 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-11105:

Target Version/s:   (was: 1.5.1)

> Disitribute the log4j.properties files from the client to the executors
> ---
>
> Key: SPARK-11105
> URL: https://issues.apache.org/jira/browse/SPARK-11105
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Srinivasa Reddy Vundela
>Priority: Minor
>
> The log4j.properties file from the client is not distributed to the 
> executors. This means that the client settings are not applied to the 
> executors and they run with the default settings.
> This affects troubleshooting and data gathering.
> The workaround is to use the --files option for spark-submit to propagate the 
> log4j.properties file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10973) __gettitem__ method throws IndexError exception when we try to access index after the last non-zero entry.

2015-10-14 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957641#comment-14957641
 ] 

Joseph K. Bradley commented on SPARK-10973:
---

Yes, thanks!

> __gettitem__ method throws IndexError exception when we try to access index 
> after the last non-zero entry.
> --
>
> Key: SPARK-10973
> URL: https://issues.apache.org/jira/browse/SPARK-10973
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0, 1.4.0, 1.5.0, 1.6.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>  Labels: backport-needed
> Fix For: 1.3.2, 1.4.2, 1.5.2, 1.6.0
>
>
> \_\_gettitem\_\_ method throws IndexError exception when we try to access  
> index  after the last non-zero entry.
> {code}
> from pyspark.mllib.linalg import Vectors
> sv = Vectors.sparse(5, {1: 3})
> sv[0]
> ## 0.0
> sv[1]
> ## 3.0
> sv[2]
> ## Traceback (most recent call last):
> ##   File "", line 1, in 
> ##   File "/python/pyspark/mllib/linalg/__init__.py", line 734, in __getitem__
> ## row_ind = inds[insert_index]
> ## IndexError: index out of bounds
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11111) Fast null-safe join

2015-10-14 Thread Davies Liu (JIRA)
Davies Liu created SPARK-1:
--

 Summary: Fast null-safe join
 Key: SPARK-1
 URL: https://issues.apache.org/jira/browse/SPARK-1
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu
Assignee: Davies Liu


Today, null safe joins are executed with a Cartesian product.
{code}
scala> sqlContext.sql("select * from t a join t b on (a.i <=> b.i)").explain
== Physical Plan ==
TungstenProject [i#2,j#3,i#7,j#8]
 Filter (i#2 <=> i#7)
  CartesianProduct
   LocalTableScan [i#2,j#3], [[1,1]]
   LocalTableScan [i#7,j#8], [[1,1]]
{code}
One option is to add this rewrite to the optimizer:
{code}
select * 
from t a 
join t b 
  on coalesce(a.i, ) = coalesce(b.i, ) AND (a.i <=> b.i)
{code}
Acceptance criteria: joins with only null safe equality should not result in a 
Cartesian product.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11110) Scala 2.11 build fails due to compiler errors

2015-10-14 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-0?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-0:

Assignee: Jakob Odersky

> Scala 2.11 build fails due to compiler errors
> -
>
> Key: SPARK-0
> URL: https://issues.apache.org/jira/browse/SPARK-0
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Patrick Wendell
>Assignee: Jakob Odersky
>
> Right now the 2.11 build is failing due to compiler errors in SBT (though not 
> in Maven). I have updated our 2.11 compile test harness to catch this.
> https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/job/Spark-Master-Scala211-Compile/1667/consoleFull
> {code}
> [error] 
> /home/jenkins/workspace/Spark-Master-Scala211-Compile/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala:308:
>  no valid targets for annotation on value conf - it is discarded unused. You 
> may specify targets with meta-annotations, e.g. @(transient @param)
> [error] private[netty] class NettyRpcEndpointRef(@transient conf: SparkConf)
> [error] 
> {code}
> This is one error, but there may be others past this point (the compile fails 
> fast).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2015-10-14 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957546#comment-14957546
 ] 

Reynold Xin commented on SPARK-6235:


Is your data skewed? i.e. maybe there is a single key that's enormous?


> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10577) [PySpark] DataFrame hint for broadcast join

2015-10-14 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10577:

Fix Version/s: 1.5.2

> [PySpark] DataFrame hint for broadcast join
> ---
>
> Key: SPARK-10577
> URL: https://issues.apache.org/jira/browse/SPARK-10577
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
>Reporter: Maciej Bryński
>Assignee: Jian Feng Zhang
>  Labels: starter
> Fix For: 1.5.2, 1.6.0
>
>
> As in https://issues.apache.org/jira/browse/SPARK-8300
> there should by possibility to add hint for broadcast join in:
> - Pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11110) Scala 2.11 build fails due to compiler errors

2015-10-14 Thread Jakob Odersky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-0?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957721#comment-14957721
 ] 

Jakob Odersky commented on SPARK-0:
---

exactly what I got, I'll take a look at it

> Scala 2.11 build fails due to compiler errors
> -
>
> Key: SPARK-0
> URL: https://issues.apache.org/jira/browse/SPARK-0
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Patrick Wendell
>
> Right now the 2.11 build is failing due to compiler errors in SBT (though not 
> in Maven). I have updated our 2.11 compile test harness to catch this.
> https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/job/Spark-Master-Scala211-Compile/1667/consoleFull
> {code}
> [error] 
> /home/jenkins/workspace/Spark-Master-Scala211-Compile/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala:308:
>  no valid targets for annotation on value conf - it is discarded unused. You 
> may specify targets with meta-annotations, e.g. @(transient @param)
> [error] private[netty] class NettyRpcEndpointRef(@transient conf: SparkConf)
> [error] 
> {code}
> This is one error, but there may be others past this point (the compile fails 
> fast).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2015-10-14 Thread Glenn Strycker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957554#comment-14957554
 ] 

Glenn Strycker commented on SPARK-6235:
---

I don't think so, but I can check.  My RDD came from an RDD of type (K,V) that 
was partitioned by key and worked just fine... my new RDD that is failing is 
attempting to map the value V to the K, so that (V, K) is now going to be 
partitioned by the value (now the key) instead.  So I can try running some 
checks of multiplicity to see if my values have some kind of skew... 
unfortunately most of those checks are going to involve reduceByKey-like 
operations that will probably result in 2GB failures themselves... I was hoping 
to get the mapping and partitioning of (K,V) -> (V,K) accomplished first before 
running such checks.  Thanks for the suggestion, though!

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10538) java.lang.NegativeArraySizeException during join

2015-10-14 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10538:

Target Version/s: 1.6.0  (was: 1.5.2)

> java.lang.NegativeArraySizeException during join
> 
>
> Key: SPARK-10538
> URL: https://issues.apache.org/jira/browse/SPARK-10538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Maciej Bryński
>Assignee: Davies Liu
> Attachments: screenshot-1.png
>
>
> Hi,
> I've got a problem during joining tables in PySpark. (in my example 20 of 
> them)
> I can observe that during calculation of first partition (on one of 
> consecutive joins) there is a big shuffle read size (294.7 MB / 146 records) 
> vs on others partitions (approx. 272.5 KB / 113 record)
> I can also observe that just before the crash python process going up to few 
> gb of RAM.
> After some time there is an exception:
> {code}
> java.lang.NegativeArraySizeException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:90)
>   at 
> org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:88)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:119)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> I'm running this on 2 nodes cluster (12 cores, 64 GB RAM)
> Config:
> {code}
> spark.driver.memory  10g
> spark.executor.extraJavaOptions -XX:-UseGCOverheadLimit -XX:+UseParallelGC 
> -Dfile.encoding=UTF8
> spark.executor.memory   60g
> spark.storage.memoryFraction0.05
> spark.shuffle.memoryFraction0.75
> spark.driver.maxResultSize  10g  
> spark.cores.max 24
> spark.kryoserializer.buffer.max 1g
> spark.default.parallelism   200
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10577) [PySpark] DataFrame hint for broadcast join

2015-10-14 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957569#comment-14957569
 ] 

Reynold Xin commented on SPARK-10577:
-

I also backported this into branch-1.5 so this can be included in 1.5.2.


> [PySpark] DataFrame hint for broadcast join
> ---
>
> Key: SPARK-10577
> URL: https://issues.apache.org/jira/browse/SPARK-10577
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
>Reporter: Maciej Bryński
>Assignee: Jian Feng Zhang
>  Labels: starter
> Fix For: 1.5.2, 1.6.0
>
>
> As in https://issues.apache.org/jira/browse/SPARK-8300
> there should by possibility to add hint for broadcast join in:
> - Pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10973) __gettitem__ method throws IndexError exception when we try to access index after the last non-zero entry.

2015-10-14 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957587#comment-14957587
 ] 

Reynold Xin commented on SPARK-10973:
-

[~josephkb] this should be closed now right?

> __gettitem__ method throws IndexError exception when we try to access index 
> after the last non-zero entry.
> --
>
> Key: SPARK-10973
> URL: https://issues.apache.org/jira/browse/SPARK-10973
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0, 1.4.0, 1.5.0, 1.6.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>  Labels: backport-needed
> Fix For: 1.6.0
>
>
> \_\_gettitem\_\_ method throws IndexError exception when we try to access  
> index  after the last non-zero entry.
> {code}
> from pyspark.mllib.linalg import Vectors
> sv = Vectors.sparse(5, {1: 3})
> sv[0]
> ## 0.0
> sv[1]
> ## 3.0
> sv[2]
> ## Traceback (most recent call last):
> ##   File "", line 1, in 
> ##   File "/python/pyspark/mllib/linalg/__init__.py", line 734, in __getitem__
> ## row_ind = inds[insert_index]
> ## IndexError: index out of bounds
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org




[jira] [Resolved] (SPARK-11096) Post-hoc review Netty based RPC implementation - round 2

2015-10-14 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-11096.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

> Post-hoc review Netty based RPC implementation - round 2
> 
>
> Key: SPARK-11096
> URL: https://issues.apache.org/jira/browse/SPARK-11096
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10528) spark-shell throws java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.

2015-10-14 Thread Balaji Krish (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957661#comment-14957661
 ] 

Balaji Krish commented on SPARK-10528:
--

Following steps solved my problem

1. Open Command Prompt in Admin Mode
2. winutils.exe chmod 777 /tmp/hive
3. Open Spark-Shell --master local[2]

> spark-shell throws java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable.
> --
>
> Key: SPARK-10528
> URL: https://issues.apache.org/jira/browse/SPARK-10528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.0
> Environment: Windows 7 x64
>Reporter: Aliaksei Belablotski
>Priority: Minor
>
> Starting spark-shell throws
> java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11105) Disitribute the log4j.properties files from the client to the executors

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11105:


Assignee: (was: Apache Spark)

> Disitribute the log4j.properties files from the client to the executors
> ---
>
> Key: SPARK-11105
> URL: https://issues.apache.org/jira/browse/SPARK-11105
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Srinivasa Reddy Vundela
>Priority: Minor
>
> The log4j.properties file from the client is not distributed to the 
> executors. This means that the client settings are not applied to the 
> executors and they run with the default settings.
> This affects troubleshooting and data gathering.
> The workaround is to use the --files option for spark-submit to propagate the 
> log4j.properties file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10973) __gettitem__ method throws IndexError exception when we try to access index after the last non-zero entry.

2015-10-14 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-10973.
-
   Resolution: Fixed
Fix Version/s: 1.5.2
   1.4.2
   1.3.2

> __gettitem__ method throws IndexError exception when we try to access index 
> after the last non-zero entry.
> --
>
> Key: SPARK-10973
> URL: https://issues.apache.org/jira/browse/SPARK-10973
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0, 1.4.0, 1.5.0, 1.6.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>  Labels: backport-needed
> Fix For: 1.3.2, 1.4.2, 1.5.2, 1.6.0
>
>
> \_\_gettitem\_\_ method throws IndexError exception when we try to access  
> index  after the last non-zero entry.
> {code}
> from pyspark.mllib.linalg import Vectors
> sv = Vectors.sparse(5, {1: 3})
> sv[0]
> ## 0.0
> sv[1]
> ## 3.0
> sv[2]
> ## Traceback (most recent call last):
> ##   File "", line 1, in 
> ##   File "/python/pyspark/mllib/linalg/__init__.py", line 734, in __getitem__
> ## row_ind = inds[insert_index]
> ## IndexError: index out of bounds
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11111) Fast null-safe join

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-1:


Assignee: Davies Liu  (was: Apache Spark)

> Fast null-safe join
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Today, null safe joins are executed with a Cartesian product.
> {code}
> scala> sqlContext.sql("select * from t a join t b on (a.i <=> b.i)").explain
> == Physical Plan ==
> TungstenProject [i#2,j#3,i#7,j#8]
>  Filter (i#2 <=> i#7)
>   CartesianProduct
>LocalTableScan [i#2,j#3], [[1,1]]
>LocalTableScan [i#7,j#8], [[1,1]]
> {code}
> One option is to add this rewrite to the optimizer:
> {code}
> select * 
> from t a 
> join t b 
>   on coalesce(a.i, ) = coalesce(b.i, ) AND (a.i <=> b.i)
> {code}
> Acceptance criteria: joins with only null safe equality should not result in 
> a Cartesian product.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11111) Fast null-safe join

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-1:


Assignee: Apache Spark  (was: Davies Liu)

> Fast null-safe join
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> Today, null safe joins are executed with a Cartesian product.
> {code}
> scala> sqlContext.sql("select * from t a join t b on (a.i <=> b.i)").explain
> == Physical Plan ==
> TungstenProject [i#2,j#3,i#7,j#8]
>  Filter (i#2 <=> i#7)
>   CartesianProduct
>LocalTableScan [i#2,j#3], [[1,1]]
>LocalTableScan [i#7,j#8], [[1,1]]
> {code}
> One option is to add this rewrite to the optimizer:
> {code}
> select * 
> from t a 
> join t b 
>   on coalesce(a.i, ) = coalesce(b.i, ) AND (a.i <=> b.i)
> {code}
> Acceptance criteria: joins with only null safe equality should not result in 
> a Cartesian product.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11111) Fast null-safe join

2015-10-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957669#comment-14957669
 ] 

Apache Spark commented on SPARK-1:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/9120

> Fast null-safe join
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Today, null safe joins are executed with a Cartesian product.
> {code}
> scala> sqlContext.sql("select * from t a join t b on (a.i <=> b.i)").explain
> == Physical Plan ==
> TungstenProject [i#2,j#3,i#7,j#8]
>  Filter (i#2 <=> i#7)
>   CartesianProduct
>LocalTableScan [i#2,j#3], [[1,1]]
>LocalTableScan [i#7,j#8], [[1,1]]
> {code}
> One option is to add this rewrite to the optimizer:
> {code}
> select * 
> from t a 
> join t b 
>   on coalesce(a.i, ) = coalesce(b.i, ) AND (a.i <=> b.i)
> {code}
> Acceptance criteria: joins with only null safe equality should not result in 
> a Cartesian product.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11112) DAG visualization: display RDD callsite

2015-10-14 Thread Andrew Or (JIRA)
Andrew Or created SPARK-2:
-

 Summary: DAG visualization: display RDD callsite
 Key: SPARK-2
 URL: https://issues.apache.org/jira/browse/SPARK-2
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11112) DAG visualization: display RDD callsite

2015-10-14 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2:
--
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-7463

> DAG visualization: display RDD callsite
> ---
>
> Key: SPARK-2
> URL: https://issues.apache.org/jira/browse/SPARK-2
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11110) Scala 2.11 build fails due to compiler errors

2015-10-14 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-0?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-0:

Priority: Critical  (was: Major)

> Scala 2.11 build fails due to compiler errors
> -
>
> Key: SPARK-0
> URL: https://issues.apache.org/jira/browse/SPARK-0
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Patrick Wendell
>Assignee: Jakob Odersky
>Priority: Critical
>
> Right now the 2.11 build is failing due to compiler errors in SBT (though not 
> in Maven). I have updated our 2.11 compile test harness to catch this.
> https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/job/Spark-Master-Scala211-Compile/1667/consoleFull
> {code}
> [error] 
> /home/jenkins/workspace/Spark-Master-Scala211-Compile/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala:308:
>  no valid targets for annotation on value conf - it is discarded unused. You 
> may specify targets with meta-annotations, e.g. @(transient @param)
> [error] private[netty] class NettyRpcEndpointRef(@transient conf: SparkConf)
> [error] 
> {code}
> This is one error, but there may be others past this point (the compile fails 
> fast).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8386) DataFrame and JDBC regression

2015-10-14 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8386.

   Resolution: Fixed
 Assignee: Huaxin Gao
Fix Version/s: 1.6.0
   1.5.2
   1.4.2

> DataFrame and JDBC regression
> -
>
> Key: SPARK-8386
> URL: https://issues.apache.org/jira/browse/SPARK-8386
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: RHEL 7.1
>Reporter: Peter Haumer
>Assignee: Huaxin Gao
>Priority: Critical
> Fix For: 1.4.2, 1.5.2, 1.6.0
>
>
> I have an ETL app that appends to a JDBC table new results found at each run. 
>  In 1.3.1 I did this:
> testResultsDF.insertIntoJDBC(CONNECTION_URL, TABLE_NAME, false);
> When I do this now in 1.4 it complains that the "object" 'TABLE_NAME' already 
> exists. I get this even if I switch the overwrite to true.  I also tried this 
> now:
> testResultsDF.write().mode(SaveMode.Append).jdbc(CONNECTION_URL, TABLE_NAME, 
> connectionProperties);
> getting the same error. It works running the first time creating the new 
> table and adding data successfully. But, running it a second time it (the 
> jdbc driver) will tell me that the table already exists. Even 
> SaveMode.Overwrite will give me the same error. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11110) Scala 2.11 build fails due to compiler errors

2015-10-14 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-0:
---

 Summary: Scala 2.11 build fails due to compiler errors
 Key: SPARK-0
 URL: https://issues.apache.org/jira/browse/SPARK-0
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Patrick Wendell


Right now the 2.11 build is failing due to compiler errors in SBT (though not 
in Maven). I have updated our 2.11 compile test harness to catch this.

https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/job/Spark-Master-Scala211-Compile/1667/consoleFull

{code}
[error] 
/home/jenkins/workspace/Spark-Master-Scala211-Compile/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala:308:
 no valid targets for annotation on value conf - it is discarded unused. You 
may specify targets with meta-annotations, e.g. @(transient @param)
[error] private[netty] class NettyRpcEndpointRef(@transient conf: SparkConf)
[error] 
{code}

This is one error, but there may be others past this point (the compile fails 
fast).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-10-14 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957926#comment-14957926
 ] 

Michael Armbrust commented on SPARK-:
-

[~sandyr] did you look at the test cases [in 
scala|https://github.com/marmbrus/spark/blob/d0277f5013fd9e5e758c607b5c833cf5aa7bb93c/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala]
 and 
[java|https://github.com/marmbrus/spark/blob/d0277f5013fd9e5e758c607b5c833cf5aa7bb93c/sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java]
 linked from the attached design doc?

In Scala, users should never have to think about Encoders as long as their data 
can be represented as primitives, case classes, tuples, or collections.  
Implicits (provided by {{sqlContext.implicits._}}) automatically pass the 
required information to the function.  

In Java, the compiler is not helping us out as much, so the user must do as you 
suggest.  The prototype shows {{ProductEncoder.tuple(Long.class, Long.class)}}, 
but we will have a similar interface that works for class objects for POJOs / 
JavaBeans.  The problem with doing this using a registry (like kryo in RDDs 
today) is that then you aren't finding out the object type until you have an 
example object from realizing the computation.  That is often too late to do 
the kinds of optimizations that we are trying to enable.  Instead we'd like to 
statically realize the schema at Dataset construction time.

Encoders are just an encapsulation of the required information and provide an 
interface if we ever want to allow someone to specify a custom encoder.

Regarding the performance concerns with reflection, the implementation that is 
already present in Spark master ([SPARK-10993] and [SPARK-11090]) is based on 
catalyst expressions.  Reflection is done once on the driver, and the existing 
code generation caching framework is taking care of caching generated encoder 
bytecode on the executors.

> RDD-like API on top of Catalyst/DataFrame
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To 

[jira] [Commented] (SPARK-10534) ORDER BY clause allows only columns that are present in SELECT statement

2015-10-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957935#comment-14957935
 ] 

Apache Spark commented on SPARK-10534:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/9123

> ORDER BY clause allows only columns that are present in SELECT statement
> 
>
> Key: SPARK-10534
> URL: https://issues.apache.org/jira/browse/SPARK-10534
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Michal Cwienczek
>
> When invoking query SELECT EmployeeID from Employees order by YEAR(HireDate) 
> Spark 1.5 throws exception:
> {code}
> cannot resolve 'MsSqlNorthwindJobServerTested_dbo_Employees.HireDate' given 
> input columns EmployeeID; line 2 pos 14 StackTrace: 
> org.apache.spark.sql.AnalysisException: cannot resolve 
> 'MsSqlNorthwindJobServerTested_dbo_Employees.HireDate' given input columns 
> EmployeeID; line 2 pos 14
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$7.apply(TreeNode.scala:268)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:266)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at 

[jira] [Commented] (SPARK-11078) Ensure spilling tests are actually spilling

2015-10-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957940#comment-14957940
 ] 

Apache Spark commented on SPARK-11078:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/9124

> Ensure spilling tests are actually spilling
> ---
>
> Key: SPARK-11078
> URL: https://issues.apache.org/jira/browse/SPARK-11078
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> The new unified memory management model in SPARK-10983 uncovered many brittle 
> tests that rely on arbitrary thresholds to detect spilling. Some tests don't 
> even assert that spilling did occur.
> We should go through all the places where we test spilling behavior and 
> correct the tests, a subset of which are definitely incorrect. Potential 
> suspects:
> - UnsafeShuffleSuite
> - ExternalAppendOnlyMapSuite
> - ExternalSorterSuite
> - SQLQuerySuite
> - DistributedSuite



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)

2015-10-14 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957965#comment-14957965
 ] 

Hyukjin Kwon commented on SPARK-11103:
--

In this case, this should be fine because filter is not pushed down to Parquet 
and data is filtered by Spark filter.

If you set off spark.sql.parquet.filterPushdown which is true by default, the 
original case also should work okay. 

> Filter applied on Merged Parquet shema with new column fail with 
> (java.lang.IllegalArgumentException: Column [column_name] was not found in 
> schema!)
> 
>
> Key: SPARK-11103
> URL: https://issues.apache.org/jira/browse/SPARK-11103
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Dominic Ricard
>
> When evolving a schema in parquet files, spark properly expose all columns 
> found in the different parquet files but when trying to query the data, it is 
> not possible to apply a filter on a column that is not present in all files.
> To reproduce:
> *SQL:*
> {noformat}
> create table `table1` STORED AS PARQUET LOCATION 
> 'hdfs://:/path/to/table/id=1/' as select 1 as `col1`;
> create table `table2` STORED AS PARQUET LOCATION 
> 'hdfs://:/path/to/table/id=2/' as select 1 as `col1`, 2 as 
> `col2`;
> create table `table3` USING org.apache.spark.sql.parquet OPTIONS (path 
> "hdfs://:/path/to/table");
> select col1 from `table3` where col2 = 2;
> {noformat}
> The last select will output the following Stack Trace:
> {noformat}
> An error occurred when executing the SQL command:
> select col1 from `table3` where col2 = 2
> [Simba][HiveJDBCDriver](500051) ERROR processing query/statement. Error Code: 
> 0, SQL state: TStatus(statusCode:ERROR_STATUS, 
> infoMessages:[*org.apache.hive.service.cli.HiveSQLException:org.apache.spark.SparkException:
>  Job aborted due to stage failure: Task 0 in stage 7212.0 failed 4 times, 
> most recent failure: Lost task 0.3 in stage 7212.0 (TID 138449, 
> 208.92.52.88): java.lang.IllegalArgumentException: Column [col2] was not 
> found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:94)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$Eq.accept(Operators.java:180)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:160)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:155)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:120)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at 

[jira] [Assigned] (SPARK-11114) Add getOrCreate for SparkContext/SQLContext for Python

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4:


Assignee: Davies Liu  (was: Apache Spark)

> Add getOrCreate for SparkContext/SQLContext for Python
> --
>
> Key: SPARK-4
> URL: https://issues.apache.org/jira/browse/SPARK-4
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Also SQLContext.newSession()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11116) Initial API Draft

2015-10-14 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-6:


 Summary: Initial API Draft
 Key: SPARK-6
 URL: https://issues.apache.org/jira/browse/SPARK-6
 Project: Spark
  Issue Type: Sub-task
Reporter: Michael Armbrust
Assignee: Michael Armbrust


The goal here is to spec out the main functions to give people an idea of what 
using the API would be like.  Optimization and whatnot can be done in a follow 
up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10925) Exception when joining DataFrames

2015-10-14 Thread Alexis Seigneurin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957731#comment-14957731
 ] 

Alexis Seigneurin commented on SPARK-10925:
---

Well, technically, it's not a duplicate column. An inner join between two 
Dataframes on a column that carries the same name on both sides is supposed to 
work and to only retain one column.

I had noticed that renaming one of the columns was a workaround and that's what 
I'm doing before this issue gets fixed.

One thing to note, though, is that this code used to work with Spark 1.4 (I 
have only adjusted the call to the UDFs to use the new API). This means there 
must be a regression in the query analyzer.

> Exception when joining DataFrames
> -
>
> Key: SPARK-10925
> URL: https://issues.apache.org/jira/browse/SPARK-10925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tested with Spark 1.5.0 and Spark 1.5.1
>Reporter: Alexis Seigneurin
> Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase2.scala
>
>
> I get an exception when joining a DataFrame with another DataFrame. The 
> second DataFrame was created by performing an aggregation on the first 
> DataFrame.
> My complete workflow is:
> # read the DataFrame
> # apply an UDF on column "name"
> # apply an UDF on column "surname"
> # apply an UDF on column "birthDate"
> # aggregate on "name" and re-join with the DF
> # aggregate on "surname" and re-join with the DF
> If I remove one step, the process completes normally.
> Here is the exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in 
> operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS 
> birthDate_cleaned#8];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553)
>   at 

[jira] [Updated] (SPARK-11113) Remove DeveloperApi annotation from private classes

2015-10-14 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3:

Description: 
For a variety of reasons, we tagged a bunch of internal classes in the 
execution package in SQL as DeveloperApi.


  was:
For a variety of reasons, we tagged a bunch of internal classes in SQL as 
DeveloperApi.



> Remove DeveloperApi annotation from private classes
> ---
>
> Key: SPARK-3
> URL: https://issues.apache.org/jira/browse/SPARK-3
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> For a variety of reasons, we tagged a bunch of internal classes in the 
> execution package in SQL as DeveloperApi.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11113) Remove DeveloperApi annotation from private classes

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3:


Assignee: Apache Spark  (was: Reynold Xin)

> Remove DeveloperApi annotation from private classes
> ---
>
> Key: SPARK-3
> URL: https://issues.apache.org/jira/browse/SPARK-3
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> For a variety of reasons, we tagged a bunch of internal classes in the 
> execution package in SQL as DeveloperApi.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11113) Remove DeveloperApi annotation from private classes

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3:


Assignee: Reynold Xin  (was: Apache Spark)

> Remove DeveloperApi annotation from private classes
> ---
>
> Key: SPARK-3
> URL: https://issues.apache.org/jira/browse/SPARK-3
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> For a variety of reasons, we tagged a bunch of internal classes in the 
> execution package in SQL as DeveloperApi.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11113) Remove DeveloperApi annotation from private classes

2015-10-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957776#comment-14957776
 ] 

Apache Spark commented on SPARK-3:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/9121

> Remove DeveloperApi annotation from private classes
> ---
>
> Key: SPARK-3
> URL: https://issues.apache.org/jira/browse/SPARK-3
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> For a variety of reasons, we tagged a bunch of internal classes in the 
> execution package in SQL as DeveloperApi.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11113) Remove DeveloperApi annotation from private classes

2015-10-14 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-3:
---

 Summary: Remove DeveloperApi annotation from private classes
 Key: SPARK-3
 URL: https://issues.apache.org/jira/browse/SPARK-3
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


For a variety of reasons, we tagged a bunch of internal classes in SQL as 
DeveloperApi.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11114) Add getOrCreate for SparkContext/SQLContext for Python

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4:


Assignee: Apache Spark  (was: Davies Liu)

> Add getOrCreate for SparkContext/SQLContext for Python
> --
>
> Key: SPARK-4
> URL: https://issues.apache.org/jira/browse/SPARK-4
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> Also SQLContext.newSession()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11114) Add getOrCreate for SparkContext/SQLContext for Python

2015-10-14 Thread Davies Liu (JIRA)
Davies Liu created SPARK-4:
--

 Summary: Add getOrCreate for SparkContext/SQLContext for Python
 Key: SPARK-4
 URL: https://issues.apache.org/jira/browse/SPARK-4
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Davies Liu
Assignee: Davies Liu


Also SQLContext.newSession()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11114) Add getOrCreate for SparkContext/SQLContext for Python

2015-10-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957779#comment-14957779
 ] 

Apache Spark commented on SPARK-4:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/9122

> Add getOrCreate for SparkContext/SQLContext for Python
> --
>
> Key: SPARK-4
> URL: https://issues.apache.org/jira/browse/SPARK-4
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Also SQLContext.newSession()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10534) ORDER BY clause allows only columns that are present in SELECT statement

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10534:


Assignee: Apache Spark

> ORDER BY clause allows only columns that are present in SELECT statement
> 
>
> Key: SPARK-10534
> URL: https://issues.apache.org/jira/browse/SPARK-10534
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Michal Cwienczek
>Assignee: Apache Spark
>
> When invoking query SELECT EmployeeID from Employees order by YEAR(HireDate) 
> Spark 1.5 throws exception:
> {code}
> cannot resolve 'MsSqlNorthwindJobServerTested_dbo_Employees.HireDate' given 
> input columns EmployeeID; line 2 pos 14 StackTrace: 
> org.apache.spark.sql.AnalysisException: cannot resolve 
> 'MsSqlNorthwindJobServerTested_dbo_Employees.HireDate' given input columns 
> EmployeeID; line 2 pos 14
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$7.apply(TreeNode.scala:268)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:266)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> 

  1   2   >