[jira] [Updated] (SPARK-17303) dev/run-tests fails if spark-warehouse directory exists
[ https://issues.apache.org/jira/browse/SPARK-17303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-17303: --- Assignee: Frederick Reiss > dev/run-tests fails if spark-warehouse directory exists > --- > > Key: SPARK-17303 > URL: https://issues.apache.org/jira/browse/SPARK-17303 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: Frederick Reiss >Assignee: Frederick Reiss >Priority: Minor > Fix For: 2.1.0 > > > The script dev/run-tests, which is intended for verifying the correctness of > pull requests, runs Apache RAT to check for missing Apache license headers. > Later, the script does a full compile/package/test sequence. > The script as currently written works fine the first time. But the second > time it runs, the Apache RAT checks fail due to the presence of the directory > spark-warehouse, which the script indirectly creates during its regression > test run. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17303) dev/run-tests fails if spark-warehouse directory exists
[ https://issues.apache.org/jira/browse/SPARK-17303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-17303. Resolution: Fixed Fix Version/s: 2.1.0 > dev/run-tests fails if spark-warehouse directory exists > --- > > Key: SPARK-17303 > URL: https://issues.apache.org/jira/browse/SPARK-17303 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: Frederick Reiss >Priority: Minor > Fix For: 2.1.0 > > > The script dev/run-tests, which is intended for verifying the correctness of > pull requests, runs Apache RAT to check for missing Apache license headers. > Later, the script does a full compile/package/test sequence. > The script as currently written works fine the first time. But the second > time it runs, the Apache RAT checks fail due to the presence of the directory > spark-warehouse, which the script indirectly creates during its regression > test run. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17307) Document what all access is needed on S3 bucket when trying to save a model
Aseem Bansal created SPARK-17307: Summary: Document what all access is needed on S3 bucket when trying to save a model Key: SPARK-17307 URL: https://issues.apache.org/jira/browse/SPARK-17307 Project: Spark Issue Type: Documentation Reporter: Aseem Bansal I faced this lack of documentation when I was trying to save a model to S3. Initially I thought it should be only write. Then I found it also needs delete to delete temporary files. Now I requested access for delete and tried again and I am get the error Exception in thread "main" org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 PUT failed for '/dev-qa_%24folder%24' XML Error Message To reproduce this error the below can be used {code} SparkSession sparkSession = SparkSession .builder() .appName("my app") .master("local") .getOrCreate(); JavaSparkContext jsc = new JavaSparkContext(sparkSession.sparkContext()); jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", ); jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", ); //Create a Pipelinemode pipelineModel.write().overwrite().save("s3n:///dev-qa/modelTest"); {code} This back and forth could be avoided if it was clearly mentioned what all access spark needs to write to S3. Also would be great if why all of the access is needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17290) Spark CSVInferSchema does not always respect nullValue settings
[ https://issues.apache.org/jira/browse/SPARK-17290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15447793#comment-15447793 ] Teng Yutong commented on SPARK-17290: - So this is an issueSorry for the duplication. Should I just close this one? > Spark CSVInferSchema does not always respect nullValue settings > --- > > Key: SPARK-17290 > URL: https://issues.apache.org/jira/browse/SPARK-17290 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Teng Yutong > > When loading a csv-formated data file into a table which has boolean type > column, if the boolean value is not given and the nullValue has been set, > CSVInferSchema will fail to parse the data. > e.g.: > table schema: create table test(id varchar(10), flag boolean) USING > com.databricks.spark.csv OPTIONS (path "test.csv", header "false", nullValue > '') > csv data example: > aa, > bb,true > cc,false > After some investigation, I found that CSVInferSchema will not check wether > the current string match the nullValue or not if the target data type is > Boolean、Timestamp、Date。 > I am wondering that this logic is coded by purpose or not -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13525) SparkR: java.net.SocketTimeoutException: Accept timed out when running any dataframe function
[ https://issues.apache.org/jira/browse/SPARK-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15447791#comment-15447791 ] Sun Rui commented on SPARK-13525: - yes, if spark fails to launch R worker as a process, it should throw IOException. So throwing SocketTimeoutException implies that the R worker process has been successfully launched. But your experiment shows that daemon.R seems not having a chance to be executed by RScript. So I guess something must be wrong after the Rscript is launched but before it begins to interpret the R script in daemon.R. Could you write a shell wrapper for Rscript? Setting the configure option to point to it. This wrapper writes some debug information to a temp file, and in turn launches Rscript. Remember to re-direct the stdout and stderr of Rscript for debugging. > SparkR: java.net.SocketTimeoutException: Accept timed out when running any > dataframe function > - > > Key: SPARK-13525 > URL: https://issues.apache.org/jira/browse/SPARK-13525 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Shubhanshu Mishra > Labels: sparkr > > I am following the code steps from this example: > https://spark.apache.org/docs/1.6.0/sparkr.html > There are multiple issues: > 1. The head and summary and filter methods are not overridden by spark. Hence > I need to call them using `SparkR::` namespace. > 2. When I try to execute the following, I get errors: > {code} > $> $R_HOME/bin/R > R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree" > Copyright (C) 2015 The R Foundation for Statistical Computing > Platform: x86_64-pc-linux-gnu (64-bit) > R is free software and comes with ABSOLUTELY NO WARRANTY. > You are welcome to redistribute it under certain conditions. > Type 'license()' or 'licence()' for distribution details. > Natural language support but running in an English locale > R is a collaborative project with many contributors. > Type 'contributors()' for more information and > 'citation()' on how to cite R or R packages in publications. > Type 'demo()' for some demos, 'help()' for on-line help, or > 'help.start()' for an HTML browser interface to help. > Type 'q()' to quit R. > Welcome at Fri Feb 26 16:19:35 2016 > Attaching package: ‘SparkR’ > The following objects are masked from ‘package:base’: > colnames, colnames<-, drop, intersect, rank, rbind, sample, subset, > summary, transform > Launching java with spark-submit command > /content/smishra8/SOFTWARE/spark/bin/spark-submit --driver-memory "50g" > sparkr-shell /tmp/RtmpfBQRg6/backend_portc3bc16f09b1b > > df <- createDataFrame(sqlContext, iris) > Warning messages: > 1: In FUN(X[[i]], ...) : > Use Sepal_Length instead of Sepal.Length as column name > 2: In FUN(X[[i]], ...) : > Use Sepal_Width instead of Sepal.Width as column name > 3: In FUN(X[[i]], ...) : > Use Petal_Length instead of Petal.Length as column name > 4: In FUN(X[[i]], ...) : > Use Petal_Width instead of Petal.Width as column name > > training <- filter(df, df$Species != "setosa") > Error in filter(df, df$Species != "setosa") : > no method for coercing this S4 class to a vector > > training <- SparkR::filter(df, df$Species != "setosa") > > model <- SparkR::glm(Species ~ Sepal_Length + Sepal_Width, data = training, > > family = "binomial") > 16/02/26 16:26:46 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.net.SocketTimeoutException: Accept timed out > at java.net.PlainSocketImpl.socketAccept(Native Method) > at > java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398) > at java.net.ServerSocket.implAccept(ServerSocket.java:530) > at java.net.ServerSocket.accept(ServerSocket.java:498) > at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:431) > at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:62) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPa
[jira] [Commented] (SPARK-17290) Spark CSVInferSchema does not always respect nullValue settings
[ https://issues.apache.org/jira/browse/SPARK-17290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15447775#comment-15447775 ] Liwei Lin commented on SPARK-17290: --- Oh [~hyukjin.kwon] you are so devoted :-D > Spark CSVInferSchema does not always respect nullValue settings > --- > > Key: SPARK-17290 > URL: https://issues.apache.org/jira/browse/SPARK-17290 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Teng Yutong > > When loading a csv-formated data file into a table which has boolean type > column, if the boolean value is not given and the nullValue has been set, > CSVInferSchema will fail to parse the data. > e.g.: > table schema: create table test(id varchar(10), flag boolean) USING > com.databricks.spark.csv OPTIONS (path "test.csv", header "false", nullValue > '') > csv data example: > aa, > bb,true > cc,false > After some investigation, I found that CSVInferSchema will not check wether > the current string match the nullValue or not if the target data type is > Boolean、Timestamp、Date。 > I am wondering that this logic is coded by purpose or not -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17290) Spark CSVInferSchema does not always respect nullValue settings
[ https://issues.apache.org/jira/browse/SPARK-17290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15447758#comment-15447758 ] Hyukjin Kwon commented on SPARK-17290: -- BTW, there is a related PR here, https://github.com/apache/spark/pull/14118 > Spark CSVInferSchema does not always respect nullValue settings > --- > > Key: SPARK-17290 > URL: https://issues.apache.org/jira/browse/SPARK-17290 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Teng Yutong > > When loading a csv-formated data file into a table which has boolean type > column, if the boolean value is not given and the nullValue has been set, > CSVInferSchema will fail to parse the data. > e.g.: > table schema: create table test(id varchar(10), flag boolean) USING > com.databricks.spark.csv OPTIONS (path "test.csv", header "false", nullValue > '') > csv data example: > aa, > bb,true > cc,false > After some investigation, I found that CSVInferSchema will not check wether > the current string match the nullValue or not if the target data type is > Boolean、Timestamp、Date。 > I am wondering that this logic is coded by purpose or not -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17290) Spark CSVInferSchema does not always respect nullValue settings
[ https://issues.apache.org/jira/browse/SPARK-17290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15447755#comment-15447755 ] Hyukjin Kwon commented on SPARK-17290: -- This should be a duplicate of SPARK-16462, SPARK-16460, SPARK-15144 and SPARK-16903 > Spark CSVInferSchema does not always respect nullValue settings > --- > > Key: SPARK-17290 > URL: https://issues.apache.org/jira/browse/SPARK-17290 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Teng Yutong > > When loading a csv-formated data file into a table which has boolean type > column, if the boolean value is not given and the nullValue has been set, > CSVInferSchema will fail to parse the data. > e.g.: > table schema: create table test(id varchar(10), flag boolean) USING > com.databricks.spark.csv OPTIONS (path "test.csv", header "false", nullValue > '') > csv data example: > aa, > bb,true > cc,false > After some investigation, I found that CSVInferSchema will not check wether > the current string match the nullValue or not if the target data type is > Boolean、Timestamp、Date。 > I am wondering that this logic is coded by purpose or not -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13573) Open SparkR APIs (R package) to allow better 3rd party usage
[ https://issues.apache.org/jira/browse/SPARK-13573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15447710#comment-15447710 ] Sun Rui commented on SPARK-13573: - [~chipsenkbeil], we have made public the method for creating Java objects and invoking object methods: sparkR.callJMethod(), sparkR.callJStatic(), sparkR.newJObject(). please refer to SPARK-16581 and https://github.com/apache/spark/blob/master/R/pkg/R/jvm.R > Open SparkR APIs (R package) to allow better 3rd party usage > > > Key: SPARK-13573 > URL: https://issues.apache.org/jira/browse/SPARK-13573 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Chip Senkbeil > > Currently, SparkR's R package does not expose enough of its APIs to be used > flexibly. That I am aware of, SparkR still requires you to create a new > SparkContext by invoking the sparkR.init method (so you cannot connect to a > running one) and there is no way to invoke custom Java methods using the > exposed SparkR API (unlike PySpark). > We currently maintain a fork of SparkR that is used to power the R > implementation of Apache Toree, which is a gateway to use Apache Spark. This > fork provides a connect method (to use an existing Spark Context), exposes > needed methods like invokeJava (to be able to communicate with our JVM to > retrieve code to run, etc), and uses reflection to access > org.apache.spark.api.r.RBackend. > Here is the documentation I recorded regarding changes we need to enable > SparkR as an option for Apache Toree: > https://github.com/apache/incubator-toree/tree/master/sparkr-interpreter/src/main/resources -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17301) Remove unused classTag field from AtomicType base class
[ https://issues.apache.org/jira/browse/SPARK-17301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-17301. - Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 > Remove unused classTag field from AtomicType base class > --- > > Key: SPARK-17301 > URL: https://issues.apache.org/jira/browse/SPARK-17301 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Minor > Fix For: 2.0.1, 2.1.0 > > > There's an unused {{classTag}} {{val}} in the {{AtomicType}} base class which > is causing unnecessary slowness in deserialization because it needs to grab > ScalaReflectionLock and create a new runtime reflection mirror. Removing this > unused code gives a small but measurable performance boost in SQL task > deserialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3162) Train DecisionTree locally when possible
[ https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-3162: --- Assignee: Apache Spark > Train DecisionTree locally when possible > > > Key: SPARK-3162 > URL: https://issues.apache.org/jira/browse/SPARK-3162 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Critical > > Improvement: communication > Currently, every level of a DecisionTree is trained in a distributed manner. > However, at deeper levels in the tree, it is possible that a small set of > training data will be matched with any given node. If the node’s training > data can fit on one machine’s memory, it may be more efficient to shuffle the > data and do local training for the rest of the subtree rooted at that node. > Note: It is possible that local training would become possible at different > levels in different branches of the tree. There are multiple options for > handling this case: > (1) Train in a distributed fashion until all remaining nodes can be trained > locally. This would entail training multiple levels at once (locally). > (2) Train branches locally when possible, and interleave this with > distributed training of the other branches. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3162) Train DecisionTree locally when possible
[ https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-3162: --- Assignee: (was: Apache Spark) > Train DecisionTree locally when possible > > > Key: SPARK-3162 > URL: https://issues.apache.org/jira/browse/SPARK-3162 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Critical > > Improvement: communication > Currently, every level of a DecisionTree is trained in a distributed manner. > However, at deeper levels in the tree, it is possible that a small set of > training data will be matched with any given node. If the node’s training > data can fit on one machine’s memory, it may be more efficient to shuffle the > data and do local training for the rest of the subtree rooted at that node. > Note: It is possible that local training would become possible at different > levels in different branches of the tree. There are multiple options for > handling this case: > (1) Train in a distributed fashion until all remaining nodes can be trained > locally. This would entail training multiple levels at once (locally). > (2) Train branches locally when possible, and interleave this with > distributed training of the other branches. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3162) Train DecisionTree locally when possible
[ https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15447662#comment-15447662 ] Apache Spark commented on SPARK-3162: - User 'smurching' has created a pull request for this issue: https://github.com/apache/spark/pull/14872 > Train DecisionTree locally when possible > > > Key: SPARK-3162 > URL: https://issues.apache.org/jira/browse/SPARK-3162 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Critical > > Improvement: communication > Currently, every level of a DecisionTree is trained in a distributed manner. > However, at deeper levels in the tree, it is possible that a small set of > training data will be matched with any given node. If the node’s training > data can fit on one machine’s memory, it may be more efficient to shuffle the > data and do local training for the rest of the subtree rooted at that node. > Note: It is possible that local training would become possible at different > levels in different branches of the tree. There are multiple options for > handling this case: > (1) Train in a distributed fashion until all remaining nodes can be trained > locally. This would entail training multiple levels at once (locally). > (2) Train branches locally when possible, and interleave this with > distributed training of the other branches. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17298) Require explicit CROSS join for cartesian products by default
[ https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15447642#comment-15447642 ] Srinath commented on SPARK-17298: - I've updated the description. Hopefully it is clearer. Note that before this change, even with spark.sql.crossJoin.enabled = false, case 1.a may sometimes NOT throw an error (i.e. execute successfully) depending on the physical plan chosen. With the proposed change, it would always throw an error > Require explicit CROSS join for cartesian products by default > - > > Key: SPARK-17298 > URL: https://issues.apache.org/jira/browse/SPARK-17298 > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Srinath >Priority: Minor > > Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame > API) to specify explicit cartesian products between relations under the > default configuration (spark.sql.crossJoin.enabled = false). > By cartesian product we mean a join between relations R and S where there is > no join condition involving columns from both R and S. > If a cartesian product is detected in the absence of an explicit CROSS join, > an error must be thrown. > Turning on the spark.sql.crossJoin.enabled configuration flag will disable > this check and allow cartesian products without an explicit cross join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17298) Require explicit CROSS join for cartesian products by default
[ https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Srinath updated SPARK-17298: Description: Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame API) to specify explicit cartesian products between relations under the default configuration (spark.sql.crossJoin.enabled = false). By cartesian product we mean a join between relations R and S where there is no join condition involving columns from both R and S. If a cartesian product is detected in the absence of an explicit CROSS join, an error must be thrown. Turning on the spark.sql.crossJoin.enabled configuration flag will disable this check and allow cartesian products without an explicit cross join. was: Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame API) to specify explicit cartesian products between relations under the default configuration with spark.sql.crossJoin.enabled = false. By cartesian product we mean a join between relations R and S where there is no join condition involving columns from both R and S. If a cartesian product is detected in the absence of an explicit CROSS join, an error must be thrown. Turning on the spark.sql.crossJoin.enabled configuration flag will disable this check and allow cartesian products without an explicit cross join. > Require explicit CROSS join for cartesian products by default > - > > Key: SPARK-17298 > URL: https://issues.apache.org/jira/browse/SPARK-17298 > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Srinath >Priority: Minor > > Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame > API) to specify explicit cartesian products between relations under the > default configuration (spark.sql.crossJoin.enabled = false). > By cartesian product we mean a join between relations R and S where there is > no join condition involving columns from both R and S. > If a cartesian product is detected in the absence of an explicit CROSS join, > an error must be thrown. > Turning on the spark.sql.crossJoin.enabled configuration flag will disable > this check and allow cartesian products without an explicit cross join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17298) Require explicit CROSS join for cartesian products by default
[ https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Srinath updated SPARK-17298: Description: Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame API) to specify explicit cartesian products between relations under the default cross_join_. By cartesian product we mean a join between relations R and S where there is no join condition involving columns from both R and S. If a cartesian product is detected in the absence of an explicit CROSS join, an error must be thrown. Turning on the spark.sql.crossJoin.enabled configuration flag will disable this check and allow cartesian products without an explicit cross join. was: Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame API) to specify explicit cartesian products between relations. By cartesian product we mean a join between relations R and S where there is no join condition involving columns from both R and S. If a cartesian product is detected in the absence of an explicit CROSS join, an error must be thrown. Turning on the spark.sql.crossJoin.enabled configuration flag will disable this check and allow cartesian products without an explicit cross join. > Require explicit CROSS join for cartesian products by default > - > > Key: SPARK-17298 > URL: https://issues.apache.org/jira/browse/SPARK-17298 > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Srinath >Priority: Minor > > Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame > API) to specify explicit cartesian products between relations under the > default cross_join_. > By cartesian product we mean a join between relations R and S where there is > no join condition involving columns from both R and S. > If a cartesian product is detected in the absence of an explicit CROSS join, > an error must be thrown. > Turning on the spark.sql.crossJoin.enabled configuration flag will disable > this check and allow cartesian products without an explicit cross join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17298) Require explicit CROSS join for cartesian products by default
[ https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Srinath updated SPARK-17298: Description: Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame API) to specify explicit cartesian products between relations under the default configuration with spark.sql.crossJoin.enabled = false. By cartesian product we mean a join between relations R and S where there is no join condition involving columns from both R and S. If a cartesian product is detected in the absence of an explicit CROSS join, an error must be thrown. Turning on the spark.sql.crossJoin.enabled configuration flag will disable this check and allow cartesian products without an explicit cross join. was: Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame API) to specify explicit cartesian products between relations under the default cross_join_. By cartesian product we mean a join between relations R and S where there is no join condition involving columns from both R and S. If a cartesian product is detected in the absence of an explicit CROSS join, an error must be thrown. Turning on the spark.sql.crossJoin.enabled configuration flag will disable this check and allow cartesian products without an explicit cross join. > Require explicit CROSS join for cartesian products by default > - > > Key: SPARK-17298 > URL: https://issues.apache.org/jira/browse/SPARK-17298 > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Srinath >Priority: Minor > > Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame > API) to specify explicit cartesian products between relations under the > default configuration with spark.sql.crossJoin.enabled = false. > By cartesian product we mean a join between relations R and S where there is > no join condition involving columns from both R and S. > If a cartesian product is detected in the absence of an explicit CROSS join, > an error must be thrown. > Turning on the spark.sql.crossJoin.enabled configuration flag will disable > this check and allow cartesian products without an explicit cross join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17298) Require explicit CROSS join for cartesian products by default
[ https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Srinath updated SPARK-17298: Summary: Require explicit CROSS join for cartesian products by default (was: Require explicit CROSS join for cartesian products) > Require explicit CROSS join for cartesian products by default > - > > Key: SPARK-17298 > URL: https://issues.apache.org/jira/browse/SPARK-17298 > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Srinath >Priority: Minor > > Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame > API) to specify explicit cartesian products between relations. > By cartesian product we mean a join between relations R and S where there is > no join condition involving columns from both R and S. > If a cartesian product is detected in the absence of an explicit CROSS join, > an error must be thrown. Turning on the spark.sql.crossJoin.enabled > configuration flag will disable this check and allow cartesian products > without an explicit cross join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17306) Memory leak in QuantileSummaries
[ https://issues.apache.org/jira/browse/SPARK-17306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-17306: --- Component/s: SQL > Memory leak in QuantileSummaries > > > Key: SPARK-17306 > URL: https://issues.apache.org/jira/browse/SPARK-17306 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sean Zhong > > compressThreshold was not referenced anywhere > {code} > class QuantileSummaries( > val compressThreshold: Int, > val relativeError: Double, > val sampled: ArrayBuffer[Stats] = ArrayBuffer.empty, > private[stat] var count: Long = 0L, > val headSampled: ArrayBuffer[Double] = ArrayBuffer.empty) extends > Serializable > {code} > And, it causes memory leak, QuantileSummaries takes unbounded memory > {code} > val summary = new QuantileSummaries(1, relativeError = 0.001) > // Results in creating an array of size 1 !!! > (1 to 1).foreach(summary.insert(_)) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17306) Memory leak in QuantileSummaries
Sean Zhong created SPARK-17306: -- Summary: Memory leak in QuantileSummaries Key: SPARK-17306 URL: https://issues.apache.org/jira/browse/SPARK-17306 Project: Spark Issue Type: Bug Reporter: Sean Zhong compressThreshold was not referenced anywhere {code} class QuantileSummaries( val compressThreshold: Int, val relativeError: Double, val sampled: ArrayBuffer[Stats] = ArrayBuffer.empty, private[stat] var count: Long = 0L, val headSampled: ArrayBuffer[Double] = ArrayBuffer.empty) extends Serializable {code} And, it causes memory leak, QuantileSummaries takes unbounded memory {code} val summary = new QuantileSummaries(1, relativeError = 0.001) // Results in creating an array of size 1 !!! (1 to 1).foreach(summary.insert(_)) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17304) TaskSetManager.abortIfCompletelyBlacklisted is a perf. hotspot in scheduler benchmark
[ https://issues.apache.org/jira/browse/SPARK-17304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15447518#comment-15447518 ] Apache Spark commented on SPARK-17304: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/14871 > TaskSetManager.abortIfCompletelyBlacklisted is a perf. hotspot in scheduler > benchmark > - > > Key: SPARK-17304 > URL: https://issues.apache.org/jira/browse/SPARK-17304 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Minor > > If you run > {code} > sc.parallelize(1 to 10, 10).map(identity).count() > {code} > then {{TaskSetManager.abortIfCompletelyBlacklisted()}} is the number-one > performance hotspot in the scheduler, accounting for over half of the time. > This method was introduced in SPARK-15865, so this is a performance > regression in 2.1.0-SNAPSHOT. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17304) TaskSetManager.abortIfCompletelyBlacklisted is a perf. hotspot in scheduler benchmark
[ https://issues.apache.org/jira/browse/SPARK-17304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17304: Assignee: Apache Spark (was: Josh Rosen) > TaskSetManager.abortIfCompletelyBlacklisted is a perf. hotspot in scheduler > benchmark > - > > Key: SPARK-17304 > URL: https://issues.apache.org/jira/browse/SPARK-17304 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Assignee: Apache Spark >Priority: Minor > > If you run > {code} > sc.parallelize(1 to 10, 10).map(identity).count() > {code} > then {{TaskSetManager.abortIfCompletelyBlacklisted()}} is the number-one > performance hotspot in the scheduler, accounting for over half of the time. > This method was introduced in SPARK-15865, so this is a performance > regression in 2.1.0-SNAPSHOT. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17304) TaskSetManager.abortIfCompletelyBlacklisted is a perf. hotspot in scheduler benchmark
[ https://issues.apache.org/jira/browse/SPARK-17304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17304: Assignee: Josh Rosen (was: Apache Spark) > TaskSetManager.abortIfCompletelyBlacklisted is a perf. hotspot in scheduler > benchmark > - > > Key: SPARK-17304 > URL: https://issues.apache.org/jira/browse/SPARK-17304 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Minor > > If you run > {code} > sc.parallelize(1 to 10, 10).map(identity).count() > {code} > then {{TaskSetManager.abortIfCompletelyBlacklisted()}} is the number-one > performance hotspot in the scheduler, accounting for over half of the time. > This method was introduced in SPARK-15865, so this is a performance > regression in 2.1.0-SNAPSHOT. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17305) Cannot save ML PipelineModel in pyspark, PipelineModel.params still return null values
Hechao Sun created SPARK-17305: -- Summary: Cannot save ML PipelineModel in pyspark, PipelineModel.params still return null values Key: SPARK-17305 URL: https://issues.apache.org/jira/browse/SPARK-17305 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.0.0 Environment: Python 2.7 Anaconda2 (64-bit) IDE Spark standalone mode Reporter: Hechao Sun I used pyspark.ml module to run standalone ML tasks, but when I tried to save the PipelineModel, it gave me the following error messages: Py4JJavaError: An error occurred while calling o8753.save. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2275.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2275.0 (TID 7942, localhost): java.lang.NullPointerException at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012) at org.apache.hadoop.util.Shell.runCommand(Shell.java:483) at org.apache.hadoop.util.Shell.run(Shell.java:456) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) at org.apache.hadoop.util.Shell.execCommand(Shell.java:815) at org.apache.hadoop.util.Shell.execCommand(Shell.java:798) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:731) at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.(RawLocalFileSystem.java:225) at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.(RawLocalFileSystem.java:209) at org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:305) at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:294) at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:326) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.(ChecksumFileSystem.java:393) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:435) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:909) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:802) at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123) at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1199) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1904) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAs
[jira] [Updated] (SPARK-17304) TaskSetManager.abortIfCompletelyBlacklisted is a perf. hotspot in scheduler benchmark
[ https://issues.apache.org/jira/browse/SPARK-17304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-17304: --- Description: If you run {code} sc.parallelize(1 to 10, 10).map(identity).count() {code} then {{TaskSetManager.abortIfCompletelyBlacklisted()}} is the number-one performance hotspot in the scheduler, accounting for over half of the time. This method was introduced in SPARK-15865, so this is a performance regression in 2.1.0-SNAPSHOT. was: If you run {code} sc.parallelize(1 to 10, 10).map(identity).count() {code} then {{TaskSetManager.abortIfCompletelyBlacklisted()}} is the number-one performance hotspot in the scheduler, accounting for over half of the time. > TaskSetManager.abortIfCompletelyBlacklisted is a perf. hotspot in scheduler > benchmark > - > > Key: SPARK-17304 > URL: https://issues.apache.org/jira/browse/SPARK-17304 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Minor > > If you run > {code} > sc.parallelize(1 to 10, 10).map(identity).count() > {code} > then {{TaskSetManager.abortIfCompletelyBlacklisted()}} is the number-one > performance hotspot in the scheduler, accounting for over half of the time. > This method was introduced in SPARK-15865, so this is a performance > regression in 2.1.0-SNAPSHOT. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17304) TaskSetManager.abortIfCompletelyBlacklisted is a perf. hotspot in scheduler benchmark
[ https://issues.apache.org/jira/browse/SPARK-17304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-17304: --- Target Version/s: 2.1.0 > TaskSetManager.abortIfCompletelyBlacklisted is a perf. hotspot in scheduler > benchmark > - > > Key: SPARK-17304 > URL: https://issues.apache.org/jira/browse/SPARK-17304 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Minor > > If you run > {code} > sc.parallelize(1 to 10, 10).map(identity).count() > {code} > then {{TaskSetManager.abortIfCompletelyBlacklisted()}} is the number-one > performance hotspot in the scheduler, accounting for over half of the time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17304) TaskSetManager.abortIfCompletelyBlacklisted is a perf. hotspot in scheduler benchmark
[ https://issues.apache.org/jira/browse/SPARK-17304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-17304: --- Affects Version/s: 2.1.0 > TaskSetManager.abortIfCompletelyBlacklisted is a perf. hotspot in scheduler > benchmark > - > > Key: SPARK-17304 > URL: https://issues.apache.org/jira/browse/SPARK-17304 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Minor > > If you run > {code} > sc.parallelize(1 to 10, 10).map(identity).count() > {code} > then {{TaskSetManager.abortIfCompletelyBlacklisted()}} is the number-one > performance hotspot in the scheduler, accounting for over half of the time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17304) TaskSetManager.abortIfCompletelyBlacklisted is a perf. hotspot in scheduler benchmark
[ https://issues.apache.org/jira/browse/SPARK-17304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-17304: --- Assignee: Josh Rosen Issue Type: Bug (was: Improvement) > TaskSetManager.abortIfCompletelyBlacklisted is a perf. hotspot in scheduler > benchmark > - > > Key: SPARK-17304 > URL: https://issues.apache.org/jira/browse/SPARK-17304 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Minor > > If you run > {code} > sc.parallelize(1 to 10, 10).map(identity).count() > {code} > then {{TaskSetManager.abortIfCompletelyBlacklisted()}} is the number-one > performance hotspot in the scheduler, accounting for over half of the time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17304) TaskSetManager.abortIfCompletelyBlacklisted is a perf. hotspot in scheduler benchmark
Josh Rosen created SPARK-17304: -- Summary: TaskSetManager.abortIfCompletelyBlacklisted is a perf. hotspot in scheduler benchmark Key: SPARK-17304 URL: https://issues.apache.org/jira/browse/SPARK-17304 Project: Spark Issue Type: Improvement Components: Scheduler Reporter: Josh Rosen Priority: Minor If you run {code} sc.parallelize(1 to 10, 10).map(identity).count() {code} then {{TaskSetManager.abortIfCompletelyBlacklisted()}} is the number-one performance hotspot in the scheduler, accounting for over half of the time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17303) dev/run-tests fails if spark-warehouse directory exists
[ https://issues.apache.org/jira/browse/SPARK-17303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15447344#comment-15447344 ] Apache Spark commented on SPARK-17303: -- User 'frreiss' has created a pull request for this issue: https://github.com/apache/spark/pull/14870 > dev/run-tests fails if spark-warehouse directory exists > --- > > Key: SPARK-17303 > URL: https://issues.apache.org/jira/browse/SPARK-17303 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: Frederick Reiss >Priority: Minor > > The script dev/run-tests, which is intended for verifying the correctness of > pull requests, runs Apache RAT to check for missing Apache license headers. > Later, the script does a full compile/package/test sequence. > The script as currently written works fine the first time. But the second > time it runs, the Apache RAT checks fail due to the presence of the directory > spark-warehouse, which the script indirectly creates during its regression > test run. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17303) dev/run-tests fails if spark-warehouse directory exists
[ https://issues.apache.org/jira/browse/SPARK-17303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17303: Assignee: (was: Apache Spark) > dev/run-tests fails if spark-warehouse directory exists > --- > > Key: SPARK-17303 > URL: https://issues.apache.org/jira/browse/SPARK-17303 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: Frederick Reiss >Priority: Minor > > The script dev/run-tests, which is intended for verifying the correctness of > pull requests, runs Apache RAT to check for missing Apache license headers. > Later, the script does a full compile/package/test sequence. > The script as currently written works fine the first time. But the second > time it runs, the Apache RAT checks fail due to the presence of the directory > spark-warehouse, which the script indirectly creates during its regression > test run. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17303) dev/run-tests fails if spark-warehouse directory exists
[ https://issues.apache.org/jira/browse/SPARK-17303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17303: Assignee: Apache Spark > dev/run-tests fails if spark-warehouse directory exists > --- > > Key: SPARK-17303 > URL: https://issues.apache.org/jira/browse/SPARK-17303 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: Frederick Reiss >Assignee: Apache Spark >Priority: Minor > > The script dev/run-tests, which is intended for verifying the correctness of > pull requests, runs Apache RAT to check for missing Apache license headers. > Later, the script does a full compile/package/test sequence. > The script as currently written works fine the first time. But the second > time it runs, the Apache RAT checks fail due to the presence of the directory > spark-warehouse, which the script indirectly creates during its regression > test run. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17303) dev/run-tests fails if spark-warehouse directory exists
Frederick Reiss created SPARK-17303: --- Summary: dev/run-tests fails if spark-warehouse directory exists Key: SPARK-17303 URL: https://issues.apache.org/jira/browse/SPARK-17303 Project: Spark Issue Type: Bug Components: Build Reporter: Frederick Reiss Priority: Minor The script dev/run-tests, which is intended for verifying the correctness of pull requests, runs Apache RAT to check for missing Apache license headers. Later, the script does a full compile/package/test sequence. The script as currently written works fine the first time. But the second time it runs, the Apache RAT checks fail due to the presence of the directory spark-warehouse, which the script indirectly creates during its regression test run. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history
[ https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15447285#comment-15447285 ] Gang Wu commented on SPARK-17243: - Yup you're right. I finally got some app_ids that were not in the summary page but their urls can be accessed. Our cluster has 100K+ app_ids so it took me a long time to figure it out. Thanks for your help! > Spark 2.0 history server summary page gets stuck at "loading history summary" > with 10K+ application history > --- > > Key: SPARK-17243 > URL: https://issues.apache.org/jira/browse/SPARK-17243 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 > Environment: Linux >Reporter: Gang Wu > > The summary page of Spark 2.0 history server web UI keep displaying "Loading > history summary..." all the time and crashes the browser when there are more > than 10K application history event logs on HDFS. > I did some investigation, "historypage.js" file sends a REST request to > /api/v1/applications endpoint of history server REST endpoint and gets back > json response. When there are more than 10K applications inside the event log > directory it takes forever to parse them and render the page. When there are > only hundreds or thousands of application history it is running fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history
[ https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15447266#comment-15447266 ] Alex Bozarth commented on SPARK-17243: -- that's odd, how long did you wait before accessing the app url? because the history server still needs to propagate after starting and that can take a long time, I was testing with a limit of 50 and testing an app in the thousands and it took about 5min to propagate for me to see it > Spark 2.0 history server summary page gets stuck at "loading history summary" > with 10K+ application history > --- > > Key: SPARK-17243 > URL: https://issues.apache.org/jira/browse/SPARK-17243 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 > Environment: Linux >Reporter: Gang Wu > > The summary page of Spark 2.0 history server web UI keep displaying "Loading > history summary..." all the time and crashes the browser when there are more > than 10K application history event logs on HDFS. > I did some investigation, "historypage.js" file sends a REST request to > /api/v1/applications endpoint of history server REST endpoint and gets back > json response. When there are more than 10K applications inside the event log > directory it takes forever to parse them and render the page. When there are > only hundreds or thousands of application history it is running fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17302) Cannot set non-Spark SQL session variables in hive-site.xml, spark-defaults.conf, or using --conf
Ryan Blue created SPARK-17302: - Summary: Cannot set non-Spark SQL session variables in hive-site.xml, spark-defaults.conf, or using --conf Key: SPARK-17302 URL: https://issues.apache.org/jira/browse/SPARK-17302 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Ryan Blue When configuration changed for 2.0 to the new SparkSession structure, Spark stopped using Hive's internal HiveConf for session state and now uses HiveSessionState and an associated SQLConf. Now, session options like hive.exec.compress.output and hive.exec.dynamic.partition.mode are pulled from this SQLConf. This doesn't include session properties from hive-site.xml (including hive.exec.compress.output), and no longer contains Spark-specific overrides from spark-defaults.conf that used the spark.hadoop.hive... pattern. Also, setting these variables on the command-line no longer works because settings must start with "spark.". Is there a recommended way to set Hive session properties? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17301) Remove unused classTag field from AtomicType base class
[ https://issues.apache.org/jira/browse/SPARK-17301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17301: Assignee: Apache Spark (was: Josh Rosen) > Remove unused classTag field from AtomicType base class > --- > > Key: SPARK-17301 > URL: https://issues.apache.org/jira/browse/SPARK-17301 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Josh Rosen >Assignee: Apache Spark >Priority: Minor > > There's an unused {{classTag}} {{val}} in the {{AtomicType}} base class which > is causing unnecessary slowness in deserialization because it needs to grab > ScalaReflectionLock and create a new runtime reflection mirror. Removing this > unused code gives a small but measurable performance boost in SQL task > deserialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17301) Remove unused classTag field from AtomicType base class
[ https://issues.apache.org/jira/browse/SPARK-17301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15447191#comment-15447191 ] Apache Spark commented on SPARK-17301: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/14869 > Remove unused classTag field from AtomicType base class > --- > > Key: SPARK-17301 > URL: https://issues.apache.org/jira/browse/SPARK-17301 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Minor > > There's an unused {{classTag}} {{val}} in the {{AtomicType}} base class which > is causing unnecessary slowness in deserialization because it needs to grab > ScalaReflectionLock and create a new runtime reflection mirror. Removing this > unused code gives a small but measurable performance boost in SQL task > deserialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17301) Remove unused classTag field from AtomicType base class
[ https://issues.apache.org/jira/browse/SPARK-17301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17301: Assignee: Josh Rosen (was: Apache Spark) > Remove unused classTag field from AtomicType base class > --- > > Key: SPARK-17301 > URL: https://issues.apache.org/jira/browse/SPARK-17301 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Minor > > There's an unused {{classTag}} {{val}} in the {{AtomicType}} base class which > is causing unnecessary slowness in deserialization because it needs to grab > ScalaReflectionLock and create a new runtime reflection mirror. Removing this > unused code gives a small but measurable performance boost in SQL task > deserialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17301) Remove unused classTag field from AtomicType base class
Josh Rosen created SPARK-17301: -- Summary: Remove unused classTag field from AtomicType base class Key: SPARK-17301 URL: https://issues.apache.org/jira/browse/SPARK-17301 Project: Spark Issue Type: Improvement Components: SQL Reporter: Josh Rosen Assignee: Josh Rosen Priority: Minor There's an unused {{classTag}} {{val}} in the {{AtomicType}} base class which is causing unnecessary slowness in deserialization because it needs to grab ScalaReflectionLock and create a new runtime reflection mirror. Removing this unused code gives a small but measurable performance boost in SQL task deserialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history
[ https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15447172#comment-15447172 ] Gang Wu commented on SPARK-17243: - I imported the last change. I can get all application list from rest endpoint /api/v1/applications, (without limit parameter). However, the web UI indicates the app_id is not found when I specify the app_id. I can get it using spark 1.5 history server. > Spark 2.0 history server summary page gets stuck at "loading history summary" > with 10K+ application history > --- > > Key: SPARK-17243 > URL: https://issues.apache.org/jira/browse/SPARK-17243 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 > Environment: Linux >Reporter: Gang Wu > > The summary page of Spark 2.0 history server web UI keep displaying "Loading > history summary..." all the time and crashes the browser when there are more > than 10K application history event logs on HDFS. > I did some investigation, "historypage.js" file sends a REST request to > /api/v1/applications endpoint of history server REST endpoint and gets back > json response. When there are more than 10K applications inside the event log > directory it takes forever to parse them and render the page. When there are > only hundreds or thousands of application history it is running fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17300) ClosedChannelException caused by missing block manager when speculative tasks are killed
Ryan Blue created SPARK-17300: - Summary: ClosedChannelException caused by missing block manager when speculative tasks are killed Key: SPARK-17300 URL: https://issues.apache.org/jira/browse/SPARK-17300 Project: Spark Issue Type: Bug Reporter: Ryan Blue We recently backported SPARK-10530 to our Spark build, which kills unnecessary duplicate/speculative tasks when one completes (either a speculative task or the original). In large jobs with 500+ executors, this caused some executors to die and resulted in the same error that was fixed by SPARK-15262: ClosedChannelException when trying to connect to the block manager on affected hosts. {code} java.nio.channels.ClosedChannelException at org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:239) at org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:226) at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680) at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:567) at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424) at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:801) at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:699) at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1122) at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:633) at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:32) at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:908) at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:960) at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:893) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) Caused by: java.nio.channels.ClosedChannelException {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history
[ https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15447112#comment-15447112 ] Alex Bozarth commented on SPARK-17243: -- [~wgtmac] I'm not sure which version of the pr you tested, in my initial commit the issue you saw still existed but I updated it EOD Friday to switch to a version that only restricts the summary display, leaving all the applications available via their direct url as you would expect. > Spark 2.0 history server summary page gets stuck at "loading history summary" > with 10K+ application history > --- > > Key: SPARK-17243 > URL: https://issues.apache.org/jira/browse/SPARK-17243 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 > Environment: Linux >Reporter: Gang Wu > > The summary page of Spark 2.0 history server web UI keep displaying "Loading > history summary..." all the time and crashes the browser when there are more > than 10K application history event logs on HDFS. > I did some investigation, "historypage.js" file sends a REST request to > /api/v1/applications endpoint of history server REST endpoint and gets back > json response. When there are more than 10K applications inside the event log > directory it takes forever to parse them and render the page. When there are > only hundreds or thousands of application history it is running fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14662) LinearRegressionModel uses only default parameters if yStd is 0
[ https://issues.apache.org/jira/browse/SPARK-14662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-14662. --- Resolution: Fixed Assignee: Yanbo Liang Fix Version/s: 2.0.0 This has now been solved in [SPARK-15339] > LinearRegressionModel uses only default parameters if yStd is 0 > --- > > Key: SPARK-14662 > URL: https://issues.apache.org/jira/browse/SPARK-14662 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Louis Traynard >Assignee: Yanbo Liang >Priority: Minor > Fix For: 2.0.0 > > Original Estimate: 5m > Remaining Estimate: 5m > > In the (rare) case when yStd is 0 in LinearRegression, parameters are not > copied immediately to the created LinearRegressionModel instance. But they > should (as in all other cases), because currently only the default column > names are used in and returned by model.findSummaryModelAndPredictionCol(), > which leads to schema validation errors. > The fix is two lines and should not be hard to check & apply. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history
[ https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15447099#comment-15447099 ] Gang Wu commented on SPARK-17243: - I've test this PR. It indeed reduces the number of application metadata list. I think it intends to restrict only the summary page; jobs that are dropped from summary web ui should still be available via its URL like http://x.x.x.x:18080/history/application_id/jobs. However, those dropped ones cannot be accessed. This may heavily decrease the usability of history server. > Spark 2.0 history server summary page gets stuck at "loading history summary" > with 10K+ application history > --- > > Key: SPARK-17243 > URL: https://issues.apache.org/jira/browse/SPARK-17243 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 > Environment: Linux >Reporter: Gang Wu > > The summary page of Spark 2.0 history server web UI keep displaying "Loading > history summary..." all the time and crashes the browser when there are more > than 10K application history event logs on HDFS. > I did some investigation, "historypage.js" file sends a REST request to > /api/v1/applications endpoint of history server REST endpoint and gets back > json response. When there are more than 10K applications inside the event log > directory it takes forever to parse them and render the page. When there are > only hundreds or thousands of application history it is running fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17298) Require explicit CROSS join for cartesian products
[ https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15447015#comment-15447015 ] Sean Owen commented on SPARK-17298: --- Agree, though spark.sql.crossJoin.enabled=false by default, so the queries result in an error right now. This disagrees with https://issues.apache.org/jira/browse/SPARK-17298?focusedCommentId=15446920&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15446920 This change allows it to work where it didn't before. I mean it's inaccurate to say that the change is to require this new syntax, because as before it can be allowed by changing the flag too. > Require explicit CROSS join for cartesian products > -- > > Key: SPARK-17298 > URL: https://issues.apache.org/jira/browse/SPARK-17298 > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Srinath >Priority: Minor > > Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame > API) to specify explicit cartesian products between relations. > By cartesian product we mean a join between relations R and S where there is > no join condition involving columns from both R and S. > If a cartesian product is detected in the absence of an explicit CROSS join, > an error must be thrown. Turning on the spark.sql.crossJoin.enabled > configuration flag will disable this check and allow cartesian products > without an explicit cross join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17298) Require explicit CROSS join for cartesian products
[ https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15446988#comment-15446988 ] Sameer Agarwal commented on SPARK-17298: Sean, if I understand correctly, here are the new semantics Srinath is proposing: 1. Case 1: spark.sql.crossJoin.enabled = false (a) select * from A inner join B *throws an error* (b) select * from A cross join B *doesn't throw an error* 2. Case 2: spark.sql.crossJoin.enabled = true (a) select * from A inner join B *doesn't throw an error* (b) select * from A cross join B *doesn't throw an error* 1(a) and 2(a) confirm with the existing semantics in Spark. This PR proposes 1(b) and 2(b). > Require explicit CROSS join for cartesian products > -- > > Key: SPARK-17298 > URL: https://issues.apache.org/jira/browse/SPARK-17298 > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Srinath >Priority: Minor > > Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame > API) to specify explicit cartesian products between relations. > By cartesian product we mean a join between relations R and S where there is > no join condition involving columns from both R and S. > If a cartesian product is detected in the absence of an explicit CROSS join, > an error must be thrown. Turning on the spark.sql.crossJoin.enabled > configuration flag will disable this check and allow cartesian products > without an explicit cross join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17296) Spark SQL: cross join + two joins = BUG
[ https://issues.apache.org/jira/browse/SPARK-17296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17296: Assignee: Apache Spark > Spark SQL: cross join + two joins = BUG > --- > > Key: SPARK-17296 > URL: https://issues.apache.org/jira/browse/SPARK-17296 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Furcy Pin >Assignee: Apache Spark > > In spark shell : > {code} > CREATE TABLE test (col INT) ; > INSERT OVERWRITE TABLE test VALUES (1), (2) ; > SELECT > COUNT(1) > FROM test T1 > CROSS JOIN test T2 > JOIN test T3 > ON T3.col = T1.col > JOIN test T4 > ON T4.col = T1.col > ; > {code} > returns : > {code} > Error in query: cannot resolve '`T1.col`' given input columns: [col, col]; > line 6 pos 12 > {code} > Apparently, this example is minimal (removing the CROSS or one of the JOIN > causes no issue). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17296) Spark SQL: cross join + two joins = BUG
[ https://issues.apache.org/jira/browse/SPARK-17296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17296: Assignee: (was: Apache Spark) > Spark SQL: cross join + two joins = BUG > --- > > Key: SPARK-17296 > URL: https://issues.apache.org/jira/browse/SPARK-17296 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Furcy Pin > > In spark shell : > {code} > CREATE TABLE test (col INT) ; > INSERT OVERWRITE TABLE test VALUES (1), (2) ; > SELECT > COUNT(1) > FROM test T1 > CROSS JOIN test T2 > JOIN test T3 > ON T3.col = T1.col > JOIN test T4 > ON T4.col = T1.col > ; > {code} > returns : > {code} > Error in query: cannot resolve '`T1.col`' given input columns: [col, col]; > line 6 pos 12 > {code} > Apparently, this example is minimal (removing the CROSS or one of the JOIN > causes no issue). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17296) Spark SQL: cross join + two joins = BUG
[ https://issues.apache.org/jira/browse/SPARK-17296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15446979#comment-15446979 ] Apache Spark commented on SPARK-17296: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/14867 > Spark SQL: cross join + two joins = BUG > --- > > Key: SPARK-17296 > URL: https://issues.apache.org/jira/browse/SPARK-17296 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Furcy Pin > > In spark shell : > {code} > CREATE TABLE test (col INT) ; > INSERT OVERWRITE TABLE test VALUES (1), (2) ; > SELECT > COUNT(1) > FROM test T1 > CROSS JOIN test T2 > JOIN test T3 > ON T3.col = T1.col > JOIN test T4 > ON T4.col = T1.col > ; > {code} > returns : > {code} > Error in query: cannot resolve '`T1.col`' given input columns: [col, col]; > line 6 pos 12 > {code} > Apparently, this example is minimal (removing the CROSS or one of the JOIN > causes no issue). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17298) Require explicit CROSS join for cartesian products
[ https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15446946#comment-15446946 ] Sean Owen commented on SPARK-17298: --- This should be an error unless you set the property to allow cartesian joins. At least, it was for me when I tried it this week on 1.6. It's possible something changed in which case disregard this, if you've tested it. > Require explicit CROSS join for cartesian products > -- > > Key: SPARK-17298 > URL: https://issues.apache.org/jira/browse/SPARK-17298 > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Srinath >Priority: Minor > > Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame > API) to specify explicit cartesian products between relations. > By cartesian product we mean a join between relations R and S where there is > no join condition involving columns from both R and S. > If a cartesian product is detected in the absence of an explicit CROSS join, > an error must be thrown. Turning on the spark.sql.crossJoin.enabled > configuration flag will disable this check and allow cartesian products > without an explicit cross join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17299) TRIM/LTRIM/RTRIM strips characters other than spaces
[ https://issues.apache.org/jira/browse/SPARK-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15446930#comment-15446930 ] Sean Owen commented on SPARK-17299: --- That's probably the intent yeah. If that's how the other engines treat TRIM then this is a bug. I can see it is indeed implemented internally with String.trim(). CC [~chenghao] for https://github.com/apache/spark/commit/0b0b9ceaf73de472198c9804fb7ae61fa2a2e097 > TRIM/LTRIM/RTRIM strips characters other than spaces > > > Key: SPARK-17299 > URL: https://issues.apache.org/jira/browse/SPARK-17299 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Affects Versions: 2.0.0 >Reporter: Jeremy Beard >Priority: Minor > > TRIM/LTRIM/RTRIM docs state that they only strip spaces: > http://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html#trim(org.apache.spark.sql.Column) > But the implementation strips all characters of ASCII value 20 or less: > https://github.com/apache/spark/blob/v2.0.0/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L468-L470 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17298) Require explicit CROSS join for cartesian products
[ https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15446920#comment-15446920 ] Srinath commented on SPARK-17298: - So if I do the following: create temporary view nt1 as select * from values ("one", 1), ("two", 2), ("three", 3) as nt1(k, v1); create temporary view nt2 as select * from values ("one", 1), ("two", 22), ("one", 5) as nt2(k, v2); SELECT * FROM nt1, nt2; -- or select * FROM nt1 inner join nt2; The SELECT queries do not in fact result in an error. The proposed change would have them return an error > Require explicit CROSS join for cartesian products > -- > > Key: SPARK-17298 > URL: https://issues.apache.org/jira/browse/SPARK-17298 > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Srinath >Priority: Minor > > Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame > API) to specify explicit cartesian products between relations. > By cartesian product we mean a join between relations R and S where there is > no join condition involving columns from both R and S. > If a cartesian product is detected in the absence of an explicit CROSS join, > an error must be thrown. Turning on the spark.sql.crossJoin.enabled > configuration flag will disable this check and allow cartesian products > without an explicit cross join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17299) TRIM/LTRIM/RTRIM strips characters other than spaces
[ https://issues.apache.org/jira/browse/SPARK-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15446902#comment-15446902 ] Jeremy Beard commented on SPARK-17299: -- What is the priority for compatibility with other SQL dialects? TRIM in (at least) Hive, Impala, Oracle, and Teradata is just for spaces. > TRIM/LTRIM/RTRIM strips characters other than spaces > > > Key: SPARK-17299 > URL: https://issues.apache.org/jira/browse/SPARK-17299 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Affects Versions: 2.0.0 >Reporter: Jeremy Beard >Priority: Minor > > TRIM/LTRIM/RTRIM docs state that they only strip spaces: > http://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html#trim(org.apache.spark.sql.Column) > But the implementation strips all characters of ASCII value 20 or less: > https://github.com/apache/spark/blob/v2.0.0/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L468-L470 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16581) Making JVM backend calling functions public
[ https://issues.apache.org/jira/browse/SPARK-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-16581. --- Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 14775 [https://github.com/apache/spark/pull/14775] > Making JVM backend calling functions public > --- > > Key: SPARK-16581 > URL: https://issues.apache.org/jira/browse/SPARK-16581 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > Fix For: 2.0.1, 2.1.0 > > > As described in the design doc in SPARK-15799, to help packages that need to > call into the JVM, it will be good to expose some of the R -> JVM functions > we have. > As a part of this we could also rename, reformat the functions to make them > more user friendly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16581) Making JVM backend calling functions public
[ https://issues.apache.org/jira/browse/SPARK-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman reassigned SPARK-16581: - Assignee: Shivaram Venkataraman > Making JVM backend calling functions public > --- > > Key: SPARK-16581 > URL: https://issues.apache.org/jira/browse/SPARK-16581 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Shivaram Venkataraman > Fix For: 2.0.1, 2.1.0 > > > As described in the design doc in SPARK-15799, to help packages that need to > call into the JVM, it will be good to expose some of the R -> JVM functions > we have. > As a part of this we could also rename, reformat the functions to make them > more user friendly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17299) TRIM/LTRIM/RTRIM strips characters other than spaces
[ https://issues.apache.org/jira/browse/SPARK-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17299: -- Priority: Minor (was: Major) Component/s: Documentation I'm almost certain its intent is to match the behavior of String.trim() in Java. Unless there's a reason to believe it should behave otherwise I think the docs should be fixed to resolve this. > TRIM/LTRIM/RTRIM strips characters other than spaces > > > Key: SPARK-17299 > URL: https://issues.apache.org/jira/browse/SPARK-17299 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Affects Versions: 2.0.0 >Reporter: Jeremy Beard >Priority: Minor > > TRIM/LTRIM/RTRIM docs state that they only strip spaces: > http://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html#trim(org.apache.spark.sql.Column) > But the implementation strips all characters of ASCII value 20 or less: > https://github.com/apache/spark/blob/v2.0.0/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L468-L470 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17299) TRIM/LTRIM/RTRIM strips characters other than spaces
Jeremy Beard created SPARK-17299: Summary: TRIM/LTRIM/RTRIM strips characters other than spaces Key: SPARK-17299 URL: https://issues.apache.org/jira/browse/SPARK-17299 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Jeremy Beard TRIM/LTRIM/RTRIM docs state that they only strip spaces: http://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html#trim(org.apache.spark.sql.Column) But the implementation strips all characters of ASCII value 20 or less: https://github.com/apache/spark/blob/v2.0.0/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L468-L470 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17298) Require explicit CROSS join for cartesian products
[ https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15446740#comment-15446740 ] Sean Owen commented on SPARK-17298: --- This already results in an error. You mean that it will not result in an error if you specify it explicitly in the query right? > Require explicit CROSS join for cartesian products > -- > > Key: SPARK-17298 > URL: https://issues.apache.org/jira/browse/SPARK-17298 > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Srinath >Priority: Minor > > Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame > API) to specify explicit cartesian products between relations. > By cartesian product we mean a join between relations R and S where there is > no join condition involving columns from both R and S. > If a cartesian product is detected in the absence of an explicit CROSS join, > an error must be thrown. Turning on the spark.sql.crossJoin.enabled > configuration flag will disable this check and allow cartesian products > without an explicit cross join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17298) Require explicit CROSS join for cartesian products
[ https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17298: Assignee: Apache Spark > Require explicit CROSS join for cartesian products > -- > > Key: SPARK-17298 > URL: https://issues.apache.org/jira/browse/SPARK-17298 > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Srinath >Assignee: Apache Spark >Priority: Minor > > Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame > API) to specify explicit cartesian products between relations. > By cartesian product we mean a join between relations R and S where there is > no join condition involving columns from both R and S. > If a cartesian product is detected in the absence of an explicit CROSS join, > an error must be thrown. Turning on the spark.sql.crossJoin.enabled > configuration flag will disable this check and allow cartesian products > without an explicit cross join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17298) Require explicit CROSS join for cartesian products
[ https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17298: Assignee: (was: Apache Spark) > Require explicit CROSS join for cartesian products > -- > > Key: SPARK-17298 > URL: https://issues.apache.org/jira/browse/SPARK-17298 > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Srinath >Priority: Minor > > Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame > API) to specify explicit cartesian products between relations. > By cartesian product we mean a join between relations R and S where there is > no join condition involving columns from both R and S. > If a cartesian product is detected in the absence of an explicit CROSS join, > an error must be thrown. Turning on the spark.sql.crossJoin.enabled > configuration flag will disable this check and allow cartesian products > without an explicit cross join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17298) Require explicit CROSS join for cartesian products
[ https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15446719#comment-15446719 ] Apache Spark commented on SPARK-17298: -- User 'srinathshankar' has created a pull request for this issue: https://github.com/apache/spark/pull/14866 > Require explicit CROSS join for cartesian products > -- > > Key: SPARK-17298 > URL: https://issues.apache.org/jira/browse/SPARK-17298 > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Srinath >Priority: Minor > > Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame > API) to specify explicit cartesian products between relations. > By cartesian product we mean a join between relations R and S where there is > no join condition involving columns from both R and S. > If a cartesian product is detected in the absence of an explicit CROSS join, > an error must be thrown. Turning on the spark.sql.crossJoin.enabled > configuration flag will disable this check and allow cartesian products > without an explicit cross join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17298) Require explicit CROSS join for cartesian products
[ https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15446682#comment-15446682 ] Srinath commented on SPARK-17298: - You are correct that with this change, queries of the form {noformat} select * from A inner join B {noformat} will now throw an error where previously they would not. The reason for this suggestion is that users may often forget to specify join conditions altogether, leading to incorrect, long-running queries. Requiring explicit cross joins helps clarify intent. Turning on the spark.sql.crossJoin.enabled flag will revert to previous behavior. > Require explicit CROSS join for cartesian products > -- > > Key: SPARK-17298 > URL: https://issues.apache.org/jira/browse/SPARK-17298 > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Srinath >Priority: Minor > > Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame > API) to specify explicit cartesian products between relations. > By cartesian product we mean a join between relations R and S where there is > no join condition involving columns from both R and S. > If a cartesian product is detected in the absence of an explicit CROSS join, > an error must be thrown. Turning on the spark.sql.crossJoin.enabled > configuration flag will disable this check and allow cartesian products > without an explicit cross join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16578) Configurable hostname for RBackend
[ https://issues.apache.org/jira/browse/SPARK-16578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-16578: -- Issue Type: New Feature (was: Sub-task) Parent: (was: SPARK-15799) > Configurable hostname for RBackend > -- > > Key: SPARK-16578 > URL: https://issues.apache.org/jira/browse/SPARK-16578 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Junyang Qian > > One of the requirements that comes up with SparkR being a standalone package > is that users can now install just the R package on the client side and > connect to a remote machine which runs the RBackend class. > We should check if we can support this mode of execution and what are the > pros / cons of it -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17063) MSCK REPAIR TABLE is super slow with Hive metastore
[ https://issues.apache.org/jira/browse/SPARK-17063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-17063. Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 14607 [https://github.com/apache/spark/pull/14607] > MSCK REPAIR TABLE is super slow with Hive metastore > --- > > Key: SPARK-17063 > URL: https://issues.apache.org/jira/browse/SPARK-17063 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.1, 2.1.0 > > > When repair a table with thousands of partitions, it could take hundreds of > seconds, Hive metastore can only add a few partitioins per seconds, because > it will list all the files for each partition to gather the fast stats > (number of files, total size of files). > We could improve this by listing the files in Spark in parallel, than sending > the fast stats to Hive metastore to avoid this sequential listing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17110) Pyspark with locality ANY throw java.io.StreamCorruptedException
[ https://issues.apache.org/jira/browse/SPARK-17110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1544#comment-1544 ] Tomer Kaftan commented on SPARK-17110: -- Hi Miao, all that is needed using the fully default configurations with 2 slaves is to just set the number of worker cores per slave to 1. That is, putting the following in /spark/conf/spark-env.sh {code} SPARK_WORKER_CORES=1 {code} More concretely (if you’re looking for exact steps), the way I start up the cluster & reproduce this example is as follows. I use the Spark EC2 scripts from the PR here: https://github.com/amplab/spark-ec2/pull/46 I launch the cluster on 2 r3.xlarge machines (but any machine should work, though you may need to change a later sed command): {code} ./spark-ec2 -k ec2_ssh_key -i path_to_key_here -s 2 -t r3.xlarge launch temp-cluster --spot-price=1.00 --spark-version=2.0.0 --region=us-west-2 --hadoop-major-version=yarn {code} I update the number of worker cores and launch the pyspark shell: {code} sed -i'f' 's/SPARK_WORKER_CORES=4/SPARK_WORKER_CORES=1/g' /root/spark/conf/spark-env.sh ~/spark-ec2/copy-dir ~/spark/conf/ ~/spark/sbin/stop-all.sh ~/spark/sbin/start-all.sh ~/spark/bin/pyspark {code} And then I run the example I included at the start: {code} x = sc.parallelize([1, 1, 1, 1, 1, 1000, 1, 1, 1], numSlices=9).cache() x.count() import time def waitMap(x): time.sleep(x) return x x.map(waitMap).count() {code} > Pyspark with locality ANY throw java.io.StreamCorruptedException > > > Key: SPARK-17110 > URL: https://issues.apache.org/jira/browse/SPARK-17110 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 > Environment: Cluster of 2 AWS r3.xlarge slaves launched via ec2 > scripts, Spark 2.0.0, hadoop: yarn, pyspark shell >Reporter: Tomer Kaftan >Priority: Critical > > In Pyspark 2.0.0, any task that accesses cached data non-locally throws a > StreamCorruptedException like the stacktrace below: > {noformat} > WARN TaskSetManager: Lost task 7.0 in stage 2.0 (TID 26, 172.31.26.184): > java.io.StreamCorruptedException: invalid stream header: 12010A80 > at > java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:807) > at java.io.ObjectInputStream.(ObjectInputStream.java:302) > at > org.apache.spark.serializer.JavaDeserializationStream$$anon$1.(JavaSerializer.scala:63) > at > org.apache.spark.serializer.JavaDeserializationStream.(JavaSerializer.scala:63) > at > org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:122) > at > org.apache.spark.serializer.SerializerManager.dataDeserializeStream(SerializerManager.scala:146) > at > org.apache.spark.storage.BlockManager$$anonfun$getRemoteValues$1.apply(BlockManager.scala:524) > at > org.apache.spark.storage.BlockManager$$anonfun$getRemoteValues$1.apply(BlockManager.scala:522) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.storage.BlockManager.getRemoteValues(BlockManager.scala:522) > at org.apache.spark.storage.BlockManager.get(BlockManager.scala:609) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:661) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:281) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > The simplest way I have found to reproduce this is by running the following > code in the pyspark shell, on a cluster of 2 slaves set to use only one > worker core each: > {code} > x = sc.parallelize([1, 1, 1, 1, 1, 1000, 1, 1, 1], numSlices=9).cache() > x.count() > import time > def waitMap(x): > time.sleep(x) > return x > x.map(waitMap).count() > {code} > Or by running the following via spark-submit: > {code} > from pyspark import SparkContext > sc = SparkContext() > x = sc.parallelize([1, 1, 1, 1, 1, 1000, 1, 1, 1], numSlices=9).cache() > x.count() > import time > def waitMap(x): > time.sleep(x) > return x > x.map(waitMap).count() > {code}
[jira] [Commented] (SPARK-17289) Sort based partial aggregation breaks due to SPARK-12978
[ https://issues.apache.org/jira/browse/SPARK-17289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15446615#comment-15446615 ] Apache Spark commented on SPARK-17289: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/14865 > Sort based partial aggregation breaks due to SPARK-12978 > > > Key: SPARK-17289 > URL: https://issues.apache.org/jira/browse/SPARK-17289 > Project: Spark > Issue Type: Bug >Reporter: Sean Zhong >Priority: Blocker > > For the following query: > {code} > val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", > "b").createOrReplaceTempView("t2") > spark.sql("select max(b) from t2 group by a").explain(true) > {code} > Now, the SortAggregator won't insert Sort operator before partial > aggregation, this will break sort-based partial aggregation. > {code} > == Physical Plan == > SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17]) > +- *Sort [a#5 ASC], false, 0 >+- Exchange hashpartitioning(a#5, 200) > +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, > max#19]) > +- LocalTableScan [a#5, b#6] > {code} > In Spark 2.0 branch, the plan is: > {code} > == Physical Plan == > SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17]) > +- *Sort [a#5 ASC], false, 0 >+- Exchange hashpartitioning(a#5, 200) > +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, > max#19]) > +- *Sort [a#5 ASC], false, 0 > +- LocalTableScan [a#5, b#6] > {code} > This is related to SPARK-12978 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17289) Sort based partial aggregation breaks due to SPARK-12978
[ https://issues.apache.org/jira/browse/SPARK-17289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17289: Assignee: (was: Apache Spark) > Sort based partial aggregation breaks due to SPARK-12978 > > > Key: SPARK-17289 > URL: https://issues.apache.org/jira/browse/SPARK-17289 > Project: Spark > Issue Type: Bug >Reporter: Sean Zhong >Priority: Blocker > > For the following query: > {code} > val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", > "b").createOrReplaceTempView("t2") > spark.sql("select max(b) from t2 group by a").explain(true) > {code} > Now, the SortAggregator won't insert Sort operator before partial > aggregation, this will break sort-based partial aggregation. > {code} > == Physical Plan == > SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17]) > +- *Sort [a#5 ASC], false, 0 >+- Exchange hashpartitioning(a#5, 200) > +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, > max#19]) > +- LocalTableScan [a#5, b#6] > {code} > In Spark 2.0 branch, the plan is: > {code} > == Physical Plan == > SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17]) > +- *Sort [a#5 ASC], false, 0 >+- Exchange hashpartitioning(a#5, 200) > +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, > max#19]) > +- *Sort [a#5 ASC], false, 0 > +- LocalTableScan [a#5, b#6] > {code} > This is related to SPARK-12978 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17289) Sort based partial aggregation breaks due to SPARK-12978
[ https://issues.apache.org/jira/browse/SPARK-17289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17289: Assignee: Apache Spark > Sort based partial aggregation breaks due to SPARK-12978 > > > Key: SPARK-17289 > URL: https://issues.apache.org/jira/browse/SPARK-17289 > Project: Spark > Issue Type: Bug >Reporter: Sean Zhong >Assignee: Apache Spark >Priority: Blocker > > For the following query: > {code} > val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", > "b").createOrReplaceTempView("t2") > spark.sql("select max(b) from t2 group by a").explain(true) > {code} > Now, the SortAggregator won't insert Sort operator before partial > aggregation, this will break sort-based partial aggregation. > {code} > == Physical Plan == > SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17]) > +- *Sort [a#5 ASC], false, 0 >+- Exchange hashpartitioning(a#5, 200) > +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, > max#19]) > +- LocalTableScan [a#5, b#6] > {code} > In Spark 2.0 branch, the plan is: > {code} > == Physical Plan == > SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17]) > +- *Sort [a#5 ASC], false, 0 >+- Exchange hashpartitioning(a#5, 200) > +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, > max#19]) > +- *Sort [a#5 ASC], false, 0 > +- LocalTableScan [a#5, b#6] > {code} > This is related to SPARK-12978 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17298) Require explicit CROSS join for cartesian products
[ https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17298: -- Priority: Minor (was: Major) Hm, aren't you suggesting that cartesian joins be _allowed_ when explicitly requested, regardless of the global flag? that's not the same as requiring this syntax, which probably isn't feasible as it would break things. > Require explicit CROSS join for cartesian products > -- > > Key: SPARK-17298 > URL: https://issues.apache.org/jira/browse/SPARK-17298 > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Srinath >Priority: Minor > > Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame > API) to specify explicit cartesian products between relations. > By cartesian product we mean a join between relations R and S where there is > no join condition involving columns from both R and S. > If a cartesian product is detected in the absence of an explicit CROSS join, > an error must be thrown. Turning on the spark.sql.crossJoin.enabled > configuration flag will disable this check and allow cartesian products > without an explicit cross join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17298) Require explicit CROSS join for cartesian products
Srinath created SPARK-17298: --- Summary: Require explicit CROSS join for cartesian products Key: SPARK-17298 URL: https://issues.apache.org/jira/browse/SPARK-17298 Project: Spark Issue Type: Story Components: SQL Reporter: Srinath Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame API) to specify explicit cartesian products between relations. By cartesian product we mean a join between relations R and S where there is no join condition involving columns from both R and S. If a cartesian product is detected in the absence of an explicit CROSS join, an error must be thrown. Turning on the spark.sql.crossJoin.enabled configuration flag will disable this check and allow cartesian products without an explicit cross join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17297) window function generates unexpected results due to startTime being relative to UTC
[ https://issues.apache.org/jira/browse/SPARK-17297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pete Baker updated SPARK-17297: --- Description: In Spark 2.0.0, the {{window(Column timeColumn, String windowDuration, String slideDuration, String startTime)}} function {{startTime}} parameter behaves as follows: {quote} startTime - The offset with respect to 1970-01-01 00:00:00 UTC with which to start window intervals. For example, in order to have hourly tumbling windows that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide startTime as 15 minutes. {quote} Given a {{windowDuration}} of {{1 day}} and a {{startTime}} of {{0h}}, I'd expect to see events from each day fall into the correct day's bucket. This doesn't happen as expected in every case, however, due to the way that this feature and timestamp / timezone support interact. Using a fixed UTC reference, there is an assumption that all days are the same length ({{1 day === 24 h}}}). This is not the case for most timezones where the offset from UTC changes by 1h for 6 months out of the year. In this case, on the days that clocks go forward/back, one day is 23h long, 1 day is 25h long. The result of this is that, for daylight savings time, some rows within 1h of midnight are aggregated to the wrong day. Options for a fix are: # This is the expected behavior, and the window() function should not be used for this type of aggregation with a long window length and the {{window}} function documentation should be updated as such, or # The window function should respect timezones and work on the assumption that {{1 day !== 24 h}}. The {{startTime}} should be updated to snap to the local timezone, rather than UTC. # Support for both absolute and relative window lengths may be added was: In Spark 2.0.0, the {{window(Column timeColumn, String windowDuration, String slideDuration, String startTime)}} function {{startTime}} parameter behaves as follows: {quote} startTime - The offset with respect to 1970-01-01 00:00:00 UTC with which to start window intervals. For example, in order to have hourly tumbling windows that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide startTime as 15 minutes. {quote} Given a {{windowDuration}} of {{1 day}} and a {{startTime}} of {{0h}}, I'd expect to see events from each day fall into the correct day's bucket. This doesn't happen as expected in every case, however, due to the way that this feature and timestamp / timezone support interact. Using a fixed UTC reference, there is an assumption that all days are the same length ({{1 day === 24 h}}}). This is not the case for most timezones where the offset from UTC changes by 1h for 6 months out of the year. In this case, on the days that clocks go forward/back, one day is 23h long, 1 day is 25h long. The result of this is that, for daylight savings time, some rows within 1h of midnight are aggregated to the wrong day. Options for a fix are: * This is the expected behavior, and the window() function should not be used for this type of aggregation with a long window length and the {{window}} function documentation should be updated as such, or * The window function should respect timezones and work on the assumption that {{1 day !== 24 h}}. The {{startTime}} should be updated to snap to the local timezone, rather than UTC. * Support for both absolute and relative window lengths may be added > window function generates unexpected results due to startTime being relative > to UTC > --- > > Key: SPARK-17297 > URL: https://issues.apache.org/jira/browse/SPARK-17297 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Pete Baker > > In Spark 2.0.0, the {{window(Column timeColumn, String windowDuration, String > slideDuration, String startTime)}} function {{startTime}} parameter behaves > as follows: > {quote} > startTime - The offset with respect to 1970-01-01 00:00:00 UTC with which to > start window intervals. For example, in order to have hourly tumbling windows > that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide > startTime as 15 minutes. > {quote} > Given a {{windowDuration}} of {{1 day}} and a {{startTime}} of {{0h}}, I'd > expect to see events from each day fall into the correct day's bucket. This > doesn't happen as expected in every case, however, due to the way that this > feature and timestamp / timezone support interact. > Using a fixed UTC reference, there is an assumption that all days are the > same length ({{1 day === 24 h}}}). This is not the case for most timezones > where the offset from UTC changes by 1h for 6 months out of the year. In > this case, on the days that clocks
[jira] [Updated] (SPARK-16240) model loading backward compatibility for ml.clustering.LDA
[ https://issues.apache.org/jira/browse/SPARK-16240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-16240: -- Shepherd: Joseph K. Bradley > model loading backward compatibility for ml.clustering.LDA > -- > > Key: SPARK-16240 > URL: https://issues.apache.org/jira/browse/SPARK-16240 > Project: Spark > Issue Type: Bug >Reporter: yuhao yang >Assignee: Gayathri Murali > > After resolving the matrix conversion issue, LDA model still cannot load 1.6 > models as one of the parameter name is changed. > https://github.com/apache/spark/pull/12065 > We can perhaps add some special logic in the loading code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17297) window function generates unexpected results due to startTime being relative to UTC
[ https://issues.apache.org/jira/browse/SPARK-17297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15446391#comment-15446391 ] Sean Owen commented on SPARK-17297: --- I don't think there's an assumption about what a day is in here, because underneath this is all done in terms of absolute time in microseconds (since the epoch). You would not be able to do aggregations whose window varied a bit to start and end on day boundaries with respect to some calendar with a straight window() call this way. I think you could propose a documentation update that notes that, for example, "1 day" just means "24*60*60*1000*1000" microseconds, not a day according to a calendar. There are other ways to get the effect you need but will entail watching a somewhat larger window of time so you will always eventually find one batch interval where you see the whole day of data. > window function generates unexpected results due to startTime being relative > to UTC > --- > > Key: SPARK-17297 > URL: https://issues.apache.org/jira/browse/SPARK-17297 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Pete Baker > > In Spark 2.0.0, the {{window(Column timeColumn, String windowDuration, String > slideDuration, String startTime)}} function {{startTime}} parameter behaves > as follows: > {quote} > startTime - The offset with respect to 1970-01-01 00:00:00 UTC with which to > start window intervals. For example, in order to have hourly tumbling windows > that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide > startTime as 15 minutes. > {quote} > Given a {{windowDuration}} of {{1 day}} and a {{startTime}} of {{0h}}, I'd > expect to see events from each day fall into the correct day's bucket. This > doesn't happen as expected in every case, however, due to the way that this > feature and timestamp / timezone support interact. > Using a fixed UTC reference, there is an assumption that all days are the > same length ({{1 day === 24 h}}}). This is not the case for most timezones > where the offset from UTC changes by 1h for 6 months out of the year. In > this case, on the days that clocks go forward/back, one day is 23h long, 1 > day is 25h long. > The result of this is that, for daylight savings time, some rows within 1h of > midnight are aggregated to the wrong day. > Options for a fix are: > * This is the expected behavior, and the window() function should not be used > for this type of aggregation with a long window length and the {{window}} > function documentation should be updated as such, or > * The window function should respect timezones and work on the assumption > that {{1 day !== 24 h}}. The {{startTime}} should be updated to snap to the > local timezone, rather than UTC. > * Support for both absolute and relative window lengths may be added -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17297) window function generates unexpected results due to startTime being relative to UTC
Pete Baker created SPARK-17297: -- Summary: window function generates unexpected results due to startTime being relative to UTC Key: SPARK-17297 URL: https://issues.apache.org/jira/browse/SPARK-17297 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Pete Baker In Spark 2.0.0, the {{window(Column timeColumn, String windowDuration, String slideDuration, String startTime)}} function {{startTime}} parameter behaves as follows: {quote} startTime - The offset with respect to 1970-01-01 00:00:00 UTC with which to start window intervals. For example, in order to have hourly tumbling windows that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide startTime as 15 minutes. {quote} Given a {{windowDuration}} of {{1 day}} and a {{startTime}} of {{0h}}, I'd expect to see events from each day fall into the correct day's bucket. This doesn't happen as expected in every case, however, due to the way that this feature and timestamp / timezone support interact. Using a fixed UTC reference, there is an assumption that all days are the same length ({{1 day === 24 h}}}). This is not the case for most timezones where the offset from UTC changes by 1h for 6 months out of the year. In this case, on the days that clocks go forward/back, one day is 23h long, 1 day is 25h long. The result of this is that, for daylight savings time, some rows within 1h of midnight are aggregated to the wrong day. Either: * This is the expected behavior, and the window() function should not be used for this type of aggregation with a long window length and the {{window}} function documentation should be updated as such, or * The window function should respect timezones and work on the assumption that {{1 day !== 24 h}}. The {{startTime}} should be updated to snap to the local timezone, rather than UTC. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17296) Spark SQL: cross join + two joins = BUG
[ https://issues.apache.org/jira/browse/SPARK-17296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15446371#comment-15446371 ] Herman van Hovell commented on SPARK-17296: --- I think you have found a bug in the parser. Your query produces the following (unresolved) LogicalPlan: {noformat} 'Project [unresolvedalias('COUNT(1), None)] +- 'Join Inner, ('T4.col = 'T1.col) :- 'Join Inner : :- 'UnresolvedRelation `test`, T1 : +- 'Join Inner, ('T3.col = 'T1.col) : :- 'UnresolvedRelation `test`, T2 : +- 'UnresolvedRelation `test`, T3 +- 'UnresolvedRelation `test`, T4 {noformat} Notice how the the most nested Inner Join references T2 and T3 using a join condition on T1 (which is an unknown relation for that join). > Spark SQL: cross join + two joins = BUG > --- > > Key: SPARK-17296 > URL: https://issues.apache.org/jira/browse/SPARK-17296 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Furcy Pin > > In spark shell : > {code} > CREATE TABLE test (col INT) ; > INSERT OVERWRITE TABLE test VALUES (1), (2) ; > SELECT > COUNT(1) > FROM test T1 > CROSS JOIN test T2 > JOIN test T3 > ON T3.col = T1.col > JOIN test T4 > ON T4.col = T1.col > ; > {code} > returns : > {code} > Error in query: cannot resolve '`T1.col`' given input columns: [col, col]; > line 6 pos 12 > {code} > Apparently, this example is minimal (removing the CROSS or one of the JOIN > causes no issue). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17296) Spark SQL: cross join + two joins = BUG
[ https://issues.apache.org/jira/browse/SPARK-17296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15446382#comment-15446382 ] Furcy Pin commented on SPARK-17296: --- Yes, this is not critical though, a workaround is to invert the order between T1 and T2. > Spark SQL: cross join + two joins = BUG > --- > > Key: SPARK-17296 > URL: https://issues.apache.org/jira/browse/SPARK-17296 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Furcy Pin > > In spark shell : > {code} > CREATE TABLE test (col INT) ; > INSERT OVERWRITE TABLE test VALUES (1), (2) ; > SELECT > COUNT(1) > FROM test T1 > CROSS JOIN test T2 > JOIN test T3 > ON T3.col = T1.col > JOIN test T4 > ON T4.col = T1.col > ; > {code} > returns : > {code} > Error in query: cannot resolve '`T1.col`' given input columns: [col, col]; > line 6 pos 12 > {code} > Apparently, this example is minimal (removing the CROSS or one of the JOIN > causes no issue). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17297) window function generates unexpected results due to startTime being relative to UTC
[ https://issues.apache.org/jira/browse/SPARK-17297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pete Baker updated SPARK-17297: --- Description: In Spark 2.0.0, the {{window(Column timeColumn, String windowDuration, String slideDuration, String startTime)}} function {{startTime}} parameter behaves as follows: {quote} startTime - The offset with respect to 1970-01-01 00:00:00 UTC with which to start window intervals. For example, in order to have hourly tumbling windows that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide startTime as 15 minutes. {quote} Given a {{windowDuration}} of {{1 day}} and a {{startTime}} of {{0h}}, I'd expect to see events from each day fall into the correct day's bucket. This doesn't happen as expected in every case, however, due to the way that this feature and timestamp / timezone support interact. Using a fixed UTC reference, there is an assumption that all days are the same length ({{1 day === 24 h}}}). This is not the case for most timezones where the offset from UTC changes by 1h for 6 months out of the year. In this case, on the days that clocks go forward/back, one day is 23h long, 1 day is 25h long. The result of this is that, for daylight savings time, some rows within 1h of midnight are aggregated to the wrong day. Options for a fix are: * This is the expected behavior, and the window() function should not be used for this type of aggregation with a long window length and the {{window}} function documentation should be updated as such, or * The window function should respect timezones and work on the assumption that {{1 day !== 24 h}}. The {{startTime}} should be updated to snap to the local timezone, rather than UTC. * Support for both absolute and relative window lengths may be added was: In Spark 2.0.0, the {{window(Column timeColumn, String windowDuration, String slideDuration, String startTime)}} function {{startTime}} parameter behaves as follows: {quote} startTime - The offset with respect to 1970-01-01 00:00:00 UTC with which to start window intervals. For example, in order to have hourly tumbling windows that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide startTime as 15 minutes. {quote} Given a {{windowDuration}} of {{1 day}} and a {{startTime}} of {{0h}}, I'd expect to see events from each day fall into the correct day's bucket. This doesn't happen as expected in every case, however, due to the way that this feature and timestamp / timezone support interact. Using a fixed UTC reference, there is an assumption that all days are the same length ({{1 day === 24 h}}}). This is not the case for most timezones where the offset from UTC changes by 1h for 6 months out of the year. In this case, on the days that clocks go forward/back, one day is 23h long, 1 day is 25h long. The result of this is that, for daylight savings time, some rows within 1h of midnight are aggregated to the wrong day. Either: * This is the expected behavior, and the window() function should not be used for this type of aggregation with a long window length and the {{window}} function documentation should be updated as such, or * The window function should respect timezones and work on the assumption that {{1 day !== 24 h}}. The {{startTime}} should be updated to snap to the local timezone, rather than UTC. > window function generates unexpected results due to startTime being relative > to UTC > --- > > Key: SPARK-17297 > URL: https://issues.apache.org/jira/browse/SPARK-17297 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Pete Baker > > In Spark 2.0.0, the {{window(Column timeColumn, String windowDuration, String > slideDuration, String startTime)}} function {{startTime}} parameter behaves > as follows: > {quote} > startTime - The offset with respect to 1970-01-01 00:00:00 UTC with which to > start window intervals. For example, in order to have hourly tumbling windows > that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide > startTime as 15 minutes. > {quote} > Given a {{windowDuration}} of {{1 day}} and a {{startTime}} of {{0h}}, I'd > expect to see events from each day fall into the correct day's bucket. This > doesn't happen as expected in every case, however, due to the way that this > feature and timestamp / timezone support interact. > Using a fixed UTC reference, there is an assumption that all days are the > same length ({{1 day === 24 h}}}). This is not the case for most timezones > where the offset from UTC changes by 1h for 6 months out of the year. In > this case, on the days that clocks go forward/back, one day is 23h long, 1 > day is 25h long. > The result of this is
[jira] [Commented] (SPARK-17296) Spark SQL: cross join + two joins = BUG
[ https://issues.apache.org/jira/browse/SPARK-17296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15446367#comment-15446367 ] Sean Owen commented on SPARK-17296: --- Pardon if I'm missing something, but you are not joining T3 with T1, so I don't think you can use T1.col in the join condition right? > Spark SQL: cross join + two joins = BUG > --- > > Key: SPARK-17296 > URL: https://issues.apache.org/jira/browse/SPARK-17296 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Furcy Pin > > In spark shell : > {code} > CREATE TABLE test (col INT) ; > INSERT OVERWRITE TABLE test VALUES (1), (2) ; > SELECT > COUNT(1) > FROM test T1 > CROSS JOIN test T2 > JOIN test T3 > ON T3.col = T1.col > JOIN test T4 > ON T4.col = T1.col > ; > {code} > returns : > {code} > Error in query: cannot resolve '`T1.col`' given input columns: [col, col]; > line 6 pos 12 > {code} > Apparently, this example is minimal (removing the CROSS or one of the JOIN > causes no issue). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17289) Sort based partial aggregation breaks due to SPARK-12978
[ https://issues.apache.org/jira/browse/SPARK-17289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15446344#comment-15446344 ] Takeshi Yamamuro commented on SPARK-17289: -- okay. I'll add tests and open pr. > Sort based partial aggregation breaks due to SPARK-12978 > > > Key: SPARK-17289 > URL: https://issues.apache.org/jira/browse/SPARK-17289 > Project: Spark > Issue Type: Bug >Reporter: Sean Zhong >Priority: Blocker > > For the following query: > {code} > val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", > "b").createOrReplaceTempView("t2") > spark.sql("select max(b) from t2 group by a").explain(true) > {code} > Now, the SortAggregator won't insert Sort operator before partial > aggregation, this will break sort-based partial aggregation. > {code} > == Physical Plan == > SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17]) > +- *Sort [a#5 ASC], false, 0 >+- Exchange hashpartitioning(a#5, 200) > +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, > max#19]) > +- LocalTableScan [a#5, b#6] > {code} > In Spark 2.0 branch, the plan is: > {code} > == Physical Plan == > SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17]) > +- *Sort [a#5 ASC], false, 0 >+- Exchange hashpartitioning(a#5, 200) > +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, > max#19]) > +- *Sort [a#5 ASC], false, 0 > +- LocalTableScan [a#5, b#6] > {code} > This is related to SPARK-12978 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17289) Sort based partial aggregation breaks due to SPARK-12978
[ https://issues.apache.org/jira/browse/SPARK-17289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15446341#comment-15446341 ] Herman van Hovell commented on SPARK-17289: --- Looks good. Can you open a PR? > Sort based partial aggregation breaks due to SPARK-12978 > > > Key: SPARK-17289 > URL: https://issues.apache.org/jira/browse/SPARK-17289 > Project: Spark > Issue Type: Bug >Reporter: Sean Zhong >Priority: Blocker > > For the following query: > {code} > val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", > "b").createOrReplaceTempView("t2") > spark.sql("select max(b) from t2 group by a").explain(true) > {code} > Now, the SortAggregator won't insert Sort operator before partial > aggregation, this will break sort-based partial aggregation. > {code} > == Physical Plan == > SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17]) > +- *Sort [a#5 ASC], false, 0 >+- Exchange hashpartitioning(a#5, 200) > +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, > max#19]) > +- LocalTableScan [a#5, b#6] > {code} > In Spark 2.0 branch, the plan is: > {code} > == Physical Plan == > SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17]) > +- *Sort [a#5 ASC], false, 0 >+- Exchange hashpartitioning(a#5, 200) > +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, > max#19]) > +- *Sort [a#5 ASC], false, 0 > +- LocalTableScan [a#5, b#6] > {code} > This is related to SPARK-12978 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17289) Sort based partial aggregation breaks due to SPARK-12978
[ https://issues.apache.org/jira/browse/SPARK-17289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15446323#comment-15446323 ] Takeshi Yamamuro commented on SPARK-17289: -- This is probably because EnsureRequirements does not check if partial aggregation satisfies sort requirements. We can fix this like; https://github.com/apache/spark/compare/master...maropu:SPARK-17289#diff-cdb577e36041e4a27a605b6b3063fd54L167 cc: [~hvanhovell] > Sort based partial aggregation breaks due to SPARK-12978 > > > Key: SPARK-17289 > URL: https://issues.apache.org/jira/browse/SPARK-17289 > Project: Spark > Issue Type: Bug >Reporter: Sean Zhong >Priority: Blocker > > For the following query: > {code} > val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", > "b").createOrReplaceTempView("t2") > spark.sql("select max(b) from t2 group by a").explain(true) > {code} > Now, the SortAggregator won't insert Sort operator before partial > aggregation, this will break sort-based partial aggregation. > {code} > == Physical Plan == > SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17]) > +- *Sort [a#5 ASC], false, 0 >+- Exchange hashpartitioning(a#5, 200) > +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, > max#19]) > +- LocalTableScan [a#5, b#6] > {code} > In Spark 2.0 branch, the plan is: > {code} > == Physical Plan == > SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17]) > +- *Sort [a#5 ASC], false, 0 >+- Exchange hashpartitioning(a#5, 200) > +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, > max#19]) > +- *Sort [a#5 ASC], false, 0 > +- LocalTableScan [a#5, b#6] > {code} > This is related to SPARK-12978 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15453) Improve join planning for bucketed / sorted tables
[ https://issues.apache.org/jira/browse/SPARK-15453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15446283#comment-15446283 ] Apache Spark commented on SPARK-15453: -- User 'tejasapatil' has created a pull request for this issue: https://github.com/apache/spark/pull/14864 > Improve join planning for bucketed / sorted tables > -- > > Key: SPARK-15453 > URL: https://issues.apache.org/jira/browse/SPARK-15453 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Tejas Patil >Priority: Minor > > Datasource allows creation of bucketed and sorted tables but performing joins > on such tables still does not utilize this metadata to produce optimal query > plan. > As below, the `Exchange` and `Sort` can be avoided if the tables are known to > be hashed + sorted on relevant columns. > {noformat} > == Physical Plan == > WholeStageCodegen > : +- SortMergeJoin [j#20,k#21,i#22], [j#23,k#24,i#25], Inner, None > : :- INPUT > : +- INPUT > :- WholeStageCodegen > : : +- Sort [j#20 ASC,k#21 ASC,i#22 ASC], false, 0 > : : +- INPUT > : +- Exchange hashpartitioning(j#20, k#21, i#22, 200), None > : +- WholeStageCodegen > :: +- Project [j#20,k#21,i#22] > :: +- Filter (isnotnull(k#21) && isnotnull(j#20)) > ::+- Scan orc default.table7[j#20,k#21,i#22] Format: ORC, > InputPaths: file:/XXX/table7, PushedFilters: [IsNotNull(k), > IsNotNull(j)], ReadSchema: struct > +- WholeStageCodegen >: +- Sort [j#23 ASC,k#24 ASC,i#25 ASC], false, 0 >: +- INPUT >+- Exchange hashpartitioning(j#23, k#24, i#25, 200), None > +- WholeStageCodegen > : +- Project [j#23,k#24,i#25] > : +- Filter (isnotnull(k#24) && isnotnull(j#23)) > :+- Scan orc default.table8[j#23,k#24,i#25] Format: ORC, > InputPaths: file:/XXX/table8, PushedFilters: [IsNotNull(k), > IsNotNull(j)], ReadSchema: struct > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16992) Pep8 code style
[ https://issues.apache.org/jira/browse/SPARK-16992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15446255#comment-15446255 ] Apache Spark commented on SPARK-16992: -- User 'Stibbons' has created a pull request for this issue: https://github.com/apache/spark/pull/14863 > Pep8 code style > --- > > Key: SPARK-16992 > URL: https://issues.apache.org/jira/browse/SPARK-16992 > Project: Spark > Issue Type: Improvement >Reporter: Semet > > Add code style checks and auto formating into the Python code. > Features: > - add a {{.editconfig}} file (Spark's Scala files use 2-spaces indentation, > while Python files uses 4) for compatible editors (almost every editors has a > plugin to support {{.editconfig}} file) > - use autopep8 to fix basic pep8 mistakes > - use isort to automatically sort and organise {{import}} statements and > organise them into logically linked order (see doc here. The most important > thing is that it splits import statements that loads more than one object > into several lines. It send keep the imports sorted. Said otherwise, for a > given module import, the line where it should be added will be fixed. This > will increase the number of line of the file, but this facilitates a lot file > maintainance and file merges if needed. > add a 'validate.sh' script in order to automatise the correction (need isort > and autopep8 installed) > You can see similar script in prod in the > [Buildbot|https://github.com/buildbot/buildbot/blob/master/common/validate.sh] > project. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17296) Spark SQL: cross join + two joins = BUG
Furcy Pin created SPARK-17296: - Summary: Spark SQL: cross join + two joins = BUG Key: SPARK-17296 URL: https://issues.apache.org/jira/browse/SPARK-17296 Project: Spark Issue Type: Bug Affects Versions: 2.0.0 Reporter: Furcy Pin In spark shell : {code} CREATE TABLE test (col INT) ; INSERT OVERWRITE TABLE test VALUES (1), (2) ; SELECT COUNT(1) FROM test T1 CROSS JOIN test T2 JOIN test T3 ON T3.col = T1.col JOIN test T4 ON T4.col = T1.col ; {code} returns : {code} Error in query: cannot resolve '`T1.col`' given input columns: [col, col]; line 6 pos 12 {code} Apparently, this example is minimal (removing the CROSS or one of the JOIN causes no issue). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17295) Create TestHiveSessionState use reflect logic based on the setting of CATALOG_IMPLEMENTATION
[ https://issues.apache.org/jira/browse/SPARK-17295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17295: Assignee: Apache Spark > Create TestHiveSessionState use reflect logic based on the setting of > CATALOG_IMPLEMENTATION > > > Key: SPARK-17295 > URL: https://issues.apache.org/jira/browse/SPARK-17295 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Jiang Xingbo >Assignee: Apache Spark > > Currently we create a new `TestHiveSessionState` in `TestHive`, but in > `SparkSession` we create `SessionState`/`HiveSessionState` use reflect logic > based on the setting of CATALOG_IMPLEMENTATION, we should make the both > consist, then we can test the reflect logic of `SparkSession` in `TestHive`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17295) Create TestHiveSessionState use reflect logic based on the setting of CATALOG_IMPLEMENTATION
[ https://issues.apache.org/jira/browse/SPARK-17295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15446152#comment-15446152 ] Apache Spark commented on SPARK-17295: -- User 'jiangxb1987' has created a pull request for this issue: https://github.com/apache/spark/pull/14862 > Create TestHiveSessionState use reflect logic based on the setting of > CATALOG_IMPLEMENTATION > > > Key: SPARK-17295 > URL: https://issues.apache.org/jira/browse/SPARK-17295 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Jiang Xingbo > > Currently we create a new `TestHiveSessionState` in `TestHive`, but in > `SparkSession` we create `SessionState`/`HiveSessionState` use reflect logic > based on the setting of CATALOG_IMPLEMENTATION, we should make the both > consist, then we can test the reflect logic of `SparkSession` in `TestHive`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17295) Create TestHiveSessionState use reflect logic based on the setting of CATALOG_IMPLEMENTATION
[ https://issues.apache.org/jira/browse/SPARK-17295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17295: Assignee: (was: Apache Spark) > Create TestHiveSessionState use reflect logic based on the setting of > CATALOG_IMPLEMENTATION > > > Key: SPARK-17295 > URL: https://issues.apache.org/jira/browse/SPARK-17295 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Jiang Xingbo > > Currently we create a new `TestHiveSessionState` in `TestHive`, but in > `SparkSession` we create `SessionState`/`HiveSessionState` use reflect logic > based on the setting of CATALOG_IMPLEMENTATION, we should make the both > consist, then we can test the reflect logic of `SparkSession` in `TestHive`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17291) The shuffle data fetched based on netty were directly stored in off-memoryr?
[ https://issues.apache.org/jira/browse/SPARK-17291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15446142#comment-15446142 ] song fengfei commented on SPARK-17291: -- Thanks very much,this is the first time to create issue,and can't understand “user@ is the place for questions” before!Thanks again. > The shuffle data fetched based on netty were directly stored in off-memoryr? > > > Key: SPARK-17291 > URL: https://issues.apache.org/jira/browse/SPARK-17291 > Project: Spark > Issue Type: IT Help > Components: Shuffle >Reporter: song fengfei > > If the shuffle data were stored in off-memory,and isn't it easily lead to OOM > when a map output was too big?Additionly,it was also out of spark memory > management? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17295) Create TestHiveSessionState use reflect logic based on the setting of CATALOG_IMPLEMENTATION
Jiang Xingbo created SPARK-17295: Summary: Create TestHiveSessionState use reflect logic based on the setting of CATALOG_IMPLEMENTATION Key: SPARK-17295 URL: https://issues.apache.org/jira/browse/SPARK-17295 Project: Spark Issue Type: Improvement Components: SQL Reporter: Jiang Xingbo Currently we create a new `TestHiveSessionState` in `TestHive`, but in `SparkSession` we create `SessionState`/`HiveSessionState` use reflect logic based on the setting of CATALOG_IMPLEMENTATION, we should make the both consist, then we can test the reflect logic of `SparkSession` in `TestHive`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17291) The shuffle data fetched based on netty were directly stored in off-memoryr?
[ https://issues.apache.org/jira/browse/SPARK-17291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15446119#comment-15446119 ] Sean Owen commented on SPARK-17291: --- I replied on the other JIRA. Questions go to u...@spark.apache.org, and not to JIRA. https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > The shuffle data fetched based on netty were directly stored in off-memoryr? > > > Key: SPARK-17291 > URL: https://issues.apache.org/jira/browse/SPARK-17291 > Project: Spark > Issue Type: IT Help > Components: Shuffle >Reporter: song fengfei > > If the shuffle data were stored in off-memory,and isn't it easily lead to OOM > when a map output was too big?Additionly,it was also out of spark memory > management? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17291) The shuffle data fetched based on netty were directly stored in off-memoryr?
[ https://issues.apache.org/jira/browse/SPARK-17291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15446110#comment-15446110 ] song fengfei commented on SPARK-17291: -- yeah, they are same,but neither was resolved and instead, they were all changed to resolved? Is it not allowed to ask question here? > The shuffle data fetched based on netty were directly stored in off-memoryr? > > > Key: SPARK-17291 > URL: https://issues.apache.org/jira/browse/SPARK-17291 > Project: Spark > Issue Type: IT Help > Components: Shuffle >Reporter: song fengfei > > If the shuffle data were stored in off-memory,and isn't it easily lead to OOM > when a map output was too big?Additionly,it was also out of spark memory > management? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17294) Caching invalidates data on mildly wide dataframes
[ https://issues.apache.org/jira/browse/SPARK-17294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17294. --- Resolution: Duplicate Duplicate #5, popular issue > Caching invalidates data on mildly wide dataframes > -- > > Key: SPARK-17294 > URL: https://issues.apache.org/jira/browse/SPARK-17294 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.2, 2.0.0 >Reporter: Kalle Jepsen > > Caching a dataframe with > 200 columns causes the data within to simply > vanish under certain circumstances. > Consider the following code, where we create a one-row dataframe containing > the numbers from 0 to 200. > {code} > n_cols = 201 > rng = range(n_cols) > df = spark.createDataFrame( > data=[rng] > ) > last = df.columns[-1] > print(df.select(last).collect()) > df.select(F.greatest(*df.columns).alias('greatest')).show() > {code} > Returns: > {noformat} > [Row(_201=200)] > ++ > |greatest| > ++ > | 200| > ++ > {noformat} > As expected column {{_201}} contains the number 200 and as expected the > greatest value within that single row is 200. > Now if we introduce a {{.cache}} on {{df}}: > {code} > n_cols = 201 > rng = range(n_cols) > df = spark.createDataFrame( > data=[rng] > ).cache() > last = df.columns[-1] > print(df.select(last).collect()) > df.select(F.greatest(*df.columns).alias('greatest')).show() > {code} > Returns: > {noformat} > [Row(_201=200)] > ++ > |greatest| > ++ > | 0| > ++ > {noformat} > the last column {{_201}} still seems to contain the correct value, but when I > try to select the greatest value within the row, 0 is returned. When I issue > {{.show()}} on the dataframe, all values will be zero. As soon as I limit the > columns on a number < 200, everything looks fine again. > When the number of columns is < 200 from the beginning, even the cache will > not break things and everything works as expected. > It doesn't matter whether the data is loaded from disk or created on the fly > and this happens in Spark 1.6.2 and 2.0.0 (haven't tested anything else). > Can anyone confirm this? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17294) Caching invalidates data on mildly wide dataframes
Kalle Jepsen created SPARK-17294: Summary: Caching invalidates data on mildly wide dataframes Key: SPARK-17294 URL: https://issues.apache.org/jira/browse/SPARK-17294 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.0.0, 1.6.2 Reporter: Kalle Jepsen Caching a dataframe with > 200 columns causes the data within to simply vanish under certain circumstances. Consider the following code, where we create a one-row dataframe containing the numbers from 0 to 200. {code} n_cols = 201 rng = range(n_cols) df = spark.createDataFrame( data=[rng] ) last = df.columns[-1] print(df.select(last).collect()) df.select(F.greatest(*df.columns).alias('greatest')).show() {code} Returns: {noformat} [Row(_201=200)] ++ |greatest| ++ | 200| ++ {noformat} As expected column {{_201}} contains the number 200 and as expected the greatest value within that single row is 200. Now if we introduce a {{.cache}} on {{df}}: {code} n_cols = 201 rng = range(n_cols) df = spark.createDataFrame( data=[rng] ).cache() last = df.columns[-1] print(df.select(last).collect()) df.select(F.greatest(*df.columns).alias('greatest')).show() {code} Returns: {noformat} [Row(_201=200)] ++ |greatest| ++ | 0| ++ {noformat} the last column {{_201}} still seems to contain the correct value, but when I try to select the greatest value within the row, 0 is returned. When I issue {{.show()}} on the dataframe, all values will be zero. As soon as I limit the columns on a number < 200, everything looks fine again. When the number of columns is < 200 from the beginning, even the cache will not break things and everything works as expected. It doesn't matter whether the data is loaded from disk or created on the fly and this happens in Spark 1.6.2 and 2.0.0 (haven't tested anything else). Can anyone confirm this? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17289) Sort based partial aggregation breaks due to SPARK-12978
[ https://issues.apache.org/jira/browse/SPARK-17289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15446059#comment-15446059 ] Takeshi Yamamuro commented on SPARK-17289: -- yea, I'll check this. > Sort based partial aggregation breaks due to SPARK-12978 > > > Key: SPARK-17289 > URL: https://issues.apache.org/jira/browse/SPARK-17289 > Project: Spark > Issue Type: Bug >Reporter: Sean Zhong >Priority: Blocker > > For the following query: > {code} > val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", > "b").createOrReplaceTempView("t2") > spark.sql("select max(b) from t2 group by a").explain(true) > {code} > Now, the SortAggregator won't insert Sort operator before partial > aggregation, this will break sort-based partial aggregation. > {code} > == Physical Plan == > SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17]) > +- *Sort [a#5 ASC], false, 0 >+- Exchange hashpartitioning(a#5, 200) > +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, > max#19]) > +- LocalTableScan [a#5, b#6] > {code} > In Spark 2.0 branch, the plan is: > {code} > == Physical Plan == > SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17]) > +- *Sort [a#5 ASC], false, 0 >+- Exchange hashpartitioning(a#5, 200) > +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, > max#19]) > +- *Sort [a#5 ASC], false, 0 > +- LocalTableScan [a#5, b#6] > {code} > This is related to SPARK-12978 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17289) Sort based partial aggregation breaks due to SPARK-12978
[ https://issues.apache.org/jira/browse/SPARK-17289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15446013#comment-15446013 ] Herman van Hovell commented on SPARK-17289: --- cc [~maropu] > Sort based partial aggregation breaks due to SPARK-12978 > > > Key: SPARK-17289 > URL: https://issues.apache.org/jira/browse/SPARK-17289 > Project: Spark > Issue Type: Bug >Reporter: Sean Zhong >Priority: Blocker > > For the following query: > {code} > val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", > "b").createOrReplaceTempView("t2") > spark.sql("select max(b) from t2 group by a").explain(true) > {code} > Now, the SortAggregator won't insert Sort operator before partial > aggregation, this will break sort-based partial aggregation. > {code} > == Physical Plan == > SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17]) > +- *Sort [a#5 ASC], false, 0 >+- Exchange hashpartitioning(a#5, 200) > +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, > max#19]) > +- LocalTableScan [a#5, b#6] > {code} > In Spark 2.0 branch, the plan is: > {code} > == Physical Plan == > SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17]) > +- *Sort [a#5 ASC], false, 0 >+- Exchange hashpartitioning(a#5, 200) > +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, > max#19]) > +- *Sort [a#5 ASC], false, 0 > +- LocalTableScan [a#5, b#6] > {code} > This is related to SPARK-12978 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17289) Sort based partial aggregation breaks due to SPARK-12978
[ https://issues.apache.org/jira/browse/SPARK-17289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15446015#comment-15446015 ] Herman van Hovell commented on SPARK-17289: --- [~clockfly] Are you working on this one? > Sort based partial aggregation breaks due to SPARK-12978 > > > Key: SPARK-17289 > URL: https://issues.apache.org/jira/browse/SPARK-17289 > Project: Spark > Issue Type: Bug >Reporter: Sean Zhong >Priority: Blocker > > For the following query: > {code} > val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", > "b").createOrReplaceTempView("t2") > spark.sql("select max(b) from t2 group by a").explain(true) > {code} > Now, the SortAggregator won't insert Sort operator before partial > aggregation, this will break sort-based partial aggregation. > {code} > == Physical Plan == > SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17]) > +- *Sort [a#5 ASC], false, 0 >+- Exchange hashpartitioning(a#5, 200) > +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, > max#19]) > +- LocalTableScan [a#5, b#6] > {code} > In Spark 2.0 branch, the plan is: > {code} > == Physical Plan == > SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17]) > +- *Sort [a#5 ASC], false, 0 >+- Exchange hashpartitioning(a#5, 200) > +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, > max#19]) > +- *Sort [a#5 ASC], false, 0 > +- LocalTableScan [a#5, b#6] > {code} > This is related to SPARK-12978 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17289) Sort based partial aggregation breaks due to SPARK-12978
[ https://issues.apache.org/jira/browse/SPARK-17289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-17289: -- Priority: Blocker (was: Major) > Sort based partial aggregation breaks due to SPARK-12978 > > > Key: SPARK-17289 > URL: https://issues.apache.org/jira/browse/SPARK-17289 > Project: Spark > Issue Type: Bug >Reporter: Sean Zhong >Priority: Blocker > > For the following query: > {code} > val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", > "b").createOrReplaceTempView("t2") > spark.sql("select max(b) from t2 group by a").explain(true) > {code} > Now, the SortAggregator won't insert Sort operator before partial > aggregation, this will break sort-based partial aggregation. > {code} > == Physical Plan == > SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17]) > +- *Sort [a#5 ASC], false, 0 >+- Exchange hashpartitioning(a#5, 200) > +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, > max#19]) > +- LocalTableScan [a#5, b#6] > {code} > In Spark 2.0 branch, the plan is: > {code} > == Physical Plan == > SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17]) > +- *Sort [a#5 ASC], false, 0 >+- Exchange hashpartitioning(a#5, 200) > +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, > max#19]) > +- *Sort [a#5 ASC], false, 0 > +- LocalTableScan [a#5, b#6] > {code} > This is related to SPARK-12978 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-17293) seperate view handling from CreateTableCommand
[ https://issues.apache.org/jira/browse/SPARK-17293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan closed SPARK-17293. --- Resolution: Invalid sorry my mistake > seperate view handling from CreateTableCommand > -- > > Key: SPARK-17293 > URL: https://issues.apache.org/jira/browse/SPARK-17293 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17293) seperate view handling from CreateTableCommand
Wenchen Fan created SPARK-17293: --- Summary: seperate view handling from CreateTableCommand Key: SPARK-17293 URL: https://issues.apache.org/jira/browse/SPARK-17293 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org