[jira] [Comment Edited] (SPARK-18877) Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 exceeds max precision 20
[ https://issues.apache.org/jira/browse/SPARK-18877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15760487#comment-15760487 ] Navya Krishnappa edited comment on SPARK-18877 at 12/19/16 7:56 AM: Thank you [~dongjoon] and i will create an issue in Apace parquet. was (Author: navya krishnappa): Thank you [~dongjoon] > Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: > requirement failed: Decimal precision 28 exceeds max precision 20 > -- > > Key: SPARK-18877 > URL: https://issues.apache.org/jira/browse/SPARK-18877 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Navya Krishnappa > > When reading below mentioned csv data, even though the maximum decimal > precision is 38, following exception is thrown > java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 > exceeds max precision 20 > Decimal > 2323366225312000 > 2433573971400 > 23233662253000 > 23233662253 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18877) Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 exceeds max precision 20
[ https://issues.apache.org/jira/browse/SPARK-18877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15760487#comment-15760487 ] Navya Krishnappa edited comment on SPARK-18877 at 12/19/16 7:56 AM: Thank you [~dongjoon] and i will create an issue in Apache Parquet JIRA. was (Author: navya krishnappa): Thank you [~dongjoon] and i will create an issue in Apace parquet. > Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: > requirement failed: Decimal precision 28 exceeds max precision 20 > -- > > Key: SPARK-18877 > URL: https://issues.apache.org/jira/browse/SPARK-18877 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Navya Krishnappa > > When reading below mentioned csv data, even though the maximum decimal > precision is 38, following exception is thrown > java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 > exceeds max precision 20 > Decimal > 2323366225312000 > 2433573971400 > 23233662253000 > 23233662253 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18877) Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 exceeds max precision 20
[ https://issues.apache.org/jira/browse/SPARK-18877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15760487#comment-15760487 ] Navya Krishnappa commented on SPARK-18877: -- Thank you [~dongjoon] > Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: > requirement failed: Decimal precision 28 exceeds max precision 20 > -- > > Key: SPARK-18877 > URL: https://issues.apache.org/jira/browse/SPARK-18877 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Navya Krishnappa > > When reading below mentioned csv data, even though the maximum decimal > precision is 38, following exception is thrown > java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 > exceeds max precision 20 > Decimal > 2323366225312000 > 2433573971400 > 23233662253000 > 23233662253 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18924) Improve collect/createDataFrame performance in SparkR
[ https://issues.apache.org/jira/browse/SPARK-18924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15760439#comment-15760439 ] Xiangrui Meng commented on SPARK-18924: --- cc: [~shivaram] [~felixcheung] [~falaki] [~yanboliang] for discussion. > Improve collect/createDataFrame performance in SparkR > - > > Key: SPARK-18924 > URL: https://issues.apache.org/jira/browse/SPARK-18924 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Xiangrui Meng >Priority: Critical > > SparkR has its own SerDe for data serialization between JVM and R. > The SerDe on the JVM side is implemented in: > * > [SeDe|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala] > * > [SQLUtils|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala] > The SerDe on the R side is implemented in: > * > [deserialize|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R] > * [serialize|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R] > The serialization between JVM and R suffers from huge storage and computation > overhead. For example, a short round-trip of 1 million doubles surprisingly > took 3 minutes on my laptop: > {code} > > system.time(collect(createDataFrame(data.frame(x=runif(100) >user system elapsed > 14.224 0.582 189.135 > {code} > Collecting a medium-sized DataFrame to local and continuing with a local R > workflow is a use case we should pay attention to. SparkR will never be able > to cover all existing features from CRAN packages. It is also unnecessary for > Spark to do so because not all features need scalability. > Several factors contribute to the serialization overhead: > 1. The SerDe in R side is implemented using high-level R methods. > 2. DataFrame columns are not efficiently serialized, primitive type columns > in particular. > 3. Some overhead in the serialization protocol/impl. > 1) might be discussed before because R packages like rJava exist before > SparkR. I'm not sure whether we have a license issue in depending on those > libraries. Another option is to switch to low-level R'C interface or Rcpp, > which again might have license issue. I'm not an expert here. If we have to > implement our own, there still exist much space for improvement, discussed > below. > 2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, > which collect rows to local and then construct columns. However, > * it ignores column types and results boxing/unboxing overhead > * it collects all objects to driver and results high GC pressure > A relatively simple change is to implement specialized column builder based > on column types, primitive types in particular. We need to handle null values > properly. A simple data structure we can use is > {code} > val size: Int > val nullIndexes: Array[Int] > val notNullValues: Array[T] // specialized for primitive types > {code} > On the R side, we can use `readBin` and `writeBin` to read the entire vector > in a single method call. The speed seems reasonable (at the order of GB/s): > {code} > > x <- runif(1000) # 1e7, not 1e6 > > system.time(r <- writeBin(x, raw(0))) >user system elapsed > 0.036 0.021 0.059 > > > system.time(y <- readBin(r, double(), 1000)) >user system elapsed > 0.015 0.007 0.024 > {code} > This is just a proposal that needs to be discussed and formalized. But in > general, it should be feasible to obtain 20x or more performance gain. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18924) Improve collect/createDataFrame performance in SparkR
Xiangrui Meng created SPARK-18924: - Summary: Improve collect/createDataFrame performance in SparkR Key: SPARK-18924 URL: https://issues.apache.org/jira/browse/SPARK-18924 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Xiangrui Meng Priority: Critical SparkR has its own SerDe for data serialization between JVM and R. The SerDe on the JVM side is implemented in: * [SeDe|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala] * [SQLUtils|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala] The SerDe on the R side is implemented in: * [deserialize|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R] * [serialize|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R] The serialization between JVM and R suffers from huge storage and computation overhead. For example, a short round-trip of 1 million doubles surprisingly took 3 minutes on my laptop: {code} > system.time(collect(createDataFrame(data.frame(x=runif(100) user system elapsed 14.224 0.582 189.135 {code} Collecting a medium-sized DataFrame to local and continuing with a local R workflow is a use case we should pay attention to. SparkR will never be able to cover all existing features from CRAN packages. It is also unnecessary for Spark to do so because not all features need scalability. Several factors contribute to the serialization overhead: 1. The SerDe in R side is implemented using high-level R methods. 2. DataFrame columns are not efficiently serialized, primitive type columns in particular. 3. Some overhead in the serialization protocol/impl. 1) might be discussed before because R packages like rJava exist before SparkR. I'm not sure whether we have a license issue in depending on those libraries. Another option is to switch to low-level R'C interface or Rcpp, which again might have license issue. I'm not an expert here. If we have to implement our own, there still exist much space for improvement, discussed below. 2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, which collect rows to local and then construct columns. However, * it ignores column types and results boxing/unboxing overhead * it collects all objects to driver and results high GC pressure A relatively simple change is to implement specialized column builder based on column types, primitive types in particular. We need to handle null values properly. A simple data structure we can use is {code} val size: Int val nullIndexes: Array[Int] val notNullValues: Array[T] // specialized for primitive types {code} On the R side, we can use `readBin` and `writeBin` to read the entire vector in a single method call. The speed seems reasonable (at the order of GB/s): {code} > x <- runif(1000) # 1e7, not 1e6 > system.time(r <- writeBin(x, raw(0))) user system elapsed 0.036 0.021 0.059 > > system.time(y <- readBin(r, double(), 1000)) user system elapsed 0.015 0.007 0.024 {code} This is just a proposal that needs to be discussed and formalized. But in general, it should be feasible to obtain 20x or more performance gain. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18871) New test cases for IN subquery
[ https://issues.apache.org/jira/browse/SPARK-18871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18871: Assignee: (was: Apache Spark) > New test cases for IN subquery > -- > > Key: SPARK-18871 > URL: https://issues.apache.org/jira/browse/SPARK-18871 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Reporter: Nattavut Sutyanyong > > This JIRA is open for submitting a PR for new test cases for IN/NOT IN > subquery. We plan to put approximately 100+ test cases under > `SQLQueryTestSuite`. The test cases range from IN/NOT IN subqueries with > simple SELECT in both parent and subquery to subqueries with more complex > constructs in both sides (joins, aggregates, etc.) Test data include null > value, and duplicate values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18871) New test cases for IN subquery
[ https://issues.apache.org/jira/browse/SPARK-18871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18871: Assignee: Apache Spark > New test cases for IN subquery > -- > > Key: SPARK-18871 > URL: https://issues.apache.org/jira/browse/SPARK-18871 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Reporter: Nattavut Sutyanyong >Assignee: Apache Spark > > This JIRA is open for submitting a PR for new test cases for IN/NOT IN > subquery. We plan to put approximately 100+ test cases under > `SQLQueryTestSuite`. The test cases range from IN/NOT IN subqueries with > simple SELECT in both parent and subquery to subqueries with more complex > constructs in both sides (joins, aggregates, etc.) Test data include null > value, and duplicate values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18871) New test cases for IN subquery
[ https://issues.apache.org/jira/browse/SPARK-18871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15760370#comment-15760370 ] Apache Spark commented on SPARK-18871: -- User 'kevinyu98' has created a pull request for this issue: https://github.com/apache/spark/pull/16337 > New test cases for IN subquery > -- > > Key: SPARK-18871 > URL: https://issues.apache.org/jira/browse/SPARK-18871 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Reporter: Nattavut Sutyanyong > > This JIRA is open for submitting a PR for new test cases for IN/NOT IN > subquery. We plan to put approximately 100+ test cases under > `SQLQueryTestSuite`. The test cases range from IN/NOT IN subqueries with > simple SELECT in both parent and subquery to subqueries with more complex > constructs in both sides (joins, aggregates, etc.) Test data include null > value, and duplicate values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18923) Support SKIP_PYTHONDOC/RDOC in doc generation
[ https://issues.apache.org/jira/browse/SPARK-18923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18923: Assignee: Apache Spark > Support SKIP_PYTHONDOC/RDOC in doc generation > - > > Key: SPARK-18923 > URL: https://issues.apache.org/jira/browse/SPARK-18923 > Project: Spark > Issue Type: Improvement > Components: Build, Documentation >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > > This issue aims to support `SKIP_PYTHONDOC` and `SKIP_RDOC` for documentation > generation. Currently, we can use `SKIP_SCALADOC` or `SKIP_API`. > The reason is that the Spark documentation build uses a number of tools to > build HTML docs and API docs in Scala, Python and R. Especially, > - Python API docs requires `sphinx`. > - R API docs requires `R` installation and `knitr` (and more others > libraries). > In other words, we cannot generate Python API docs without R installation. > Also, we cannot generate R API docs without Python `sphinx` installation. > If Spark provides `SKIP_PYTHONDOC` and `SKIP_RDOC` like `SKIP_SCALADOC`, it > would be more convenient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18923) Support SKIP_PYTHONDOC/RDOC in doc generation
[ https://issues.apache.org/jira/browse/SPARK-18923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18923: Assignee: (was: Apache Spark) > Support SKIP_PYTHONDOC/RDOC in doc generation > - > > Key: SPARK-18923 > URL: https://issues.apache.org/jira/browse/SPARK-18923 > Project: Spark > Issue Type: Improvement > Components: Build, Documentation >Reporter: Dongjoon Hyun >Priority: Minor > > This issue aims to support `SKIP_PYTHONDOC` and `SKIP_RDOC` for documentation > generation. Currently, we can use `SKIP_SCALADOC` or `SKIP_API`. > The reason is that the Spark documentation build uses a number of tools to > build HTML docs and API docs in Scala, Python and R. Especially, > - Python API docs requires `sphinx`. > - R API docs requires `R` installation and `knitr` (and more others > libraries). > In other words, we cannot generate Python API docs without R installation. > Also, we cannot generate R API docs without Python `sphinx` installation. > If Spark provides `SKIP_PYTHONDOC` and `SKIP_RDOC` like `SKIP_SCALADOC`, it > would be more convenient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18923) Support SKIP_PYTHONDOC/RDOC in doc generation
[ https://issues.apache.org/jira/browse/SPARK-18923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15760332#comment-15760332 ] Apache Spark commented on SPARK-18923: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/16336 > Support SKIP_PYTHONDOC/RDOC in doc generation > - > > Key: SPARK-18923 > URL: https://issues.apache.org/jira/browse/SPARK-18923 > Project: Spark > Issue Type: Improvement > Components: Build, Documentation >Reporter: Dongjoon Hyun >Priority: Minor > > This issue aims to support `SKIP_PYTHONDOC` and `SKIP_RDOC` for documentation > generation. Currently, we can use `SKIP_SCALADOC` or `SKIP_API`. > The reason is that the Spark documentation build uses a number of tools to > build HTML docs and API docs in Scala, Python and R. Especially, > - Python API docs requires `sphinx`. > - R API docs requires `R` installation and `knitr` (and more others > libraries). > In other words, we cannot generate Python API docs without R installation. > Also, we cannot generate R API docs without Python `sphinx` installation. > If Spark provides `SKIP_PYTHONDOC` and `SKIP_RDOC` like `SKIP_SCALADOC`, it > would be more convenient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18923) Support SKIP_PYTHONDOC/RDOC in doc generation
Dongjoon Hyun created SPARK-18923: - Summary: Support SKIP_PYTHONDOC/RDOC in doc generation Key: SPARK-18923 URL: https://issues.apache.org/jira/browse/SPARK-18923 Project: Spark Issue Type: Improvement Components: Build, Documentation Reporter: Dongjoon Hyun Priority: Minor This issue aims to support `SKIP_PYTHONDOC` and `SKIP_RDOC` for documentation generation. Currently, we can use `SKIP_SCALADOC` or `SKIP_API`. The reason is that the Spark documentation build uses a number of tools to build HTML docs and API docs in Scala, Python and R. Especially, - Python API docs requires `sphinx`. - R API docs requires `R` installation and `knitr` (and more others libraries). In other words, we cannot generate Python API docs without R installation. Also, we cannot generate R API docs without Python `sphinx` installation. If Spark provides `SKIP_PYTHONDOC` and `SKIP_RDOC` like `SKIP_SCALADOC`, it would be more convenient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18857) SparkSQL ThriftServer hangs while extracting huge data volumes in incremental collect mode
[ https://issues.apache.org/jira/browse/SPARK-18857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15760203#comment-15760203 ] vishal agrawal commented on SPARK-18857: We are unable to use incremental collect in a spark version before 2.0.2 due the bug spark-18009 We will have to take 2.0.2 and change this particular class and build from source code. > SparkSQL ThriftServer hangs while extracting huge data volumes in incremental > collect mode > -- > > Key: SPARK-18857 > URL: https://issues.apache.org/jira/browse/SPARK-18857 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: vishal agrawal > Attachments: GC-spark-1.6.3, GC-spark-2.0.2 > > > We are trying to run a sql query on our spark cluster and extracting around > 200 million records through SparkSQL ThriftServer interface. This query works > fine for Spark 1.6.3 version, however for spark 2.0.2, thrift server hangs > after fetching data from a few partitions (we are using incremental collect > mode with 400 partitions). As per documentation max memory taken up by thrift > server should be what is required by the biggest data partition. But we > observed that Thrift server is not releasing the old partitions memory > whenever the GC occurs even though it has moved to next partition data > fetches. which is not the case with 1.6.3 version. > On further investigation we found that SparkExecuteStatementOperation.scala > was modified for "[SPARK-16563][SQL] fix spark sql thrift server FetchResults > bug" and result set iterator was duplicated to keep a reference to the first > set. > + val (itra, itrb) = iter.duplicate > + iterHeader = itra > + iter = itrb > We suspect that this is resulting in the memory not being cleared on GC. To > confirm this we created an iterator in our test class and fetched the data > once without duplicating and second time with creating a duplicate. we could > see that in first instance it ran fine and fetched the entire data set while > in second instance driver hanged after fetching data from a few partitions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18922) Fix more resource-closing-related and path-related test failures in identified ones on Windows
[ https://issues.apache.org/jira/browse/SPARK-18922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18922: Assignee: (was: Apache Spark) > Fix more resource-closing-related and path-related test failures in > identified ones on Windows > -- > > Key: SPARK-18922 > URL: https://issues.apache.org/jira/browse/SPARK-18922 > Project: Spark > Issue Type: Sub-task > Components: Tests >Reporter: Hyukjin Kwon >Priority: Minor > > There are more instances that are failed on Windows as below: > - {{LauncherBackendSuite}}: > {code} > - local: launcher handle *** FAILED *** (30 seconds, 120 milliseconds) > The code passed to eventually never returned normally. Attempted 283 times > over 30.0960053 seconds. Last failure message: The reference was null. > (LauncherBackendSuite.scala:56) > org.scalatest.exceptions.TestFailedDueToTimeoutException: > at > org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) > - standalone/client: launcher handle *** FAILED *** (30 seconds, 47 > milliseconds) > The code passed to eventually never returned normally. Attempted 282 times > over 30.03798710002 seconds. Last failure message: The reference was > null. (LauncherBackendSuite.scala:56) > org.scalatest.exceptions.TestFailedDueToTimeoutException: > at > org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) > {code} > - {{SQLQuerySuite}}: > {code} > - specifying database name for a temporary table is not allowed *** FAILED > *** (125 milliseconds) > org.apache.spark.sql.AnalysisException: Path does not exist: > file:/C:projectsspark arget mpspark-1f4471ab-aac0-4239-ae35-833d54b37e52; > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370) > {code} > - {{JsonSuite}}: > {code} > - Loading a JSON dataset from a text file with SQL *** FAILED *** (94 > milliseconds) > org.apache.spark.sql.AnalysisException: Path does not exist: > file:/C:projectsspark arget mpspark-c918a8b7-fc09-433c-b9d0-36c0f78ae918; > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370) > {code} > - {{StateStoreSuite}}: > {code} > - SPARK-18342: commit fails when rename fails *** FAILED *** (16 milliseconds) > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: > StateStoreSuite29777261fs://C:%5Cprojects%5Cspark%5Ctarget%5Ctmp%5Cspark-ef349862-7281-4963-aaf3-add0d670a4ad%5C?-2218c2f8-2cf6-4f80-9cdf-96354e8246a77685899733421033312/0 > at org.apache.hadoop.fs.Path.initialize(Path.java:206) > at org.apache.hadoop.fs.Path.(Path.java:116) > at org.apache.hadoop.fs.Path.(Path.java:89) > ... > Cause: java.net.URISyntaxException: Relative path in absolute URI: > StateStoreSuite29777261fs://C:%5Cprojects%5Cspark%5Ctarget%5Ctmp%5Cspark-ef349862-7281-4963-aaf3-add0d670a4ad%5C?-2218c2f8-2cf6-4f80-9cdf-96354e8246a77685899733421033312/0 > at java.net.URI.checkPath(URI.java:1823) > at java.net.URI.(URI.java:745) > at org.apache.hadoop.fs.Path.initialize(Path.java:203) > {code} > - {{HDFSMetadataLogSuite}}: > {code} > - FileManager: FileContextManager *** FAILED *** (94 milliseconds) > java.io.IOException: Failed to delete: > C:\projects\spark\target\tmp\spark-415bb0bd-396b-444d-be82-04599e025f21 > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010) > at > org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:127) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.withTempDir(HDFSMetadataLogSuite.scala:38) > - FileManager: FileSystemManager *** FAILED *** (78 milliseconds) > java.io.IOException: Failed to delete: > C:\projects\spark\target\tmp\spark-ef8222cd-85aa-47c0-a396-bc7979e15088 > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010) > at > org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:127) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.withTempDir(HDFSMetadataLogSuite.scala:38) > {code} > Please refer, for full logs, > https://ci.appveyor.com/project/spark-test/spark/build/283-tmp-test-base -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-ma
[jira] [Commented] (SPARK-18922) Fix more resource-closing-related and path-related test failures in identified ones on Windows
[ https://issues.apache.org/jira/browse/SPARK-18922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15760116#comment-15760116 ] Apache Spark commented on SPARK-18922: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/16335 > Fix more resource-closing-related and path-related test failures in > identified ones on Windows > -- > > Key: SPARK-18922 > URL: https://issues.apache.org/jira/browse/SPARK-18922 > Project: Spark > Issue Type: Sub-task > Components: Tests >Reporter: Hyukjin Kwon >Priority: Minor > > There are more instances that are failed on Windows as below: > - {{LauncherBackendSuite}}: > {code} > - local: launcher handle *** FAILED *** (30 seconds, 120 milliseconds) > The code passed to eventually never returned normally. Attempted 283 times > over 30.0960053 seconds. Last failure message: The reference was null. > (LauncherBackendSuite.scala:56) > org.scalatest.exceptions.TestFailedDueToTimeoutException: > at > org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) > - standalone/client: launcher handle *** FAILED *** (30 seconds, 47 > milliseconds) > The code passed to eventually never returned normally. Attempted 282 times > over 30.03798710002 seconds. Last failure message: The reference was > null. (LauncherBackendSuite.scala:56) > org.scalatest.exceptions.TestFailedDueToTimeoutException: > at > org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) > {code} > - {{SQLQuerySuite}}: > {code} > - specifying database name for a temporary table is not allowed *** FAILED > *** (125 milliseconds) > org.apache.spark.sql.AnalysisException: Path does not exist: > file:/C:projectsspark arget mpspark-1f4471ab-aac0-4239-ae35-833d54b37e52; > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370) > {code} > - {{JsonSuite}}: > {code} > - Loading a JSON dataset from a text file with SQL *** FAILED *** (94 > milliseconds) > org.apache.spark.sql.AnalysisException: Path does not exist: > file:/C:projectsspark arget mpspark-c918a8b7-fc09-433c-b9d0-36c0f78ae918; > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370) > {code} > - {{StateStoreSuite}}: > {code} > - SPARK-18342: commit fails when rename fails *** FAILED *** (16 milliseconds) > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: > StateStoreSuite29777261fs://C:%5Cprojects%5Cspark%5Ctarget%5Ctmp%5Cspark-ef349862-7281-4963-aaf3-add0d670a4ad%5C?-2218c2f8-2cf6-4f80-9cdf-96354e8246a77685899733421033312/0 > at org.apache.hadoop.fs.Path.initialize(Path.java:206) > at org.apache.hadoop.fs.Path.(Path.java:116) > at org.apache.hadoop.fs.Path.(Path.java:89) > ... > Cause: java.net.URISyntaxException: Relative path in absolute URI: > StateStoreSuite29777261fs://C:%5Cprojects%5Cspark%5Ctarget%5Ctmp%5Cspark-ef349862-7281-4963-aaf3-add0d670a4ad%5C?-2218c2f8-2cf6-4f80-9cdf-96354e8246a77685899733421033312/0 > at java.net.URI.checkPath(URI.java:1823) > at java.net.URI.(URI.java:745) > at org.apache.hadoop.fs.Path.initialize(Path.java:203) > {code} > - {{HDFSMetadataLogSuite}}: > {code} > - FileManager: FileContextManager *** FAILED *** (94 milliseconds) > java.io.IOException: Failed to delete: > C:\projects\spark\target\tmp\spark-415bb0bd-396b-444d-be82-04599e025f21 > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010) > at > org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:127) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.withTempDir(HDFSMetadataLogSuite.scala:38) > - FileManager: FileSystemManager *** FAILED *** (78 milliseconds) > java.io.IOException: Failed to delete: > C:\projects\spark\target\tmp\spark-ef8222cd-85aa-47c0-a396-bc7979e15088 > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010) > at > org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:127) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.withTempDir(HDFSMetadataLogSuite.scala:38) > {code} > Please refer, for full logs, > https://ci.appveyor.com/project/spark-test/spark/build/283-tmp-test-base -- This message was sent by At
[jira] [Assigned] (SPARK-18922) Fix more resource-closing-related and path-related test failures in identified ones on Windows
[ https://issues.apache.org/jira/browse/SPARK-18922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18922: Assignee: Apache Spark > Fix more resource-closing-related and path-related test failures in > identified ones on Windows > -- > > Key: SPARK-18922 > URL: https://issues.apache.org/jira/browse/SPARK-18922 > Project: Spark > Issue Type: Sub-task > Components: Tests >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Minor > > There are more instances that are failed on Windows as below: > - {{LauncherBackendSuite}}: > {code} > - local: launcher handle *** FAILED *** (30 seconds, 120 milliseconds) > The code passed to eventually never returned normally. Attempted 283 times > over 30.0960053 seconds. Last failure message: The reference was null. > (LauncherBackendSuite.scala:56) > org.scalatest.exceptions.TestFailedDueToTimeoutException: > at > org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) > - standalone/client: launcher handle *** FAILED *** (30 seconds, 47 > milliseconds) > The code passed to eventually never returned normally. Attempted 282 times > over 30.03798710002 seconds. Last failure message: The reference was > null. (LauncherBackendSuite.scala:56) > org.scalatest.exceptions.TestFailedDueToTimeoutException: > at > org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) > {code} > - {{SQLQuerySuite}}: > {code} > - specifying database name for a temporary table is not allowed *** FAILED > *** (125 milliseconds) > org.apache.spark.sql.AnalysisException: Path does not exist: > file:/C:projectsspark arget mpspark-1f4471ab-aac0-4239-ae35-833d54b37e52; > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370) > {code} > - {{JsonSuite}}: > {code} > - Loading a JSON dataset from a text file with SQL *** FAILED *** (94 > milliseconds) > org.apache.spark.sql.AnalysisException: Path does not exist: > file:/C:projectsspark arget mpspark-c918a8b7-fc09-433c-b9d0-36c0f78ae918; > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370) > {code} > - {{StateStoreSuite}}: > {code} > - SPARK-18342: commit fails when rename fails *** FAILED *** (16 milliseconds) > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: > StateStoreSuite29777261fs://C:%5Cprojects%5Cspark%5Ctarget%5Ctmp%5Cspark-ef349862-7281-4963-aaf3-add0d670a4ad%5C?-2218c2f8-2cf6-4f80-9cdf-96354e8246a77685899733421033312/0 > at org.apache.hadoop.fs.Path.initialize(Path.java:206) > at org.apache.hadoop.fs.Path.(Path.java:116) > at org.apache.hadoop.fs.Path.(Path.java:89) > ... > Cause: java.net.URISyntaxException: Relative path in absolute URI: > StateStoreSuite29777261fs://C:%5Cprojects%5Cspark%5Ctarget%5Ctmp%5Cspark-ef349862-7281-4963-aaf3-add0d670a4ad%5C?-2218c2f8-2cf6-4f80-9cdf-96354e8246a77685899733421033312/0 > at java.net.URI.checkPath(URI.java:1823) > at java.net.URI.(URI.java:745) > at org.apache.hadoop.fs.Path.initialize(Path.java:203) > {code} > - {{HDFSMetadataLogSuite}}: > {code} > - FileManager: FileContextManager *** FAILED *** (94 milliseconds) > java.io.IOException: Failed to delete: > C:\projects\spark\target\tmp\spark-415bb0bd-396b-444d-be82-04599e025f21 > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010) > at > org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:127) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.withTempDir(HDFSMetadataLogSuite.scala:38) > - FileManager: FileSystemManager *** FAILED *** (78 milliseconds) > java.io.IOException: Failed to delete: > C:\projects\spark\target\tmp\spark-ef8222cd-85aa-47c0-a396-bc7979e15088 > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010) > at > org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:127) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.withTempDir(HDFSMetadataLogSuite.scala:38) > {code} > Please refer, for full logs, > https://ci.appveyor.com/project/spark-test/spark/build/283-tmp-test-base -- This message was sent by Atlassian JIRA (v6.3.4#6332) -
[jira] [Created] (SPARK-18922) Fix more resource-closing-related and path-related test failures in identified ones on Windows
Hyukjin Kwon created SPARK-18922: Summary: Fix more resource-closing-related and path-related test failures in identified ones on Windows Key: SPARK-18922 URL: https://issues.apache.org/jira/browse/SPARK-18922 Project: Spark Issue Type: Sub-task Components: Tests Reporter: Hyukjin Kwon Priority: Minor There are more instances that are failed on Windows as below: - {{LauncherBackendSuite}}: {code} - local: launcher handle *** FAILED *** (30 seconds, 120 milliseconds) The code passed to eventually never returned normally. Attempted 283 times over 30.0960053 seconds. Last failure message: The reference was null. (LauncherBackendSuite.scala:56) org.scalatest.exceptions.TestFailedDueToTimeoutException: at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) - standalone/client: launcher handle *** FAILED *** (30 seconds, 47 milliseconds) The code passed to eventually never returned normally. Attempted 282 times over 30.03798710002 seconds. Last failure message: The reference was null. (LauncherBackendSuite.scala:56) org.scalatest.exceptions.TestFailedDueToTimeoutException: at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) {code} - {{SQLQuerySuite}}: {code} - specifying database name for a temporary table is not allowed *** FAILED *** (125 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-1f4471ab-aac0-4239-ae35-833d54b37e52; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370) {code} - {{JsonSuite}}: {code} - Loading a JSON dataset from a text file with SQL *** FAILED *** (94 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-c918a8b7-fc09-433c-b9d0-36c0f78ae918; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370) {code} - {{StateStoreSuite}}: {code} - SPARK-18342: commit fails when rename fails *** FAILED *** (16 milliseconds) java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: StateStoreSuite29777261fs://C:%5Cprojects%5Cspark%5Ctarget%5Ctmp%5Cspark-ef349862-7281-4963-aaf3-add0d670a4ad%5C?-2218c2f8-2cf6-4f80-9cdf-96354e8246a77685899733421033312/0 at org.apache.hadoop.fs.Path.initialize(Path.java:206) at org.apache.hadoop.fs.Path.(Path.java:116) at org.apache.hadoop.fs.Path.(Path.java:89) ... Cause: java.net.URISyntaxException: Relative path in absolute URI: StateStoreSuite29777261fs://C:%5Cprojects%5Cspark%5Ctarget%5Ctmp%5Cspark-ef349862-7281-4963-aaf3-add0d670a4ad%5C?-2218c2f8-2cf6-4f80-9cdf-96354e8246a77685899733421033312/0 at java.net.URI.checkPath(URI.java:1823) at java.net.URI.(URI.java:745) at org.apache.hadoop.fs.Path.initialize(Path.java:203) {code} - {{HDFSMetadataLogSuite}}: {code} - FileManager: FileContextManager *** FAILED *** (94 milliseconds) java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-415bb0bd-396b-444d-be82-04599e025f21 at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010) at org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:127) at org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.withTempDir(HDFSMetadataLogSuite.scala:38) - FileManager: FileSystemManager *** FAILED *** (78 milliseconds) java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-ef8222cd-85aa-47c0-a396-bc7979e15088 at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010) at org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:127) at org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.withTempDir(HDFSMetadataLogSuite.scala:38) {code} Please refer, for full logs, https://ci.appveyor.com/project/spark-test/spark/build/283-tmp-test-base -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18703) Insertion/CTAS against Hive Tables: Staging Directories and Data Files Not Dropped Until Normal Termination of JVM
[ https://issues.apache.org/jira/browse/SPARK-18703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-18703: Fix Version/s: 2.1.1 > Insertion/CTAS against Hive Tables: Staging Directories and Data Files Not > Dropped Until Normal Termination of JVM > -- > > Key: SPARK-18703 > URL: https://issues.apache.org/jira/browse/SPARK-18703 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Critical > Fix For: 2.1.1, 2.2.0 > > > Below are the files/directories generated for three inserts againsts a Hive > table: > {noformat} > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/._SUCCESS.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/.part-0.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/_SUCCESS > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/part-0 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/._SUCCESS.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/.part-0.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/_SUCCESS > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/part-0 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/._SUCCESS.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/.part-0.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/_SUCCESS > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/part-0 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-0.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/part-0 > {noformat} > The first 18 files are temporary. We do not drop it until the end of JVM > termination. If JVM does not appropriately terminate, these temporary > files/directories will not be dropped. > Only the last two files are needed, as shown below. > {noformat} > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-0.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/part-0 > {noformat} > Ideally, we shou
[jira] [Updated] (SPARK-18675) CTAS for hive serde table should work for all hive versions
[ https://issues.apache.org/jira/browse/SPARK-18675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-18675: Fix Version/s: 2.1.1 > CTAS for hive serde table should work for all hive versions > --- > > Key: SPARK-18675 > URL: https://issues.apache.org/jira/browse/SPARK-18675 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.1.1, 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18921) check database existence with Hive.databaseExists instead of getDatabase
[ https://issues.apache.org/jira/browse/SPARK-18921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18921: Assignee: Wenchen Fan (was: Apache Spark) > check database existence with Hive.databaseExists instead of getDatabase > > > Key: SPARK-18921 > URL: https://issues.apache.org/jira/browse/SPARK-18921 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18921) check database existence with Hive.databaseExists instead of getDatabase
[ https://issues.apache.org/jira/browse/SPARK-18921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15760023#comment-15760023 ] Apache Spark commented on SPARK-18921: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/16332 > check database existence with Hive.databaseExists instead of getDatabase > > > Key: SPARK-18921 > URL: https://issues.apache.org/jira/browse/SPARK-18921 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18921) check database existence with Hive.databaseExists instead of getDatabase
[ https://issues.apache.org/jira/browse/SPARK-18921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18921: Assignee: Apache Spark (was: Wenchen Fan) > check database existence with Hive.databaseExists instead of getDatabase > > > Key: SPARK-18921 > URL: https://issues.apache.org/jira/browse/SPARK-18921 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18921) check database existence with Hive.databaseExists instead of getDatabase
Wenchen Fan created SPARK-18921: --- Summary: check database existence with Hive.databaseExists instead of getDatabase Key: SPARK-18921 URL: https://issues.apache.org/jira/browse/SPARK-18921 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Assignee: Wenchen Fan Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-18767) Unify Models' toString methods
[ https://issues.apache.org/jira/browse/SPARK-18767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng closed SPARK-18767. Resolution: Won't Fix > Unify Models' toString methods > -- > > Key: SPARK-18767 > URL: https://issues.apache.org/jira/browse/SPARK-18767 > Project: Spark > Issue Type: Improvement >Reporter: zhengruifeng >Priority: Minor > > Models' toString should output some info, no just the uid of its trainner. > {code} > scala> val nb = new NaiveBayes > nb: org.apache.spark.ml.classification.NaiveBayes = nb_18e8984091a8 > scala> val nbm = nb.fit(data) > nbm: org.apache.spark.ml.classification.NaiveBayesModel = NaiveBayesModel > (uid=nb_18e8984091a8) with 2 classes > scala> val dt = new DecisionTreeClassifier > dt: org.apache.spark.ml.classification.DecisionTreeClassifier = > dtc_627dac64995e > scala> val dtm = dt.fit(data) > 16/12/07 15:08:14 WARN Executor: 1 block locks were not released by TID = 94: > [rdd_8_0] > dtm: org.apache.spark.ml.classification.DecisionTreeClassificationModel = > DecisionTreeClassificationModel (uid=dtc_627dac64995e) of depth 2 with 5 nodes > scala> val lr = new LogisticRegression > lr: org.apache.spark.ml.classification.LogisticRegression = > logreg_251625c948a0 > scala> val lrm = lr.fit(data) > lrm: org.apache.spark.ml.classification.LogisticRegressionModel = > logreg_251625c948a0 > {code} > I override toString in model to make them all like this: > {{ModelClassName (uid=...) with key params}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18917) Dataframe - Time Out Issues / Taking long time in append mode on object stores
[ https://issues.apache.org/jira/browse/SPARK-18917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15759931#comment-15759931 ] Dongjoon Hyun commented on SPARK-18917: --- Hi, [~alunarbeach]. Sure, you can make a PR. BTW, please don't set target versions and fixed versions. Usually, target versions are used by committers. Also, fixed versions are recorded only when your patch is merged. > Dataframe - Time Out Issues / Taking long time in append mode on object stores > -- > > Key: SPARK-18917 > URL: https://issues.apache.org/jira/browse/SPARK-18917 > Project: Spark > Issue Type: Improvement > Components: EC2, SQL, YARN >Affects Versions: 2.0.2 >Reporter: Anbu Cheeralan >Priority: Minor > Original Estimate: 72h > Remaining Estimate: 72h > > When using Dataframe write in append mode on object stores (S3 / Google > Storage), the writes are taking long time to write/ getting read time out. > This is because dataframe.write lists all leaf folders in the target > directory. If there are lot of subfolders due to partitions, this is taking > for ever. > The code is In org.apache.spark.sql.execution.datasources.DataSource.write() > following code causes huge number of RPC calls when the file system is an > Object Store (S3, GS). > if (mode == SaveMode.Append) { > val existingPartitionColumns = Try { > resolveRelation() > .asInstanceOf[HadoopFsRelation] > .location > .partitionSpec() > .partitionColumns > .fieldNames > .toSeq > }.getOrElse(Seq.empty[String]) > There should be a flag to skip Partition Match Check in append mode. I can > work on the patch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18917) Dataframe - Time Out Issues / Taking long time in append mode on object stores
[ https://issues.apache.org/jira/browse/SPARK-18917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-18917: -- Fix Version/s: (was: 2.1.1) (was: 2.1.0) > Dataframe - Time Out Issues / Taking long time in append mode on object stores > -- > > Key: SPARK-18917 > URL: https://issues.apache.org/jira/browse/SPARK-18917 > Project: Spark > Issue Type: Improvement > Components: EC2, SQL, YARN >Affects Versions: 2.0.2 >Reporter: Anbu Cheeralan >Priority: Minor > Original Estimate: 72h > Remaining Estimate: 72h > > When using Dataframe write in append mode on object stores (S3 / Google > Storage), the writes are taking long time to write/ getting read time out. > This is because dataframe.write lists all leaf folders in the target > directory. If there are lot of subfolders due to partitions, this is taking > for ever. > The code is In org.apache.spark.sql.execution.datasources.DataSource.write() > following code causes huge number of RPC calls when the file system is an > Object Store (S3, GS). > if (mode == SaveMode.Append) { > val existingPartitionColumns = Try { > resolveRelation() > .asInstanceOf[HadoopFsRelation] > .location > .partitionSpec() > .partitionColumns > .fieldNames > .toSeq > }.getOrElse(Seq.empty[String]) > There should be a flag to skip Partition Match Check in append mode. I can > work on the patch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18917) Dataframe - Time Out Issues / Taking long time in append mode on object stores
[ https://issues.apache.org/jira/browse/SPARK-18917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-18917: -- Target Version/s: (was: 2.1.0, 2.1.1) > Dataframe - Time Out Issues / Taking long time in append mode on object stores > -- > > Key: SPARK-18917 > URL: https://issues.apache.org/jira/browse/SPARK-18917 > Project: Spark > Issue Type: Improvement > Components: EC2, SQL, YARN >Affects Versions: 2.0.2 >Reporter: Anbu Cheeralan >Priority: Minor > Original Estimate: 72h > Remaining Estimate: 72h > > When using Dataframe write in append mode on object stores (S3 / Google > Storage), the writes are taking long time to write/ getting read time out. > This is because dataframe.write lists all leaf folders in the target > directory. If there are lot of subfolders due to partitions, this is taking > for ever. > The code is In org.apache.spark.sql.execution.datasources.DataSource.write() > following code causes huge number of RPC calls when the file system is an > Object Store (S3, GS). > if (mode == SaveMode.Append) { > val existingPartitionColumns = Try { > resolveRelation() > .asInstanceOf[HadoopFsRelation] > .location > .partitionSpec() > .partitionColumns > .fieldNames > .toSeq > }.getOrElse(Seq.empty[String]) > There should be a flag to skip Partition Match Check in append mode. I can > work on the patch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18920) Update outdated date formatting
[ https://issues.apache.org/jira/browse/SPARK-18920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15759917#comment-15759917 ] Apache Spark commented on SPARK-18920: -- User 'WangTaoTheTonic' has created a pull request for this issue: https://github.com/apache/spark/pull/16331 > Update outdated date formatting > --- > > Key: SPARK-18920 > URL: https://issues.apache.org/jira/browse/SPARK-18920 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0, 2.0.1, 2.0.2 >Reporter: Tao Wang >Priority: Minor > > Before we show "-" while the timestamp is less than 0, we should update it as > now the date string is presented in format "-MM-dd ." -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18920) Update outdated date formatting
[ https://issues.apache.org/jira/browse/SPARK-18920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18920: Assignee: (was: Apache Spark) > Update outdated date formatting > --- > > Key: SPARK-18920 > URL: https://issues.apache.org/jira/browse/SPARK-18920 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0, 2.0.1, 2.0.2 >Reporter: Tao Wang >Priority: Minor > > Before we show "-" while the timestamp is less than 0, we should update it as > now the date string is presented in format "-MM-dd ." -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18920) Update outdated date formatting
[ https://issues.apache.org/jira/browse/SPARK-18920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18920: Assignee: Apache Spark > Update outdated date formatting > --- > > Key: SPARK-18920 > URL: https://issues.apache.org/jira/browse/SPARK-18920 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0, 2.0.1, 2.0.2 >Reporter: Tao Wang >Assignee: Apache Spark >Priority: Minor > > Before we show "-" while the timestamp is less than 0, we should update it as > now the date string is presented in format "-MM-dd ." -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18920) Update outdated date formatting
Tao Wang created SPARK-18920: Summary: Update outdated date formatting Key: SPARK-18920 URL: https://issues.apache.org/jira/browse/SPARK-18920 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.0.2, 2.0.1, 2.0.0 Reporter: Tao Wang Priority: Minor Before we show "-" while the timestamp is less than 0, we should update it as now the date string is presented in format "-MM-dd ." -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18916) Possible bug in Pregel / mergeMsg with hashmaps
[ https://issues.apache.org/jira/browse/SPARK-18916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15759870#comment-15759870 ] Seth Bromberger commented on SPARK-18916: - Added to update: https://issues.scala-lang.org/browse/SI-9895 appears to be what hit me. > Possible bug in Pregel / mergeMsg with hashmaps > --- > > Key: SPARK-18916 > URL: https://issues.apache.org/jira/browse/SPARK-18916 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 2.0.2 > Environment: OSX / IntelliJ IDEA 2016.3 CE EAP, Scala 2.11.8, Spark > 2.0.2 >Reporter: Seth Bromberger > Labels: error, graphx, pregel > > Consider the following (rough) code that attempts to calculate all-pairs > shortest paths via pregel: > {code:java} > def allPairsShortestPaths: RDD[(VertexId, HashMap[VertexId, ParentDist])] > = { > val initialMsg = HashMap(-1L -> ParentDist(-1L, -1L)) > val pregelg = g.mapVertices((vid, vd) => (vd, HashMap[VertexId, > ParentDist](vid -> ParentDist(vid, 0L.reverse > def vprog(v: VertexId, value: (VD, HashMap[VertexId, ParentDist]), > message: HashMap[VertexId, ParentDist]): (VD, HashMap[VertexId, ParentDist]) > = { > val updatedValues = mm2(value._2, message).filter(v => v._2.dist >= 0) > (value._1, updatedValues) > } > def sendMsg(triplet: EdgeTriplet[(VD, HashMap[VertexId, ParentDist]), > ED]): Iterator[(VertexId, HashMap[VertexId, ParentDist])] = { > val dstVertexId = triplet.dstId > val srcMap = triplet.srcAttr._2 > val dstMap = triplet.dstAttr._2 // guaranteed to have dstVertexId as > a key > val updatesToSend : HashMap[VertexId, ParentDist] = srcMap.filter { > case (vid, srcPD) => dstMap.get(vid) match { > case Some(dstPD) => dstPD.dist > srcPD.dist + 1 // if it > exists, is it cheaper? > case _ => true // not found - new update > } > }.map(u => u._1 -> ParentDist(triplet.srcId, u._2.dist +1)) > if (updatesToSend.nonEmpty) > Iterator[(VertexId, HashMap[VertexId, ParentDist])]((dstVertexId, > updatesToSend)) > else > Iterator.empty > } > def mergeMsg(m1: HashMap[VertexId, ParentDist], m2: HashMap[VertexId, > ParentDist]): HashMap[VertexId, ParentDist] = { > // when the following two lines are commented out, the program fails > with > // 16/12/17 19:53:50 INFO DAGScheduler: Job 24 failed: reduce at > VertexRDDImpl.scala:88, took 0.244042 s > // Exception in thread "main" org.apache.spark.SparkException: Job > aborted due to stage failure: Task 0 in stage 1099.0 failed 1 times, most > recent failure: Lost task 0.0 in stage 1099.0 (TID 129, localhost): > scala.MatchError: (null,null) (of class scala.Tuple2) > m1.foreach(_ => ()) > m2.foreach(_ => ()) > m1.merged(m2) { > case ((k1, v1), (_, v2)) => (k1, v1.min(v2)) > } > } > // mm2 is here just to provide a separate function for vprog. Ideally > we'd just re-use mergeMsg. > def mm2(m1: HashMap[VertexId, ParentDist], m2: HashMap[VertexId, > ParentDist]): HashMap[VertexId, ParentDist] = { > m1.merged(m2) { > case ((k1, v1), (_, v2)) => (k1, v1.min(v2)) > case n => throw new Exception("we've got a problem: " + n) > } > } > val pregelRun = pregelg.pregel(initialMsg)(vprog, sendMsg, mergeMsg) > val sps = pregelRun.vertices.map(v => v._1 -> v._2._2) > sps > } > } > {code} > Note the comment in the mergeMsg function: when the messages are explicitly > accessed prior to the .merged statement, the code works. If these side-effect > statements are removed / commented out, the error message in the comments is > generated. > This fails consistently on a 50-node undirected cyclegraph. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17073) generate basic stats for column
[ https://issues.apache.org/jira/browse/SPARK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15759787#comment-15759787 ] Zhenhua Wang commented on SPARK-17073: -- [~ioana-delaney] Thanks for sharing the information! > generate basic stats for column > --- > > Key: SPARK-17073 > URL: https://issues.apache.org/jira/browse/SPARK-17073 > Project: Spark > Issue Type: Sub-task > Components: Optimizer >Affects Versions: 2.0.0 >Reporter: Ron Hu >Assignee: Zhenhua Wang > Fix For: 2.1.0 > > > For a specified column, we need to generate basic stats including max, min, > number of nulls, number of distinct values, max column length, average column > length. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default
[ https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15759635#comment-15759635 ] Apache Spark commented on SPARK-18817: -- User 'felixcheung' has created a pull request for this issue: https://github.com/apache/spark/pull/16330 > Ensure nothing is written outside R's tempdir() by default > -- > > Key: SPARK-18817 > URL: https://issues.apache.org/jira/browse/SPARK-18817 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Brendan Dwyer >Priority: Critical > > Per CRAN policies > https://cran.r-project.org/web/packages/policies.html > {quote} > - Packages should not write in the users’ home filespace, nor anywhere else > on the file system apart from the R session’s temporary directory (or during > installation in the location pointed to by TMPDIR: and such usage should be > cleaned up). Installing into the system’s R installation (e.g., scripts to > its bin directory) is not allowed. > Limited exceptions may be allowed in interactive sessions if the package > obtains confirmation from the user. > - Packages should not modify the global environment (user’s workspace). > {quote} > Currently "spark-warehouse" gets created in the working directory when > sparkR.session() is called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-18919) PrimitiveKeyOpenHashMap is boxing values
[ https://issues.apache.org/jira/browse/SPARK-18919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jakub Liska closed SPARK-18919. --- Resolution: Not A Problem Ahh, my fault, the OpenHashSet is rehasing at 734004 and doubles the size of the array ... > PrimitiveKeyOpenHashMap is boxing values > > > Key: SPARK-18919 > URL: https://issues.apache.org/jira/browse/SPARK-18919 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2 > Environment: ubuntu 16.04, scala 2.11.7, 2.12.1, java 1.8.0_111 >Reporter: Jakub Liska >Priority: Critical > > Hey, I was benchmarking PrimitiveKeyOpenHashMap for speed and memory > footprint and I noticed, that the footprint is higher then it should, if I > add 1M [Long,Long] entries to it, it has : > ~ 34 MB in total > ~ 17 MB at OpenHashSet as keys > ~ 17 MB at Array as values > The Array size is strange though, because its initial size is 1 048 576 > (_keySet.capacity) so it should have ~ 8MB, not ~ 17MB because a Long value > has 8 bytes. Therefore I think that the values are getting boxed in this > collection. > The consequence of this problem is that if you put more than 100M Long > entries to this map, the GC gets choked to death with unlimited heap size ... > Strange thing is that I get the same results with using @miniboxed instead of > @specialized > This is the scalameter code I used : > {code} > class PrimitiveKeyOpenHashMapBench extends Bench.ForkedTime { > override def measurer = new Executor.Measurer.MemoryFootprint > val sizes = Gen.single("size")(1*1000*1000) > performance of "MemoryFootprint" in { > performance of "PrimitiveKeyOpenHashMap" in { > using(sizes) config ( > exec.benchRuns -> 1, > exec.maxWarmupRuns -> 0, > exec.independentSamples -> 1, > exec.requireGC -> true, > exec.jvmflags -> List("-server", "-Xms1024m", "-Xmx6548m", > "-XX:+UseG1GC") > ) in { size => > val map = new PrimitiveKeyOpenHashMap[Long, Long](size) > var index = 0L > while (index < size) { > map(index) = 0L > index+=1 > } > println("Size " + SizeEstimator.estimate(map)) > while (index != 0) { > index-=1 > assert(map.contains(index)) > } > map > } > } > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18919) PrimitiveKeyOpenHashMap is boxing values
[ https://issues.apache.org/jira/browse/SPARK-18919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jakub Liska updated SPARK-18919: Description: Hey, I was benchmarking PrimitiveKeyOpenHashMap for speed and memory footprint and I noticed, that the footprint is higher then it should, if I add 1M [Long,Long] entries to it, it has : ~ 34 MB in total ~ 17 MB at OpenHashSet as keys ~ 17 MB at Array as values The Array size is strange though, because its initial size is 1 048 576 (_keySet.capacity) so it should have ~ 8MB, not ~ 17MB because a Long value has 8 bytes. Therefore I think that the values are getting boxed in this collection. The consequence of this problem is that if you put more than 100M Long entries to this map, the GC gets choked to death with unlimited heap size ... Strange thing is that I get the same results with using @miniboxed instead of @specialized This is the scalameter code I used : {code} class PrimitiveKeyOpenHashMapBench extends Bench.ForkedTime { override def measurer = new Executor.Measurer.MemoryFootprint val sizes = Gen.single("size")(1*1000*1000) performance of "MemoryFootprint" in { performance of "PrimitiveKeyOpenHashMap" in { using(sizes) config ( exec.benchRuns -> 1, exec.maxWarmupRuns -> 0, exec.independentSamples -> 1, exec.requireGC -> true, exec.jvmflags -> List("-server", "-Xms1024m", "-Xmx6548m", "-XX:+UseG1GC") ) in { size => val map = new PrimitiveKeyOpenHashMap[Long, Long](size) var index = 0L while (index < size) { map(index) = 0L index+=1 } println("Size " + SizeEstimator.estimate(map)) while (index != 0) { index-=1 assert(map.contains(index)) } map } } } } {code} was: Hey, I was benchmarking PrimitiveKeyOpenHashMap for speed and memory footprint and I noticed, that the footprint is higher then it should, if I add 1M [Long,Long] entries to it, it has : ~ 34 MB in total ~ 17 MB at OpenHashSet as keys ~ 17 MB at Array as values The Array size is strange though, because its initial size is 1 048 576 (_keySet.capacity) so it should have ~ 8MB, not ~ 17MB because a Long value has 8 bytes. Therefore I think that the values are getting boxed in this collection. The consequence of this problem is that if you put more than 100M Long entries to this map, the GC gets choked to death... This is the scalameter code I used : {code} class PrimitiveKeyOpenHashMapBench extends Bench.ForkedTime { override def measurer = new Executor.Measurer.MemoryFootprint val sizes = Gen.single("size")(1*1000*1000) performance of "MemoryFootprint" in { performance of "PrimitiveKeyOpenHashMap" in { using(sizes) config ( exec.benchRuns -> 1, exec.maxWarmupRuns -> 0, exec.independentSamples -> 1, exec.requireGC -> true, exec.jvmflags -> List("-server", "-Xms1024m", "-Xmx6548m", "-XX:+UseG1GC") ) in { size => val map = new PrimitiveKeyOpenHashMap[Long, Long](size) var index = 0L while (index < size) { map(index) = 0L index+=1 } println("Size " + SizeEstimator.estimate(map)) while (index != 0) { index-=1 assert(map.contains(index)) } map } } } } {code} > PrimitiveKeyOpenHashMap is boxing values > > > Key: SPARK-18919 > URL: https://issues.apache.org/jira/browse/SPARK-18919 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2 > Environment: ubuntu 16.04, scala 2.11.7, 2.12.1, java 1.8.0_111 >Reporter: Jakub Liska >Priority: Critical > > Hey, I was benchmarking PrimitiveKeyOpenHashMap for speed and memory > footprint and I noticed, that the footprint is higher then it should, if I > add 1M [Long,Long] entries to it, it has : > ~ 34 MB in total > ~ 17 MB at OpenHashSet as keys > ~ 17 MB at Array as values > The Array size is strange though, because its initial size is 1 048 576 > (_keySet.capacity) so it should have ~ 8MB, not ~ 17MB because a Long value > has 8 bytes. Therefore I think that the values are getting boxed in this > collection. > The consequence of this problem is that if you put more than 100M Long > entries to this map, the GC gets choked to death with unlimited heap size ... > Strange thing is that I get the same results with using @miniboxed instead of > @specialized > This is the scalameter code I used : > {code} > class PrimitiveKeyOpenHashMapBench extends Bench.ForkedTime { > override def measurer = new Executor.Measurer.MemoryFootprint > val sizes = Gen.single("size")(1*1000*1000) > performance of "Memor
[jira] [Created] (SPARK-18919) PrimitiveKeyOpenHashMap is boxing values
Jakub Liska created SPARK-18919: --- Summary: PrimitiveKeyOpenHashMap is boxing values Key: SPARK-18919 URL: https://issues.apache.org/jira/browse/SPARK-18919 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.2 Environment: ubuntu 16.04, scala 2.11.7, 2.12.1, java 1.8.0_111 Reporter: Jakub Liska Priority: Critical Hey, I was benchmarking PrimitiveKeyOpenHashMap for speed and memory footprint and I noticed, that the footprint is higher then it should, if I add 1M [Long,Long] entries to it, it has : ~ 34 MB in total ~ 17 MB at OpenHashSet as keys ~ 17 MB at Array as values The Array size is strange though, because its initial size is 1 048 576 (_keySet.capacity) so it should have ~ 8MB, not ~ 17MB because a Long value has 8 bytes. Therefore I think that the values are getting boxed in this collection. The consequence of this problem is that if you put more than 100M Long entries to this map, the GC gets choked to death... This is the scalameter code I used : {code} class PrimitiveKeyOpenHashMapBench extends Bench.ForkedTime { override def measurer = new Executor.Measurer.MemoryFootprint val sizes = Gen.single("size")(1*1000*1000) performance of "MemoryFootprint" in { performance of "PrimitiveKeyOpenHashMap" in { using(sizes) config ( exec.benchRuns -> 1, exec.maxWarmupRuns -> 0, exec.independentSamples -> 1, exec.requireGC -> true, exec.jvmflags -> List("-server", "-Xms1024m", "-Xmx6548m", "-XX:+UseG1GC") ) in { size => val map = new PrimitiveKeyOpenHashMap[Long, Long](size) var index = 0L while (index < size) { map(index) = 0L index+=1 } println("Size " + SizeEstimator.estimate(map)) while (index != 0) { index-=1 assert(map.contains(index)) } map } } } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default
[ https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15759368#comment-15759368 ] Felix Cheung commented on SPARK-18817: -- testing fix, will open a PR shortly. > Ensure nothing is written outside R's tempdir() by default > -- > > Key: SPARK-18817 > URL: https://issues.apache.org/jira/browse/SPARK-18817 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Brendan Dwyer >Priority: Critical > > Per CRAN policies > https://cran.r-project.org/web/packages/policies.html > {quote} > - Packages should not write in the users’ home filespace, nor anywhere else > on the file system apart from the R session’s temporary directory (or during > installation in the location pointed to by TMPDIR: and such usage should be > cleaned up). Installing into the system’s R installation (e.g., scripts to > its bin directory) is not allowed. > Limited exceptions may be allowed in interactive sessions if the package > obtains confirmation from the user. > - Packages should not modify the global environment (user’s workspace). > {quote} > Currently "spark-warehouse" gets created in the working directory when > sparkR.session() is called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16046) Add Spark SQL Dataset Tutorial
[ https://issues.apache.org/jira/browse/SPARK-16046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15758691#comment-15758691 ] Apache Spark commented on SPARK-16046: -- User 'aokolnychyi' has created a pull request for this issue: https://github.com/apache/spark/pull/16329 > Add Spark SQL Dataset Tutorial > -- > > Key: SPARK-16046 > URL: https://issues.apache.org/jira/browse/SPARK-16046 > Project: Spark > Issue Type: Documentation > Components: Documentation, SQL >Affects Versions: 2.0.0 >Reporter: Pedro Rodriguez > > Issue to update the Spark SQL guide to provide more content around using > Datasets. This would expand the Creating Datasets section of the Spark SQL > documentation. > Goals > 1. Add more examples of column access via $ and ` > 2. Add examples of aggregates > 3. Add examples of using Spark SQL functions > What else would be useful to have? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16046) Add Spark SQL Dataset Tutorial
[ https://issues.apache.org/jira/browse/SPARK-16046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16046: Assignee: (was: Apache Spark) > Add Spark SQL Dataset Tutorial > -- > > Key: SPARK-16046 > URL: https://issues.apache.org/jira/browse/SPARK-16046 > Project: Spark > Issue Type: Documentation > Components: Documentation, SQL >Affects Versions: 2.0.0 >Reporter: Pedro Rodriguez > > Issue to update the Spark SQL guide to provide more content around using > Datasets. This would expand the Creating Datasets section of the Spark SQL > documentation. > Goals > 1. Add more examples of column access via $ and ` > 2. Add examples of aggregates > 3. Add examples of using Spark SQL functions > What else would be useful to have? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16046) Add Spark SQL Dataset Tutorial
[ https://issues.apache.org/jira/browse/SPARK-16046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16046: Assignee: Apache Spark > Add Spark SQL Dataset Tutorial > -- > > Key: SPARK-16046 > URL: https://issues.apache.org/jira/browse/SPARK-16046 > Project: Spark > Issue Type: Documentation > Components: Documentation, SQL >Affects Versions: 2.0.0 >Reporter: Pedro Rodriguez >Assignee: Apache Spark > > Issue to update the Spark SQL guide to provide more content around using > Datasets. This would expand the Creating Datasets section of the Spark SQL > documentation. > Goals > 1. Add more examples of column access via $ and ` > 2. Add examples of aggregates > 3. Add examples of using Spark SQL functions > What else would be useful to have? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12216) Spark failed to delete temp directory
[ https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15758646#comment-15758646 ] certman commented on SPARK-12216: - I can confirm this issue exists in Windows 7 running Spark 2.x Although this is a minor issue, it is still a bug and should be fixed. Something doesn't work on windows doesn't mean this is a windows bug. It's not due to permission. Proposing workaround of not using windows is just stupid.. > Spark failed to delete temp directory > -- > > Key: SPARK-12216 > URL: https://issues.apache.org/jira/browse/SPARK-12216 > Project: Spark > Issue Type: Bug > Components: Spark Shell > Environment: windows 7 64 bit > Spark 1.52 > Java 1.8.0.65 > PATH includes: > C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin > C:\ProgramData\Oracle\Java\javapath > C:\Users\Stefan\scala\bin > SYSTEM variables set are: > JAVA_HOME=C:\Program Files\Java\jre1.8.0_65 > HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin > (where the bin\winutils resides) > both \tmp and \tmp\hive have permissions > drwxrwxrwx as detected by winutils ls >Reporter: stefan >Priority: Minor > > The mailing list archives have no obvious solution to this: > scala> :q > Stopping spark context. > 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark > temp dir: > C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff > java.io.IOException: Failed to delete: > C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:60) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234) > at > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234) > at scala.util.Try$.apply(Try.scala:161) > at > org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216) > at > org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18808) ml.KMeansModel.transform is very inefficient
[ https://issues.apache.org/jira/browse/SPARK-18808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18808: Assignee: (was: Apache Spark) > ml.KMeansModel.transform is very inefficient > > > Key: SPARK-18808 > URL: https://issues.apache.org/jira/browse/SPARK-18808 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.2 >Reporter: Michel Lemay > > The function ml.KMeansModel.transform will call the > parentModel.predict(features) method on each row which in turns will > normalize all clusterCenters from mllib.KMeansModel.clusterCentersWithNorm > every time! > This is a serious waste of resources! In my profiling, > clusterCentersWithNorm represent 99% of the sampling! > This should have been implemented with a broadcast variable as it is done in > other functions like computeCost. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18808) ml.KMeansModel.transform is very inefficient
[ https://issues.apache.org/jira/browse/SPARK-18808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15758577#comment-15758577 ] Apache Spark commented on SPARK-18808: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/16328 > ml.KMeansModel.transform is very inefficient > > > Key: SPARK-18808 > URL: https://issues.apache.org/jira/browse/SPARK-18808 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.2 >Reporter: Michel Lemay > > The function ml.KMeansModel.transform will call the > parentModel.predict(features) method on each row which in turns will > normalize all clusterCenters from mllib.KMeansModel.clusterCentersWithNorm > every time! > This is a serious waste of resources! In my profiling, > clusterCentersWithNorm represent 99% of the sampling! > This should have been implemented with a broadcast variable as it is done in > other functions like computeCost. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18808) ml.KMeansModel.transform is very inefficient
[ https://issues.apache.org/jira/browse/SPARK-18808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18808: Assignee: Apache Spark > ml.KMeansModel.transform is very inefficient > > > Key: SPARK-18808 > URL: https://issues.apache.org/jira/browse/SPARK-18808 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.2 >Reporter: Michel Lemay >Assignee: Apache Spark > > The function ml.KMeansModel.transform will call the > parentModel.predict(features) method on each row which in turns will > normalize all clusterCenters from mllib.KMeansModel.clusterCentersWithNorm > every time! > This is a serious waste of resources! In my profiling, > clusterCentersWithNorm represent 99% of the sampling! > This should have been implemented with a broadcast variable as it is done in > other functions like computeCost. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18829) Printing to logger
[ https://issues.apache.org/jira/browse/SPARK-18829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15758520#comment-15758520 ] David Hodeffi edited comment on SPARK-18829 at 12/18/16 9:10 AM: - If so, I think you should add it to documentation and exapmpls, since no one ever guess it. For now I have already helped spark project by questioning and answering this question on stackoverflow.com with the help of you guys. was (Author: davidho): If so, I think you should add it to documentation and exapmpls, since no one ever guess it. For now I have already helped spark project by questioning and answering this question on stackoverflow.com > Printing to logger > -- > > Key: SPARK-18829 > URL: https://issues.apache.org/jira/browse/SPARK-18829 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.2 > Environment: ALL >Reporter: David Hodeffi >Priority: Trivial > Labels: easyfix, patch > Original Estimate: 1h > Remaining Estimate: 1h > > I would like to print dataframe.show or df.explain(true) into log file. > right now the code print to standard output without a way to redirect it. > It also cannot be configured on log4j.properties. > My suggestion is to write to the logger and standard output. > i.e > class DataFrame {.. > override def explain(extended: Boolean): Unit = { > val explain = ExplainCommand(queryExecution.logical, extended = extended) > sqlContext.executePlan(explain).executedPlan.executeCollect().foreach { > // scalastyle:off println > r => { > println(r.getString(0)) > logger.debug(r.getString(0)) > } > } > // scalastyle:on println > } > } > def show(numRows: Int, truncate: Boolean): Unit = { > val str =showString(numRows, truncate) > println(str) > logger.debug(str) > } > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18827) Cann't read broadcast if broadcast blocks are stored on-disk
[ https://issues.apache.org/jira/browse/SPARK-18827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18827: -- Assignee: Yuming Wang > Cann't read broadcast if broadcast blocks are stored on-disk > > > Key: SPARK-18827 > URL: https://issues.apache.org/jira/browse/SPARK-18827 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1, 2.0.2, 2.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang > Fix For: 2.0.3, 2.1.1 > > Attachments: NoSuchElementException4722.gif > > > How to reproduce it: > {code:java} > test("Cache broadcast to disk") { > val conf = new SparkConf() > .setAppName("Cache broadcast to disk") > .setMaster("local") > .set("spark.memory.useLegacyMode", "true") > .set("spark.storage.memoryFraction", "0.0") > sc = new SparkContext(conf) > val list = List[Int](1, 2, 3, 4) > val broadcast = sc.broadcast(list) > assert(broadcast.value.sum === 10) > } > {code} > {{NoSuchElementException}} will throw since SPARK-17503 if a broadcast cannot > cache in memory. The reason is that that change cannot cover > {{!unrolled.hasNext}} in {{next()}} function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18829) Printing to logger
[ https://issues.apache.org/jira/browse/SPARK-18829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15758520#comment-15758520 ] David Hodeffi commented on SPARK-18829: --- If so, I think you should add it to documentation and exapmpls, since no one ever guess it. For now I have already helped spark project by questioning and answering this question on stackoverflow.com > Printing to logger > -- > > Key: SPARK-18829 > URL: https://issues.apache.org/jira/browse/SPARK-18829 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.2 > Environment: ALL >Reporter: David Hodeffi >Priority: Trivial > Labels: easyfix, patch > Original Estimate: 1h > Remaining Estimate: 1h > > I would like to print dataframe.show or df.explain(true) into log file. > right now the code print to standard output without a way to redirect it. > It also cannot be configured on log4j.properties. > My suggestion is to write to the logger and standard output. > i.e > class DataFrame {.. > override def explain(extended: Boolean): Unit = { > val explain = ExplainCommand(queryExecution.logical, extended = extended) > sqlContext.executePlan(explain).executedPlan.executeCollect().foreach { > // scalastyle:off println > r => { > println(r.getString(0)) > logger.debug(r.getString(0)) > } > } > // scalastyle:on println > } > } > def show(numRows: Int, truncate: Boolean): Unit = { > val str =showString(numRows, truncate) > println(str) > logger.debug(str) > } > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18827) Cann't read broadcast if broadcast blocks are stored on-disk
[ https://issues.apache.org/jira/browse/SPARK-18827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18827. --- Resolution: Fixed Fix Version/s: 2.0.3 2.1.1 Issue resolved by pull request 16252 [https://github.com/apache/spark/pull/16252] > Cann't read broadcast if broadcast blocks are stored on-disk > > > Key: SPARK-18827 > URL: https://issues.apache.org/jira/browse/SPARK-18827 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1, 2.0.2, 2.1.0 >Reporter: Yuming Wang > Fix For: 2.1.1, 2.0.3 > > Attachments: NoSuchElementException4722.gif > > > How to reproduce it: > {code:java} > test("Cache broadcast to disk") { > val conf = new SparkConf() > .setAppName("Cache broadcast to disk") > .setMaster("local") > .set("spark.memory.useLegacyMode", "true") > .set("spark.storage.memoryFraction", "0.0") > sc = new SparkContext(conf) > val list = List[Int](1, 2, 3, 4) > val broadcast = sc.broadcast(list) > assert(broadcast.value.sum === 10) > } > {code} > {{NoSuchElementException}} will throw since SPARK-17503 if a broadcast cannot > cache in memory. The reason is that that change cannot cover > {{!unrolled.hasNext}} in {{next()}} function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18882) Spark UI , storage tab is always empty.
[ https://issues.apache.org/jira/browse/SPARK-18882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15758513#comment-15758513 ] David Hodeffi commented on SPARK-18882: --- generate 4 random tables using range() function on DataFrame. Cartesian join all of them and write them to Hive Table on HiveContext (Spark 1.6.2). > Spark UI , storage tab is always empty. > --- > > Key: SPARK-18882 > URL: https://issues.apache.org/jira/browse/SPARK-18882 > Project: Spark > Issue Type: Question > Components: Spark Core > Environment: Spark 1.6.2 >Reporter: David Hodeffi >Priority: Minor > > I have HDP 2.5 installed with Spark 1.6.2 on yarn, deploy mode cluster. > On my code I create a DataFrame and cache it and then doing action on it. I > have tried to look for details about my dataframe on Storage tab at Spark UI > but I cannot see anything on this tab. > My Question is : what could be the reasons why my storage tab is still empty? > Is my code is wrong? Is the configuration of HDP 2.5 with Spark on yarn are > wrong ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18918) Missing in Configuration page
[ https://issues.apache.org/jira/browse/SPARK-18918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18918. --- Resolution: Fixed Fix Version/s: 2.1.1 Issue resolved by pull request 16327 [https://github.com/apache/spark/pull/16327] > Missing in Configuration page > --- > > Key: SPARK-18918 > URL: https://issues.apache.org/jira/browse/SPARK-18918 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Minor > Fix For: 2.1.1 > > > The configuration page looks messy now, as shown in the nightly build: > https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/configuration.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18918) Missing in Configuration page
[ https://issues.apache.org/jira/browse/SPARK-18918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18918: -- Target Version/s: (was: 2.1.0) Priority: Minor (was: Blocker) > Missing in Configuration page > --- > > Key: SPARK-18918 > URL: https://issues.apache.org/jira/browse/SPARK-18918 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Minor > > The configuration page looks messy now, as shown in the nightly build: > https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/configuration.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18918) Missing in Configuration page
[ https://issues.apache.org/jira/browse/SPARK-18918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18918: Assignee: Apache Spark (was: Xiao Li) > Missing in Configuration page > --- > > Key: SPARK-18918 > URL: https://issues.apache.org/jira/browse/SPARK-18918 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Blocker > > The configuration page looks messy now, as shown in the nightly build: > https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/configuration.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18918) Missing in Configuration page
[ https://issues.apache.org/jira/browse/SPARK-18918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18918: Assignee: Xiao Li (was: Apache Spark) > Missing in Configuration page > --- > > Key: SPARK-18918 > URL: https://issues.apache.org/jira/browse/SPARK-18918 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Blocker > > The configuration page looks messy now, as shown in the nightly build: > https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/configuration.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18918) Missing in Configuration page
[ https://issues.apache.org/jira/browse/SPARK-18918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15758459#comment-15758459 ] Apache Spark commented on SPARK-18918: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/16327 > Missing in Configuration page > --- > > Key: SPARK-18918 > URL: https://issues.apache.org/jira/browse/SPARK-18918 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Blocker > > The configuration page looks messy now, as shown in the nightly build: > https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/configuration.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18918) Missing in Configuration page
Xiao Li created SPARK-18918: --- Summary: Missing in Configuration page Key: SPARK-18918 URL: https://issues.apache.org/jira/browse/SPARK-18918 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 2.1.0 Reporter: Xiao Li Assignee: Xiao Li Priority: Blocker The configuration page looks messy now, as shown in the nightly build: https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/configuration.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18915) Return Nothing when Querying a Partitioned Data Source Table without Repairing it
[ https://issues.apache.org/jira/browse/SPARK-18915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-18915: Issue Type: Sub-task (was: Bug) Parent: SPARK-17861 > Return Nothing when Querying a Partitioned Data Source Table without > Repairing it > - > > Key: SPARK-18915 > URL: https://issues.apache.org/jira/browse/SPARK-18915 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Priority: Critical > > In Spark 2.1, if we create a parititoned data source table given a specified > path, it returns nothing when we try to query it. To get the data, we have to > manually issue a DDL to repair the table. > In Spark 2.0, it can return the data stored in the specified path, without > repairing the table. > Below is the output of Spark 2.1. > {noformat} > scala> spark.range(5).selectExpr("id as fieldOne", "id as > partCol").write.partitionBy("partCol").mode("overwrite").saveAsTable("test") > [Stage 0:==>(3 + 5) / > 8]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > > > scala> spark.sql("select * from test").show() > ++---+ > |fieldOne|partCol| > ++---+ > | 0| 0| > | 1| 1| > | 2| 2| > | 3| 3| > | 4| 4| > ++---+ > scala> spark.sql("desc formatted test").show(50, false) > ++--+---+ > |col_name|data_type > |comment| > ++--+---+ > |fieldOne|bigint > |null | > |partCol |bigint > |null | > |# Partition Information | > | | > |# col_name |data_type > |comment| > |partCol |bigint > |null | > || > | | > |# Detailed Table Information| > | | > |Database: |default > | | > |Owner: |xiaoli > | | > |Create Time:|Sat Dec 17 17:46:24 PST 2016 > | | > |Last Access Time: |Wed Dec 31 16:00:00 PST 1969 > | | > |Location: > |file:/Users/xiaoli/IdeaProjects/sparkDelivery/bin/spark-warehouse/test| > | > |Table Type: |MANAGED > | | > |Table Parameters: | > | | > | transient_lastDdlTime |1482025584 > | | > || > | | > |# Storage Information | > | | > |SerDe Library: > |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | > | > |InputFormat: > |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | > | > |OutputFormat: > |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| > | > |Compressed: |No > | | > |Storage Desc Parameters:| > | | > | serialization.format |1 > | | > |Partition Provider: |Catalog > |