[jira] [Comment Edited] (SPARK-18877) Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 exceeds max precision 20

2016-12-18 Thread Navya Krishnappa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15760487#comment-15760487
 ] 

Navya Krishnappa edited comment on SPARK-18877 at 12/19/16 7:56 AM:


Thank you [~dongjoon] and i will create an issue in Apace parquet.


was (Author: navya krishnappa):
Thank you [~dongjoon]

> Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: 
> requirement failed: Decimal precision 28 exceeds max precision 20
> --
>
> Key: SPARK-18877
> URL: https://issues.apache.org/jira/browse/SPARK-18877
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Navya Krishnappa
>
> When reading below mentioned csv data, even though the maximum decimal 
> precision is 38, following exception is thrown 
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 
> exceeds max precision 20
> Decimal
> 2323366225312000
> 2433573971400
> 23233662253000
> 23233662253



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18877) Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 exceeds max precision 20

2016-12-18 Thread Navya Krishnappa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15760487#comment-15760487
 ] 

Navya Krishnappa edited comment on SPARK-18877 at 12/19/16 7:56 AM:


Thank you [~dongjoon] and i will create an issue in  Apache Parquet JIRA.


was (Author: navya krishnappa):
Thank you [~dongjoon] and i will create an issue in Apace parquet.

> Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: 
> requirement failed: Decimal precision 28 exceeds max precision 20
> --
>
> Key: SPARK-18877
> URL: https://issues.apache.org/jira/browse/SPARK-18877
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Navya Krishnappa
>
> When reading below mentioned csv data, even though the maximum decimal 
> precision is 38, following exception is thrown 
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 
> exceeds max precision 20
> Decimal
> 2323366225312000
> 2433573971400
> 23233662253000
> 23233662253



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18877) Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 exceeds max precision 20

2016-12-18 Thread Navya Krishnappa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15760487#comment-15760487
 ] 

Navya Krishnappa commented on SPARK-18877:
--

Thank you [~dongjoon]

> Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: 
> requirement failed: Decimal precision 28 exceeds max precision 20
> --
>
> Key: SPARK-18877
> URL: https://issues.apache.org/jira/browse/SPARK-18877
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Navya Krishnappa
>
> When reading below mentioned csv data, even though the maximum decimal 
> precision is 38, following exception is thrown 
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 
> exceeds max precision 20
> Decimal
> 2323366225312000
> 2433573971400
> 23233662253000
> 23233662253



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18924) Improve collect/createDataFrame performance in SparkR

2016-12-18 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15760439#comment-15760439
 ] 

Xiangrui Meng commented on SPARK-18924:
---

cc: [~shivaram] [~felixcheung] [~falaki] [~yanboliang] for discussion.

> Improve collect/createDataFrame performance in SparkR
> -
>
> Key: SPARK-18924
> URL: https://issues.apache.org/jira/browse/SPARK-18924
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Xiangrui Meng
>Priority: Critical
>
> SparkR has its own SerDe for data serialization between JVM and R.
> The SerDe on the JVM side is implemented in:
> * 
> [SeDe|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala]
> * 
> [SQLUtils|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala]
> The SerDe on the R side is implemented in:
> * 
> [deserialize|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R]
> * [serialize|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R]
> The serialization between JVM and R suffers from huge storage and computation 
> overhead. For example, a short round-trip of 1 million doubles surprisingly 
> took 3 minutes on my laptop:
> {code}
> > system.time(collect(createDataFrame(data.frame(x=runif(100)
>user  system elapsed
>  14.224   0.582 189.135
> {code}
> Collecting a medium-sized DataFrame to local and continuing with a local R 
> workflow is a use case we should pay attention to. SparkR will never be able 
> to cover all existing features from CRAN packages. It is also unnecessary for 
> Spark to do so because not all features need scalability. 
> Several factors contribute to the serialization overhead:
> 1. The SerDe in R side is implemented using high-level R methods.
> 2. DataFrame columns are not efficiently serialized, primitive type columns 
> in particular.
> 3. Some overhead in the serialization protocol/impl.
> 1) might be discussed before because R packages like rJava exist before 
> SparkR. I'm not sure whether we have a license issue in depending on those 
> libraries. Another option is to switch to low-level R'C interface or Rcpp, 
> which again might have license issue. I'm not an expert here. If we have to 
> implement our own, there still exist much space for improvement, discussed 
> below.
> 2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, 
> which collect rows to local  and then construct columns. However,
> * it ignores column types and results boxing/unboxing overhead
> * it collects all objects to driver and results high GC pressure
> A relatively simple change is to implement specialized column builder based 
> on column types, primitive types in particular. We need to handle null values 
> properly. A simple data structure we can use is
> {code}
> val size: Int
> val nullIndexes: Array[Int]
> val notNullValues: Array[T] // specialized for primitive types
> {code}
> On the R side, we can use `readBin` and `writeBin` to read the entire vector 
> in a single method call. The speed seems reasonable (at the order of GB/s):
> {code}
> > x <- runif(1000) # 1e7, not 1e6
> > system.time(r <- writeBin(x, raw(0)))
>user  system elapsed
>   0.036   0.021   0.059
> > > system.time(y <- readBin(r, double(), 1000))
>user  system elapsed
>   0.015   0.007   0.024
> {code}
> This is just a proposal that needs to be discussed and formalized. But in 
> general, it should be feasible to obtain 20x or more performance gain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18924) Improve collect/createDataFrame performance in SparkR

2016-12-18 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-18924:
-

 Summary: Improve collect/createDataFrame performance in SparkR
 Key: SPARK-18924
 URL: https://issues.apache.org/jira/browse/SPARK-18924
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Xiangrui Meng
Priority: Critical


SparkR has its own SerDe for data serialization between JVM and R.

The SerDe on the JVM side is implemented in:
* 
[SeDe|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala]
* 
[SQLUtils|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala]

The SerDe on the R side is implemented in:
* 
[deserialize|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R]
* [serialize|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R]

The serialization between JVM and R suffers from huge storage and computation 
overhead. For example, a short round-trip of 1 million doubles surprisingly 
took 3 minutes on my laptop:

{code}
> system.time(collect(createDataFrame(data.frame(x=runif(100)
   user  system elapsed
 14.224   0.582 189.135
{code}

Collecting a medium-sized DataFrame to local and continuing with a local R 
workflow is a use case we should pay attention to. SparkR will never be able to 
cover all existing features from CRAN packages. It is also unnecessary for 
Spark to do so because not all features need scalability. 


Several factors contribute to the serialization overhead:
1. The SerDe in R side is implemented using high-level R methods.
2. DataFrame columns are not efficiently serialized, primitive type columns in 
particular.
3. Some overhead in the serialization protocol/impl.

1) might be discussed before because R packages like rJava exist before SparkR. 
I'm not sure whether we have a license issue in depending on those libraries. 
Another option is to switch to low-level R'C interface or Rcpp, which again 
might have license issue. I'm not an expert here. If we have to implement our 
own, there still exist much space for improvement, discussed below.

2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, 
which collect rows to local  and then construct columns. However,
* it ignores column types and results boxing/unboxing overhead
* it collects all objects to driver and results high GC pressure

A relatively simple change is to implement specialized column builder based on 
column types, primitive types in particular. We need to handle null values 
properly. A simple data structure we can use is

{code}
val size: Int
val nullIndexes: Array[Int]
val notNullValues: Array[T] // specialized for primitive types
{code}

On the R side, we can use `readBin` and `writeBin` to read the entire vector in 
a single method call. The speed seems reasonable (at the order of GB/s):

{code}
> x <- runif(1000) # 1e7, not 1e6
> system.time(r <- writeBin(x, raw(0)))
   user  system elapsed
  0.036   0.021   0.059
> > system.time(y <- readBin(r, double(), 1000))
   user  system elapsed
  0.015   0.007   0.024
{code}

This is just a proposal that needs to be discussed and formalized. But in 
general, it should be feasible to obtain 20x or more performance gain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18871) New test cases for IN subquery

2016-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18871:


Assignee: (was: Apache Spark)

> New test cases for IN subquery
> --
>
> Key: SPARK-18871
> URL: https://issues.apache.org/jira/browse/SPARK-18871
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Reporter: Nattavut Sutyanyong
>
> This JIRA is open for submitting a PR for new test cases for IN/NOT IN 
> subquery. We plan to put approximately 100+ test cases under 
> `SQLQueryTestSuite`. The test cases range from IN/NOT IN subqueries with 
> simple SELECT in both parent and subquery to subqueries with more complex 
> constructs in both sides (joins, aggregates, etc.) Test data include null 
> value, and duplicate values. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18871) New test cases for IN subquery

2016-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18871:


Assignee: Apache Spark

> New test cases for IN subquery
> --
>
> Key: SPARK-18871
> URL: https://issues.apache.org/jira/browse/SPARK-18871
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Reporter: Nattavut Sutyanyong
>Assignee: Apache Spark
>
> This JIRA is open for submitting a PR for new test cases for IN/NOT IN 
> subquery. We plan to put approximately 100+ test cases under 
> `SQLQueryTestSuite`. The test cases range from IN/NOT IN subqueries with 
> simple SELECT in both parent and subquery to subqueries with more complex 
> constructs in both sides (joins, aggregates, etc.) Test data include null 
> value, and duplicate values. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18871) New test cases for IN subquery

2016-12-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15760370#comment-15760370
 ] 

Apache Spark commented on SPARK-18871:
--

User 'kevinyu98' has created a pull request for this issue:
https://github.com/apache/spark/pull/16337

> New test cases for IN subquery
> --
>
> Key: SPARK-18871
> URL: https://issues.apache.org/jira/browse/SPARK-18871
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Reporter: Nattavut Sutyanyong
>
> This JIRA is open for submitting a PR for new test cases for IN/NOT IN 
> subquery. We plan to put approximately 100+ test cases under 
> `SQLQueryTestSuite`. The test cases range from IN/NOT IN subqueries with 
> simple SELECT in both parent and subquery to subqueries with more complex 
> constructs in both sides (joins, aggregates, etc.) Test data include null 
> value, and duplicate values. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18923) Support SKIP_PYTHONDOC/RDOC in doc generation

2016-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18923:


Assignee: Apache Spark

> Support SKIP_PYTHONDOC/RDOC in doc generation
> -
>
> Key: SPARK-18923
> URL: https://issues.apache.org/jira/browse/SPARK-18923
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Documentation
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> This issue aims to support `SKIP_PYTHONDOC` and `SKIP_RDOC` for documentation 
> generation. Currently, we can use `SKIP_SCALADOC` or `SKIP_API`.
> The reason is that the Spark documentation build uses a number of tools to 
> build HTML docs and API docs in Scala, Python and R. Especially,
> - Python API docs requires `sphinx`.
> - R API docs requires `R` installation and `knitr` (and more others 
> libraries).
> In other words, we cannot generate Python API docs without R installation. 
> Also, we cannot generate R API docs without Python `sphinx` installation.
> If Spark provides `SKIP_PYTHONDOC` and `SKIP_RDOC` like `SKIP_SCALADOC`, it 
> would be more convenient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18923) Support SKIP_PYTHONDOC/RDOC in doc generation

2016-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18923:


Assignee: (was: Apache Spark)

> Support SKIP_PYTHONDOC/RDOC in doc generation
> -
>
> Key: SPARK-18923
> URL: https://issues.apache.org/jira/browse/SPARK-18923
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Documentation
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue aims to support `SKIP_PYTHONDOC` and `SKIP_RDOC` for documentation 
> generation. Currently, we can use `SKIP_SCALADOC` or `SKIP_API`.
> The reason is that the Spark documentation build uses a number of tools to 
> build HTML docs and API docs in Scala, Python and R. Especially,
> - Python API docs requires `sphinx`.
> - R API docs requires `R` installation and `knitr` (and more others 
> libraries).
> In other words, we cannot generate Python API docs without R installation. 
> Also, we cannot generate R API docs without Python `sphinx` installation.
> If Spark provides `SKIP_PYTHONDOC` and `SKIP_RDOC` like `SKIP_SCALADOC`, it 
> would be more convenient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18923) Support SKIP_PYTHONDOC/RDOC in doc generation

2016-12-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15760332#comment-15760332
 ] 

Apache Spark commented on SPARK-18923:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/16336

> Support SKIP_PYTHONDOC/RDOC in doc generation
> -
>
> Key: SPARK-18923
> URL: https://issues.apache.org/jira/browse/SPARK-18923
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Documentation
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue aims to support `SKIP_PYTHONDOC` and `SKIP_RDOC` for documentation 
> generation. Currently, we can use `SKIP_SCALADOC` or `SKIP_API`.
> The reason is that the Spark documentation build uses a number of tools to 
> build HTML docs and API docs in Scala, Python and R. Especially,
> - Python API docs requires `sphinx`.
> - R API docs requires `R` installation and `knitr` (and more others 
> libraries).
> In other words, we cannot generate Python API docs without R installation. 
> Also, we cannot generate R API docs without Python `sphinx` installation.
> If Spark provides `SKIP_PYTHONDOC` and `SKIP_RDOC` like `SKIP_SCALADOC`, it 
> would be more convenient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18923) Support SKIP_PYTHONDOC/RDOC in doc generation

2016-12-18 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-18923:
-

 Summary: Support SKIP_PYTHONDOC/RDOC in doc generation
 Key: SPARK-18923
 URL: https://issues.apache.org/jira/browse/SPARK-18923
 Project: Spark
  Issue Type: Improvement
  Components: Build, Documentation
Reporter: Dongjoon Hyun
Priority: Minor


This issue aims to support `SKIP_PYTHONDOC` and `SKIP_RDOC` for documentation 
generation. Currently, we can use `SKIP_SCALADOC` or `SKIP_API`.

The reason is that the Spark documentation build uses a number of tools to 
build HTML docs and API docs in Scala, Python and R. Especially,

- Python API docs requires `sphinx`.
- R API docs requires `R` installation and `knitr` (and more others libraries).

In other words, we cannot generate Python API docs without R installation. 
Also, we cannot generate R API docs without Python `sphinx` installation.

If Spark provides `SKIP_PYTHONDOC` and `SKIP_RDOC` like `SKIP_SCALADOC`, it 
would be more convenient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18857) SparkSQL ThriftServer hangs while extracting huge data volumes in incremental collect mode

2016-12-18 Thread vishal agrawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15760203#comment-15760203
 ] 

vishal agrawal commented on SPARK-18857:


We are unable to use incremental collect in a spark version before 2.0.2 due 
the bug spark-18009

We will have to take 2.0.2 and change this particular class and build from 
source code.

> SparkSQL ThriftServer hangs while extracting huge data volumes in incremental 
> collect mode
> --
>
> Key: SPARK-18857
> URL: https://issues.apache.org/jira/browse/SPARK-18857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: vishal agrawal
> Attachments: GC-spark-1.6.3, GC-spark-2.0.2
>
>
> We are trying to run a sql query on our spark cluster and extracting around 
> 200 million records through SparkSQL ThriftServer interface. This query works 
> fine for Spark 1.6.3 version, however for spark 2.0.2, thrift server hangs 
> after fetching data from a few partitions (we are using incremental collect 
> mode with 400 partitions). As per documentation max memory taken up by thrift 
> server should be what is required by the biggest data partition. But we 
> observed that Thrift server is not releasing the old partitions memory 
> whenever the GC occurs even though it has moved to next partition data 
> fetches. which is not the case with 1.6.3 version.
> On further investigation we found that SparkExecuteStatementOperation.scala 
> was modified for "[SPARK-16563][SQL] fix spark sql thrift server FetchResults 
> bug" and result set iterator was duplicated to keep a reference to the first 
> set.
> +  val (itra, itrb) = iter.duplicate
> +  iterHeader = itra
> +  iter = itrb
> We suspect that this is resulting in the memory not being cleared on GC. To 
> confirm this we created an iterator in our test class and fetched the data 
> once without duplicating and second time with creating a duplicate. we could 
> see that in first instance it ran fine and fetched the entire data set while 
> in second instance driver hanged after fetching data from a few partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18922) Fix more resource-closing-related and path-related test failures in identified ones on Windows

2016-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18922:


Assignee: (was: Apache Spark)

> Fix more resource-closing-related and path-related test failures in 
> identified ones on Windows
> --
>
> Key: SPARK-18922
> URL: https://issues.apache.org/jira/browse/SPARK-18922
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> There are more instances that are failed on Windows as below:
> - {{LauncherBackendSuite}}:
> {code}
> - local: launcher handle *** FAILED *** (30 seconds, 120 milliseconds)
>   The code passed to eventually never returned normally. Attempted 283 times 
> over 30.0960053 seconds. Last failure message: The reference was null. 
> (LauncherBackendSuite.scala:56)
>   org.scalatest.exceptions.TestFailedDueToTimeoutException:
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
> - standalone/client: launcher handle *** FAILED *** (30 seconds, 47 
> milliseconds)
>   The code passed to eventually never returned normally. Attempted 282 times 
> over 30.03798710002 seconds. Last failure message: The reference was 
> null. (LauncherBackendSuite.scala:56)
>   org.scalatest.exceptions.TestFailedDueToTimeoutException:
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
> {code}
> - {{SQLQuerySuite}}:
> {code}
> - specifying database name for a temporary table is not allowed *** FAILED 
> *** (125 milliseconds)
>   org.apache.spark.sql.AnalysisException: Path does not exist: 
> file:/C:projectsspark  arget mpspark-1f4471ab-aac0-4239-ae35-833d54b37e52;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
> {code}
> - {{JsonSuite}}:
> {code}
> - Loading a JSON dataset from a text file with SQL *** FAILED *** (94 
> milliseconds)
>   org.apache.spark.sql.AnalysisException: Path does not exist: 
> file:/C:projectsspark  arget mpspark-c918a8b7-fc09-433c-b9d0-36c0f78ae918;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
> {code}
> - {{StateStoreSuite}}:
> {code}
> - SPARK-18342: commit fails when rename fails *** FAILED *** (16 milliseconds)
>   java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: 
> StateStoreSuite29777261fs://C:%5Cprojects%5Cspark%5Ctarget%5Ctmp%5Cspark-ef349862-7281-4963-aaf3-add0d670a4ad%5C?-2218c2f8-2cf6-4f80-9cdf-96354e8246a77685899733421033312/0
>   at org.apache.hadoop.fs.Path.initialize(Path.java:206)
>   at org.apache.hadoop.fs.Path.(Path.java:116)
>   at org.apache.hadoop.fs.Path.(Path.java:89)
>   ...
>   Cause: java.net.URISyntaxException: Relative path in absolute URI: 
> StateStoreSuite29777261fs://C:%5Cprojects%5Cspark%5Ctarget%5Ctmp%5Cspark-ef349862-7281-4963-aaf3-add0d670a4ad%5C?-2218c2f8-2cf6-4f80-9cdf-96354e8246a77685899733421033312/0
>   at java.net.URI.checkPath(URI.java:1823)
>   at java.net.URI.(URI.java:745)
>   at org.apache.hadoop.fs.Path.initialize(Path.java:203)
> {code}
> - {{HDFSMetadataLogSuite}}:
> {code}
> - FileManager: FileContextManager *** FAILED *** (94 milliseconds)
>   java.io.IOException: Failed to delete: 
> C:\projects\spark\target\tmp\spark-415bb0bd-396b-444d-be82-04599e025f21
>   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:127)
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.withTempDir(HDFSMetadataLogSuite.scala:38)
> - FileManager: FileSystemManager *** FAILED *** (78 milliseconds)
>   java.io.IOException: Failed to delete: 
> C:\projects\spark\target\tmp\spark-ef8222cd-85aa-47c0-a396-bc7979e15088
>   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:127)
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.withTempDir(HDFSMetadataLogSuite.scala:38)
> {code}
> Please refer, for full logs, 
> https://ci.appveyor.com/project/spark-test/spark/build/283-tmp-test-base



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-ma

[jira] [Commented] (SPARK-18922) Fix more resource-closing-related and path-related test failures in identified ones on Windows

2016-12-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15760116#comment-15760116
 ] 

Apache Spark commented on SPARK-18922:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/16335

> Fix more resource-closing-related and path-related test failures in 
> identified ones on Windows
> --
>
> Key: SPARK-18922
> URL: https://issues.apache.org/jira/browse/SPARK-18922
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> There are more instances that are failed on Windows as below:
> - {{LauncherBackendSuite}}:
> {code}
> - local: launcher handle *** FAILED *** (30 seconds, 120 milliseconds)
>   The code passed to eventually never returned normally. Attempted 283 times 
> over 30.0960053 seconds. Last failure message: The reference was null. 
> (LauncherBackendSuite.scala:56)
>   org.scalatest.exceptions.TestFailedDueToTimeoutException:
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
> - standalone/client: launcher handle *** FAILED *** (30 seconds, 47 
> milliseconds)
>   The code passed to eventually never returned normally. Attempted 282 times 
> over 30.03798710002 seconds. Last failure message: The reference was 
> null. (LauncherBackendSuite.scala:56)
>   org.scalatest.exceptions.TestFailedDueToTimeoutException:
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
> {code}
> - {{SQLQuerySuite}}:
> {code}
> - specifying database name for a temporary table is not allowed *** FAILED 
> *** (125 milliseconds)
>   org.apache.spark.sql.AnalysisException: Path does not exist: 
> file:/C:projectsspark  arget mpspark-1f4471ab-aac0-4239-ae35-833d54b37e52;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
> {code}
> - {{JsonSuite}}:
> {code}
> - Loading a JSON dataset from a text file with SQL *** FAILED *** (94 
> milliseconds)
>   org.apache.spark.sql.AnalysisException: Path does not exist: 
> file:/C:projectsspark  arget mpspark-c918a8b7-fc09-433c-b9d0-36c0f78ae918;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
> {code}
> - {{StateStoreSuite}}:
> {code}
> - SPARK-18342: commit fails when rename fails *** FAILED *** (16 milliseconds)
>   java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: 
> StateStoreSuite29777261fs://C:%5Cprojects%5Cspark%5Ctarget%5Ctmp%5Cspark-ef349862-7281-4963-aaf3-add0d670a4ad%5C?-2218c2f8-2cf6-4f80-9cdf-96354e8246a77685899733421033312/0
>   at org.apache.hadoop.fs.Path.initialize(Path.java:206)
>   at org.apache.hadoop.fs.Path.(Path.java:116)
>   at org.apache.hadoop.fs.Path.(Path.java:89)
>   ...
>   Cause: java.net.URISyntaxException: Relative path in absolute URI: 
> StateStoreSuite29777261fs://C:%5Cprojects%5Cspark%5Ctarget%5Ctmp%5Cspark-ef349862-7281-4963-aaf3-add0d670a4ad%5C?-2218c2f8-2cf6-4f80-9cdf-96354e8246a77685899733421033312/0
>   at java.net.URI.checkPath(URI.java:1823)
>   at java.net.URI.(URI.java:745)
>   at org.apache.hadoop.fs.Path.initialize(Path.java:203)
> {code}
> - {{HDFSMetadataLogSuite}}:
> {code}
> - FileManager: FileContextManager *** FAILED *** (94 milliseconds)
>   java.io.IOException: Failed to delete: 
> C:\projects\spark\target\tmp\spark-415bb0bd-396b-444d-be82-04599e025f21
>   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:127)
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.withTempDir(HDFSMetadataLogSuite.scala:38)
> - FileManager: FileSystemManager *** FAILED *** (78 milliseconds)
>   java.io.IOException: Failed to delete: 
> C:\projects\spark\target\tmp\spark-ef8222cd-85aa-47c0-a396-bc7979e15088
>   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:127)
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.withTempDir(HDFSMetadataLogSuite.scala:38)
> {code}
> Please refer, for full logs, 
> https://ci.appveyor.com/project/spark-test/spark/build/283-tmp-test-base



--
This message was sent by At

[jira] [Assigned] (SPARK-18922) Fix more resource-closing-related and path-related test failures in identified ones on Windows

2016-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18922:


Assignee: Apache Spark

> Fix more resource-closing-related and path-related test failures in 
> identified ones on Windows
> --
>
> Key: SPARK-18922
> URL: https://issues.apache.org/jira/browse/SPARK-18922
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> There are more instances that are failed on Windows as below:
> - {{LauncherBackendSuite}}:
> {code}
> - local: launcher handle *** FAILED *** (30 seconds, 120 milliseconds)
>   The code passed to eventually never returned normally. Attempted 283 times 
> over 30.0960053 seconds. Last failure message: The reference was null. 
> (LauncherBackendSuite.scala:56)
>   org.scalatest.exceptions.TestFailedDueToTimeoutException:
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
> - standalone/client: launcher handle *** FAILED *** (30 seconds, 47 
> milliseconds)
>   The code passed to eventually never returned normally. Attempted 282 times 
> over 30.03798710002 seconds. Last failure message: The reference was 
> null. (LauncherBackendSuite.scala:56)
>   org.scalatest.exceptions.TestFailedDueToTimeoutException:
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
> {code}
> - {{SQLQuerySuite}}:
> {code}
> - specifying database name for a temporary table is not allowed *** FAILED 
> *** (125 milliseconds)
>   org.apache.spark.sql.AnalysisException: Path does not exist: 
> file:/C:projectsspark  arget mpspark-1f4471ab-aac0-4239-ae35-833d54b37e52;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
> {code}
> - {{JsonSuite}}:
> {code}
> - Loading a JSON dataset from a text file with SQL *** FAILED *** (94 
> milliseconds)
>   org.apache.spark.sql.AnalysisException: Path does not exist: 
> file:/C:projectsspark  arget mpspark-c918a8b7-fc09-433c-b9d0-36c0f78ae918;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
> {code}
> - {{StateStoreSuite}}:
> {code}
> - SPARK-18342: commit fails when rename fails *** FAILED *** (16 milliseconds)
>   java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: 
> StateStoreSuite29777261fs://C:%5Cprojects%5Cspark%5Ctarget%5Ctmp%5Cspark-ef349862-7281-4963-aaf3-add0d670a4ad%5C?-2218c2f8-2cf6-4f80-9cdf-96354e8246a77685899733421033312/0
>   at org.apache.hadoop.fs.Path.initialize(Path.java:206)
>   at org.apache.hadoop.fs.Path.(Path.java:116)
>   at org.apache.hadoop.fs.Path.(Path.java:89)
>   ...
>   Cause: java.net.URISyntaxException: Relative path in absolute URI: 
> StateStoreSuite29777261fs://C:%5Cprojects%5Cspark%5Ctarget%5Ctmp%5Cspark-ef349862-7281-4963-aaf3-add0d670a4ad%5C?-2218c2f8-2cf6-4f80-9cdf-96354e8246a77685899733421033312/0
>   at java.net.URI.checkPath(URI.java:1823)
>   at java.net.URI.(URI.java:745)
>   at org.apache.hadoop.fs.Path.initialize(Path.java:203)
> {code}
> - {{HDFSMetadataLogSuite}}:
> {code}
> - FileManager: FileContextManager *** FAILED *** (94 milliseconds)
>   java.io.IOException: Failed to delete: 
> C:\projects\spark\target\tmp\spark-415bb0bd-396b-444d-be82-04599e025f21
>   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:127)
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.withTempDir(HDFSMetadataLogSuite.scala:38)
> - FileManager: FileSystemManager *** FAILED *** (78 milliseconds)
>   java.io.IOException: Failed to delete: 
> C:\projects\spark\target\tmp\spark-ef8222cd-85aa-47c0-a396-bc7979e15088
>   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:127)
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.withTempDir(HDFSMetadataLogSuite.scala:38)
> {code}
> Please refer, for full logs, 
> https://ci.appveyor.com/project/spark-test/spark/build/283-tmp-test-base



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-

[jira] [Created] (SPARK-18922) Fix more resource-closing-related and path-related test failures in identified ones on Windows

2016-12-18 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-18922:


 Summary: Fix more resource-closing-related and path-related test 
failures in identified ones on Windows
 Key: SPARK-18922
 URL: https://issues.apache.org/jira/browse/SPARK-18922
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Reporter: Hyukjin Kwon
Priority: Minor


There are more instances that are failed on Windows as below:

- {{LauncherBackendSuite}}:

{code}
- local: launcher handle *** FAILED *** (30 seconds, 120 milliseconds)
  The code passed to eventually never returned normally. Attempted 283 times 
over 30.0960053 seconds. Last failure message: The reference was null. 
(LauncherBackendSuite.scala:56)
  org.scalatest.exceptions.TestFailedDueToTimeoutException:
  at 
org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
  at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)

- standalone/client: launcher handle *** FAILED *** (30 seconds, 47 
milliseconds)
  The code passed to eventually never returned normally. Attempted 282 times 
over 30.03798710002 seconds. Last failure message: The reference was null. 
(LauncherBackendSuite.scala:56)
  org.scalatest.exceptions.TestFailedDueToTimeoutException:
  at 
org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
  at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
{code}

- {{SQLQuerySuite}}:

{code}
- specifying database name for a temporary table is not allowed *** FAILED *** 
(125 milliseconds)
  org.apache.spark.sql.AnalysisException: Path does not exist: 
file:/C:projectsspark  arget mpspark-1f4471ab-aac0-4239-ae35-833d54b37e52;
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
{code}

- {{JsonSuite}}:

{code}
- Loading a JSON dataset from a text file with SQL *** FAILED *** (94 
milliseconds)
  org.apache.spark.sql.AnalysisException: Path does not exist: 
file:/C:projectsspark  arget mpspark-c918a8b7-fc09-433c-b9d0-36c0f78ae918;
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
{code}

- {{StateStoreSuite}}:

{code}
- SPARK-18342: commit fails when rename fails *** FAILED *** (16 milliseconds)
  java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
path in absolute URI: 
StateStoreSuite29777261fs://C:%5Cprojects%5Cspark%5Ctarget%5Ctmp%5Cspark-ef349862-7281-4963-aaf3-add0d670a4ad%5C?-2218c2f8-2cf6-4f80-9cdf-96354e8246a77685899733421033312/0
  at org.apache.hadoop.fs.Path.initialize(Path.java:206)
  at org.apache.hadoop.fs.Path.(Path.java:116)
  at org.apache.hadoop.fs.Path.(Path.java:89)
  ...
  Cause: java.net.URISyntaxException: Relative path in absolute URI: 
StateStoreSuite29777261fs://C:%5Cprojects%5Cspark%5Ctarget%5Ctmp%5Cspark-ef349862-7281-4963-aaf3-add0d670a4ad%5C?-2218c2f8-2cf6-4f80-9cdf-96354e8246a77685899733421033312/0
  at java.net.URI.checkPath(URI.java:1823)
  at java.net.URI.(URI.java:745)
  at org.apache.hadoop.fs.Path.initialize(Path.java:203)
{code}

- {{HDFSMetadataLogSuite}}:

{code}
- FileManager: FileContextManager *** FAILED *** (94 milliseconds)
  java.io.IOException: Failed to delete: 
C:\projects\spark\target\tmp\spark-415bb0bd-396b-444d-be82-04599e025f21
  at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
  at 
org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:127)
  at 
org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.withTempDir(HDFSMetadataLogSuite.scala:38)
- FileManager: FileSystemManager *** FAILED *** (78 milliseconds)
  java.io.IOException: Failed to delete: 
C:\projects\spark\target\tmp\spark-ef8222cd-85aa-47c0-a396-bc7979e15088
  at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
  at 
org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:127)
  at 
org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.withTempDir(HDFSMetadataLogSuite.scala:38)
{code}

Please refer, for full logs, 
https://ci.appveyor.com/project/spark-test/spark/build/283-tmp-test-base



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18703) Insertion/CTAS against Hive Tables: Staging Directories and Data Files Not Dropped Until Normal Termination of JVM

2016-12-18 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-18703:

Fix Version/s: 2.1.1

> Insertion/CTAS against Hive Tables: Staging Directories and Data Files Not 
> Dropped Until Normal Termination of JVM
> --
>
> Key: SPARK-18703
> URL: https://issues.apache.org/jira/browse/SPARK-18703
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.1.1, 2.2.0
>
>
> Below are the files/directories generated for three inserts againsts a Hive 
> table:
> {noformat}
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/._SUCCESS.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/.part-0.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/_SUCCESS
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/part-0
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/._SUCCESS.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/.part-0.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/_SUCCESS
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/part-0
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/._SUCCESS.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/.part-0.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/_SUCCESS
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/part-0
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-0.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/part-0
> {noformat}
> The first 18 files are temporary. We do not drop it until the end of JVM 
> termination. If JVM does not appropriately terminate, these temporary 
> files/directories will not be dropped.
> Only the last two files are needed, as shown below.
> {noformat}
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-0.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/part-0
> {noformat}
> Ideally, we shou

[jira] [Updated] (SPARK-18675) CTAS for hive serde table should work for all hive versions

2016-12-18 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-18675:

Fix Version/s: 2.1.1

> CTAS for hive serde table should work for all hive versions
> ---
>
> Key: SPARK-18675
> URL: https://issues.apache.org/jira/browse/SPARK-18675
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.1, 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18921) check database existence with Hive.databaseExists instead of getDatabase

2016-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18921:


Assignee: Wenchen Fan  (was: Apache Spark)

> check database existence with Hive.databaseExists instead of getDatabase
> 
>
> Key: SPARK-18921
> URL: https://issues.apache.org/jira/browse/SPARK-18921
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18921) check database existence with Hive.databaseExists instead of getDatabase

2016-12-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15760023#comment-15760023
 ] 

Apache Spark commented on SPARK-18921:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/16332

> check database existence with Hive.databaseExists instead of getDatabase
> 
>
> Key: SPARK-18921
> URL: https://issues.apache.org/jira/browse/SPARK-18921
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18921) check database existence with Hive.databaseExists instead of getDatabase

2016-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18921:


Assignee: Apache Spark  (was: Wenchen Fan)

> check database existence with Hive.databaseExists instead of getDatabase
> 
>
> Key: SPARK-18921
> URL: https://issues.apache.org/jira/browse/SPARK-18921
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18921) check database existence with Hive.databaseExists instead of getDatabase

2016-12-18 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-18921:
---

 Summary: check database existence with Hive.databaseExists instead 
of getDatabase
 Key: SPARK-18921
 URL: https://issues.apache.org/jira/browse/SPARK-18921
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18767) Unify Models' toString methods

2016-12-18 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng closed SPARK-18767.

Resolution: Won't Fix

> Unify Models' toString methods
> --
>
> Key: SPARK-18767
> URL: https://issues.apache.org/jira/browse/SPARK-18767
> Project: Spark
>  Issue Type: Improvement
>Reporter: zhengruifeng
>Priority: Minor
>
> Models' toString should output some info, no just the uid of its trainner. 
> {code}
> scala> val nb = new NaiveBayes
> nb: org.apache.spark.ml.classification.NaiveBayes = nb_18e8984091a8
> scala> val nbm = nb.fit(data)
> nbm: org.apache.spark.ml.classification.NaiveBayesModel = NaiveBayesModel 
> (uid=nb_18e8984091a8) with 2 classes
> scala> val dt = new DecisionTreeClassifier
> dt: org.apache.spark.ml.classification.DecisionTreeClassifier = 
> dtc_627dac64995e
> scala> val dtm = dt.fit(data)
> 16/12/07 15:08:14 WARN Executor: 1 block locks were not released by TID = 94:
> [rdd_8_0]
> dtm: org.apache.spark.ml.classification.DecisionTreeClassificationModel = 
> DecisionTreeClassificationModel (uid=dtc_627dac64995e) of depth 2 with 5 nodes
> scala> val lr = new LogisticRegression
> lr: org.apache.spark.ml.classification.LogisticRegression = 
> logreg_251625c948a0
> scala> val lrm = lr.fit(data)
> lrm: org.apache.spark.ml.classification.LogisticRegressionModel = 
> logreg_251625c948a0
> {code}
> I override toString in model to make them all like this:
> {{ModelClassName (uid=...) with key params}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18917) Dataframe - Time Out Issues / Taking long time in append mode on object stores

2016-12-18 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15759931#comment-15759931
 ] 

Dongjoon Hyun commented on SPARK-18917:
---

Hi, [~alunarbeach].
Sure, you can make a PR.
BTW, please don't set target versions and fixed versions.
Usually, target versions are used by committers. Also, fixed versions are 
recorded only when your patch is merged.

> Dataframe - Time Out Issues / Taking long time in append mode on object stores
> --
>
> Key: SPARK-18917
> URL: https://issues.apache.org/jira/browse/SPARK-18917
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2, SQL, YARN
>Affects Versions: 2.0.2
>Reporter: Anbu Cheeralan
>Priority: Minor
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> When using Dataframe write in append mode on object stores (S3 / Google 
> Storage), the writes are taking long time to write/ getting read time out. 
> This is because dataframe.write lists all leaf folders in the target 
> directory. If there are lot of subfolders due to partitions, this is taking 
> for ever.
> The code is In org.apache.spark.sql.execution.datasources.DataSource.write() 
> following code causes huge number of RPC calls when the file system is an 
> Object Store (S3, GS).
> if (mode == SaveMode.Append) {
> val existingPartitionColumns = Try {
> resolveRelation()
> .asInstanceOf[HadoopFsRelation]
> .location
> .partitionSpec()
> .partitionColumns
> .fieldNames
> .toSeq
> }.getOrElse(Seq.empty[String])
> There should be a flag to skip Partition Match Check in append mode. I can 
> work on the patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18917) Dataframe - Time Out Issues / Taking long time in append mode on object stores

2016-12-18 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-18917:
--
Fix Version/s: (was: 2.1.1)
   (was: 2.1.0)

> Dataframe - Time Out Issues / Taking long time in append mode on object stores
> --
>
> Key: SPARK-18917
> URL: https://issues.apache.org/jira/browse/SPARK-18917
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2, SQL, YARN
>Affects Versions: 2.0.2
>Reporter: Anbu Cheeralan
>Priority: Minor
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> When using Dataframe write in append mode on object stores (S3 / Google 
> Storage), the writes are taking long time to write/ getting read time out. 
> This is because dataframe.write lists all leaf folders in the target 
> directory. If there are lot of subfolders due to partitions, this is taking 
> for ever.
> The code is In org.apache.spark.sql.execution.datasources.DataSource.write() 
> following code causes huge number of RPC calls when the file system is an 
> Object Store (S3, GS).
> if (mode == SaveMode.Append) {
> val existingPartitionColumns = Try {
> resolveRelation()
> .asInstanceOf[HadoopFsRelation]
> .location
> .partitionSpec()
> .partitionColumns
> .fieldNames
> .toSeq
> }.getOrElse(Seq.empty[String])
> There should be a flag to skip Partition Match Check in append mode. I can 
> work on the patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18917) Dataframe - Time Out Issues / Taking long time in append mode on object stores

2016-12-18 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-18917:
--
Target Version/s:   (was: 2.1.0, 2.1.1)

> Dataframe - Time Out Issues / Taking long time in append mode on object stores
> --
>
> Key: SPARK-18917
> URL: https://issues.apache.org/jira/browse/SPARK-18917
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2, SQL, YARN
>Affects Versions: 2.0.2
>Reporter: Anbu Cheeralan
>Priority: Minor
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> When using Dataframe write in append mode on object stores (S3 / Google 
> Storage), the writes are taking long time to write/ getting read time out. 
> This is because dataframe.write lists all leaf folders in the target 
> directory. If there are lot of subfolders due to partitions, this is taking 
> for ever.
> The code is In org.apache.spark.sql.execution.datasources.DataSource.write() 
> following code causes huge number of RPC calls when the file system is an 
> Object Store (S3, GS).
> if (mode == SaveMode.Append) {
> val existingPartitionColumns = Try {
> resolveRelation()
> .asInstanceOf[HadoopFsRelation]
> .location
> .partitionSpec()
> .partitionColumns
> .fieldNames
> .toSeq
> }.getOrElse(Seq.empty[String])
> There should be a flag to skip Partition Match Check in append mode. I can 
> work on the patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18920) Update outdated date formatting

2016-12-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15759917#comment-15759917
 ] 

Apache Spark commented on SPARK-18920:
--

User 'WangTaoTheTonic' has created a pull request for this issue:
https://github.com/apache/spark/pull/16331

> Update outdated date formatting
> ---
>
> Key: SPARK-18920
> URL: https://issues.apache.org/jira/browse/SPARK-18920
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Tao Wang
>Priority: Minor
>
> Before we show "-" while the timestamp is less than 0, we should update it as 
> now the date string is presented in format "-MM-dd ."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18920) Update outdated date formatting

2016-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18920:


Assignee: (was: Apache Spark)

> Update outdated date formatting
> ---
>
> Key: SPARK-18920
> URL: https://issues.apache.org/jira/browse/SPARK-18920
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Tao Wang
>Priority: Minor
>
> Before we show "-" while the timestamp is less than 0, we should update it as 
> now the date string is presented in format "-MM-dd ."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18920) Update outdated date formatting

2016-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18920:


Assignee: Apache Spark

> Update outdated date formatting
> ---
>
> Key: SPARK-18920
> URL: https://issues.apache.org/jira/browse/SPARK-18920
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Tao Wang
>Assignee: Apache Spark
>Priority: Minor
>
> Before we show "-" while the timestamp is less than 0, we should update it as 
> now the date string is presented in format "-MM-dd ."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18920) Update outdated date formatting

2016-12-18 Thread Tao Wang (JIRA)
Tao Wang created SPARK-18920:


 Summary: Update outdated date formatting
 Key: SPARK-18920
 URL: https://issues.apache.org/jira/browse/SPARK-18920
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.0.2, 2.0.1, 2.0.0
Reporter: Tao Wang
Priority: Minor


Before we show "-" while the timestamp is less than 0, we should update it as 
now the date string is presented in format "-MM-dd ."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18916) Possible bug in Pregel / mergeMsg with hashmaps

2016-12-18 Thread Seth Bromberger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15759870#comment-15759870
 ] 

Seth Bromberger commented on SPARK-18916:
-

Added to update: https://issues.scala-lang.org/browse/SI-9895 appears to be 
what hit me.

> Possible bug in Pregel / mergeMsg with hashmaps
> ---
>
> Key: SPARK-18916
> URL: https://issues.apache.org/jira/browse/SPARK-18916
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.0.2
> Environment: OSX / IntelliJ IDEA 2016.3 CE EAP, Scala 2.11.8, Spark 
> 2.0.2
>Reporter: Seth Bromberger
>  Labels: error, graphx, pregel
>
> Consider the following (rough) code that attempts to calculate all-pairs 
> shortest paths via pregel:
> {code:java}
> def allPairsShortestPaths: RDD[(VertexId, HashMap[VertexId, ParentDist])] 
> = {
>   val initialMsg = HashMap(-1L -> ParentDist(-1L, -1L))
>   val pregelg = g.mapVertices((vid, vd) => (vd, HashMap[VertexId, 
> ParentDist](vid -> ParentDist(vid, 0L.reverse
>   def vprog(v: VertexId, value: (VD, HashMap[VertexId, ParentDist]), 
> message: HashMap[VertexId, ParentDist]): (VD, HashMap[VertexId, ParentDist]) 
> = {
> val updatedValues = mm2(value._2, message).filter(v => v._2.dist >= 0)
> (value._1, updatedValues)
>   }
>   def sendMsg(triplet: EdgeTriplet[(VD, HashMap[VertexId, ParentDist]), 
> ED]): Iterator[(VertexId, HashMap[VertexId, ParentDist])] = {
> val dstVertexId = triplet.dstId
> val srcMap = triplet.srcAttr._2
> val dstMap = triplet.dstAttr._2  // guaranteed to have dstVertexId as 
> a key
> val updatesToSend : HashMap[VertexId, ParentDist] = srcMap.filter {
>   case (vid, srcPD) => dstMap.get(vid) match {
> case Some(dstPD) => dstPD.dist > srcPD.dist + 1   // if it 
> exists, is it cheaper?
> case _ => true // not found - new update
>   }
> }.map(u => u._1 -> ParentDist(triplet.srcId, u._2.dist +1))
> if (updatesToSend.nonEmpty)
>   Iterator[(VertexId, HashMap[VertexId, ParentDist])]((dstVertexId, 
> updatesToSend))
> else
>   Iterator.empty
>   }
>   def mergeMsg(m1: HashMap[VertexId, ParentDist], m2: HashMap[VertexId, 
> ParentDist]): HashMap[VertexId, ParentDist] = {
> // when the following two lines are commented out, the program fails 
> with
> // 16/12/17 19:53:50 INFO DAGScheduler: Job 24 failed: reduce at 
> VertexRDDImpl.scala:88, took 0.244042 s
> // Exception in thread "main" org.apache.spark.SparkException: Job 
> aborted due to stage failure: Task 0 in stage 1099.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 1099.0 (TID 129, localhost): 
> scala.MatchError: (null,null) (of class scala.Tuple2)
> m1.foreach(_ => ())
> m2.foreach(_ => ())
> m1.merged(m2) {
>   case ((k1, v1), (_, v2)) => (k1, v1.min(v2))
> }
>   }
>   // mm2 is here just to provide a separate function for vprog. Ideally 
> we'd just re-use mergeMsg.
>   def mm2(m1: HashMap[VertexId, ParentDist], m2: HashMap[VertexId, 
> ParentDist]): HashMap[VertexId, ParentDist] = {
> m1.merged(m2) {
>   case ((k1, v1), (_, v2)) => (k1, v1.min(v2))
>   case n => throw new Exception("we've got a problem: " + n)
> }
>   }
>   val pregelRun = pregelg.pregel(initialMsg)(vprog, sendMsg, mergeMsg)
>   val sps = pregelRun.vertices.map(v => v._1 -> v._2._2)
>   sps
> }
>   }
> {code}
> Note the comment in the mergeMsg function: when the messages are explicitly 
> accessed prior to the .merged statement, the code works. If these side-effect 
> statements are removed / commented out, the error message in the comments is 
> generated.
> This fails consistently on a 50-node undirected cyclegraph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17073) generate basic stats for column

2016-12-18 Thread Zhenhua Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15759787#comment-15759787
 ] 

Zhenhua Wang commented on SPARK-17073:
--

[~ioana-delaney] Thanks for sharing the information!

> generate basic stats for column
> ---
>
> Key: SPARK-17073
> URL: https://issues.apache.org/jira/browse/SPARK-17073
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>Assignee: Zhenhua Wang
> Fix For: 2.1.0
>
>
> For a specified column, we need to generate basic stats including max, min, 
> number of nulls, number of distinct values, max column length, average column 
> length.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15759635#comment-15759635
 ] 

Apache Spark commented on SPARK-18817:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/16330

> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18919) PrimitiveKeyOpenHashMap is boxing values

2016-12-18 Thread Jakub Liska (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jakub Liska closed SPARK-18919.
---
Resolution: Not A Problem

Ahh, my fault, the OpenHashSet is rehasing at 734004 and doubles the size of 
the array ... 

> PrimitiveKeyOpenHashMap is boxing values
> 
>
> Key: SPARK-18919
> URL: https://issues.apache.org/jira/browse/SPARK-18919
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2
> Environment: ubuntu 16.04, scala 2.11.7, 2.12.1, java 1.8.0_111
>Reporter: Jakub Liska
>Priority: Critical
>
> Hey, I was benchmarking PrimitiveKeyOpenHashMap for speed and memory 
> footprint and I noticed, that the footprint is higher then it should, if I 
> add 1M [Long,Long] entries to it, it has : 
> ~ 34 MB in total
> ~ 17 MB  at OpenHashSet as keys
> ~ 17 MB at Array as values
> The Array size is strange though, because its initial size is 1 048 576 
> (_keySet.capacity) so it should have ~ 8MB, not ~ 17MB because a Long value 
> has 8 bytes. Therefore I think that the values are getting boxed in this 
> collection.
> The consequence of this problem is that if you put more than 100M Long 
> entries to this map, the GC gets choked to death with unlimited heap size ...
> Strange thing is that I get the same results with using @miniboxed instead of 
> @specialized 
> This is the scalameter code I used :
> {code}
> class PrimitiveKeyOpenHashMapBench extends Bench.ForkedTime {
>   override def measurer = new Executor.Measurer.MemoryFootprint
>   val sizes = Gen.single("size")(1*1000*1000)
>   performance of "MemoryFootprint" in {
> performance of "PrimitiveKeyOpenHashMap" in {
>   using(sizes) config (
> exec.benchRuns -> 1,
> exec.maxWarmupRuns -> 0,
> exec.independentSamples -> 1,
> exec.requireGC -> true,
> exec.jvmflags -> List("-server", "-Xms1024m", "-Xmx6548m", 
> "-XX:+UseG1GC")
>   ) in { size =>
>   val map = new PrimitiveKeyOpenHashMap[Long, Long](size)
>   var index = 0L
>   while (index < size) {
> map(index) = 0L
> index+=1
>   }
>   println("Size " + SizeEstimator.estimate(map))
>   while (index != 0) {
> index-=1
> assert(map.contains(index))
>   }
> map
>   }
> }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18919) PrimitiveKeyOpenHashMap is boxing values

2016-12-18 Thread Jakub Liska (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jakub Liska updated SPARK-18919:

Description: 
Hey, I was benchmarking PrimitiveKeyOpenHashMap for speed and memory footprint 
and I noticed, that the footprint is higher then it should, if I add 1M 
[Long,Long] entries to it, it has : 
~ 34 MB in total
~ 17 MB  at OpenHashSet as keys
~ 17 MB at Array as values

The Array size is strange though, because its initial size is 1 048 576 
(_keySet.capacity) so it should have ~ 8MB, not ~ 17MB because a Long value has 
8 bytes. Therefore I think that the values are getting boxed in this collection.

The consequence of this problem is that if you put more than 100M Long entries 
to this map, the GC gets choked to death with unlimited heap size ...

Strange thing is that I get the same results with using @miniboxed instead of 
@specialized 

This is the scalameter code I used :
{code}
class PrimitiveKeyOpenHashMapBench extends Bench.ForkedTime {

  override def measurer = new Executor.Measurer.MemoryFootprint

  val sizes = Gen.single("size")(1*1000*1000)

  performance of "MemoryFootprint" in {
performance of "PrimitiveKeyOpenHashMap" in {
  using(sizes) config (
exec.benchRuns -> 1,
exec.maxWarmupRuns -> 0,
exec.independentSamples -> 1,
exec.requireGC -> true,
exec.jvmflags -> List("-server", "-Xms1024m", "-Xmx6548m", 
"-XX:+UseG1GC")
  ) in { size =>
  val map = new PrimitiveKeyOpenHashMap[Long, Long](size)
  var index = 0L
  while (index < size) {
map(index) = 0L
index+=1
  }
  println("Size " + SizeEstimator.estimate(map))
  while (index != 0) {
index-=1
assert(map.contains(index))
  }
map
  }
}
  }
}
{code}

  was:
Hey, I was benchmarking PrimitiveKeyOpenHashMap for speed and memory footprint 
and I noticed, that the footprint is higher then it should, if I add 1M 
[Long,Long] entries to it, it has : 
~ 34 MB in total
~ 17 MB  at OpenHashSet as keys
~ 17 MB at Array as values

The Array size is strange though, because its initial size is 1 048 576 
(_keySet.capacity) so it should have ~ 8MB, not ~ 17MB because a Long value has 
8 bytes. Therefore I think that the values are getting boxed in this collection.

The consequence of this problem is that if you put more than 100M Long entries 
to this map, the GC gets choked to death...

This is the scalameter code I used :
{code}
class PrimitiveKeyOpenHashMapBench extends Bench.ForkedTime {

  override def measurer = new Executor.Measurer.MemoryFootprint

  val sizes = Gen.single("size")(1*1000*1000)

  performance of "MemoryFootprint" in {
performance of "PrimitiveKeyOpenHashMap" in {
  using(sizes) config (
exec.benchRuns -> 1,
exec.maxWarmupRuns -> 0,
exec.independentSamples -> 1,
exec.requireGC -> true,
exec.jvmflags -> List("-server", "-Xms1024m", "-Xmx6548m", 
"-XX:+UseG1GC")
  ) in { size =>
  val map = new PrimitiveKeyOpenHashMap[Long, Long](size)
  var index = 0L
  while (index < size) {
map(index) = 0L
index+=1
  }
  println("Size " + SizeEstimator.estimate(map))
  while (index != 0) {
index-=1
assert(map.contains(index))
  }
map
  }
}
  }
}
{code}


> PrimitiveKeyOpenHashMap is boxing values
> 
>
> Key: SPARK-18919
> URL: https://issues.apache.org/jira/browse/SPARK-18919
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2
> Environment: ubuntu 16.04, scala 2.11.7, 2.12.1, java 1.8.0_111
>Reporter: Jakub Liska
>Priority: Critical
>
> Hey, I was benchmarking PrimitiveKeyOpenHashMap for speed and memory 
> footprint and I noticed, that the footprint is higher then it should, if I 
> add 1M [Long,Long] entries to it, it has : 
> ~ 34 MB in total
> ~ 17 MB  at OpenHashSet as keys
> ~ 17 MB at Array as values
> The Array size is strange though, because its initial size is 1 048 576 
> (_keySet.capacity) so it should have ~ 8MB, not ~ 17MB because a Long value 
> has 8 bytes. Therefore I think that the values are getting boxed in this 
> collection.
> The consequence of this problem is that if you put more than 100M Long 
> entries to this map, the GC gets choked to death with unlimited heap size ...
> Strange thing is that I get the same results with using @miniboxed instead of 
> @specialized 
> This is the scalameter code I used :
> {code}
> class PrimitiveKeyOpenHashMapBench extends Bench.ForkedTime {
>   override def measurer = new Executor.Measurer.MemoryFootprint
>   val sizes = Gen.single("size")(1*1000*1000)
>   performance of "Memor

[jira] [Created] (SPARK-18919) PrimitiveKeyOpenHashMap is boxing values

2016-12-18 Thread Jakub Liska (JIRA)
Jakub Liska created SPARK-18919:
---

 Summary: PrimitiveKeyOpenHashMap is boxing values
 Key: SPARK-18919
 URL: https://issues.apache.org/jira/browse/SPARK-18919
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.2
 Environment: ubuntu 16.04, scala 2.11.7, 2.12.1, java 1.8.0_111
Reporter: Jakub Liska
Priority: Critical


Hey, I was benchmarking PrimitiveKeyOpenHashMap for speed and memory footprint 
and I noticed, that the footprint is higher then it should, if I add 1M 
[Long,Long] entries to it, it has : 
~ 34 MB in total
~ 17 MB  at OpenHashSet as keys
~ 17 MB at Array as values

The Array size is strange though, because its initial size is 1 048 576 
(_keySet.capacity) so it should have ~ 8MB, not ~ 17MB because a Long value has 
8 bytes. Therefore I think that the values are getting boxed in this collection.

The consequence of this problem is that if you put more than 100M Long entries 
to this map, the GC gets choked to death...

This is the scalameter code I used :
{code}
class PrimitiveKeyOpenHashMapBench extends Bench.ForkedTime {

  override def measurer = new Executor.Measurer.MemoryFootprint

  val sizes = Gen.single("size")(1*1000*1000)

  performance of "MemoryFootprint" in {
performance of "PrimitiveKeyOpenHashMap" in {
  using(sizes) config (
exec.benchRuns -> 1,
exec.maxWarmupRuns -> 0,
exec.independentSamples -> 1,
exec.requireGC -> true,
exec.jvmflags -> List("-server", "-Xms1024m", "-Xmx6548m", 
"-XX:+UseG1GC")
  ) in { size =>
  val map = new PrimitiveKeyOpenHashMap[Long, Long](size)
  var index = 0L
  while (index < size) {
map(index) = 0L
index+=1
  }
  println("Size " + SizeEstimator.estimate(map))
  while (index != 0) {
index-=1
assert(map.contains(index))
  }
map
  }
}
  }
}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-18 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15759368#comment-15759368
 ] 

Felix Cheung commented on SPARK-18817:
--

testing fix, will open a PR shortly.

> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16046) Add Spark SQL Dataset Tutorial

2016-12-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15758691#comment-15758691
 ] 

Apache Spark commented on SPARK-16046:
--

User 'aokolnychyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/16329

> Add Spark SQL Dataset Tutorial
> --
>
> Key: SPARK-16046
> URL: https://issues.apache.org/jira/browse/SPARK-16046
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 2.0.0
>Reporter: Pedro Rodriguez
>
> Issue to update the Spark SQL guide to provide more content around using 
> Datasets. This would expand the Creating Datasets section of the Spark SQL 
> documentation.
> Goals
> 1. Add more examples of column access via $ and `
> 2. Add examples of aggregates
> 3. Add examples of using Spark SQL functions
> What else would be useful to have?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16046) Add Spark SQL Dataset Tutorial

2016-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16046:


Assignee: (was: Apache Spark)

> Add Spark SQL Dataset Tutorial
> --
>
> Key: SPARK-16046
> URL: https://issues.apache.org/jira/browse/SPARK-16046
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 2.0.0
>Reporter: Pedro Rodriguez
>
> Issue to update the Spark SQL guide to provide more content around using 
> Datasets. This would expand the Creating Datasets section of the Spark SQL 
> documentation.
> Goals
> 1. Add more examples of column access via $ and `
> 2. Add examples of aggregates
> 3. Add examples of using Spark SQL functions
> What else would be useful to have?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16046) Add Spark SQL Dataset Tutorial

2016-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16046:


Assignee: Apache Spark

> Add Spark SQL Dataset Tutorial
> --
>
> Key: SPARK-16046
> URL: https://issues.apache.org/jira/browse/SPARK-16046
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 2.0.0
>Reporter: Pedro Rodriguez
>Assignee: Apache Spark
>
> Issue to update the Spark SQL guide to provide more content around using 
> Datasets. This would expand the Creating Datasets section of the Spark SQL 
> documentation.
> Goals
> 1. Add more examples of column access via $ and `
> 2. Add examples of aggregates
> 3. Add examples of using Spark SQL functions
> What else would be useful to have?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12216) Spark failed to delete temp directory

2016-12-18 Thread certman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15758646#comment-15758646
 ] 

certman commented on SPARK-12216:
-

I can confirm this issue exists in Windows 7 running Spark 2.x
Although this is a minor issue, it is still a bug and should be fixed. 
Something doesn't work on windows doesn't mean this is a windows bug. It's not 
due to permission. Proposing workaround of not using windows is just stupid..

> Spark failed to delete temp directory 
> --
>
> Key: SPARK-12216
> URL: https://issues.apache.org/jira/browse/SPARK-12216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
> Environment: windows 7 64 bit
> Spark 1.52
> Java 1.8.0.65
> PATH includes:
> C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin
> C:\ProgramData\Oracle\Java\javapath
> C:\Users\Stefan\scala\bin
> SYSTEM variables set are:
> JAVA_HOME=C:\Program Files\Java\jre1.8.0_65
> HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin
> (where the bin\winutils resides)
> both \tmp and \tmp\hive have permissions
> drwxrwxrwx as detected by winutils ls
>Reporter: stefan
>Priority: Minor
>
> The mailing list archives have no obvious solution to this:
> scala> :q
> Stopping spark context.
> 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark 
> temp dir: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> java.io.IOException: Failed to delete: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60)
> at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:60)
> at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at scala.util.Try$.apply(Try.scala:161)
> at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
> at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18808) ml.KMeansModel.transform is very inefficient

2016-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18808:


Assignee: (was: Apache Spark)

> ml.KMeansModel.transform is very inefficient
> 
>
> Key: SPARK-18808
> URL: https://issues.apache.org/jira/browse/SPARK-18808
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Michel Lemay
>
> The function ml.KMeansModel.transform will call the 
> parentModel.predict(features) method on each row which in turns will 
> normalize all clusterCenters from mllib.KMeansModel.clusterCentersWithNorm 
> every time!
> This is a serious waste of resources!  In my profiling, 
> clusterCentersWithNorm represent 99% of the sampling!  
> This should have been implemented with a broadcast variable as it is done in 
> other functions like computeCost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18808) ml.KMeansModel.transform is very inefficient

2016-12-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15758577#comment-15758577
 ] 

Apache Spark commented on SPARK-18808:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/16328

> ml.KMeansModel.transform is very inefficient
> 
>
> Key: SPARK-18808
> URL: https://issues.apache.org/jira/browse/SPARK-18808
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Michel Lemay
>
> The function ml.KMeansModel.transform will call the 
> parentModel.predict(features) method on each row which in turns will 
> normalize all clusterCenters from mllib.KMeansModel.clusterCentersWithNorm 
> every time!
> This is a serious waste of resources!  In my profiling, 
> clusterCentersWithNorm represent 99% of the sampling!  
> This should have been implemented with a broadcast variable as it is done in 
> other functions like computeCost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18808) ml.KMeansModel.transform is very inefficient

2016-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18808:


Assignee: Apache Spark

> ml.KMeansModel.transform is very inefficient
> 
>
> Key: SPARK-18808
> URL: https://issues.apache.org/jira/browse/SPARK-18808
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Michel Lemay
>Assignee: Apache Spark
>
> The function ml.KMeansModel.transform will call the 
> parentModel.predict(features) method on each row which in turns will 
> normalize all clusterCenters from mllib.KMeansModel.clusterCentersWithNorm 
> every time!
> This is a serious waste of resources!  In my profiling, 
> clusterCentersWithNorm represent 99% of the sampling!  
> This should have been implemented with a broadcast variable as it is done in 
> other functions like computeCost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18829) Printing to logger

2016-12-18 Thread David Hodeffi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15758520#comment-15758520
 ] 

David Hodeffi edited comment on SPARK-18829 at 12/18/16 9:10 AM:
-

If so, I think you should add it to documentation and exapmpls, since no one 
ever guess it. 
For now I have already helped spark project by questioning and answering this 
question on stackoverflow.com with the help of you guys.


was (Author: davidho):
If so, I think you should add it to documentation and exapmpls, since no one 
ever guess it. 
For now I have already helped spark project by questioning and answering this 
question on stackoverflow.com

> Printing to logger
> --
>
> Key: SPARK-18829
> URL: https://issues.apache.org/jira/browse/SPARK-18829
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.2
> Environment: ALL
>Reporter: David Hodeffi
>Priority: Trivial
>  Labels: easyfix, patch
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> I would like to print dataframe.show or  df.explain(true)  into log file.
> right now the code print to standard output without a way to redirect it.
> It also cannot be configured on log4j.properties.
> My suggestion is to write to the logger and standard output.
> i.e 
> class DataFrame {..
> override def explain(extended: Boolean): Unit = {
> val explain = ExplainCommand(queryExecution.logical, extended = extended)
> sqlContext.executePlan(explain).executedPlan.executeCollect().foreach {
>   // scalastyle:off println
>   r => {
> println(r.getString(0))
> logger.debug(r.getString(0))
>   }
>  }
>   // scalastyle:on println
> }
>   }
> def show(numRows: Int, truncate: Boolean): Unit = {
> val str =showString(numRows, truncate) 
> println(str)
> logger.debug(str)
> }
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18827) Cann't read broadcast if broadcast blocks are stored on-disk

2016-12-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18827:
--
Assignee: Yuming Wang

> Cann't read broadcast if broadcast blocks are stored on-disk
> 
>
> Key: SPARK-18827
> URL: https://issues.apache.org/jira/browse/SPARK-18827
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1, 2.0.2, 2.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
> Fix For: 2.0.3, 2.1.1
>
> Attachments: NoSuchElementException4722.gif
>
>
> How to reproduce it:
> {code:java}
>   test("Cache broadcast to disk") {
> val conf = new SparkConf()
>   .setAppName("Cache broadcast to disk")
>   .setMaster("local")
>   .set("spark.memory.useLegacyMode", "true")
>   .set("spark.storage.memoryFraction", "0.0")
> sc = new SparkContext(conf)
> val list = List[Int](1, 2, 3, 4)
> val broadcast = sc.broadcast(list)
> assert(broadcast.value.sum === 10)
>   }
> {code}
> {{NoSuchElementException}} will throw since SPARK-17503 if a broadcast cannot 
> cache in memory. The reason is that that change cannot cover 
> {{!unrolled.hasNext}} in {{next()}} function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18829) Printing to logger

2016-12-18 Thread David Hodeffi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15758520#comment-15758520
 ] 

David Hodeffi commented on SPARK-18829:
---

If so, I think you should add it to documentation and exapmpls, since no one 
ever guess it. 
For now I have already helped spark project by questioning and answering this 
question on stackoverflow.com

> Printing to logger
> --
>
> Key: SPARK-18829
> URL: https://issues.apache.org/jira/browse/SPARK-18829
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.2
> Environment: ALL
>Reporter: David Hodeffi
>Priority: Trivial
>  Labels: easyfix, patch
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> I would like to print dataframe.show or  df.explain(true)  into log file.
> right now the code print to standard output without a way to redirect it.
> It also cannot be configured on log4j.properties.
> My suggestion is to write to the logger and standard output.
> i.e 
> class DataFrame {..
> override def explain(extended: Boolean): Unit = {
> val explain = ExplainCommand(queryExecution.logical, extended = extended)
> sqlContext.executePlan(explain).executedPlan.executeCollect().foreach {
>   // scalastyle:off println
>   r => {
> println(r.getString(0))
> logger.debug(r.getString(0))
>   }
>  }
>   // scalastyle:on println
> }
>   }
> def show(numRows: Int, truncate: Boolean): Unit = {
> val str =showString(numRows, truncate) 
> println(str)
> logger.debug(str)
> }
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18827) Cann't read broadcast if broadcast blocks are stored on-disk

2016-12-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18827.
---
   Resolution: Fixed
Fix Version/s: 2.0.3
   2.1.1

Issue resolved by pull request 16252
[https://github.com/apache/spark/pull/16252]

> Cann't read broadcast if broadcast blocks are stored on-disk
> 
>
> Key: SPARK-18827
> URL: https://issues.apache.org/jira/browse/SPARK-18827
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1, 2.0.2, 2.1.0
>Reporter: Yuming Wang
> Fix For: 2.1.1, 2.0.3
>
> Attachments: NoSuchElementException4722.gif
>
>
> How to reproduce it:
> {code:java}
>   test("Cache broadcast to disk") {
> val conf = new SparkConf()
>   .setAppName("Cache broadcast to disk")
>   .setMaster("local")
>   .set("spark.memory.useLegacyMode", "true")
>   .set("spark.storage.memoryFraction", "0.0")
> sc = new SparkContext(conf)
> val list = List[Int](1, 2, 3, 4)
> val broadcast = sc.broadcast(list)
> assert(broadcast.value.sum === 10)
>   }
> {code}
> {{NoSuchElementException}} will throw since SPARK-17503 if a broadcast cannot 
> cache in memory. The reason is that that change cannot cover 
> {{!unrolled.hasNext}} in {{next()}} function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18882) Spark UI , storage tab is always empty.

2016-12-18 Thread David Hodeffi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15758513#comment-15758513
 ] 

David Hodeffi commented on SPARK-18882:
---

generate 4 random tables using range() function on DataFrame. Cartesian join 
all of them and write them to Hive Table on HiveContext (Spark 1.6.2). 

> Spark UI , storage tab is always empty.
> ---
>
> Key: SPARK-18882
> URL: https://issues.apache.org/jira/browse/SPARK-18882
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
> Environment: Spark 1.6.2
>Reporter: David Hodeffi
>Priority: Minor
>
> I have HDP 2.5 installed with Spark 1.6.2 on yarn, deploy mode cluster.
> On my code I create a DataFrame and cache it and then doing action on it. I 
> have tried to look for details about my dataframe on Storage tab at Spark UI 
> but I cannot see anything on this tab.
> My Question is : what could be the reasons why my storage tab is still empty? 
> Is my code is wrong? Is the configuration of HDP 2.5 with Spark on yarn are 
> wrong ? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18918) Missing in Configuration page

2016-12-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18918.
---
   Resolution: Fixed
Fix Version/s: 2.1.1

Issue resolved by pull request 16327
[https://github.com/apache/spark/pull/16327]

> Missing  in Configuration page
> ---
>
> Key: SPARK-18918
> URL: https://issues.apache.org/jira/browse/SPARK-18918
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Minor
> Fix For: 2.1.1
>
>
> The configuration page looks messy now, as shown in the nightly build:
> https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/configuration.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18918) Missing in Configuration page

2016-12-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18918:
--
Target Version/s:   (was: 2.1.0)
Priority: Minor  (was: Blocker)

> Missing  in Configuration page
> ---
>
> Key: SPARK-18918
> URL: https://issues.apache.org/jira/browse/SPARK-18918
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Minor
>
> The configuration page looks messy now, as shown in the nightly build:
> https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/configuration.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18918) Missing in Configuration page

2016-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18918:


Assignee: Apache Spark  (was: Xiao Li)

> Missing  in Configuration page
> ---
>
> Key: SPARK-18918
> URL: https://issues.apache.org/jira/browse/SPARK-18918
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Blocker
>
> The configuration page looks messy now, as shown in the nightly build:
> https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/configuration.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18918) Missing in Configuration page

2016-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18918:


Assignee: Xiao Li  (was: Apache Spark)

> Missing  in Configuration page
> ---
>
> Key: SPARK-18918
> URL: https://issues.apache.org/jira/browse/SPARK-18918
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
>
> The configuration page looks messy now, as shown in the nightly build:
> https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/configuration.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18918) Missing in Configuration page

2016-12-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15758459#comment-15758459
 ] 

Apache Spark commented on SPARK-18918:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/16327

> Missing  in Configuration page
> ---
>
> Key: SPARK-18918
> URL: https://issues.apache.org/jira/browse/SPARK-18918
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
>
> The configuration page looks messy now, as shown in the nightly build:
> https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/configuration.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18918) Missing in Configuration page

2016-12-18 Thread Xiao Li (JIRA)
Xiao Li created SPARK-18918:
---

 Summary: Missing  in Configuration page
 Key: SPARK-18918
 URL: https://issues.apache.org/jira/browse/SPARK-18918
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 2.1.0
Reporter: Xiao Li
Assignee: Xiao Li
Priority: Blocker


The configuration page looks messy now, as shown in the nightly build:

https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/configuration.html





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18915) Return Nothing when Querying a Partitioned Data Source Table without Repairing it

2016-12-18 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-18915:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-17861

> Return Nothing when Querying a Partitioned Data Source Table without 
> Repairing it
> -
>
> Key: SPARK-18915
> URL: https://issues.apache.org/jira/browse/SPARK-18915
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Priority: Critical
>
> In Spark 2.1, if we create a parititoned data source table given a specified 
> path, it returns nothing when we try to query it. To get the data, we have to 
> manually issue a DDL to repair the table. 
> In Spark 2.0, it can return the data stored in the specified path, without 
> repairing the table.
> Below is the output of Spark 2.1. 
> {noformat}
> scala> spark.range(5).selectExpr("id as fieldOne", "id as 
> partCol").write.partitionBy("partCol").mode("overwrite").saveAsTable("test")
> [Stage 0:==>(3 + 5) / 
> 8]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
>   
>   
> scala> spark.sql("select * from test").show()
> ++---+
> |fieldOne|partCol|
> ++---+
> |   0|  0|
> |   1|  1|
> |   2|  2|
> |   3|  3|
> |   4|  4|
> ++---+
> scala> spark.sql("desc formatted test").show(50, false)
> ++--+---+
> |col_name|data_type   
>   |comment|
> ++--+---+
> |fieldOne|bigint  
>   |null   |
> |partCol |bigint  
>   |null   |
> |# Partition Information |
>   |   |
> |# col_name  |data_type   
>   |comment|
> |partCol |bigint  
>   |null   |
> ||
>   |   |
> |# Detailed Table Information|
>   |   |
> |Database:   |default 
>   |   |
> |Owner:  |xiaoli  
>   |   |
> |Create Time:|Sat Dec 17 17:46:24 PST 2016
>   |   |
> |Last Access Time:   |Wed Dec 31 16:00:00 PST 1969
>   |   |
> |Location:   
> |file:/Users/xiaoli/IdeaProjects/sparkDelivery/bin/spark-warehouse/test|  
>  |
> |Table Type: |MANAGED 
>   |   |
> |Table Parameters:   |
>   |   |
> |  transient_lastDdlTime |1482025584  
>   |   |
> ||
>   |   |
> |# Storage Information   |
>   |   |
> |SerDe Library:  
> |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe   |  
>  |
> |InputFormat:
> |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat |  
>  |
> |OutputFormat:   
> |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat|  
>  |
> |Compressed: |No  
>   |   |
> |Storage Desc Parameters:|
>   |   |
> |  serialization.format  |1   
>   |   |
> |Partition Provider: |Catalog 
>   |