date:20180915

[jira] [Updated] (SPARK-25425) Extra options must overwrite sessions options

2018-09-15 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25425:
--
Affects Version/s: 2.3.0

> Extra options must overwrite sessions options
> -
>
> Key: SPARK-25425
> URL: https://issues.apache.org/jira/browse/SPARK-25425
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Maxim Gekk
>Priority: Major
>
> In load() and save() methods of DataSource V2, extra options are overwritten 
> by session options:
> * 
> https://github.com/apache/spark/blob/c9cb393dc414ae98093c1541d09fa3c8663ce276/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L244-L245
> * 
> https://github.com/apache/spark/blob/c9cb393dc414ae98093c1541d09fa3c8663ce276/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L205
> but implementation must be opposite - more specific extra options set via 
> *.option(...)* must overwrite more common session options



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25427) Add BloomFilter creation test cases

2018-09-15 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-25427:
-

Assignee: Dongjoon Hyun

> Add BloomFilter creation test cases
> ---
>
> Key: SPARK-25427
> URL: https://issues.apache.org/jira/browse/SPARK-25427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> Spark supports BloomFilter creation for ORC files. This issue aims to add 
> test coverages to prevent regressions like SPARK-12417



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25438) Fix FilterPushdownBenchmark to use the same memory assumption

2018-09-15 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-25438:
-

 Summary: Fix FilterPushdownBenchmark to use the same memory 
assumption
 Key: SPARK-25438
 URL: https://issues.apache.org/jira/browse/SPARK-25438
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 2.4.0
Reporter: Dongjoon Hyun


This issue aims to fix three things in `FilterPushdownBenchmark`.

1. Use the same memory assumption. 
The following configurations are used in ORC and Parquet.

*Memory buffer for writing*
- parquet.block.size (default: 128MB)
- orc.stripe.size (default: 64MB)

*Compression chunk size*
- parquet.page.size (default: 1MB)
- orc.compress.size (default: 256KB)

SPARK-24692 used 1MB, the default value of `parquet.page.size`, for 
`parquet.block.size` and `orc.stripe.size`. But, it missed to match 
`orc.compression.size`. So, the current benchmark shows the result from ORC 
with 256KB memory for compression and Parquet with 1MB. To compare correctly, 
we need to be consistent.

2. Dictionary encoding should not be enforced for all cases.
SPARK-24206 enforced dictionary encoding for all test cases. This issue 
recovers the ORC behavior in general and enforces dictionary encoding only for 
`prepareStringDictTable`.

3. Generate test result on AWS r3.xlarge.
We do not 
SPARK-24206 generates the result on AWS in order to reproduce and compare 
easily. This issue also aims to update the result on the same machine again in 
the same reason. Specifically, AWS r3.xlarge with Instance Store is used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15041) adding mode strategy for ml.feature.Imputer for categorical features

2018-09-15 Thread Manu Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-15041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616243#comment-16616243
 ] 

Manu Zhang commented on SPARK-15041:


Is there a plan to add such strategies as min/max ?

> adding mode strategy for ml.feature.Imputer for categorical features
> 
>
> Key: SPARK-15041
> URL: https://issues.apache.org/jira/browse/SPARK-15041
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> Adding mode strategy for ml.feature.Imputer for categorical features. This 
> need to wait until PR for SPARK-13568 gets merged.
> https://github.com/apache/spark/pull/11601
> From comments of jkbradley and Nick Pentreath in the PR
> {quote}
> Investigate efficiency of approaches using DataFrame/Dataset and/or approx 
> approaches such as frequentItems or Count-Min Sketch (will require an update 
> to CMS to return "heavy-hitters").
> investigate if we can use metadata to only allow mode for categorical 
> features (or perhaps as an easier alternative, allow mode for only Int/Long 
> columns)
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25438) Fix FilterPushdownBenchmark to use the same memory assumption

2018-09-15 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25438:
--
Description: 
This issue aims to fix three things in `FilterPushdownBenchmark`.

1. Use the same memory assumption. 
The following configurations are used in ORC and Parquet.

*Memory buffer for writing*
- parquet.block.size (default: 128MB)
- orc.stripe.size (default: 64MB)

*Compression chunk size*
- parquet.page.size (default: 1MB)
- orc.compress.size (default: 256KB)

SPARK-24692 used 1MB, the default value of `parquet.page.size`, for 
`parquet.block.size` and `orc.stripe.size`. But, it missed to match 
`orc.compress.size`. So, the current benchmark shows the result from ORC with 
256KB memory for compression and Parquet with 1MB. To compare correctly, we 
need to be consistent.

2. Dictionary encoding should not be enforced for all cases.
SPARK-24206 enforced dictionary encoding for all test cases. This issue 
recovers the ORC behavior in general and enforces dictionary encoding only for 
`prepareStringDictTable`.

3. Generate test result on AWS r3.xlarge.
We do not 
SPARK-24206 generates the result on AWS in order to reproduce and compare 
easily. This issue also aims to update the result on the same machine again in 
the same reason. Specifically, AWS r3.xlarge with Instance Store is used.

  was:
This issue aims to fix three things in `FilterPushdownBenchmark`.

1. Use the same memory assumption. 
The following configurations are used in ORC and Parquet.

*Memory buffer for writing*
- parquet.block.size (default: 128MB)
- orc.stripe.size (default: 64MB)

*Compression chunk size*
- parquet.page.size (default: 1MB)
- orc.compress.size (default: 256KB)

SPARK-24692 used 1MB, the default value of `parquet.page.size`, for 
`parquet.block.size` and `orc.stripe.size`. But, it missed to match 
`orc.compression.size`. So, the current benchmark shows the result from ORC 
with 256KB memory for compression and Parquet with 1MB. To compare correctly, 
we need to be consistent.

2. Dictionary encoding should not be enforced for all cases.
SPARK-24206 enforced dictionary encoding for all test cases. This issue 
recovers the ORC behavior in general and enforces dictionary encoding only for 
`prepareStringDictTable`.

3. Generate test result on AWS r3.xlarge.
We do not 
SPARK-24206 generates the result on AWS in order to reproduce and compare 
easily. This issue also aims to update the result on the same machine again in 
the same reason. Specifically, AWS r3.xlarge with Instance Store is used.


> Fix FilterPushdownBenchmark to use the same memory assumption
> -
>
> Key: SPARK-25438
> URL: https://issues.apache.org/jira/browse/SPARK-25438
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue aims to fix three things in `FilterPushdownBenchmark`.
> 1. Use the same memory assumption. 
> The following configurations are used in ORC and Parquet.
> *Memory buffer for writing*
> - parquet.block.size (default: 128MB)
> - orc.stripe.size (default: 64MB)
> *Compression chunk size*
> - parquet.page.size (default: 1MB)
> - orc.compress.size (default: 256KB)
> SPARK-24692 used 1MB, the default value of `parquet.page.size`, for 
> `parquet.block.size` and `orc.stripe.size`. But, it missed to match 
> `orc.compress.size`. So, the current benchmark shows the result from ORC with 
> 256KB memory for compression and Parquet with 1MB. To compare correctly, we 
> need to be consistent.
> 2. Dictionary encoding should not be enforced for all cases.
> SPARK-24206 enforced dictionary encoding for all test cases. This issue 
> recovers the ORC behavior in general and enforces dictionary encoding only 
> for `prepareStringDictTable`.
> 3. Generate test result on AWS r3.xlarge.
> We do not 
> SPARK-24206 generates the result on AWS in order to reproduce and compare 
> easily. This issue also aims to update the result on the same machine again 
> in the same reason. Specifically, AWS r3.xlarge with Instance Store is used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25438) Fix FilterPushdownBenchmark to use the same memory assumption

2018-09-15 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-25438:
-

Assignee: Dongjoon Hyun

> Fix FilterPushdownBenchmark to use the same memory assumption
> -
>
> Key: SPARK-25438
> URL: https://issues.apache.org/jira/browse/SPARK-25438
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> This issue aims to fix three things in `FilterPushdownBenchmark`.
> 1. Use the same memory assumption. 
> The following configurations are used in ORC and Parquet.
> *Memory buffer for writing*
> - parquet.block.size (default: 128MB)
> - orc.stripe.size (default: 64MB)
> *Compression chunk size*
> - parquet.page.size (default: 1MB)
> - orc.compress.size (default: 256KB)
> SPARK-24692 used 1MB, the default value of `parquet.page.size`, for 
> `parquet.block.size` and `orc.stripe.size`. But, it missed to match 
> `orc.compress.size`. So, the current benchmark shows the result from ORC with 
> 256KB memory for compression and Parquet with 1MB. To compare correctly, we 
> need to be consistent.
> 2. Dictionary encoding should not be enforced for all cases.
> SPARK-24206 enforced dictionary encoding for all test cases. This issue 
> recovers the ORC behavior in general and enforces dictionary encoding only 
> for `prepareStringDictTable`.
> 3. Generate test result on AWS r3.xlarge.
> We do not 
> SPARK-24206 generates the result on AWS in order to reproduce and compare 
> easily. This issue also aims to update the result on the same machine again 
> in the same reason. Specifically, AWS r3.xlarge with Instance Store is used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25439) [TESTS][SQL] TPCHQuerySuite customer.c_nationkey should be bigint instead of string

2018-09-15 Thread Nicolas Poggi (JIRA)

Nicolas Poggi created SPARK-25439:
-

 Summary: [TESTS][SQL] TPCHQuerySuite customer.c_nationkey should 
be bigint instead of string
 Key: SPARK-25439
 URL: https://issues.apache.org/jira/browse/SPARK-25439
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 2.3.1
Reporter: Nicolas Poggi


 

The 
[TPCHQuerySuite|https://github.com/apache/spark/blob/be454a7cef1cb5c76fb22589fc3a55c1bf519cf4/sql/core/src/test/scala/org/apache/spark/sql/TPCHQuerySuite.scala#L72]
 currently has {{string}} for the {{customer.c_nationkey}} column, while it 
should be bigint according to [the 
spec|http://www.tpc.org/TPC_Documents_Current_Versions/pdf/tpc-h_v2.17.3.pdf] 
(identifier type). 

Note: this update would make previousTPCH results not comparable for queries 
using the {{customer}} table


 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25439) [TESTS][SQL] TPCHQuerySuite customer.c_nationkey should be bigint instead of string

2018-09-15 Thread Nicolas Poggi (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Poggi updated SPARK-25439:
--
Description: 
 

The 
[TPCHQuerySuite|https://github.com/apache/spark/blob/be454a7cef1cb5c76fb22589fc3a55c1bf519cf4/sql/core/src/test/scala/org/apache/spark/sql/TPCHQuerySuite.scala#L72]
 currently has {{string}} for the {{customer.c_nationkey}} column, while it 
should be bigint according to [the 
spec|http://www.tpc.org/TPC_Documents_Current_Versions/pdf/tpc-h_v2.17.3.pdf] 
(identifier type) and matching the {{nation}} table. 

Note: this update would make previousTPCH results not comparable for queries 
using the {{customer}} table

 

  was:
 

The 
[TPCHQuerySuite|https://github.com/apache/spark/blob/be454a7cef1cb5c76fb22589fc3a55c1bf519cf4/sql/core/src/test/scala/org/apache/spark/sql/TPCHQuerySuite.scala#L72]
 currently has {{string}} for the {{customer.c_nationkey}} column, while it 
should be bigint according to [the 
spec|http://www.tpc.org/TPC_Documents_Current_Versions/pdf/tpc-h_v2.17.3.pdf] 
(identifier type). 

Note: this update would make previousTPCH results not comparable for queries 
using the {{customer}} table


 


> [TESTS][SQL] TPCHQuerySuite customer.c_nationkey should be bigint instead of 
> string
> ---
>
> Key: SPARK-25439
> URL: https://issues.apache.org/jira/browse/SPARK-25439
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.3.1
>Reporter: Nicolas Poggi
>Priority: Minor
>  Labels: benchmark, easy-fix, test
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
>  
> The 
> [TPCHQuerySuite|https://github.com/apache/spark/blob/be454a7cef1cb5c76fb22589fc3a55c1bf519cf4/sql/core/src/test/scala/org/apache/spark/sql/TPCHQuerySuite.scala#L72]
>  currently has {{string}} for the {{customer.c_nationkey}} column, while it 
> should be bigint according to [the 
> spec|http://www.tpc.org/TPC_Documents_Current_Versions/pdf/tpc-h_v2.17.3.pdf] 
> (identifier type) and matching the {{nation}} table. 
> Note: this update would make previousTPCH results not comparable for queries 
> using the {{customer}} table
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25440) Dump query execution info to a file

2018-09-15 Thread Maxim Gekk (JIRA)

Maxim Gekk created SPARK-25440:
--

 Summary: Dump query execution info to a file
 Key: SPARK-25440
 URL: https://issues.apache.org/jira/browse/SPARK-25440
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maxim Gekk


Output of the explain() doesn't contain full information and in some cases can 
be truncated. Besides of that it saves info to a string in memory which can 
cause OOM. The ticket aims to solve the problem and dump info about query 
execution to a file. Need to add new method to queryExecution.debug which 
accepts a path to a file. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25439) [TESTS][SQL] TPCHQuerySuite customer.c_nationkey should be bigint instead of string

2018-09-15 Thread Nicolas Poggi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616370#comment-16616370
 ] 

Nicolas Poggi commented on SPARK-25439:
---

Created the[ PR with the patch|[https://github.com/apache/spark/pull/22430].]

> [TESTS][SQL] TPCHQuerySuite customer.c_nationkey should be bigint instead of 
> string
> ---
>
> Key: SPARK-25439
> URL: https://issues.apache.org/jira/browse/SPARK-25439
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.3.1
>Reporter: Nicolas Poggi
>Priority: Minor
>  Labels: benchmark, easy-fix, test
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
>  
> The 
> [TPCHQuerySuite|https://github.com/apache/spark/blob/be454a7cef1cb5c76fb22589fc3a55c1bf519cf4/sql/core/src/test/scala/org/apache/spark/sql/TPCHQuerySuite.scala#L72]
>  currently has {{string}} for the {{customer.c_nationkey}} column, while it 
> should be bigint according to [the 
> spec|http://www.tpc.org/TPC_Documents_Current_Versions/pdf/tpc-h_v2.17.3.pdf] 
> (identifier type) and matching the {{nation}} table. 
> Note: this update would make previousTPCH results not comparable for queries 
> using the {{customer}} table
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-25439) [TESTS][SQL] TPCHQuerySuite customer.c_nationkey should be bigint instead of string

2018-09-15 Thread Nicolas Poggi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616370#comment-16616370
 ] 

Nicolas Poggi edited comment on SPARK-25439 at 9/15/18 4:10 PM:


Created the [PR with the patch|https://github.com/apache/spark/pull/22430].

 


was (Author: npoggi):
Created the[ PR with the patch|[https://github.com/apache/spark/pull/22430].]

> [TESTS][SQL] TPCHQuerySuite customer.c_nationkey should be bigint instead of 
> string
> ---
>
> Key: SPARK-25439
> URL: https://issues.apache.org/jira/browse/SPARK-25439
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.3.1
>Reporter: Nicolas Poggi
>Priority: Minor
>  Labels: benchmark, easy-fix, test
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
>  
> The 
> [TPCHQuerySuite|https://github.com/apache/spark/blob/be454a7cef1cb5c76fb22589fc3a55c1bf519cf4/sql/core/src/test/scala/org/apache/spark/sql/TPCHQuerySuite.scala#L72]
>  currently has {{string}} for the {{customer.c_nationkey}} column, while it 
> should be bigint according to [the 
> spec|http://www.tpc.org/TPC_Documents_Current_Versions/pdf/tpc-h_v2.17.3.pdf] 
> (identifier type) and matching the {{nation}} table. 
> Note: this update would make previousTPCH results not comparable for queries 
> using the {{customer}} table
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25302) ReducedWindowedDStream not using checkpoints for reduced RDDs

2018-09-15 Thread Nikunj Bansal (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616378#comment-16616378
 ] 

Nikunj Bansal commented on SPARK-25302:
---

Patch available at PR [#22423|https://github.com/apache/spark/pull/22423]

> ReducedWindowedDStream not using checkpoints for reduced RDDs
> -
>
> Key: SPARK-25302
> URL: https://issues.apache.org/jira/browse/SPARK-25302
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 
> 2.2.1, 2.2.2, 2.3.0, 2.3.1
>Reporter: Nikunj Bansal
>Priority: Major
>  Labels: Streaming, streaming
>
> When using reduceByKeyAndWindow() using inverse reduce function, it 
> eventually creates a ReducedWindowedDStream. This class creates a 
> reducedDStream but only persists it and does not checkpoint it. The result is 
> that it ends up using cached RDDs and does not cut lineage to the input 
> DStream resulting in eventually caching the input RDDs for much longer than 
> they are needed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25303) A DStream that is checkpointed should allow its parent(s) to be removed and not persisted

2018-09-15 Thread Nikunj Bansal (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616381#comment-16616381
 ] 

Nikunj Bansal commented on SPARK-25303:
---

Patch available at PR [#22424|https://github.com/apache/spark/pull/22424]

> A DStream that is checkpointed should allow its parent(s) to be removed and 
> not persisted
> -
>
> Key: SPARK-25303
> URL: https://issues.apache.org/jira/browse/SPARK-25303
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 
> 2.2.1, 2.2.2, 2.3.0, 2.3.1
>Reporter: Nikunj Bansal
>Priority: Major
>  Labels: Streaming, streaming
>
> A checkpointed DStream is supposed to cut the lineage to its parent(s) such 
> that any persisted RDDs for the parent(s) are removed. However, combined with 
> the issue in SPARK-25302, they result in the Input Stream RDDs being 
> persisted a lot longer than they are actually required.
> See also related bug SPARK-25302.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25441) calculate term frequency in CountVectorizer()

2018-09-15 Thread Xinyong Tian (JIRA)

Xinyong Tian created SPARK-25441:


 Summary: calculate term frequency in CountVectorizer()
 Key: SPARK-25441
 URL: https://issues.apache.org/jira/browse/SPARK-25441
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.3.1
Reporter: Xinyong Tian


currently CountVectorizer() can not output TF (term frequency). I hope there 
will be such option.

TF defined as https://en.m.wikipedia.org/wiki/Tf–idf

 

example,

>>> df = spark.createDataFrame( ...  [(0, ["a", "b", "c"]), (1, ["a", "b", "b", 
>>> "c", "a"])], ...  ["label", "raw"])

>>> cv = CountVectorizer(inputCol="raw", outputCol="vectors")

>>> model = cv.fit(df)

>>> model.transform(df).limit(1).show(truncate=False)

label        raw           vectors 

0            [a, b, c]       (3,[0,1,2],[1.0,1.0,1.0])

 

instead I want 

0            [a, b, c]       (3,[0,1,2],[0.33,0.33,0.33]) # ie, each vector 
devided by by its sum, here 3, so                                               
                                  sum of new vector will 1,for every 
row(document)

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25434) failed to locate the winutils binary in the hadoop binary path

2018-09-15 Thread Veenit Shah (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616489#comment-16616489
 ] 

Veenit Shah commented on SPARK-25434:
-

Are you on Windows? I faced the same issue. This link helped me resolve it.

[https://changhsinlee.com/install-pyspark-windows-jupyter/]

 

> failed to locate the winutils binary in the hadoop binary path
> --
>
> Key: SPARK-25434
> URL: https://issues.apache.org/jira/browse/SPARK-25434
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Shell
>Affects Versions: 2.3.1
>Reporter: WEI PENG
>Priority: Major
>
> C:\Users\WEI>pyspark
> Python 3.5.6 |Anaconda custom (64-bit)| (default, Aug 26 2018, 16:05:27) [MSC 
> v.
> 1900 64 bit (AMD64)] on win32
> Type "help", "copyright", "credits" or "license" for more information.
> 2018-09-14 21:12:39 ERROR Shell:397 - Failed to locate the winutils binary in 
> th
> e hadoop binary path
> java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
> Ha
> doop binaries.
>  at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379)
>  at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394)
>  at org.apache.hadoop.util.Shell.(Shell.java:387)
>  at org.apache.hadoop.util.StringUtils.(StringUtils.java:80)
>  at org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(Secur
> ityUtil.java:611)
>  at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupI
> nformation.java:273)
>  at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(Use
> rGroupInformation.java:261)
>  at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(
> UserGroupInformation.java:791)
>  at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGrou
> pInformation.java:761)
>  at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGr
> oupInformation.java:634)
>  at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils
> .scala:2467)
>  at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils
> .scala:2467)
>  at scala.Option.getOrElse(Option.scala:121)
>  at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2467)
>  at org.apache.spark.SecurityManager.(SecurityManager.scala:220)
>  at org.apache.spark.deploy.SparkSubmit$.secMgr$lzycompute$1(SparkSubmit.
> scala:408)
>  at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSub
> mit$$secMgr$1(SparkSubmit.scala:408)
>  at org.apache.spark.deploy.SparkSubmit$$anonfun$doPrepareSubmitEnvironme
> nt$7.apply(SparkSubmit.scala:416)
>  at org.apache.spark.deploy.SparkSubmit$$anonfun$doPrepareSubmitEnvironme
> nt$7.apply(SparkSubmit.scala:416)
>  at scala.Option.map(Option.scala:146)
>  at org.apache.spark.deploy.SparkSubmit$.doPrepareSubmitEnvironment(Spark
> Submit.scala:415)
>  at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSu
> bmit.scala:250)
>  at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:171)
>  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
>  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 2018-09-14 21:12:39 WARN NativeCodeLoader:62 - Unable to load native-hadoop 
> lib
> rary for your platform... using builtin-java classes where applicable
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLeve
> l(newLevel).
> Welcome to
>   __
>  / __/__ ___ _/ /__
>  _\ \/ _ \/ _ `/ __/ '_/
>  /__ / .__/\_,_/_/ /_/\_\ version 2.3.1
>  /_/
> Using Python version 3.5.6 (default, Aug 26 2018 16:05:27)
> SparkSession available as 'spark'.
> >>>
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25425) Extra options must overwrite sessions options

2018-09-15 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25425:
--
Affects Version/s: 2.4.0

> Extra options must overwrite sessions options
> -
>
> Key: SPARK-25425
> URL: https://issues.apache.org/jira/browse/SPARK-25425
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> In load() and save() methods of DataSource V2, extra options are overwritten 
> by session options:
> * 
> https://github.com/apache/spark/blob/c9cb393dc414ae98093c1541d09fa3c8663ce276/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L244-L245
> * 
> https://github.com/apache/spark/blob/c9cb393dc414ae98093c1541d09fa3c8663ce276/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L205
> but implementation must be opposite - more specific extra options set via 
> *.option(...)* must overwrite more common session options



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25439) [TESTS][SQL] TPCHQuerySuite customer.c_nationkey should be bigint instead of string

2018-09-15 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25439:
--
Affects Version/s: 2.4.0
   2.3.0

> [TESTS][SQL] TPCHQuerySuite customer.c_nationkey should be bigint instead of 
> string
> ---
>
> Key: SPARK-25439
> URL: https://issues.apache.org/jira/browse/SPARK-25439
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.3.0, 2.3.1, 2.4.0
>Reporter: Nicolas Poggi
>Priority: Minor
>  Labels: benchmark, easy-fix, test
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
>  
> The 
> [TPCHQuerySuite|https://github.com/apache/spark/blob/be454a7cef1cb5c76fb22589fc3a55c1bf519cf4/sql/core/src/test/scala/org/apache/spark/sql/TPCHQuerySuite.scala#L72]
>  currently has {{string}} for the {{customer.c_nationkey}} column, while it 
> should be bigint according to [the 
> spec|http://www.tpc.org/TPC_Documents_Current_Versions/pdf/tpc-h_v2.17.3.pdf] 
> (identifier type) and matching the {{nation}} table. 
> Note: this update would make previousTPCH results not comparable for queries 
> using the {{customer}} table
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25439) TPCHQuerySuite customer.c_nationkey should be bigint instead of string

2018-09-15 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25439:
--
Summary: TPCHQuerySuite customer.c_nationkey should be bigint instead of 
string  (was: [TESTS][SQL] TPCHQuerySuite customer.c_nationkey should be bigint 
instead of string)

> TPCHQuerySuite customer.c_nationkey should be bigint instead of string
> --
>
> Key: SPARK-25439
> URL: https://issues.apache.org/jira/browse/SPARK-25439
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.3.0, 2.3.1, 2.4.0
>Reporter: Nicolas Poggi
>Priority: Minor
>  Labels: benchmark, easy-fix, test
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
>  
> The 
> [TPCHQuerySuite|https://github.com/apache/spark/blob/be454a7cef1cb5c76fb22589fc3a55c1bf519cf4/sql/core/src/test/scala/org/apache/spark/sql/TPCHQuerySuite.scala#L72]
>  currently has {{string}} for the {{customer.c_nationkey}} column, while it 
> should be bigint according to [the 
> spec|http://www.tpc.org/TPC_Documents_Current_Versions/pdf/tpc-h_v2.17.3.pdf] 
> (identifier type) and matching the {{nation}} table. 
> Note: this update would make previousTPCH results not comparable for queries 
> using the {{customer}} table
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25439) TPCHQuerySuite customer.c_nationkey should be bigint instead of string

2018-09-15 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25439:
--
Issue Type: Bug  (was: Improvement)

> TPCHQuerySuite customer.c_nationkey should be bigint instead of string
> --
>
> Key: SPARK-25439
> URL: https://issues.apache.org/jira/browse/SPARK-25439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.3.0, 2.3.1, 2.4.0
>Reporter: Nicolas Poggi
>Priority: Minor
>  Labels: benchmark, easy-fix, test
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
>  
> The 
> [TPCHQuerySuite|https://github.com/apache/spark/blob/be454a7cef1cb5c76fb22589fc3a55c1bf519cf4/sql/core/src/test/scala/org/apache/spark/sql/TPCHQuerySuite.scala#L72]
>  currently has {{string}} for the {{customer.c_nationkey}} column, while it 
> should be bigint according to [the 
> spec|http://www.tpc.org/TPC_Documents_Current_Versions/pdf/tpc-h_v2.17.3.pdf] 
> (identifier type) and matching the {{nation}} table. 
> Note: this update would make previousTPCH results not comparable for queries 
> using the {{customer}} table
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25439) [TESTS][SQL] TPCHQuerySuite customer.c_nationkey should be bigint instead of string

2018-09-15 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25439:
--
Component/s: SQL

> [TESTS][SQL] TPCHQuerySuite customer.c_nationkey should be bigint instead of 
> string
> ---
>
> Key: SPARK-25439
> URL: https://issues.apache.org/jira/browse/SPARK-25439
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.3.0, 2.3.1, 2.4.0
>Reporter: Nicolas Poggi
>Priority: Minor
>  Labels: benchmark, easy-fix, test
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
>  
> The 
> [TPCHQuerySuite|https://github.com/apache/spark/blob/be454a7cef1cb5c76fb22589fc3a55c1bf519cf4/sql/core/src/test/scala/org/apache/spark/sql/TPCHQuerySuite.scala#L72]
>  currently has {{string}} for the {{customer.c_nationkey}} column, while it 
> should be bigint according to [the 
> spec|http://www.tpc.org/TPC_Documents_Current_Versions/pdf/tpc-h_v2.17.3.pdf] 
> (identifier type) and matching the {{nation}} table. 
> Note: this update would make previousTPCH results not comparable for queries 
> using the {{customer}} table
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25426) Remove the duplicate fallback logic in UnsafeProjection

2018-09-15 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25426.
-
   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 2.5.0

> Remove the duplicate fallback logic in UnsafeProjection
> ---
>
> Key: SPARK-25426
> URL: https://issues.apache.org/jira/browse/SPARK-25426
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.5.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25436) Bump master branch version to 2.5.0-SNAPSHOT

2018-09-15 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25436.
-
   Resolution: Fixed
Fix Version/s: 2.5.0

> Bump master branch version to 2.5.0-SNAPSHOT
> 
>
> Key: SPARK-25436
> URL: https://issues.apache.org/jira/browse/SPARK-25436
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.5.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.5.0
>
>
> This patch bumps the master branch version to `2.5.0-SNAPSHOT`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25434) failed to locate the winutils binary in the hadoop binary path

2018-09-15 Thread WEI PENG (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616545#comment-16616545
 ] 

WEI PENG commented on SPARK-25434:
--

Thank you, [~VeenitShah] , it works!!

> failed to locate the winutils binary in the hadoop binary path
> --
>
> Key: SPARK-25434
> URL: https://issues.apache.org/jira/browse/SPARK-25434
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Shell
>Affects Versions: 2.3.1
>Reporter: WEI PENG
>Priority: Major
>
> C:\Users\WEI>pyspark
> Python 3.5.6 |Anaconda custom (64-bit)| (default, Aug 26 2018, 16:05:27) [MSC 
> v.
> 1900 64 bit (AMD64)] on win32
> Type "help", "copyright", "credits" or "license" for more information.
> 2018-09-14 21:12:39 ERROR Shell:397 - Failed to locate the winutils binary in 
> th
> e hadoop binary path
> java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
> Ha
> doop binaries.
>  at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379)
>  at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394)
>  at org.apache.hadoop.util.Shell.(Shell.java:387)
>  at org.apache.hadoop.util.StringUtils.(StringUtils.java:80)
>  at org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(Secur
> ityUtil.java:611)
>  at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupI
> nformation.java:273)
>  at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(Use
> rGroupInformation.java:261)
>  at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(
> UserGroupInformation.java:791)
>  at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGrou
> pInformation.java:761)
>  at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGr
> oupInformation.java:634)
>  at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils
> .scala:2467)
>  at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils
> .scala:2467)
>  at scala.Option.getOrElse(Option.scala:121)
>  at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2467)
>  at org.apache.spark.SecurityManager.(SecurityManager.scala:220)
>  at org.apache.spark.deploy.SparkSubmit$.secMgr$lzycompute$1(SparkSubmit.
> scala:408)
>  at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSub
> mit$$secMgr$1(SparkSubmit.scala:408)
>  at org.apache.spark.deploy.SparkSubmit$$anonfun$doPrepareSubmitEnvironme
> nt$7.apply(SparkSubmit.scala:416)
>  at org.apache.spark.deploy.SparkSubmit$$anonfun$doPrepareSubmitEnvironme
> nt$7.apply(SparkSubmit.scala:416)
>  at scala.Option.map(Option.scala:146)
>  at org.apache.spark.deploy.SparkSubmit$.doPrepareSubmitEnvironment(Spark
> Submit.scala:415)
>  at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSu
> bmit.scala:250)
>  at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:171)
>  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
>  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 2018-09-14 21:12:39 WARN NativeCodeLoader:62 - Unable to load native-hadoop 
> lib
> rary for your platform... using builtin-java classes where applicable
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLeve
> l(newLevel).
> Welcome to
>   __
>  / __/__ ___ _/ /__
>  _\ \/ _ \/ _ `/ __/ '_/
>  /__ / .__/\_,_/_/ /_/\_\ version 2.3.1
>  /_/
> Using Python version 3.5.6 (default, Aug 26 2018 16:05:27)
> SparkSession available as 'spark'.
> >>>
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25425) Extra options must overwrite sessions options

2018-09-15 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-25425:
-

Assignee: Maxim Gekk

> Extra options must overwrite sessions options
> -
>
> Key: SPARK-25425
> URL: https://issues.apache.org/jira/browse/SPARK-25425
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.5.0
>
>
> In load() and save() methods of DataSource V2, extra options are overwritten 
> by session options:
> * 
> https://github.com/apache/spark/blob/c9cb393dc414ae98093c1541d09fa3c8663ce276/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L244-L245
> * 
> https://github.com/apache/spark/blob/c9cb393dc414ae98093c1541d09fa3c8663ce276/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L205
> but implementation must be opposite - more specific extra options set via 
> *.option(...)* must overwrite more common session options



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25425) Extra options must overwrite sessions options

2018-09-15 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25425.
---
   Resolution: Fixed
Fix Version/s: 2.5.0

> Extra options must overwrite sessions options
> -
>
> Key: SPARK-25425
> URL: https://issues.apache.org/jira/browse/SPARK-25425
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.5.0
>
>
> In load() and save() methods of DataSource V2, extra options are overwritten 
> by session options:
> * 
> https://github.com/apache/spark/blob/c9cb393dc414ae98093c1541d09fa3c8663ce276/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L244-L245
> * 
> https://github.com/apache/spark/blob/c9cb393dc414ae98093c1541d09fa3c8663ce276/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L205
> but implementation must be opposite - more specific extra options set via 
> *.option(...)* must overwrite more common session options



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-25431) Fix function examples and unify the format of the example results.

2018-09-15 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-25431:
---

> Fix function examples and unify the format of the example results.
> --
>
> Key: SPARK-25431
> URL: https://issues.apache.org/jira/browse/SPARK-25431
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Minor
>
> There are some mistakes in examples of newly added functions. Also the format 
> of the example results are not unified. We should fix and unify them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25431) Fix function examples and unify the format of the example results.

2018-09-15 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25431:
--
Fix Version/s: (was: 2.4.0)

> Fix function examples and unify the format of the example results.
> --
>
> Key: SPARK-25431
> URL: https://issues.apache.org/jira/browse/SPARK-25431
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Minor
>
> There are some mistakes in examples of newly added functions. Also the format 
> of the example results are not unified. We should fix and unify them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25431) Fix function examples and unify the format of the example results.

2018-09-15 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616552#comment-16616552
 ] 

Dongjoon Hyun commented on SPARK-25431:
---

I reopened this since it's reverted now. We can resolve this back with the new 
commit.

> Fix function examples and unify the format of the example results.
> --
>
> Key: SPARK-25431
> URL: https://issues.apache.org/jira/browse/SPARK-25431
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Minor
>
> There are some mistakes in examples of newly added functions. Also the format 
> of the example results are not unified. We should fix and unify them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25438) Fix FilterPushdownBenchmark to use the same memory assumption

2018-09-15 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25438.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22427
[https://github.com/apache/spark/pull/22427]

> Fix FilterPushdownBenchmark to use the same memory assumption
> -
>
> Key: SPARK-25438
> URL: https://issues.apache.org/jira/browse/SPARK-25438
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.0
>
>
> This issue aims to fix three things in `FilterPushdownBenchmark`.
> 1. Use the same memory assumption. 
> The following configurations are used in ORC and Parquet.
> *Memory buffer for writing*
> - parquet.block.size (default: 128MB)
> - orc.stripe.size (default: 64MB)
> *Compression chunk size*
> - parquet.page.size (default: 1MB)
> - orc.compress.size (default: 256KB)
> SPARK-24692 used 1MB, the default value of `parquet.page.size`, for 
> `parquet.block.size` and `orc.stripe.size`. But, it missed to match 
> `orc.compress.size`. So, the current benchmark shows the result from ORC with 
> 256KB memory for compression and Parquet with 1MB. To compare correctly, we 
> need to be consistent.
> 2. Dictionary encoding should not be enforced for all cases.
> SPARK-24206 enforced dictionary encoding for all test cases. This issue 
> recovers the ORC behavior in general and enforces dictionary encoding only 
> for `prepareStringDictTable`.
> 3. Generate test result on AWS r3.xlarge.
> We do not 
> SPARK-24206 generates the result on AWS in order to reproduce and compare 
> easily. This issue also aims to update the result on the same machine again 
> in the same reason. Specifically, AWS r3.xlarge with Instance Store is used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25425) Extra options must overwrite sessions options

2018-09-15 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616554#comment-16616554
 ] 

Dongjoon Hyun commented on SPARK-25425:
---

This is resolved via https://github.com/apache/spark/pull/22413 at master 
branch. And, we are waiting for two PRs against branch-2.4 and 2.3.

> Extra options must overwrite sessions options
> -
>
> Key: SPARK-25425
> URL: https://issues.apache.org/jira/browse/SPARK-25425
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.5.0
>
>
> In load() and save() methods of DataSource V2, extra options are overwritten 
> by session options:
> * 
> https://github.com/apache/spark/blob/c9cb393dc414ae98093c1541d09fa3c8663ce276/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L244-L245
> * 
> https://github.com/apache/spark/blob/c9cb393dc414ae98093c1541d09fa3c8663ce276/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L205
> but implementation must be opposite - more specific extra options set via 
> *.option(...)* must overwrite more common session options



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22017) watermark evaluation with multi-input stream operators is unspecified

2018-09-15 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22017:

Fix Version/s: (was: 2.4.0)
   2.3.2

> watermark evaluation with multi-input stream operators is unspecified
> -
>
> Key: SPARK-22017
> URL: https://issues.apache.org/jira/browse/SPARK-22017
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jose Torres
>Priority: Major
> Fix For: 2.3.0
>
>
> Watermarks are stored as a single value in StreamExecution. If a query has 
> multiple watermark nodes (which can generally only happen with multi input 
> operators like union), a headOption call will arbitrarily pick one to use as 
> the real one. This will happen independently in each batch, possibly leading 
> to strange and undefined behavior.
> We should instead choose the minimum from all watermark exec nodes as the 
> query-wide watermark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22017) watermark evaluation with multi-input stream operators is unspecified

2018-09-15 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22017:

Fix Version/s: (was: 2.3.2)
   2.3.0

> watermark evaluation with multi-input stream operators is unspecified
> -
>
> Key: SPARK-22017
> URL: https://issues.apache.org/jira/browse/SPARK-22017
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jose Torres
>Priority: Major
> Fix For: 2.3.0
>
>
> Watermarks are stored as a single value in StreamExecution. If a query has 
> multiple watermark nodes (which can generally only happen with multi input 
> operators like union), a headOption call will arbitrarily pick one to use as 
> the real one. This will happen independently in each batch, possibly leading 
> to strange and undefined behavior.
> We should instead choose the minimum from all watermark exec nodes as the 
> query-wide watermark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22018) Catalyst Optimizer does not preserve top-level metadata while collapsing projects

2018-09-15 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22018:

Fix Version/s: (was: 2.4.0)
   2.3.0

> Catalyst Optimizer does not preserve top-level metadata while collapsing 
> projects
> -
>
> Key: SPARK-22018
> URL: https://issues.apache.org/jira/browse/SPARK-22018
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Structured Streaming
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
> Fix For: 2.3.0
>
>
> If there are two projects like as follows.
> {code}
> Project [a_with_metadata#27 AS b#26]
> +- Project [a#0 AS a_with_metadata#27]
>+- LocalRelation , [a#0, b#1]
> {code}
> Child Project has an output column with a metadata in it, and the parent 
> Project has an alias that implicitly forwards the metadata. So this metadata 
> is visible for higher operators. Upon applying CollapseProject optimizer 
> rule, the metadata is not preserved.
> {code}
> Project [a#0 AS b#26]
> +- LocalRelation , [a#0, b#1]
> {code}
> This is incorrect, as downstream operators that expect certain metadata (e.g. 
> watermark in structured streaming) to identify certain fields will fail to do 
> so.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22956) Union Stream Failover Cause `IllegalStateException`

2018-09-15 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22956:

Fix Version/s: (was: 2.4.0)

> Union Stream Failover Cause `IllegalStateException`
> ---
>
> Key: SPARK-22956
> URL: https://issues.apache.org/jira/browse/SPARK-22956
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Li Yuanjian
>Assignee: Li Yuanjian
>Priority: Major
> Fix For: 2.3.0
>
>
> When we union 2 streams from kafka or other sources, while one of them have 
> no continues data coming and in the same time task restart, this will cause 
> an `IllegalStateException`. This mainly cause because the code in 
> [MicroBatchExecution|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L190]
>   , while one stream has no continues data, its comittedOffset same with 
> availableOffset during `populateStartOffsets`, and `currentPartitionOffsets` 
> not properly handled in KafkaSource. Also, maybe we should also consider this 
> scenario in other Source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22238) EnsureStatefulOpPartitioning shouldn't ask for the child RDD before planning is completed

2018-09-15 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22238:

Fix Version/s: (was: 2.3.2)
   2.3.0

> EnsureStatefulOpPartitioning shouldn't ask for the child RDD before planning 
> is completed
> -
>
> Key: SPARK-22238
> URL: https://issues.apache.org/jira/browse/SPARK-22238
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
> Fix For: 2.3.0
>
>
> In EnsureStatefulOpPartitioning, we check that the inputRDD to a SparkPlan 
> has the expected partitioning for Streaming Stateful Operators. The problem 
> is that we are not allowed to access this information during planning.
> The reason we added that check was because CoalesceExec could actually create 
> RDDs with 0 partitions. We should fix it such that when CoalesceExec says 
> that there is a SinglePartition, there is in fact an inputRDD of 1 partition 
> instead of 0 partitions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22238) EnsureStatefulOpPartitioning shouldn't ask for the child RDD before planning is completed

2018-09-15 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22238:

Fix Version/s: (was: 2.4.0)
   2.3.2

> EnsureStatefulOpPartitioning shouldn't ask for the child RDD before planning 
> is completed
> -
>
> Key: SPARK-22238
> URL: https://issues.apache.org/jira/browse/SPARK-22238
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
> Fix For: 2.3.0
>
>
> In EnsureStatefulOpPartitioning, we check that the inputRDD to a SparkPlan 
> has the expected partitioning for Streaming Stateful Operators. The problem 
> is that we are not allowed to access this information during planning.
> The reason we added that check was because CoalesceExec could actually create 
> RDDs with 0 partitions. We should fix it such that when CoalesceExec says 
> that there is a SinglePartition, there is in fact an inputRDD of 1 partition 
> instead of 0 partitions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23503) continuous execution should sequence committed epochs

2018-09-15 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-23503:
---

Assignee: Efim Poberezkin

> continuous execution should sequence committed epochs
> -
>
> Key: SPARK-23503
> URL: https://issues.apache.org/jira/browse/SPARK-23503
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Efim Poberezkin
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently, the EpochCoordinator doesn't enforce a commit order. If a message 
> for epoch n gets lost in the ether, and epoch n + 1 happens to be ready for 
> commit earlier, epoch n + 1 will be committed.
>  
> This is either incorrect or needlessly confusing, because it's not safe to 
> start from the end offset of epoch n + 1 until epoch n is committed. 
> EpochCoordinator should enforce this sequencing.
>  
> Note that this is not actually a problem right now, because the commit 
> messages go through the same RPC channel from the same place. But we 
> shouldn't implicitly bake this assumption in.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23748) Support select from temp tables

2018-09-15 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-23748:
---

Assignee: Saisai Shao

> Support select from temp tables
> ---
>
> Key: SPARK-23748
> URL: https://issues.apache.org/jira/browse/SPARK-23748
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Saisai Shao
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> As reported in the dev list, the following currently fails:
>  
> val jdf = spark.readStream.format("kafka").option("kafka.bootstrap.servers", 
> "localhost:9092").option("subscribe", "join_test").option("startingOffsets", 
> "earliest").load();
> jdf.createOrReplaceTempView("table")
>  
> val resultdf = spark.sql("select * from table")
> resultdf.writeStream.outputMode("append").format("console").option("truncate",
>  false).trigger(Trigger.Continuous("1 second")).start()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25439) TPCHQuerySuite customer.c_nationkey should be bigint instead of string

2018-09-15 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25439.
-
   Resolution: Fixed
 Assignee: Nicolas Poggi
Fix Version/s: 2.4.0

> TPCHQuerySuite customer.c_nationkey should be bigint instead of string
> --
>
> Key: SPARK-25439
> URL: https://issues.apache.org/jira/browse/SPARK-25439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.3.0, 2.3.1, 2.4.0
>Reporter: Nicolas Poggi
>Assignee: Nicolas Poggi
>Priority: Minor
>  Labels: benchmark, easy-fix, test
> Fix For: 2.4.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
>  
> The 
> [TPCHQuerySuite|https://github.com/apache/spark/blob/be454a7cef1cb5c76fb22589fc3a55c1bf519cf4/sql/core/src/test/scala/org/apache/spark/sql/TPCHQuerySuite.scala#L72]
>  currently has {{string}} for the {{customer.c_nationkey}} column, while it 
> should be bigint according to [the 
> spec|http://www.tpc.org/TPC_Documents_Current_Versions/pdf/tpc-h_v2.17.3.pdf] 
> (identifier type) and matching the {{nation}} table. 
> Note: this update would make previousTPCH results not comparable for queries 
> using the {{customer}} table
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25434) failed to locate the winutils binary in the hadoop binary path

2018-09-15 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616578#comment-16616578
 ] 

Dongjoon Hyun commented on SPARK-25434:
---

Welcome to the Apache Spark community, [~LandSurveyorK].
BTW, JIRA is not for Q&A. Could you read http://spark.apache.org/community.html 
for that resource?
We use JIRA only when it's a really bug.

> failed to locate the winutils binary in the hadoop binary path
> --
>
> Key: SPARK-25434
> URL: https://issues.apache.org/jira/browse/SPARK-25434
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Shell
>Affects Versions: 2.3.1
>Reporter: WEI PENG
>Priority: Major
>
> C:\Users\WEI>pyspark
> Python 3.5.6 |Anaconda custom (64-bit)| (default, Aug 26 2018, 16:05:27) [MSC 
> v.
> 1900 64 bit (AMD64)] on win32
> Type "help", "copyright", "credits" or "license" for more information.
> 2018-09-14 21:12:39 ERROR Shell:397 - Failed to locate the winutils binary in 
> th
> e hadoop binary path
> java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
> Ha
> doop binaries.
>  at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379)
>  at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394)
>  at org.apache.hadoop.util.Shell.(Shell.java:387)
>  at org.apache.hadoop.util.StringUtils.(StringUtils.java:80)
>  at org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(Secur
> ityUtil.java:611)
>  at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupI
> nformation.java:273)
>  at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(Use
> rGroupInformation.java:261)
>  at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(
> UserGroupInformation.java:791)
>  at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGrou
> pInformation.java:761)
>  at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGr
> oupInformation.java:634)
>  at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils
> .scala:2467)
>  at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils
> .scala:2467)
>  at scala.Option.getOrElse(Option.scala:121)
>  at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2467)
>  at org.apache.spark.SecurityManager.(SecurityManager.scala:220)
>  at org.apache.spark.deploy.SparkSubmit$.secMgr$lzycompute$1(SparkSubmit.
> scala:408)
>  at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSub
> mit$$secMgr$1(SparkSubmit.scala:408)
>  at org.apache.spark.deploy.SparkSubmit$$anonfun$doPrepareSubmitEnvironme
> nt$7.apply(SparkSubmit.scala:416)
>  at org.apache.spark.deploy.SparkSubmit$$anonfun$doPrepareSubmitEnvironme
> nt$7.apply(SparkSubmit.scala:416)
>  at scala.Option.map(Option.scala:146)
>  at org.apache.spark.deploy.SparkSubmit$.doPrepareSubmitEnvironment(Spark
> Submit.scala:415)
>  at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSu
> bmit.scala:250)
>  at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:171)
>  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
>  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 2018-09-14 21:12:39 WARN NativeCodeLoader:62 - Unable to load native-hadoop 
> lib
> rary for your platform... using builtin-java classes where applicable
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLeve
> l(newLevel).
> Welcome to
>   __
>  / __/__ ___ _/ /__
>  _\ \/ _ \/ _ `/ __/ '_/
>  /__ / .__/\_,_/_/ /_/\_\ version 2.3.1
>  /_/
> Using Python version 3.5.6 (default, Aug 26 2018 16:05:27)
> SparkSession available as 'spark'.
> >>>
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25434) failed to locate the winutils binary in the hadoop binary path

2018-09-15 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25434.
---
Resolution: Not A Problem

> failed to locate the winutils binary in the hadoop binary path
> --
>
> Key: SPARK-25434
> URL: https://issues.apache.org/jira/browse/SPARK-25434
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Shell
>Affects Versions: 2.3.1
>Reporter: WEI PENG
>Priority: Major
>
> C:\Users\WEI>pyspark
> Python 3.5.6 |Anaconda custom (64-bit)| (default, Aug 26 2018, 16:05:27) [MSC 
> v.
> 1900 64 bit (AMD64)] on win32
> Type "help", "copyright", "credits" or "license" for more information.
> 2018-09-14 21:12:39 ERROR Shell:397 - Failed to locate the winutils binary in 
> th
> e hadoop binary path
> java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
> Ha
> doop binaries.
>  at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379)
>  at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394)
>  at org.apache.hadoop.util.Shell.(Shell.java:387)
>  at org.apache.hadoop.util.StringUtils.(StringUtils.java:80)
>  at org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(Secur
> ityUtil.java:611)
>  at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupI
> nformation.java:273)
>  at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(Use
> rGroupInformation.java:261)
>  at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(
> UserGroupInformation.java:791)
>  at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGrou
> pInformation.java:761)
>  at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGr
> oupInformation.java:634)
>  at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils
> .scala:2467)
>  at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils
> .scala:2467)
>  at scala.Option.getOrElse(Option.scala:121)
>  at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2467)
>  at org.apache.spark.SecurityManager.(SecurityManager.scala:220)
>  at org.apache.spark.deploy.SparkSubmit$.secMgr$lzycompute$1(SparkSubmit.
> scala:408)
>  at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSub
> mit$$secMgr$1(SparkSubmit.scala:408)
>  at org.apache.spark.deploy.SparkSubmit$$anonfun$doPrepareSubmitEnvironme
> nt$7.apply(SparkSubmit.scala:416)
>  at org.apache.spark.deploy.SparkSubmit$$anonfun$doPrepareSubmitEnvironme
> nt$7.apply(SparkSubmit.scala:416)
>  at scala.Option.map(Option.scala:146)
>  at org.apache.spark.deploy.SparkSubmit$.doPrepareSubmitEnvironment(Spark
> Submit.scala:415)
>  at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSu
> bmit.scala:250)
>  at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:171)
>  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
>  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 2018-09-14 21:12:39 WARN NativeCodeLoader:62 - Unable to load native-hadoop 
> lib
> rary for your platform... using builtin-java classes where applicable
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLeve
> l(newLevel).
> Welcome to
>   __
>  / __/__ ___ _/ /__
>  _\ \/ _ \/ _ `/ __/ '_/
>  /__ / .__/\_,_/_/ /_/\_\ version 2.3.1
>  /_/
> Using Python version 3.5.6 (default, Aug 26 2018 16:05:27)
> SparkSession available as 'spark'.
> >>>
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-25434) failed to locate the winutils binary in the hadoop binary path

2018-09-15 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616578#comment-16616578
 ] 

Dongjoon Hyun edited comment on SPARK-25434 at 9/16/18 3:44 AM:


Welcome to the Apache Spark community, [~LandSurveyorK].
BTW, JIRA is not for Q&A. Could you read http://spark.apache.org/community.html 
for that resource?
We use JIRA only when it's a really bug. I closed this issue since I assume 
that you got what you wanted here.


was (Author: dongjoon):
Welcome to the Apache Spark community, [~LandSurveyorK].
BTW, JIRA is not for Q&A. Could you read http://spark.apache.org/community.html 
for that resource?
We use JIRA only when it's a really bug.

> failed to locate the winutils binary in the hadoop binary path
> --
>
> Key: SPARK-25434
> URL: https://issues.apache.org/jira/browse/SPARK-25434
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Shell
>Affects Versions: 2.3.1
>Reporter: WEI PENG
>Priority: Major
>
> C:\Users\WEI>pyspark
> Python 3.5.6 |Anaconda custom (64-bit)| (default, Aug 26 2018, 16:05:27) [MSC 
> v.
> 1900 64 bit (AMD64)] on win32
> Type "help", "copyright", "credits" or "license" for more information.
> 2018-09-14 21:12:39 ERROR Shell:397 - Failed to locate the winutils binary in 
> th
> e hadoop binary path
> java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
> Ha
> doop binaries.
>  at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379)
>  at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394)
>  at org.apache.hadoop.util.Shell.(Shell.java:387)
>  at org.apache.hadoop.util.StringUtils.(StringUtils.java:80)
>  at org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(Secur
> ityUtil.java:611)
>  at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupI
> nformation.java:273)
>  at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(Use
> rGroupInformation.java:261)
>  at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(
> UserGroupInformation.java:791)
>  at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGrou
> pInformation.java:761)
>  at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGr
> oupInformation.java:634)
>  at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils
> .scala:2467)
>  at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils
> .scala:2467)
>  at scala.Option.getOrElse(Option.scala:121)
>  at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2467)
>  at org.apache.spark.SecurityManager.(SecurityManager.scala:220)
>  at org.apache.spark.deploy.SparkSubmit$.secMgr$lzycompute$1(SparkSubmit.
> scala:408)
>  at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSub
> mit$$secMgr$1(SparkSubmit.scala:408)
>  at org.apache.spark.deploy.SparkSubmit$$anonfun$doPrepareSubmitEnvironme
> nt$7.apply(SparkSubmit.scala:416)
>  at org.apache.spark.deploy.SparkSubmit$$anonfun$doPrepareSubmitEnvironme
> nt$7.apply(SparkSubmit.scala:416)
>  at scala.Option.map(Option.scala:146)
>  at org.apache.spark.deploy.SparkSubmit$.doPrepareSubmitEnvironment(Spark
> Submit.scala:415)
>  at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSu
> bmit.scala:250)
>  at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:171)
>  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
>  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 2018-09-14 21:12:39 WARN NativeCodeLoader:62 - Unable to load native-hadoop 
> lib
> rary for your platform... using builtin-java classes where applicable
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLeve
> l(newLevel).
> Welcome to
>   __
>  / __/__ ___ _/ /__
>  _\ \/ _ \/ _ `/ __/ '_/
>  /__ / .__/\_,_/_/ /_/\_\ version 2.3.1
>  /_/
> Using Python version 3.5.6 (default, Aug 26 2018 16:05:27)
> SparkSession available as 'spark'.
> >>>
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24479) Register StreamingQueryListener in Spark Conf

2018-09-15 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24479:

Labels:   (was: feature)

> Register StreamingQueryListener in Spark Conf 
> --
>
> Key: SPARK-24479
> URL: https://issues.apache.org/jira/browse/SPARK-24479
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Mingjie Tang
>Assignee: Arun Mahadevan
>Priority: Major
> Fix For: 2.4.0
>
>
> Users need to register their own StreamingQueryListener into 
> StreamingQueryManager, the similar function is provided as EXTRA_LISTENERS 
> and QUERY_EXECUTION_LISTENERS. 
> We propose to provide STREAMING_QUERY_LISTENER Conf for user to register 
> their own listener. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25442) Support STS to run in K8S deployment with spark deployment mode as cluster

2018-09-15 Thread Suryanarayana Garlapati (JIRA)

Suryanarayana Garlapati created SPARK-25442:
---

 Summary: Support STS to run in K8S deployment with spark 
deployment mode as cluster
 Key: SPARK-25442
 URL: https://issues.apache.org/jira/browse/SPARK-25442
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes, SQL
Affects Versions: 2.4.0, 2.5.0
Reporter: Suryanarayana Garlapati


STS fails to start in kubernetes deployments with spark deploy mode as cluster. 
 Support should be added to make it run in K8S deployments.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25442) Support STS to run in K8S deployment with spark deployment mode as cluster

2018-09-15 Thread Suryanarayana Garlapati (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616601#comment-16616601
 ] 

Suryanarayana Garlapati commented on SPARK-25442:
-

Following is the PR for the same:

https://github.com/apache/spark/pull/22433

> Support STS to run in K8S deployment with spark deployment mode as cluster
> --
>
> Key: SPARK-25442
> URL: https://issues.apache.org/jira/browse/SPARK-25442
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, SQL
>Affects Versions: 2.4.0, 2.5.0
>Reporter: Suryanarayana Garlapati
>Priority: Major
>
> STS fails to start in kubernetes deployments with spark deploy mode as 
> cluster.  Support should be added to make it run in K8S deployments.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25391) Make behaviors consistent when converting parquet hive table to parquet data source

2018-09-15 Thread Chenxiao Mao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao resolved SPARK-25391.
--
Resolution: Won't Do

> Make behaviors consistent when converting parquet hive table to parquet data 
> source
> ---
>
> Key: SPARK-25391
> URL: https://issues.apache.org/jira/browse/SPARK-25391
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Priority: Major
>
> parquet data source tables and hive parquet tables have different behaviors 
> about parquet field resolution. So, when 
> {{spark.sql.hive.convertMetastoreParquet}} is true, users might face 
> inconsistent behaviors. The differences are:
>  * Whether respect {{spark.sql.caseSensitive}}. Without SPARK-25132, both 
> data source tables and hive tables do NOT respect 
> {{spark.sql.caseSensitive}}. However data source tables always do 
> case-sensitive parquet field resolution, while hive tables always do 
> case-insensitive parquet field resolution no matter whether 
> {{spark.sql.caseSensitive}} is set to true or false. SPARK-25132 let data 
> source tables respect {{spark.sql.caseSensitive}} while hive serde table 
> behavior is not changed.
>  * How to resolve ambiguity in case-insensitive mode. Without SPARK-25132, 
> data source tables do case-sensitive resolution and return columns with the 
> corresponding letter cases, while hive tables always return the first matched 
> column ignoring cases. SPARK-25132 let data source tables throw exception 
> when there is ambiguity while hive table behavior is not changed.
> This ticket aims to make behaviors consistent when converting hive table to 
> data source table.
>  * The behavior must be consistent to do the conversion, so we skip the 
> conversion in case-sensitive mode because hive parquet table always do 
> case-insensitive field resolution.
>  * In case-insensitive mode, when converting hive parquet table to parquet 
> data source, we switch the duplicated fields resolution mode to ask parquet 
> data source to pick the first matched field - the same behavior as hive 
> parquet table - to keep behaviors consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

46 matches

Mail list logo