[jira] [Updated] (SPARK-25425) Extra options must overwrite sessions options
[ https://issues.apache.org/jira/browse/SPARK-25425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25425: -- Affects Version/s: 2.3.0 > Extra options must overwrite sessions options > - > > Key: SPARK-25425 > URL: https://issues.apache.org/jira/browse/SPARK-25425 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Maxim Gekk >Priority: Major > > In load() and save() methods of DataSource V2, extra options are overwritten > by session options: > * > https://github.com/apache/spark/blob/c9cb393dc414ae98093c1541d09fa3c8663ce276/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L244-L245 > * > https://github.com/apache/spark/blob/c9cb393dc414ae98093c1541d09fa3c8663ce276/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L205 > but implementation must be opposite - more specific extra options set via > *.option(...)* must overwrite more common session options -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25427) Add BloomFilter creation test cases
[ https://issues.apache.org/jira/browse/SPARK-25427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-25427: - Assignee: Dongjoon Hyun > Add BloomFilter creation test cases > --- > > Key: SPARK-25427 > URL: https://issues.apache.org/jira/browse/SPARK-25427 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.3.2, 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > > Spark supports BloomFilter creation for ORC files. This issue aims to add > test coverages to prevent regressions like SPARK-12417 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25438) Fix FilterPushdownBenchmark to use the same memory assumption
Dongjoon Hyun created SPARK-25438: - Summary: Fix FilterPushdownBenchmark to use the same memory assumption Key: SPARK-25438 URL: https://issues.apache.org/jira/browse/SPARK-25438 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 2.4.0 Reporter: Dongjoon Hyun This issue aims to fix three things in `FilterPushdownBenchmark`. 1. Use the same memory assumption. The following configurations are used in ORC and Parquet. *Memory buffer for writing* - parquet.block.size (default: 128MB) - orc.stripe.size (default: 64MB) *Compression chunk size* - parquet.page.size (default: 1MB) - orc.compress.size (default: 256KB) SPARK-24692 used 1MB, the default value of `parquet.page.size`, for `parquet.block.size` and `orc.stripe.size`. But, it missed to match `orc.compression.size`. So, the current benchmark shows the result from ORC with 256KB memory for compression and Parquet with 1MB. To compare correctly, we need to be consistent. 2. Dictionary encoding should not be enforced for all cases. SPARK-24206 enforced dictionary encoding for all test cases. This issue recovers the ORC behavior in general and enforces dictionary encoding only for `prepareStringDictTable`. 3. Generate test result on AWS r3.xlarge. We do not SPARK-24206 generates the result on AWS in order to reproduce and compare easily. This issue also aims to update the result on the same machine again in the same reason. Specifically, AWS r3.xlarge with Instance Store is used. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15041) adding mode strategy for ml.feature.Imputer for categorical features
[ https://issues.apache.org/jira/browse/SPARK-15041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616243#comment-16616243 ] Manu Zhang commented on SPARK-15041: Is there a plan to add such strategies as min/max ? > adding mode strategy for ml.feature.Imputer for categorical features > > > Key: SPARK-15041 > URL: https://issues.apache.org/jira/browse/SPARK-15041 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: yuhao yang >Priority: Minor > > Adding mode strategy for ml.feature.Imputer for categorical features. This > need to wait until PR for SPARK-13568 gets merged. > https://github.com/apache/spark/pull/11601 > From comments of jkbradley and Nick Pentreath in the PR > {quote} > Investigate efficiency of approaches using DataFrame/Dataset and/or approx > approaches such as frequentItems or Count-Min Sketch (will require an update > to CMS to return "heavy-hitters"). > investigate if we can use metadata to only allow mode for categorical > features (or perhaps as an easier alternative, allow mode for only Int/Long > columns) > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25438) Fix FilterPushdownBenchmark to use the same memory assumption
[ https://issues.apache.org/jira/browse/SPARK-25438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25438: -- Description: This issue aims to fix three things in `FilterPushdownBenchmark`. 1. Use the same memory assumption. The following configurations are used in ORC and Parquet. *Memory buffer for writing* - parquet.block.size (default: 128MB) - orc.stripe.size (default: 64MB) *Compression chunk size* - parquet.page.size (default: 1MB) - orc.compress.size (default: 256KB) SPARK-24692 used 1MB, the default value of `parquet.page.size`, for `parquet.block.size` and `orc.stripe.size`. But, it missed to match `orc.compress.size`. So, the current benchmark shows the result from ORC with 256KB memory for compression and Parquet with 1MB. To compare correctly, we need to be consistent. 2. Dictionary encoding should not be enforced for all cases. SPARK-24206 enforced dictionary encoding for all test cases. This issue recovers the ORC behavior in general and enforces dictionary encoding only for `prepareStringDictTable`. 3. Generate test result on AWS r3.xlarge. We do not SPARK-24206 generates the result on AWS in order to reproduce and compare easily. This issue also aims to update the result on the same machine again in the same reason. Specifically, AWS r3.xlarge with Instance Store is used. was: This issue aims to fix three things in `FilterPushdownBenchmark`. 1. Use the same memory assumption. The following configurations are used in ORC and Parquet. *Memory buffer for writing* - parquet.block.size (default: 128MB) - orc.stripe.size (default: 64MB) *Compression chunk size* - parquet.page.size (default: 1MB) - orc.compress.size (default: 256KB) SPARK-24692 used 1MB, the default value of `parquet.page.size`, for `parquet.block.size` and `orc.stripe.size`. But, it missed to match `orc.compression.size`. So, the current benchmark shows the result from ORC with 256KB memory for compression and Parquet with 1MB. To compare correctly, we need to be consistent. 2. Dictionary encoding should not be enforced for all cases. SPARK-24206 enforced dictionary encoding for all test cases. This issue recovers the ORC behavior in general and enforces dictionary encoding only for `prepareStringDictTable`. 3. Generate test result on AWS r3.xlarge. We do not SPARK-24206 generates the result on AWS in order to reproduce and compare easily. This issue also aims to update the result on the same machine again in the same reason. Specifically, AWS r3.xlarge with Instance Store is used. > Fix FilterPushdownBenchmark to use the same memory assumption > - > > Key: SPARK-25438 > URL: https://issues.apache.org/jira/browse/SPARK-25438 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > This issue aims to fix three things in `FilterPushdownBenchmark`. > 1. Use the same memory assumption. > The following configurations are used in ORC and Parquet. > *Memory buffer for writing* > - parquet.block.size (default: 128MB) > - orc.stripe.size (default: 64MB) > *Compression chunk size* > - parquet.page.size (default: 1MB) > - orc.compress.size (default: 256KB) > SPARK-24692 used 1MB, the default value of `parquet.page.size`, for > `parquet.block.size` and `orc.stripe.size`. But, it missed to match > `orc.compress.size`. So, the current benchmark shows the result from ORC with > 256KB memory for compression and Parquet with 1MB. To compare correctly, we > need to be consistent. > 2. Dictionary encoding should not be enforced for all cases. > SPARK-24206 enforced dictionary encoding for all test cases. This issue > recovers the ORC behavior in general and enforces dictionary encoding only > for `prepareStringDictTable`. > 3. Generate test result on AWS r3.xlarge. > We do not > SPARK-24206 generates the result on AWS in order to reproduce and compare > easily. This issue also aims to update the result on the same machine again > in the same reason. Specifically, AWS r3.xlarge with Instance Store is used. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25438) Fix FilterPushdownBenchmark to use the same memory assumption
[ https://issues.apache.org/jira/browse/SPARK-25438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-25438: - Assignee: Dongjoon Hyun > Fix FilterPushdownBenchmark to use the same memory assumption > - > > Key: SPARK-25438 > URL: https://issues.apache.org/jira/browse/SPARK-25438 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > > This issue aims to fix three things in `FilterPushdownBenchmark`. > 1. Use the same memory assumption. > The following configurations are used in ORC and Parquet. > *Memory buffer for writing* > - parquet.block.size (default: 128MB) > - orc.stripe.size (default: 64MB) > *Compression chunk size* > - parquet.page.size (default: 1MB) > - orc.compress.size (default: 256KB) > SPARK-24692 used 1MB, the default value of `parquet.page.size`, for > `parquet.block.size` and `orc.stripe.size`. But, it missed to match > `orc.compress.size`. So, the current benchmark shows the result from ORC with > 256KB memory for compression and Parquet with 1MB. To compare correctly, we > need to be consistent. > 2. Dictionary encoding should not be enforced for all cases. > SPARK-24206 enforced dictionary encoding for all test cases. This issue > recovers the ORC behavior in general and enforces dictionary encoding only > for `prepareStringDictTable`. > 3. Generate test result on AWS r3.xlarge. > We do not > SPARK-24206 generates the result on AWS in order to reproduce and compare > easily. This issue also aims to update the result on the same machine again > in the same reason. Specifically, AWS r3.xlarge with Instance Store is used. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25439) [TESTS][SQL] TPCHQuerySuite customer.c_nationkey should be bigint instead of string
Nicolas Poggi created SPARK-25439: - Summary: [TESTS][SQL] TPCHQuerySuite customer.c_nationkey should be bigint instead of string Key: SPARK-25439 URL: https://issues.apache.org/jira/browse/SPARK-25439 Project: Spark Issue Type: Improvement Components: Tests Affects Versions: 2.3.1 Reporter: Nicolas Poggi The [TPCHQuerySuite|https://github.com/apache/spark/blob/be454a7cef1cb5c76fb22589fc3a55c1bf519cf4/sql/core/src/test/scala/org/apache/spark/sql/TPCHQuerySuite.scala#L72] currently has {{string}} for the {{customer.c_nationkey}} column, while it should be bigint according to [the spec|http://www.tpc.org/TPC_Documents_Current_Versions/pdf/tpc-h_v2.17.3.pdf] (identifier type). Note: this update would make previousTPCH results not comparable for queries using the {{customer}} table -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25439) [TESTS][SQL] TPCHQuerySuite customer.c_nationkey should be bigint instead of string
[ https://issues.apache.org/jira/browse/SPARK-25439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Poggi updated SPARK-25439: -- Description: The [TPCHQuerySuite|https://github.com/apache/spark/blob/be454a7cef1cb5c76fb22589fc3a55c1bf519cf4/sql/core/src/test/scala/org/apache/spark/sql/TPCHQuerySuite.scala#L72] currently has {{string}} for the {{customer.c_nationkey}} column, while it should be bigint according to [the spec|http://www.tpc.org/TPC_Documents_Current_Versions/pdf/tpc-h_v2.17.3.pdf] (identifier type) and matching the {{nation}} table. Note: this update would make previousTPCH results not comparable for queries using the {{customer}} table was: The [TPCHQuerySuite|https://github.com/apache/spark/blob/be454a7cef1cb5c76fb22589fc3a55c1bf519cf4/sql/core/src/test/scala/org/apache/spark/sql/TPCHQuerySuite.scala#L72] currently has {{string}} for the {{customer.c_nationkey}} column, while it should be bigint according to [the spec|http://www.tpc.org/TPC_Documents_Current_Versions/pdf/tpc-h_v2.17.3.pdf] (identifier type). Note: this update would make previousTPCH results not comparable for queries using the {{customer}} table > [TESTS][SQL] TPCHQuerySuite customer.c_nationkey should be bigint instead of > string > --- > > Key: SPARK-25439 > URL: https://issues.apache.org/jira/browse/SPARK-25439 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.3.1 >Reporter: Nicolas Poggi >Priority: Minor > Labels: benchmark, easy-fix, test > Original Estimate: 1h > Remaining Estimate: 1h > > > The > [TPCHQuerySuite|https://github.com/apache/spark/blob/be454a7cef1cb5c76fb22589fc3a55c1bf519cf4/sql/core/src/test/scala/org/apache/spark/sql/TPCHQuerySuite.scala#L72] > currently has {{string}} for the {{customer.c_nationkey}} column, while it > should be bigint according to [the > spec|http://www.tpc.org/TPC_Documents_Current_Versions/pdf/tpc-h_v2.17.3.pdf] > (identifier type) and matching the {{nation}} table. > Note: this update would make previousTPCH results not comparable for queries > using the {{customer}} table > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25440) Dump query execution info to a file
Maxim Gekk created SPARK-25440: -- Summary: Dump query execution info to a file Key: SPARK-25440 URL: https://issues.apache.org/jira/browse/SPARK-25440 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1 Reporter: Maxim Gekk Output of the explain() doesn't contain full information and in some cases can be truncated. Besides of that it saves info to a string in memory which can cause OOM. The ticket aims to solve the problem and dump info about query execution to a file. Need to add new method to queryExecution.debug which accepts a path to a file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25439) [TESTS][SQL] TPCHQuerySuite customer.c_nationkey should be bigint instead of string
[ https://issues.apache.org/jira/browse/SPARK-25439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616370#comment-16616370 ] Nicolas Poggi commented on SPARK-25439: --- Created the[ PR with the patch|[https://github.com/apache/spark/pull/22430].] > [TESTS][SQL] TPCHQuerySuite customer.c_nationkey should be bigint instead of > string > --- > > Key: SPARK-25439 > URL: https://issues.apache.org/jira/browse/SPARK-25439 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.3.1 >Reporter: Nicolas Poggi >Priority: Minor > Labels: benchmark, easy-fix, test > Original Estimate: 1h > Remaining Estimate: 1h > > > The > [TPCHQuerySuite|https://github.com/apache/spark/blob/be454a7cef1cb5c76fb22589fc3a55c1bf519cf4/sql/core/src/test/scala/org/apache/spark/sql/TPCHQuerySuite.scala#L72] > currently has {{string}} for the {{customer.c_nationkey}} column, while it > should be bigint according to [the > spec|http://www.tpc.org/TPC_Documents_Current_Versions/pdf/tpc-h_v2.17.3.pdf] > (identifier type) and matching the {{nation}} table. > Note: this update would make previousTPCH results not comparable for queries > using the {{customer}} table > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25439) [TESTS][SQL] TPCHQuerySuite customer.c_nationkey should be bigint instead of string
[ https://issues.apache.org/jira/browse/SPARK-25439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616370#comment-16616370 ] Nicolas Poggi edited comment on SPARK-25439 at 9/15/18 4:10 PM: Created the [PR with the patch|https://github.com/apache/spark/pull/22430]. was (Author: npoggi): Created the[ PR with the patch|[https://github.com/apache/spark/pull/22430].] > [TESTS][SQL] TPCHQuerySuite customer.c_nationkey should be bigint instead of > string > --- > > Key: SPARK-25439 > URL: https://issues.apache.org/jira/browse/SPARK-25439 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.3.1 >Reporter: Nicolas Poggi >Priority: Minor > Labels: benchmark, easy-fix, test > Original Estimate: 1h > Remaining Estimate: 1h > > > The > [TPCHQuerySuite|https://github.com/apache/spark/blob/be454a7cef1cb5c76fb22589fc3a55c1bf519cf4/sql/core/src/test/scala/org/apache/spark/sql/TPCHQuerySuite.scala#L72] > currently has {{string}} for the {{customer.c_nationkey}} column, while it > should be bigint according to [the > spec|http://www.tpc.org/TPC_Documents_Current_Versions/pdf/tpc-h_v2.17.3.pdf] > (identifier type) and matching the {{nation}} table. > Note: this update would make previousTPCH results not comparable for queries > using the {{customer}} table > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25302) ReducedWindowedDStream not using checkpoints for reduced RDDs
[ https://issues.apache.org/jira/browse/SPARK-25302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616378#comment-16616378 ] Nikunj Bansal commented on SPARK-25302: --- Patch available at PR [#22423|https://github.com/apache/spark/pull/22423] > ReducedWindowedDStream not using checkpoints for reduced RDDs > - > > Key: SPARK-25302 > URL: https://issues.apache.org/jira/browse/SPARK-25302 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, > 2.2.1, 2.2.2, 2.3.0, 2.3.1 >Reporter: Nikunj Bansal >Priority: Major > Labels: Streaming, streaming > > When using reduceByKeyAndWindow() using inverse reduce function, it > eventually creates a ReducedWindowedDStream. This class creates a > reducedDStream but only persists it and does not checkpoint it. The result is > that it ends up using cached RDDs and does not cut lineage to the input > DStream resulting in eventually caching the input RDDs for much longer than > they are needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25303) A DStream that is checkpointed should allow its parent(s) to be removed and not persisted
[ https://issues.apache.org/jira/browse/SPARK-25303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616381#comment-16616381 ] Nikunj Bansal commented on SPARK-25303: --- Patch available at PR [#22424|https://github.com/apache/spark/pull/22424] > A DStream that is checkpointed should allow its parent(s) to be removed and > not persisted > - > > Key: SPARK-25303 > URL: https://issues.apache.org/jira/browse/SPARK-25303 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, > 2.2.1, 2.2.2, 2.3.0, 2.3.1 >Reporter: Nikunj Bansal >Priority: Major > Labels: Streaming, streaming > > A checkpointed DStream is supposed to cut the lineage to its parent(s) such > that any persisted RDDs for the parent(s) are removed. However, combined with > the issue in SPARK-25302, they result in the Input Stream RDDs being > persisted a lot longer than they are actually required. > See also related bug SPARK-25302. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25441) calculate term frequency in CountVectorizer()
Xinyong Tian created SPARK-25441: Summary: calculate term frequency in CountVectorizer() Key: SPARK-25441 URL: https://issues.apache.org/jira/browse/SPARK-25441 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.3.1 Reporter: Xinyong Tian currently CountVectorizer() can not output TF (term frequency). I hope there will be such option. TF defined as https://en.m.wikipedia.org/wiki/Tf–idf example, >>> df = spark.createDataFrame( ... [(0, ["a", "b", "c"]), (1, ["a", "b", "b", >>> "c", "a"])], ... ["label", "raw"]) >>> cv = CountVectorizer(inputCol="raw", outputCol="vectors") >>> model = cv.fit(df) >>> model.transform(df).limit(1).show(truncate=False) label raw vectors 0 [a, b, c] (3,[0,1,2],[1.0,1.0,1.0]) instead I want 0 [a, b, c] (3,[0,1,2],[0.33,0.33,0.33]) # ie, each vector devided by by its sum, here 3, so sum of new vector will 1,for every row(document) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25434) failed to locate the winutils binary in the hadoop binary path
[ https://issues.apache.org/jira/browse/SPARK-25434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616489#comment-16616489 ] Veenit Shah commented on SPARK-25434: - Are you on Windows? I faced the same issue. This link helped me resolve it. [https://changhsinlee.com/install-pyspark-windows-jupyter/] > failed to locate the winutils binary in the hadoop binary path > -- > > Key: SPARK-25434 > URL: https://issues.apache.org/jira/browse/SPARK-25434 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Shell >Affects Versions: 2.3.1 >Reporter: WEI PENG >Priority: Major > > C:\Users\WEI>pyspark > Python 3.5.6 |Anaconda custom (64-bit)| (default, Aug 26 2018, 16:05:27) [MSC > v. > 1900 64 bit (AMD64)] on win32 > Type "help", "copyright", "credits" or "license" for more information. > 2018-09-14 21:12:39 ERROR Shell:397 - Failed to locate the winutils binary in > th > e hadoop binary path > java.io.IOException: Could not locate executable null\bin\winutils.exe in the > Ha > doop binaries. > at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379) > at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394) > at org.apache.hadoop.util.Shell.(Shell.java:387) > at org.apache.hadoop.util.StringUtils.(StringUtils.java:80) > at org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(Secur > ityUtil.java:611) > at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupI > nformation.java:273) > at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(Use > rGroupInformation.java:261) > at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject( > UserGroupInformation.java:791) > at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGrou > pInformation.java:761) > at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGr > oupInformation.java:634) > at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils > .scala:2467) > at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils > .scala:2467) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2467) > at org.apache.spark.SecurityManager.(SecurityManager.scala:220) > at org.apache.spark.deploy.SparkSubmit$.secMgr$lzycompute$1(SparkSubmit. > scala:408) > at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSub > mit$$secMgr$1(SparkSubmit.scala:408) > at org.apache.spark.deploy.SparkSubmit$$anonfun$doPrepareSubmitEnvironme > nt$7.apply(SparkSubmit.scala:416) > at org.apache.spark.deploy.SparkSubmit$$anonfun$doPrepareSubmitEnvironme > nt$7.apply(SparkSubmit.scala:416) > at scala.Option.map(Option.scala:146) > at org.apache.spark.deploy.SparkSubmit$.doPrepareSubmitEnvironment(Spark > Submit.scala:415) > at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSu > bmit.scala:250) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:171) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > 2018-09-14 21:12:39 WARN NativeCodeLoader:62 - Unable to load native-hadoop > lib > rary for your platform... using builtin-java classes where applicable > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLeve > l(newLevel). > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /__ / .__/\_,_/_/ /_/\_\ version 2.3.1 > /_/ > Using Python version 3.5.6 (default, Aug 26 2018 16:05:27) > SparkSession available as 'spark'. > >>> > > > > > > > > > > > > > > > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25425) Extra options must overwrite sessions options
[ https://issues.apache.org/jira/browse/SPARK-25425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25425: -- Affects Version/s: 2.4.0 > Extra options must overwrite sessions options > - > > Key: SPARK-25425 > URL: https://issues.apache.org/jira/browse/SPARK-25425 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.4.0 >Reporter: Maxim Gekk >Priority: Major > > In load() and save() methods of DataSource V2, extra options are overwritten > by session options: > * > https://github.com/apache/spark/blob/c9cb393dc414ae98093c1541d09fa3c8663ce276/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L244-L245 > * > https://github.com/apache/spark/blob/c9cb393dc414ae98093c1541d09fa3c8663ce276/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L205 > but implementation must be opposite - more specific extra options set via > *.option(...)* must overwrite more common session options -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25439) [TESTS][SQL] TPCHQuerySuite customer.c_nationkey should be bigint instead of string
[ https://issues.apache.org/jira/browse/SPARK-25439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25439: -- Affects Version/s: 2.4.0 2.3.0 > [TESTS][SQL] TPCHQuerySuite customer.c_nationkey should be bigint instead of > string > --- > > Key: SPARK-25439 > URL: https://issues.apache.org/jira/browse/SPARK-25439 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.3.0, 2.3.1, 2.4.0 >Reporter: Nicolas Poggi >Priority: Minor > Labels: benchmark, easy-fix, test > Original Estimate: 1h > Remaining Estimate: 1h > > > The > [TPCHQuerySuite|https://github.com/apache/spark/blob/be454a7cef1cb5c76fb22589fc3a55c1bf519cf4/sql/core/src/test/scala/org/apache/spark/sql/TPCHQuerySuite.scala#L72] > currently has {{string}} for the {{customer.c_nationkey}} column, while it > should be bigint according to [the > spec|http://www.tpc.org/TPC_Documents_Current_Versions/pdf/tpc-h_v2.17.3.pdf] > (identifier type) and matching the {{nation}} table. > Note: this update would make previousTPCH results not comparable for queries > using the {{customer}} table > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25439) TPCHQuerySuite customer.c_nationkey should be bigint instead of string
[ https://issues.apache.org/jira/browse/SPARK-25439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25439: -- Summary: TPCHQuerySuite customer.c_nationkey should be bigint instead of string (was: [TESTS][SQL] TPCHQuerySuite customer.c_nationkey should be bigint instead of string) > TPCHQuerySuite customer.c_nationkey should be bigint instead of string > -- > > Key: SPARK-25439 > URL: https://issues.apache.org/jira/browse/SPARK-25439 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 2.3.0, 2.3.1, 2.4.0 >Reporter: Nicolas Poggi >Priority: Minor > Labels: benchmark, easy-fix, test > Original Estimate: 1h > Remaining Estimate: 1h > > > The > [TPCHQuerySuite|https://github.com/apache/spark/blob/be454a7cef1cb5c76fb22589fc3a55c1bf519cf4/sql/core/src/test/scala/org/apache/spark/sql/TPCHQuerySuite.scala#L72] > currently has {{string}} for the {{customer.c_nationkey}} column, while it > should be bigint according to [the > spec|http://www.tpc.org/TPC_Documents_Current_Versions/pdf/tpc-h_v2.17.3.pdf] > (identifier type) and matching the {{nation}} table. > Note: this update would make previousTPCH results not comparable for queries > using the {{customer}} table > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25439) TPCHQuerySuite customer.c_nationkey should be bigint instead of string
[ https://issues.apache.org/jira/browse/SPARK-25439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25439: -- Issue Type: Bug (was: Improvement) > TPCHQuerySuite customer.c_nationkey should be bigint instead of string > -- > > Key: SPARK-25439 > URL: https://issues.apache.org/jira/browse/SPARK-25439 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.3.0, 2.3.1, 2.4.0 >Reporter: Nicolas Poggi >Priority: Minor > Labels: benchmark, easy-fix, test > Original Estimate: 1h > Remaining Estimate: 1h > > > The > [TPCHQuerySuite|https://github.com/apache/spark/blob/be454a7cef1cb5c76fb22589fc3a55c1bf519cf4/sql/core/src/test/scala/org/apache/spark/sql/TPCHQuerySuite.scala#L72] > currently has {{string}} for the {{customer.c_nationkey}} column, while it > should be bigint according to [the > spec|http://www.tpc.org/TPC_Documents_Current_Versions/pdf/tpc-h_v2.17.3.pdf] > (identifier type) and matching the {{nation}} table. > Note: this update would make previousTPCH results not comparable for queries > using the {{customer}} table > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25439) [TESTS][SQL] TPCHQuerySuite customer.c_nationkey should be bigint instead of string
[ https://issues.apache.org/jira/browse/SPARK-25439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25439: -- Component/s: SQL > [TESTS][SQL] TPCHQuerySuite customer.c_nationkey should be bigint instead of > string > --- > > Key: SPARK-25439 > URL: https://issues.apache.org/jira/browse/SPARK-25439 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 2.3.0, 2.3.1, 2.4.0 >Reporter: Nicolas Poggi >Priority: Minor > Labels: benchmark, easy-fix, test > Original Estimate: 1h > Remaining Estimate: 1h > > > The > [TPCHQuerySuite|https://github.com/apache/spark/blob/be454a7cef1cb5c76fb22589fc3a55c1bf519cf4/sql/core/src/test/scala/org/apache/spark/sql/TPCHQuerySuite.scala#L72] > currently has {{string}} for the {{customer.c_nationkey}} column, while it > should be bigint according to [the > spec|http://www.tpc.org/TPC_Documents_Current_Versions/pdf/tpc-h_v2.17.3.pdf] > (identifier type) and matching the {{nation}} table. > Note: this update would make previousTPCH results not comparable for queries > using the {{customer}} table > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25426) Remove the duplicate fallback logic in UnsafeProjection
[ https://issues.apache.org/jira/browse/SPARK-25426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25426. - Resolution: Fixed Assignee: Takeshi Yamamuro Fix Version/s: 2.5.0 > Remove the duplicate fallback logic in UnsafeProjection > --- > > Key: SPARK-25426 > URL: https://issues.apache.org/jira/browse/SPARK-25426 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Minor > Fix For: 2.5.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25436) Bump master branch version to 2.5.0-SNAPSHOT
[ https://issues.apache.org/jira/browse/SPARK-25436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25436. - Resolution: Fixed Fix Version/s: 2.5.0 > Bump master branch version to 2.5.0-SNAPSHOT > > > Key: SPARK-25436 > URL: https://issues.apache.org/jira/browse/SPARK-25436 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.5.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > Fix For: 2.5.0 > > > This patch bumps the master branch version to `2.5.0-SNAPSHOT`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25434) failed to locate the winutils binary in the hadoop binary path
[ https://issues.apache.org/jira/browse/SPARK-25434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616545#comment-16616545 ] WEI PENG commented on SPARK-25434: -- Thank you, [~VeenitShah] , it works!! > failed to locate the winutils binary in the hadoop binary path > -- > > Key: SPARK-25434 > URL: https://issues.apache.org/jira/browse/SPARK-25434 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Shell >Affects Versions: 2.3.1 >Reporter: WEI PENG >Priority: Major > > C:\Users\WEI>pyspark > Python 3.5.6 |Anaconda custom (64-bit)| (default, Aug 26 2018, 16:05:27) [MSC > v. > 1900 64 bit (AMD64)] on win32 > Type "help", "copyright", "credits" or "license" for more information. > 2018-09-14 21:12:39 ERROR Shell:397 - Failed to locate the winutils binary in > th > e hadoop binary path > java.io.IOException: Could not locate executable null\bin\winutils.exe in the > Ha > doop binaries. > at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379) > at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394) > at org.apache.hadoop.util.Shell.(Shell.java:387) > at org.apache.hadoop.util.StringUtils.(StringUtils.java:80) > at org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(Secur > ityUtil.java:611) > at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupI > nformation.java:273) > at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(Use > rGroupInformation.java:261) > at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject( > UserGroupInformation.java:791) > at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGrou > pInformation.java:761) > at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGr > oupInformation.java:634) > at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils > .scala:2467) > at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils > .scala:2467) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2467) > at org.apache.spark.SecurityManager.(SecurityManager.scala:220) > at org.apache.spark.deploy.SparkSubmit$.secMgr$lzycompute$1(SparkSubmit. > scala:408) > at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSub > mit$$secMgr$1(SparkSubmit.scala:408) > at org.apache.spark.deploy.SparkSubmit$$anonfun$doPrepareSubmitEnvironme > nt$7.apply(SparkSubmit.scala:416) > at org.apache.spark.deploy.SparkSubmit$$anonfun$doPrepareSubmitEnvironme > nt$7.apply(SparkSubmit.scala:416) > at scala.Option.map(Option.scala:146) > at org.apache.spark.deploy.SparkSubmit$.doPrepareSubmitEnvironment(Spark > Submit.scala:415) > at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSu > bmit.scala:250) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:171) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > 2018-09-14 21:12:39 WARN NativeCodeLoader:62 - Unable to load native-hadoop > lib > rary for your platform... using builtin-java classes where applicable > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLeve > l(newLevel). > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /__ / .__/\_,_/_/ /_/\_\ version 2.3.1 > /_/ > Using Python version 3.5.6 (default, Aug 26 2018 16:05:27) > SparkSession available as 'spark'. > >>> > > > > > > > > > > > > > > > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25425) Extra options must overwrite sessions options
[ https://issues.apache.org/jira/browse/SPARK-25425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-25425: - Assignee: Maxim Gekk > Extra options must overwrite sessions options > - > > Key: SPARK-25425 > URL: https://issues.apache.org/jira/browse/SPARK-25425 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 2.5.0 > > > In load() and save() methods of DataSource V2, extra options are overwritten > by session options: > * > https://github.com/apache/spark/blob/c9cb393dc414ae98093c1541d09fa3c8663ce276/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L244-L245 > * > https://github.com/apache/spark/blob/c9cb393dc414ae98093c1541d09fa3c8663ce276/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L205 > but implementation must be opposite - more specific extra options set via > *.option(...)* must overwrite more common session options -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25425) Extra options must overwrite sessions options
[ https://issues.apache.org/jira/browse/SPARK-25425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-25425. --- Resolution: Fixed Fix Version/s: 2.5.0 > Extra options must overwrite sessions options > - > > Key: SPARK-25425 > URL: https://issues.apache.org/jira/browse/SPARK-25425 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 2.5.0 > > > In load() and save() methods of DataSource V2, extra options are overwritten > by session options: > * > https://github.com/apache/spark/blob/c9cb393dc414ae98093c1541d09fa3c8663ce276/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L244-L245 > * > https://github.com/apache/spark/blob/c9cb393dc414ae98093c1541d09fa3c8663ce276/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L205 > but implementation must be opposite - more specific extra options set via > *.option(...)* must overwrite more common session options -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-25431) Fix function examples and unify the format of the example results.
[ https://issues.apache.org/jira/browse/SPARK-25431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reopened SPARK-25431: --- > Fix function examples and unify the format of the example results. > -- > > Key: SPARK-25431 > URL: https://issues.apache.org/jira/browse/SPARK-25431 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Minor > > There are some mistakes in examples of newly added functions. Also the format > of the example results are not unified. We should fix and unify them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25431) Fix function examples and unify the format of the example results.
[ https://issues.apache.org/jira/browse/SPARK-25431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25431: -- Fix Version/s: (was: 2.4.0) > Fix function examples and unify the format of the example results. > -- > > Key: SPARK-25431 > URL: https://issues.apache.org/jira/browse/SPARK-25431 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Minor > > There are some mistakes in examples of newly added functions. Also the format > of the example results are not unified. We should fix and unify them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25431) Fix function examples and unify the format of the example results.
[ https://issues.apache.org/jira/browse/SPARK-25431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616552#comment-16616552 ] Dongjoon Hyun commented on SPARK-25431: --- I reopened this since it's reverted now. We can resolve this back with the new commit. > Fix function examples and unify the format of the example results. > -- > > Key: SPARK-25431 > URL: https://issues.apache.org/jira/browse/SPARK-25431 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Minor > > There are some mistakes in examples of newly added functions. Also the format > of the example results are not unified. We should fix and unify them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25438) Fix FilterPushdownBenchmark to use the same memory assumption
[ https://issues.apache.org/jira/browse/SPARK-25438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-25438. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22427 [https://github.com/apache/spark/pull/22427] > Fix FilterPushdownBenchmark to use the same memory assumption > - > > Key: SPARK-25438 > URL: https://issues.apache.org/jira/browse/SPARK-25438 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.4.0 > > > This issue aims to fix three things in `FilterPushdownBenchmark`. > 1. Use the same memory assumption. > The following configurations are used in ORC and Parquet. > *Memory buffer for writing* > - parquet.block.size (default: 128MB) > - orc.stripe.size (default: 64MB) > *Compression chunk size* > - parquet.page.size (default: 1MB) > - orc.compress.size (default: 256KB) > SPARK-24692 used 1MB, the default value of `parquet.page.size`, for > `parquet.block.size` and `orc.stripe.size`. But, it missed to match > `orc.compress.size`. So, the current benchmark shows the result from ORC with > 256KB memory for compression and Parquet with 1MB. To compare correctly, we > need to be consistent. > 2. Dictionary encoding should not be enforced for all cases. > SPARK-24206 enforced dictionary encoding for all test cases. This issue > recovers the ORC behavior in general and enforces dictionary encoding only > for `prepareStringDictTable`. > 3. Generate test result on AWS r3.xlarge. > We do not > SPARK-24206 generates the result on AWS in order to reproduce and compare > easily. This issue also aims to update the result on the same machine again > in the same reason. Specifically, AWS r3.xlarge with Instance Store is used. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25425) Extra options must overwrite sessions options
[ https://issues.apache.org/jira/browse/SPARK-25425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616554#comment-16616554 ] Dongjoon Hyun commented on SPARK-25425: --- This is resolved via https://github.com/apache/spark/pull/22413 at master branch. And, we are waiting for two PRs against branch-2.4 and 2.3. > Extra options must overwrite sessions options > - > > Key: SPARK-25425 > URL: https://issues.apache.org/jira/browse/SPARK-25425 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 2.5.0 > > > In load() and save() methods of DataSource V2, extra options are overwritten > by session options: > * > https://github.com/apache/spark/blob/c9cb393dc414ae98093c1541d09fa3c8663ce276/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L244-L245 > * > https://github.com/apache/spark/blob/c9cb393dc414ae98093c1541d09fa3c8663ce276/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L205 > but implementation must be opposite - more specific extra options set via > *.option(...)* must overwrite more common session options -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22017) watermark evaluation with multi-input stream operators is unspecified
[ https://issues.apache.org/jira/browse/SPARK-22017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22017: Fix Version/s: (was: 2.4.0) 2.3.2 > watermark evaluation with multi-input stream operators is unspecified > - > > Key: SPARK-22017 > URL: https://issues.apache.org/jira/browse/SPARK-22017 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Jose Torres >Priority: Major > Fix For: 2.3.0 > > > Watermarks are stored as a single value in StreamExecution. If a query has > multiple watermark nodes (which can generally only happen with multi input > operators like union), a headOption call will arbitrarily pick one to use as > the real one. This will happen independently in each batch, possibly leading > to strange and undefined behavior. > We should instead choose the minimum from all watermark exec nodes as the > query-wide watermark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22017) watermark evaluation with multi-input stream operators is unspecified
[ https://issues.apache.org/jira/browse/SPARK-22017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22017: Fix Version/s: (was: 2.3.2) 2.3.0 > watermark evaluation with multi-input stream operators is unspecified > - > > Key: SPARK-22017 > URL: https://issues.apache.org/jira/browse/SPARK-22017 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Jose Torres >Priority: Major > Fix For: 2.3.0 > > > Watermarks are stored as a single value in StreamExecution. If a query has > multiple watermark nodes (which can generally only happen with multi input > operators like union), a headOption call will arbitrarily pick one to use as > the real one. This will happen independently in each batch, possibly leading > to strange and undefined behavior. > We should instead choose the minimum from all watermark exec nodes as the > query-wide watermark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22018) Catalyst Optimizer does not preserve top-level metadata while collapsing projects
[ https://issues.apache.org/jira/browse/SPARK-22018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22018: Fix Version/s: (was: 2.4.0) 2.3.0 > Catalyst Optimizer does not preserve top-level metadata while collapsing > projects > - > > Key: SPARK-22018 > URL: https://issues.apache.org/jira/browse/SPARK-22018 > Project: Spark > Issue Type: Bug > Components: Optimizer, Structured Streaming >Affects Versions: 2.1.1, 2.2.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Major > Fix For: 2.3.0 > > > If there are two projects like as follows. > {code} > Project [a_with_metadata#27 AS b#26] > +- Project [a#0 AS a_with_metadata#27] >+- LocalRelation , [a#0, b#1] > {code} > Child Project has an output column with a metadata in it, and the parent > Project has an alias that implicitly forwards the metadata. So this metadata > is visible for higher operators. Upon applying CollapseProject optimizer > rule, the metadata is not preserved. > {code} > Project [a#0 AS b#26] > +- LocalRelation , [a#0, b#1] > {code} > This is incorrect, as downstream operators that expect certain metadata (e.g. > watermark in structured streaming) to identify certain fields will fail to do > so. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22956) Union Stream Failover Cause `IllegalStateException`
[ https://issues.apache.org/jira/browse/SPARK-22956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22956: Fix Version/s: (was: 2.4.0) > Union Stream Failover Cause `IllegalStateException` > --- > > Key: SPARK-22956 > URL: https://issues.apache.org/jira/browse/SPARK-22956 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Li Yuanjian >Assignee: Li Yuanjian >Priority: Major > Fix For: 2.3.0 > > > When we union 2 streams from kafka or other sources, while one of them have > no continues data coming and in the same time task restart, this will cause > an `IllegalStateException`. This mainly cause because the code in > [MicroBatchExecution|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L190] > , while one stream has no continues data, its comittedOffset same with > availableOffset during `populateStartOffsets`, and `currentPartitionOffsets` > not properly handled in KafkaSource. Also, maybe we should also consider this > scenario in other Source. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22238) EnsureStatefulOpPartitioning shouldn't ask for the child RDD before planning is completed
[ https://issues.apache.org/jira/browse/SPARK-22238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22238: Fix Version/s: (was: 2.3.2) 2.3.0 > EnsureStatefulOpPartitioning shouldn't ask for the child RDD before planning > is completed > - > > Key: SPARK-22238 > URL: https://issues.apache.org/jira/browse/SPARK-22238 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > Fix For: 2.3.0 > > > In EnsureStatefulOpPartitioning, we check that the inputRDD to a SparkPlan > has the expected partitioning for Streaming Stateful Operators. The problem > is that we are not allowed to access this information during planning. > The reason we added that check was because CoalesceExec could actually create > RDDs with 0 partitions. We should fix it such that when CoalesceExec says > that there is a SinglePartition, there is in fact an inputRDD of 1 partition > instead of 0 partitions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22238) EnsureStatefulOpPartitioning shouldn't ask for the child RDD before planning is completed
[ https://issues.apache.org/jira/browse/SPARK-22238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22238: Fix Version/s: (was: 2.4.0) 2.3.2 > EnsureStatefulOpPartitioning shouldn't ask for the child RDD before planning > is completed > - > > Key: SPARK-22238 > URL: https://issues.apache.org/jira/browse/SPARK-22238 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > Fix For: 2.3.0 > > > In EnsureStatefulOpPartitioning, we check that the inputRDD to a SparkPlan > has the expected partitioning for Streaming Stateful Operators. The problem > is that we are not allowed to access this information during planning. > The reason we added that check was because CoalesceExec could actually create > RDDs with 0 partitions. We should fix it such that when CoalesceExec says > that there is a SinglePartition, there is in fact an inputRDD of 1 partition > instead of 0 partitions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23503) continuous execution should sequence committed epochs
[ https://issues.apache.org/jira/browse/SPARK-23503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-23503: --- Assignee: Efim Poberezkin > continuous execution should sequence committed epochs > - > > Key: SPARK-23503 > URL: https://issues.apache.org/jira/browse/SPARK-23503 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Efim Poberezkin >Priority: Major > Fix For: 2.4.0 > > > Currently, the EpochCoordinator doesn't enforce a commit order. If a message > for epoch n gets lost in the ether, and epoch n + 1 happens to be ready for > commit earlier, epoch n + 1 will be committed. > > This is either incorrect or needlessly confusing, because it's not safe to > start from the end offset of epoch n + 1 until epoch n is committed. > EpochCoordinator should enforce this sequencing. > > Note that this is not actually a problem right now, because the commit > messages go through the same RPC channel from the same place. But we > shouldn't implicitly bake this assumption in. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23748) Support select from temp tables
[ https://issues.apache.org/jira/browse/SPARK-23748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-23748: --- Assignee: Saisai Shao > Support select from temp tables > --- > > Key: SPARK-23748 > URL: https://issues.apache.org/jira/browse/SPARK-23748 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Saisai Shao >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > As reported in the dev list, the following currently fails: > > val jdf = spark.readStream.format("kafka").option("kafka.bootstrap.servers", > "localhost:9092").option("subscribe", "join_test").option("startingOffsets", > "earliest").load(); > jdf.createOrReplaceTempView("table") > > val resultdf = spark.sql("select * from table") > resultdf.writeStream.outputMode("append").format("console").option("truncate", > false).trigger(Trigger.Continuous("1 second")).start() -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25439) TPCHQuerySuite customer.c_nationkey should be bigint instead of string
[ https://issues.apache.org/jira/browse/SPARK-25439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25439. - Resolution: Fixed Assignee: Nicolas Poggi Fix Version/s: 2.4.0 > TPCHQuerySuite customer.c_nationkey should be bigint instead of string > -- > > Key: SPARK-25439 > URL: https://issues.apache.org/jira/browse/SPARK-25439 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.3.0, 2.3.1, 2.4.0 >Reporter: Nicolas Poggi >Assignee: Nicolas Poggi >Priority: Minor > Labels: benchmark, easy-fix, test > Fix For: 2.4.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > > The > [TPCHQuerySuite|https://github.com/apache/spark/blob/be454a7cef1cb5c76fb22589fc3a55c1bf519cf4/sql/core/src/test/scala/org/apache/spark/sql/TPCHQuerySuite.scala#L72] > currently has {{string}} for the {{customer.c_nationkey}} column, while it > should be bigint according to [the > spec|http://www.tpc.org/TPC_Documents_Current_Versions/pdf/tpc-h_v2.17.3.pdf] > (identifier type) and matching the {{nation}} table. > Note: this update would make previousTPCH results not comparable for queries > using the {{customer}} table > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25434) failed to locate the winutils binary in the hadoop binary path
[ https://issues.apache.org/jira/browse/SPARK-25434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616578#comment-16616578 ] Dongjoon Hyun commented on SPARK-25434: --- Welcome to the Apache Spark community, [~LandSurveyorK]. BTW, JIRA is not for Q&A. Could you read http://spark.apache.org/community.html for that resource? We use JIRA only when it's a really bug. > failed to locate the winutils binary in the hadoop binary path > -- > > Key: SPARK-25434 > URL: https://issues.apache.org/jira/browse/SPARK-25434 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Shell >Affects Versions: 2.3.1 >Reporter: WEI PENG >Priority: Major > > C:\Users\WEI>pyspark > Python 3.5.6 |Anaconda custom (64-bit)| (default, Aug 26 2018, 16:05:27) [MSC > v. > 1900 64 bit (AMD64)] on win32 > Type "help", "copyright", "credits" or "license" for more information. > 2018-09-14 21:12:39 ERROR Shell:397 - Failed to locate the winutils binary in > th > e hadoop binary path > java.io.IOException: Could not locate executable null\bin\winutils.exe in the > Ha > doop binaries. > at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379) > at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394) > at org.apache.hadoop.util.Shell.(Shell.java:387) > at org.apache.hadoop.util.StringUtils.(StringUtils.java:80) > at org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(Secur > ityUtil.java:611) > at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupI > nformation.java:273) > at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(Use > rGroupInformation.java:261) > at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject( > UserGroupInformation.java:791) > at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGrou > pInformation.java:761) > at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGr > oupInformation.java:634) > at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils > .scala:2467) > at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils > .scala:2467) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2467) > at org.apache.spark.SecurityManager.(SecurityManager.scala:220) > at org.apache.spark.deploy.SparkSubmit$.secMgr$lzycompute$1(SparkSubmit. > scala:408) > at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSub > mit$$secMgr$1(SparkSubmit.scala:408) > at org.apache.spark.deploy.SparkSubmit$$anonfun$doPrepareSubmitEnvironme > nt$7.apply(SparkSubmit.scala:416) > at org.apache.spark.deploy.SparkSubmit$$anonfun$doPrepareSubmitEnvironme > nt$7.apply(SparkSubmit.scala:416) > at scala.Option.map(Option.scala:146) > at org.apache.spark.deploy.SparkSubmit$.doPrepareSubmitEnvironment(Spark > Submit.scala:415) > at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSu > bmit.scala:250) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:171) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > 2018-09-14 21:12:39 WARN NativeCodeLoader:62 - Unable to load native-hadoop > lib > rary for your platform... using builtin-java classes where applicable > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLeve > l(newLevel). > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /__ / .__/\_,_/_/ /_/\_\ version 2.3.1 > /_/ > Using Python version 3.5.6 (default, Aug 26 2018 16:05:27) > SparkSession available as 'spark'. > >>> > > > > > > > > > > > > > > > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25434) failed to locate the winutils binary in the hadoop binary path
[ https://issues.apache.org/jira/browse/SPARK-25434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-25434. --- Resolution: Not A Problem > failed to locate the winutils binary in the hadoop binary path > -- > > Key: SPARK-25434 > URL: https://issues.apache.org/jira/browse/SPARK-25434 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Shell >Affects Versions: 2.3.1 >Reporter: WEI PENG >Priority: Major > > C:\Users\WEI>pyspark > Python 3.5.6 |Anaconda custom (64-bit)| (default, Aug 26 2018, 16:05:27) [MSC > v. > 1900 64 bit (AMD64)] on win32 > Type "help", "copyright", "credits" or "license" for more information. > 2018-09-14 21:12:39 ERROR Shell:397 - Failed to locate the winutils binary in > th > e hadoop binary path > java.io.IOException: Could not locate executable null\bin\winutils.exe in the > Ha > doop binaries. > at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379) > at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394) > at org.apache.hadoop.util.Shell.(Shell.java:387) > at org.apache.hadoop.util.StringUtils.(StringUtils.java:80) > at org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(Secur > ityUtil.java:611) > at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupI > nformation.java:273) > at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(Use > rGroupInformation.java:261) > at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject( > UserGroupInformation.java:791) > at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGrou > pInformation.java:761) > at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGr > oupInformation.java:634) > at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils > .scala:2467) > at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils > .scala:2467) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2467) > at org.apache.spark.SecurityManager.(SecurityManager.scala:220) > at org.apache.spark.deploy.SparkSubmit$.secMgr$lzycompute$1(SparkSubmit. > scala:408) > at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSub > mit$$secMgr$1(SparkSubmit.scala:408) > at org.apache.spark.deploy.SparkSubmit$$anonfun$doPrepareSubmitEnvironme > nt$7.apply(SparkSubmit.scala:416) > at org.apache.spark.deploy.SparkSubmit$$anonfun$doPrepareSubmitEnvironme > nt$7.apply(SparkSubmit.scala:416) > at scala.Option.map(Option.scala:146) > at org.apache.spark.deploy.SparkSubmit$.doPrepareSubmitEnvironment(Spark > Submit.scala:415) > at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSu > bmit.scala:250) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:171) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > 2018-09-14 21:12:39 WARN NativeCodeLoader:62 - Unable to load native-hadoop > lib > rary for your platform... using builtin-java classes where applicable > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLeve > l(newLevel). > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /__ / .__/\_,_/_/ /_/\_\ version 2.3.1 > /_/ > Using Python version 3.5.6 (default, Aug 26 2018 16:05:27) > SparkSession available as 'spark'. > >>> > > > > > > > > > > > > > > > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25434) failed to locate the winutils binary in the hadoop binary path
[ https://issues.apache.org/jira/browse/SPARK-25434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616578#comment-16616578 ] Dongjoon Hyun edited comment on SPARK-25434 at 9/16/18 3:44 AM: Welcome to the Apache Spark community, [~LandSurveyorK]. BTW, JIRA is not for Q&A. Could you read http://spark.apache.org/community.html for that resource? We use JIRA only when it's a really bug. I closed this issue since I assume that you got what you wanted here. was (Author: dongjoon): Welcome to the Apache Spark community, [~LandSurveyorK]. BTW, JIRA is not for Q&A. Could you read http://spark.apache.org/community.html for that resource? We use JIRA only when it's a really bug. > failed to locate the winutils binary in the hadoop binary path > -- > > Key: SPARK-25434 > URL: https://issues.apache.org/jira/browse/SPARK-25434 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Shell >Affects Versions: 2.3.1 >Reporter: WEI PENG >Priority: Major > > C:\Users\WEI>pyspark > Python 3.5.6 |Anaconda custom (64-bit)| (default, Aug 26 2018, 16:05:27) [MSC > v. > 1900 64 bit (AMD64)] on win32 > Type "help", "copyright", "credits" or "license" for more information. > 2018-09-14 21:12:39 ERROR Shell:397 - Failed to locate the winutils binary in > th > e hadoop binary path > java.io.IOException: Could not locate executable null\bin\winutils.exe in the > Ha > doop binaries. > at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379) > at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394) > at org.apache.hadoop.util.Shell.(Shell.java:387) > at org.apache.hadoop.util.StringUtils.(StringUtils.java:80) > at org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(Secur > ityUtil.java:611) > at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupI > nformation.java:273) > at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(Use > rGroupInformation.java:261) > at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject( > UserGroupInformation.java:791) > at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGrou > pInformation.java:761) > at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGr > oupInformation.java:634) > at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils > .scala:2467) > at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils > .scala:2467) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2467) > at org.apache.spark.SecurityManager.(SecurityManager.scala:220) > at org.apache.spark.deploy.SparkSubmit$.secMgr$lzycompute$1(SparkSubmit. > scala:408) > at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSub > mit$$secMgr$1(SparkSubmit.scala:408) > at org.apache.spark.deploy.SparkSubmit$$anonfun$doPrepareSubmitEnvironme > nt$7.apply(SparkSubmit.scala:416) > at org.apache.spark.deploy.SparkSubmit$$anonfun$doPrepareSubmitEnvironme > nt$7.apply(SparkSubmit.scala:416) > at scala.Option.map(Option.scala:146) > at org.apache.spark.deploy.SparkSubmit$.doPrepareSubmitEnvironment(Spark > Submit.scala:415) > at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSu > bmit.scala:250) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:171) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > 2018-09-14 21:12:39 WARN NativeCodeLoader:62 - Unable to load native-hadoop > lib > rary for your platform... using builtin-java classes where applicable > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLeve > l(newLevel). > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /__ / .__/\_,_/_/ /_/\_\ version 2.3.1 > /_/ > Using Python version 3.5.6 (default, Aug 26 2018 16:05:27) > SparkSession available as 'spark'. > >>> > > > > > > > > > > > > > > > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24479) Register StreamingQueryListener in Spark Conf
[ https://issues.apache.org/jira/browse/SPARK-24479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-24479: Labels: (was: feature) > Register StreamingQueryListener in Spark Conf > -- > > Key: SPARK-24479 > URL: https://issues.apache.org/jira/browse/SPARK-24479 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.3.0, 2.4.0 >Reporter: Mingjie Tang >Assignee: Arun Mahadevan >Priority: Major > Fix For: 2.4.0 > > > Users need to register their own StreamingQueryListener into > StreamingQueryManager, the similar function is provided as EXTRA_LISTENERS > and QUERY_EXECUTION_LISTENERS. > We propose to provide STREAMING_QUERY_LISTENER Conf for user to register > their own listener. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25442) Support STS to run in K8S deployment with spark deployment mode as cluster
Suryanarayana Garlapati created SPARK-25442: --- Summary: Support STS to run in K8S deployment with spark deployment mode as cluster Key: SPARK-25442 URL: https://issues.apache.org/jira/browse/SPARK-25442 Project: Spark Issue Type: Bug Components: Kubernetes, SQL Affects Versions: 2.4.0, 2.5.0 Reporter: Suryanarayana Garlapati STS fails to start in kubernetes deployments with spark deploy mode as cluster. Support should be added to make it run in K8S deployments. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25442) Support STS to run in K8S deployment with spark deployment mode as cluster
[ https://issues.apache.org/jira/browse/SPARK-25442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616601#comment-16616601 ] Suryanarayana Garlapati commented on SPARK-25442: - Following is the PR for the same: https://github.com/apache/spark/pull/22433 > Support STS to run in K8S deployment with spark deployment mode as cluster > -- > > Key: SPARK-25442 > URL: https://issues.apache.org/jira/browse/SPARK-25442 > Project: Spark > Issue Type: Bug > Components: Kubernetes, SQL >Affects Versions: 2.4.0, 2.5.0 >Reporter: Suryanarayana Garlapati >Priority: Major > > STS fails to start in kubernetes deployments with spark deploy mode as > cluster. Support should be added to make it run in K8S deployments. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25391) Make behaviors consistent when converting parquet hive table to parquet data source
[ https://issues.apache.org/jira/browse/SPARK-25391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenxiao Mao resolved SPARK-25391. -- Resolution: Won't Do > Make behaviors consistent when converting parquet hive table to parquet data > source > --- > > Key: SPARK-25391 > URL: https://issues.apache.org/jira/browse/SPARK-25391 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Chenxiao Mao >Priority: Major > > parquet data source tables and hive parquet tables have different behaviors > about parquet field resolution. So, when > {{spark.sql.hive.convertMetastoreParquet}} is true, users might face > inconsistent behaviors. The differences are: > * Whether respect {{spark.sql.caseSensitive}}. Without SPARK-25132, both > data source tables and hive tables do NOT respect > {{spark.sql.caseSensitive}}. However data source tables always do > case-sensitive parquet field resolution, while hive tables always do > case-insensitive parquet field resolution no matter whether > {{spark.sql.caseSensitive}} is set to true or false. SPARK-25132 let data > source tables respect {{spark.sql.caseSensitive}} while hive serde table > behavior is not changed. > * How to resolve ambiguity in case-insensitive mode. Without SPARK-25132, > data source tables do case-sensitive resolution and return columns with the > corresponding letter cases, while hive tables always return the first matched > column ignoring cases. SPARK-25132 let data source tables throw exception > when there is ambiguity while hive table behavior is not changed. > This ticket aims to make behaviors consistent when converting hive table to > data source table. > * The behavior must be consistent to do the conversion, so we skip the > conversion in case-sensitive mode because hive parquet table always do > case-insensitive field resolution. > * In case-insensitive mode, when converting hive parquet table to parquet > data source, we switch the duplicated fields resolution mode to ask parquet > data source to pick the first matched field - the same behavior as hive > parquet table - to keep behaviors consistent. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org