[jira] [Comment Edited] (SPARK-32526) Let sql/catalyst module tests pass for Scala 2.13

2020-08-06 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172091#comment-17172091
 ] 

Yang Jie edited comment on SPARK-32526 at 8/7/20, 5:44 AM:
---

Detailed error message is in attachment.

If [https://github.com/apache/spark/pull/29370] can be merged, remaining failed 
cases as follows:
 * LiteralExpressionSuite (1 FAILED)
 * CollectionExpressionsSuite( 6 FAILED)
 * StarJoinCostBasedReorderSuite ( 1 FAILED)
 * ObjectExpressionsSuite( 2 FAILED)
 * ExpressionParserSuite (1 FAILED)
 * StringExpressionsSuite ( 1 FAILED)
 * JacksonParserSuite ( 1 FAILED)
 * HigherOrderFunctionsSuite (1 FAILED)
 * InferFiltersFromConstraintsSuite( 3 FAILED)
 * ScalaReflectionSuite (2 FAILED)
 * RowEncoderSuite (10 FAILED)
 * SchemaUtilsSuite (2 FAILED)
 * MetadataSuite (1 FAILED)
 * ArrayDataIndexedSeqSuite (ABORTED)
 * ExpressionEncoderSuite  (ABORTED)
 * ExpressionSetSuite (ABORTED)


was (Author: luciferyang):
Detailed error message is in attachment.

If [https://github.com/apache/spark/pull/29370] can be merged, remaining failed 
cases as follows:
 * LiteralExpressionSuite (1 FAILED)
 * CollectionExpressionsSuite( 6 FAILED)
 * StarJoinCostBasedReorderSuite ( 1 FAILED)
 * ObjectExpressionsSuite( 2 FAILED)
 * ExpressionParserSuite (1 FAILED)
 * StringExpressionsSuite ( 1 FAILED)
 * JacksonParserSuite ( 1 FAILED)
 * HigherOrderFunctionsSuite (1 FAILED)
 * InferFiltersFromConstraintsSuite( 3 FAILED)
 * ScalaReflectionSuite (2 FAILED)
 * RowEncoderSuite (10 FAILED)
 * SchemaUtilsSuite (2 FAILED)
 * ArrayDataIndexedSeqSuite (ABORTED)
 * ExpressionEncoderSuite  (ABORTED)
 * ExpressionSetSuite (ABORTED)

> Let sql/catalyst module tests pass for Scala 2.13
> -
>
> Key: SPARK-32526
> URL: https://issues.apache.org/jira/browse/SPARK-32526
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yang Jie
>Priority: Minor
> Attachments: failed-and-aborted-20200806
>
>
> sql/catalyst module has following compile errors with scala-2.13 profile:
> {code:java}
> [ERROR] [Error] 
> /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1284:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1289:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1297:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:952:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]
>  required

[jira] [Resolved] (SPARK-32560) improve exception message

2020-08-06 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32560.
--
Fix Version/s: 3.1.0
   2.4.7
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 29376
[https://github.com/apache/spark/pull/29376]

> improve exception message
> -
>
> Key: SPARK-32560
> URL: https://issues.apache.org/jira/browse/SPARK-32560
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: philipse
>Assignee: philipse
>Priority: Minor
> Fix For: 3.0.1, 2.4.7, 3.1.0
>
> Attachments: exception.png
>
>
> Exception messages are lack of single quotes, we can improve it to keep 
> consisent



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32560) improve exception message

2020-08-06 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32560:


Assignee: philipse

> improve exception message
> -
>
> Key: SPARK-32560
> URL: https://issues.apache.org/jira/browse/SPARK-32560
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: philipse
>Assignee: philipse
>Priority: Minor
> Attachments: exception.png
>
>
> Exception messages are lack of single quotes, we can improve it to keep 
> consisent



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32536) deleted not existing hdfs locations when use spark sql to execute "insert overwrite" statement to dynamic partition

2020-08-06 Thread yx91490 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172889#comment-17172889
 ] 

yx91490 commented on SPARK-32536:
-

the method org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace() is 
in standalone-metastore-1.21.2.3.1.4.0-315-hive3.jar, but I cannot found the 
source code.

> deleted not existing hdfs locations when use spark sql to execute "insert 
> overwrite" statement to dynamic partition
> ---
>
> Key: SPARK-32536
> URL: https://issues.apache.org/jira/browse/SPARK-32536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: HDP version 2.3.2.3.1.4.0-315
>Reporter: yx91490
>Priority: Major
> Attachments: SPARK-32536.full.log
>
>
> when execute insert overwrite table statement to dynamic partition :
>  
> {code:java}
> set hive.exec.dynamic.partition=true;
> set hive.exec.dynamic.partition.mode=nostrict;
> insert overwrite table tmp.id_name2 partition(dt) select * from tmp.id_name 
> where dt='2001';
> {code}
> output log:
> {code:java}
> 20/08/05 14:38:05 ERROR Hive: Exception when loading partition with 
> parameters  
> partPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1/dt=2001,
>   table=id_name2,  partSpec={dt=2001},  loadFileType=REPLACE_ALL,  
> listBucketingLevel=0,  isAcid=false,  resetStatistics=false
> org.apache.hadoop.hive.ql.metadata.HiveException: Directory 
> hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 could not be 
> cleaned up.
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4666)
> at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:4597)
> at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2132)
> at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2588)
> at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2579)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: File 
> hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 does not exist.
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1053)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1113)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1110)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1120)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.cleanUpOneDirectoryForReplace(Hive.java:4681)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4661)
> ... 8 more
> Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Exception 
> when loading 1 in table id_name2 with 
> loadPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1;
> {code}
> it seems that spark doesn't test if the partitions hdfs locations whether 
> exists before delete it.
> and Hive can successfully execute the same sql.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32562) Pyspark drop duplicate columns

2020-08-06 Thread abhijeet dada mote (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

abhijeet dada mote updated SPARK-32562:
---
Description: 

Hi All,

This is one suggestion can we have a feature in pyspark to remove duplicate 
columns? 

I have come up with small code for that 

{code:python}
def drop_duplicate_columns(_rdd_df):
column_names = _rdd_df.columns
duplicate_columns = set([x for x in column_names if column_names.count(x) > 
1])
_rdd_df = _rdd_df.drop(*duplicate_columns)
return _rdd_df

{code}

Your suggestions are appreciatd and can work on this PR, this would be my first 
contribution(PR) to Pyspark if you guys agree with it

  was:
Hi All,

This is one suggestion can we have a feature in pyspark to remove duplicate 
columns? 

I have come up with small code for that 



def drop_duplicate_columns(_rdd_df):
column_names = _rdd_df.columns
duplicate_columns = set([x for x in column_names if column_names.count(x) > 
1])
_rdd_df = _rdd_df.drop(*duplicate_columns)
return _rdd_df



Your suggestions are appreciatd and can work on this PR, this would be my first 
contribution(PR) to Pyspark if you guys agree with it


> Pyspark drop duplicate columns
> --
>
> Key: SPARK-32562
> URL: https://issues.apache.org/jira/browse/SPARK-32562
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: abhijeet dada mote
>Priority: Major
>  Labels: newbie, starter
> Fix For: 3.0.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Hi All,
> This is one suggestion can we have a feature in pyspark to remove duplicate 
> columns? 
> I have come up with small code for that 
> {code:python}
> def drop_duplicate_columns(_rdd_df):
> column_names = _rdd_df.columns
> duplicate_columns = set([x for x in column_names if column_names.count(x) 
> > 1])
> _rdd_df = _rdd_df.drop(*duplicate_columns)
> return _rdd_df
> {code}
> Your suggestions are appreciatd and can work on this PR, this would be my 
> first contribution(PR) to Pyspark if you guys agree with it



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32562) Pyspark drop duplicate columns

2020-08-06 Thread abhijeet dada mote (Jira)
abhijeet dada mote created SPARK-32562:
--

 Summary: Pyspark drop duplicate columns
 Key: SPARK-32562
 URL: https://issues.apache.org/jira/browse/SPARK-32562
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.0.0
Reporter: abhijeet dada mote
 Fix For: 3.0.0


Hi All,

This is one suggestion can we have a feature in pyspark to remove duplicate 
columns? 

I have come up with small code for that 



def drop_duplicate_columns(_rdd_df):
column_names = _rdd_df.columns
duplicate_columns = set([x for x in column_names if column_names.count(x) > 
1])
_rdd_df = _rdd_df.drop(*duplicate_columns)
return _rdd_df



Your suggestions are appreciatd and can work on this PR, this would be my first 
contribution(PR) to Pyspark if you guys agree with it



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Ramakrishna Prasad K S (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172805#comment-17172805
 ] 

Ramakrishna Prasad K S commented on SPARK-32558:


[~rohitmishr1484] [~hyukjin.kwon] Thanks for letting me know about setting the 
Priority and target version. Sorry for the same. 

Can someone help me with this issue? This is critical and the workaround 
mentioned in [https://spark.apache.org/docs/latest/sql-migration-guide.html] is 
also not working.

> ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 
> (work-around of using spark.sql.orc.impl=hive is also not working)
> -
>
> Key: SPARK-32558
> URL: https://issues.apache.org/jira/browse/SPARK-32558
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version. 
> (Linux Redhat)
>Reporter: Ramakrishna Prasad K S
>Priority: Major
>
> Steps to reproduce the issue:
> --- 
> Download Spark_3.0 from [https://spark.apache.org/downloads.html]
>  
> Step 1) Create ORC File by using the default Spark_3.0 Native API from spark 
> shell .
> {code}
> [linuxuser1@irlrhellinux1 bin]$ ./spark-shell
> Welcome to Spark version 3.0.0
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)
> Type in expressions to have them evaluated. Type :help for more information.
>  scala> spark.sql("set spark.sql.orc.impl").show()
> +-+
> |               key| value| 
> +-+
> |spark.sql.orc.impl|native|
> +-+
>  
> scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
> org.apache.spark.sql.DataFrame = []
> scala> spark.sql("insert into df_table values('col1val1','col2val1')")
> org.apache.spark.sql.DataFrame = []
> scala> val dFrame = spark.sql("select * from df_table") dFrame: 
> org.apache.spark.sql.DataFrame = [col1: string, col2: string]
> scala> dFrame.show()
> +-+
> |    col1|    col2|
> +-+
> |col1val1|col2val1|
> +-+
> scala> 
> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")
> {code}
>  
> Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop 
> cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following 
> command to analyze or read metadata from the ORC files. As you see below, it 
> fails to fetch the metadata from the ORC file.
> {code}
> adpqa@irlhadoop1 bug]$ hive --orcfiledump 
> /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> Processing data file 
> /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> [length: 414]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
> at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
> at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
> at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)
> at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
> at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
> at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
> at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
> at org.apache.orc.tools.FileDump.main(FileDump.java:154)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:313)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:227)
> {code}
> Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
> [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
> spark.sql.orc.impl as hive)
> {code}
> scala> spark.sql("set spark.sql.orc.impl=hive")
> res6: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.sql("set spark.sql.orc.impl").show()
> ++
> |               key|value| 
> ++
> |spark.sql.orc.impl| hive|
> ++
> scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")
> scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
> org.apache.spark.sql.DataFrame = []
> scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
> org.apache.spark.sql.DataFrame = [col1: string, col2: string]
> scala> 
> 

[jira] [Comment Edited] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Ramakrishna Prasad K S (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172805#comment-17172805
 ] 

Ramakrishna Prasad K S edited comment on SPARK-32558 at 8/7/20, 4:07 AM:
-

[~rohitmishr1484] [~hyukjin.kwon] Thanks for letting me know about setting the 
priority and target version. Sorry about the same. 

Can someone help me with this issue? This is critical and the workaround 
mentioned in [https://spark.apache.org/docs/latest/sql-migration-guide.html] is 
also not working.


was (Author: ramks):
[~rohitmishr1484] [~hyukjin.kwon] Thanks for letting me know about setting the 
Priority and target version. Sorry for the same. 

Can someone help me with this issue? This is critical and the workaround 
mentioned in [https://spark.apache.org/docs/latest/sql-migration-guide.html] is 
also not working.

> ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 
> (work-around of using spark.sql.orc.impl=hive is also not working)
> -
>
> Key: SPARK-32558
> URL: https://issues.apache.org/jira/browse/SPARK-32558
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version. 
> (Linux Redhat)
>Reporter: Ramakrishna Prasad K S
>Priority: Major
>
> Steps to reproduce the issue:
> --- 
> Download Spark_3.0 from [https://spark.apache.org/downloads.html]
>  
> Step 1) Create ORC File by using the default Spark_3.0 Native API from spark 
> shell .
> {code}
> [linuxuser1@irlrhellinux1 bin]$ ./spark-shell
> Welcome to Spark version 3.0.0
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)
> Type in expressions to have them evaluated. Type :help for more information.
>  scala> spark.sql("set spark.sql.orc.impl").show()
> +-+
> |               key| value| 
> +-+
> |spark.sql.orc.impl|native|
> +-+
>  
> scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
> org.apache.spark.sql.DataFrame = []
> scala> spark.sql("insert into df_table values('col1val1','col2val1')")
> org.apache.spark.sql.DataFrame = []
> scala> val dFrame = spark.sql("select * from df_table") dFrame: 
> org.apache.spark.sql.DataFrame = [col1: string, col2: string]
> scala> dFrame.show()
> +-+
> |    col1|    col2|
> +-+
> |col1val1|col2val1|
> +-+
> scala> 
> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")
> {code}
>  
> Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop 
> cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following 
> command to analyze or read metadata from the ORC files. As you see below, it 
> fails to fetch the metadata from the ORC file.
> {code}
> adpqa@irlhadoop1 bug]$ hive --orcfiledump 
> /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> Processing data file 
> /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> [length: 414]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
> at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
> at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
> at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)
> at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
> at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
> at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
> at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
> at org.apache.orc.tools.FileDump.main(FileDump.java:154)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:313)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:227)
> {code}
> Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
> [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
> spark.sql.orc.impl as hive)
> {code}
> scala> spark.sql("set spark.sql.orc.impl=hive")
> res6: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.sql("set spark.sql.orc.impl").show()
> ++
> |               key|value| 
> ++
> |spark.sql.orc.impl| hive|
> ++
> scala> spark.sql("CREATE 

[jira] [Commented] (SPARK-32560) improve exception message

2020-08-06 Thread philipse (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172791#comment-17172791
 ] 

philipse commented on SPARK-32560:
--

 

Thanks [~maropu] for you notice. will improve it in furture.;)

> improve exception message
> -
>
> Key: SPARK-32560
> URL: https://issues.apache.org/jira/browse/SPARK-32560
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: philipse
>Priority: Minor
> Attachments: exception.png
>
>
> Exception messages are lack of single quotes, we can improve it to keep 
> consisent



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31703) Changes made by SPARK-26985 break reading parquet files correctly in BigEndian architectures (AIX + LinuxPPC64)

2020-08-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172786#comment-17172786
 ] 

Apache Spark commented on SPARK-31703:
--

User 'tinhto-000' has created a pull request for this issue:
https://github.com/apache/spark/pull/29383

> Changes made by SPARK-26985 break reading parquet files correctly in 
> BigEndian architectures (AIX + LinuxPPC64)
> ---
>
> Key: SPARK-31703
> URL: https://issues.apache.org/jira/browse/SPARK-31703
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.5, 3.0.0
> Environment: AIX 7.2
> LinuxPPC64 with RedHat.
>Reporter: Michail Giannakopoulos
>Priority: Critical
>  Labels: BigEndian
> Attachments: Data_problem_Spark.gif
>
>
> Trying to upgrade to Apache Spark 2.4.5 in our IBM systems (AIX and PowerPC) 
> so as to be able to read data stored in parquet format, we notice that values 
> associated with DOUBLE and DECIMAL types are parsed in the wrong form.
> According toe parquet documentation, they always opt to store the values 
> using left-endian representation for values:
> [https://github.com/apache/parquet-format/blob/master/Encodings.md]
> {noformat}
> The plain encoding is used whenever a more efficient encoding can not be 
> used. It
> stores the data in the following format:
> BOOLEAN: Bit Packed, LSB first
> INT32: 4 bytes little endian
> INT64: 8 bytes little endian
> INT96: 12 bytes little endian (deprecated)
> FLOAT: 4 bytes IEEE little endian
> DOUBLE: 8 bytes IEEE little endian
> BYTE_ARRAY: length in 4 bytes little endian followed by the bytes contained 
> in the array
> FIXED_LEN_BYTE_ARRAY: the bytes contained in the array
> For native types, this outputs the data as little endian. Floating
> point types are encoded in IEEE.
> For the byte array type, it encodes the length as a 4 byte little
> endian, followed by the bytes.{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31703) Changes made by SPARK-26985 break reading parquet files correctly in BigEndian architectures (AIX + LinuxPPC64)

2020-08-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31703:


Assignee: (was: Apache Spark)

> Changes made by SPARK-26985 break reading parquet files correctly in 
> BigEndian architectures (AIX + LinuxPPC64)
> ---
>
> Key: SPARK-31703
> URL: https://issues.apache.org/jira/browse/SPARK-31703
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.5, 3.0.0
> Environment: AIX 7.2
> LinuxPPC64 with RedHat.
>Reporter: Michail Giannakopoulos
>Priority: Critical
>  Labels: BigEndian
> Attachments: Data_problem_Spark.gif
>
>
> Trying to upgrade to Apache Spark 2.4.5 in our IBM systems (AIX and PowerPC) 
> so as to be able to read data stored in parquet format, we notice that values 
> associated with DOUBLE and DECIMAL types are parsed in the wrong form.
> According toe parquet documentation, they always opt to store the values 
> using left-endian representation for values:
> [https://github.com/apache/parquet-format/blob/master/Encodings.md]
> {noformat}
> The plain encoding is used whenever a more efficient encoding can not be 
> used. It
> stores the data in the following format:
> BOOLEAN: Bit Packed, LSB first
> INT32: 4 bytes little endian
> INT64: 8 bytes little endian
> INT96: 12 bytes little endian (deprecated)
> FLOAT: 4 bytes IEEE little endian
> DOUBLE: 8 bytes IEEE little endian
> BYTE_ARRAY: length in 4 bytes little endian followed by the bytes contained 
> in the array
> FIXED_LEN_BYTE_ARRAY: the bytes contained in the array
> For native types, this outputs the data as little endian. Floating
> point types are encoded in IEEE.
> For the byte array type, it encodes the length as a 4 byte little
> endian, followed by the bytes.{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31703) Changes made by SPARK-26985 break reading parquet files correctly in BigEndian architectures (AIX + LinuxPPC64)

2020-08-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172785#comment-17172785
 ] 

Apache Spark commented on SPARK-31703:
--

User 'tinhto-000' has created a pull request for this issue:
https://github.com/apache/spark/pull/29383

> Changes made by SPARK-26985 break reading parquet files correctly in 
> BigEndian architectures (AIX + LinuxPPC64)
> ---
>
> Key: SPARK-31703
> URL: https://issues.apache.org/jira/browse/SPARK-31703
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.5, 3.0.0
> Environment: AIX 7.2
> LinuxPPC64 with RedHat.
>Reporter: Michail Giannakopoulos
>Priority: Critical
>  Labels: BigEndian
> Attachments: Data_problem_Spark.gif
>
>
> Trying to upgrade to Apache Spark 2.4.5 in our IBM systems (AIX and PowerPC) 
> so as to be able to read data stored in parquet format, we notice that values 
> associated with DOUBLE and DECIMAL types are parsed in the wrong form.
> According toe parquet documentation, they always opt to store the values 
> using left-endian representation for values:
> [https://github.com/apache/parquet-format/blob/master/Encodings.md]
> {noformat}
> The plain encoding is used whenever a more efficient encoding can not be 
> used. It
> stores the data in the following format:
> BOOLEAN: Bit Packed, LSB first
> INT32: 4 bytes little endian
> INT64: 8 bytes little endian
> INT96: 12 bytes little endian (deprecated)
> FLOAT: 4 bytes IEEE little endian
> DOUBLE: 8 bytes IEEE little endian
> BYTE_ARRAY: length in 4 bytes little endian followed by the bytes contained 
> in the array
> FIXED_LEN_BYTE_ARRAY: the bytes contained in the array
> For native types, this outputs the data as little endian. Floating
> point types are encoded in IEEE.
> For the byte array type, it encodes the length as a 4 byte little
> endian, followed by the bytes.{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31703) Changes made by SPARK-26985 break reading parquet files correctly in BigEndian architectures (AIX + LinuxPPC64)

2020-08-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31703:


Assignee: Apache Spark

> Changes made by SPARK-26985 break reading parquet files correctly in 
> BigEndian architectures (AIX + LinuxPPC64)
> ---
>
> Key: SPARK-31703
> URL: https://issues.apache.org/jira/browse/SPARK-31703
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.5, 3.0.0
> Environment: AIX 7.2
> LinuxPPC64 with RedHat.
>Reporter: Michail Giannakopoulos
>Assignee: Apache Spark
>Priority: Critical
>  Labels: BigEndian
> Attachments: Data_problem_Spark.gif
>
>
> Trying to upgrade to Apache Spark 2.4.5 in our IBM systems (AIX and PowerPC) 
> so as to be able to read data stored in parquet format, we notice that values 
> associated with DOUBLE and DECIMAL types are parsed in the wrong form.
> According toe parquet documentation, they always opt to store the values 
> using left-endian representation for values:
> [https://github.com/apache/parquet-format/blob/master/Encodings.md]
> {noformat}
> The plain encoding is used whenever a more efficient encoding can not be 
> used. It
> stores the data in the following format:
> BOOLEAN: Bit Packed, LSB first
> INT32: 4 bytes little endian
> INT64: 8 bytes little endian
> INT96: 12 bytes little endian (deprecated)
> FLOAT: 4 bytes IEEE little endian
> DOUBLE: 8 bytes IEEE little endian
> BYTE_ARRAY: length in 4 bytes little endian followed by the bytes contained 
> in the array
> FIXED_LEN_BYTE_ARRAY: the bytes contained in the array
> For native types, this outputs the data as little endian. Floating
> point types are encoded in IEEE.
> For the byte array type, it encodes the length as a 4 byte little
> endian, followed by the bytes.{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32549) Add column name in _infer_schema error message

2020-08-06 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32549.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29365
[https://github.com/apache/spark/pull/29365]

> Add column name in _infer_schema error message
> --
>
> Key: SPARK-32549
> URL: https://issues.apache.org/jira/browse/SPARK-32549
> Project: Spark
>  Issue Type: Story
>  Components: PySpark
>Affects Versions: 3.0.1
>Reporter: Liang Zhang
>Assignee: Liang Zhang
>Priority: Major
> Fix For: 3.1.0
>
>
> The current error message from _infer_type in _infer_schema only includes the 
> unsupported column type but not the column name. This ticket suggests adding 
> the column name in the error message to make it easier for users to identify 
> which column should they drop or convert.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32549) Add column name in _infer_schema error message

2020-08-06 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32549:


Assignee: Liang Zhang

> Add column name in _infer_schema error message
> --
>
> Key: SPARK-32549
> URL: https://issues.apache.org/jira/browse/SPARK-32549
> Project: Spark
>  Issue Type: Story
>  Components: PySpark
>Affects Versions: 3.0.1
>Reporter: Liang Zhang
>Assignee: Liang Zhang
>Priority: Major
>
> The current error message from _infer_type in _infer_schema only includes the 
> unsupported column type but not the column name. This ticket suggests adding 
> the column name in the error message to make it easier for users to identify 
> which column should they drop or convert.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32560) improve exception message

2020-08-06 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172780#comment-17172780
 ] 

Takeshi Yamamuro commented on SPARK-32560:
--

Hi, [~小郭飞飞刀], thanks for the report, but I think this is a minor/trivial issue, 
so I think we don't file a ticket on it.

> improve exception message
> -
>
> Key: SPARK-32560
> URL: https://issues.apache.org/jira/browse/SPARK-32560
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: philipse
>Priority: Minor
> Attachments: exception.png
>
>
> Exception messages are lack of single quotes, we can improve it to keep 
> consisent



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32540) Eliminate filter clause in aggregate

2020-08-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172779#comment-17172779
 ] 

Apache Spark commented on SPARK-32540:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/29382

> Eliminate filter clause in aggregate
> 
>
> Key: SPARK-32540
> URL: https://issues.apache.org/jira/browse/SPARK-32540
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
> SQL like:
> {code:java}
> SELECT COUNT(DISTINCT 1) FILTER (WHERE true) FROM testData;
> {code}
> Could converted to
> {code:java}
> SELECT COUNT(DISTINCT 1) FROM testData;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32538) Use local time zone for the timestamp logged in unit-tests.log

2020-08-06 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-32538.
--
Fix Version/s: 3.0.1
   Resolution: Fixed

Resolved by [https://github.com/apache/spark/pull/29356]

> Use local time zone for the timestamp logged in unit-tests.log
> --
>
> Key: SPARK-32538
> URL: https://issues.apache.org/jira/browse/SPARK-32538
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.0.1
>
>
> SparkFunSuite fixes the default time zone to America/Los_Angeles so the 
> timestamp logged in unit-tests.log is also based on the fixed time zone.
> It's confusable for developers whose time zone is not America/Los_Angeles.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32558:
-
Description: 
Steps to reproduce the issue:

--- 

Download Spark_3.0 from [https://spark.apache.org/downloads.html]

 

Step 1) Create ORC File by using the default Spark_3.0 Native API from spark 
shell .

{code}
[linuxuser1@irlrhellinux1 bin]$ ./spark-shell

Welcome to Spark version 3.0.0

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)

Type in expressions to have them evaluated. Type :help for more information.
 scala> spark.sql("set spark.sql.orc.impl").show()

+-+
|               key| value| 
+-+
|spark.sql.orc.impl|native|
+-+
 
scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []
scala> spark.sql("insert into df_table values('col1val1','col2val1')")
org.apache.spark.sql.DataFrame = []

scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]
scala> dFrame.show()

+-+
|    col1|    col2|
+-+
|col1val1|col2val1|
+-+

scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")
{code}
 

Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails to 
fetch the metadata from the ORC file.

{code}
adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
Processing data file 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
[length: 414]
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)
at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
at org.apache.orc.tools.FileDump.main(FileDump.java:154)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:313)
at org.apache.hadoop.util.RunJar.main(RunJar.java:227)
{code}


Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
[https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
spark.sql.orc.impl as hive)

{code}
scala> spark.sql("set spark.sql.orc.impl=hive")
res6: org.apache.spark.sql.DataFrame = [key: string, value: string]
scala> spark.sql("set spark.sql.orc.impl").show()

++
|               key|value| 
++
|spark.sql.orc.impl| hive|
++

scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")
scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
org.apache.spark.sql.DataFrame = []
scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]
scala> 
dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")
{code} 

Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails with 
the same exception to fetch the metadata even after following the workaround 
suggested by spark to set spark.sql.orc.impl to hive

{code}
[adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc
Processing data file 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc 
[length: 414]
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)
at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)

[jira] [Commented] (SPARK-30577) StorageLevel.DISK_ONLY_2 causes the data loss

2020-08-06 Thread zero222 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172771#comment-17172771
 ] 

zero222 commented on SPARK-30577:
-

OK, Thank you very much.

> StorageLevel.DISK_ONLY_2 causes the data loss
> -
>
> Key: SPARK-30577
> URL: https://issues.apache.org/jira/browse/SPARK-30577
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: zero222
>Priority: Major
> Attachments: DISK_ONLY_2.png
>
>
> As shown in the attachment,after I load the data of the hive table which is 
> immutable and cache the data in the level of the DISK_ONLY_2,the count value 
> of the data  is different every time.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32558:
-
Fix Version/s: (was: 3.0.0)

> ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 
> (work-around of using spark.sql.orc.impl=hive is also not working)
> -
>
> Key: SPARK-32558
> URL: https://issues.apache.org/jira/browse/SPARK-32558
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version. 
> (Linux Redhat)
>Reporter: Ramakrishna Prasad K S
>Priority: Major
>
> Steps to reproduce the issue:
> --- 
> Download Spark_3.0 from [https://spark.apache.org/downloads.html]
>  
> Step 1) Create ORC File by using the default Spark_3.0 Native API from spark 
> shell .
> [linuxuser1@irlrhellinux1 bin]$ ./spark-shell
> Welcome to Spark version 3.0.0
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)
> Type in expressions to have them evaluated. Type :help for more information.
>  scala> spark.sql("set spark.sql.orc.impl").show()
> -+ |               key| value| + 
> |spark.sql.orc.impl|*native*| -+
>  
> scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
> org.apache.spark.sql.DataFrame = []
>  
> scala> spark.sql("insert into df_table values('col1val1','col2val1')")
> org.apache.spark.sql.DataFrame = []
>  
> scala> val dFrame = spark.sql("select * from df_table") dFrame: 
> org.apache.spark.sql.DataFrame = [col1: string, col2: string]
>  
> scala> dFrame.show()
> + |    col1|    col2| + |col1val1|col2val1| -+
>  
> scala> 
> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")
>  
>  Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop 
> cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following 
> command to analyze or read metadata from the ORC files. As you see below, it 
> fails to fetch the metadata from the ORC file.
>  
> adpqa@irlhadoop1 bug]$ hive --orcfiledump 
> /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> Processing data file 
> /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> [length: 414]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
> at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
> at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
> at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)
> at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
> at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
> at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
> at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
> at org.apache.orc.tools.FileDump.main(FileDump.java:154)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:313)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:227)
>  
> Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
> [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
> spark.sql.orc.impl as hive)
>  
> scala> spark.sql("set spark.sql.orc.impl=hive")
> res6: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.sql("set spark.sql.orc.impl").show()
> --+ |               key|value| +--- 
> |spark.sql.orc.impl| *hive*| +--
>  
> scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")
> 20/08/04 22:43:26 WARN HiveMetaStore: Location: 
> [file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2]
>  specified for non-external table:df_table2 res5: 
> org.apache.spark.sql.DataFrame = []
>  
> scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
> org.apache.spark.sql.DataFrame = []
>  
> scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
> org.apache.spark.sql.DataFrame = [col1: string, col2: string]
>  
> scala> 
> dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")
>  
>  Step 4) Copy the ORC files created in 

[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32558:
-
Target Version/s:   (was: 3.0.0)

> ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 
> (work-around of using spark.sql.orc.impl=hive is also not working)
> -
>
> Key: SPARK-32558
> URL: https://issues.apache.org/jira/browse/SPARK-32558
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version. 
> (Linux Redhat)
>Reporter: Ramakrishna Prasad K S
>Priority: Major
>
> Steps to reproduce the issue:
> --- 
> Download Spark_3.0 from [https://spark.apache.org/downloads.html]
>  
> Step 1) Create ORC File by using the default Spark_3.0 Native API from spark 
> shell .
> [linuxuser1@irlrhellinux1 bin]$ ./spark-shell
> Welcome to Spark version 3.0.0
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)
> Type in expressions to have them evaluated. Type :help for more information.
>  scala> spark.sql("set spark.sql.orc.impl").show()
> -+ |               key| value| + 
> |spark.sql.orc.impl|*native*| -+
>  
> scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
> org.apache.spark.sql.DataFrame = []
>  
> scala> spark.sql("insert into df_table values('col1val1','col2val1')")
> org.apache.spark.sql.DataFrame = []
>  
> scala> val dFrame = spark.sql("select * from df_table") dFrame: 
> org.apache.spark.sql.DataFrame = [col1: string, col2: string]
>  
> scala> dFrame.show()
> + |    col1|    col2| + |col1val1|col2val1| -+
>  
> scala> 
> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")
>  
>  Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop 
> cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following 
> command to analyze or read metadata from the ORC files. As you see below, it 
> fails to fetch the metadata from the ORC file.
>  
> adpqa@irlhadoop1 bug]$ hive --orcfiledump 
> /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> Processing data file 
> /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> [length: 414]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
> at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
> at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
> at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)
> at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
> at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
> at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
> at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
> at org.apache.orc.tools.FileDump.main(FileDump.java:154)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:313)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:227)
>  
> Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
> [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
> spark.sql.orc.impl as hive)
>  
> scala> spark.sql("set spark.sql.orc.impl=hive")
> res6: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.sql("set spark.sql.orc.impl").show()
> --+ |               key|value| +--- 
> |spark.sql.orc.impl| *hive*| +--
>  
> scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")
> 20/08/04 22:43:26 WARN HiveMetaStore: Location: 
> [file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2]
>  specified for non-external table:df_table2 res5: 
> org.apache.spark.sql.DataFrame = []
>  
> scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
> org.apache.spark.sql.DataFrame = []
>  
> scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
> org.apache.spark.sql.DataFrame = [col1: string, col2: string]
>  
> scala> 
> dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")
>  
>  Step 4) Copy the ORC files created in 

[jira] [Commented] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172770#comment-17172770
 ] 

Hyukjin Kwon commented on SPARK-32558:
--

Thanks [~rohitmishr1484]. [~ramks] please also don't set the target versions 
which are also reserved for committers and the fix version which is set after 
the fix is actually merged and landed.

> ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 
> (work-around of using spark.sql.orc.impl=hive is also not working)
> -
>
> Key: SPARK-32558
> URL: https://issues.apache.org/jira/browse/SPARK-32558
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version. 
> (Linux Redhat)
>Reporter: Ramakrishna Prasad K S
>Priority: Major
> Fix For: 3.0.0
>
>
> Steps to reproduce the issue:
> --- 
> Download Spark_3.0 from [https://spark.apache.org/downloads.html]
>  
> Step 1) Create ORC File by using the default Spark_3.0 Native API from spark 
> shell .
> [linuxuser1@irlrhellinux1 bin]$ ./spark-shell
> Welcome to Spark version 3.0.0
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)
> Type in expressions to have them evaluated. Type :help for more information.
>  scala> spark.sql("set spark.sql.orc.impl").show()
> -+ |               key| value| + 
> |spark.sql.orc.impl|*native*| -+
>  
> scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
> org.apache.spark.sql.DataFrame = []
>  
> scala> spark.sql("insert into df_table values('col1val1','col2val1')")
> org.apache.spark.sql.DataFrame = []
>  
> scala> val dFrame = spark.sql("select * from df_table") dFrame: 
> org.apache.spark.sql.DataFrame = [col1: string, col2: string]
>  
> scala> dFrame.show()
> + |    col1|    col2| + |col1val1|col2val1| -+
>  
> scala> 
> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")
>  
>  Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop 
> cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following 
> command to analyze or read metadata from the ORC files. As you see below, it 
> fails to fetch the metadata from the ORC file.
>  
> adpqa@irlhadoop1 bug]$ hive --orcfiledump 
> /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> Processing data file 
> /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> [length: 414]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
> at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
> at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
> at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)
> at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
> at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
> at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
> at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
> at org.apache.orc.tools.FileDump.main(FileDump.java:154)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:313)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:227)
>  
> Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
> [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
> spark.sql.orc.impl as hive)
>  
> scala> spark.sql("set spark.sql.orc.impl=hive")
> res6: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.sql("set spark.sql.orc.impl").show()
> --+ |               key|value| +--- 
> |spark.sql.orc.impl| *hive*| +--
>  
> scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")
> 20/08/04 22:43:26 WARN HiveMetaStore: Location: 
> [file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2]
>  specified for non-external table:df_table2 res5: 
> org.apache.spark.sql.DataFrame = []
>  
> scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
> org.apache.spark.sql.DataFrame = []
>  
> scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
> 

[jira] [Reopened] (SPARK-32515) Distinct Function Weird Bug

2020-08-06 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro reopened SPARK-32515:
--

> Distinct Function Weird Bug
> ---
>
> Key: SPARK-32515
> URL: https://issues.apache.org/jira/browse/SPARK-32515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6
> Environment: Window 10 and Mac, both have the same issues.
> Using Scala version 2.11.12
> Python 3.6.10
> java version "1.8.0_261"
>Reporter: Jayce Jiang
>Priority: Major
> Attachments: Capture.PNG, Capture1.png, Capture2.PNG, 
> Screen_Shot_2020-08-05_at_2.46.42_PM.png, image-2020-08-03-07-03-55-716.png, 
> unknown.png, unknown1.png, unknown2.png
>
>
> A weird spark display and counting error. When I was loading in my CSV file 
> into spark and trying to do check all distinct value from a column inside of 
> a dataframe. Everything I try in spark resulted in a wrong answer. But if I 
> convert my spark dataframe into pandas dataframe, it works. Please help. This 
> bug only happens in this one CSV file, all my other CSV files work properly. 
> Here are the pictures.
>  
> !image-2020-08-01-21-19-06-402.png!!image-2020-08-01-21-19-03-289.png!!image-2020-08-01-21-18-58-625.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32515) Distinct Function Weird Bug

2020-08-06 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-32515.
--
Resolution: Not A Problem

> Distinct Function Weird Bug
> ---
>
> Key: SPARK-32515
> URL: https://issues.apache.org/jira/browse/SPARK-32515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6
> Environment: Window 10 and Mac, both have the same issues.
> Using Scala version 2.11.12
> Python 3.6.10
> java version "1.8.0_261"
>Reporter: Jayce Jiang
>Priority: Major
> Attachments: Capture.PNG, Capture1.png, Capture2.PNG, 
> Screen_Shot_2020-08-05_at_2.46.42_PM.png, image-2020-08-03-07-03-55-716.png, 
> unknown.png, unknown1.png, unknown2.png
>
>
> A weird spark display and counting error. When I was loading in my CSV file 
> into spark and trying to do check all distinct value from a column inside of 
> a dataframe. Everything I try in spark resulted in a wrong answer. But if I 
> convert my spark dataframe into pandas dataframe, it works. Please help. This 
> bug only happens in this one CSV file, all my other CSV files work properly. 
> Here are the pictures.
>  
> !image-2020-08-01-21-19-06-402.png!!image-2020-08-01-21-19-03-289.png!!image-2020-08-01-21-18-58-625.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32515) Distinct Function Weird Bug

2020-08-06 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-32515:
-
Target Version/s:   (was: 2.4.6)

> Distinct Function Weird Bug
> ---
>
> Key: SPARK-32515
> URL: https://issues.apache.org/jira/browse/SPARK-32515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6
> Environment: Window 10 and Mac, both have the same issues.
> Using Scala version 2.11.12
> Python 3.6.10
> java version "1.8.0_261"
>Reporter: Jayce Jiang
>Priority: Major
> Attachments: Capture.PNG, Capture1.png, Capture2.PNG, 
> Screen_Shot_2020-08-05_at_2.46.42_PM.png, image-2020-08-03-07-03-55-716.png, 
> unknown.png, unknown1.png, unknown2.png
>
>
> A weird spark display and counting error. When I was loading in my CSV file 
> into spark and trying to do check all distinct value from a column inside of 
> a dataframe. Everything I try in spark resulted in a wrong answer. But if I 
> convert my spark dataframe into pandas dataframe, it works. Please help. This 
> bug only happens in this one CSV file, all my other CSV files work properly. 
> Here are the pictures.
>  
> !image-2020-08-01-21-19-06-402.png!!image-2020-08-01-21-19-03-289.png!!image-2020-08-01-21-18-58-625.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32515) Distinct Function Weird Bug

2020-08-06 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-32515:
-
Fix Version/s: (was: 2.4.6)

> Distinct Function Weird Bug
> ---
>
> Key: SPARK-32515
> URL: https://issues.apache.org/jira/browse/SPARK-32515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6
> Environment: Window 10 and Mac, both have the same issues.
> Using Scala version 2.11.12
> Python 3.6.10
> java version "1.8.0_261"
>Reporter: Jayce Jiang
>Priority: Major
> Attachments: Capture.PNG, Capture1.png, Capture2.PNG, 
> Screen_Shot_2020-08-05_at_2.46.42_PM.png, image-2020-08-03-07-03-55-716.png, 
> unknown.png, unknown1.png, unknown2.png
>
>
> A weird spark display and counting error. When I was loading in my CSV file 
> into spark and trying to do check all distinct value from a column inside of 
> a dataframe. Everything I try in spark resulted in a wrong answer. But if I 
> convert my spark dataframe into pandas dataframe, it works. Please help. This 
> bug only happens in this one CSV file, all my other CSV files work properly. 
> Here are the pictures.
>  
> !image-2020-08-01-21-19-06-402.png!!image-2020-08-01-21-19-03-289.png!!image-2020-08-01-21-18-58-625.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31851) Redesign PySpark documentation

2020-08-06 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172758#comment-17172758
 ] 

Hyukjin Kwon commented on SPARK-31851:
--

[~Shan_Chandra] Please go ahead. You might better leave a comment in one of the 
subtasks saying you're working on it, and open a pull request in GitHub as 
guided in http://spark.apache.org/contributing.html.

> Redesign PySpark documentation
> --
>
> Key: SPARK-31851
> URL: https://issues.apache.org/jira/browse/SPARK-31851
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, PySpark, Spark Core, SQL, Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
>
> Currently, PySpark documentation 
> (https://spark.apache.org/docs/latest/api/python/index.html) is pretty much 
> poorly written compared to other projects.
> See, for example, see Koalas https://koalas.readthedocs.io/en/latest/ as an 
> example.
> PySpark is being more and more important in Spark, and we should improve this 
> documentation so people can easily follow.
> Reference: 
> - https://koalas.readthedocs.io/en/latest/
> - https://pandas.pydata.org/docs/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32560) improve exception message

2020-08-06 Thread philipse (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

philipse updated SPARK-32560:
-
Description: Exception messages are lack of single quotes, we can improve 
it to keep consisent  (was: Exception messages are lack of single quotes, we 
can improve it to keep consisent

!image-2020-08-07-08-32-35-808.png!)

> improve exception message
> -
>
> Key: SPARK-32560
> URL: https://issues.apache.org/jira/browse/SPARK-32560
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: philipse
>Priority: Minor
> Attachments: exception.png
>
>
> Exception messages are lack of single quotes, we can improve it to keep 
> consisent



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32560) improve exception message

2020-08-06 Thread philipse (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

philipse updated SPARK-32560:
-
Attachment: exception.png

> improve exception message
> -
>
> Key: SPARK-32560
> URL: https://issues.apache.org/jira/browse/SPARK-32560
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: philipse
>Priority: Minor
> Attachments: exception.png
>
>
> Exception messages are lack of single quotes, we can improve it to keep 
> consisent
> !image-2020-08-07-08-32-35-808.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32560) improve exception message

2020-08-06 Thread philipse (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

philipse updated SPARK-32560:
-
Description: 
Exception messages are lack of single quotes, we can improve it to keep 
consisent

!image-2020-08-07-08-32-35-808.png!

  was:Exception message have extra single quotes, we can improve it.


> improve exception message
> -
>
> Key: SPARK-32560
> URL: https://issues.apache.org/jira/browse/SPARK-32560
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: philipse
>Priority: Minor
>
> Exception messages are lack of single quotes, we can improve it to keep 
> consisent
> !image-2020-08-07-08-32-35-808.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30577) StorageLevel.DISK_ONLY_2 causes the data loss

2020-08-06 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172715#comment-17172715
 ] 

Dongjoon Hyun commented on SPARK-30577:
---

Thanks for the confirmation.

BTW, `get_all_functions` is irrelevant to this. If you are tightly-coupled with 
Hive 1.2, could you try the following? We have a dedicate release for Hive 1.2 
users. :)
- 
https://downloads.apache.org/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7-hive1.2.tgz

> StorageLevel.DISK_ONLY_2 causes the data loss
> -
>
> Key: SPARK-30577
> URL: https://issues.apache.org/jira/browse/SPARK-30577
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: zero222
>Priority: Major
> Attachments: DISK_ONLY_2.png
>
>
> As shown in the attachment,after I load the data of the hive table which is 
> immutable and cache the data in the level of the DISK_ONLY_2,the count value 
> of the data  is different every time.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32522) Using pyspark with a MultiLayerPerceptron model given inconsistent outputs if a large amount of data is fed into it and at least one of the model outputs is fed to a P

2020-08-06 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172714#comment-17172714
 ] 

Dongjoon Hyun commented on SPARK-32522:
---

Thank you for the explanation, [~Ben Smith].

> Using pyspark with a MultiLayerPerceptron model given inconsistent outputs if 
> a large amount of data is fed into it and at least one of the model outputs 
> is fed to a Python UDF.
> -
>
> Key: SPARK-32522
> URL: https://issues.apache.org/jira/browse/SPARK-32522
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3, 3.0.0
> Environment: CentOS 7.6 with Python 3.6.3 and Spark 2.4.3
> or
> CentOS 7.6 with Python 3.6.3 and Spark built from master
>Reporter: Ben Smith
>Priority: Major
>  Labels: correctness
> Attachments: model.zip, pyspark-script.py
>
>
> Using pyspark with a MultiLayerPerceptron model given inconsistent outputs if 
> a large amount of data is fed into it and at least one of the model outputs 
> is fed to a Python UDF.
> This data correctness issue impacts both the Spark 2.4 releases and the 
> latest Master branch.
> I do not understand the root cause and cannot recreate 100% of the time. But 
> I have a simplified code sample (attached) that triggers the bug regularly. I 
> raised an inquiry on the mailing list as a Spark 2.4 issue but nobody had a 
> suggested root cause and I have since recreated the problem on master so I am 
> now raising a bug here.
> During debugging I have narrowed the problem down somewhat and some 
> observations I have made while doing this are:
>  * I can recreate the problem with a very simple MultilayerPerceptron with no 
> hidden layers (just 14 features and 2 outputs), I also see it with a more 
> complex MultilayerPerceptron model so I don't think the model details are 
> important.
>  * I cannot recreate the problem unless the model output is fed to a python 
> UDF, removing this leads to good outputs for the model and having it means 
> that model outputs are inconsistent (note that not just the Python UDF 
> outputs are inconsistent)
>  * I cannot recreate the problem on minuscule amounts of data or when my data 
> is partitioned heavily. 100,000 rows of input with 2 partitions sees the 
> issue happen most of the time.
>  * Some of the bad outputs I get could be explained if certain features were 
> zero when they came into the model (when they are not in my actual feature 
> data)
>  * I can recreate the problem on several different servers
> My environment is CentOS 7.6 with Python 3.6.3 and Spark 2.4.3, I can also 
> recreate the issue from the code on the Spark master branch but strangely I 
> cannot recreate the issue with Spark 2.4.3 and Python 2.7. I'm not sure why 
> the version of python would matter.
> The attached code sample triggers the problem for me the vast majority of the 
> time when pasted into a pyspark shell. This code generates a dataframe 
> containing 100,000 identical rows, transforms it with a MultiLayerPerceptron 
> model and feeds one of the model output columns to a simple Python UDF to 
> generate an additional column. The resulting dataframe has the distinct rows 
> selected and since all the inputs are identical I would expect to get 1 row 
> back, instead I get many unique rows with the number returned varying each 
> time I run the code. To run the code you will need the model files locally. I 
> have attached the model as a zip archive and unzipping this to /tmp should be 
> all you need to do to get the code to run.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32264) More resources in Github Actions

2020-08-06 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172703#comment-17172703
 ] 

Dongjoon Hyun commented on SPARK-32264:
---

That's too bad. However, thank you for informing that.

> More resources in Github Actions
> 
>
> Key: SPARK-32264
> URL: https://issues.apache.org/jira/browse/SPARK-32264
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Holden Karau
>Priority: Major
>
> We are currently using free version of Github Actions which only allows 20 
> concurrent jobs. This is not enough in the heavy development in Apache spark.
> We should have a way to allocate more resources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32526) Let sql/catalyst module tests pass for Scala 2.13

2020-08-06 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172701#comment-17172701
 ] 

Dongjoon Hyun commented on SPARK-32526:
---

Thank you all for this efforts!

> Let sql/catalyst module tests pass for Scala 2.13
> -
>
> Key: SPARK-32526
> URL: https://issues.apache.org/jira/browse/SPARK-32526
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yang Jie
>Priority: Minor
> Attachments: failed-and-aborted-20200806
>
>
> sql/catalyst module has following compile errors with scala-2.13 profile:
> {code:java}
> [ERROR] [Error] 
> /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1284:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1289:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1297:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:952:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]
>  required: Seq[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]
> {code}
> Similar as https://issues.apache.org/jira/browse/SPARK-29292 ,  call .toSeq 
> on these to ensue they still works on 2.12.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32018) Fix UnsafeRow set overflowed decimal

2020-08-06 Thread Sunitha Kambhampati (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172689#comment-17172689
 ] 

Sunitha Kambhampati commented on SPARK-32018:
-

I have added a summary of my comments from the  
[https://github.com/apache/spark/pull/29125] discussion in above comment.  
Thanks. 

> Fix UnsafeRow set overflowed decimal
> 
>
> Key: SPARK-32018
> URL: https://issues.apache.org/jira/browse/SPARK-32018
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Allison Wang
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.4.7, 3.0.1, 3.1.0
>
>
> There is a bug that writing an overflowed decimal into UnsafeRow is fine but 
> reading it out will throw ArithmeticException. This exception is thrown when 
> calling {{getDecimal}} in UnsafeRow with input decimal's precision greater 
> than the input precision. Setting the value of the overflowed decimal to null 
> when writing into UnsafeRow should fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32018) Fix UnsafeRow set overflowed decimal

2020-08-06 Thread Sunitha Kambhampati (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172688#comment-17172688
 ] 

Sunitha Kambhampati commented on SPARK-32018:
-

The important issue is we should not return incorrect results. In general, it 
is not a good practice to back port a change to a stable branch and cause more 
queries to return incorrect results.

Just to reiterate:
 # This current PR that has back ported the UnsafeRow fix causes queries to 
return incorrect results. This is for v2.4.x and v3.0.x line. This change by 
itself has unsafe side effects and results in incorrect results being returned.
 # It does not matter whether you have whole stage on or off, ansi on or off, 
you will get more queries returning incorrect results.
 # Incorrect results is very serious and it is not good for Spark users to run 
into it for common operations like sum.

{code:java}
 
scala> val decStr = "1" + "0" * 19
decStr: String = 1000
scala> val d3 = spark.range(0, 1, 1, 1).union(spark.range(0, 11, 1, 1))
d3: org.apache.spark.sql.Dataset[Long] = [id: bigint]
 
scala>  val d5 = d3.select(expr(s"cast('$decStr' as decimal (38, 18)) as 
d"),lit(1).as("key")).groupBy("key").agg(sum($"d").alias("sumd")).select($"sumd")
d5: org.apache.spark.sql.DataFrame = [sumd: decimal(38,18)]
scala> d5.show(false)   <- INCORRECT RESULTS
+---+
|sumd                                   |
+---+
|2000.00|
+---+
{code}
 

> Fix UnsafeRow set overflowed decimal
> 
>
> Key: SPARK-32018
> URL: https://issues.apache.org/jira/browse/SPARK-32018
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Allison Wang
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.4.7, 3.0.1, 3.1.0
>
>
> There is a bug that writing an overflowed decimal into UnsafeRow is fine but 
> reading it out will throw ArithmeticException. This exception is thrown when 
> calling {{getDecimal}} in UnsafeRow with input decimal's precision greater 
> than the input precision. Setting the value of the overflowed decimal to null 
> when writing into UnsafeRow should fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32506) flaky test: pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests

2020-08-06 Thread Huaxin Gao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao reassigned SPARK-32506:
--

Assignee: Huaxin Gao

> flaky test: 
> pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests
> 
>
> Key: SPARK-32506
> URL: https://issues.apache.org/jira/browse/SPARK-32506
> Project: Spark
>  Issue Type: Test
>  Components: MLlib
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Huaxin Gao
>Priority: Major
>
> {code}
> FAIL: test_train_prediction 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
> Test that error on test data improves as model is trained.
> --
> Traceback (most recent call last):
>   File 
> "/home/runner/work/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 466, in test_train_prediction
> eventually(condition, timeout=180.0)
>   File "/home/runner/work/spark/spark/python/pyspark/testing/utils.py", line 
> 81, in eventually
> lastValue = condition()
>   File 
> "/home/runner/work/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 461, in condition
> self.assertGreater(errors[1] - errors[-1], 2)
> AssertionError: 1.672640157855923 not greater than 2
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32506) flaky test: pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests

2020-08-06 Thread Huaxin Gao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao resolved SPARK-32506.

Fix Version/s: 3.1.0
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 29380
[https://github.com/apache/spark/pull/29380]

> flaky test: 
> pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests
> 
>
> Key: SPARK-32506
> URL: https://issues.apache.org/jira/browse/SPARK-32506
> Project: Spark
>  Issue Type: Test
>  Components: MLlib
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> {code}
> FAIL: test_train_prediction 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
> Test that error on test data improves as model is trained.
> --
> Traceback (most recent call last):
>   File 
> "/home/runner/work/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 466, in test_train_prediction
> eventually(condition, timeout=180.0)
>   File "/home/runner/work/spark/spark/python/pyspark/testing/utils.py", line 
> 81, in eventually
> lastValue = condition()
>   File 
> "/home/runner/work/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 461, in condition
> self.assertGreater(errors[1] - errors[-1], 2)
> AssertionError: 1.672640157855923 not greater than 2
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32561) Allow DataSourceReadBenchmark to run for select formats

2020-08-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172647#comment-17172647
 ] 

Apache Spark commented on SPARK-32561:
--

User 'msamirkhan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29381

> Allow DataSourceReadBenchmark to run for select formats
> ---
>
> Key: SPARK-32561
> URL: https://issues.apache.org/jira/browse/SPARK-32561
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Muhammad Samir Khan
>Priority: Major
>
> Currently DataSourceReadBenchmark runs benchmarks for Parquet, ORC, CSV, and 
> Json file formats and there is no way to specify at runtime a single format 
> or a subset of formats, like there is for BuiltInDataSourceWriteBenchmark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32561) Allow DataSourceReadBenchmark to run for select formats

2020-08-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32561:


Assignee: (was: Apache Spark)

> Allow DataSourceReadBenchmark to run for select formats
> ---
>
> Key: SPARK-32561
> URL: https://issues.apache.org/jira/browse/SPARK-32561
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Muhammad Samir Khan
>Priority: Major
>
> Currently DataSourceReadBenchmark runs benchmarks for Parquet, ORC, CSV, and 
> Json file formats and there is no way to specify at runtime a single format 
> or a subset of formats, like there is for BuiltInDataSourceWriteBenchmark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32561) Allow DataSourceReadBenchmark to run for select formats

2020-08-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172646#comment-17172646
 ] 

Apache Spark commented on SPARK-32561:
--

User 'msamirkhan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29381

> Allow DataSourceReadBenchmark to run for select formats
> ---
>
> Key: SPARK-32561
> URL: https://issues.apache.org/jira/browse/SPARK-32561
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Muhammad Samir Khan
>Priority: Major
>
> Currently DataSourceReadBenchmark runs benchmarks for Parquet, ORC, CSV, and 
> Json file formats and there is no way to specify at runtime a single format 
> or a subset of formats, like there is for BuiltInDataSourceWriteBenchmark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32561) Allow DataSourceReadBenchmark to run for select formats

2020-08-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32561:


Assignee: Apache Spark

> Allow DataSourceReadBenchmark to run for select formats
> ---
>
> Key: SPARK-32561
> URL: https://issues.apache.org/jira/browse/SPARK-32561
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Muhammad Samir Khan
>Assignee: Apache Spark
>Priority: Major
>
> Currently DataSourceReadBenchmark runs benchmarks for Parquet, ORC, CSV, and 
> Json file formats and there is no way to specify at runtime a single format 
> or a subset of formats, like there is for BuiltInDataSourceWriteBenchmark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32561) Allow DataSourceReadBenchmark to run for select formats

2020-08-06 Thread Muhammad Samir Khan (Jira)
Muhammad Samir Khan created SPARK-32561:
---

 Summary: Allow DataSourceReadBenchmark to run for select formats
 Key: SPARK-32561
 URL: https://issues.apache.org/jira/browse/SPARK-32561
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 3.0.0
Reporter: Muhammad Samir Khan


Currently DataSourceReadBenchmark runs benchmarks for Parquet, ORC, CSV, and 
Json file formats and there is no way to specify at runtime a single format or 
a subset of formats, like there is for BuiltInDataSourceWriteBenchmark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32531) Add benchmarks for nested structs and arrays for different file formats

2020-08-06 Thread Muhammad Samir Khan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Muhammad Samir Khan updated SPARK-32531:

Component/s: Tests

> Add benchmarks for nested structs and arrays for different file formats
> ---
>
> Key: SPARK-32531
> URL: https://issues.apache.org/jira/browse/SPARK-32531
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Muhammad Samir Khan
>Priority: Major
>
> We had found that Spark performance was slow as compared to PIG on some 
> schemas in our pipelines. On investigation, it was found that Spark 
> performance was slow for nested structs and array'd structs and these cases 
> were not being profiled by the current benchmarks. I have some improvements 
> for ORC (SPARK-32532) and Avro (SPARK-32533) file formats which improve the 
> performance in these cases and will be putting up the PRs soon.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Rohit Mishra (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172635#comment-17172635
 ] 

Rohit Mishra commented on SPARK-32558:
--

[~ramks], Thanks for raising this bug but Please refrain from marking any issue 
with Priority as "blocker". All issues shall be marked as "Major" or below, 
apart from the ones raised by committers. Thanks.

 

> ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 
> (work-around of using spark.sql.orc.impl=hive is also not working)
> -
>
> Key: SPARK-32558
> URL: https://issues.apache.org/jira/browse/SPARK-32558
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version. 
> (Linux Redhat)
>Reporter: Ramakrishna Prasad K S
>Priority: Blocker
> Fix For: 3.0.0
>
>
> Steps to reproduce the issue:
> --- 
> Download Spark_3.0 from [https://spark.apache.org/downloads.html]
>  
> Step 1) Create ORC File by using the default Spark_3.0 Native API from spark 
> shell .
> [linuxuser1@irlrhellinux1 bin]$ ./spark-shell
> Welcome to Spark version 3.0.0
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)
> Type in expressions to have them evaluated. Type :help for more information.
>  scala> spark.sql("set spark.sql.orc.impl").show()
> -+ |               key| value| + 
> |spark.sql.orc.impl|*native*| -+
>  
> scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
> org.apache.spark.sql.DataFrame = []
>  
> scala> spark.sql("insert into df_table values('col1val1','col2val1')")
> org.apache.spark.sql.DataFrame = []
>  
> scala> val dFrame = spark.sql("select * from df_table") dFrame: 
> org.apache.spark.sql.DataFrame = [col1: string, col2: string]
>  
> scala> dFrame.show()
> + |    col1|    col2| + |col1val1|col2val1| -+
>  
> scala> 
> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")
>  
>  Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop 
> cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following 
> command to analyze or read metadata from the ORC files. As you see below, it 
> fails to fetch the metadata from the ORC file.
>  
> adpqa@irlhadoop1 bug]$ hive --orcfiledump 
> /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> Processing data file 
> /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> [length: 414]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
> at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
> at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
> at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)
> at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
> at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
> at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
> at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
> at org.apache.orc.tools.FileDump.main(FileDump.java:154)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:313)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:227)
>  
> Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
> [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
> spark.sql.orc.impl as hive)
>  
> scala> spark.sql("set spark.sql.orc.impl=hive")
> res6: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.sql("set spark.sql.orc.impl").show()
> --+ |               key|value| +--- 
> |spark.sql.orc.impl| *hive*| +--
>  
> scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")
> 20/08/04 22:43:26 WARN HiveMetaStore: Location: 
> [file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2]
>  specified for non-external table:df_table2 res5: 
> org.apache.spark.sql.DataFrame = []
>  
> scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
> org.apache.spark.sql.DataFrame = []
>  
> scala> val dFrame2 = spark.sql("select * from df_table2") 

[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Rohit Mishra (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohit Mishra updated SPARK-32558:
-
Priority: Major  (was: Blocker)

> ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 
> (work-around of using spark.sql.orc.impl=hive is also not working)
> -
>
> Key: SPARK-32558
> URL: https://issues.apache.org/jira/browse/SPARK-32558
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version. 
> (Linux Redhat)
>Reporter: Ramakrishna Prasad K S
>Priority: Major
> Fix For: 3.0.0
>
>
> Steps to reproduce the issue:
> --- 
> Download Spark_3.0 from [https://spark.apache.org/downloads.html]
>  
> Step 1) Create ORC File by using the default Spark_3.0 Native API from spark 
> shell .
> [linuxuser1@irlrhellinux1 bin]$ ./spark-shell
> Welcome to Spark version 3.0.0
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)
> Type in expressions to have them evaluated. Type :help for more information.
>  scala> spark.sql("set spark.sql.orc.impl").show()
> -+ |               key| value| + 
> |spark.sql.orc.impl|*native*| -+
>  
> scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
> org.apache.spark.sql.DataFrame = []
>  
> scala> spark.sql("insert into df_table values('col1val1','col2val1')")
> org.apache.spark.sql.DataFrame = []
>  
> scala> val dFrame = spark.sql("select * from df_table") dFrame: 
> org.apache.spark.sql.DataFrame = [col1: string, col2: string]
>  
> scala> dFrame.show()
> + |    col1|    col2| + |col1val1|col2val1| -+
>  
> scala> 
> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")
>  
>  Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop 
> cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following 
> command to analyze or read metadata from the ORC files. As you see below, it 
> fails to fetch the metadata from the ORC file.
>  
> adpqa@irlhadoop1 bug]$ hive --orcfiledump 
> /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> Processing data file 
> /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> [length: 414]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
> at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
> at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
> at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)
> at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
> at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
> at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
> at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
> at org.apache.orc.tools.FileDump.main(FileDump.java:154)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:313)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:227)
>  
> Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
> [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
> spark.sql.orc.impl as hive)
>  
> scala> spark.sql("set spark.sql.orc.impl=hive")
> res6: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.sql("set spark.sql.orc.impl").show()
> --+ |               key|value| +--- 
> |spark.sql.orc.impl| *hive*| +--
>  
> scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")
> 20/08/04 22:43:26 WARN HiveMetaStore: Location: 
> [file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2]
>  specified for non-external table:df_table2 res5: 
> org.apache.spark.sql.DataFrame = []
>  
> scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
> org.apache.spark.sql.DataFrame = []
>  
> scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
> org.apache.spark.sql.DataFrame = [col1: string, col2: string]
>  
> scala> 
> dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")
>  
>  Step 4) Copy 

[jira] [Commented] (SPARK-31851) Redesign PySpark documentation

2020-08-06 Thread Shanmugavel Kuttiyandi Chandrakasu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172627#comment-17172627
 ] 

Shanmugavel Kuttiyandi Chandrakasu commented on SPARK-31851:


Hi, 
can you please let me know if i can contribute to the documentation. provided 
an opportunity, this will be my first work towards becoming an open source 
committer. 

> Redesign PySpark documentation
> --
>
> Key: SPARK-31851
> URL: https://issues.apache.org/jira/browse/SPARK-31851
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, PySpark, Spark Core, SQL, Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
>
> Currently, PySpark documentation 
> (https://spark.apache.org/docs/latest/api/python/index.html) is pretty much 
> poorly written compared to other projects.
> See, for example, see Koalas https://koalas.readthedocs.io/en/latest/ as an 
> example.
> PySpark is being more and more important in Spark, and we should improve this 
> documentation so people can easily follow.
> Reference: 
> - https://koalas.readthedocs.io/en/latest/
> - https://pandas.pydata.org/docs/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-32551) Ambiguous self join error in non self join with window

2020-08-06 Thread kanika dhuria (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kanika dhuria closed SPARK-32551.
-

Closing as duplicate.

> Ambiguous self join error in non self join with window
> --
>
> Key: SPARK-32551
> URL: https://issues.apache.org/jira/browse/SPARK-32551
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: kanika dhuria
>Priority: Major
> Fix For: 3.0.1
>
>
> Following code fails ambiguous self join analysis, even when it doesn't have 
> self join 
> val v1 = spark.range(3).toDF("m")
>  val v2 = spark.range(3).toDF("d")
>  val v3 = v1.join(v2, v1("m").===(v2("d")))
>  val v4 = v3("d");
>  val w1 = Window.partitionBy(v4)
>  val out = v3.select(v4.as("a"), sum(v4).over(w1).as("b"))
> org.apache.spark.sql.AnalysisException: Column a#45L are ambiguous. It's 
> probably because you joined several Datasets together, and some of these 
> Datasets are the same. This column points to one of the Datasets but Spark is 
> unable to figure out which one. Please alias the Datasets with different 
> names via `Dataset.as` before joining them, and specify the column using 
> qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > $"b.id")`. You 
> can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable 
> this check.;
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32551) Ambiguous self join error in non self join with window

2020-08-06 Thread kanika dhuria (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kanika dhuria resolved SPARK-32551.
---
Fix Version/s: 3.0.1
   Resolution: Fixed

> Ambiguous self join error in non self join with window
> --
>
> Key: SPARK-32551
> URL: https://issues.apache.org/jira/browse/SPARK-32551
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: kanika dhuria
>Priority: Major
> Fix For: 3.0.1
>
>
> Following code fails ambiguous self join analysis, even when it doesn't have 
> self join 
> val v1 = spark.range(3).toDF("m")
>  val v2 = spark.range(3).toDF("d")
>  val v3 = v1.join(v2, v1("m").===(v2("d")))
>  val v4 = v3("d");
>  val w1 = Window.partitionBy(v4)
>  val out = v3.select(v4.as("a"), sum(v4).over(w1).as("b"))
> org.apache.spark.sql.AnalysisException: Column a#45L are ambiguous. It's 
> probably because you joined several Datasets together, and some of these 
> Datasets are the same. This column points to one of the Datasets but Spark is 
> unable to figure out which one. Please alias the Datasets with different 
> names via `Dataset.as` before joining them, and specify the column using 
> qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > $"b.id")`. You 
> can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable 
> this check.;
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32551) Ambiguous self join error in non self join with window

2020-08-06 Thread kanika dhuria (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172586#comment-17172586
 ] 

kanika dhuria commented on SPARK-32551:
---

Thanks [~cloud_fan], it is fixed in latest 3.0 branch. 

Fixed as part of https://issues.apache.org/jira/browse/SPARK-31956.

> Ambiguous self join error in non self join with window
> --
>
> Key: SPARK-32551
> URL: https://issues.apache.org/jira/browse/SPARK-32551
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: kanika dhuria
>Priority: Major
>
> Following code fails ambiguous self join analysis, even when it doesn't have 
> self join 
> val v1 = spark.range(3).toDF("m")
>  val v2 = spark.range(3).toDF("d")
>  val v3 = v1.join(v2, v1("m").===(v2("d")))
>  val v4 = v3("d");
>  val w1 = Window.partitionBy(v4)
>  val out = v3.select(v4.as("a"), sum(v4).over(w1).as("b"))
> org.apache.spark.sql.AnalysisException: Column a#45L are ambiguous. It's 
> probably because you joined several Datasets together, and some of these 
> Datasets are the same. This column points to one of the Datasets but Spark is 
> unable to figure out which one. Please alias the Datasets with different 
> names via `Dataset.as` before joining them, and specify the column using 
> qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > $"b.id")`. You 
> can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable 
> this check.;
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32506) flaky test: pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests

2020-08-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172581#comment-17172581
 ] 

Apache Spark commented on SPARK-32506:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/29380

> flaky test: 
> pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests
> 
>
> Key: SPARK-32506
> URL: https://issues.apache.org/jira/browse/SPARK-32506
> Project: Spark
>  Issue Type: Test
>  Components: MLlib
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Priority: Major
>
> {code}
> FAIL: test_train_prediction 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
> Test that error on test data improves as model is trained.
> --
> Traceback (most recent call last):
>   File 
> "/home/runner/work/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 466, in test_train_prediction
> eventually(condition, timeout=180.0)
>   File "/home/runner/work/spark/spark/python/pyspark/testing/utils.py", line 
> 81, in eventually
> lastValue = condition()
>   File 
> "/home/runner/work/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 461, in condition
> self.assertGreater(errors[1] - errors[-1], 2)
> AssertionError: 1.672640157855923 not greater than 2
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32515) Distinct Function Weird Bug

2020-08-06 Thread Jayce Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jayce Jiang resolved SPARK-32515.
-
   Fix Version/s: 2.4.6
Target Version/s: 2.4.6
  Resolution: Fixed

> Distinct Function Weird Bug
> ---
>
> Key: SPARK-32515
> URL: https://issues.apache.org/jira/browse/SPARK-32515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6
> Environment: Window 10 and Mac, both have the same issues.
> Using Scala version 2.11.12
> Python 3.6.10
> java version "1.8.0_261"
>Reporter: Jayce Jiang
>Priority: Major
> Fix For: 2.4.6
>
> Attachments: Capture.PNG, Capture1.png, Capture2.PNG, 
> Screen_Shot_2020-08-05_at_2.46.42_PM.png, image-2020-08-03-07-03-55-716.png, 
> unknown.png, unknown1.png, unknown2.png
>
>
> A weird spark display and counting error. When I was loading in my CSV file 
> into spark and trying to do check all distinct value from a column inside of 
> a dataframe. Everything I try in spark resulted in a wrong answer. But if I 
> convert my spark dataframe into pandas dataframe, it works. Please help. This 
> bug only happens in this one CSV file, all my other CSV files work properly. 
> Here are the pictures.
>  
> !image-2020-08-01-21-19-06-402.png!!image-2020-08-01-21-19-03-289.png!!image-2020-08-01-21-18-58-625.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32515) Distinct Function Weird Bug

2020-08-06 Thread Jayce Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172579#comment-17172579
 ] 

Jayce Jiang commented on SPARK-32515:
-

Closing the issues, it has to due with spark.load function and the weird format 
of tweet column.

> Distinct Function Weird Bug
> ---
>
> Key: SPARK-32515
> URL: https://issues.apache.org/jira/browse/SPARK-32515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6
> Environment: Window 10 and Mac, both have the same issues.
> Using Scala version 2.11.12
> Python 3.6.10
> java version "1.8.0_261"
>Reporter: Jayce Jiang
>Priority: Major
> Attachments: Capture.PNG, Capture1.png, Capture2.PNG, 
> Screen_Shot_2020-08-05_at_2.46.42_PM.png, image-2020-08-03-07-03-55-716.png, 
> unknown.png, unknown1.png, unknown2.png
>
>
> A weird spark display and counting error. When I was loading in my CSV file 
> into spark and trying to do check all distinct value from a column inside of 
> a dataframe. Everything I try in spark resulted in a wrong answer. But if I 
> convert my spark dataframe into pandas dataframe, it works. Please help. This 
> bug only happens in this one CSV file, all my other CSV files work properly. 
> Here are the pictures.
>  
> !image-2020-08-01-21-19-06-402.png!!image-2020-08-01-21-19-03-289.png!!image-2020-08-01-21-18-58-625.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32506) flaky test: pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests

2020-08-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32506:


Assignee: (was: Apache Spark)

> flaky test: 
> pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests
> 
>
> Key: SPARK-32506
> URL: https://issues.apache.org/jira/browse/SPARK-32506
> Project: Spark
>  Issue Type: Test
>  Components: MLlib
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Priority: Major
>
> {code}
> FAIL: test_train_prediction 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
> Test that error on test data improves as model is trained.
> --
> Traceback (most recent call last):
>   File 
> "/home/runner/work/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 466, in test_train_prediction
> eventually(condition, timeout=180.0)
>   File "/home/runner/work/spark/spark/python/pyspark/testing/utils.py", line 
> 81, in eventually
> lastValue = condition()
>   File 
> "/home/runner/work/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 461, in condition
> self.assertGreater(errors[1] - errors[-1], 2)
> AssertionError: 1.672640157855923 not greater than 2
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32506) flaky test: pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests

2020-08-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32506:


Assignee: Apache Spark

> flaky test: 
> pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests
> 
>
> Key: SPARK-32506
> URL: https://issues.apache.org/jira/browse/SPARK-32506
> Project: Spark
>  Issue Type: Test
>  Components: MLlib
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>
> {code}
> FAIL: test_train_prediction 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
> Test that error on test data improves as model is trained.
> --
> Traceback (most recent call last):
>   File 
> "/home/runner/work/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 466, in test_train_prediction
> eventually(condition, timeout=180.0)
>   File "/home/runner/work/spark/spark/python/pyspark/testing/utils.py", line 
> 81, in eventually
> lastValue = condition()
>   File 
> "/home/runner/work/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 461, in condition
> self.assertGreater(errors[1] - errors[-1], 2)
> AssertionError: 1.672640157855923 not greater than 2
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32506) flaky test: pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests

2020-08-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172578#comment-17172578
 ] 

Apache Spark commented on SPARK-32506:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/29380

> flaky test: 
> pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests
> 
>
> Key: SPARK-32506
> URL: https://issues.apache.org/jira/browse/SPARK-32506
> Project: Spark
>  Issue Type: Test
>  Components: MLlib
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Priority: Major
>
> {code}
> FAIL: test_train_prediction 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
> Test that error on test data improves as model is trained.
> --
> Traceback (most recent call last):
>   File 
> "/home/runner/work/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 466, in test_train_prediction
> eventually(condition, timeout=180.0)
>   File "/home/runner/work/spark/spark/python/pyspark/testing/utils.py", line 
> 81, in eventually
> lastValue = condition()
>   File 
> "/home/runner/work/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 461, in condition
> self.assertGreater(errors[1] - errors[-1], 2)
> AssertionError: 1.672640157855923 not greater than 2
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32551) Ambiguous self join error in non self join with window

2020-08-06 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172407#comment-17172407
 ] 

Wenchen Fan commented on SPARK-32551:
-

Can you try the latest 3.0 branch? There are some bug fixes for this self-join 
check.

> Ambiguous self join error in non self join with window
> --
>
> Key: SPARK-32551
> URL: https://issues.apache.org/jira/browse/SPARK-32551
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: kanika dhuria
>Priority: Major
>
> Following code fails ambiguous self join analysis, even when it doesn't have 
> self join 
> val v1 = spark.range(3).toDF("m")
>  val v2 = spark.range(3).toDF("d")
>  val v3 = v1.join(v2, v1("m").===(v2("d")))
>  val v4 = v3("d");
>  val w1 = Window.partitionBy(v4)
>  val out = v3.select(v4.as("a"), sum(v4).over(w1).as("b"))
> org.apache.spark.sql.AnalysisException: Column a#45L are ambiguous. It's 
> probably because you joined several Datasets together, and some of these 
> Datasets are the same. This column points to one of the Datasets but Spark is 
> unable to figure out which one. Please alias the Datasets with different 
> names via `Dataset.as` before joining them, and specify the column using 
> qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > $"b.id")`. You 
> can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable 
> this check.;
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32546) SHOW VIEWS fails with MetaException ... ClassNotFoundException

2020-08-06 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-32546:

Fix Version/s: 3.0.1

> SHOW VIEWS fails with MetaException ... ClassNotFoundException
> --
>
> Key: SPARK-32546
> URL: https://issues.apache.org/jira/browse/SPARK-32546
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> SHOW VIEWS can fail with the error:
> {code}
> java.lang.RuntimeException: 
> MetaException(message:java.lang.ClassNotFoundException Class 
> com.ibm.spss.hive.serde2.xml.XmlSerDe not found)
> at 
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:290)
> at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:281)
> at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:631)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:486)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.convertHiveTableToCatalogTable(HiveClientImpl.scala:485)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getTablesByName$2(HiveClientImpl.scala:472)
> at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
> at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
> at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
> at scala.collection.TraversableLike.map(TraversableLike.scala:238)
> at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
> at scala.collection.AbstractTraversable.map(Traversable.scala:108)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getTablesByName$1(HiveClientImpl.scala:472)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:349)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$retryLocked$1(HiveClientImpl.scala:252)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.synchronizeOnObject(HiveClientImpl.scala:288)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:244)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:331)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTablesByName(HiveClientImpl.scala:472)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$listTablesByType$1(HiveClientImpl.scala:873)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:349)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$retryLocked$1(HiveClientImpl.scala:252)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.synchronizeOnObject(HiveClientImpl.scala:288)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:244)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:331)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.listTablesByType(HiveClientImpl.scala:866)
> at 
> org.apache.spark.sql.hive.client.PoolingHiveClient.$anonfun$listTablesByType$1(PoolingHiveClient.scala:266)
> at 
> org.apache.spark.sql.hive.client.PoolingHiveClient.withHiveClient(PoolingHiveClient.scala:112)
> at 
> org.apache.spark.sql.hive.client.PoolingHiveClient.listTablesByType(PoolingHiveClient.scala:266)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listViews$1(HiveExternalCatalog.scala:940)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32544) Bucketing and Partitioning information are not passed on to non FileFormat datasource writes

2020-08-06 Thread Rahij Ramsharan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172313#comment-17172313
 ] 

Rahij Ramsharan commented on SPARK-32544:
-

[~hyukjin.kwon] do you happen to know when that is planned to happen?

> Bucketing and Partitioning information are not passed on to non FileFormat 
> datasource writes
> 
>
> Key: SPARK-32544
> URL: https://issues.apache.org/jira/browse/SPARK-32544
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.0.0
>Reporter: Rahij Ramsharan
>Priority: Major
>
> When writing to a FileFormat datasource, bucket spec and partition columns 
> are passed into InsertIntoHadoopFsRelationCommand: 
> [https://github.com/apache/spark/blob/v3.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L474-L475].
>  
> However, from what I can tell, the RelationProvider API does not have a way 
> to pass in these: 
> [https://github.com/apache/spark/blob/v3.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L511-L513].
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32546) SHOW VIEWS fails with MetaException ... ClassNotFoundException

2020-08-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172275#comment-17172275
 ] 

Apache Spark commented on SPARK-32546:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29379

> SHOW VIEWS fails with MetaException ... ClassNotFoundException
> --
>
> Key: SPARK-32546
> URL: https://issues.apache.org/jira/browse/SPARK-32546
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> SHOW VIEWS can fail with the error:
> {code}
> java.lang.RuntimeException: 
> MetaException(message:java.lang.ClassNotFoundException Class 
> com.ibm.spss.hive.serde2.xml.XmlSerDe not found)
> at 
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:290)
> at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:281)
> at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:631)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:486)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.convertHiveTableToCatalogTable(HiveClientImpl.scala:485)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getTablesByName$2(HiveClientImpl.scala:472)
> at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
> at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
> at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
> at scala.collection.TraversableLike.map(TraversableLike.scala:238)
> at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
> at scala.collection.AbstractTraversable.map(Traversable.scala:108)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getTablesByName$1(HiveClientImpl.scala:472)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:349)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$retryLocked$1(HiveClientImpl.scala:252)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.synchronizeOnObject(HiveClientImpl.scala:288)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:244)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:331)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTablesByName(HiveClientImpl.scala:472)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$listTablesByType$1(HiveClientImpl.scala:873)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:349)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$retryLocked$1(HiveClientImpl.scala:252)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.synchronizeOnObject(HiveClientImpl.scala:288)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:244)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:331)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.listTablesByType(HiveClientImpl.scala:866)
> at 
> org.apache.spark.sql.hive.client.PoolingHiveClient.$anonfun$listTablesByType$1(PoolingHiveClient.scala:266)
> at 
> org.apache.spark.sql.hive.client.PoolingHiveClient.withHiveClient(PoolingHiveClient.scala:112)
> at 
> org.apache.spark.sql.hive.client.PoolingHiveClient.listTablesByType(PoolingHiveClient.scala:266)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listViews$1(HiveExternalCatalog.scala:940)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30069) Clean up non-shuffle disk block manager files following executor exists on YARN

2020-08-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172267#comment-17172267
 ] 

Apache Spark commented on SPARK-30069:
--

User 'LantaoJin' has created a pull request for this issue:
https://github.com/apache/spark/pull/29378

> Clean up non-shuffle disk block manager files following executor exists on 
> YARN
> ---
>
> Key: SPARK-30069
> URL: https://issues.apache.org/jira/browse/SPARK-30069
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Currently we only clean up the local directories on application removed. 
> However, when executors die and restart repeatedly, many temp files are left 
> untouched in the local directories, which is undesired behavior and could 
> cause disk space used up gradually.
> SPARK-24340 had fixed this problem in standalone mode. But in YARN mode, this 
> issue still exists. Especially, in long running service like Spark 
> thrift-server with dynamic resource allocation disabled, it's very easy 
> causes local disk full.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30069) Clean up non-shuffle disk block manager files following executor exists on YARN

2020-08-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172264#comment-17172264
 ] 

Apache Spark commented on SPARK-30069:
--

User 'LantaoJin' has created a pull request for this issue:
https://github.com/apache/spark/pull/29378

> Clean up non-shuffle disk block manager files following executor exists on 
> YARN
> ---
>
> Key: SPARK-30069
> URL: https://issues.apache.org/jira/browse/SPARK-30069
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Currently we only clean up the local directories on application removed. 
> However, when executors die and restart repeatedly, many temp files are left 
> untouched in the local directories, which is undesired behavior and could 
> cause disk space used up gradually.
> SPARK-24340 had fixed this problem in standalone mode. But in YARN mode, this 
> issue still exists. Especially, in long running service like Spark 
> thrift-server with dynamic resource allocation disabled, it's very easy 
> causes local disk full.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32546) SHOW VIEWS fails with MetaException ... ClassNotFoundException

2020-08-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172231#comment-17172231
 ] 

Apache Spark commented on SPARK-32546:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29377

> SHOW VIEWS fails with MetaException ... ClassNotFoundException
> --
>
> Key: SPARK-32546
> URL: https://issues.apache.org/jira/browse/SPARK-32546
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> SHOW VIEWS can fail with the error:
> {code}
> java.lang.RuntimeException: 
> MetaException(message:java.lang.ClassNotFoundException Class 
> com.ibm.spss.hive.serde2.xml.XmlSerDe not found)
> at 
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:290)
> at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:281)
> at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:631)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:486)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.convertHiveTableToCatalogTable(HiveClientImpl.scala:485)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getTablesByName$2(HiveClientImpl.scala:472)
> at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
> at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
> at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
> at scala.collection.TraversableLike.map(TraversableLike.scala:238)
> at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
> at scala.collection.AbstractTraversable.map(Traversable.scala:108)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getTablesByName$1(HiveClientImpl.scala:472)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:349)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$retryLocked$1(HiveClientImpl.scala:252)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.synchronizeOnObject(HiveClientImpl.scala:288)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:244)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:331)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTablesByName(HiveClientImpl.scala:472)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$listTablesByType$1(HiveClientImpl.scala:873)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:349)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$retryLocked$1(HiveClientImpl.scala:252)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.synchronizeOnObject(HiveClientImpl.scala:288)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:244)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:331)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.listTablesByType(HiveClientImpl.scala:866)
> at 
> org.apache.spark.sql.hive.client.PoolingHiveClient.$anonfun$listTablesByType$1(PoolingHiveClient.scala:266)
> at 
> org.apache.spark.sql.hive.client.PoolingHiveClient.withHiveClient(PoolingHiveClient.scala:112)
> at 
> org.apache.spark.sql.hive.client.PoolingHiveClient.listTablesByType(PoolingHiveClient.scala:266)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listViews$1(HiveExternalCatalog.scala:940)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12741) DataFrame count method return wrong size.

2020-08-06 Thread Yu Gan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172226#comment-17172226
 ] 

Yu Gan edited comment on SPARK-12741 at 8/6/20, 10:38 AM:
--

Aha, I came across the similar issue. My sql is 

select
 p_brand,
 p_size,
 count(ps_suppkey) as supplier_cnt 
 from
 tpch.partsupp 
 inner join
 tpch.part 
 on p_partkey = ps_partkey 
 group by
 P_BRAND,
 p_size

the total row count are different:

dataSet.count()=1179, dataSet.rdd().count()=1178

 

Finally i found the root cause:

In org.apache.spark.sql.execution.datasources.FailureSafeParser#parse throws 
BadRecordException, when  in PermissiveMode (default mode) and corrupted record 
exists the result row would be None record. In this case, the none record will 
be filtered. 

BTW, spark version 2.4 


was (Author: gyustorm):
Aha, I came across the similar issue. My sql is 

select
 p_brand,
 p_size,
 count(ps_suppkey) as supplier_cnt 
 from
 tpch.partsupp 
 inner join
 tpch.part 
 on p_partkey = ps_partkey 
 group by
 P_BRAND,
 p_size

the total row count are different:

dataSet.count()=1179, dataSet.rdd().count()=1178

 

Finally i found the root cause:

In org.apache.spark.sql.execution.datasources.FailureSafeParser#parse throws 
BadRecordException, when  in PermissiveMode and corrupted record exists the 
result row would be None record. In this case, the none record will be 
filtered. 

BTW, spark version 2.4 

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>Priority: Major
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12741) DataFrame count method return wrong size.

2020-08-06 Thread Yu Gan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172226#comment-17172226
 ] 

Yu Gan edited comment on SPARK-12741 at 8/6/20, 10:38 AM:
--

Aha, I came across the similar issue. My sql is 

select
 p_brand,
 p_size,
 count(ps_suppkey) as supplier_cnt 
 from
 tpch.partsupp 
 inner join
 tpch.part 
 on p_partkey = ps_partkey 
 group by
 P_BRAND,
 p_size

the total row count are different:

dataSet.count()=1179, dataSet.rdd().count()=1178

 

Finally i found the root cause:

In org.apache.spark.sql.execution.datasources.FailureSafeParser#parse throws 
BadRecordException, when  in PermissiveMode and corrupted record exists the 
result row would be None record. In this case, the none record will be 
filtered. 

BTW, spark version 2.4 


was (Author: gyustorm):
Aha, I came across the similar issue. My sql is 

select
p_brand,
p_size,
count(ps_suppkey) as supplier_cnt 
from
tpch.partsupp 
inner join
tpch.part 
on p_partkey = ps_partkey 
group by
P_BRAND,
p_size


the total row count are different:

dataSet.count()=1179, dataSet.rdd().count()=1178

 

Finally i found the root cause:

In org.apache.spark.sql.execution.datasources.FailureSafeParser#parse throws 
BadRecordException, when  in PermissiveMode and corrupted record exists the 
result row would be None record. In this case, the none record will be 
filtered. 

 

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>Priority: Major
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2020-08-06 Thread Yu Gan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172226#comment-17172226
 ] 

Yu Gan commented on SPARK-12741:


Aha, I came across the similar issue. My sql is 

select
p_brand,
p_size,
count(ps_suppkey) as supplier_cnt 
from
tpch.partsupp 
inner join
tpch.part 
on p_partkey = ps_partkey 
group by
P_BRAND,
p_size


the total row count are different:

dataSet.count()=1179, dataSet.rdd().count()=1178

 

Finally i found the root cause:

In org.apache.spark.sql.execution.datasources.FailureSafeParser#parse throws 
BadRecordException, when  in PermissiveMode and corrupted record exists the 
result row would be None record. In this case, the none record will be 
filtered. 

 

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>Priority: Major
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32560) improve exception message

2020-08-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172219#comment-17172219
 ] 

Apache Spark commented on SPARK-32560:
--

User 'GuoPhilipse' has created a pull request for this issue:
https://github.com/apache/spark/pull/29376

> improve exception message
> -
>
> Key: SPARK-32560
> URL: https://issues.apache.org/jira/browse/SPARK-32560
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: philipse
>Priority: Minor
>
> Exception message have extra single quotes, we can improve it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32560) improve exception message

2020-08-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32560:


Assignee: (was: Apache Spark)

> improve exception message
> -
>
> Key: SPARK-32560
> URL: https://issues.apache.org/jira/browse/SPARK-32560
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: philipse
>Priority: Minor
>
> Exception message have extra single quotes, we can improve it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32560) improve exception message

2020-08-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172217#comment-17172217
 ] 

Apache Spark commented on SPARK-32560:
--

User 'GuoPhilipse' has created a pull request for this issue:
https://github.com/apache/spark/pull/29376

> improve exception message
> -
>
> Key: SPARK-32560
> URL: https://issues.apache.org/jira/browse/SPARK-32560
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: philipse
>Priority: Minor
>
> Exception message have extra single quotes, we can improve it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32560) improve exception message

2020-08-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32560:


Assignee: Apache Spark

> improve exception message
> -
>
> Key: SPARK-32560
> URL: https://issues.apache.org/jira/browse/SPARK-32560
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: philipse
>Assignee: Apache Spark
>Priority: Minor
>
> Exception message have extra single quotes, we can improve it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-32536) deleted not existing hdfs locations when use spark sql to execute "insert overwrite" statement to dynamic partition

2020-08-06 Thread yx91490 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yx91490 updated SPARK-32536:

Comment: was deleted

(was: it seems that the hive code in there(I use hdp-3.1.4.0-315):
{code:java}
org.spark-project.hive
hive-metastore
1.21.2.3.1.4.0-315
{code})

> deleted not existing hdfs locations when use spark sql to execute "insert 
> overwrite" statement to dynamic partition
> ---
>
> Key: SPARK-32536
> URL: https://issues.apache.org/jira/browse/SPARK-32536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: HDP version 2.3.2.3.1.4.0-315
>Reporter: yx91490
>Priority: Major
> Attachments: SPARK-32536.full.log
>
>
> when execute insert overwrite table statement to dynamic partition :
>  
> {code:java}
> set hive.exec.dynamic.partition=true;
> set hive.exec.dynamic.partition.mode=nostrict;
> insert overwrite table tmp.id_name2 partition(dt) select * from tmp.id_name 
> where dt='2001';
> {code}
> output log:
> {code:java}
> 20/08/05 14:38:05 ERROR Hive: Exception when loading partition with 
> parameters  
> partPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1/dt=2001,
>   table=id_name2,  partSpec={dt=2001},  loadFileType=REPLACE_ALL,  
> listBucketingLevel=0,  isAcid=false,  resetStatistics=false
> org.apache.hadoop.hive.ql.metadata.HiveException: Directory 
> hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 could not be 
> cleaned up.
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4666)
> at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:4597)
> at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2132)
> at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2588)
> at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2579)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: File 
> hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 does not exist.
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1053)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1113)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1110)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1120)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.cleanUpOneDirectoryForReplace(Hive.java:4681)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4661)
> ... 8 more
> Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Exception 
> when loading 1 in table id_name2 with 
> loadPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1;
> {code}
> it seems that spark doesn't test if the partitions hdfs locations whether 
> exists before delete it.
> and Hive can successfully execute the same sql.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32560) improve exception message

2020-08-06 Thread philipse (Jira)
philipse created SPARK-32560:


 Summary: improve exception message
 Key: SPARK-32560
 URL: https://issues.apache.org/jira/browse/SPARK-32560
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: philipse


Exception message have extra single quotes, we can improve it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32547) Cant able to process Timestamp 0001-01-01T00:00:00.000+0000 with TimestampType

2020-08-06 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172214#comment-17172214
 ] 

Kent Yao commented on SPARK-32547:
--

thanks [~ManjunathHatti] I also tested other zones and got the same as you did. 
So I guess there is something wrong here. BUT if the results of the new_df are 
intentional, maybe the API we choose to call here 
https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L195 
should be datetime.datetime.*utc*fromtimestamp



> Cant able to process Timestamp 0001-01-01T00:00:00.000+ with TimestampType
> --
>
> Key: SPARK-32547
> URL: https://issues.apache.org/jira/browse/SPARK-32547
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Manjunath H
>Priority: Major
>
> Spark Version : 3.0.0
> Below is the sample code to reproduce the problem with TimestampType.
> {code:java}
> from pyspark.sql.functions import lit
> from pyspark.sql.types import TimestampType
> df=spark.createDataFrame([(1, 'foo'),(2, 'bar'),],['id', 'txt'])
> new_df=df.withColumn("test_timestamp",lit("0001-01-01T00:00:00.000+").cast(TimestampType()))
> new_df.printSchema()
> root
>  |-- id: long (nullable = true)
>  |-- txt: string (nullable = true)
>  |-- test_timestamp: timestamp (nullable = true)
> new_df.show()
> +---+---+---+
> | id|txt| test_timestamp|
> +---+---+---+
> |  1|foo|0001-01-01 00:00:00|
> |  2|bar|0001-01-01 00:00:00|
> +---+---+---+
> {code}
>  
> new_df.rdd.isEmpty() operation is failing with *year 0 is out of range*
>  
> {code:java}
> new_df.rdd.isEmpty()
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure:  
> Traceback (most recent call last):
> File "/databricks/spark/python/pyspark/serializers.py", line 177, in 
> _read_with_length
>  return self.loads(obj)
>  File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads
>  return pickle.loads(obj, encoding=encoding)
>  File "/databricks/spark/python/pyspark/sql/types.py", line 1415, in 
>  return lambda *a: dataType.fromInternal(a)
>  File "/databricks/spark/python/pyspark/sql/types.py", line 635, in 
> fromInternal
>  for f, v, c in zip(self.fields, obj, self._needConversion)]
>  File "/databricks/spark/python/pyspark/sql/types.py", line 635, in 
>  for f, v, c in zip(self.fields, obj, self._needConversion)]
>  File "/databricks/spark/python/pyspark/sql/types.py", line 447, in 
> fromInternal
>  return self.dataType.fromInternal(obj)
>  File "/databricks/spark/python/pyspark/sql/types.py", line 201, in 
> fromInternal
>  return datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts 
> % 100)
> ValueError: year 0 is out of range{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32536) deleted not existing hdfs locations when use spark sql to execute "insert overwrite" statement to dynamic partition

2020-08-06 Thread yx91490 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172207#comment-17172207
 ] 

yx91490 commented on SPARK-32536:
-

it seems that the hive code in there(I use hdp-3.1.4.0-315):
{code:java}
org.spark-project.hive
hive-metastore
1.21.2.3.1.4.0-315
{code}

> deleted not existing hdfs locations when use spark sql to execute "insert 
> overwrite" statement to dynamic partition
> ---
>
> Key: SPARK-32536
> URL: https://issues.apache.org/jira/browse/SPARK-32536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: HDP version 2.3.2.3.1.4.0-315
>Reporter: yx91490
>Priority: Major
> Attachments: SPARK-32536.full.log
>
>
> when execute insert overwrite table statement to dynamic partition :
>  
> {code:java}
> set hive.exec.dynamic.partition=true;
> set hive.exec.dynamic.partition.mode=nostrict;
> insert overwrite table tmp.id_name2 partition(dt) select * from tmp.id_name 
> where dt='2001';
> {code}
> output log:
> {code:java}
> 20/08/05 14:38:05 ERROR Hive: Exception when loading partition with 
> parameters  
> partPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1/dt=2001,
>   table=id_name2,  partSpec={dt=2001},  loadFileType=REPLACE_ALL,  
> listBucketingLevel=0,  isAcid=false,  resetStatistics=false
> org.apache.hadoop.hive.ql.metadata.HiveException: Directory 
> hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 could not be 
> cleaned up.
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4666)
> at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:4597)
> at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2132)
> at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2588)
> at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2579)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: File 
> hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 does not exist.
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1053)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1113)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1110)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1120)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.cleanUpOneDirectoryForReplace(Hive.java:4681)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4661)
> ... 8 more
> Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Exception 
> when loading 1 in table id_name2 with 
> loadPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1;
> {code}
> it seems that spark doesn't test if the partitions hdfs locations whether 
> exists before delete it.
> and Hive can successfully execute the same sql.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32547) Cant able to process Timestamp 0001-01-01T00:00:00.000+0000 with TimestampType

2020-08-06 Thread Manjunath H (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172199#comment-17172199
 ] 

Manjunath H commented on SPARK-32547:
-

[~Qin Yao]  timezone is UTC

spark.conf.get('spark.sql.session.timeZone')
Out[2]: 'Etc/UTC'

import datetime
LOCAL_TIMEZONE = 
datetime.datetime.now(datetime.timezone(datetime.timedelta(0))).astimezone().tzinfo
print(LOCAL_TIMEZONE)
UTC

> Cant able to process Timestamp 0001-01-01T00:00:00.000+ with TimestampType
> --
>
> Key: SPARK-32547
> URL: https://issues.apache.org/jira/browse/SPARK-32547
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Manjunath H
>Priority: Major
>
> Spark Version : 3.0.0
> Below is the sample code to reproduce the problem with TimestampType.
> {code:java}
> from pyspark.sql.functions import lit
> from pyspark.sql.types import TimestampType
> df=spark.createDataFrame([(1, 'foo'),(2, 'bar'),],['id', 'txt'])
> new_df=df.withColumn("test_timestamp",lit("0001-01-01T00:00:00.000+").cast(TimestampType()))
> new_df.printSchema()
> root
>  |-- id: long (nullable = true)
>  |-- txt: string (nullable = true)
>  |-- test_timestamp: timestamp (nullable = true)
> new_df.show()
> +---+---+---+
> | id|txt| test_timestamp|
> +---+---+---+
> |  1|foo|0001-01-01 00:00:00|
> |  2|bar|0001-01-01 00:00:00|
> +---+---+---+
> {code}
>  
> new_df.rdd.isEmpty() operation is failing with *year 0 is out of range*
>  
> {code:java}
> new_df.rdd.isEmpty()
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure:  
> Traceback (most recent call last):
> File "/databricks/spark/python/pyspark/serializers.py", line 177, in 
> _read_with_length
>  return self.loads(obj)
>  File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads
>  return pickle.loads(obj, encoding=encoding)
>  File "/databricks/spark/python/pyspark/sql/types.py", line 1415, in 
>  return lambda *a: dataType.fromInternal(a)
>  File "/databricks/spark/python/pyspark/sql/types.py", line 635, in 
> fromInternal
>  for f, v, c in zip(self.fields, obj, self._needConversion)]
>  File "/databricks/spark/python/pyspark/sql/types.py", line 635, in 
>  for f, v, c in zip(self.fields, obj, self._needConversion)]
>  File "/databricks/spark/python/pyspark/sql/types.py", line 447, in 
> fromInternal
>  return self.dataType.fromInternal(obj)
>  File "/databricks/spark/python/pyspark/sql/types.py", line 201, in 
> fromInternal
>  return datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts 
> % 100)
> ValueError: year 0 is out of range{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30577) StorageLevel.DISK_ONLY_2 causes the data loss

2020-08-06 Thread zero222 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172177#comment-17172177
 ] 

zero222 edited comment on SPARK-30577 at 8/6/20, 9:50 AM:
--

spark-2.4.6 can work normally with DISK_ONLY_2. But when  spark-3.0  access the 
table of the hive with 1.2 verison, it get "TApplicationException: Invalid 
method name: 'get_all_functions'".  Is there a good solution Under the 
condition of hive1.2 ? Thanks !


was (Author: zero222):
spark-2.4.6 can work normally with DISK_ONLY_2. But when  spark-3.0 sql  access 
the table of the hive with 1.2 verison, it get "TApplicationException: Invalid 
method name: 'get_all_functions'".  Is there a good solution Under the 
condition of hive1.2 ? Thanks !

> StorageLevel.DISK_ONLY_2 causes the data loss
> -
>
> Key: SPARK-30577
> URL: https://issues.apache.org/jira/browse/SPARK-30577
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: zero222
>Priority: Major
> Attachments: DISK_ONLY_2.png
>
>
> As shown in the attachment,after I load the data of the hive table which is 
> immutable and cache the data in the level of the DISK_ONLY_2,the count value 
> of the data  is different every time.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30577) StorageLevel.DISK_ONLY_2 causes the data loss

2020-08-06 Thread zero222 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172177#comment-17172177
 ] 

zero222 commented on SPARK-30577:
-

spark-2.4.6 can work normally with DISK_ONLY_2. But when  spark-3.0 sql  access 
the table of the hive with 1.2 verison, it get "TApplicationException: Invalid 
method name: 'get_all_functions'".  Is there a good solution Under the 
condition of hive1.2 ? Thanks !

> StorageLevel.DISK_ONLY_2 causes the data loss
> -
>
> Key: SPARK-30577
> URL: https://issues.apache.org/jira/browse/SPARK-30577
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: zero222
>Priority: Major
> Attachments: DISK_ONLY_2.png
>
>
> As shown in the attachment,after I load the data of the hive table which is 
> immutable and cache the data in the level of the DISK_ONLY_2,the count value 
> of the data  is different every time.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32559) Fix the trim logic in UTF8String.toInt/toLong did't handle Chinese characters correctly

2020-08-06 Thread EdisonWang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang updated SPARK-32559:
---
Description: 
The trim logic in Cast expression introduced in 
[https://github.com/apache/spark/pull/26622] will trim chinese characters 
unexpectly.

For example,  sql  select cast("1中文" as float) gives 1 instead of null

 

  was:
The trim logic in Cast expression introduced in 
[https://github.com/apache/spark/pull/26622] will trim chinese characters 
unexpectly.

For example, 

!image-2020-08-06-17-01-48-646.png!

 


> Fix the trim logic in UTF8String.toInt/toLong did't handle Chinese characters 
> correctly
> ---
>
> Key: SPARK-32559
> URL: https://issues.apache.org/jira/browse/SPARK-32559
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: EdisonWang
>Priority: Minor
>
> The trim logic in Cast expression introduced in 
> [https://github.com/apache/spark/pull/26622] will trim chinese characters 
> unexpectly.
> For example,  sql  select cast("1中文" as float) gives 1 instead of null
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32559) Fix the trim logic in UTF8String.toInt/toLong did't handle Chinese characters correctly

2020-08-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172153#comment-17172153
 ] 

Apache Spark commented on SPARK-32559:
--

User 'WangGuangxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/29375

> Fix the trim logic in UTF8String.toInt/toLong did't handle Chinese characters 
> correctly
> ---
>
> Key: SPARK-32559
> URL: https://issues.apache.org/jira/browse/SPARK-32559
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: EdisonWang
>Priority: Minor
>
> The trim logic in Cast expression introduced in 
> [https://github.com/apache/spark/pull/26622] will trim chinese characters 
> unexpectly.
> For example, 
> !image-2020-08-06-17-01-48-646.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32559) Fix the trim logic in UTF8String.toInt/toLong did't handle Chinese characters correctly

2020-08-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32559:


Assignee: (was: Apache Spark)

> Fix the trim logic in UTF8String.toInt/toLong did't handle Chinese characters 
> correctly
> ---
>
> Key: SPARK-32559
> URL: https://issues.apache.org/jira/browse/SPARK-32559
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: EdisonWang
>Priority: Minor
>
> The trim logic in Cast expression introduced in 
> [https://github.com/apache/spark/pull/26622] will trim chinese characters 
> unexpectly.
> For example, 
> !image-2020-08-06-17-01-48-646.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32559) Fix the trim logic in UTF8String.toInt/toLong did't handle Chinese characters correctly

2020-08-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32559:


Assignee: Apache Spark

> Fix the trim logic in UTF8String.toInt/toLong did't handle Chinese characters 
> correctly
> ---
>
> Key: SPARK-32559
> URL: https://issues.apache.org/jira/browse/SPARK-32559
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: EdisonWang
>Assignee: Apache Spark
>Priority: Minor
>
> The trim logic in Cast expression introduced in 
> [https://github.com/apache/spark/pull/26622] will trim chinese characters 
> unexpectly.
> For example, 
> !image-2020-08-06-17-01-48-646.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32559) Fix the trim logic in UTF8String.toInt/toLong did't handle Chinese characters correctly

2020-08-06 Thread EdisonWang (Jira)
EdisonWang created SPARK-32559:
--

 Summary: Fix the trim logic in UTF8String.toInt/toLong did't 
handle Chinese characters correctly
 Key: SPARK-32559
 URL: https://issues.apache.org/jira/browse/SPARK-32559
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: EdisonWang


The trim logic in Cast expression introduced in 
[https://github.com/apache/spark/pull/26622] will trim chinese characters 
unexpectly.

For example, 

!image-2020-08-06-17-01-48-646.png!

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Ramakrishna Prasad K S (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna Prasad K S updated SPARK-32558:
---
Description: 
Steps to reproduce the issue:

--- 

Download Spark_3.0 from [https://spark.apache.org/downloads.html]

 

Step 1) Create ORC File by using the default Spark_3.0 Native API from the 
spark shell .

[linuxuser1@irlrhellinux1 bin]$ ./spark-shell

Welcome to Spark version 3.0.0

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)

Type in expressions to have them evaluated. Type :help for more information.
 scala> spark.sql("set spark.sql.orc.impl").show()

-+ |               key| value| + 
|spark.sql.orc.impl|*native*| -+

 

scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table values('col1val1','col2val1')")

org.apache.spark.sql.DataFrame = []

 

scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> dFrame.show()

+ |    col1|    col2| + |col1val1|col2val1| -+

 

scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")

 

 Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails to 
fetch the metadata from the ORC file.

 

adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
Processing data file 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)

at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)

at org.apache.orc.OrcFile.createReader(OrcFile.java:222)

at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)

at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)

at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)

at org.apache.orc.tools.FileDump.main(FileDump.java:154)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:313)

at org.apache.hadoop.util.RunJar.main(RunJar.java:227)

 

Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
[https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
spark.sql.orc.impl as hive)

 

scala> spark.sql("set spark.sql.orc.impl=hive")

res6: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.sql("set spark.sql.orc.impl").show()

--+ |               key|value| +--- 
|spark.sql.orc.impl| *hive*| +--

 

scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")

20/08/04 22:43:26 WARN HiveMetaStore: Location: 
[file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2]
 specified for non-external table:df_table2 res5: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
org.apache.spark.sql.DataFrame = []

 

scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> 
dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")

 

 Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails with 
the same exception to fetch the metadata even after following the workaround 
suggested by spark to set spark.sql.orc.impl to hive

 

[adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc

Processing data file 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)

at 

[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Ramakrishna Prasad K S (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna Prasad K S updated SPARK-32558:
---
Description: 
Steps to reproduce the issue:

--- 

Download Spark_3.0 from [https://spark.apache.org/downloads.html]

 

Step 1) Create ORC File by using the default Spark_3.0 Native API from spark 
shell .

[linuxuser1@irlrhellinux1 bin]$ ./spark-shell

Welcome to Spark version 3.0.0

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)

Type in expressions to have them evaluated. Type :help for more information.
 scala> spark.sql("set spark.sql.orc.impl").show()

-+ |               key| value| + 
|spark.sql.orc.impl|*native*| -+

 

scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table values('col1val1','col2val1')")

org.apache.spark.sql.DataFrame = []

 

scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> dFrame.show()

+ |    col1|    col2| + |col1val1|col2val1| -+

 

scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")

 

 Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails to 
fetch the metadata from the ORC file.

 

adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
Processing data file 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)

at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)

at org.apache.orc.OrcFile.createReader(OrcFile.java:222)

at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)

at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)

at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)

at org.apache.orc.tools.FileDump.main(FileDump.java:154)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:313)

at org.apache.hadoop.util.RunJar.main(RunJar.java:227)

 

Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
[https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
spark.sql.orc.impl as hive)

 

scala> spark.sql("set spark.sql.orc.impl=hive")

res6: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.sql("set spark.sql.orc.impl").show()

--+ |               key|value| +--- 
|spark.sql.orc.impl| *hive*| +--

 

scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")

20/08/04 22:43:26 WARN HiveMetaStore: Location: 
[file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2]
 specified for non-external table:df_table2 res5: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
org.apache.spark.sql.DataFrame = []

 

scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> 
dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")

 

 Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails with 
the same exception to fetch the metadata even after following the workaround 
suggested by spark to set spark.sql.orc.impl to hive

 

[adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc

Processing data file 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)

at 

[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Ramakrishna Prasad K S (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna Prasad K S updated SPARK-32558:
---
Environment: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version.  (was: 
Spark 3.0 on Linux and Hadoop cluster having Hive_2.1.1 version.)

> ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 
> (work-around of using spark.sql.orc.impl=hive is also not working)
> -
>
> Key: SPARK-32558
> URL: https://issues.apache.org/jira/browse/SPARK-32558
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version.
>Reporter: Ramakrishna Prasad K S
>Priority: Blocker
> Fix For: 3.0.0
>
>
> Steps to reproduce the issue:
> --- 
>  
> Download Spark_3.0 from [https://spark.apache.org/downloads.html]
>  
> Step 1) Create ORC File by using the default Spark_3.0 Native API from the 
> spark shell .
>  
> [linuxuser1@irlrhellinux1 bin]$ ./spark-shell
> Welcome to Spark version 3.0.0
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)
> Type in expressions to have them evaluated. Type :help for more information.
>  scala> spark.sql("set spark.sql.orc.impl").show()
> -+ |               key| value| + 
> |spark.sql.orc.impl|*native*| -+
>  
> scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
> org.apache.spark.sql.DataFrame = []
>  
> scala> spark.sql("insert into df_table values('col1val1','col2val1')")
> org.apache.spark.sql.DataFrame = []
>  
> scala> val dFrame = spark.sql("select * from df_table") dFrame: 
> org.apache.spark.sql.DataFrame = [col1: string, col2: string]
>  
> scala> dFrame.show()
> + |    col1|    col2| + |col1val1|col2val1| -+
>  
> scala> 
> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")
>  
>  Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop 
> cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following 
> command to analyze or read metadata from the ORC files. As you see below, it 
> fails to fetch the metadata from the ORC file.
>  
> adpqa@irlhadoop1 bug]$ hive --orcfiledump 
> /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> Processing data file 
> /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> [length: 414]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
> at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
> at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
> at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)
> at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
> at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
> at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
> at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
> at org.apache.orc.tools.FileDump.main(FileDump.java:154)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:313)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:227)
>  
> Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
> [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
> spark.sql.orc.impl as hive)
>  
> scala> spark.sql("set spark.sql.orc.impl=hive")
> res6: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.sql("set spark.sql.orc.impl").show()
> --+ |               key|value| +--- 
> |spark.sql.orc.impl| *hive*| +--
>  
> scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")
> 20/08/04 22:43:26 WARN HiveMetaStore: Location: 
> [file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2]
>  specified for non-external table:df_table2 res5: 
> org.apache.spark.sql.DataFrame = []
>  
> scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
> org.apache.spark.sql.DataFrame = []
>  
> scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
> org.apache.spark.sql.DataFrame = [col1: string, col2: string]
>  
> scala> 
> 

[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Ramakrishna Prasad K S (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna Prasad K S updated SPARK-32558:
---
Environment: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version. (Linux 
Redhat)  (was: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version.)

> ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 
> (work-around of using spark.sql.orc.impl=hive is also not working)
> -
>
> Key: SPARK-32558
> URL: https://issues.apache.org/jira/browse/SPARK-32558
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version. 
> (Linux Redhat)
>Reporter: Ramakrishna Prasad K S
>Priority: Blocker
> Fix For: 3.0.0
>
>
> Steps to reproduce the issue:
> --- 
>  
> Download Spark_3.0 from [https://spark.apache.org/downloads.html]
>  
> Step 1) Create ORC File by using the default Spark_3.0 Native API from the 
> spark shell .
>  
> [linuxuser1@irlrhellinux1 bin]$ ./spark-shell
> Welcome to Spark version 3.0.0
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)
> Type in expressions to have them evaluated. Type :help for more information.
>  scala> spark.sql("set spark.sql.orc.impl").show()
> -+ |               key| value| + 
> |spark.sql.orc.impl|*native*| -+
>  
> scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
> org.apache.spark.sql.DataFrame = []
>  
> scala> spark.sql("insert into df_table values('col1val1','col2val1')")
> org.apache.spark.sql.DataFrame = []
>  
> scala> val dFrame = spark.sql("select * from df_table") dFrame: 
> org.apache.spark.sql.DataFrame = [col1: string, col2: string]
>  
> scala> dFrame.show()
> + |    col1|    col2| + |col1val1|col2val1| -+
>  
> scala> 
> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")
>  
>  Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop 
> cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following 
> command to analyze or read metadata from the ORC files. As you see below, it 
> fails to fetch the metadata from the ORC file.
>  
> adpqa@irlhadoop1 bug]$ hive --orcfiledump 
> /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> Processing data file 
> /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> [length: 414]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
> at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
> at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
> at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)
> at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
> at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
> at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
> at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
> at org.apache.orc.tools.FileDump.main(FileDump.java:154)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:313)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:227)
>  
> Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
> [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
> spark.sql.orc.impl as hive)
>  
> scala> spark.sql("set spark.sql.orc.impl=hive")
> res6: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.sql("set spark.sql.orc.impl").show()
> --+ |               key|value| +--- 
> |spark.sql.orc.impl| *hive*| +--
>  
> scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")
> 20/08/04 22:43:26 WARN HiveMetaStore: Location: 
> [file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2]
>  specified for non-external table:df_table2 res5: 
> org.apache.spark.sql.DataFrame = []
>  
> scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
> org.apache.spark.sql.DataFrame = []
>  
> scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
> org.apache.spark.sql.DataFrame = [col1: string, col2: 

[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Ramakrishna Prasad K S (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna Prasad K S updated SPARK-32558:
---
Description: 
Steps to reproduce the issue:

--- 

 

Download Spark_3.0 from [https://spark.apache.org/downloads.html]

 

Step 1) Create ORC File by using the default Spark_3.0 Native API from the 
spark shell .

Launch Spark Shell:

[linuxuser1@irlrhellinux1 bin]$ ./spark-shell

Welcome to Spark version 3.0.0

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)

Type in expressions to have them evaluated. Type :help for more information.
 scala> spark.sql("set spark.sql.orc.impl").show()

-+ |               key| value| + 
|spark.sql.orc.impl|*native*| -+

 

scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table values('col1val1','col2val1')")

org.apache.spark.sql.DataFrame = []

 

scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> dFrame.show()

+ |    col1|    col2| + |col1val1|col2val1| -+

 

scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")

 

 Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails to 
fetch the metadata from the ORC file.

 

adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
Processing data file 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)

at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)

at org.apache.orc.OrcFile.createReader(OrcFile.java:222)

at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)

at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)

at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)

at org.apache.orc.tools.FileDump.main(FileDump.java:154)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:313)

at org.apache.hadoop.util.RunJar.main(RunJar.java:227)

 

Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
[https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
spark.sql.orc.impl as hive)

 

scala> spark.sql("set spark.sql.orc.impl=hive")

res6: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.sql("set spark.sql.orc.impl").show()

--+ |               key|value| +--- 
|spark.sql.orc.impl| *hive*| +--

 

scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")

20/08/04 22:43:26 WARN HiveMetaStore: Location: 
[file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2]
 specified for non-external table:df_table2 res5: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
org.apache.spark.sql.DataFrame = []

 

scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> 
dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")

 

 Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails with 
the same exception to fetch the metadata even after following the workaround 
suggested by spark to set spark.sql.orc.impl to hive

 

[adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc

Processing data file 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at 

[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Ramakrishna Prasad K S (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna Prasad K S updated SPARK-32558:
---
Description: 
Steps to reproduce the issue:

--- 

 

Download Spark_3.0 from [https://spark.apache.org/downloads.html]

 

Step 1) Create ORC File by using the default Spark_3.0 Native API from 
spark_3.0 spark shell .

Launch Spark Shell:

[linuxuser1@irlrhellinux1 bin]$ ./spark-shell

Welcome to Spark version 3.0.0

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)

Type in expressions to have them evaluated. Type :help for more information.
 scala> spark.sql("set spark.sql.orc.impl").show()

-+ |               key| value| + 
|spark.sql.orc.impl|*native*| -+

 

scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table values('col1val1','col2val1')")

org.apache.spark.sql.DataFrame = []

 

scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> dFrame.show()

+ |    col1|    col2| + |col1val1|col2val1| -+

 

scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")

 

 Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails to 
fetch the metadata from the ORC file.

 

adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
Processing data file 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)

at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)

at org.apache.orc.OrcFile.createReader(OrcFile.java:222)

at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)

at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)

at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)

at org.apache.orc.tools.FileDump.main(FileDump.java:154)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:313)

at org.apache.hadoop.util.RunJar.main(RunJar.java:227)

 

Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
[https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
spark.sql.orc.impl as hive)

 

scala> spark.sql("set spark.sql.orc.impl=hive")

res6: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.sql("set spark.sql.orc.impl").show()

--+ |               key|value| +--- 
|spark.sql.orc.impl| *hive*| +--

 

scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")

20/08/04 22:43:26 WARN HiveMetaStore: Location: 
[file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2]
 specified for non-external table:df_table2 res5: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
org.apache.spark.sql.DataFrame = []

 

scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> 
dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")

 

 Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails with 
the same exception to fetch the metadata even after following the workaround 
suggested by spark to set spark.sql.orc.impl to hive

 

[adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc

Processing data file 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at 

[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Ramakrishna Prasad K S (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna Prasad K S updated SPARK-32558:
---
Description: 
Steps to reproduce the issue:

--- 

 

Download Spark_3.0 from [https://spark.apache.org/downloads.html]

 

Step 1) Create ORC File by using the default Spark_3.0 Native API from the 
spark shell .

 

[linuxuser1@irlrhellinux1 bin]$ ./spark-shell

Welcome to Spark version 3.0.0

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)

Type in expressions to have them evaluated. Type :help for more information.
 scala> spark.sql("set spark.sql.orc.impl").show()

-+ |               key| value| + 
|spark.sql.orc.impl|*native*| -+

 

scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table values('col1val1','col2val1')")

org.apache.spark.sql.DataFrame = []

 

scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> dFrame.show()

+ |    col1|    col2| + |col1val1|col2val1| -+

 

scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")

 

 Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails to 
fetch the metadata from the ORC file.

 

adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
Processing data file 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)

at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)

at org.apache.orc.OrcFile.createReader(OrcFile.java:222)

at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)

at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)

at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)

at org.apache.orc.tools.FileDump.main(FileDump.java:154)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:313)

at org.apache.hadoop.util.RunJar.main(RunJar.java:227)

 

Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
[https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
spark.sql.orc.impl as hive)

 

scala> spark.sql("set spark.sql.orc.impl=hive")

res6: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.sql("set spark.sql.orc.impl").show()

--+ |               key|value| +--- 
|spark.sql.orc.impl| *hive*| +--

 

scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")

20/08/04 22:43:26 WARN HiveMetaStore: Location: 
[file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2]
 specified for non-external table:df_table2 res5: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
org.apache.spark.sql.DataFrame = []

 

scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> 
dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")

 

 Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails with 
the same exception to fetch the metadata even after following the workaround 
suggested by spark to set spark.sql.orc.impl to hive

 

[adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc

Processing data file 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)

at 

[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Ramakrishna Prasad K S (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna Prasad K S updated SPARK-32558:
---
Description: 
Steps to reproduce the issue:

--- 

 

Download Spark_3.0  [https://spark.apache.org/downloads.html]

 

Step 1) Create ORC File by using the default Spark_3.0 Native API from 
spark_3.0 spark shell .

Launch Spark Shell:

[linuxuser1@irlrhellinux1 bin]$ ./spark-shell

Welcome to Spark version 3.0.0

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)

Type in expressions to have them evaluated. Type :help for more information.
 scala> spark.sql("set spark.sql.orc.impl").show()

-+ |               key| value| + 
|spark.sql.orc.impl|*native*| -+

 

scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table values('col1val1','col2val1')")

org.apache.spark.sql.DataFrame = []

 

scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> dFrame.show()

+ |    col1|    col2| + |col1val1|col2val1| -+

 

scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")

 

 Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails to 
fetch the metadata from the ORC file.

 

adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
Processing data file 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)

at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)

at org.apache.orc.OrcFile.createReader(OrcFile.java:222)

at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)

at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)

at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)

at org.apache.orc.tools.FileDump.main(FileDump.java:154)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:313)

at org.apache.hadoop.util.RunJar.main(RunJar.java:227)

 

Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
[https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
spark.sql.orc.impl as hive)

 

scala> spark.sql("set spark.sql.orc.impl=hive")

res6: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.sql("set spark.sql.orc.impl").show()

--+ |               key|value| +--- 
|spark.sql.orc.impl| *hive*| +--

 

scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")

20/08/04 22:43:26 WARN HiveMetaStore: Location: 
[file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2]
 specified for non-external table:df_table2 res5: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
org.apache.spark.sql.DataFrame = []

 

scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> 
dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")

 

 Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails with 
the same exception to fetch the metadata even after following the workaround 
suggested by spark to set spark.sql.orc.impl to hive

 

[adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc

Processing data file 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at 

[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Ramakrishna Prasad K S (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna Prasad K S updated SPARK-32558:
---
Description: 
Steps to reproduce the issue:

--- 

 

Download Spark_3.0 on Linux: Download Spark_3.0 on Linux: 
[https://spark.apache.org/downloads.html]

 

Step 1) Create ORC File by using the default Spark_3.0 Native API from 
spark_3.0 spark shell .

Launch Spark Shell:

[linuxuser1@irlrhellinux1 bin]$ ./spark-shell

Welcome to Spark version 3.0.0

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)

Type in expressions to have them evaluated. Type :help for more information.
 scala> spark.sql("set spark.sql.orc.impl").show()

-+ |               key| value| + 
|spark.sql.orc.impl|*native*| -+

 

scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table values('col1val1','col2val1')")

org.apache.spark.sql.DataFrame = []

 

scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> dFrame.show()

+ |    col1|    col2| + |col1val1|col2val1| -+

 

scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")

 

 Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails to 
fetch the metadata from the ORC file.

 

adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
Processing data file 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)

at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)

at org.apache.orc.OrcFile.createReader(OrcFile.java:222)

at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)

at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)

at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)

at org.apache.orc.tools.FileDump.main(FileDump.java:154)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:313)

at org.apache.hadoop.util.RunJar.main(RunJar.java:227)

 

Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
[https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
spark.sql.orc.impl as hive)

 

scala> spark.sql("set spark.sql.orc.impl=hive")

res6: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.sql("set spark.sql.orc.impl").show()

--+ |               key|value| +--- 
|spark.sql.orc.impl| *hive*| +--

 

scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")

20/08/04 22:43:26 WARN HiveMetaStore: Location: 
[file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2]
 specified for non-external table:df_table2 res5: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
org.apache.spark.sql.DataFrame = []

 

scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> 
dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")

 

 Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails with 
the same exception to fetch the metadata even after following the workaround 
suggested by spark to set spark.sql.orc.impl to hive

 

[adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc

Processing data file 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at 

[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Ramakrishna Prasad K S (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna Prasad K S updated SPARK-32558:
---
Description: 
Steps to reproduce the issue:

--- 

 

Download Spark_3.0 on Linux: Download Spark_3.0 on Linux: 
[https://spark.apache.org/downloads.html]

 

Step 1) Create ORC File by using the default Spark_3.0 Native API from 
spark_3.0 spark shell .

Launch Spark Shell:

[linuxuser1@irlrhellinux1 bin]$ ./spark-shell

Welcome to Spark version 3.0.0

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)

Type in expressions to have them evaluated. Type :help for more information.
 scala> spark.sql("set spark.sql.orc.impl").show()

-+ |               key| value| + 
|spark.sql.orc.impl|*native*| -+

 

scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table values('col1val1','col2val1')")

org.apache.spark.sql.DataFrame = []

 

scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> dFrame.show()

+ |    col1|    col2| + |col1val1|col2val1| -+

 

scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")

 

 Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails to 
fetch the metadata from the ORC file.

 

adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
Processing data file 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)

at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)

at org.apache.orc.OrcFile.createReader(OrcFile.java:222)

at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)

at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)

at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)

at org.apache.orc.tools.FileDump.main(FileDump.java:154)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:313)

at org.apache.hadoop.util.RunJar.main(RunJar.java:227)

 

Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
[https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
spark.sql.orc.impl as hive)

 

scala> spark.sql("set spark.sql.orc.impl=hive")

res6: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.sql("set spark.sql.orc.impl").show()

--+ |               key|value| +--- 
|spark.sql.orc.impl| *hive*| +--

 

scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")

20/08/04 22:43:26 WARN HiveMetaStore: Location: 
[file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2]
 specified for non-external table:df_table2 res5: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
org.apache.spark.sql.DataFrame = []

 

scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> 
dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")

 

 Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails with 
the same exception to fetch the metadata even after following the workaround 
suggested by spark to set spark.sql.orc.impl to hive

 

[adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc

Processing data file 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at 

[jira] [Resolved] (SPARK-32546) SHOW VIEWS fails with MetaException ... ClassNotFoundException

2020-08-06 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32546.
-
Fix Version/s: 3.1.0
 Assignee: Maxim Gekk
   Resolution: Fixed

> SHOW VIEWS fails with MetaException ... ClassNotFoundException
> --
>
> Key: SPARK-32546
> URL: https://issues.apache.org/jira/browse/SPARK-32546
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> SHOW VIEWS can fail with the error:
> {code}
> java.lang.RuntimeException: 
> MetaException(message:java.lang.ClassNotFoundException Class 
> com.ibm.spss.hive.serde2.xml.XmlSerDe not found)
> at 
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:290)
> at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:281)
> at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:631)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:486)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.convertHiveTableToCatalogTable(HiveClientImpl.scala:485)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getTablesByName$2(HiveClientImpl.scala:472)
> at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
> at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
> at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
> at scala.collection.TraversableLike.map(TraversableLike.scala:238)
> at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
> at scala.collection.AbstractTraversable.map(Traversable.scala:108)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getTablesByName$1(HiveClientImpl.scala:472)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:349)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$retryLocked$1(HiveClientImpl.scala:252)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.synchronizeOnObject(HiveClientImpl.scala:288)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:244)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:331)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTablesByName(HiveClientImpl.scala:472)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$listTablesByType$1(HiveClientImpl.scala:873)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:349)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$retryLocked$1(HiveClientImpl.scala:252)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.synchronizeOnObject(HiveClientImpl.scala:288)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:244)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:331)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.listTablesByType(HiveClientImpl.scala:866)
> at 
> org.apache.spark.sql.hive.client.PoolingHiveClient.$anonfun$listTablesByType$1(PoolingHiveClient.scala:266)
> at 
> org.apache.spark.sql.hive.client.PoolingHiveClient.withHiveClient(PoolingHiveClient.scala:112)
> at 
> org.apache.spark.sql.hive.client.PoolingHiveClient.listTablesByType(PoolingHiveClient.scala:266)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listViews$1(HiveExternalCatalog.scala:940)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Ramakrishna Prasad K S (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna Prasad K S updated SPARK-32558:
---
Description: 
Steps to reproduce the issue:

--- 

 

Download Spark_3.0 on Linux: Download Spark_3.0 on Linux: 
[https://spark.apache.org/downloads.html]

 

Step 1) Create ORC File by using the default Spark_3.0 Native API from 
spark_3.0 spark shell .

Launch Spark Shell:

[linuxuser1@irlrhellinux1 bin]$ ./spark-shell

Welcome to Spark version 3.0.0

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)

Type in expressions to have them evaluated. Type :help for more information.
 scala> spark.sql("set spark.sql.orc.impl").show()

-+ |               key| value| + 
|spark.sql.orc.impl|native| -+

 

scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table values('col1val1','col2val1')")

org.apache.spark.sql.DataFrame = []

 

scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> dFrame.show()

+ |    col1|    col2| + |col1val1|col2val1| -+

 

scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")

 

 Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails to 
fetch the metadata from the ORC file.

 

adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
Processing data file 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)

at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)

at org.apache.orc.OrcFile.createReader(OrcFile.java:222)

at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)

at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)

at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)

at org.apache.orc.tools.FileDump.main(FileDump.java:154)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:313)

at org.apache.hadoop.util.RunJar.main(RunJar.java:227)

 

Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
[https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
spark.sql.orc.impl as hive)

 

scala> spark.sql("set spark.sql.orc.impl=hive")

res6: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.sql("set spark.sql.orc.impl").show()

---+ |               key|value| + 
|spark.sql.orc.impl| hive| +--

 

scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")

20/08/04 22:43:26 WARN HiveMetaStore: Location: 
[file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2]
 specified for non-external table:df_table2 res5: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
org.apache.spark.sql.DataFrame = []

 

scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> 
dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")

 

 Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails with 
the same exception to fetch the metadata even after following the workaround 
suggested by spark to set spark.sql.orc.impl to hive

 

[adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc

Processing data file 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at 

[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Ramakrishna Prasad K S (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna Prasad K S updated SPARK-32558:
---
Description: 
Steps to reproduce the issue:

--- 

 

Download Spark_3.0 on Linux: Download Spark_3.0 on Linux: 
[https://spark.apache.org/downloads.html]

 

Step 1) Create ORC File by using the default Spark_3.0 Native API from 
spark_3.0 spark shell .


 Launch Spark Shell:

[linuxuser1@irlrhellinux1 bin]$ ./spark-shell

Welcome to Spark version 3.0.0

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)

Type in expressions to have them evaluated. Type :help for more information.
 scala> spark.sql("set spark.sql.orc.impl").show()

-+ |               key| value| + 
|spark.sql.orc.impl|native| -+

scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []

scala> spark.sql("insert into df_table values('col1val1','col2val1')")

org.apache.spark.sql.DataFrame = []

scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

scala> dFrame.show()

+ |    col1|    col2| + |col1val1|col2val1| -+

scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")

 

 Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails to 
fetch the metadata from the ORC file.

 

adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
Processing data file 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)

at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)

at org.apache.orc.OrcFile.createReader(OrcFile.java:222)

at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)

at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)

at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)

at org.apache.orc.tools.FileDump.main(FileDump.java:154)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:313)

at org.apache.hadoop.util.RunJar.main(RunJar.java:227)

 

Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
[https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
spark.sql.orc.impl as hive)

 

scala> spark.sql("set spark.sql.orc.impl=hive")

res6: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.sql("set spark.sql.orc.impl").show()

---+ |               key|value| + 
|spark.sql.orc.impl| hive| +--

 

scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")

20/08/04 22:43:26 WARN HiveMetaStore: Location: 
[file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2]
 specified for non-external table:df_table2 res5: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
org.apache.spark.sql.DataFrame = []

 

scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> 
dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")

 

 Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails with 
the same exception to fetch the metadata even after following the workaround 
suggested by spark to set spark.sql.orc.impl to hive

 

[adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc

Processing data file 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at 

[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Ramakrishna Prasad K S (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna Prasad K S updated SPARK-32558:
---
Description: 
Steps to reproduce the issue:

 

 

Download Spark_3.0 on Linux: Download Spark_3.0 on Linux: 
[https://spark.apache.org/downloads.html]

Step 1) Create ORC File by using the default Spark_3.0 Native API from 
spark_3.0 spark shell .
 Launch Spark Shell:

[linuxuser1@irlrhellinux1 bin]$ ./spark-shell

Welcome to Spark version 3.0.0

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)

Type in expressions to have them evaluated. Type :help for more information.
 scala> spark.sql("set spark.sql.orc.impl").show()

-+ |               key| value| + 
|spark.sql.orc.impl|native| -+

scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []


 scala> spark.sql("insert into df_table values('col1val1','col2val1')")

org.apache.spark.sql.DataFrame = []


 scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]


 scala> dFrame.show()

+ |    col1|    col2| + |col1val1|col2val1| -+

scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")

 

 Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails to 
fetch the metadata from the ORC file.

adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
Processing data file 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)

at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)

at org.apache.orc.OrcFile.createReader(OrcFile.java:222)

at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)

at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)

at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)

at org.apache.orc.tools.FileDump.main(FileDump.java:154)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:313)

at org.apache.hadoop.util.RunJar.main(RunJar.java:227)

 

Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
[https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
spark.sql.orc.impl as hive)

scala> spark.sql("set spark.sql.orc.impl=hive")

res6: org.apache.spark.sql.DataFrame = [key: string, value: string]


 scala> spark.sql("set spark.sql.orc.impl").show()

+ |               key|value| +- 
|spark.sql.orc.impl| hive| +--

 

scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")

20/08/04 22:43:26 WARN HiveMetaStore: Location: 
[file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2]
 specified for non-external table:df_table2 res5: 
org.apache.spark.sql.DataFrame = []


 scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
org.apache.spark.sql.DataFrame = []


 scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]


 scala> 
dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")

 

 Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails with 
the same exception to fetch the metadata even after following the workaround 
suggested by spark to set spark.sql.orc.impl to hive


 [adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc

Processing data file 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)


[jira] [Commented] (SPARK-32547) Cant able to process Timestamp 0001-01-01T00:00:00.000+0000 with TimestampType

2020-08-06 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172127#comment-17172127
 ] 

Kent Yao commented on SPARK-32547:
--


{code:java}
>>> spark.sql("select timestamp '0001-01-01 00:00:00+'")
DataFrame[TIMESTAMP '0001-01-01 00:00:00': timestamp]
>>> spark.sql("select timestamp '0001-01-01 00:00:00'")
DataFrame[TIMESTAMP '0001-01-01 00:00:00': timestamp]
>>>
{code}

With `pyspark` shell, the result seemed to ignore the zone I specified.


> Cant able to process Timestamp 0001-01-01T00:00:00.000+ with TimestampType
> --
>
> Key: SPARK-32547
> URL: https://issues.apache.org/jira/browse/SPARK-32547
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Manjunath H
>Priority: Major
>
> Spark Version : 3.0.0
> Below is the sample code to reproduce the problem with TimestampType.
> {code:java}
> from pyspark.sql.functions import lit
> from pyspark.sql.types import TimestampType
> df=spark.createDataFrame([(1, 'foo'),(2, 'bar'),],['id', 'txt'])
> new_df=df.withColumn("test_timestamp",lit("0001-01-01T00:00:00.000+").cast(TimestampType()))
> new_df.printSchema()
> root
>  |-- id: long (nullable = true)
>  |-- txt: string (nullable = true)
>  |-- test_timestamp: timestamp (nullable = true)
> new_df.show()
> +---+---+---+
> | id|txt| test_timestamp|
> +---+---+---+
> |  1|foo|0001-01-01 00:00:00|
> |  2|bar|0001-01-01 00:00:00|
> +---+---+---+
> {code}
>  
> new_df.rdd.isEmpty() operation is failing with *year 0 is out of range*
>  
> {code:java}
> new_df.rdd.isEmpty()
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure:  
> Traceback (most recent call last):
> File "/databricks/spark/python/pyspark/serializers.py", line 177, in 
> _read_with_length
>  return self.loads(obj)
>  File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads
>  return pickle.loads(obj, encoding=encoding)
>  File "/databricks/spark/python/pyspark/sql/types.py", line 1415, in 
>  return lambda *a: dataType.fromInternal(a)
>  File "/databricks/spark/python/pyspark/sql/types.py", line 635, in 
> fromInternal
>  for f, v, c in zip(self.fields, obj, self._needConversion)]
>  File "/databricks/spark/python/pyspark/sql/types.py", line 635, in 
>  for f, v, c in zip(self.fields, obj, self._needConversion)]
>  File "/databricks/spark/python/pyspark/sql/types.py", line 447, in 
> fromInternal
>  return self.dataType.fromInternal(obj)
>  File "/databricks/spark/python/pyspark/sql/types.py", line 201, in 
> fromInternal
>  return datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts 
> % 100)
> ValueError: year 0 is out of range{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Ramakrishna Prasad K S (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna Prasad K S updated SPARK-32558:
---
Description: 
Steps to reproduce the issue:

 

 

Download Spark_3.0 on Linux: Download Spark_3.0 on Linux: 
[https://spark.apache.org/downloads.html]

Step 1) Create ORC File by using the default Spark_3.0 Native API from 
spark_3.0 spark shell .
 Launch Spark Shell:

[linuxuser1@irlrhellinux1 bin]$ ./spark-shell

Welcome to Spark version 3.0.0

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)

Type in expressions to have them evaluated. Type :help for more information.
 scala> spark.sql("set spark.sql.orc.impl").show()

-+ |               key| value| + 
|spark.sql.orc.impl|native| -+

scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []
 scala> spark.sql("insert into df_table values('col1val1','col2val1')") 
20/08/04 22:40:18 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException res2: org.apache.spark.sql.DataFrame = []
 scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]
 scala> dFrame.show() -+ |    col1|    col2| + 
|col1val1|col2val1| -+

scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")

 

 Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails to 
fetch the metadata from the ORC file.


 adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
Processing data file 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)

at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)

at org.apache.orc.OrcFile.createReader(OrcFile.java:222)

at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)

at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)

at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)

at org.apache.orc.tools.FileDump.main(FileDump.java:154)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:313)

at org.apache.hadoop.util.RunJar.main(RunJar.java:227)

 

Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
[https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
spark.sql.orc.impl as hive)


 scala> spark.sql("set spark.sql.orc.impl=hive") res6: 
org.apache.spark.sql.DataFrame = [key: string, value: string]
 scala> spark.sql("set spark.sql.orc.impl").show() +---++ | 
              key|value| ++---+ |spark.sql.orc.impl| hive| 
++---+

scala> spark.sql("CREATE table df_table2(col1 string,col2 string)") 20/08/04 
22:43:26 WARN HiveMetaStore: Location: 
[file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2]
 specified for non-external table:df_table2 res5: 
org.apache.spark.sql.DataFrame = []
 scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
org.apache.spark.sql.DataFrame = []
 scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]
 scala> 
dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")


 Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails with 
the same exception to fetch the metadata even after following the workaround 
suggested by spark to set spark.sql.orc.impl to hive 

[jira] [Comment Edited] (SPARK-32547) Cant able to process Timestamp 0001-01-01T00:00:00.000+0000 with TimestampType

2020-08-06 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172121#comment-17172121
 ] 

Kent Yao edited comment on SPARK-32547 at 8/6/20, 8:25 AM:
---


{code:java}
new_df.show()
+---+---+---+
| id|txt| test_timestamp|
+---+---+---+
|  1|foo|0001-01-01 00:00:00|
|  2|bar|0001-01-01 00:00:00|
+---+---+---+
{code}



The result here seems to be wrong already if you are not in the time zone 
'+ '


was (Author: qin yao):
 new_df.show()
+---+---+---+
| id|txt| test_timestamp|
+---+---+---+
|  1|foo|0001-01-01 00:00:00|
|  2|bar|0001-01-01 00:00:00|
+---+---+---+


The result here seems to be wrong already if you are not in the time zone 
'+ '

> Cant able to process Timestamp 0001-01-01T00:00:00.000+ with TimestampType
> --
>
> Key: SPARK-32547
> URL: https://issues.apache.org/jira/browse/SPARK-32547
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Manjunath H
>Priority: Major
>
> Spark Version : 3.0.0
> Below is the sample code to reproduce the problem with TimestampType.
> {code:java}
> from pyspark.sql.functions import lit
> from pyspark.sql.types import TimestampType
> df=spark.createDataFrame([(1, 'foo'),(2, 'bar'),],['id', 'txt'])
> new_df=df.withColumn("test_timestamp",lit("0001-01-01T00:00:00.000+").cast(TimestampType()))
> new_df.printSchema()
> root
>  |-- id: long (nullable = true)
>  |-- txt: string (nullable = true)
>  |-- test_timestamp: timestamp (nullable = true)
> new_df.show()
> +---+---+---+
> | id|txt| test_timestamp|
> +---+---+---+
> |  1|foo|0001-01-01 00:00:00|
> |  2|bar|0001-01-01 00:00:00|
> +---+---+---+
> {code}
>  
> new_df.rdd.isEmpty() operation is failing with *year 0 is out of range*
>  
> {code:java}
> new_df.rdd.isEmpty()
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure:  
> Traceback (most recent call last):
> File "/databricks/spark/python/pyspark/serializers.py", line 177, in 
> _read_with_length
>  return self.loads(obj)
>  File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads
>  return pickle.loads(obj, encoding=encoding)
>  File "/databricks/spark/python/pyspark/sql/types.py", line 1415, in 
>  return lambda *a: dataType.fromInternal(a)
>  File "/databricks/spark/python/pyspark/sql/types.py", line 635, in 
> fromInternal
>  for f, v, c in zip(self.fields, obj, self._needConversion)]
>  File "/databricks/spark/python/pyspark/sql/types.py", line 635, in 
>  for f, v, c in zip(self.fields, obj, self._needConversion)]
>  File "/databricks/spark/python/pyspark/sql/types.py", line 447, in 
> fromInternal
>  return self.dataType.fromInternal(obj)
>  File "/databricks/spark/python/pyspark/sql/types.py", line 201, in 
> fromInternal
>  return datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts 
> % 100)
> ValueError: year 0 is out of range{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >