[jira] [Commented] (SPARK-42760) The partition of result data frame of join is always 1
[ https://issues.apache.org/jira/browse/SPARK-42760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707125#comment-17707125 ] Ranga Reddy commented on SPARK-42760: - >From Spark version 3.2.0 onwards AQE is enabled by default. AQE is disabled in >Spark version < 3.2.0. > The partition of result data frame of join is always 1 > -- > > Key: SPARK-42760 > URL: https://issues.apache.org/jira/browse/SPARK-42760 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.3.2 > Environment: standard spark 3.0.3/3.3.2, using in jupyter notebook, > local mode >Reporter: binyang >Priority: Major > > I am using pyspark. The partition of result data frame of join is always 1. > Here is my code from > https://stackoverflow.com/questions/51876281/is-partitioning-retained-after-a-spark-sql-join > > print(spark.version) > def example_shuffle_partitions(data_partitions=10, shuffle_partitions=4): > spark.conf.set("spark.sql.shuffle.partitions", shuffle_partitions) > spark.sql("SET spark.sql.autoBroadcastJoinThreshold=-1") > df1 = spark.range(1, 1000).repartition(data_partitions) > df2 = spark.range(1, 2000).repartition(data_partitions) > df3 = spark.range(1, 3000).repartition(data_partitions) > print("Data partitions is: {}. Shuffle partitions is > {}".format(data_partitions, shuffle_partitions)) > print("Data partitions before join: > {}".format(df1.rdd.getNumPartitions())) > df = (df1.join(df2, df1.id == df2.id) > .join(df3, df1.id == df3.id)) > print("Data partitions after join : {}".format(df.rdd.getNumPartitions())) > example_shuffle_partitions() > > In Spark 3.0.3, it prints out: > 3.0.3 > Data partitions is: 10. Shuffle partitions is 4 > Data partitions before join: 10 > Data partitions after join : 4 > However, it prints out the following in the latest 3.3.2 > 3.3.2 > Data partitions is: 10. Shuffle partitions is 4 > Data partitions before join: 10 > Data partitions after join : 1 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32893) Structured Streaming and Dynamic Allocation on StandaloneCluster
[ https://issues.apache.org/jira/browse/SPARK-32893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17703182#comment-17703182 ] Ranga Reddy commented on SPARK-32893: - Similar kind of issue is created for [Kubernetes|https://issues.apache.org/jira/issues/?jql=project+%3D+SPARK+AND+component+%3D+Kubernetes] - https://issues.apache.org/jira/browse/SPARK-35625 > Structured Streaming and Dynamic Allocation on StandaloneCluster > > > Key: SPARK-32893 > URL: https://issues.apache.org/jira/browse/SPARK-32893 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.1 >Reporter: Duarte Ferreira >Priority: Major > > We are currently using Spark 3.0.1 Standalone cluster to run our Structured > streaming applications. > We set the following configurations when running the application in cluster > mode: > * spark.dynamicAllocation.enabled = true > * spark.shuffle.service.enabled = true > * spark.cores.max =5 > * spark.executor.memory = 1G > * spark.executor.cores = 1 > We also have the configurations set to enable spark.shuffle.service.enabled > on each worker and have a cluster composed of 1 master and 2 slaves. > The application reads data from a kafka Topic (readTopic) using [This > documentation, > |https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html]applies > some transformations on the DataSet using spark SQL and writes data to > another Kafka Topic (writeTopic). > When we start the application it behaves correctly, it starts with 0 > executors and. as we start feeding data to the readTopic, it starts > increasing the number of executors until it reaches the 5 executors limit and > all messages are transformed and written to the writeTopic in Kafka. > If we stop feeding messages to the readTopic the application will work as > expected and starts killing executors that are not needed anymore until we > stop sending data completely and it reach 0 executors running. > If we start sending data again right away, it behaves just as expected it > starts increasing the numbers of executors again. But if we leave the > application in idle at 0 executors for around 10 minutes we start getting > errors like this: > {noformat} > *no* > 20/09/15 10:41:22 ERROR TransportClient: Failed to send RPC RPC > 7570256331800450365 to sparkmaster/10.0.12.231:7077: > java.nio.channels.ClosedChannelException > java.nio.channels.ClosedChannelException > at > io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764) > at > io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1104) > at > io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) > at > io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) > at > io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) > at sun.nio.ch.IOUtil.write(IOUtil.java:65) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468) > at > org.apache.spark.network.protocol.MessageWithHeader.copyByteBuf(MessageWithHeader.java:148) > at > org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:123) > at > io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:362) > at > io.netty.channel.nio.AbstractNioByteChannel.doWriteInternal(AbstractNioByteChannel.java:235) > at > io.netty.channel.nio.AbstractNioByteChannel.doWrite0(AbstractNioByteChannel.java:209) > at > io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:400) >
[jira] [Commented] (SPARK-32893) Structured Streaming and Dynamic Allocation on StandaloneCluster
[ https://issues.apache.org/jira/browse/SPARK-32893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17703177#comment-17703177 ] Ranga Reddy commented on SPARK-32893: - I can see similar behaviour Spark Structured Streaming with Yarn. {code:java} 2023-03-14 18:17:29 ERROR TransportClient:337 - Failed to send RPC RPC 7955407071046657873 to /127.0.0.1:50040: java.nio.channels.ClosedChannelException java.nio.channels.ClosedChannelException at io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957) at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865) at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367) at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717) at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764) at io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1104) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748) {code} > Structured Streaming and Dynamic Allocation on StandaloneCluster > > > Key: SPARK-32893 > URL: https://issues.apache.org/jira/browse/SPARK-32893 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.1 >Reporter: Duarte Ferreira >Priority: Major > > We are currently using Spark 3.0.1 Standalone cluster to run our Structured > streaming applications. > We set the following configurations when running the application in cluster > mode: > * spark.dynamicAllocation.enabled = true > * spark.shuffle.service.enabled = true > * spark.cores.max =5 > * spark.executor.memory = 1G > * spark.executor.cores = 1 > We also have the configurations set to enable spark.shuffle.service.enabled > on each worker and have a cluster composed of 1 master and 2 slaves. > The application reads data from a kafka Topic (readTopic) using [This > documentation, > |https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html]applies > some transformations on the DataSet using spark SQL and writes data to > another Kafka Topic (writeTopic). > When we start the application it behaves correctly, it starts with 0 > executors and. as we start feeding data to the readTopic, it starts > increasing the number of executors until it reaches the 5 executors limit and > all messages are transformed and written to the writeTopic in Kafka. > If we stop feeding messages to the readTopic the application will work as > expected and starts killing executors that are not needed anymore until we > stop sending data completely and it reach 0 executors running. > If we start sending data again right away, it behaves just as expected it > starts increasing the numbers of executors again. But if we leave the > application in idle at 0 executors for around 10 minutes we start getting > errors like this: > {noformat} > *no* > 20/09/15 10:41:22 ERROR TransportClient: Failed to send RPC RPC > 7570256331800450365 to sparkmaster/10.0.12.231:7077: > java.nio.channels.ClosedChannelException > java.nio.channels.ClosedChannelException > at > io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764) > at > io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1104) > at > io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop
[jira] [Commented] (SPARK-28869) Roll over event log files
[ https://issues.apache.org/jira/browse/SPARK-28869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643804#comment-17643804 ] Ranga Reddy commented on SPARK-28869: - Hi [~kabhwan] I have enabled the eventlog rolling for the spark streaming network word count example, but event log files are not compacted. *Configuration Parameters:* {code:java} spark.eventLog.rolling.enabled=true spark.eventLog.rolling.maxFileSize=10m spark.history.fs.eventLog.rolling.maxFilesToRetain=2 spark.history.fs.cleaner.interval=1800{code} *Event log file list:* [^application_1670216197043_0012.log] ^Could you please check the issue.^ > Roll over event log files > - > > Key: SPARK-28869 > URL: https://issues.apache.org/jira/browse/SPARK-28869 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0 > > Attachments: application_1670216197043_0012.log > > > This issue tracks the effort on rolling over event log files in driver and > let SHS replay the multiple event log files correctly. > This issue doesn't deal with overall size of event log, as well as no > guarantee when deleting old event log files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28869) Roll over event log files
[ https://issues.apache.org/jira/browse/SPARK-28869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ranga Reddy updated SPARK-28869: Attachment: application_1670216197043_0012.log > Roll over event log files > - > > Key: SPARK-28869 > URL: https://issues.apache.org/jira/browse/SPARK-28869 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0 > > Attachments: application_1670216197043_0012.log > > > This issue tracks the effort on rolling over event log files in driver and > let SHS replay the multiple event log files correctly. > This issue doesn't deal with overall size of event log, as well as no > guarantee when deleting old event log files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40958) Support data masking built-in function 'null'
[ https://issues.apache.org/jira/browse/SPARK-40958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17634142#comment-17634142 ] Ranga Reddy commented on SPARK-40958: - I will work on this Jira. > Support data masking built-in function 'null' > - > > Key: SPARK-40958 > URL: https://issues.apache.org/jira/browse/SPARK-40958 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Priority: Major > > This can be a simple function that returns a NULL value of the given input > type. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40686) Support data masking and redacting built-in functions
[ https://issues.apache.org/jira/browse/SPARK-40686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17633668#comment-17633668 ] Ranga Reddy commented on SPARK-40686: - Hi [~dtenedor] I am not able to find more references/examples for *phi-related* functions except okera website. Could you please share some references so that i will start work on those functions? > Support data masking and redacting built-in functions > - > > Key: SPARK-40686 > URL: https://issues.apache.org/jira/browse/SPARK-40686 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Vinod KC >Priority: Minor > > Support built-in data masking and redacting functions -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40988) Test case for insert partition should verify value
[ https://issues.apache.org/jira/browse/SPARK-40988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ranga Reddy updated SPARK-40988: Priority: Minor (was: Major) > Test case for insert partition should verify value > --- > > Key: SPARK-40988 > URL: https://issues.apache.org/jira/browse/SPARK-40988 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0 >Reporter: Ranga Reddy >Priority: Minor > > Spark3 has not validated the Partition Column type while inserting the data > but on the Hive side exception is thrown while inserting different type > values. > *Spark Code:* > > {code:java} > scala> val tableName="test_partition_table" > tableName: String = test_partition_table > scala>scala> spark.sql(s"DROP TABLE IF EXISTS $tableName") > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.sql(s"CREATE EXTERNAL TABLE $tableName ( id INT, name STRING ) > PARTITIONED BY (age INT) LOCATION 'file:/tmp/spark-warehouse/$tableName'") > res1: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("SHOW tables").show(truncate=false) > +-+-+---+ > |namespace|tableName |isTemporary| > +-+-+---+ > |default |test_partition_table |false | > +-+-+---+ > scala> spark.sql("SET spark.sql.sources.validatePartitionColumns").show(50, > false) > +--+-+ > |key |value| > +--+-+ > |spark.sql.sources.validatePartitionColumns|true | > +--+-+ > scala> spark.sql(s"""INSERT INTO $tableName partition (age=25) VALUES (1, > 'Ranga')""") > res4: org.apache.spark.sql.DataFrame = []scala> spark.sql(s"show partitions > $tableName").show(50, false) > +-+ > |partition| > +-+ > |age=25 | > +-+ > scala> spark.sql(s"select * from $tableName").show(50, false) > +---+-+---+ > |id |name |age| > +---+-+---+ > |1 |Ranga|25 | > +---+-+---+ > scala> spark.sql(s"""INSERT INTO $tableName partition (age=\"test_age\") > VALUES (2, 'Nishanth')""") > res7: org.apache.spark.sql.DataFrame = []scala> spark.sql(s"show partitions > $tableName").show(50, false) > ++ > |partition | > ++ > |age=25 | > |age=test_age| > ++ > scala> spark.sql(s"select * from $tableName").show(50, false) > +---+++ > |id |name |age | > +---+++ > |1 |Ranga |25 | > |2 |Nishanth|null| > +---+++ {code} > *Hive Code:* > > > {code:java} > > INSERT INTO test_partition_table partition (age="test_age2") VALUES (3, > > 'Nishanth'); > Error: Error while compiling statement: FAILED: SemanticException [Error > 10248]: Cannot add partition column age of type string as it cannot be > converted to type int (state=42000,code=10248){code} > > *Expected Result:* > When *spark.sql.sources.validatePartitionColumns=true* it needs to be > validated the datatype value and exception needs to be thrown if we provide > wrong data type value. > *Reference:* > [https://spark.apache.org/docs/3.3.1/sql-migration-guide.html#data-sources] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40988) Test case for insert partition should verify value
[ https://issues.apache.org/jira/browse/SPARK-40988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ranga Reddy updated SPARK-40988: Summary: Test case for insert partition should verify value (was: Spark3 partition column value is not validated with user provided schema.) > Test case for insert partition should verify value > --- > > Key: SPARK-40988 > URL: https://issues.apache.org/jira/browse/SPARK-40988 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0 >Reporter: Ranga Reddy >Priority: Major > > Spark3 has not validated the Partition Column type while inserting the data > but on the Hive side exception is thrown while inserting different type > values. > *Spark Code:* > > {code:java} > scala> val tableName="test_partition_table" > tableName: String = test_partition_table > scala>scala> spark.sql(s"DROP TABLE IF EXISTS $tableName") > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.sql(s"CREATE EXTERNAL TABLE $tableName ( id INT, name STRING ) > PARTITIONED BY (age INT) LOCATION 'file:/tmp/spark-warehouse/$tableName'") > res1: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("SHOW tables").show(truncate=false) > +-+-+---+ > |namespace|tableName |isTemporary| > +-+-+---+ > |default |test_partition_table |false | > +-+-+---+ > scala> spark.sql("SET spark.sql.sources.validatePartitionColumns").show(50, > false) > +--+-+ > |key |value| > +--+-+ > |spark.sql.sources.validatePartitionColumns|true | > +--+-+ > scala> spark.sql(s"""INSERT INTO $tableName partition (age=25) VALUES (1, > 'Ranga')""") > res4: org.apache.spark.sql.DataFrame = []scala> spark.sql(s"show partitions > $tableName").show(50, false) > +-+ > |partition| > +-+ > |age=25 | > +-+ > scala> spark.sql(s"select * from $tableName").show(50, false) > +---+-+---+ > |id |name |age| > +---+-+---+ > |1 |Ranga|25 | > +---+-+---+ > scala> spark.sql(s"""INSERT INTO $tableName partition (age=\"test_age\") > VALUES (2, 'Nishanth')""") > res7: org.apache.spark.sql.DataFrame = []scala> spark.sql(s"show partitions > $tableName").show(50, false) > ++ > |partition | > ++ > |age=25 | > |age=test_age| > ++ > scala> spark.sql(s"select * from $tableName").show(50, false) > +---+++ > |id |name |age | > +---+++ > |1 |Ranga |25 | > |2 |Nishanth|null| > +---+++ {code} > *Hive Code:* > > > {code:java} > > INSERT INTO test_partition_table partition (age="test_age2") VALUES (3, > > 'Nishanth'); > Error: Error while compiling statement: FAILED: SemanticException [Error > 10248]: Cannot add partition column age of type string as it cannot be > converted to type int (state=42000,code=10248){code} > > *Expected Result:* > When *spark.sql.sources.validatePartitionColumns=true* it needs to be > validated the datatype value and exception needs to be thrown if we provide > wrong data type value. > *Reference:* > [https://spark.apache.org/docs/3.3.1/sql-migration-guide.html#data-sources] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40798) Alter partition should verify value
[ https://issues.apache.org/jira/browse/SPARK-40798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17633488#comment-17633488 ] Ranga Reddy commented on SPARK-40798: - Hi [~ulysses] I will go ahead and add a test case for the *insert* operation and update the Jira description. > Alter partition should verify value > --- > > Key: SPARK-40798 > URL: https://issues.apache.org/jira/browse/SPARK-40798 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.4.0 > > > > {code:java} > CREATE TABLE t (c int) USING PARQUET PARTITIONED BY(p int); > -- This DDL should fail but worked: > ALTER TABLE t ADD PARTITION(p='aaa'); {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-40798) Alter partition should verify value
[ https://issues.apache.org/jira/browse/SPARK-40798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17633209#comment-17633209 ] Ranga Reddy edited comment on SPARK-40798 at 11/14/22 3:11 AM: --- Hi [~ulysses] The current Jira will solve both the *Insert* and *Alter* partition values. {code:java} CREATE EXTERNAL TABLE test_partition_value_tbl ( id INT, name STRING ) PARTITIONED BY (age INT) LOCATION 'file:/tmp/spark-warehouse/test_partition_value_tbl'; -- This DDL should fail but worked: - SPARK-40798 ALTER TABLE test_partition_value_tbl ADD PARTITION(age='aaa'); -- This DML should fail but worked: - SPARK-40988 INSERT INTO test_partition_value_tbl PARTITION(age="aaa") VALUES (1, 'ABC'); {code} Can we rename *SKIP_TYPE_VALIDATION_ON_ALTER_PARTITION* variable and update the description of this variable? *Note:* We have a *spark.sql.sources.validatePartitionColumns* {color:#172b4d}parameter and used while reading the partition value. This variable only we can reuse?{color} After that, I will commit my test case. was (Author: rangareddy.av...@gmail.com): Hi [~ulysses] The current Jira will solve both the *Insert* and *Alter* partition values. {code:java} CREATE EXTERNAL TABLE test_partition_value_tbl ( id INT, name STRING ) PARTITIONED BY (age INT) LOCATION 'file:/tmp/spark-warehouse/test_partition_value_tbl'; -- This DDL should fail but worked: ALTER TABLE test_partition_value_tbl ADD PARTITION(age='aaa'); -- This DML should fail but worked: INSERT INTO test_partition_value_tbl PARTITION(age="aaa") VALUES (1, 'ABC');{code} Can we rename *SKIP_TYPE_VALIDATION_ON_ALTER_PARTITION* variable and update the description of this variable? *Note:* We have a *spark.sql.sources.validatePartitionColumns* {color:#172b4d}parameter and used while reading the partition value. This variable only we can reuse?{color} After that, I will commit my test case. > Alter partition should verify value > --- > > Key: SPARK-40798 > URL: https://issues.apache.org/jira/browse/SPARK-40798 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.4.0 > > > > {code:java} > CREATE TABLE t (c int) USING PARQUET PARTITIONED BY(p int); > -- This DDL should fail but worked: > ALTER TABLE t ADD PARTITION(p='aaa'); {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40798) Alter partition should verify value
[ https://issues.apache.org/jira/browse/SPARK-40798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17633471#comment-17633471 ] Ranga Reddy commented on SPARK-40798: - Hi [~ulysses] *SPARK-40988* is created for validating the value type while inserting a partition and *SPARK-40798* is created for validating the value while altering a partition. *SPARK-40798* Jira will fix the *validating partition value* for both *insert* and *alter* partitions. So, we can use the *same configuration* parameter for both *Insert* and *Alter* partitions? > Alter partition should verify value > --- > > Key: SPARK-40798 > URL: https://issues.apache.org/jira/browse/SPARK-40798 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.4.0 > > > > {code:java} > CREATE TABLE t (c int) USING PARQUET PARTITIONED BY(p int); > -- This DDL should fail but worked: > ALTER TABLE t ADD PARTITION(p='aaa'); {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-40798) Alter partition should verify value
[ https://issues.apache.org/jira/browse/SPARK-40798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17633209#comment-17633209 ] Ranga Reddy edited comment on SPARK-40798 at 11/13/22 3:59 AM: --- Hi [~ulysses] The current Jira will solve both the *Insert* and *Alter* partition values. {code:java} CREATE EXTERNAL TABLE test_partition_value_tbl ( id INT, name STRING ) PARTITIONED BY (age INT) LOCATION 'file:/tmp/spark-warehouse/test_partition_value_tbl'; -- This DDL should fail but worked: ALTER TABLE test_partition_value_tbl ADD PARTITION(age='aaa'); -- This DML should fail but worked: INSERT INTO test_partition_value_tbl PARTITION(age="aaa") VALUES (1, 'ABC');{code} Can we rename *SKIP_TYPE_VALIDATION_ON_ALTER_PARTITION* variable and update the description of this variable? *Note:* We have a *spark.sql.sources.validatePartitionColumns* {color:#172b4d}parameter and used while reading the partition value. This variable only we can reuse?{color} After that, I will commit my test case. was (Author: rangareddy.av...@gmail.com): Hi [~ulysses] The current Jira will solve both the *Insert* and *Alter* partition values. {code:java} CREATE EXTERNAL TABLE test_partition_value_tbl ( id INT, name STRING ) PARTITIONED BY (age INT) LOCATION 'file:/tmp/spark-warehouse/test_partition_value_tbl' -- This DDL should fail but worked: ALTER TABLE test_partition_value_tbl ADD PARTITION(age='aaa'); -- This DML should fail but worked: INSERT INTO test_partition_value_tbl PARTITION(age="aaa") VALUES (1, 'ABC'){code} Can we rename *SKIP_TYPE_VALIDATION_ON_ALTER_PARTITION* variable and update the description of this variable? After that, I will commit my test case. > Alter partition should verify value > --- > > Key: SPARK-40798 > URL: https://issues.apache.org/jira/browse/SPARK-40798 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.4.0 > > > > {code:java} > CREATE TABLE t (c int) USING PARQUET PARTITIONED BY(p int); > -- This DDL should fail but worked: > ALTER TABLE t ADD PARTITION(p='aaa'); {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40798) Alter partition should verify value
[ https://issues.apache.org/jira/browse/SPARK-40798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17633209#comment-17633209 ] Ranga Reddy commented on SPARK-40798: - Hi [~ulysses] The current Jira will solve both the *Insert* and *Alter* partition values. {code:java} CREATE EXTERNAL TABLE test_partition_value_tbl ( id INT, name STRING ) PARTITIONED BY (age INT) LOCATION 'file:/tmp/spark-warehouse/test_partition_value_tbl' -- This DDL should fail but worked: ALTER TABLE test_partition_value_tbl ADD PARTITION(age='aaa'); -- This DML should fail but worked: INSERT INTO test_partition_value_tbl PARTITION(age="aaa") VALUES (1, 'ABC'){code} Can we rename *SKIP_TYPE_VALIDATION_ON_ALTER_PARTITION* variable and update the description of this variable? After that, I will commit my test case. > Alter partition should verify value > --- > > Key: SPARK-40798 > URL: https://issues.apache.org/jira/browse/SPARK-40798 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.4.0 > > > > {code:java} > CREATE TABLE t (c int) USING PARQUET PARTITIONED BY(p int); > -- This DDL should fail but worked: > ALTER TABLE t ADD PARTITION(p='aaa'); {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40798) Alter partition should verify value
[ https://issues.apache.org/jira/browse/SPARK-40798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632106#comment-17632106 ] Ranga Reddy commented on SPARK-40798: - The below issue is addressed in the current Jira. Can I add a test case for the below issue in InsertSuite.scala? https://issues.apache.org/jira/browse/SPARK-40988 > Alter partition should verify value > --- > > Key: SPARK-40798 > URL: https://issues.apache.org/jira/browse/SPARK-40798 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.4.0 > > > > {code:java} > CREATE TABLE t (c int) USING PARQUET PARTITIONED BY(p int); > -- This DDL should fail but worked: > ALTER TABLE t ADD PARTITION(p='aaa'); {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40988) Spark3 partition column value is not validated with user provided schema.
[ https://issues.apache.org/jira/browse/SPARK-40988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632102#comment-17632102 ] Ranga Reddy commented on SPARK-40988: - The following Jira will resolve the issue by throwing the *CAST_INVALID_INPUT* error. https://issues.apache.org/jira/browse/SPARK-40798 > Spark3 partition column value is not validated with user provided schema. > - > > Key: SPARK-40988 > URL: https://issues.apache.org/jira/browse/SPARK-40988 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0 >Reporter: Ranga Reddy >Priority: Major > > Spark3 has not validated the Partition Column type while inserting the data > but on the Hive side exception is thrown while inserting different type > values. > *Spark Code:* > > {code:java} > scala> val tableName="test_partition_table" > tableName: String = test_partition_table > scala>scala> spark.sql(s"DROP TABLE IF EXISTS $tableName") > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.sql(s"CREATE EXTERNAL TABLE $tableName ( id INT, name STRING ) > PARTITIONED BY (age INT) LOCATION 'file:/tmp/spark-warehouse/$tableName'") > res1: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("SHOW tables").show(truncate=false) > +-+-+---+ > |namespace|tableName |isTemporary| > +-+-+---+ > |default |test_partition_table |false | > +-+-+---+ > scala> spark.sql("SET spark.sql.sources.validatePartitionColumns").show(50, > false) > +--+-+ > |key |value| > +--+-+ > |spark.sql.sources.validatePartitionColumns|true | > +--+-+ > scala> spark.sql(s"""INSERT INTO $tableName partition (age=25) VALUES (1, > 'Ranga')""") > res4: org.apache.spark.sql.DataFrame = []scala> spark.sql(s"show partitions > $tableName").show(50, false) > +-+ > |partition| > +-+ > |age=25 | > +-+ > scala> spark.sql(s"select * from $tableName").show(50, false) > +---+-+---+ > |id |name |age| > +---+-+---+ > |1 |Ranga|25 | > +---+-+---+ > scala> spark.sql(s"""INSERT INTO $tableName partition (age=\"test_age\") > VALUES (2, 'Nishanth')""") > res7: org.apache.spark.sql.DataFrame = []scala> spark.sql(s"show partitions > $tableName").show(50, false) > ++ > |partition | > ++ > |age=25 | > |age=test_age| > ++ > scala> spark.sql(s"select * from $tableName").show(50, false) > +---+++ > |id |name |age | > +---+++ > |1 |Ranga |25 | > |2 |Nishanth|null| > +---+++ {code} > *Hive Code:* > > > {code:java} > > INSERT INTO test_partition_table partition (age="test_age2") VALUES (3, > > 'Nishanth'); > Error: Error while compiling statement: FAILED: SemanticException [Error > 10248]: Cannot add partition column age of type string as it cannot be > converted to type int (state=42000,code=10248){code} > > *Expected Result:* > When *spark.sql.sources.validatePartitionColumns=true* it needs to be > validated the datatype value and exception needs to be thrown if we provide > wrong data type value. > *Reference:* > [https://spark.apache.org/docs/3.3.1/sql-migration-guide.html#data-sources] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40988) Spark3 partition column value is not validated with user provided schema.
[ https://issues.apache.org/jira/browse/SPARK-40988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632073#comment-17632073 ] Ranga Reddy commented on SPARK-40988: - In Spark 3.4, if we run the following code we can see the *CAST_INVALID_INPUT* exception. {code:java} spark.sql(s"""INSERT INTO $tableName partition (age=\"test_age\") VALUES (2, 'Nishanth')"""){code} *Exception:* {code:java} [CAST_INVALID_INPUT] The value 'AGE_34' of the type "STRING" cannot be cast to "INT" because it is malformed. Correct the value as per the syntax, or change its target type. Use `try_cast` to tolerate malformed input and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. == SQL(line 1, position 1) == INSERT INTO TABLE partition_table PARTITION(age="AGE_34") VALUES (1, 'ABC') ^org.apache.spark.SparkNumberFormatException: [CAST_INVALID_INPUT] The value 'AGE_34' of the type "STRING" cannot be cast to "INT" because it is malformed. Correct the value as per the syntax, or change its target type. Use `try_cast` to tolerate malformed input and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. == SQL(line 1, position 1) == INSERT INTO TABLE partition_table PARTITION(age="AGE_34") VALUES (1, 'ABC') ^ at org.apache.spark.sql.errors.QueryExecutionErrors$.invalidInputInCastToNumberError(QueryExecutionErrors.scala:161) at org.apache.spark.sql.catalyst.util.UTF8StringUtils$.withException(UTF8StringUtils.scala:51) at org.apache.spark.sql.catalyst.util.UTF8StringUtils$.toIntExact(UTF8StringUtils.scala:34) at org.apache.spark.sql.catalyst.expressions.Cast.$anonfun$castToInt$2(Cast.scala:927) at org.apache.spark.sql.catalyst.expressions.Cast.$anonfun$castToInt$2$adapted(Cast.scala:927) at org.apache.spark.sql.catalyst.expressions.Cast.buildCast(Cast.scala:588) at org.apache.spark.sql.catalyst.expressions.Cast.$anonfun$castToInt$1(Cast.scala:927) at org.apache.spark.sql.catalyst.expressions.Cast.nullSafeEval(Cast.scala:1285) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:526) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:522) at org.apache.spark.sql.util.PartitioningUtils$.normalizePartitionStringValue(PartitioningUtils.scala:56) at org.apache.spark.sql.util.PartitioningUtils$.$anonfun$normalizePartitionSpec$1(PartitioningUtils.scala:100) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at scala.collection.TraversableLike.map(TraversableLike.scala:286) at scala.collection.TraversableLike.map$(TraversableLike.scala:279) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.sql.util.PartitioningUtils$.normalizePartitionSpec(PartitioningUtils.scala:76) at org.apache.spark.sql.execution.datasources.PreprocessTableInsertion$.org$apache$spark$sql$execution$datasources$PreprocessTableInsertion$$preprocess(rules.scala:382) at org.apache.spark.sql.execution.datasources.PreprocessTableInsertion$$anonfun$apply$3.applyOrElse(rules.scala:426) at org.apache.spark.sql.execution.datasources.PreprocessTableInsertion$$anonfun$apply$3.applyOrElse(rules.scala:420) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDownWithPruning$2(AnalysisHelper.scala:170) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDownWithPruning$1(AnalysisHelper.scala:170) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:323) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDownWithPruning(AnalysisHelper.scala:168) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDownWithPruning$(AnalysisHelper.scala:164) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDownWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsWithPruning(AnalysisHelper.scala:99) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsWithPruning$(AnalysisHelper.scala:96) a
[jira] [Commented] (SPARK-32380) sparksql cannot access hive table while data in hbase
[ https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627704#comment-17627704 ] Ranga Reddy commented on SPARK-32380: - The below pull request will solve the issue but needs to check if there are any other issues. [https://github.com/apache/spark/pull/29178] > sparksql cannot access hive table while data in hbase > - > > Key: SPARK-32380 > URL: https://issues.apache.org/jira/browse/SPARK-32380 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: ||component||version|| > |hadoop|2.8.5| > |hive|2.3.7| > |spark|3.0.0| > |hbase|1.4.9| >Reporter: deyzhong >Priority: Major > Original Estimate: 72h > Remaining Estimate: 72h > > * step1: create hbase table > {code:java} > hbase(main):001:0>create 'hbase_test1', 'cf1' > hbase(main):001:0> put 'hbase_test', 'r1', 'cf1:c1', '123' > {code} > * step2: create hive table related to hbase table > > {code:java} > hive> > CREATE EXTERNAL TABLE `hivetest.hbase_test`( > `key` string COMMENT '', > `value` string COMMENT '') > ROW FORMAT SERDE > 'org.apache.hadoop.hive.hbase.HBaseSerDe' > STORED BY > 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ( > 'hbase.columns.mapping'=':key,cf1:v1', > 'serialization.format'='1') > TBLPROPERTIES ( > 'hbase.table.name'='hbase_test') > {code} > * step3: sparksql query hive table while data in hbase > {code:java} > spark-sql --master yarn -e "select * from hivetest.hbase_test" > {code} > > The error log as follow: > java.io.IOException: Cannot create a record reader because of a previous > error. Please look at the previous logs lines from the task's full log for > more details. > at > org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270) > at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158) > at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) > at org.apache.spark.rdd.RDD.collect(RDD.scala:1003) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412) > at > org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496) > at scala.collection.Iterator.foreach(Iterato
[jira] [Commented] (SPARK-40988) Spark3 partition column value is not validated with user provided schema.
[ https://issues.apache.org/jira/browse/SPARK-40988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627197#comment-17627197 ] Ranga Reddy commented on SPARK-40988: - I will work on this issue. > Spark3 partition column value is not validated with user provided schema. > - > > Key: SPARK-40988 > URL: https://issues.apache.org/jira/browse/SPARK-40988 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0 >Reporter: Ranga Reddy >Priority: Major > Fix For: 3.4.0 > > > Spark3 has not validated the Partition Column type while inserting the data > but on the Hive side exception is thrown while inserting different type > values. > *Spark Code:* > > {code:java} > scala> val tableName="test_partition_table" > tableName: String = test_partition_table > scala>scala> spark.sql(s"DROP TABLE IF EXISTS $tableName") > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.sql(s"CREATE EXTERNAL TABLE $tableName ( id INT, name STRING ) > PARTITIONED BY (age INT) LOCATION 'file:/tmp/spark-warehouse/$tableName'") > res1: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("SHOW tables").show(truncate=false) > +-+-+---+ > |namespace|tableName |isTemporary| > +-+-+---+ > |default |test_partition_table |false | > +-+-+---+ > scala> spark.sql("SET spark.sql.sources.validatePartitionColumns").show(50, > false) > +--+-+ > |key |value| > +--+-+ > |spark.sql.sources.validatePartitionColumns|true | > +--+-+ > scala> spark.sql(s"""INSERT INTO $tableName partition (age=25) VALUES (1, > 'Ranga')""") > res4: org.apache.spark.sql.DataFrame = []scala> spark.sql(s"show partitions > $tableName").show(50, false) > +-+ > |partition| > +-+ > |age=25 | > +-+ > scala> spark.sql(s"select * from $tableName").show(50, false) > +---+-+---+ > |id |name |age| > +---+-+---+ > |1 |Ranga|25 | > +---+-+---+ > scala> spark.sql(s"""INSERT INTO $tableName partition (age=\"test_age\") > VALUES (2, 'Nishanth')""") > res7: org.apache.spark.sql.DataFrame = []scala> spark.sql(s"show partitions > $tableName").show(50, false) > ++ > |partition | > ++ > |age=25 | > |age=test_age| > ++ > scala> spark.sql(s"select * from $tableName").show(50, false) > +---+++ > |id |name |age | > +---+++ > |1 |Ranga |25 | > |2 |Nishanth|null| > +---+++ {code} > *Hive Code:* > > > {code:java} > > INSERT INTO test_partition_table partition (age="test_age2") VALUES (3, > > 'Nishanth'); > Error: Error while compiling statement: FAILED: SemanticException [Error > 10248]: Cannot add partition column age of type string as it cannot be > converted to type int (state=42000,code=10248){code} > > *Expected Result:* > When *spark.sql.sources.validatePartitionColumns=true* it needs to be > validated the datatype value and exception needs to be thrown if we provide > wrong data type value. > *Reference:* > [https://spark.apache.org/docs/3.3.1/sql-migration-guide.html#data-sources] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40988) Spark3 partition column value is not validated with user provided schema.
Ranga Reddy created SPARK-40988: --- Summary: Spark3 partition column value is not validated with user provided schema. Key: SPARK-40988 URL: https://issues.apache.org/jira/browse/SPARK-40988 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0, 3.2.0, 3.1.0, 3.0.0 Reporter: Ranga Reddy Fix For: 3.4.0 Spark3 has not validated the Partition Column type while inserting the data but on the Hive side exception is thrown while inserting different type values. *Spark Code:* {code:java} scala> val tableName="test_partition_table" tableName: String = test_partition_table scala>scala> spark.sql(s"DROP TABLE IF EXISTS $tableName") res0: org.apache.spark.sql.DataFrame = [] scala> spark.sql(s"CREATE EXTERNAL TABLE $tableName ( id INT, name STRING ) PARTITIONED BY (age INT) LOCATION 'file:/tmp/spark-warehouse/$tableName'") res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("SHOW tables").show(truncate=false) +-+-+---+ |namespace|tableName |isTemporary| +-+-+---+ |default |test_partition_table |false | +-+-+---+ scala> spark.sql("SET spark.sql.sources.validatePartitionColumns").show(50, false) +--+-+ |key |value| +--+-+ |spark.sql.sources.validatePartitionColumns|true | +--+-+ scala> spark.sql(s"""INSERT INTO $tableName partition (age=25) VALUES (1, 'Ranga')""") res4: org.apache.spark.sql.DataFrame = []scala> spark.sql(s"show partitions $tableName").show(50, false) +-+ |partition| +-+ |age=25 | +-+ scala> spark.sql(s"select * from $tableName").show(50, false) +---+-+---+ |id |name |age| +---+-+---+ |1 |Ranga|25 | +---+-+---+ scala> spark.sql(s"""INSERT INTO $tableName partition (age=\"test_age\") VALUES (2, 'Nishanth')""") res7: org.apache.spark.sql.DataFrame = []scala> spark.sql(s"show partitions $tableName").show(50, false) ++ |partition | ++ |age=25 | |age=test_age| ++ scala> spark.sql(s"select * from $tableName").show(50, false) +---+++ |id |name |age | +---+++ |1 |Ranga |25 | |2 |Nishanth|null| +---+++ {code} *Hive Code:* {code:java} > INSERT INTO test_partition_table partition (age="test_age2") VALUES (3, > 'Nishanth'); Error: Error while compiling statement: FAILED: SemanticException [Error 10248]: Cannot add partition column age of type string as it cannot be converted to type int (state=42000,code=10248){code} *Expected Result:* When *spark.sql.sources.validatePartitionColumns=true* it needs to be validated the datatype value and exception needs to be thrown if we provide wrong data type value. *Reference:* [https://spark.apache.org/docs/3.3.1/sql-migration-guide.html#data-sources] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36901) ERROR exchange.BroadcastExchangeExec: Could not execute broadcast in 300 secs
[ https://issues.apache.org/jira/browse/SPARK-36901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ranga Reddy updated SPARK-36901: Description: While running Spark application, if there are no further resources to launch executors, Spark application is failed after 5 mins with below exception. {code:java} 21/09/24 06:12:45 WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources ... 21/09/24 06:17:29 ERROR exchange.BroadcastExchangeExec: Could not execute broadcast in 300 secs. java.util.concurrent.TimeoutException: Futures timed out after [300 seconds] ... Caused by: java.util.concurrent.TimeoutException: Futures timed out after [300 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227) at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:146) ... 71 more 21/09/24 06:17:30 INFO spark.SparkContext: Invoking stop() from shutdown hook {code} *Expectation* should be either needs to be throw proper exception saying *"there are no further resources to run the application"* or it needs to be *"wait till it get resources"*. To reproduce the issue we have used following sample code. *PySpark Code (test_broadcast_timeout.py):* {code:java} from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Test Broadcast Timeout").getOrCreate() t1 = spark.range(5) t2 = spark.range(5) q = t1.join(t2,t1.id == t2.id) q.explain q.show(){code} *Spark Submit Command:* {code:java} spark-submit --executor-memory 512M test_broadcast_timeout.py{code} Note: We have tested same code in Spark 3.1, we are able to reproduce the issue in Spark3 as well. was: While running Spark application, if there are no further resources to launch executors, Spark application is failed after 5 mins with below exception. {code:java} 21/09/24 06:12:45 WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources ... 21/09/24 06:17:29 ERROR exchange.BroadcastExchangeExec: Could not execute broadcast in 300 secs. java.util.concurrent.TimeoutException: Futures timed out after [300 seconds] ... Caused by: java.util.concurrent.TimeoutException: Futures timed out after [300 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227) at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:146) ... 71 more 21/09/24 06:17:30 INFO spark.SparkContext: Invoking stop() from shutdown hook {code} *Expectation* should be either needs to be throw proper exception saying *"there are no further to resources to run the application"* or it needs to be *"wait till it get resources"*. To reproduce the issue we have used following sample code. *PySpark Code (test_broadcast_timeout.py):* {code:java} from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Test Broadcast Timeout").getOrCreate() t1 = spark.range(5) t2 = spark.range(5) q = t1.join(t2,t1.id == t2.id) q.explain q.show(){code} *Spark Submit Command:* {code:java} spark-submit --executor-memory 512M test_broadcast_timeout.py{code} > ERROR exchange.BroadcastExchangeExec: Could not execute broadcast in 300 secs > - > > Key: SPARK-36901 > URL: https://issues.apache.org/jira/browse/SPARK-36901 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.4.0 >Reporter: Ranga Reddy >Priority: Major > > While running Spark application, if there are no further resources to launch > executors, Spark application is failed after 5 mins with below exception. > {code:java} > 21/09/24 06:12:45 WARN cluster.YarnScheduler: Initial job has not accepted > any resources; check your cluster UI to ensure that workers are registered > and have sufficient resources > ... > 21/09/24 06:17:29 ERROR exchange.BroadcastExchangeExec: Could not execute > broadcast in 300 secs. > java.util.concurrent.TimeoutException: Futures timed out after [300 seconds] > ... > Caused by: java.util.concurrent.TimeoutException: Futures timed out after > [300 seconds] > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223) > at > scala.concurrent.impl.Promise$DefaultPromise.resul
[jira] [Updated] (SPARK-36901) ERROR exchange.BroadcastExchangeExec: Could not execute broadcast in 300 secs
[ https://issues.apache.org/jira/browse/SPARK-36901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ranga Reddy updated SPARK-36901: Description: While running Spark application, if there are no further resources to launch executors, Spark application is failed after 5 mins with below exception. {code:java} 21/09/24 06:12:45 WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources ... 21/09/24 06:17:29 ERROR exchange.BroadcastExchangeExec: Could not execute broadcast in 300 secs. java.util.concurrent.TimeoutException: Futures timed out after [300 seconds] ... Caused by: java.util.concurrent.TimeoutException: Futures timed out after [300 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227) at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:146) ... 71 more 21/09/24 06:17:30 INFO spark.SparkContext: Invoking stop() from shutdown hook {code} *Expectation* should be either needs to be throw proper exception saying *"there are no further to resources to run the application"* or it needs to be *"wait till it get resources"*. To reproduce the issue we have used following sample code. *PySpark Code (test_broadcast_timeout.py):* {code:java} from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Test Broadcast Timeout").getOrCreate() t1 = spark.range(5) t2 = spark.range(5) q = t1.join(t2,t1.id == t2.id) q.explain q.show(){code} *Spark Submit Command:* {code:java} spark-submit --executor-memory 512M test_broadcast_timeout.py{code} was: While running Spark application, if there are no further resources to launch executors, Spark application is failed after 5 mins with below exception. {code:java} 21/09/24 06:12:45 WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources ... 21/09/24 06:17:29 ERROR exchange.BroadcastExchangeExec: Could not execute broadcast in 300 secs. java.util.concurrent.TimeoutException: Futures timed out after [300 seconds] ... Caused by: java.util.concurrent.TimeoutException: Futures timed out after [300 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227) at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:146) ... 71 more 21/09/24 06:17:30 INFO spark.SparkContext: Invoking stop() from shutdown hook {code} Expectation will be either needs to be throw proper exception saying there are no further to resources to run the application or it needs to be wait till it get resources. To reproduce the issue we have used following sample code. *PySpark Code:* # cat test_broadcast_timeout.py from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Test Broadcast Timeout").getOrCreate() t1 = spark.range(5) t2 = spark.range(5) q = t1.join(t2,t1.id == t2.id) q.explain q.show() *Spark Submit Command:* {code:java} spark-submit --executor-memory 512M test_broadcast_timeout.py{code} > ERROR exchange.BroadcastExchangeExec: Could not execute broadcast in 300 secs > - > > Key: SPARK-36901 > URL: https://issues.apache.org/jira/browse/SPARK-36901 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.4.0 >Reporter: Ranga Reddy >Priority: Major > > While running Spark application, if there are no further resources to launch > executors, Spark application is failed after 5 mins with below exception. > {code:java} > 21/09/24 06:12:45 WARN cluster.YarnScheduler: Initial job has not accepted > any resources; check your cluster UI to ensure that workers are registered > and have sufficient resources > ... > 21/09/24 06:17:29 ERROR exchange.BroadcastExchangeExec: Could not execute > broadcast in 300 secs. > java.util.concurrent.TimeoutException: Futures timed out after [300 seconds] > ... > Caused by: java.util.concurrent.TimeoutException: Futures timed out after > [300 seconds] > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227) > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220) > at > org.apa
[jira] [Created] (SPARK-36901) ERROR exchange.BroadcastExchangeExec: Could not execute broadcast in 300 secs
Ranga Reddy created SPARK-36901: --- Summary: ERROR exchange.BroadcastExchangeExec: Could not execute broadcast in 300 secs Key: SPARK-36901 URL: https://issues.apache.org/jira/browse/SPARK-36901 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 2.4.0 Reporter: Ranga Reddy While running Spark application, if there are no further resources to launch executors, Spark application is failed after 5 mins with below exception. {code:java} 21/09/24 06:12:45 WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources ... 21/09/24 06:17:29 ERROR exchange.BroadcastExchangeExec: Could not execute broadcast in 300 secs. java.util.concurrent.TimeoutException: Futures timed out after [300 seconds] ... Caused by: java.util.concurrent.TimeoutException: Futures timed out after [300 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227) at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:146) ... 71 more 21/09/24 06:17:30 INFO spark.SparkContext: Invoking stop() from shutdown hook {code} Expectation will be either needs to be throw proper exception saying there are no further to resources to run the application or it needs to be wait till it get resources. To reproduce the issue we have used following sample code. *PySpark Code:* # cat test_broadcast_timeout.py from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Test Broadcast Timeout").getOrCreate() t1 = spark.range(5) t2 = spark.range(5) q = t1.join(t2,t1.id == t2.id) q.explain q.show() *Spark Submit Command:* {code:java} spark-submit --executor-memory 512M test_broadcast_timeout.py{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26208) Empty dataframe does not roundtrip for csv with header
[ https://issues.apache.org/jira/browse/SPARK-26208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17409427#comment-17409427 ] Ranga Reddy commented on SPARK-26208: - cc [~hyukjin.kwon] > Empty dataframe does not roundtrip for csv with header > -- > > Key: SPARK-26208 > URL: https://issues.apache.org/jira/browse/SPARK-26208 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: master branch, > commit 034ae305c33b1990b3c1a284044002874c343b4d, > date: Sun Nov 18 16:02:15 2018 +0800 >Reporter: koert kuipers >Assignee: Koert Kuipers >Priority: Minor > Fix For: 3.0.0 > > > when we write empty part file for csv and header=true we fail to write > header. the result cannot be read back in. > when header=true a part file with zero rows should still have header -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26208) Empty dataframe does not roundtrip for csv with header
[ https://issues.apache.org/jira/browse/SPARK-26208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396076#comment-17396076 ] Ranga Reddy edited comment on SPARK-26208 at 8/10/21, 9:03 AM: --- Hi [~koertkuipers] The above code will work only when dataframe created manually. Issue still persists when when we create dataframe while reading hive table. *Hive Table:* {code:java} CREATE EXTERNAL TABLE `test_empty_csv_table`( `col1` bigint, `col2` bigint) STORED AS ORC LOCATION '/tmp/test_empty_csv_table';{code} *spark-shell* {code:java} val tableName = "test_empty_csv_table" val emptyCSVFilePath = "/tmp/empty_csv_file" val df = spark.sql("select * from "+tableName) df.printSchema() df.write.format("csv").option("header", true).mode("overwrite").save(emptyCSVFilePath) val df2 = spark.read.option("header", true).csv(emptyCSVFilePath) {code} {code:java} org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV. It must be specified manually.; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:207) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:393) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473) ... 49 elided{code} was (Author: rangareddy.av...@gmail.com): The above code will work only when dataframe created manually. Issue still persists when when we create dataframe while reading hive table. *Hive Table:* {code:java} CREATE EXTERNAL TABLE `test_empty_csv_table`( `col1` bigint, `col2` bigint) STORED AS ORC LOCATION '/tmp/test_empty_csv_table';{code} *spark-shell* {code:java} val tableName = "test_empty_csv_table" val emptyCSVFilePath = "/tmp/empty_csv_file" val df = spark.sql("select * from "+tableName) df.printSchema() df.write.format("csv").option("header", true).mode("overwrite").save(emptyCSVFilePath) val df2 = spark.read.option("header", true).csv(emptyCSVFilePath) {code} {code:java} org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV. It must be specified manually.; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:207) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:393) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473) ... 49 elided{code} > Empty dataframe does not roundtrip for csv with header > -- > > Key: SPARK-26208 > URL: https://issues.apache.org/jira/browse/SPARK-26208 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: master branch, > commit 034ae305c33b1990b3c1a284044002874c343b4d, > date: Sun Nov 18 16:02:15 2018 +0800 >Reporter: koert kuipers >Assignee: Koert Kuipers >Priority: Minor > Fix For: 3.0.0 > > > when we write empty part file for csv and header=true we fail to write > header. the result cannot be read back in. > when header=true a part file with zero rows should still have header -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26208) Empty dataframe does not roundtrip for csv with header
[ https://issues.apache.org/jira/browse/SPARK-26208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396076#comment-17396076 ] Ranga Reddy commented on SPARK-26208: - The above code will work only when dataframe created manually. Issue still persists when when we create dataframe while reading hive table. *Hive Table:* {code:java} CREATE EXTERNAL TABLE `test_empty_csv_table`( `col1` bigint, `col2` bigint) STORED AS ORC LOCATION '/tmp/test_empty_csv_table';{code} *spark-shell* {code:java} val tableName = "test_empty_csv_table" val emptyCSVFilePath = "/tmp/empty_csv_file" val df = spark.sql("select * from "+tableName) df.printSchema() df.write.format("csv").option("header", true).mode("overwrite").save(emptyCSVFilePath) val df2 = spark.read.option("header", true).csv(emptyCSVFilePath) {code} {code:java} org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV. It must be specified manually.; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:207) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:393) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473) ... 49 elided{code} > Empty dataframe does not roundtrip for csv with header > -- > > Key: SPARK-26208 > URL: https://issues.apache.org/jira/browse/SPARK-26208 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: master branch, > commit 034ae305c33b1990b3c1a284044002874c343b4d, > date: Sun Nov 18 16:02:15 2018 +0800 >Reporter: koert kuipers >Assignee: Koert Kuipers >Priority: Minor > Fix For: 3.0.0 > > > when we write empty part file for csv and header=true we fail to write > header. the result cannot be read back in. > when header=true a part file with zero rows should still have header -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12139) REGEX Column Specification for Hive Queries
[ https://issues.apache.org/jira/browse/SPARK-12139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346581#comment-17346581 ] Ranga Reddy edited comment on SPARK-12139 at 5/18/21, 6:07 AM: --- Hi [~janewangfb] While creating alias for column it is not working. Same query if i execute in Hive it is working. Could you please check and let me know. If you want i will create a new jira. *Hive:* {code:java} hive> select `col1` as `col` from regex_test.regex_test_tbl; OK col 1 2 3 {code} *Spark:* {code:java} scala> spark.sql("select `col1` as `col` from regex_test.regex_test_tbl").show(false) org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression 'alias'; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:49) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:48) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:132) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:1659) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:1630) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:342) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:342) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.expandStarExpression(Analyzer.scala:1630) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.$anonfun$buildExpandedProjectList$1(Analyzer.scala:1615) at scala.collection.immutable.List.flatMap(List.scala:366) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$buildExpandedProjectList(Analyzer.scala:1610) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$11.applyOrElse(Analyzer.scala:1381) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$11.applyOrElse(Analyzer.scala:1376) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$3(AnalysisHelper.scala:90) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$1(AnalysisHelper.scala:90) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:207) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp(AnalysisHelper.scala:86) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp$(AnalysisHelper.scala:84) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:1376) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:1214) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:216) at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) at scala.collection.immutable.List.foldLeft(List.scala:91) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:213) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:205) at scala.collection.immutable.List.foreach(List.scala:431) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:205) at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:182) at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:176) at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:132) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:183) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88) at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:183) at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:160) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:214) at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:159) at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:73
[jira] [Commented] (SPARK-12139) REGEX Column Specification for Hive Queries
[ https://issues.apache.org/jira/browse/SPARK-12139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346581#comment-17346581 ] Ranga Reddy commented on SPARK-12139: - Hi [~janewangfb] While creating alias for column it is not working. Same query if i execute in Hive it is working. Could you please check and let me know. If you want i will create a new jira. {code:java} scala> spark.sql("select `col1` as `col` from regex_test.regex_test_tbl").show(false) org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression 'alias'; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:49) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:48) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:132) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:1659) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:1630) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:342) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:342) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.expandStarExpression(Analyzer.scala:1630) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.$anonfun$buildExpandedProjectList$1(Analyzer.scala:1615) at scala.collection.immutable.List.flatMap(List.scala:366) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$buildExpandedProjectList(Analyzer.scala:1610) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$11.applyOrElse(Analyzer.scala:1381) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$11.applyOrElse(Analyzer.scala:1376) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$3(AnalysisHelper.scala:90) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$1(AnalysisHelper.scala:90) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:207) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp(AnalysisHelper.scala:86) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp$(AnalysisHelper.scala:84) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:1376) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:1214) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:216) at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) at scala.collection.immutable.List.foldLeft(List.scala:91) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:213) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:205) at scala.collection.immutable.List.foreach(List.scala:431) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:205) at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:182) at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:176) at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:132) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:183) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88) at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:183) at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:160) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:214) at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:159) at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:73) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.execution.QueryExecution.$anonfu