[jira] [Commented] (SPARK-42760) The partition of result data frame of join is always 1

2023-03-30 Thread Ranga Reddy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17707125#comment-17707125
 ] 

Ranga Reddy commented on SPARK-42760:
-

>From Spark version 3.2.0 onwards AQE is enabled by default. AQE is disabled in 
>Spark version < 3.2.0.

 

> The partition of result data frame of join is always 1
> --
>
> Key: SPARK-42760
> URL: https://issues.apache.org/jira/browse/SPARK-42760
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.3.2
> Environment: standard spark 3.0.3/3.3.2, using in jupyter notebook, 
> local mode
>Reporter: binyang
>Priority: Major
>
> I am using pyspark. The partition of result data frame of join is always 1.
> Here is my code from 
> https://stackoverflow.com/questions/51876281/is-partitioning-retained-after-a-spark-sql-join
>  
> print(spark.version)
> def example_shuffle_partitions(data_partitions=10, shuffle_partitions=4):
>     spark.conf.set("spark.sql.shuffle.partitions", shuffle_partitions)
>     spark.sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
>     df1 = spark.range(1, 1000).repartition(data_partitions)
>     df2 = spark.range(1, 2000).repartition(data_partitions)
>     df3 = spark.range(1, 3000).repartition(data_partitions)
>     print("Data partitions is: {}. Shuffle partitions is 
> {}".format(data_partitions, shuffle_partitions))
>     print("Data partitions before join: 
> {}".format(df1.rdd.getNumPartitions()))
>     df = (df1.join(df2, df1.id == df2.id)
>           .join(df3, df1.id == df3.id))
>     print("Data partitions after join : {}".format(df.rdd.getNumPartitions()))
> example_shuffle_partitions()
>  
> In Spark 3.0.3, it prints out:
> 3.0.3
> Data partitions is: 10. Shuffle partitions is 4
> Data partitions before join: 10
> Data partitions after join : 4
> However, it prints out the following in the latest 3.3.2
> 3.3.2
> Data partitions is: 10. Shuffle partitions is 4
> Data partitions before join: 10
> Data partitions after join : 1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32893) Structured Streaming and Dynamic Allocation on StandaloneCluster

2023-03-21 Thread Ranga Reddy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703182#comment-17703182
 ] 

Ranga Reddy commented on SPARK-32893:
-

Similar kind of issue is created for 
[Kubernetes|https://issues.apache.org/jira/issues/?jql=project+%3D+SPARK+AND+component+%3D+Kubernetes]
 - https://issues.apache.org/jira/browse/SPARK-35625

> Structured Streaming and Dynamic Allocation on StandaloneCluster
> 
>
> Key: SPARK-32893
> URL: https://issues.apache.org/jira/browse/SPARK-32893
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.1
>Reporter: Duarte Ferreira
>Priority: Major
>
> We are currently using Spark 3.0.1 Standalone cluster to run our Structured 
> streaming applications.
> We set the following configurations when running the application in cluster 
> mode:
>  * spark.dynamicAllocation.enabled = true
>  * spark.shuffle.service.enabled = true
>  * spark.cores.max =5
>  * spark.executor.memory = 1G
>  * spark.executor.cores = 1
> We also have the configurations set to enable spark.shuffle.service.enabled 
> on each worker and have a cluster composed of 1 master and 2 slaves.
> The application reads data from a kafka Topic (readTopic) using [This 
> documentation, 
> |https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html]applies
>  some transformations on the DataSet using spark SQL and writes data to 
> another Kafka Topic (writeTopic).
> When we start the application it behaves correctly, it starts with 0 
> executors and. as we start feeding data to the readTopic, it starts 
> increasing the number of executors until it reaches the 5 executors limit and 
> all messages are transformed and written to the writeTopic in Kafka.
> If we stop feeding messages to the readTopic the application will work as 
> expected and starts killing executors that are not needed anymore until we 
> stop sending data completely and it reach 0 executors running.
> If we start sending data again right away, it behaves just as expected it 
> starts increasing the numbers of executors again. But if we leave the 
> application in idle at 0 executors for around 10 minutes we start getting 
> errors like this:
> {noformat}
> *no*
> 20/09/15 10:41:22 ERROR TransportClient: Failed to send RPC RPC 
> 7570256331800450365 to sparkmaster/10.0.12.231:7077: 
> java.nio.channels.ClosedChannelException
> java.nio.channels.ClosedChannelException
>   at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)
>   at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865)
>   at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
>   at 
> io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1104)
>   at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>   at 
> io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>   at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Connection reset by peer
>   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>   at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>   at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468)
>   at 
> org.apache.spark.network.protocol.MessageWithHeader.copyByteBuf(MessageWithHeader.java:148)
>   at 
> org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:123)
>   at 
> io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:362)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel.doWriteInternal(AbstractNioByteChannel.java:235)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel.doWrite0(AbstractNioByteChannel.java:209)
>   at 
> io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:400)
>   at 
> 

[jira] [Commented] (SPARK-32893) Structured Streaming and Dynamic Allocation on StandaloneCluster

2023-03-21 Thread Ranga Reddy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703177#comment-17703177
 ] 

Ranga Reddy commented on SPARK-32893:
-

I can see similar behaviour Spark Structured Streaming with Yarn.
{code:java}
2023-03-14 18:17:29 ERROR TransportClient:337 - Failed to send RPC RPC 
7955407071046657873 to /127.0.0.1:50040: 
java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
        at 
io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)
        at 
io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865)
        at 
io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
        at 
io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1104)
        at 
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at 
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:748) {code}

> Structured Streaming and Dynamic Allocation on StandaloneCluster
> 
>
> Key: SPARK-32893
> URL: https://issues.apache.org/jira/browse/SPARK-32893
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.1
>Reporter: Duarte Ferreira
>Priority: Major
>
> We are currently using Spark 3.0.1 Standalone cluster to run our Structured 
> streaming applications.
> We set the following configurations when running the application in cluster 
> mode:
>  * spark.dynamicAllocation.enabled = true
>  * spark.shuffle.service.enabled = true
>  * spark.cores.max =5
>  * spark.executor.memory = 1G
>  * spark.executor.cores = 1
> We also have the configurations set to enable spark.shuffle.service.enabled 
> on each worker and have a cluster composed of 1 master and 2 slaves.
> The application reads data from a kafka Topic (readTopic) using [This 
> documentation, 
> |https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html]applies
>  some transformations on the DataSet using spark SQL and writes data to 
> another Kafka Topic (writeTopic).
> When we start the application it behaves correctly, it starts with 0 
> executors and. as we start feeding data to the readTopic, it starts 
> increasing the number of executors until it reaches the 5 executors limit and 
> all messages are transformed and written to the writeTopic in Kafka.
> If we stop feeding messages to the readTopic the application will work as 
> expected and starts killing executors that are not needed anymore until we 
> stop sending data completely and it reach 0 executors running.
> If we start sending data again right away, it behaves just as expected it 
> starts increasing the numbers of executors again. But if we leave the 
> application in idle at 0 executors for around 10 minutes we start getting 
> errors like this:
> {noformat}
> *no*
> 20/09/15 10:41:22 ERROR TransportClient: Failed to send RPC RPC 
> 7570256331800450365 to sparkmaster/10.0.12.231:7077: 
> java.nio.channels.ClosedChannelException
> java.nio.channels.ClosedChannelException
>   at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)
>   at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865)
>   at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
>   at 
> io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1104)
>   at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
> 

[jira] [Commented] (SPARK-28869) Roll over event log files

2022-12-06 Thread Ranga Reddy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17643804#comment-17643804
 ] 

Ranga Reddy commented on SPARK-28869:
-

Hi [~kabhwan] 

I have enabled the eventlog rolling for the spark streaming network word count 
example, but event log files are not compacted. 

*Configuration Parameters:*
{code:java}
spark.eventLog.rolling.enabled=true
spark.eventLog.rolling.maxFileSize=10m
spark.history.fs.eventLog.rolling.maxFilesToRetain=2
spark.history.fs.cleaner.interval=1800{code}
*Event log file list:*

[^application_1670216197043_0012.log]

^Could you please check the issue.^

> Roll over event log files
> -
>
> Key: SPARK-28869
> URL: https://issues.apache.org/jira/browse/SPARK-28869
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: application_1670216197043_0012.log
>
>
> This issue tracks the effort on rolling over event log files in driver and 
> let SHS replay the multiple event log files correctly.
> This issue doesn't deal with overall size of event log, as well as no 
> guarantee when deleting old event log files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28869) Roll over event log files

2022-12-06 Thread Ranga Reddy (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ranga Reddy updated SPARK-28869:

Attachment: application_1670216197043_0012.log

> Roll over event log files
> -
>
> Key: SPARK-28869
> URL: https://issues.apache.org/jira/browse/SPARK-28869
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: application_1670216197043_0012.log
>
>
> This issue tracks the effort on rolling over event log files in driver and 
> let SHS replay the multiple event log files correctly.
> This issue doesn't deal with overall size of event log, as well as no 
> guarantee when deleting old event log files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40958) Support data masking built-in function 'null'

2022-11-14 Thread Ranga Reddy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17634142#comment-17634142
 ] 

Ranga Reddy commented on SPARK-40958:
-

I will work on this Jira.

> Support data masking built-in function 'null'
> -
>
> Key: SPARK-40958
> URL: https://issues.apache.org/jira/browse/SPARK-40958
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Priority: Major
>
> This can be a simple function that returns a NULL value of the given input 
> type.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40686) Support data masking and redacting built-in functions

2022-11-14 Thread Ranga Reddy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17633668#comment-17633668
 ] 

Ranga Reddy commented on SPARK-40686:
-

Hi [~dtenedor] 

I am not able to find more references/examples for *phi-related* functions 
except okera website. Could you please share some references so that i will 
start work on those functions? 

> Support data masking and redacting built-in functions
> -
>
> Key: SPARK-40686
> URL: https://issues.apache.org/jira/browse/SPARK-40686
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vinod KC
>Priority: Minor
>
> Support built-in data masking and redacting functions 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40988) Test case for insert partition should verify value

2022-11-13 Thread Ranga Reddy (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ranga Reddy updated SPARK-40988:

Priority: Minor  (was: Major)

> Test case for insert partition should verify value 
> ---
>
> Key: SPARK-40988
> URL: https://issues.apache.org/jira/browse/SPARK-40988
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Ranga Reddy
>Priority: Minor
>
> Spark3 has not validated the Partition Column type while inserting the data 
> but on the Hive side exception is thrown while inserting different type 
> values.
> *Spark Code:*
>  
> {code:java}
> scala> val tableName="test_partition_table"
> tableName: String = test_partition_table
> scala>scala> spark.sql(s"DROP TABLE IF EXISTS $tableName")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql(s"CREATE EXTERNAL TABLE $tableName ( id INT, name STRING ) 
> PARTITIONED BY (age INT) LOCATION 'file:/tmp/spark-warehouse/$tableName'")
> res1: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("SHOW tables").show(truncate=false)
> +-+-+---+
> |namespace|tableName            |isTemporary|
> +-+-+---+
> |default  |test_partition_table |false      |
> +-+-+---+
> scala> spark.sql("SET spark.sql.sources.validatePartitionColumns").show(50, 
> false)
> +--+-+
> |key                                       |value|
> +--+-+
> |spark.sql.sources.validatePartitionColumns|true |
> +--+-+
> scala> spark.sql(s"""INSERT INTO $tableName partition (age=25) VALUES (1, 
> 'Ranga')""")
> res4: org.apache.spark.sql.DataFrame = []scala> spark.sql(s"show partitions 
> $tableName").show(50, false)
> +-+
> |partition|
> +-+
> |age=25   |
> +-+
> scala> spark.sql(s"select * from $tableName").show(50, false)
> +---+-+---+
> |id |name |age|
> +---+-+---+
> |1  |Ranga|25 |
> +---+-+---+
> scala> spark.sql(s"""INSERT INTO $tableName partition (age=\"test_age\") 
> VALUES (2, 'Nishanth')""")
> res7: org.apache.spark.sql.DataFrame = []scala> spark.sql(s"show partitions 
> $tableName").show(50, false)
> ++
> |partition   |
> ++
> |age=25      |
> |age=test_age|
> ++
> scala> spark.sql(s"select * from $tableName").show(50, false)
> +---+++
> |id |name    |age |
> +---+++
> |1  |Ranga   |25  |
> |2  |Nishanth|null|
> +---+++ {code}
> *Hive Code:*
>  
>  
> {code:java}
> > INSERT INTO test_partition_table partition (age="test_age2") VALUES (3, 
> > 'Nishanth');
> Error: Error while compiling statement: FAILED: SemanticException [Error 
> 10248]: Cannot add partition column age of type string as it cannot be 
> converted to type int (state=42000,code=10248){code}
>  
> *Expected Result:*
> When *spark.sql.sources.validatePartitionColumns=true* it needs to be 
> validated the datatype value and exception needs to be thrown if we provide 
> wrong data type value.
> *Reference:*
> [https://spark.apache.org/docs/3.3.1/sql-migration-guide.html#data-sources]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40988) Test case for insert partition should verify value

2022-11-13 Thread Ranga Reddy (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ranga Reddy updated SPARK-40988:

Summary: Test case for insert partition should verify value   (was: Spark3 
partition column value is not validated with user provided schema.)

> Test case for insert partition should verify value 
> ---
>
> Key: SPARK-40988
> URL: https://issues.apache.org/jira/browse/SPARK-40988
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Ranga Reddy
>Priority: Major
>
> Spark3 has not validated the Partition Column type while inserting the data 
> but on the Hive side exception is thrown while inserting different type 
> values.
> *Spark Code:*
>  
> {code:java}
> scala> val tableName="test_partition_table"
> tableName: String = test_partition_table
> scala>scala> spark.sql(s"DROP TABLE IF EXISTS $tableName")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql(s"CREATE EXTERNAL TABLE $tableName ( id INT, name STRING ) 
> PARTITIONED BY (age INT) LOCATION 'file:/tmp/spark-warehouse/$tableName'")
> res1: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("SHOW tables").show(truncate=false)
> +-+-+---+
> |namespace|tableName            |isTemporary|
> +-+-+---+
> |default  |test_partition_table |false      |
> +-+-+---+
> scala> spark.sql("SET spark.sql.sources.validatePartitionColumns").show(50, 
> false)
> +--+-+
> |key                                       |value|
> +--+-+
> |spark.sql.sources.validatePartitionColumns|true |
> +--+-+
> scala> spark.sql(s"""INSERT INTO $tableName partition (age=25) VALUES (1, 
> 'Ranga')""")
> res4: org.apache.spark.sql.DataFrame = []scala> spark.sql(s"show partitions 
> $tableName").show(50, false)
> +-+
> |partition|
> +-+
> |age=25   |
> +-+
> scala> spark.sql(s"select * from $tableName").show(50, false)
> +---+-+---+
> |id |name |age|
> +---+-+---+
> |1  |Ranga|25 |
> +---+-+---+
> scala> spark.sql(s"""INSERT INTO $tableName partition (age=\"test_age\") 
> VALUES (2, 'Nishanth')""")
> res7: org.apache.spark.sql.DataFrame = []scala> spark.sql(s"show partitions 
> $tableName").show(50, false)
> ++
> |partition   |
> ++
> |age=25      |
> |age=test_age|
> ++
> scala> spark.sql(s"select * from $tableName").show(50, false)
> +---+++
> |id |name    |age |
> +---+++
> |1  |Ranga   |25  |
> |2  |Nishanth|null|
> +---+++ {code}
> *Hive Code:*
>  
>  
> {code:java}
> > INSERT INTO test_partition_table partition (age="test_age2") VALUES (3, 
> > 'Nishanth');
> Error: Error while compiling statement: FAILED: SemanticException [Error 
> 10248]: Cannot add partition column age of type string as it cannot be 
> converted to type int (state=42000,code=10248){code}
>  
> *Expected Result:*
> When *spark.sql.sources.validatePartitionColumns=true* it needs to be 
> validated the datatype value and exception needs to be thrown if we provide 
> wrong data type value.
> *Reference:*
> [https://spark.apache.org/docs/3.3.1/sql-migration-guide.html#data-sources]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40798) Alter partition should verify value

2022-11-13 Thread Ranga Reddy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17633488#comment-17633488
 ] 

Ranga Reddy commented on SPARK-40798:
-

Hi [~ulysses] 

I will go ahead and add a test case for the *insert* operation and update the 
Jira description. 

> Alter partition should verify value
> ---
>
> Key: SPARK-40798
> URL: https://issues.apache.org/jira/browse/SPARK-40798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.4.0
>
>
>  
> {code:java}
> CREATE TABLE t (c int) USING PARQUET PARTITIONED BY(p int);
> -- This DDL should fail but worked:
> ALTER TABLE t ADD PARTITION(p='aaa'); {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-40798) Alter partition should verify value

2022-11-13 Thread Ranga Reddy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17633209#comment-17633209
 ] 

Ranga Reddy edited comment on SPARK-40798 at 11/14/22 3:11 AM:
---

Hi [~ulysses] 

The current Jira will solve both the *Insert* and *Alter* partition values. 
{code:java}
CREATE EXTERNAL TABLE test_partition_value_tbl ( id INT, name STRING ) 
PARTITIONED BY (age INT) LOCATION 
'file:/tmp/spark-warehouse/test_partition_value_tbl';

-- This DDL should fail but worked: - SPARK-40798
ALTER TABLE test_partition_value_tbl ADD PARTITION(age='aaa'); 

-- This DML should fail but worked: - SPARK-40988
INSERT INTO test_partition_value_tbl PARTITION(age="aaa") VALUES (1, 'ABC'); 
{code}
Can we rename *SKIP_TYPE_VALIDATION_ON_ALTER_PARTITION* variable and update the 
description of this variable?

*Note:* We have a *spark.sql.sources.validatePartitionColumns* 
{color:#172b4d}parameter and used while reading the partition value.  This 
variable only we can reuse?{color}

After that, I will commit my test case.


was (Author: rangareddy.av...@gmail.com):
Hi [~ulysses] 

The current Jira will solve both the *Insert* and *Alter* partition values. 
{code:java}
CREATE EXTERNAL TABLE test_partition_value_tbl ( id INT, name STRING ) 
PARTITIONED BY (age INT) LOCATION 
'file:/tmp/spark-warehouse/test_partition_value_tbl';

-- This DDL should fail but worked:
ALTER TABLE test_partition_value_tbl ADD PARTITION(age='aaa'); 

-- This DML should fail but worked:
INSERT INTO test_partition_value_tbl PARTITION(age="aaa") VALUES (1, 
'ABC');{code}
Can we rename *SKIP_TYPE_VALIDATION_ON_ALTER_PARTITION* variable and update the 
description of this variable?

*Note:* We have a *spark.sql.sources.validatePartitionColumns* 
{color:#172b4d}parameter and used while reading the partition value.  This 
variable only we can reuse?{color}

After that, I will commit my test case.

> Alter partition should verify value
> ---
>
> Key: SPARK-40798
> URL: https://issues.apache.org/jira/browse/SPARK-40798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.4.0
>
>
>  
> {code:java}
> CREATE TABLE t (c int) USING PARQUET PARTITIONED BY(p int);
> -- This DDL should fail but worked:
> ALTER TABLE t ADD PARTITION(p='aaa'); {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40798) Alter partition should verify value

2022-11-13 Thread Ranga Reddy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17633471#comment-17633471
 ] 

Ranga Reddy commented on SPARK-40798:
-

Hi [~ulysses] 

*SPARK-40988* is created for validating the value type while inserting a 
partition and *SPARK-40798* is created for validating the value while altering 
a partition. *SPARK-40798* Jira will fix the *validating partition value* for 
both *insert* and *alter* partitions. So, we can use the *same configuration* 
parameter for both *Insert* and *Alter* partitions?

> Alter partition should verify value
> ---
>
> Key: SPARK-40798
> URL: https://issues.apache.org/jira/browse/SPARK-40798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.4.0
>
>
>  
> {code:java}
> CREATE TABLE t (c int) USING PARQUET PARTITIONED BY(p int);
> -- This DDL should fail but worked:
> ALTER TABLE t ADD PARTITION(p='aaa'); {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-40798) Alter partition should verify value

2022-11-12 Thread Ranga Reddy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17633209#comment-17633209
 ] 

Ranga Reddy edited comment on SPARK-40798 at 11/13/22 3:59 AM:
---

Hi [~ulysses] 

The current Jira will solve both the *Insert* and *Alter* partition values. 
{code:java}
CREATE EXTERNAL TABLE test_partition_value_tbl ( id INT, name STRING ) 
PARTITIONED BY (age INT) LOCATION 
'file:/tmp/spark-warehouse/test_partition_value_tbl';

-- This DDL should fail but worked:
ALTER TABLE test_partition_value_tbl ADD PARTITION(age='aaa'); 

-- This DML should fail but worked:
INSERT INTO test_partition_value_tbl PARTITION(age="aaa") VALUES (1, 
'ABC');{code}
Can we rename *SKIP_TYPE_VALIDATION_ON_ALTER_PARTITION* variable and update the 
description of this variable?

*Note:* We have a *spark.sql.sources.validatePartitionColumns* 
{color:#172b4d}parameter and used while reading the partition value.  This 
variable only we can reuse?{color}

After that, I will commit my test case.


was (Author: rangareddy.av...@gmail.com):
Hi [~ulysses] 

The current Jira will solve both the *Insert* and *Alter* partition values. 
{code:java}
CREATE EXTERNAL TABLE test_partition_value_tbl ( id INT, name STRING ) 
PARTITIONED BY (age INT) LOCATION 
'file:/tmp/spark-warehouse/test_partition_value_tbl'

-- This DDL should fail but worked:
ALTER TABLE test_partition_value_tbl ADD PARTITION(age='aaa'); 

-- This DML should fail but worked:
INSERT INTO test_partition_value_tbl PARTITION(age="aaa") VALUES (1, 
'ABC'){code}
Can we rename *SKIP_TYPE_VALIDATION_ON_ALTER_PARTITION* variable and update the 
description of this variable?

After that, I will commit my test case.

> Alter partition should verify value
> ---
>
> Key: SPARK-40798
> URL: https://issues.apache.org/jira/browse/SPARK-40798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.4.0
>
>
>  
> {code:java}
> CREATE TABLE t (c int) USING PARQUET PARTITIONED BY(p int);
> -- This DDL should fail but worked:
> ALTER TABLE t ADD PARTITION(p='aaa'); {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40798) Alter partition should verify value

2022-11-12 Thread Ranga Reddy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17633209#comment-17633209
 ] 

Ranga Reddy commented on SPARK-40798:
-

Hi [~ulysses] 

The current Jira will solve both the *Insert* and *Alter* partition values. 
{code:java}
CREATE EXTERNAL TABLE test_partition_value_tbl ( id INT, name STRING ) 
PARTITIONED BY (age INT) LOCATION 
'file:/tmp/spark-warehouse/test_partition_value_tbl'

-- This DDL should fail but worked:
ALTER TABLE test_partition_value_tbl ADD PARTITION(age='aaa'); 

-- This DML should fail but worked:
INSERT INTO test_partition_value_tbl PARTITION(age="aaa") VALUES (1, 
'ABC'){code}
Can we rename *SKIP_TYPE_VALIDATION_ON_ALTER_PARTITION* variable and update the 
description of this variable?

After that, I will commit my test case.

> Alter partition should verify value
> ---
>
> Key: SPARK-40798
> URL: https://issues.apache.org/jira/browse/SPARK-40798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.4.0
>
>
>  
> {code:java}
> CREATE TABLE t (c int) USING PARQUET PARTITIONED BY(p int);
> -- This DDL should fail but worked:
> ALTER TABLE t ADD PARTITION(p='aaa'); {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40798) Alter partition should verify value

2022-11-10 Thread Ranga Reddy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17632106#comment-17632106
 ] 

Ranga Reddy commented on SPARK-40798:
-

The below issue is addressed in the current Jira. Can I add a test case for the 
below issue in InsertSuite.scala? 

https://issues.apache.org/jira/browse/SPARK-40988

> Alter partition should verify value
> ---
>
> Key: SPARK-40798
> URL: https://issues.apache.org/jira/browse/SPARK-40798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.4.0
>
>
>  
> {code:java}
> CREATE TABLE t (c int) USING PARQUET PARTITIONED BY(p int);
> -- This DDL should fail but worked:
> ALTER TABLE t ADD PARTITION(p='aaa'); {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40988) Spark3 partition column value is not validated with user provided schema.

2022-11-10 Thread Ranga Reddy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17632102#comment-17632102
 ] 

Ranga Reddy commented on SPARK-40988:
-

The following Jira will resolve the issue by throwing the *CAST_INVALID_INPUT* 
error.

https://issues.apache.org/jira/browse/SPARK-40798

> Spark3 partition column value is not validated with user provided schema.
> -
>
> Key: SPARK-40988
> URL: https://issues.apache.org/jira/browse/SPARK-40988
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Ranga Reddy
>Priority: Major
>
> Spark3 has not validated the Partition Column type while inserting the data 
> but on the Hive side exception is thrown while inserting different type 
> values.
> *Spark Code:*
>  
> {code:java}
> scala> val tableName="test_partition_table"
> tableName: String = test_partition_table
> scala>scala> spark.sql(s"DROP TABLE IF EXISTS $tableName")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql(s"CREATE EXTERNAL TABLE $tableName ( id INT, name STRING ) 
> PARTITIONED BY (age INT) LOCATION 'file:/tmp/spark-warehouse/$tableName'")
> res1: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("SHOW tables").show(truncate=false)
> +-+-+---+
> |namespace|tableName            |isTemporary|
> +-+-+---+
> |default  |test_partition_table |false      |
> +-+-+---+
> scala> spark.sql("SET spark.sql.sources.validatePartitionColumns").show(50, 
> false)
> +--+-+
> |key                                       |value|
> +--+-+
> |spark.sql.sources.validatePartitionColumns|true |
> +--+-+
> scala> spark.sql(s"""INSERT INTO $tableName partition (age=25) VALUES (1, 
> 'Ranga')""")
> res4: org.apache.spark.sql.DataFrame = []scala> spark.sql(s"show partitions 
> $tableName").show(50, false)
> +-+
> |partition|
> +-+
> |age=25   |
> +-+
> scala> spark.sql(s"select * from $tableName").show(50, false)
> +---+-+---+
> |id |name |age|
> +---+-+---+
> |1  |Ranga|25 |
> +---+-+---+
> scala> spark.sql(s"""INSERT INTO $tableName partition (age=\"test_age\") 
> VALUES (2, 'Nishanth')""")
> res7: org.apache.spark.sql.DataFrame = []scala> spark.sql(s"show partitions 
> $tableName").show(50, false)
> ++
> |partition   |
> ++
> |age=25      |
> |age=test_age|
> ++
> scala> spark.sql(s"select * from $tableName").show(50, false)
> +---+++
> |id |name    |age |
> +---+++
> |1  |Ranga   |25  |
> |2  |Nishanth|null|
> +---+++ {code}
> *Hive Code:*
>  
>  
> {code:java}
> > INSERT INTO test_partition_table partition (age="test_age2") VALUES (3, 
> > 'Nishanth');
> Error: Error while compiling statement: FAILED: SemanticException [Error 
> 10248]: Cannot add partition column age of type string as it cannot be 
> converted to type int (state=42000,code=10248){code}
>  
> *Expected Result:*
> When *spark.sql.sources.validatePartitionColumns=true* it needs to be 
> validated the datatype value and exception needs to be thrown if we provide 
> wrong data type value.
> *Reference:*
> [https://spark.apache.org/docs/3.3.1/sql-migration-guide.html#data-sources]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40988) Spark3 partition column value is not validated with user provided schema.

2022-11-10 Thread Ranga Reddy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17632073#comment-17632073
 ] 

Ranga Reddy commented on SPARK-40988:
-

In Spark 3.4, if we run the following code we can see the *CAST_INVALID_INPUT* 
exception.
{code:java}
spark.sql(s"""INSERT INTO $tableName partition (age=\"test_age\") VALUES (2, 
'Nishanth')"""){code}
*Exception:*
{code:java}
[CAST_INVALID_INPUT] The value 'AGE_34' of the type "STRING" cannot be cast to 
"INT" because it is malformed. Correct the value as per the syntax, or change 
its target type. Use `try_cast` to tolerate malformed input and return NULL 
instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this 
error.
== SQL(line 1, position 1) ==
INSERT INTO TABLE partition_table PARTITION(age="AGE_34") VALUES (1, 'ABC')
^org.apache.spark.SparkNumberFormatException:
 [CAST_INVALID_INPUT] The value 'AGE_34' of the type "STRING" cannot be cast to 
"INT" because it is malformed. Correct the value as per the syntax, or change 
its target type. Use `try_cast` to tolerate malformed input and return NULL 
instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this 
error.
== SQL(line 1, position 1) ==
INSERT INTO TABLE partition_table PARTITION(age="AGE_34") VALUES (1, 'ABC')
^    at 
org.apache.spark.sql.errors.QueryExecutionErrors$.invalidInputInCastToNumberError(QueryExecutionErrors.scala:161)
    at 
org.apache.spark.sql.catalyst.util.UTF8StringUtils$.withException(UTF8StringUtils.scala:51)
    at 
org.apache.spark.sql.catalyst.util.UTF8StringUtils$.toIntExact(UTF8StringUtils.scala:34)
    at 
org.apache.spark.sql.catalyst.expressions.Cast.$anonfun$castToInt$2(Cast.scala:927)
    at 
org.apache.spark.sql.catalyst.expressions.Cast.$anonfun$castToInt$2$adapted(Cast.scala:927)
    at org.apache.spark.sql.catalyst.expressions.Cast.buildCast(Cast.scala:588)
    at 
org.apache.spark.sql.catalyst.expressions.Cast.$anonfun$castToInt$1(Cast.scala:927)
    at 
org.apache.spark.sql.catalyst.expressions.Cast.nullSafeEval(Cast.scala:1285)
    at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:526)
    at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:522)
    at 
org.apache.spark.sql.util.PartitioningUtils$.normalizePartitionStringValue(PartitioningUtils.scala:56)
    at 
org.apache.spark.sql.util.PartitioningUtils$.$anonfun$normalizePartitionSpec$1(PartitioningUtils.scala:100)
    at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
    at scala.collection.Iterator.foreach(Iterator.scala:943)
    at scala.collection.Iterator.foreach$(Iterator.scala:943)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
    at scala.collection.IterableLike.foreach(IterableLike.scala:74)
    at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
    at scala.collection.TraversableLike.map(TraversableLike.scala:286)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
    at scala.collection.AbstractTraversable.map(Traversable.scala:108)
    at 
org.apache.spark.sql.util.PartitioningUtils$.normalizePartitionSpec(PartitioningUtils.scala:76)
    at 
org.apache.spark.sql.execution.datasources.PreprocessTableInsertion$.org$apache$spark$sql$execution$datasources$PreprocessTableInsertion$$preprocess(rules.scala:382)
    at 
org.apache.spark.sql.execution.datasources.PreprocessTableInsertion$$anonfun$apply$3.applyOrElse(rules.scala:426)
    at 
org.apache.spark.sql.execution.datasources.PreprocessTableInsertion$$anonfun$apply$3.applyOrElse(rules.scala:420)
    at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDownWithPruning$2(AnalysisHelper.scala:170)
    at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104)
    at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDownWithPruning$1(AnalysisHelper.scala:170)
    at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:323)
    at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDownWithPruning(AnalysisHelper.scala:168)
    at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDownWithPruning$(AnalysisHelper.scala:164)
    at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDownWithPruning(LogicalPlan.scala:30)
    at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsWithPruning(AnalysisHelper.scala:99)
    at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsWithPruning$(AnalysisHelper.scala:96)
    at 

[jira] [Commented] (SPARK-32380) sparksql cannot access hive table while data in hbase

2022-11-02 Thread Ranga Reddy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627704#comment-17627704
 ] 

Ranga Reddy commented on SPARK-32380:
-

The below pull request will solve the issue but needs to check if there are any 
other issues.

[https://github.com/apache/spark/pull/29178]

 

> sparksql cannot access hive table while data in hbase
> -
>
> Key: SPARK-32380
> URL: https://issues.apache.org/jira/browse/SPARK-32380
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: ||component||version||
> |hadoop|2.8.5|
> |hive|2.3.7|
> |spark|3.0.0|
> |hbase|1.4.9|
>Reporter: deyzhong
>Priority: Major
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> * step1: create hbase table
> {code:java}
>  hbase(main):001:0>create 'hbase_test1', 'cf1'
>  hbase(main):001:0> put 'hbase_test', 'r1', 'cf1:c1', '123'
> {code}
>  * step2: create hive table related to hbase table
>  
> {code:java}
> hive> 
> CREATE EXTERNAL TABLE `hivetest.hbase_test`(
>   `key` string COMMENT '', 
>   `value` string COMMENT '')
> ROW FORMAT SERDE 
>   'org.apache.hadoop.hive.hbase.HBaseSerDe' 
> STORED BY 
>   'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
> WITH SERDEPROPERTIES ( 
>   'hbase.columns.mapping'=':key,cf1:v1', 
>   'serialization.format'='1')
> TBLPROPERTIES (
>   'hbase.table.name'='hbase_test')
>  {code}
>  * step3: sparksql query hive table while data in hbase
> {code:java}
> spark-sql --master yarn -e "select * from hivetest.hbase_test"
> {code}
>  
> The error log as follow: 
> java.io.IOException: Cannot create a record reader because of a previous 
> error. Please look at the previous logs lines from the task's full log for 
> more details.
>  at 
> org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270)
>  at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158)
>  at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
>  at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412)
>  at 
> org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496)
>  at scala.collection.Iterator.foreach(Iterator.scala:941)
>  

[jira] [Commented] (SPARK-40988) Spark3 partition column value is not validated with user provided schema.

2022-11-01 Thread Ranga Reddy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627197#comment-17627197
 ] 

Ranga Reddy commented on SPARK-40988:
-

I will work on this issue.

> Spark3 partition column value is not validated with user provided schema.
> -
>
> Key: SPARK-40988
> URL: https://issues.apache.org/jira/browse/SPARK-40988
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Ranga Reddy
>Priority: Major
> Fix For: 3.4.0
>
>
> Spark3 has not validated the Partition Column type while inserting the data 
> but on the Hive side exception is thrown while inserting different type 
> values.
> *Spark Code:*
>  
> {code:java}
> scala> val tableName="test_partition_table"
> tableName: String = test_partition_table
> scala>scala> spark.sql(s"DROP TABLE IF EXISTS $tableName")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql(s"CREATE EXTERNAL TABLE $tableName ( id INT, name STRING ) 
> PARTITIONED BY (age INT) LOCATION 'file:/tmp/spark-warehouse/$tableName'")
> res1: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("SHOW tables").show(truncate=false)
> +-+-+---+
> |namespace|tableName            |isTemporary|
> +-+-+---+
> |default  |test_partition_table |false      |
> +-+-+---+
> scala> spark.sql("SET spark.sql.sources.validatePartitionColumns").show(50, 
> false)
> +--+-+
> |key                                       |value|
> +--+-+
> |spark.sql.sources.validatePartitionColumns|true |
> +--+-+
> scala> spark.sql(s"""INSERT INTO $tableName partition (age=25) VALUES (1, 
> 'Ranga')""")
> res4: org.apache.spark.sql.DataFrame = []scala> spark.sql(s"show partitions 
> $tableName").show(50, false)
> +-+
> |partition|
> +-+
> |age=25   |
> +-+
> scala> spark.sql(s"select * from $tableName").show(50, false)
> +---+-+---+
> |id |name |age|
> +---+-+---+
> |1  |Ranga|25 |
> +---+-+---+
> scala> spark.sql(s"""INSERT INTO $tableName partition (age=\"test_age\") 
> VALUES (2, 'Nishanth')""")
> res7: org.apache.spark.sql.DataFrame = []scala> spark.sql(s"show partitions 
> $tableName").show(50, false)
> ++
> |partition   |
> ++
> |age=25      |
> |age=test_age|
> ++
> scala> spark.sql(s"select * from $tableName").show(50, false)
> +---+++
> |id |name    |age |
> +---+++
> |1  |Ranga   |25  |
> |2  |Nishanth|null|
> +---+++ {code}
> *Hive Code:*
>  
>  
> {code:java}
> > INSERT INTO test_partition_table partition (age="test_age2") VALUES (3, 
> > 'Nishanth');
> Error: Error while compiling statement: FAILED: SemanticException [Error 
> 10248]: Cannot add partition column age of type string as it cannot be 
> converted to type int (state=42000,code=10248){code}
>  
> *Expected Result:*
> When *spark.sql.sources.validatePartitionColumns=true* it needs to be 
> validated the datatype value and exception needs to be thrown if we provide 
> wrong data type value.
> *Reference:*
> [https://spark.apache.org/docs/3.3.1/sql-migration-guide.html#data-sources]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40988) Spark3 partition column value is not validated with user provided schema.

2022-11-01 Thread Ranga Reddy (Jira)
Ranga Reddy created SPARK-40988:
---

 Summary: Spark3 partition column value is not validated with user 
provided schema.
 Key: SPARK-40988
 URL: https://issues.apache.org/jira/browse/SPARK-40988
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0, 3.2.0, 3.1.0, 3.0.0
Reporter: Ranga Reddy
 Fix For: 3.4.0


Spark3 has not validated the Partition Column type while inserting the data but 
on the Hive side exception is thrown while inserting different type values.

*Spark Code:*

 
{code:java}
scala> val tableName="test_partition_table"
tableName: String = test_partition_table

scala>scala> spark.sql(s"DROP TABLE IF EXISTS $tableName")
res0: org.apache.spark.sql.DataFrame = []

scala> spark.sql(s"CREATE EXTERNAL TABLE $tableName ( id INT, name STRING ) 
PARTITIONED BY (age INT) LOCATION 'file:/tmp/spark-warehouse/$tableName'")
res1: org.apache.spark.sql.DataFrame = []

scala> spark.sql("SHOW tables").show(truncate=false)
+-+-+---+
|namespace|tableName            |isTemporary|
+-+-+---+
|default  |test_partition_table |false      |
+-+-+---+

scala> spark.sql("SET spark.sql.sources.validatePartitionColumns").show(50, 
false)
+--+-+
|key                                       |value|
+--+-+
|spark.sql.sources.validatePartitionColumns|true |
+--+-+

scala> spark.sql(s"""INSERT INTO $tableName partition (age=25) VALUES (1, 
'Ranga')""")
res4: org.apache.spark.sql.DataFrame = []scala> spark.sql(s"show partitions 
$tableName").show(50, false)
+-+
|partition|
+-+
|age=25   |
+-+

scala> spark.sql(s"select * from $tableName").show(50, false)
+---+-+---+
|id |name |age|
+---+-+---+
|1  |Ranga|25 |
+---+-+---+

scala> spark.sql(s"""INSERT INTO $tableName partition (age=\"test_age\") VALUES 
(2, 'Nishanth')""")
res7: org.apache.spark.sql.DataFrame = []scala> spark.sql(s"show partitions 
$tableName").show(50, false)
++
|partition   |
++
|age=25      |
|age=test_age|
++

scala> spark.sql(s"select * from $tableName").show(50, false)
+---+++
|id |name    |age |
+---+++
|1  |Ranga   |25  |
|2  |Nishanth|null|
+---+++ {code}
*Hive Code:*

 

 
{code:java}
> INSERT INTO test_partition_table partition (age="test_age2") VALUES (3, 
> 'Nishanth');
Error: Error while compiling statement: FAILED: SemanticException [Error 
10248]: Cannot add partition column age of type string as it cannot be 
converted to type int (state=42000,code=10248){code}
 

*Expected Result:*

When *spark.sql.sources.validatePartitionColumns=true* it needs to be validated 
the datatype value and exception needs to be thrown if we provide wrong data 
type value.

*Reference:*

[https://spark.apache.org/docs/3.3.1/sql-migration-guide.html#data-sources]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36901) ERROR exchange.BroadcastExchangeExec: Could not execute broadcast in 300 secs

2021-10-05 Thread Ranga Reddy (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ranga Reddy updated SPARK-36901:

Description: 
While running Spark application, if there are no further resources to launch 
executors, Spark application is failed after 5 mins with below exception.
{code:java}
21/09/24 06:12:45 WARN cluster.YarnScheduler: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient resources
...
21/09/24 06:17:29 ERROR exchange.BroadcastExchangeExec: Could not execute 
broadcast in 300 secs.
java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
...
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [300 
seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:146)
... 71 more
21/09/24 06:17:30 INFO spark.SparkContext: Invoking stop() from shutdown hook
{code}
*Expectation* should be either needs to be throw proper exception saying 
*"there are no further resources to run the application"* or it needs to be 
*"wait till it get resources"*.

To reproduce the issue we have used following sample code.

*PySpark Code (test_broadcast_timeout.py):*
{code:java}
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Test Broadcast Timeout").getOrCreate()
t1 = spark.range(5)
t2 = spark.range(5)

q = t1.join(t2,t1.id == t2.id)
q.explain
q.show(){code}
*Spark Submit Command:*
{code:java}
spark-submit --executor-memory 512M test_broadcast_timeout.py{code}
 Note: We have tested same code in Spark 3.1, we are able to reproduce the 
issue in Spark3 as well.

  was:
While running Spark application, if there are no further resources to launch 
executors, Spark application is failed after 5 mins with below exception.
{code:java}
21/09/24 06:12:45 WARN cluster.YarnScheduler: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient resources
...
21/09/24 06:17:29 ERROR exchange.BroadcastExchangeExec: Could not execute 
broadcast in 300 secs.
java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
...
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [300 
seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:146)
... 71 more
21/09/24 06:17:30 INFO spark.SparkContext: Invoking stop() from shutdown hook
{code}
*Expectation* should be either needs to be throw proper exception saying 
*"there are no further to resources to run the application"* or it needs to be 
*"wait till it get resources"*.

To reproduce the issue we have used following sample code.

*PySpark Code (test_broadcast_timeout.py):*
{code:java}
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Test Broadcast Timeout").getOrCreate()
t1 = spark.range(5)
t2 = spark.range(5)

q = t1.join(t2,t1.id == t2.id)
q.explain
q.show(){code}
*Spark Submit Command:*
{code:java}
spark-submit --executor-memory 512M test_broadcast_timeout.py{code}
 


> ERROR exchange.BroadcastExchangeExec: Could not execute broadcast in 300 secs
> -
>
> Key: SPARK-36901
> URL: https://issues.apache.org/jira/browse/SPARK-36901
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Ranga Reddy
>Priority: Major
>
> While running Spark application, if there are no further resources to launch 
> executors, Spark application is failed after 5 mins with below exception.
> {code:java}
> 21/09/24 06:12:45 WARN cluster.YarnScheduler: Initial job has not accepted 
> any resources; check your cluster UI to ensure that workers are registered 
> and have sufficient resources
> ...
> 21/09/24 06:17:29 ERROR exchange.BroadcastExchangeExec: Could not execute 
> broadcast in 300 secs.
> java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
> ...
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after 
> [300 seconds]
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
>   at 
> 

[jira] [Updated] (SPARK-36901) ERROR exchange.BroadcastExchangeExec: Could not execute broadcast in 300 secs

2021-09-30 Thread Ranga Reddy (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ranga Reddy updated SPARK-36901:

Description: 
While running Spark application, if there are no further resources to launch 
executors, Spark application is failed after 5 mins with below exception.
{code:java}
21/09/24 06:12:45 WARN cluster.YarnScheduler: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient resources
...
21/09/24 06:17:29 ERROR exchange.BroadcastExchangeExec: Could not execute 
broadcast in 300 secs.
java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
...
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [300 
seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:146)
... 71 more
21/09/24 06:17:30 INFO spark.SparkContext: Invoking stop() from shutdown hook
{code}
*Expectation* should be either needs to be throw proper exception saying 
*"there are no further to resources to run the application"* or it needs to be 
*"wait till it get resources"*.

To reproduce the issue we have used following sample code.

*PySpark Code (test_broadcast_timeout.py):*
{code:java}
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Test Broadcast Timeout").getOrCreate()
t1 = spark.range(5)
t2 = spark.range(5)

q = t1.join(t2,t1.id == t2.id)
q.explain
q.show(){code}
*Spark Submit Command:*
{code:java}
spark-submit --executor-memory 512M test_broadcast_timeout.py{code}
 

  was:
While running Spark application, if there are no further resources to launch 
executors, Spark application is failed after 5 mins with below exception.
{code:java}
21/09/24 06:12:45 WARN cluster.YarnScheduler: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient resources
...
21/09/24 06:17:29 ERROR exchange.BroadcastExchangeExec: Could not execute 
broadcast in 300 secs.
java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
...
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [300 
seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:146)
... 71 more
21/09/24 06:17:30 INFO spark.SparkContext: Invoking stop() from shutdown hook
{code}
Expectation will be either needs to be throw proper exception saying there are 
no further to resources to run the application or it needs to be wait till it 
get resources.

To reproduce the issue we have used following sample code.

*PySpark Code:*
# cat test_broadcast_timeout.py
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Test Broadcast Timeout").getOrCreate()

t1 = spark.range(5)
t2 = spark.range(5)

q = t1.join(t2,t1.id == t2.id)
q.explain
q.show()
*Spark Submit Command:*
{code:java}
spark-submit --executor-memory 512M test_broadcast_timeout.py{code}


> ERROR exchange.BroadcastExchangeExec: Could not execute broadcast in 300 secs
> -
>
> Key: SPARK-36901
> URL: https://issues.apache.org/jira/browse/SPARK-36901
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Ranga Reddy
>Priority: Major
>
> While running Spark application, if there are no further resources to launch 
> executors, Spark application is failed after 5 mins with below exception.
> {code:java}
> 21/09/24 06:12:45 WARN cluster.YarnScheduler: Initial job has not accepted 
> any resources; check your cluster UI to ensure that workers are registered 
> and have sufficient resources
> ...
> 21/09/24 06:17:29 ERROR exchange.BroadcastExchangeExec: Could not execute 
> broadcast in 300 secs.
> java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
> ...
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after 
> [300 seconds]
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
>   at 
> 

[jira] [Created] (SPARK-36901) ERROR exchange.BroadcastExchangeExec: Could not execute broadcast in 300 secs

2021-09-30 Thread Ranga Reddy (Jira)
Ranga Reddy created SPARK-36901:
---

 Summary: ERROR exchange.BroadcastExchangeExec: Could not execute 
broadcast in 300 secs
 Key: SPARK-36901
 URL: https://issues.apache.org/jira/browse/SPARK-36901
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 2.4.0
Reporter: Ranga Reddy


While running Spark application, if there are no further resources to launch 
executors, Spark application is failed after 5 mins with below exception.
{code:java}
21/09/24 06:12:45 WARN cluster.YarnScheduler: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient resources
...
21/09/24 06:17:29 ERROR exchange.BroadcastExchangeExec: Could not execute 
broadcast in 300 secs.
java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
...
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [300 
seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:146)
... 71 more
21/09/24 06:17:30 INFO spark.SparkContext: Invoking stop() from shutdown hook
{code}
Expectation will be either needs to be throw proper exception saying there are 
no further to resources to run the application or it needs to be wait till it 
get resources.

To reproduce the issue we have used following sample code.

*PySpark Code:*
# cat test_broadcast_timeout.py
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Test Broadcast Timeout").getOrCreate()

t1 = spark.range(5)
t2 = spark.range(5)

q = t1.join(t2,t1.id == t2.id)
q.explain
q.show()
*Spark Submit Command:*
{code:java}
spark-submit --executor-memory 512M test_broadcast_timeout.py{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26208) Empty dataframe does not roundtrip for csv with header

2021-09-03 Thread Ranga Reddy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17409427#comment-17409427
 ] 

Ranga Reddy commented on SPARK-26208:
-

cc [~hyukjin.kwon]

> Empty dataframe does not roundtrip for csv with header
> --
>
> Key: SPARK-26208
> URL: https://issues.apache.org/jira/browse/SPARK-26208
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: master branch,
> commit 034ae305c33b1990b3c1a284044002874c343b4d,
> date:   Sun Nov 18 16:02:15 2018 +0800
>Reporter: koert kuipers
>Assignee: Koert Kuipers
>Priority: Minor
> Fix For: 3.0.0
>
>
> when we write empty part file for csv and header=true we fail to write 
> header. the result cannot be read back in.
> when header=true a part file with zero rows should still have header



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26208) Empty dataframe does not roundtrip for csv with header

2021-08-10 Thread Ranga Reddy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396076#comment-17396076
 ] 

Ranga Reddy edited comment on SPARK-26208 at 8/10/21, 9:03 AM:
---

Hi [~koertkuipers]

The above code will work only when dataframe created manually.

Issue still persists when when we create dataframe while reading hive table.

*Hive Table:*
{code:java}
CREATE EXTERNAL TABLE `test_empty_csv_table`( 
 `col1` bigint, 
 `col2` bigint) 
STORED AS ORC 
LOCATION '/tmp/test_empty_csv_table';{code}
*spark-shell*

 
{code:java}
val tableName = "test_empty_csv_table"
val emptyCSVFilePath = "/tmp/empty_csv_file"
val df = spark.sql("select * from "+tableName)
df.printSchema()
df.write.format("csv").option("header", 
true).mode("overwrite").save(emptyCSVFilePath)
val df2 = spark.read.option("header", true).csv(emptyCSVFilePath)
{code}
 
{code:java}
org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV. It must 
be specified manually.;
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208)
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208)
 at scala.Option.getOrElse(Option.scala:121)
 at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:207)
 at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:393)
 at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473)
 ... 49 elided{code}


was (Author: rangareddy.av...@gmail.com):
The above code will work only when dataframe created manually.

Issue still persists when when we create dataframe while reading hive table.

*Hive Table:*
{code:java}
CREATE EXTERNAL TABLE `test_empty_csv_table`( 
 `col1` bigint, 
 `col2` bigint) 
STORED AS ORC 
LOCATION '/tmp/test_empty_csv_table';{code}
*spark-shell*

 
{code:java}
val tableName = "test_empty_csv_table"
val emptyCSVFilePath = "/tmp/empty_csv_file"
val df = spark.sql("select * from "+tableName)
df.printSchema()
df.write.format("csv").option("header", 
true).mode("overwrite").save(emptyCSVFilePath)
val df2 = spark.read.option("header", true).csv(emptyCSVFilePath)
{code}
 
{code:java}
org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV. It must 
be specified manually.;
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208)
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208)
 at scala.Option.getOrElse(Option.scala:121)
 at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:207)
 at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:393)
 at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473)
 ... 49 elided{code}

> Empty dataframe does not roundtrip for csv with header
> --
>
> Key: SPARK-26208
> URL: https://issues.apache.org/jira/browse/SPARK-26208
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: master branch,
> commit 034ae305c33b1990b3c1a284044002874c343b4d,
> date:   Sun Nov 18 16:02:15 2018 +0800
>Reporter: koert kuipers
>Assignee: Koert Kuipers
>Priority: Minor
> Fix For: 3.0.0
>
>
> when we write empty part file for csv and header=true we fail to write 
> header. the result cannot be read back in.
> when header=true a part file with zero rows should still have header



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26208) Empty dataframe does not roundtrip for csv with header

2021-08-09 Thread Ranga Reddy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396076#comment-17396076
 ] 

Ranga Reddy commented on SPARK-26208:
-

The above code will work only when dataframe created manually.

Issue still persists when when we create dataframe while reading hive table.

*Hive Table:*
{code:java}
CREATE EXTERNAL TABLE `test_empty_csv_table`( 
 `col1` bigint, 
 `col2` bigint) 
STORED AS ORC 
LOCATION '/tmp/test_empty_csv_table';{code}
*spark-shell*

 
{code:java}
val tableName = "test_empty_csv_table"
val emptyCSVFilePath = "/tmp/empty_csv_file"
val df = spark.sql("select * from "+tableName)
df.printSchema()
df.write.format("csv").option("header", 
true).mode("overwrite").save(emptyCSVFilePath)
val df2 = spark.read.option("header", true).csv(emptyCSVFilePath)
{code}
 
{code:java}
org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV. It must 
be specified manually.;
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208)
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208)
 at scala.Option.getOrElse(Option.scala:121)
 at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:207)
 at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:393)
 at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473)
 ... 49 elided{code}

> Empty dataframe does not roundtrip for csv with header
> --
>
> Key: SPARK-26208
> URL: https://issues.apache.org/jira/browse/SPARK-26208
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: master branch,
> commit 034ae305c33b1990b3c1a284044002874c343b4d,
> date:   Sun Nov 18 16:02:15 2018 +0800
>Reporter: koert kuipers
>Assignee: Koert Kuipers
>Priority: Minor
> Fix For: 3.0.0
>
>
> when we write empty part file for csv and header=true we fail to write 
> header. the result cannot be read back in.
> when header=true a part file with zero rows should still have header



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12139) REGEX Column Specification for Hive Queries

2021-05-18 Thread Ranga Reddy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346581#comment-17346581
 ] 

Ranga Reddy edited comment on SPARK-12139 at 5/18/21, 6:07 AM:
---

Hi [~janewangfb]

While creating alias for column it is not working. Same query if i execute in 
Hive it is working. Could you please check and let me know. If you want i will 
create a new jira.

*Hive:* 
{code:java}
hive> select `col1` as `col` from regex_test.regex_test_tbl;
OK
col
1
2
3
{code}
*Spark:*
{code:java}
scala> spark.sql("select `col1` as `col` from 
regex_test.regex_test_tbl").show(false)
org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression 
'alias';
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:49)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:48)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:132)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:1659)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:1630)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:342)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:342)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.expandStarExpression(Analyzer.scala:1630)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.$anonfun$buildExpandedProjectList$1(Analyzer.scala:1615)
  at scala.collection.immutable.List.flatMap(List.scala:366)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$buildExpandedProjectList(Analyzer.scala:1610)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$11.applyOrElse(Analyzer.scala:1381)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$11.applyOrElse(Analyzer.scala:1376)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$3(AnalysisHelper.scala:90)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$1(AnalysisHelper.scala:90)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:207)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp(AnalysisHelper.scala:86)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp$(AnalysisHelper.scala:84)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:1376)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:1214)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:216)
  at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
  at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
  at scala.collection.immutable.List.foldLeft(List.scala:91)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:213)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:205)
  at scala.collection.immutable.List.foreach(List.scala:431)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:205)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:182)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:176)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:132)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:183)
  at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:183)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:160)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:214)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:159)
  at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:73)
  at 

[jira] [Commented] (SPARK-12139) REGEX Column Specification for Hive Queries

2021-05-17 Thread Ranga Reddy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346581#comment-17346581
 ] 

Ranga Reddy commented on SPARK-12139:
-

Hi [~janewangfb]

While creating alias for column it is not working. Same query if i execute in 
Hive it is working. Could you please check and let me know. If you want i will 
create a new jira.
{code:java}
scala> spark.sql("select `col1` as `col` from 
regex_test.regex_test_tbl").show(false)
org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression 
'alias';
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:49)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:48)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:132)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:1659)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:1630)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:342)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:342)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.expandStarExpression(Analyzer.scala:1630)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.$anonfun$buildExpandedProjectList$1(Analyzer.scala:1615)
  at scala.collection.immutable.List.flatMap(List.scala:366)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$buildExpandedProjectList(Analyzer.scala:1610)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$11.applyOrElse(Analyzer.scala:1381)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$11.applyOrElse(Analyzer.scala:1376)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$3(AnalysisHelper.scala:90)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$1(AnalysisHelper.scala:90)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:207)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp(AnalysisHelper.scala:86)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp$(AnalysisHelper.scala:84)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:1376)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:1214)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:216)
  at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
  at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
  at scala.collection.immutable.List.foldLeft(List.scala:91)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:213)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:205)
  at scala.collection.immutable.List.foreach(List.scala:431)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:205)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:182)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:176)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:132)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:183)
  at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:183)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:160)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:214)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:159)
  at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:73)
  at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
  at