Re: Unsubscribe

2023-02-08 Thread LinuxGuy
please send an empty email to:
user-unsubscr...@spark.apache.org
to unsubscribe yourself from the list.


On Thu, Feb 9, 2023 at 12:38 PM fuwei...@163.com  wrote:

> Unsubscribe
>


[Spark SQL]: Spark 3.2 generates different results to query when columns name have mixed casing vs when they have same casing

2023-02-08 Thread Amit Singh Rathore
Hi Team,

I am running a query in Spark 3.2.

val df1 =
sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4",
"col5")
val op_cols_same_case = List("id","col2","col3","col4", "col5", "id")
val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*)
df2.select("id").show()

This query runs fine. But when I change the casing of the op_cols to have
mix of upper & lower case ("id" & "ID") it throws an ambiguous col ref
error.

val df1 =
sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4",
"col5")
val op_cols_same_case = List("id","col2","col3","col4", "col5", "ID")
val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*)
df2.select("id").show()

My question is why is this different behavior when I have duplicate columns
with the same names ("id", "id") vs the same name in different cases ("id",
"ID")? Either both should fail or non should fail considering spark
caseSensitive is false by default in 3.2

Note I checked, this issue is there in spark 2.4 as well. It works for both
case (mixed & single casing) spark 2.3.


Thanks
Spark user


Unsubscribe

2023-02-08 Thread fuwei901


Is sparkSession.sql now an action in Spark 3 and later?

2023-02-08 Thread Sayeh Roshan
Hi,
I remember previously that spark.sql() wasn’t a final action
and you would have needed to run something like show() for the query to
actually being performed. Today I noticed that when I do just spark.sql() without show() or anything , lots of executors are being fired and
reading their logs shows they are actually opening files and reading them.
Was there a change in spark 3 and later that changed the behavior?
I am using spark 3.1.2. This happens even if I disable AQE.
Thanks,
S.


Re: Graceful shutdown SPARK Structured Streaming

2023-02-08 Thread Brian Wylie
It's been a few years (so this approach might be out of date) but here's
what I used for PySpark as part of this SO (
https://stackoverflow.com/questions/45717433/stop-structured-streaming-query-gracefully/65708677
)

```

# Helper method to stop a streaming query
def stop_stream_query(query, wait_time):
"""Stop a running streaming query"""
while query.isActive:
msg = query.status['message']
data_avail = query.status['isDataAvailable']
trigger_active = query.status['isTriggerActive']
if not data_avail and not trigger_active and msg !=
"Initializing sources":
print('Stopping query...')
query.stop()
time.sleep(0.5)

# Okay wait for the stop to happen
print('Awaiting termination...')
query.awaitTermination(wait_time)
```


I'd also be interested is there is a newer/better way to do this.. so
please cc me on updates :)


On Thu, May 6, 2021 at 1:08 PM Mich Talebzadeh 
wrote:

> That is a valid question and I am not aware of any new addition to Spark
> Structured Streaming (SSS) in newer releases for this graceful shutdown.
>
> Going back to my earlier explanation, there are occasions that you may
> want to stop the Spark program gracefully. Gracefully meaning that Spark
> application handles the last streaming message completely and terminates
> the application. This is different from invoking interrupts such as CTRL-C.
> Of course one can terminate the process based on the following
>
>
>1.
>
>query.awaitTermination() # Waits for the termination of this query,
>with stop() or with error
>2.
>
>query.awaitTermination(timeoutMs) # Returns true if this query is
>terminated within the timeout in milliseconds.
>
> So the first one above waits until an interrupt signal is received. The
> second one will count the timeout and will exit when timeout in
> milliseconds is reached
>
> The issue is that one needs to predict how long the streaming job needs to
> run. Clearly any interrupt at the terminal or OS level (kill process), may
> end up the processing terminated without a proper completion of the
> streaming process.
> So I gather if we agree on what constitutes a graceful shutdown we can
> consider both the tool offerings from Spark itself  or what solutions we
> can come up with.
>
> HTH
>
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 6 May 2021 at 13:28, ayan guha  wrote:
>
>> What are some other "newer" methodologies?
>>
>> Really interested to understand what is possible here as this is a topic
>> came up in this forum time and again.
>>
>> On Thu, 6 May 2021 at 5:13 pm, Gourav Sengupta <
>> gourav.sengupta.develo...@gmail.com> wrote:
>>
>>> Hi Mich,
>>>
>>> thanks a ton for your kind response, looks like we are still using the
>>> earlier methodologies for stopping a spark streaming program gracefully.
>>>
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>> On Wed, May 5, 2021 at 6:04 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>

 Hi,


 I believe I discussed this in this forum. I sent the following to
 spark-dev forum as an add-on to Spark functionality. This is the gist of
 it.


 Spark Structured Streaming AKA SSS is a very useful tool in dealing
 with Event Driven Architecture. In an Event Driven Architecture, there is
 generally a main loop that listens for events and then triggers a call-back
 function when one of those events is detected. In a streaming application
 the application waits to receive the source messages in a set interval or
 whenever they happen and reacts accordingly.

 There are occasions that you may want to stop the Spark program
 gracefully. Gracefully meaning that Spark application handles the last
 streaming message completely and terminates the application. This is
 different from invoking interrupts such as CTRL-C. Of course one can
 terminate the process based on the following


1.

query.awaitTermination() # Waits for the termination of this query,
with stop() or with error
2.

query.awaitTermination(timeoutMs) # Returns true if this query is
terminated within the timeout in milliseconds.

 So the first one above waits until an interrupt signal is received. The
 second one will count the timeout and will exit when timeout in
 milliseconds is reached

 The issue is that one needs to predict how long the streaming job needs
 to run. Clearly any interrupt at the terminal or OS level 

Unsubscribe

2023-02-08 Thread fuwei901