Re: [Spark SQL]: Does Spark support processing records with timestamp NULL in stateful streaming?

2024-05-27 Thread Mich Talebzadeh
When you use applyInPandasWithState, Spark processes each input row as it
arrives, regardless of whether certain columns, such as the timestamp
column, contain NULL values. This behavior is useful where you want to
handle incomplete or missing data gracefully within your stateful
processing logic. By allowing NULL timestamps to trigger calls to the
stateful function, you can implement custom handling strategies, such as
skipping incomplete records, within your stateful function.


However, it is important to understand that this behavior also *means that
the watermark is not advanced for NULL timestamps*. The watermark is used
for event-time processing in Spark Structured Streaming, to track the
progress of event-time in your data stream and is typically based on the
timestamp column. Since NULL timestamps do not contribute to the watermark
advancement,

Regarding whether you can rely on this behavior for your production code,
it largely depends on your requirements and use case. If your application
logic is designed to handle NULL timestamps appropriately and you have
tested it to ensure it behaves as expected, then you can generally rely on
this behavior. FYI, I have not tested it myself, so I cannot provide a
definitive answer.

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
PhD  Imperial College
London 
London, United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 27 May 2024 at 22:04, Juan Casse  wrote:

> I am using applyInPandasWithState in PySpark 3.5.0.
>
> I noticed that records with timestamp==NULL are processed (i.e., trigger a
> call to the stateful function). And, as you would expect, does not advance
> the watermark.
>
> I am taking advantage of this in my application.
>
> My question: Is this a supported feature of Spark? Can I rely on this
> behavior for my production code?
>
> Thanks,
> Juan
>


Re: [Spark SQL]: Source code for PartitionedFile

2024-04-11 Thread Ashley McManamon
Hi Mich,

Thanks for the reply.

I did come across that file but it didn't align with the appearance of
`PartitionedFile`:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/PartitionedFileUtil.scala

In fact, the code snippet you shared also references the type
`PartitionedFile`.

There's actually this javadoc.io page for a `PartitionedFile`
at org.apache.spark.sql.execution.datasources for spark-sql_2.12:3.0.2:
https://javadoc.io/doc/org.apache.spark/spark-sql_2.12/3.0.2/org/apache/spark/sql/execution/datasources/PartitionedFile.html.
I double checked the source code for version 3.0.2 and doesn't seem to
exist there either:
https://github.com/apache/spark/tree/v3.0.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources

Ashley


On Mon, 8 Apr 2024 at 22:41, Mich Talebzadeh 
wrote:

> Hi,
>
> I believe this is the package
>
>
> https://raw.githubusercontent.com/apache/spark/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala
>
> And the code
>
> case class FilePartition(index: Int, files: Array[PartitionedFile])
>   extends Partition with InputPartition {
>   override def preferredLocations(): Array[String] = {
> // Computes total number of bytes that can be retrieved from each host.
> val hostToNumBytes = mutable.HashMap.empty[String, Long]
> files.foreach { file =>
>   file.locations.filter(_ != "localhost").foreach { host =>
> hostToNumBytes(host) = hostToNumBytes.getOrElse(host, 0L) +
> file.length
>   }
> }
>
> // Selects the first 3 hosts with the most data to be retrieved.
> hostToNumBytes.toSeq.sortBy {
>   case (host, numBytes) => numBytes
> }.reverse.take(3).map {
>   case (host, numBytes) => host
> }.toArray
>   }
> }
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Mon, 8 Apr 2024 at 20:31, Ashley McManamon <
> ashley.mcmana...@quantcast.com> wrote:
>
>> Hi All,
>>
>> I've been diving into the source code to get a better understanding of
>> how file splitting works from a user perspective. I've hit a deadend at
>> `PartitionedFile`, for which I cannot seem to find a definition? It appears
>> though it should be found at
>> org.apache.spark.sql.execution.datasources but I find no definition in
>> the entire source code. Am I missing something?
>>
>> I appreciate there may be an obvious answer here, apologies if I'm being
>> naive.
>>
>> Thanks,
>> Ashley McManamon
>>
>>


Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-09 Thread Mich Talebzadeh
interesting. So below should be the corrected code with the suggestion in
the [SPARK-47718] .sql() does not recognize watermark defined upstream -
ASF JIRA (apache.org) 

# Define schema for parsing Kafka messages
schema = StructType([
StructField('createTime', TimestampType(), True),
StructField('orderId', LongType(), True),
StructField('payAmount', DoubleType(), True),
StructField('payPlatform', IntegerType(), True),
StructField('provinceId', IntegerType(), True),
])

# Read streaming data from Kafka source
streaming_df = session.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "payment_msg") \
.option("startingOffsets", "earliest") \
.load() \
.select(from_json(col("value").cast("string"),
schema).alias("parsed_value")) \
.select("parsed_value.*") \
.withWatermark("createTime", "10 seconds")

# Create temporary view for SQL queries
*streaming_df.createOrReplaceTempView("streaming_df")*
# Define SQL query with correct window function usage
query = """
*SELECT*
*window(start, '1 hour', '30 minutes') as window,*
provinceId,
sum(payAmount) as totalPayAmount
FROM streaming_df
GROUP BY provinceId, window(start, '1 hour', '30 minutes')
ORDER BY window.start
"""

# Write the aggregated results to Kafka sink
stream = session.sql(query) \
.writeStream \
.format("kafka") \
.option("checkpointLocation", "checkpoint") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", "sink") \
.start()


Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Tue, 9 Apr 2024 at 21:45, 刘唯  wrote:

> Sorry this is not a bug but essentially a user error. Spark throws a
> really confusing error and I'm also confused. Please see the reply in the
> ticket for how to make things correct.
> https://issues.apache.org/jira/browse/SPARK-47718
>
> 刘唯  于2024年4月6日周六 11:41写道:
>
>> This indeed looks like a bug. I will take some time to look into it.
>>
>> Mich Talebzadeh  于2024年4月3日周三 01:55写道:
>>
>>>
>>> hm. you are getting below
>>>
>>> AnalysisException: Append output mode not supported when there are
>>> streaming aggregations on streaming DataFrames/DataSets without watermark;
>>>
>>> The problem seems to be that you are using the append output mode when
>>> writing the streaming query results to Kafka. This mode is designed for
>>> scenarios where you want to append new data to an existing dataset at the
>>> sink (in this case, the "sink" topic in Kafka). However, your query
>>> involves a streaming aggregation: group by provinceId, window('createTime',
>>> '1 hour', '30 minutes'). The problem is that Spark Structured Streaming
>>> requires a watermark to ensure exactly-once processing when using
>>> aggregations with append mode. Your code already defines a watermark on the
>>> "createTime" column with a delay of 10 seconds (withWatermark("createTime",
>>> "10 seconds")). However, the error message indicates it is missing on the
>>> start column. Try adding watermark to "start" Column: Modify your code as
>>> below  to include a watermark on the "start" column generated by the
>>> window function:
>>>
>>> from pyspark.sql.functions import col, from_json, explode, window, sum,
>>> watermark
>>>
>>> streaming_df = session.readStream \
>>>   .format("kafka") \
>>>   .option("kafka.bootstrap.servers", "localhost:9092") \
>>>   .option("subscribe", "payment_msg") \
>>>   .option("startingOffsets", "earliest") \
>>>   .load() \
>>>   .select(from_json(col("value").cast("string"),
>>> schema).alias("parsed_value")) \
>>>   .select("parsed_value.*") \
>>>   .withWatermark("createTime", "10 seconds")  # Existing watermark on
>>> createTime
>>>
>>> *# Modified section with watermark on 'start' column*
>>> streaming_df = streaming_df.groupBy(
>>>   col("provinceId"),
>>>   window(col("createTime"), "1 hour", "30 minutes")
>>> ).agg(
>>>   sum(col("payAmount")).alias("totalPayAmount")
>>> ).withWatermark(expr("start"), "10 seconds")  # Watermark on
>>> window-generated 'start'
>>>
>>> # Rest of the code remains the same
>>> streaming_df.createOrReplaceTempView("streaming_df")
>>>
>>> spark.sql("""
>>> SELECT
>>>   window.start, window.end, provinceId, totalPayAmount
>>> FROM streaming_df
>>> ORDER BY window.start
>>> """) \
>>> .writeStream \
>>> .format("kafka") \
>>> .option("checkpointLocation", "chec

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-09 Thread 刘唯
Sorry this is not a bug but essentially a user error. Spark throws a really
confusing error and I'm also confused. Please see the reply in the ticket
for how to make things correct.
https://issues.apache.org/jira/browse/SPARK-47718

刘唯  于2024年4月6日周六 11:41写道:

> This indeed looks like a bug. I will take some time to look into it.
>
> Mich Talebzadeh  于2024年4月3日周三 01:55写道:
>
>>
>> hm. you are getting below
>>
>> AnalysisException: Append output mode not supported when there are
>> streaming aggregations on streaming DataFrames/DataSets without watermark;
>>
>> The problem seems to be that you are using the append output mode when
>> writing the streaming query results to Kafka. This mode is designed for
>> scenarios where you want to append new data to an existing dataset at the
>> sink (in this case, the "sink" topic in Kafka). However, your query
>> involves a streaming aggregation: group by provinceId, window('createTime',
>> '1 hour', '30 minutes'). The problem is that Spark Structured Streaming
>> requires a watermark to ensure exactly-once processing when using
>> aggregations with append mode. Your code already defines a watermark on the
>> "createTime" column with a delay of 10 seconds (withWatermark("createTime",
>> "10 seconds")). However, the error message indicates it is missing on the
>> start column. Try adding watermark to "start" Column: Modify your code as
>> below  to include a watermark on the "start" column generated by the
>> window function:
>>
>> from pyspark.sql.functions import col, from_json, explode, window, sum,
>> watermark
>>
>> streaming_df = session.readStream \
>>   .format("kafka") \
>>   .option("kafka.bootstrap.servers", "localhost:9092") \
>>   .option("subscribe", "payment_msg") \
>>   .option("startingOffsets", "earliest") \
>>   .load() \
>>   .select(from_json(col("value").cast("string"),
>> schema).alias("parsed_value")) \
>>   .select("parsed_value.*") \
>>   .withWatermark("createTime", "10 seconds")  # Existing watermark on
>> createTime
>>
>> *# Modified section with watermark on 'start' column*
>> streaming_df = streaming_df.groupBy(
>>   col("provinceId"),
>>   window(col("createTime"), "1 hour", "30 minutes")
>> ).agg(
>>   sum(col("payAmount")).alias("totalPayAmount")
>> ).withWatermark(expr("start"), "10 seconds")  # Watermark on
>> window-generated 'start'
>>
>> # Rest of the code remains the same
>> streaming_df.createOrReplaceTempView("streaming_df")
>>
>> spark.sql("""
>> SELECT
>>   window.start, window.end, provinceId, totalPayAmount
>> FROM streaming_df
>> ORDER BY window.start
>> """) \
>> .writeStream \
>> .format("kafka") \
>> .option("checkpointLocation", "checkpoint") \
>> .option("kafka.bootstrap.servers", "localhost:9092") \
>> .option("topic", "sink") \
>> .start()
>>
>> Try and see how it goes
>>
>> HTH
>>
>> Mich Talebzadeh,
>>
>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> Disclaimer: The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner Von Braun)".
>>
>> Mich Talebzadeh,
>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Tue, 2 Apr 2024 at 22:43, Chloe He 
>> wrote:
>>
>>> Hi Mich,
>>>
>>> Thank you so much for your response. I really appreciate your help!
>>>
>>> You mentioned "defining the watermark using the withWatermark function
>>> on the streaming_df before creating the temporary view” - I believe this is
>>> what I’m doing and it’s not working for me. Here is the exact code snippet
>>> that I’m running:
>>>
>>> ```
>>> >>> streaming_df = session.readStream\
>>> .format("kafka")\
>>> .option("kafka.bootstrap.servers", "localhost:9092")\
>>> .option("subscribe", "payment_msg")\
>>> .option("startingOffsets","earliest")\
>>> .load()\
>>> .select(from_json(col("value").cast("string"),
>>> schema).alias("parsed_value"))\
>>> .select("parsed_value.*")\
>>> .withWatermark("createTime", "10 seconds")
>>>
>>> >>> streaming_df.createOrReplaceTempView("streaming_df”)
>>>
>>> >>> spark.sql("""
>>> SELECT
>>> window.start, window.end, provinceId, sum(payAmount) as
>>> totalPayAmount
>>> FROM streaming

Re: [Spark SQL]: Source code for PartitionedFile

2024-04-08 Thread Mich Talebzadeh
Hi,

I believe this is the package

https://raw.githubusercontent.com/apache/spark/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala

And the code

case class FilePartition(index: Int, files: Array[PartitionedFile])
  extends Partition with InputPartition {
  override def preferredLocations(): Array[String] = {
// Computes total number of bytes that can be retrieved from each host.
val hostToNumBytes = mutable.HashMap.empty[String, Long]
files.foreach { file =>
  file.locations.filter(_ != "localhost").foreach { host =>
hostToNumBytes(host) = hostToNumBytes.getOrElse(host, 0L) +
file.length
  }
}

// Selects the first 3 hosts with the most data to be retrieved.
hostToNumBytes.toSeq.sortBy {
  case (host, numBytes) => numBytes
}.reverse.take(3).map {
  case (host, numBytes) => host
}.toArray
  }
}

HTH

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 8 Apr 2024 at 20:31, Ashley McManamon <
ashley.mcmana...@quantcast.com> wrote:

> Hi All,
>
> I've been diving into the source code to get a better understanding of how
> file splitting works from a user perspective. I've hit a deadend at
> `PartitionedFile`, for which I cannot seem to find a definition? It appears
> though it should be found at
> org.apache.spark.sql.execution.datasources but I find no definition in the
> entire source code. Am I missing something?
>
> I appreciate there may be an obvious answer here, apologies if I'm being
> naive.
>
> Thanks,
> Ashley McManamon
>
>


Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-06 Thread 刘唯
This indeed looks like a bug. I will take some time to look into it.

Mich Talebzadeh  于2024年4月3日周三 01:55写道:

>
> hm. you are getting below
>
> AnalysisException: Append output mode not supported when there are
> streaming aggregations on streaming DataFrames/DataSets without watermark;
>
> The problem seems to be that you are using the append output mode when
> writing the streaming query results to Kafka. This mode is designed for
> scenarios where you want to append new data to an existing dataset at the
> sink (in this case, the "sink" topic in Kafka). However, your query
> involves a streaming aggregation: group by provinceId, window('createTime',
> '1 hour', '30 minutes'). The problem is that Spark Structured Streaming
> requires a watermark to ensure exactly-once processing when using
> aggregations with append mode. Your code already defines a watermark on the
> "createTime" column with a delay of 10 seconds (withWatermark("createTime",
> "10 seconds")). However, the error message indicates it is missing on the
> start column. Try adding watermark to "start" Column: Modify your code as
> below  to include a watermark on the "start" column generated by the
> window function:
>
> from pyspark.sql.functions import col, from_json, explode, window, sum,
> watermark
>
> streaming_df = session.readStream \
>   .format("kafka") \
>   .option("kafka.bootstrap.servers", "localhost:9092") \
>   .option("subscribe", "payment_msg") \
>   .option("startingOffsets", "earliest") \
>   .load() \
>   .select(from_json(col("value").cast("string"),
> schema).alias("parsed_value")) \
>   .select("parsed_value.*") \
>   .withWatermark("createTime", "10 seconds")  # Existing watermark on
> createTime
>
> *# Modified section with watermark on 'start' column*
> streaming_df = streaming_df.groupBy(
>   col("provinceId"),
>   window(col("createTime"), "1 hour", "30 minutes")
> ).agg(
>   sum(col("payAmount")).alias("totalPayAmount")
> ).withWatermark(expr("start"), "10 seconds")  # Watermark on
> window-generated 'start'
>
> # Rest of the code remains the same
> streaming_df.createOrReplaceTempView("streaming_df")
>
> spark.sql("""
> SELECT
>   window.start, window.end, provinceId, totalPayAmount
> FROM streaming_df
> ORDER BY window.start
> """) \
> .writeStream \
> .format("kafka") \
> .option("checkpointLocation", "checkpoint") \
> .option("kafka.bootstrap.servers", "localhost:9092") \
> .option("topic", "sink") \
> .start()
>
> Try and see how it goes
>
> HTH
>
> Mich Talebzadeh,
>
> Technologist | Solutions Architect | Data Engineer  | Generative AI
>
> London
> United Kingdom
>
>
>view my Linkedin profile
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> Disclaimer: The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner Von Braun)".
>
> Mich Talebzadeh,
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Tue, 2 Apr 2024 at 22:43, Chloe He 
> wrote:
>
>> Hi Mich,
>>
>> Thank you so much for your response. I really appreciate your help!
>>
>> You mentioned "defining the watermark using the withWatermark function on
>> the streaming_df before creating the temporary view” - I believe this is
>> what I’m doing and it’s not working for me. Here is the exact code snippet
>> that I’m running:
>>
>> ```
>> >>> streaming_df = session.readStream\
>> .format("kafka")\
>> .option("kafka.bootstrap.servers", "localhost:9092")\
>> .option("subscribe", "payment_msg")\
>> .option("startingOffsets","earliest")\
>> .load()\
>> .select(from_json(col("value").cast("string"),
>> schema).alias("parsed_value"))\
>> .select("parsed_value.*")\
>> .withWatermark("createTime", "10 seconds")
>>
>> >>> streaming_df.createOrReplaceTempView("streaming_df”)
>>
>> >>> spark.sql("""
>> SELECT
>> window.start, window.end, provinceId, sum(payAmount) as totalPayAmount
>> FROM streaming_df
>> GROUP BY provinceId, window('createTime', '1 hour', '30 minutes')
>> ORDER BY window.start
>> """)\
>>   .withWatermark("start", "10 seconds")\
>>   .writeStream\
>>   .format("kafka") \
>>   .option("checkpointLocation", "checkpoint") \
>>   .option("kafka.bootstrap.servers", "localhost:9092") \
>>   .option("topic", "sink") \
>>   .start()
>>
>> AnalysisException: Append output mode not 

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Mich Talebzadeh
hm. you are getting below

AnalysisException: Append output mode not supported when there are
streaming aggregations on streaming DataFrames/DataSets without watermark;

The problem seems to be that you are using the append output mode when
writing the streaming query results to Kafka. This mode is designed for
scenarios where you want to append new data to an existing dataset at the
sink (in this case, the "sink" topic in Kafka). However, your query
involves a streaming aggregation: group by provinceId, window('createTime',
'1 hour', '30 minutes'). The problem is that Spark Structured Streaming
requires a watermark to ensure exactly-once processing when using
aggregations with append mode. Your code already defines a watermark on the
"createTime" column with a delay of 10 seconds (withWatermark("createTime",
"10 seconds")). However, the error message indicates it is missing on the
start column. Try adding watermark to "start" Column: Modify your code as
below  to include a watermark on the "start" column generated by the window
function:

from pyspark.sql.functions import col, from_json, explode, window, sum,
watermark

streaming_df = session.readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "localhost:9092") \
  .option("subscribe", "payment_msg") \
  .option("startingOffsets", "earliest") \
  .load() \
  .select(from_json(col("value").cast("string"),
schema).alias("parsed_value")) \
  .select("parsed_value.*") \
  .withWatermark("createTime", "10 seconds")  # Existing watermark on
createTime

*# Modified section with watermark on 'start' column*
streaming_df = streaming_df.groupBy(
  col("provinceId"),
  window(col("createTime"), "1 hour", "30 minutes")
).agg(
  sum(col("payAmount")).alias("totalPayAmount")
).withWatermark(expr("start"), "10 seconds")  # Watermark on
window-generated 'start'

# Rest of the code remains the same
streaming_df.createOrReplaceTempView("streaming_df")

spark.sql("""
SELECT
  window.start, window.end, provinceId, totalPayAmount
FROM streaming_df
ORDER BY window.start
""") \
.writeStream \
.format("kafka") \
.option("checkpointLocation", "checkpoint") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", "sink") \
.start()

Try and see how it goes

HTH

Mich Talebzadeh,

Technologist | Solutions Architect | Data Engineer  | Generative AI

London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my knowledge
but of course cannot be guaranteed . It is essential to note that, as with
any advice, quote "one test result is worth one-thousand expert opinions
(Werner Von Braun)".

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Tue, 2 Apr 2024 at 22:43, Chloe He  wrote:

> Hi Mich,
>
> Thank you so much for your response. I really appreciate your help!
>
> You mentioned "defining the watermark using the withWatermark function on
> the streaming_df before creating the temporary view” - I believe this is
> what I’m doing and it’s not working for me. Here is the exact code snippet
> that I’m running:
>
> ```
> >>> streaming_df = session.readStream\
> .format("kafka")\
> .option("kafka.bootstrap.servers", "localhost:9092")\
> .option("subscribe", "payment_msg")\
> .option("startingOffsets","earliest")\
> .load()\
> .select(from_json(col("value").cast("string"),
> schema).alias("parsed_value"))\
> .select("parsed_value.*")\
> .withWatermark("createTime", "10 seconds")
>
> >>> streaming_df.createOrReplaceTempView("streaming_df”)
>
> >>> spark.sql("""
> SELECT
> window.start, window.end, provinceId, sum(payAmount) as totalPayAmount
> FROM streaming_df
> GROUP BY provinceId, window('createTime', '1 hour', '30 minutes')
> ORDER BY window.start
> """)\
>   .withWatermark("start", "10 seconds")\
>   .writeStream\
>   .format("kafka") \
>   .option("checkpointLocation", "checkpoint") \
>   .option("kafka.bootstrap.servers", "localhost:9092") \
>   .option("topic", "sink") \
>   .start()
>
> AnalysisException: Append output mode not supported when there are
> streaming aggregations on streaming DataFrames/DataSets without watermark;
> EventTimeWatermark start#37: timestamp, 10 seconds
> ```
>
> I’m using pyspark 3.5.1. Please let me know if I missed something. Thanks
> again!
>
> Best,
> Chloe
>
>
> On 2024/04/02 20:32:11 Mich Talebzadeh wrote:
> > ok let us tak

RE: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Chloe He
Hi Mich,

Thank you so much for your response. I really appreciate your help!

You mentioned "defining the watermark using the withWatermark function on the 
streaming_df before creating the temporary view” - I believe this is what I’m 
doing and it’s not working for me. Here is the exact code snippet that I’m 
running:

```
>>> streaming_df = session.readStream\
.format("kafka")\
.option("kafka.bootstrap.servers", "localhost:9092")\
.option("subscribe", "payment_msg")\
.option("startingOffsets","earliest")\
.load()\
.select(from_json(col("value").cast("string"), 
schema).alias("parsed_value"))\
.select("parsed_value.*")\
.withWatermark("createTime", "10 seconds")

>>> streaming_df.createOrReplaceTempView("streaming_df”)

>>> spark.sql("""
SELECT
window.start, window.end, provinceId, sum(payAmount) as totalPayAmount
FROM streaming_df
GROUP BY provinceId, window('createTime', '1 hour', '30 minutes')
ORDER BY window.start
""")\
  .withWatermark("start", "10 seconds")\
  .writeStream\
  .format("kafka") \
  .option("checkpointLocation", "checkpoint") \
  .option("kafka.bootstrap.servers", "localhost:9092") \
  .option("topic", "sink") \
  .start()

AnalysisException: Append output mode not supported when there are streaming 
aggregations on streaming DataFrames/DataSets without watermark;
EventTimeWatermark start#37: timestamp, 10 seconds
```

I’m using pyspark 3.5.1. Please let me know if I missed something. Thanks again!

Best,
Chloe


On 2024/04/02 20:32:11 Mich Talebzadeh wrote:
> ok let us take it for a test.
> 
> The original code of mine
> 
> def fetch_data(self):
> self.sc.setLogLevel("ERROR")
> schema = StructType() \
>  .add("rowkey", StringType()) \
>  .add("timestamp", TimestampType()) \
>  .add("temperature", IntegerType())
> checkpoint_path = "file:///ssd/hduser/avgtemperature/chkpt"
> try:
> 
> # construct a streaming dataframe 'streamingDataFrame' that
> subscribes to topic temperature
> streamingDataFrame = self.spark \
> .readStream \
> .format("kafka") \
> .option("kafka.bootstrap.servers",
> config['MDVariables']['bootstrapServers'],) \
> .option("schema.registry.url",
> config['MDVariables']['schemaRegistryURL']) \
> .option("group.id", config['common']['appName']) \
> .option("zookeeper.connection.timeout.ms",
> config['MDVariables']['zookeeperConnectionTimeoutMs']) \
> .option("rebalance.backoff.ms",
> config['MDVariables']['rebalanceBackoffMS']) \
> .option("zookeeper.session.timeout.ms",
> config['MDVariables']['zookeeperSessionTimeOutMs']) \
> .option("auto.commit.interval.ms",
> config['MDVariables']['autoCommitIntervalMS']) \
> .option("subscribe", "temperature") \
> .option("failOnDataLoss", "false") \
> .option("includeHeaders", "true") \
> .option("startingOffsets", "earliest") \
> .load() \
> .select(from_json(col("value").cast("string"),
> schema).alias("parsed_value"))
> 
> 
> resultC = streamingDataFrame.select( \
>  col("parsed_value.rowkey").alias("rowkey") \
>, col("parsed_value.timestamp").alias("timestamp") \
>, col("parsed_value.temperature").alias("temperature"))
> 
> """
> We work out the window and the AVG(temperature) in the window's
> timeframe below
> This should return back the following Dataframe as struct
> 
>  root
>  |-- window: struct (nullable = false)
>  ||-- start: timestamp (nullable = true)
>  ||-- end: timestamp (nullable = true)
>  |-- avg(temperature): double (nullable = true)
> 
> """
> resultM = resultC. \
>  withWatermark("timestamp", "5 minutes"). \
>  groupBy(window(resultC.timestamp, "5 minutes", "5
> minutes")). \
>  avg('temperature')
> 
> # We take the above DataFrame and flatten it to get the columns
> aliased as "startOfWindowFrame", "endOfWindowFrame" and "AVGTemperature"
> resultMF = resultM. \
>select( \
> F.col("window.start").alias("startOfWindow") \
>   , F.col("window.end").alias("endOfWindow") \
>   ,
> F.col("avg(temperature)").alias("AVGTemperature"))
> 
> # Kafka producer requires a key, value pair. We generate UUID
> key as the unique identifier of Kafka record
> uuidUdf= F.udf(lambda : str(uuid.uuid4()),StringType())
> 
> """
> We take DataFrame resultMF containing temperature info and
> write it to Kafka. The uuid is serialized as a str

Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Mich Talebzadeh
ok let us take it for a test.

The original code of mine

def fetch_data(self):
self.sc.setLogLevel("ERROR")
schema = StructType() \
 .add("rowkey", StringType()) \
 .add("timestamp", TimestampType()) \
 .add("temperature", IntegerType())
checkpoint_path = "file:///ssd/hduser/avgtemperature/chkpt"
try:

# construct a streaming dataframe 'streamingDataFrame' that
subscribes to topic temperature
streamingDataFrame = self.spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers",
config['MDVariables']['bootstrapServers'],) \
.option("schema.registry.url",
config['MDVariables']['schemaRegistryURL']) \
.option("group.id", config['common']['appName']) \
.option("zookeeper.connection.timeout.ms",
config['MDVariables']['zookeeperConnectionTimeoutMs']) \
.option("rebalance.backoff.ms",
config['MDVariables']['rebalanceBackoffMS']) \
.option("zookeeper.session.timeout.ms",
config['MDVariables']['zookeeperSessionTimeOutMs']) \
.option("auto.commit.interval.ms",
config['MDVariables']['autoCommitIntervalMS']) \
.option("subscribe", "temperature") \
.option("failOnDataLoss", "false") \
.option("includeHeaders", "true") \
.option("startingOffsets", "earliest") \
.load() \
.select(from_json(col("value").cast("string"),
schema).alias("parsed_value"))


resultC = streamingDataFrame.select( \
 col("parsed_value.rowkey").alias("rowkey") \
   , col("parsed_value.timestamp").alias("timestamp") \
   , col("parsed_value.temperature").alias("temperature"))

"""
We work out the window and the AVG(temperature) in the window's
timeframe below
This should return back the following Dataframe as struct

 root
 |-- window: struct (nullable = false)
 ||-- start: timestamp (nullable = true)
 ||-- end: timestamp (nullable = true)
 |-- avg(temperature): double (nullable = true)

"""
resultM = resultC. \
 withWatermark("timestamp", "5 minutes"). \
 groupBy(window(resultC.timestamp, "5 minutes", "5
minutes")). \
 avg('temperature')

# We take the above DataFrame and flatten it to get the columns
aliased as "startOfWindowFrame", "endOfWindowFrame" and "AVGTemperature"
resultMF = resultM. \
   select( \
F.col("window.start").alias("startOfWindow") \
  , F.col("window.end").alias("endOfWindow") \
  ,
F.col("avg(temperature)").alias("AVGTemperature"))

# Kafka producer requires a key, value pair. We generate UUID
key as the unique identifier of Kafka record
uuidUdf= F.udf(lambda : str(uuid.uuid4()),StringType())

"""
We take DataFrame resultMF containing temperature info and
write it to Kafka. The uuid is serialized as a string and used as the key.
We take all the columns of the DataFrame and serialize them as
a JSON string, putting the results in the "value" of the record.
"""
result = resultMF.withColumn("uuid",uuidUdf()) \
 .selectExpr("CAST(uuid AS STRING) AS key",
"to_json(struct(startOfWindow, endOfWindow, AVGTemperature)) AS value") \
 .writeStream \
 .outputMode('complete') \
 .format("kafka") \
 .option("kafka.bootstrap.servers",
config['MDVariables']['bootstrapServers'],) \
 .option("topic", "avgtemperature") \
 .option('checkpointLocation', checkpoint_path) \
 .queryName("avgtemperature") \
 .start()

except Exception as e:
print(f"""{e}, quitting""")
sys.exit(1)

#print(result.status)
#print(result.recentProgress)
#print(result.lastProgress)

result.awaitTermination()

Now try to use sql for the entire transformation and aggression

#import this and anything else needed
from pyspark.sql.functions import from_json, col, window
from pyspark.sql.types import StructType, StringType,IntegerType,
FloatType, TimestampType


# Define the schema for the JSON data
schema = ... # Replace with your schema definition

# construct a streaming dataframe 'streamingDataFrame' that
subscribes to topic temperature
streamingDataFrame = self.spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers",
config['MDVariables']['bootstrapServers'],) \

Re: [ SPARK SQL ]: UPPER in WHERE condition is not working in Apache Spark 3.5.0 for Mysql ENUM Column

2023-11-07 Thread Suyash Ajmera
Any update on this?


On Fri, 13 Oct, 2023, 12:56 pm Suyash Ajmera, 
wrote:

> This issue is related to CharVarcharCodegenUtils readSidePadding method .
>
> Appending white spaces while reading ENUM data from mysql
>
> Causing issue in querying , writing the same data to Cassandra.
>
> On Thu, 12 Oct, 2023, 7:46 pm Suyash Ajmera, 
> wrote:
>
>> I have upgraded my spark job from spark 3.3.1 to spark 3.5.0, I am
>> querying to Mysql Database and applying
>>
>> `*UPPER(col) = UPPER(value)*` in the subsequent sql query. It is working
>> as expected in spark 3.3.1 , but not working with 3.5.0.
>>
>> Where Condition ::  `*UPPER(vn) = 'ERICSSON' AND (upper(st) = 'OPEN' OR
>> upper(st) = 'REOPEN' OR upper(st) = 'CLOSED')*`
>>
>> The *st *column is ENUM in the database and it is causing the issue.
>>
>> Below is the Physical Plan of *FILTER* phase :
>>
>> For 3.3.1 :
>>
>> +- Filter ((upper(vn#11) = ERICSSON) AND (((upper(st#42) = OPEN) OR
>> (upper(st#42) = REOPEN)) OR (upper(st#42) = CLOSED)))
>>
>> For 3.5.0 :
>>
>> +- Filter ((upper(vn#11) = ERICSSON) AND (((upper(staticinvoke(class
>> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
>> readSidePadding, st#42, 13, true, false, true)) = OPEN) OR
>> (upper(staticinvoke(class
>> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
>> readSidePadding, st#42, 13, true, false, true)) = REOPEN)) OR
>> (upper(staticinvoke(class
>> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
>> readSidePadding, st#42, 13, true, false, true)) = CLOSED)))
>>
>> -
>>
>> I have debug it and found that Spark added a property in version 3.4.0 ,
>> i.e. **spark.sql.readSideCharPadding** which has default value **true**.
>>
>> Link to the JIRA : https://issues.apache.org/jira/browse/SPARK-40697
>>
>> Added a new method in Class **CharVarcharCodegenUtils**
>>
>> public static UTF8String readSidePadding(UTF8String inputStr, int limit) {
>> int numChars = inputStr.numChars();
>> if (numChars == limit) {
>>   return inputStr;
>> } else if (numChars < limit) {
>>   return inputStr.rpad(limit, SPACE);
>> } else {
>>   return inputStr;
>> }
>>   }
>>
>>
>> **This method is appending some whitespace padding to the ENUM values
>> while reading and causing the Issue.**
>>
>> ---
>>
>> When I am removing the UPPER function from the where condition the
>> **FILTER** Phase looks like this :
>>
>>  +- Filter (((staticinvoke(class
>> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils,
>>  StringType, readSidePadding, st#42, 13, true, false, true) = OPEN
>> ) OR (staticinvoke(class
>> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
>> readSidePadding, st#42, 13, true, false, true) = REOPEN   )) OR
>> (staticinvoke(class
>> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
>> readSidePadding, st#42, 13, true, false, true) = CLOSED   ))
>>
>>
>> **You can see it has added some white space after the value and the query
>> runs fine giving the correct result.**
>>
>> But with the UPPER function I am not getting the data.
>>
>> --
>>
>> I have also tried to disable this Property *spark.sql.readSideCharPadding
>> = false* with following cases :
>>
>> 1. With Upper function in where clause :
>>It is not pushing the filters to Database and the *query works fine*.
>>
>>
>>   +- Filter (((upper(st#42) = OPEN) OR (upper(st#42) = REOPEN)) OR
>> (upper(st#42) = CLOSED))
>>
>> 2. But when I am removing the upper function
>>
>>  *It is pushing the filter to Mysql with the white spaces and I am not
>> getting the data. (THIS IS A CAUSING VERY BIG ISSUE)*
>>
>>   PushedFilters: [*IsNotNull(vn), *EqualTo(vn,ERICSSON),
>> *Or(Or(EqualTo(st,OPEN ),EqualTo(st,REOPEN
>> )),EqualTo(st,CLOSED   ))]
>>
>> I cannot move this filter to JDBC read query , also I can't remove this
>> UPPER function in the where clause.
>>
>>
>> 
>>
>> Also I found same data getting written to CASSANDRA with *PADDING .*
>>
>


Re: [ SPARK SQL ]: UPPER in WHERE condition is not working in Apache Spark 3.5.0 for Mysql ENUM Column

2023-10-13 Thread Suyash Ajmera
This issue is related to CharVarcharCodegenUtils readSidePadding method .

Appending white spaces while reading ENUM data from mysql

Causing issue in querying , writing the same data to Cassandra.

On Thu, 12 Oct, 2023, 7:46 pm Suyash Ajmera, 
wrote:

> I have upgraded my spark job from spark 3.3.1 to spark 3.5.0, I am
> querying to Mysql Database and applying
>
> `*UPPER(col) = UPPER(value)*` in the subsequent sql query. It is working
> as expected in spark 3.3.1 , but not working with 3.5.0.
>
> Where Condition ::  `*UPPER(vn) = 'ERICSSON' AND (upper(st) = 'OPEN' OR
> upper(st) = 'REOPEN' OR upper(st) = 'CLOSED')*`
>
> The *st *column is ENUM in the database and it is causing the issue.
>
> Below is the Physical Plan of *FILTER* phase :
>
> For 3.3.1 :
>
> +- Filter ((upper(vn#11) = ERICSSON) AND (((upper(st#42) = OPEN) OR
> (upper(st#42) = REOPEN)) OR (upper(st#42) = CLOSED)))
>
> For 3.5.0 :
>
> +- Filter ((upper(vn#11) = ERICSSON) AND (((upper(staticinvoke(class
> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
> readSidePadding, st#42, 13, true, false, true)) = OPEN) OR
> (upper(staticinvoke(class
> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
> readSidePadding, st#42, 13, true, false, true)) = REOPEN)) OR
> (upper(staticinvoke(class
> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
> readSidePadding, st#42, 13, true, false, true)) = CLOSED)))
>
> -
>
> I have debug it and found that Spark added a property in version 3.4.0 ,
> i.e. **spark.sql.readSideCharPadding** which has default value **true**.
>
> Link to the JIRA : https://issues.apache.org/jira/browse/SPARK-40697
>
> Added a new method in Class **CharVarcharCodegenUtils**
>
> public static UTF8String readSidePadding(UTF8String inputStr, int limit) {
> int numChars = inputStr.numChars();
> if (numChars == limit) {
>   return inputStr;
> } else if (numChars < limit) {
>   return inputStr.rpad(limit, SPACE);
> } else {
>   return inputStr;
> }
>   }
>
>
> **This method is appending some whitespace padding to the ENUM values
> while reading and causing the Issue.**
>
> ---
>
> When I am removing the UPPER function from the where condition the
> **FILTER** Phase looks like this :
>
>  +- Filter (((staticinvoke(class
> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils,
>  StringType, readSidePadding, st#42, 13, true, false, true) = OPEN
> ) OR (staticinvoke(class
> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
> readSidePadding, st#42, 13, true, false, true) = REOPEN   )) OR
> (staticinvoke(class
> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
> readSidePadding, st#42, 13, true, false, true) = CLOSED   ))
>
>
> **You can see it has added some white space after the value and the query
> runs fine giving the correct result.**
>
> But with the UPPER function I am not getting the data.
>
> --
>
> I have also tried to disable this Property *spark.sql.readSideCharPadding
> = false* with following cases :
>
> 1. With Upper function in where clause :
>It is not pushing the filters to Database and the *query works fine*.
>
>   +- Filter (((upper(st#42) = OPEN) OR (upper(st#42) = REOPEN)) OR
> (upper(st#42) = CLOSED))
>
> 2. But when I am removing the upper function
>
>  *It is pushing the filter to Mysql with the white spaces and I am not
> getting the data. (THIS IS A CAUSING VERY BIG ISSUE)*
>
>   PushedFilters: [*IsNotNull(vn), *EqualTo(vn,ERICSSON),
> *Or(Or(EqualTo(st,OPEN ),EqualTo(st,REOPEN
> )),EqualTo(st,CLOSED   ))]
>
> I cannot move this filter to JDBC read query , also I can't remove this
> UPPER function in the where clause.
>
>
> 
>
> Also I found same data getting written to CASSANDRA with *PADDING .*
>


Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-18 Thread Mich Talebzadeh
Yes, it sounds like it. So the broadcast DF size seems to be between 1 and
4GB. So I suggest that you leave it as it is.

I have not used the standalone mode since spark-2.4.3 so I may be missing a
fair bit of context here.  I am sure there are others like you that are
still using it!

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 17 Aug 2023 at 23:33, Patrick Tucci  wrote:

> No, the driver memory was not set explicitly. So it was likely the default
> value, which appears to be 1GB.
>
> On Thu, Aug 17, 2023, 16:49 Mich Talebzadeh 
> wrote:
>
>> One question, what was the driver memory before setting it to 4G? Did you
>> have it set at all before?
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 17 Aug 2023 at 21:01, Patrick Tucci 
>> wrote:
>>
>>> Hi Mich,
>>>
>>> Here are my config values from spark-defaults.conf:
>>>
>>> spark.eventLog.enabled true
>>> spark.eventLog.dir hdfs://10.0.50.1:8020/spark-logs
>>> spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
>>> spark.history.fs.logDirectory hdfs://10.0.50.1:8020/spark-logs
>>> spark.history.fs.update.interval 10s
>>> spark.history.ui.port 18080
>>> spark.sql.warehouse.dir hdfs://10.0.50.1:8020/user/spark/warehouse
>>> spark.executor.cores 4
>>> spark.executor.memory 16000M
>>> spark.sql.legacy.createHiveTableByDefault false
>>> spark.driver.host 10.0.50.1
>>> spark.scheduler.mode FAIR
>>> spark.driver.memory 4g #added 2023-08-17
>>>
>>> The only application that runs on the cluster is the Spark Thrift
>>> server, which I launch like so:
>>>
>>> ~/spark/sbin/start-thriftserver.sh --master spark://10.0.50.1:7077
>>>
>>> The cluster runs in standalone mode and does not use Yarn for resource
>>> management. As a result, the Spark Thrift server acquires all available
>>> cluster resources when it starts. This is okay; as of right now, I am the
>>> only user of the cluster. If I add more users, they will also be SQL users,
>>> submitting queries through the Thrift server.
>>>
>>> Let me know if you have any other questions or thoughts.
>>>
>>> Thanks,
>>>
>>> Patrick
>>>
>>> On Thu, Aug 17, 2023 at 3:09 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hello Paatrick,

 As a matter of interest what parameters and their respective values do
 you use in spark-submit. I assume it is running in YARN mode.

 HTH

 Mich Talebzadeh,
 Solutions Architect/Engineering Lead
 London
 United Kingdom


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Thu, 17 Aug 2023 at 19:36, Patrick Tucci 
 wrote:

> Hi Mich,
>
> Yes, that's the sequence of events. I think the big breakthrough is
> that (for now at least) Spark is throwing errors instead of the queries
> hanging. Which is a big step forward. I can at least troubleshoot issues 
> if
> I know what they are.
>
> When I reflect on the issues I faced and the solutions, my issue may
> have been driver memory all along. I just couldn't determine that was the
> issue because I never saw any errors. In one case, converting a LEFT JOIN
> to an inner JOIN caused the query to run. In another case, replacing a 
> text
> field with an int ID and JOINing on the ID column worked. Per your advice,
> changing file formats from ORC to Parquet solved one issue. These
> interventions could have changed the wa

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
No, the driver memory was not set explicitly. So it was likely the default
value, which appears to be 1GB.

On Thu, Aug 17, 2023, 16:49 Mich Talebzadeh 
wrote:

> One question, what was the driver memory before setting it to 4G? Did you
> have it set at all before?
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 17 Aug 2023 at 21:01, Patrick Tucci 
> wrote:
>
>> Hi Mich,
>>
>> Here are my config values from spark-defaults.conf:
>>
>> spark.eventLog.enabled true
>> spark.eventLog.dir hdfs://10.0.50.1:8020/spark-logs
>> spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
>> spark.history.fs.logDirectory hdfs://10.0.50.1:8020/spark-logs
>> spark.history.fs.update.interval 10s
>> spark.history.ui.port 18080
>> spark.sql.warehouse.dir hdfs://10.0.50.1:8020/user/spark/warehouse
>> spark.executor.cores 4
>> spark.executor.memory 16000M
>> spark.sql.legacy.createHiveTableByDefault false
>> spark.driver.host 10.0.50.1
>> spark.scheduler.mode FAIR
>> spark.driver.memory 4g #added 2023-08-17
>>
>> The only application that runs on the cluster is the Spark Thrift server,
>> which I launch like so:
>>
>> ~/spark/sbin/start-thriftserver.sh --master spark://10.0.50.1:7077
>>
>> The cluster runs in standalone mode and does not use Yarn for resource
>> management. As a result, the Spark Thrift server acquires all available
>> cluster resources when it starts. This is okay; as of right now, I am the
>> only user of the cluster. If I add more users, they will also be SQL users,
>> submitting queries through the Thrift server.
>>
>> Let me know if you have any other questions or thoughts.
>>
>> Thanks,
>>
>> Patrick
>>
>> On Thu, Aug 17, 2023 at 3:09 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hello Paatrick,
>>>
>>> As a matter of interest what parameters and their respective values do
>>> you use in spark-submit. I assume it is running in YARN mode.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 17 Aug 2023 at 19:36, Patrick Tucci 
>>> wrote:
>>>
 Hi Mich,

 Yes, that's the sequence of events. I think the big breakthrough is
 that (for now at least) Spark is throwing errors instead of the queries
 hanging. Which is a big step forward. I can at least troubleshoot issues if
 I know what they are.

 When I reflect on the issues I faced and the solutions, my issue may
 have been driver memory all along. I just couldn't determine that was the
 issue because I never saw any errors. In one case, converting a LEFT JOIN
 to an inner JOIN caused the query to run. In another case, replacing a text
 field with an int ID and JOINing on the ID column worked. Per your advice,
 changing file formats from ORC to Parquet solved one issue. These
 interventions could have changed the way Spark needed to broadcast data to
 execute the query, thereby reducing demand on the memory-constrained 
 driver.

 Fingers crossed this is the solution. I will reply to this thread if
 the issue comes up again (hopefully it doesn't!).

 Thanks again,

 Patrick

 On Thu, Aug 17, 2023 at 1:54 PM Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Hi Patrik,
>
> glad that you have managed to sort this problem out. Hopefully it will
> go away for good.
>
> Still we are in the dark about how this problem is going away and
> coming back :( As I recall the chronology of events were as follows:
>
>
>1. The Issue with hanging Spark job reported
>2. concurrency on Hive metastore (single threaded Derby DB) was
>identified as a possible cause
>3. You changed the underlying Hive table formats from ORC to
>Parquet and somehow it worked
>4. The issue was reported again
>5. You upgraded 

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Mich Talebzadeh
One question, what was the driver memory before setting it to 4G? Did you
have it set at all before?

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 17 Aug 2023 at 21:01, Patrick Tucci  wrote:

> Hi Mich,
>
> Here are my config values from spark-defaults.conf:
>
> spark.eventLog.enabled true
> spark.eventLog.dir hdfs://10.0.50.1:8020/spark-logs
> spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
> spark.history.fs.logDirectory hdfs://10.0.50.1:8020/spark-logs
> spark.history.fs.update.interval 10s
> spark.history.ui.port 18080
> spark.sql.warehouse.dir hdfs://10.0.50.1:8020/user/spark/warehouse
> spark.executor.cores 4
> spark.executor.memory 16000M
> spark.sql.legacy.createHiveTableByDefault false
> spark.driver.host 10.0.50.1
> spark.scheduler.mode FAIR
> spark.driver.memory 4g #added 2023-08-17
>
> The only application that runs on the cluster is the Spark Thrift server,
> which I launch like so:
>
> ~/spark/sbin/start-thriftserver.sh --master spark://10.0.50.1:7077
>
> The cluster runs in standalone mode and does not use Yarn for resource
> management. As a result, the Spark Thrift server acquires all available
> cluster resources when it starts. This is okay; as of right now, I am the
> only user of the cluster. If I add more users, they will also be SQL users,
> submitting queries through the Thrift server.
>
> Let me know if you have any other questions or thoughts.
>
> Thanks,
>
> Patrick
>
> On Thu, Aug 17, 2023 at 3:09 PM Mich Talebzadeh 
> wrote:
>
>> Hello Paatrick,
>>
>> As a matter of interest what parameters and their respective values do
>> you use in spark-submit. I assume it is running in YARN mode.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 17 Aug 2023 at 19:36, Patrick Tucci 
>> wrote:
>>
>>> Hi Mich,
>>>
>>> Yes, that's the sequence of events. I think the big breakthrough is that
>>> (for now at least) Spark is throwing errors instead of the queries hanging.
>>> Which is a big step forward. I can at least troubleshoot issues if I know
>>> what they are.
>>>
>>> When I reflect on the issues I faced and the solutions, my issue may
>>> have been driver memory all along. I just couldn't determine that was the
>>> issue because I never saw any errors. In one case, converting a LEFT JOIN
>>> to an inner JOIN caused the query to run. In another case, replacing a text
>>> field with an int ID and JOINing on the ID column worked. Per your advice,
>>> changing file formats from ORC to Parquet solved one issue. These
>>> interventions could have changed the way Spark needed to broadcast data to
>>> execute the query, thereby reducing demand on the memory-constrained driver.
>>>
>>> Fingers crossed this is the solution. I will reply to this thread if the
>>> issue comes up again (hopefully it doesn't!).
>>>
>>> Thanks again,
>>>
>>> Patrick
>>>
>>> On Thu, Aug 17, 2023 at 1:54 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi Patrik,

 glad that you have managed to sort this problem out. Hopefully it will
 go away for good.

 Still we are in the dark about how this problem is going away and
 coming back :( As I recall the chronology of events were as follows:


1. The Issue with hanging Spark job reported
2. concurrency on Hive metastore (single threaded Derby DB) was
identified as a possible cause
3. You changed the underlying Hive table formats from ORC to
Parquet and somehow it worked
4. The issue was reported again
5. You upgraded the spark version from 3.4.0 to 3.4.1 (as a
possible underlying issue) and encountered driver memory limitation.
6. you allocated more memory to the driver and it is running ok for
now,
7. It appears that you are doing some join between a large dataset
and a smaller dataset. Spark decides to do broadcast join by ta

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
Hi Mich,

Here are my config values from spark-defaults.conf:

spark.eventLog.enabled true
spark.eventLog.dir hdfs://10.0.50.1:8020/spark-logs
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDirectory hdfs://10.0.50.1:8020/spark-logs
spark.history.fs.update.interval 10s
spark.history.ui.port 18080
spark.sql.warehouse.dir hdfs://10.0.50.1:8020/user/spark/warehouse
spark.executor.cores 4
spark.executor.memory 16000M
spark.sql.legacy.createHiveTableByDefault false
spark.driver.host 10.0.50.1
spark.scheduler.mode FAIR
spark.driver.memory 4g #added 2023-08-17

The only application that runs on the cluster is the Spark Thrift server,
which I launch like so:

~/spark/sbin/start-thriftserver.sh --master spark://10.0.50.1:7077

The cluster runs in standalone mode and does not use Yarn for resource
management. As a result, the Spark Thrift server acquires all available
cluster resources when it starts. This is okay; as of right now, I am the
only user of the cluster. If I add more users, they will also be SQL users,
submitting queries through the Thrift server.

Let me know if you have any other questions or thoughts.

Thanks,

Patrick

On Thu, Aug 17, 2023 at 3:09 PM Mich Talebzadeh 
wrote:

> Hello Paatrick,
>
> As a matter of interest what parameters and their respective values do
> you use in spark-submit. I assume it is running in YARN mode.
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 17 Aug 2023 at 19:36, Patrick Tucci 
> wrote:
>
>> Hi Mich,
>>
>> Yes, that's the sequence of events. I think the big breakthrough is that
>> (for now at least) Spark is throwing errors instead of the queries hanging.
>> Which is a big step forward. I can at least troubleshoot issues if I know
>> what they are.
>>
>> When I reflect on the issues I faced and the solutions, my issue may have
>> been driver memory all along. I just couldn't determine that was the issue
>> because I never saw any errors. In one case, converting a LEFT JOIN to an
>> inner JOIN caused the query to run. In another case, replacing a text field
>> with an int ID and JOINing on the ID column worked. Per your advice,
>> changing file formats from ORC to Parquet solved one issue. These
>> interventions could have changed the way Spark needed to broadcast data to
>> execute the query, thereby reducing demand on the memory-constrained driver.
>>
>> Fingers crossed this is the solution. I will reply to this thread if the
>> issue comes up again (hopefully it doesn't!).
>>
>> Thanks again,
>>
>> Patrick
>>
>> On Thu, Aug 17, 2023 at 1:54 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi Patrik,
>>>
>>> glad that you have managed to sort this problem out. Hopefully it will
>>> go away for good.
>>>
>>> Still we are in the dark about how this problem is going away and coming
>>> back :( As I recall the chronology of events were as follows:
>>>
>>>
>>>1. The Issue with hanging Spark job reported
>>>2. concurrency on Hive metastore (single threaded Derby DB) was
>>>identified as a possible cause
>>>3. You changed the underlying Hive table formats from ORC to Parquet
>>>and somehow it worked
>>>4. The issue was reported again
>>>5. You upgraded the spark version from 3.4.0 to 3.4.1 (as a possible
>>>underlying issue) and encountered driver memory limitation.
>>>6. you allocated more memory to the driver and it is running ok for
>>>now,
>>>7. It appears that you are doing some join between a large dataset
>>>and a smaller dataset. Spark decides to do broadcast join by taking the
>>>smaller dataset, fit it into the driver memory and broadcasting it to all
>>>executors.  That is where you had this issue with the memory limit on the
>>>driver. In the absence of Broadcast join, spark needs to perform a 
>>> shuffle
>>>which is an expensive process.
>>>   1. you can increase the broadcast join memory setting the conf.
>>>   parameter "spark.sql.autoBroadcastJoinThreshold" in bytes (check the 
>>> manual)
>>>   2. You can also disable the broadcast join by setting
>>>   "spark.sql.autoBroadcastJoinThreshold", -1 to see what is happening.
>>>
>>>
>>> So you still need to find a resolution to this issue. Maybe 3.4.1 has
>>> managed to fix some underlying issues.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> London
>>> United Kingdom
>>>
>>>
>>>vie

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Mich Talebzadeh
Hello Paatrick,

As a matter of interest what parameters and their respective values do
you use in spark-submit. I assume it is running in YARN mode.

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 17 Aug 2023 at 19:36, Patrick Tucci  wrote:

> Hi Mich,
>
> Yes, that's the sequence of events. I think the big breakthrough is that
> (for now at least) Spark is throwing errors instead of the queries hanging.
> Which is a big step forward. I can at least troubleshoot issues if I know
> what they are.
>
> When I reflect on the issues I faced and the solutions, my issue may have
> been driver memory all along. I just couldn't determine that was the issue
> because I never saw any errors. In one case, converting a LEFT JOIN to an
> inner JOIN caused the query to run. In another case, replacing a text field
> with an int ID and JOINing on the ID column worked. Per your advice,
> changing file formats from ORC to Parquet solved one issue. These
> interventions could have changed the way Spark needed to broadcast data to
> execute the query, thereby reducing demand on the memory-constrained driver.
>
> Fingers crossed this is the solution. I will reply to this thread if the
> issue comes up again (hopefully it doesn't!).
>
> Thanks again,
>
> Patrick
>
> On Thu, Aug 17, 2023 at 1:54 PM Mich Talebzadeh 
> wrote:
>
>> Hi Patrik,
>>
>> glad that you have managed to sort this problem out. Hopefully it will go
>> away for good.
>>
>> Still we are in the dark about how this problem is going away and coming
>> back :( As I recall the chronology of events were as follows:
>>
>>
>>1. The Issue with hanging Spark job reported
>>2. concurrency on Hive metastore (single threaded Derby DB) was
>>identified as a possible cause
>>3. You changed the underlying Hive table formats from ORC to Parquet
>>and somehow it worked
>>4. The issue was reported again
>>5. You upgraded the spark version from 3.4.0 to 3.4.1 (as a possible
>>underlying issue) and encountered driver memory limitation.
>>6. you allocated more memory to the driver and it is running ok for
>>now,
>>7. It appears that you are doing some join between a large dataset
>>and a smaller dataset. Spark decides to do broadcast join by taking the
>>smaller dataset, fit it into the driver memory and broadcasting it to all
>>executors.  That is where you had this issue with the memory limit on the
>>driver. In the absence of Broadcast join, spark needs to perform a shuffle
>>which is an expensive process.
>>   1. you can increase the broadcast join memory setting the conf.
>>   parameter "spark.sql.autoBroadcastJoinThreshold" in bytes (check the 
>> manual)
>>   2. You can also disable the broadcast join by setting
>>   "spark.sql.autoBroadcastJoinThreshold", -1 to see what is happening.
>>
>>
>> So you still need to find a resolution to this issue. Maybe 3.4.1 has
>> managed to fix some underlying issues.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 17 Aug 2023 at 17:17, Patrick Tucci 
>> wrote:
>>
>>> Hi Everyone,
>>>
>>> I just wanted to follow up on this issue. This issue has continued since
>>> our last correspondence. Today I had a query hang and couldn't resolve the
>>> issue. I decided to upgrade my Spark install from 3.4.0 to 3.4.1. After
>>> doing so, instead of the query hanging, I got an error message that the
>>> driver didn't have enough memory to broadcast objects. After increasing the
>>> driver memory, the query runs without issue.
>>>
>>> I hope this can be helpful to someone else in the future. Thanks again
>>> for the support,
>>>
>>> Patrick
>>>
>>> On Sun, Aug 13, 2023 at 7:52 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 OK I use Hive 3.1.1

 My suggestion is to put your hive issues to u...@hive.apache.org and
 for JAVA version compatibility

 They will g

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
Hi Mich,

Yes, that's the sequence of events. I think the big breakthrough is that
(for now at least) Spark is throwing errors instead of the queries hanging.
Which is a big step forward. I can at least troubleshoot issues if I know
what they are.

When I reflect on the issues I faced and the solutions, my issue may have
been driver memory all along. I just couldn't determine that was the issue
because I never saw any errors. In one case, converting a LEFT JOIN to an
inner JOIN caused the query to run. In another case, replacing a text field
with an int ID and JOINing on the ID column worked. Per your advice,
changing file formats from ORC to Parquet solved one issue. These
interventions could have changed the way Spark needed to broadcast data to
execute the query, thereby reducing demand on the memory-constrained driver.

Fingers crossed this is the solution. I will reply to this thread if the
issue comes up again (hopefully it doesn't!).

Thanks again,

Patrick

On Thu, Aug 17, 2023 at 1:54 PM Mich Talebzadeh 
wrote:

> Hi Patrik,
>
> glad that you have managed to sort this problem out. Hopefully it will go
> away for good.
>
> Still we are in the dark about how this problem is going away and coming
> back :( As I recall the chronology of events were as follows:
>
>
>1. The Issue with hanging Spark job reported
>2. concurrency on Hive metastore (single threaded Derby DB) was
>identified as a possible cause
>3. You changed the underlying Hive table formats from ORC to Parquet
>and somehow it worked
>4. The issue was reported again
>5. You upgraded the spark version from 3.4.0 to 3.4.1 (as a possible
>underlying issue) and encountered driver memory limitation.
>6. you allocated more memory to the driver and it is running ok for
>now,
>7. It appears that you are doing some join between a large dataset and
>a smaller dataset. Spark decides to do broadcast join by taking the smaller
>dataset, fit it into the driver memory and broadcasting it to all
>executors.  That is where you had this issue with the memory limit on the
>driver. In the absence of Broadcast join, spark needs to perform a shuffle
>which is an expensive process.
>   1. you can increase the broadcast join memory setting the conf.
>   parameter "spark.sql.autoBroadcastJoinThreshold" in bytes (check the 
> manual)
>   2. You can also disable the broadcast join by setting
>   "spark.sql.autoBroadcastJoinThreshold", -1 to see what is happening.
>
>
> So you still need to find a resolution to this issue. Maybe 3.4.1 has
> managed to fix some underlying issues.
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 17 Aug 2023 at 17:17, Patrick Tucci 
> wrote:
>
>> Hi Everyone,
>>
>> I just wanted to follow up on this issue. This issue has continued since
>> our last correspondence. Today I had a query hang and couldn't resolve the
>> issue. I decided to upgrade my Spark install from 3.4.0 to 3.4.1. After
>> doing so, instead of the query hanging, I got an error message that the
>> driver didn't have enough memory to broadcast objects. After increasing the
>> driver memory, the query runs without issue.
>>
>> I hope this can be helpful to someone else in the future. Thanks again
>> for the support,
>>
>> Patrick
>>
>> On Sun, Aug 13, 2023 at 7:52 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> OK I use Hive 3.1.1
>>>
>>> My suggestion is to put your hive issues to u...@hive.apache.org and
>>> for JAVA version compatibility
>>>
>>> They will give you better info.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sun, 13 Aug 2023 at 11:48, Patrick Tucci 
>>> wrote:
>>>
 I attempted to install Hive yesterday. The experience was similar to
 other attempts at installing Hive: it took a few hours and at the end of
 t

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Mich Talebzadeh
Hi Patrik,

glad that you have managed to sort this problem out. Hopefully it will go
away for good.

Still we are in the dark about how this problem is going away and coming
back :( As I recall the chronology of events were as follows:


   1. The Issue with hanging Spark job reported
   2. concurrency on Hive metastore (single threaded Derby DB) was
   identified as a possible cause
   3. You changed the underlying Hive table formats from ORC to Parquet and
   somehow it worked
   4. The issue was reported again
   5. You upgraded the spark version from 3.4.0 to 3.4.1 (as a possible
   underlying issue) and encountered driver memory limitation.
   6. you allocated more memory to the driver and it is running ok for now,
   7. It appears that you are doing some join between a large dataset and a
   smaller dataset. Spark decides to do broadcast join by taking the smaller
   dataset, fit it into the driver memory and broadcasting it to all
   executors.  That is where you had this issue with the memory limit on the
   driver. In the absence of Broadcast join, spark needs to perform a shuffle
   which is an expensive process.
  1. you can increase the broadcast join memory setting the conf.
  parameter "spark.sql.autoBroadcastJoinThreshold" in bytes (check
the manual)
  2. You can also disable the broadcast join by setting
  "spark.sql.autoBroadcastJoinThreshold", -1 to see what is happening.


So you still need to find a resolution to this issue. Maybe 3.4.1 has
managed to fix some underlying issues.

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 17 Aug 2023 at 17:17, Patrick Tucci  wrote:

> Hi Everyone,
>
> I just wanted to follow up on this issue. This issue has continued since
> our last correspondence. Today I had a query hang and couldn't resolve the
> issue. I decided to upgrade my Spark install from 3.4.0 to 3.4.1. After
> doing so, instead of the query hanging, I got an error message that the
> driver didn't have enough memory to broadcast objects. After increasing the
> driver memory, the query runs without issue.
>
> I hope this can be helpful to someone else in the future. Thanks again for
> the support,
>
> Patrick
>
> On Sun, Aug 13, 2023 at 7:52 AM Mich Talebzadeh 
> wrote:
>
>> OK I use Hive 3.1.1
>>
>> My suggestion is to put your hive issues to u...@hive.apache.org and for
>> JAVA version compatibility
>>
>> They will give you better info.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sun, 13 Aug 2023 at 11:48, Patrick Tucci 
>> wrote:
>>
>>> I attempted to install Hive yesterday. The experience was similar to
>>> other attempts at installing Hive: it took a few hours and at the end of
>>> the process, I didn't have a working setup. The latest stable release would
>>> not run. I never discovered the cause, but similar StackOverflow questions
>>> suggest it might be a Java incompatibility issue. Since I didn't want to
>>> downgrade or install an additional Java version, I attempted to use the
>>> latest alpha as well. This appears to have worked, although I couldn't
>>> figure out how to get it to use the metastore_db from Spark.
>>>
>>> After turning my attention back to Spark, I determined the issue. After
>>> much troubleshooting, I discovered that if I performed a COUNT(*) using
>>> the same JOINs, the problem query worked. I removed all the columns from
>>> the SELECT statement and added them one by one until I found the culprit.
>>> It's a text field on one of the tables. When the query SELECTs this column,
>>> or attempts to filter on it, the query hangs and never completes. If I
>>> remove all explicit references to this column, the query works fine. Since
>>> I need this column in the results, I went back to the ETL and extracted the
>>> values to a dimension table. I replaced the text column in the source table
>>> with an integer ID column and the query worked without issue.
>>>
>>> On t

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
Hi Everyone,

I just wanted to follow up on this issue. This issue has continued since
our last correspondence. Today I had a query hang and couldn't resolve the
issue. I decided to upgrade my Spark install from 3.4.0 to 3.4.1. After
doing so, instead of the query hanging, I got an error message that the
driver didn't have enough memory to broadcast objects. After increasing the
driver memory, the query runs without issue.

I hope this can be helpful to someone else in the future. Thanks again for
the support,

Patrick

On Sun, Aug 13, 2023 at 7:52 AM Mich Talebzadeh 
wrote:

> OK I use Hive 3.1.1
>
> My suggestion is to put your hive issues to u...@hive.apache.org and for
> JAVA version compatibility
>
> They will give you better info.
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 13 Aug 2023 at 11:48, Patrick Tucci 
> wrote:
>
>> I attempted to install Hive yesterday. The experience was similar to
>> other attempts at installing Hive: it took a few hours and at the end of
>> the process, I didn't have a working setup. The latest stable release would
>> not run. I never discovered the cause, but similar StackOverflow questions
>> suggest it might be a Java incompatibility issue. Since I didn't want to
>> downgrade or install an additional Java version, I attempted to use the
>> latest alpha as well. This appears to have worked, although I couldn't
>> figure out how to get it to use the metastore_db from Spark.
>>
>> After turning my attention back to Spark, I determined the issue. After
>> much troubleshooting, I discovered that if I performed a COUNT(*) using
>> the same JOINs, the problem query worked. I removed all the columns from
>> the SELECT statement and added them one by one until I found the culprit.
>> It's a text field on one of the tables. When the query SELECTs this column,
>> or attempts to filter on it, the query hangs and never completes. If I
>> remove all explicit references to this column, the query works fine. Since
>> I need this column in the results, I went back to the ETL and extracted the
>> values to a dimension table. I replaced the text column in the source table
>> with an integer ID column and the query worked without issue.
>>
>> On the topic of Hive, does anyone have any detailed resources for how to
>> set up Hive from scratch? Aside from the official site, since those
>> instructions didn't work for me. I'm starting to feel uneasy about building
>> my process around Spark. There really shouldn't be any instances where I
>> ask Spark to run legal ANSI SQL code and it just does nothing. In the past
>> 4 days I've run into 2 of these instances, and the solution was more voodoo
>> and magic than examining errors/logs and fixing code. I feel that I should
>> have a contingency plan in place for when I run into an issue with Spark
>> that can't be resolved.
>>
>> Thanks everyone.
>>
>>
>> On Sat, Aug 12, 2023 at 2:18 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> OK you would not have known unless you went through the process so to
>>> speak.
>>>
>>> Let us do something revolutionary here 😁
>>>
>>> Install hive and its metastore. You already have hadoop anyway
>>>
>>> https://cwiki.apache.org/confluence/display/hive/adminmanual+installation
>>>
>>> hive metastore
>>>
>>>
>>> https://data-flair.training/blogs/apache-hive-metastore/#:~:text=What%20is%20Hive%20Metastore%3F,by%20using%20metastore%20service%20API
>>> .
>>>
>>> choose one of these
>>>
>>> derby  hive  mssql  mysql  oracle  postgres
>>>
>>> Mine is an oracle. postgres is good as well.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sat, 12 Aug 2023 at 18:31, Patrick Tucci 
>>> wrote:
>>>
 Yes, on premise.

 Unfortunately after installing Delta Lake and re-writing all tables as
 Delta tables, the issue persists.

 On Sat, Aug 12, 2023

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-13 Thread Mich Talebzadeh
OK I use Hive 3.1.1

My suggestion is to put your hive issues to u...@hive.apache.org and for
JAVA version compatibility

They will give you better info.

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 13 Aug 2023 at 11:48, Patrick Tucci  wrote:

> I attempted to install Hive yesterday. The experience was similar to other
> attempts at installing Hive: it took a few hours and at the end of the
> process, I didn't have a working setup. The latest stable release would not
> run. I never discovered the cause, but similar StackOverflow questions
> suggest it might be a Java incompatibility issue. Since I didn't want to
> downgrade or install an additional Java version, I attempted to use the
> latest alpha as well. This appears to have worked, although I couldn't
> figure out how to get it to use the metastore_db from Spark.
>
> After turning my attention back to Spark, I determined the issue. After
> much troubleshooting, I discovered that if I performed a COUNT(*) using
> the same JOINs, the problem query worked. I removed all the columns from
> the SELECT statement and added them one by one until I found the culprit.
> It's a text field on one of the tables. When the query SELECTs this column,
> or attempts to filter on it, the query hangs and never completes. If I
> remove all explicit references to this column, the query works fine. Since
> I need this column in the results, I went back to the ETL and extracted the
> values to a dimension table. I replaced the text column in the source table
> with an integer ID column and the query worked without issue.
>
> On the topic of Hive, does anyone have any detailed resources for how to
> set up Hive from scratch? Aside from the official site, since those
> instructions didn't work for me. I'm starting to feel uneasy about building
> my process around Spark. There really shouldn't be any instances where I
> ask Spark to run legal ANSI SQL code and it just does nothing. In the past
> 4 days I've run into 2 of these instances, and the solution was more voodoo
> and magic than examining errors/logs and fixing code. I feel that I should
> have a contingency plan in place for when I run into an issue with Spark
> that can't be resolved.
>
> Thanks everyone.
>
>
> On Sat, Aug 12, 2023 at 2:18 PM Mich Talebzadeh 
> wrote:
>
>> OK you would not have known unless you went through the process so to
>> speak.
>>
>> Let us do something revolutionary here 😁
>>
>> Install hive and its metastore. You already have hadoop anyway
>>
>> https://cwiki.apache.org/confluence/display/hive/adminmanual+installation
>>
>> hive metastore
>>
>>
>> https://data-flair.training/blogs/apache-hive-metastore/#:~:text=What%20is%20Hive%20Metastore%3F,by%20using%20metastore%20service%20API
>> .
>>
>> choose one of these
>>
>> derby  hive  mssql  mysql  oracle  postgres
>>
>> Mine is an oracle. postgres is good as well.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 12 Aug 2023 at 18:31, Patrick Tucci 
>> wrote:
>>
>>> Yes, on premise.
>>>
>>> Unfortunately after installing Delta Lake and re-writing all tables as
>>> Delta tables, the issue persists.
>>>
>>> On Sat, Aug 12, 2023 at 11:34 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 ok sure.

 Is this Delta Lake going to be on-premise?

 Mich Talebzadeh,
 Solutions Architect/Engineering Lead
 London
 United Kingdom


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, 

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-13 Thread Patrick Tucci
I attempted to install Hive yesterday. The experience was similar to other
attempts at installing Hive: it took a few hours and at the end of the
process, I didn't have a working setup. The latest stable release would not
run. I never discovered the cause, but similar StackOverflow questions
suggest it might be a Java incompatibility issue. Since I didn't want to
downgrade or install an additional Java version, I attempted to use the
latest alpha as well. This appears to have worked, although I couldn't
figure out how to get it to use the metastore_db from Spark.

After turning my attention back to Spark, I determined the issue. After
much troubleshooting, I discovered that if I performed a COUNT(*) using the
same JOINs, the problem query worked. I removed all the columns from the
SELECT statement and added them one by one until I found the culprit. It's
a text field on one of the tables. When the query SELECTs this column, or
attempts to filter on it, the query hangs and never completes. If I remove
all explicit references to this column, the query works fine. Since I need
this column in the results, I went back to the ETL and extracted the values
to a dimension table. I replaced the text column in the source table with
an integer ID column and the query worked without issue.

On the topic of Hive, does anyone have any detailed resources for how to
set up Hive from scratch? Aside from the official site, since those
instructions didn't work for me. I'm starting to feel uneasy about building
my process around Spark. There really shouldn't be any instances where I
ask Spark to run legal ANSI SQL code and it just does nothing. In the past
4 days I've run into 2 of these instances, and the solution was more voodoo
and magic than examining errors/logs and fixing code. I feel that I should
have a contingency plan in place for when I run into an issue with Spark
that can't be resolved.

Thanks everyone.


On Sat, Aug 12, 2023 at 2:18 PM Mich Talebzadeh 
wrote:

> OK you would not have known unless you went through the process so to
> speak.
>
> Let us do something revolutionary here 😁
>
> Install hive and its metastore. You already have hadoop anyway
>
> https://cwiki.apache.org/confluence/display/hive/adminmanual+installation
>
> hive metastore
>
>
> https://data-flair.training/blogs/apache-hive-metastore/#:~:text=What%20is%20Hive%20Metastore%3F,by%20using%20metastore%20service%20API
> .
>
> choose one of these
>
> derby  hive  mssql  mysql  oracle  postgres
>
> Mine is an oracle. postgres is good as well.
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 12 Aug 2023 at 18:31, Patrick Tucci 
> wrote:
>
>> Yes, on premise.
>>
>> Unfortunately after installing Delta Lake and re-writing all tables as
>> Delta tables, the issue persists.
>>
>> On Sat, Aug 12, 2023 at 11:34 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> ok sure.
>>>
>>> Is this Delta Lake going to be on-premise?
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sat, 12 Aug 2023 at 12:03, Patrick Tucci 
>>> wrote:
>>>
 Hi Mich,

 Thanks for the feedback. My original intention after reading your
 response was to stick to Hive for managing tables. Unfortunately, I'm
 running into another case of SQL scripts hanging. Since all tables are
 already Parquet, I'm out of troubleshooting options. I'm going to migrate
 to Delta Lake and see if that solves the issue.

 Thanks again for your feedback.

 Patrick

 On Fri, Aug 11, 2023 at 10:09 AM Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Hi Patrick,
>
> There is not anything wrong with Hive On-premise it is the best data
> warehouse there is
>
> Hive handles both ORC and Parquet formal well. They are both columnar
> implementations of relational model. What you are seeing

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Mich Talebzadeh
OK you would not have known unless you went through the process so to speak.

Let us do something revolutionary here 😁

Install hive and its metastore. You already have hadoop anyway

https://cwiki.apache.org/confluence/display/hive/adminmanual+installation

hive metastore

https://data-flair.training/blogs/apache-hive-metastore/#:~:text=What%20is%20Hive%20Metastore%3F,by%20using%20metastore%20service%20API
.

choose one of these

derby  hive  mssql  mysql  oracle  postgres

Mine is an oracle. postgres is good as well.

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 12 Aug 2023 at 18:31, Patrick Tucci  wrote:

> Yes, on premise.
>
> Unfortunately after installing Delta Lake and re-writing all tables as
> Delta tables, the issue persists.
>
> On Sat, Aug 12, 2023 at 11:34 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> ok sure.
>>
>> Is this Delta Lake going to be on-premise?
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 12 Aug 2023 at 12:03, Patrick Tucci 
>> wrote:
>>
>>> Hi Mich,
>>>
>>> Thanks for the feedback. My original intention after reading your
>>> response was to stick to Hive for managing tables. Unfortunately, I'm
>>> running into another case of SQL scripts hanging. Since all tables are
>>> already Parquet, I'm out of troubleshooting options. I'm going to migrate
>>> to Delta Lake and see if that solves the issue.
>>>
>>> Thanks again for your feedback.
>>>
>>> Patrick
>>>
>>> On Fri, Aug 11, 2023 at 10:09 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi Patrick,

 There is not anything wrong with Hive On-premise it is the best data
 warehouse there is

 Hive handles both ORC and Parquet formal well. They are both columnar
 implementations of relational model. What you are seeing is the Spark API
 to Hive which prefers Parquet. I found out a few years ago.

 From your point of view I suggest you stick to parquet format with Hive
 specific to Spark. As far as I know you don't have a fully independent Hive
 DB as yet.

 Anyway stick to Hive for now as you never know what issues you may be
 facing using moving to Delta Lake.

 You can also use compression

 STORED AS PARQUET
 TBLPROPERTIES ("parquet.compression"="SNAPPY")

 ALSO

 ANALYZE TABLE  COMPUTE STATISTICS FOR COLUMNS

 HTH

 Mich Talebzadeh,
 Solutions Architect/Engineering Lead
 London
 United Kingdom


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Fri, 11 Aug 2023 at 11:26, Patrick Tucci 
 wrote:

> Thanks for the reply Stephen and Mich.
>
> Stephen, you're right, it feels like Spark is waiting for something,
> but I'm not sure what. I'm the only user on the cluster and there are
> plenty of resources (+60 cores, +250GB RAM). I even tried restarting
> Hadoop, Spark and the host servers to make sure nothing was lingering in
> the background.
>
> Mich, thank you so much, your suggestion worked. Storing the tables as
> Parquet solves the issue.
>
> Interestingly, I found that only the MemberEnrollment table needs to
> be Parquet. The ID field in MemberEnrollment is an int calculated during
> load by a ROW_NUMBER() function. Further testing found that if I hard code
> a 0 as MemberEnrollment.ID instead of using the ROW_NUMBER() function, th

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Patrick Tucci
Yes, on premise.

Unfortunately after installing Delta Lake and re-writing all tables as
Delta tables, the issue persists.

On Sat, Aug 12, 2023 at 11:34 AM Mich Talebzadeh 
wrote:

> ok sure.
>
> Is this Delta Lake going to be on-premise?
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 12 Aug 2023 at 12:03, Patrick Tucci 
> wrote:
>
>> Hi Mich,
>>
>> Thanks for the feedback. My original intention after reading your
>> response was to stick to Hive for managing tables. Unfortunately, I'm
>> running into another case of SQL scripts hanging. Since all tables are
>> already Parquet, I'm out of troubleshooting options. I'm going to migrate
>> to Delta Lake and see if that solves the issue.
>>
>> Thanks again for your feedback.
>>
>> Patrick
>>
>> On Fri, Aug 11, 2023 at 10:09 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi Patrick,
>>>
>>> There is not anything wrong with Hive On-premise it is the best data
>>> warehouse there is
>>>
>>> Hive handles both ORC and Parquet formal well. They are both columnar
>>> implementations of relational model. What you are seeing is the Spark API
>>> to Hive which prefers Parquet. I found out a few years ago.
>>>
>>> From your point of view I suggest you stick to parquet format with Hive
>>> specific to Spark. As far as I know you don't have a fully independent Hive
>>> DB as yet.
>>>
>>> Anyway stick to Hive for now as you never know what issues you may be
>>> facing using moving to Delta Lake.
>>>
>>> You can also use compression
>>>
>>> STORED AS PARQUET
>>> TBLPROPERTIES ("parquet.compression"="SNAPPY")
>>>
>>> ALSO
>>>
>>> ANALYZE TABLE  COMPUTE STATISTICS FOR COLUMNS
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Fri, 11 Aug 2023 at 11:26, Patrick Tucci 
>>> wrote:
>>>
 Thanks for the reply Stephen and Mich.

 Stephen, you're right, it feels like Spark is waiting for something,
 but I'm not sure what. I'm the only user on the cluster and there are
 plenty of resources (+60 cores, +250GB RAM). I even tried restarting
 Hadoop, Spark and the host servers to make sure nothing was lingering in
 the background.

 Mich, thank you so much, your suggestion worked. Storing the tables as
 Parquet solves the issue.

 Interestingly, I found that only the MemberEnrollment table needs to be
 Parquet. The ID field in MemberEnrollment is an int calculated during load
 by a ROW_NUMBER() function. Further testing found that if I hard code a 0
 as MemberEnrollment.ID instead of using the ROW_NUMBER() function, the
 query works without issue even if both tables are ORC.

 Should I infer from this issue that the Hive components prefer Parquet
 over ORC? Furthermore, should I consider using a different table storage
 framework, like Delta Lake, instead of the Hive components? Given this
 issue and other issues I've had with Hive, I'm starting to think a
 different solution might be more robust and stable. The main condition is
 that my application operates solely through Thrift server, so I need to be
 able to connect to Spark through Thrift server and have it write tables
 using Delta Lake instead of Hive. From this StackOverflow question, it
 looks like this is possible:
 https://stackoverflow.com/questions/69862388/how-to-run-spark-sql-thrift-server-in-local-mode-and-connect-to-delta-using-jdbc

 Thanks again to everyone who replied for their help.

 Patrick


 On Fri, Aug 11, 2023 at 2:14 AM Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Steve may have a valid point. You raised an issue with concurrent
> writes before, if I recall correctly. Since this limitation may be due to
> Hive metastore. By default Spark uses Apache Derby for its database
> persistence. *Howeve

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Mich Talebzadeh
ok sure.

Is this Delta Lake going to be on-premise?

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 12 Aug 2023 at 12:03, Patrick Tucci  wrote:

> Hi Mich,
>
> Thanks for the feedback. My original intention after reading your response
> was to stick to Hive for managing tables. Unfortunately, I'm running into
> another case of SQL scripts hanging. Since all tables are already Parquet,
> I'm out of troubleshooting options. I'm going to migrate to Delta Lake and
> see if that solves the issue.
>
> Thanks again for your feedback.
>
> Patrick
>
> On Fri, Aug 11, 2023 at 10:09 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi Patrick,
>>
>> There is not anything wrong with Hive On-premise it is the best data
>> warehouse there is
>>
>> Hive handles both ORC and Parquet formal well. They are both columnar
>> implementations of relational model. What you are seeing is the Spark API
>> to Hive which prefers Parquet. I found out a few years ago.
>>
>> From your point of view I suggest you stick to parquet format with Hive
>> specific to Spark. As far as I know you don't have a fully independent Hive
>> DB as yet.
>>
>> Anyway stick to Hive for now as you never know what issues you may be
>> facing using moving to Delta Lake.
>>
>> You can also use compression
>>
>> STORED AS PARQUET
>> TBLPROPERTIES ("parquet.compression"="SNAPPY")
>>
>> ALSO
>>
>> ANALYZE TABLE  COMPUTE STATISTICS FOR COLUMNS
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 11 Aug 2023 at 11:26, Patrick Tucci 
>> wrote:
>>
>>> Thanks for the reply Stephen and Mich.
>>>
>>> Stephen, you're right, it feels like Spark is waiting for something, but
>>> I'm not sure what. I'm the only user on the cluster and there are plenty of
>>> resources (+60 cores, +250GB RAM). I even tried restarting Hadoop, Spark
>>> and the host servers to make sure nothing was lingering in the background.
>>>
>>> Mich, thank you so much, your suggestion worked. Storing the tables as
>>> Parquet solves the issue.
>>>
>>> Interestingly, I found that only the MemberEnrollment table needs to be
>>> Parquet. The ID field in MemberEnrollment is an int calculated during load
>>> by a ROW_NUMBER() function. Further testing found that if I hard code a 0
>>> as MemberEnrollment.ID instead of using the ROW_NUMBER() function, the
>>> query works without issue even if both tables are ORC.
>>>
>>> Should I infer from this issue that the Hive components prefer Parquet
>>> over ORC? Furthermore, should I consider using a different table storage
>>> framework, like Delta Lake, instead of the Hive components? Given this
>>> issue and other issues I've had with Hive, I'm starting to think a
>>> different solution might be more robust and stable. The main condition is
>>> that my application operates solely through Thrift server, so I need to be
>>> able to connect to Spark through Thrift server and have it write tables
>>> using Delta Lake instead of Hive. From this StackOverflow question, it
>>> looks like this is possible:
>>> https://stackoverflow.com/questions/69862388/how-to-run-spark-sql-thrift-server-in-local-mode-and-connect-to-delta-using-jdbc
>>>
>>> Thanks again to everyone who replied for their help.
>>>
>>> Patrick
>>>
>>>
>>> On Fri, Aug 11, 2023 at 2:14 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Steve may have a valid point. You raised an issue with concurrent
 writes before, if I recall correctly. Since this limitation may be due to
 Hive metastore. By default Spark uses Apache Derby for its database
 persistence. *However it is limited to only one Spark session at any
 time for the purposes of metadata storage.*  That may be the cause
 here as well. Does this happen if the underlying tables are created as
 PARQUET as opposed to ORC?

 HTH

 Mich Talebzadeh,
 Solutions Architect/Engineering Lead
 London
 United King

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Patrick Tucci
Hi Mich,

Thanks for the feedback. My original intention after reading your response
was to stick to Hive for managing tables. Unfortunately, I'm running into
another case of SQL scripts hanging. Since all tables are already Parquet,
I'm out of troubleshooting options. I'm going to migrate to Delta Lake and
see if that solves the issue.

Thanks again for your feedback.

Patrick

On Fri, Aug 11, 2023 at 10:09 AM Mich Talebzadeh 
wrote:

> Hi Patrick,
>
> There is not anything wrong with Hive On-premise it is the best data
> warehouse there is
>
> Hive handles both ORC and Parquet formal well. They are both columnar
> implementations of relational model. What you are seeing is the Spark API
> to Hive which prefers Parquet. I found out a few years ago.
>
> From your point of view I suggest you stick to parquet format with Hive
> specific to Spark. As far as I know you don't have a fully independent Hive
> DB as yet.
>
> Anyway stick to Hive for now as you never know what issues you may be
> facing using moving to Delta Lake.
>
> You can also use compression
>
> STORED AS PARQUET
> TBLPROPERTIES ("parquet.compression"="SNAPPY")
>
> ALSO
>
> ANALYZE TABLE  COMPUTE STATISTICS FOR COLUMNS
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 11 Aug 2023 at 11:26, Patrick Tucci 
> wrote:
>
>> Thanks for the reply Stephen and Mich.
>>
>> Stephen, you're right, it feels like Spark is waiting for something, but
>> I'm not sure what. I'm the only user on the cluster and there are plenty of
>> resources (+60 cores, +250GB RAM). I even tried restarting Hadoop, Spark
>> and the host servers to make sure nothing was lingering in the background.
>>
>> Mich, thank you so much, your suggestion worked. Storing the tables as
>> Parquet solves the issue.
>>
>> Interestingly, I found that only the MemberEnrollment table needs to be
>> Parquet. The ID field in MemberEnrollment is an int calculated during load
>> by a ROW_NUMBER() function. Further testing found that if I hard code a 0
>> as MemberEnrollment.ID instead of using the ROW_NUMBER() function, the
>> query works without issue even if both tables are ORC.
>>
>> Should I infer from this issue that the Hive components prefer Parquet
>> over ORC? Furthermore, should I consider using a different table storage
>> framework, like Delta Lake, instead of the Hive components? Given this
>> issue and other issues I've had with Hive, I'm starting to think a
>> different solution might be more robust and stable. The main condition is
>> that my application operates solely through Thrift server, so I need to be
>> able to connect to Spark through Thrift server and have it write tables
>> using Delta Lake instead of Hive. From this StackOverflow question, it
>> looks like this is possible:
>> https://stackoverflow.com/questions/69862388/how-to-run-spark-sql-thrift-server-in-local-mode-and-connect-to-delta-using-jdbc
>>
>> Thanks again to everyone who replied for their help.
>>
>> Patrick
>>
>>
>> On Fri, Aug 11, 2023 at 2:14 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Steve may have a valid point. You raised an issue with concurrent writes
>>> before, if I recall correctly. Since this limitation may be due to Hive
>>> metastore. By default Spark uses Apache Derby for its database
>>> persistence. *However it is limited to only one Spark session at any
>>> time for the purposes of metadata storage.*  That may be the cause here
>>> as well. Does this happen if the underlying tables are created as PARQUET
>>> as opposed to ORC?
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Fri, 11 Aug 2023 at 01:33, Stephen Coy 
>>> wrote:
>>>
 Hi Patrick,

 When this has happened to me in the past (admittedly via spark-submit)
 it has been because another job was still running and had already claimed
 some of the resources (co

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-11 Thread Mich Talebzadeh
Hi Patrick,

There is not anything wrong with Hive On-premise it is the best data
warehouse there is

Hive handles both ORC and Parquet formal well. They are both columnar
implementations of relational model. What you are seeing is the Spark API
to Hive which prefers Parquet. I found out a few years ago.

>From your point of view I suggest you stick to parquet format with Hive
specific to Spark. As far as I know you don't have a fully independent Hive
DB as yet.

Anyway stick to Hive for now as you never know what issues you may be
facing using moving to Delta Lake.

You can also use compression

STORED AS PARQUET
TBLPROPERTIES ("parquet.compression"="SNAPPY")

ALSO

ANALYZE TABLE  COMPUTE STATISTICS FOR COLUMNS

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 11 Aug 2023 at 11:26, Patrick Tucci  wrote:

> Thanks for the reply Stephen and Mich.
>
> Stephen, you're right, it feels like Spark is waiting for something, but
> I'm not sure what. I'm the only user on the cluster and there are plenty of
> resources (+60 cores, +250GB RAM). I even tried restarting Hadoop, Spark
> and the host servers to make sure nothing was lingering in the background.
>
> Mich, thank you so much, your suggestion worked. Storing the tables as
> Parquet solves the issue.
>
> Interestingly, I found that only the MemberEnrollment table needs to be
> Parquet. The ID field in MemberEnrollment is an int calculated during load
> by a ROW_NUMBER() function. Further testing found that if I hard code a 0
> as MemberEnrollment.ID instead of using the ROW_NUMBER() function, the
> query works without issue even if both tables are ORC.
>
> Should I infer from this issue that the Hive components prefer Parquet
> over ORC? Furthermore, should I consider using a different table storage
> framework, like Delta Lake, instead of the Hive components? Given this
> issue and other issues I've had with Hive, I'm starting to think a
> different solution might be more robust and stable. The main condition is
> that my application operates solely through Thrift server, so I need to be
> able to connect to Spark through Thrift server and have it write tables
> using Delta Lake instead of Hive. From this StackOverflow question, it
> looks like this is possible:
> https://stackoverflow.com/questions/69862388/how-to-run-spark-sql-thrift-server-in-local-mode-and-connect-to-delta-using-jdbc
>
> Thanks again to everyone who replied for their help.
>
> Patrick
>
>
> On Fri, Aug 11, 2023 at 2:14 AM Mich Talebzadeh 
> wrote:
>
>> Steve may have a valid point. You raised an issue with concurrent writes
>> before, if I recall correctly. Since this limitation may be due to Hive
>> metastore. By default Spark uses Apache Derby for its database
>> persistence. *However it is limited to only one Spark session at any
>> time for the purposes of metadata storage.*  That may be the cause here
>> as well. Does this happen if the underlying tables are created as PARQUET
>> as opposed to ORC?
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 11 Aug 2023 at 01:33, Stephen Coy 
>> wrote:
>>
>>> Hi Patrick,
>>>
>>> When this has happened to me in the past (admittedly via spark-submit)
>>> it has been because another job was still running and had already claimed
>>> some of the resources (cores and memory).
>>>
>>> I think this can also happen if your configuration tries to claim
>>> resources that will never be available.
>>>
>>> Cheers,
>>>
>>> SteveC
>>>
>>>
>>> On 11 Aug 2023, at 3:36 am, Patrick Tucci 
>>> wrote:
>>>
>>> Hello,
>>>
>>> I'm attempting to run a query on Spark 3.4.0 through the Spark
>>> ThriftServer. The cluster has 64 cores, 250GB RAM, and operates in
>>> standalone mode using HDFS for storage.
>>>
>>> The query is as follows:
>>>
>>> SELECT ME.*, MB.BenefitID
>>> FROM MemberEnrollment ME
>>> JOIN MemberBenefits MB
>>> ON ME.ID  = MB.EnrollmentID
>>> WHERE MB.BenefitID =

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-11 Thread Patrick Tucci
Thanks for the reply Stephen and Mich.

Stephen, you're right, it feels like Spark is waiting for something, but
I'm not sure what. I'm the only user on the cluster and there are plenty of
resources (+60 cores, +250GB RAM). I even tried restarting Hadoop, Spark
and the host servers to make sure nothing was lingering in the background.

Mich, thank you so much, your suggestion worked. Storing the tables as
Parquet solves the issue.

Interestingly, I found that only the MemberEnrollment table needs to be
Parquet. The ID field in MemberEnrollment is an int calculated during load
by a ROW_NUMBER() function. Further testing found that if I hard code a 0
as MemberEnrollment.ID instead of using the ROW_NUMBER() function, the
query works without issue even if both tables are ORC.

Should I infer from this issue that the Hive components prefer Parquet over
ORC? Furthermore, should I consider using a different table storage
framework, like Delta Lake, instead of the Hive components? Given this
issue and other issues I've had with Hive, I'm starting to think a
different solution might be more robust and stable. The main condition is
that my application operates solely through Thrift server, so I need to be
able to connect to Spark through Thrift server and have it write tables
using Delta Lake instead of Hive. From this StackOverflow question, it
looks like this is possible:
https://stackoverflow.com/questions/69862388/how-to-run-spark-sql-thrift-server-in-local-mode-and-connect-to-delta-using-jdbc

Thanks again to everyone who replied for their help.

Patrick


On Fri, Aug 11, 2023 at 2:14 AM Mich Talebzadeh 
wrote:

> Steve may have a valid point. You raised an issue with concurrent writes
> before, if I recall correctly. Since this limitation may be due to Hive
> metastore. By default Spark uses Apache Derby for its database
> persistence. *However it is limited to only one Spark session at any time
> for the purposes of metadata storage.*  That may be the cause here as
> well. Does this happen if the underlying tables are created as PARQUET as
> opposed to ORC?
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 11 Aug 2023 at 01:33, Stephen Coy 
> wrote:
>
>> Hi Patrick,
>>
>> When this has happened to me in the past (admittedly via spark-submit) it
>> has been because another job was still running and had already claimed some
>> of the resources (cores and memory).
>>
>> I think this can also happen if your configuration tries to claim
>> resources that will never be available.
>>
>> Cheers,
>>
>> SteveC
>>
>>
>> On 11 Aug 2023, at 3:36 am, Patrick Tucci 
>> wrote:
>>
>> Hello,
>>
>> I'm attempting to run a query on Spark 3.4.0 through the Spark
>> ThriftServer. The cluster has 64 cores, 250GB RAM, and operates in
>> standalone mode using HDFS for storage.
>>
>> The query is as follows:
>>
>> SELECT ME.*, MB.BenefitID
>> FROM MemberEnrollment ME
>> JOIN MemberBenefits MB
>> ON ME.ID  = MB.EnrollmentID
>> WHERE MB.BenefitID = 5
>> LIMIT 10
>>
>> The tables are defined as follows:
>>
>> -- Contains about 3M rows
>> CREATE TABLE MemberEnrollment
>> (
>> ID INT
>> , MemberID VARCHAR(50)
>> , StartDate DATE
>> , EndDate DATE
>> -- Other columns, but these are the most important
>> ) STORED AS ORC;
>>
>> -- Contains about 25m rows
>> CREATE TABLE MemberBenefits
>> (
>> EnrollmentID INT
>> , BenefitID INT
>> ) STORED AS ORC;
>>
>> When I execute the query, it runs a single broadcast exchange stage,
>> which completes after a few seconds. Then everything just hangs. The
>> JDBC/ODBC tab in the UI shows the query state as COMPILED, but no stages or
>> tasks are executing or pending:
>>
>> 
>>
>> I've let the query run for as long as 30 minutes with no additional
>> stages, progress, or errors. I'm not sure where to start troubleshooting.
>>
>> Thanks for your help,
>>
>> Patrick
>>
>>
>> This email contains confidential information of and is the copyright of
>> Infomedia. It must not be forwarded, amended or disclosed without consent
>> of the sender. If you received this message by mistake, please advise the
>> sender and delete all copies. Security of transmission on the internet
>> cannot be guaranteed, could be infected, intercepted, or corrupted and you
>> should ensure you have suitable antivirus protection in place. By sending
>> us your or any third party personal details, you consent to (or confirm you
>> have obtained consent

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh
Steve may have a valid point. You raised an issue with concurrent writes
before, if I recall correctly. Since this limitation may be due to Hive
metastore. By default Spark uses Apache Derby for its database
persistence. *However
it is limited to only one Spark session at any time for the purposes of
metadata storage.*  That may be the cause here as well. Does this happen if
the underlying tables are created as PARQUET as opposed to ORC?

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 11 Aug 2023 at 01:33, Stephen Coy 
wrote:

> Hi Patrick,
>
> When this has happened to me in the past (admittedly via spark-submit) it
> has been because another job was still running and had already claimed some
> of the resources (cores and memory).
>
> I think this can also happen if your configuration tries to claim
> resources that will never be available.
>
> Cheers,
>
> SteveC
>
>
> On 11 Aug 2023, at 3:36 am, Patrick Tucci  wrote:
>
> Hello,
>
> I'm attempting to run a query on Spark 3.4.0 through the Spark
> ThriftServer. The cluster has 64 cores, 250GB RAM, and operates in
> standalone mode using HDFS for storage.
>
> The query is as follows:
>
> SELECT ME.*, MB.BenefitID
> FROM MemberEnrollment ME
> JOIN MemberBenefits MB
> ON ME.ID  = MB.EnrollmentID
> WHERE MB.BenefitID = 5
> LIMIT 10
>
> The tables are defined as follows:
>
> -- Contains about 3M rows
> CREATE TABLE MemberEnrollment
> (
> ID INT
> , MemberID VARCHAR(50)
> , StartDate DATE
> , EndDate DATE
> -- Other columns, but these are the most important
> ) STORED AS ORC;
>
> -- Contains about 25m rows
> CREATE TABLE MemberBenefits
> (
> EnrollmentID INT
> , BenefitID INT
> ) STORED AS ORC;
>
> When I execute the query, it runs a single broadcast exchange stage, which
> completes after a few seconds. Then everything just hangs. The JDBC/ODBC
> tab in the UI shows the query state as COMPILED, but no stages or tasks are
> executing or pending:
>
> 
>
> I've let the query run for as long as 30 minutes with no additional
> stages, progress, or errors. I'm not sure where to start troubleshooting.
>
> Thanks for your help,
>
> Patrick
>
>
> This email contains confidential information of and is the copyright of
> Infomedia. It must not be forwarded, amended or disclosed without consent
> of the sender. If you received this message by mistake, please advise the
> sender and delete all copies. Security of transmission on the internet
> cannot be guaranteed, could be infected, intercepted, or corrupted and you
> should ensure you have suitable antivirus protection in place. By sending
> us your or any third party personal details, you consent to (or confirm you
> have obtained consent from such third parties) to Infomedia’s privacy
> policy. http://www.infomedia.com.au/privacy-policy/
>


Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Stephen Coy
Hi Patrick,

When this has happened to me in the past (admittedly via spark-submit) it has 
been because another job was still running and had already claimed some of the 
resources (cores and memory).

I think this can also happen if your configuration tries to claim resources 
that will never be available.

Cheers,

SteveC


On 11 Aug 2023, at 3:36 am, Patrick Tucci  wrote:

Hello,

I'm attempting to run a query on Spark 3.4.0 through the Spark ThriftServer. 
The cluster has 64 cores, 250GB RAM, and operates in standalone mode using HDFS 
for storage.

The query is as follows:

SELECT ME.*, MB.BenefitID
FROM MemberEnrollment ME
JOIN MemberBenefits MB
ON ME.ID = MB.EnrollmentID
WHERE MB.BenefitID = 5
LIMIT 10

The tables are defined as follows:

-- Contains about 3M rows
CREATE TABLE MemberEnrollment
(
ID INT
, MemberID VARCHAR(50)
, StartDate DATE
, EndDate DATE
-- Other columns, but these are the most important
) STORED AS ORC;

-- Contains about 25m rows
CREATE TABLE MemberBenefits
(
EnrollmentID INT
, BenefitID INT
) STORED AS ORC;

When I execute the query, it runs a single broadcast exchange stage, which 
completes after a few seconds. Then everything just hangs. The JDBC/ODBC tab in 
the UI shows the query state as COMPILED, but no stages or tasks are executing 
or pending:



I've let the query run for as long as 30 minutes with no additional stages, 
progress, or errors. I'm not sure where to start troubleshooting.

Thanks for your help,

Patrick

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia's privacy policy. 
http://www.infomedia.com.au/privacy-policy/


Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
Hi Mich,

I don't believe Hive is installed. I set up this cluster from scratch. I
installed Hadoop and Spark by downloading them from their project websites.
If Hive isn't bundled with Hadoop or Spark, I don't believe I have it. I'm
running the Thrift server distributed with Spark, like so:

~/spark/sbin/start-thriftserver.sh --master spark://10.0.50.1:7077

I can look into installing Hive, but it might take some time. I tried to
set up Hive when I first started evaluating distributed data processing
solutions, but I encountered many issues. Spark was much simpler, which was
part of the reason why I chose it.

Thanks again for the reply, I truly appreciate your help.

Patrick

On Thu, Aug 10, 2023 at 3:43 PM Mich Talebzadeh 
wrote:

> sorry host is 10.0.50.1
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 10 Aug 2023 at 20:41, Mich Talebzadeh 
> wrote:
>
>> Hi Patrick
>>
>> That beeline on port 1 is a hive thrift server running on your hive
>> on host 10.0.50.1:1.
>>
>> if you can access that host, you should be able to log into hive by
>> typing hive. The os user is hadoop in your case and sounds like there is no
>> password!
>>
>> Once inside that host, hive logs are kept in your case
>> /tmp/hadoop/hive.log or go to /tmp and do
>>
>> /tmp> find ./ -name hive.log. It should be under /tmp/hive.log
>>
>> Try running the sql inside hive and see what it says
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 10 Aug 2023 at 20:02, Patrick Tucci 
>> wrote:
>>
>>> Hi Mich,
>>>
>>> Thanks for the reply. Unfortunately I don't have Hive set up on my
>>> cluster. I can explore this if there are no other ways to troubleshoot.
>>>
>>> I'm using beeline to run commands against the Thrift server. Here's the
>>> command I use:
>>>
>>> ~/spark/bin/beeline -u jdbc:hive2://10.0.50.1:1 -n hadoop -f
>>> command.sql
>>>
>>> Thanks again for your help.
>>>
>>> Patrick
>>>
>>>
>>> On Thu, Aug 10, 2023 at 2:24 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Can you run this sql query through hive itself?

 Are you using this command or similar for your thrift server?

 beeline -u jdbc:hive2:///1/default
 org.apache.hive.jdbc.HiveDriver -n hadoop -p xxx

 HTH

 Mich Talebzadeh,
 Solutions Architect/Engineering Lead
 London
 United Kingdom


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Thu, 10 Aug 2023 at 18:39, Patrick Tucci 
 wrote:

> Hello,
>
> I'm attempting to run a query on Spark 3.4.0 through the Spark
> ThriftServer. The cluster has 64 cores, 250GB RAM, and operates in
> standalone mode using HDFS for storage.
>
> The query is as follows:
>
> SELECT ME.*, MB.BenefitID
> FROM MemberEnrollment ME
> JOIN MemberBenefits MB
> ON ME.ID = MB.EnrollmentID
> WHERE MB.BenefitID = 5
> LIMIT 10
>
> The tables are defined as follows:
>
> -- Contains about 3M rows
> CREATE TABLE MemberEnrollment
> (
> ID INT
> , MemberID VARCHAR(50)
> , StartDate DATE
> , EndDate DATE
> -- Other columns, but these are the most important
> ) STORED AS ORC;
>
> -- Contains about 25m rows
> CREATE TABLE MemberBenefits
> (
> EnrollmentID INT
> , BenefitID INT
> ) STORED AS ORC;
>
> W

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh
sorry host is 10.0.50.1

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 10 Aug 2023 at 20:41, Mich Talebzadeh 
wrote:

> Hi Patrick
>
> That beeline on port 1 is a hive thrift server running on your hive on
> host 10.0.50.1:1.
>
> if you can access that host, you should be able to log into hive by typing
> hive. The os user is hadoop in your case and sounds like there is no
> password!
>
> Once inside that host, hive logs are kept in your case
> /tmp/hadoop/hive.log or go to /tmp and do
>
> /tmp> find ./ -name hive.log. It should be under /tmp/hive.log
>
> Try running the sql inside hive and see what it says
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 10 Aug 2023 at 20:02, Patrick Tucci 
> wrote:
>
>> Hi Mich,
>>
>> Thanks for the reply. Unfortunately I don't have Hive set up on my
>> cluster. I can explore this if there are no other ways to troubleshoot.
>>
>> I'm using beeline to run commands against the Thrift server. Here's the
>> command I use:
>>
>> ~/spark/bin/beeline -u jdbc:hive2://10.0.50.1:1 -n hadoop -f
>> command.sql
>>
>> Thanks again for your help.
>>
>> Patrick
>>
>>
>> On Thu, Aug 10, 2023 at 2:24 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Can you run this sql query through hive itself?
>>>
>>> Are you using this command or similar for your thrift server?
>>>
>>> beeline -u jdbc:hive2:///1/default
>>> org.apache.hive.jdbc.HiveDriver -n hadoop -p xxx
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 10 Aug 2023 at 18:39, Patrick Tucci 
>>> wrote:
>>>
 Hello,

 I'm attempting to run a query on Spark 3.4.0 through the Spark
 ThriftServer. The cluster has 64 cores, 250GB RAM, and operates in
 standalone mode using HDFS for storage.

 The query is as follows:

 SELECT ME.*, MB.BenefitID
 FROM MemberEnrollment ME
 JOIN MemberBenefits MB
 ON ME.ID = MB.EnrollmentID
 WHERE MB.BenefitID = 5
 LIMIT 10

 The tables are defined as follows:

 -- Contains about 3M rows
 CREATE TABLE MemberEnrollment
 (
 ID INT
 , MemberID VARCHAR(50)
 , StartDate DATE
 , EndDate DATE
 -- Other columns, but these are the most important
 ) STORED AS ORC;

 -- Contains about 25m rows
 CREATE TABLE MemberBenefits
 (
 EnrollmentID INT
 , BenefitID INT
 ) STORED AS ORC;

 When I execute the query, it runs a single broadcast exchange stage,
 which completes after a few seconds. Then everything just hangs. The
 JDBC/ODBC tab in the UI shows the query state as COMPILED, but no stages or
 tasks are executing or pending:

 [image: image.png]

 I've let the query run for as long as 30 minutes with no additional
 stages, progress, or errors. I'm not sure where to start troubleshooting.

 Thanks for your help,

 Patrick

>>>


Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh
Hi Patrick

That beeline on port 1 is a hive thrift server running on your hive on
host 10.0.50.1:1.

if you can access that host, you should be able to log into hive by typing
hive. The os user is hadoop in your case and sounds like there is no
password!

Once inside that host, hive logs are kept in your case /tmp/hadoop/hive.log
or go to /tmp and do

/tmp> find ./ -name hive.log. It should be under /tmp/hive.log

Try running the sql inside hive and see what it says

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 10 Aug 2023 at 20:02, Patrick Tucci  wrote:

> Hi Mich,
>
> Thanks for the reply. Unfortunately I don't have Hive set up on my
> cluster. I can explore this if there are no other ways to troubleshoot.
>
> I'm using beeline to run commands against the Thrift server. Here's the
> command I use:
>
> ~/spark/bin/beeline -u jdbc:hive2://10.0.50.1:1 -n hadoop -f
> command.sql
>
> Thanks again for your help.
>
> Patrick
>
>
> On Thu, Aug 10, 2023 at 2:24 PM Mich Talebzadeh 
> wrote:
>
>> Can you run this sql query through hive itself?
>>
>> Are you using this command or similar for your thrift server?
>>
>> beeline -u jdbc:hive2:///1/default
>> org.apache.hive.jdbc.HiveDriver -n hadoop -p xxx
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 10 Aug 2023 at 18:39, Patrick Tucci 
>> wrote:
>>
>>> Hello,
>>>
>>> I'm attempting to run a query on Spark 3.4.0 through the Spark
>>> ThriftServer. The cluster has 64 cores, 250GB RAM, and operates in
>>> standalone mode using HDFS for storage.
>>>
>>> The query is as follows:
>>>
>>> SELECT ME.*, MB.BenefitID
>>> FROM MemberEnrollment ME
>>> JOIN MemberBenefits MB
>>> ON ME.ID = MB.EnrollmentID
>>> WHERE MB.BenefitID = 5
>>> LIMIT 10
>>>
>>> The tables are defined as follows:
>>>
>>> -- Contains about 3M rows
>>> CREATE TABLE MemberEnrollment
>>> (
>>> ID INT
>>> , MemberID VARCHAR(50)
>>> , StartDate DATE
>>> , EndDate DATE
>>> -- Other columns, but these are the most important
>>> ) STORED AS ORC;
>>>
>>> -- Contains about 25m rows
>>> CREATE TABLE MemberBenefits
>>> (
>>> EnrollmentID INT
>>> , BenefitID INT
>>> ) STORED AS ORC;
>>>
>>> When I execute the query, it runs a single broadcast exchange stage,
>>> which completes after a few seconds. Then everything just hangs. The
>>> JDBC/ODBC tab in the UI shows the query state as COMPILED, but no stages or
>>> tasks are executing or pending:
>>>
>>> [image: image.png]
>>>
>>> I've let the query run for as long as 30 minutes with no additional
>>> stages, progress, or errors. I'm not sure where to start troubleshooting.
>>>
>>> Thanks for your help,
>>>
>>> Patrick
>>>
>>


Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
Hi Mich,

Thanks for the reply. Unfortunately I don't have Hive set up on my cluster.
I can explore this if there are no other ways to troubleshoot.

I'm using beeline to run commands against the Thrift server. Here's the
command I use:

~/spark/bin/beeline -u jdbc:hive2://10.0.50.1:1 -n hadoop -f command.sql

Thanks again for your help.

Patrick


On Thu, Aug 10, 2023 at 2:24 PM Mich Talebzadeh 
wrote:

> Can you run this sql query through hive itself?
>
> Are you using this command or similar for your thrift server?
>
> beeline -u jdbc:hive2:///1/default
> org.apache.hive.jdbc.HiveDriver -n hadoop -p xxx
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 10 Aug 2023 at 18:39, Patrick Tucci 
> wrote:
>
>> Hello,
>>
>> I'm attempting to run a query on Spark 3.4.0 through the Spark
>> ThriftServer. The cluster has 64 cores, 250GB RAM, and operates in
>> standalone mode using HDFS for storage.
>>
>> The query is as follows:
>>
>> SELECT ME.*, MB.BenefitID
>> FROM MemberEnrollment ME
>> JOIN MemberBenefits MB
>> ON ME.ID = MB.EnrollmentID
>> WHERE MB.BenefitID = 5
>> LIMIT 10
>>
>> The tables are defined as follows:
>>
>> -- Contains about 3M rows
>> CREATE TABLE MemberEnrollment
>> (
>> ID INT
>> , MemberID VARCHAR(50)
>> , StartDate DATE
>> , EndDate DATE
>> -- Other columns, but these are the most important
>> ) STORED AS ORC;
>>
>> -- Contains about 25m rows
>> CREATE TABLE MemberBenefits
>> (
>> EnrollmentID INT
>> , BenefitID INT
>> ) STORED AS ORC;
>>
>> When I execute the query, it runs a single broadcast exchange stage,
>> which completes after a few seconds. Then everything just hangs. The
>> JDBC/ODBC tab in the UI shows the query state as COMPILED, but no stages or
>> tasks are executing or pending:
>>
>> [image: image.png]
>>
>> I've let the query run for as long as 30 minutes with no additional
>> stages, progress, or errors. I'm not sure where to start troubleshooting.
>>
>> Thanks for your help,
>>
>> Patrick
>>
>


Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh
Can you run this sql query through hive itself?

Are you using this command or similar for your thrift server?

beeline -u jdbc:hive2:///1/default
org.apache.hive.jdbc.HiveDriver -n hadoop -p xxx

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 10 Aug 2023 at 18:39, Patrick Tucci  wrote:

> Hello,
>
> I'm attempting to run a query on Spark 3.4.0 through the Spark
> ThriftServer. The cluster has 64 cores, 250GB RAM, and operates in
> standalone mode using HDFS for storage.
>
> The query is as follows:
>
> SELECT ME.*, MB.BenefitID
> FROM MemberEnrollment ME
> JOIN MemberBenefits MB
> ON ME.ID = MB.EnrollmentID
> WHERE MB.BenefitID = 5
> LIMIT 10
>
> The tables are defined as follows:
>
> -- Contains about 3M rows
> CREATE TABLE MemberEnrollment
> (
> ID INT
> , MemberID VARCHAR(50)
> , StartDate DATE
> , EndDate DATE
> -- Other columns, but these are the most important
> ) STORED AS ORC;
>
> -- Contains about 25m rows
> CREATE TABLE MemberBenefits
> (
> EnrollmentID INT
> , BenefitID INT
> ) STORED AS ORC;
>
> When I execute the query, it runs a single broadcast exchange stage, which
> completes after a few seconds. Then everything just hangs. The JDBC/ODBC
> tab in the UI shows the query state as COMPILED, but no stages or tasks are
> executing or pending:
>
> [image: image.png]
>
> I've let the query run for as long as 30 minutes with no additional
> stages, progress, or errors. I'm not sure where to start troubleshooting.
>
> Thanks for your help,
>
> Patrick
>


Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Mich Talebzadeh
ok so as expected the underlying database is Hive. Hive uses hdfs storage.

You said you encountered limitations on concurrent writes. The order and
limitations are introduced by Hive metastore so to speak. Since this is all
happening through Spark, by default implementation of the Hive metastore

in Apache Spark uses Apache Derby for its database persistence. This is
available with no configuration required as in your case. *However it is
limited to only one Spark session at any time for the purposes of metadata
storage.* This obviously makes it unsuitable for use in multi-concurrency
situations as you observed. For industrial strength backend Hive metastore
databases, you should consider a multi-user ACID-compliant relational
database product for hosting the metastore. Any current RDBMSs should do. I
use Oracle 12g myself and others use MySQL or postgresQL for this purpose
etc.

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 30 Jul 2023 at 13:46, Patrick Tucci  wrote:

> Hi Mich and Pol,
>
> Thanks for the feedback. The database layer is Hadoop 3.3.5. The cluster
> restarted so I lost the stack trace in the application UI. In the snippets
> I saved, it looks like the exception being thrown was from Hive. Given the
> feedback you've provided, I suspect the issue is with how the Hive
> components are handling concurrent writes.
>
> While using a different format would likely help with this issue, I think
> I have found an easier solution for now. Currently I have many individual
> scripts that perform logic and insert the results separately. Instead of
> each script performing an insert, each script can instead create a view.
> After the views are created, one single script can perform one single
> INSERT, combining the views with UNION ALL statements.
>
> -- Old logic --
> -- Script 1
> INSERT INTO EventClaims
> /*Long, complicated query 1*/
>
> -- Script N
> INSERT INTO EventClaims
> /*Long, complicated query N*/
>
> -- New logic --
> -- Script 1
> CREATE VIEW Q1 AS
> /*Long, complicated query 1*/
>
> -- Script N
> CREATE VIEW QN AS
> /*Long, complicated query N*/
>
> -- Final script --
> INSERT INTO EventClaims
> SELECT * FROM Q1 UNION ALL
> SELECT * FROM QN
>
> The old approach had almost two dozen stages with relatively fewer tasks.
> The new approach requires only 3 stages. With fewer stages and more tasks,
> cluster utilization is much higher.
>
> Thanks again for your feedback. I suspect better concurrent writes will be
> valuable for my project in the future, so this is good information to have
> ready.
>
> Thanks,
>
> Patrick
>
> On Sun, Jul 30, 2023 at 5:30 AM Pol Santamaria  wrote:
>
>> Hi Patrick,
>>
>> You can have multiple writers simultaneously writing to the same table in
>> HDFS by utilizing an open table format with concurrency control. Several
>> formats, such as Apache Hudi, Apache Iceberg, Delta Lake, and Qbeast
>> Format, offer this capability. All of them provide advanced features that
>> will work better in different use cases according to the writing pattern,
>> type of queries, data characteristics, etc.
>>
>> *Pol Santamaria*
>>
>>
>> On Sat, Jul 29, 2023 at 4:28 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> It is not Spark SQL that throws the error. It is the underlying Database
>>> or layer that throws the error.
>>>
>>> Spark acts as an ETL tool.  What is the underlying DB  where the table
>>> resides? Is concurrency supported. Please send the error to this list
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sat, 29 Jul 2023 at 12:02, Patrick Tucci 
>>> wrote:
>>>
 Hello,

 I'm building an application on Spark SQL. The cluster is set up in
 standalone mode with HDFS as

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Patrick Tucci
Hi Mich and Pol,

Thanks for the feedback. The database layer is Hadoop 3.3.5. The cluster
restarted so I lost the stack trace in the application UI. In the snippets
I saved, it looks like the exception being thrown was from Hive. Given the
feedback you've provided, I suspect the issue is with how the Hive
components are handling concurrent writes.

While using a different format would likely help with this issue, I think I
have found an easier solution for now. Currently I have many individual
scripts that perform logic and insert the results separately. Instead of
each script performing an insert, each script can instead create a view.
After the views are created, one single script can perform one single
INSERT, combining the views with UNION ALL statements.

-- Old logic --
-- Script 1
INSERT INTO EventClaims
/*Long, complicated query 1*/

-- Script N
INSERT INTO EventClaims
/*Long, complicated query N*/

-- New logic --
-- Script 1
CREATE VIEW Q1 AS
/*Long, complicated query 1*/

-- Script N
CREATE VIEW QN AS
/*Long, complicated query N*/

-- Final script --
INSERT INTO EventClaims
SELECT * FROM Q1 UNION ALL
SELECT * FROM QN

The old approach had almost two dozen stages with relatively fewer tasks.
The new approach requires only 3 stages. With fewer stages and more tasks,
cluster utilization is much higher.

Thanks again for your feedback. I suspect better concurrent writes will be
valuable for my project in the future, so this is good information to have
ready.

Thanks,

Patrick

On Sun, Jul 30, 2023 at 5:30 AM Pol Santamaria  wrote:

> Hi Patrick,
>
> You can have multiple writers simultaneously writing to the same table in
> HDFS by utilizing an open table format with concurrency control. Several
> formats, such as Apache Hudi, Apache Iceberg, Delta Lake, and Qbeast
> Format, offer this capability. All of them provide advanced features that
> will work better in different use cases according to the writing pattern,
> type of queries, data characteristics, etc.
>
> *Pol Santamaria*
>
>
> On Sat, Jul 29, 2023 at 4:28 PM Mich Talebzadeh 
> wrote:
>
>> It is not Spark SQL that throws the error. It is the underlying Database
>> or layer that throws the error.
>>
>> Spark acts as an ETL tool.  What is the underlying DB  where the table
>> resides? Is concurrency supported. Please send the error to this list
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 29 Jul 2023 at 12:02, Patrick Tucci 
>> wrote:
>>
>>> Hello,
>>>
>>> I'm building an application on Spark SQL. The cluster is set up in
>>> standalone mode with HDFS as storage. The only Spark application running is
>>> the Spark Thrift Server using FAIR scheduling mode. Queries are submitted
>>> to Thrift Server using beeline.
>>>
>>> I have multiple queries that insert rows into the same table
>>> (EventClaims). These queries work fine when run sequentially, however, some
>>> individual queries don't fully utilize the resources available on the
>>> cluster. I would like to run all of these queries concurrently to more
>>> fully utilize available resources. When I attempt to do this, tasks
>>> eventually begin to fail. The stack trace is pretty long, but here's what
>>> looks like the most relevant parts:
>>>
>>>
>>> org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:788)
>>>
>>> org.apache.hive.service.cli.HiveSQLException: Error running query:
>>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 28
>>> in stage 128.0 failed 4 times, most recent failure: Lost task 28.3 in stage
>>> 128.0 (TID 6578) (10.0.50.2 executor 0): org.apache.spark.SparkException:
>>> [TASK_WRITE_FAILED] Task failed while writing rows to hdfs://
>>> 10.0.50.1:8020/user/spark/warehouse/eventclaims.
>>>
>>> Is it possible to have multiple concurrent writers to the same table
>>> with Spark SQL? Is there any way to make this work?
>>>
>>> Thanks for the help.
>>>
>>> Patrick
>>>
>>


Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Pol Santamaria
Hi Patrick,

You can have multiple writers simultaneously writing to the same table in
HDFS by utilizing an open table format with concurrency control. Several
formats, such as Apache Hudi, Apache Iceberg, Delta Lake, and Qbeast
Format, offer this capability. All of them provide advanced features that
will work better in different use cases according to the writing pattern,
type of queries, data characteristics, etc.

*Pol Santamaria*


On Sat, Jul 29, 2023 at 4:28 PM Mich Talebzadeh 
wrote:

> It is not Spark SQL that throws the error. It is the underlying Database
> or layer that throws the error.
>
> Spark acts as an ETL tool.  What is the underlying DB  where the table
> resides? Is concurrency supported. Please send the error to this list
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 29 Jul 2023 at 12:02, Patrick Tucci 
> wrote:
>
>> Hello,
>>
>> I'm building an application on Spark SQL. The cluster is set up in
>> standalone mode with HDFS as storage. The only Spark application running is
>> the Spark Thrift Server using FAIR scheduling mode. Queries are submitted
>> to Thrift Server using beeline.
>>
>> I have multiple queries that insert rows into the same table
>> (EventClaims). These queries work fine when run sequentially, however, some
>> individual queries don't fully utilize the resources available on the
>> cluster. I would like to run all of these queries concurrently to more
>> fully utilize available resources. When I attempt to do this, tasks
>> eventually begin to fail. The stack trace is pretty long, but here's what
>> looks like the most relevant parts:
>>
>>
>> org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:788)
>>
>> org.apache.hive.service.cli.HiveSQLException: Error running query:
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 28
>> in stage 128.0 failed 4 times, most recent failure: Lost task 28.3 in stage
>> 128.0 (TID 6578) (10.0.50.2 executor 0): org.apache.spark.SparkException:
>> [TASK_WRITE_FAILED] Task failed while writing rows to hdfs://
>> 10.0.50.1:8020/user/spark/warehouse/eventclaims.
>>
>> Is it possible to have multiple concurrent writers to the same table with
>> Spark SQL? Is there any way to make this work?
>>
>> Thanks for the help.
>>
>> Patrick
>>
>


Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-29 Thread Mich Talebzadeh
It is not Spark SQL that throws the error. It is the underlying Database or
layer that throws the error.

Spark acts as an ETL tool.  What is the underlying DB  where the table
resides? Is concurrency supported. Please send the error to this list

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 29 Jul 2023 at 12:02, Patrick Tucci  wrote:

> Hello,
>
> I'm building an application on Spark SQL. The cluster is set up in
> standalone mode with HDFS as storage. The only Spark application running is
> the Spark Thrift Server using FAIR scheduling mode. Queries are submitted
> to Thrift Server using beeline.
>
> I have multiple queries that insert rows into the same table
> (EventClaims). These queries work fine when run sequentially, however, some
> individual queries don't fully utilize the resources available on the
> cluster. I would like to run all of these queries concurrently to more
> fully utilize available resources. When I attempt to do this, tasks
> eventually begin to fail. The stack trace is pretty long, but here's what
> looks like the most relevant parts:
>
>
> org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:788)
>
> org.apache.hive.service.cli.HiveSQLException: Error running query:
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 28
> in stage 128.0 failed 4 times, most recent failure: Lost task 28.3 in stage
> 128.0 (TID 6578) (10.0.50.2 executor 0): org.apache.spark.SparkException:
> [TASK_WRITE_FAILED] Task failed while writing rows to hdfs://
> 10.0.50.1:8020/user/spark/warehouse/eventclaims.
>
> Is it possible to have multiple concurrent writers to the same table with
> Spark SQL? Is there any way to make this work?
>
> Thanks for the help.
>
> Patrick
>


Re: [Spark SQL] Data objects from query history

2023-07-03 Thread Jack Wells
 Hi Ruben,

I’m not sure if this answers your question, but if you’re interested in
exploring the underlying tables, you could always try something like the
below in a Databricks notebook:

display(spark.read.table(’samples.nyctaxi.trips’))

(For vanilla Spark users, it would be
spark.read.table(’samples.nyctaxi.trips’).show(100, False) )

Since you’re using Databricks, you can also find the data under the Data
menu, scroll down to the samples metastore then click through to trips to
find the file location, schema, and sample data.

On Jun 29, 2023 at 23:53:25, Ruben Mennes  wrote:

> Dear Apache Spark community,
>
> I hope this email finds you well. My name is Ruben, and I am an
> enthusiastic user of Apache Spark, specifically through the Databricks
> platform. I am reaching out to you today to seek your assistance and
> guidance regarding a specific use case.
>
> I have been exploring the capabilities of Spark SQL and Databricks, and I
> have encountered a challenge related to accessing the data objects used by
> queries from the query history. I am aware that Databricks provides a
> comprehensive query history that contains valuable information about
> executed queries.
>
> However, my objective is to extract the underlying data objects (tables)
> involved in each query. By doing so, I aim to analyze and understand the
> dependencies between queries and the data they operate on. This information
> will provide us new insights in how data is used across our data platform.
>
> I have attempted to leverage the Spark SQL Antlr grammar, available at
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4,
> to parse the queries retrieved from the query history. Unfortunately, I
> have encountered difficulties when parsing more complex queries.
>
> As an example, I have struggled to parse queries with intricate constructs
> such as the following:
>
> SELECT
>   concat(pickup_zip, '-', dropoff_zip) as route,
>   AVG(fare_amount) as average_fare
> FROM
>   `samples`.`nyctaxi`.`trips`
> GROUP BY
>   1
> ORDER BY
>   2 DESC
> LIMIT 1000
>
> I would greatly appreciate it if you could provide me with some guidance
> on how to overcome these challenges. Specifically, I am interested in
> understanding if there are alternative approaches or existing tools that
> can help me achieve my goal of extracting the data objects used by queries
> from the Databricks query history.
>
> Additionally, if there are any resources, documentation, or examples that
> provide further clarity on this topic, I would be more than grateful to
> receive them. Any insights you can provide would be of immense help in
> advancing my understanding and enabling me to make the most of the Spark
> SQL and Databricks ecosystem.
>
> Thank you very much for your time and support. I eagerly look forward to
> hearing from you and benefiting from your expertise.
>
> Best regards,
> Ruben Mennes
>


Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Mich Talebzadeh
OK, good news. You have made some progress here :)

bzip (bzip2) works (splittable) because it is block-oriented whereas gzip
is stream oriented. I also noticed that you are creating a managed ORC
file.  You can bucket and partition an ORC (Optimized Row Columnar file
format. An example below:


DROP TABLE IF EXISTS dummy;

CREATE TABLE dummy (
 ID INT
   , CLUSTERED INT
   , SCATTERED INT
   , RANDOMISED INT
   , RANDOM_STRING VARCHAR(50)
   , SMALL_VC VARCHAR(10)
   , PADDING  VARCHAR(10)
)
CLUSTERED BY (ID) INTO 256 BUCKETS
STORED AS ORC
TBLPROPERTIES (
"orc.create.index"="true",
"orc.bloom.filter.columns"="ID",
"orc.bloom.filter.fpp"="0.05",
"orc.compress"="SNAPPY",
"orc.stripe.size"="16777216",
"orc.row.index.stride"="1" )
;

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 26 Jun 2023 at 19:35, Patrick Tucci  wrote:

> Hi Mich,
>
> Thanks for the reply. I started running ANALYZE TABLE on the external
> table, but the progress was very slow. The stage had only read about 275MB
> in 10 minutes. That equates to about 5.5 hours just to analyze the table.
>
> This might just be the reality of trying to process a 240m record file
> with 80+ columns, unless there's an obvious issue with my setup that
> someone sees. The solution is likely going to involve increasing
> parallelization.
>
> To that end, I extracted and re-zipped this file in bzip. Since bzip is
> splittable and gzip is not, Spark can process the bzip file in parallel.
> The same CTAS query only took about 45 minutes. This is still a bit slower
> than I had hoped, but the import from bzip fully utilized all available
> cores. So we can give the cluster more resources if we need the process to
> go faster.
>
> Patrick
>
> On Mon, Jun 26, 2023 at 12:52 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> OK for now have you analyzed statistics in Hive external table
>>
>> spark-sql (default)> ANALYZE TABLE test.stg_t2 COMPUTE STATISTICS FOR ALL
>> COLUMNS;
>> spark-sql (default)> DESC EXTENDED test.stg_t2;
>>
>> Hive external tables have little optimization
>>
>> HTH
>>
>>
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 26 Jun 2023 at 16:33, Patrick Tucci 
>> wrote:
>>
>>> Hello,
>>>
>>> I'm using Spark 3.4.0 in standalone mode with Hadoop 3.3.5. The master
>>> node has 2 cores and 8GB of RAM. There is a single worker node with 8 cores
>>> and 64GB of RAM.
>>>
>>> I'm trying to process a large pipe delimited file that has been
>>> compressed with gzip (9.2GB zipped, ~58GB unzipped, ~241m records, ~85
>>> columns). I uploaded the gzipped file to HDFS and created an external table
>>> using the attached script. I tried two simpler queries on the same table,
>>> and they finished in ~5 and ~10 minutes respectively:
>>>
>>> SELECT COUNT(*) FROM ClaimsImport;
>>> SELECT COUNT(*) FROM ClaimsImport WHERE ClaimLineID = 1;
>>>
>>> However, when I tried to create a table stored as ORC using this table
>>> as the input, the query ran for almost 4 hours:
>>>
>>> CREATE TABLE Claims STORED AS ORC
>>> AS
>>> SELECT *
>>> FROM ClaimsImport
>>> --Exclude the header record
>>> WHERE ClaimID <> 'ClaimID';
>>>
>>> [image: image.png]
>>>
>>> Why is there such a speed disparity between these different operations?
>>> I understand that this job cannot be parallelized because the file is
>>> compressed with gzip. I also understand that creating an ORC table from the
>>> input will take more time than a simple COUNT(*). But it doesn't feel like
>>> the CREATE TABLE operation should take more than 24x longer than a simple
>>> SELECT COUNT(*) statement.
>>>
>>> Thanks for any help. Please let me know if I can provide any additional
>>> information.
>>>
>>> Patrick
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Patrick Tucci
Hi Mich,

Thanks for the reply. I started running ANALYZE TABLE on the external
table, but the progress was very slow. The stage had only read about 275MB
in 10 minutes. That equates to about 5.5 hours just to analyze the table.

This might just be the reality of trying to process a 240m record file with
80+ columns, unless there's an obvious issue with my setup that someone
sees. The solution is likely going to involve increasing parallelization.

To that end, I extracted and re-zipped this file in bzip. Since bzip is
splittable and gzip is not, Spark can process the bzip file in parallel.
The same CTAS query only took about 45 minutes. This is still a bit slower
than I had hoped, but the import from bzip fully utilized all available
cores. So we can give the cluster more resources if we need the process to
go faster.

Patrick

On Mon, Jun 26, 2023 at 12:52 PM Mich Talebzadeh 
wrote:

> OK for now have you analyzed statistics in Hive external table
>
> spark-sql (default)> ANALYZE TABLE test.stg_t2 COMPUTE STATISTICS FOR ALL
> COLUMNS;
> spark-sql (default)> DESC EXTENDED test.stg_t2;
>
> Hive external tables have little optimization
>
> HTH
>
>
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 26 Jun 2023 at 16:33, Patrick Tucci 
> wrote:
>
>> Hello,
>>
>> I'm using Spark 3.4.0 in standalone mode with Hadoop 3.3.5. The master
>> node has 2 cores and 8GB of RAM. There is a single worker node with 8 cores
>> and 64GB of RAM.
>>
>> I'm trying to process a large pipe delimited file that has been
>> compressed with gzip (9.2GB zipped, ~58GB unzipped, ~241m records, ~85
>> columns). I uploaded the gzipped file to HDFS and created an external table
>> using the attached script. I tried two simpler queries on the same table,
>> and they finished in ~5 and ~10 minutes respectively:
>>
>> SELECT COUNT(*) FROM ClaimsImport;
>> SELECT COUNT(*) FROM ClaimsImport WHERE ClaimLineID = 1;
>>
>> However, when I tried to create a table stored as ORC using this table as
>> the input, the query ran for almost 4 hours:
>>
>> CREATE TABLE Claims STORED AS ORC
>> AS
>> SELECT *
>> FROM ClaimsImport
>> --Exclude the header record
>> WHERE ClaimID <> 'ClaimID';
>>
>> [image: image.png]
>>
>> Why is there such a speed disparity between these different operations? I
>> understand that this job cannot be parallelized because the file is
>> compressed with gzip. I also understand that creating an ORC table from the
>> input will take more time than a simple COUNT(*). But it doesn't feel like
>> the CREATE TABLE operation should take more than 24x longer than a simple
>> SELECT COUNT(*) statement.
>>
>> Thanks for any help. Please let me know if I can provide any additional
>> information.
>>
>> Patrick
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Mich Talebzadeh
OK for now have you analyzed statistics in Hive external table

spark-sql (default)> ANALYZE TABLE test.stg_t2 COMPUTE STATISTICS FOR ALL
COLUMNS;
spark-sql (default)> DESC EXTENDED test.stg_t2;

Hive external tables have little optimization

HTH



Mich Talebzadeh,
Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 26 Jun 2023 at 16:33, Patrick Tucci  wrote:

> Hello,
>
> I'm using Spark 3.4.0 in standalone mode with Hadoop 3.3.5. The master
> node has 2 cores and 8GB of RAM. There is a single worker node with 8 cores
> and 64GB of RAM.
>
> I'm trying to process a large pipe delimited file that has been compressed
> with gzip (9.2GB zipped, ~58GB unzipped, ~241m records, ~85 columns). I
> uploaded the gzipped file to HDFS and created an external table using the
> attached script. I tried two simpler queries on the same table, and they
> finished in ~5 and ~10 minutes respectively:
>
> SELECT COUNT(*) FROM ClaimsImport;
> SELECT COUNT(*) FROM ClaimsImport WHERE ClaimLineID = 1;
>
> However, when I tried to create a table stored as ORC using this table as
> the input, the query ran for almost 4 hours:
>
> CREATE TABLE Claims STORED AS ORC
> AS
> SELECT *
> FROM ClaimsImport
> --Exclude the header record
> WHERE ClaimID <> 'ClaimID';
>
> [image: image.png]
>
> Why is there such a speed disparity between these different operations? I
> understand that this job cannot be parallelized because the file is
> compressed with gzip. I also understand that creating an ORC table from the
> input will take more time than a simple COUNT(*). But it doesn't feel like
> the CREATE TABLE operation should take more than 24x longer than a simple
> SELECT COUNT(*) statement.
>
> Thanks for any help. Please let me know if I can provide any additional
> information.
>
> Patrick
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: Spark SQL question

2023-01-28 Thread Bjørn Jørgensen
Hi Mich.
This is a Spark user group mailing list where people can ask *any*
questions about spark.
You know SQL and streaming, but I don't think it's necessary to start a
replay with "*LOL*" to the question that's being asked.
No questions are to stupid to be asked.


lør. 28. jan. 2023 kl. 09:22 skrev Mich Talebzadeh <
mich.talebza...@gmail.com>:

> LOL
>
> First one
>
> spark-sql> select 1 as `data.group` from abc group by data.group;
> 1
> Time taken: 0.198 seconds, Fetched 1 row(s)
>
> means that are assigning alias data.group to select and you are using that
> alias -> data.group in your group by statement
>
>
> This is equivalent to
>
>
> spark-sql> select 1 as `data.group` from abc group by 1;
>
> 1
>
> With regard to your second sql
>
>
> select 1 as *`data.group`* from tbl group by `*data.group`;*
>
>
> *will throw an error *
>
>
> *spark-sql> select 1 as `data.group` from abc group by `data.group`;*
>
> *Error in query: cannot resolve '`data.group`' given input columns:
> [spark_catalog.elayer.abc.keyword, spark_catalog.elayer.abc.occurence];
> line 1 pos 43;*
>
> *'Aggregate ['`data.group`], [1 AS data.group#225]*
>
> *+- SubqueryAlias spark_catalog.elayer.abc*
>
> *   +- HiveTableRelation [`elayer`.`abc`,
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols:
> [keyword#226, occurence#227L], Partition Cols: []]*
>
> `data.group` with quotes is neither the name of the column or its alias
>
>
> *HTH*
>
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 27 Jan 2023 at 23:36, Kohki Nishio  wrote:
>
>> this SQL works
>>
>> select 1 as *`data.group`* from tbl group by *data.group*
>>
>>
>> Since there's no such field as *data,* I thought the SQL has to look
>> like this
>>
>> select 1 as *`data.group`* from tbl group by `*data.group`*
>>
>>
>>  But that gives and error (cannot resolve '`data.group`') ... I'm no
>> expert in SQL, but feel like it's a strange behavior... does anybody have a
>> good explanation for it ?
>>
>> Thanks
>>
>> --
>> Kohki Nishio
>>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: Spark SQL question

2023-01-28 Thread Mich Talebzadeh
LOL

First one

spark-sql> select 1 as `data.group` from abc group by data.group;
1
Time taken: 0.198 seconds, Fetched 1 row(s)

means that are assigning alias data.group to select and you are using that
alias -> data.group in your group by statement


This is equivalent to


spark-sql> select 1 as `data.group` from abc group by 1;

1

With regard to your second sql


select 1 as *`data.group`* from tbl group by `*data.group`;*


*will throw an error *


*spark-sql> select 1 as `data.group` from abc group by `data.group`;*

*Error in query: cannot resolve '`data.group`' given input columns:
[spark_catalog.elayer.abc.keyword, spark_catalog.elayer.abc.occurence];
line 1 pos 43;*

*'Aggregate ['`data.group`], [1 AS data.group#225]*

*+- SubqueryAlias spark_catalog.elayer.abc*

*   +- HiveTableRelation [`elayer`.`abc`,
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols:
[keyword#226, occurence#227L], Partition Cols: []]*

`data.group` with quotes is neither the name of the column or its alias


*HTH*



   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 27 Jan 2023 at 23:36, Kohki Nishio  wrote:

> this SQL works
>
> select 1 as *`data.group`* from tbl group by *data.group*
>
>
> Since there's no such field as *data,* I thought the SQL has to look like
> this
>
> select 1 as *`data.group`* from tbl group by `*data.group`*
>
>
>  But that gives and error (cannot resolve '`data.group`') ... I'm no
> expert in SQL, but feel like it's a strange behavior... does anybody have a
> good explanation for it ?
>
> Thanks
>
> --
> Kohki Nishio
>


Re: [Spark SQL]: unpredictable errors: java.io.IOException: can not read class org.apache.parquet.format.PageHeader

2022-12-19 Thread Eric Hanchrow
We’ve discovered a workaround for this; it’s described 
here.

From: Eric Hanchrow 
Date: Thursday, December 8, 2022 at 17:03
To: user@spark.apache.org 
Subject: [Spark SQL]: unpredictable errors: java.io.IOException: can not read 
class org.apache.parquet.format.PageHeader
My company runs java code that uses Spark to read from, and write to, Azure 
Blob storage.  This code runs more or less 24x7.

Recently we've noticed a few failures that leave stack traces in our logs; what 
they have in common are exceptions that look variously like

Caused by: java.io.IOException: can not read class 
org.apache.parquet.format.PageHeader: Unrecognized type 0
Caused by: java.io.IOException: can not read class 
org.apache.parquet.format.PageHeader : don't know what type: 14
Caused by: java.io.IOException: can not read class 
org.apache.parquet.format.PageHeader Required field 'num_values' was not found 
in serialized data!
Caused by: java.io.IOException: can not read class 
org.apache.parquet.format.PageHeader Required field 'uncompressed_page_size' 
was not found in serialized data!

I searched 
https://stackoverflow.com/search?q=%5Bapache-spark%5D+java.io.IOException+can+not+read+class+org.apache.parquet.format.PageHeader
 and found exactly one marginally-relevant hit -- 
https://stackoverflow.com/questions/47211392/required-field-uncompressed-page-size-was-not-found-in-serialized-data-parque
It contains a suggested workaround which I haven't yet tried, but intend to 
soon.

I searched the ASF archive for 
user@spark.apache.org; the only hit is 
https://lists.apache.org/list?user@spark.apache.org:2022-9:can%20not%20read%20class%20org.apache.parquet.format.PageHeader
 which is relevant but unhelpful.

It cites https://issues.apache.org/jira/browse/SPARK-11844 which is quite 
relevant, but again unhelpful.

Unfortunately, we cannot provide the relevant parquet file to the mailing list, 
since it of course contains proprietary data.

I've posted the stack trace at 
https://gist.github.com/erich-truveta/f30d77441186a8c30c5f22f9c44bf59f

Here are various maven dependencies that might be relevant (gotten from the 
output of `mvn dependency:tree`):

org.apache.hadoop.thirdparty:hadoop-shaded-guava:jar:1.1.1
org.apache.hadoop.thirdparty:hadoop-shaded-protobuf_3_7 :jar:1.1.1

org.apache.hadoop:hadoop-annotations:jar:3.3.4
org.apache.hadoop:hadoop-auth   :jar:3.3.4
org.apache.hadoop:hadoop-azure  :jar:3.3.4
org.apache.hadoop:hadoop-client-api :jar:3.3.4
org.apache.hadoop:hadoop-client-runtime :jar:3.3.4
org.apache.hadoop:hadoop-client :jar:3.3.4
org.apache.hadoop:hadoop-common :jar:3.3.4
org.apache.hadoop:hadoop-hdfs-client:jar:3.3.4
org.apache.hadoop:hadoop-mapreduce-client-common:jar:3.3.4
org.apache.hadoop:hadoop-mapreduce-client-core  :jar:3.3.4
org.apache.hadoop:hadoop-mapreduce-client-jobclient :jar:3.3.4
org.apache.hadoop:hadoop-yarn-api   :jar:3.3.4
org.apache.hadoop:hadoop-yarn-client:jar:3.3.4
org.apache.hadoop:hadoop-yarn-common:jar:3.3.4

org.apache.hive:hive-storage-api :jar:2.7.2

org.apache.parquet:parquet-column:jar:1.12.2
org.apache.parquet:parquet-common:jar:1.12.2
org.apache.parquet:parquet-encoding  :jar:1.12.2
org.apache.parquet:parquet-format-structures :jar:1.12.2
org.apache.parquet:parquet-hadoop:jar:1.12.2
org.apache.parquet:parquet-jackson   :jar:1.12.2

org.apache.spark:spark-catalyst_2.12:jar:3.3.1
org.apache.spark:spark-core_2.12:jar:3.3.1
org.apache.spark:spark-kvstore_2.12 :jar:3.3.1
org.apache.spark:spark-launcher_2.12:jar:3.3.1
org.apache.spark:spark-network-common_2.12  :jar:3.3.1
org.apache.spark:spark-network-shuffle_2.12 :jar:3.3.1
org.apache.spark:spark-sketch_2.12  :jar:3.3.1
org.apache.spark:spark-sql_2.12 :jar:3.3.1
org.apache.spark:spark-tags_2.12:jar:3.3.1
org.apache.spark:spark-unsafe_2.12  :jar:3.3.1

Thank you for any help you can provide!


RE: Re: [Spark Sql] Global Setting for Case-Insensitive String Compare

2022-11-22 Thread Patrick Tucci

Thanks. How would I go about formally submitting a feature request for this?

On 2022/11/21 23:47:16 Andrew Melo wrote:
> I think this is the right place, just a hard question :) As far as I
> know, there's no "case insensitive flag", so YMMV
>
> On Mon, Nov 21, 2022 at 5:40 PM Patrick Tucci  wrote:
> >
> > Is this the wrong list for this type of question?
> >
> > On 2022/11/12 16:34:48 Patrick Tucci wrote:
> > > Hello,
> > >
> > > Is there a way to set string comparisons to be case-insensitive
> > globally? I
> > > understand LOWER() can be used, but my codebase contains 27k 
lines of SQL
> > > and many string comparisons. I would need to apply LOWER() to 
each string
> > > literal in the code base. I'd also need to change all the 
ETL/import code

> > > to apply LOWER() to each string value on import.
> > >
> > > Current behavior:
> > >
> > > SELECT 'ABC' = 'abc';
> > > false
> > > Time taken: 5.466 seconds, Fetched 1 row(s)
> > >
> > > SELECT 'ABC' IN ('AbC', 'abc');
> > > false
> > > Time taken: 5.498 seconds, Fetched 1 row(s)
> > >
> > > SELECT 'ABC' like 'Ab%'
> > > false
> > > Time taken: 5.439 seconds, Fetched 1 row(s)
> > >
> > > Desired behavior would be true for all of the above with the proposed
> > > case-insensitive flag set.
> > >
> > > Thanks,
> > >
> > > Patrick
> > >
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark Sql] Global Setting for Case-Insensitive String Compare

2022-11-21 Thread Andrew Melo
I think this is the right place, just a hard question :) As far as I
know, there's no "case insensitive flag", so YMMV

On Mon, Nov 21, 2022 at 5:40 PM Patrick Tucci  wrote:
>
> Is this the wrong list for this type of question?
>
> On 2022/11/12 16:34:48 Patrick Tucci wrote:
>  > Hello,
>  >
>  > Is there a way to set string comparisons to be case-insensitive
> globally? I
>  > understand LOWER() can be used, but my codebase contains 27k lines of SQL
>  > and many string comparisons. I would need to apply LOWER() to each string
>  > literal in the code base. I'd also need to change all the ETL/import code
>  > to apply LOWER() to each string value on import.
>  >
>  > Current behavior:
>  >
>  > SELECT 'ABC' = 'abc';
>  > false
>  > Time taken: 5.466 seconds, Fetched 1 row(s)
>  >
>  > SELECT 'ABC' IN ('AbC', 'abc');
>  > false
>  > Time taken: 5.498 seconds, Fetched 1 row(s)
>  >
>  > SELECT 'ABC' like 'Ab%'
>  > false
>  > Time taken: 5.439 seconds, Fetched 1 row(s)
>  >
>  > Desired behavior would be true for all of the above with the proposed
>  > case-insensitive flag set.
>  >
>  > Thanks,
>  >
>  > Patrick
>  >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



RE: [Spark Sql] Global Setting for Case-Insensitive String Compare

2022-11-21 Thread Patrick Tucci

Is this the wrong list for this type of question?

On 2022/11/12 16:34:48 Patrick Tucci wrote:
> Hello,
>
> Is there a way to set string comparisons to be case-insensitive 
globally? I

> understand LOWER() can be used, but my codebase contains 27k lines of SQL
> and many string comparisons. I would need to apply LOWER() to each string
> literal in the code base. I'd also need to change all the ETL/import code
> to apply LOWER() to each string value on import.
>
> Current behavior:
>
> SELECT 'ABC' = 'abc';
> false
> Time taken: 5.466 seconds, Fetched 1 row(s)
>
> SELECT 'ABC' IN ('AbC', 'abc');
> false
> Time taken: 5.498 seconds, Fetched 1 row(s)
>
> SELECT 'ABC' like 'Ab%'
> false
> Time taken: 5.439 seconds, Fetched 1 row(s)
>
> Desired behavior would be true for all of the above with the proposed
> case-insensitive flag set.
>
> Thanks,
>
> Patrick
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

2022-11-18 Thread Sean Owen
Taking this of list

Start here:
https://github.com/apache/spark/blob/70ec696bce7012b25ed6d8acec5e2f3b3e127f11/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala#L144
Look at subclasses of JdbcDialect too, like TeradataDialect.
Note that you are using an old unsupported version, too; that's a link to
master.

On Fri, Nov 18, 2022 at 5:50 AM Ramakrishna Rayudu <
ramakrishna560.ray...@gmail.com> wrote:

> Hi Sean,
>
> Can you please let me know what is query spark internally fires for
> getting count on dataframe.
>
> Long count=dataframe.count();
>
> Is this
>
> SELECT 1 FROM ( QUERY) SUB_TABL
>
> and suming up the all 1s in the response.
> Or directly
>
> SELECT COUNT(*) FROM (QUERY)
> SUB_TABL
>
> Can you please what is approch spark will follow.
>
>
> Thanks,
> Ramakrishna Rayudu
>
> On Fri, Nov 18, 2022, 8:13 AM Ramakrishna Rayudu <
> ramakrishna560.ray...@gmail.com> wrote:
>
>> Sure I will test with latest spark and let you the result.
>>
>> Thanks,
>> Rama
>>
>> On Thu, Nov 17, 2022, 11:16 PM Sean Owen  wrote:
>>
>>> Weird, does Teradata not support LIMIT n? looking at the Spark source
>>> code suggests it won't. The syntax is "SELECT TOP"? I wonder if that's why
>>> the generic query that seems to test existence loses the LIMIT.
>>> But, that "SELECT 1" test seems to be used for MySQL, Postgres, so I'm
>>> still not sure where it's coming from or if it's coming from Spark. You're
>>> using the teradata dialect I assume. Can you use the latest Spark to test?
>>>
>>> On Thu, Nov 17, 2022 at 11:31 AM Ramakrishna Rayudu <
>>> ramakrishna560.ray...@gmail.com> wrote:
>>>
 Yes I am sure that we are not generating this kind of queries. Okay
 then problem is  LIMIT is not coming up in query. Can you please suggest me
 any direction.

 Thanks,
 Rama

 On Thu, Nov 17, 2022, 10:56 PM Sean Owen  wrote:

> Hm, the existence queries even in 2.4.x had LIMIT 1. Are you sure
> nothing else is generating or changing those queries?
>
> On Thu, Nov 17, 2022 at 11:20 AM Ramakrishna Rayudu <
> ramakrishna560.ray...@gmail.com> wrote:
>
>> We are using spark 2.4.4 version.
>> I can see two types of queries in DB logs.
>>
>> SELECT 1 FROM (INPUT_QUERY) SPARK_GEN_SUB_0
>>
>> SELECT * FROM (INPUT_QUERY) SPARK_GEN_SUB_0 WHERE 1=0
>>
>> When we see `SELECT *` which ending up with `Where 1=0`  but query
>> starts with `SELECT 1` there is no where condition.
>>
>> Thanks,
>> Rama
>>
>> On Thu, Nov 17, 2022, 10:39 PM Sean Owen  wrote:
>>
>>> Hm, actually that doesn't look like the queries that Spark uses to
>>> test existence, which will be "SELECT 1 ... LIMIT 1" or "SELECT * ... 
>>> WHERE
>>> 1=0" depending on the dialect. What version, and are you sure something
>>> else is not sending those queries?
>>>
>>> On Thu, Nov 17, 2022 at 11:02 AM Ramakrishna Rayudu <
>>> ramakrishna560.ray...@gmail.com> wrote:
>>>
 Hi Sean,

 Thanks for your response I think it has the performance impact
 because if the query return one million rows then in the response It's 
 self
 we will one million rows unnecessarily like below.

 1
 1
 1
 1
 .
 .
 1


 Its impact the performance. Can we any alternate solution for this.

 Thanks,
 Rama


 On Thu, Nov 17, 2022, 10:17 PM Sean Owen  wrote:

> This is a query to check the existence of the table upfront.
> It is nearly a no-op query; can it have a perf impact?
>
> On Thu, Nov 17, 2022 at 10:42 AM Ramakrishna Rayudu <
> ramakrishna560.ray...@gmail.com> wrote:
>
>> Hi Team,
>>
>> I am facing one issue. Can you please help me on this.
>>
>> 
>>
>>1.
>>
>>
>> 
>>
>> We are connecting Tera data from spark SQL with below API
>>
>> Dataset jdbcDF = spark.read().jdbc(connectionUrl, tableQuery, 
>> connectionProperties);
>>
>> when we execute above logic on large table with million rows every 
>> time we are seeing below
>>
>> extra query is executing every time as this resulting performance 
>> hit on DB.
>>
>> This below information we got from DBA. We dont have any logs on
>> SPARK SQL.
>>
>> SELECT 1 FROM ONE_MILLION_ROWS_TABLE;
>>
>> 1
>> 1
>> 1
>> 1
>> 1
>> 1
>> 1
>> 1
>> 1
>>
>> Can you please clarify why this query is executing or is there
>> any chance that this type of query is executing from our code it 
>>>

Re: [Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

2022-11-17 Thread Sean Owen
Weird, does Teradata not support LIMIT n? looking at the Spark source code
suggests it won't. The syntax is "SELECT TOP"? I wonder if that's why the
generic query that seems to test existence loses the LIMIT.
But, that "SELECT 1" test seems to be used for MySQL, Postgres, so I'm
still not sure where it's coming from or if it's coming from Spark. You're
using the teradata dialect I assume. Can you use the latest Spark to test?

On Thu, Nov 17, 2022 at 11:31 AM Ramakrishna Rayudu <
ramakrishna560.ray...@gmail.com> wrote:

> Yes I am sure that we are not generating this kind of queries. Okay then
> problem is  LIMIT is not coming up in query. Can you please suggest me any
> direction.
>
> Thanks,
> Rama
>
> On Thu, Nov 17, 2022, 10:56 PM Sean Owen  wrote:
>
>> Hm, the existence queries even in 2.4.x had LIMIT 1. Are you sure nothing
>> else is generating or changing those queries?
>>
>> On Thu, Nov 17, 2022 at 11:20 AM Ramakrishna Rayudu <
>> ramakrishna560.ray...@gmail.com> wrote:
>>
>>> We are using spark 2.4.4 version.
>>> I can see two types of queries in DB logs.
>>>
>>> SELECT 1 FROM (INPUT_QUERY) SPARK_GEN_SUB_0
>>>
>>> SELECT * FROM (INPUT_QUERY) SPARK_GEN_SUB_0 WHERE 1=0
>>>
>>> When we see `SELECT *` which ending up with `Where 1=0`  but query
>>> starts with `SELECT 1` there is no where condition.
>>>
>>> Thanks,
>>> Rama
>>>
>>> On Thu, Nov 17, 2022, 10:39 PM Sean Owen  wrote:
>>>
 Hm, actually that doesn't look like the queries that Spark uses to test
 existence, which will be "SELECT 1 ... LIMIT 1" or "SELECT * ... WHERE 1=0"
 depending on the dialect. What version, and are you sure something else is
 not sending those queries?

 On Thu, Nov 17, 2022 at 11:02 AM Ramakrishna Rayudu <
 ramakrishna560.ray...@gmail.com> wrote:

> Hi Sean,
>
> Thanks for your response I think it has the performance impact because
> if the query return one million rows then in the response It's self we 
> will
> one million rows unnecessarily like below.
>
> 1
> 1
> 1
> 1
> .
> .
> 1
>
>
> Its impact the performance. Can we any alternate solution for this.
>
> Thanks,
> Rama
>
>
> On Thu, Nov 17, 2022, 10:17 PM Sean Owen  wrote:
>
>> This is a query to check the existence of the table upfront.
>> It is nearly a no-op query; can it have a perf impact?
>>
>> On Thu, Nov 17, 2022 at 10:42 AM Ramakrishna Rayudu <
>> ramakrishna560.ray...@gmail.com> wrote:
>>
>>> Hi Team,
>>>
>>> I am facing one issue. Can you please help me on this.
>>>
>>> 
>>>
>>>1.
>>>
>>>
>>> 
>>>
>>> We are connecting Tera data from spark SQL with below API
>>>
>>> Dataset jdbcDF = spark.read().jdbc(connectionUrl, tableQuery, 
>>> connectionProperties);
>>>
>>> when we execute above logic on large table with million rows every time 
>>> we are seeing below
>>>
>>> extra query is executing every time as this resulting performance hit 
>>> on DB.
>>>
>>> This below information we got from DBA. We dont have any logs on
>>> SPARK SQL.
>>>
>>> SELECT 1 FROM ONE_MILLION_ROWS_TABLE;
>>>
>>> 1
>>> 1
>>> 1
>>> 1
>>> 1
>>> 1
>>> 1
>>> 1
>>> 1
>>>
>>> Can you please clarify why this query is executing or is there any
>>> chance that this type of query is executing from our code it self while
>>> check for rows count from dataframe.
>>>
>>> Please provide me your inputs on this.
>>>
>>>
>>> Thanks,
>>>
>>> Rama
>>>
>>


Re: [Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

2022-11-17 Thread Sean Owen
Hm, the existence queries even in 2.4.x had LIMIT 1. Are you sure nothing
else is generating or changing those queries?

On Thu, Nov 17, 2022 at 11:20 AM Ramakrishna Rayudu <
ramakrishna560.ray...@gmail.com> wrote:

> We are using spark 2.4.4 version.
> I can see two types of queries in DB logs.
>
> SELECT 1 FROM (INPUT_QUERY) SPARK_GEN_SUB_0
>
> SELECT * FROM (INPUT_QUERY) SPARK_GEN_SUB_0 WHERE 1=0
>
> When we see `SELECT *` which ending up with `Where 1=0`  but query starts
> with `SELECT 1` there is no where condition.
>
> Thanks,
> Rama
>
> On Thu, Nov 17, 2022, 10:39 PM Sean Owen  wrote:
>
>> Hm, actually that doesn't look like the queries that Spark uses to test
>> existence, which will be "SELECT 1 ... LIMIT 1" or "SELECT * ... WHERE 1=0"
>> depending on the dialect. What version, and are you sure something else is
>> not sending those queries?
>>
>> On Thu, Nov 17, 2022 at 11:02 AM Ramakrishna Rayudu <
>> ramakrishna560.ray...@gmail.com> wrote:
>>
>>> Hi Sean,
>>>
>>> Thanks for your response I think it has the performance impact because
>>> if the query return one million rows then in the response It's self we will
>>> one million rows unnecessarily like below.
>>>
>>> 1
>>> 1
>>> 1
>>> 1
>>> .
>>> .
>>> 1
>>>
>>>
>>> Its impact the performance. Can we any alternate solution for this.
>>>
>>> Thanks,
>>> Rama
>>>
>>>
>>> On Thu, Nov 17, 2022, 10:17 PM Sean Owen  wrote:
>>>
 This is a query to check the existence of the table upfront.
 It is nearly a no-op query; can it have a perf impact?

 On Thu, Nov 17, 2022 at 10:42 AM Ramakrishna Rayudu <
 ramakrishna560.ray...@gmail.com> wrote:

> Hi Team,
>
> I am facing one issue. Can you please help me on this.
>
> 
>
>1.
>
>
> 
>
> We are connecting Tera data from spark SQL with below API
>
> Dataset jdbcDF = spark.read().jdbc(connectionUrl, tableQuery, 
> connectionProperties);
>
> when we execute above logic on large table with million rows every time 
> we are seeing below
>
> extra query is executing every time as this resulting performance hit on 
> DB.
>
> This below information we got from DBA. We dont have any logs on SPARK
> SQL.
>
> SELECT 1 FROM ONE_MILLION_ROWS_TABLE;
>
> 1
> 1
> 1
> 1
> 1
> 1
> 1
> 1
> 1
>
> Can you please clarify why this query is executing or is there any
> chance that this type of query is executing from our code it self while
> check for rows count from dataframe.
>
> Please provide me your inputs on this.
>
>
> Thanks,
>
> Rama
>



Re: [Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

2022-11-17 Thread Ramakrishna Rayudu
We are using spark 2.4.4 version.
I can see two types of queries in DB logs.

SELECT 1 FROM (INPUT_QUERY) SPARK_GEN_SUB_0

SELECT * FROM (INPUT_QUERY) SPARK_GEN_SUB_0 WHERE 1=0

When we see `SELECT *` which ending up with `Where 1=0`  but query starts
with `SELECT 1` there is no where condition.

Thanks,
Rama

On Thu, Nov 17, 2022, 10:39 PM Sean Owen  wrote:

> Hm, actually that doesn't look like the queries that Spark uses to test
> existence, which will be "SELECT 1 ... LIMIT 1" or "SELECT * ... WHERE 1=0"
> depending on the dialect. What version, and are you sure something else is
> not sending those queries?
>
> On Thu, Nov 17, 2022 at 11:02 AM Ramakrishna Rayudu <
> ramakrishna560.ray...@gmail.com> wrote:
>
>> Hi Sean,
>>
>> Thanks for your response I think it has the performance impact because if
>> the query return one million rows then in the response It's self we will
>> one million rows unnecessarily like below.
>>
>> 1
>> 1
>> 1
>> 1
>> .
>> .
>> 1
>>
>>
>> Its impact the performance. Can we any alternate solution for this.
>>
>> Thanks,
>> Rama
>>
>>
>> On Thu, Nov 17, 2022, 10:17 PM Sean Owen  wrote:
>>
>>> This is a query to check the existence of the table upfront.
>>> It is nearly a no-op query; can it have a perf impact?
>>>
>>> On Thu, Nov 17, 2022 at 10:42 AM Ramakrishna Rayudu <
>>> ramakrishna560.ray...@gmail.com> wrote:
>>>
 Hi Team,

 I am facing one issue. Can you please help me on this.

 

1.


 

 We are connecting Tera data from spark SQL with below API

 Dataset jdbcDF = spark.read().jdbc(connectionUrl, tableQuery, 
 connectionProperties);

 when we execute above logic on large table with million rows every time we 
 are seeing below

 extra query is executing every time as this resulting performance hit on 
 DB.

 This below information we got from DBA. We dont have any logs on SPARK
 SQL.

 SELECT 1 FROM ONE_MILLION_ROWS_TABLE;

 1
 1
 1
 1
 1
 1
 1
 1
 1

 Can you please clarify why this query is executing or is there any
 chance that this type of query is executing from our code it self while
 check for rows count from dataframe.

 Please provide me your inputs on this.


 Thanks,

 Rama

>>>


Re: [Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

2022-11-17 Thread Sean Owen
Hm, actually that doesn't look like the queries that Spark uses to test
existence, which will be "SELECT 1 ... LIMIT 1" or "SELECT * ... WHERE 1=0"
depending on the dialect. What version, and are you sure something else is
not sending those queries?

On Thu, Nov 17, 2022 at 11:02 AM Ramakrishna Rayudu <
ramakrishna560.ray...@gmail.com> wrote:

> Hi Sean,
>
> Thanks for your response I think it has the performance impact because if
> the query return one million rows then in the response It's self we will
> one million rows unnecessarily like below.
>
> 1
> 1
> 1
> 1
> .
> .
> 1
>
>
> Its impact the performance. Can we any alternate solution for this.
>
> Thanks,
> Rama
>
>
> On Thu, Nov 17, 2022, 10:17 PM Sean Owen  wrote:
>
>> This is a query to check the existence of the table upfront.
>> It is nearly a no-op query; can it have a perf impact?
>>
>> On Thu, Nov 17, 2022 at 10:42 AM Ramakrishna Rayudu <
>> ramakrishna560.ray...@gmail.com> wrote:
>>
>>> Hi Team,
>>>
>>> I am facing one issue. Can you please help me on this.
>>>
>>> 
>>>
>>>1.
>>>
>>>
>>> 
>>>
>>> We are connecting Tera data from spark SQL with below API
>>>
>>> Dataset jdbcDF = spark.read().jdbc(connectionUrl, tableQuery, 
>>> connectionProperties);
>>>
>>> when we execute above logic on large table with million rows every time we 
>>> are seeing below
>>>
>>> extra query is executing every time as this resulting performance hit on DB.
>>>
>>> This below information we got from DBA. We dont have any logs on SPARK
>>> SQL.
>>>
>>> SELECT 1 FROM ONE_MILLION_ROWS_TABLE;
>>>
>>> 1
>>> 1
>>> 1
>>> 1
>>> 1
>>> 1
>>> 1
>>> 1
>>> 1
>>>
>>> Can you please clarify why this query is executing or is there any
>>> chance that this type of query is executing from our code it self while
>>> check for rows count from dataframe.
>>>
>>> Please provide me your inputs on this.
>>>
>>>
>>> Thanks,
>>>
>>> Rama
>>>
>>


Re: [Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

2022-11-17 Thread Sean Owen
This is a query to check the existence of the table upfront.
It is nearly a no-op query; can it have a perf impact?

On Thu, Nov 17, 2022 at 10:42 AM Ramakrishna Rayudu <
ramakrishna560.ray...@gmail.com> wrote:

> Hi Team,
>
> I am facing one issue. Can you please help me on this.
>
> 
>
>1.
>
>
> 
>
> We are connecting Tera data from spark SQL with below API
>
> Dataset jdbcDF = spark.read().jdbc(connectionUrl, tableQuery, 
> connectionProperties);
>
> when we execute above logic on large table with million rows every time we 
> are seeing below
>
> extra query is executing every time as this resulting performance hit on DB.
>
> This below information we got from DBA. We dont have any logs on SPARK SQL.
>
> SELECT 1 FROM ONE_MILLION_ROWS_TABLE;
>
> 1
> 1
> 1
> 1
> 1
> 1
> 1
> 1
> 1
>
> Can you please clarify why this query is executing or is there any chance
> that this type of query is executing from our code it self while check for
> rows count from dataframe.
>
> Please provide me your inputs on this.
>
>
> Thanks,
>
> Rama
>


Re: EXT: Re: Spark SQL

2022-09-15 Thread Vibhor Gupta
Hi Mayur,

In Java, you can do futures.get with a timeout and then cancel the future once 
timeout has been reached in the catch block. There should be something similar 
in scala as well.

Eg: https://stackoverflow.com/a/16231834
[https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-i...@2.png?v=73d79a89bded]<https://stackoverflow.com/a/16231834>
Does a Future timeout kill the Thread 
execution<https://stackoverflow.com/a/16231834>
When using an ExecutorService and Future objects (when submitting Runnable 
tasks), if I specify a timeout value to the future's get function, does the 
underlying thread get killed when a TimeoutExc...
stackoverflow.com



Regards,
Vibhor

From: Gourav Sengupta 
Sent: Thursday, September 15, 2022 10:22 PM
To: Mayur Benodekar 
Cc: user ; i...@spark.apache.org 
Subject: EXT: Re: Spark SQL

EXTERNAL: Report suspicious emails to Email Abuse.

Okay, so for the problem to the solution 👍 that is powerful

On Thu, 15 Sept 2022, 14:48 Mayur Benodekar, 
mailto:askma...@gmail.com>> wrote:
Hi Gourav,

It’s the way the framework is


Sent from my iPhone

On Sep 15, 2022, at 02:02, Gourav Sengupta 
mailto:gourav.sengu...@gmail.com>> wrote:


Hi,

Why spark and why scala?

Regards,
Gourav

On Wed, 7 Sept 2022, 21:42 Mayur Benodekar, 
mailto:askma...@gmail.com>> wrote:

 am new to scala and spark both .

I have a code in scala which executes quieres in while loop one after the other.

What we need to do is if a particular query takes more than a certain time , 
for example # 10 mins we should be able to stop the query execution for that 
particular query and move on to the next one

for example

do {
var f = Future(
   spark.sql("some query"))
   )

f onSucess {
  case suc - > println("Query ran in 10mins")
}

f failure {
 case fail -> println("query took more than 10mins")
}


}while(some condition)

var result = Await.ready(f,Duration(10,TimeUnit.MINUTES))

I understand that when we call spark.sql the control is sent to spark which i 
need to kill/stop when the duration is over so that i can get back the resources

I have tried multiple things but I am not sure how to solve this. Any help 
would be welcomed as i am stuck with this.

--
Cheers,
Mayur


Re: Spark SQL

2022-09-15 Thread Gourav Sengupta
Okay, so for the problem to the solution 👍 that is powerful

On Thu, 15 Sept 2022, 14:48 Mayur Benodekar,  wrote:

> Hi Gourav,
>
> It’s the way the framework is
>
>
> Sent from my iPhone
>
> On Sep 15, 2022, at 02:02, Gourav Sengupta 
> wrote:
>
> 
> Hi,
>
> Why spark and why scala?
>
> Regards,
> Gourav
>
> On Wed, 7 Sept 2022, 21:42 Mayur Benodekar,  wrote:
>
>>  am new to scala and spark both .
>>
>> I have a code in scala which executes quieres in while loop one after the
>> other.
>>
>> What we need to do is if a particular query takes more than a certain
>> time , for example # 10 mins we should be able to stop the query execution
>> for that particular query and move on to the next one
>>
>> for example
>>
>> do {
>> var f = Future(
>>spark.sql("some query"))
>>)
>>
>> f onSucess {
>>   case suc - > println("Query ran in 10mins")
>> }
>>
>> f failure {
>>  case fail -> println("query took more than 10mins")
>> }
>>
>> }while(some condition)
>>
>> var result = Await.ready(f,Duration(10,TimeUnit.MINUTES))
>>
>> I understand that when we call spark.sql the control is sent to spark
>> which i need to kill/stop when the duration is over so that i can get back
>> the resources
>>
>> I have tried multiple things but I am not sure how to solve this. Any
>> help would be welcomed as i am stuck with this.
>>
>> --
>> Cheers,
>> Mayur
>>
>


Re: Spark SQL

2022-09-15 Thread Mayur Benodekar
Hi Gourav,It’s the way the framework is Sent from my iPhoneOn Sep 15, 2022, at 02:02, Gourav Sengupta  wrote:Hi,Why spark and why scala? Regards,GouravOn Wed, 7 Sept 2022, 21:42 Mayur Benodekar,  wrote: am new to scala and spark both .I have a code in scala which executes quieres in while loop one after the other.What we need to do is if a particular query takes more than a certain time , for example # 10 mins we should be able to stop the query execution for that particular query and move on to the next onefor exampledo {
var f = Future(
   spark.sql("some query")) 
   )

f onSucess {
  case suc - > println("Query ran in 10mins")
}

f failure {
 case fail -> println("query took more than 10mins")
}
}while(some condition)var result = Await.ready(f,Duration(10,TimeUnit.MINUTES))I understand that when we call spark.sql the control is sent to spark which i need to kill/stop when the duration is over so that i can get back the resourcesI have tried multiple things but I am not sure how to solve this. Any help would be welcomed as i am stuck with this.-- Cheers,Mayur



Re: Spark SQL

2022-09-14 Thread Gourav Sengupta
Hi,

Why spark and why scala?

Regards,
Gourav

On Wed, 7 Sept 2022, 21:42 Mayur Benodekar,  wrote:

>  am new to scala and spark both .
>
> I have a code in scala which executes quieres in while loop one after the
> other.
>
> What we need to do is if a particular query takes more than a certain time
> , for example # 10 mins we should be able to stop the query execution for
> that particular query and move on to the next one
>
> for example
>
> do {
> var f = Future(
>spark.sql("some query"))
>)
>
> f onSucess {
>   case suc - > println("Query ran in 10mins")
> }
>
> f failure {
>  case fail -> println("query took more than 10mins")
> }
>
> }while(some condition)
>
> var result = Await.ready(f,Duration(10,TimeUnit.MINUTES))
>
> I understand that when we call spark.sql the control is sent to spark
> which i need to kill/stop when the duration is over so that i can get back
> the resources
>
> I have tried multiple things but I am not sure how to solve this. Any help
> would be welcomed as i am stuck with this.
>
> --
> Cheers,
> Mayur
>


Re: [Spark SQL] Omit Create Table Statement in Spark Sql

2022-08-08 Thread pengyh

you have to saveAsTable or view to make a SQL query.


As the title, does Spark Sql have a feature like Flink Catalog to omit 
`Create Table` statement, and write sql query directly ?


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark SQL]: Does Spark SQL support WAITFOR?

2022-05-19 Thread Someshwar Kale
Hi Ram,

Have you seen this stackoverflow query and response-
https://stackoverflow.com/questions/39685744/apache-spark-how-to-cancel-job-in-code-and-kill-running-tasks
if not, please have a look. seems to have a similar problem .

*Regards,*
*Someshwar Kale*


On Fri, May 20, 2022 at 7:34 AM Artemis User  wrote:

> WAITFOR is part of the Transact-SQL and it's Microsoft SQL server
> specific, not supported by Spark SQL.  If you want to impose a delay in a
> Spark program, you may want to use the thread sleep function in Java or
> Scala.  Hope this helps...
>
> On 5/19/22 1:45 PM, K. N. Ramachandran wrote:
>
> Hi Sean,
>
> I'm trying to test a timeout feature in a tool that uses Spark SQL.
> Basically, if a long-running query exceeds a configured threshold, then the
> query should be canceled.
> I couldn't see a simple way to make a "sleep" SQL statement to test the
> timeout. Instead, I just ran a "select count(*) from table" on a large
> table to act as a query with a long duration.
>
> Is there any way to trigger a "sleep" like behavior in Spark SQL?
>
> Regards,
> Ram
>
> On Tue, May 17, 2022 at 4:23 PM Sean Owen  wrote:
>
>> I don't think that is standard SQL? what are you trying to do, and why
>> not do it outside SQL?
>>
>> On Tue, May 17, 2022 at 6:03 PM K. N. Ramachandran 
>> wrote:
>>
>>> Gentle ping. Any info here would be great.
>>>
>>> Regards,
>>> Ram
>>>
>>> On Sun, May 15, 2022 at 5:16 PM K. N. Ramachandran 
>>> wrote:
>>>
 Hello Spark Users Group,

 I've just recently started working on tools that use Apache Spark.
 When I try WAITFOR in the spark-sql command line, I just get:

 Error: Error running query:
 org.apache.spark.sql.catalyst.parser.ParseException:

 mismatched input 'WAITFOR' expecting (.. list of allowed commands..)


 1) Why is WAITFOR not allowed? Is there another way to get a process to
 sleep for a desired period of time? I'm trying to test a timeout issue and
 need to simulate a sleep behavior.


 2) Is there documentation that outlines why WAITFOR is not supported? I
 did not find any good matches searching online.

 Thanks,
 Ram

>>>
>>>
>>> --
>>> K.N.Ramachandran
>>> Ph: 814-441-4279
>>>
>>
>
> --
> K.N.Ramachandran
> Ph: 814-441-4279
>
>
>


Re: [Spark SQL]: Does Spark SQL support WAITFOR?

2022-05-19 Thread Artemis User
WAITFOR is part of the Transact-SQL and it's Microsoft SQL server 
specific, not supported by Spark SQL.  If you want to impose a delay in 
a Spark program, you may want to use the thread sleep function in Java 
or Scala.  Hope this helps...


On 5/19/22 1:45 PM, K. N. Ramachandran wrote:

Hi Sean,

I'm trying to test a timeout feature in a tool that uses Spark SQL. 
Basically, if a long-running query exceeds a configured threshold, 
then the query should be canceled.
I couldn't see a simple way to make a "sleep" SQL statement to test 
the timeout. Instead, I just ran a "select count(*) from table" on a 
large table to act as a query with a long duration.


Is there any way to trigger a "sleep" like behavior in Spark SQL?

Regards,
Ram

On Tue, May 17, 2022 at 4:23 PM Sean Owen  wrote:

I don't think that is standard SQL? what are you trying to do, and
why not do it outside SQL?

On Tue, May 17, 2022 at 6:03 PM K. N. Ramachandran
 wrote:

Gentle ping. Any info here would be great.

Regards,
Ram

On Sun, May 15, 2022 at 5:16 PM K. N. Ramachandran
 wrote:

Hello Spark Users Group,

I've just recently started working on tools that use
Apache Spark.
When I try WAITFOR in the spark-sql command line, I just get:

Error: Error running query:
org.apache.spark.sql.catalyst.parser.ParseException:

mismatched input 'WAITFOR' expecting (.. list of allowed
commands..)


1) Why is WAITFOR not allowed? Is there another way to get
a process to sleep for a desired period of time? I'm
trying to test a timeout issue and need to simulate a
sleep behavior.


2) Is there documentation that outlines why WAITFOR is not
supported? I did not find any good matches searching online.


Thanks,
Ram



-- 
K.N.Ramachandran

Ph: 814-441-4279



--
K.N.Ramachandran
Ph: 814-441-4279


Re: [Spark SQL]: Does Spark SQL support WAITFOR?

2022-05-19 Thread K. N. Ramachandran
Hi Sean,

I'm trying to test a timeout feature in a tool that uses Spark SQL.
Basically, if a long-running query exceeds a configured threshold, then the
query should be canceled.
I couldn't see a simple way to make a "sleep" SQL statement to test the
timeout. Instead, I just ran a "select count(*) from table" on a large
table to act as a query with a long duration.

Is there any way to trigger a "sleep" like behavior in Spark SQL?

Regards,
Ram

On Tue, May 17, 2022 at 4:23 PM Sean Owen  wrote:

> I don't think that is standard SQL? what are you trying to do, and why not
> do it outside SQL?
>
> On Tue, May 17, 2022 at 6:03 PM K. N. Ramachandran 
> wrote:
>
>> Gentle ping. Any info here would be great.
>>
>> Regards,
>> Ram
>>
>> On Sun, May 15, 2022 at 5:16 PM K. N. Ramachandran 
>> wrote:
>>
>>> Hello Spark Users Group,
>>>
>>> I've just recently started working on tools that use Apache Spark.
>>> When I try WAITFOR in the spark-sql command line, I just get:
>>>
>>> Error: Error running query:
>>> org.apache.spark.sql.catalyst.parser.ParseException:
>>>
>>> mismatched input 'WAITFOR' expecting (.. list of allowed commands..)
>>>
>>>
>>> 1) Why is WAITFOR not allowed? Is there another way to get a process to
>>> sleep for a desired period of time? I'm trying to test a timeout issue and
>>> need to simulate a sleep behavior.
>>>
>>>
>>> 2) Is there documentation that outlines why WAITFOR is not supported? I
>>> did not find any good matches searching online.
>>>
>>> Thanks,
>>> Ram
>>>
>>
>>
>> --
>> K.N.Ramachandran
>> Ph: 814-441-4279
>>
>

-- 
K.N.Ramachandran
Ph: 814-441-4279


Re: [Spark SQL]: Configuring/Using Spark + Catalyst optimally for read-heavy transactional workloads in JDBC sources?

2022-05-18 Thread Gavin Ray
Following up on this in case anyone runs across it in the archives in the
future
>From reading through the config docs and trying various combinations, I've
discovered that:

- You don't want to disable codegen. This roughly doubled the time to
perform simple, few-column/few-row queries from basic testing
  -  Can test this by setting an internal property after setting
"spark.testing" to "true" in system properties


> System.setProperty("spark.testing", "true")
> val spark = SparkSession.builder()
>   .config("spark.sql.codegen.wholeStage", "false")
>   .config("spark.sql.codegen.factoryMode", "NO_CODEGEN")
>

-  The following gave the best performance. I don't know if enabling CBO
did much.

val spark = SparkSession.builder()
> .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
> .config("spark.kryo.unsafe", "true")
> .config("spark.sql.adaptive.enabled", "true")
> .config("spark.sql.cbo.enabled", "true")
> .config("spark.sql.cbo.joinReorder.dp.star.filter", "true")
> .config("spark.sql.cbo.joinReorder.enabled", "true")
> .config("spark.sql.cbo.planStats.enabled", "true")
> .config("spark.sql.cbo.starSchemaDetection", "true")


If you're running on more recent JDK's, you'll need to set "--add-opens"
flags for a few namespaces for "kryo.unsafe" to work.



On Mon, May 16, 2022 at 12:55 PM Gavin Ray  wrote:

> Hi all,
>
> I've not got much experience with Spark, but have been reading the
> Catalyst and
> Datasources V2 code/tests to try to get a basic understanding.
>
> I'm interested in trying Catalyst's query planner + optimizer for queries
> spanning one-or-more JDBC sources.
>
> Somewhat unusually, I'd like to do this with as minimal latency as
> possible to
> see what the experience for standard line-of-business apps is like (~90/10
> read/write ratio).
> Few rows would be returned in the reads (something on the order of
> 1-to-1,000).
>
> My question is: What configuration settings would you want to use for
> something
> like this?
>
> I imagine that doing codegen/JIT compilation of the query plan might not be
> worth the cost, so maybe you'd want to disable that and do interpretation?
>
> And possibly you'd want to use query plan config/rules that reduce the time
> spent in planning, trading efficiency for latency?
>
> Does anyone know how you'd configure Spark to test something like this?
>
> Would greatly appreciate any input (even if it's "This is a bad idea and
> will
> never work well").
>
> Thank you =)
>


Re: [Spark SQL]: Does Spark SQL support WAITFOR?

2022-05-17 Thread Sean Owen
I don't think that is standard SQL? what are you trying to do, and why not
do it outside SQL?

On Tue, May 17, 2022 at 6:03 PM K. N. Ramachandran 
wrote:

> Gentle ping. Any info here would be great.
>
> Regards,
> Ram
>
> On Sun, May 15, 2022 at 5:16 PM K. N. Ramachandran 
> wrote:
>
>> Hello Spark Users Group,
>>
>> I've just recently started working on tools that use Apache Spark.
>> When I try WAITFOR in the spark-sql command line, I just get:
>>
>> Error: Error running query:
>> org.apache.spark.sql.catalyst.parser.ParseException:
>>
>> mismatched input 'WAITFOR' expecting (.. list of allowed commands..)
>>
>>
>> 1) Why is WAITFOR not allowed? Is there another way to get a process to
>> sleep for a desired period of time? I'm trying to test a timeout issue and
>> need to simulate a sleep behavior.
>>
>>
>> 2) Is there documentation that outlines why WAITFOR is not supported? I
>> did not find any good matches searching online.
>>
>> Thanks,
>> Ram
>>
>
>
> --
> K.N.Ramachandran
> Ph: 814-441-4279
>


Re: [Spark SQL]: Does Spark SQL support WAITFOR?

2022-05-17 Thread K. N. Ramachandran
Gentle ping. Any info here would be great.

Regards,
Ram

On Sun, May 15, 2022 at 5:16 PM K. N. Ramachandran 
wrote:

> Hello Spark Users Group,
>
> I've just recently started working on tools that use Apache Spark.
> When I try WAITFOR in the spark-sql command line, I just get:
>
> Error: Error running query:
> org.apache.spark.sql.catalyst.parser.ParseException:
>
> mismatched input 'WAITFOR' expecting (.. list of allowed commands..)
>
>
> 1) Why is WAITFOR not allowed? Is there another way to get a process to
> sleep for a desired period of time? I'm trying to test a timeout issue and
> need to simulate a sleep behavior.
>
>
> 2) Is there documentation that outlines why WAITFOR is not supported? I
> did not find any good matches searching online.
>
> Thanks,
> Ram
>


-- 
K.N.Ramachandran
Ph: 814-441-4279


Re: {EXT} Re: Spark sql slowness in Spark 3.0.1

2022-04-15 Thread Anil Dasari
Hello,

DF is checkpointed here. So it is written to HDFS. DF is written in paraquet 
format and used default parallelism.

Thanks.

From: wilson 
Date: Thursday, April 14, 2022 at 2:54 PM
To: user@spark.apache.org 
Subject: {EXT} Re: Spark sql slowness in Spark 3.0.1
just curious, where to  write?


Anil Dasari wrote:
> We are upgrading spark from 2.4.7 to 3.0.1. we use spark sql (hive) to
> checkpoint data frames (intermediate data). DF write is very slow in
> 3.0.1 compared to 2.4.7.
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: Spark sql slowness in Spark 3.0.1

2022-04-14 Thread wilson

just curious, where to  write?


Anil Dasari wrote:
We are upgrading spark from 2.4.7 to 3.0.1. we use spark sql (hive) to 
checkpoint data frames (intermediate data). DF write is very slow in 
3.0.1 compared to 2.4.7.




-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark sql slowness in Spark 3.0.1

2022-04-14 Thread Sergey B.
The suggestion is to check:

1. Used format for write
2. Used parallelism

On Thu, Apr 14, 2022 at 7:13 PM Anil Dasari  wrote:

> Hello,
>
>
>
> We are upgrading spark from 2.4.7 to 3.0.1. we use spark sql (hive) to
> checkpoint data frames (intermediate data). DF write is very slow in 3.0.1
> compared to 2.4.7.
>
> Have read the release notes and there were no major changes except managed
> tables and adaptive scheduling. We are not using adaptive scheduling and
> going with default config. We made changes to handle managed tables by
> adding explicit paths during writes and delete.
>
>
>
> Do you have any suggestions to debug and fix the slowness problem ?
>
>
>
> Thanks,
>
>
>


Re: [Spark SQL] Structured Streaming in pyhton can connect to cassandra ?

2022-03-25 Thread Gourav Sengupta
Hi,

completely agree with Alex, also if you are just writing to Cassandra then
what is the purpose of writing to Kafka broker?

Generally people just find it sound as if adding more components to their
architecture is great, but sadly it is not. Remove the Kafka broker, incase
you are not broadcasting your messages to a set of wider solutions. Also
SPARK is an overkill in the way you are using it.

There are fantastic solutions available in the market like Presto SQL, Big
Query, Redshift, Athena, Snowflake, etc and SPARK is just one of the tools
and often a difficult one to configure and run.

Regards,
Gourav Sengupta

On Fri, Mar 25, 2022 at 1:19 PM Alex Ott  wrote:

> You don't need to use foreachBatch to write to Cassandra. You just need to
> use Spark Cassandra Connector version 2.5.0 or higher - it supports native
> writing of stream data into Cassandra.
>
> Here is an announcement:
> https://www.datastax.com/blog/advanced-apache-cassandra-analytics-now-open-all
>
> guillaume farcy  at "Mon, 21 Mar 2022 16:33:51 +0100" wrote:
>  gf> Hello,
>
>  gf> I am a student and I am currently doing a big data project.
>  gf> Here is my code:
>  gf> https://gist.github.com/Balykoo/262d94a7073d5a7e16dfb0d0a576b9c3
>
>  gf> My project is to retrieve messages from a twitch chat and send them
> into kafka then spark
>  gf> reads the kafka topic to perform the processing in the provided gist.
>
>  gf> I will want to send these messages into cassandra.
>
>  gf> I tested a first solution on line 72 which works but when there are
> too many messages
>  gf> spark crashes. Probably due to the fact that my function connects to
> cassandra each time
>  gf> it is called.
>
>  gf> I tried the object approach to mutualize the connection object but
> without success:
>  gf> _pickle.PicklingError: Could not serialize object: TypeError: cannot
> pickle
>  gf> '_thread.RLock' object
>
>  gf> Can you please tell me how to do this?
>  gf> Or at least give me some advice?
>
>  gf> Sincerely,
>  gf> FARCY Guillaume.
>
>
>
>  gf> -
>  gf> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
> --
> With best wishes,Alex Ott
> http://alexott.net/
> Twitter: alexott_en (English), alexott (Russian)
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: [Spark SQL] Structured Streaming in pyhton can connect to cassandra ?

2022-03-25 Thread Alex Ott
You don't need to use foreachBatch to write to Cassandra. You just need to
use Spark Cassandra Connector version 2.5.0 or higher - it supports native
writing of stream data into Cassandra.

Here is an announcement: 
https://www.datastax.com/blog/advanced-apache-cassandra-analytics-now-open-all

guillaume farcy  at "Mon, 21 Mar 2022 16:33:51 +0100" wrote:
 gf> Hello,

 gf> I am a student and I am currently doing a big data project.
 gf> Here is my code:
 gf> https://gist.github.com/Balykoo/262d94a7073d5a7e16dfb0d0a576b9c3

 gf> My project is to retrieve messages from a twitch chat and send them into 
kafka then spark
 gf> reads the kafka topic to perform the processing in the provided gist.

 gf> I will want to send these messages into cassandra.

 gf> I tested a first solution on line 72 which works but when there are too 
many messages
 gf> spark crashes. Probably due to the fact that my function connects to 
cassandra each time
 gf> it is called.

 gf> I tried the object approach to mutualize the connection object but without 
success:
 gf> _pickle.PicklingError: Could not serialize object: TypeError: cannot pickle
 gf> '_thread.RLock' object

 gf> Can you please tell me how to do this?
 gf> Or at least give me some advice?

 gf> Sincerely,
 gf> FARCY Guillaume.



 gf> -
 gf> To unsubscribe e-mail: user-unsubscr...@spark.apache.org



-- 
With best wishes,Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark SQL] Structured Streaming in pyhton can connect to cassandra ?

2022-03-21 Thread Mich Talebzadeh
dear student,


Check this article of  mine in Linkedin


Processing Change Data Capture with Spark Structured Streaming



There is a link to GitHub
  as well.


This writes to the Google BigQuery table. You can write to Cassandra via
JDBC connection if i am correct.



HTH


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 21 Mar 2022 at 16:51, guillaume farcy <
guillaume.fa...@imt-atlantique.net> wrote:

> Hello,
>
> I am a student and I am currently doing a big data project.
> Here is my code:
> https://gist.github.com/Balykoo/262d94a7073d5a7e16dfb0d0a576b9c3
>
> My project is to retrieve messages from a twitch chat and send them into
> kafka then spark reads the kafka topic to perform the processing in the
> provided gist.
>
> I will want to send these messages into cassandra.
>
> I tested a first solution on line 72 which works but when there are too
> many messages spark crashes. Probably due to the fact that my function
> connects to cassandra each time it is called.
>
> I tried the object approach to mutualize the connection object but
> without success:
> _pickle.PicklingError: Could not serialize object: TypeError: cannot
> pickle '_thread.RLock' object
>
> Can you please tell me how to do this?
> Or at least give me some advice?
>
> Sincerely,
> FARCY Guillaume.
>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: [Spark SQL] Structured Streaming in pyhton can connect to cassandra ?

2022-03-21 Thread Sean Owen
Looks like you are trying to apply this class/function across Spark, but it
contains a non-serialized object, the connection. That has to be
initialized on use, otherwise you try to send it from the driver and that
can't work.

On Mon, Mar 21, 2022 at 11:51 AM guillaume farcy <
guillaume.fa...@imt-atlantique.net> wrote:

> Hello,
>
> I am a student and I am currently doing a big data project.
> Here is my code:
> https://gist.github.com/Balykoo/262d94a7073d5a7e16dfb0d0a576b9c3
>
> My project is to retrieve messages from a twitch chat and send them into
> kafka then spark reads the kafka topic to perform the processing in the
> provided gist.
>
> I will want to send these messages into cassandra.
>
> I tested a first solution on line 72 which works but when there are too
> many messages spark crashes. Probably due to the fact that my function
> connects to cassandra each time it is called.
>
> I tried the object approach to mutualize the connection object but
> without success:
> _pickle.PicklingError: Could not serialize object: TypeError: cannot
> pickle '_thread.RLock' object
>
> Can you please tell me how to do this?
> Or at least give me some advice?
>
> Sincerely,
> FARCY Guillaume.
>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: [Spark SQL] Null when trying to use corr() with a Window

2022-02-28 Thread Edgar H
Oh I see now, using currentRow will give the correlation per ID within the
group based on its ordering and using unbounded both will result in the
overall correlation value for the whole group?

El lun, 28 feb 2022 a las 16:33, Sean Owen () escribió:

> The results make sense then. You want a correlation per group right?
> because it's over the sums by ID within the group. Then currentRow is
> wrong; needs to be unbounded preceding and following.
>
>
> On Mon, Feb 28, 2022 at 9:22 AM Edgar H  wrote:
>
>> The window is defined as you said yes, unboundedPreceding and currentRow
>> ordering by orderCountSum.
>>
>> val initialSetWindow = Window
>>   .partitionBy("group")
>>   .orderBy("orderCountSum")
>>   .rowsBetween(Window.unboundedPreceding, Window.currentRow)
>>
>> I'm trying to obtain the correlation for each of the members of the group
>> yes (or the accumulative per element, don't really know how to phrase
>> that), and the correlation is affected by the counter used for the column,
>> right? Top to bottom?
>>
>> Ps. Thank you so much for replying so fast!
>>
>> El lun, 28 feb 2022 a las 15:56, Sean Owen () escribió:
>>
>>> How are you defining the window? It looks like it's something like "rows
>>> unbounded proceeding, current" or the reverse, as the correlation varies
>>> across the elements of the group as if it's computing them on 1, then 2,
>>> then 3 elements. Don't you want the correlation across the group? otherwise
>>> this answer is 'right' for what you're doing it seems.
>>>
>>> On Mon, Feb 28, 2022 at 7:49 AM Edgar H  wrote:
>>>
 My bad completely, missed the example by a mile sorry for that, let me
 change a couple of things.

 - Got to add "id" to the initial grouping and also add more elements to
 the initial set;

 val sampleSet = Seq(
   ("group1", "id1", 1, 1, 6),
   ("group1", "id1", 4, 4, 6),
   ("group1", "id2", 2, 2, 5),
   ("group1", "id3", 3, 3, 4),
   ("group2", "id1", 4, 4, 3),
   ("group2", "id2", 5, 5, 2),
   ("group2", "id3", 6, 6, 1),
   ("group2", "id3", 15, 6, 1)
 )

 val groupedSet = initialSet
   .groupBy(
 "group", "id"
   ).agg(
 sum("count1").as("count1Sum"),
 sum("count2").as("count2Sum"),
 sum("orderCount").as("orderCountSum")
 )
   .withColumn("cf", corr("count1Sum",
 "count2Sum").over(initialSetWindow))

 Now, with this in place, in case the correlation is applied, the
 following is shown:

 +--+---+-+-+-+--+
 | group| id|count1Sum|count2Sum|orderCountSum|cf|
 +--+---+-+-+-+--+
 |group1|id3|3|3|4|  null|
 |group1|id2|2|2|5|   1.0|
 |group1|id1|5|5|   12|   1.0|
 |group2|id3|   21|   12|2|  null|
 |group2|id2|5|5|2|   1.0|
 |group2|id1|4|4|3|0.9980460957560549|
 +--+---+-+-+-+--+

 Taking into account what you just mentioned... Even if the Window is
 only partitioned by "group", would it still be impossible to obtain a
 correlation? I'm trying to do like...

 group1 = id1, id2, id3 (and their respective counts) - apply the
 correlation over the set of ids within the group (without taking into
 account they are a sum)
 group2 = id1, id2, id3 (and their respective counts) - same as before

 However, the highest element is still null. When changing the
 rowsBetween call to .rowsBetween(Window.unboundedPreceding,
 Window.unboundedFollowing) it will just calculate the whole subset
 correlation. Shouldn't the first element of the correlation calculate
 itself?

>>>


Re: [Spark SQL] Null when trying to use corr() with a Window

2022-02-28 Thread Sean Owen
The results make sense then. You want a correlation per group right?
because it's over the sums by ID within the group. Then currentRow is
wrong; needs to be unbounded preceding and following.


On Mon, Feb 28, 2022 at 9:22 AM Edgar H  wrote:

> The window is defined as you said yes, unboundedPreceding and currentRow
> ordering by orderCountSum.
>
> val initialSetWindow = Window
>   .partitionBy("group")
>   .orderBy("orderCountSum")
>   .rowsBetween(Window.unboundedPreceding, Window.currentRow)
>
> I'm trying to obtain the correlation for each of the members of the group
> yes (or the accumulative per element, don't really know how to phrase
> that), and the correlation is affected by the counter used for the column,
> right? Top to bottom?
>
> Ps. Thank you so much for replying so fast!
>
> El lun, 28 feb 2022 a las 15:56, Sean Owen () escribió:
>
>> How are you defining the window? It looks like it's something like "rows
>> unbounded proceeding, current" or the reverse, as the correlation varies
>> across the elements of the group as if it's computing them on 1, then 2,
>> then 3 elements. Don't you want the correlation across the group? otherwise
>> this answer is 'right' for what you're doing it seems.
>>
>> On Mon, Feb 28, 2022 at 7:49 AM Edgar H  wrote:
>>
>>> My bad completely, missed the example by a mile sorry for that, let me
>>> change a couple of things.
>>>
>>> - Got to add "id" to the initial grouping and also add more elements to
>>> the initial set;
>>>
>>> val sampleSet = Seq(
>>>   ("group1", "id1", 1, 1, 6),
>>>   ("group1", "id1", 4, 4, 6),
>>>   ("group1", "id2", 2, 2, 5),
>>>   ("group1", "id3", 3, 3, 4),
>>>   ("group2", "id1", 4, 4, 3),
>>>   ("group2", "id2", 5, 5, 2),
>>>   ("group2", "id3", 6, 6, 1),
>>>   ("group2", "id3", 15, 6, 1)
>>> )
>>>
>>> val groupedSet = initialSet
>>>   .groupBy(
>>> "group", "id"
>>>   ).agg(
>>> sum("count1").as("count1Sum"),
>>> sum("count2").as("count2Sum"),
>>> sum("orderCount").as("orderCountSum")
>>> )
>>>   .withColumn("cf", corr("count1Sum",
>>> "count2Sum").over(initialSetWindow))
>>>
>>> Now, with this in place, in case the correlation is applied, the
>>> following is shown:
>>>
>>> +--+---+-+-+-+--+
>>> | group| id|count1Sum|count2Sum|orderCountSum|cf|
>>> +--+---+-+-+-+--+
>>> |group1|id3|3|3|4|  null|
>>> |group1|id2|2|2|5|   1.0|
>>> |group1|id1|5|5|   12|   1.0|
>>> |group2|id3|   21|   12|2|  null|
>>> |group2|id2|5|5|2|   1.0|
>>> |group2|id1|4|4|3|0.9980460957560549|
>>> +--+---+-+-+-+--+
>>>
>>> Taking into account what you just mentioned... Even if the Window is
>>> only partitioned by "group", would it still be impossible to obtain a
>>> correlation? I'm trying to do like...
>>>
>>> group1 = id1, id2, id3 (and their respective counts) - apply the
>>> correlation over the set of ids within the group (without taking into
>>> account they are a sum)
>>> group2 = id1, id2, id3 (and their respective counts) - same as before
>>>
>>> However, the highest element is still null. When changing the
>>> rowsBetween call to .rowsBetween(Window.unboundedPreceding,
>>> Window.unboundedFollowing) it will just calculate the whole subset
>>> correlation. Shouldn't the first element of the correlation calculate
>>> itself?
>>>
>>


Re: [Spark SQL] Null when trying to use corr() with a Window

2022-02-28 Thread Edgar H
The window is defined as you said yes, unboundedPreceding and currentRow
ordering by orderCountSum.

val initialSetWindow = Window
  .partitionBy("group")
  .orderBy("orderCountSum")
  .rowsBetween(Window.unboundedPreceding, Window.currentRow)

I'm trying to obtain the correlation for each of the members of the group
yes (or the accumulative per element, don't really know how to phrase
that), and the correlation is affected by the counter used for the column,
right? Top to bottom?

Ps. Thank you so much for replying so fast!

El lun, 28 feb 2022 a las 15:56, Sean Owen () escribió:

> How are you defining the window? It looks like it's something like "rows
> unbounded proceeding, current" or the reverse, as the correlation varies
> across the elements of the group as if it's computing them on 1, then 2,
> then 3 elements. Don't you want the correlation across the group? otherwise
> this answer is 'right' for what you're doing it seems.
>
> On Mon, Feb 28, 2022 at 7:49 AM Edgar H  wrote:
>
>> My bad completely, missed the example by a mile sorry for that, let me
>> change a couple of things.
>>
>> - Got to add "id" to the initial grouping and also add more elements to
>> the initial set;
>>
>> val sampleSet = Seq(
>>   ("group1", "id1", 1, 1, 6),
>>   ("group1", "id1", 4, 4, 6),
>>   ("group1", "id2", 2, 2, 5),
>>   ("group1", "id3", 3, 3, 4),
>>   ("group2", "id1", 4, 4, 3),
>>   ("group2", "id2", 5, 5, 2),
>>   ("group2", "id3", 6, 6, 1),
>>   ("group2", "id3", 15, 6, 1)
>> )
>>
>> val groupedSet = initialSet
>>   .groupBy(
>> "group", "id"
>>   ).agg(
>> sum("count1").as("count1Sum"),
>> sum("count2").as("count2Sum"),
>> sum("orderCount").as("orderCountSum")
>> )
>>   .withColumn("cf", corr("count1Sum", "count2Sum").over(initialSetWindow))
>>
>> Now, with this in place, in case the correlation is applied, the
>> following is shown:
>>
>> +--+---+-+-+-+--+
>> | group| id|count1Sum|count2Sum|orderCountSum|cf|
>> +--+---+-+-+-+--+
>> |group1|id3|3|3|4|  null|
>> |group1|id2|2|2|5|   1.0|
>> |group1|id1|5|5|   12|   1.0|
>> |group2|id3|   21|   12|2|  null|
>> |group2|id2|5|5|2|   1.0|
>> |group2|id1|4|4|3|0.9980460957560549|
>> +--+---+-+-+-+--+
>>
>> Taking into account what you just mentioned... Even if the Window is only
>> partitioned by "group", would it still be impossible to obtain a
>> correlation? I'm trying to do like...
>>
>> group1 = id1, id2, id3 (and their respective counts) - apply the
>> correlation over the set of ids within the group (without taking into
>> account they are a sum)
>> group2 = id1, id2, id3 (and their respective counts) - same as before
>>
>> However, the highest element is still null. When changing the rowsBetween
>> call to .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
>> it will just calculate the whole subset correlation. Shouldn't the first
>> element of the correlation calculate itself?
>>
>


Re: [Spark SQL] Null when trying to use corr() with a Window

2022-02-28 Thread Sean Owen
How are you defining the window? It looks like it's something like "rows
unbounded proceeding, current" or the reverse, as the correlation varies
across the elements of the group as if it's computing them on 1, then 2,
then 3 elements. Don't you want the correlation across the group? otherwise
this answer is 'right' for what you're doing it seems.

On Mon, Feb 28, 2022 at 7:49 AM Edgar H  wrote:

> My bad completely, missed the example by a mile sorry for that, let me
> change a couple of things.
>
> - Got to add "id" to the initial grouping and also add more elements to
> the initial set;
>
> val sampleSet = Seq(
>   ("group1", "id1", 1, 1, 6),
>   ("group1", "id1", 4, 4, 6),
>   ("group1", "id2", 2, 2, 5),
>   ("group1", "id3", 3, 3, 4),
>   ("group2", "id1", 4, 4, 3),
>   ("group2", "id2", 5, 5, 2),
>   ("group2", "id3", 6, 6, 1),
>   ("group2", "id3", 15, 6, 1)
> )
>
> val groupedSet = initialSet
>   .groupBy(
> "group", "id"
>   ).agg(
> sum("count1").as("count1Sum"),
> sum("count2").as("count2Sum"),
> sum("orderCount").as("orderCountSum")
> )
>   .withColumn("cf", corr("count1Sum", "count2Sum").over(initialSetWindow))
>
> Now, with this in place, in case the correlation is applied, the following
> is shown:
>
> +--+---+-+-+-+--+
> | group| id|count1Sum|count2Sum|orderCountSum|cf|
> +--+---+-+-+-+--+
> |group1|id3|3|3|4|  null|
> |group1|id2|2|2|5|   1.0|
> |group1|id1|5|5|   12|   1.0|
> |group2|id3|   21|   12|2|  null|
> |group2|id2|5|5|2|   1.0|
> |group2|id1|4|4|3|0.9980460957560549|
> +--+---+-+-+-+--+
>
> Taking into account what you just mentioned... Even if the Window is only
> partitioned by "group", would it still be impossible to obtain a
> correlation? I'm trying to do like...
>
> group1 = id1, id2, id3 (and their respective counts) - apply the
> correlation over the set of ids within the group (without taking into
> account they are a sum)
> group2 = id1, id2, id3 (and their respective counts) - same as before
>
> However, the highest element is still null. When changing the rowsBetween
> call to .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
> it will just calculate the whole subset correlation. Shouldn't the first
> element of the correlation calculate itself?
>


Re: [Spark SQL] Null when trying to use corr() with a Window

2022-02-28 Thread Edgar H
My bad completely, missed the example by a mile sorry for that, let me
change a couple of things.

- Got to add "id" to the initial grouping and also add more elements to the
initial set;

val sampleSet = Seq(
  ("group1", "id1", 1, 1, 6),
  ("group1", "id1", 4, 4, 6),
  ("group1", "id2", 2, 2, 5),
  ("group1", "id3", 3, 3, 4),
  ("group2", "id1", 4, 4, 3),
  ("group2", "id2", 5, 5, 2),
  ("group2", "id3", 6, 6, 1),
  ("group2", "id3", 15, 6, 1)
)

val groupedSet = initialSet
  .groupBy(
"group", "id"
  ).agg(
sum("count1").as("count1Sum"),
sum("count2").as("count2Sum"),
sum("orderCount").as("orderCountSum")
)
  .withColumn("cf", corr("count1Sum", "count2Sum").over(initialSetWindow))

Now, with this in place, in case the correlation is applied, the following
is shown:

+--+---+-+-+-+--+
| group| id|count1Sum|count2Sum|orderCountSum|cf|
+--+---+-+-+-+--+
|group1|id3|3|3|4|  null|
|group1|id2|2|2|5|   1.0|
|group1|id1|5|5|   12|   1.0|
|group2|id3|   21|   12|2|  null|
|group2|id2|5|5|2|   1.0|
|group2|id1|4|4|3|0.9980460957560549|
+--+---+-+-+-+--+

Taking into account what you just mentioned... Even if the Window is only
partitioned by "group", would it still be impossible to obtain a
correlation? I'm trying to do like...

group1 = id1, id2, id3 (and their respective counts) - apply the
correlation over the set of ids within the group (without taking into
account they are a sum)
group2 = id1, id2, id3 (and their respective counts) - same as before

However, the highest element is still null. When changing the rowsBetween
call to .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
it will just calculate the whole subset correlation. Shouldn't the first
element of the correlation calculate itself?

El lun, 28 feb 2022 a las 14:12, Sean Owen () escribió:

> You're computing correlations of two series of values, but each series has
> one value, a sum. Correlation is not defined in this case (both variances
> are undefined). This is sample correlation, note.
>
> On Mon, Feb 28, 2022 at 7:06 AM Edgar H  wrote:
>
>> Morning all, been struggling with this for a while and can't really seem
>> to understand what I'm doing wrong...
>>
>> Having the following DataFrame I want to apply the corr function over
>> the following DF;
>>
>> val sampleColumns = Seq("group", "id", "count1", "count2", "orderCount")
>>
>> val sampleSet = Seq(
>>   ("group1", "id1", 1, 1, 6),
>>   ("group1", "id2", 2, 2, 5),
>>   ("group1", "id3", 3, 3, 4),
>>   ("group2", "id4", 4, 4, 3),
>>   ("group2", "id5", 5, 5, 2),
>>   ("group2", "id6", 6, 6, 1)
>> )
>>
>> val initialSet = sparkSession
>>   .createDataFrame(sampleSet)
>>   .toDF(sampleColumns: _*)
>>
>> - .show()
>>
>> +--+---+--+--+--+
>> | group| id|count1|count2|orderCount|
>> +--+---+--+--+--+
>> |group1|id1| 1| 1| 6|
>> |group1|id2| 2| 2| 5|
>> |group1|id3| 3| 3| 4|
>> |group2|id4| 4| 4| 3|
>> |group2|id5| 5| 5| 2|
>> |group2|id6| 6| 6| 1|
>> +--+---+--+--+--+
>>
>> val initialSetWindow = Window
>>   .partitionBy("group")
>>   .orderBy("orderCountSum")
>>   .rowsBetween(Window.unboundedPreceding, Window.currentRow)
>>
>> val groupedSet = initialSet
>>   .groupBy(
>> "group"
>>   ).agg(
>> sum("count1").as("count1Sum"),
>> sum("count2").as("count2Sum"),
>> sum("orderCount").as("orderCountSum")
>> )
>>   .withColumn("cf", corr("count1Sum", 
>> "count2Sum").over(initialSetWindow))
>>
>> - .show()
>>
>> +--+-+-+-++
>> | group|count1Sum|count2Sum|orderCountSum|  cf|
>> +--+-+-+-++
>> |group1|6|6|   15|null|
>> |group2|   15|   15|6|null|
>> +--+-+-+-++
>>
>> When trying to apply the corr function, some of the resulting values in
>> cf are null for some reason:
>>
>> The question is, *how can I apply corr to each of the rows within their
>> subgroup (Window)?* Would like to obtain the corr value per Row and
>> subgroup (group1 and group2), and even if more nested IDs were added (group
>> + id) it should still work.
>>
>


Re: [Spark SQL] Null when trying to use corr() with a Window

2022-02-28 Thread Sean Owen
You're computing correlations of two series of values, but each series has
one value, a sum. Correlation is not defined in this case (both variances
are undefined). This is sample correlation, note.

On Mon, Feb 28, 2022 at 7:06 AM Edgar H  wrote:

> Morning all, been struggling with this for a while and can't really seem
> to understand what I'm doing wrong...
>
> Having the following DataFrame I want to apply the corr function over the
> following DF;
>
> val sampleColumns = Seq("group", "id", "count1", "count2", "orderCount")
>
> val sampleSet = Seq(
>   ("group1", "id1", 1, 1, 6),
>   ("group1", "id2", 2, 2, 5),
>   ("group1", "id3", 3, 3, 4),
>   ("group2", "id4", 4, 4, 3),
>   ("group2", "id5", 5, 5, 2),
>   ("group2", "id6", 6, 6, 1)
> )
>
> val initialSet = sparkSession
>   .createDataFrame(sampleSet)
>   .toDF(sampleColumns: _*)
>
> - .show()
>
> +--+---+--+--+--+
> | group| id|count1|count2|orderCount|
> +--+---+--+--+--+
> |group1|id1| 1| 1| 6|
> |group1|id2| 2| 2| 5|
> |group1|id3| 3| 3| 4|
> |group2|id4| 4| 4| 3|
> |group2|id5| 5| 5| 2|
> |group2|id6| 6| 6| 1|
> +--+---+--+--+--+
>
> val initialSetWindow = Window
>   .partitionBy("group")
>   .orderBy("orderCountSum")
>   .rowsBetween(Window.unboundedPreceding, Window.currentRow)
>
> val groupedSet = initialSet
>   .groupBy(
> "group"
>   ).agg(
> sum("count1").as("count1Sum"),
> sum("count2").as("count2Sum"),
> sum("orderCount").as("orderCountSum")
> )
>   .withColumn("cf", corr("count1Sum", "count2Sum").over(initialSetWindow))
>
> - .show()
>
> +--+-+-+-++
> | group|count1Sum|count2Sum|orderCountSum|  cf|
> +--+-+-+-++
> |group1|6|6|   15|null|
> |group2|   15|   15|6|null|
> +--+-+-+-++
>
> When trying to apply the corr function, some of the resulting values in cf
> are null for some reason:
>
> The question is, *how can I apply corr to each of the rows within their
> subgroup (Window)?* Would like to obtain the corr value per Row and
> subgroup (group1 and group2), and even if more nested IDs were added (group
> + id) it should still work.
>


RE: Spark-SQL : Getting current user name in UDF

2022-02-22 Thread Lavelle, Shawn
Apologies, this is Spark 3.2.0.

~ Shawn

From: Lavelle, Shawn
Sent: Monday, February 21, 2022 5:39 PM
To: 'user@spark.apache.org' 
Subject: Spark-SQL : Getting current user name in UDF

Hello Spark Users,

I have a UDF I wrote for use with Spark-SQL that performs a look up.  In 
that look up, I need to get the current sql user so I can validate their 
permissions.  I was using org.apach.spark.sql.util.Utils.getCurrentUserName() 
to retrieve the current active user from within the UDF but today I discovered 
that that call returns a different user based on the context:

select myUDF();
returns the SQL user
select myUDF() from myTable ;
returns the operating system (application?) user.

I can provide a code example if needed, but it's just calling 
Utils.getCurrentUserName() from within the UDF code.

Does this sound like expected behavior or a defect?  Is there another way I can 
get the active SQL user inside a UDF?

Thanks in advance,

~ Shawn

PS I can't add username as a parameter to the UDF because I can't rely the user 
to not submit someone else's username.




[OSI Logo]
Shawn Lavelle

Software Development

OSI Digital Grid Solutions
4101 Arrowhead Drive
Medina, Minnesota 55340-9457
Phone: 763 551 0559
Email: shawn.lave...@osii.com
Website: www.osii.com
[Emerson Logo]
We are proud to
now be a part of
Emerson.


Re: [Spark SQL]: Aggregate Push Down / Spark 3.2

2021-11-04 Thread Kapoor, Rohit
My basic test is here - https://github.com/rohitkapoor1/sparkPushDownAggregate


From: German Schiavon 
Date: Thursday, 4 November 2021 at 2:17 AM
To: huaxin gao 
Cc: Kapoor, Rohit , user@spark.apache.org 

Subject: Re: [Spark SQL]: Aggregate Push Down / Spark 3.2
EXTERNAL MAIL: USE CAUTION BEFORE CLICKING LINKS OR OPENING ATTACHMENTS. ALWAYS 
VERIFY THE SOURCE OF MESSAGES.

Hi,

Rohit, can you share how it looks using DSv2?

Thanks!

On Wed, 3 Nov 2021 at 19:35, huaxin gao 
mailto:huaxin.ga...@gmail.com>> wrote:
Great to hear. Thanks for testing this!

On Wed, Nov 3, 2021 at 4:03 AM Kapoor, Rohit 
mailto:rohit.kap...@envestnet.com>> wrote:
Thanks for your guidance Huaxin. I have been able to test the push down 
operators successfully against Postgresql using DS v2.


From: huaxin gao mailto:huaxin.ga...@gmail.com>>
Date: Tuesday, 2 November 2021 at 12:35 AM
To: Kapoor, Rohit 
mailto:rohit.kap...@envestnet.com>>
Subject: Re: [Spark SQL]: Aggregate Push Down / Spark 3.2
EXTERNAL MAIL: USE CAUTION BEFORE CLICKING LINKS OR OPENING ATTACHMENTS. ALWAYS 
VERIFY THE SOURCE OF MESSAGES.
EXTERNAL MAIL: USE CAUTION BEFORE CLICKING LINKS OR OPENING ATTACHMENTS. ALWAYS 
VERIFY THE SOURCE OF MESSAGES.
No need to write a customized data source reader. You may want to follow the 
example here 
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala#L40<https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala#L40>
 to use DS v2. The example uses h2 database. Please modify it to use postgresql.

Huaxin


On Mon, Nov 1, 2021 at 11:21 AM Kapoor, Rohit 
mailto:rohit.kap...@envestnet.com>> wrote:
Hi Huaxin,

Thanks a lot for your response. Do I need to write a custom data source reader 
(in my case, for PostgreSql) using the Spark DS v2 APIs, instead of the 
standard spark.read.format(“jdbc”) ?


Thanks,
Rohit

From: huaxin gao mailto:huaxin.ga...@gmail.com>>
Date: Monday, 1 November 2021 at 11:32 PM
To: Kapoor, Rohit 
mailto:rohit.kap...@envestnet.com>>
Cc: user@spark.apache.org<mailto:user@spark.apache.org> 
mailto:user@spark.apache.org>>
Subject: Re: [Spark SQL]: Aggregate Push Down / Spark 3.2
EXTERNAL MAIL: USE CAUTION BEFORE CLICKING LINKS OR OPENING ATTACHMENTS. ALWAYS 
VERIFY THE SOURCE OF MESSAGES.
EXTERNAL MAIL: USE CAUTION BEFORE CLICKING LINKS OR OPENING ATTACHMENTS. ALWAYS 
VERIFY THE SOURCE OF MESSAGES.
Hi Rohit,

Thanks for testing this. Seems to me that you are using DS v1. We only support 
aggregate push down in DS v2. Could you please try again using DS v2 and let me 
know how it goes?

Thanks,
Huaxin

On Mon, Nov 1, 2021 at 10:39 AM Chao Sun 
mailto:sunc...@apache.org>> wrote:

-- Forwarded message -
From: Kapoor, Rohit 
mailto:rohit.kap...@envestnet.com>>
Date: Mon, Nov 1, 2021 at 6:27 AM
Subject: [Spark SQL]: Aggregate Push Down / Spark 3.2
To: user@spark.apache.org<mailto:user@spark.apache.org> 
mailto:user@spark.apache.org>>

Hi,

I am testing the aggregate push down for JDBC after going through the JIRA - 
https://issues.apache.org/jira/browse/SPARK-34952<https://issues.apache.org/jira/browse/SPARK-34952>
I have the latest Spark 3.2 setup in local mode (laptop).

I have PostgreSQL v14 locally on my laptop. I am trying a basic aggregate query 
on “emp” table that has 102 rows and a simple schema with 3 columns (empid, 
ename and sal) as below:

val jdbcString = "jdbc:postgresql://" + "localhost" + ":5432/postgres"

val jdbcDF = spark.read
.format("jdbc")
.option("url", jdbcString)
.option("dbtable", "emp")
.option("pushDownAggregate","true")
.option("user", "")
.option("password", "")
.load()
.where("empid > 1")
.agg(max("SAL")).alias("max_sal")


The complete plan details are:

== Parsed Logical Plan ==
SubqueryAlias max_sal
+- Aggregate [max(SAL#2) AS max(SAL)#10]
   +- Filter (empid#0 > 1)
  +- Relation [empid#0,ename#1,sal#2] JDBCRelation(emp) [numPartitions=1]

== Analyzed Logical Plan ==
max(SAL): int
SubqueryAlias max_sal
+- Aggregate [max(SAL#2) AS max(SAL)#10]
   +- Filter (empid#0 > 1)
  +- Relation [empid#0,ename#1,sal#2] JDBCRelation(emp) [numPartitions=1]

== Optimized Logical Plan ==
Aggregate [max(SAL#2) AS max(SAL)#10]
+- Project [sal#2]
   +- Filter (isnotnull(empid#0) AND (empid#0 > 1))
  +- Relation [empid#0,ename#1,sal#2] JDBCRelation(emp) [numPartitions=1]

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[max(SAL#2)], output=[max(SAL)#10])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#15]
  +- HashAggregate(ke

Re: [Spark SQL]: Aggregate Push Down / Spark 3.2

2021-11-04 Thread Sunil Prabhakara
Unsubscribe.

On Mon, Nov 1, 2021 at 6:57 PM Kapoor, Rohit 
wrote:

> Hi,
>
>
>
> I am testing the aggregate push down for JDBC after going through the JIRA
> - https://issues.apache.org/jira/browse/SPARK-34952
>
> I have the latest Spark 3.2 setup in local mode (laptop).
>
>
>
> I have PostgreSQL v14 locally on my laptop. I am trying a basic aggregate
> query on “emp” table that has 102 rows and a simple schema with 3
> columns (empid, ename and sal) as below:
>
>
>
> val jdbcString = "jdbc:postgresql://" + "localhost" + ":5432/postgres"
>
>
>
> val jdbcDF = spark.read
>
> .format("jdbc")
>
> .option("url", jdbcString)
>
> .option("dbtable", "emp")
>
> .option("pushDownAggregate","true")
>
> .option("user", "")
>
> .option("password", "")
>
> .load()
>
> .where("empid > 1")
>
> .agg(max("SAL")).alias("max_sal")
>
>
>
>
>
> The complete plan details are:
>
>
>
> == Parsed Logical Plan ==
>
> SubqueryAlias max_sal
>
> +- Aggregate [max(SAL#2) AS max(SAL)#10]
>
>+- Filter (empid#0 > 1)
>
>   +- Relation [empid#0,ename#1,sal#2] JDBCRelation(emp)
> [numPartitions=1]
>
>
>
> == Analyzed Logical Plan ==
>
> max(SAL): int
>
> SubqueryAlias max_sal
>
> +- Aggregate [max(SAL#2) AS max(SAL)#10]
>
>+- Filter (empid#0 > 1)
>
>   +- Relation [empid#0,ename#1,sal#2] JDBCRelation(emp)
> [numPartitions=1]
>
>
>
> == Optimized Logical Plan ==
>
> Aggregate [max(SAL#2) AS max(SAL)#10]
>
> +- Project [sal#2]
>
>+- Filter (isnotnull(empid#0) AND (empid#0 > 1))
>
>   +- Relation [empid#0,ename#1,sal#2] JDBCRelation(emp)
> [numPartitions=1]
>
>
>
> == Physical Plan ==
>
> AdaptiveSparkPlan isFinalPlan=false
>
> +- HashAggregate(keys=[], functions=[max(SAL#2)], output=[max(SAL)#10])
>
>+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#15]
>
>   +- HashAggregate(keys=[], functions=[partial_max(SAL#2)],
> output=[max#13])
>
>  +- Scan JDBCRelation(emp) [numPartitions=1] [sal#2] 
> *PushedAggregates:
> []*, PushedFilters: [*IsNotNull(empid), *GreaterThan(empid,1)],
> PushedGroupby: [], ReadSchema: struct
>
>
>
>
>
> I also checked the sql submitted to the database, querying
> pg_stat_statements, and it confirms that the aggregate was not pushed
> down to the database. Here is the query submitted to the database:
>
>
>
> SELECT "sal" FROM emp WHERE ("empid" IS NOT NULL) AND ("empid" > $1)
>
>
>
> All the rows are read and aggregated in the Spark layer.
>
>
>
> Is there any configuration I missing here? Why is aggregate push down not
> working for me?
>
> Any pointers would be greatly appreciated.
>
>
>
>
>
> Thanks,
>
> Rohit
> --
>
> Disclaimer: The information in this email is confidential and may be
> legally privileged. Access to this Internet email by anyone else other than
> the recipient is unauthorized. Envestnet, Inc. and its affiliated companies
> do not accept time-sensitive transactional messages, including orders to
> buy and sell securities, account allocation instructions, or any other
> instructions affecting a client account, via e-mail. If you are not the
> intended recipient of this email, any disclosure, copying, or distribution
> of it is prohibited and may be unlawful. If you have received this email in
> error, please notify the sender and immediately and permanently delete it
> and destroy any copies of it that were printed out. When addressed to our
> clients, any opinions or advice contained in this email is subject to the
> terms and conditions expressed in any applicable governing terms of
> business or agreements.
> --
>


Re: [Spark SQL]: Aggregate Push Down / Spark 3.2

2021-11-03 Thread German Schiavon
Hi,

Rohit, can you share how it looks using DSv2?

Thanks!

On Wed, 3 Nov 2021 at 19:35, huaxin gao  wrote:

> Great to hear. Thanks for testing this!
>
> On Wed, Nov 3, 2021 at 4:03 AM Kapoor, Rohit 
> wrote:
>
>> Thanks for your guidance Huaxin. I have been able to test the push down
>> operators successfully against Postgresql using DS v2.
>>
>>
>>
>>
>>
>> *From: *huaxin gao 
>> *Date: *Tuesday, 2 November 2021 at 12:35 AM
>> *To: *Kapoor, Rohit 
>> *Subject: *Re: [Spark SQL]: Aggregate Push Down / Spark 3.2
>>
>>
>>
>> *EXTERNAL MAIL: USE CAUTION BEFORE CLICKING LINKS OR OPENING ATTACHMENTS.
>> ALWAYS VERIFY THE SOURCE OF MESSAGES. *
>>
>>
>>
>> *EXTERNAL MAIL: USE CAUTION BEFORE CLICKING LINKS OR OPENING ATTACHMENTS.
>> ALWAYS VERIFY THE SOURCE OF MESSAGES. *
>>
>> No need to write a customized data source reader. You may want to follow
>> the example here
>> https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala#L40
>> to use DS v2. The example uses h2 database. Please modify it to use
>> postgresql.
>>
>>
>>
>> Huaxin
>>
>>
>>
>>
>>
>> On Mon, Nov 1, 2021 at 11:21 AM Kapoor, Rohit 
>> wrote:
>>
>> Hi Huaxin,
>>
>>
>>
>> Thanks a lot for your response. Do I need to write a custom data source
>> reader (in my case, for PostgreSql) using the Spark DS v2 APIs, instead of
>> the standard spark.read.format(“jdbc”) ?
>>
>>
>>
>>
>>
>> Thanks,
>>
>> Rohit
>>
>>
>>
>> *From: *huaxin gao 
>> *Date: *Monday, 1 November 2021 at 11:32 PM
>> *To: *Kapoor, Rohit 
>> *Cc: *user@spark.apache.org 
>> *Subject: *Re: [Spark SQL]: Aggregate Push Down / Spark 3.2
>>
>> *EXTERNAL MAIL: USE CAUTION BEFORE CLICKING LINKS OR OPENING ATTACHMENTS.
>> ALWAYS VERIFY THE SOURCE OF MESSAGES.*
>>
>> *EXTERNAL MAIL: USE CAUTION BEFORE CLICKING LINKS OR OPENING ATTACHMENTS.
>> ALWAYS VERIFY THE SOURCE OF MESSAGES.*
>>
>> Hi Rohit,
>>
>>
>>
>> Thanks for testing this. Seems to me that you are using DS v1. We only
>> support aggregate push down in DS v2. Could you please try again using DS
>> v2 and let me know how it goes?
>>
>>
>>
>> Thanks,
>>
>> Huaxin
>>
>>
>>
>> On Mon, Nov 1, 2021 at 10:39 AM Chao Sun  wrote:
>>
>>
>>
>> -- Forwarded message -
>> From: *Kapoor, Rohit* 
>> Date: Mon, Nov 1, 2021 at 6:27 AM
>> Subject: [Spark SQL]: Aggregate Push Down / Spark 3.2
>> To: user@spark.apache.org 
>>
>>
>>
>> Hi,
>>
>>
>>
>> I am testing the aggregate push down for JDBC after going through the
>> JIRA - https://issues.apache.org/jira/browse/SPARK-34952
>>
>> I have the latest Spark 3.2 setup in local mode (laptop).
>>
>>
>>
>> I have PostgreSQL v14 locally on my laptop. I am trying a basic aggregate
>> query on “emp” table that has 102 rows and a simple schema with 3
>> columns (empid, ename and sal) as below:
>>
>>
>>
>> val jdbcString = "jdbc:postgresql://" + "localhost" + ":5432/postgres"
>>
>>
>>
>> val jdbcDF = spark.read
>>
>> .format("jdbc")
>>
>> .option("url", jdbcString)
>>
>> .option("dbtable", "emp")
>>
>> .option("pushDownAggregate","true")
>>
>> .option("user", "")
>>
>> .option("password", "")
>>
>> .load()
>>
>> .where("empid > 1")
>>
>> .agg(max("SAL")).alias("max_sal")
>>
>>
>>
>>
>>
>> The complete plan details are:
>>
>>
>>
>> == Parsed Logical Plan ==
>>
>> SubqueryAlias max_sal
>>
>> +- Aggregate [max(SAL#2) AS max(SAL)#10]
>>
>>+- Filter (empid#0 > 1)
>>
>>   +- Relation [empid#0,ename#1,sal#2] JDBCRelation(emp)
>> [numPartitions=1]
>>
>>
>>
>> == Analyzed Logical Plan ==
>>
>> max(SAL): int
>>
>> SubqueryAlias max_sal
>>
>> +- Aggregate [max(SAL#2) AS max(SAL)#10]
>>
>>+- Filter (empid#0 > 1)
>>
>>   +- Relation [empid#0,ename#1,sal#2] JDBCRelation(emp)
>

Re: [Spark SQL]: Aggregate Push Down / Spark 3.2

2021-11-03 Thread huaxin gao
Great to hear. Thanks for testing this!

On Wed, Nov 3, 2021 at 4:03 AM Kapoor, Rohit 
wrote:

> Thanks for your guidance Huaxin. I have been able to test the push down
> operators successfully against Postgresql using DS v2.
>
>
>
>
>
> *From: *huaxin gao 
> *Date: *Tuesday, 2 November 2021 at 12:35 AM
> *To: *Kapoor, Rohit 
> *Subject: *Re: [Spark SQL]: Aggregate Push Down / Spark 3.2
>
>
>
> *EXTERNAL MAIL: USE CAUTION BEFORE CLICKING LINKS OR OPENING ATTACHMENTS.
> ALWAYS VERIFY THE SOURCE OF MESSAGES. *
>
>
>
> *EXTERNAL MAIL: USE CAUTION BEFORE CLICKING LINKS OR OPENING ATTACHMENTS.
> ALWAYS VERIFY THE SOURCE OF MESSAGES. *
>
> No need to write a customized data source reader. You may want to follow
> the example here
> https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala#L40
> to use DS v2. The example uses h2 database. Please modify it to use
> postgresql.
>
>
>
> Huaxin
>
>
>
>
>
> On Mon, Nov 1, 2021 at 11:21 AM Kapoor, Rohit 
> wrote:
>
> Hi Huaxin,
>
>
>
> Thanks a lot for your response. Do I need to write a custom data source
> reader (in my case, for PostgreSql) using the Spark DS v2 APIs, instead of
> the standard spark.read.format(“jdbc”) ?
>
>
>
>
>
> Thanks,
>
> Rohit
>
>
>
> *From: *huaxin gao 
> *Date: *Monday, 1 November 2021 at 11:32 PM
> *To: *Kapoor, Rohit 
> *Cc: *user@spark.apache.org 
> *Subject: *Re: [Spark SQL]: Aggregate Push Down / Spark 3.2
>
> *EXTERNAL MAIL: USE CAUTION BEFORE CLICKING LINKS OR OPENING ATTACHMENTS.
> ALWAYS VERIFY THE SOURCE OF MESSAGES.*
>
> *EXTERNAL MAIL: USE CAUTION BEFORE CLICKING LINKS OR OPENING ATTACHMENTS.
> ALWAYS VERIFY THE SOURCE OF MESSAGES.*
>
> Hi Rohit,
>
>
>
> Thanks for testing this. Seems to me that you are using DS v1. We only
> support aggregate push down in DS v2. Could you please try again using DS
> v2 and let me know how it goes?
>
>
>
> Thanks,
>
> Huaxin
>
>
>
> On Mon, Nov 1, 2021 at 10:39 AM Chao Sun  wrote:
>
>
>
> -- Forwarded message -
> From: *Kapoor, Rohit* 
> Date: Mon, Nov 1, 2021 at 6:27 AM
> Subject: [Spark SQL]: Aggregate Push Down / Spark 3.2
> To: user@spark.apache.org 
>
>
>
> Hi,
>
>
>
> I am testing the aggregate push down for JDBC after going through the JIRA
> - https://issues.apache.org/jira/browse/SPARK-34952
>
> I have the latest Spark 3.2 setup in local mode (laptop).
>
>
>
> I have PostgreSQL v14 locally on my laptop. I am trying a basic aggregate
> query on “emp” table that has 102 rows and a simple schema with 3
> columns (empid, ename and sal) as below:
>
>
>
> val jdbcString = "jdbc:postgresql://" + "localhost" + ":5432/postgres"
>
>
>
> val jdbcDF = spark.read
>
> .format("jdbc")
>
> .option("url", jdbcString)
>
> .option("dbtable", "emp")
>
> .option("pushDownAggregate","true")
>
> .option("user", "")
>
> .option("password", "")
>
> .load()
>
> .where("empid > 1")
>
> .agg(max("SAL")).alias("max_sal")
>
>
>
>
>
> The complete plan details are:
>
>
>
> == Parsed Logical Plan ==
>
> SubqueryAlias max_sal
>
> +- Aggregate [max(SAL#2) AS max(SAL)#10]
>
>+- Filter (empid#0 > 1)
>
>   +- Relation [empid#0,ename#1,sal#2] JDBCRelation(emp)
> [numPartitions=1]
>
>
>
> == Analyzed Logical Plan ==
>
> max(SAL): int
>
> SubqueryAlias max_sal
>
> +- Aggregate [max(SAL#2) AS max(SAL)#10]
>
>+- Filter (empid#0 > 1)
>
>   +- Relation [empid#0,ename#1,sal#2] JDBCRelation(emp)
> [numPartitions=1]
>
>
>
> == Optimized Logical Plan ==
>
> Aggregate [max(SAL#2) AS max(SAL)#10]
>
> +- Project [sal#2]
>
>+- Filter (isnotnull(empid#0) AND (empid#0 > 1))
>
>   +- Relation [empid#0,ename#1,sal#2] JDBCRelation(emp)
> [numPartitions=1]
>
>
>
> == Physical Plan ==
>
> AdaptiveSparkPlan isFinalPlan=false
>
> +- HashAggregate(keys=[], functions=[max(SAL#2)], output=[max(SAL)#10])
>
>+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#15]
>
>   +- HashAggregate(keys=[], functions=[partial_max(SAL#2)],
> output=[max#13])
>
>  +- Scan JDBCRelation(emp) [numPartitions=1] [sal#2] 
> *PushedAggregates:
> []*, PushedFilters: [*IsNotNull(empid), *GreaterThan(empid,1)],
> Pushed

Re: [Spark SQL]: Aggregate Push Down / Spark 3.2

2021-11-03 Thread Kapoor, Rohit
Thanks for your guidance Huaxin. I have been able to test the push down 
operators successfully against Postgresql using DS v2.


From: huaxin gao 
Date: Tuesday, 2 November 2021 at 12:35 AM
To: Kapoor, Rohit 
Subject: Re: [Spark SQL]: Aggregate Push Down / Spark 3.2
EXTERNAL MAIL: USE CAUTION BEFORE CLICKING LINKS OR OPENING ATTACHMENTS. ALWAYS 
VERIFY THE SOURCE OF MESSAGES.

EXTERNAL MAIL: USE CAUTION BEFORE CLICKING LINKS OR OPENING ATTACHMENTS. ALWAYS 
VERIFY THE SOURCE OF MESSAGES.

No need to write a customized data source reader. You may want to follow the 
example here 
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala#L40<https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala#L40>
 to use DS v2. The example uses h2 database. Please modify it to use postgresql.

Huaxin


On Mon, Nov 1, 2021 at 11:21 AM Kapoor, Rohit 
mailto:rohit.kap...@envestnet.com>> wrote:
Hi Huaxin,

Thanks a lot for your response. Do I need to write a custom data source reader 
(in my case, for PostgreSql) using the Spark DS v2 APIs, instead of the 
standard spark.read.format(“jdbc”) ?


Thanks,
Rohit

From: huaxin gao mailto:huaxin.ga...@gmail.com>>
Date: Monday, 1 November 2021 at 11:32 PM
To: Kapoor, Rohit 
mailto:rohit.kap...@envestnet.com>>
Cc: user@spark.apache.org<mailto:user@spark.apache.org> 
mailto:user@spark.apache.org>>
Subject: Re: [Spark SQL]: Aggregate Push Down / Spark 3.2
EXTERNAL MAIL: USE CAUTION BEFORE CLICKING LINKS OR OPENING ATTACHMENTS. ALWAYS 
VERIFY THE SOURCE OF MESSAGES.
EXTERNAL MAIL: USE CAUTION BEFORE CLICKING LINKS OR OPENING ATTACHMENTS. ALWAYS 
VERIFY THE SOURCE OF MESSAGES.
Hi Rohit,

Thanks for testing this. Seems to me that you are using DS v1. We only support 
aggregate push down in DS v2. Could you please try again using DS v2 and let me 
know how it goes?

Thanks,
Huaxin

On Mon, Nov 1, 2021 at 10:39 AM Chao Sun 
mailto:sunc...@apache.org>> wrote:

-- Forwarded message -
From: Kapoor, Rohit 
mailto:rohit.kap...@envestnet.com>>
Date: Mon, Nov 1, 2021 at 6:27 AM
Subject: [Spark SQL]: Aggregate Push Down / Spark 3.2
To: user@spark.apache.org<mailto:user@spark.apache.org> 
mailto:user@spark.apache.org>>

Hi,

I am testing the aggregate push down for JDBC after going through the JIRA - 
https://issues.apache.org/jira/browse/SPARK-34952<https://issues.apache.org/jira/browse/SPARK-34952>
I have the latest Spark 3.2 setup in local mode (laptop).

I have PostgreSQL v14 locally on my laptop. I am trying a basic aggregate query 
on “emp” table that has 102 rows and a simple schema with 3 columns (empid, 
ename and sal) as below:

val jdbcString = "jdbc:postgresql://" + "localhost" + ":5432/postgres"

val jdbcDF = spark.read
.format("jdbc")
.option("url", jdbcString)
.option("dbtable", "emp")
.option("pushDownAggregate","true")
.option("user", "")
.option("password", "")
.load()
.where("empid > 1")
.agg(max("SAL")).alias("max_sal")


The complete plan details are:

== Parsed Logical Plan ==
SubqueryAlias max_sal
+- Aggregate [max(SAL#2) AS max(SAL)#10]
   +- Filter (empid#0 > 1)
  +- Relation [empid#0,ename#1,sal#2] JDBCRelation(emp) [numPartitions=1]

== Analyzed Logical Plan ==
max(SAL): int
SubqueryAlias max_sal
+- Aggregate [max(SAL#2) AS max(SAL)#10]
   +- Filter (empid#0 > 1)
  +- Relation [empid#0,ename#1,sal#2] JDBCRelation(emp) [numPartitions=1]

== Optimized Logical Plan ==
Aggregate [max(SAL#2) AS max(SAL)#10]
+- Project [sal#2]
   +- Filter (isnotnull(empid#0) AND (empid#0 > 1))
  +- Relation [empid#0,ename#1,sal#2] JDBCRelation(emp) [numPartitions=1]

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[max(SAL#2)], output=[max(SAL)#10])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#15]
  +- HashAggregate(keys=[], functions=[partial_max(SAL#2)], output=[max#13])
 +- Scan JDBCRelation(emp) [numPartitions=1] [sal#2] PushedAggregates: 
[], PushedFilters: [*IsNotNull(empid), *GreaterThan(empid,1)], PushedGroupby: 
[], ReadSchema: struct


I also checked the sql submitted to the database, querying pg_stat_statements, 
and it confirms that the aggregate was not pushed down to the database. Here is 
the query submitted to the database:

SELECT "sal" FROM emp WHERE ("empid" IS NOT NULL) AND ("empid" > $1)

All the rows are read and aggregated in the Spark layer.

Is there any configuration I missing here? Why is aggregate push down not 
working for me?
Any pointers would be greatly appreciated.


Thanks

Re: [Spark SQL]: Aggregate Push Down / Spark 3.2

2021-11-01 Thread Kapoor, Rohit
Hi Huaxin,

Thanks a lot for your response. Do I need to write a custom data source reader 
(in my case, for PostgreSql) using the Spark DS v2 APIs, instead of the 
standard spark.read.format(“jdbc”) ?


Thanks,
Rohit

From: huaxin gao 
Date: Monday, 1 November 2021 at 11:32 PM
To: Kapoor, Rohit 
Cc: user@spark.apache.org 
Subject: Re: [Spark SQL]: Aggregate Push Down / Spark 3.2
EXTERNAL MAIL: USE CAUTION BEFORE CLICKING LINKS OR OPENING ATTACHMENTS. ALWAYS 
VERIFY THE SOURCE OF MESSAGES.

EXTERNAL MAIL: USE CAUTION BEFORE CLICKING LINKS OR OPENING ATTACHMENTS. ALWAYS 
VERIFY THE SOURCE OF MESSAGES.

Hi Rohit,

Thanks for testing this. Seems to me that you are using DS v1. We only support 
aggregate push down in DS v2. Could you please try again using DS v2 and let me 
know how it goes?

Thanks,
Huaxin

On Mon, Nov 1, 2021 at 10:39 AM Chao Sun 
mailto:sunc...@apache.org>> wrote:

-- Forwarded message -
From: Kapoor, Rohit 
mailto:rohit.kap...@envestnet.com>>
Date: Mon, Nov 1, 2021 at 6:27 AM
Subject: [Spark SQL]: Aggregate Push Down / Spark 3.2
To: user@spark.apache.org<mailto:user@spark.apache.org> 
mailto:user@spark.apache.org>>

Hi,

I am testing the aggregate push down for JDBC after going through the JIRA - 
https://issues.apache.org/jira/browse/SPARK-34952<https://issues.apache.org/jira/browse/SPARK-34952>
I have the latest Spark 3.2 setup in local mode (laptop).

I have PostgreSQL v14 locally on my laptop. I am trying a basic aggregate query 
on “emp” table that has 102 rows and a simple schema with 3 columns (empid, 
ename and sal) as below:

val jdbcString = "jdbc:postgresql://" + "localhost" + ":5432/postgres"

val jdbcDF = spark.read
.format("jdbc")
.option("url", jdbcString)
.option("dbtable", "emp")
.option("pushDownAggregate","true")
.option("user", "")
.option("password", "")
.load()
.where("empid > 1")
.agg(max("SAL")).alias("max_sal")


The complete plan details are:

== Parsed Logical Plan ==
SubqueryAlias max_sal
+- Aggregate [max(SAL#2) AS max(SAL)#10]
   +- Filter (empid#0 > 1)
  +- Relation [empid#0,ename#1,sal#2] JDBCRelation(emp) [numPartitions=1]

== Analyzed Logical Plan ==
max(SAL): int
SubqueryAlias max_sal
+- Aggregate [max(SAL#2) AS max(SAL)#10]
   +- Filter (empid#0 > 1)
  +- Relation [empid#0,ename#1,sal#2] JDBCRelation(emp) [numPartitions=1]

== Optimized Logical Plan ==
Aggregate [max(SAL#2) AS max(SAL)#10]
+- Project [sal#2]
   +- Filter (isnotnull(empid#0) AND (empid#0 > 1))
  +- Relation [empid#0,ename#1,sal#2] JDBCRelation(emp) [numPartitions=1]

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[max(SAL#2)], output=[max(SAL)#10])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#15]
  +- HashAggregate(keys=[], functions=[partial_max(SAL#2)], output=[max#13])
 +- Scan JDBCRelation(emp) [numPartitions=1] [sal#2] PushedAggregates: 
[], PushedFilters: [*IsNotNull(empid), *GreaterThan(empid,1)], PushedGroupby: 
[], ReadSchema: struct


I also checked the sql submitted to the database, querying pg_stat_statements, 
and it confirms that the aggregate was not pushed down to the database. Here is 
the query submitted to the database:

SELECT "sal" FROM emp WHERE ("empid" IS NOT NULL) AND ("empid" > $1)

All the rows are read and aggregated in the Spark layer.

Is there any configuration I missing here? Why is aggregate push down not 
working for me?
Any pointers would be greatly appreciated.


Thanks,
Rohit


Disclaimer: The information in this email is confidential and may be legally 
privileged. Access to this Internet email by anyone else other than the 
recipient is unauthorized. Envestnet, Inc. and its affiliated companies do not 
accept time-sensitive transactional messages, including orders to buy and sell 
securities, account allocation instructions, or any other instructions 
affecting a client account, via e-mail. If you are not the intended recipient 
of this email, any disclosure, copying, or distribution of it is prohibited and 
may be unlawful. If you have received this email in error, please notify the 
sender and immediately and permanently delete it and destroy any copies of it 
that were printed out. When addressed to our clients, any opinions or advice 
contained in this email is subject to the terms and conditions expressed in any 
applicable governing terms of business or agreements.




Disclaimer: The information in this email is confidential and may be legally 
privileged. Access to this Internet email by anyone else other than the 
recipient is unauthorized. Envestnet, Inc. and its affi

Re: [Spark SQL]: Aggregate Push Down / Spark 3.2

2021-11-01 Thread huaxin gao
Hi Rohit,

Thanks for testing this. Seems to me that you are using DS v1. We only
support aggregate push down in DS v2. Could you please try again using DS
v2 and let me know how it goes?

Thanks,
Huaxin

On Mon, Nov 1, 2021 at 10:39 AM Chao Sun  wrote:

>
>
> -- Forwarded message -
> From: Kapoor, Rohit 
> Date: Mon, Nov 1, 2021 at 6:27 AM
> Subject: [Spark SQL]: Aggregate Push Down / Spark 3.2
> To: user@spark.apache.org 
>
>
> Hi,
>
>
>
> I am testing the aggregate push down for JDBC after going through the JIRA
> - https://issues.apache.org/jira/browse/SPARK-34952
>
> I have the latest Spark 3.2 setup in local mode (laptop).
>
>
>
> I have PostgreSQL v14 locally on my laptop. I am trying a basic aggregate
> query on “emp” table that has 102 rows and a simple schema with 3
> columns (empid, ename and sal) as below:
>
>
>
> val jdbcString = "jdbc:postgresql://" + "localhost" + ":5432/postgres"
>
>
>
> val jdbcDF = spark.read
>
> .format("jdbc")
>
> .option("url", jdbcString)
>
> .option("dbtable", "emp")
>
> .option("pushDownAggregate","true")
>
> .option("user", "")
>
> .option("password", "")
>
> .load()
>
> .where("empid > 1")
>
> .agg(max("SAL")).alias("max_sal")
>
>
>
>
>
> The complete plan details are:
>
>
>
> == Parsed Logical Plan ==
>
> SubqueryAlias max_sal
>
> +- Aggregate [max(SAL#2) AS max(SAL)#10]
>
>+- Filter (empid#0 > 1)
>
>   +- Relation [empid#0,ename#1,sal#2] JDBCRelation(emp)
> [numPartitions=1]
>
>
>
> == Analyzed Logical Plan ==
>
> max(SAL): int
>
> SubqueryAlias max_sal
>
> +- Aggregate [max(SAL#2) AS max(SAL)#10]
>
>+- Filter (empid#0 > 1)
>
>   +- Relation [empid#0,ename#1,sal#2] JDBCRelation(emp)
> [numPartitions=1]
>
>
>
> == Optimized Logical Plan ==
>
> Aggregate [max(SAL#2) AS max(SAL)#10]
>
> +- Project [sal#2]
>
>+- Filter (isnotnull(empid#0) AND (empid#0 > 1))
>
>   +- Relation [empid#0,ename#1,sal#2] JDBCRelation(emp)
> [numPartitions=1]
>
>
>
> == Physical Plan ==
>
> AdaptiveSparkPlan isFinalPlan=false
>
> +- HashAggregate(keys=[], functions=[max(SAL#2)], output=[max(SAL)#10])
>
>+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#15]
>
>   +- HashAggregate(keys=[], functions=[partial_max(SAL#2)],
> output=[max#13])
>
>  +- Scan JDBCRelation(emp) [numPartitions=1] [sal#2] 
> *PushedAggregates:
> []*, PushedFilters: [*IsNotNull(empid), *GreaterThan(empid,1)],
> PushedGroupby: [], ReadSchema: struct
>
>
>
>
>
> I also checked the sql submitted to the database, querying
> pg_stat_statements, and it confirms that the aggregate was not pushed
> down to the database. Here is the query submitted to the database:
>
>
>
> SELECT "sal" FROM emp WHERE ("empid" IS NOT NULL) AND ("empid" > $1)
>
>
>
> All the rows are read and aggregated in the Spark layer.
>
>
>
> Is there any configuration I missing here? Why is aggregate push down not
> working for me?
>
> Any pointers would be greatly appreciated.
>
>
>
>
>
> Thanks,
>
> Rohit
> --
>
> Disclaimer: The information in this email is confidential and may be
> legally privileged. Access to this Internet email by anyone else other than
> the recipient is unauthorized. Envestnet, Inc. and its affiliated companies
> do not accept time-sensitive transactional messages, including orders to
> buy and sell securities, account allocation instructions, or any other
> instructions affecting a client account, via e-mail. If you are not the
> intended recipient of this email, any disclosure, copying, or distribution
> of it is prohibited and may be unlawful. If you have received this email in
> error, please notify the sender and immediately and permanently delete it
> and destroy any copies of it that were printed out. When addressed to our
> clients, any opinions or advice contained in this email is subject to the
> terms and conditions expressed in any applicable governing terms of
> business or agreements.
> --
>


Re: Spark-sql can replace Hive ?

2021-06-15 Thread Mich Talebzadeh
OK you mean use spark.sql as opposed to HiveContext.sql?

val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
HiveContext.sql("")

replace with

spark.sql("")
?


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 15 Jun 2021 at 18:00, Battula, Brahma Reddy 
wrote:

> Currently I am using hive sql engine for adhoc queries. As spark-sql also
> supports this, I want migrate from hive.
>
>
>
>
>
>
>
>
>
> *From: *Mich Talebzadeh 
> *Date: *Thursday, 10 June 2021 at 8:12 PM
> *To: *Battula, Brahma Reddy 
> *Cc: *ayan guha , d...@spark.apache.org <
> d...@spark.apache.org>, user@spark.apache.org 
> *Subject: *Re: Spark-sql can replace Hive ?
>
> These are different things. Spark provides a computational layer and a
> dialogue of SQL based on Hive.
>
>
>
> Hive is a DW on top of HDFS. What are you trying to replace?
>
>
>
> HTH
>
>
>
>
>
>
>view my Linkedin profile
> <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linkedin.com%2Fin%2Fmich-talebzadeh-ph-d-5205b2%2F&data=04%7C01%7Cbbattula%40visa.com%7C826c6f38277b4af22f2a08d92c1e034e%7C38305e12e15d4ee888b9c4db1c477d76%7C0%7C0%7C637589329681823204%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=nt1cQcvCszeHYupi9gHEeHi0RuDEyDoNXEPTVExLxgY%3D&reserved=0>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Thu, 10 Jun 2021 at 12:09, Battula, Brahma Reddy
>  wrote:
>
> Thanks for prompt reply.
>
>
>
> I want to replace hive with spark.
>
>
>
>
>
>
>
>
>
> *From: *ayan guha 
> *Date: *Thursday, 10 June 2021 at 4:35 PM
> *To: *Battula, Brahma Reddy 
> *Cc: *d...@spark.apache.org , user@spark.apache.org <
> user@spark.apache.org>
> *Subject: *Re: Spark-sql can replace Hive ?
>
> Would you mind expanding the ask? Spark Sql can use hive by itaelf
>
>
>
> On Thu, 10 Jun 2021 at 8:58 pm, Battula, Brahma Reddy
>  wrote:
>
> Hi
>
>
>
> Would like know any refences/docs to replace hive with spark-sql
> completely like how migrate the existing data in hive.?
>
>
>
> thanks
>
>
>
>
>
> --
>
> Best Regards,
> Ayan Guha
>
>


Re: Spark-sql can replace Hive ?

2021-06-15 Thread Battula, Brahma Reddy
Currently I am using hive sql engine for adhoc queries. As spark-sql also 
supports this, I want migrate from hive.




From: Mich Talebzadeh 
Date: Thursday, 10 June 2021 at 8:12 PM
To: Battula, Brahma Reddy 
Cc: ayan guha , d...@spark.apache.org 
, user@spark.apache.org 
Subject: Re: Spark-sql can replace Hive ?
These are different things. Spark provides a computational layer and a dialogue 
of SQL based on Hive.

Hive is a DW on top of HDFS. What are you trying to replace?

HTH





 
[https://docs.google.com/uc?export=download&id=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ&revid=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linkedin.com%2Fin%2Fmich-talebzadeh-ph-d-5205b2%2F&data=04%7C01%7Cbbattula%40visa.com%7C826c6f38277b4af22f2a08d92c1e034e%7C38305e12e15d4ee888b9c4db1c477d76%7C0%7C0%7C637589329681823204%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=nt1cQcvCszeHYupi9gHEeHi0RuDEyDoNXEPTVExLxgY%3D&reserved=0>



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Thu, 10 Jun 2021 at 12:09, Battula, Brahma Reddy  
wrote:
Thanks for prompt reply.

I want to replace hive with spark.




From: ayan guha mailto:guha.a...@gmail.com>>
Date: Thursday, 10 June 2021 at 4:35 PM
To: Battula, Brahma Reddy 
Cc: d...@spark.apache.org<mailto:d...@spark.apache.org> 
mailto:d...@spark.apache.org>>, 
user@spark.apache.org<mailto:user@spark.apache.org> 
mailto:user@spark.apache.org>>
Subject: Re: Spark-sql can replace Hive ?
Would you mind expanding the ask? Spark Sql can use hive by itaelf

On Thu, 10 Jun 2021 at 8:58 pm, Battula, Brahma Reddy 
 wrote:
Hi

Would like know any refences/docs to replace hive with spark-sql completely 
like how migrate the existing data in hive.?

thanks


--
Best Regards,
Ayan Guha


Re: Spark-sql can replace Hive ?

2021-06-10 Thread Mich Talebzadeh
These are different things. Spark provides a computational layer and a
dialogue of SQL based on Hive.

Hive is a DW on top of HDFS. What are you trying to replace?

HTH



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 10 Jun 2021 at 12:09, Battula, Brahma Reddy
 wrote:

> Thanks for prompt reply.
>
>
>
> I want to replace hive with spark.
>
>
>
>
>
>
>
>
>
> *From: *ayan guha 
> *Date: *Thursday, 10 June 2021 at 4:35 PM
> *To: *Battula, Brahma Reddy 
> *Cc: *d...@spark.apache.org , user@spark.apache.org <
> user@spark.apache.org>
> *Subject: *Re: Spark-sql can replace Hive ?
>
> Would you mind expanding the ask? Spark Sql can use hive by itaelf
>
>
>
> On Thu, 10 Jun 2021 at 8:58 pm, Battula, Brahma Reddy
>  wrote:
>
> Hi
>
>
>
> Would like know any refences/docs to replace hive with spark-sql
> completely like how migrate the existing data in hive.?
>
>
>
> thanks
>
>
>
>
>
> --
>
> Best Regards,
> Ayan Guha
>


Re: Spark-sql can replace Hive ?

2021-06-10 Thread Battula, Brahma Reddy
Thanks for prompt reply.

I want to replace hive with spark.




From: ayan guha 
Date: Thursday, 10 June 2021 at 4:35 PM
To: Battula, Brahma Reddy 
Cc: d...@spark.apache.org , user@spark.apache.org 

Subject: Re: Spark-sql can replace Hive ?
Would you mind expanding the ask? Spark Sql can use hive by itaelf

On Thu, 10 Jun 2021 at 8:58 pm, Battula, Brahma Reddy 
 wrote:
Hi

Would like know any refences/docs to replace hive with spark-sql completely 
like how migrate the existing data in hive.?

thanks


--
Best Regards,
Ayan Guha


Re: Spark-sql can replace Hive ?

2021-06-10 Thread ayan guha
Would you mind expanding the ask? Spark Sql can use hive by itaelf

On Thu, 10 Jun 2021 at 8:58 pm, Battula, Brahma Reddy
 wrote:

> Hi
>
>
>
> Would like know any refences/docs to replace hive with spark-sql
> completely like how migrate the existing data in hive.?
>
>
>
> thanks
>
>
>
>
>
-- 
Best Regards,
Ayan Guha


Re: [Spark SQL]: Does Spark SQL can have better performance?

2021-04-29 Thread Mich Talebzadeh
Hi,

your query

parquetFile = spark.read.parquet("path/to/hdfs")
parquetFile.createOrReplaceTempView("parquetFile")
spark.sql("SELECT * FROM parquetFile WHERE field1 = 'value' ORDER BY
timestamp LIMIT 1")

will be lazily evaluated and won't do anything until the sql statement is
actioned with .show etc

In local mode, there is only one executor. Assuming you are actioning the
sql statement, it will have to do a full table scan to find field1 =
'value'

scala> spark.sql("""select * from tmp where 'Account Type' = 'abc' limit
1000""").explain()
== Physical Plan ==
LocalTableScan , [Date#16,  Type#17,  Description#18,  Value#19,
Balance#20,  Account Name#21,  Account Number#22]

Try actioning it and see what happens


HTH


   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 29 Apr 2021 at 07:30, Amin Borjian 
wrote:

> Hi.
>
> We use spark 3.0.1 in HDFS cluster and we store our files as parquet with
> snappy compression and enabled dictionary. We try to perform a simple query:
>
> parquetFile = spark.read.parquet("path/to/hadf")
> parquetFile.createOrReplaceTempView("parquetFile")
> spark.sql("SELECT * FROM parquetFile WHERE field1 = 'value' ORDER BY
> timestamp LIMIT 1")
>
> Condition in 'where' clause is selected so that no record is selected
> (matched) for this query. (on purpose) However, this query takes about
> 5-6 minutes to complete on our cluster (with NODE_LOCAL) and a simple spark
> configuration. (Input data is about 8TB in the following tests but can be
> much more)
>
> We decided to test the consumption of disk, network, memory and CPU
> resources in order to detect bottlenecks in this query. However, we came
> to much more strange results, which we will discuss in the following.
>
> We provided dashboards of each network, disk, memory, and CPU usage by
> monitoring tools so that we could check the conditions when running the
> query.
>
> 1) First, we increased the amount of CPU allocated to Spark from the
> initial value to 2 and then about 2.5 times. Although the last increase
> in the total amount of dedicated CPU, all of it was not used, we did not
> see any change in the duration of the query. (As a general point, in all
> tests, the execution times were increased or decreased between 0 and 20
> seconds, but in 5 minutes, these cases were insignificant)
>
> 2) Then we similarly increased the amount of memory allocated to Spark to
> 2 to 2.5 times its original value. In this case, in the last increase,
> the entire memory allocated to the spark was not used by query. But
> again, we did not see any change in the duration of the query.
>
> 3) In all these tests, we monitored the amount of network consumption and
> sending and receiving it in the whole cluster. We can run a query whose
> network consumption is 2 or almost 3 times the consumption of the query
> mentioned in this email, and this shows that we have not reached the
> maximum of the cluster network capacity in this query. Of course, it was
> strange to us why we need the network in a query that has no record and is
> completely node local, but we assumed that it probably needs it for a
> number of reasons, and with this assumption we were still very far from the
> network capacity.
>
> 4) In all these tests, we also checked the amount of writing and reading
> from the disk. In the same way as in the third case, we can write a query
> that is about 4 times the write and read rate of the query mentioned in the
> email, and our disks are much stronger. But the observation in this query
> shows that the write rate is almost zero (We were expecting it) and the
> read rate is running at a medium speed, which is very far from the amount
> of disk rate capacity and therefore cannot be a bottleneck.
>
> After all these tests and the query running time of 5 minutes, we did not
> know exactly what could be more restrictive, and it was strange to us that
> the simple query stated in the email needed such a run time (because with
> such execution time, heavier queries take longer).
>
> Does it make sense that the duration of the query is so long? Is there
> something we need to pay attention to or can we improve by changing it?
>
> Thanks,
> Amin Borjian
>


Re: [Spark SQL]:to calculate distance between four coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframe

2021-04-09 Thread ayan guha
Hi - interesting stuff. My stand always was to use spark native functions,
pandas and python native - in this order.

To OP - did you try the code? What kind of perf are you seeing? Just
curious, why do you think UDFs are bad?

On Sat, 10 Apr 2021 at 2:36 am, Sean Owen  wrote:

> Actually, good question, I'm not sure. I don't think that Spark would
> vectorize these operations over rows.
> Whereas in a pandas UDF, given a DataFrame, you can apply operations like
> sin to 1000s of values at once in native code via numpy. It's trivially
> 'vectorizable' and I've seen good wins over, at least, a single-row UDF.
>
> On Fri, Apr 9, 2021 at 9:14 AM ayan guha  wrote:
>
>> Hi Sean - absolutely open to suggestions.
>>
>> My impression was using spark native functions should provide similar
>> perf as scala ones because serialization penalty should not be there,
>> unlike native python udfs.
>>
>> Is it wrong understanding?
>>
>>
>>
>> On Fri, 9 Apr 2021 at 10:55 pm, Rao Bandaru  wrote:
>>
>>> Hi All,
>>>
>>>
>>> yes ,i need to add the below scenario based code to the executing spark
>>> job,while executing this it took lot of time to complete,please suggest
>>> best way to get below requirement without using UDF
>>>
>>>
>>> Thanks,
>>>
>>> Ankamma Rao B
>>> --
>>> *From:* Sean Owen 
>>> *Sent:* Friday, April 9, 2021 6:11 PM
>>> *To:* ayan guha 
>>> *Cc:* Rao Bandaru ; User 
>>> *Subject:* Re: [Spark SQL]:to calculate distance between four
>>> coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk
>>> dataframe
>>>
>>> This can be significantly faster with a pandas UDF, note, because you
>>> can vectorize the operations.
>>>
>>> On Fri, Apr 9, 2021, 7:32 AM ayan guha  wrote:
>>>
>>> Hi
>>>
>>> We are using a haversine distance function for this, and wrapping it in
>>> udf.
>>>
>>> from pyspark.sql.functions import acos, cos, sin, lit, toRadians, udf
>>> from pyspark.sql.types import *
>>>
>>> def haversine_distance(long_x, lat_x, long_y, lat_y):
>>> return acos(
>>> sin(toRadians(lat_x)) * sin(toRadians(lat_y)) +
>>> cos(toRadians(lat_x)) * cos(toRadians(lat_y)) *
>>> cos(toRadians(long_x) - toRadians(long_y))
>>> ) * lit(6371.0)
>>>
>>> distudf = udf(haversine_distance, FloatType())
>>>
>>> in case you just want to use just Spark SQL, you can still utilize the
>>> functions shown above to implement in SQL.
>>>
>>> Any reason you do not want to use UDF?
>>>
>>> Credit
>>> <https://stackoverflow.com/questions/38994903/how-to-sum-distances-between-data-points-in-a-dataset-using-pyspark>
>>>
>>>
>>> On Fri, Apr 9, 2021 at 10:19 PM Rao Bandaru 
>>> wrote:
>>>
>>> Hi All,
>>>
>>>
>>>
>>> I have a requirement to calculate distance between four
>>> coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the *pysaprk
>>> dataframe *with the help of from *geopy* import *distance *without
>>> using *UDF* (user defined function)*,*Please help how to achieve this
>>> scenario and do the needful.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Ankamma Rao B
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>> --
>> Best Regards,
>> Ayan Guha
>>
> --
Best Regards,
Ayan Guha


Re: [Spark SQL]:to calculate distance between four coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframe

2021-04-09 Thread Sean Owen
Actually, good question, I'm not sure. I don't think that Spark would
vectorize these operations over rows.
Whereas in a pandas UDF, given a DataFrame, you can apply operations like
sin to 1000s of values at once in native code via numpy. It's trivially
'vectorizable' and I've seen good wins over, at least, a single-row UDF.

On Fri, Apr 9, 2021 at 9:14 AM ayan guha  wrote:

> Hi Sean - absolutely open to suggestions.
>
> My impression was using spark native functions should provide similar perf
> as scala ones because serialization penalty should not be there, unlike
> native python udfs.
>
> Is it wrong understanding?
>
>
>
> On Fri, 9 Apr 2021 at 10:55 pm, Rao Bandaru  wrote:
>
>> Hi All,
>>
>>
>> yes ,i need to add the below scenario based code to the executing spark
>> job,while executing this it took lot of time to complete,please suggest
>> best way to get below requirement without using UDF
>>
>>
>> Thanks,
>>
>> Ankamma Rao B
>> ----------
>> *From:* Sean Owen 
>> *Sent:* Friday, April 9, 2021 6:11 PM
>> *To:* ayan guha 
>> *Cc:* Rao Bandaru ; User 
>> *Subject:* Re: [Spark SQL]:to calculate distance between four
>> coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk
>> dataframe
>>
>> This can be significantly faster with a pandas UDF, note, because you can
>> vectorize the operations.
>>
>> On Fri, Apr 9, 2021, 7:32 AM ayan guha  wrote:
>>
>> Hi
>>
>> We are using a haversine distance function for this, and wrapping it in
>> udf.
>>
>> from pyspark.sql.functions import acos, cos, sin, lit, toRadians, udf
>> from pyspark.sql.types import *
>>
>> def haversine_distance(long_x, lat_x, long_y, lat_y):
>> return acos(
>> sin(toRadians(lat_x)) * sin(toRadians(lat_y)) +
>> cos(toRadians(lat_x)) * cos(toRadians(lat_y)) *
>> cos(toRadians(long_x) - toRadians(long_y))
>> ) * lit(6371.0)
>>
>> distudf = udf(haversine_distance, FloatType())
>>
>> in case you just want to use just Spark SQL, you can still utilize the
>> functions shown above to implement in SQL.
>>
>> Any reason you do not want to use UDF?
>>
>> Credit
>> <https://stackoverflow.com/questions/38994903/how-to-sum-distances-between-data-points-in-a-dataset-using-pyspark>
>>
>>
>> On Fri, Apr 9, 2021 at 10:19 PM Rao Bandaru  wrote:
>>
>> Hi All,
>>
>>
>>
>> I have a requirement to calculate distance between four
>> coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the *pysaprk
>> dataframe *with the help of from *geopy* import *distance *without using
>> *UDF* (user defined function)*,*Please help how to achieve this scenario
>> and do the needful.
>>
>>
>>
>> Thanks,
>>
>> Ankamma Rao B
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>> --
> Best Regards,
> Ayan Guha
>


Re: [Spark SQL]:to calculate distance between four coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframe

2021-04-09 Thread ayan guha
Hi Sean - absolutely open to suggestions.

My impression was using spark native functions should provide similar perf
as scala ones because serialization penalty should not be there, unlike
native python udfs.

Is it wrong understanding?



On Fri, 9 Apr 2021 at 10:55 pm, Rao Bandaru  wrote:

> Hi All,
>
>
> yes ,i need to add the below scenario based code to the executing spark
> job,while executing this it took lot of time to complete,please suggest
> best way to get below requirement without using UDF
>
>
> Thanks,
>
> Ankamma Rao B
> --
> *From:* Sean Owen 
> *Sent:* Friday, April 9, 2021 6:11 PM
> *To:* ayan guha 
> *Cc:* Rao Bandaru ; User 
> *Subject:* Re: [Spark SQL]:to calculate distance between four
> coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk
> dataframe
>
> This can be significantly faster with a pandas UDF, note, because you can
> vectorize the operations.
>
> On Fri, Apr 9, 2021, 7:32 AM ayan guha  wrote:
>
> Hi
>
> We are using a haversine distance function for this, and wrapping it in
> udf.
>
> from pyspark.sql.functions import acos, cos, sin, lit, toRadians, udf
> from pyspark.sql.types import *
>
> def haversine_distance(long_x, lat_x, long_y, lat_y):
> return acos(
> sin(toRadians(lat_x)) * sin(toRadians(lat_y)) +
> cos(toRadians(lat_x)) * cos(toRadians(lat_y)) *
> cos(toRadians(long_x) - toRadians(long_y))
> ) * lit(6371.0)
>
> distudf = udf(haversine_distance, FloatType())
>
> in case you just want to use just Spark SQL, you can still utilize the
> functions shown above to implement in SQL.
>
> Any reason you do not want to use UDF?
>
> Credit
> <https://stackoverflow.com/questions/38994903/how-to-sum-distances-between-data-points-in-a-dataset-using-pyspark>
>
>
> On Fri, Apr 9, 2021 at 10:19 PM Rao Bandaru  wrote:
>
> Hi All,
>
>
>
> I have a requirement to calculate distance between four
> coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the *pysaprk
> dataframe *with the help of from *geopy* import *distance *without using
> *UDF* (user defined function)*,*Please help how to achieve this scenario
> and do the needful.
>
>
>
> Thanks,
>
> Ankamma Rao B
>
>
>
> --
> Best Regards,
> Ayan Guha
>
> --
Best Regards,
Ayan Guha


Re: [Spark SQL]:to calculate distance between four coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframe

2021-04-09 Thread Rao Bandaru
Hi All,

yes ,i need to add the below scenario based code to the executing spark 
job,while executing this it took lot of time to complete,please suggest best 
way to get below requirement without using UDF

Thanks,
Ankamma Rao B

From: Sean Owen 
Sent: Friday, April 9, 2021 6:11 PM
To: ayan guha 
Cc: Rao Bandaru ; User 
Subject: Re: [Spark SQL]:to calculate distance between four 
coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk 
dataframe

This can be significantly faster with a pandas UDF, note, because you can 
vectorize the operations.

On Fri, Apr 9, 2021, 7:32 AM ayan guha 
mailto:guha.a...@gmail.com>> wrote:
Hi

We are using a haversine distance function for this, and wrapping it in udf.

from pyspark.sql.functions import acos, cos, sin, lit, toRadians, udf
from pyspark.sql.types import *

def haversine_distance(long_x, lat_x, long_y, lat_y):
return acos(
sin(toRadians(lat_x)) * sin(toRadians(lat_y)) +
cos(toRadians(lat_x)) * cos(toRadians(lat_y)) *
cos(toRadians(long_x) - toRadians(long_y))
) * lit(6371.0)

distudf = udf(haversine_distance, FloatType())

in case you just want to use just Spark SQL, you can still utilize the 
functions shown above to implement in SQL.

Any reason you do not want to use UDF?

Credit<https://stackoverflow.com/questions/38994903/how-to-sum-distances-between-data-points-in-a-dataset-using-pyspark>

On Fri, Apr 9, 2021 at 10:19 PM Rao Bandaru 
mailto:rao.m...@outlook.com>> wrote:
Hi All,



I have a requirement to calculate distance between four coordinates(Latitude1, 
Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframe with the help of 
from geopy import distance without using UDF (user defined function),Please 
help how to achieve this scenario and do the needful.



Thanks,

Ankamma Rao B


--
Best Regards,
Ayan Guha


Re: [Spark SQL]:to calculate distance between four coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframe

2021-04-09 Thread Sean Owen
This can be significantly faster with a pandas UDF, note, because you can
vectorize the operations.

On Fri, Apr 9, 2021, 7:32 AM ayan guha  wrote:

> Hi
>
> We are using a haversine distance function for this, and wrapping it in
> udf.
>
> from pyspark.sql.functions import acos, cos, sin, lit, toRadians, udf
> from pyspark.sql.types import *
>
> def haversine_distance(long_x, lat_x, long_y, lat_y):
> return acos(
> sin(toRadians(lat_x)) * sin(toRadians(lat_y)) +
> cos(toRadians(lat_x)) * cos(toRadians(lat_y)) *
> cos(toRadians(long_x) - toRadians(long_y))
> ) * lit(6371.0)
>
> distudf = udf(haversine_distance, FloatType())
>
> in case you just want to use just Spark SQL, you can still utilize the
> functions shown above to implement in SQL.
>
> Any reason you do not want to use UDF?
>
> Credit
> 
>
>
> On Fri, Apr 9, 2021 at 10:19 PM Rao Bandaru  wrote:
>
>> Hi All,
>>
>>
>>
>> I have a requirement to calculate distance between four
>> coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the *pysaprk
>> dataframe *with the help of from *geopy* import *distance *without using
>> *UDF* (user defined function)*,*Please help how to achieve this scenario
>> and do the needful.
>>
>>
>>
>> Thanks,
>>
>> Ankamma Rao B
>>
>
>
> --
> Best Regards,
> Ayan Guha
>


Re: [Spark SQL]:to calculate distance between four coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframe

2021-04-09 Thread ayan guha
Hi

We are using a haversine distance function for this, and wrapping it in
udf.

from pyspark.sql.functions import acos, cos, sin, lit, toRadians, udf
from pyspark.sql.types import *

def haversine_distance(long_x, lat_x, long_y, lat_y):
return acos(
sin(toRadians(lat_x)) * sin(toRadians(lat_y)) +
cos(toRadians(lat_x)) * cos(toRadians(lat_y)) *
cos(toRadians(long_x) - toRadians(long_y))
) * lit(6371.0)

distudf = udf(haversine_distance, FloatType())

in case you just want to use just Spark SQL, you can still utilize the
functions shown above to implement in SQL.

Any reason you do not want to use UDF?

Credit



On Fri, Apr 9, 2021 at 10:19 PM Rao Bandaru  wrote:

> Hi All,
>
>
>
> I have a requirement to calculate distance between four
> coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the *pysaprk
> dataframe *with the help of from *geopy* import *distance *without using
> *UDF* (user defined function)*,*Please help how to achieve this scenario
> and do the needful.
>
>
>
> Thanks,
>
> Ankamma Rao B
>


-- 
Best Regards,
Ayan Guha


Re: [SPARK SQL] Sometimes spark does not scale down on k8s

2021-04-05 Thread Alexei
I've increased spark.scheduler.listenerbus.eventqueue.executorManagement.capacity to 10M, this lead to several things.First, scaler didn't break when it was expected to. I mean, maxNeededExecutors remained low (except peak values).Second, scaler started to behave a bit weird. Having maxExecutors=50 I saw up to 79 executors according to JVM metrics and up to 78 counted from api data (graphs didn't match, these values changed independently)At the same time pod count didn't change, I had 50 pods at high time as max.And one more, as a dessert - with 10M queue I ran out of 10G heap less than in three days. But this was expected so no questions :)  02.04.2021, 17:47, "Alexei" :Hi all! We are using spark as constantly running sql interface to parquet on hdfs and gcs with our in-house app. We use autoscaling with k8s backend. Sometimes (approx. once a day) something nasty happens and spark stops to scale down staying with max available executors. I've checked graphs (https://imgur.com/a/6h3MfPa) and found few strange things:At the same time numberTargetExecutors and numberMaxNeededExecutors increases drastically and remains large even though there could be no requests at all (I've tried to remove driver from backend pool, this did not help to scale down even with no requests during ~20mins)There are also lots of dropped events from executorManagement queue I've tried to increase executorManagement queue size up to 3, this did not help. Is this a bug or kinda expected behavior? Shall I increase queue size even more or there is another thing to adjust? Thank you. spark: 3.1.1jvm: openjdk-11-jre-headless:amd64      11.0.10+9-0ubuntu1~18.04k8s provider: gke some related spark options: spark.dynamicAllocation.enabled=truespark.dynamicAllocation.minExecutors=5spark.dynamicAllocation.maxExecutors=50spark.dynamicAllocation.executorIdleTimeout=120sspark.dynamicAllocation.shuffleTracking.enabled=truespark.dynamicAllocation.cachedExecutorIdleTimeout=120sspark.dynamicAllocation.shuffleTracking.timeout=120sspark.dynamicAllocation.executorAllocationRatio=0.5spark.dynamicAllocation.schedulerBacklogTimeout=2sspark.dynamicAllocation.sustainedSchedulerBacklogTimeout=1sspark.scheduler.listenerbus.eventqueue.capacity=3 -- Grats, Alex. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org  -- Grats, Alex. 

Re: [Spark SQL]: Can complex oracle views be created using Spark SQL

2021-03-23 Thread Mich Talebzadeh
Hi,

I did some investigation on this and created a dataframe on top of the
underlying view in Oracle database.

Let assume that our oracle view is just a normal view as opposed to
materialized view, something like below where both sales and costs are FACT
tables

CREATE OR REPLACE FORCE EDITIONABLE VIEW "SH"."PROFITS" ("CHANNEL_ID",
"CUST_ID", "PROD_ID", "PROMO_ID", "TIME_ID", "UNIT_COST", "UNIT_PRICE",
"AMOUNT_SOLD", "QUANTITY_SOLD", "TOTAL_COST") AS
  SELECT
s.channel_id,
s.cust_id,
s.prod_id,
s.promo_id,
s.time_id,
c.unit_cost,
c.unit_price,
s.amount_sold,
s.quantity_sold,
c.unit_cost * s.quantity_sold TOTAL_COST
 FROM   costs c, sales s
 WHERE c.prod_id = s.prod_id
   AND c.time_id = s.time_id
   AND c.channel_id = s.channel_id
   AND c.promo_id = s.promo_id;

So it is pretty simple view with a join on sales and cost tables

You typically access this view in Spark with

scala> val df = spark.read.format("jdbc").options(
 |Map("url" -> _ORACLEserver,
 |"dbtable" -> "(SELECT * FROM sh.profits)",  // dbtable could
be on a view or any valid sql
 |"user" -> _username,
 |"password" -> _password)).load
df: org.apache.spark.sql.DataFrame = [CHANNEL_ID: decimal(38,10), CUST_ID:
decimal(38,10) ... 8 more fields]

scala>

scala> df.printSchema()
root
 |-- CHANNEL_ID: decimal(38,10) (nullable = true)
 |-- CUST_ID: decimal(38,10) (nullable = true)
 |-- PROD_ID: decimal(38,10) (nullable = true)
 |-- PROMO_ID: decimal(38,10) (nullable = true)
 |-- TIME_ID: timestamp (nullable = true)
 |-- UNIT_COST: decimal(10,2) (nullable = true)
 |-- UNIT_PRICE: decimal(10,2) (nullable = true)
 |-- AMOUNT_SOLD: decimal(10,2) (nullable = true)
 |-- QUANTITY_SOLD: decimal(10,2) (nullable = true)
 |-- TOTAL_COST: decimal(38,10) (nullable = true)

If you run this all spark is going to fetch the result set from Oracle
itself and the optimisation is going to happen within Oracle itself and
results will be returned to Oracle.

However, if you use the underlying Oracle tables themselves (create DF on
top of them, here costs and sales tables), and run the  SQL code in Spark
itself, then you will get a more performant result.

scala> val sales = spark.read.format("jdbc").options(
 |Map("url" -> _ORACLEserver,
 |"dbtable" -> "(SELECT * FROM sh.sales)",
 |"user" -> _username,
 |"password" -> _password)).load
sales: org.apache.spark.sql.DataFrame = [PROD_ID: decimal(38,10), CUST_ID:
decimal(38,10) ... 5 more fields]

scala> sales.createOrReplaceTempView("sales")

scala> val costs = spark.read.format("jdbc").options(
 |Map("url" -> _ORACLEserver,
 |"dbtable" -> "(SELECT * FROM sh.costs)",
 |"user" -> _username,
 |"password" -> _password)).load
costs: org.apache.spark.sql.DataFrame = [PROD_ID: decimal(38,10), TIME_ID:
timestamp ... 4 more fields]

scala> costs.createOrReplaceTempView("costs")

scala> var sqltext = """
 | SELECT
 | s.channel_id,
 | s.cust_id,
 | s.prod_id,
 | s.promo_id,
 | s.time_id,
 | c.unit_cost,
 | c.unit_price,
 | s.amount_sold,
 | s.quantity_sold,
 | c.unit_cost * s.quantity_sold TOTAL_COST
 |  FROM   costs c, sales s
 |  WHERE c.prod_id = s.prod_id
 |AND c.time_id = s.time_id
 |AND c.channel_id = s.channel_id
 |AND c.promo_id = s.promo_id
 | """
sqltext: String =
"
SELECT
s.channel_id,
s.cust_id,
s.prod_id,
s.promo_id,
s.time_id,
c.unit_cost,
c.unit_price,
s.amount_sold,
s.quantity_sold,
c.unit_cost * s.quantity_sold TOTAL_COST
 FROM   costs c, sales s
 WHERE c.prod_id = s.prod_id
   AND c.time_id = s.time_id
   AND c.channel_id = s.channel_id
   AND c.promo_id = s.promo_id
"

Then you can look at what spark optimiser is doing

scala> *spark.sql(sqltext).explain()*
== Physical Plan ==
*(5) Project [channel_id#27, cust_id#25, prod_id#24, promo_id#28,
time_id#26, unit_cost#42, unit_price#43, amount_sold#30, quantity_sold#29,
CheckOverflow((promote_precision(unit_cost#42) *
promote_precision(quantity_sold#29)), DecimalType(21,4), true) AS
TOTAL_COST#50]
+- *(5) SortMergeJoin [prod_id#38, time_id#39, channel_id#41, promo_id#40],
[prod_id#24, time_id#26, channel_id#27, promo_id#28], Inner
   :- *(2) Sort [prod_id#38 ASC NULLS FIRST, time_id#39 ASC NULLS FIRST,
channel_id#41 ASC NULLS FIRST, promo_id#40 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(prod_id#38, time_id#39, channel_id#41,
promo_id#40, 200), ENSURE_REQUIREMENTS, [id=#37]
   : +- *(1) Scan JDBCRelation((SELECT * FROM sh.costs))
[numPartitions=1]
[PROD_ID#38,TIME_ID#39,PROMO_ID#40,CHANNEL_ID#41,UNIT_COST#42,UNIT_PRICE#43]
PushedFilters: [*IsNotNull(PROD_ID), *IsNotNull(TIME_ID),
*IsNotNull(C

  1   2   3   4   5   6   7   8   9   10   >