Re: Non string type partitions

2023-04-11 Thread Chitral Verma
Because the name of the directory cannot be an object, it has to be a
string to create partitioned dirs like "date=2023-04-10"

On Tue, 11 Apr, 2023, 8:27 pm Charles vinodh,  wrote:

>
> Hi Team,
>
> We are running into the below error when we are trying to run a simple
> query a partitioned table in Spark.
>
> *MetaException(message:Filtering is supported only on partition keys of type 
> string)
> *
>
>
> Our the partition column has been to type *date *instead of string and
> query is a very simple SQL as shown below.
>
> *SELECT * FROM my_table WHERE partition_col = date '2023-04-11'*
>
> Any idea why spark mandates partition columns to be of type string?. Is
> there a recommended work around for this issue?
>
>
>


Re: Non string type partitions

2023-04-12 Thread Charles vinodh
There are  other distributed execution engines (like hive, trino) that do
support non-string data types for partition columns such as date and
integer.
Any idea why this restriction exists in Spark? ..


On Tue, 11 Apr 2023 at 20:34, Chitral Verma  wrote:

> Because the name of the directory cannot be an object, it has to be a
> string to create partitioned dirs like "date=2023-04-10"
>
> On Tue, 11 Apr, 2023, 8:27 pm Charles vinodh, 
> wrote:
>
>>
>> Hi Team,
>>
>> We are running into the below error when we are trying to run a simple
>> query a partitioned table in Spark.
>>
>> *MetaException(message:Filtering is supported only on partition keys of type 
>> string)
>> *
>>
>>
>> Our the partition column has been to type *date *instead of string and
>> query is a very simple SQL as shown below.
>>
>> *SELECT * FROM my_table WHERE partition_col = date '2023-04-11'*
>>
>> Any idea why spark mandates partition columns to be of type string?. Is
>> there a recommended work around for this issue?
>>
>>
>>


Re: Non string type partitions

2023-04-15 Thread Charles vinodh
bumping this up again for suggestions?.. Is the official recommendation to
not have *int* or *date* typed partition columns?

On Wed, 12 Apr 2023 at 10:44, Charles vinodh  wrote:

> There are  other distributed execution engines (like hive, trino) that do
> support non-string data types for partition columns such as date and
> integer.
> Any idea why this restriction exists in Spark? ..
>
>
> On Tue, 11 Apr 2023 at 20:34, Chitral Verma 
> wrote:
>
>> Because the name of the directory cannot be an object, it has to be a
>> string to create partitioned dirs like "date=2023-04-10"
>>
>> On Tue, 11 Apr, 2023, 8:27 pm Charles vinodh, 
>> wrote:
>>
>>>
>>> Hi Team,
>>>
>>> We are running into the below error when we are trying to run a simple
>>> query a partitioned table in Spark.
>>>
>>> *MetaException(message:Filtering is supported only on partition keys of 
>>> type string)
>>> *
>>>
>>>
>>> Our the partition column has been to type *date *instead of string and
>>> query is a very simple SQL as shown below.
>>>
>>> *SELECT * FROM my_table WHERE partition_col = date '2023-04-11'*
>>>
>>> Any idea why spark mandates partition columns to be of type string?. Is
>>> there a recommended work around for this issue?
>>>
>>>
>>>


Re: Non string type partitions

2023-04-15 Thread Bjørn Jørgensen
I guess that it has to do with indexing and partitioning data to nodes.
Have a look at data partitioning system design concept

 and key range partitions



You can work around this by creating a temp view where date is casted to
string.

Note I did have to test this sometimes so I'm using .mode("overwrite") on
the file.


from pyspark.sql import SparkSession, Row
from datetime import date

spark = SparkSession.builder.getOrCreate()

mock_data = [
Row(id=1, name="John", partition_col=date(2023, 4, 11)),
Row(id=2, name="Jane", partition_col=date(2023, 4, 11)),
Row(id=3, name="Alice", partition_col=date(2023, 4, 12)),
Row(id=4, name="Bob", partition_col=date(2023, 4, 12)),
]

mock_df = spark.createDataFrame(mock_data)

parquet_data_path = "test_date"
mock_df.write.partitionBy("partition_col").mode("overwrite").parquet(parquet_data_path)

create_table_sql = f"""
CREATE TABLE IF NOT EXISTS my_table (
id INT,
name STRING)
USING parquet
PARTITIONED BY (partition_col DATE)
OPTIONS ('path' = '{parquet_data_path}')
"""

spark.sql(create_table_sql)

# temp view with the string partition column
create_view_sql = f"""
CREATE OR REPLACE TEMPORARY VIEW my_table_with_string_partition AS
SELECT *, CAST(partition_col AS STRING) AS partition_col_str FROM my_table;
"""
spark.sql(create_view_sql)

query = f"SELECT * FROM my_table_with_string_partition WHERE
partition_col_str = '2023-04-11';"
result = spark.sql(query)

result.show()


+---++-+-+
| id|name|partition_col|partition_col_str|
+---++-+-+
|  1|John|   2023-04-11|   2023-04-11|
|  2|Jane|   2023-04-11|   2023-04-11|
+---++-+-+



lør. 15. apr. 2023 kl. 21:41 skrev Charles vinodh :

>
> bumping this up again for suggestions?.. Is the official recommendation to
> not have *int* or *date* typed partition columns?
>
> On Wed, 12 Apr 2023 at 10:44, Charles vinodh 
> wrote:
>
>> There are  other distributed execution engines (like hive, trino) that do
>> support non-string data types for partition columns such as date and
>> integer.
>> Any idea why this restriction exists in Spark? ..
>>
>>
>> On Tue, 11 Apr 2023 at 20:34, Chitral Verma 
>> wrote:
>>
>>> Because the name of the directory cannot be an object, it has to be a
>>> string to create partitioned dirs like "date=2023-04-10"
>>>
>>> On Tue, 11 Apr, 2023, 8:27 pm Charles vinodh, 
>>> wrote:
>>>

 Hi Team,

 We are running into the below error when we are trying to run a simple
 query a partitioned table in Spark.

 *MetaException(message:Filtering is supported only on partition keys of 
 type string)
 *


 Our the partition column has been to type *date *instead of string and
 query is a very simple SQL as shown below.

 *SELECT * FROM my_table WHERE partition_col = date '2023-04-11'*

 Any idea why spark mandates partition columns to be of type string?. Is
 there a recommended work around for this issue?




-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297