Re: Reading too many files

2022-10-04 Thread Artemis User
Read by default can't be parallelized in a Spark job, and doing your own 
multi-threaded programming in a Spark program isn't a good idea.  Adding 
fast disk I/O and increase RAM may speed things up, but won't help with 
parallelization. You may have to be more creative here.  One option 
would be, If each file or groups of files can be processed 
independently, you can create a script or program on the client side to 
spawn multiple jobs and achieve parallel processing that way...


On 10/3/22 7:29 PM, Henrik Pang wrote:

you may need a large cluster memory and fast disk IO.


Sachit Murarka wrote:
Can anyone please suggest if there is any property to improve the 
parallel reads? I am reading more than 25000 files .





-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark Core][Release]Can we consider add SPARK-39725 into 3.3.1 or 3.3.2 release?

2022-10-04 Thread Bjørn Jørgensen
I have made a PR  for this now.

tir. 4. okt. 2022 kl. 19:02 skrev Sean Owen :

> I think it's fine to backport that to 3.3.x, regardless of whether it
> clearly affects Spark or not.
>
> On Tue, Oct 4, 2022 at 11:31 AM phoebe chen 
> wrote:
>
>> Hi:
>> (Not sure if this mailing group is good to use for such question, but
>> just try my luck here, thanks)
>>
>> SPARK-39725  has
>> fix for security issues CVE-2022-2047 and CVE2022-2048 (High), which was
>> set to 3.4.0 release but that will happen Feb 2023. Is it possible to have
>> it in any earlier release such as 3.3.1 or 3.3.2?
>>
>>
>>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: [Spark Core][Release]Can we consider add SPARK-39725 into 3.3.1 or 3.3.2 release?

2022-10-04 Thread Sean Owen
I think it's fine to backport that to 3.3.x, regardless of whether it
clearly affects Spark or not.

On Tue, Oct 4, 2022 at 11:31 AM phoebe chen  wrote:

> Hi:
> (Not sure if this mailing group is good to use for such question, but just
> try my luck here, thanks)
>
> SPARK-39725  has
> fix for security issues CVE-2022-2047 and CVE2022-2048 (High), which was
> set to 3.4.0 release but that will happen Feb 2023. Is it possible to have
> it in any earlier release such as 3.3.1 or 3.3.2?
>
>
>


[Spark Core][Release]Can we consider add SPARK-39725 into 3.3.1 or 3.3.2 release?

2022-10-04 Thread phoebe chen
Hi:
(Not sure if this mailing group is good to use for such question, but just
try my luck here, thanks)

SPARK-39725  has fix for
security issues CVE-2022-2047 and CVE2022-2048 (High), which was set to
3.4.0 release but that will happen Feb 2023. Is it possible to have it in
any earlier release such as 3.3.1 or 3.3.2?


Re: Converting None/Null into json in pyspark

2022-10-04 Thread Yeachan Park
You can try this (replace spark with whatever variable your sparksession
is): spark.conf.set("spark.sql.jsonGenerator.ignoreNullFields", False)

On Tue, Oct 4, 2022 at 4:55 PM Karthick Nk  wrote:

> Thanks
> I am using Pyspark in databricks, I have seen through multiple reference
> but I couldn't find the exact snippet. Could you share a sample snippet for
> the same how do I set that property.
>
> My step:
> df = df.selectExpr(f'to_json(struct(*)) as json_data')
>
> On Tue, Oct 4, 2022 at 10:57 AM Yeachan Park  wrote:
>
>> Hi,
>>
>> There's a config option for this. Try setting this to false in your spark
>> conf.
>>
>> spark.sql.jsonGenerator.ignoreNullFields
>>
>> On Tuesday, October 4, 2022, Karthick Nk  wrote:
>>
>>> Hi all,
>>>
>>> I need to convert pyspark dataframe into json .
>>>
>>> While converting , if all rows values are null/None for that particular
>>> column that column is getting removed from data.
>>>
>>> Could you suggest a way to do this. I need to convert dataframe into
>>> json with columns.
>>>
>>> Thanks
>>>
>>


Re: Converting None/Null into json in pyspark

2022-10-04 Thread Karthick Nk
 Thanks
I am using Pyspark in databricks, I have seen through multiple reference
but I couldn't find the exact snippet. Could you share a sample snippet for
the same how do I set that property.

My step:
df = df.selectExpr(f'to_json(struct(*)) as json_data')

On Tue, Oct 4, 2022 at 10:57 AM Yeachan Park  wrote:

> Hi,
>
> There's a config option for this. Try setting this to false in your spark
> conf.
>
> spark.sql.jsonGenerator.ignoreNullFields
>
> On Tuesday, October 4, 2022, Karthick Nk  wrote:
>
>> Hi all,
>>
>> I need to convert pyspark dataframe into json .
>>
>> While converting , if all rows values are null/None for that particular
>> column that column is getting removed from data.
>>
>> Could you suggest a way to do this. I need to convert dataframe into json
>> with columns.
>>
>> Thanks
>>
>