Re: Spark 3.4.1 and Hive 3.1.3

2023-09-07 Thread Yeachan Park
Hi,

The maven option is good for testing but I wouldn't recommend it running in
production from a security perspective and also depending on your setup you
might be downloading jars at the start of every spark session.

By the way, Spark definitely not require all the jars from Hive, since from
you are only trying to connect to the metastore. Can you just try pointing
spark.sql.hive.metastore.jars.path to the following jars from Hive 3.1.3:
- hive-common-3.1.3.jar
- hive-metastore-3.1.3.jar
- hive-shims-common-3.1.3.jar

On Thu, Sep 7, 2023 at 3:20 PM Agrawal, Sanket 
wrote:

> Hi
>
>
>
> I Tried using the maven option and it’s working. But we are not allowed to
> download jars at runtime from maven because of some security restrictions.
>
>
>
> So, I tried again with downloading hive 3.1.3 and giving the location of
> jars and it worked this time. But now in our docker image we have 40 new
> Critical vulnerabilities due to Hive (scanned by AWS Inspector).
>
>
>
> So, The only solution I see here is to build *Spark 3.4.1* *with Hive
> 3.1.3*. But when I do so the build is failing while compiling the files
> in /spark/sql/hive. But when I am trying to build *Spark 3.4.1* *with
> Hive 2.3.9* the build is completed successfully.
>
>
>
> Has anyone tried building Spark 3.4.1 with Hive 3.1.3 or higher?
>
>
>
> Thanks,
>
> Sanket A.
>
>
>
> *From:* Yeachan Park 
> *Sent:* Tuesday, September 5, 2023 8:52 PM
> *To:* Agrawal, Sanket 
> *Cc:* user@spark.apache.org
> *Subject:* [EXT] Re: Spark 3.4.1 and Hive 3.1.3
>
>
>
> What's the full traceback when you run the same thing via spark-shell? So
> something like:
>
>
>
> $SPARK_HOME/bin/spark-shell \
>--conf "spark.sql.hive.metastore.version=3.1.3" \
>--conf "spark.sql.hive.metastore.jars=path" \
>--conf "spark.sql.hive.metastore.jars.path=/opt/hive/lib/*.jar"
>
>
>
> W.r.t building hive, there's no need - either download it from
> https://downloads.apache.org/hive/hive-3.1.3/
> <https://secure-web.cisco.com/1IsuHM1ALR8L3m2ZVx4VbSWxlL34thDBf_dHELqydfQIj7R90KvNhGSEkXqyHXmOfSenFAtnuzzarKHiNMbSqX72Kh4feX6b6QpNP16REgegIZLutUZ_MJcQ_CPPCNre-OeveW0hgCfi_nmR5aLeG-SGHSeTfMF42qJd4xndcM5FFxQe4Tfg8gAP2UCVxyvhQut40U9xDaIjcJD5_IT1y7whzw4xcxp2s_lhL7VAEBHOrWdMTG2MI8qdm7HzyE_By32O6XDkc0YaMdQLcAomZ5l5Ssp0DKwoVMntgNZe_adWv-yvUSNuwpqb-af55AjSgXf3Vy2ajVN0tBPY2Li_igjTilrrRoKugtNZsaOTpx3Ex5RUFdu0g2TK8bombxiVsncFiGVvmvOCCewuE-dEV44b6EveOyoqNcbE6AHgI9-6wcy5qtrScU5wruVO6z3_-tvpH26RFVw7fYla-mMeqX2PLhsnwqFvcU1lRc0Hiq9J93VmLyr3Y-mDYKqlFUL6EqRGhT7hY9Szurj5BSHzoDw/https%3A%2F%2Fdownloads.apache.org%2Fhive%2Fhive-3.1.3%2F>
> or use the maven option like Yasukazu suggested. If you do want to build it
> make sure you are using Java 8 to do so.
>
>
>
> On Tue, Sep 5, 2023 at 12:00 PM Agrawal, Sanket 
> wrote:
>
> Hi,
>
>
>
> I tried pointing to hive 3.1.3 using the below command. But still getting
> error. I see that the spark-hive-thriftserver_2.12/3.4.1 and
> spark-hive_2.12/3.4.1 have dependency on hive 2.3.9
>
>
>
> Command: pyspark --conf "spark.sql.hive.metastore.version=3.1.3" --conf
> "spark.sql.hive.metastore.jars=path" --conf
> "spark.sql.hive.metastore.jars.path=file://opt/hive/lib/*.jar"
>
>
>
> Error:
>
>
>
>
>
> Also, when I am trying to build spark with Hive 3.1.3 I am getting
> following error.
>
>
>
> If anyone can give me some direction then it would of great help.
>
>
>
> Thanks,
>
> Sanket
>
>
>
> *From:* Yeachan Park 
> *Sent:* Tuesday, September 5, 2023 1:32 AM
> *To:* Agrawal, Sanket 
> *Cc:* user@spark.apache.org
> *Subject:* [EXT] Re: Spark 3.4.1 and Hive 3.1.3
>
>
>
> Hi,
>
>
>
> Why not download/build the hive 3.1.3 bundle and tell Spark to use that?
> See https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html
> <https://secure-web.cisco.com/1v3sayeGSek80FVpl91pY_-yK4C4shE3LRRtmMS7Th08V9Ka4HMy009tMFFeXWLOUKRl2IpbeIxoppyDNCEYm7q9QnWfObVCCq-DJEmdtoG4XJGEioT8hpRCd0PTuXEai8_zSk-2RzByf5ksuQ49_QPiGQi-33wS3GvNkBIjWvPuZstPHlxYrhYzOqZU2xIzhtr_VbqkG0N-7YHs1O8dyR-Xomli8_SgFh-RPPUuwb5nH-Yj-Ro6FTJ0hRlnOjmvu6c9im6V2WPg6rmWXr7KuN2zyuxzzsxr0fOYKgLLhSKUNi__wZY9jfzKlLalS88DZKx5fkK15vfWW-FTULz20KGDETRmLryWFZDaeTgYyDQ1-fqR-9G4IaVOvj9DmXRqkYRlMWE1n3Jq8BOABFJoyOdJF5RE3irkrOOYdk2Q5ip_qCwtd6qMKQH-QqlyNqWdbrGS1xPKdP1lZv25dJJ7KsM7kbO8eqyKlk0YJp5C1mVPZr4UfFu885lNXi-6D-3eudTU6B5m3-ynoieZC94eUGw/https%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fsql-data-sources-hive-tables.html>
>
>
>
> Basically, set:
>
> spark.sql.hive.metastore.version 3.1.3
>
> spark.sql.hive.metastore.jars path
>
> spark.sql.hive.metastore.jars.path

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Yeachan Park
Did you check if mapreduce.fileoutputcommitter.algorithm.version 2 is
supported on GCS? IIRC it wasn't, but you could check with GCP support


On Mon, Jul 17, 2023 at 3:54 PM Dipayan Dev  wrote:

> Thanks Jay,
>
> I will try that option.
>
> Any insight on the file committer algorithms?
>
> I tried v2 algorithm but its not enhancing the runtime. What’s the best
> practice in Dataproc for dynamic updates in Spark.
>
>
> On Mon, 17 Jul 2023 at 7:05 PM, Jay  wrote:
>
>> You can try increasing fs.gs.batch.threads and
>> fs.gs.max.requests.per.batch.
>>
>> The definitions for these flags are available here -
>> https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md
>>
>> On Mon, 17 Jul 2023 at 14:59, Dipayan Dev 
>> wrote:
>>
>>> No, I am using Spark 2.4 to update the GCS partitions . I have a managed
>>> Hive table on top of this.
>>> [image: image.png]
>>> When I do a dynamic partition update of Spark, it creates the new file
>>> in a Staging area as shown here.
>>> But the GCS blob renaming takes a lot of time. I have a partition based
>>> on dates and I need to update around 3 years of data. It usually takes 3
>>> hours to finish the process. Anyway to speed up this?
>>> With Best Regards,
>>>
>>> Dipayan Dev
>>>
>>> On Mon, Jul 17, 2023 at 1:53 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 So you are using GCP and your Hive is installed on Dataproc which
 happens to run your Spark as well. Is that correct?

 What version of Hive are you using?

 HTH


 Mich Talebzadeh,
 Solutions Architect/Engineering Lead
 Palantir Technologies Limited
 London
 United Kingdom


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Mon, 17 Jul 2023 at 09:16, Dipayan Dev 
 wrote:

> Hi All,
>
> Of late, I have encountered the issue where I have to overwrite a lot
> of partitions of the Hive table through Spark. It looks like writing to
> hive_staging_directory takes 25% of the total time, whereas 75% or more
> time goes in moving the ORC files from staging directory to the final
> partitioned directory structure.
>
> I got some reference where it's mentioned to use this config during
> the Spark write.
> *mapreduce.fileoutputcommitter.algorithm.version = 2*
>
> However, it's also mentioned it's not safe as partial job failure
> might cause data loss.
>
> Is there any suggestion on the pros and cons of using this version? Or
> any ongoing Spark feature development to address this issue?
>
>
>
> With Best Regards,
>
> Dipayan Dev
>
 --
>
>
>
> With Best Regards,
>
> Dipayan Dev
> Author of *Deep Learning with Hadoop
> *
> M.Tech (AI), IISc, Bangalore
>


Loading in custom Hive jars for spark

2023-07-11 Thread Yeachan Park
Hi all,

We made some changes to hive which require changes to the hive jars that
Spark is bundled with. Since Spark 3.3.1 comes bundled with Hive 2.3.9
jars, we built our changes in Hive 2.3.9 and put the necessary jars under
$SPARK_HOME/jars (replacing the original jars that were there), everything
works fine.

However since I wanted to make use of spark.jars.packages to download jars
at runtime, I thought what would also work is if I deleted the original
hive jars from $SPARK_HOME/jars and download the same jars at runtime.
Apparently spark.jars.packages should add these jars to the classpath.
Instead I get a NoClassDefFoundError downloading the same Jars:

```
Caused by: java.lang.reflect.InvocationTargetException:
java.lang.NoClassDefFoundError:
org/apache/hadoop/hive/ql/metadata/HiveException
  at
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
  at
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(Unknown
Source)
  at
java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
Source)
  at java.base/java.lang.reflect.Constructor.newInstance(Unknown Source)
  at
org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$reflect(SharedState.scala:227)
  ... 87 more
Caused by: java.lang.NoClassDefFoundError:
org/apache/hadoop/hive/ql/metadata/HiveException
  at
org.apache.spark.sql.hive.HiveExternalCatalog.(HiveExternalCatalog.scala:75)
  ... 92 more
Caused by: java.lang.ClassNotFoundException:
org.apache.hadoop.hive.ql.metadata.HiveException
  at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(Unknown
Source)
  at
java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(Unknown
Source)
  at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
```

The class HiveException should already be available in the jars that have
been supplied by spark.jars.packages... Any idea what could be wrong?

Thanks,
Yeachan


Raise exception whilst casting instead of defaulting to null

2023-04-05 Thread Yeachan Park
Hi all,

The default behaviour of Spark is to add a null value for casts that fail,
unless ANSI SQL is enabled, SPARK-30292
.

Whilst I understand that this is a subset of ANSI compliant behaviour, I
don't understand why this feature is so coupled. Enabling ANSI also comes
with other consequences that fall outside casting behaviour, and not all
Spark operations are done via the SQL interface (i.e. spark.sql("") ).

I can imagine it's a pretty useful feature to have something like an extra
arg that would raise an exception if casting fails (e.g. *df.age.cast("int",
raise=True)* ) without enabling ANSI as an option.

Does anyone know why this approach was chosen/have I missed something?
Would others find something like this useful?

Thanks,
Yeachan


How to check the liveness of a SparkSession

2023-01-19 Thread Yeachan Park
Hi all,

We have a long running PySpark session running on client mode that
occasionally dies.

We'd like to check whether the session is still alive. One solution we came
up with was checking whether the UI is still up, but we were wondering if
there's maybe an easier way then that.

Maybe something like spark.getActiveSession() might do the same. I noticed
that it throws a connection refused error if the current spark session dies.

Are there any official/suggested ways to check this? I couldn't find much
in the docs/previous mailing lists.

Kind regards,
Yeachan


Re: Converting None/Null into json in pyspark

2022-10-04 Thread Yeachan Park
You can try this (replace spark with whatever variable your sparksession
is): spark.conf.set("spark.sql.jsonGenerator.ignoreNullFields", False)

On Tue, Oct 4, 2022 at 4:55 PM Karthick Nk  wrote:

> Thanks
> I am using Pyspark in databricks, I have seen through multiple reference
> but I couldn't find the exact snippet. Could you share a sample snippet for
> the same how do I set that property.
>
> My step:
> df = df.selectExpr(f'to_json(struct(*)) as json_data')
>
> On Tue, Oct 4, 2022 at 10:57 AM Yeachan Park  wrote:
>
>> Hi,
>>
>> There's a config option for this. Try setting this to false in your spark
>> conf.
>>
>> spark.sql.jsonGenerator.ignoreNullFields
>>
>> On Tuesday, October 4, 2022, Karthick Nk  wrote:
>>
>>> Hi all,
>>>
>>> I need to convert pyspark dataframe into json .
>>>
>>> While converting , if all rows values are null/None for that particular
>>> column that column is getting removed from data.
>>>
>>> Could you suggest a way to do this. I need to convert dataframe into
>>> json with columns.
>>>
>>> Thanks
>>>
>>


Re: Converting None/Null into json in pyspark

2022-10-03 Thread Yeachan Park
Hi,

There's a config option for this. Try setting this to false in your spark
conf.

spark.sql.jsonGenerator.ignoreNullFields

On Tuesday, October 4, 2022, Karthick Nk  wrote:

> Hi all,
>
> I need to convert pyspark dataframe into json .
>
> While converting , if all rows values are null/None for that particular
> column that column is getting removed from data.
>
> Could you suggest a way to do this. I need to convert dataframe into json
> with columns.
>
> Thanks
>


Filtering by job group in the Spark UI / API

2022-08-18 Thread Yeachan Park
Hi All,

Is there a way that we can filter in all the jobs from the history server
UI / in Spark's API based on the Job Group to which the job belongs to?

Ideally we would like to supply a particular job group, and only see the
jobs associated with that job group in the UI.

Thanks,
Yeachan


Reading snappy/lz4 compressed csv/json files

2022-07-05 Thread Yeachan Park
Hi all,

We are trying to read csv/json files that have been snappy/lz4 compressed
with spark. Files were compressed with the lz4 command line tool and the
python snappy library.

Both did not succeed, while other formats (bzip2 & gzip) worked fine.

I've read in some places that the codec is not fully compatible between
different implementations. Has anyone else had success with this? Would be
happy to hear how you went about it.

Thanks,
Yeachan


[Spark Core]: Unexpectedly exiting executor while gracefully decommissioning

2022-04-22 Thread Yeachan Park
Hello all, we are running into some issues while attempting graceful
decommissioning of executors. We are running spark-thriftserver (3.2.0) on
Kubernetes (GKE 1.20.15-gke.2500). We enabled:

   - spark.decommission.enabled
   - spark.storage.decommission.rddBlocks.enabled
   - spark.storage.decommission.shuffleBlocks.enabled
   - spark.storage.decommission.enabled

and set spark.storage.decommission.fallbackStorage.path to a path in our
bucket.

The logs from the driver seems to suggest the decommissioning process
started but then unexpectedly exited and failed while the executor logs
seem to suggest that decommissioning was successful.

Attached are the error logs:

https://gist.github.com/yeachan153/9bfb2f0ab9ac7f292fb626186b014bbf


Thanks in advance.


[Spark Core]: Unexpectedly exiting executor while gracefully decommissioning

2022-04-21 Thread Yeachan Park
Hello all, we are running into some issues while attempting graceful
decommissioning of executors. We are running spark-thriftserver (3.2.0) on
Kubernetes (GKE 1.20.15-gke.2500). We enabled:

   - spark.decommission.enabled
   - spark.storage.decommission.rddBlocks.enabled
   - spark.storage.decommission.shuffleBlocks.enabled
   - spark.storage.decommission.enabled

and set spark.storage.decommission.fallbackStorage.path to a path in our
bucket.

The logs from the driver seems to suggest the decommissioning process
started but then unexpectedly exited and failed while the executor logs
seem to suggest that decommissioning was successful.

Attached are the error logs:

https://gist.github.com/yeachan153/9bfb2f0ab9ac7f292fb626186b014bbf


Thanks in advance.