Re: [External Email] Re: About /mnt/hdfs/current/BP directories

2023-09-08 Thread Nebi Aydin
Usually job never reaches that point fails during shuffle. And storage
memory and executor memory when it failed is usually low
On Fri, Sep 8, 2023 at 16:49 Jack Wells  wrote:

> Assuming you’re not writing to HDFS in your code, Spark can spill to HDFS
> if it runs out of memory on a per-executor basis. This could happen when
> evaluating a cache operation like you have below or during shuffle
> operations in joins, etc. You might try to increase executor memory, tune
> shuffle operations, avoid caching, or reduce the size of your dataframe(s).
>
> Jack
>
> On Sep 8, 2023 at 12:43:07, Nebi Aydin 
> wrote:
>
>>
>> Sure
>> df = spark.read.option("basePath",
>> some_path).parquet(*list_of_s3_file_paths())
>> (
>> df
>> .where(SOME FILTER)
>> .repartition(6)
>> .cache()
>> )
>>
>> On Fri, Sep 8, 2023 at 14:56 Jack Wells  wrote:
>>
>>> Hi Nebi, can you share the code you’re using to read and write from S3?
>>>
>>> On Sep 8, 2023 at 10:59:59, Nebi Aydin 
>>> wrote:
>>>
 Hi all,
 I am using spark on EMR to process data. Basically i read data from AWS
 S3 and do the transformation and post transformation i am loading/writing
 data to s3.

 Recently we have found that hdfs(/mnt/hdfs) utilization is going too
 high.

 I disabled `yarn.log-aggregation-enable` by setting it to False.

 I am not writing any data to hdfs(/mnt/hdfs) however is that spark is
 creating blocks and writing data into it. We are going all the operations
 in memory.

 Any specific operation writing data to datanode(HDFS)?

 Here is the hdfs dirs created.

 ```

 *15.4G
 /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized/subdir1

 129G
 /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized

 129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current

 129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812

 129G /mnt/hdfs/current 129G /mnt/hdfs*

 ```


 

>>>


Re: [External Email] Re: About /mnt/hdfs/current/BP directories

2023-09-08 Thread Jack Wells
 Assuming you’re not writing to HDFS in your code, Spark can spill to HDFS
if it runs out of memory on a per-executor basis. This could happen when
evaluating a cache operation like you have below or during shuffle
operations in joins, etc. You might try to increase executor memory, tune
shuffle operations, avoid caching, or reduce the size of your dataframe(s).

Jack

On Sep 8, 2023 at 12:43:07, Nebi Aydin 
wrote:

>
> Sure
> df = spark.read.option("basePath",
> some_path).parquet(*list_of_s3_file_paths())
> (
> df
> .where(SOME FILTER)
> .repartition(6)
> .cache()
> )
>
> On Fri, Sep 8, 2023 at 14:56 Jack Wells  wrote:
>
>> Hi Nebi, can you share the code you’re using to read and write from S3?
>>
>> On Sep 8, 2023 at 10:59:59, Nebi Aydin 
>> wrote:
>>
>>> Hi all,
>>> I am using spark on EMR to process data. Basically i read data from AWS
>>> S3 and do the transformation and post transformation i am loading/writing
>>> data to s3.
>>>
>>> Recently we have found that hdfs(/mnt/hdfs) utilization is going too
>>> high.
>>>
>>> I disabled `yarn.log-aggregation-enable` by setting it to False.
>>>
>>> I am not writing any data to hdfs(/mnt/hdfs) however is that spark is
>>> creating blocks and writing data into it. We are going all the operations
>>> in memory.
>>>
>>> Any specific operation writing data to datanode(HDFS)?
>>>
>>> Here is the hdfs dirs created.
>>>
>>> ```
>>>
>>> *15.4G
>>> /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized/subdir1
>>>
>>> 129G
>>> /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized
>>>
>>> 129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current
>>>
>>> 129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812
>>>
>>> 129G /mnt/hdfs/current 129G /mnt/hdfs*
>>>
>>> ```
>>>
>>>
>>> 
>>>
>>


Re: [External Email] Re: About /mnt/hdfs/current/BP directories

2023-09-08 Thread Nebi Aydin
Sure
df = spark.read.option("basePath",
some_path).parquet(*list_of_s3_file_paths())
(
df
.where(SOME FILTER)
.repartition(6)
.cache()
)

On Fri, Sep 8, 2023 at 14:56 Jack Wells  wrote:

> Hi Nebi, can you share the code you’re using to read and write from S3?
>
> On Sep 8, 2023 at 10:59:59, Nebi Aydin 
> wrote:
>
>> Hi all,
>> I am using spark on EMR to process data. Basically i read data from AWS
>> S3 and do the transformation and post transformation i am loading/writing
>> data to s3.
>>
>> Recently we have found that hdfs(/mnt/hdfs) utilization is going too high.
>>
>> I disabled `yarn.log-aggregation-enable` by setting it to False.
>>
>> I am not writing any data to hdfs(/mnt/hdfs) however is that spark is
>> creating blocks and writing data into it. We are going all the operations
>> in memory.
>>
>> Any specific operation writing data to datanode(HDFS)?
>>
>> Here is the hdfs dirs created.
>>
>> ```
>>
>> *15.4G
>> /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized/subdir1
>>
>> 129G
>> /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized
>>
>> 129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current
>>
>> 129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812
>>
>> 129G /mnt/hdfs/current 129G /mnt/hdfs*
>>
>> ```
>>
>>
>> 
>>
>


Re: About /mnt/hdfs/current/BP directories

2023-09-08 Thread Jack Wells
 Hi Nebi, can you share the code you’re using to read and write from S3?

On Sep 8, 2023 at 10:59:59, Nebi Aydin 
wrote:

> Hi all,
> I am using spark on EMR to process data. Basically i read data from AWS S3
> and do the transformation and post transformation i am loading/writing data
> to s3.
>
> Recently we have found that hdfs(/mnt/hdfs) utilization is going too high.
>
> I disabled `yarn.log-aggregation-enable` by setting it to False.
>
> I am not writing any data to hdfs(/mnt/hdfs) however is that spark is
> creating blocks and writing data into it. We are going all the operations
> in memory.
>
> Any specific operation writing data to datanode(HDFS)?
>
> Here is the hdfs dirs created.
>
> ```
>
> *15.4G
> /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized/subdir1
>
> 129G
> /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized
>
> 129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current
>
> 129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812
>
> 129G /mnt/hdfs/current 129G /mnt/hdfs*
>
> ```
>
>
> 
>


About /mnt/hdfs/current/BP directories

2023-09-08 Thread Nebi Aydin
Hi all,
I am using spark on EMR to process data. Basically i read data from AWS S3
and do the transformation and post transformation i am loading/writing data
to s3.

Recently we have found that hdfs(/mnt/hdfs) utilization is going too high.

I disabled `yarn.log-aggregation-enable` by setting it to False.

I am not writing any data to hdfs(/mnt/hdfs) however is that spark is
creating blocks and writing data into it. We are going all the operations
in memory.

Any specific operation writing data to datanode(HDFS)?

Here is the hdfs dirs created.

```

*15.4G
/mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized/subdir1

129G
/mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized

129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current

129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812

129G /mnt/hdfs/current 129G /mnt/hdfs*

```





RE: Spark 3.4.1 and Hive 3.1.3

2023-09-08 Thread Agrawal, Sanket
Hi Yasukazu,

I tried by replacing the jar though the spark code didn’t work but the 
vulnerability was removed. But I agree that even 3.1.3 has other 
vulnerabilities listed on maven page but these are medium level 
vulnerabilities. We are currently targeting Critical and High vulnerabilities 
only.

Thank,
Sanket

From: Nagatomi Yasukazu 
Sent: Friday, September 8, 2023 9:35 AM
To: Agrawal, Sanket 
Cc: Chao Sun ; Yeachan Park ; 
user@spark.apache.org
Subject: [EXT] Re: Spark 3.4.1 and Hive 3.1.3

Hi Sanket,

While migrating to Hive 3.1.3 may resolve many issues, the link below suggests 
that there might still be some vulnerabilities present.
Do you think the specific vulnerability you're concerned about can be addressed 
with Hive 3.1.3?

https://mvnrepository.com/artifact/org.apache.hive/hive-exec/3.1.3

Regards,
Yasukazu

2023年9月8日(金) 12:36 Agrawal, Sanket 
mailto:sankeagra...@deloitte.com.invalid>>:
Hi Chao,

The reason to migrate to Hive 3.1.3 is to remove a vulnerability from 
hive-exec-2.3.9.jar.

Thanks
Sanket

From: Chao Sun mailto:sunc...@apache.org>>
Sent: Thursday, September 7, 2023 10:23 PM
To: Agrawal, Sanket 
mailto:sankeagra...@deloitte.com.invalid>>
Cc: Yeachan Park mailto:yeachan...@gmail.com>>; 
user@spark.apache.org
Subject: [EXT] Re: Spark 3.4.1 and Hive 3.1.3

Hi Sanket,

Spark 3.4.1 currently only works with Hive 2.3.9, and it would require a lot of 
work to upgrade the Hive version to 3.x and up.

Normally though, you only need the Hive client in Spark to talk to 
HiveMetastore (HMS) for things like table or partition metadata information. In 
this case, Hive 2.3.9 used by Spark is already capable of communicating with 
HMS of other versions like Hive 3.x. So, could you share a bit of context why 
you want to use Hive 3.1.3 with Spark?

Chao


On Thu, Sep 7, 2023 at 6:22 AM Agrawal, Sanket 
mailto:sankeagra...@deloitte.com.invalid>> 
wrote:
Hi

I Tried using the maven option and it’s working. But we are not allowed to 
download jars at runtime from maven because of some security restrictions.

So, I tried again with downloading hive 3.1.3 and giving the location of jars 
and it worked this time. But now in our docker image we have 40 new Critical 
vulnerabilities due to Hive (scanned by AWS Inspector).

So, The only solution I see here is to build Spark 3.4.1 with Hive 3.1.3. But 
when I do so the build is failing while compiling the files in /spark/sql/hive. 
But when I am trying to build Spark 3.4.1 with Hive 2.3.9 the build is 
completed successfully.

Has anyone tried building Spark 3.4.1 with Hive 3.1.3 or higher?

Thanks,
Sanket A.

From: Yeachan Park mailto:yeachan...@gmail.com>>
Sent: Tuesday, September 5, 2023 8:52 PM
To: Agrawal, Sanket 
mailto:sankeagra...@deloitte.com>>
Cc: user@spark.apache.org
Subject: [EXT] Re: Spark 3.4.1 and Hive 3.1.3

What's the full traceback when you run the same thing via spark-shell? So 
something like:

$SPARK_HOME/bin/spark-shell \
   --conf "spark.sql.hive.metastore.version=3.1.3" \
   --conf "spark.sql.hive.metastore.jars=path" \
   --conf "spark.sql.hive.metastore.jars.path=/opt/hive/lib/*.jar"

W.r.t building hive, there's no need - either download it from 
https://downloads.apache.org/hive/hive-3.1.3/
 or use the maven option like Yasukazu suggested. If you do want to build it 
make sure you are using Java 8 to do so.

On Tue, Sep 5, 2023 at 12:00 PM Agrawal, Sanket 
mailto:sankeagra...@deloitte.com>> wrote:
Hi,

I tried pointing to hive 3.1.3 using the below command. But still getting 
error. I see that the spark-hive-thriftserver_2.12/3.4.1 and 
spark-hive_2.12/3.4.1 have dependency on hive 2.3.9

Command: pyspark --conf "spark.

Re: Elasticsearch support for Spark 3.x

2023-09-08 Thread Dipayan Dev
@Alfie Davidson  : Awesome, it worked with
"“org.elasticsearch.spark.sql”"
But as soon as I switched to *elasticsearch-spark-20_2.12, *"es" also
worked.


On Fri, Sep 8, 2023 at 12:45 PM Dipayan Dev  wrote:

>
> Let me try that and get back. Just wondering, if there a change in  the
> way we pass the format in connector from Spark 2 to 3?
>
>
> On Fri, 8 Sep 2023 at 12:35 PM, Alfie Davidson 
> wrote:
>
>> I am pretty certain you need to change the write.format from “es” to
>> “org.elasticsearch.spark.sql”
>>
>> Sent from my iPhone
>>
>> On 8 Sep 2023, at 03:10, Dipayan Dev  wrote:
>>
>> 
>>
>> ++ Dev
>>
>> On Thu, 7 Sep 2023 at 10:22 PM, Dipayan Dev 
>> wrote:
>>
>>> Hi,
>>>
>>> Can you please elaborate your last response? I don’t have any external
>>> dependencies added, and just updated the Spark version as mentioned below.
>>>
>>> Can someone help me with this?
>>>
>>> On Fri, 1 Sep 2023 at 5:58 PM, Koert Kuipers  wrote:
>>>
 could the provided scope be the issue?

 On Sun, Aug 27, 2023 at 2:58 PM Dipayan Dev 
 wrote:

> Using the following dependency for Spark 3 in POM file (My Scala
> version is 2.12.14)
>
>
>
>
>
>
> *org.elasticsearch
> elasticsearch-spark-30_2.12
> 7.12.0provided*
>
>
> The code throws error at this line :
> df.write.format("es").mode("overwrite").options(elasticOptions).save("index_name")
> The same code is working with Spark 2.4.0 and the following dependency
>
>
>
>
>
> *org.elasticsearch
> elasticsearch-spark-20_2.12
> 7.12.0*
>
>
> On Mon, 28 Aug 2023 at 12:17 AM, Holden Karau 
> wrote:
>
>> What’s the version of the ES connector you are using?
>>
>> On Sat, Aug 26, 2023 at 10:17 AM Dipayan Dev 
>> wrote:
>>
>>> Hi All,
>>>
>>> We're using Spark 2.4.x to write dataframe into the Elasticsearch
>>> index.
>>> As we're upgrading to Spark 3.3.0, it throwing out error
>>> Caused by: java.lang.ClassNotFoundException: es.DefaultSource
>>> at
>>> java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
>>> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
>>> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
>>>
>>> Looking at a few responses from Stackoverflow
>>> . it seems this is not yet
>>> supported by Elasticsearch-hadoop.
>>>
>>> Does anyone have experience with this? Or faced/resolved this issue
>>> in Spark 3?
>>>
>>> Thanks in advance!
>>>
>>> Regards
>>> Dipayan
>>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>
 CONFIDENTIALITY NOTICE: This electronic communication and any files
 transmitted with it are confidential, privileged and intended solely for
 the use of the individual or entity to whom they are addressed. If you are
 not the intended recipient, you are hereby notified that any disclosure,
 copying, distribution (electronic or otherwise) or forwarding of, or the
 taking of any action in reliance on the contents of this transmission is
 strictly prohibited. Please notify the sender immediately by e-mail if you
 have received this email by mistake and delete this email from your system.

 Is it necessary to print this email? If you care about the environment
 like we do, please refrain from printing emails. It helps to keep the
 environment forested and litter-free.
>>>
>>>


Re: Elasticsearch support for Spark 3.x

2023-09-08 Thread Dipayan Dev
Let me try that and get back. Just wondering, if there a change in  the way
we pass the format in connector from Spark 2 to 3?


On Fri, 8 Sep 2023 at 12:35 PM, Alfie Davidson 
wrote:

> I am pretty certain you need to change the write.format from “es” to
> “org.elasticsearch.spark.sql”
>
> Sent from my iPhone
>
> On 8 Sep 2023, at 03:10, Dipayan Dev  wrote:
>
> 
>
> ++ Dev
>
> On Thu, 7 Sep 2023 at 10:22 PM, Dipayan Dev 
> wrote:
>
>> Hi,
>>
>> Can you please elaborate your last response? I don’t have any external
>> dependencies added, and just updated the Spark version as mentioned below.
>>
>> Can someone help me with this?
>>
>> On Fri, 1 Sep 2023 at 5:58 PM, Koert Kuipers  wrote:
>>
>>> could the provided scope be the issue?
>>>
>>> On Sun, Aug 27, 2023 at 2:58 PM Dipayan Dev 
>>> wrote:
>>>
 Using the following dependency for Spark 3 in POM file (My Scala
 version is 2.12.14)






 *org.elasticsearch
 elasticsearch-spark-30_2.12
 7.12.0provided*


 The code throws error at this line :
 df.write.format("es").mode("overwrite").options(elasticOptions).save("index_name")
 The same code is working with Spark 2.4.0 and the following dependency





 *org.elasticsearch
 elasticsearch-spark-20_2.12
 7.12.0*


 On Mon, 28 Aug 2023 at 12:17 AM, Holden Karau 
 wrote:

> What’s the version of the ES connector you are using?
>
> On Sat, Aug 26, 2023 at 10:17 AM Dipayan Dev 
> wrote:
>
>> Hi All,
>>
>> We're using Spark 2.4.x to write dataframe into the Elasticsearch
>> index.
>> As we're upgrading to Spark 3.3.0, it throwing out error
>> Caused by: java.lang.ClassNotFoundException: es.DefaultSource
>> at
>> java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
>> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
>> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
>>
>> Looking at a few responses from Stackoverflow
>> . it seems this is not yet
>> supported by Elasticsearch-hadoop.
>>
>> Does anyone have experience with this? Or faced/resolved this issue
>> in Spark 3?
>>
>> Thanks in advance!
>>
>> Regards
>> Dipayan
>>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

>>> CONFIDENTIALITY NOTICE: This electronic communication and any files
>>> transmitted with it are confidential, privileged and intended solely for
>>> the use of the individual or entity to whom they are addressed. If you are
>>> not the intended recipient, you are hereby notified that any disclosure,
>>> copying, distribution (electronic or otherwise) or forwarding of, or the
>>> taking of any action in reliance on the contents of this transmission is
>>> strictly prohibited. Please notify the sender immediately by e-mail if you
>>> have received this email by mistake and delete this email from your system.
>>>
>>> Is it necessary to print this email? If you care about the environment
>>> like we do, please refrain from printing emails. It helps to keep the
>>> environment forested and litter-free.
>>
>>