Re: [EXTERNAL] Use of ML in certain aspects of Spark to improve the performance

2023-08-08 Thread Daniel Tavares de Santana
unsubscribe From: Mich Talebzadeh Sent: Tuesday, August 8, 2023 4:43 PM To: user @spark Subject: [EXTERNAL] Use of ML in certain aspects of Spark to improve the performance I am currently pondering and sharing my thoughts openly. Given our reliance on gathered

Use of ML in certain aspects of Spark to improve the performance

2023-08-08 Thread Mich Talebzadeh
I am currently pondering and sharing my thoughts openly. Given our reliance on gathered statistics, it prompts the question of whether we could integrate specific machine learning components into Spark Structured Streaming. Consider a scenario where we aim to adjust configuration values on the fly

Re: Dynamic allocation does not deallocate executors

2023-08-08 Thread Holden Karau
So if you disable shuffle tracking but enable shuffle block decommissioning it should work from memory On Tue, Aug 8, 2023 at 4:13 AM Mich Talebzadeh wrote: > Hm. I don't think it will work > > --conf spark.dynamicAllocation.shuffleTracking.enabled=false > > In Spark 3.4.1 running spark in k8s

Re: Dynamic allocation does not deallocate executors

2023-08-08 Thread Mich Talebzadeh
Hm. I don't think it will work --conf spark.dynamicAllocation.shuffleTracking.enabled=false In Spark 3.4.1 running spark in k8s you get : org.apache.spark.SparkException: Dynamic allocation of executors requires the external shuffle service. You may enable this through

Re: Dynamic allocation does not deallocate executors

2023-08-07 Thread Holden Karau
I think you need to set "spark.dynamicAllocation.shuffleTracking.enabled=true" to false. On Mon, Aug 7, 2023 at 2:50 AM Mich Talebzadeh wrote: > Yes I have seen cases where the driver gone but a couple of executors > hanging on. Sounds like a code issue. > > HTH > > Mich Talebzadeh, > Solutions

Re: Dynamic allocation does not deallocate executors

2023-08-07 Thread Mich Talebzadeh
Yes I have seen cases where the driver gone but a couple of executors hanging on. Sounds like a code issue. HTH Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin profile

Spark 3.41 with Java 11 performance on k8s serverless/autopilot

2023-08-07 Thread Mich Talebzadeh
Hi, I would like to share experience on spark 3.4.1 running on k8s autopilot or some refer to it as serverless. My current experience is on Google GKE autopilot . So essentially you specify the name and region and CSP

Unsubscribe

2023-08-04 Thread heri wijayanto
Unsubscribe

Re: conver panda image column to spark dataframe

2023-08-03 Thread Sean Owen
pp4 has one row, I'm guessing - containing an array of 10 images. You want 10 rows of 1 image each. But, just don't do this. Pass the bytes of the image as an array, along with width/height/channels, and reshape it on use. It's just easier. That is how the Spark image representation works anyway

Unsubscribe

2023-08-03 Thread Denys Cherepanin
Unsubscribe

Re: conver panda image column to spark dataframe

2023-08-03 Thread second_co...@yahoo.com.INVALID
Hello Adrian,    here is the snippet import tensorflow_datasets as tfds (ds_train, ds_test), ds_info = tfds.load(     dataset_name, data_dir='',  split=["train", "test"], with_info=True, as_supervised=True ) schema = StructType([     StructField("image",

Re: Interested in contributing to SPARK-24815

2023-08-03 Thread Sean Owen
Formally, an ICLA is required, and you can read more here: https://www.apache.org/licenses/contributor-agreements.html In practice, it's unrealistic to collect and verify an ICLA for every PR contributed by 1000s of people. We have not gated on that. But, contributions are in all cases governed

Re: Interested in contributing to SPARK-24815

2023-08-03 Thread Rinat Shangeeta
(Adding my manager Eugene Kim who will cover me as I plan to be out of the office soon) Hi Kent and Sean, Nice to meet you. I am working on the OSS legal aspects with Pavan who is planning to make the contribution request to the Spark project. I saw that Sean mentioned in his email that the

Re: conver panda image column to spark dataframe

2023-08-03 Thread Adrian Pop-Tifrea
Hello, can you also please show us how you created the pandas dataframe? I mean, how you added the actual data into the dataframe. It would help us for reproducing the error. Thank you, Pop-Tifrea Adrian On Mon, Jul 31, 2023 at 5:03 AM second_co...@yahoo.com < second_co...@yahoo.com> wrote: >

Custom Session Windowing in Spark using Scala/Python

2023-08-03 Thread Ravi Teja
Hi, I am new to Spark and looking for help regarding the session windowing in Spark. I want to create session windows on a user activity stream with a gap duration of `x` minutes and also have

Re: Extracting Logical Plan

2023-08-02 Thread Vibhatha Abeykoon
Hello Winston, Thanks again for this response, I will check this one out. On Wed, Aug 2, 2023 at 3:50 PM Winston Lai wrote: > > Hi Vibhatha, > > I helped you post this question to another community. There is one answer > by someone else for your reference. > > To access the logical plan or

Re: Extracting Logical Plan

2023-08-02 Thread Winston Lai
Hi Vibhatha, I helped you post this question to another community. There is one answer by someone else for your reference. To access the logical plan or optimized plan, you can register a custom QueryExecutionListener and retrieve the plans during the query execution process. Here's an

Re: Extracting Logical Plan

2023-08-02 Thread Vibhatha Abeykoon
I understand. I sort of drew the same conclusion. But I wasn’t sure. Thanks everyone for taking time on this. On Wed, Aug 2, 2023 at 2:29 PM Ruifeng Zheng wrote: > In Spark Connect, I think the only API to show optimized plan is > `df.explain("extended")` as Winston mentioned, but it is not a

Re: Extracting Logical Plan

2023-08-02 Thread Ruifeng Zheng
In Spark Connect, I think the only API to show optimized plan is `df.explain("extended")` as Winston mentioned, but it is not a LogicalPlan object. On Wed, Aug 2, 2023 at 4:36 PM Vibhatha Abeykoon wrote: > Hello Ruifeng, > > Thank you for these pointers. Would it be different if I use the Spark

Re: Extracting Logical Plan

2023-08-02 Thread Vibhatha Abeykoon
Hello Ruifeng, Thank you for these pointers. Would it be different if I use the Spark connect? I am not using the regular SparkSession. I am pretty new to these APIs. Appreciate your thoughts. On Wed, Aug 2, 2023 at 2:00 PM Ruifeng Zheng wrote: > Hi Vibhatha, >I think those APIs are still

Re: Extracting Logical Plan

2023-08-02 Thread Ruifeng Zheng
Hi Vibhatha, I think those APIs are still avaiable? ``` Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.4.1 /_/ Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 11.0.19) Type in

Re: Extracting Logical Plan

2023-08-02 Thread Vibhatha Abeykoon
Hi Winston, I need to use the LogicalPlan object and process it with another function I have written. In earlier Spark versions we can access that via the dataframe object. So if it can be accessed via the UI, is there an API to access the object? On Wed, Aug 2, 2023 at 1:24 PM Winston Lai

Re: Extracting Logical Plan

2023-08-02 Thread Winston Lai
Hi Vibhatha, How about reading the logical plan from Spark UI, do you have access to the Spark UI? I am not sure what infra you run your Spark jobs on. Usually you should be able to view the logical and physical plan under Spark UI in text version at least. It is independent from the language

Re: Extracting Logical Plan

2023-08-02 Thread Vibhatha Abeykoon
Hi Winston, I am looking for a way to access the LogicalPlan object in Scala. Not sure if explain function would serve the purpose. On Wed, Aug 2, 2023 at 9:14 AM Winston Lai wrote: > Hi Vibhatha, > > Have you tried pyspark.sql.DataFrame.explain — PySpark 3.4.1 > documentation (apache.org) >

Unsubscribe

2023-08-01 Thread Zoran Jeremic
Unsubscribe

Re: Extracting Logical Plan

2023-08-01 Thread Winston Lai
Hi Vibhatha, Have you tried pyspark.sql.DataFrame.explain — PySpark 3.4.1 documentation (apache.org) before? I am not sure what infra that you have, you can

Extracting Logical Plan

2023-08-01 Thread Vibhatha Abeykoon
Hello, I recently upgraded the Spark version to 3.4.1 and I have encountered a few issues. In my previous code, I was able to extract the logical plan using `df.queryExecution` (df: DataFrame and in Scala), but it seems like in the latest API it is not supported. Is there a way to extract the

Unsubscribe

2023-08-01 Thread Alex Landa
Unsubscribe

Re: conver panda image column to spark dataframe

2023-07-31 Thread second_co...@yahoo.com.INVALID
i changed to ArrayType(ArrayType(ArrayType(IntegerType( , still get same error Thank you for responding On Thursday, July 27, 2023 at 06:58:09 PM GMT+8, Adrian Pop-Tifrea wrote: Hello,  when you said your pandas Dataframe has 10 rows, does that mean it contains 10 images?

Unsubscribe

2023-07-31 Thread Ali Bajwa
Unsubscribe

Unsubscribe

2023-07-30 Thread
Unsubscribe thanks! 郭 祝工作顺利、万事胜意

Unsubscribe

2023-07-30 Thread Aayush Ostwal
Unsubscribe *Thanks,Aayush Ostwal*

Unsubscribe

2023-07-30 Thread Parag Chaudhari
Unsubscribe *Thanks,Parag Chaudhari*

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Mich Talebzadeh
ok so as expected the underlying database is Hive. Hive uses hdfs storage. You said you encountered limitations on concurrent writes. The order and limitations are introduced by Hive metastore so to speak. Since this is all happening through Spark, by default implementation of the Hive metastore

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Patrick Tucci
Hi Mich and Pol, Thanks for the feedback. The database layer is Hadoop 3.3.5. The cluster restarted so I lost the stack trace in the application UI. In the snippets I saved, it looks like the exception being thrown was from Hive. Given the feedback you've provided, I suspect the issue is with how

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Pol Santamaria
Hi Patrick, You can have multiple writers simultaneously writing to the same table in HDFS by utilizing an open table format with concurrency control. Several formats, such as Apache Hudi, Apache Iceberg, Delta Lake, and Qbeast Format, offer this capability. All of them provide advanced features

Unsubscribe

2023-07-30 Thread 王怡刚
Unsubscribe

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-29 Thread Mich Talebzadeh
It is not Spark SQL that throws the error. It is the underlying Database or layer that throws the error. Spark acts as an ETL tool. What is the underlying DB where the table resides? Is concurrency supported. Please send the error to this list HTH Mich Talebzadeh, Solutions

Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-29 Thread Patrick Tucci
Hello, I'm building an application on Spark SQL. The cluster is set up in standalone mode with HDFS as storage. The only Spark application running is the Spark Thrift Server using FAIR scheduling mode. Queries are submitted to Thrift Server using beeline. I have multiple queries that insert rows

Re: The performance difference when running Apache Spark on K8s and traditional server

2023-07-27 Thread Mich Talebzadeh
Spark on tin boxes like Google Dataproc or AWS EC2 often utilise YARN resource manager. YARN is the most widely used resource manager not just for Spark but for other artefacts as well. On-premise YARN is used extensively. In Cloud it is also used widely in Infrastructure as a Service such as

Unsubscribe

2023-07-27 Thread Kevin Wang
Unsubscribe please!

The performance difference when running Apache Spark on K8s and traditional server

2023-07-27 Thread Trường Trần Phan An
Hi all, I am learning about the performance difference of Spark when performing a JOIN problem on Serverless (K8S) and Serverful (Traditional server) environments. Through experiment, Spark on K8s tends to run slower than Serverful. Through understanding the architecture, I know that Spark runs

Unsubscribe

2023-07-27 Thread blaz stojanovic
Unsubscribe - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Unsubscribe

2023-07-27 Thread Sherif Eid
Unsubscribe

Dynamic allocation does not deallocate executors

2023-07-27 Thread Sergei Zhgirovski
Hi everyone I'm trying to use pyspark 3.3.2. I have these relevant options set: spark.dynamicAllocation.enabled=true spark.dynamicAllocation.shuffleTracking.enabled=true spark.dynamicAllocation.shuffleTracking.timeout=20s spark.dynamicAllocation.executorIdleTimeout=30s

[ANNOUNCE] Apache Celeborn(incubating) 0.3.0 available

2023-07-27 Thread zhongqiang chen
Hi all, Apache Celeborn(Incubating) community is glad to announce the new release of Apache Celeborn(Incubating) 0.3.0 Celeborn is dedicated to improving the efficiency and elasticity of different map-reduce engines and provides an elastic, high-efficient service for intermediate data including

Unsubscribe

2023-07-27 Thread Sherif Eid
Unsubscribe

Re: conver panda image column to spark dataframe

2023-07-27 Thread Adrian Pop-Tifrea
Hello, when you said your pandas Dataframe has 10 rows, does that mean it contains 10 images? Because if that's the case, then you'd want ro only use 3 layers of ArrayType when you define the schema. Best regards, Adrian On Thu, Jul 27, 2023, 11:04 second_co...@yahoo.com.INVALID wrote: > i

conver panda image column to spark dataframe

2023-07-27 Thread second_co...@yahoo.com.INVALID
i have panda dataframe with column 'image' using numpy.ndarray. shape is (500, 333, 3) per image. my panda dataframe has 10 rows, thus, shape is (10, 500, 333, 3) when using spark.createDataframe(panda_dataframe, schema), i need to specify the schema, schema = StructType([    

Re: spark context list_packages()

2023-07-27 Thread Sean Owen
There is no such method in Spark. I think that's some EMR-specific modification. On Wed, Jul 26, 2023 at 11:06 PM second_co...@yahoo.com.INVALID wrote: > I ran the following code > > spark.sparkContext.list_packages() > > on spark 3.4.1 and i get below error > > An error was encountered: >

spark context list_packages()

2023-07-26 Thread second_co...@yahoo.com.INVALID
I ran the following code spark.sparkContext.list_packages() on spark 3.4.1 and i get below error An error was encountered: AttributeError [Traceback (most recent call last): , File "/tmp/spark-3d66c08a-08a3-4d4e-9fdf-45853f65e03d/shell_wrapper.py", line 113, in exec

Re: Interested in contributing to SPARK-24815

2023-07-26 Thread Pavan Kotikalapudi
Thanks for the response with all the information Sean and Kent. Is there a way to figure out if my employer (Twilio) part of CCLA? cc'ing: @Rinat Shangeeta our Open Source Counsel at twilio Thank you, Pavan On Tue, Jul 25, 2023 at 10:48 PM Kent Yao wrote: > Hi Pavan, > > Refer to the ASF

Re: Interested in contributing to SPARK-24815

2023-07-25 Thread Kent Yao
Hi Pavan, Refer to the ASF Source Header and Copyright Notice Policy[1], code directly submitted to ASF should include the Apache license header without any additional copyright notice. Kent Yao [1] https://www.apache.org/legal/src-headers.html#headers Sean Owen 于2023年7月25日周二 07:22写道: > >

Map Partition is called Multiple Times

2023-07-25 Thread Deepak Patankar
I am trying to run a spark job which performs some database operations and saves passed records in one table and the failed ones in another. Here is the code for the same: ``` log.info("Starting the spark job {}"); String sparkAppName = generateSparkAppName("reading-graph"); SparkConf sparkConf

Re: Interested in contributing to SPARK-24815

2023-07-24 Thread Sean Owen
When contributing to an ASF project, it's governed by the terms of the ASF ICLA: https://www.apache.org/licenses/icla.pdf or CCLA: https://www.apache.org/licenses/cla-corporate.pdf I don't believe ASF projects ever retain an original author copyright statement, but rather source files have a

Fwd: Interested in contributing to SPARK-24815

2023-07-24 Thread Pavan Kotikalapudi
Hi Spark Dev, My name is Pavan Kotikalapudi, I work at Twilio. I am looking to contribute to this spark issue https://issues.apache.org/jira/browse/SPARK-24815. There is a clause from the company's OSS saying - The proposed contribution is about 100 lines of code modification in the Spark

Re: Spark 3.3 + parquet 1.10

2023-07-24 Thread Mich Talebzadeh
personally I have not done it myself. CCed to spark user group if some user has tried it among users. HTH Mich Talebzadeh, Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile

Spark3.3 with parquet 1.10.x

2023-07-24 Thread Pralabh Kumar
Hi Spark Users . I have a quick question with respect to Spark 3.3. Currently Spark 3.3 is built with parquet 1.12. However, anyone tried Spark 3.3 with parquet 1.10 . We are at Uber , planning to migrate Spark 3.3 but we have limitations of using parquet 1.10 . Has anyone tried building Spark

Re: Unable to launch Spark connect on Docker image

2023-07-22 Thread Mich Talebzadeh
This is the downloaded docker? Try this with the added configuration options as below /opt/spark/sbin/start-connect-server.sh *--conf spark.driver.extraJavaOptions="-Divy.cache.dir=/tmp -Divy.home=/tmp" *--packages org.apache.spark:spark-connect_2.12:3.4.1 And you will get starting

Unable to launch Spark connect on Docker image

2023-07-21 Thread Edmondo Porcu
Hello, I am trying to launch Spark connect on Docker Image ❯ docker run -it apache/spark:3.4.1-scala2.12-java11-r-ubuntu /bin/bash spark@aa0a670f7433:/opt/spark/work-dir$ /opt/spark/sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.4.1 starting

Re: Spark File Output Committer algorithm for GCS

2023-07-21 Thread Mich Talebzadeh
this link might help https://stackoverflow.com/questions/46929351/spark-reading-orc-file-in-driver-not-in-executors Mich Talebzadeh, Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile

Re: Spark File Output Committer algorithm for GCS

2023-07-21 Thread Dipayan Dev
I used the following config and the performance has improved a lot. .config("spark.sql.orc.splits.include.file.footer", true) I am not able to find the default value of this config anywhere? Can someone please share what's the default config of this- is it false? Also just curious what this

unsubscribe

2023-07-19 Thread Josh Patterson
unsubscribe

Argo for general purpose k8s scheduling

2023-07-19 Thread Mich Talebzadeh
Hi, Is there any update for use case of argo for k8s. As I understand it, Kubeflow uses it for scheduling. Outside of machine learning and MLOps on Kubernetes has anyone used Argo for standard ETL as well and if so any experience. Thanks Mich

Re: Spark File Output Committer algorithm for GCS

2023-07-19 Thread Dipayan Dev
Thank you. Will try out these options. With Best Regards, On Wed, Jul 19, 2023 at 1:40 PM Mich Talebzadeh wrote: > Sounds like if the mv command is inherently slow, there is little that can > be done. > > The only suggestion I can make is to create the staging table as > compressed to

Re: Spark File Output Committer algorithm for GCS

2023-07-19 Thread Mich Talebzadeh
Sounds like if the mv command is inherently slow, there is little that can be done. The only suggestion I can make is to create the staging table as compressed to reduce its size and hence mv? Is that feasible? Also the managed table can be created with SNAPPY compression STORED AS ORC

Re: Spark File Output Committer algorithm for GCS

2023-07-18 Thread Dipayan Dev
Hi Mich, Ok, my use-case is a bit different. I have a Hive table partitioned by dates and need to do dynamic partition updates(insert overwrite) daily for the last 30 days (partitions). The ETL inside the staging directories is completed in hardly 5minutes, but then renaming takes a lot of time as

Re: Spark File Output Committer algorithm for GCS

2023-07-18 Thread Mich Talebzadeh
Spark has no role in creating that hive staging directory. That directory belongs to Hive and Spark simply does ETL there, loading to the Hive managed table in your case which ends up in saging directory I suggest that you review your design and use an external hive table with explicit location

Re: Spark File Output Committer algorithm for GCS

2023-07-18 Thread Dipayan Dev
It does help performance but not significantly. I am just wondering, once Spark creates that staging directory along with the SUCCESS file, can we just do a gsutil rsync command and move these files to original directory? Anyone tried this approach or foresee any concern? On Mon, 17 Jul 2023

Re: Spark Scala SBT Local build fails

2023-07-18 Thread Varun Shah
++ DEV community On Mon, Jul 17, 2023 at 4:14 PM Varun Shah wrote: > Resending this message with a proper Subject line > > Hi Spark Community, > > I am trying to set up my forked apache/spark project locally for my 1st > Open Source Contribution, by building and creating a package as mentioned

Re: Spark Scala SBT Local build fails

2023-07-17 Thread Varun Shah
Hi Team, I am still looking for a guidance here. Really appreciate anything that points me in the right direction. On Mon, Jul 17, 2023, 16:14 Varun Shah wrote: > Resending this message with a proper Subject line > > Hi Spark Community, > > I am trying to set up my forked apache/spark project

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
Thanks Jay, is there any suggestion how much I can increase those parameters? On Mon, 17 Jul 2023 at 8:25 PM, Jay wrote: > Fileoutputcommitter v2 is supported in GCS but the rename is a metadata > copy and delete operation in GCS and therefore if there are many number of > files it will take a

Re: Contributing to Spark MLLib

2023-07-17 Thread Gourav Sengupta
Hi, Holden Karau has some fantastic videos in her channel which will be quite helpful. Thanks Gourav On Sun, 16 Jul 2023, 19:15 Brian Huynh, wrote: > Good morning Dipayan, > > Happy to see another contributor! > > Please go through this document for contributors. Please note the >

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Jay
Fileoutputcommitter v2 is supported in GCS but the rename is a metadata copy and delete operation in GCS and therefore if there are many number of files it will take a long time to perform this step. One workaround will be to create smaller number of larger files if that is possible from Spark and

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Mich Talebzadeh
You said this Hive table was a managed table partitioned by date -->${TODAY} How do you define your Hive managed table? HTH Mich Talebzadeh, Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile

Unsubscribe

2023-07-17 Thread mojianan2015
Unsubscribe

Unsubscribe

2023-07-17 Thread Zoran Jeremic
Unsubscribe

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
It does support- It doesn’t error out for me atleast. But it took around 4 hours to finish the job. Interestingly, it took only 10 minutes to write the output in the staging directory and rest of the time it took to rename the objects. Thats the concern. Looks like a known issue as spark behaves

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Yeachan Park
Did you check if mapreduce.fileoutputcommitter.algorithm.version 2 is supported on GCS? IIRC it wasn't, but you could check with GCP support On Mon, Jul 17, 2023 at 3:54 PM Dipayan Dev wrote: > Thanks Jay, > > I will try that option. > > Any insight on the file committer algorithms? > > I

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
Thanks Jay, I will try that option. Any insight on the file committer algorithms? I tried v2 algorithm but its not enhancing the runtime. What’s the best practice in Dataproc for dynamic updates in Spark. On Mon, 17 Jul 2023 at 7:05 PM, Jay wrote: > You can try increasing

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Jay
You can try increasing fs.gs.batch.threads and fs.gs.max.requests.per.batch. The definitions for these flags are available here - https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md On Mon, 17 Jul 2023 at 14:59, Dipayan Dev wrote: > No, I am using Spark

Unsubscribe

2023-07-17 Thread Bode, Meikel
Unsubscribe

Spark Scala SBT Local build fails

2023-07-17 Thread Varun Shah
Resending this message with a proper Subject line Hi Spark Community, I am trying to set up my forked apache/spark project locally for my 1st Open Source Contribution, by building and creating a package as mentioned here under Running Individual Tests

Re: Unsubscribe

2023-07-17 Thread srini subramanian
Unsubscribe  On Monday, July 17, 2023 at 11:19:41 AM GMT+5:30, Bode, Meikel wrote: Unsubscribe

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
No, I am using Spark 2.4 to update the GCS partitions . I have a managed Hive table on top of this. [image: image.png] When I do a dynamic partition update of Spark, it creates the new file in a Staging area as shown here. But the GCS blob renaming takes a lot of time. I have a partition based on

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Mich Talebzadeh
So you are using GCP and your Hive is installed on Dataproc which happens to run your Spark as well. Is that correct? What version of Hive are you using? HTH Mich Talebzadeh, Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile

Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
Hi All, Of late, I have encountered the issue where I have to overwrite a lot of partitions of the Hive table through Spark. It looks like writing to hive_staging_directory takes 25% of the total time, whereas 75% or more time goes in moving the ORC files from staging directory to the final

Unsubscribe

2023-07-16 Thread Bode, Meikel
Unsubscribe

Re: Contributing to Spark MLLib

2023-07-16 Thread Brian Huynh
Good morning Dipayan, Happy to see another contributor! Please go through this document for contributors. Please note the MLlib-specific contribution guidelines section in particular. https://spark.apache.org/contributing.html Since you are looking for something to start with, take a look at

Contributing to Spark MLLib

2023-07-16 Thread Dipayan Dev
Hi Spark Community, A very good morning to you. I am using Spark from last few years now, and new to the community. I am very much interested to be a contributor. I am looking to contribute to Spark MLLib. Can anyone please suggest me how to start with contributing to any new MLLib feature? Is

[no subject]

2023-07-16 Thread Varun Shah
Hi Spark Community, I am trying to setup my forked apache/spark project locally by building and creating a package as mentioned here under Running Individual Tests . Here are the steps I have followed: >> .build/sbt # this

[Spark RPC]: Yarn - Application Master / executors to Driver communication issue

2023-07-14 Thread Sunayan Saikia
Hey Spark Community, Our Jupyterhub/Jupyterlab (with spark client) runs behind two layers of HAProxy and the Yarn cluster runs remotely. We want to use deploy mode 'client' so that we can capture the output of any spark sql query in jupyterlab. I'm aware of other technologies like Livy and Spark

Re: Unable to populate spark metrics using custom metrics API

2023-07-13 Thread Surya Soma
Gentle reminder on this. On Sat, Jul 8, 2023 at 7:59 PM Surya Soma wrote: > Hello, > > I am trying to publish custom metrics using Spark CustomMetric API as > supported since spark 3.2 https://github.com/apache/spark/pull/31476, > > >

Re: Spark Not Connecting

2023-07-12 Thread Artemis User
Well, in that case, you may want to make sure your Spark server is running properly and you can access the Spark UI using your browser.  If you're not owning the spark cluster, contact your spark admin. On 7/12/23 1:56 PM, timi ayoade wrote: I can't even connect to the spark UI On Wed, Jul

Re: [EXTERNAL] Spark Not Connecting

2023-07-12 Thread Daniel Tavares de Santana
unsubscribe From: timi ayoade Sent: Wednesday, July 12, 2023 6:11 AM To: user@spark.apache.org Subject: [EXTERNAL] Spark Not Connecting Hi Apache spark community, I am a Data EngineerI have been using Apache spark for some time now. I recently tried to use it

Spark Not Connecting

2023-07-12 Thread timi ayoade
Hi Apache spark community, I am a Data EngineerI have been using Apache spark for some time now. I recently tried to use it but I have been getting some errors. I have tried debugging the error but to no avail. the screenshot is attached below. I will be glad if responded to. thanks

Re: Loading in custom Hive jars for spark

2023-07-11 Thread Mich Talebzadeh
Are you using Spark 3.4? Under directory $SPARK_HOME get a list of jar files for hive and hadoop. This one is for version 3.4.0 /opt/spark/jars> ltr *hive* *hadoop* -rw-r--r--. 1 hduser hadoop 717820 Apr 7 03:43 spark-hive_2.12-3.4.0.jar -rw-r--r--. 1 hduser hadoop 563632 Apr 7 03:43

Loading in custom Hive jars for spark

2023-07-11 Thread Yeachan Park
Hi all, We made some changes to hive which require changes to the hive jars that Spark is bundled with. Since Spark 3.3.1 comes bundled with Hive 2.3.9 jars, we built our changes in Hive 2.3.9 and put the necessary jars under $SPARK_HOME/jars (replacing the original jars that were there),

Jobs that have join & have .rdd calls get executed 2x when AQE is enabled.

2023-07-11 Thread Priyanka Raju
We have a few spark scala jobs that are currently running in production. Most jobs typically use Dataset, Dataframes. There is a small code in our custom library code, that makes rdd calls example to check if the dataframe is empty: df.rdd.getNumPartitions == 0 When I enable aqe for these jobs,

Re: PySpark error java.lang.IllegalArgumentException

2023-07-10 Thread elango vaidyanathan
Finally I was able to solve this issue by setting this conf. "spark.driver.extraJavaOptions=-Dorg.xerial.snappy.tempdir=/my_user/temp_ folder" Thanks all! On Sat, 8 Jul 2023 at 3:45 AM, Brian Huynh wrote: > Hi Khalid, > > Elango mentioned the file is working fine in our another environment

<    4   5   6   7   8   9   10   11   12   13   >