Elasticsearch support for Spark 3.x

2023-08-26 Thread Dipayan Dev
Hi All, We're using Spark 2.4.x to write dataframe into the Elasticsearch index. As we're upgrading to Spark 3.3.0, it throwing out error Caused by: java.lang.ClassNotFoundException: es.DefaultSource at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476) at

Spark 2.4.7

2023-08-25 Thread Harry Jamison
I am using python 3.7 and Spark 2.4.7 I am not sure what the best way to do this is. I have a dataframe with a url in one of the columns, and I want to download the contents of that url and put it in a new column. Can someone point me in the right direction on how to do this?I looked at the UDFs

Re: mysterious spark.sql.utils.AnalysisException Union in spark 3.3.2, but not seen in 3.4.0+

2023-08-25 Thread Mich Talebzadeh
Hi Srivastan, Ground investigation 1. Does this union explicitly exist in your code? If not, where are the 7 and 6 column counting coming from? 2. On 3.3.1 have you looked at spark UI and the relevant dag diagram 3. Check query execution plan using explain() functionality 4. Can

mysterious spark.sql.utils.AnalysisException Union in spark 3.3.2, but not seen in 3.4.0+

2023-08-25 Thread Srivatsan vn
Hello Users, I have been seeing some weird issues when I upgraded my EMR setup to 6.11 (which uses spark 3.3.2) , the call stack seems to point to a code location where there is no explicit union, also I have unionByName everywhere in the codebase with allowMissingColumns set

Spark Connect: API mismatch in SparkSesession#execute

2023-08-25 Thread Stefan Hagedorn
Hi everyone, I’m trying to use the “extension” feature of the Spark Connect CommandPlugin (Spark 3.4.1). I created a simple protobuf message `MyMessage` that I want to send from the connect client-side to the connect server (where I registered my plugin). The SparkSession class in `spark

Re: $SPARK_HOME/sbin/start-worker.sh spark://{main_host}:{cluster_port} failing

2023-08-23 Thread Mich Talebzadeh
Hi Jeremy, This error concerns me "23/08/23 20:01:03 ERROR LevelDBProvider: error opening leveldb file file:/mnt/data_ebs/infrastructure/spark/tmp/registeredExecutors.ldb. Creating new file, will not be able to recover state for existing applications org.fusesource.leveldbjni.internal.Nat

$SPARK_HOME/sbin/start-worker.sh spark://{main_host}:{cluster_port} failing

2023-08-23 Thread Jeremy Brent
Hi Spark Community, We have a cluster running with Spark 3.3.1. All nodes are AWS EC2’s with an Ubuntu OS version 22.04. One of the workers disconnected from the main node. When we run $SPARK_HOME/sbin/start-worker.sh spark://{main_host}:{cluster_port} it appears to run successfully; there is no

[ANNOUNCE] Apache Spark 3.3.3 released

2023-08-22 Thread Yuming Wang
We are happy to announce the availability of Apache Spark 3.3.3! Spark 3.3.3 is a maintenance release containing stability fixes. This release is based on the branch-3.3 maintenance branch of Spark. We strongly recommend all 3.3 users to upgrade to this stable release. To download Spark 3.3.3

Re: Problem with spark 3.4.1 not finding spark java classes

2023-08-21 Thread Bjørn Jørgensen
In yours file /home/spark/real-estate/pullhttp/pull_apartments.py replace import org.apache.spark.SparkContext with from pyspark import SparkContext man. 21. aug. 2023 kl. 15:13 skrev Kal Stevens : > I am getting a class not found error > import org.apache.spark.SparkContext > &g

Problem with spark 3.4.1 not finding spark java classes

2023-08-21 Thread Kal Stevens
I am getting a class not found error import org.apache.spark.SparkContext It sounds like this is because pyspark is not installed, but as far as I can tell it is. Pyspark is installed in the correct python verison root@namenode:/home/spark/# pip3.10 install pyspark Requirement already

Spark doesn’t create SUCCESS file when external path is passed

2023-08-21 Thread Dipayan Dev
Hi Team, I need some help and if someone can replicate the issue at their end, or let me know if I am doing anything wrong. https://issues.apache.org/jira/browse/SPARK-44884 We have recently upgraded to Spark 3.3.0 in our Production Dataproc. We have a lot of downstream application that relies

Re: k8s+ YARN Spark

2023-08-21 Thread Mich Talebzadeh
Interesting. Spark supports the following cluster managers - Standalone: A cluster-manager, limited in features, shipped with Spark. - Apache Hadoop YARN is the most widely used resource manager not just for Spark but for other artefacts as well. On-premise YARN is used extensively

Re: Problem with spark 3.4.1 not finding spark java classes

2023-08-21 Thread Mich Talebzadeh
This should work check your path. It should pyspark from which pyspark /opt/spark/bin/pyspark And your installation should contain cd $SPARK_HOME /opt/spark> ls LICENSE NOTICE R README.md RELEASE bin conf data examples jars kubernetes licenses logs python sbin yarn You sho

Re: Problem with spark 3.4.1 not finding spark java classes

2023-08-21 Thread Kal Stevens
Nevermind I was doing something dumb On Sun, Aug 20, 2023 at 9:53 PM Kal Stevens wrote: > Are there installation instructions for Spark 3.4.1? > > I defined SPARK_HOME as it describes here > > https://spark.apache.org/docs/latest/api/python/getting_started/install.html > >

k8s+ YARN Spark

2023-08-21 Thread Крюков Виталий Семенович
Good afternoon. Perhaps you will be discouraged by what I will write below, but nevertheless, I ask for help in solving my problem. Perhaps the architecture of our solution will not seem correct to you. There are backend services that communicate with a service that implements spark-driver

Problem with spark 3.4.1 not finding spark java classes

2023-08-20 Thread Kal Stevens
Are there installation instructions for Spark 3.4.1? I defined SPARK_HOME as it describes here https://spark.apache.org/docs/latest/api/python/getting_started/install.html ls $SPARK_HOME/python/lib py4j-0.10.9.7-src.zip PY4J_LICENSE.txt pyspark.zip I am getting a class not found error

Re: Probable Spark Bug while inserting into flat GCS bucket?

2023-08-20 Thread Dipayan Dev
ill in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Sat, 19 Aug 2023 at 21:36, Dipayan Dev wrote: > >> Hi Everyone, >> >> I'm stuck with one problem, where I need to provide a custom GCS locat

Re: Probable Spark Bug while inserting into flat GCS bucket?

2023-08-19 Thread Mich Talebzadeh
ed to provide a custom GCS location > for the Hive table from Spark. The code fails while doing an *'insert > into'* whenever my Hive table has a flag GS location like > gs://, but works for nested locations like > gs://bucket_name/blob_name. > > Is anyone aware if it&#

Probable Spark Bug while inserting into flat GCS bucket?

2023-08-19 Thread Dipayan Dev
Hi Everyone, I'm stuck with one problem, where I need to provide a custom GCS location for the Hive table from Spark. The code fails while doing an *'insert into'* whenever my Hive table has a flag GS location like gs://, but works for nested locations like gs://bucket_name/blob_n

Re: [External Email] Re: [Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-19 Thread Nebi Aydin
rency and Thread Issues: If there are too many concurrent > connections or thread limitations, > it could result in failed connections. *Adjust > spark.shuffle.io.clientThreads* > - It might be prudent to do the same to *spark.shuffle.io.server.Threads* > - Check how stable your envi

Re: [External Email] Re: [Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-19 Thread Mich Talebzadeh
, it could result in failed connections. *Adjust spark.shuffle.io.clientThreads* - It might be prudent to do the same to *spark.shuffle.io.server.Threads* - Check how stable your environment is. Observe any issues reported in Spark UI HTH Mich Talebzadeh, Solutions Architect/Engineering Lead

Re: [External Email] Re: [Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-18 Thread Nebi Aydin
trust that you are familiar with the concept of shuffle in Spark. > Spark Shuffle is an expensive operation since it involves the following > >- > >Disk I/O >- > >Involves data serialization and deserialization >- > >Network I/O > > Bas

Re: [Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-18 Thread Mich Talebzadeh
Hi, These two threads that you sent seem to be duplicates of each other? Anyhow I trust that you are familiar with the concept of shuffle in Spark. Spark Shuffle is an expensive operation since it involves the following - Disk I/O - Involves data serialization and deserialization

[Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-18 Thread Nebi Aydin
I want to learn differences among below thread configurations. spark.shuffle.io.serverThreads spark.shuffle.io.clientThreads spark.shuffle.io.threads spark.rpc.io.serverThreads spark.rpc.io.clientThreads spark.rpc.io.threads Thanks.

[Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-18 Thread Nebi Aydin
I want to learn differences among below thread configurations. spark.shuffle.io.serverThreads spark.shuffle.io.clientThreads spark.shuffle.io.threads spark.rpc.io.serverThreads spark.rpc.io.clientThreads spark.rpc.io.threads Thanks.

RE: Re: Spark Vulnerabilities

2023-08-18 Thread Sankavi Nagalingam
s back. Thanks, Sankavi From: Bjørn Jørgensen Sent: Monday, August 14, 2023 6:11 PM To: Sankavi Nagalingam Cc: user@spark.apache.org; Vijaya Kumar Mathupaiyan Subject: [EXT MSG] Re: Spark Vulnerabilities EXTERNAL source. Be CAREFUL with links / attachments I have added links to the github

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-18 Thread Mich Talebzadeh
Yes, it sounds like it. So the broadcast DF size seems to be between 1 and 4GB. So I suggest that you leave it as it is. I have not used the standalone mode since spark-2.4.3 so I may be missing a fair bit of context here. I am sure there are others like you that are still using it! HTH Mich

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
from > such loss, damage or destruction. > > > > > On Thu, 17 Aug 2023 at 21:01, Patrick Tucci > wrote: > >> Hi Mich, >> >> Here are my config values from spark-defaults.conf: >> >> spark.eventLog.enabled true >> spark.eventLog.dir hdfs:/

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Mich Talebzadeh
ny monetary damages arising from such loss, damage or destruction. On Thu, 17 Aug 2023 at 21:01, Patrick Tucci wrote: > Hi Mich, > > Here are my config values from spark-defaults.conf: > > spark.eventLog.enabled true > spark.eventLog.dir hdfs://10.0.50.1:8020/spark-

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
Hi Mich, Here are my config values from spark-defaults.conf: spark.eventLog.enabled true spark.eventLog.dir hdfs://10.0.50.1:8020/spark-logs spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider spark.history.fs.logDirectory hdfs://10.0.50.1:8020/spark-logs

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Mich Talebzadeh
Hello Paatrick, As a matter of interest what parameters and their respective values do you use in spark-submit. I assume it is running in YARN mode. HTH Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/m

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
Hi Mich, Yes, that's the sequence of events. I think the big breakthrough is that (for now at least) Spark is throwing errors instead of the queries hanging. Which is a big step forward. I can at least troubleshoot issues if I know what they are. When I reflect on the issues I faced an

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Mich Talebzadeh
Hi Patrik, glad that you have managed to sort this problem out. Hopefully it will go away for good. Still we are in the dark about how this problem is going away and coming back :( As I recall the chronology of events were as follows: 1. The Issue with hanging Spark job reported 2

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
Hi Everyone, I just wanted to follow up on this issue. This issue has continued since our last correspondence. Today I had a query hang and couldn't resolve the issue. I decided to upgrade my Spark install from 3.4.0 to 3.4.1. After doing so, instead of the query hanging, I got an error me

Re: Spark Vulnerabilities

2023-08-14 Thread Cheng Pan
For the Guava case, you may be interested in https://github.com/apache/spark/pull/42493 Thanks, Cheng Pan > On Aug 14, 2023, at 16:50, Sankavi Nagalingam > wrote: > > Hi Team, > We could see there are many dependent vulnerabilities present in the latest > spark-core:3.4.

Re: Spark Vulnerabilities

2023-08-14 Thread Sean Owen
Yeah, we generally don't respond to "look at the output of my static analyzer". Some of these are already addressed in a later version. Some don't affect Spark. Some are possibly an issue but hard to change without breaking lots of things - they are really issues with upstrea

Re: Spark Vulnerabilities

2023-08-14 Thread Bjørn Jørgensen
I have added links to the github PR. Or comment for those that I have not seen before. Apache Spark has very many dependencies, some can easily be upgraded while others are very hard to fix. Please feel free to open a PR if you wanna help. man. 14. aug. 2023 kl. 14:06 skrev Sankavi Nagalingam

Spark Vulnerabilities

2023-08-14 Thread Sankavi Nagalingam
Hi Team, We could see there are many dependent vulnerabilities present in the latest spark-core:3.4.1.jar. PFA Could you please let us know when will be the fix version available for the users. Thanks, Sankavi The information in this e-mail and any attachments is confidential and may be

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-13 Thread Mich Talebzadeh
r install an additional Java version, I attempted to use the > latest alpha as well. This appears to have worked, although I couldn't > figure out how to get it to use the metastore_db from Spark. > > After turning my attention back to Spark, I determined the issue. After > much trouble

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-13 Thread Patrick Tucci
tions suggest it might be a Java incompatibility issue. Since I didn't want to downgrade or install an additional Java version, I attempted to use the latest alpha as well. This appears to have worked, although I couldn't figure out how to get it to use the metastore_db from Spark. After turni

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Mich Talebzadeh
to migrate >>> to Delta Lake and see if that solves the issue. >>> >>> Thanks again for your feedback. >>> >>> Patrick >>> >>> On Fri, Aug 11, 2023 at 10:09 AM Mich Talebzadeh < >>> mich.talebza...@gmail.com> wrote: >

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Patrick Tucci
; >> On Fri, Aug 11, 2023 at 10:09 AM Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> Hi Patrick, >>> >>> There is not anything wrong with Hive On-premise it is the best data >>> warehouse there is >>> >>> Hive handles bo

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Mich Talebzadeh
..@gmail.com> wrote: > >> Hi Patrick, >> >> There is not anything wrong with Hive On-premise it is the best data >> warehouse there is >> >> Hive handles both ORC and Parquet formal well. They are both columnar >> implementations of relational mod

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Patrick Tucci
both ORC and Parquet formal well. They are both columnar > implementations of relational model. What you are seeing is the Spark API > to Hive which prefers Parquet. I found out a few years ago. > > From your point of view I suggest you stick to parquet format with Hive > specific t

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-11 Thread Mich Talebzadeh
Hi Patrick, There is not anything wrong with Hive On-premise it is the best data warehouse there is Hive handles both ORC and Parquet formal well. They are both columnar implementations of relational model. What you are seeing is the Spark API to Hive which prefers Parquet. I found out a few

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-11 Thread Patrick Tucci
Thanks for the reply Stephen and Mich. Stephen, you're right, it feels like Spark is waiting for something, but I'm not sure what. I'm the only user on the cluster and there are plenty of resources (+60 cores, +250GB RAM). I even tried restarting Hadoop, Spark and the host serve

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh
Steve may have a valid point. You raised an issue with concurrent writes before, if I recall correctly. Since this limitation may be due to Hive metastore. By default Spark uses Apache Derby for its database persistence. *However it is limited to only one Spark session at any time for the purposes

Re: Spark Connect, Master, and Workers

2023-08-10 Thread Brian Huynh
Hi Kezhi, Yes, you no longer need to start a master to make the client work. Please see the quickstart. https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_connect.html You can think of Spark Connect as an API on top of Master so workers can be added to the cluster same

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Stephen Coy
Hi Patrick, When this has happened to me in the past (admittedly via spark-submit) it has been because another job was still running and had already claimed some of the resources (cores and memory). I think this can also happen if your configuration tries to claim resources that will never be

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
Hi Mich, I don't believe Hive is installed. I set up this cluster from scratch. I installed Hadoop and Spark by downloading them from their project websites. If Hive isn't bundled with Hadoop or Spark, I don't believe I have it. I'm running the Thrift server distributed

Re: dockerhub does not contain apache/spark-py 3.4.1

2023-08-10 Thread Mich Talebzadeh
Hi Mark, I created a spark3.4.1 docker file. Details from spark-py-3.4.1-scala_2.12-11-jre-slim-buster <https://hub.docker.com/repository/docker/michtalebzadeh/spark_dockerfiles/tags?page=1&ordering=last_updated> Pull instructions are given docker pull michtalebzadeh/spark_dockerfile

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh
damages arising from > such loss, damage or destruction. > > > > > On Thu, 10 Aug 2023 at 20:02, Patrick Tucci > wrote: > >> Hi Mich, >> >> Thanks for the reply. Unfortunately I don't have Hive set up on my >> cluster. I can explore this if th

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh
my > cluster. I can explore this if there are no other ways to troubleshoot. > > I'm using beeline to run commands against the Thrift server. Here's the > command I use: > > ~/spark/bin/beeline -u jdbc:hive2://10.0.50.1:1 -n hadoop -f > command.sql > > Thanks

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
Hi Mich, Thanks for the reply. Unfortunately I don't have Hive set up on my cluster. I can explore this if there are no other ways to troubleshoot. I'm using beeline to run commands against the Thrift server. Here's the command I use: ~/spark/bin/beeline -u jdbc:hive2://10.

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh
mail's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Thu, 10 Aug 2023 at 18:39, Patrick Tucci wrote: > Hello, > > I'm attempting to run a query on Spar

Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
Hello, I'm attempting to run a query on Spark 3.4.0 through the Spark ThriftServer. The cluster has 64 cores, 250GB RAM, and operates in standalone mode using HDFS for storage. The query is as follows: SELECT ME.*, MB.BenefitID FROM MemberEnrollment ME JOIN MemberBenefits MB ON

Spark Connect, Master, and Workers

2023-08-09 Thread Kezhi Xiong
Hi, I'm recently learning Spark Connect but have some questions regarding the connect server's relation with master or workers: so when I'm using the connect server, I don't have to start a master alone side to make clients work. Is the connect server simply using "local[

Re: dockerhub does not contain apache/spark-py 3.4.1

2023-08-09 Thread Mich Talebzadeh
Hi Mark, you can build it yourself, no big deal :) REPOSITORY TAG IMAGE ID CREATED SIZE sparkpy/spark-py 3.4.1-scala_2.12-11-jre-slim-buster-Dockerfile a876102b2206 1 second ago

dockerhub does not contain apache/spark-py 3.4.1

2023-08-09 Thread Mark Elliot
Hello, I noticed that the apache/spark-py image for Spark's 3.4.1 release is not available (apache/spark@3.4.1 is available). Would it be possible to get the 3.4.1 release build for the apache/spark-py image published? Thanks, Mark -- This communication, together wit

Re: [EXTERNAL] Use of ML in certain aspects of Spark to improve the performance

2023-08-08 Thread Daniel Tavares de Santana
unsubscribe From: Mich Talebzadeh Sent: Tuesday, August 8, 2023 4:43 PM To: user @spark Subject: [EXTERNAL] Use of ML in certain aspects of Spark to improve the performance I am currently pondering and sharing my thoughts openly. Given our reliance on gathered

Use of ML in certain aspects of Spark to improve the performance

2023-08-08 Thread Mich Talebzadeh
I am currently pondering and sharing my thoughts openly. Given our reliance on gathered statistics, it prompts the question of whether we could integrate specific machine learning components into Spark Structured Streaming. Consider a scenario where we aim to adjust configuration values on the fly

Spark 3.41 with Java 11 performance on k8s serverless/autopilot

2023-08-07 Thread Mich Talebzadeh
Hi, I would like to share experience on spark 3.4.1 running on k8s autopilot or some refer to it as serverless. My current experience is on Google GKE autopilot <https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview>. So essentially you specify the name and region a

Re: conver panda image column to spark dataframe

2023-08-03 Thread Sean Owen
pp4 has one row, I'm guessing - containing an array of 10 images. You want 10 rows of 1 image each. But, just don't do this. Pass the bytes of the image as an array, along with width/height/channels, and reshape it on use. It's just easier. That is how the Spark image representati

Re: conver panda image column to spark dataframe

2023-08-03 Thread second_co...@yahoo.com.INVALID
Hello Adrian,    here is the snippet import tensorflow_datasets as tfds (ds_train, ds_test), ds_info = tfds.load(     dataset_name, data_dir='',  split=["train", "test"], with_info=True, as_supervised=True ) schema = StructType([     StructField("image", ArrayType(ArrayType(ArrayType(Integer

Re: Interested in contributing to SPARK-24815

2023-08-03 Thread Sean Owen
will cover me as I plan to be out of the > office soon) > > Hi Kent and Sean, > > Nice to meet you. I am working on the OSS legal aspects with Pavan who is > planning to make the contribution request to the Spark project. I saw that > Sean mentioned in his email that the contribu

Re: Interested in contributing to SPARK-24815

2023-08-03 Thread Rinat Shangeeta
(Adding my manager Eugene Kim who will cover me as I plan to be out of the office soon) Hi Kent and Sean, Nice to meet you. I am working on the OSS legal aspects with Pavan who is planning to make the contribution request to the Spark project. I saw that Sean mentioned in his email that the

Re: conver panda image column to spark dataframe

2023-08-03 Thread Adrian Pop-Tifrea
Hello, can you also please show us how you created the pandas dataframe? I mean, how you added the actual data into the dataframe. It would help us for reproducing the error. Thank you, Pop-Tifrea Adrian On Mon, Jul 31, 2023 at 5:03 AM second_co...@yahoo.com < second_co...@yahoo.com> wrote: > i

Custom Session Windowing in Spark using Scala/Python

2023-08-03 Thread Ravi Teja
Hi, I am new to Spark and looking for help regarding the session windowing <https://spark.apache.org/docs/3.4.1/structured-streaming-programming-guide.html#types-of-time-windows> in Spark. I want to create session windows on a user activity stream with a gap duration of `x` minutes and als

Re: conver panda image column to spark dataframe

2023-07-31 Thread second_co...@yahoo.com.INVALID
i changed to ArrayType(ArrayType(ArrayType(IntegerType( , still get same error Thank you for responding On Thursday, July 27, 2023 at 06:58:09 PM GMT+8, Adrian Pop-Tifrea wrote: Hello,  when you said your pandas Dataframe has 10 rows, does that mean it contains 10 images? Becaus

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Mich Talebzadeh
ok so as expected the underlying database is Hive. Hive uses hdfs storage. You said you encountered limitations on concurrent writes. The order and limitations are introduced by Hive metastore so to speak. Since this is all happening through Spark, by default implementation of the Hive metastore

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Patrick Tucci
4:28 PM Mich Talebzadeh > wrote: > >> It is not Spark SQL that throws the error. It is the underlying Database >> or layer that throws the error. >> >> Spark acts as an ETL tool. What is the underlying DB where the table >> resides? Is concurrency supported.

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Pol Santamaria
that will work better in different use cases according to the writing pattern, type of queries, data characteristics, etc. *Pol Santamaria* On Sat, Jul 29, 2023 at 4:28 PM Mich Talebzadeh wrote: > It is not Spark SQL that throws the error. It is the underlying Database > or layer that

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-29 Thread Mich Talebzadeh
It is not Spark SQL that throws the error. It is the underlying Database or layer that throws the error. Spark acts as an ETL tool. What is the underlying DB where the table resides? Is concurrency supported. Please send the error to this list HTH Mich Talebzadeh, Solutions Architect

Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-29 Thread Patrick Tucci
Hello, I'm building an application on Spark SQL. The cluster is set up in standalone mode with HDFS as storage. The only Spark application running is the Spark Thrift Server using FAIR scheduling mode. Queries are submitted to Thrift Server using beeline. I have multiple queries that insert

Re: The performance difference when running Apache Spark on K8s and traditional server

2023-07-27 Thread Mich Talebzadeh
Spark on tin boxes like Google Dataproc or AWS EC2 often utilise YARN resource manager. YARN is the most widely used resource manager not just for Spark but for other artefacts as well. On-premise YARN is used extensively. In Cloud it is also used widely in Infrastructure as a Service such as

The performance difference when running Apache Spark on K8s and traditional server

2023-07-27 Thread Trường Trần Phan An
Hi all, I am learning about the performance difference of Spark when performing a JOIN problem on Serverless (K8S) and Serverful (Traditional server) environments. Through experiment, Spark on K8s tends to run slower than Serverful. Through understanding the architecture, I know that Spark runs

Re: conver panda image column to spark dataframe

2023-07-27 Thread Adrian Pop-Tifrea
Hello, when you said your pandas Dataframe has 10 rows, does that mean it contains 10 images? Because if that's the case, then you'd want ro only use 3 layers of ArrayType when you define the schema. Best regards, Adrian On Thu, Jul 27, 2023, 11:04 second_co...@yahoo.com.INVALID wrote: > i h

conver panda image column to spark dataframe

2023-07-27 Thread second_co...@yahoo.com.INVALID
i have panda dataframe with column 'image' using numpy.ndarray. shape is (500, 333, 3) per image. my panda dataframe has 10 rows, thus, shape is (10, 500, 333, 3) when using spark.createDataframe(panda_dataframe, schema), i need to specify the schema, schema = StructType([     StructField(

Re: spark context list_packages()

2023-07-27 Thread Sean Owen
There is no such method in Spark. I think that's some EMR-specific modification. On Wed, Jul 26, 2023 at 11:06 PM second_co...@yahoo.com.INVALID wrote: > I ran the following code > > spark.sparkContext.list_packages() > > on spark 3.4.1 and i get below error > >

spark context list_packages()

2023-07-26 Thread second_co...@yahoo.com.INVALID
I ran the following code spark.sparkContext.list_packages() on spark 3.4.1 and i get below error An error was encountered: AttributeError [Traceback (most recent call last): , File "/tmp/spark-3d66c08a-08a3-4d4e-9fdf-45853f65e03d/shell_wrapper.py", line 113, in exec self._exec

Re: Interested in contributing to SPARK-24815

2023-07-26 Thread Pavan Kotikalapudi
A with Twilio and consider > establishing that to govern contributions. > > > > On Mon, Jul 24, 2023 at 6:10 PM Pavan Kotikalapudi < > pkotikalap...@twilio.com.invalid> wrote: > >> > >> Hi Spark Dev, > >> > >> My name is Pavan Kotikalapudi,

Re: Interested in contributing to SPARK-24815

2023-07-25 Thread Kent Yao
ributed to the project is assumed > to have been licensed per above already. > > It might be wise to review the CCLA with Twilio and consider establishing > that to govern contributions. > > On Mon, Jul 24, 2023 at 6:10 PM Pavan Kotikalapudi > wrote: >> >> Hi S

Re: Interested in contributing to SPARK-24815

2023-07-24 Thread Sean Owen
e wise to review the CCLA with Twilio and consider establishing that to govern contributions. On Mon, Jul 24, 2023 at 6:10 PM Pavan Kotikalapudi wrote: > Hi Spark Dev, > > My name is Pavan Kotikalapudi, I work at Twilio. > > I am looking to contribute to this spark issue > h

Fwd: Interested in contributing to SPARK-24815

2023-07-24 Thread Pavan Kotikalapudi
Hi Spark Dev, My name is Pavan Kotikalapudi, I work at Twilio. I am looking to contribute to this spark issue https://issues.apache.org/jira/browse/SPARK-24815. There is a clause from the company's OSS saying - The proposed contribution is about 100 lines of code modification in the

Re: Spark 3.3 + parquet 1.10

2023-07-24 Thread Mich Talebzadeh
personally I have not done it myself. CCed to spark user group if some user has tried it among users. HTH Mich Talebzadeh, Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-p

Re: Unable to launch Spark connect on Docker image

2023-07-22 Thread Mich Talebzadeh
This is the downloaded docker? Try this with the added configuration options as below /opt/spark/sbin/start-connect-server.sh *--conf spark.driver.extraJavaOptions="-Divy.cache.dir=/tmp -Divy.home=/tmp" *--packages org.apache.spark:spark-connect_2.12:3.4.1 And you will get

Unable to launch Spark connect on Docker image

2023-07-21 Thread Edmondo Porcu
Hello, I am trying to launch Spark connect on Docker Image ❯ docker run -it apache/spark:3.4.1-scala2.12-java11-r-ubuntu /bin/bash spark@aa0a670f7433:/opt/spark/work-dir$ /opt/spark/sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.4.1 starting

Re: Spark File Output Committer algorithm for GCS

2023-07-21 Thread Mich Talebzadeh
this link might help https://stackoverflow.com/questions/46929351/spark-reading-orc-file-in-driver-not-in-executors Mich Talebzadeh, Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebza

Re: Spark File Output Committer algorithm for GCS

2023-07-21 Thread Dipayan Dev
t; partition updates(insert overwrite) daily for the last 30 days >>> (partitions). >>> The ETL inside the staging directories is completed in hardly 5minutes, >>> but then renaming takes a lot of time as it deletes and copies the >>> partitions. >>> My issue is somethi

Re: Spark File Output Committer algorithm for GCS

2023-07-19 Thread Dipayan Dev
rtitions). >> The ETL inside the staging directories is completed in hardly 5minutes, >> but then renaming takes a lot of time as it deletes and copies the >> partitions. >> My issue is something related to this - >> https://groups.google.com/g/cloud-dataproc-discuss/c/neMyhyt

Re: Spark File Output Committer algorithm for GCS

2023-07-19 Thread Mich Talebzadeh
tories is completed in hardly 5minutes, > but then renaming takes a lot of time as it deletes and copies the > partitions. > My issue is something related to this - > https://groups.google.com/g/cloud-dataproc-discuss/c/neMyhytlfyg?pli=1 > > > > With Best Regards, > >

Re: Spark File Output Committer algorithm for GCS

2023-07-18 Thread Dipayan Dev
it deletes and copies the partitions. My issue is something related to this - https://groups.google.com/g/cloud-dataproc-discuss/c/neMyhytlfyg?pli=1 With Best Regards, Dipayan Dev On Wed, Jul 19, 2023 at 12:06 AM Mich Talebzadeh wrote: > Spark has no role in creating that hive stag

Re: Spark File Output Committer algorithm for GCS

2023-07-18 Thread Mich Talebzadeh
Spark has no role in creating that hive staging directory. That directory belongs to Hive and Spark simply does ETL there, loading to the Hive managed table in your case which ends up in saging directory I suggest that you review your design and use an external hive table with explicit location

Re: Spark File Output Committer algorithm for GCS

2023-07-18 Thread Dipayan Dev
It does help performance but not significantly. I am just wondering, once Spark creates that staging directory along with the SUCCESS file, can we just do a gsutil rsync command and move these files to original directory? Anyone tried this approach or foresee any concern? On Mon, 17 Jul 2023

Re: Spark Scala SBT Local build fails

2023-07-17 Thread Varun Shah
++ DEV community On Mon, Jul 17, 2023 at 4:14 PM Varun Shah wrote: > Resending this message with a proper Subject line > > Hi Spark Community, > > I am trying to set up my forked apache/spark project locally for my 1st > Open Source Contribution, by building and creating a pa

Re: Spark Scala SBT Local build fails

2023-07-17 Thread Varun Shah
Hi Team, I am still looking for a guidance here. Really appreciate anything that points me in the right direction. On Mon, Jul 17, 2023, 16:14 Varun Shah wrote: > Resending this message with a proper Subject line > > Hi Spark Community, > > I am trying to set up my forked apach

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
ll take a long time to perform this step. One workaround will be > to create smaller number of larger files if that is possible from Spark and > if this is not possible then those configurations allow for configuring the > threadpool which does the metadata copy. > > You can go thr

Re: Contributing to Spark MLLib

2023-07-17 Thread Gourav Sengupta
ote the > MLlib-specific contribution guidelines section in particular. > > https://spark.apache.org/contributing.html > > Since you are looking for something to start with, take a look at this > Jira query for starter issues. > > > https://issues.apache.org/jira/browse/S

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Jay
Fileoutputcommitter v2 is supported in GCS but the rename is a metadata copy and delete operation in GCS and therefore if there are many number of files it will take a long time to perform this step. One workaround will be to create smaller number of larger files if that is possible from Spark and

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Mich Talebzadeh
restingly, it took only 10 minutes to write the output in the staging > directory and rest of the time it took to rename the objects. Thats the > concern. > > Looks like a known issue as spark behaves with GCS but not getting any > workaround for this. > > > On Mon, 17 Jul

<    1   2   3   4   5   6   7   8   9   10   >