Re: k8s+ YARN Spark

2023-08-21 Thread Mich Talebzadeh
Interesting. Spark supports the following cluster managers - Standalone: A cluster-manager, limited in features, shipped with Spark. - Apache Hadoop YARN is the most widely used resource manager not just for Spark but for other artefacts as well. On-premise YARN is used extensively.

Re: Problem with spark 3.4.1 not finding spark java classes

2023-08-21 Thread Mich Talebzadeh
This should work check your path. It should pyspark from which pyspark /opt/spark/bin/pyspark And your installation should contain cd $SPARK_HOME /opt/spark> ls LICENSE NOTICE R README.md RELEASE bin conf data examples jars kubernetes licenses logs python sbin yarn You should

Re: Problem with spark 3.4.1 not finding spark java classes

2023-08-21 Thread Kal Stevens
Nevermind I was doing something dumb On Sun, Aug 20, 2023 at 9:53 PM Kal Stevens wrote: > Are there installation instructions for Spark 3.4.1? > > I defined SPARK_HOME as it describes here > > https://spark.apache.org/docs/latest/api/python/getting_started/install.html > > ls

k8s+ YARN Spark

2023-08-21 Thread Крюков Виталий Семенович
Good afternoon. Perhaps you will be discouraged by what I will write below, but nevertheless, I ask for help in solving my problem. Perhaps the architecture of our solution will not seem correct to you. There are backend services that communicate with a service that implements spark-driver.

Problem with spark 3.4.1 not finding spark java classes

2023-08-20 Thread Kal Stevens
Are there installation instructions for Spark 3.4.1? I defined SPARK_HOME as it describes here https://spark.apache.org/docs/latest/api/python/getting_started/install.html ls $SPARK_HOME/python/lib py4j-0.10.9.7-src.zip PY4J_LICENSE.txt pyspark.zip I am getting a class not found error

Re: Probable Spark Bug while inserting into flat GCS bucket?

2023-08-20 Thread Dipayan Dev
Hi Mich, It's not specific to ORC, and looks like a bug from Hadoop Common project. I have raised a bug and am happy to contribute to Hadoop 3.3.0 version. Do you know if anyone could help me to set the Assignee? https://issues.apache.org/jira/browse/HADOOP-18856 With Best Regards, Dipayan Dev

Re: Probable Spark Bug while inserting into flat GCS bucket?

2023-08-19 Thread Mich Talebzadeh
Under gs directory "gs://test_dd1/abc/" What do you see? gsutil ls gs://test_dd1/abc and the same gs://test_dd1/ gsutil ls gs://test_dd1 I suspect you need a folder for multiple ORC slices! Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin

Probable Spark Bug while inserting into flat GCS bucket?

2023-08-19 Thread Dipayan Dev
Hi Everyone, I'm stuck with one problem, where I need to provide a custom GCS location for the Hive table from Spark. The code fails while doing an *'insert into'* whenever my Hive table has a flag GS location like gs://, but works for nested locations like gs://bucket_name/blob_name. Is anyone

Re: [External Email] Re: [Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-19 Thread Nebi Aydin
Here's the executor logs ``` java.io.IOException: Connection from ip-172-31-16-143.ec2.internal/172.31.16.143:7337 closed at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:146) at

Re: [External Email] Re: [Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-19 Thread Mich Talebzadeh
That error message *FetchFailedException: Failed to connect to on port 7337 *happens when a task running on one executor node tries to fetch data from another executor node but fails to establish a connection to the specified port (7337 in this case). In a nutshell it is performing network IO

Re: [External Email] Re: [Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-18 Thread Nebi Aydin
Hi, sorry for duplicates. First time user :) I keep getting fetchfailedexception 7337 port closed. Which is external shuffle service port. I was trying to tune these parameters. I have around 1000 executors and 5000 cores. I tried to set spark.shuffle.io.serverThreads to 2k. Should I also set

Re: [Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-18 Thread Mich Talebzadeh
Hi, These two threads that you sent seem to be duplicates of each other? Anyhow I trust that you are familiar with the concept of shuffle in Spark. Spark Shuffle is an expensive operation since it involves the following - Disk I/O - Involves data serialization and deserialization

[Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-18 Thread Nebi Aydin
I want to learn differences among below thread configurations. spark.shuffle.io.serverThreads spark.shuffle.io.clientThreads spark.shuffle.io.threads spark.rpc.io.serverThreads spark.rpc.io.clientThreads spark.rpc.io.threads Thanks.

[Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-18 Thread Nebi Aydin
I want to learn differences among below thread configurations. spark.shuffle.io.serverThreads spark.shuffle.io.clientThreads spark.shuffle.io.threads spark.rpc.io.serverThreads spark.rpc.io.clientThreads spark.rpc.io.threads Thanks.

[no subject]

2023-08-18 Thread Dipayan Dev
Unsubscribe -- With Best Regards, Dipayan Dev Author of *Deep Learning with Hadoop * M.Tech (AI), IISc, Bangalore

Re: read dataset from only one node in YARN cluster

2023-08-18 Thread Mich Talebzadeh
Hi, Where do you see this? In spark UI. So data is skewed most probably as one node gets all the data and others nothing as I understand? HTH Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin profile

read dataset from only one node in YARN cluster

2023-08-18 Thread marc nicole
Hi, Spark 3.2, Hadoop 3.2, using YARN cluster mode, if one wants to read a dataset that is found in one node of the cluster and not in the others, how to tell Spark that? I expect through DataframeReader and using path like *IP:port/pathOnLocalNode* PS: loading the dataset in HDFS is not an

RE: Re: Spark Vulnerabilities

2023-08-18 Thread Sankavi Nagalingam
Hi @Bjørn Jørgensen, Thank you for your quick response. Based on the PR shared , we are doing analysis from our side. For few jars you have requested for the CVE id, I have updated it in the attached document. Kindly verify it from your side and revert us back.

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-18 Thread Mich Talebzadeh
Yes, it sounds like it. So the broadcast DF size seems to be between 1 and 4GB. So I suggest that you leave it as it is. I have not used the standalone mode since spark-2.4.3 so I may be missing a fair bit of context here. I am sure there are others like you that are still using it! HTH Mich

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
No, the driver memory was not set explicitly. So it was likely the default value, which appears to be 1GB. On Thu, Aug 17, 2023, 16:49 Mich Talebzadeh wrote: > One question, what was the driver memory before setting it to 4G? Did you > have it set at all before? > > HTH > > Mich Talebzadeh, >

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Mich Talebzadeh
One question, what was the driver memory before setting it to 4G? Did you have it set at all before? HTH Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin profile

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
Hi Mich, Here are my config values from spark-defaults.conf: spark.eventLog.enabled true spark.eventLog.dir hdfs://10.0.50.1:8020/spark-logs spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider spark.history.fs.logDirectory hdfs://10.0.50.1:8020/spark-logs

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Mich Talebzadeh
Hello Paatrick, As a matter of interest what parameters and their respective values do you use in spark-submit. I assume it is running in YARN mode. HTH Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin profile

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
Hi Mich, Yes, that's the sequence of events. I think the big breakthrough is that (for now at least) Spark is throwing errors instead of the queries hanging. Which is a big step forward. I can at least troubleshoot issues if I know what they are. When I reflect on the issues I faced and the

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Mich Talebzadeh
Hi Patrik, glad that you have managed to sort this problem out. Hopefully it will go away for good. Still we are in the dark about how this problem is going away and coming back :( As I recall the chronology of events were as follows: 1. The Issue with hanging Spark job reported 2.

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
Hi Everyone, I just wanted to follow up on this issue. This issue has continued since our last correspondence. Today I had a query hang and couldn't resolve the issue. I decided to upgrade my Spark install from 3.4.0 to 3.4.1. After doing so, instead of the query hanging, I got an error message

Managing python modules in docker for PySpark?

2023-08-16 Thread Mich Talebzadeh
Hi, This is a bit of an old hat but worth getting opinions on it. Current options that I believe apply are: 1. Installing them individually via pip in the docker build process 2. Installing them together via pip in the build process via requirments.txt 3. Installing them to a

Re: why advisoryPartitionSize <= maxShuffledHashJoinLocalMapThreshold

2023-08-15 Thread XiDuo You
CoalesceShufflePartitions will merge small partitions into bigger ones. Say, if you set maxShuffledHashJoinLocalMapThreshold to 32MB, but the advisoryPartitionSize is 64MB, then the final each reducer partition size will be close to 64MB. It breaks the maxShuffledHashJoinLocalMapThreshold. So we

why advisoryPartitionSize <= maxShuffledHashJoinLocalMapThreshold

2023-08-15 Thread ??????
dear community, I want to setmaxShuffledHashJoinLocalMapThreshold to enable convert smj to hash join,but I foundmaxShuffledHashJoinLocalMapThreshold must be large thanadvisoryPartitionSize. I want to know what happend ifmaxShuffledHashJoinLocalMapThreshold

Re: Spark Vulnerabilities

2023-08-14 Thread Cheng Pan
For the Guava case, you may be interested in https://github.com/apache/spark/pull/42493 Thanks, Cheng Pan > On Aug 14, 2023, at 16:50, Sankavi Nagalingam > wrote: > > Hi Team, > We could see there are many dependent vulnerabilities present in the latest > spark-core:3.4.1.jar. PFA > Could

Re: Spark Vulnerabilities

2023-08-14 Thread Sean Owen
Yeah, we generally don't respond to "look at the output of my static analyzer". Some of these are already addressed in a later version. Some don't affect Spark. Some are possibly an issue but hard to change without breaking lots of things - they are really issues with upstream dependencies. But

Re: Spark Vulnerabilities

2023-08-14 Thread Bjørn Jørgensen
I have added links to the github PR. Or comment for those that I have not seen before. Apache Spark has very many dependencies, some can easily be upgraded while others are very hard to fix. Please feel free to open a PR if you wanna help. man. 14. aug. 2023 kl. 14:06 skrev Sankavi Nagalingam :

Spark Vulnerabilities

2023-08-14 Thread Sankavi Nagalingam
Hi Team, We could see there are many dependent vulnerabilities present in the latest spark-core:3.4.1.jar. PFA Could you please let us know when will be the fix version available for the users. Thanks, Sankavi The information in this e-mail and any attachments is confidential and may be

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-13 Thread Mich Talebzadeh
OK I use Hive 3.1.1 My suggestion is to put your hive issues to u...@hive.apache.org and for JAVA version compatibility They will give you better info. HTH Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin profile

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-13 Thread Patrick Tucci
I attempted to install Hive yesterday. The experience was similar to other attempts at installing Hive: it took a few hours and at the end of the process, I didn't have a working setup. The latest stable release would not run. I never discovered the cause, but similar StackOverflow questions

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Mich Talebzadeh
OK you would not have known unless you went through the process so to speak. Let us do something revolutionary here  Install hive and its metastore. You already have hadoop anyway https://cwiki.apache.org/confluence/display/hive/adminmanual+installation hive metastore

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Patrick Tucci
Yes, on premise. Unfortunately after installing Delta Lake and re-writing all tables as Delta tables, the issue persists. On Sat, Aug 12, 2023 at 11:34 AM Mich Talebzadeh wrote: > ok sure. > > Is this Delta Lake going to be on-premise? > > Mich Talebzadeh, > Solutions Architect/Engineering

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Mich Talebzadeh
ok sure. Is this Delta Lake going to be on-premise? Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Patrick Tucci
Hi Mich, Thanks for the feedback. My original intention after reading your response was to stick to Hive for managing tables. Unfortunately, I'm running into another case of SQL scripts hanging. Since all tables are already Parquet, I'm out of troubleshooting options. I'm going to migrate to

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-11 Thread Mich Talebzadeh
Hi Patrick, There is not anything wrong with Hive On-premise it is the best data warehouse there is Hive handles both ORC and Parquet formal well. They are both columnar implementations of relational model. What you are seeing is the Spark API to Hive which prefers Parquet. I found out a few

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-11 Thread Patrick Tucci
Thanks for the reply Stephen and Mich. Stephen, you're right, it feels like Spark is waiting for something, but I'm not sure what. I'm the only user on the cluster and there are plenty of resources (+60 cores, +250GB RAM). I even tried restarting Hadoop, Spark and the host servers to make sure

Re: unsubscribe

2023-08-11 Thread Mich Talebzadeh
To unsubscribe e-mail: user-unsubscr...@spark.apache.org Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at

unsubscribe

2023-08-11 Thread Yifan LI
unsubscribe

Re: Extracting Logical Plan

2023-08-11 Thread Vibhatha Abeykoon
Hello Winston, I looked into the suggested code snippet. But I am getting the following error ``` value listenerManager is not a member of org.apache.spark.sql.SparkSession ``` Although I can see it is available in the API.

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-11 Thread Mich Talebzadeh
Steve may have a valid point. You raised an issue with concurrent writes before, if I recall correctly. Since this limitation may be due to Hive metastore. By default Spark uses Apache Derby for its database persistence. *However it is limited to only one Spark session at any time for the purposes

Re: Spark Connect, Master, and Workers

2023-08-10 Thread Brian Huynh
Hi Kezhi, Yes, you no longer need to start a master to make the client work. Please see the quickstart. https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_connect.html You can think of Spark Connect as an API on top of Master so workers can be added to the cluster same

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Stephen Coy
Hi Patrick, When this has happened to me in the past (admittedly via spark-submit) it has been because another job was still running and had already claimed some of the resources (cores and memory). I think this can also happen if your configuration tries to claim resources that will never be

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
Hi Mich, I don't believe Hive is installed. I set up this cluster from scratch. I installed Hadoop and Spark by downloading them from their project websites. If Hive isn't bundled with Hadoop or Spark, I don't believe I have it. I'm running the Thrift server distributed with Spark, like so:

Re: dockerhub does not contain apache/spark-py 3.4.1

2023-08-10 Thread Mich Talebzadeh
Hi Mark, I created a spark3.4.1 docker file. Details from spark-py-3.4.1-scala_2.12-11-jre-slim-buster Pull instructions are given docker pull

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh
sorry host is 10.0.50.1 Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk. Any and all

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh
Hi Patrick That beeline on port 1 is a hive thrift server running on your hive on host 10.0.50.1:1. if you can access that host, you should be able to log into hive by typing hive. The os user is hadoop in your case and sounds like there is no password! Once inside that host, hive logs

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
Hi Mich, Thanks for the reply. Unfortunately I don't have Hive set up on my cluster. I can explore this if there are no other ways to troubleshoot. I'm using beeline to run commands against the Thrift server. Here's the command I use: ~/spark/bin/beeline -u jdbc:hive2://10.0.50.1:1 -n

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh
Can you run this sql query through hive itself? Are you using this command or similar for your thrift server? beeline -u jdbc:hive2:///1/default org.apache.hive.jdbc.HiveDriver -n hadoop -p xxx HTH Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my

Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
Hello, I'm attempting to run a query on Spark 3.4.0 through the Spark ThriftServer. The cluster has 64 cores, 250GB RAM, and operates in standalone mode using HDFS for storage. The query is as follows: SELECT ME.*, MB.BenefitID FROM MemberEnrollment ME JOIN MemberBenefits MB ON ME.ID =

Re: [PySpark][UDF][PickleException]

2023-08-10 Thread Bjørn Jørgensen
I pasted your text to chatgtp and this is what I got back Your problem arises due to how Apache Spark serializes Python objects to be used in Spark tasks. When a User-Defined Function (UDF) is defined, Spark uses Python's `pickle` library to serialize the Python function and any required objects

[PySpark][UDF][PickleException]

2023-08-10 Thread Sanket Sharma
Hi, I've been trying to debug a Spark UDF for a couple of days now but I can't seem to figure out what is going on. The UDF essentially pads a 2D array to a certain fixed length. When the code uses NumPy, it fails with a PickleException. When I re write using plain python, it works like charm.:

Re: [PySpark] Failed to add file [file:///tmp/app-submodules.zip] specified in 'spark.submit.pyFiles' to Python path:

2023-08-09 Thread lnxpgn
Yes, ls -l /tmp/app-submodules.zip, hdfs dfs -ls /tmp/app-submodules.zip can show the file. 在 2023/8/9 22:48, Mich Talebzadeh 写道: If you are running in the cluster mode, that zip file should exist in all the nodes! Is that the case? HTH Mich Talebzadeh, Solutions Architect/Engineering Lead

Spark Connect, Master, and Workers

2023-08-09 Thread Kezhi Xiong
Hi, I'm recently learning Spark Connect but have some questions regarding the connect server's relation with master or workers: so when I'm using the connect server, I don't have to start a master alone side to make clients work. Is the connect server simply using "local[*]" as master? Then, if I

unsubscribe

2023-08-09 Thread heri wijayanto
unsubscribe

Re: dockerhub does not contain apache/spark-py 3.4.1

2023-08-09 Thread Mich Talebzadeh
Hi Mark, you can build it yourself, no big deal :) REPOSITORY TAG IMAGE ID CREATED SIZE sparkpy/spark-py 3.4.1-scala_2.12-11-jre-slim-buster-Dockerfile a876102b2206 1 second ago

dockerhub does not contain apache/spark-py 3.4.1

2023-08-09 Thread Mark Elliot
Hello, I noticed that the apache/spark-py image for Spark's 3.4.1 release is not available (apache/spark@3.4.1 is available). Would it be possible to get the 3.4.1 release build for the apache/spark-py image published? Thanks, Mark -- This communication, together with any

Re: [PySpark] Failed to add file [file:///tmp/app-submodules.zip] specified in 'spark.submit.pyFiles' to Python path:

2023-08-09 Thread Mich Talebzadeh
If you are running in the cluster mode, that zip file should exist in all the nodes! Is that the case? HTH Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin profile

[PySpark] Failed to add file [file:///tmp/app-submodules.zip] specified in 'spark.submit.pyFiles' to Python path:

2023-08-09 Thread lnxpgn
Hi, I am using Spark 3.4.1, running on YARN. Hadoop runs on a single-node in a pseudo-distributed mode. spark-submit --master yarn --deploy-mode cluster --py-files /tmp/app-submodules.zip app.py The YARN application ran successfully, but have a warning log message:

unsubscribe

2023-08-08 Thread Daniel Tavares de Santana
unsubscribe

Re: [EXTERNAL] Use of ML in certain aspects of Spark to improve the performance

2023-08-08 Thread Daniel Tavares de Santana
unsubscribe From: Mich Talebzadeh Sent: Tuesday, August 8, 2023 4:43 PM To: user @spark Subject: [EXTERNAL] Use of ML in certain aspects of Spark to improve the performance I am currently pondering and sharing my thoughts openly. Given our reliance on gathered

Use of ML in certain aspects of Spark to improve the performance

2023-08-08 Thread Mich Talebzadeh
I am currently pondering and sharing my thoughts openly. Given our reliance on gathered statistics, it prompts the question of whether we could integrate specific machine learning components into Spark Structured Streaming. Consider a scenario where we aim to adjust configuration values on the fly

Re: Dynamic allocation does not deallocate executors

2023-08-08 Thread Holden Karau
So if you disable shuffle tracking but enable shuffle block decommissioning it should work from memory On Tue, Aug 8, 2023 at 4:13 AM Mich Talebzadeh wrote: > Hm. I don't think it will work > > --conf spark.dynamicAllocation.shuffleTracking.enabled=false > > In Spark 3.4.1 running spark in k8s

Re: Dynamic allocation does not deallocate executors

2023-08-08 Thread Mich Talebzadeh
Hm. I don't think it will work --conf spark.dynamicAllocation.shuffleTracking.enabled=false In Spark 3.4.1 running spark in k8s you get : org.apache.spark.SparkException: Dynamic allocation of executors requires the external shuffle service. You may enable this through

Re: Dynamic allocation does not deallocate executors

2023-08-07 Thread Holden Karau
I think you need to set "spark.dynamicAllocation.shuffleTracking.enabled=true" to false. On Mon, Aug 7, 2023 at 2:50 AM Mich Talebzadeh wrote: > Yes I have seen cases where the driver gone but a couple of executors > hanging on. Sounds like a code issue. > > HTH > > Mich Talebzadeh, > Solutions

Re: Dynamic allocation does not deallocate executors

2023-08-07 Thread Mich Talebzadeh
Yes I have seen cases where the driver gone but a couple of executors hanging on. Sounds like a code issue. HTH Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin profile

Spark 3.41 with Java 11 performance on k8s serverless/autopilot

2023-08-07 Thread Mich Talebzadeh
Hi, I would like to share experience on spark 3.4.1 running on k8s autopilot or some refer to it as serverless. My current experience is on Google GKE autopilot . So essentially you specify the name and region and CSP

Unsubscribe

2023-08-04 Thread heri wijayanto
Unsubscribe

Re: conver panda image column to spark dataframe

2023-08-03 Thread Sean Owen
pp4 has one row, I'm guessing - containing an array of 10 images. You want 10 rows of 1 image each. But, just don't do this. Pass the bytes of the image as an array, along with width/height/channels, and reshape it on use. It's just easier. That is how the Spark image representation works anyway

Unsubscribe

2023-08-03 Thread Denys Cherepanin
Unsubscribe

Re: conver panda image column to spark dataframe

2023-08-03 Thread second_co...@yahoo.com.INVALID
Hello Adrian,    here is the snippet import tensorflow_datasets as tfds (ds_train, ds_test), ds_info = tfds.load(     dataset_name, data_dir='',  split=["train", "test"], with_info=True, as_supervised=True ) schema = StructType([     StructField("image",

Re: Interested in contributing to SPARK-24815

2023-08-03 Thread Sean Owen
Formally, an ICLA is required, and you can read more here: https://www.apache.org/licenses/contributor-agreements.html In practice, it's unrealistic to collect and verify an ICLA for every PR contributed by 1000s of people. We have not gated on that. But, contributions are in all cases governed

Re: Interested in contributing to SPARK-24815

2023-08-03 Thread Rinat Shangeeta
(Adding my manager Eugene Kim who will cover me as I plan to be out of the office soon) Hi Kent and Sean, Nice to meet you. I am working on the OSS legal aspects with Pavan who is planning to make the contribution request to the Spark project. I saw that Sean mentioned in his email that the

Re: conver panda image column to spark dataframe

2023-08-03 Thread Adrian Pop-Tifrea
Hello, can you also please show us how you created the pandas dataframe? I mean, how you added the actual data into the dataframe. It would help us for reproducing the error. Thank you, Pop-Tifrea Adrian On Mon, Jul 31, 2023 at 5:03 AM second_co...@yahoo.com < second_co...@yahoo.com> wrote: >

Custom Session Windowing in Spark using Scala/Python

2023-08-03 Thread Ravi Teja
Hi, I am new to Spark and looking for help regarding the session windowing in Spark. I want to create session windows on a user activity stream with a gap duration of `x` minutes and also have

Re: Extracting Logical Plan

2023-08-02 Thread Vibhatha Abeykoon
Hello Winston, Thanks again for this response, I will check this one out. On Wed, Aug 2, 2023 at 3:50 PM Winston Lai wrote: > > Hi Vibhatha, > > I helped you post this question to another community. There is one answer > by someone else for your reference. > > To access the logical plan or

Re: Extracting Logical Plan

2023-08-02 Thread Winston Lai
Hi Vibhatha, I helped you post this question to another community. There is one answer by someone else for your reference. To access the logical plan or optimized plan, you can register a custom QueryExecutionListener and retrieve the plans during the query execution process. Here's an

Re: Extracting Logical Plan

2023-08-02 Thread Vibhatha Abeykoon
I understand. I sort of drew the same conclusion. But I wasn’t sure. Thanks everyone for taking time on this. On Wed, Aug 2, 2023 at 2:29 PM Ruifeng Zheng wrote: > In Spark Connect, I think the only API to show optimized plan is > `df.explain("extended")` as Winston mentioned, but it is not a

Re: Extracting Logical Plan

2023-08-02 Thread Ruifeng Zheng
In Spark Connect, I think the only API to show optimized plan is `df.explain("extended")` as Winston mentioned, but it is not a LogicalPlan object. On Wed, Aug 2, 2023 at 4:36 PM Vibhatha Abeykoon wrote: > Hello Ruifeng, > > Thank you for these pointers. Would it be different if I use the Spark

Re: Extracting Logical Plan

2023-08-02 Thread Vibhatha Abeykoon
Hello Ruifeng, Thank you for these pointers. Would it be different if I use the Spark connect? I am not using the regular SparkSession. I am pretty new to these APIs. Appreciate your thoughts. On Wed, Aug 2, 2023 at 2:00 PM Ruifeng Zheng wrote: > Hi Vibhatha, >I think those APIs are still

Re: Extracting Logical Plan

2023-08-02 Thread Ruifeng Zheng
Hi Vibhatha, I think those APIs are still avaiable? ``` Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.4.1 /_/ Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 11.0.19) Type in

Re: Extracting Logical Plan

2023-08-02 Thread Vibhatha Abeykoon
Hi Winston, I need to use the LogicalPlan object and process it with another function I have written. In earlier Spark versions we can access that via the dataframe object. So if it can be accessed via the UI, is there an API to access the object? On Wed, Aug 2, 2023 at 1:24 PM Winston Lai

Re: Extracting Logical Plan

2023-08-02 Thread Winston Lai
Hi Vibhatha, How about reading the logical plan from Spark UI, do you have access to the Spark UI? I am not sure what infra you run your Spark jobs on. Usually you should be able to view the logical and physical plan under Spark UI in text version at least. It is independent from the language

Re: Extracting Logical Plan

2023-08-02 Thread Vibhatha Abeykoon
Hi Winston, I am looking for a way to access the LogicalPlan object in Scala. Not sure if explain function would serve the purpose. On Wed, Aug 2, 2023 at 9:14 AM Winston Lai wrote: > Hi Vibhatha, > > Have you tried pyspark.sql.DataFrame.explain — PySpark 3.4.1 > documentation (apache.org) >

Unsubscribe

2023-08-01 Thread Zoran Jeremic
Unsubscribe

Re: Extracting Logical Plan

2023-08-01 Thread Winston Lai
Hi Vibhatha, Have you tried pyspark.sql.DataFrame.explain — PySpark 3.4.1 documentation (apache.org) before? I am not sure what infra that you have, you can

Extracting Logical Plan

2023-08-01 Thread Vibhatha Abeykoon
Hello, I recently upgraded the Spark version to 3.4.1 and I have encountered a few issues. In my previous code, I was able to extract the logical plan using `df.queryExecution` (df: DataFrame and in Scala), but it seems like in the latest API it is not supported. Is there a way to extract the

Unsubscribe

2023-08-01 Thread Alex Landa
Unsubscribe

Re: conver panda image column to spark dataframe

2023-07-31 Thread second_co...@yahoo.com.INVALID
i changed to ArrayType(ArrayType(ArrayType(IntegerType( , still get same error Thank you for responding On Thursday, July 27, 2023 at 06:58:09 PM GMT+8, Adrian Pop-Tifrea wrote: Hello,  when you said your pandas Dataframe has 10 rows, does that mean it contains 10 images?

Unsubscribe

2023-07-31 Thread Ali Bajwa
Unsubscribe

Unsubscribe

2023-07-30 Thread
Unsubscribe thanks! 郭 祝工作顺利、万事胜意

Unsubscribe

2023-07-30 Thread Aayush Ostwal
Unsubscribe *Thanks,Aayush Ostwal*

Unsubscribe

2023-07-30 Thread Parag Chaudhari
Unsubscribe *Thanks,Parag Chaudhari*

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Mich Talebzadeh
ok so as expected the underlying database is Hive. Hive uses hdfs storage. You said you encountered limitations on concurrent writes. The order and limitations are introduced by Hive metastore so to speak. Since this is all happening through Spark, by default implementation of the Hive metastore

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Patrick Tucci
Hi Mich and Pol, Thanks for the feedback. The database layer is Hadoop 3.3.5. The cluster restarted so I lost the stack trace in the application UI. In the snippets I saved, it looks like the exception being thrown was from Hive. Given the feedback you've provided, I suspect the issue is with how

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Pol Santamaria
Hi Patrick, You can have multiple writers simultaneously writing to the same table in HDFS by utilizing an open table format with concurrency control. Several formats, such as Apache Hudi, Apache Iceberg, Delta Lake, and Qbeast Format, offer this capability. All of them provide advanced features

<    4   5   6   7   8   9   10   11   12   13   >