Re: Big Broadcast Hash Join with Dynamic Partition Pruning gives wrong results

2021-04-07 Thread Wenchen Fan
Hi Tomas, thanks for reporting this bug! Is it possible to share your dataset so that other people can reproduce and debug it? On Thu, Apr 8, 2021 at 7:52 AM Tomas Bartalos wrote: > when I try to do a Broadcast Hash Join on a bigger table (6Mil rows) I get > an incorrect result of 0 rows. > >

Big Broadcast Hash Join with Dynamic Partition Pruning gives wrong results

2021-04-07 Thread Tomas Bartalos
when I try to do a Broadcast Hash Join on a bigger table (6Mil rows) I get an incorrect result of 0 rows. val rightDF = spark.read.format("parquet").load("table-a") val leftDF = spark.read.format("parquet").load("table-b") //needed to activate dynamic pruning subquery .where('part_ts ===

How to adapt PySpark to optimize handling of large no. of partitions?

2021-04-07 Thread iashiq5
https://stackoverflow.com/questions/66991825/how-to-adapt-pyspark-to-optimize-handling-of-large-no-of-partitions Can someone help me with this question? -- Sent from:

Re: Apache ML Agorithm Solution

2021-04-07 Thread Mich Talebzadeh
LOL. yes indeed. Previous one from the Lloyds Banking Group and this one from TCS that provides services to the same company. Just slightly different wording but the same content view my Linkedin profile *Disclaimer:* Use it at

Re: Apache ML Agorithm Solution

2021-04-07 Thread Sean Owen
I think this question was asked just a week ago? same company and setup. https://mail-archives.apache.org/mod_mbox/spark-user/202104.mbox/%3CLNXP123MB2604758548BE38E8D3F369EC8A7B9%40LNXP123MB2604.GBRP123.PROD.OUTLOOK.COM%3E On Wed, Apr 7, 2021 at 11:17 AM SRITHALAM, ANUPAMA (Risk Value Stream)

Re: Apache ML Agorithm Solution

2021-04-07 Thread Adi Polak
Hi Anupama, A couple of questions: - Where are you running your PySpark application? How many executors do you have available? how much it uses? - What is the data format and actual size in MG/GB/PB? - Did you see any failures in the Spark History Server? As a distributed computing engine,

Apache ML Agorithm Solution

2021-04-07 Thread SRITHALAM, ANUPAMA (Risk Value Stream)
Classification: Limited Hi Team, We are trying to use Gradient Boosting Classification algorithm and in Python we tried using Sklearn library and in Pyspark we are using ML library. We have around 45k dataset which is used for training and that dataset is taking around 3 to 4 hours in python

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-07 Thread Mich Talebzadeh
Hi Amit, Many thanks for your suggestion. My problem is that I am using PySpark in this particular case and there is no SBT or Maven which is used similar to Scala to build an Uber jar file with shading. So regrettably that is the only way I could resolve the problem by adding the jar file in

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-07 Thread Amit Joshi
Hi Mich, If I correctly understood your problem, it is that the spark-kafka jar is shadowed by the installed kafka client jar at run time. I had been in that place earlier. I can recommend resolving the issue using the shade plugin. The example I am pasting here works for pom.xml. I am very sure

Re: Mesos + Spark users going forward?

2021-04-07 Thread dmcwhorter
We are using the mesos integration at Premier (https://www.premierinc.com/). Obviously with the move to the attic we will likely move away from Mesos in the future. I think deprecating the mesos integration makes sense. We would probably continue to utilize the spark mesos components for

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-07 Thread Mich Talebzadeh
Did some tests. The concern is SSS job running under YARN *Scenario 1)* use spark-sql-kafka-0-10_2.12-3.1.0.jar - Removed spark-sql-kafka-0-10_2.12-3.1.0.jar from anywhere on CLASSPATH including $SPARK_HOME/jars - Added the said jar file to spark-submit in client mode (the only mode

RE: Spark performance over S3

2021-04-07 Thread Boris Litvak
Oh, Tzahi, I misread the metrics in the first reply. It’s about reads indeed, not writes. From: Tzahi File Sent: Wednesday, 7 April 2021 16:02 To: Hariharan Cc: user Subject: Re: Spark performance over S3 Hi Hariharan, Thanks for your reply. In both cases we are writing the data to S3. The

Re: Invite Spark community as Pulsar Summit NA 2021 Community Partner

2021-04-07 Thread Dianjin Wang
Hi everyone, I would like to follow up on my previous email to see if the Spark community is interested in being the community partner of Pulsar Summit NA 2021. We would love to see your support for the Pulsar community and conference. Now our Summit is open for registration, welcome to have a

Re: Spark performance over S3

2021-04-07 Thread Tzahi File
Hi Hariharan, Thanks for your reply. In both cases we are writing the data to S3. The difference is that in the first case we read the data from S3 and in the second we read from HDFS. We are using ListObjectsV2 API in S3A . The S3 bucket and

Re: Mesos + Spark users going forward?

2021-04-07 Thread Mridul Muralidharan
Unfortunate about Mesos, +1 on deprecation of mesos integration. Regards, Mridul On Wed, Apr 7, 2021 at 7:12 AM Sean Owen wrote: > I noted that Apache Mesos is moving to the attic, so won't be actively > developed soon: > >

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-07 Thread Gabor Somogyi
+1 on Sean's opinion On Wed, Apr 7, 2021 at 2:17 PM Sean Owen wrote: > You shouldn't be modifying your cluster install. You may at this point > have conflicting, excess JARs in there somewhere. I'd start it over if you > can. > > On Wed, Apr 7, 2021 at 7:15 AM Gabor Somogyi > wrote: > >> Not

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-07 Thread Sean Owen
You shouldn't be modifying your cluster install. You may at this point have conflicting, excess JARs in there somewhere. I'd start it over if you can. On Wed, Apr 7, 2021 at 7:15 AM Gabor Somogyi wrote: > Not sure what you mean not working. You've added 3.1.1 to packages which > uses: > * 2.6.0

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-07 Thread Gabor Somogyi
Not sure what you mean not working. You've added 3.1.1 to packages which uses: * 2.6.0 kafka-clients: https://github.com/apache/spark/blob/1d550c4e90275ab418b9161925049239227f3dc9/pom.xml#L136 * 2.6.2 commons pool:

Mesos + Spark users going forward?

2021-04-07 Thread Sean Owen
I noted that Apache Mesos is moving to the attic, so won't be actively developed soon: https://lists.apache.org/thread.html/rab2a820507f7c846e54a847398ab20f47698ec5bce0c8e182bfe51ba%40%3Cdev.mesos.apache.org%3E That doesn't mean people will stop using it as a Spark resource manager soon. But it

Re: Spark performance over S3

2021-04-07 Thread Vladimir Prus
VPC endpoint can also make a major difference in costs. Without it, access to S3 incurs data transfer costs and NAT costs, and these can be large. On Wed, 7 Apr 2021 at 14:13, Hariharan wrote: > Hi Tzahi, > > Comparing the first two cases: > >- > reads the parquet files from S3 and also

Re: Spark performance over S3

2021-04-07 Thread Hariharan
Hi Tzahi, Comparing the first two cases: - > reads the parquet files from S3 and also writes to S3, it takes 22 min - > reads the parquet files from S3 and writes to its local hdfs, it takes the same amount of time (±22 min) It looks like most of the time is being spent in reading, and the time

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-07 Thread Mich Talebzadeh
Hi Gabor et. al., To be honest I am not convinced this package --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 is really working! I know for definite that spark-sql-kafka-0-10_2.12-3.1.0.jar works fine. I reported the package working before because under $SPARK_HOME/jars on all nodes

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-07 Thread Gabor Somogyi
Good to hear it's working. Happy Spark usage. G On Tue, 6 Apr 2021, 21:56 Mich Talebzadeh, wrote: > OK we found out the root cause of this issue. > > We were writing to Redis from Spark and downloaded a recently compiled > version of Redis jar with scala 2.12. > >

Data Lakes using Spark

2021-04-07 Thread Boris Litvak
Hi Friends, I’d like to publish a document to Medium about data lakes using Spark. Its latter parts include info that is not widely known, unless you have experience with data lakes.

RE: Spark performance over S3

2021-04-07 Thread Boris Litvak
Hi Tzahi, I don’t know the reasons for that, though I’d check for fs.s3a implementation to be using multipart uploads, which I assume it does. I would say that none of the comments in the link are relevant to you, as the VPC endpoint is more of a security rather than performance feature. I