Re: spark-cassandra-connector_2.1 caused java.lang.NoClassDefFoundError under Spark 2.4.2?

2019-05-09 Thread Russell Spitzer
2.4.3 Binary is out now and they did change back to 2.11. https://www.apache.org/dyn/closer.lua/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz On Mon, May 6, 2019 at 9:21 PM Russell Spitzer wrote: > Spark 2.4.2 was incorrectly released with the default package binaries set > to Scala 2.12 >

Spark not doing a broadcast join inspite of the table being well below spark.sql.autoBroadcastJoinThreshold

2019-05-09 Thread V0lleyBallJunki3
I have a small table well below 50 MB that I want to broadcast join with a larger table. However, if I set spark.sql.autoBroadcastJoinThreshold to 100 MB spark still decides to do a SortMergeJoin instead of a broadcast join. I have to set an explicit broadcast hint on the table for it to do the

Question about SaveMode.Ignore behaviour

2019-05-09 Thread Juho Autio
Does spark handle 'ignore' mode on file level or partition level? My code is like this: df.write \ .option('mapreduce.fileoutputcommitter.algorithm.version', '2') \ .mode('ignore') \ .partitionBy('p') \ .orc(target_path) When I used mode('append') my job

Spark foreach retry

2019-05-09 Thread hemant singh
Hi, I want to know what happens if foreach fails for some record. Does foreach retry like any general task it retries 4 times. Say I am pushing some payload to an API if for some record it fails then will it get retried or it is bypassed and rest of the records are processed. Thanks, Hemant

How to fix ClosedChannelException

2019-05-09 Thread u9g
Hey, When I run Spark on Alluxio, I encounter the following error. How can I fix this? Thanks Lost task 63.0 in stage 0.0 (TID 63, 172.28.172.165, executor 7): java.io.lOException: java.util.concurrent.ExecutionExcep tion: java.nio.channels.ClosedC hannelException Best, Andy Li

How to configure alluxio cluster with spark in yarn

2019-05-09 Thread u9g
Hey, I want to speed up the Spark task running in the Yarn cluster through Alluxio. Is Alluxio recommended to run in the same yarn cluster on the yarn mode? Should I deploy Alluxio independently on the nodes of the yarn cluster? Or deploy a cluster separately? Best, Andy Li

Re: The following Java MR code works for small dataset but throws(arrayindexoutofBound) error for large dataset

2019-05-09 Thread Gerard Maas
Hi, I'm afraid you sent this email to the wrong Mailing list. This is the Spark users mailing list. We could probably tell you how to do this with Spark, but I think that's not your intention :) kr, Gerard. On Thu, May 9, 2019 at 11:03 AM Balakumar iyer S wrote: > Hi All, > > I am trying to

The following Java MR code works for small dataset but throws(arrayindexoutofBound) error for large dataset

2019-05-09 Thread Balakumar iyer S
Hi All, I am trying to read a orc file and perform groupBy operation on it , but When i run it on a large data set we are facing the following error message. Input format of INPUT DATA |178111256| 107125374| |178111256| 107148618| |178111256| 107175361| |178111256| 107189910| and we are

Re: Generic Dataset[T] Query

2019-05-09 Thread Ramandeep Singh Nanda
You need to supply a rowencoder. Regards, Ramandeep Singh On Thu, May 9, 2019, 11:33 SNEHASISH DUTTA wrote: > Hi , > > I am trying to write a generic method which will return custom type > datasets as well as spark.sql.Row > > def read[T](params: Map[String, Any])(implicit encoder:

[ANNOUNCE] Announcing Apache Spark 2.4.3

2019-05-09 Thread Xiao Li
We are happy to announce the availability of Spark 2.4.3! Spark 2.4.3 is a maintenance release containing stability fixes. This release is based on the branch-2.4 maintenance branch of Spark. We strongly recommend all 2.4 users to upgrade to this stable release. Note that 2.4.3 switched the

Re: Train ML models on each partition

2019-05-09 Thread Dillon Dukek
Hi Qian, The way that I have gotten around this type of problem in the past is to do a groupBy on the dimensions that you want to build a model for and then initialize, and train a model using a package like scikit learn for each group in something like a group map pandas udf. If you need these

Generic Dataset[T] Query

2019-05-09 Thread SNEHASISH DUTTA
Hi , I am trying to write a generic method which will return custom type datasets as well as spark.sql.Row def read[T](params: Map[String, Any])(implicit encoder: Encoder[T]): Dataset[T] is my method signature, which is working fine for custom types but when I am trying to obtain a Dataset[Row]