Writing protobuf RDD to parquet

2023-01-20 Thread David Diebold
Hello, I'm trying to write to parquet some RDD[T] where T is a protobuf message, in scala. I am wondering what is the best option to do this, and I would be interested by your lights. So far, I see two possibilities: - use PairRDD method *saveAsNewAPIHadoopFile*, and I guess I need to call

Question about bucketing and custom partitioners

2022-04-11 Thread David Diebold
Hello, I have a few questions related to bucketing and custom partitioning in dataframe api. I am considering bucketing to perform one-side free shuffle join in incremental jobs, but there is one thing that I'm not happy with. Data is likely to grow/skew over time. At some point, i would need to

Re: Question about spark.sql min_by

2022-02-21 Thread David Diebold
rk 3.3, up for release soon. It exists in SQL. You can still use it in >> SQL with `spark.sql(...)` in Python though, not hard. >> >> On Mon, Feb 21, 2022 at 4:01 AM David Diebold >> wrote: >> >>> Hello all, >>> >>> I'm trying to use the spar

Question about spark.sql min_by

2022-02-21 Thread David Diebold
Hello all, I'm trying to use the spark.sql min_by aggregation function with pyspark. I'm relying on this distribution of spark : spark-3.2.1-bin-hadoop3.2 I have a dataframe made of these columns: - productId : int - sellerId : int - price : double For each product, I want to get the seller who

Re: groupMapReduce

2022-01-14 Thread David Diebold
Hello, In RDD api, you must be looking for reduceByKey. Cheers Le ven. 14 janv. 2022 à 11:56, frakass a écrit : > Is there a RDD API which is similar to Scala's groupMapReduce? > https://blog.genuine.com/2019/11/scalas-groupmap-and-groupmapreduce/ > > Thank you. > >

Re: Pyspark debugging best practices

2022-01-03 Thread David Diebold
Hello Andy, Are you sure you want to perform lots of join operations, and not simple unions ? Are you doing inner joins or outer joins ? Can you provide us with a rough amount of your list size plus each individual dataset size ? Have a look at execution plan would help, maybe the high amount of

question about data skew and memory issues

2021-12-14 Thread David Diebold
Hello all, I was wondering if it possible to encounter out of memory exceptions on spark executors when doing some aggregation, when a dataset is skewed. Let's say we have a dataset with two columns: - key : int - value : float And I want to aggregate values by key. Let's say that we have a tons

Re: Trying to hash cross features with mllib

2021-10-04 Thread David Diebold
ing for > https://spark.apache.org/docs/latest/ml-features.html#interaction ? > That's the closest built in thing I can think of. Otherwise you can make > custom transformations. > > On Fri, Oct 1, 2021, 8:44 AM David Diebold > wrote: > >> Hello everyone, >> >>

Trying to hash cross features with mllib

2021-10-01 Thread David Diebold
Hello everyone, In MLLib, I’m trying to rely essentially on pipelines to create features out of the Titanic dataset, and show-case the power of feature hashing. I want to: - Apply bucketization on some columns (QuantileDiscretizer is fine) - Then I want to cross all my columns

Re: Performance of PySpark jobs on the Kubernetes cluster

2021-08-11 Thread David Diebold
Hi Mich, I don't quite understand why the driver node is using so much CPU, but it may be unrelated to your executors being underused. About your executors being underused, I would check that your job generated enough tasks. Then I would check spark.executor.cores and spark.tasks.cpus parameters