Re: how to classify column

2022-02-11 Thread frakass
that's good. thanks On 2022/2/12 12:11, Raghavendra Ganesh wrote: .withColumn("newColumn",expr(s"case when score>3 then 'good' else 'bad' end")) - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

unsubscribe

2022-02-11 Thread Basavaraj
unsubscribe smime.p7s Description: S/MIME cryptographic signature

Unable to access Google buckets using spark-submit

2022-02-11 Thread karan alang
Hello All, I'm trying to access gcp buckets while running spark-submit from local, and running into issues. I'm getting error : ``` 22/02/11 20:06:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Exception in thread

Re: how to classify column

2022-02-11 Thread Raghavendra Ganesh
You could use expr() function to achieve the same. .withColumn("newColumn",expr(s"case when score>3 then 'good' else 'bad' end")) -- Raghavendra On Fri, Feb 11, 2022 at 5:59 PM frakass wrote: > Hello > > I have a column whose value (Int type as score) is from 0 to 5. > I want to query that,

Re: Apache spark 3.0.3 [Spark lower version enhancements]

2022-02-11 Thread Sean Owen
3.0.x is about EOL now, and I hadn't heard anyone come forward to push a final maintenance release. Is there a specific issue you're concerned about? On Fri, Feb 11, 2022 at 4:24 PM Rajesh Krishnamurthy < rkrishnamur...@perforce.com> wrote: > Hi there, > > We are just wondering if there are

Apache spark 3.0.3 [Spark lower version enhancements]

2022-02-11 Thread Rajesh Krishnamurthy
Hi there, We are just wondering if there are any agenda by the Spark community to actively engage development activities on the 3.0.x path. I know we have the latest version of Spark with 3.2.x, but we are just wondering if any development plans to have the vulnerabilities fixed on the 3.0.x

RE: determine week of month from date in spark3

2022-02-11 Thread Appel, Kevin
The output I sent originally with WEEKOFMONTHF is when LEGACY is set, when EXCEPTION is set this is the result which is also different +---+--++ | a| b|WEEKOFMONTHF| +---+--++ | 1|2014-03-07| 7| | 2|2014-03-08| 1| |

Deploying Spark on Google Kubernetes (GKE) autopilot, preliminary findings

2022-02-11 Thread Mich Talebzadeh
The equivalent of Google GKE autopilot in AWS is AWS Fargate I have not used the AWS Fargate so I can only mension Google's GKE Autopilot. This is developed from the concept of

Re: Unable to force small partitions in streaming job without repartitioning

2022-02-11 Thread Gourav Sengupta
Hi, just trying to understand the problem before solving it. 1. you mentioned "The primary key of the static table is non-unique". This appears to be a design flaw to me. 2. you once again mentioned "The Pandas UDF is then applied to the resulting stream-static join and stored in a table. To

Re: Spark 3.1.2 full thread dumps

2022-02-11 Thread Maksim Grinman
Thanks for these suggestions. Regarding hot nodes, are you referring to the same as in this article? https://www.elastic.co/blog/hot-warm-architecture-in-elasticsearch-5-x. I am also curious where the 10MB heuristic came from, though I have heard a similar heuristic with respect to the size of a

Re: determine week of month from date in spark3

2022-02-11 Thread Sean Owen
Here is some back-story: https://issues.apache.org/jira/browse/SPARK-32683 I think the answer may be: use "F"? On Fri, Feb 11, 2022 at 12:43 PM Appel, Kevin wrote: > Previously in Spark2 we could use the spark function date_format with the > “W” flag and it will provide back the week of month

Re: Using Avro file format with SparkSQL

2022-02-11 Thread Gourav Sengupta
Hi Anna, Avro libraries should be inbuilt in SPARK in case I am not wrong. Any particular reason why you are using a deprecated or soon to be deprecated version of SPARK? SPARK 3.2.1 is fantastic. Please do let us know about your set up if possible. Regards, Gourav Sengupta On Thu, Feb 10,

determine week of month from date in spark3

2022-02-11 Thread Appel, Kevin
Previously in Spark2 we could use the spark function date_format with the "W" flag and it will provide back the week of month of that date. In Spark3 when trying this there is an error back: * org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading

Re: Unable to force small partitions in streaming job without repartitioning

2022-02-11 Thread Adam Binford
Writing to Delta might not support the write.option method. We set spark.hadoop.parquet.block.size in our spark config for writing to Delta. Adam On Fri, Feb 11, 2022, 10:15 AM Chris Coutinho wrote: > I tried re-writing the table with the updated block size but it doesn't > appear to have an

Re: Spark 3.1.2 full thread dumps

2022-02-11 Thread Lalwani, Jayesh
You can probably tune writing to elastic search by 1. Increasing number of partitions so you are writing smaller batches of rows to elastic search 2. Using Elastic search’s bulk api 3. Scaling up the number of hot nodes on elastic search cluster to support writing in parallel. You

Re: Unable to force small partitions in streaming job without repartitioning

2022-02-11 Thread Chris Coutinho
I tried re-writing the table with the updated block size but it doesn't appear to have an effect on the row group size. ```pyspark df = spark.read.format("delta").load("/path/to/source1") df.write \ .format("delta") \ .mode("overwrite") \ .options(**{ "parquet.block.size": "1m", }) \

Re: Unable to force small partitions in streaming job without repartitioning

2022-02-11 Thread Sean Owen
It should just be parquet.block.size indeed. spark.write.option("parquet.block.size", "16m").parquet(...) This is an issue in how you write, not read, the parquet. On Fri, Feb 11, 2022 at 8:26 AM Chris Coutinho wrote: > Hi Adam, > > Thanks for the explanation on the empty partitions. > > We

Re: data size exceeds the total ram

2022-02-11 Thread Gourav Sengupta
Hi, I am in a meeting, but you can look out for a setting that tells spark how many bytes to read from a file at one go. I use SQL, which is far better in case you are using dataframes. As we do not still know what is the SPARK version that you are using, it may cause issues around skew, and

Re: Unable to force small partitions in streaming job without repartitioning

2022-02-11 Thread Chris Coutinho
Hi Adam, Thanks for the explanation on the empty partitions. We have the freedom to adjust how the source table is written, so if there are any improvements we can implement on the source side we'd be happy to look into that. It's not yet clear to me how you can reduce the row group size of the

Repartitioning dataframe by file wite size and preserving order

2022-02-11 Thread Danil Suetin
Hello, I want to be able to write dataframe with set average size of the file using orc or parquet. Also, preserving dataframe sorting is important. The task is that I have a dataframe that I know nothing about as an argument, and I need to write orc or parquet files with constant size with

Re: Unable to force small partitions in streaming job without repartitioning

2022-02-11 Thread Adam Binford
The smallest unit of work you can do on a parquet file (under the delta hood) is based on the parquet row group size, which by default is 128mb. If you specify maxPartitionBytes of 10mb, what that will basically do is create a partition for each 10mb of a file, but whatever partition covers the

how to classify column

2022-02-11 Thread frakass
Hello I have a column whose value (Int type as score) is from 0 to 5. I want to query that, when the score > 3, classified as "good". else classified as "bad". How do I implement that? A UDF like something as this? scala> implicit class Foo(i:Int) { | def classAs(f:Int=>String) = f(i)

Unable to force small partitions in streaming job without repartitioning

2022-02-11 Thread Chris Coutinho
Hello, We have a spark structured streaming job that includes a stream-static join and a Pandas UDF, streaming to/from delta tables. The primary key of the static table is non-unique, meaning that the streaming join results in multiple records per input record - in our case 100x increase. The

Re: data size exceeds the total ram

2022-02-11 Thread Mich Talebzadeh
check this https://sparkbyexamples.com/spark/spark-partitioning-understanding/ view my Linkedin profile *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other

Re: data size exceeds the total ram

2022-02-11 Thread frakass
Hello list I have imported the data into spark and I found there is disk IO in every node. The memory didn't get overflow. But such query is quite slow: >>> df.groupBy("rvid").agg({'rate':'avg','rvid':'count'}).show() May I ask: 1. since I have 3 nodes (as known as 3 executors?), are there

Re: data size exceeds the total ram

2022-02-11 Thread frakass
On 2022/2/11 6:16, Gourav Sengupta wrote: What is the source data (is it JSON, CSV, Parquet, etc)? Where are you reading it from (JDBC, file, etc)? What is the compression format (GZ, BZIP, etc)? What is the SPARK version that you are using? it's a well built csv file (no compressed)

Re: data size exceeds the total ram

2022-02-11 Thread Gourav Sengupta
Hi, just so that we understand the problem first? What is the source data (is it JSON, CSV, Parquet, etc)? Where are you reading it from (JDBC, file, etc)? What is the compression format (GZ, BZIP, etc)? What is the SPARK version that you are using? Thanks and Regards, Gourav Sengupta On Fri,

Re: data size exceeds the total ram

2022-02-11 Thread Mich Talebzadeh
Well one experiment is worth many times more than asking what/if scenario question. 1. Try running it first to see how spark handles it 2. Go to spark GUI (on port 4044) and look at the storage tab and see what it says 3. Unless you explicitly persist the data, Spark will read the

data size exceeds the total ram

2022-02-11 Thread frakass
Hello I have three nodes with total memory 128G x 3 = 384GB But the input data is about 1TB. How can spark handle this case? Thanks. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org