Many thanks all, especially to Mich. That is what I was looking for. On Friday, 15 October 2021, 09:28:24 BST, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: Spark allows one to define the column format as StructType or list. By default Spark assumes that all fields are nullable when creating a dataframe. To change nullability you need to provide the structure of the columns. Assume that I have created an RDD in the form rdd = sc.parallelize(Range). \ map(lambda x: (x, usedFunctions.clustered(x,numRows), \ usedFunctions.scattered(x,numRows), \ usedFunctions.randomised(x,numRows), \ usedFunctions.randomString(50), \ usedFunctions.padString(x," ",50), \ usedFunctions.padSingleChar("x",4000))) For the above I create a schema with StructType as below: Schema = StructType([ StructField("ID", IntegerType(), False), StructField("CLUSTERED", FloatType(), True), StructField("SCATTERED", FloatType(), True), StructField("RANDOMISED", FloatType(), True), StructField("RANDOM_STRING", StringType(), True), StructField("SMALL_VC", StringType(), True), StructField("PADDING", StringType(), True) ]) Note that the first column ID is defined as NOT NULL Then I can create a dataframe df as below df= spark.createDataFrame(rdd, schema = Schema) df.printSchema() root |-- ID: integer (nullable = false) |-- CLUSTERED: float (nullable = true) |-- SCATTERED: float (nullable = true) |-- RANDOMISED: float (nullable = true) |-- RANDOM_STRING: string (nullable = true) |-- SMALL_VC: string (nullable = true) |-- PADDING: string (nullable = true)
HTH view my Linkedin profile Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction. On Thu, 14 Oct 2021 at 12:50, ashok34...@yahoo.com.INVALID <ashok34...@yahoo.com.invalid> wrote: Gurus, I have an RDD in PySpark that I can convert to DF through df = rdd.toDF() However, when I do df.printSchema() I see the columns as nullable. = true by default root |-- COL-1: long (nullable = true) |-- COl-2: double (nullable = true) |-- COl-3: string (nullable = true) What would be the easiest way to make COL-1 NOT NULLABLE Thanking you