Yup sorry I should have explained myself better So I know which columns are numeric and which arent because I have a StructType and all the internal StructFields will tell me which ones have a DataType which is numeric and which arent
So assuming I have a json string which has double quotes on numbers when it shouldnt, and I have the correct schema in a struct type how can I iterate over them to programatically create the new dataframe in the correct format do i iterate over the columns in the StructType? or iterate over the columns in the dataframe and try to match them with the StructType? I hope I cleared things up, What I wouldnt do for a drawing board right now! On Mon, Feb 6, 2017 at 8:56 AM, ayan guha <guha.a...@gmail.com> wrote: > Umm....I think the premise is you need to "know" beforehand which columns > are numeric.....Unless you know it, how would you apply the schema? > > On Mon, Feb 6, 2017 at 7:54 PM, Sam Elamin <hussam.ela...@gmail.com> > wrote: > >> Thanks ayan but I meant how to derive the list automatically >> >> In your example you are specifying the numeric columns and I would like >> it to be applied to any schema if that makes sense >> On Mon, 6 Feb 2017 at 08:49, ayan guha <guha.a...@gmail.com> wrote: >> >>> SImple (pyspark) example: >>> >>> >>> df = sqlContext.read.json("/user/l_aguha/spark_qs.json") >>> >>> df.printSchema() >>> root >>> |-- customerid: string (nullable = true) >>> |-- foo: string (nullable = true) >>> >>> >>> numeric_field_list = ['customerid'] >>> >>> >>> for k in numeric_field_list: >>> ... df = df.withColumn(k,df[k].cast("long")) >>> ... >>> >>> df.printSchema() >>> root >>> |-- customerid: long (nullable = true) >>> |-- foo: string (nullable = true) >>> >>> >>> On Mon, Feb 6, 2017 at 6:56 PM, Sam Elamin <hussam.ela...@gmail.com> >>> wrote: >>> >>> Ok thanks Micheal! >>> >>> >>> Can I get an idea on where to start? Assuming I have the end schema and >>> the current dataframe... >>> How can I loop through it and create a new dataframe using the >>> WithColumn? >>> >>> >>> Am I iterating through the dataframe or the schema? >>> >>> I'm assuming it's easier to iterate through the columns in the old df. >>> For each column cast it correctly and generate a new df? >>> >>> >>> Would you recommend that? >>> >>> Regards >>> Sam >>> On Mon, 6 Feb 2017 at 01:12, Michael Armbrust <mich...@databricks.com> >>> wrote: >>> >>> If you already have the expected schema, and you know that all numbers >>> will always be formatted as strings in the input JSON, you could probably >>> derive this list automatically. >>> >>> Wouldn't it be simpler to just regex replace the numbers to remove the >>> quotes? >>> >>> >>> I think this is likely to be a slower and less robust solution. You >>> would have to make sure that you got all the corner cases right (i.e. >>> escaping and what not). >>> >>> On Sun, Feb 5, 2017 at 3:13 PM, Sam Elamin <hussam.ela...@gmail.com> >>> wrote: >>> >>> I see so for the connector I need to pass in an array/list of numerical >>> columns? >>> >>> Wouldnt it be simpler to just regex replace the numbers to remove the >>> quotes? >>> >>> >>> Regards >>> Sam >>> >>> On Sun, Feb 5, 2017 at 11:11 PM, Michael Armbrust < >>> mich...@databricks.com> wrote: >>> >>> Specifying the schema when parsing JSON will only let you pick between >>> similar datatypes (i.e should this be a short, long float, double etc). It >>> will not let you perform conversions like string <-> number. This has to >>> be done with explicit casts after the data has been loaded. >>> >>> I think you can make a solution that uses select or withColumn generic. >>> Just load the dataframe with a "parse schema" that treats numbers as >>> strings. Then construct a list of columns that should be numbers and apply >>> the necessary conversions. >>> >>> import org.apache.spark.sql.functions.col >>> var df = spark.read.schema(parseSchema).json("...") >>> numericColumns.foreach { columnName => >>> df = df.withColumn(columnName, col(columnName).cast("long")) >>> } >>> >>> >>> >>> On Sun, Feb 5, 2017 at 2:09 PM, Sam Elamin <hussam.ela...@gmail.com> >>> wrote: >>> >>> Thanks Micheal >>> >>> I've been spending the past few days researching this >>> >>> The problem is the generated json has double quotes on fields that are >>> numbers because the producing datastore doesn't want to lose precision >>> >>> I can change the data type true but that would be on specific to a job >>> rather than a generic streaming job. I'm writing a structured streaming >>> connector and I have the schema the generated dataframe should match. >>> >>> Unfortunately using withColumn won't help me here since the solution >>> needs to be generic >>> >>> To summarise assume I have the following json >>> >>> [{ >>> "customerid": "535137", >>> "foo": "bar" >>> }] >>> >>> >>> and I know the schema should be: >>> StructType(Array(StructField("customerid",LongType,true),Str >>> uctField("foo",StringType,true))) >>> >>> Whats the best way of solving this? >>> >>> My current approach is to iterate over the JSON and identify which >>> fields are numbers and which arent then recreate the json >>> >>> But to be honest that doesnt seem like the cleanest approach, so happy >>> for advice on this >>> >>> Regards >>> Sam >>> >>> On Sun, 5 Feb 2017 at 22:00, Michael Armbrust <mich...@databricks.com> >>> wrote: >>> >>> -dev >>> >>> You can use withColumn to change the type after the data has been loaded >>> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1572067047091340/2840265927289860/latest.html> >>> . >>> >>> On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin <hussam.ela...@gmail.com> >>> wrote: >>> >>> Hi Direceu >>> >>> Thanks your right! that did work >>> >>> >>> But now im facing an even bigger problem since i dont have access to >>> change the underlying data, I just want to apply a schema over something >>> that was written via the sparkContext.newAPIHadoopRDD >>> >>> Basically I am reading in a RDD[JsonObject] and would like to convert it >>> into a dataframe which I pass the schema into >>> >>> Whats the best way to do this? >>> >>> I doubt removing all the quotes in the JSON is the best solution is it? >>> >>> Regards >>> Sam >>> >>> On Sat, Feb 4, 2017 at 2:13 PM, Dirceu Semighini Filho < >>> dirceu.semigh...@gmail.com> wrote: >>> >>> Hi Sam >>> Remove the " from the number that it will work >>> >>> Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <hussam.ela...@gmail.com> >>> escreveu: >>> >>> Hi All >>> >>> I would like to specify a schema when reading from a json but when >>> trying to map a number to a Double it fails, I tried FloatType and IntType >>> with no joy! >>> >>> >>> When inferring the schema customer id is set to String, and I would like >>> to cast it as Double >>> >>> so df1 is corrupted while df2 shows >>> >>> >>> Also FYI I need this to be generic as I would like to apply it to any >>> json, I specified the below schema as an example of the issue I am facing >>> >>> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, >>> DoubleType,FloatType, StructType, LongType,DecimalType} >>> val testSchema = StructType(Array(StructField("customerid",DoubleType))) >>> val df1 = >>> spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}"""))) >>> val df2 = >>> spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}"""))) >>> df1.show(1) >>> df2.show(1) >>> >>> >>> Any help would be appreciated, I am sure I am missing something obvious >>> but for the life of me I cant tell what it is! >>> >>> >>> Kind Regards >>> Sam >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> -- >>> Best Regards, >>> Ayan Guha >>> >> > > > -- > Best Regards, > Ayan Guha >