Ah ok
Thanks for clearing it up Ayan! i will give that a go Thank you all for your help, this mailing list is awesome! On Mon, Feb 6, 2017 at 9:07 AM, ayan guha <guha.a...@gmail.com> wrote: > If I am not missing anything here, "So I know which columns are numeric > and which arent because I have a StructType and all the internal > StructFields will tell me which ones have a DataType which is numeric and > which arent" will lead to getting to a list of fields which should be > numeric. > > Essentially, You will create a list of numeric fields from your > "should-be" struct type. Then you will load your raw data using built-in > json reader. At this point, your data have a wrong schema. Now, you will > need to correct it. How? You will loop over the list of numeric fields (or, > you can do it directly on the struct type), and try to match the type. If > you find a mismatch, you'd add a withColumn clause to cast to the correct > data type (from your "should-be" struct). > > HTH? > > Best > Ayan > > On Mon, Feb 6, 2017 at 8:00 PM, Sam Elamin <hussam.ela...@gmail.com> > wrote: > >> Yup sorry I should have explained myself better >> >> So I know which columns are numeric and which arent because I have a >> StructType and all the internal StructFields will tell me which ones have a >> DataType which is numeric and which arent >> >> So assuming I have a json string which has double quotes on numbers when >> it shouldnt, and I have the correct schema in a struct type >> >> >> how can I iterate over them to programatically create the new dataframe >> in the correct format >> >> do i iterate over the columns in the StructType? or iterate over the >> columns in the dataframe and try to match them with the StructType? >> >> I hope I cleared things up, What I wouldnt do for a drawing board right >> now! >> >> >> On Mon, Feb 6, 2017 at 8:56 AM, ayan guha <guha.a...@gmail.com> wrote: >> >>> Umm....I think the premise is you need to "know" beforehand which >>> columns are numeric.....Unless you know it, how would you apply the schema? >>> >>> On Mon, Feb 6, 2017 at 7:54 PM, Sam Elamin <hussam.ela...@gmail.com> >>> wrote: >>> >>>> Thanks ayan but I meant how to derive the list automatically >>>> >>>> In your example you are specifying the numeric columns and I would like >>>> it to be applied to any schema if that makes sense >>>> On Mon, 6 Feb 2017 at 08:49, ayan guha <guha.a...@gmail.com> wrote: >>>> >>>>> SImple (pyspark) example: >>>>> >>>>> >>> df = sqlContext.read.json("/user/l_aguha/spark_qs.json") >>>>> >>> df.printSchema() >>>>> root >>>>> |-- customerid: string (nullable = true) >>>>> |-- foo: string (nullable = true) >>>>> >>>>> >>> numeric_field_list = ['customerid'] >>>>> >>>>> >>> for k in numeric_field_list: >>>>> ... df = df.withColumn(k,df[k].cast("long")) >>>>> ... >>>>> >>> df.printSchema() >>>>> root >>>>> |-- customerid: long (nullable = true) >>>>> |-- foo: string (nullable = true) >>>>> >>>>> >>>>> On Mon, Feb 6, 2017 at 6:56 PM, Sam Elamin <hussam.ela...@gmail.com> >>>>> wrote: >>>>> >>>>> Ok thanks Micheal! >>>>> >>>>> >>>>> Can I get an idea on where to start? Assuming I have the end schema >>>>> and the current dataframe... >>>>> How can I loop through it and create a new dataframe using the >>>>> WithColumn? >>>>> >>>>> >>>>> Am I iterating through the dataframe or the schema? >>>>> >>>>> I'm assuming it's easier to iterate through the columns in the old df. >>>>> For each column cast it correctly and generate a new df? >>>>> >>>>> >>>>> Would you recommend that? >>>>> >>>>> Regards >>>>> Sam >>>>> On Mon, 6 Feb 2017 at 01:12, Michael Armbrust <mich...@databricks.com> >>>>> wrote: >>>>> >>>>> If you already have the expected schema, and you know that all numbers >>>>> will always be formatted as strings in the input JSON, you could probably >>>>> derive this list automatically. >>>>> >>>>> Wouldn't it be simpler to just regex replace the numbers to remove the >>>>> quotes? >>>>> >>>>> >>>>> I think this is likely to be a slower and less robust solution. You >>>>> would have to make sure that you got all the corner cases right (i.e. >>>>> escaping and what not). >>>>> >>>>> On Sun, Feb 5, 2017 at 3:13 PM, Sam Elamin <hussam.ela...@gmail.com> >>>>> wrote: >>>>> >>>>> I see so for the connector I need to pass in an array/list of >>>>> numerical columns? >>>>> >>>>> Wouldnt it be simpler to just regex replace the numbers to remove the >>>>> quotes? >>>>> >>>>> >>>>> Regards >>>>> Sam >>>>> >>>>> On Sun, Feb 5, 2017 at 11:11 PM, Michael Armbrust < >>>>> mich...@databricks.com> wrote: >>>>> >>>>> Specifying the schema when parsing JSON will only let you pick between >>>>> similar datatypes (i.e should this be a short, long float, double etc). >>>>> It >>>>> will not let you perform conversions like string <-> number. This has to >>>>> be done with explicit casts after the data has been loaded. >>>>> >>>>> I think you can make a solution that uses select or withColumn >>>>> generic. Just load the dataframe with a "parse schema" that treats >>>>> numbers >>>>> as strings. Then construct a list of columns that should be numbers and >>>>> apply the necessary conversions. >>>>> >>>>> import org.apache.spark.sql.functions.col >>>>> var df = spark.read.schema(parseSchema).json("...") >>>>> numericColumns.foreach { columnName => >>>>> df = df.withColumn(columnName, col(columnName).cast("long")) >>>>> } >>>>> >>>>> >>>>> >>>>> On Sun, Feb 5, 2017 at 2:09 PM, Sam Elamin <hussam.ela...@gmail.com> >>>>> wrote: >>>>> >>>>> Thanks Micheal >>>>> >>>>> I've been spending the past few days researching this >>>>> >>>>> The problem is the generated json has double quotes on fields that are >>>>> numbers because the producing datastore doesn't want to lose precision >>>>> >>>>> I can change the data type true but that would be on specific to a job >>>>> rather than a generic streaming job. I'm writing a structured streaming >>>>> connector and I have the schema the generated dataframe should match. >>>>> >>>>> Unfortunately using withColumn won't help me here since the solution >>>>> needs to be generic >>>>> >>>>> To summarise assume I have the following json >>>>> >>>>> [{ >>>>> "customerid": "535137", >>>>> "foo": "bar" >>>>> }] >>>>> >>>>> >>>>> and I know the schema should be: >>>>> StructType(Array(StructField("customerid",LongType,true),Str >>>>> uctField("foo",StringType,true))) >>>>> >>>>> Whats the best way of solving this? >>>>> >>>>> My current approach is to iterate over the JSON and identify which >>>>> fields are numbers and which arent then recreate the json >>>>> >>>>> But to be honest that doesnt seem like the cleanest approach, so happy >>>>> for advice on this >>>>> >>>>> Regards >>>>> Sam >>>>> >>>>> On Sun, 5 Feb 2017 at 22:00, Michael Armbrust <mich...@databricks.com> >>>>> wrote: >>>>> >>>>> -dev >>>>> >>>>> You can use withColumn to change the type after the data has been >>>>> loaded >>>>> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1572067047091340/2840265927289860/latest.html> >>>>> . >>>>> >>>>> On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin <hussam.ela...@gmail.com> >>>>> wrote: >>>>> >>>>> Hi Direceu >>>>> >>>>> Thanks your right! that did work >>>>> >>>>> >>>>> But now im facing an even bigger problem since i dont have access to >>>>> change the underlying data, I just want to apply a schema over something >>>>> that was written via the sparkContext.newAPIHadoopRDD >>>>> >>>>> Basically I am reading in a RDD[JsonObject] and would like to convert >>>>> it into a dataframe which I pass the schema into >>>>> >>>>> Whats the best way to do this? >>>>> >>>>> I doubt removing all the quotes in the JSON is the best solution is >>>>> it? >>>>> >>>>> Regards >>>>> Sam >>>>> >>>>> On Sat, Feb 4, 2017 at 2:13 PM, Dirceu Semighini Filho < >>>>> dirceu.semigh...@gmail.com> wrote: >>>>> >>>>> Hi Sam >>>>> Remove the " from the number that it will work >>>>> >>>>> Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <hussam.ela...@gmail.com> >>>>> escreveu: >>>>> >>>>> Hi All >>>>> >>>>> I would like to specify a schema when reading from a json but when >>>>> trying to map a number to a Double it fails, I tried FloatType and IntType >>>>> with no joy! >>>>> >>>>> >>>>> When inferring the schema customer id is set to String, and I would >>>>> like to cast it as Double >>>>> >>>>> so df1 is corrupted while df2 shows >>>>> >>>>> >>>>> Also FYI I need this to be generic as I would like to apply it to any >>>>> json, I specified the below schema as an example of the issue I am facing >>>>> >>>>> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, >>>>> DoubleType,FloatType, StructType, LongType,DecimalType} >>>>> val testSchema = StructType(Array(StructField("customerid",DoubleType))) >>>>> val df1 = >>>>> spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}"""))) >>>>> val df2 = >>>>> spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}"""))) >>>>> df1.show(1) >>>>> df2.show(1) >>>>> >>>>> >>>>> Any help would be appreciated, I am sure I am missing something >>>>> obvious but for the life of me I cant tell what it is! >>>>> >>>>> >>>>> Kind Regards >>>>> Sam >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Best Regards, >>>>> Ayan Guha >>>>> >>>> >>> >>> >>> -- >>> Best Regards, >>> Ayan Guha >>> >> >> > > > -- > Best Regards, > Ayan Guha >