Thanks ayan but I meant how to derive the list automatically In your example you are specifying the numeric columns and I would like it to be applied to any schema if that makes sense On Mon, 6 Feb 2017 at 08:49, ayan guha <guha.a...@gmail.com> wrote:
> SImple (pyspark) example: > > >>> df = sqlContext.read.json("/user/l_aguha/spark_qs.json") > >>> df.printSchema() > root > |-- customerid: string (nullable = true) > |-- foo: string (nullable = true) > > >>> numeric_field_list = ['customerid'] > > >>> for k in numeric_field_list: > ... df = df.withColumn(k,df[k].cast("long")) > ... > >>> df.printSchema() > root > |-- customerid: long (nullable = true) > |-- foo: string (nullable = true) > > > On Mon, Feb 6, 2017 at 6:56 PM, Sam Elamin <hussam.ela...@gmail.com> > wrote: > > Ok thanks Micheal! > > > Can I get an idea on where to start? Assuming I have the end schema and > the current dataframe... > How can I loop through it and create a new dataframe using the WithColumn? > > > Am I iterating through the dataframe or the schema? > > I'm assuming it's easier to iterate through the columns in the old df. For > each column cast it correctly and generate a new df? > > > Would you recommend that? > > Regards > Sam > On Mon, 6 Feb 2017 at 01:12, Michael Armbrust <mich...@databricks.com> > wrote: > > If you already have the expected schema, and you know that all numbers > will always be formatted as strings in the input JSON, you could probably > derive this list automatically. > > Wouldn't it be simpler to just regex replace the numbers to remove the > quotes? > > > I think this is likely to be a slower and less robust solution. You would > have to make sure that you got all the corner cases right (i.e. escaping > and what not). > > On Sun, Feb 5, 2017 at 3:13 PM, Sam Elamin <hussam.ela...@gmail.com> > wrote: > > I see so for the connector I need to pass in an array/list of numerical > columns? > > Wouldnt it be simpler to just regex replace the numbers to remove the > quotes? > > > Regards > Sam > > On Sun, Feb 5, 2017 at 11:11 PM, Michael Armbrust <mich...@databricks.com> > wrote: > > Specifying the schema when parsing JSON will only let you pick between > similar datatypes (i.e should this be a short, long float, double etc). It > will not let you perform conversions like string <-> number. This has to > be done with explicit casts after the data has been loaded. > > I think you can make a solution that uses select or withColumn generic. > Just load the dataframe with a "parse schema" that treats numbers as > strings. Then construct a list of columns that should be numbers and apply > the necessary conversions. > > import org.apache.spark.sql.functions.col > var df = spark.read.schema(parseSchema).json("...") > numericColumns.foreach { columnName => > df = df.withColumn(columnName, col(columnName).cast("long")) > } > > > > On Sun, Feb 5, 2017 at 2:09 PM, Sam Elamin <hussam.ela...@gmail.com> > wrote: > > Thanks Micheal > > I've been spending the past few days researching this > > The problem is the generated json has double quotes on fields that are > numbers because the producing datastore doesn't want to lose precision > > I can change the data type true but that would be on specific to a job > rather than a generic streaming job. I'm writing a structured streaming > connector and I have the schema the generated dataframe should match. > > Unfortunately using withColumn won't help me here since the solution needs > to be generic > > To summarise assume I have the following json > > [{ > "customerid": "535137", > "foo": "bar" > }] > > > and I know the schema should be: > > StructType(Array(StructField("customerid",LongType,true),StructField("foo",StringType,true))) > > Whats the best way of solving this? > > My current approach is to iterate over the JSON and identify which fields > are numbers and which arent then recreate the json > > But to be honest that doesnt seem like the cleanest approach, so happy for > advice on this > > Regards > Sam > > On Sun, 5 Feb 2017 at 22:00, Michael Armbrust <mich...@databricks.com> > wrote: > > -dev > > You can use withColumn to change the type after the data has been loaded > <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1572067047091340/2840265927289860/latest.html> > . > > On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin <hussam.ela...@gmail.com> > wrote: > > Hi Direceu > > Thanks your right! that did work > > > But now im facing an even bigger problem since i dont have access to > change the underlying data, I just want to apply a schema over something > that was written via the sparkContext.newAPIHadoopRDD > > Basically I am reading in a RDD[JsonObject] and would like to convert it > into a dataframe which I pass the schema into > > Whats the best way to do this? > > I doubt removing all the quotes in the JSON is the best solution is it? > > Regards > Sam > > On Sat, Feb 4, 2017 at 2:13 PM, Dirceu Semighini Filho < > dirceu.semigh...@gmail.com> wrote: > > Hi Sam > Remove the " from the number that it will work > > Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <hussam.ela...@gmail.com> > escreveu: > > Hi All > > I would like to specify a schema when reading from a json but when trying > to map a number to a Double it fails, I tried FloatType and IntType with no > joy! > > > When inferring the schema customer id is set to String, and I would like > to cast it as Double > > so df1 is corrupted while df2 shows > > > Also FYI I need this to be generic as I would like to apply it to any > json, I specified the below schema as an example of the issue I am facing > > import org.apache.spark.sql.types.{BinaryType, StringType, StructField, > DoubleType,FloatType, StructType, LongType,DecimalType} > val testSchema = StructType(Array(StructField("customerid",DoubleType))) > val df1 = > spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}"""))) > val df2 = > spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}"""))) > df1.show(1) > df2.show(1) > > > Any help would be appreciated, I am sure I am missing something obvious > but for the life of me I cant tell what it is! > > > Kind Regards > Sam > > > > > > > > > > -- > Best Regards, > Ayan Guha >