Re: specifing schema on dataframe

Sam Elamin Mon, 06 Feb 2017 00:55:49 -0800

Thanks ayan but I meant how to derive the list automatically

In your example you are specifying the numeric columns and I would like it
to be applied to any schema if that makes sense
On Mon, 6 Feb 2017 at 08:49, ayan guha <guha.a...@gmail.com> wrote:


> SImple (pyspark) example:
>
> >>> df = sqlContext.read.json("/user/l_aguha/spark_qs.json")
> >>> df.printSchema()
> root
>  |-- customerid: string (nullable = true)
>  |-- foo: string (nullable = true)
>
> >>> numeric_field_list = ['customerid']
>
> >>> for k in numeric_field_list:
> ...     df = df.withColumn(k,df[k].cast("long"))
> ...
> >>> df.printSchema()
> root
>  |-- customerid: long (nullable = true)
>  |-- foo: string (nullable = true)
>
>
> On Mon, Feb 6, 2017 at 6:56 PM, Sam Elamin <hussam.ela...@gmail.com>
> wrote:
>
> Ok thanks Micheal!
>
>
> Can I get an idea on where to start? Assuming I have the end schema and
> the current dataframe...
> How can I loop through it and create a new dataframe using the WithColumn?
>
>
> Am I iterating through the dataframe or the schema?
>
> I'm assuming it's easier to iterate through the columns in the old df. For
> each column cast it correctly and generate a new df?
>
>
> Would you recommend that?
>
> Regards
> Sam
> On Mon, 6 Feb 2017 at 01:12, Michael Armbrust <mich...@databricks.com>
> wrote:
>
> If you already have the expected schema, and you know that all numbers
> will always be formatted as strings in the input JSON, you could probably
> derive this list automatically.
>
> Wouldn't it be simpler to just regex replace the numbers to remove the
> quotes?
>
>
> I think this is likely to be a slower and less robust solution.  You would
> have to make sure that you got all the corner cases right (i.e. escaping
> and what not).
>
> On Sun, Feb 5, 2017 at 3:13 PM, Sam Elamin <hussam.ela...@gmail.com>
> wrote:
>
> I see so for the connector I need to pass in an array/list of numerical
> columns?
>
> Wouldnt it be simpler to just regex replace the numbers to remove the
> quotes?
>
>
> Regards
> Sam
>
> On Sun, Feb 5, 2017 at 11:11 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
> Specifying the schema when parsing JSON will only let you pick between
> similar datatypes (i.e should this be a short, long float, double etc).  It
> will not let you perform conversions like string <-> number.  This has to
> be done with explicit casts after the data has been loaded.
>
> I think you can make a solution that uses select or withColumn generic.
> Just load the dataframe with a "parse schema" that treats numbers as
> strings.  Then construct a list of columns that should be numbers and apply
> the necessary conversions.
>
> import org.apache.spark.sql.functions.col
> var df = spark.read.schema(parseSchema).json("...")
> numericColumns.foreach { columnName =>
>   df = df.withColumn(columnName, col(columnName).cast("long"))
> }
>
>
>
> On Sun, Feb 5, 2017 at 2:09 PM, Sam Elamin <hussam.ela...@gmail.com>
> wrote:
>
> Thanks Micheal
>
> I've been spending the past few days researching this
>
> The problem is the generated json has double quotes on fields that are
> numbers because the producing datastore doesn't want to lose precision
>
> I can change the data type true but that would be on specific to a job
> rather than a generic streaming job. I'm writing a structured streaming
> connector and I have the schema the generated dataframe should match.
>
> Unfortunately using withColumn won't help me here since the solution needs
> to be generic
>
> To summarise assume I have the following json
>
> [{
> "customerid": "535137",
> "foo": "bar"
> }]
>
>
> and I know the schema should be:
>
> StructType(Array(StructField("customerid",LongType,true),StructField("foo",StringType,true)))
>
> Whats the best way of solving this?
>
> My current approach is to iterate over the JSON and identify which fields
> are numbers and which arent then recreate the json
>
> But to be honest that doesnt seem like the cleanest approach, so happy for
> advice on this
>
> Regards
> Sam
>
> On Sun, 5 Feb 2017 at 22:00, Michael Armbrust <mich...@databricks.com>
> wrote:
>
> -dev
>
> You can use withColumn to change the type after the data has been loaded
> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1572067047091340/2840265927289860/latest.html>
> .
>
> On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin <hussam.ela...@gmail.com>
> wrote:
>
> Hi Direceu
>
> Thanks your right! that did work
>
>
> But now im facing an even bigger problem since i dont have access to
> change the underlying data, I just want to apply a schema over something
> that was written via the sparkContext.newAPIHadoopRDD
>
> Basically I am reading in a RDD[JsonObject] and would like to convert it
> into a dataframe which I pass the schema into
>
> Whats the best way to do this?
>
> I doubt removing all the quotes in the JSON is the best solution is it?
>
> Regards
> Sam
>
> On Sat, Feb 4, 2017 at 2:13 PM, Dirceu Semighini Filho <
> dirceu.semigh...@gmail.com> wrote:
>
> Hi Sam
> Remove the " from the number that it will work
>
> Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <hussam.ela...@gmail.com>
> escreveu:
>
> Hi All
>
> I would like to specify a schema when reading from a json but when trying
> to map a number to a Double it fails, I tried FloatType and IntType with no
> joy!
>
>
> When inferring the schema customer id is set to String, and I would like
> to cast it as Double
>
> so df1 is corrupted while df2 shows
>
>
> Also FYI I need this to be generic as I would like to apply it to any
> json, I specified the below schema as an example of the issue I am facing
>
> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, 
> DoubleType,FloatType, StructType, LongType,DecimalType}
> val testSchema = StructType(Array(StructField("customerid",DoubleType)))
> val df1 = 
> spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
> val df2 = 
> spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
> df1.show(1)
> df2.show(1)
>
>
> Any help would be appreciated, I am sure I am missing something obvious
> but for the life of me I cant tell what it is!
>
>
> Kind Regards
> Sam
>
>
>
>
>
>
>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: specifing schema on dataframe

Reply via email to