Re: specifing schema on dataframe

Sam Elamin Sun, 05 Feb 2017 15:15:06 -0800

I see so for the connector I need to pass in an array/list of numerical
columns?


Wouldnt it be simpler to just regex replace the numbers to remove the
quotes?


Regards
Sam

On Sun, Feb 5, 2017 at 11:11 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> Specifying the schema when parsing JSON will only let you pick between
> similar datatypes (i.e should this be a short, long float, double etc).  It
> will not let you perform conversions like string <-> number.  This has to
> be done with explicit casts after the data has been loaded.
>
> I think you can make a solution that uses select or withColumn generic.
> Just load the dataframe with a "parse schema" that treats numbers as
> strings.  Then construct a list of columns that should be numbers and apply
> the necessary conversions.
>
> import org.apache.spark.sql.functions.col
> var df = spark.read.schema(parseSchema).json("...")
> numericColumns.foreach { columnName =>
>   df = df.withColumn(columnName, col(columnName).cast("long"))
> }
>
>
>
> On Sun, Feb 5, 2017 at 2:09 PM, Sam Elamin <hussam.ela...@gmail.com>
> wrote:
>
>> Thanks Micheal
>>
>> I've been spending the past few days researching this
>>
>> The problem is the generated json has double quotes on fields that are
>> numbers because the producing datastore doesn't want to lose precision
>>
>> I can change the data type true but that would be on specific to a job
>> rather than a generic streaming job. I'm writing a structured streaming
>> connector and I have the schema the generated dataframe should match.
>>
>> Unfortunately using withColumn won't help me here since the solution
>> needs to be generic
>>
>> To summarise assume I have the following json
>>
>> [{
>> "customerid": "535137",
>> "foo": "bar"
>> }]
>>
>>
>> and I know the schema should be:
>> StructType(Array(StructField("customerid",LongType,true),Str
>> uctField("foo",StringType,true)))
>>
>> Whats the best way of solving this?
>>
>> My current approach is to iterate over the JSON and identify which fields
>> are numbers and which arent then recreate the json
>>
>> But to be honest that doesnt seem like the cleanest approach, so happy
>> for advice on this
>>
>> Regards
>> Sam
>>
>> On Sun, 5 Feb 2017 at 22:00, Michael Armbrust <mich...@databricks.com>
>> wrote:
>>
>>> -dev
>>>
>>> You can use withColumn to change the type after the data has been loaded
>>> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1572067047091340/2840265927289860/latest.html>
>>> .
>>>
>>> On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin <hussam.ela...@gmail.com>
>>> wrote:
>>>
>>> Hi Direceu
>>>
>>> Thanks your right! that did work
>>>
>>>
>>> But now im facing an even bigger problem since i dont have access to
>>> change the underlying data, I just want to apply a schema over something
>>> that was written via the sparkContext.newAPIHadoopRDD
>>>
>>> Basically I am reading in a RDD[JsonObject] and would like to convert it
>>> into a dataframe which I pass the schema into
>>>
>>> Whats the best way to do this?
>>>
>>> I doubt removing all the quotes in the JSON is the best solution is it?
>>>
>>> Regards
>>> Sam
>>>
>>> On Sat, Feb 4, 2017 at 2:13 PM, Dirceu Semighini Filho <
>>> dirceu.semigh...@gmail.com> wrote:
>>>
>>> Hi Sam
>>> Remove the " from the number that it will work
>>>
>>> Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <hussam.ela...@gmail.com>
>>> escreveu:
>>>
>>> Hi All
>>>
>>> I would like to specify a schema when reading from a json but when
>>> trying to map a number to a Double it fails, I tried FloatType and IntType
>>> with no joy!
>>>
>>>
>>> When inferring the schema customer id is set to String, and I would like
>>> to cast it as Double
>>>
>>> so df1 is corrupted while df2 shows
>>>
>>>
>>> Also FYI I need this to be generic as I would like to apply it to any
>>> json, I specified the below schema as an example of the issue I am facing
>>>
>>> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, 
>>> DoubleType,FloatType, StructType, LongType,DecimalType}
>>> val testSchema = StructType(Array(StructField("customerid",DoubleType)))
>>> val df1 = 
>>> spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>>> val df2 = 
>>> spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>>> df1.show(1)
>>> df2.show(1)
>>>
>>>
>>> Any help would be appreciated, I am sure I am missing something obvious
>>> but for the life of me I cant tell what it is!
>>>
>>>
>>> Kind Regards
>>> Sam
>>>
>>>
>>>
>>>
>

Re: specifing schema on dataframe

Reply via email to