Re: Best way to go from RDD to DataFrame of StringType columns

Everett Anderson Fri, 17 Jun 2016 13:59:54 -0700

On Fri, Jun 17, 2016 at 1:17 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:


> Ok a bit of a challenge.
>
> Have you tried using databricks stuff?. they can read compressed files and
> they might work here?
>
> val df =
> sqlContext.read.format("com.databricks.spark.csv").option("inferSchema",
> "true").option("header",
> "true").load("hdfs://rhes564:9000/data/stg/accounts/nw/10124772")
>
> case class Accounts( TransactionDate: String, TransactionType: String,
> Description: String, Value: Double, Balance: Double, AccountName: String,
> AccountNumber : String)
> // Map the columns to names
> //
> val a = df.filter(col("Date") > "").map(p =>
> Accounts(p(0).toString,p(1).toString,p(2).toString,p(3).toString.toDouble,p(4).toString.toDouble,p(5).toString,p(6).toString))
> //
> // Create a Spark temporary table
> //
> a.toDF.registerTempTable("tmp")
>

Yes, I looked at their spark-csv package -- it'd be great for CSV (or even
a large swath of delimited file formats). In some cases, I have file
formats that aren't delimited in a way compatible with that, though, so was
rolling my own string lines => DataFrames.

Also, there are arbitrary record formats, and I don't want to restrict to a
compile-time value class, hence the need to manually create the schema.




>
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 17 June 2016 at 21:02, Everett Anderson <ever...@nuna.com> wrote:
>
>>
>>
>> On Fri, Jun 17, 2016 at 12:44 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Are these mainly in csv format?
>>>
>>
>> Alas, no -- lots of different formats. Many are fixed width files, where
>> I have outside information to know which byte ranges correspond to which
>> columns. Some have odd null representations or non-comma delimiters (though
>> many of those cases might fit within the configurability of the spark-csv
>> package).
>>
>>
>>
>>
>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 17 June 2016 at 20:38, Everett Anderson <ever...@nuna.com.invalid>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a system with files in a variety of non-standard input formats,
>>>> though they're generally flat text files. I'd like to dynamically create
>>>> DataFrames of string columns.
>>>>
>>>> What's the best way to go from a RDD<String> to a DataFrame of
>>>> StringType columns?
>>>>
>>>> My current plan is
>>>>
>>>>    - Call map() on the RDD<String> with a function to split the String
>>>>    into columns and call RowFactory.create() with the resulting array,
>>>>    creating a RDD<Row>
>>>>    - Construct a StructType schema using column names and StringType
>>>>    - Call SQLContext.createDataFrame(RDD, schema) to create the result
>>>>
>>>> Does that make sense?
>>>>
>>>> I looked through the spark-csv package a little and noticed that it's
>>>> using baseRelationToDataFrame(), but BaseRelation looks like it might be a
>>>> restricted developer API. Anyone know if it's recommended for use?
>>>>
>>>> Thanks!
>>>>
>>>> - Everett
>>>>
>>>>
>>>
>>
>

Re: Best way to go from RDD to DataFrame of StringType columns

Reply via email to