> Ok a bit of a challenge.
> Have you tried using databricks stuff?. they can read compressed files and
> they might work here?
> val df =
> sqlContext.read.format("com.databricks.spark.csv").option("inferSchema",
> "true").option("header",
> "true").load("hdfs://rhes564:9000/data/stg/accounts/nw/10124772")
> case class Accounts( TransactionDate: String, TransactionType: String,
> Description: String, Value: Double, Balance: Double, AccountName: String,
> AccountNumber : String)
> // Map the columns to names
> //
> val a = df.filter(col("Date") > "").map(p =>
> Accounts(p(0).toString,p(1).toString,p(2).toString,p(3).toString.toDouble,p(4).toString.toDouble,p(5).toString,p(6).toString))
> //
> // Create a Spark temporary table
> //
> a.toDF.registerTempTable("tmp")

Yes, I looked at their spark-csv package -- it'd be great for CSV (or even
a large swath of delimited file formats). In some cases, I have file
formats that aren't delimited in a way compatible with that, though, so was
rolling my own string lines => DataFrames.

Also, there are arbitrary record formats, and I don't want to restrict to a
compile-time value class, hence the need to manually create the schema.

On 17 June 2016 at 21:02, Everett Anderson wrote:
On Fri, Jun 17, 2016 at 12:44 PM, Mich Talebzadeh wrote:
>> mich.talebza...@gmail.com> wrote:
>>> Are these mainly in csv format?
>> Alas, no -- lots of different formats. Many are fixed width files, where
>> I have outside information to know which byte ranges correspond to which
>> columns. Some have odd null representations or non-comma delimiters (though
>> many of those cases might fit within the configurability of the spark-csv
>> package).
On 17 June 2016 at 20:38, Everett Anderson wrote:
>>> wrote:
>>>> Hi,
>>>> I have a system with files in a variety of non-standard input formats,
>>>> though they're generally flat text files. I'd like to dynamically create
>>>> DataFrames of string columns.
>>>> What's the best way to go from a RDD<String> to a DataFrame of
>>>> StringType columns?
>>>> My current plan is
>>>>    - Call map() on the RDD<String> with a function to split the String
>>>>    into columns and call RowFactory.create() with the resulting array,
>>>>    creating a RDD<Row>
>>>>    - Construct a StructType schema using column names and StringType
>>>>    - Call SQLContext.createDataFrame(RDD, schema) to create the result
>>>> Does that make sense?
>>>> I looked through the spark-csv package a little and noticed that it's
>>>> using baseRelationToDataFrame(), but BaseRelation looks like it might be a
>>>> restricted developer API. Anyone know if it's recommended for use?
>>>> Thanks!
>>>> - Everett

