On Fri, Jun 17, 2016 at 1:17 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> Ok a bit of a challenge. > > Have you tried using databricks stuff?. they can read compressed files and > they might work here? > > val df = > sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", > "true").option("header", > "true").load("hdfs://rhes564:9000/data/stg/accounts/nw/10124772") > > case class Accounts( TransactionDate: String, TransactionType: String, > Description: String, Value: Double, Balance: Double, AccountName: String, > AccountNumber : String) > // Map the columns to names > // > val a = df.filter(col("Date") > "").map(p => > Accounts(p(0).toString,p(1).toString,p(2).toString,p(3).toString.toDouble,p(4).toString.toDouble,p(5).toString,p(6).toString)) > // > // Create a Spark temporary table > // > a.toDF.registerTempTable("tmp") > Yes, I looked at their spark-csv package -- it'd be great for CSV (or even a large swath of delimited file formats). In some cases, I have file formats that aren't delimited in a way compatible with that, though, so was rolling my own string lines => DataFrames. Also, there are arbitrary record formats, and I don't want to restrict to a compile-time value class, hence the need to manually create the schema. > > > HTH > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 17 June 2016 at 21:02, Everett Anderson <ever...@nuna.com> wrote: > >> >> >> On Fri, Jun 17, 2016 at 12:44 PM, Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> Are these mainly in csv format? >>> >> >> Alas, no -- lots of different formats. Many are fixed width files, where >> I have outside information to know which byte ranges correspond to which >> columns. Some have odd null representations or non-comma delimiters (though >> many of those cases might fit within the configurability of the spark-csv >> package). >> >> >> >> >> >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> >>> On 17 June 2016 at 20:38, Everett Anderson <ever...@nuna.com.invalid> >>> wrote: >>> >>>> Hi, >>>> >>>> I have a system with files in a variety of non-standard input formats, >>>> though they're generally flat text files. I'd like to dynamically create >>>> DataFrames of string columns. >>>> >>>> What's the best way to go from a RDD<String> to a DataFrame of >>>> StringType columns? >>>> >>>> My current plan is >>>> >>>> - Call map() on the RDD<String> with a function to split the String >>>> into columns and call RowFactory.create() with the resulting array, >>>> creating a RDD<Row> >>>> - Construct a StructType schema using column names and StringType >>>> - Call SQLContext.createDataFrame(RDD, schema) to create the result >>>> >>>> Does that make sense? >>>> >>>> I looked through the spark-csv package a little and noticed that it's >>>> using baseRelationToDataFrame(), but BaseRelation looks like it might be a >>>> restricted developer API. Anyone know if it's recommended for use? >>>> >>>> Thanks! >>>> >>>> - Everett >>>> >>>> >>> >> >