Read all the columns from a file in spark sql
Hi, I am newbie to spark sql and i would like to know about how to read all the columns from a file in spark sql. I have referred the programming guide here: http://people.apache.org/~tdas/spark-1.0-docs/sql-programming-guide.html The example says: val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)) But, instead of explicitly specifying p(0),p(1) I would like to read all the columns from a file. It would be difficult if my source dataset has more no of columns. Is there any shortcut for that? And instead of a single file, i would like to read multiple files which shares a similar structure from a directory. Could you please share your thoughts on this? It would be great , if you share any documentation which has details on these? Thanks
Re: Read all the columns from a file in spark sql
I think what you might be looking for is the ability to programmatically specify the schema, which is coming in 1.1. Here's the JIRA: SPARK-2179 <https://issues.apache.org/jira/browse/SPARK-2179> On Wed, Jul 16, 2014 at 8:24 AM, pandees waran wrote: > Hi, > > I am newbie to spark sql and i would like to know about how to read all > the columns from a file in spark sql. I have referred the programming guide > here: > http://people.apache.org/~tdas/spark-1.0-docs/sql-programming-guide.html > > The example says: > > val people = > sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p > => Person(p(0), p(1).trim.toInt)) > > But, instead of explicitly specifying p(0),p(1) I would like to read all > the columns from a file. It would be difficult if my source dataset has > more no of columns. > > Is there any shortcut for that? > > And instead of a single file, i would like to read multiple files which > shares a similar structure from a directory. > > Could you please share your thoughts on this? > > It would be great , if you share any documentation which has details on > these? > > Thanks >
Re: Read all the columns from a file in spark sql
Hi Pandees, You may also be helped by looking into the ability to read and write Parquet files which is available in the present release. Parquet files allow you to store columnar data in HDFS. At present, Spark "infers" the schema from the Parquet file. In pyspark, some of the methods you'd be interested in are "parquetFile" and "inferSchema" in SQLContext, and "saveAsParquetFile" in SchemaRDD. Hope that helps. -Brad On Wed, Jul 16, 2014 at 4:31 PM, Michael Armbrust wrote: > I think what you might be looking for is the ability to programmatically > specify the schema, which is coming in 1.1. > > Here's the JIRA: SPARK-2179 > > > On Wed, Jul 16, 2014 at 8:24 AM, pandees waran wrote: >> >> Hi, >> >> I am newbie to spark sql and i would like to know about how to read all >> the columns from a file in spark sql. I have referred the programming guide >> here: >> http://people.apache.org/~tdas/spark-1.0-docs/sql-programming-guide.html >> >> The example says: >> >> val people = >> sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p >> => Person(p(0), p(1).trim.toInt)) >> >> But, instead of explicitly specifying p(0),p(1) I would like to read all >> the columns from a file. It would be difficult if my source dataset has more >> no of columns. >> >> Is there any shortcut for that? >> >> And instead of a single file, i would like to read multiple files which >> shares a similar structure from a directory. >> >> Could you please share your thoughts on this? >> >> It would be great , if you share any documentation which has details on >> these? >> >> Thanks > >