Re: Splitting columns from a text file
just use SPARK CSV, all other ways of splitting and working is just trying to reinvent the wheel and a magnanimous waste of time. Regards, Gourav On Mon, Sep 5, 2016 at 1:48 PM, Ashok Kumarwrote: > Hi, > > I have a text file as below that I read in > > 74,20160905-133143,98.11218069128827594148 > 75,20160905-133143,49.52776998815916807742 > 76,20160905-133143,56.08029957123980984556 > 77,20160905-133143,46.63689526544407522777 > 78,20160905-133143,84.88227141164402181551 > 79,20160905-133143,68.72408602520662115000 > > val textFile = sc.textFile("/tmp/mytextfile.txt") > > Now I want to split the rows separated by "," > > scala> textFile.map(x=>x.toString).split(",") > :27: error: value split is not a member of > org.apache.spark.rdd.RDD[String] >textFile.map(x=>x.toString).split(",") > > However, the above throws error? > > Any ideas what is wrong or how I can do this if I can avoid converting it > to String? > > Thanking > >
Re: Splitting columns from a text file
sc.textFile("filename").map(_.split(",")).filter(arr => arr.length == 3 && arr(2).toDouble > 50).collect this will give you a Array[Array[String]] do as you may wish with it. And please read through abt RDD On 5 Sep 2016 8:51 pm, "Ashok Kumar"wrote: > Thanks everyone. > > I am not skilled like you gentlemen > > This is what I did > > 1) Read the text file > > val textFile = sc.textFile("/tmp/myfile.txt") > > 2) That produces an RDD of String. > > 3) Create a DF after splitting the file into an Array > > val df = textFile.map(line => line.split(",")).map(x=>(x(0). > toInt,x(1).toString,x(2).toDouble)).toDF > > 4) Create a class for column headers > > case class Columns(col1: Int, col2: String, col3: Double) > > 5) Assign the column headers > > val h = df.map(p => Columns(p(0).toString.toInt, p(1).toString, > p(2).toString.toDouble)) > > 6) Only interested in column 3 > 50 > > h.filter(col("Col3") > 50.0) > > 7) Now I just want Col3 only > > h.filter(col("Col3") > 50.0).select("col3").show(5) > +-+ > | col3| > +-+ > |95.42536350467836| > |61.56297588648554| > |76.73982017179868| > |68.86218120274728| > |67.64613810115105| > +-+ > only showing top 5 rows > > Does that make sense. Are there shorter ways gurus? Can I just do all this > on RDD without DF? > > Thanking you > > > > > > > > On Monday, 5 September 2016, 15:19, ayan guha wrote: > > > Then, You need to refer third term in the array, convert it to your > desired data type and then use filter. > > > On Tue, Sep 6, 2016 at 12:14 AM, Ashok Kumar wrote: > > Hi, > I want to filter them for values. > > This is what is in array > > 74,20160905-133143,98. 11218069128827594148 > > I want to filter anything > 50.0 in the third column > > Thanks > > > > > On Monday, 5 September 2016, 15:07, ayan guha wrote: > > > Hi > > x.split returns an array. So, after first map, you will get RDD of arrays. > What is your expected outcome of 2nd map? > > On Mon, Sep 5, 2016 at 11:30 PM, Ashok Kumar > wrote: > > Thank you sir. > > This is what I get > > scala> textFile.map(x=> x.split(",")) > res52: org.apache.spark.rdd.RDD[ Array[String]] = MapPartitionsRDD[27] at > map at :27 > > How can I work on individual columns. I understand they are strings > > scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0)) > | ) > :27: error: value getString is not a member of Array[String] >textFile.map(x=> x.split(",")).map(x => (x.getString(0)) > > regards > > > > > On Monday, 5 September 2016, 13:51, Somasundaram Sekar tigeranalytics.com > wrote: > > > Basic error, you get back an RDD on transformations like map. > sc.textFile("filename").map(x => x.split(",") > > On 5 Sep 2016 6:19 pm, "Ashok Kumar" wrote: > > Hi, > > I have a text file as below that I read in > > 74,20160905-133143,98. 11218069128827594148 > 75,20160905-133143,49. 52776998815916807742 > 76,20160905-133143,56. 08029957123980984556 > 77,20160905-133143,46. 63689526544407522777 > 78,20160905-133143,84. 88227141164402181551 > 79,20160905-133143,68. 72408602520662115000 > > val textFile = sc.textFile("/tmp/mytextfile. txt") > > Now I want to split the rows separated by "," > > scala> textFile.map(x=>x.toString). split(",") > :27: error: value split is not a member of > org.apache.spark.rdd.RDD[ String] >textFile.map(x=>x.toString). split(",") > > However, the above throws error? > > Any ideas what is wrong or how I can do this if I can avoid converting it > to String? > > Thanking > > > > > > > -- > Best Regards, > Ayan Guha > > > > > > -- > Best Regards, > Ayan Guha > > >
Re: Splitting columns from a text file
Thanks everyone. I am not skilled like you gentlemen This is what I did 1) Read the text file val textFile = sc.textFile("/tmp/myfile.txt") 2) That produces an RDD of String. 3) Create a DF after splitting the file into an Array val df = textFile.map(line => line.split(",")).map(x=>(x(0).toInt,x(1).toString,x(2).toDouble)).toDF 4) Create a class for column headers case class Columns(col1: Int, col2: String, col3: Double) 5) Assign the column headers val h = df.map(p => Columns(p(0).toString.toInt, p(1).toString, p(2).toString.toDouble)) 6) Only interested in column 3 > 50 h.filter(col("Col3") > 50.0) 7) Now I just want Col3 only h.filter(col("Col3") > 50.0).select("col3").show(5)+-+| col3|+-+|95.42536350467836||61.56297588648554||76.73982017179868||68.86218120274728||67.64613810115105|+-+only showing top 5 rows Does that make sense. Are there shorter ways gurus? Can I just do all this on RDD without DF? Thanking you On Monday, 5 September 2016, 15:19, ayan guhawrote: Then, You need to refer third term in the array, convert it to your desired data type and then use filter. On Tue, Sep 6, 2016 at 12:14 AM, Ashok Kumar wrote: Hi,I want to filter them for values. This is what is in array 74,20160905-133143,98. 11218069128827594148 I want to filter anything > 50.0 in the third column Thanks On Monday, 5 September 2016, 15:07, ayan guha wrote: Hi x.split returns an array. So, after first map, you will get RDD of arrays. What is your expected outcome of 2nd map? On Mon, Sep 5, 2016 at 11:30 PM, Ashok Kumar wrote: Thank you sir. This is what I get scala> textFile.map(x=> x.split(","))res52: org.apache.spark.rdd.RDD[ Array[String]] = MapPartitionsRDD[27] at map at :27 How can I work on individual columns. I understand they are strings scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0)) | ):27: error: value getString is not a member of Array[String] textFile.map(x=> x.split(",")).map(x => (x.getString(0)) regards On Monday, 5 September 2016, 13:51, Somasundaram Sekar wrote: Basic error, you get back an RDD on transformations like map.sc.textFile("filename").map(x => x.split(",") On 5 Sep 2016 6:19 pm, "Ashok Kumar" wrote: Hi, I have a text file as below that I read in 74,20160905-133143,98. 1121806912882759414875,20160905-133143,49. 5277699881591680774276,20160905-133143,56. 0802995712398098455677,20160905-133143,46. 636895265444075228,20160905-133143,84. 8822714116440218155179,20160905-133143,68. 72408602520662115000 val textFile = sc.textFile("/tmp/mytextfile. txt") Now I want to split the rows separated by "," scala> textFile.map(x=>x.toString). split(","):27: error: value split is not a member of org.apache.spark.rdd.RDD[ String] textFile.map(x=>x.toString). split(",") However, the above throws error? Any ideas what is wrong or how I can do this if I can avoid converting it to String? Thanking -- Best Regards, Ayan Guha -- Best Regards, Ayan Guha
Re: Splitting columns from a text file
Then, You need to refer third term in the array, convert it to your desired data type and then use filter. On Tue, Sep 6, 2016 at 12:14 AM, Ashok Kumarwrote: > Hi, > I want to filter them for values. > > This is what is in array > > 74,20160905-133143,98.11218069128827594148 > > I want to filter anything > 50.0 in the third column > > Thanks > > > > > On Monday, 5 September 2016, 15:07, ayan guha wrote: > > > Hi > > x.split returns an array. So, after first map, you will get RDD of arrays. > What is your expected outcome of 2nd map? > > On Mon, Sep 5, 2016 at 11:30 PM, Ashok Kumar > wrote: > > Thank you sir. > > This is what I get > > scala> textFile.map(x=> x.split(",")) > res52: org.apache.spark.rdd.RDD[ Array[String]] = MapPartitionsRDD[27] at > map at :27 > > How can I work on individual columns. I understand they are strings > > scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0)) > | ) > :27: error: value getString is not a member of Array[String] >textFile.map(x=> x.split(",")).map(x => (x.getString(0)) > > regards > > > > > On Monday, 5 September 2016, 13:51, Somasundaram Sekar tigeranalytics.com > wrote: > > > Basic error, you get back an RDD on transformations like map. > sc.textFile("filename").map(x => x.split(",") > > On 5 Sep 2016 6:19 pm, "Ashok Kumar" wrote: > > Hi, > > I have a text file as below that I read in > > 74,20160905-133143,98. 11218069128827594148 > 75,20160905-133143,49. 52776998815916807742 > 76,20160905-133143,56. 08029957123980984556 > 77,20160905-133143,46. 63689526544407522777 > 78,20160905-133143,84. 88227141164402181551 > 79,20160905-133143,68. 72408602520662115000 > > val textFile = sc.textFile("/tmp/mytextfile. txt") > > Now I want to split the rows separated by "," > > scala> textFile.map(x=>x.toString). split(",") > :27: error: value split is not a member of > org.apache.spark.rdd.RDD[ String] >textFile.map(x=>x.toString). split(",") > > However, the above throws error? > > Any ideas what is wrong or how I can do this if I can avoid converting it > to String? > > Thanking > > > > > > > -- > Best Regards, > Ayan Guha > > > -- Best Regards, Ayan Guha
Re: Splitting columns from a text file
Ask yourself how to access the third element in an array in Scala. Am 05.09.2016 um 16:14 schrieb Ashok Kumar: Hi, I want to filter them for values. This is what is in array 74,20160905-133143,98.11218069128827594148 I want to filter anything > 50.0 in the third column Thanks On Monday, 5 September 2016, 15:07, ayan guhawrote: Hi x.split returns an array. So, after first map, you will get RDD of arrays. What is your expected outcome of 2nd map? On Mon, Sep 5, 2016 at 11:30 PM, Ashok Kumar > wrote: Thank you sir. This is what I get scala> textFile.map(x=> x.split(",")) res52: org.apache.spark.rdd.RDD[ Array[String]] = MapPartitionsRDD[27] at map at :27 How can I work on individual columns. I understand they are strings scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0)) | ) :27: error: value getString is not a member of Array[String] textFile.map(x=> x.split(",")).map(x => (x.getString(0)) regards On Monday, 5 September 2016, 13:51, Somasundaram Sekar > wrote: Basic error, you get back an RDD on transformations like map. sc.textFile("filename").map(x => x.split(",") On 5 Sep 2016 6:19 pm, "Ashok Kumar" wrote: Hi, I have a text file as below that I read in 74,20160905-133143,98. 11218069128827594148 75,20160905-133143,49. 52776998815916807742 76,20160905-133143,56. 08029957123980984556 77,20160905-133143,46. 63689526544407522777 78,20160905-133143,84. 88227141164402181551 79,20160905-133143,68. 72408602520662115000 val textFile = sc.textFile("/tmp/mytextfile. txt") Now I want to split the rows separated by "," scala> textFile.map(x=>x.toString). split(",") :27: error: value split is not a member of org.apache.spark.rdd.RDD[ String] textFile.map(x=>x.toString). split(",") However, the above throws error? Any ideas what is wrong or how I can do this if I can avoid converting it to String? Thanking -- Best Regards, Ayan Guha
Re: Splitting columns from a text file
Hi,I want to filter them for values. This is what is in array 74,20160905-133143,98.11218069128827594148 I want to filter anything > 50.0 in the third column Thanks On Monday, 5 September 2016, 15:07, ayan guhawrote: Hi x.split returns an array. So, after first map, you will get RDD of arrays. What is your expected outcome of 2nd map? On Mon, Sep 5, 2016 at 11:30 PM, Ashok Kumar wrote: Thank you sir. This is what I get scala> textFile.map(x=> x.split(","))res52: org.apache.spark.rdd.RDD[ Array[String]] = MapPartitionsRDD[27] at map at :27 How can I work on individual columns. I understand they are strings scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0)) | ):27: error: value getString is not a member of Array[String] textFile.map(x=> x.split(",")).map(x => (x.getString(0)) regards On Monday, 5 September 2016, 13:51, Somasundaram Sekar wrote: Basic error, you get back an RDD on transformations like map.sc.textFile("filename").map(x => x.split(",") On 5 Sep 2016 6:19 pm, "Ashok Kumar" wrote: Hi, I have a text file as below that I read in 74,20160905-133143,98. 1121806912882759414875,20160905-133143,49. 5277699881591680774276,20160905-133143,56. 0802995712398098455677,20160905-133143,46. 636895265444075228,20160905-133143,84. 8822714116440218155179,20160905-133143,68. 72408602520662115000 val textFile = sc.textFile("/tmp/mytextfile. txt") Now I want to split the rows separated by "," scala> textFile.map(x=>x.toString). split(","):27: error: value split is not a member of org.apache.spark.rdd.RDD[ String] textFile.map(x=>x.toString). split(",") However, the above throws error? Any ideas what is wrong or how I can do this if I can avoid converting it to String? Thanking -- Best Regards, Ayan Guha
Re: Splitting columns from a text file
Hi x.split returns an array. So, after first map, you will get RDD of arrays. What is your expected outcome of 2nd map? On Mon, Sep 5, 2016 at 11:30 PM, Ashok Kumarwrote: > Thank you sir. > > This is what I get > > scala> textFile.map(x=> x.split(",")) > res52: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[27] at > map at :27 > > How can I work on individual columns. I understand they are strings > > scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0)) > | ) > :27: error: value getString is not a member of Array[String] >textFile.map(x=> x.split(",")).map(x => (x.getString(0)) > > regards > > > > > On Monday, 5 September 2016, 13:51, Somasundaram Sekar tigeranalytics.com> wrote: > > > Basic error, you get back an RDD on transformations like map. > sc.textFile("filename").map(x => x.split(",") > > On 5 Sep 2016 6:19 pm, "Ashok Kumar" wrote: > > Hi, > > I have a text file as below that I read in > > 74,20160905-133143,98. 11218069128827594148 > 75,20160905-133143,49. 52776998815916807742 > 76,20160905-133143,56. 08029957123980984556 > 77,20160905-133143,46. 63689526544407522777 > 78,20160905-133143,84. 88227141164402181551 > 79,20160905-133143,68. 72408602520662115000 > > val textFile = sc.textFile("/tmp/mytextfile. txt") > > Now I want to split the rows separated by "," > > scala> textFile.map(x=>x.toString). split(",") > :27: error: value split is not a member of > org.apache.spark.rdd.RDD[ String] >textFile.map(x=>x.toString). split(",") > > However, the above throws error? > > Any ideas what is wrong or how I can do this if I can avoid converting it > to String? > > Thanking > > > > -- Best Regards, Ayan Guha
Re: Splitting columns from a text file
Please have a look at the documentation for information on how to work with RDD. Start with this http://spark.apache.org/docs/latest/quick-start.html On 5 Sep 2016 7:00 pm, "Ashok Kumar"wrote: > Thank you sir. > > This is what I get > > scala> textFile.map(x=> x.split(",")) > res52: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[27] at > map at :27 > > How can I work on individual columns. I understand they are strings > > scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0)) > | ) > :27: error: value getString is not a member of Array[String] >textFile.map(x=> x.split(",")).map(x => (x.getString(0)) > > regards > > > > > On Monday, 5 September 2016, 13:51, Somasundaram Sekar tigeranalytics.com> wrote: > > > Basic error, you get back an RDD on transformations like map. > sc.textFile("filename").map(x => x.split(",") > > On 5 Sep 2016 6:19 pm, "Ashok Kumar" wrote: > > Hi, > > I have a text file as below that I read in > > 74,20160905-133143,98. 11218069128827594148 > 75,20160905-133143,49. 52776998815916807742 > 76,20160905-133143,56. 08029957123980984556 > 77,20160905-133143,46. 63689526544407522777 > 78,20160905-133143,84. 88227141164402181551 > 79,20160905-133143,68. 72408602520662115000 > > val textFile = sc.textFile("/tmp/mytextfile. txt") > > Now I want to split the rows separated by "," > > scala> textFile.map(x=>x.toString). split(",") > :27: error: value split is not a member of > org.apache.spark.rdd.RDD[ String] >textFile.map(x=>x.toString). split(",") > > However, the above throws error? > > Any ideas what is wrong or how I can do this if I can avoid converting it > to String? > > Thanking > > > >
Re: Splitting columns from a text file
Thank you sir. This is what I get scala> textFile.map(x=> x.split(","))res52: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[27] at map at :27 How can I work on individual columns. I understand they are strings scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0)) | ):27: error: value getString is not a member of Array[String] textFile.map(x=> x.split(",")).map(x => (x.getString(0)) regards On Monday, 5 September 2016, 13:51, Somasundaram Sekarwrote: Basic error, you get back an RDD on transformations like map.sc.textFile("filename").map(x => x.split(",") On 5 Sep 2016 6:19 pm, "Ashok Kumar" wrote: Hi, I have a text file as below that I read in 74,20160905-133143,98. 1121806912882759414875,20160905-133143,49. 5277699881591680774276,20160905-133143,56. 0802995712398098455677,20160905-133143,46. 636895265444075228,20160905-133143,84. 8822714116440218155179,20160905-133143,68. 72408602520662115000 val textFile = sc.textFile("/tmp/mytextfile. txt") Now I want to split the rows separated by "," scala> textFile.map(x=>x.toString). split(","):27: error: value split is not a member of org.apache.spark.rdd.RDD[ String] textFile.map(x=>x.toString). split(",") However, the above throws error? Any ideas what is wrong or how I can do this if I can avoid converting it to String? Thanking
Re: Splitting columns from a text file
Basic error, you get back an RDD on transformations like map. sc.textFile("filename").map(x => x.split(",") On 5 Sep 2016 6:19 pm, "Ashok Kumar"wrote: > Hi, > > I have a text file as below that I read in > > 74,20160905-133143,98.11218069128827594148 > 75,20160905-133143,49.52776998815916807742 > 76,20160905-133143,56.08029957123980984556 > 77,20160905-133143,46.63689526544407522777 > 78,20160905-133143,84.88227141164402181551 > 79,20160905-133143,68.72408602520662115000 > > val textFile = sc.textFile("/tmp/mytextfile.txt") > > Now I want to split the rows separated by "," > > scala> textFile.map(x=>x.toString).split(",") > :27: error: value split is not a member of > org.apache.spark.rdd.RDD[String] >textFile.map(x=>x.toString).split(",") > > However, the above throws error? > > Any ideas what is wrong or how I can do this if I can avoid converting it > to String? > > Thanking > >