Re: Splitting columns from a text file

2016-09-05 Thread Gourav Sengupta
just use SPARK CSV, all other ways of splitting and working is just trying
to reinvent the wheel and a magnanimous waste of time.


Regards,
Gourav

On Mon, Sep 5, 2016 at 1:48 PM, Ashok Kumar 
wrote:

> Hi,
>
> I have a text file as below that I read in
>
> 74,20160905-133143,98.11218069128827594148
> 75,20160905-133143,49.52776998815916807742
> 76,20160905-133143,56.08029957123980984556
> 77,20160905-133143,46.63689526544407522777
> 78,20160905-133143,84.88227141164402181551
> 79,20160905-133143,68.72408602520662115000
>
> val textFile = sc.textFile("/tmp/mytextfile.txt")
>
> Now I want to split the rows separated by ","
>
> scala> textFile.map(x=>x.toString).split(",")
> :27: error: value split is not a member of
> org.apache.spark.rdd.RDD[String]
>textFile.map(x=>x.toString).split(",")
>
> However, the above throws error?
>
> Any ideas what is wrong or how I can do this if I can avoid converting it
> to String?
>
> Thanking
>
>


Re: Splitting columns from a text file

2016-09-05 Thread Somasundaram Sekar
sc.textFile("filename").map(_.split(",")).filter(arr => arr.length == 3 &&
arr(2).toDouble > 50).collect this will give you a Array[Array[String]] do
as you may wish with it. And please read through abt RDD

On 5 Sep 2016 8:51 pm, "Ashok Kumar"  wrote:

> Thanks everyone.
>
> I am not skilled like you gentlemen
>
> This is what I did
>
> 1) Read the text file
>
> val textFile = sc.textFile("/tmp/myfile.txt")
>
> 2) That produces an RDD of String.
>
> 3) Create a DF after splitting the file into an Array
>
> val df = textFile.map(line => line.split(",")).map(x=>(x(0).
> toInt,x(1).toString,x(2).toDouble)).toDF
>
> 4) Create a class for column headers
>
>  case class Columns(col1: Int, col2: String, col3: Double)
>
> 5) Assign the column headers
>
> val h = df.map(p => Columns(p(0).toString.toInt, p(1).toString,
> p(2).toString.toDouble))
>
> 6) Only interested in column 3 > 50
>
>  h.filter(col("Col3") > 50.0)
>
> 7) Now I just want Col3 only
>
> h.filter(col("Col3") > 50.0).select("col3").show(5)
> +-+
> | col3|
> +-+
> |95.42536350467836|
> |61.56297588648554|
> |76.73982017179868|
> |68.86218120274728|
> |67.64613810115105|
> +-+
> only showing top 5 rows
>
> Does that make sense. Are there shorter ways gurus? Can I just do all this
> on RDD without DF?
>
> Thanking you
>
>
>
>
>
>
>
> On Monday, 5 September 2016, 15:19, ayan guha  wrote:
>
>
> Then, You need to refer third term in the array, convert it to your
> desired data type and then use filter.
>
>
> On Tue, Sep 6, 2016 at 12:14 AM, Ashok Kumar  wrote:
>
> Hi,
> I want to filter them for values.
>
> This is what is in array
>
> 74,20160905-133143,98. 11218069128827594148
>
> I want to filter anything > 50.0 in the third column
>
> Thanks
>
>
>
>
> On Monday, 5 September 2016, 15:07, ayan guha  wrote:
>
>
> Hi
>
> x.split returns an array. So, after first map, you will get RDD of arrays.
> What is your expected outcome of 2nd map?
>
> On Mon, Sep 5, 2016 at 11:30 PM, Ashok Kumar  > wrote:
>
> Thank you sir.
>
> This is what I get
>
> scala> textFile.map(x=> x.split(","))
> res52: org.apache.spark.rdd.RDD[ Array[String]] = MapPartitionsRDD[27] at
> map at :27
>
> How can I work on individual columns. I understand they are strings
>
> scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0))
>  | )
> :27: error: value getString is not a member of Array[String]
>textFile.map(x=> x.split(",")).map(x => (x.getString(0))
>
> regards
>
>
>
>
> On Monday, 5 September 2016, 13:51, Somasundaram Sekar  tigeranalytics.com > wrote:
>
>
> Basic error, you get back an RDD on transformations like map.
> sc.textFile("filename").map(x => x.split(",")
>
> On 5 Sep 2016 6:19 pm, "Ashok Kumar"  wrote:
>
> Hi,
>
> I have a text file as below that I read in
>
> 74,20160905-133143,98. 11218069128827594148
> 75,20160905-133143,49. 52776998815916807742
> 76,20160905-133143,56. 08029957123980984556
> 77,20160905-133143,46. 63689526544407522777
> 78,20160905-133143,84. 88227141164402181551
> 79,20160905-133143,68. 72408602520662115000
>
> val textFile = sc.textFile("/tmp/mytextfile. txt")
>
> Now I want to split the rows separated by ","
>
> scala> textFile.map(x=>x.toString). split(",")
> :27: error: value split is not a member of
> org.apache.spark.rdd.RDD[ String]
>textFile.map(x=>x.toString). split(",")
>
> However, the above throws error?
>
> Any ideas what is wrong or how I can do this if I can avoid converting it
> to String?
>
> Thanking
>
>
>
>
>
>
> --
> Best Regards,
> Ayan Guha
>
>
>
>
>
> --
> Best Regards,
> Ayan Guha
>
>
>


Re: Splitting columns from a text file

2016-09-05 Thread Ashok Kumar
Thanks everyone.
I am not skilled like you gentlemen
This is what I did
1) Read the text file
val textFile = sc.textFile("/tmp/myfile.txt")

2) That produces an RDD of String.
3) Create a DF after splitting the file into an Array 
val df = textFile.map(line => 
line.split(",")).map(x=>(x(0).toInt,x(1).toString,x(2).toDouble)).toDF
4) Create a class for column headers
 case class Columns(col1: Int, col2: String, col3: Double)
5) Assign the column headers 
val h = df.map(p => Columns(p(0).toString.toInt, p(1).toString, 
p(2).toString.toDouble))
6) Only interested in column 3 > 50
 h.filter(col("Col3") > 50.0)
7) Now I just want Col3 only
h.filter(col("Col3") > 50.0).select("col3").show(5)+-+|         
    
col3|+-+|95.42536350467836||61.56297588648554||76.73982017179868||68.86218120274728||67.64613810115105|+-+only
 showing top 5 rows
Does that make sense. Are there shorter ways gurus? Can I just do all this on 
RDD without DF?
Thanking you




 

On Monday, 5 September 2016, 15:19, ayan guha  wrote:
 

 Then, You need to refer third term in the array, convert it to your desired 
data type and then use filter. 

On Tue, Sep 6, 2016 at 12:14 AM, Ashok Kumar  wrote:

Hi,I want to filter them for values.
This is what is in array
74,20160905-133143,98. 11218069128827594148

I want to filter anything > 50.0 in the third column
Thanks

 

On Monday, 5 September 2016, 15:07, ayan guha  wrote:
 

 Hi
x.split returns an array. So, after first map, you will get RDD of arrays. What 
is your expected outcome of 2nd map? 
On Mon, Sep 5, 2016 at 11:30 PM, Ashok Kumar  
wrote:

Thank you sir.
This is what I get
scala> textFile.map(x=> x.split(","))res52: org.apache.spark.rdd.RDD[ 
Array[String]] = MapPartitionsRDD[27] at map at :27
How can I work on individual columns. I understand they are strings
scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0))     | 
):27: error: value getString is not a member of Array[String]       
textFile.map(x=> x.split(",")).map(x => (x.getString(0))
regards

 

On Monday, 5 September 2016, 13:51, Somasundaram Sekar  wrote:
 

 Basic error, you get back an RDD on transformations like 
map.sc.textFile("filename").map(x => x.split(",") 
On 5 Sep 2016 6:19 pm, "Ashok Kumar"  wrote:

Hi,
I have a text file as below that I read in
74,20160905-133143,98. 1121806912882759414875,20160905-133143,49. 
5277699881591680774276,20160905-133143,56. 
0802995712398098455677,20160905-133143,46. 
636895265444075228,20160905-133143,84. 
8822714116440218155179,20160905-133143,68. 72408602520662115000
val textFile = sc.textFile("/tmp/mytextfile. txt")
Now I want to split the rows separated by ","
scala> textFile.map(x=>x.toString). split(","):27: error: value split 
is not a member of org.apache.spark.rdd.RDD[ String]       
textFile.map(x=>x.toString). split(",")
However, the above throws error?
Any ideas what is wrong or how I can do this if I can avoid converting it to 
String?
Thanking



   



-- 
Best Regards,
Ayan Guha


   



-- 
Best Regards,
Ayan Guha


   

Re: Splitting columns from a text file

2016-09-05 Thread ayan guha
Then, You need to refer third term in the array, convert it to your desired
data type and then use filter.


On Tue, Sep 6, 2016 at 12:14 AM, Ashok Kumar  wrote:

> Hi,
> I want to filter them for values.
>
> This is what is in array
>
> 74,20160905-133143,98.11218069128827594148
>
> I want to filter anything > 50.0 in the third column
>
> Thanks
>
>
>
>
> On Monday, 5 September 2016, 15:07, ayan guha  wrote:
>
>
> Hi
>
> x.split returns an array. So, after first map, you will get RDD of arrays.
> What is your expected outcome of 2nd map?
>
> On Mon, Sep 5, 2016 at 11:30 PM, Ashok Kumar  > wrote:
>
> Thank you sir.
>
> This is what I get
>
> scala> textFile.map(x=> x.split(","))
> res52: org.apache.spark.rdd.RDD[ Array[String]] = MapPartitionsRDD[27] at
> map at :27
>
> How can I work on individual columns. I understand they are strings
>
> scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0))
>  | )
> :27: error: value getString is not a member of Array[String]
>textFile.map(x=> x.split(",")).map(x => (x.getString(0))
>
> regards
>
>
>
>
> On Monday, 5 September 2016, 13:51, Somasundaram Sekar  tigeranalytics.com > wrote:
>
>
> Basic error, you get back an RDD on transformations like map.
> sc.textFile("filename").map(x => x.split(",")
>
> On 5 Sep 2016 6:19 pm, "Ashok Kumar"  wrote:
>
> Hi,
>
> I have a text file as below that I read in
>
> 74,20160905-133143,98. 11218069128827594148
> 75,20160905-133143,49. 52776998815916807742
> 76,20160905-133143,56. 08029957123980984556
> 77,20160905-133143,46. 63689526544407522777
> 78,20160905-133143,84. 88227141164402181551
> 79,20160905-133143,68. 72408602520662115000
>
> val textFile = sc.textFile("/tmp/mytextfile. txt")
>
> Now I want to split the rows separated by ","
>
> scala> textFile.map(x=>x.toString). split(",")
> :27: error: value split is not a member of
> org.apache.spark.rdd.RDD[ String]
>textFile.map(x=>x.toString). split(",")
>
> However, the above throws error?
>
> Any ideas what is wrong or how I can do this if I can avoid converting it
> to String?
>
> Thanking
>
>
>
>
>
>
> --
> Best Regards,
> Ayan Guha
>
>
>


-- 
Best Regards,
Ayan Guha


Re: Splitting columns from a text file

2016-09-05 Thread Fridtjof Sander

Ask yourself how to access the third element in an array in Scala.


Am 05.09.2016 um 16:14 schrieb Ashok Kumar:

Hi,
I want to filter them for values.

This is what is in array

74,20160905-133143,98.11218069128827594148

I want to filter anything > 50.0 in the third column

Thanks




On Monday, 5 September 2016, 15:07, ayan guha  wrote:


Hi

x.split returns an array. So, after first map, you will get RDD of 
arrays. What is your expected outcome of 2nd map?


On Mon, Sep 5, 2016 at 11:30 PM, Ashok Kumar 
> 
wrote:


Thank you sir.

This is what I get

scala> textFile.map(x=> x.split(","))
res52: org.apache.spark.rdd.RDD[ Array[String]] =
MapPartitionsRDD[27] at map at :27

How can I work on individual columns. I understand they are strings

scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0))
 | )
:27: error: value getString is not a member of Array[String]
 textFile.map(x=> x.split(",")).map(x => (x.getString(0))

regards




On Monday, 5 September 2016, 13:51, Somasundaram Sekar
> wrote:


Basic error, you get back an RDD on transformations like map.
sc.textFile("filename").map(x => x.split(",")

On 5 Sep 2016 6:19 pm, "Ashok Kumar"
 wrote:

Hi,

I have a text file as below that I read in

74,20160905-133143,98. 11218069128827594148
75,20160905-133143,49. 52776998815916807742
76,20160905-133143,56. 08029957123980984556
77,20160905-133143,46. 63689526544407522777
78,20160905-133143,84. 88227141164402181551
79,20160905-133143,68. 72408602520662115000

val textFile = sc.textFile("/tmp/mytextfile. txt")

Now I want to split the rows separated by ","

scala> textFile.map(x=>x.toString). split(",")
:27: error: value split is not a member of
org.apache.spark.rdd.RDD[ String]
 textFile.map(x=>x.toString). split(",")

However, the above throws error?

Any ideas what is wrong or how I can do this if I can avoid
converting it to String?

Thanking






--
Best Regards,
Ayan Guha






Re: Splitting columns from a text file

2016-09-05 Thread Ashok Kumar
Hi,I want to filter them for values.
This is what is in array
74,20160905-133143,98.11218069128827594148

I want to filter anything > 50.0 in the third column
Thanks

 

On Monday, 5 September 2016, 15:07, ayan guha  wrote:
 

 Hi
x.split returns an array. So, after first map, you will get RDD of arrays. What 
is your expected outcome of 2nd map? 
On Mon, Sep 5, 2016 at 11:30 PM, Ashok Kumar  
wrote:

Thank you sir.
This is what I get
scala> textFile.map(x=> x.split(","))res52: org.apache.spark.rdd.RDD[ 
Array[String]] = MapPartitionsRDD[27] at map at :27
How can I work on individual columns. I understand they are strings
scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0))     | 
):27: error: value getString is not a member of Array[String]       
textFile.map(x=> x.split(",")).map(x => (x.getString(0))
regards

 

On Monday, 5 September 2016, 13:51, Somasundaram Sekar  wrote:
 

 Basic error, you get back an RDD on transformations like 
map.sc.textFile("filename").map(x => x.split(",") 
On 5 Sep 2016 6:19 pm, "Ashok Kumar"  wrote:

Hi,
I have a text file as below that I read in
74,20160905-133143,98. 1121806912882759414875,20160905-133143,49. 
5277699881591680774276,20160905-133143,56. 
0802995712398098455677,20160905-133143,46. 
636895265444075228,20160905-133143,84. 
8822714116440218155179,20160905-133143,68. 72408602520662115000
val textFile = sc.textFile("/tmp/mytextfile. txt")
Now I want to split the rows separated by ","
scala> textFile.map(x=>x.toString). split(","):27: error: value split 
is not a member of org.apache.spark.rdd.RDD[ String]       
textFile.map(x=>x.toString). split(",")
However, the above throws error?
Any ideas what is wrong or how I can do this if I can avoid converting it to 
String?
Thanking



   



-- 
Best Regards,
Ayan Guha


   

Re: Splitting columns from a text file

2016-09-05 Thread ayan guha
Hi

x.split returns an array. So, after first map, you will get RDD of arrays.
What is your expected outcome of 2nd map?

On Mon, Sep 5, 2016 at 11:30 PM, Ashok Kumar 
wrote:

> Thank you sir.
>
> This is what I get
>
> scala> textFile.map(x=> x.split(","))
> res52: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[27] at
> map at :27
>
> How can I work on individual columns. I understand they are strings
>
> scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0))
>  | )
> :27: error: value getString is not a member of Array[String]
>textFile.map(x=> x.split(",")).map(x => (x.getString(0))
>
> regards
>
>
>
>
> On Monday, 5 September 2016, 13:51, Somasundaram Sekar  tigeranalytics.com> wrote:
>
>
> Basic error, you get back an RDD on transformations like map.
> sc.textFile("filename").map(x => x.split(",")
>
> On 5 Sep 2016 6:19 pm, "Ashok Kumar"  wrote:
>
> Hi,
>
> I have a text file as below that I read in
>
> 74,20160905-133143,98. 11218069128827594148
> 75,20160905-133143,49. 52776998815916807742
> 76,20160905-133143,56. 08029957123980984556
> 77,20160905-133143,46. 63689526544407522777
> 78,20160905-133143,84. 88227141164402181551
> 79,20160905-133143,68. 72408602520662115000
>
> val textFile = sc.textFile("/tmp/mytextfile. txt")
>
> Now I want to split the rows separated by ","
>
> scala> textFile.map(x=>x.toString). split(",")
> :27: error: value split is not a member of
> org.apache.spark.rdd.RDD[ String]
>textFile.map(x=>x.toString). split(",")
>
> However, the above throws error?
>
> Any ideas what is wrong or how I can do this if I can avoid converting it
> to String?
>
> Thanking
>
>
>
>


-- 
Best Regards,
Ayan Guha


Re: Splitting columns from a text file

2016-09-05 Thread Somasundaram Sekar
Please have a look at the documentation for information on how to work with
RDD. Start with this http://spark.apache.org/docs/latest/quick-start.html

On 5 Sep 2016 7:00 pm, "Ashok Kumar"  wrote:

> Thank you sir.
>
> This is what I get
>
> scala> textFile.map(x=> x.split(","))
> res52: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[27] at
> map at :27
>
> How can I work on individual columns. I understand they are strings
>
> scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0))
>  | )
> :27: error: value getString is not a member of Array[String]
>textFile.map(x=> x.split(",")).map(x => (x.getString(0))
>
> regards
>
>
>
>
> On Monday, 5 September 2016, 13:51, Somasundaram Sekar  tigeranalytics.com> wrote:
>
>
> Basic error, you get back an RDD on transformations like map.
> sc.textFile("filename").map(x => x.split(",")
>
> On 5 Sep 2016 6:19 pm, "Ashok Kumar"  wrote:
>
> Hi,
>
> I have a text file as below that I read in
>
> 74,20160905-133143,98. 11218069128827594148
> 75,20160905-133143,49. 52776998815916807742
> 76,20160905-133143,56. 08029957123980984556
> 77,20160905-133143,46. 63689526544407522777
> 78,20160905-133143,84. 88227141164402181551
> 79,20160905-133143,68. 72408602520662115000
>
> val textFile = sc.textFile("/tmp/mytextfile. txt")
>
> Now I want to split the rows separated by ","
>
> scala> textFile.map(x=>x.toString). split(",")
> :27: error: value split is not a member of
> org.apache.spark.rdd.RDD[ String]
>textFile.map(x=>x.toString). split(",")
>
> However, the above throws error?
>
> Any ideas what is wrong or how I can do this if I can avoid converting it
> to String?
>
> Thanking
>
>
>
>


Re: Splitting columns from a text file

2016-09-05 Thread Ashok Kumar
Thank you sir.
This is what I get
scala> textFile.map(x=> x.split(","))res52: 
org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[27] at map at 
:27
How can I work on individual columns. I understand they are strings
scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0))     | 
):27: error: value getString is not a member of Array[String]       
textFile.map(x=> x.split(",")).map(x => (x.getString(0))
regards

 

On Monday, 5 September 2016, 13:51, Somasundaram Sekar 
 wrote:
 

 Basic error, you get back an RDD on transformations like 
map.sc.textFile("filename").map(x => x.split(",") 
On 5 Sep 2016 6:19 pm, "Ashok Kumar"  wrote:

Hi,
I have a text file as below that I read in
74,20160905-133143,98. 1121806912882759414875,20160905-133143,49. 
5277699881591680774276,20160905-133143,56. 
0802995712398098455677,20160905-133143,46. 
636895265444075228,20160905-133143,84. 
8822714116440218155179,20160905-133143,68. 72408602520662115000
val textFile = sc.textFile("/tmp/mytextfile. txt")
Now I want to split the rows separated by ","
scala> textFile.map(x=>x.toString). split(","):27: error: value split 
is not a member of org.apache.spark.rdd.RDD[ String]       
textFile.map(x=>x.toString). split(",")
However, the above throws error?
Any ideas what is wrong or how I can do this if I can avoid converting it to 
String?
Thanking



   

Re: Splitting columns from a text file

2016-09-05 Thread Somasundaram Sekar
Basic error, you get back an RDD on transformations like map.

sc.textFile("filename").map(x => x.split(",")

On 5 Sep 2016 6:19 pm, "Ashok Kumar"  wrote:

> Hi,
>
> I have a text file as below that I read in
>
> 74,20160905-133143,98.11218069128827594148
> 75,20160905-133143,49.52776998815916807742
> 76,20160905-133143,56.08029957123980984556
> 77,20160905-133143,46.63689526544407522777
> 78,20160905-133143,84.88227141164402181551
> 79,20160905-133143,68.72408602520662115000
>
> val textFile = sc.textFile("/tmp/mytextfile.txt")
>
> Now I want to split the rows separated by ","
>
> scala> textFile.map(x=>x.toString).split(",")
> :27: error: value split is not a member of
> org.apache.spark.rdd.RDD[String]
>textFile.map(x=>x.toString).split(",")
>
> However, the above throws error?
>
> Any ideas what is wrong or how I can do this if I can avoid converting it
> to String?
>
> Thanking
>
>