[Structured Streaming]: Structured Streaming into Redshift sink

2018-01-18 Thread Somasundaram Sekar
Is it possible to write the Dataframe backed by Kafka Streaming source into
AWS Redshift, we have in the past used
https://github.com/databricks/spark-redshift to write into redshift, but I
presume it will not work with *writeStream*. Also writing with JDBC
connector with ForeachWriter is also may not be a good idea given the way
Redshift works.



One possible approach that I have come across from Yelp blog (
https://engineeringblog.yelp.com/2016/10/redshift-connector.html) is to
write the files into S3 and then invoke Redhift COPY(
https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html) with a *Manifest
file having the S3 Object path*, in case of Structured Streaming, how can I
control the files into which I write to S3 and have a separate trigger to
create a manifest file after writing say 5 files into S3.



Any other possible solution are also appreciated. Thanks in advance.



Regards,

Somasundaram S

-- 
*Disclaimer*: This e-mail is intended to be delivered only to the named 
addressee(s). If this information is received by anyone other than the 
named addressee(s), the recipient(s) should immediately notify 
i...@tigeranalytics.com and promptly delete the transmitted material from 
your computer and server.   In no event shall this material be read, used, 
stored, or retained by anyone other than the named addressee(s) without the 
express written consent of the sender or the named addressee(s). Computer 
viruses can be transmitted viaemail. The recipient should check this email and 
any attachments for viruses. The company accepts no liability for any 
damage caused by any virus transmitted by this email.


Writing to Redshift from Kafka Streaming source

2018-01-18 Thread Somasundaram Sekar
Hi,



Is it possible to write the Dataframe backed by Kafka Streaming source into
AWS Redshift, we have in the past used
https://github.com/databricks/spark-redshift to write into redshift, but I
presume it will not work with DataFrame##writeStream(). Also writing with
JDBC connector with ForeachWriter is also may not be a good idea given the
way Redshift works.



One possible approach that I have come across from Yelp blog (
https://engineeringblog.yelp.com/2016/10/redshift-connector.html) is to
write the files into S3 and then invoke Redhift COPY(
https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html) with a Manifest
file having the S3 Object path, in case of Structured Streaming, how can I
control the files into which I write to S3 and have a separate trigger to
create a manifest file after writing say 5 files into S3.



Any other possible solution are also appreciated. Thanks in advance.



Regards,

Somasundaram S

-- 
*Disclaimer*: This e-mail is intended to be delivered only to the named 
addressee(s). If this information is received by anyone other than the 
named addressee(s), the recipient(s) should immediately notify 
i...@tigeranalytics.com and promptly delete the transmitted material from 
your computer and server.   In no event shall this material be read, used, 
stored, or retained by anyone other than the named addressee(s) without the 
express written consent of the sender or the named addressee(s). Computer 
viruses can be transmitted viaemail. The recipient should check this email and 
any attachments for viruses. The company accepts no liability for any 
damage caused by any virus transmitted by this email.


Re: learning Spark

2017-12-04 Thread Somasundaram Sekar
Learning Spark - ORielly publication as a starter and official doc

On 4 Dec 2017 9:19 am, "Manuel Sopena Ballesteros" 
wrote:

> Dear Spark community,
>
>
>
> Is there any resource (books, online course, etc.) available that you know
> of to learn about spark? I am interested in the sys admin side of it? like
> the different parts inside spark, how spark works internally, best ways to
> install/deploy/monitor and how to get best performance possible.
>
>
>
> Any suggestion?
>
>
>
> Thank you very much
>
>
>
> *Manuel Sopena Ballesteros *| Systems Engineer
> *Garvan Institute of Medical Research *
> The Kinghorn Cancer Centre, 370 Victoria Street, Darlinghurst, NSW 2010
> 
> *T:* + 61 (0)2 9355 5760 | *F:* +61 (0)2 9295 8507 | *E:*
> manuel...@garvan.org.au
>
>
> NOTICE
> Please consider the environment before printing this email. This message
> and any attachments are intended for the addressee named and may contain
> legally privileged/confidential/copyright information. If you are not the
> intended recipient, you should not read, use, disclose, copy or distribute
> this communication. If you have received this message in error please
> notify us at once by return email and then delete both messages. We accept
> no liability for the distribution of viruses or similar in electronic
> communications. This notice should not be removed.
>

-- 
*Disclaimer*: This e-mail is intended to be delivered only to the named 
addressee(s). If this information is received by anyone other than the 
named addressee(s), the recipient(s) should immediately notify 
i...@tigeranalytics.com and promptly delete the transmitted material from 
your computer and server.   In no event shall this material be read, used, 
stored, or retained by anyone other than the named addressee(s) without the 
express written consent of the sender or the named addressee(s). Computer 
viruses can be transmitted viaemail. The recipient should check this email and 
any attachments for viruses. The company accepts no liability for any 
damage caused by any virus transmitted by this email.


Equivalent of Redshift ListAgg function in Spark (Pyspak)

2017-10-08 Thread Somasundaram Sekar
Hi,



I want to concat multiple columns into a single column after grouping the
 DataFrame,



I want an functional equivalent of Redshift ListAgg function



pg_catalog.Listagg(column, '|')

 within GROUP( ORDER BY column) AS

name


LISTAGG Function

: For each group in a query, the LISTAGG aggregate function orders the rows
for that group according to the ORDER BY expression, then concatenates the
values into a single string.


DataFrame multiple agg on the same column

2017-10-07 Thread Somasundaram Sekar
Hi,

I have a GroupedData object, on which I perform aggregation of few columns
since GroupedData takes in map, I cannot perform multiple aggregate on the
same column, say I want to have both max and min of amount.

So the below line of code will return only one aggregate per column

grouped_txn.agg({'*' : 'count', 'amount' : 'sum', 'amount' : 'max',
'created_time' : 'min', 'created_time' : 'max'})

What are the possible alternatives, I can have a new column defined, that
is just a copy of the original and use that, but that looks ugly any
suggestions?

Thanks,
Somasundaram S


Re: Splitting columns from a text file

2016-09-05 Thread Somasundaram Sekar
sc.textFile("filename").map(_.split(",")).filter(arr => arr.length == 3 &&
arr(2).toDouble > 50).collect this will give you a Array[Array[String]] do
as you may wish with it. And please read through abt RDD

On 5 Sep 2016 8:51 pm, "Ashok Kumar" <ashok34...@yahoo.com> wrote:

> Thanks everyone.
>
> I am not skilled like you gentlemen
>
> This is what I did
>
> 1) Read the text file
>
> val textFile = sc.textFile("/tmp/myfile.txt")
>
> 2) That produces an RDD of String.
>
> 3) Create a DF after splitting the file into an Array
>
> val df = textFile.map(line => line.split(",")).map(x=>(x(0).
> toInt,x(1).toString,x(2).toDouble)).toDF
>
> 4) Create a class for column headers
>
>  case class Columns(col1: Int, col2: String, col3: Double)
>
> 5) Assign the column headers
>
> val h = df.map(p => Columns(p(0).toString.toInt, p(1).toString,
> p(2).toString.toDouble))
>
> 6) Only interested in column 3 > 50
>
>  h.filter(col("Col3") > 50.0)
>
> 7) Now I just want Col3 only
>
> h.filter(col("Col3") > 50.0).select("col3").show(5)
> +-+
> | col3|
> +-+
> |95.42536350467836|
> |61.56297588648554|
> |76.73982017179868|
> |68.86218120274728|
> |67.64613810115105|
> +-+
> only showing top 5 rows
>
> Does that make sense. Are there shorter ways gurus? Can I just do all this
> on RDD without DF?
>
> Thanking you
>
>
>
>
>
>
>
> On Monday, 5 September 2016, 15:19, ayan guha <guha.a...@gmail.com> wrote:
>
>
> Then, You need to refer third term in the array, convert it to your
> desired data type and then use filter.
>
>
> On Tue, Sep 6, 2016 at 12:14 AM, Ashok Kumar <ashok34...@yahoo.com> wrote:
>
> Hi,
> I want to filter them for values.
>
> This is what is in array
>
> 74,20160905-133143,98. 11218069128827594148
>
> I want to filter anything > 50.0 in the third column
>
> Thanks
>
>
>
>
> On Monday, 5 September 2016, 15:07, ayan guha <guha.a...@gmail.com> wrote:
>
>
> Hi
>
> x.split returns an array. So, after first map, you will get RDD of arrays.
> What is your expected outcome of 2nd map?
>
> On Mon, Sep 5, 2016 at 11:30 PM, Ashok Kumar <ashok34...@yahoo.com.invalid
> > wrote:
>
> Thank you sir.
>
> This is what I get
>
> scala> textFile.map(x=> x.split(","))
> res52: org.apache.spark.rdd.RDD[ Array[String]] = MapPartitionsRDD[27] at
> map at :27
>
> How can I work on individual columns. I understand they are strings
>
> scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0))
>  | )
> :27: error: value getString is not a member of Array[String]
>textFile.map(x=> x.split(",")).map(x => (x.getString(0))
>
> regards
>
>
>
>
> On Monday, 5 September 2016, 13:51, Somasundaram Sekar <somasundar.sekar@
> tigeranalytics.com <somasundar.se...@tigeranalytics.com>> wrote:
>
>
> Basic error, you get back an RDD on transformations like map.
> sc.textFile("filename").map(x => x.split(",")
>
> On 5 Sep 2016 6:19 pm, "Ashok Kumar" <ashok34...@yahoo.com.invalid> wrote:
>
> Hi,
>
> I have a text file as below that I read in
>
> 74,20160905-133143,98. 11218069128827594148
> 75,20160905-133143,49. 52776998815916807742
> 76,20160905-133143,56. 08029957123980984556
> 77,20160905-133143,46. 63689526544407522777
> 78,20160905-133143,84. 88227141164402181551
> 79,20160905-133143,68. 72408602520662115000
>
> val textFile = sc.textFile("/tmp/mytextfile. txt")
>
> Now I want to split the rows separated by ","
>
> scala> textFile.map(x=>x.toString). split(",")
> :27: error: value split is not a member of
> org.apache.spark.rdd.RDD[ String]
>textFile.map(x=>x.toString). split(",")
>
> However, the above throws error?
>
> Any ideas what is wrong or how I can do this if I can avoid converting it
> to String?
>
> Thanking
>
>
>
>
>
>
> --
> Best Regards,
> Ayan Guha
>
>
>
>
>
> --
> Best Regards,
> Ayan Guha
>
>
>


Re: Splitting columns from a text file

2016-09-05 Thread Somasundaram Sekar
Please have a look at the documentation for information on how to work with
RDD. Start with this http://spark.apache.org/docs/latest/quick-start.html

On 5 Sep 2016 7:00 pm, "Ashok Kumar" <ashok34...@yahoo.com> wrote:

> Thank you sir.
>
> This is what I get
>
> scala> textFile.map(x=> x.split(","))
> res52: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[27] at
> map at :27
>
> How can I work on individual columns. I understand they are strings
>
> scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0))
>  | )
> :27: error: value getString is not a member of Array[String]
>textFile.map(x=> x.split(",")).map(x => (x.getString(0))
>
> regards
>
>
>
>
> On Monday, 5 September 2016, 13:51, Somasundaram Sekar <somasundar.sekar@
> tigeranalytics.com> wrote:
>
>
> Basic error, you get back an RDD on transformations like map.
> sc.textFile("filename").map(x => x.split(",")
>
> On 5 Sep 2016 6:19 pm, "Ashok Kumar" <ashok34...@yahoo.com.invalid> wrote:
>
> Hi,
>
> I have a text file as below that I read in
>
> 74,20160905-133143,98. 11218069128827594148
> 75,20160905-133143,49. 52776998815916807742
> 76,20160905-133143,56. 08029957123980984556
> 77,20160905-133143,46. 63689526544407522777
> 78,20160905-133143,84. 88227141164402181551
> 79,20160905-133143,68. 72408602520662115000
>
> val textFile = sc.textFile("/tmp/mytextfile. txt")
>
> Now I want to split the rows separated by ","
>
> scala> textFile.map(x=>x.toString). split(",")
> :27: error: value split is not a member of
> org.apache.spark.rdd.RDD[ String]
>textFile.map(x=>x.toString). split(",")
>
> However, the above throws error?
>
> Any ideas what is wrong or how I can do this if I can avoid converting it
> to String?
>
> Thanking
>
>
>
>


Re: Splitting columns from a text file

2016-09-05 Thread Somasundaram Sekar
Basic error, you get back an RDD on transformations like map.

sc.textFile("filename").map(x => x.split(",")

On 5 Sep 2016 6:19 pm, "Ashok Kumar"  wrote:

> Hi,
>
> I have a text file as below that I read in
>
> 74,20160905-133143,98.11218069128827594148
> 75,20160905-133143,49.52776998815916807742
> 76,20160905-133143,56.08029957123980984556
> 77,20160905-133143,46.63689526544407522777
> 78,20160905-133143,84.88227141164402181551
> 79,20160905-133143,68.72408602520662115000
>
> val textFile = sc.textFile("/tmp/mytextfile.txt")
>
> Now I want to split the rows separated by ","
>
> scala> textFile.map(x=>x.toString).split(",")
> :27: error: value split is not a member of
> org.apache.spark.rdd.RDD[String]
>textFile.map(x=>x.toString).split(",")
>
> However, the above throws error?
>
> Any ideas what is wrong or how I can do this if I can avoid converting it
> to String?
>
> Thanking
>
>


Resources for learning Spark administration

2016-09-04 Thread Somasundaram Sekar
Please suggest some good resources to learn Spark administration.


Re: Spark transformations

2016-09-04 Thread Somasundaram Sekar
Can you try this

https://www.linkedin.com/pulse/hive-functions-udfudaf-udtf-examples-gaurav-singh

On 4 Sep 2016 9:38 pm, "janardhan shetty"  wrote:

> Hi,
>
> Is there any chance that we can send entire multiple columns to an udf and
> generate a new column for Spark ML.
> I see similar approach as VectorAssembler but not able to use few classes
> /traitslike HasInputCols, HasOutputCol, DefaultParamsWritable since they
> are private.
>
> Any leads/examples is appreciated in this regard..
>
> Requirement:
> *Input*: Multiple columns of a Dataframe
> *Output*:  Single new modified column
>


Re: Importing large file with SparkContext.textFile

2016-09-03 Thread Somasundaram Sekar
If the file is not splittable(can I assume the log file is splittable,
though) can you advise on how spark handles such caseā€¦? If Spark can't what
is the widely used practice?

On 3 Sep 2016 7:29 pm, "Raghavendra Pandey" <raghavendra.pan...@gmail.com>
wrote:

If your file format is splittable say TSV, CSV etc, it will be distributed
across all executors.

On Sat, Sep 3, 2016 at 3:38 PM, Somasundaram Sekar <somasundar.sekar@
tigeranalytics.com> wrote:

> Hi All,
>
>
>
> Would like to gain some understanding on the questions listed below,
>
>
>
> 1.   When processing a large file with Apache Spark, with, say,
> sc.textFile("somefile.xml"), does it split it for parallel processing
> across executors or, will it be processed as a single chunk in a single
> executor?
>
> 2.   When using dataframes, with implicit XMLContext from Databricks
> is there any optimization prebuilt for such large file processing?
>
>
>
> Please help!!!
>
>
>
> http://stackoverflow.com/questions/39305310/does-spark-proce
> ss-large-file-in-the-single-worker
>
>
>
> Regards,
>
> Somasundaram S
>


Importing large file with SparkContext.textFile

2016-09-03 Thread Somasundaram Sekar
Hi All,



Would like to gain some understanding on the questions listed below,



1.   When processing a large file with Apache Spark, with, say,
sc.textFile("somefile.xml"), does it split it for parallel processing
across executors or, will it be processed as a single chunk in a single
executor?

2.   When using dataframes, with implicit XMLContext from Databricks is
there any optimization prebuilt for such large file processing?



Please help!!!



http://stackoverflow.com/questions/39305310/does-spark-process-large-file-in-the-single-worker



Regards,

Somasundaram S