Re: is it ok to make I/O calls in UDF? other words is it a standard practice ?

2018-04-23 Thread Sathish Kumaran Vairavelu
I have made simple rest call within UDF and it worked but not sure if it can be applied for large datasets but may be for small lookup files. Thanks On Mon, Apr 23, 2018 at 4:28 PM kant kodali wrote: > Hi All, > > Is it ok to make I/O calls in UDF? other words is it a

Re: Spark querying C* in Scala

2018-01-22 Thread Sathish Kumaran Vairavelu
You have to register a Cassandra table in spark as dataframes https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md Thanks Sathish On Mon, Jan 22, 2018 at 7:43 AM Conconscious wrote: > Hi list, > > I have a Cassandra table with two

Re: PySpark - Expand rows into dataframes via function

2017-10-03 Thread Sathish Kumaran Vairavelu
= > (processed_rdds.toDF().withColumnRenamed('_1','ip').withColumnRenamed('_2','registryid')) > > And then after that I split and subset the IP column into what I wanted. > > On Mon, Oct 2, 2017 at 7:52 PM, Sathish Kumaran Vairavelu < > vsathishkuma...@gmail.com> wrote

Re: PySpark - Expand rows into dataframes via function

2017-10-02 Thread Sathish Kumaran Vairavelu
It's possible with array function combined with struct construct. Below is a SQL example select Array(struct(ip1,hashkey), struct(ip2,hashkey)) from (select substr(col1,1,2) as ip1, substr(col1,3,3) as ip2, etc, hashkey from object) a If you want dynamic ip ranges; you need to dynamically

Re: spark.write.csv is not able write files to specified path, but is writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes

2017-08-11 Thread Sathish Kumaran Vairavelu
I think you can collect the results in driver through toLocalIterator method of RDD and save the result to the driver program; rather than writing it to the file on the local disk and collecting it separately. If your data is small enough and if you have enough cores/memory try processing

Re: Does Spark SQL uses Calcite?

2017-08-10 Thread Sathish Kumaran Vairavelu
I think it is for hive dependency. On Thu, Aug 10, 2017 at 4:14 PM kant kodali <kanth...@gmail.com> wrote: > Since I see a calcite dependency in Spark I wonder where Calcite is being > used? > > On Thu, Aug 10, 2017 at 1:30 PM, Sathish Kumaran Vairavelu < > vsathish

Re: Does Spark SQL uses Calcite?

2017-08-10 Thread Sathish Kumaran Vairavelu
Spark SQL doesn't use Calcite On Thu, Aug 10, 2017 at 3:14 PM kant kodali wrote: > Hi All, > > Does Spark SQL uses Calcite? If so, what for? I thought the Spark SQL has > catalyst which would generate its own logical plans, physical plans and > other optimizations. > >

Re: Spark Streaming: Async action scheduling inside foreachRDD

2017-08-04 Thread Sathish Kumaran Vairavelu
Forkjoinpool with task support would help in this case. Where u can create a thread pool with configured number of threads ( make sure u have enough cores) and submit job I mean actions to the pool On Fri, Aug 4, 2017 at 8:54 AM Raghavendra Pandey < raghavendra.pan...@gmail.com> wrote: > Did you

Re: some Ideas on expressing Spark SQL using JSON

2017-07-26 Thread Sathish Kumaran Vairavelu
sql function was misspelled when using the > dsl opposed to the plain sql string which is only parsed at runtime. > Sathish Kumaran Vairavelu <vsathishkuma...@gmail.com> schrieb am Di. 25. > Juli 2017 um 23:42: > >> Just a thought. SQL itself is a DSL. Why DSL on top of another DSL? >

Re: some Ideas on expressing Spark SQL using JSON

2017-07-25 Thread Sathish Kumaran Vairavelu
Just a thought. SQL itself is a DSL. Why DSL on top of another DSL? On Tue, Jul 25, 2017 at 4:47 AM kant kodali wrote: > Hi All, > > I am thinking to express Spark SQL using JSON in the following the way. > > For Example: > > *Query using Spark DSL* > >

Re: Spark SQL 2.1 Complex SQL - Query Planning Issue

2017-04-02 Thread Sathish Kumaran Vairavelu
Please let me know if anybody has any thoughts on this issue? On Thu, Mar 30, 2017 at 10:37 PM Sathish Kumaran Vairavelu < vsathishkuma...@gmail.com> wrote: > Also, is it possible to cache logical plan and parsed query so that in > subsequent executions it can be reused. It would imp

Re: Spark SQL 2.1 Complex SQL - Query Planning Issue

2017-03-30 Thread Sathish Kumaran Vairavelu
Also, is it possible to cache logical plan and parsed query so that in subsequent executions it can be reused. It would improve overall query performance particularly in streaming jobs On Thu, Mar 30, 2017 at 10:06 PM Sathish Kumaran Vairavelu < vsathishkuma...@gmail.com> wrote: > Hi Ay

Re: Spark SQL 2.1 Complex SQL - Query Planning Issue

2017-03-30 Thread Sathish Kumaran Vairavelu
to avoid > such scenarios > > On Fri, Mar 31, 2017 at 1:25 PM, Sathish Kumaran Vairavelu < > vsathishkuma...@gmail.com> wrote: > > Hi Everyone, > > I have complex SQL with approx 2000 lines of code and works with 50+ > tables with 50+ left joins and transfor

Spark SQL 2.1 Complex SQL - Query Planning Issue

2017-03-30 Thread Sathish Kumaran Vairavelu
Hi Everyone, I have complex SQL with approx 2000 lines of code and works with 50+ tables with 50+ left joins and transformations. All the tables are fully cached in Memory with sufficient storage memory and working memory. The issue is after the launch of the query for the execution; the query

Re: Spark Job trigger in production

2016-07-20 Thread Sathish Kumaran Vairavelu
If you are using Mesos, then u can use Chronos or Marathon On Wed, Jul 20, 2016 at 6:08 AM Rabin Banerjee wrote: > ++ crontab :) > > On Wed, Jul 20, 2016 at 9:07 AM, Andrew Ehrlich > wrote: > >> Another option is Oozie with the spark action: >>

Re: Best practice for handing tables between pipeline components

2016-06-27 Thread Sathish Kumaran Vairavelu
Alluxio off heap memory would help to share cached objects On Mon, Jun 27, 2016 at 11:14 AM Everett Anderson wrote: > Hi, > > We have a pipeline of components strung together via Airflow running on > AWS. Some of them are implemented in Spark, but some aren't. Generally

Re: Spark 1.5 on Mesos

2016-03-02 Thread Sathish Kumaran Vairavelu
quot; >>due to too many failures; is Spark installed on it? >> WARN TaskSchedulerImpl: Initial job has not accepted any resources; >> check your cluster UI to ensure that workers are registered and have >> sufficient resources >> >> >> On Mon

Re: Passing multiple jar files to spark-shell

2016-02-14 Thread Sathish Kumaran Vairavelu
--jars takes comma separated values. On Sun, Feb 14, 2016 at 5:35 PM Mich Talebzadeh wrote: > Hi, > > > > Is there anyway one can pass multiple --driver-class-path and multiple > –jars to spark shell. > > > > For example something as below with two jar files entries for

Re: Spark, Mesos, Docker and S3

2016-01-29 Thread Sathish Kumaran Vairavelu
all > docker options to spark. > > -Mao > > On Thu, Jan 28, 2016 at 1:55 PM, Sathish Kumaran Vairavelu < > vsathishkuma...@gmail.com> wrote: > >> Thank you., I figured it out. I have set executor memory to minimal and >> it works., >> >>

Re: Spark, Mesos, Docker and S3

2016-01-28 Thread Sathish Kumaran Vairavelu
fine. > > Best, > Mao > > On Wed, Jan 27, 2016 at 5:00 PM, Sathish Kumaran Vairavelu < > vsathishkuma...@gmail.com> wrote: > >> Hi, >> >> On the same Spark/Mesos/Docker setup, I am getting warning "Initial Job >> has not accepted any resourc

Re: Spark, Mesos, Docker and S3

2016-01-27 Thread Sathish Kumaran Vairavelu
lp. I have updated both docker.properties and spark-default.conf with spark.mesos.executor.docker.image and other properties. Thanks Sathish On Wed, Jan 27, 2016 at 9:58 AM Sathish Kumaran Vairavelu < vsathishkuma...@gmail.com> wrote: > Thanks a lot for your info! I will try this toda

Re: Spark, Mesos, Docker and S3

2016-01-27 Thread Sathish Kumaran Vairavelu
file will take effect so that the driver can access the protected s3 > files. > > Similarly, Mesos slaves also run Spark executor docker container in > --net=host mode, so that the AWS profile of Mesos slaves will take effect. > > Hope it helps, > Mao > > On

Re: Spark, Mesos, Docker and S3

2016-01-26 Thread Sathish Kumaran Vairavelu
Hi Mao, I want to check on accessing the S3 from Spark docker in Mesos. The EC2 instance that I am using has the AWS profile/IAM included. Should we build the docker image with any AWS profile settings or --net=host docker option takes care of it? Please help Thanks Sathish On Tue, Jan 26,

Re: Docker/Mesos with Spark

2016-01-19 Thread Sathish Kumaran Vairavelu
Hi Tim Do you have any materials/blog for running Spark in a container in Mesos cluster environment? I have googled it but couldn't find info on it. Spark documentation says it is possible, but no details provided.. Please help Thanks Sathish On Mon, Sep 21, 2015 at 11:54 AM Tim Chen

Re: Docker/Mesos with Spark

2016-01-19 Thread Sathish Kumaran Vairavelu
> have been doing this for our DCOS spark for our past releases and has been > working well so far. > > Thanks! > > Tim > > On Tue, Jan 19, 2016 at 12:28 PM, Sathish Kumaran Vairavelu < > vsathishkuma...@gmail.com> wrote: > >> Hi Tim >> >>

Re: Can a tempTable registered by sqlContext be used inside a forEachRDD?

2016-01-03 Thread Sathish Kumaran Vairavelu
I think you can use foreachpartition instead of foreachrdd Sathish On Sun, Jan 3, 2016 at 5:51 AM SRK wrote: > Hi, > > Can a tempTable registered in sqlContext be used to query inside forEachRDD > as shown below? > My requirement is that I have a set of data in the

Re: How to return a pair RDD from an RDD that has foreachPartition applied?

2015-11-18 Thread Sathish Kumaran Vairavelu
I think you can use mapPartitions that returns PairRDDs followed by forEachPartition for saving it On Wed, Nov 18, 2015 at 9:31 AM swetha kasireddy wrote: > Looks like I can use mapPartitions but can it be done using > forEachPartition? > > On Tue, Nov 17, 2015 at

Re: JDBC thrift server

2015-10-08 Thread Sathish Kumaran Vairavelu
Which version of spark you are using? You might encounter SPARK-6882 if Kerberos is enabled. -Sathish On Thu, Oct 8, 2015 at 10:46 AM Younes Naguib < younes.nag...@tritondigital.com> wrote: > Hi, > > > > We’ve been using the JDBC thrift server

Reading Hive Tables using SQLContext

2015-09-24 Thread Sathish Kumaran Vairavelu
Hello, Is it possible to access Hive tables directly from SQLContext instead of HiveContext? I am facing with errors while doing it. Please let me know Thanks Sathish

Re: Reading Hive Tables using SQLContext

2015-09-24 Thread Sathish Kumaran Vairavelu
Thanks Michael. Just want to check if there is a roadmap to include Hive tables from SQLContext. -Sathish On Thu, Sep 24, 2015 at 7:46 PM Michael Armbrust <mich...@databricks.com> wrote: > No, you have to use a HiveContext. > > On Thu, Sep 24, 2015 at 2:47 PM, Sathish Ku

Re: Best way to import data from Oracle to Spark?

2015-09-10 Thread Sathish Kumaran Vairavelu
I guess data pump export from Oracle could be fast option. Hive now has oracle data pump serde.. https://docs.oracle.com/cd/E57371_01/doc.41/e57351/copy2bda.htm On Wed, Sep 9, 2015 at 4:41 AM Reynold Xin wrote: > Using the JDBC data source is probably the best way. >

Re: How to set environment of worker applications

2015-08-23 Thread Sathish Kumaran Vairavelu
spark-env.sh works for me in Spark 1.4 but not spark.executor.extraJavaOptions. On Sun, Aug 23, 2015 at 11:27 AM Raghavendra Pandey raghavendra.pan...@gmail.com wrote: I think the only way to pass on environment variables to worker node is to write it in spark-env.sh file on each worker node.

Re: How do we control output part files created by Spark job?

2015-07-06 Thread Sathish Kumaran Vairavelu
Try coalesce function to limit no of part files On Mon, Jul 6, 2015 at 1:23 PM kachau umesh.ka...@gmail.com wrote: Hi I am having couple of Spark jobs which processes thousands of files every day. File size may very from MBs to GBs. After finishing job I usually save using the following code

Re: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException in spark with mysql database

2015-07-06 Thread Sathish Kumaran Vairavelu
Try including alias in the query. val query=(select * from +table+) a On Mon, Jul 6, 2015 at 3:38 AM Hafiz Mujadid hafizmujadi...@gmail.com wrote: Hi! I am trying to load data from my sql database using following code val query=select * from +table+ val url = jdbc:mysql:// +

Re: Spark SQL JDBC Source data skew

2015-06-25 Thread Sathish Kumaran Vairavelu
Can some one help me here? Please On Sat, Jun 20, 2015 at 9:54 AM Sathish Kumaran Vairavelu vsathishkuma...@gmail.com wrote: Hi, In Spark SQL JDBC data source there is an option to specify upper/lower bound and num of partitions. How Spark handles data distribution, if we do not give

Spark SQL JDBC Source data skew

2015-06-20 Thread Sathish Kumaran Vairavelu
Hi, In Spark SQL JDBC data source there is an option to specify upper/lower bound and num of partitions. How Spark handles data distribution, if we do not give the upper/lower/num of parititons ? Will all data from the external data source skewed up in one executor? In many situations, we do not

Re: lowerupperBound not working/spark 1.3

2015-06-14 Thread Sathish Kumaran Vairavelu
Hi I am also facing with same issue. Is it possible to view actual query passed to the database. Has anyone tried that? Also, what if we don't give upper and lower bound partition. Would we end up in data skew ? Thanks Sathish On Sun, Jun 14, 2015 at 5:02 AM Sujeevan suje...@gmail.com wrote:

Spark SQL JDBC Source Join Error

2015-06-14 Thread Sathish Kumaran Vairavelu
Hello Everyone, I pulled 2 different tables from the JDBC source and then joined them using the cust_id *decimal* column. A simple join like as below. This simple join works perfectly in the database but not in Spark SQL. I am importing 2 tables as a data frame/registertemptable and firing sql on

Re: Spark SQL JDBC Source Join Error

2015-06-14 Thread Sathish Kumaran Vairavelu
Thank you.. it works in Spark 1.4. On Sun, Jun 14, 2015 at 3:51 PM Michael Armbrust mich...@databricks.com wrote: Sounds like SPARK-5456 https://issues.apache.org/jira/browse/SPARK-5456. Which is fixed in Spark 1.4. On Sun, Jun 14, 2015 at 11:57 AM, Sathish Kumaran Vairavelu vsathishkuma

Spark SQL - Complex query pushdown

2015-06-14 Thread Sathish Kumaran Vairavelu
Hello, Is there a way in spark, where I define the data source (say the JDBC Source) and define the list of tables to be used on that data source. Like JDBC connection, where we define the connection and run execute statement based on that connection. In current external table implementation,

Re: SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0

2015-06-11 Thread Sathish Kumaran Vairavelu
Hi Nathan, I am also facing the issue with Spark 1.3. Did you find any workaround for this issue? Please help Thanks Sathish On Thu, Apr 16, 2015 at 6:03 AM Nathan McCarthy nathan.mccar...@quantium.com.au wrote: Its JTDS 1.3.1; http://sourceforge.net/projects/jtds/files/jtds/1.3.1/ I

Drools in Spark

2015-04-07 Thread Sathish Kumaran Vairavelu
Hello, Just want to check if anyone has tried drools with Spark? Please let me know. Are there any alternate rule engine that works well with Spark? Thanks Sathish

Error in SparkSQL/Scala IDE

2015-04-02 Thread Sathish Kumaran Vairavelu
Hi Everyone, I am getting following error while registering table using Scala IDE. Please let me know how to resolve this error. I am using Spark 1.2.1 import sqlContext.createSchemaRDD val empFile = sc.textFile(/tmp/emp.csv, 4) .map ( _.split(,) )

Checking Data Integrity in Spark

2015-03-27 Thread Sathish Kumaran Vairavelu
Hello, I want to check if there is any way to check the data integrity of the data files. The use case is perform data integrity check on large files 100+ columns and reject records (write it another file) that does not meet criteria's (such as NOT NULL, date format, etc). Since there are lot of

reduceByKey vs countByKey

2015-02-24 Thread Sathish Kumaran Vairavelu
Hello, Quick question. I am trying to understand difference between reduceByKey vs countByKey? Which one gives better performance reduceByKey or countByKey? While we can perform same count operation using reduceByKey why we need countByKey/countByValue? Thanks Sathish

Re: Publishing streaming results to web interface

2015-01-02 Thread Sathish Kumaran Vairavelu
Try and see if this helps. http://zeppelin-project.org/ -Sathish On Fri Jan 02 2015 at 8:20:54 PM Pankaj Narang pankajnaran...@gmail.com wrote: Thomus, Spark does not provide any web interface directly. There might be third party apps providing dashboards but I am not aware of any for the

Spark SQL JSON dataset query nested datastructures

2014-08-09 Thread Sathish Kumaran Vairavelu
I have a simple JSON dataset as below. How do I query all parts.lock for the id=1. JSON: { id: 1, name: A green door, price: 12.50, tags: [home, green], parts : [ { lock : One lock, key : single key }, { lock : 2 lock, key : 2 key } ] } Query: select id,name,price,parts.lockfrom product where

Spark SQL dialect

2014-08-08 Thread Sathish Kumaran Vairavelu
Hi, Can you anyone point me where to find the sql dialect for Spark SQL? Unlike HQL, there are lot of tasks involved in creating and querying tables which is very cumbersome one. If we have to fire multiple queries on 10's and 100's of tables then it is very difficult at this point. Given Spark

Using Python IDE for Spark Application Development

2014-08-06 Thread Sathish Kumaran Vairavelu
Hello, I am trying to use the python IDE PyCharm for Spark application development. How can I use pyspark with Python IDE? Can anyone help me with this? Thanks Sathish

Re: Using Python IDE for Spark Application Development

2014-08-06 Thread Sathish Kumaran Vairavelu
. from pyspark import SparkContext from pyspark import SparkConf Execution works from within pycharm... Though my next step is to figure out autocompletion and I bet there are better ways to develop apps for spark.. On Wed, Aug 6, 2014 at 4:16 PM, Sathish Kumaran Vairavelu vsathishkuma