issue with regexp_replace

2019-10-26 Thread amit kumar singh
Hi Team,


I am trying to use regexp_replace in spark sql  it throwing error

expected , but found Scalar
 in 'reader', line 9, column 45:
 ... select translate(payload, '"', '"') as payload


i am trying to remove all character  from \\\"  with "


convert josn string in spark sql

2019-10-16 Thread amit kumar singh
Hi Team,

I have kafka messages where json is coming as string  how can create table
after converting json string to json using spark sql


Re: Spark-hive integration on HDInsight

2019-02-21 Thread amit kumar singh
Hey jay

How you are making your cluster  are you using spark cluster 

All this thing should be set up automatically 



Sent from my iPhone

> On Feb 21, 2019, at 12:12 PM, Felix Cheung  wrote:
> 
> You should check with HDInsight support
> 
> From: Jay Singh 
> Sent: Wednesday, February 20, 2019 11:43:23 PM
> To: User
> Subject: Spark-hive integration on HDInsight
>  
> I am trying to integrate  spark with hive on HDInsight  spark cluster .
> I copied hive-site.xml in spark/conf directory. In addition I added hive 
> metastore properties like jdbc connection info on Ambari as well. But still 
> the database and tables created using spark-sql are not visible in hive. 
> Changed ‘spark.sql.warehouse.dir’ value also to point to hive warehouse 
> directory.
> Spark does work with hive not having LLAP ON. What am I missing in the 
> configuration to integrate spark with hive ? Any pointer will be appreciated.
>  
> thx  


executing stored procedure through spark

2018-08-12 Thread amit kumar singh
Hi /team,

The way we call java program to executed stored procedure
is there any way we can achieve the same using pyspark


Re: Can we deploy python script on a spark cluster

2018-08-02 Thread amit kumar singh
Hi Lehak

You can make a scala project with oozing class 
And one run class which will ship your python file to cluster 

Define oozie  coordinator with  spark action or shell action

We are deploying pyspark  based machine learning code 


Sent from my iPhone

> On Aug 2, 2018, at 8:46 AM, Lehak Dharmani  
> wrote:
> 
> 
> We are trying to deploy python script on spark cluster . However as per
> documentations , it is not possible to deploy python applications on a
> cluster . Is there any alternative 
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Pyspark is not picking up correct python version on azure hdinsight

2018-06-25 Thread amit kumar singh
Hi Guys,


Pyspark is not picking up correct  python version on azure hdinsight

property setup in spark2-env



PYSPARK_PYTHON=${PYSPARK3_PYTHON:-/usr/bin/anaconda/envs/py35/bin/python3}
export
PYSPARK_DRIVER_PYTHON=${PYSPARK3_PYTHON:-/usr/bin/anaconda/envs/py35/bin/python3}



Thanks


Re: help needed in perforance improvement of spark structured streaming

2018-05-30 Thread amit kumar singh
hi team

any help with this



I have a use case where i need to call stored procedure through structured
streaming.

I am able to send kafka message and call stored procedure ,

but since foreach sink keeps on executing stored procedure per message

i want to combine all the messages in single dtaframe and then call  stored
procedure at once

is it possible to do


current code

select('value cast "string",'topic)
  .select('topic,concat_ws(",", 'value cast "string") as 'value1)
 .groupBy('topic cast "string").count()
.coalesce(1)
.as[String]
.writeStream
.trigger(ProcessingTime("60 seconds"))
.option("checkpointLocation", checkpointUrl)
.foreach(new SimpleSqlServerSink(jdbcUrl, connectionProperties))







On Sat, May 5, 2018 at 12:20 PM, amit kumar singh 
wrote:

> Hi Community,
>
> I have a use case where i need to call stored procedure through structured
> streaming.
>
> I am able to send kafka message and call stored procedure ,
>
> but since foreach sink keeps on executing stored procedure per message
>
> i want to combine all the messages in single dtaframe and then call
> stored procedure at once
>
> is it possible to do
>
>
> current code
>
> select('value cast "string",'topic)
>   .select('topic,concat_ws(",", 'value cast "string") as 'value1)
>  .groupBy('topic cast "string").count()
> .coalesce(1)
> .as[String]
> .writeStream
> .trigger(ProcessingTime("60 seconds"))
> .option("checkpointLocation", checkpointUrl)
> .foreach(new SimpleSqlServerSink(jdbcUrl, connectionProperties))
>
>
>
>
> thanks
> rohit
>


How can we group by messages coming in per batch of structured streaming

2018-05-30 Thread amit kumar singh
Hi Team,

I have a requirement where i need to to combine all  json messages coming
in batch of structured streaming into one single json messages which can be
separated by comma or any other delimiter and store it

i have tried to group by kafka partition i tried using concat but its not
working

thanks
Amit


help in copying data from one azure subscription to another azure subscription

2018-05-21 Thread amit kumar singh
HI Team,

We are trying to move data between one azure subscription to another azure
subscription is there a faster way to do through spark

i am using distcp and its taking for ever

thanks
rohit


help needed in perforance improvement of spark structured streaming

2018-05-05 Thread amit kumar singh
Hi Community,

I have a use case where i need to call stored procedure through structured
streaming.

I am able to send kafka message and call stored procedure ,

but since foreach sink keeps on executing stored procedure per message

i want to combine all the messages in single dtaframe and then call  stored
procedure at once

is it possible to do


current code

select('value cast "string",'topic)
  .select('topic,concat_ws(",", 'value cast "string") as 'value1)
 .groupBy('topic cast "string").count()
.coalesce(1)
.as[String]
.writeStream
.trigger(ProcessingTime("60 seconds"))
.option("checkpointLocation", checkpointUrl)
.foreach(new SimpleSqlServerSink(jdbcUrl, connectionProperties))




thanks
rohit


User class threw exception: java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html

2018-04-27 Thread amit kumar singh
Hi Team,

I am working on structured streaming

i have added all libraries in build,sbt then also its not picking up right
library an failing with error

User class threw exception: java.lang.ClassNotFoundException: Failed to
find data source: kafka. Please find packages at
http://spark.apache.org/third-party-projects.html

i am using jenkins to deploy this task

thanks
amit


how to call stored procedure from spark

2018-04-26 Thread amit kumar singh
Hi Guys

I have stored procedure which does transformation and write it to sql
server table

i am not able to execute this through spark is there any way in which spark
streaming sink can just be calling these stored procedure


How to bulk insert using spark streaming job

2018-04-19 Thread amit kumar singh
How to bulk insert using spark streaming job 

Sent from my iPhone

Re: optimize hive query to move a subset of data from one partition table to another table

2018-02-11 Thread amit kumar singh
Hi

create table emp as select * from emp_full where join_date
>=date_sub(join_date,2)

i am trying to select from one table insert into another table

i need a way to do select last 2 month of data everytime

table is partitioned on year month day

On Sun, Feb 11, 2018 at 4:30 PM, Richard Qiao <richardqiao2...@gmail.com>
wrote:

> Would you mind share your code with us to analyze?
>
> > On Feb 10, 2018, at 10:18 AM, amit kumar singh <amitiem...@gmail.com>
> wrote:
> >
> > Hi Team,
> >
> > We have hive external  table which has 50 tb of data partitioned on year
> month day
> >
> > i want to move last 2 month of data into another table
> >
> > when i try to do this through spark ,more than 120k task are getting
> created
> >
> > what is the best way to do this
> >
> > thanks
> > Rohit
>
>


optimize hive query to move a subset of data from one partition table to another table

2018-02-10 Thread amit kumar singh
Hi Team,

We have hive external  table which has 50 tb of data partitioned on year
month day

i want to move last 2 month of data into another table

when i try to do this through spark ,more than 120k task are getting created

what is the best way to do this

thanks
Rohit


How to configure spark with java

2017-07-23 Thread amit kumar singh
Hello everyone

I want to use spark with java API

Please let me know how can I configure it


Thanks
A

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: type issue: found RDD[T] expected RDD[A]

2014-08-19 Thread Amit Kumar
Hi Evan, Patrick and Tobias,

So, It worked for what I needed it to do. I followed Yana's suggestion of
using parameterized type of  [T : Product:ClassTag:TypeTag]

more concretely, I was trying to make the query process a bit more fluent
 -some pseudocode but with correct types

val table:SparkTable[POJO] = new
SparkTable[POJO](sqlContext,extractor:String=POJO)
val data=
   table.atLocation(hdfs://)
   .withName(tableName)
   .makeRDD(SELECT * FROM tableName)


class SparkTable[T : Product : ClassTag :TypeTag](val
sqlContext:SQLContext, val extractor: (String) = (T) ) {
  private[this] var location:Option[String] =None
  private[this] var name:Option[String]=None
  private[this] val sc = sqlContext.sparkContext

  def withName(name:String):SparkTable[T]={..}
  def atLocation(path:String):SparkTable[T]={.. }
  def makeRDD(sqlQuery:String):SchemaRDD={
...
import sqlContext._
val rdd:RDD[String] = sc.textFile(this.location.get)
val rddT:RDD[T] = rdd.map(extractor)
val schemaRDD= createSchemaRDD(rddT)
schemaRDD.registerAsTable(name.get)
val all = sqlContext.sql(sqlQuery)
all
  }

}

Best,
Amit

On Tue, Aug 19, 2014 at 9:13 PM, Evan Chan velvia.git...@gmail.com wrote:

 That might not be enough.  Reflection is used to determine what the
 fields are, thus your class might actually need to have members
 corresponding to the fields in the table.

 I heard that a more generic method of inputting stuff is coming.

 On Tue, Aug 19, 2014 at 6:43 PM, Tobias Pfeiffer t...@preferred.jp wrote:
  Hi,
 
  On Tue, Aug 19, 2014 at 7:01 PM, Patrick McGloin 
 mcgloin.patr...@gmail.com
  wrote:
 
  I think the type of the data contained in your RDD needs to be a known
  case class and not abstract for createSchemaRDD.  This makes sense when
 you
  think it needs to know about the fields in the object to create the
 schema.
 
 
  Exactly this. The actual message pointing to that is:
 
  inferred type arguments [T] do not conform to method
 createSchemaRDD's
  type parameter bounds [A : Product]
 
  All case classes are automatically subclasses of Product, but otherwise
 you
  will have to extend Product and add the required methods yourself.
 
  Tobias
 



type issue: found RDD[T] expected RDD[A]

2014-08-05 Thread Amit Kumar
Hi All,

I am having some trouble trying to write generic code that uses sqlContext
and RDDs. Can you suggest what might be wrong?

 class SparkTable[T : ClassTag](val sqlContext:SQLContext, val extractor:
(String) = (T) ) {

  private[this] var location:Option[String] =None
  private[this] var name:Option[String]=None
  private[this] val sc = sqlContext.sparkContext
  ...

def makeRDD(sqlQuery:String):SchemaRDD={
require(this.location!=None)
require(this.name!=None)
import sqlContext._
val rdd:RDD[String] = sc.textFile(this.location.get)
val rddT:RDD[T] = rdd.map(extractor)
val schemaRDD:SchemaRDD= createSchemaRDD(rddT)
schemaRDD.registerAsTable(name.get)
val all = sqlContext.sql(sqlQuery)
all
  }

}

I use it as below:

 def extractor(line:String):POJO={
  val splits= line.split(pattern).toList
  POJO(splits(0),splits(1),splits(2),splits(3))
}

   val pojoTable:SparkTable[POJO] = new
SparkTable[POJO](sqlContext,extractor)

val identityData:SchemaRDD=
pojoTable.atLocation(hdfs://location/table)
  .withName(pojo)
  .makeRDD(SELECT * FROM pojo)


I get compilation failure

inferred type arguments [T] do not conform to method createSchemaRDD's type
parameter bounds [A : Product]
[error] val schemaRDD:SchemaRDD= createSchemaRDD(rddT)
[error]  ^
[error]  SparkTable.scala:37: type mismatch;
[error]  found   : org.apache.spark.rdd.RDD[T]
[error]  required: org.apache.spark.rdd.RDD[A]
[error] val schemaRDD:SchemaRDD= createSchemaRDD(rddT)
[error]  ^
[error] two errors found

I am probably missing something basic either in scala reflection/types or
implicits?

Any hints would be appreciated.

Thanks
Amit