Issue : KafkaConsumer cache hitting max capacity of 64, removing consumer for CacheKey

2019-10-21 Thread Shyam P
Hi , I am using spark-sql-2.4.1v with kafka I am facing slow consumer issue I see warning "KafkaConsumer cache hitting max capacity of 64, removing consumer for CacheKey(spark-kafka-source-33321dde-bfad-49f3-bdf7-09f95883b6e9--1249540122-executor)" in logs more on the same

How to handle this use-case in spark-sql-streaming

2019-09-30 Thread Shyam P
Hi, I have scenario like below https://stackoverflow.com/questions/58134379/how-to-handle-backup-scenario-in-spark-structured-streaming-using-joins How to handle this use-case ( back-up scenario) in spark-structured-streaming? Any clues would be highly appreciated. Thanks, Shyam

Can anyone suggest what is wrong with my spark job here?

2019-09-16 Thread Shyam P
Hi , Though my spark-job working fine in my local in spark cluster it has issue . Can anyone suggest me what is wrong here ? https://stackoverflow.com/questions/57960569/accessing-external-yml-file-in-my-spark-job-code-not-working-throwing-cant-con Regards, Shyam

Re: how to refresh the loaded non-streaming dataframe for each steaming batch ?

2019-09-06 Thread Shyam P
Difficult things in spark is debugging and tuning.

Re: how to refresh the loaded non-streaming dataframe for each steaming batch ?

2019-09-06 Thread Shyam P
cool ,but did you find a way or anyhelp or clue ? On Fri, Sep 6, 2019 at 11:40 PM David Zhou wrote: > I have the same question with yours > > On Thu, Sep 5, 2019 at 9:18 PM Shyam P wrote: > >> Hi, >> >> I am using spark-sql-2.4.1v to streaming in my PoC. &

how to refresh the loaded non-streaming dataframe for each steaming batch ?

2019-09-05 Thread Shyam P
Hi, I am using spark-sql-2.4.1v to streaming in my PoC. how to refresh the loaded dataframe from hdfs/cassandra table every time new batch of stream processed ? What is the practice followed in general to handle this kind of scenario? Below is the SOF link for more details .

Re: Even after VO fields are mapped using @Table and @Column annotations get error NoSuchElementException

2019-09-04 Thread Shyam P
Now I am getting different error as below : com.datastax.spark.connector.types.TypeConversionException: Cannot convert object [] of type class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema to com.datastax.driver.core.LocalDate. at

Re: Control Sqoop job from Spark job

2019-09-03 Thread Shyam P
J Franke, Leave alone sqoop , I am just asking about spark in ETL of Oracle ...? Thanks, Shyam >

Re: Control Sqoop job from Spark job

2019-09-03 Thread Shyam P
Hi Mich, Lot of people say that Spark does not have proven record in migrating data from oracle as sqoop has. At list in production. Please correct me if I am wrong and suggest how to deal with shuffling when dealing with groupBy ? Thanks, Shyam On Sat, Aug 31, 2019 at 12:17 PM Mich

Re: Will this use-case can be handled with spark-sql streaming and cassandra?

2019-08-29 Thread Shyam P
ndra connector library written for spark > streaming because we wrote one ourselves when we wanted to do the same. > > Regards > Prathmesh Ranaut > https://linkedin.com/in/prathmeshranaut > > On Aug 29, 2019, at 7:21 AM, Shyam P wrote: > > Hi, > > I need to

Will this use-case can be handled with spark-sql streaming and cassandra?

2019-08-29 Thread Shyam P
Hi, I need to do a PoC for a business use-case. *Use case :* Need to update a record in Cassandra table if exists. Will spark streaming support compare each record and update existing Cassandra record ? For each record received from kakfa topic , If I want to check and compare each record

How to improve loading data into Cassandra table in this scenario?

2019-08-28 Thread Shyam P
> > updated the issue content. > https://stackoverflow.com/questions/57684972/how-to-improve-performance-my-spark-job-here-to-load-data-into-cassandra-table Thank you.

Is groupBy and partition are similar in this scenario? Still I need to do paritioning here to save into Cassandra ?

2019-08-27 Thread Shyam P
Hi, Is groupBy and partition are similar in this scenario? I know they are not similar and mean for different purpose but I am confused here. Still I need to do partitioning here to save into Cassandra ? Below is my scenario. I am using spark-sql-2.4.1 ,spark-cassandra-connector_2.11-2.4.1 with

Any advice how to do this usecase in spark sql ?

2019-08-13 Thread Shyam P
Hi, Any advice how to do this in spark sql ? I have a scenario as below dataframe1 = loaded from an HDFS Parquet file. dataframe2 = read from a Kafka Stream. If column1 of dataframe1 value in columnX value of dataframe2 , then I need then I need to replace column1 value of dataframe1.

How to pass Datasets as arguments to user defined function of a class

2019-07-11 Thread Shyam P
Hi, Anyhelp is thankful. https://stackoverflow.com/questions/56991447/in-spark-dataset-s-can-be-passed-as-input-args-to-a-function-to-get-out-put-args Regards, Shyam

Re: A basic question

2019-06-17 Thread Shyam P
reference/html/springandhadoop-spark.html > > > On Mon, Jun 17, 2019 at 12:27 PM Shyam P wrote: > >> I am developing a spark job using java1.8v. >> >> Is it possible to write a spark app using spring-boot technology? >> Did anyone tried it ? if so how

A basic question

2019-06-17 Thread Shyam P
I am developing a spark job using java1.8v. Is it possible to write a spark app using spring-boot technology? Did anyone tried it ? if so how it should be done? Regards, Shyam

Why my spark job STATE--> Running FINALSTATE --> Undefined.

2019-06-11 Thread Shyam P
Hi, Any clue why spark job goes into UNDEFINED state ? More detail are in the url. https://stackoverflow.com/questions/56545644/why-my-spark-sql-job-stays-in-state-runningfinalstatus-undefined Appreciate your help. Regards, Shyam

Does anyone used spark-structured streaming successfully in production ?

2019-06-10 Thread Shyam P
https://stackoverflow.com/questions/56428367/any-clue-how-to-join-this-spark-structured-stream-joins

How spark structured streaming consumers initiated and invoked while reading multi-partitioned kafka topics?

2019-06-10 Thread Shyam P
Hi, Any suggestions regarding below issue? https://stackoverflow.com/questions/56524921/how-spark-structured-streaming-consumers-initiated-and-invoked-while-reading-mul Thanks, Shyam

Re: Read hdfs files in spark streaming

2019-06-10 Thread Shyam P
Hi Deepak, Why are you getting paths from kafka topic? any specific reason to do so ? Regards, Shyam On Mon, Jun 10, 2019 at 10:44 AM Deepak Sharma wrote: > The context is different here. > The file path are coming as messages in kafka topic. > Spark streaming (structured) consumes form this

How to handle small file problem in spark structured streaming?

2019-06-10 Thread Shyam P
https://stackoverflow.com/questions/56524539/how-to-handle-small-file-problem-in-spark-structured-streaming Regards, Shyam

Re: java.util.NoSuchElementException: Columns not found

2019-06-03 Thread Shyam P
Thank you so much Alex Ott. On Fri, May 31, 2019 at 6:05 PM Alex Ott wrote: > Check the answer on SO... > > On Fri, May 31, 2019 at 1:04 PM Shyam P wrote: > >> Trying to save a sample data into C* table >> >> I am getting below error : >> >> *java.util

java.util.NoSuchElementException: Columns not found

2019-05-31 Thread Shyam P
Trying to save a sample data into C* table I am getting below error : *java.util.NoSuchElementException: Columns not found in table abc.company_vals: companyId, companyName* Though I have all the columns and re checked them again and again. I dont see any issue with columns. I am using

design question related to kafka.

2019-05-17 Thread Shyam P
Hi, https://stackoverflow.com/questions/56181135/design-can-kafka-producer-written-as-spark-job Thank you, Shyam

IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff] while using spark-sql-2.4.1v to read data from oracle

2019-05-08 Thread Shyam P
Hi , I have oracle table in which has column schema is : DATA_DATE DATE something like 31-MAR-02 I am trying to retrieve data from oracle using spark-sql-2.4.1 version. I tried to set the JdbcOptions as below : .option("lowerBound", "2002-03-31 00:00:00"); .option("upperBound",

Re: Spark SQL Teradata load is very slow

2019-05-03 Thread Shyam P
Asmath, Why upperBound is set to 300 ? how many cores you have ? check how data is distributed in TeraData DB table. SELECT distinct( itm_bloon_seq_no ), count(*) as cc FROM TABLE order by itm_bloon_seq_no desc; Is this column "itm_bloon_seq_no" already in table or you derived at spark

Re: spark df.write.partitionBy run very slow

2019-05-02 Thread Shyam P
Junfeng Chen > > > On Thu, Mar 14, 2019 at 2:26 PM Shyam P wrote: > >> cool. >> >> On Tue, Mar 12, 2019 at 9:08 AM JF Chen wrote: >> >>> Hi >>> Finally I found the reason... >>> It caused by some long time gc on some datanodes. Afte

spark stddev() giving '?' as output how to handle it ? i.e replace null/0

2019-04-24 Thread Shyam P
https://stackoverflow.com/questions/55823608/how-to-handle-spark-stddev-function-output-value-when-there-there-is-no-data Regards, Shyam

Re: Is there any spark API function to handle a group of companies at once in this scenario?

2019-04-08 Thread Shyam P
e liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Fri, 5 Apr 2019 at 10:51, Shyam P wrote: > >> Hi , >> In my scenario I have few companies , for which I need to calculate few >> stats like avg I need to be stored in Cassandra

Is there any spark API function to handle a group of companies at once in this scenario?

2019-04-05 Thread Shyam P
Hi , In my scenario I have few companies , for which I need to calculate few stats like avg I need to be stored in Cassandra , for next set of records I need to get previously calculated and over it i need to calculate accumulated results ( i.e preset set of data + previously stored stats) and

Re: spark df.write.partitionBy run very slow

2019-03-14 Thread Shyam P
kes a long time. > Now I have decommissioned the broken data nodes, and now my spark runs > well. > I am trying to increase the heap size of data node to check if it can > resolve the problem > > Regard, > Junfeng Chen > > > On Fri, Mar 8, 2019 at 8:54 PM Shyam P wrote

Re: spark df.write.partitionBy run very slow

2019-03-08 Thread Shyam P
e spark UI I can ensure data is not skewed. There is only about >> 100MB for each task, where most of tasks takes several seconds to write the >> data to hdfs, and some tasks takes minutes of time. >> >> Regard, >> Junfeng Chen >> >> >> On Wed, Mar

Re: "java.lang.AssertionError: assertion failed: Failed to get records for **** after polling for 180000" error

2019-03-05 Thread Shyam P
Would be better if you share some code block to understand it better. Else would be difficult to provide answer. ~Shyam On Wed, Mar 6, 2019 at 8:38 AM JF Chen wrote: > When my kafka executor reads data from kafka, sometimes it throws the > error "java.lang.AssertionError: assertion failed:

Re: spark df.write.partitionBy run very slow

2019-03-05 Thread Shyam P
ngs, some tasks in write hdfs stage cost > much more time than others, where the amount of writing data is similar. > How to solve it? > > Regard, > Junfeng Chen > > > On Tue, Mar 5, 2019 at 3:05 PM Shyam P wrote: > >> Hi JF , >>

Re: How to group dataframe year-wise and iterate through groups and send each year to dataframe to executor?

2019-03-05 Thread Shyam P
Thanks a lot Roman. But provided link as several ways to deal the problem. Why do we need to do operation on RDD instead dataframe/dataset ? Do I need a custom partitioner in my case , how to invoke it in spark-sql? Can anyone provide some sample on handling skewed data with spark-sql? Thanks,

[no subject]

2019-03-05 Thread Shyam P
Hi All, I need to save a huge data frame as parquet file. As it is huge its taking several hours. To improve performance it is known I have to send it group wise. But when I do partition ( columns*) /groupBy(Columns*) , driver is spilling a lot of data and performance hits a lot again. So how

Re: error in sprark sql

2019-03-04 Thread Shyam P
Something wrong with query. Add the code snippet to exactly what are you trying to do. ~Shyam On Fri, Mar 1, 2019 at 1:07 PM yuvraj singh <19yuvrajsing...@gmail.com> wrote: > Hi, > > I am running spark as a service , when we change some sql schema we are > facing some problems . > > ERROR

Re: spark df.write.partitionBy run very slow

2019-03-04 Thread Shyam P
Hi JF , Try to execute it before df.write //count by partition_id import org.apache.spark.sql.functions.spark_partition_id df.groupBy(spark_partition_id).count.show() You will come to know how data has been partitioned inside df. Small trick we can apply here while

Re: Looking for an apache spark mentor

2019-02-19 Thread Shyam P
What IRC channel we should join? On Tue, 19 Feb 2019, 17:56 Robert Kaye, wrote: > Hello! > > I’m Robert Kaye from the MetaBrainz Foundation — we’re the people behind > MusicBrainz ( https://musicbrainz.org ) and more recently ListenBrainz ( > https://listenbrainz.org ). ListenBrainz is aiming

Does Cassandra Support Populating Reference Table from Master Table ?

2019-02-15 Thread Shyam P
Hi, I have scenario where I need to ingest data into master table which has many number of columns include few columns like "Country_Id" "CountryName","Date"etc. Everytime I load data with new records this "Date" would change to the data generation data. Every time each country data might