Re: ETL Using Spark

2020-05-24 Thread vijay.bvp
Hi Avadhut Narayan JoshiThe use case is achievable using Spark. Connection to SQL Server possible as Mich mentioned below as longs as there a JDBC driver that can connect to SQL ServerFor a production workloads important points to consider, >> what is the QoS requirements for your case? at least

Re: [apache-spark]-spark-shuffle

2020-05-24 Thread vijay.bvp
How a Spark job reads datasources depends on the underlying source system,the job configuration about number of executors and cores per executor. https://spark.apache.org/docs/latest/rdd-programming-guide.html#external-datasets About Shuffle operations.

Re: Spark not releasing shuffle files in time (with very large heap)

2018-02-23 Thread vijay.bvp
have you looked at http://apache-spark-user-list.1001560.n3.nabble.com/Limit-Spark-Shuffle-Disk-Usage-td23279.html and the post mentioned there https://forums.databricks.com/questions/277/how-do-i-avoid-the-no-space-left-on-device-error.html also try compressing the output

Re: Apache Spark - Structured Streaming reading from Kafka some tasks take much longer

2018-02-23 Thread vijay.bvp
Instead of spark-shell have you tried running it as a job. how many executors and cores, can you share the RDD graph and event timeline on the UI and did you find which of the tasks taking more time was they are any GC please look at the UI if not already it can provide lot of information

Re: [Spark Streaming]: Non-deterministic uneven task-to-machine assignment

2018-02-23 Thread vijay.bvp
thanks for adding RDD lineage graph. I could see 18 parallel tasks for HDFS Read was it changed. what is the spark job configuration, how many executors and cores per exeuctor i would say keep the partitioning multiple of (no of executors * cores) for all the RDD's if you have 3 executors

Re: sqoop import job not working when spark thrift server is running.

2018-02-23 Thread vijay.bvp
it sure is not able to get sufficient resources from YARN to start the containers. is it only with this import job or if you submit any other job its failing to start. As a test just try to run another spark job or a mapredue job and see if the job can be started. Reduce the thrift server

Re: Can spark handle this scenario?

2018-02-23 Thread vijay.bvp
when HTTP connection is opened you are opening a connection between specific machine (with IP and NIC card) to another specific machine, so this can't be serialized and used on other machine right!! This isn't spark limitation. I made a simple diagram if it helps. The Objects created at driver

Re: Can spark handle this scenario?

2018-02-20 Thread vijay.bvp
I am assuming pullSymbolFromYahoo functions opens a connection to yahoo API with some token passed, in the code provided so far if you have 2000 symbols, it will make 2000 new connections!! and 2000 API calls connection objects can't/shouldn't be serialized and send to executors, they should

Re: sqoop import job not working when spark thrift server is running.

2018-02-20 Thread vijay.bvp
what was the error when you are trying to run mapreduce import job when the thrift server is running. this is only config changed? what was the config before... also share the spark thrift server job config such as no of executors, cores memory etc. My guess is your mapreduce job is unable to

Re: [Spark Streaming]: Non-deterministic uneven task-to-machine assignment

2018-02-19 Thread vijay.bvp
apologies for the long answer. understanding partitioning at each stage of the the RDD graph/lineage is important for efficient parallelism and having load balanced. This applies to working with any sources streaming or static. you have tricky situation here of one source kafka with 9

Re: Prefer Structured Streaming over Spark Streaming (DStreams)?

2018-01-31 Thread vijay.bvp
here is my two cents, experts please correct me if wrong its important to understand why one over other and for what kind of use case. There might be sometime in future where low level API's are abstracted and become legacy but for now in Spark RDD API is the core and low level API, all higher

Re: Spark Streaming Cluster queries

2018-01-31 Thread vijay.bvp
Assuming you are talking about Spark Streaming 1) How to analyze what part of code executes on Spark Driver and what part of code executes on the executors? RDD's can be understood as set of data transformations or set of jobs. Your understanding deepens as you do more programming with Spark.

Re: [Spark Streaming]: Non-deterministic uneven task-to-machine assignment

2018-01-31 Thread vijay.bvp
Summarizing 1) Static data set read from Parquet files as DataFrame in HDFS has initial parallelism of 90 (based on no input files) 2) static data set DataFrame is converted as rdd, and rdd has parallelism of 18 this was not expected dataframe.rdd is lazy evaluation there must be some operation

Re: Type Casting Error in Spark Data Frame

2018-01-31 Thread vijay.bvp
formatted = Assuming MessageHelper.sqlMapping schema is correctly mapped with input json (it would help if the schema and sample json is shared) here is explode function with dataframes similar functionality is available with SQL import sparkSession.implicits._ import

Re: Type Casting Error in Spark Data Frame

2018-01-31 Thread vijay.bvp
Assuming MessageHelper.sqlMapping schema is correctly mapped with input json (it would help if the schema and sample json is shared)here is explode function with dataframes similar functionality is available with SQL import sparkSession.implicits._import org.apache.spark.sql.functions._val

Re: Spark Streaming not reading missed data

2018-01-16 Thread vijay.bvp
you are creating streaming context each time val streamingContext = new StreamingContext(sparkSession.sparkContext, Seconds(config.getInt(Constants.Properties.BatchInterval))) if you want fault-tolerance, to read from where it stopped between spark job restarts, the correct way is to restore