Re: jdbc spark streaming

2018-12-28 Thread Thakrar, Jayesh
Yes, you can certainly use spark streaming, but reading from the original source table may still be time consuming and resource intensive. Having some context on the RDBMS platform, data size/volumes involved and the tolerable lag (between changes being created and it being processed by Spark)

Re: Custom Metric Sink on Executor Always ClassNotFound

2018-12-21 Thread Thakrar, Jayesh
Just curious - is this HttpSink your own custom sink or Dropwizard configuration? If your own custom code, I would suggest looking/trying out the Dropwizard. See http://spark.apache.org/docs/latest/monitoring.html#metrics https://metrics.dropwizard.io/4.0.0/ Also, from what I know, the metrics

Re: How to track batch jobs in spark ?

2018-12-06 Thread Thakrar, Jayesh
See if https://spark.apache.org/docs/latest/monitoring.html helps. Essentially whether you are running an app as spark-shell, via spark-submit (local, Spark-Cluster, YARN, Kubernetes, mesos), the driver will provide a UI on port 4040. You can monitor via the UI and via a REST API E.g. running

Re: How to address seemingly low core utilization on a spark workload?

2018-11-15 Thread Thakrar, Jayesh
are still excessive. From: Vitaliy Pisarev Date: Thursday, November 15, 2018 at 1:58 PM To: "Thakrar, Jayesh" Cc: Shahbaz , user , David Markovitz Subject: Re: How to address seemingly low core utilization on a spark workload? Small update, my initial estimate was incorrect. I have on

Re: How to address seemingly low core utilization on a spark workload?

2018-11-15 Thread Thakrar, Jayesh
save. From: Vitaliy Pisarev Date: Thursday, November 15, 2018 at 1:03 PM To: Shahbaz Cc: "Thakrar, Jayesh" , user , "dudu.markov...@microsoft.com" Subject: Re: How to address seemingly low core utilization on a spark workload? Agree, and I will try it. One clarificatio

Re: How to address seemingly low core utilization on a spark workload?

2018-11-15 Thread Thakrar, Jayesh
ittle work. Question is what can I do about it. On Thu, Nov 15, 2018 at 5:29 PM Thakrar, Jayesh mailto:jthak...@conversantmedia.com>> wrote: Can you shed more light on what kind of processing you are doing? One common pattern that I have seen for active core/executor utilization dropping to zero

Re: How to address seemingly low core utilization on a spark workload?

2018-11-15 Thread Thakrar, Jayesh
Can you shed more light on what kind of processing you are doing? One common pattern that I have seen for active core/executor utilization dropping to zero is while reading ORC data and the driver seems (I think) to be doing schema validation. In my case I would have hundreds of thousands of

Re: [Spark SQL] why spark sql hash() are returns the same hash value though the keys/expr are not same

2018-09-28 Thread Thakrar, Jayesh
Not sure I get what you mean…. I ran the query that you had – and don’t get the same hash as you. From: Gokula Krishnan D Date: Friday, September 28, 2018 at 10:40 AM To: "Thakrar, Jayesh" Cc: user Subject: Re: [Spark SQL] why spark sql hash() are returns the same hash value thoug

Re: [Spark SQL] why spark sql hash() are returns the same hash value though the keys/expr are not same

2018-09-26 Thread Thakrar, Jayesh
Cannot reproduce your situation. Can you share Spark version? Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.2.0 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_92) Type

Re: [PySpark] Releasing memory after a spark job is finished

2018-06-04 Thread Thakrar, Jayesh
Disclaimer - I use Spark with Scala and not Python. But I am guessing that Jorn's reference to modularization is to ensure that you do the processing inside methods/functions and call those methods sequentially. I believe that as long as an RDD/dataset variable is in scope, its memory may not

Re: How to skip nonexistent file when read files with spark?

2018-05-21 Thread Thakrar, Jayesh
com> Date: Monday, May 21, 2018 at 10:20 PM To: ayan guha <guha.a...@gmail.com> Cc: "Thakrar, Jayesh" <jthak...@conversantmedia.com>, user <user@spark.apache.org> Subject: Re: How to skip nonexistent file when read files with spark? Thanks ayan, Also I have tried

Re: How to skip nonexistent file when read files with spark?

2018-05-21 Thread Thakrar, Jayesh
Probably you can do some preprocessing/checking of the paths before you attempt to read it via Spark. Whether it is local or hdfs filesystem, you can try to check for existence and other details by using the "FileSystem.globStatus" method from the Hadoop API. From: JF Chen

Re: Spark Monitoring using Jolokia

2018-01-08 Thread Thakrar, Jayesh
And here's some more info on Spark Metrics https://www.slideshare.net/JayeshThakrar/apache-bigdata2017sparkprofiling From: Maximiliano Felice Date: Monday, January 8, 2018 at 8:14 AM To: Irtiza Ali Cc: Subject: Re: Spark

Re: Access to Applications metrics

2017-12-05 Thread Thakrar, Jayesh
You can also get the metrics from the Spark application events log file. See https://www.slideshare.net/JayeshThakrar/apache-bigdata2017sparkprofiling From: "Qiao, Richard" Date: Monday, December 4, 2017 at 6:09 PM To: Nick Dimiduk ,

Re: Why don't I see my spark jobs running in parallel in Cassandra/Spark DSE cluster?

2017-10-27 Thread Thakrar, Jayesh
What you have is sequential and hence sequential processing. Also Spark/Scala are not parallel programming languages. But even if they were, statements are executed sequentially unless you exploit the parallel/concurrent execution features. Anyway, see if this works: val (RDD1, RDD2) =

Re: [Spark JDBC] Does spark support read from remote Hive server via JDBC

2017-06-12 Thread Thakrar, Jayesh
Could this be due to https://issues.apache.org/jira/browse/HIVE-6 ? From: Patrik Medvedev Date: Monday, June 12, 2017 at 2:31 AM To: Jörn Franke , vaquar khan Cc: Jean Georges Perrin , User

Re: spark-submit config via file

2017-03-27 Thread Thakrar, Jayesh
Roy - can you check if you have HADOOP_CONF_DIR and YARN_CONF_DIR set to the directory containing the HDFS and YARN configuration files? From: Sandeep Nemuri Date: Monday, March 27, 2017 at 9:44 AM To: Saisai Shao Cc: Yong Zhang

Re: Lost executor 4 Container killed by YARN for exceeding memory limits.

2017-02-14 Thread Thakrar, Jayesh
management are unified. All memory fractions used in the old model are now deprecated and no longer read. If you wish to use the old memory management, you may explicitly enable `spark.memory.useLegacyMode` (not recommended). On Mon, Feb 13, 2017 at 11:23 PM, Thakrar, Jayesh <jtha

Re: Lost executor 4 Container killed by YARN for exceeding memory limits.

2017-02-13 Thread Thakrar, Jayesh
Nancy, As your log output indicated, your executor 11 GB memory limit. While you might want to address the root cause/data volume as suggested by Jon, you can do an immediate test by changing your command as follows spark-shell --master yarn --deploy-mode client --driver-memory 16G

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Thakrar, Jayesh
Ben, Also look at Phoenix (Apache project) which provides a better (one of the best) SQL/JDBC layer on top of HBase. http://phoenix.apache.org/ Cheers, Jayesh From: vincent gromakowski Date: Monday, October 17, 2016 at 1:53 PM To: Benjamin Kim

Re: Is spark a right tool for updating a dataframe repeatedly

2016-10-17 Thread Thakrar, Jayesh
Yes, iterating over a dataframe and making changes is not uncommon. Ofcourse RDDs, dataframes and datasets are immultable, but there is some optimization in the optimizer that can potentially help to dampen the effect/impact of creating a new rdd, df or ds. Also, the use-case you cited is