Re: Databricks Spark Parallelism and Shuffle Partitions

2021-02-04 Thread Subhash Sriram
Hi Erica, On your cluster details, you can click on "Advanced", and then set those parameters in the "Spark" tab. Hope that helps. Thanks, Subhash On Thu, Feb 4, 2021 at 5:27 PM Erica Lin wrote: > Hello! > > Is there a way to set spark.sql.shuffle.partitions > and spark.default.parallelism in

Spark events log behavior in interactive vs batch job

2020-08-01 Thread Sriram Ganesh
and this better? -- *Sriram G* *Tech*

optimising cluster performance

2019-12-19 Thread Sriram Bhamidipati
ate the rdd size as ${total count}* ${sampel data size} / >> ${sample rdd count} >> >> The code is here >> <https://github.com/kellyzly/sparkcode/blob/master/EstimateDataSetSize.scala#L24> >> . >> >> My question >> 1. can i use above way to solve the problem? If can not, where is wrong? >> 2. Is there any existed solution ( existed API in spark) to solve the >> problem? >> >> >> >> Best Regards >> Kelly Zhang >> >> >> >> > > > -- > -Sriram > -- -Sriram

Re: How to estimate the rdd size before the rdd result is written to disk

2019-12-19 Thread Sriram Bhamidipati
If can not, where is wrong? > 2. Is there any existed solution ( existed API in spark) to solve the > problem? > > > > Best Regards > Kelly Zhang > > > > -- -Sriram

Re: Monitor executor and task memory getting used

2019-10-24 Thread Sriram Ganesh
I was wrong here. I am using spark standalone cluster and I am not using YARN or MESOS. Is it possible to track spark execution memory?. On Mon, Oct 21, 2019 at 5:42 PM Sriram Ganesh wrote: > I looked into this. But I found it is possible like this > > https://github.com/apache/s

Re: Monitor executor and task memory getting used

2019-10-21 Thread Sriram Ganesh
Roman, wrote: > Take a look in this thread > <https://stackoverflow.com/questions/48768188/spark-execution-memory-monitoring#_=_> > > El lun., 21 oct. 2019 a las 13:45, Sriram Ganesh () > escribió: > >> Hi, >> >> I wanna monitor how much memory executor and

Monitor executor and task memory getting used

2019-10-21 Thread Sriram Ganesh
Hi, I wanna monitor how much memory executor and task used for a given job. Is there any direct method available for it which can be used to track this metric? -- *Sriram G* *Tech*

Re: Core allocation is scattered

2019-07-25 Thread Srikanth Sriram
de1=16 cores > and node 2=4 cores . but cores are allocated like node1=2 node > =1-node 14=1 like that. Is there any conf property i need to > change. I know with dynamic allocation we can use below but without dynamic > allocation is there any? > --conf "spark.dynamicAllocation.maxExecutors=2" > > > Thanks > Amit > -- Regards, Srikanth Sriram

Re: Run SQL on files directly

2018-12-08 Thread Subhash Sriram
Hi David, I’m not sure if that is possible, but why not just read the CSV file using the Scala API, specifying those options, and then query it using SQL by creating a temp view? Thanks, Subhash Sent from my iPhone > On Dec 8, 2018, at 12:39 PM, David Markovitz > wrote: > > Hi > Spark SQL

JdbcRDD - schema always resolved as nullable=true

2018-08-15 Thread Subhash Sriram
Hi Spark Users, We do a lot of processing in Spark using data that is in MS SQL server. Today, I created a DataFrame against a table in SQL Server using the following: val dfSql=spark.read.jdbc(connectionString, table, props) I noticed that every column in the DataFrame showed as *nullable=true,

Re: Spark DF to Hive table with both Partition and Bucketing not working

2018-06-19 Thread Subhash Sriram
Hi Umar, Could it be that spark.sql.sources.bucketing.enabled is not set to true? Thanks, Subhash Sent from my iPhone > On Jun 19, 2018, at 11:41 PM, umargeek wrote: > > Hi Folks, > > I am trying to save a spark data frame after reading from ORC file and add > two new columns and finally tr

Re: how can I run spark job in my environment which is a single Ubuntu host with no hadoop installed

2018-06-17 Thread Subhash Sriram
Hi Raymond, If you set your master to local[*] instead of yarn-client, it should run on your local machine. Thanks, Subhash Sent from my iPhone > On Jun 17, 2018, at 2:32 PM, Raymond Xie wrote: > > Hello, > > I am wondering how can I run spark job in my environment which is a single > Ubu

Re: Spark & S3 - Introducing random values into key names

2018-03-08 Thread Subhash Sriram
27;re going to have (but it must be set in stone), and then they can > try to pre-optimize the bucket for you. > >> On Thu, Mar 8, 2018 at 11:42 AM, Subhash Sriram >> wrote: >> Hey Spark user community, >> >> I am writing Parquet files from Spark to S3 using S3

Spark & S3 - Introducing random values into key names

2018-03-08 Thread Subhash Sriram
Hey Spark user community, I am writing Parquet files from Spark to S3 using S3a. I was reading this article about improving S3 bucket performance, specifically about how it can help to introduce randomness to your key names so that data is written to different partitions. https://aws.amazon.com/p

Spark JDBC bulk insert

2018-02-01 Thread Subhash Sriram
Hey everyone, I have a use case where I will be processing data in Spark and then writing it back to MS SQL Server. Is it possible to use bulk insert functionality and/or batch the writes back to SQL? I am using the DataFrame API to write the rows: sqlContext.write.jdbc(...) Thanks in advance

Re: is there a way to create new column with timeuuid using raw spark sql ?

2018-02-01 Thread Subhash Sriram
If you have the temp view name (table, for example), couldn't you do something like this? val dfWithColumn=spark.sql("select *, as new_column from table") Thanks, Subhash On Thu, Feb 1, 2018 at 11:18 AM, kant kodali wrote: > Hi, > > Are you talking about df.withColumn() ? If so, thats not wha

Re: Writing data in HDFS high available cluster

2018-01-18 Thread Subhash Sriram
Hi Soheil, We have a high availability cluster as well, but I never have to specify the active master when writing, only the cluster name. It works regardless of which node is the active master. Hope that helps. Thanks, Subhash Sent from my iPhone > On Jan 18, 2018, at 5:49 AM, Soheil Pourb

Re: Why do I see five attempts on my Spark application

2017-12-13 Thread Subhash Sriram
There are some more properties specifically for YARN here: http://spark.apache.org/docs/latest/running-on-yarn.html Thanks, Subhash On Wed, Dec 13, 2017 at 2:32 PM, Subhash Sriram wrote: > http://spark.apache.org/docs/latest/configuration.html > > On Wed, Dec 13, 2017 at 2:31 PM, T

Re: Why do I see five attempts on my Spark application

2017-12-13 Thread Subhash Sriram
http://spark.apache.org/docs/latest/configuration.html On Wed, Dec 13, 2017 at 2:31 PM, Toy wrote: > Hi, > > Can you point me to the config for that please? > > On Wed, 13 Dec 2017 at 14:23 Marcelo Vanzin wrote: > >> On Wed, Dec 13, 2017 at 11:21 AM, Toy wrote: >> > I'm wondering why am I seei

Re: Json to csv

2017-12-12 Thread Subhash Sriram
I was curious about this too, and found this. You may find it helpful: http://www.tegdesign.com/converting-a-nested-json-document-to-csv-using-scala-hadoop-and-apache-spark/ Thanks, Subhash Sent from my iPhone > On Dec 12, 2017, at 1:44 AM, Prabha K wrote: > > Any help on converting json to

Re: Structured Stream in Spark

2017-10-25 Thread Subhash Sriram
No problem! Take a look at this: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing Thanks, Subhash On Wed, Oct 25, 2017 at 4:08 PM, KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > Hi Sriram, > >

Re: Structured Stream in Spark

2017-10-25 Thread Subhash Sriram
Hi Asmath, Here is an example of using structured streaming to read from Kafka: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredKafkaWordCount.scala In terms of parsing the JSON, there is a from_json function that you can use.

Re: Create dataframe from RDBMS table using JDBC

2017-04-26 Thread Subhash Sriram
Hi Devender, I have always gone with the 2nd approach, only so I don't have to chain a bunch of "option()." calls together. You should be able to use either. Thanks, Subhash Sent from my iPhone > On Apr 26, 2017, at 3:26 AM, Devender Yadav > wrote: > > Hi All, > > > I am using Spak 1.6.2

Re: Concurrent DataFrame.saveAsTable into non-existant tables fails the second job despite Mode.APPEND

2017-04-20 Thread Subhash Sriram
Would it be an option to just write the results of each job into separate tables and then run a UNION on all of them at the end into a final target table? Just thinking of an alternative! Thanks, Subhash Sent from my iPhone > On Apr 20, 2017, at 3:48 AM, Rick Moritz wrote: > > Hi List, > >

Re: Two Nodes :SparkContext Null Pointer

2017-04-10 Thread Sriram
Fixed it by submitting the second job as a child process. Thanks, Sriram. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Two-Nodes-SparkContext-Null-Pointer-tp28582p28585.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Two Nodes :SparkContext Null Pointer

2017-04-10 Thread Sriram
the shell and when the cron triggers other job launched using java spark launcher from the first job. Both the jobs runs fine on same worker node, but when master chooses different nodes its unable to create a spark context in the second job. Any idea?. Thanks, Sriram. -- View this message in

Re: Assigning a unique row ID

2017-04-07 Thread Subhash Sriram
Hi, We use monotonically_increasing_id() as well, but just cache the table first like Ankur suggested. With that method, we get the same keys in all derived tables. Thanks, Subhash Sent from my iPhone > On Apr 7, 2017, at 7:32 PM, Everett Anderson wrote: > > Hi, > > Thanks, but that's usi

Re: spark-sql use case beginner question

2017-03-09 Thread Subhash Sriram
We have a similar use case. We use the DataFrame API to cache data out of Hive tables, and then run pretty complex scripts on them. You can register your Hive UDFs to be used within Spark SQL statements if you want. Something like this: sqlContext.sql("CREATE TEMPORARY FUNCTION as ''") If you h

Re: Spark JDBC reads

2017-03-07 Thread Subhash Sriram
Could you create a view of the table on your JDBC data source and just query that from Spark? Thanks, Subhash Sent from my iPhone > On Mar 7, 2017, at 6:37 AM, El-Hassan Wanas wrote: > > As an example, this is basically what I'm doing: > > val myDF = originalDataFrame.select(col(column

Re: Spark Beginner: Correct approach for use case

2017-03-05 Thread Subhash Sriram
Hi Allan, Where is the data stored right now? If it's in a relational database, and you are using Spark with Hadoop, I feel like it would make sense to move the import the data into HDFS, just because it would be faster to access the data. You could use Sqoop to do that. In terms of having a l

Re: question on transforms for spark 2.0 dataset

2017-03-01 Thread Subhash Sriram
If I am understanding your problem correctly, I think you can just create a new DataFrame that is a transformation of sample_data by first registering sample_data as a temp table. //Register temp table sample_data.createOrReplaceTempView("sql_sample_data") //Create new DataSet with transformed va