Re: Databricks Spark Parallelism and Shuffle Partitions

2021-02-04 Thread Subhash Sriram
Hi Erica, On your cluster details, you can click on "Advanced", and then set those parameters in the "Spark" tab. Hope that helps. Thanks, Subhash On Thu, Feb 4, 2021 at 5:27 PM Erica Lin wrote: > Hello! > > Is there a way to set spark.sql.shuffle.partitions > and spark.default.parallelism in

Re: Run SQL on files directly

2018-12-08 Thread Subhash Sriram
Hi David, I’m not sure if that is possible, but why not just read the CSV file using the Scala API, specifying those options, and then query it using SQL by creating a temp view? Thanks, Subhash Sent from my iPhone > On Dec 8, 2018, at 12:39 PM, David Markovitz > wrote: > > Hi > Spark

JdbcRDD - schema always resolved as nullable=true

2018-08-15 Thread Subhash Sriram
Hi Spark Users, We do a lot of processing in Spark using data that is in MS SQL server. Today, I created a DataFrame against a table in SQL Server using the following: val dfSql=spark.read.jdbc(connectionString, table, props) I noticed that every column in the DataFrame showed as

Re: Spark DF to Hive table with both Partition and Bucketing not working

2018-06-19 Thread Subhash Sriram
Hi Umar, Could it be that spark.sql.sources.bucketing.enabled is not set to true? Thanks, Subhash Sent from my iPhone > On Jun 19, 2018, at 11:41 PM, umargeek wrote: > > Hi Folks, > > I am trying to save a spark data frame after reading from ORC file and add > two new columns and finally

Re: how can I run spark job in my environment which is a single Ubuntu host with no hadoop installed

2018-06-17 Thread Subhash Sriram
Hi Raymond, If you set your master to local[*] instead of yarn-client, it should run on your local machine. Thanks, Subhash Sent from my iPhone > On Jun 17, 2018, at 2:32 PM, Raymond Xie wrote: > > Hello, > > I am wondering how can I run spark job in my environment which is a single >

Re: Spark & S3 - Introducing random values into key names

2018-03-08 Thread Subhash Sriram
e going to have (but it must be set in stone), and then they can > try to pre-optimize the bucket for you. > >> On Thu, Mar 8, 2018 at 11:42 AM, Subhash Sriram <subhash.sri...@gmail.com> >> wrote: >> Hey Spark user community, >> >> I am writing Parquet fi

Spark & S3 - Introducing random values into key names

2018-03-08 Thread Subhash Sriram
Hey Spark user community, I am writing Parquet files from Spark to S3 using S3a. I was reading this article about improving S3 bucket performance, specifically about how it can help to introduce randomness to your key names so that data is written to different partitions.

Spark JDBC bulk insert

2018-02-01 Thread Subhash Sriram
Hey everyone, I have a use case where I will be processing data in Spark and then writing it back to MS SQL Server. Is it possible to use bulk insert functionality and/or batch the writes back to SQL? I am using the DataFrame API to write the rows: sqlContext.write.jdbc(...) Thanks in advance

Re: is there a way to create new column with timeuuid using raw spark sql ?

2018-02-01 Thread Subhash Sriram
If you have the temp view name (table, for example), couldn't you do something like this? val dfWithColumn=spark.sql("select *, as new_column from table") Thanks, Subhash On Thu, Feb 1, 2018 at 11:18 AM, kant kodali wrote: > Hi, > > Are you talking about df.withColumn() ?

Re: Writing data in HDFS high available cluster

2018-01-18 Thread Subhash Sriram
Hi Soheil, We have a high availability cluster as well, but I never have to specify the active master when writing, only the cluster name. It works regardless of which node is the active master. Hope that helps. Thanks, Subhash Sent from my iPhone > On Jan 18, 2018, at 5:49 AM, Soheil

Re: Why do I see five attempts on my Spark application

2017-12-13 Thread Subhash Sriram
There are some more properties specifically for YARN here: http://spark.apache.org/docs/latest/running-on-yarn.html Thanks, Subhash On Wed, Dec 13, 2017 at 2:32 PM, Subhash Sriram <subhash.sri...@gmail.com> wrote: > http://spark.apache.org/docs/latest/configuration.html > >

Re: Why do I see five attempts on my Spark application

2017-12-13 Thread Subhash Sriram
http://spark.apache.org/docs/latest/configuration.html On Wed, Dec 13, 2017 at 2:31 PM, Toy wrote: > Hi, > > Can you point me to the config for that please? > > On Wed, 13 Dec 2017 at 14:23 Marcelo Vanzin wrote: > >> On Wed, Dec 13, 2017 at 11:21 AM,

Re: Json to csv

2017-12-12 Thread Subhash Sriram
I was curious about this too, and found this. You may find it helpful: http://www.tegdesign.com/converting-a-nested-json-document-to-csv-using-scala-hadoop-and-apache-spark/ Thanks, Subhash Sent from my iPhone > On Dec 12, 2017, at 1:44 AM, Prabha K wrote: > > Any

Re: Structured Stream in Spark

2017-10-25 Thread Subhash Sriram
gt; > Thanks. This is what I was looking for. > > one question, where do we need to specify the checkpoint directory in case > of structured streaming? > > Thanks, > Asmath > > On Wed, Oct 25, 2017 at 2:52 PM, Subhash Sriram <subhash.sri...@gmail.com> > wrote: > >>

Re: Structured Stream in Spark

2017-10-25 Thread Subhash Sriram
Hi Asmath, Here is an example of using structured streaming to read from Kafka: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredKafkaWordCount.scala In terms of parsing the JSON, there is a from_json function that you can

Re: Create dataframe from RDBMS table using JDBC

2017-04-26 Thread Subhash Sriram
Hi Devender, I have always gone with the 2nd approach, only so I don't have to chain a bunch of "option()." calls together. You should be able to use either. Thanks, Subhash Sent from my iPhone > On Apr 26, 2017, at 3:26 AM, Devender Yadav > wrote: > > Hi All,

Re: Concurrent DataFrame.saveAsTable into non-existant tables fails the second job despite Mode.APPEND

2017-04-20 Thread Subhash Sriram
Would it be an option to just write the results of each job into separate tables and then run a UNION on all of them at the end into a final target table? Just thinking of an alternative! Thanks, Subhash Sent from my iPhone > On Apr 20, 2017, at 3:48 AM, Rick Moritz wrote:

Re: Assigning a unique row ID

2017-04-07 Thread Subhash Sriram
Hi, We use monotonically_increasing_id() as well, but just cache the table first like Ankur suggested. With that method, we get the same keys in all derived tables. Thanks, Subhash Sent from my iPhone > On Apr 7, 2017, at 7:32 PM, Everett Anderson wrote: > > Hi,

Re: spark-sql use case beginner question

2017-03-09 Thread Subhash Sriram
We have a similar use case. We use the DataFrame API to cache data out of Hive tables, and then run pretty complex scripts on them. You can register your Hive UDFs to be used within Spark SQL statements if you want. Something like this: sqlContext.sql("CREATE TEMPORARY FUNCTION as ''") If you

Re: Spark JDBC reads

2017-03-07 Thread Subhash Sriram
Could you create a view of the table on your JDBC data source and just query that from Spark? Thanks, Subhash Sent from my iPhone > On Mar 7, 2017, at 6:37 AM, El-Hassan Wanas wrote: > > As an example, this is basically what I'm doing: > > val myDF =

Re: Spark Beginner: Correct approach for use case

2017-03-05 Thread Subhash Sriram
Hi Allan, Where is the data stored right now? If it's in a relational database, and you are using Spark with Hadoop, I feel like it would make sense to move the import the data into HDFS, just because it would be faster to access the data. You could use Sqoop to do that. In terms of having a

Re: question on transforms for spark 2.0 dataset

2017-03-01 Thread Subhash Sriram
If I am understanding your problem correctly, I think you can just create a new DataFrame that is a transformation of sample_data by first registering sample_data as a temp table. //Register temp table sample_data.createOrReplaceTempView("sql_sample_data") //Create new DataSet with transformed