from:"Subhash Sriram"

Re: Databricks Spark Parallelism and Shuffle Partitions

2021-02-04 Thread Subhash Sriram

Hi Erica, On your cluster details, you can click on "Advanced", and then set those parameters in the "Spark" tab. Hope that helps. Thanks, Subhash On Thu, Feb 4, 2021 at 5:27 PM Erica Lin wrote: > Hello! > > Is there a way to set spark.sql.shuffle.partitions > and spark.default.parallelism in

Re: Run SQL on files directly

2018-12-08 Thread Subhash Sriram

Hi David, I’m not sure if that is possible, but why not just read the CSV file using the Scala API, specifying those options, and then query it using SQL by creating a temp view? Thanks, Subhash Sent from my iPhone > On Dec 8, 2018, at 12:39 PM, David Markovitz > wrote: > > Hi > Spark SQL

JdbcRDD - schema always resolved as nullable=true

2018-08-15 Thread Subhash Sriram

Hi Spark Users, We do a lot of processing in Spark using data that is in MS SQL server. Today, I created a DataFrame against a table in SQL Server using the following: val dfSql=spark.read.jdbc(connectionString, table, props) I noticed that every column in the DataFrame showed as *nullable=true,

Re: Spark DF to Hive table with both Partition and Bucketing not working

2018-06-19 Thread Subhash Sriram

Hi Umar, Could it be that spark.sql.sources.bucketing.enabled is not set to true? Thanks, Subhash Sent from my iPhone > On Jun 19, 2018, at 11:41 PM, umargeek wrote: > > Hi Folks, > > I am trying to save a spark data frame after reading from ORC file and add > two new columns and finally tr

Re: how can I run spark job in my environment which is a single Ubuntu host with no hadoop installed

2018-06-17 Thread Subhash Sriram

Hi Raymond, If you set your master to local[*] instead of yarn-client, it should run on your local machine. Thanks, Subhash Sent from my iPhone > On Jun 17, 2018, at 2:32 PM, Raymond Xie wrote: > > Hello, > > I am wondering how can I run spark job in my environment which is a single > Ubu

Re: Spark & S3 - Introducing random values into key names

2018-03-08 Thread Subhash Sriram

27;re going to have (but it must be set in stone), and then they can > try to pre-optimize the bucket for you. > >> On Thu, Mar 8, 2018 at 11:42 AM, Subhash Sriram >> wrote: >> Hey Spark user community, >> >> I am writing Parquet files from Spark to S3 using S3

Spark & S3 - Introducing random values into key names

2018-03-08 Thread Subhash Sriram

Hey Spark user community, I am writing Parquet files from Spark to S3 using S3a. I was reading this article about improving S3 bucket performance, specifically about how it can help to introduce randomness to your key names so that data is written to different partitions. https://aws.amazon.com/p

Spark JDBC bulk insert

2018-02-01 Thread Subhash Sriram

Hey everyone, I have a use case where I will be processing data in Spark and then writing it back to MS SQL Server. Is it possible to use bulk insert functionality and/or batch the writes back to SQL? I am using the DataFrame API to write the rows: sqlContext.write.jdbc(...) Thanks in advance

Re: is there a way to create new column with timeuuid using raw spark sql ?

2018-02-01 Thread Subhash Sriram

If you have the temp view name (table, for example), couldn't you do something like this? val dfWithColumn=spark.sql("select *, as new_column from table") Thanks, Subhash On Thu, Feb 1, 2018 at 11:18 AM, kant kodali wrote: > Hi, > > Are you talking about df.withColumn() ? If so, thats not wha

Re: Writing data in HDFS high available cluster

2018-01-18 Thread Subhash Sriram

Hi Soheil, We have a high availability cluster as well, but I never have to specify the active master when writing, only the cluster name. It works regardless of which node is the active master. Hope that helps. Thanks, Subhash Sent from my iPhone > On Jan 18, 2018, at 5:49 AM, Soheil Pourb

Re: Why do I see five attempts on my Spark application

2017-12-13 Thread Subhash Sriram

There are some more properties specifically for YARN here: http://spark.apache.org/docs/latest/running-on-yarn.html Thanks, Subhash On Wed, Dec 13, 2017 at 2:32 PM, Subhash Sriram wrote: > http://spark.apache.org/docs/latest/configuration.html > > On Wed, Dec 13, 2017 at 2:31 PM, T

Re: Why do I see five attempts on my Spark application

2017-12-13 Thread Subhash Sriram

http://spark.apache.org/docs/latest/configuration.html On Wed, Dec 13, 2017 at 2:31 PM, Toy wrote: > Hi, > > Can you point me to the config for that please? > > On Wed, 13 Dec 2017 at 14:23 Marcelo Vanzin wrote: > >> On Wed, Dec 13, 2017 at 11:21 AM, Toy wrote: >> > I'm wondering why am I seei

Re: Json to csv

2017-12-12 Thread Subhash Sriram

I was curious about this too, and found this. You may find it helpful: http://www.tegdesign.com/converting-a-nested-json-document-to-csv-using-scala-hadoop-and-apache-spark/ Thanks, Subhash Sent from my iPhone > On Dec 12, 2017, at 1:44 AM, Prabha K wrote: > > Any help on converting json to

Re: Structured Stream in Spark

2017-10-25 Thread Subhash Sriram

Thanks. This is what I was looking for. > > one question, where do we need to specify the checkpoint directory in case > of structured streaming? > > Thanks, > Asmath > > On Wed, Oct 25, 2017 at 2:52 PM, Subhash Sriram > wrote: > >> Hi Asmath, >> >>

Re: Structured Stream in Spark

2017-10-25 Thread Subhash Sriram

Hi Asmath, Here is an example of using structured streaming to read from Kafka: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredKafkaWordCount.scala In terms of parsing the JSON, there is a from_json function that you can use.

Re: Create dataframe from RDBMS table using JDBC

2017-04-26 Thread Subhash Sriram

Hi Devender, I have always gone with the 2nd approach, only so I don't have to chain a bunch of "option()." calls together. You should be able to use either. Thanks, Subhash Sent from my iPhone > On Apr 26, 2017, at 3:26 AM, Devender Yadav > wrote: > > Hi All, > > > I am using Spak 1.6.2

Re: Concurrent DataFrame.saveAsTable into non-existant tables fails the second job despite Mode.APPEND

2017-04-20 Thread Subhash Sriram

Would it be an option to just write the results of each job into separate tables and then run a UNION on all of them at the end into a final target table? Just thinking of an alternative! Thanks, Subhash Sent from my iPhone > On Apr 20, 2017, at 3:48 AM, Rick Moritz wrote: > > Hi List, > >

Re: Assigning a unique row ID

2017-04-07 Thread Subhash Sriram

Hi, We use monotonically_increasing_id() as well, but just cache the table first like Ankur suggested. With that method, we get the same keys in all derived tables. Thanks, Subhash Sent from my iPhone > On Apr 7, 2017, at 7:32 PM, Everett Anderson wrote: > > Hi, > > Thanks, but that's usi

Re: spark-sql use case beginner question

2017-03-09 Thread Subhash Sriram

We have a similar use case. We use the DataFrame API to cache data out of Hive tables, and then run pretty complex scripts on them. You can register your Hive UDFs to be used within Spark SQL statements if you want. Something like this: sqlContext.sql("CREATE TEMPORARY FUNCTION as ''") If you h

Re: Spark JDBC reads

2017-03-07 Thread Subhash Sriram

Could you create a view of the table on your JDBC data source and just query that from Spark? Thanks, Subhash Sent from my iPhone > On Mar 7, 2017, at 6:37 AM, El-Hassan Wanas wrote: > > As an example, this is basically what I'm doing: > > val myDF = originalDataFrame.select(col(column

Re: Spark Beginner: Correct approach for use case

2017-03-05 Thread Subhash Sriram

Hi Allan, Where is the data stored right now? If it's in a relational database, and you are using Spark with Hadoop, I feel like it would make sense to move the import the data into HDFS, just because it would be faster to access the data. You could use Sqoop to do that. In terms of having a l

Re: question on transforms for spark 2.0 dataset

2017-03-01 Thread Subhash Sriram

If I am understanding your problem correctly, I think you can just create a new DataFrame that is a transformation of sample_data by first registering sample_data as a temp table. //Register temp table sample_data.createOrReplaceTempView("sql_sample_data") //Create new DataSet with transformed va

Re: Databricks Spark Parallelism and Shuffle Partitions

Re: Run SQL on files directly

JdbcRDD - schema always resolved as nullable=true

Re: Spark DF to Hive table with both Partition and Bucketing not working

Re: how can I run spark job in my environment which is a single Ubuntu host with no hadoop installed

Re: Spark & S3 - Introducing random values into key names

Spark & S3 - Introducing random values into key names

Spark JDBC bulk insert

Re: is there a way to create new column with timeuuid using raw spark sql ?

Re: Writing data in HDFS high available cluster

Re: Why do I see five attempts on my Spark application

Re: Why do I see five attempts on my Spark application

Re: Json to csv

Re: Structured Stream in Spark

Re: Structured Stream in Spark

Re: Create dataframe from RDBMS table using JDBC

Re: Concurrent DataFrame.saveAsTable into non-existant tables fails the second job despite Mode.APPEND

Re: Assigning a unique row ID

Re: spark-sql use case beginner question

Re: Spark JDBC reads

Re: Spark Beginner: Correct approach for use case

Re: question on transforms for spark 2.0 dataset

22 matches

Site Navigation

Mail list logo

Footer information