Is there any Job/Career channel

2023-01-15 Thread Chetan Khatri
Hi Spark Users, Is there any Job/Career channel for Apache Spark? Thank you

to find Difference of locations in Spark Dataframe rows

2022-06-07 Thread Chetan Khatri
Hi Dear Spark Users, It has been many years that I have worked on Spark, Please help me. Thanks much I have different cities and their co-ordinates in DataFrame[Row], I want to find distance in KMs and then show only those records /cities which are 10 KMs far. I have a function created that can

Need help on migrating Spark on Hortonworks to Kubernetes Cluster

2022-05-08 Thread Chetan Khatri
Hi Everyone, I need help on my Airflow DAG which has Spark Submit and Now I have Kubernetes Cluster instead Hortonworks Linux Distributed Spark Cluster.My existing Spark-Submit is through BashOperator as below: calculation1 = '/usr/hdp/2.6.5.0-292/spark2/bin/spark-submit --conf

Re: Usage of DropDuplicate in Spark

2021-06-22 Thread Chetan Khatri
I am looking for any built-in API if at all exists? On Tue, Jun 22, 2021 at 1:16 PM Chetan Khatri wrote: > this has been very slow > > On Tue, Jun 22, 2021 at 1:15 PM Sachit Murarka > wrote: > >> Hi Chetan, >> >> You can substract the data frame or use excep

Re: Usage of DropDuplicate in Spark

2021-06-22 Thread Chetan Khatri
gt; > hope this helps > > Thanks > Sachit > > On Tue, Jun 22, 2021, 22:23 Chetan Khatri > wrote: > >> Hi Spark Users, >> >> I want to use DropDuplicate, but those records which I discard. I >> would like to log to the instrumental table. >> >> What would be the best approach to do that? >> >> Thanks >> >

Usage of DropDuplicate in Spark

2021-06-22 Thread Chetan Khatri
Hi Spark Users, I want to use DropDuplicate, but those records which I discard. I would like to log to the instrumental table. What would be the best approach to do that? Thanks

Re: Performance Improvement: Collect in spark taking huge time

2021-05-05 Thread Chetan Khatri
May 5, 2021 at 10:15 PM Chetan Khatri wrote: > Hi All, Collect in spark is taking huge time. I want to get list of values > of one column to Scala collection. How can I do this? > val newDynamicFieldTablesDF = cachedPhoenixAppMetaDataForCreateTableDF > .select(col("

Performance Improvement: Collect in spark taking huge time

2021-05-05 Thread Chetan Khatri
Hi All, Collect in spark is taking huge time. I want to get list of values of one column to Scala collection. How can I do this? val newDynamicFieldTablesDF = cachedPhoenixAppMetaDataForCreateTableDF .select(col("reporting_table")).except(clientSchemaDF)

DropNa in Spark for Columns

2021-02-26 Thread Chetan Khatri
Hi Users, What is equivalent of *df.dropna(axis='columns'**) *of Pandas in the Spark/Scala? Thanks

Re: Reading TB of JSON file

2020-06-19 Thread Chetan Khatri
Thanks, you meant in a for loop. could you please put pseudocode in spark On Fri, Jun 19, 2020 at 8:39 AM Jörn Franke wrote: > Make every json object a line and then read t as jsonline not as multiline > > Am 19.06.2020 um 14:37 schrieb Chetan Khatri >: > >  > All

Re: Reading TB of JSON file

2020-06-19 Thread Chetan Khatri
All transactions in JSON, It is not a single array. On Thu, Jun 18, 2020 at 12:55 PM Stephan Wehner wrote: > It's an interesting problem. What is the structure of the file? One big > array? On hash with many key-value pairs? > > Stephan > > On Thu, Jun 18, 2020 at 6:12 AM Chet

Re: Reading TB of JSON file

2020-06-19 Thread Chetan Khatri
Yes On Thu, Jun 18, 2020 at 12:34 PM Gourav Sengupta wrote: > Hi, > So you have a single JSON record in multiple lines? > And all the 50 GB is in one file? > > Regards, > Gourav > > On Thu, 18 Jun 2020, 14:34 Chetan Khatri, > wrote: > >> It is dynamicall

Re: Reading TB of JSON file

2020-06-18 Thread Chetan Khatri
> a role. > > Otherwise if it is one large object or array I would not recommend it. > > > Am 18.06.2020 um 15:12 schrieb Chetan Khatri < > chetan.opensou...@gmail.com>: > > > >  > > Hi Spark Users, > > > > I have a 50GB of JSON file, I wou

Re: Reading TB of JSON file

2020-06-18 Thread Chetan Khatri
u can use your executors to > perform the reading instead of the driver. > > On Thu, Jun 18, 2020 at 9:12 AM Chetan Khatri > wrote: > >> Hi Spark Users, >> >> I have a 50GB of JSON file, I would like to read and persist at HDFS so >> it can be tak

Reading TB of JSON file

2020-06-18 Thread Chetan Khatri
Hi Spark Users, I have a 50GB of JSON file, I would like to read and persist at HDFS so it can be taken into next transformation. I am trying to read as spark.read.json(path) but this is giving Out of memory error on driver. Obviously, I can't afford having 50 GB on driver memory. In general,

Join on Condition provide at run time

2020-06-02 Thread Chetan Khatri
Hi Spark Users, How can I provide join ON condition at run time in the form of String to the code? Can someone please help me.

Re: Calling HTTP Rest APIs from Spark Job

2020-05-15 Thread Chetan Khatri
tor slots. But even then, you're > welcome to, say, use thread pools to execute even more concurrently as > most are I/O bound. Your code can do what you want. > > On Thu, May 14, 2020 at 6:14 PM Chetan Khatri > wrote: > > > > Thanks, that means number of executor = numb

Re: Calling HTTP Rest APIs from Spark Job

2020-05-14 Thread Chetan Khatri
ber of executor slots. Yes you can only > simultaneously execute as many tasks as slots regardless of partitions. > > On Thu, May 14, 2020, 5:19 PM Chetan Khatri > wrote: > >> Thanks Sean, Jerry. >> >> Default Spark DataFrame partitions are 200 right? does it have >>

Re: Calling HTTP Rest APIs from Spark Job

2020-05-14 Thread Chetan Khatri
you do this within the context of an operation that is > already parallelized such as a map, the work will be distributed to > executors and they will do it in parallel. I could be wrong about this as I > never investigated this specific use case, though. > > > > On Thu, May 1

Re: Calling HTTP Rest APIs from Spark Job

2020-05-14 Thread Chetan Khatri
> > On Thu, May 14, 2020 at 5:03 PM Chetan Khatri > wrote: > >> Hi Spark Users, >> >> How can I invoke the Rest API call from Spark Code which is not only >> running on Spark Driver but distributed / parallel? >> >> Spark with Scala is my tech stack. >> >> Thanks >> >> >> > > -- > http://www.google.com/profiles/grapesmoker >

Calling HTTP Rest APIs from Spark Job

2020-05-14 Thread Chetan Khatri
Hi Spark Users, How can I invoke the Rest API call from Spark Code which is not only running on Spark Driver but distributed / parallel? Spark with Scala is my tech stack. Thanks

Re: XPATH_INT behavior - XML - Function in Spark

2020-05-13 Thread Chetan Khatri
x.html#xpath>, there > appears to be no reason to expect that to work. > > On Tue, May 12, 2020 at 2:09 PM Chetan Khatri > wrote: > >> Can someone please help.. Thanks in advance. >> >> On Mon, May 11, 2020 at 5:29 PM Chetan Khatri < >> chetan.opensou..

Re: XPATH_INT behavior - XML - Function in Spark

2020-05-12 Thread Chetan Khatri
che.org/docs/2.4.5/api/sql/index.html#xpath>, there > appears to be no reason to expect that to work. > > On Tue, May 12, 2020 at 2:09 PM Chetan Khatri > wrote: > >> Can someone please help.. Thanks in advance. >> >> On Mon, May 11, 2020 at 5:29 PM Chetan Khatri &l

Re: XPATH_INT behavior - XML - Function in Spark

2020-05-12 Thread Chetan Khatri
Can someone please help.. Thanks in advance. On Mon, May 11, 2020 at 5:29 PM Chetan Khatri wrote: > Hi Spark Users, > > I want to parse xml coming in the query columns and get the value, I am > using *xpath_int* which works as per my requirement but When I am > embedding in the

XPATH_INT behavior - XML - Function in Spark

2020-05-11 Thread Chetan Khatri
Hi Spark Users, I want to parse xml coming in the query columns and get the value, I am using *xpath_int* which works as per my requirement but When I am embedding in the Spark SQL query columns it is failing. select timesheet_profile_id, *xpath_int(timesheet_profile_code,

Re: AnalysisException - Infer schema for the Parquet path

2020-05-11 Thread Chetan Khatri
; >> >> http://talebzadehmich.wordpress.com >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical conten

AnalysisException - Infer schema for the Parquet path

2020-05-09 Thread Chetan Khatri
Hi Spark Users, I've a spark job where I am reading the parquet path, and that parquet path data is generated by other systems, some of the parquet paths doesn't contains any data which is possible. is there a any way to read the parquet if no data found I can create a dummy dataframe and go

Re: How can I add extra mounted disk to HDFS

2020-04-30 Thread Chetan Khatri
> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html> > > > > Le mar. 28 avr. 2020 à 23:18, Chetan Khatri > a écrit : > >> Hi Spark Users, >> >> My spark job gave me an error No Space left on the device >> >

Re: Read Hive ACID Managed table in Spark

2020-04-28 Thread Chetan Khatri
> to drop me a mail or raise an issue here: > https://github.com/qubole/spark-acid/issues > > Regards, > Amogh > > On Tue, Mar 10, 2020 at 4:20 AM Chetan Khatri > wrote: > >> Hi Venkata, >> Thanks for your reply. I am using HDP 2.6 and I don't think above wi

Re: Unablee to get to_timestamp with Timezone Information

2020-04-28 Thread Chetan Khatri
gt; through strings, you should format them as above. If you use Dataset.map > you can access the timestamps as java.sql.Timestamp objects (but that might > not be necessary): > > import java.sql.Timestamp > case class Times(value: Timestamp) > timestampDF.as[Times].map(t => t.va

How can I add extra mounted disk to HDFS

2020-04-28 Thread Chetan Khatri
Hi Spark Users, My spark job gave me an error No Space left on the device

Re: Unablee to get to_timestamp with Timezone Information

2020-03-31 Thread Chetan Khatri
ng (non >> ISO8601) strings and showing timestamps. >> >> br, >> >> Magnus >> >> On Tue, Mar 31, 2020 at 6:14 PM Chetan Khatri < >> chetan.opensou...@gmail.com> wrote: >> >>> Hi Spark Users, >>> >>> I am losi

Unablee to get to_timestamp with Timezone Information

2020-03-31 Thread Chetan Khatri
Hi Spark Users, I am losing the timezone value from below format, I tried couple of formats but not able to make it. Can someone throw lights? scala> val sampleDF = Seq("2020-04-11T20:40:00-0500").toDF("value") sampleDF: org.apache.spark.sql.DataFrame = [value: string] scala>

Re: Best Practice: Evaluate Expression from Spark DataFrame Column

2020-03-28 Thread Chetan Khatri
ot;), (1, 0, "miki", "NUM IS NOT NULL AND FLAG!=0 AND WORD == 'MIKI'") ).toDF("num", "flag", "word", "expression") val derivedDF = sampleDF.withColumn("status", expr(sampleDF.col("expression").as[String].toString())

Best Practice: Evaluate Expression from Spark DataFrame Column

2020-03-27 Thread Chetan Khatri
Hi Spark Users, I want to evaluate expression from dataframe column values on other columns in the same dataframe for each row. Please suggest best approach to deal with this given that not impacting the performance of the job. Thanks Sample code: val sampleDF = Seq( (8, 1, "bat", "NUM IS

Re: Read Hive ACID Managed table in Spark

2020-03-10 Thread Chetan Khatri
cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/integrating-hive/content/hive_hivewarehouseconnector_for_handling_apache_spark_data.html > > On Thu, Mar 5, 2020, 6:51 AM Chetan Khatri > wrote: > >> Just followup, if anyone has worried on this before >> >> On Wed, Mar 4, 2020 at

Re: Read Hive ACID Managed table in Spark

2020-03-05 Thread Chetan Khatri
Just followup, if anyone has worried on this before On Wed, Mar 4, 2020 at 12:09 PM Chetan Khatri wrote: > Hi Spark Users, > I want to read Hive ACID managed table data (ORC) in Spark. Can someone > help me here. > I've tried, https://github.com/qubole/spark-acid but no success. > > Thanks >

Read Hive ACID Managed table in Spark

2020-03-04 Thread Chetan Khatri
Hi Spark Users, I want to read Hive ACID managed table data (ORC) in Spark. Can someone help me here. I've tried, https://github.com/qubole/spark-acid but no success. Thanks

Re: Compute the Hash of each row in new column

2020-03-02 Thread Chetan Khatri
(ds.columns.map(col) ++ ds.columns.map(column => > md5(col(column)).as(s"$column hash")): _*).show(false) > > Enrico > > Am 02.03.20 um 11:10 schrieb Chetan Khatri: > > Thanks Enrico > I want to compute hash of all the columns value in the row. &g

Re: Compute the Hash of each row in new column

2020-03-02 Thread Chetan Khatri
st with this Dataset ds: > > import org.apache.spark.sql.types._ > val ds = spark.range(10).select($"id".cast(StringType)) > > Available are md5, sha, sha1, sha2 and hash: > https://spark.apache.org/docs/2.4.5/api/sql/index.html > > Enrico > > > Am 28.02.20 um 13:56

Compute the Hash of each row in new column

2020-02-28 Thread Chetan Khatri
Hi Spark Users, How can I compute Hash of each row and store in new column at Dataframe, could someone help me. Thanks

Apache Spark Use cases - my first talk

2019-12-25 Thread Chetan Khatri
Hi Spark Users, Thank you for all support over the mailing list. Contributors - thanks for your all contributions. This is my first 5 mins talk with Apache Spark - https://youtu.be/bBqItpgT8xQ Thanks.

Re: How more than one spark job can write to same partition in the parquet file

2019-12-11 Thread Chetan Khatri
Thanks, If you can share alternative change in design. I would love to hear from you. On Wed, Dec 11, 2019 at 9:34 PM ayan guha wrote: > No we faced problem with that setup. > > On Thu, 12 Dec 2019 at 11:14 am, Chetan Khatri < > chetan.opensou...@gmail.com> wrote: > >&g

How more than one spark job can write to same partition in the parquet file

2019-12-11 Thread Chetan Khatri
Hi Spark Users, would that be possible to write to same partition to the parquet file through concurrent two spark jobs with different spark session. thanks

Re: Spark - configuration setting doesn't work

2019-10-29 Thread Chetan Khatri
Oct 2019 at 11:02 AM, Chetan Khatri < > chetan.opensou...@gmail.com> wrote: > >> Could someone please help me. >> >> On Thu, Oct 17, 2019 at 7:29 PM Chetan Khatri < >> chetan.opensou...@gmail.com> wrote: >> >>> Hi Users, >>> >

Re: Spark Cluster over yarn cluster monitoring

2019-10-29 Thread Chetan Khatri
Thanks Jörn On Sun, Oct 27, 2019 at 8:01 AM Jörn Franke wrote: > Use yarn queues: > > > https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html > > Am 27.10.2019 um 06:41 schrieb Chetan Khatri >: > >  > Could someone pleas

Re: Spark Cluster over yarn cluster monitoring

2019-10-26 Thread Chetan Khatri
Could someone please help me to understand better.. On Thu, Oct 17, 2019 at 7:41 PM Chetan Khatri wrote: > Hi Users, > > I do submit *X* number of jobs with Airflow to Yarn as a part of workflow > for *Y *customer. I could potentially run workflow for customer *Z *but I > need to

Re: Spark - configuration setting doesn't work

2019-10-26 Thread Chetan Khatri
Could someone please help me. On Thu, Oct 17, 2019 at 7:29 PM Chetan Khatri wrote: > Hi Users, > > I am setting spark configuration in below way; > > val spark = SparkSession.builder().appName(APP_NAME).getOrCreate() > > spark.conf.set("spark.speculation&q

Spark Cluster over yarn cluster monitoring

2019-10-17 Thread Chetan Khatri
Hi Users, I do submit *X* number of jobs with Airflow to Yarn as a part of workflow for *Y *customer. I could potentially run workflow for customer *Z *but I need to check that how much resources are available over the cluster so jobs for next customer should start. Could you please tell what is

Re: Control Sqoop job from Spark job

2019-10-17 Thread Chetan Khatri
Shyam, As mark said - if we boost the parallelism with spark we can reach to performance of sqoop or better than that. On Tue, Sep 3, 2019 at 6:35 PM Shyam P wrote: > J Franke, > Leave alone sqoop , I am just asking about spark in ETL of Oracle ...? > > Thanks, > Shyam > >>

Spark - configuration setting doesn't work

2019-10-17 Thread Chetan Khatri
Hi Users, I am setting spark configuration in below way; val spark = SparkSession.builder().appName(APP_NAME).getOrCreate() spark.conf.set("spark.speculation", "false") spark.conf.set("spark.broadcast.compress", "true") spark.conf.set("spark.sql.broadcastTimeout", "36000")

Re: Control Sqoop job from Spark job

2019-09-02 Thread Chetan Khatri
e > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Thu, 29 Aug 2019 at 21:01, Chetan Khatri > wrote: > >> Hi Users,

Re: Control Sqoop job from Spark job

2019-09-02 Thread Chetan Khatri
ri, 30 Aug 2019 at 06:02, Chetan Khatri > wrote: > >> Sorry, >> I call sqoop job from above function. Can you help me to resolve this. >> >> Thanks >> >> On Fri, Aug 30, 2019 at 1:31 AM Chetan Khatri < >> chetan.opensou...@gmail.com> wrote: &g

Re: Control Sqoop job from Spark job

2019-08-29 Thread Chetan Khatri
Sorry, I call sqoop job from above function. Can you help me to resolve this. Thanks On Fri, Aug 30, 2019 at 1:31 AM Chetan Khatri wrote: > Hi Users, > I am launching a Sqoop job from Spark job and would like to FAIL Spark job > if Sqoop job fails. > > def executeSqoopOrig

Control Sqoop job from Spark job

2019-08-29 Thread Chetan Khatri
Hi Users, I am launching a Sqoop job from Spark job and would like to FAIL Spark job if Sqoop job fails. def executeSqoopOriginal(serverName: String, schemaName: String, username: String, password: String, query: String, splitBy: String, fetchSize: Int, numMappers: Int,

Re: Kafka Topic to Parquet HDFS with Structured Streaming

2019-06-10 Thread Chetan Khatri
path in HDFS. I would suggest to write the > parquet files to a different path, perhaps to a project space or user home, > rather than at the root directory. > > HTH, > Deng > > On Sat, Jun 8, 2019 at 8:00 AM Chetan Khatri > wrote: > >> Hello Dear Spark Users, >

Re: Kafka Topic to Parquet HDFS with Structured Streaming

2019-06-07 Thread Chetan Khatri
Also anyone has any idea to resolve this issue - https://stackoverflow.com/questions/56390492/spark-metadata-0-doesnt-exist-while-compacting-batch-9-structured-streaming-er On Fri, Jun 7, 2019 at 5:59 PM Chetan Khatri wrote: > Hello Dear Spark Users, > > I am trying to write data f

Kafka Topic to Parquet HDFS with Structured Streaming

2019-06-07 Thread Chetan Khatri
Hello Dear Spark Users, I am trying to write data from Kafka Topic to Parquet HDFS with Structured Streaming but Getting failures. Please do help. val spark: SparkSession = SparkSession.builder().appName("DemoSparkKafka").getOrCreate() import spark.implicits._ val dataFromTopicDF = spark

Re: Update / Delete records in Parquet

2019-05-03 Thread Chetan Khatri
quot; > *Date: *Tuesday, 23 April 2019 at 11:35 > *To: *Chetan Khatri , Jason Nerothin < > jasonnerot...@gmail.com> > *Cc: *user > *Subject: *Re: Update / Delete records in Parquet > > > > Hi Chetan, > > > > I also agree that for this usecase parquet would

Re: Update / Delete records in Parquet

2019-04-22 Thread Chetan Khatri
s like it might be the wrong sink for a high-frequency change > scenario. > > What are you trying to accomplish? > > Thanks, > Jason > > On Mon, Apr 22, 2019 at 2:09 PM Chetan Khatri > wrote: > >> Hello All, >> >> If I am doing incremental load / delta

Update / Delete records in Parquet

2019-04-22 Thread Chetan Khatri
Hello All, If I am doing incremental load / delta and would like to update / delete the records in parquet, I understands that parquet is immutable and can't be deleted / updated theoretically only append / overwrite can be done. But I can see utility tools which claims to add value for that.

Usage of Explicit Future in Spark program

2019-04-21 Thread Chetan Khatri
Hello Spark Users, Someone has suggested by breaking 5-5 unpredictable transformation blocks to Future[ONE STRING ARGUMENT] and claim this can tune the performance. I am wondering this is a use of explicit Future! in Spark? Sample code is below: def writeData( tableName: String):

Re: How to print DataFrame.show(100) to text file at HDFS

2019-04-14 Thread Chetan Khatri
el("OFF") > > > spark.table("").show(100,truncate=false) > > But is there any specific reason you want to write it to hdfs? Is this for > human consumption? > > Regards, > Nuthan > > On Sat, Apr 13, 2019 at 6:41 PM Chetan Khatri > wrote: > &

How to print DataFrame.show(100) to text file at HDFS

2019-04-13 Thread Chetan Khatri
Hello Users, In spark when I have a DataFrame and do .show(100) the output which gets printed, I wants to save as it is content to txt file in HDFS. How can I do this? Thanks

Re: dropDuplicate on timestamp based column unexpected output

2019-04-04 Thread Chetan Khatri
> >> How much memory do you have per partition? >> >> On Thu, Apr 4, 2019 at 7:49 AM Chetan Khatri >> wrote: >> >>> I will get the information and will share with you. >>> >>> On Thu, Apr 4, 2019 at 5:03 PM Abdeali Kothari >>&g

Re: dropDuplicate on timestamp based column unexpected output

2019-04-04 Thread Chetan Khatri
anything that is faster. When I ran is on my data ~8-9GB > I think it took less than 5 mins (don't remember exact time) > > On Thu, Apr 4, 2019 at 1:09 PM Chetan Khatri > wrote: > >> Thanks for awesome clarification / explanation. >> >> I have cases where update_time can be s

Re: dropDuplicate on timestamp based column unexpected output

2019-04-04 Thread Chetan Khatri
eems like it's meant for cases where you > literally have redundant duplicated data. And not for filtering to get > first/last etc. > > > On Thu, Apr 4, 2019 at 11:46 AM Chetan Khatri > wrote: > >> Hello Abdeali, Thank you for your response. >> >> Can you p

Re: dropDuplicate on timestamp based column unexpected output

2019-04-04 Thread Chetan Khatri
gt; The min() is faster than doing an orderBy() and a row_number(). > And the dropDuplicates at the end ensures records with two values for the > same 'update_time' don't cause issues. > > > On Thu, Apr 4, 2019 at 10:22 AM Chetan Khatri > wrote: > >> Hello Dear Spark Users, >> >

dropDuplicate on timestamp based column unexpected output

2019-04-03 Thread Chetan Khatri
Hello Dear Spark Users, I am using dropDuplicate on a DataFrame generated from large parquet file from(HDFS) and doing dropDuplicate based on timestamp based column, every time I run it drops different - different rows based on same timestamp. What I tried and worked val wSpec =

Re: Increase time for Spark Job to be in Accept mode in Yarn

2019-01-23 Thread Chetan Khatri
wrote: > Hi , please tell me why you need to increase the time? > > > > > > At 2019-01-22 18:38:29, "Chetan Khatri" > wrote: > > Hello Spark Users, > > Can you please tell me how to increase the time for Spark job to be in > *Accept* mode in Yarn. > > Thank you. Regards, > Chetan > > > > >

Increase time for Spark Job to be in Accept mode in Yarn

2019-01-22 Thread Chetan Khatri
Hello Spark Users, Can you please tell me how to increase the time for Spark job to be in *Accept* mode in Yarn. Thank you. Regards, Chetan

Re: How to Keep Null values in Parquet

2018-11-21 Thread Chetan Khatri
gt; See also https://issues.apache.org/jira/browse/SPARK-10943. > > — Soumya > > > On Nov 21, 2018, at 9:29 PM, Chetan Khatri > wrote: > > Hello Spark Users, > > I have a Dataframe with some of Null Values, When I am writing to parquet > it is failing with below error: >

How to Keep Null values in Parquet

2018-11-21 Thread Chetan Khatri
Hello Spark Users, I have a Dataframe with some of Null Values, When I am writing to parquet it is failing with below error: Caused by: java.lang.RuntimeException: Unsupported data type NullType. at scala.sys.package$.error(package.scala:27) at

Spark 2.3.0 with HDP Got completely successfully but status FAILED with error

2018-11-21 Thread Chetan Khatri
Hello Spark Users, I am working with Spark 2.3.0 with HDP Distribution, where my spark job got completed successfully but final job status is failed with below error: What is best way to prevent this kind of errors? Thanks 8/11/21 17:38:15 INFO ApplicationMaster: Final app status: SUCCEEDED,

How to do efficient self join with Spark-SQL and Scala

2018-09-21 Thread Chetan Khatri
Dear Spark Users, I came across little weird MSSQL Query to replace with Spark and I am like no clue how to do it in an efficient way with Scala + SparkSQL. Can someone please throw light. I can create view of DataFrame and do it as *spark.sql *(query) but I would like to do it with Scala + Spark

Re: Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-15 Thread Chetan Khatri
python.html > > We will continue adding more there. > > Feel free to ping me directly in case of questions. > > Thanks, > Jayant > > > On Mon, Jul 9, 2018 at 9:56 PM, Chetan Khatri > wrote: > >> Hello Jayant, >> >> Thank you so much for suggestion.

Re: Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-09 Thread Chetan Khatri
like Pandas Dataframe for processing and finally write the > results back. > > In the Spark/Scala/Java code, you get an RDD of string, which we convert > back to a Dataframe. > > Feel free to ping me directly in case of questions. > > Thanks, > Jayant > > > On Thu, Jul 5

Re: Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-05 Thread Chetan Khatri
Prem sure, Thanks for suggestion. On Wed, Jul 4, 2018 at 8:38 PM, Prem Sure wrote: > try .pipe(.py) on RDD > > Thanks, > Prem > > On Wed, Jul 4, 2018 at 7:59 PM, Chetan Khatri > wrote: > >> Can someone please suggest me , thanks >> >> On Tue 3 J

Re: Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-04 Thread Chetan Khatri
Can someone please suggest me , thanks On Tue 3 Jul, 2018, 5:28 PM Chetan Khatri, wrote: > Hello Dear Spark User / Dev, > > I would like to pass Python user defined function to Spark Job developed > using Scala and return value of that function would be returned to DF / > Datas

Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-03 Thread Chetan Khatri
Hello Dear Spark User / Dev, I would like to pass Python user defined function to Spark Job developed using Scala and return value of that function would be returned to DF / Dataset API. Can someone please guide me, which would be best approach to do this. Python function would be mostly

Re: Apply Core Java Transformation UDF on DataFrame

2018-06-05 Thread Chetan Khatri
Anyone can throw light on this. would be helpful. On Tue, Jun 5, 2018 at 1:41 AM, Chetan Khatri wrote: > All, > > I would like to Apply Java Transformation UDF on DataFrame created from > Table, Flat Files and retrun new Data Frame Object. Any suggestions, with > respect to

Apply Core Java Transformation UDF on DataFrame

2018-06-04 Thread Chetan Khatri
All, I would like to Apply Java Transformation UDF on DataFrame created from Table, Flat Files and retrun new Data Frame Object. Any suggestions, with respect to Spark Internals. Thanks.

Re: 答复: GroupBy in Spark / Scala without Agg functions

2018-05-29 Thread Chetan Khatri
ct(): Dataset[T] = dropDuplicates() > > … > > def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan { > > … > > Aggregate(groupCols, aggCols, logicalPlan) > } > > > > > > > > > > *发件人**:* Chetan Khatri [mailto:chetan.opensou.

Re: GroupBy in Spark / Scala without Agg functions

2018-05-29 Thread Chetan Khatri
Georg, Sorry for dumb question. Help me to understand - if i do DF.select(A,B,C,D)*.distinct() *that would be same as above groupBy without agg in sql right ? On Wed, May 30, 2018 at 12:17 AM, Chetan Khatri wrote: > I don't want to get any aggregation, just want to know rather saying > di

Re: GroupBy in Spark / Scala without Agg functions

2018-05-29 Thread Chetan Khatri
8 AM, Georg Heiler > > wrote: >> >>> Why do you group if you do not want to aggregate? >>> Isn't this the same as select distinct? >>> >>> Chetan Khatri schrieb am Di., 29. Mai >>> 2018 um 20:21 Uhr: >>> >>>> All, >

Re: GroupBy in Spark / Scala without Agg functions

2018-05-29 Thread Chetan Khatri
e same as select distinct? > > Chetan Khatri schrieb am Di., 29. Mai 2018 > um 20:21 Uhr: > >> All, >> >> I have scenario like this in MSSQL Server SQL where i need to do groupBy >> without Agg function: >> >> Pseudocode: >> >> >

GroupBy in Spark / Scala without Agg functions

2018-05-29 Thread Chetan Khatri
All, I have scenario like this in MSSQL Server SQL where i need to do groupBy without Agg function: Pseudocode: select m.student_id, m.student_name, m.student_std, m.student_group, m.student_d ob from student as m inner join general_register g on m.student_id = g.student_i d group by

Re: Bulk / Fast Read and Write with MSSQL Server and Spark

2018-05-25 Thread Chetan Khatri
ave had this >> issue in the past where all spark slaves tend to send lots of data at once >> to SQL and that slows down the latency of the rest of the system. We >> overcame this by using sqoop and running it in a controlled environment. >> >> On Wed, May 23, 2018 a

Re: Bulk / Fast Read and Write with MSSQL Server and Spark

2018-05-23 Thread Chetan Khatri
Super, just giving high level idea what i want to do. I have one source schema which is MS SQL Server 2008 and target is also MS SQL Server 2008. Currently there is c# based ETL application which does extract transform and load as customer specific schema including indexing etc. Thanks On Wed,

Re: Bulk / Fast Read and Write with MSSQL Server and Spark

2018-05-23 Thread Chetan Khatri
his https://docs.microsoft.com/en-us/azure/sql-database/sql- > database-spark-connector > > > > > > *From: *Chetan Khatri <chetan.opensou...@gmail.com> > *Date: *Wednesday, May 23, 2018 at 7:47 AM > *To: *user <user@spark.apache.org> > *Subject: *Bul

Bulk / Fast Read and Write with MSSQL Server and Spark

2018-05-23 Thread Chetan Khatri
All, I am looking for approach to do bulk read / write with MSSQL Server and Apache Spark 2.2 , please let me know if any library / driver for the same. Thank you. Chetan

Livy Failed error on Yarn with Spark

2018-05-09 Thread Chetan Khatri
All, I am running on Hortonworks HDP Hadoop with Livy and Spark 2.2.0, when I am running same spark job using spark-submit it is getting success with all transformations are done. When I am trying to do spark submit using Livy, at that time Spark Job is getting invoked and getting success but

Re: NLTK with Spark Streaming

2017-11-26 Thread Chetan Khatri
But you can still use Stanford NLP library and distribute through spark right ! On Sun, Nov 26, 2017 at 3:31 PM, Holden Karau wrote: > So it’s certainly doable (it’s not super easy mind you), but until the > arrow udf release goes out it will be rather slow. > > On Sun,

Re: Spark Writing to parquet directory : java.io.IOException: Disk quota exceeded

2017-11-22 Thread Chetan Khatri
Anybody reply on this ? On Tue, Nov 21, 2017 at 3:36 PM, Chetan Khatri <chetan.opensou...@gmail.com> wrote: > > Hello Spark Users, > > I am getting below error, when i am trying to write dataset to parquet > location. I have enough disk space available. Last time i w

Spark Writing to parquet directory : java.io.IOException: Disk quota exceeded

2017-11-21 Thread Chetan Khatri
Hello Spark Users, I am getting below error, when i am trying to write dataset to parquet location. I have enough disk space available. Last time i was facing same kind of error which were resolved by increasing number of cores at hyper parameters. Currently result set data size is almost 400Gig

Re: No space left on device

2017-10-17 Thread Chetan Khatri
Process data in micro batch On 18-Oct-2017 10:36 AM, "Chetan Khatri" <chetan.opensou...@gmail.com> wrote: > Your hard drive don't have much space > On 18-Oct-2017 10:35 AM, "Mina Aslani" <aslanim...@gmail.com> wrote: > >> Hi, >> >&g

Re: No space left on device

2017-10-17 Thread Chetan Khatri
Your hard drive don't have much space On 18-Oct-2017 10:35 AM, "Mina Aslani" wrote: > Hi, > > I get "No space left on device" error in my spark worker: > > Error writing stream to file /usr/spark-2.2.0/work/app-.../0/stderr > java.io.IOException: No space left on device

Re: Spark - Partitions

2017-10-12 Thread Chetan Khatri
Use repartition On 13-Oct-2017 9:35 AM, "KhajaAsmath Mohammed" wrote: > Hi, > > I am reading hive query and wiriting the data back into hive after doing > some transformations. > > I have changed setting spark.sql.shuffle.partitions to 2000 and since then > job completes

Re: Write only one output file in Spark SQL

2017-08-11 Thread Chetan Khatri
What you can do is at hive creates partitioned column for example date and use Val finalDf = repartition(data frame.col("date-column")) and later say insert overwrite tablename partition(date-column) select * from tempview Would work as expected On 11-Aug-2017 11:03 PM, "KhajaAsmath Mohammed"

Re: Reparitioning Hive tables - Container killed by YARN for exceeding memory limits

2017-08-03 Thread Chetan Khatri
eed to increase it). Honestly most people > find this number for their job "experimentally" (e.g. they try a few > different things). > > On Wed, Aug 2, 2017 at 1:52 PM, Chetan Khatri <chetan.opensou...@gmail.com > > wrote: > >> Ryan, >> Thank y

  1   2   >