Re: Read Time from a remote data source

2018-12-18 Thread jiaan.geng
You said your hdfs cluster and spark cluster is running on different cluster.This is not a good idea,because you should consider data locality.Your spark node need config hdfs client configuration. Spark Job is composed of stages,each stage have one or more partitions。Parallelism of job decided by

Re: How to update structured streaming apps gracefully

2018-12-18 Thread Yuta Morisawa
Hi Priya and Vincent Thank you for your reply! It looks the new feature is implemented only in the latest version. But I'm using Spark 2.3.0 so, in my understanding, I need to stop and reload apps. Thanks On 2018/12/19 9:09, vincent gromakowski wrote: I totally missed this new feature.

Re: How to update structured streaming apps gracefully

2018-12-18 Thread vincent gromakowski
I totally missed this new feature. Thanks for the pointer Le mar. 18 déc. 2018 à 21:18, Priya Matpadi a écrit : > Changes in streaming query that allow or disallow recovery from checkpoint > is clearly provided in >

Re: How to update structured streaming apps gracefully

2018-12-18 Thread Priya Matpadi
Changes in streaming query that allow or disallow recovery from checkpoint is clearly provided in https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovery-semantics-after-changes-in-a-streaming-query . On Tue, Dec 18, 2018 at 9:45 AM vincent gromakowski <

Dataset experimental interfaces

2018-12-18 Thread Andrew Old
We are running Spark 2.2.0 in a hadoop cluster and I worked on a proof of concept to read event based data into Spark Datasets and operating over those sets to calculate differences between the event data. More specifically, ordered position data with odometer values and wanting to calculate the

Read Time from a remote data source

2018-12-18 Thread swastik mittal
Hi, I am new to spark. I am running a hdfs file system on a remote cluster whereas my spark workers are on another cluster. When my textFile RDD gets executed, does spark worker read from the file according to hdfs partitions task by task, or do they read it once when the blockmanager sets after

Re: Spark Scala reading from Google Cloud BigQuery table throws error

2018-12-18 Thread Mich Talebzadeh
Thanks Jorn. I will try that. Requires installing sbt etc on ephemeral compute server in Google Cloud to built an uber jar file. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: How to clean up logs-dirs and local-dirs of running spark streaming in yarn cluster mode

2018-12-18 Thread shyla deshpande
Is there a way to do this without stopping the streaming application in yarn cluster mode? On Mon, Dec 17, 2018 at 4:42 PM shyla deshpande wrote: > I get the ERROR > 1/1 local-dirs are bad: /mnt/yarn; 1/1 log-dirs are bad: > /var/log/hadoop-yarn/containers > > Is there a way to clean up these

Re: Questions about caching

2018-12-18 Thread Reza Safi
Hi Andrew, 1) df2 will cache all the columns 2) In spark2 you will receive a warning like: WARN execution.CacheManager: Asked to cache already cached data. I don't recall whether it is the same in 1.6. Seems you are not using spark 2. 2a) Not sure whether you are suggesting for a feature in

Re: How to update structured streaming apps gracefully

2018-12-18 Thread vincent gromakowski
Checkpointing is only used for failure recovery not for app upgrades. You need to manually code the unload/load and save it to a persistent store Le mar. 18 déc. 2018 à 17:29, Priya Matpadi a écrit : > Using checkpointing for graceful updates is my understanding as well, > based on the writeup

Re: How to update structured streaming apps gracefully

2018-12-18 Thread Priya Matpadi
Using checkpointing for graceful updates is my understanding as well, based on the writeup in https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing, and some prototyping. Have you faced any missed events? On Mon, Dec 17, 2018

Re: [Apache Beam] Custom DataSourceV2 instanciation: parameters passing and Encoders

2018-12-18 Thread Etienne Chauchot
Hi everyone, Does anyone have comments on this question? CCing user ML ThanksEtienne Le mardi 11 décembre 2018 à 19:02 +0100, Etienne Chauchot a écrit : > Hi Spark guys, > I'm Etienne Chauchot and I'm a committer on the Apache Beam project. > We have what we call runners. They are pieces of

Re: Add column value in the dataset on the basis of a condition

2018-12-18 Thread Devender Yadav
Thanks, Yunus. It solved my problem. Regards, Devender From: Shahab Yunus Sent: Tuesday, December 18, 2018 8:27:51 PM To: Devender Yadav Cc: user@spark.apache.org Subject: Re: Add column value in the dataset on the basis of a condition Sorry Devender, I hit the

Re: Add column value in the dataset on the basis of a condition

2018-12-18 Thread Shahab Yunus
Sorry Devender, I hit the send button sooner by mistake. I meant to add more info. So what I was trying to say was that you can use withColumn with when/otherwise clauses to add a column conditionally. See an example here:

Re: Add column value in the dataset on the basis of a condition

2018-12-18 Thread Shahab Yunus
Have you tried using withColumn? You can add a boolean column based on whether the age exists or not and then drop the older age column. You wouldn't need union of dataframes then On Tue, Dec 18, 2018 at 8:58 AM Devender Yadav wrote: > Hi All, > > > useful code: > > public class EmployeeBean

Add column value in the dataset on the basis of a condition

2018-12-18 Thread Devender Yadav
Hi All, useful code: public class EmployeeBean implements Serializable { private Long id; private String name; private Long salary; private Integer age; // getters and setters } Relevant spark code: SparkSession spark =

Re: Spark Scala reading from Google Cloud BigQuery table throws error

2018-12-18 Thread Jörn Franke
Maybe the guava version in your spark lib folder is not compatible (if your Spark version has a guava library)? In this case i propose to create a fat/uber jar potentially with a shaded guava dependency. > Am 18.12.2018 um 11:26 schrieb Mich Talebzadeh : > > Hi, > > I am writing a small test

Spark Scala reading from Google Cloud BigQuery table throws error

2018-12-18 Thread Mich Talebzadeh
Hi, I am writing a small test code in spark-shell with attached jar dependencies spark-shell --jars