Re: Hostname :BUG

2020-03-04 Thread Zahid Rahman
Please explain why you think that if there is a different reason from this : - If you think that, because the header of /etc/hostname says hosts then that is because I copied the file header from /etc/hosts to /etc/hostname. On Wed, 4 Mar 2020, 21:14 Andrew Melo, wrote: > Hello Zabid, > >

Re: Can't get Spark to interface with S3A Filesystem with correct credentials

2020-03-04 Thread Hariharan
If you're using hadoop 2.7 or below, you may also need to use the following hadoop settings: fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem fs.AbstractFileSystem.s3.impl=org.apache.hadoop.fs.s3a.S3A

Re: Stateful Structured Spark Streaming: Timeout is not getting triggered

2020-03-04 Thread Tathagata Das
Make sure that you are continuously feeding data into the query to trigger the batches. only then timeouts are processed. See the timeout behavior details here - https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.streaming.GroupState On Wed, Mar 4, 2020 at 2:51 PM

Re: SPARK Suitable IDE

2020-03-04 Thread Holden Karau
I work in emacs with ensime. I think really any IDE is ok, so go with the one you feel most at home in. On Wed, Mar 4, 2020 at 5:49 PM tianlangstudio wrote: > We use IntelliJ IDEA,Whether it's Java, Scala or Python > > >

回复:SPARK Suitable IDE

2020-03-04 Thread tianlangstudio
We use IntelliJ IDEA,Whether it's Java, Scala or Python TianlangStudio Some of the biggest lies: I will start tomorrow/Others are better than me/I am not good enough/I don't have time/This is the way I am -- 发件人:Zahid

Spark DataSet class is not truly private[sql]

2020-03-04 Thread Nirav Patel
I see Spark dataset is defined as: class Dataset[T] private[sql]( @transient val sparkSession: SparkSession, @DeveloperApi @InterfaceStability.Unstable @transient val queryExecution: QueryExecution, encoder: Encoder[T]) However it has public constructors which allows DataSet to

Stateful Structured Spark Streaming: Timeout is not getting triggered

2020-03-04 Thread Something Something
I've set the timeout duration to "2 minutes" as follows: def updateAcrossEvents (tuple3: Tuple3[String, String, String], inputs: Iterator[R00tJsonObject], oldState: GroupState[MyState]): OutputRow = { println(" Inside updateAcrossEvents with : " + tuple3._1 + ",

Re: Stateful Spark Streaming: Required attribute 'value' not found

2020-03-04 Thread Something Something
By simply adding 'toJSON' before 'writeStream' the problem was fixed. Maybe it will help someone. On Tue, Mar 3, 2020 at 6:02 PM Something Something wrote: > In a Stateful Spark Streaming application I am writing the 'OutputRow' in > the 'updateAcrossEvents' but I keep getting this error

Spark 2.4.5 - Structured Streaming - Failed Jobs expire from the UI

2020-03-04 Thread puneetloya
Hi, I have been using Spark 2.4.5, for the past month. When a structured streaming query fails, it appears on the UI as a failed job. But after a while these failed jobs expire(disappear) from the UI. Is there a setting which expires failed jobs? I was using Spark 2.2 before this, I have never

Re: Can't get Spark to interface with S3A Filesystem with correct credentials

2020-03-04 Thread Steven Stetzler
To successfully read from S3 using s3a, I've had to also set ``` spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem ``` in addition to `spark.hadoop.fs.s3a.access.key` and `spark.hadoop.fs.s3a.secret.key`. I've also needed to ensure Spark has access to the AWS SDK jar. I have

Can't get Spark to interface with S3A Filesystem with correct credentials

2020-03-04 Thread Devin Boyer
Hello, I'm attempting to run Spark within a Docker container with the hope of eventually running Spark on Kubernetes. Nearly all the data we currently process with Spark is stored in S3, so I need to be able to interface with it using the S3A filesystem. I feel like I've gotten close to getting

Hostname :BUG

2020-03-04 Thread Zahid Rahman
Hi, I found the problem was because on my Linux Operating System the /etc/hostname was blank. *STEP 1* I searched on google the error message and there was an answer suggesting I should add to /etc/hostname 127.0.0.1 [hostname] localhost. I did that but there was still an error, this

Re: Schema store for Parquet

2020-03-04 Thread Magnus Nilsson
Apache Atlas is the apache data catalog. Maybe want to look into that. It depends on what your use case is. On Wed, Mar 4, 2020 at 8:01 PM Ruijing Li wrote: > Thanks Lucas and Magnus, > > Would there be any open source solutions other than Apache Hive metastore, > if we don’t wish to use Apache

Re: Schema store for Parquet

2020-03-04 Thread Ruijing Li
Thanks Lucas and Magnus, Would there be any open source solutions other than Apache Hive metastore, if we don’t wish to use Apache Hive and spark? Thanks. On Wed, Mar 4, 2020 at 10:40 AM lucas.g...@gmail.com wrote: > Or AWS glue catalog if you're in AWS > > On Wed, 4 Mar 2020 at 10:35, Magnus

Re: Schema store for Parquet

2020-03-04 Thread lucas.g...@gmail.com
Or AWS glue catalog if you're in AWS On Wed, 4 Mar 2020 at 10:35, Magnus Nilsson wrote: > Google hive metastore. > > On Wed, Mar 4, 2020 at 7:29 PM Ruijing Li wrote: > >> Hi all, >> >> Has anyone explored efforts to have a centralized storage of schemas of >> different parquet files? I know

Re: Schema store for Parquet

2020-03-04 Thread Magnus Nilsson
Google hive metastore. On Wed, Mar 4, 2020 at 7:29 PM Ruijing Li wrote: > Hi all, > > Has anyone explored efforts to have a centralized storage of schemas of > different parquet files? I know there is schema management for Avro, but > couldn’t find solutions for parquet schema management.

Read Hive ACID Managed table in Spark

2020-03-04 Thread Chetan Khatri
Hi Spark Users, I want to read Hive ACID managed table data (ORC) in Spark. Can someone help me here. I've tried, https://github.com/qubole/spark-acid but no success. Thanks

Re: Way to get the file name of the output when doing ORC write from dataframe

2020-03-04 Thread Manjunath Shetty H
Or is there any way to provide a Unique file name to the ORC write function itself ? Any suggestions will be helpful. Regards Manjunath Shetty From: Manjunath Shetty H Sent: Wednesday, March 4, 2020 2:28 PM To: user Subject: Way to get the file name of the

Re: How to collect Spark dataframe write metrics

2020-03-04 Thread Manjunath Shetty H
Thanks Zohar, Will try that - Manjunath From: Zohar Stiro Sent: Tuesday, March 3, 2020 1:49 PM To: Manjunath Shetty H Cc: user Subject: Re: How to collect Spark dataframe write metrics Hi, to get DataFrame level write metrics you can take a look at the

Way to get the file name of the output when doing ORC write from dataframe

2020-03-04 Thread Manjunath Shetty H
Hi, I wanted to know if there is any way to get the output file name that `Dataframe.orc()` will write to ?. This is needed to track which file is written by which job during incremental batch jobs. Thanks Manjunath