Re: [Pyspark 2.4] Best way to define activity within different time window

2019-06-10 Thread Rishi Shah
Thank you both for your input! To calculate moving average of active users, could you comment on whether to go for RDD based implementation or dataframe? If dataframe, will window function work here? In general, how would spark behave when working with dataframe with date, week, month, quarter,

RE: Spark on Kubernetes - log4j.properties not read

2019-06-10 Thread Rao, Abhishek (Nokia - IN/Bangalore)
Hi Dave, As part of driver pod bringup, a configmap is created using all the spark configuration parameters (with name spark.properties) and mounted to /opt/spark/conf. So all the other files present in /opt/spark/conf will be overwritten. Same is happening with the log4j.properties in this

Re: ARM CI for spark

2019-06-10 Thread 김영우
Hi Tianhua, I read similar question to your's from HBase mailing list. so I'd like to let you know about efforts on supporting AArch64 from Apache Bigtop[1] I don't believe that CI and distribution of Bigtop is not exactly what you are looking for but, Folks from Linaro and Arm are contributing

ARM CI for spark

2019-06-10 Thread Tianhua huang
Hi all, The CI testing for apache spark is supported by AMPLab Jenkins, and I find there are some computers(most of them are Linux (amd64) arch) for the CI development, but seems there is no Aarch64 computer for spark CI testing. Recently, I build and run test for spark(master and branch-2.4) on

Spark on Kubernetes - log4j.properties not read

2019-06-10 Thread Dave Jaffe
I am using Spark on Kubernetes from Spark 2.4.3. I have created a log4j.properties file in my local spark/conf directory and modified it so that the console (or, in the case of Kubernetes, the log) only shows warnings and higher (log4j.rootCategory=WARN, console). I then added the command COPY

Re: Spark structured streaming leftOuter join not working as I expect

2019-06-10 Thread Joe Ammann
Hi all it took me some time to get the issues extracted into a piece of standalone code. I created the following gist https://gist.github.com/jammann/b58bfbe0f4374b89ecea63c1e32c8f17 I has messages for 4 topics A/B/C/D and a simple Python program which shows 6 use cases, with my expectations

Re: Kafka Topic to Parquet HDFS with Structured Streaming

2019-06-10 Thread Chetan Khatri
Hello Deng, Thank you for your email. Issue was with Spark - Hadoop / HDFS configuration settings. Thanks On Mon, Jun 10, 2019 at 5:28 AM Deng Ching-Mallete wrote: > Hi Chetan, > > Best to check if the user account that you're using to run the job has > permission to write to the path in HDFS.

Re: Spark SQL

2019-06-10 Thread Russell Spitzer
Spark can use the HiveMetastore as a catalog, but it doesn't use the hive parser or optimization engine. Instead it uses Catalyst, see https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html On Mon, Jun 10, 2019 at 2:07 PM naresh Goud wrote: > Hi Team, > > Is

Spark SQL

2019-06-10 Thread naresh Goud
Hi Team, Is Spark Sql uses hive engine to run queries ? My understanding that spark sql uses hive meta store to get metadata information to run queries. Thank you, Naresh -- Thanks, Naresh www.linkedin.com/in/naresh-dulam http://hadoopandspark.blogspot.com/

[Spark Core]: What is the release date for Spark 3 ?

2019-06-10 Thread Alex Dettinger
Hi guys, I was not able to find the foreseen release date for Spark 3. Would one have any information on this please ? Many thanks, Alex

Fwd: Spark kafka streaming job stopped

2019-06-10 Thread Amit Sharma
We have spark kafka sreaming job running on standalone spark cluster. We have below kafka architecture 1. Two cluster running on two data centers. 2. There is LTM on top on each data center (load balance) 3. There is GSLB on top of LTM. I observed when ever any of the node in kafka cluster is

Does anyone used spark-structured streaming successfully in production ?

2019-06-10 Thread Shyam P
https://stackoverflow.com/questions/56428367/any-clue-how-to-join-this-spark-structured-stream-joins

Re: Read hdfs files in spark streaming

2019-06-10 Thread Deepak Sharma
Thanks All. I managed to get this working. Marking this thread as closed. On Mon, Jun 10, 2019 at 4:14 PM Deepak Sharma wrote: > This is the project requirement , where paths are being streamed in kafka > topic. > Seems it's not possible using spark structured streaming. > > > On Mon, Jun 10,

How spark structured streaming consumers initiated and invoked while reading multi-partitioned kafka topics?

2019-06-10 Thread Shyam P
Hi, Any suggestions regarding below issue? https://stackoverflow.com/questions/56524921/how-spark-structured-streaming-consumers-initiated-and-invoked-while-reading-mul Thanks, Shyam

Re: Read hdfs files in spark streaming

2019-06-10 Thread Shyam P
Hi Deepak, Why are you getting paths from kafka topic? any specific reason to do so ? Regards, Shyam On Mon, Jun 10, 2019 at 10:44 AM Deepak Sharma wrote: > The context is different here. > The file path are coming as messages in kafka topic. > Spark streaming (structured) consumes form this

How to handle small file problem in spark structured streaming?

2019-06-10 Thread Shyam P
https://stackoverflow.com/questions/56524539/how-to-handle-small-file-problem-in-spark-structured-streaming Regards, Shyam

Re: Kafka Topic to Parquet HDFS with Structured Streaming

2019-06-10 Thread Deng Ching-Mallete
Hi Chetan, Best to check if the user account that you're using to run the job has permission to write to the path in HDFS. I would suggest to write the parquet files to a different path, perhaps to a project space or user home, rather than at the root directory. HTH, Deng On Sat, Jun 8, 2019 at