Re: Excessive disk IO with Spark structured streaming

2020-10-07 Thread Jungtaek Lim
I can't spend too much time on explaining one by one. I strongly encourage you to do a deep-dive instead of just looking around as you want to know about "details" - that's how open source works. I'll go through a general explanation instead of replying inline; probably I'd write a blog doc if

Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-07 Thread Dongjoon Hyun
Thank you so much for your feedback, Koert. Yes, SPARK-20202 was created in April 2017 and targeted for 3.1.0 since Nov 2019. However, I believe Apache Spark 3.1.0 (Hadoop 3.2/Hive 2.3 distribution) will work with old Hadoop 2.x clusters if you isolated the classpath via SPARK-31960.

Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-07 Thread Koert Kuipers
it seems to me with SPARK-20202 we are no longer planning to support hadoop2 + hive 1.2. is that correct? so basically spark 3.1 will no longer run on say CDH 5.x or HDP2.x with hive? my use case is building spark 3.1 and launching on these existing clusters that are not managed by me. e.g. i do

Re: Hive on Spark in Kubernetes.

2020-10-07 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
Thank you very much! Отправлено с iPhone > 7 окт. 2020 г., в 17:38, mykidong написал(а): > > Hi all, > > I have recently written a blog about hive on spark in kubernetes > environment: > - https://itnext.io/hive-on-spark-in-kubernetes-115c8e9fa5c1 > > In this blog, you can find how to run

[Spark Core] - Installation issue - "java.lang.UnsatisfiedLinkError: no zstd-jni in java.library.path"

2020-10-07 Thread jelvis
Dear all,I have setup two Spark standalone test clusters which both suffered from the same problem. I have a workaround but it's bad. I would appreciate some help and input. I'm too much of a beginner to conclude that it's a bug but I found someone else having the exact same issue on Stack

Hive on Spark in Kubernetes.

2020-10-07 Thread mykidong
Hi all, I have recently written a blog about hive on spark in kubernetes environment: - https://itnext.io/hive-on-spark-in-kubernetes-115c8e9fa5c1 In this blog, you can find how to run hive on kubernetes using spark thrift server compatible with hive server2. Cheers, - Kidong. -- Sent from:

Re: Excessive disk IO with Spark structured streaming

2020-10-07 Thread Sergey Oboguev
Hi Jungtaek, *> I meant the subdirectory inside the directory you're providing as "checkpointLocation", as there're several directories in that directory...* There are two: *my-spark-checkpoint-dir/MainApp* created by sparkSession.sparkContext().setCheckpointDir() contains only empty subdir

Re: Hive using Spark engine vs native spark with hive integration.

2020-10-07 Thread Patrick McCarthy
I think a lot will depend on what the scripts do. I've seen some legacy hive scripts which were written in an awkward way (e.g. lots of subqueries, nested explodes) because pre-spark it was the only way to express certain logic. For fairly straightforward operations I expect Catalyst would reduce

[SparkR] gapply with strings with arrow

2020-10-07 Thread Jacek Pliszka
Hi! Is there any place I can find information how to use gapply with arrow? I've tried something very simple collect(gapply( df, c("ColumnA"), function(key, x){ data.frame(out=c("dfs"), stringAsFactors=FALSE) }, "out String" )) But it fails - similar code with integers or

reading a csv.gz file from sagemaker using pyspark kernel mode

2020-10-07 Thread cloudytech43
I am trying to read a compressed CSV file in pyspark. but I am unable to read in pyspark kernel mode in sagemaker. The same file I can read using pandas when the kernel is conda-python3 (in sagemaker) What I tried : file1 = 's3://testdata/output1.csv.gz' file1_df = spark.read.csv(file1,