Why SparkSQL changes the table owner when performing alter table opertations?

2018-03-12 Thread 张万新
Hi, When using spark.sql() to perform alter table operations I found that spark changes the table owner property to the execution user. Then I digged into the source code and found that in HiveClientImpl, the alterTable function will set the owner of table to the current execution user. Besides,

Re: How to run spark shell using YARN

2018-03-12 Thread vermanurag
This does not look like Spark error. Looks like yarn has not been able to allocate resources for spark driver. If you check resource manager UI you are likely to see this as spark application waiting for resources. Try reducing the driver node memory and/ or other bottlenecks based on what you see

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-03-12 Thread Tathagata Das
You have understood the problem right. However note that your interpretation of the output *(K, leftValue, null), **(K, leftValue, rightValue1), **(K, leftValue, rightValue2)* is subject to the knowledge of the semantics of the join. That if you are processing the output rows *manually*, you are

Re: How to run spark shell using YARN

2018-03-12 Thread Marcelo Vanzin
Looks like you either have a misconfigured HDFS service, or you're using the wrong configuration on the client. BTW, as I said in the previous response, the message you saw initially is *not* an error. If you're just trying things out, you don't need to do anything and Spark should still work.

Re: How to run spark shell using YARN

2018-03-12 Thread kant kodali
Hi, I read that doc several times now. I am stuck with the below error message when I run ./spark-shell --master yarn --deploy-mode client. I have my HADOOP_CONF_DIR set to /usr/local/hadoop-2.7.3/etc/hadoop and SPARK_HOME set to /usr/local/spark on all 3 machines (1 node for Resource Manager

Re: How to run spark shell using YARN

2018-03-12 Thread Marcelo Vanzin
That's not an error, just a warning. The docs [1] have more info about the config options mentioned in that message. [1] http://spark.apache.org/docs/latest/running-on-yarn.html On Mon, Mar 12, 2018 at 4:42 PM, kant kodali wrote: > Hi All, > > I am trying to use YARN for the

How to run spark shell using YARN

2018-03-12 Thread kant kodali
Hi All, I am trying to use YARN for the very first time. I believe I configured all the resource manager and name node fine. And then I run the below command ./spark-shell --master yarn --deploy-mode client *I get the below output and it hangs there forever *(I had been waiting over 10 minutes)

Re: OutOfDirectMemoryError for Spark 2.2

2018-03-12 Thread Dave Cameron
I believe jmap is only showing you the java heap used, but the program is running out of direct memory space. They are two different pools of memory. I haven't had to diagnose a direct memory problem before, but this blog post has some suggestions of how to do it:

Spark UI Streaming batch time interval does not match batch interval

2018-03-12 Thread Jordan Pilat
Hello, I am running a streaming app on Spark 2.1.2. The batch interval is set to 5000ms, and when I go to the "Streaming" tab in the Spark UI, it correctly reports a 5 second batch interval, but the list of batches below only shows one batch every two minutes (IE the batch time for each batch

Time Series Functionality with Spark

2018-03-12 Thread Li Jin
Hi All, This is Li Jin. We (me and my fellow colleagues at Two Sigma) have been using Spark for time series analysis for the past two years and it has been a success to scale up our time series analysis. Recently, we start a conversation with Reynold about potential opportunities to collaborate

Re: Spark 2.3 submit on Kubernetes error

2018-03-12 Thread purna pradeep
Thanks Yinan, I’m able to get kube-dns endpoints when I ran this command kubectl get ep kube-dns —namespace=kube-system Do I need to deploy under kube-system instead of default namespace And please lemme know if you have any insights on Error1 ? On Sun, Mar 11, 2018 at 8:26 PM Yinan Li

Re: Creating DataFrame with the implicit localSeqToDatasetHolder has bad performance

2018-03-12 Thread msinton
I think I understand that in the second case the DataFrame is created as a Local object, so it lives in the memory of the driver and is serialized as part of the Task that gets sent to each executor. Though I think the implicit conversion here is something that others could also misunderstand -

Creating DataFrame with the implicit localSeqToDatasetHolder has bad performance

2018-03-12 Thread msinton
Hi, Using Scala, spark version 2.3.0 (also 2.2.0):I've come across two main ways to create a DataFrame from a sequence. The more common:(0 until 10).toDF("value") *good*and the less common (but still prevalent):(0 until 10).toDF("value")*bad*The latter results in much worse performance