Re: How does spark sql evaluate case statements?

2020-04-16 Thread ZHANG Wei
Are you looking for this: https://spark.apache.org/docs/2.4.0/api/sql/#when ? The code generated will look like this in a `do { ... } while (false)` loop: do { ${cond.code} if (!${cond.isNull} && ${cond.value}) { ${res.code} $resultState = (byte)(${res.isNull} ? $HAS_NULL :

Re: Is there any way to set the location of the history for the spark-shell per session?

2020-04-16 Thread ZHANG Wei
You are welcome! It's not in Spark sourcecode. It's in Scala source: https://github.com/scala/scala/blob/2.11.x/src/repl-jline/scala/tools/nsc/interpreter/jline/FileBackedHistory.scala#L26 Reference Code: // For a history file in the standard location, always try to restrict permission,

Using startingOffsets latest - no data from structured streaming kafka query

2020-04-16 Thread Ruijing Li
Hi all, Apologies if this has been asked before, but I could not find the answer to this question. We have a structured streaming job, but for some reason, if we use startingOffsets = latest with foreachbatch mode, it doesn’t produce any data. Rather, in logs I see it repeats the message “

Re: Save Spark dataframe as dynamic partitioned table in Hive

2020-04-16 Thread Mich Talebzadeh
Thanks Patrick, The partition broadcastId is static as defined as a value below val broadcastValue = "123456789" // I assume this will be sent as a constant for the batch // Create a DF on top of XML val df = spark.read. format("com.databricks.spark.xml").

Re: Get Size of a column in Bytes for a Pyspark Dataframe

2020-04-16 Thread Yeikel
As far as I know , one option is to persist it , and check in Spark UI. df.select("field").persist().count() // I'd like to hear other options too. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To

Re: Is there any way to set the location of the history for the spark-shell per session?

2020-04-16 Thread Yeikel
Thank you. That's what I was looking for. I only found a PR from Scala when I googled it , so if you remember your source , please do so. Thanks! -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To

Understanding spark structured streaming checkpointing system

2020-04-16 Thread Ruijing Li
Hi all, I have a question on how structured streaming does checkpointing. I’m noticing that spark is not reading from the max / latest offset it’s seen. For example, in HDFS, I see it stored offset file 30 which contains partition: offset {1: 2000} But instead after stopping the job and

Re: Going it alone.

2020-04-16 Thread u...@moosheimer.com
A good idea in principle. But there are also reasons why it is not a good idea. Some companies forbid that their company name is on a mailing list in an OpenSource project. Or that their name is related to OpenSource. Even though they use OpenSource. This would exclude those persons, which in

Re: Going it alone.

2020-04-16 Thread Sudhanshu
I wonder if it makes sense or easy to put a vetting process before people are allowed to write to this email group. For example, they need to provide their LinkedIn profile or verify using their work address or something like that. Just some food for thought. -Sudhanshu On Thu, Apr 16, 2020 at

Re: Going it alone.

2020-04-16 Thread Stephen Boesch
The warning signs were there from the first email sent from that person. I wonder is there any way to deal with this more proactively. Am Do., 16. Apr. 2020 um 10:54 Uhr schrieb Mich Talebzadeh < mich.talebza...@gmail.com>: > good for you. right move > > Dr Mich Talebzadeh > > > > LinkedIn * >

Get Size of a column in Bytes Pyspark Dataframe

2020-04-16 Thread anbutech
Hello All, I have a column in a dataframe which i struct type.I want to find the size of the column in bytes.it is getting failed while loading in snowflake. I could see size functions avialable to get the length.how to calculate the size in bytes for a column in pyspark dataframe.

Re: Going it alone.

2020-04-16 Thread Mich Talebzadeh
good for you. right move Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own

Re: Going it alone.

2020-04-16 Thread Sean Owen
Absolutely unacceptable even if this were the only one. I'm contacting INFRA right now. On Thu, Apr 16, 2020 at 11:57 AM Holden Karau wrote: > I want to be clear I believe the language in janethrope1s email is > unacceptable for the mailing list and possibly a violation of the Apache > code of

unsubscribe

2020-04-16 Thread Rong, Jialei

unsubscribe

2020-04-16 Thread sridhararao mutluri

Spark structured streaming - performance tuning

2020-04-16 Thread Srinivas V
Hello, Can someone point me to a good video or document which takes about performance tuning for structured streaming app? I am looking especially for listening to Kafka topics say 5 topics each with 100 portions . Trying to figure out best cluster size and number of executors and cores required.

Re: wot no toggle ?

2020-04-16 Thread Mich Talebzadeh
Look I believe what Sean Owen refereed to as third party is true. We had one few weeks ago. There are many trolls who are masquerading as different individuals but the language and ferocity is the same. *Simple. Do not feed the trolls! Don't answer and they will move on. You are feeding them.*

Re: Going it alone.

2020-04-16 Thread Mich Talebzadeh
I refer you to the answer I gave in similar thread. Cheers, Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: Going it alone.

2020-04-16 Thread Holden Karau
I want to be clear I believe the language in janethrope1s email is unacceptable for the mailing list and possibly a violation of the Apache code of conduct. I’m glad we don’t see messages like this often. I know this is a stressful time for many of us, but let’s try and do our best to not take it

Re: wot no toggle ?

2020-04-16 Thread Yeikel
I have been reading the list for the last few days and I have seen a lot of messages with similar grammar and criticism that does not add any value. If you don't like something about Spark , discuss it politely or create a Jira ticket. Please stop attacking the community or you'll be blocked.

Re: wot no toggle ?

2020-04-16 Thread Sean Owen
Yes, this kind of message is not welcome on this list. At best the wording is ... odd, but the tone is combative. It's not even clear what the question is. This user has posted several other messages with the same type of issue, uncannily like those of "Zahid Raman" last month. This list has

Re: wot no toggle ?

2020-04-16 Thread Seemanto Barua
your interpretation of what those are so far off from what they really are and then you abuse everyone based on that interpretation. its actually is hilarious how wrong you are. On Thu, Apr 16, 2020 at 2:13 AM jane thorpe wrote: >

[Spark SQL] AnalysisException: cannot resolve '`column_name`' given input columns

2020-04-16 Thread Joshua Conlin
Hello, I have the following Spark SQL query: SELECT column_name, * from table_name; I have multiple spark clusters, this query has been running daily on all of the clusters. After a recent redeployment, it fails on just one of the clusters with the following exception: AnalysisException:

Can I run Spark executors in a Hadoop cluster from a Kubernetes container

2020-04-16 Thread mailfordebu
Hi, I want to deploy Spark client in a Kubernetes container. Further on , I want to run the spark job in a Hadoop cluster (meaning the resources of the Hadoop cluster will be leveraged) but call it from the K8S container. My question is whether this mode of implementation possible? Do let me

Re: Save Spark dataframe as dynamic partitioned table in Hive

2020-04-16 Thread Patrick McCarthy
What happens if you change your insert statement to be INSERT INTO TABLE michtest.BroadcastStaging PARTITION (broadcastId = broadcastValue, brand) and then add the value for brand into the select as SELECT ocis_party_id AS partyId , target_mobile_no AS phoneNumber ,

Re: How to pass a constant value to a partitioned hive table in spark

2020-04-16 Thread ayan guha
Hi Mitch Add it in the DF first from pyspark.sql.functions import lit df = df.withColumn('broadcastId, lit(broadcastValue)) Then you will be able to access the column in the temp view Re: Partitioning, DataFrame.write also supports partitionBy clause and you can use it along with saveAsTable.

Re: How to pass a constant value to a partitioned hive table in spark

2020-04-16 Thread Mich Talebzadeh
Thanks Zhang, That is not working. I need to send the value for variable broadcastValue, it cannot interpret it. scala> sqltext = """ | INSERT INTO TABLE michtest.BroadcastStaging PARTITION (broadcastId = broadcastValue, brand = "dummy") | SELECT | ocis_party_id

Re: How to pass a constant value to a partitioned hive table in spark

2020-04-16 Thread ZHANG Wei
> scala> spark.sql($sqltext) > :41: error: not found: value $sqltext > spark.sql($sqltext) ^ +-- should be Scala language Try this: scala> spark.sql(sqltext) -- Cheers, -z On Thu, 16 Apr 2020 08:49:40 +0100 Mich Talebzadeh wrote: > I have

unsubscribe

2020-04-16 Thread Jiang, Lan
-- Lan Jiang https://hpi.de/naumann/people/lan-jiang Hasso-Plattner-Institut an der Universität Potsdam Prof.-Dr.-Helmert-Str. 2-3, D-14482 Potsdam Tel +49 331 5509 280

Re: [Spark Core]: Does an executor only cache the partitions it requires for its computations or always the full RDD?

2020-04-16 Thread ZHANG Wei
As far as I know, if you are talking about RDD.cache(), the answer is the executor only caches the partition it requires. Cheers, -z From: zwithouta Sent: Tuesday, April 14, 2020 18:28 To: user@spark.apache.org Subject: [Spark Core]: Does an executor

Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting

2020-04-16 Thread ZHANG Wei
The Thread dump result table of Spark UI can provide some clues to find out thread locks issue, such as: Thread ID | Thread Name | Thread State | Thread Locks 13| NonBlockingInputStreamThread | WAITING | Blocked by Thread Some(48)

Re: wot no toggle ?

2020-04-16 Thread David Hesson
You may want to read about the JVM and have some degree of understanding what you're talking about, and then you'd know that those options have different meanings. You can view both at the same time, for example. On Thu, Apr 16, 2020, 2:13 AM jane thorpe wrote: >

How to pass a constant value to a partitioned hive table in spark

2020-04-16 Thread Mich Talebzadeh
I have a variable to be passed to a column of partition as shown below *val broadcastValue = "123456789" * // I assume this will be sent as a constant for the batch // Create a DF on top of XML df.createOrReplaceTempView("tmp") // Need to create and populate target Parquet table

Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting

2020-04-16 Thread Jungtaek Lim
Do thread dump continuously, per specific period (like 1s) and see the change of stack / lock for each thread. (This is not easy to be done in UI so maybe doing manually would be the only option. Not sure Spark UI will provide the same, haven't used at all.) It will tell which thread is being

Re: Spark structured streaming - Fallback to earliest offset

2020-04-16 Thread Ruijing Li
Thanks Jungtaek, that makes sense. I tried Burak’s solution of just turning failOnDataLoss to be false, but instead of failing, the job is stuck. I’m guessing that the offsets are being deleted faster than the job can process them and it will be stuck unless I increase resources? Or does once the

Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting

2020-04-16 Thread Ruijing Li
Once I do. thread dump, what should I be looking for to tell where it is hanging? Seeing a lot of timed_waiting and waiting on driver. Driver is also being blocked by spark UI. If there are no tasks, is there a point to do thread dump of executors? On Tue, Apr 14, 2020 at 4:49 AM Gabor Somogyi

Re: Is there any way to set the location of the history for the spark-shell per session?

2020-04-16 Thread ZHANG Wei
>From my understanding, you are talking about spark-shell command history, >aren't you? If yes, you can try adding `--conf 'spark.driver.extraJavaOptions=-Dscala.shell.histfile=` into spark-shell command arguments since Spark shell is leveraging Scala REPL JLine file backend history settings.

Re: Going it alone.

2020-04-16 Thread Subash Prabakar
Looks like he had a very bad appraisal this year.. Fun fact : the coming year would be too :) On Thu, 16 Apr 2020 at 12:07, Qi Kang wrote: > Well man, check your attitude, you’re way over the line > > > On Apr 16, 2020, at 13:26, jane thorpe > wrote: > > F*U*C*K O*F*F > C*U*N*T*S > > >

Re: Going it alone.

2020-04-16 Thread Qi Kang
Well man, check your attitude, you’re way over the line > On Apr 16, 2020, at 13:26, jane thorpe wrote: > > F*U*C*K O*F*F > C*U*N*T*S > > > > On Thursday, 16 April 2020 Kelvin Qin > wrote: > > No wonder I said why I can't understand what the mail expresses, it

Re: wot no toggle ?

2020-04-16 Thread Ashley Hoff
OK, we get it. you are not satisfied that Spark is easy to be used by mere mortals. Please stop Maybe you should look at Data Bricks? On Thu, Apr 16, 2020 at 3:43 PM jane thorpe wrote: > https://spark.apache.org/docs/3.0.0-preview/web-ui.html#storage-tab > > On the link in one of the

wot no toggle ?

2020-04-16 Thread jane thorpe
https://spark.apache.org/docs/3.0.0-preview/web-ui.html#storage-tab On the link in one of the screen shot there are two  checkboxes.ON HEAP MEMORYOFF HEAP MEMORY. That is as useful as a pussy on as Barry Humphries wearing a gold dress as Dame Edna average. Which monkey came up with that