Re: [Spark SQL, intermediate+] possible bug or weird behavior of insertInto

2021-03-04 Thread Jeff Evans
Why not perform a df.select(...) before the final write to ensure a consistent ordering. On Thu, Mar 4, 2021, 7:39 AM Oldrich Vlasic wrote: > Thanks for reply! Is there something to be done, setting a config property > for example? I'd like to prevent users (mainly data scientists) from >

Re: how to serve data over JDBC using simplest setup

2021-02-18 Thread Jeff Evans
at 12:45 PM, Jeff Evans > wrote: > > > > If the data is already in Parquet files, I don't see any reason to > involve JDBC at all. You can read Parquet files directly into a > DataFrame. > https://spark.apache.org/docs/latest/sql-data-sources-parquet.html > >

Re: how to serve data over JDBC using simplest setup

2021-02-18 Thread Jeff Evans
If the data is already in Parquet files, I don't see any reason to involve JDBC at all. You can read Parquet files directly into a DataFrame. https://spark.apache.org/docs/latest/sql-data-sources-parquet.html On Thu, Feb 18, 2021 at 1:42 PM Scott Ribe wrote: > I need a little help figuring out

Re: How to modify a field in a nested struct using pyspark

2021-01-29 Thread Jeff Evans
If you need to do this in 2.x, this library does the trick: https://github.com/fqaiser94/mse On Fri, Jan 29, 2021 at 12:15 PM Adam Binford wrote: > I think they're voting on the next release candidate starting sometime > next week. So hopefully barring any other major hurdles within the next

Re: unsubscribe

2020-12-16 Thread Jeff Evans
https://gist.github.com/jeff303/ba1906bb7bcb2f2501528a8bb1521b8e On Wed, Dec 16, 2020, 6:45 AM 张洪斌 wrote: > how to unsubscribe this ? > > 发自网易邮箱大师 > 在2020年12月16日 20:43,张洪斌 > 写道: > > > unsubscribe > 学生张洪斌 > 邮箱:hongbinzh...@163.com > >

Re: Unsubscribe

2020-12-09 Thread Jeff Evans
That's not how to unsubscribe. https://gist.github.com/jeff303/ba1906bb7bcb2f2501528a8bb1521b8e On Wed, Dec 9, 2020 at 9:26 AM Bhavya Jain wrote: > unsubscribe >

Re: Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Jeff Evans
In your situation, I'd try to do one of the following (in decreasing order of personal preference) 1. Restructure things so that you can operate on a local data file, at least for the purpose of developing your driver logic. Don't rely on the Metastore or HDFS until you have to.

Re: Spark as computing engine vs spark cluster

2020-10-12 Thread Jeff Evans
Spark is a computation engine that runs on a set of distributed nodes. You must "bring your own" hardware, although of course there are hosted solutions available. On Sat, Oct 10, 2020 at 9:24 AM Santosh74 wrote: > Is spark compute engine only or it's also cluster which comes with set of >

Re: Distribute entire columns to executors

2020-09-24 Thread Jeff Evans
I think you can just select the columns you need into new DataFrames, then process those separately. val dfFirstTwo = ds.select("Col1", "Col2") # do whatever with this one dfFirstTwo.sort(...) # similar for the next two columns val dfNextTwo = ds.select("Col3", "Col4") dfNextTwo.sort(...) These

Re: Unsubscribe

2020-08-26 Thread Jeff Evans
That is not how you unsubscribe. See here for instructions: https://gist.github.com/jeff303/ba1906bb7bcb2f2501528a8bb1521b8e On Wed, Aug 26, 2020, 4:22 PM Annabel Melongo wrote: > Please remove me from the mailing list >

Re: Garbage collection issue

2020-07-20 Thread Jeff Evans
What is your heap size, and JVM vendor/version? Generally, G1 only outperforms CMS on large heap sizes (ex: 31GB or larger). On Mon, Jul 20, 2020 at 1:22 PM Amit Sharma wrote: > Please help on this. > > > Thanks > Amit > > On Fri, Jul 17, 2020 at 2:34 PM Amit Sharma wrote: > >> Hi All, i am

Re: Using spark.jars conf to override jars present in spark default classpath

2020-07-16 Thread Jeff Evans
If you can't avoid it, you need to make use of the spark.driver.userClassPathFirst and/or spark.executor.userClassPathFirst properties. On Thu, Jul 16, 2020 at 2:03 PM Russell Spitzer wrote: > I believe the main issue here is that spark.jars is a bit "too late" to > actually prepend things to

Re: Mock spark reads and writes

2020-07-15 Thread Jeff Evans
Why do you need to mock the read/write at all? Why not have your test CSV file, and invoke it (which will perform the real Spark DF read of CSV), write it, and assert on the output? On Tue, Jul 14, 2020 at 12:19 PM Dark Crusader wrote: > Sorry I wasn't very clear in my last email. > > I have a

Re: How To Access Hive 2 Through JDBC Using Kerberos

2020-07-09 Thread Jeff Evans
There are various sample JDBC URLs documented here, depending on the driver vendor, Kerberos (or not), and SSL (or not). Often times, unsurprisingly, SSL is used in conjunction with Kerberos. Even if you don't use StreamSets software at all, you might find these useful.

Re: unsubscribe

2020-06-30 Thread Jeff Evans
That is not how you unsubscribe. See here for instructions: https://gist.github.com/jeff303/ba1906bb7bcb2f2501528a8bb1521b8e On Tue, Jun 30, 2020 at 1:31 PM Bartłomiej Niemienionek < b.niemienio...@gmail.com> wrote: >

Re: File Not Found: /tmp/spark-events in Spark 3.0

2020-06-30 Thread Jeff Evans
This should only be needed if the spark.eventLog.enabled property was set to true. Is it possible the job configuration is different between your two environments? On Mon, Jun 29, 2020 at 9:21 AM ArtemisDev wrote: > While launching a spark job from Zeppelin against a standalone spark > cluster

Re: unsubscribe

2020-06-27 Thread Jeff Evans
That is not how you unsubscribe. See here for instructions: https://gist.github.com/jeff303/ba1906bb7bcb2f2501528a8bb1521b8e On Sat, Jun 27, 2020, 6:08 PM Sri Kris wrote: > > > > > Sent from Mail for > Windows 10 > > >

Re: Where are all the jars gone ?

2020-06-24 Thread Jeff Evans
If I'm understanding this correctly, you are building Spark from source and using the built artifacts (jars) in some other project. Correct? If so, then why are you concerning yourself with the directory structure that Spark, internally, uses when building its artifacts? It should be a black

Re: apache-spark mongodb dataframe issue

2020-06-23 Thread Jeff Evans
As far as I know, in general, there isn't a way to distinguish explicit null values from missing ones. (Someone please correct me if I'm wrong, since I would love to be able to do this for my own reasons). If you really must do it, and don't care about performance at all (since it will be

Re: unsubscribe

2020-06-17 Thread Jeff Evans
That is not how you unsubscribe. See here: https://gist.github.com/jeff303/ba1906bb7bcb2f2501528a8bb1521b8e On Wed, Jun 17, 2020 at 8:56 AM DIALLO Ibrahima (BPCE-IT - Consultime) wrote: > > > > > *Ibrahima DIALLO* > > *Consultant Big Data – Architecte - Analyste* > > *Consultime * - *Pour

Re: unsubscribe

2020-06-17 Thread Jeff Evans
That is not how you unsubscribe. See here: https://gist.github.com/jeff303/ba1906bb7bcb2f2501528a8bb1521b8e On Wed, Jun 17, 2020 at 5:39 AM Ferguson, Jon wrote: > > > This message is confidential and subject to terms at: > https://www.jpmorgan.com/emaildisclaimer including on confidential, >

Re: XPATH_INT behavior - XML - Function in Spark

2020-05-12 Thread Jeff Evans
It sounds like you're expecting the XPath expression to evaluate embedded Spark SQL expressions? From the documentation , there appears to be no reason to expect that to work. On Tue, May 12, 2020 at 2:09 PM Chetan Khatri wrote: >

Re: java.lang.OutOfMemoryError Spark Worker

2020-05-07 Thread Jeff Evans
You might want to double check your Hadoop config files. From the stack trace it looks like this is happening when simply trying to load configuration (XML files). Make sure they're well formed. On Thu, May 7, 2020 at 6:12 AM Hrishikesh Mishra wrote: > Hi > > I am getting out of memory error

Re: [Spark SQL][Beginner] Spark throw Catalyst error while writing the dataframe in ORC format

2020-05-07 Thread Jeff Evans
You appear to be hitting the broadcast timeout. See: https://stackoverflow.com/a/41126034/375670 On Thu, May 7, 2020 at 8:56 AM Deepak Garg wrote: > Hi, > > I am getting following error while running a spark job. Error > occurred when Spark is trying to write the dataframe in ORC format . I am

Re: Which SQL flavor does Spark SQL follow?

2020-05-06 Thread Jeff Evans
https://docs.databricks.com/spark/latest/spark-sql/language-manual/index.html https://spark.apache.org/docs/latest/api/sql/index.html On Wed, May 6, 2020 at 3:35 PM Aakash Basu wrote: > Hi, > > Wish to know, which type of SQL syntax is followed when we write a plain > SQL query inside

Re: [Meta] Moderation request diversion?

2020-04-24 Thread Jeff Evans
Thanks, Sean; much appreciated. On Fri, Apr 24, 2020 at 1:09 PM Sean Owen wrote: > The mailing lists are operated by the ASF. I've asked whether it's > possible here: https://issues.apache.org/jira/browse/INFRA-20186 > > On Fri, Apr 24, 2020 at 12:39 PM Jeff Evans > wrote

Re: [Meta] Moderation request diversion?

2020-04-24 Thread Jeff Evans
he subject? It appears to be possible; see: http://untroubled.org/ezmlm/faq/Restricting-posts-based-on-the-Subject-line.html#Restricting-posts-based-on-the-Subject-line On Mon, Jun 24, 2019 at 3:45 PM Jeff Evans wrote: > There seem to be a lot of people trying to unsubscribe via the main > addr

Re: SPARK Suitable IDE

2020-03-02 Thread Jeff Evans
For developing Spark itself, or applications built using Spark? In either case, IntelliJ IDEA works well. For the former case, there is even a page explaining how to set it up. https://spark.apache.org/developer-tools.html On Mon, Mar 2, 2020, 4:43 PM Zahid Rahman wrote: > Hi, > > Can you

Re: What options do I have to handle third party classes that are not serializable?

2020-02-25 Thread Jeff Evans
Did you try this? https://stackoverflow.com/a/2114387/375670 On Tue, Feb 25, 2020 at 10:23 AM yeikel valdes wrote: > I am currently using a third party library(Lucene) with Spark that is not > serializable. Due to that reason, it generates the following exception : > > Job aborted due to

Re: Possible to limit number of IPC retries on spark-submit?

2020-01-31 Thread Jeff Evans
On Wed, Jan 22, 2020 at 5:02 PM Jeff Evans wrote: > Greetings, > > Is it possible to limit the number of times the IPC client retries upon a > spark-submit invocation? For context, see this StackOverflow post > <https://stackoverflow.com/questions/59863850/how-to-control-the-nu

Possible to limit number of IPC retries on spark-submit?

2020-01-22 Thread Jeff Evans
Greetings, Is it possible to limit the number of times the IPC client retries upon a spark-submit invocation? For context, see this StackOverflow post . In essence, I am

Re: Is there a way to get the final web URL from an active Spark context

2020-01-22 Thread Jeff Evans
it can be done (given an instance of the Hadoop Configuration object): https://gist.github.com/jeff303/8dab0e52dc227741b6605f576a317798 On Fri, Jan 17, 2020 at 4:09 PM Jeff Evans wrote: > Given a session/context, we can get the UI web URL like this: > > sparkSession.sparkContext.uiWebUrl

Is there a way to get the final web URL from an active Spark context

2020-01-17 Thread Jeff Evans
Given a session/context, we can get the UI web URL like this: sparkSession.sparkContext.uiWebUrl This gives me something like http://node-name.cluster-name:4040. If opening this from outside the cluster (ex: my laptop), this redirects via HTTP 302 to something like

What's the deal with --proxy-user?

2019-11-06 Thread Jeff Evans
Hi all, I'm trying to understand if the --proxy-user parameter to spark-submit is deprecated, or something similar? The reason I ask is because it's hard to find documentation really talking about it. The Spark Security doc doesn't mention it

Deleting columns within nested arrays/structs?

2019-10-29 Thread Jeff Evans
The starting point for the code is the various answer to this StackOverflow question. Fixing some of the issues there, I end up with the following: def dropColumn(df: DataFrame, colName: String):

Distinguishing between field missing and null in individual record?

2019-06-25 Thread Jeff Evans
Suppose we have the following JSON, which we parse into a DataFrame (using the mulitline option). [{ "id": 8541, "value": "8541 changed again value" },{ "id": 51109, "name": "newest bob", "value": "51109 changed again" }] Regardless of whether we explicitly define a schema, or allow it

[Meta] Moderation request diversion?

2019-06-24 Thread Jeff Evans
There seem to be a lot of people trying to unsubscribe via the main address, rather than following the instructions from the welcome email. Of course, this is not all that surprising, but it leads to a lot of pointless threads*. Is there a way to enable automatic detection and diversion of such

Is there a difference between --proxy-user or HADOOP_USER_NAME in a non-Kerberized YARN cluster?

2019-05-16 Thread Jeff Evans
Let's suppose we're dealing with a non-secured (i.e. not Kerberized) YARN cluster. When I invoke spark-submit, is there a practical difference between specifying --proxy-user=foo (supposing impersonation is properly set up) or setting the environment variable HADOOP_USER_NAME=foo? Thanks for any

Re: Is it possible to obtain the full command to be invoked by SparkLauncher?

2019-04-24 Thread Jeff Evans
ll make Spark code print the command to stderr. Not > > optimal but I think it's the only current option. > > > > On Wed, Apr 24, 2019 at 1:55 PM Jeff Evans > > wrote: > > > > > > The org.apache.spark.launcher.SparkLauncher is used to construct a > &

Is it possible to obtain the full command to be invoked by SparkLauncher?

2019-04-24 Thread Jeff Evans
The org.apache.spark.launcher.SparkLauncher is used to construct a spark-submit invocation programmatically, via a builder pattern. In our application, which uses a SparkLauncher internally, I would like to log the full spark-submit command that it will invoke to our log file, in order to aid in

Why does this spark-shell invocation get suspended due to tty output?

2019-04-04 Thread Jeff Evans
Hi all, I am trying to make our application check the Spark version before attempting to submit a job, to ensure the user is on a new enough version (in our case, 2.3.0 or later). I realize that there is a --version argument to spark-shell, but that prints the version next to some ASCII art so a