Spark 2.2.1 Dataframes multiple joins bug?

2020-03-23 Thread Dipl.-Inf. Rico Bergmann
Hi all! Is it possible that Spark creates under certain circumstances duplicate rows when doing multiple joins? What I did: buse.count res0: Long = 20554365 buse.alias("buse").join(bdef.alias("bdef"), $"buse._c4"===$"bdef._c4").count res1: Long = 20554365

Re: Spark 2.2.1 - Operation not allowed: alter table replace columns

2018-12-19 Thread Jiaan Geng
This SQL syntax is not supported now!Please use ALTER TABLE ... CHANGE COLUMN . -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark 2.2.1 - Operation not allowed: alter table replace columns

2018-12-17 Thread Nirav Patel
I see that similar issue if fixed for `ALTER TABLE table_name ADD COLUMNS(..)` stmt. https://issues.apache.org/jira/browse/SPARK-19261 Is it also fixed for `REPLACE COLUMNS` in any subsequent version? Thanks --

Re: [External Sender] re: streaming, batch / spark 2.2.1

2018-08-02 Thread Peter Liu
>> I'm new to spark streaming and have trouble to understand spark batch >> "composition" (google search keeps give me an older spark streaming >> concept). Would appreciate any help and clarifications. >> I'm using spark 2.2.1 for a streaming workload (see quoted c

Re: re: streaming, batch / spark 2.2.1

2018-08-02 Thread zakhavan
Yes, I am loading a text file from my local machine into a kafka topic using the script below and I'd like to calculate the number of samples per second which is used by kafka consumer. if __name__ == "__main__": print("hello spark") sc = SparkContext(appName="STALTA") ssc =

Re: [External Sender] re: streaming, batch / spark 2.2.1

2018-08-02 Thread Jayesh Lalwani
Liu wrote: > Hello there, > > I'm new to spark streaming and have trouble to understand spark batch > "composition" (google search keeps give me an older spark streaming > concept). Would appreciate any help and clarifications. > I'm using spark 2.2.1 for a streaming wo

Re: re: streaming, batch / spark 2.2.1

2018-08-02 Thread zakhavan
Hello, I just had a question. Could you refer me to a link or tell me how you calculated these logs such as: *300K msg/sec to a kafka broker, 220bytes per message * I'm load a text file with 36000 records into a kafka topic and I'd like to calculate the data rate (#samples per sec) in kafka.

re: streaming, batch / spark 2.2.1

2018-08-02 Thread Peter Liu
Hello there, I'm new to spark streaming and have trouble to understand spark batch "composition" (google search keeps give me an older spark streaming concept). Would appreciate any help and clarifications. I'm using spark 2.2.1 for a streaming workload (see quoted code in

Re: Strange codegen error for SortMergeJoin in Spark 2.2.1

2018-06-08 Thread Rico Bergmann
entry with a small program that > can reproduce this problem? > > Best Regards, > Kazuaki Ishizaki > > > > From:        Rico Bergmann > To:        "user@spark.apache.org" > Date:        2018/06/05 19:58 > Subject:        Stran

Re: Strange codegen error for SortMergeJoin in Spark 2.2.1

2018-06-07 Thread Kazuaki Ishizaki
egen error for SortMergeJoin in Spark 2.2.1 Hi! I get a strange error when executing a complex SQL-query involving 4 tables that are left-outer-joined: Caused by: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 37, Column 18: failed

Strange codegen error for SortMergeJoin in Spark 2.2.1

2018-06-05 Thread Rico Bergmann
Hi! I get a strange error when executing a complex SQL-query involving 4 tables that are left-outer-joined: Caused by: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 37, Column 18: failed to compile: org.codehaus.commons.compiler.CompileException: File

Re: High Disk Usage In Spark 2.2.1 With No Shuffle Or Spill To Disk

2018-04-07 Thread Gourav Sengupta
y some inputformats need a (local) tmp Directory. Sometimes >>> this cannot be avoided. >>> >>> See also the source: >>> https://github.com/apache/hbase/blob/master/hbase-mapreduce/ >>> src/main/java/org/apache/hadoop/hbase/mapred/TableSnaps >>

Re: High Disk Usage In Spark 2.2.1 With No Shuffle Or Spill To Disk

2018-04-07 Thread Saad Mufti
hbase/mapred/ >> TableSnapshotInputFormat.java >> >> On 7. Apr 2018, at 20:26, Saad Mufti <saad.mu...@gmail.com> wrote: >> >> Hi, >> >> I have a simple ETL Spark job running on AWS EMR with Spark 2.2.1 . The >> input data is HBase files in

Re: High Disk Usage In Spark 2.2.1 With No Shuffle Or Spill To Disk

2018-04-07 Thread Saad Mufti
gt; > On 7. Apr 2018, at 20:26, Saad Mufti <saad.mu...@gmail.com> wrote: > > Hi, > > I have a simple ETL Spark job running on AWS EMR with Spark 2.2.1 . The > input data is HBase files in AWS S3 using EMRFS, but there is no HBase > running on the Spark cluster itself. It is r

Re: High Disk Usage In Spark 2.2.1 With No Shuffle Or Spill To Disk

2018-04-07 Thread Jörn Franke
://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapred/TableSnapshotInputFormat.java > On 7. Apr 2018, at 20:26, Saad Mufti <saad.mu...@gmail.com> wrote: > > Hi, > > I have a simple ETL Spark job running on AWS EMR with Spark 2.2.

High Disk Usage In Spark 2.2.1 With No Shuffle Or Spill To Disk

2018-04-07 Thread Saad Mufti
Hi, I have a simple ETL Spark job running on AWS EMR with Spark 2.2.1 . The input data is HBase files in AWS S3 using EMRFS, but there is no HBase running on the Spark cluster itself. It is restoring the HBase snapshot into files on disk in another S3 folder used for temporary storage

DataFrameWriter in pyspark ignoring hdfs attributes (using spark-2.2.1-bin-hadoop2.7)?

2018-03-10 Thread Chuan-Heng Hsiao
hi all, I am using spark-2.2.1-bin-hadoop2.7 with stand-alone mode. (python version: 3.5.2 from ubuntu 16.04) I intended to have DataFrame write to hdfs with customized block-size but failed. However, the corresponding rdd can successfully write with the customized block-size. Could you help me

Spark 2.2.1 EMR 5.11.1 Encrypted S3 bucket overwriting parquet file

2018-02-13 Thread Stephen Robinson
Hi All, I am using the latest version of EMR to overwrite Parquet files to an S3 bucket encrypted with a KMS key. I am seeing the attached error whenever I Overwrite a parquet file. For example the below code produces the attached error and stacktrace:

New to spark 2.2.1 - Problem with finding tables between different metastore db

2018-02-06 Thread Subhajit Purkayastha
All, I am new to Spark 2.2.1. I have a single node cluster and also have enabled thriftserver for my Tableau application to connect to my persisted table. I feel that the spark cluster metastore is different from the thrift-server metastore. If this assumption is valid, what do I need

Running Spark 2.2.1 with extra packages

2018-02-02 Thread Conconscious
Hi list, I have a Spark cluster with 3 nodes. I'm calling spark-shell with some packages to connect to AWS S3 and Cassandra: spark-shell \   --packages org.apache.hadoop:hadoop-aws:2.7.3,com.amazonaws:aws-java-sdk:1.7.4,datastax:spark-cassandra-connector:2.0.6-s_2.11 \   --conf

Re: spark 2.2.1

2018-02-02 Thread Mihai Iacob
b <mia...@ca.ibm.com>Cc: User <user@spark.apache.org>Subject: Re: spark 2.2.1Date: Fri, Feb 2, 2018 8:23 AM  What version of java?    On Feb 1, 2018 11:30 AM, "Mihai Iacob" <mia...@ca.ibm.com> wrote: I am setting up a spark 2.2.1 cluster, however, when I bring up the maste

Re: spark 2.2.1

2018-02-02 Thread Bill Schwanitz
What version of java? On Feb 1, 2018 11:30 AM, "Mihai Iacob" <mia...@ca.ibm.com> wrote: > I am setting up a spark 2.2.1 cluster, however, when I bring up the master > and workers (both on spark 2.2.1) I get this error. I tried spark 2.2.0 and > get the same error. It

spark 2.2.1

2018-02-01 Thread Mihai Iacob
I am setting up a spark 2.2.1 cluster, however, when I bring up the master and workers (both on spark 2.2.1) I get this error. I tried spark 2.2.0 and get the same error. It works fine on spark 2.0.2. Have you seen this before, any idea what's wrong?   I found this, but it's in a different

Re: flatMapGroupsWithState not timing out (spark 2.2.1)

2018-01-12 Thread Tathagata Das
Aah okay! How are testing whether there is a timeout? The situation that would lead to the *EventTimeTimeout* would be the following. 1. Send bunch of data to group1, to set the timeout timestamp using event-time 2. Then send more data to group2 only, to advance the watermark (since it's based on

Re: flatMapGroupsWithState not timing out (spark 2.2.1)

2018-01-12 Thread Tathagata Das
Hello Dan, >From your code, it seems like you are setting the timeout timestamp based on the current processing-time / wall-clock-time, while the watermark is being calculated on the event-time ("when" column). The semantics of the EventTimeTimeout is that when the last set timeout timestamp of a

flatMapGroupsWithState not timing out (spark 2.2.1)

2018-01-12 Thread daniel williams
Hi, I’m attempting to leverage flatMapGroupsWithState to handle some arbitrary aggregations and am noticing a couple of things: - *ProcessingTimeTimeout* + *setTimeoutDuration* timeout not being honored - *EventTimeTimeout* + watermark value not being honored. - *EventTimeTimeout* +

Regression in Spark SQL UI Tab in Spark 2.2.1

2018-01-11 Thread Yuval Itzchakov
Hi, I've recently installed Spark 2.2.1, and it seems like the SQL tab isn't getting updated at all, although the "Jobs" tab gets updated with new incoming jobs, the SQL tab remains empty, all the time. I was wondering if anyone noticed such regression in 2.2.1? -- Best Rega

Re: Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0

2018-01-08 Thread Felix Cheung
Saisai Shao; Raj Adyanthaya; spark users Subject: Re: Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0 My current best guess is that Spark does not fully support Hadoop 3.x because https://issues.apache.org/jira/browse/SPARK-18673 (updates to Hive shims for Hadoop 3.x) has not been resolved. There

Re: Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0

2018-01-08 Thread Josh Rosen
imilar question > on stackoverflow.com > <https://stackoverflow.com/questions/47920005/how-is-hadoop-3-0-0-s-compatibility-with-older-versions-of-hive-pig-sqoop-and> > , Mr. jacek-laskowski > <https://stackoverflow.com/users/1305344/jacek-laskowski> replied that > spark-2.2.1 does

Re: Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0

2018-01-08 Thread akshay naidu
oop-3-0-0-s-compatibility-with-older-versions-of-hive-pig-sqoop-and> , Mr. jacek-laskowski <https://stackoverflow.com/users/1305344/jacek-laskowski> replied that spark-2.2.1 doesn't support hadoop-3. so I am just looking for more clarity on this doubt before moving on to upgrades. Thanks

Re: Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0

2018-01-07 Thread Saisai Shao
anks Jerry 2018-01-08 4:50 GMT+08:00 Raj Adyanthaya <raj...@gmail.com>: > Hi Akshay > > On the Spark Download page when you select Spark 2.2.1 it gives you an > option to select package type. In that, there is an option to select > "Pre-Built for Apache Hadoop 2.

Re: Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0

2018-01-07 Thread Raj Adyanthaya
Hi Akshay On the Spark Download page when you select Spark 2.2.1 it gives you an option to select package type. In that, there is an option to select "Pre-Built for Apache Hadoop 2.7 and later". I am assuming it means that it does support Hadoop 3.0. http://spark.apache.org/down

Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0

2018-01-06 Thread akshay naidu
hello Users, I need to know whether we can run latest spark on latest hadoop version i.e., spark-2.2.1 released on 1st dec and hadoop-3.0.0 released on 13th dec. thanks.

Re: Spark 2.2.1 worker invocation

2017-12-26 Thread Felix Cheung
ubject: Spark 2.2.1 worker invocation I need to set java.library.path to get access to some native code. Following directions, I made a spark-env.sh: #!/usr/bin/env bash export LD_LIBRARY_PATH="/usr/local/lib/libcdfNativeLibrary.so:/usr/local/lib/libcdf.so:${LD_LIBRARY_PATH}&

Spark 2.2.1 worker invocation

2017-12-26 Thread Christopher Piggott
I need to set java.library.path to get access to some native code. Following directions, I made a spark-env.sh: #!/usr/bin/env bash export LD_LIBRARY_PATH="/usr/local/lib/libcdfNativeLibrary.so:/usr/local/lib/libcdf.so:${LD_LIBRARY_PATH}" export

Re: Why Spark 2.2.1 still bundles old Hive jars?

2017-12-11 Thread Jacek Laskowski
/jaceklaskowski On Mon, Dec 11, 2017 at 7:43 AM, An Qin <a...@qilinsoft.com> wrote: > Hi, all, > > > > I want to include Sentry 2.0.0 in my Spark project. However it bundles > Hive 2.3.2. I find the newest Spark 2.2.1 still bundles old Hive jars, for > example, hive-exec-1.2.1.spar

Why Spark 2.2.1 still bundles old Hive jars?

2017-12-10 Thread An Qin
Hi, all, I want to include Sentry 2.0.0 in my Spark project. However it bundles Hive 2.3.2. I find the newest Spark 2.2.1 still bundles old Hive jars, for example, hive-exec-1.2.1.spark2.jar. Why does it upgrade to the new Hive? Are they compatible? Regards, Qin An.