exhaustive list of configuration options

2018-11-19 Thread Shiyuan
Hi Spark Users, Is there a way I can get the exhaustive list of configuration options and their default values? The documentation page https://spark.apache.org/docs/latest/configuration.html is not exhaustive. The Spark UI/environment tab is not exhaustive either. Thank you!

Exception thrown in awaitResult during application launch in yarn cluster

2018-05-18 Thread Shiyuan
Hi Spark-users, I am using pyspark on a yarn cluster. One of my spark application launch failed. Only the driver container had started before it failed on the ACCEPTED state. The error message is very short and I cannot make sense of it. The error message is attached below. Any possible causes

Submit many spark applications

2018-05-16 Thread Shiyuan
Hi Spark-users, I want to submit as many spark applications as the resources permit. I am using cluster mode on a yarn cluster. Yarn can queue and launch these applications without problems. The problem lies on spark-submit itself. Spark-submit starts a jvm which could fail due to insufficient

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-11 Thread Shiyuan
Here it is : https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2991198123660769/823198936734135/866038034322120/latest.html On Wed, Apr 11, 2018 at 10:55 AM, Alessandro Solimando < alessandro.solima...@gmail.com> wrote: > Hi Shiyuan, >

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-11 Thread Shiyuan
, Apr 10, 2018 at 9:03 PM, Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > Hi Shiyuan, > > I do not know whether I am right, but I would prefer to avoid expressions > in Spark as: > > df = <> > > > Regards, > Gourav Sengupta > > On Tue, Apr 1

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-10 Thread Shiyuan
ct("ID","score","LABEL","kk") df_t = df.groupby("ID").agg(F.countDistinct("LABEL").alias("nL")).filter(F.col("nL")>1) df = df.join(df_t.select("ID"),["ID"]) df_sw = df.groupby(["ID","

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-10 Thread Shiyuan
you could kindly try using the below statement and > go through your used case once again (I am yet to go through all the lines): > > > > from pyspark.sql import Row > > df = spark.createDataFrame([Row(score = 1.0,ID="abc",LABEL=True,k=2), > Row(score = 1.0,ID=&

A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-09 Thread Shiyuan
Hi Spark Users, The following code snippet has an "attribute missing" error while the attribute exists. This bug is triggered by a particular sequence of of "select", "groupby" and "join". Note that if I take away the "select" in #line B, the code runs without error. However, the

Uncaught exception in thread heartbeat-receiver-event-loop-thread

2018-04-02 Thread Shiyuan
Hi, I got an error of Uncaught exception in thread heartbeat-receiver-event-loop-thread. Does this error indicate that some node is too overloaded to be responsive? Thanks! ERROR Utils: Uncaught exception in thread heartbeat-receiver-event-loop-thread java.lang.NullPointerException

Re: strange behavior of joining dataframes

2018-03-23 Thread Shiyuan
.col("nL")>1) df = df.join(df_t.select("ID"),["ID"]) df_sw = df.groupby(["ID","kk"]).count().withColumnRenamed("count", "cnt1") df = df.join(df_sw, ["ID","kk"]) On Tue, Mar 20, 2018 at 9:58 PM, Shiyuan <gshy

strange behavior of joining dataframes

2018-03-20 Thread Shiyuan
Hi Spark-users: I have a dataframe "df_t" which was generated from other dataframes by several transformations. And then I did something very simple, just counting the rows, that is the following code: (A) df_t_1 = df_t.groupby(["Id","key"]).count().withColumnRenamed("count", "cnt1") df_t_2 =

Insufficient memory for Java Runtime

2018-03-13 Thread Shiyuan
Hi Spark-Users, I encountered the problem of "insufficient memory". The error is logged in the file with a name " hs_err_pid86252.log"(attached in the end of this email). I launched the spark job by " spark-submit --driver-memory 40g --master yarn --deploy-mode client". The spark session was

Re: Why dataframe can be more efficient than dataset?

2017-04-09 Thread Shiyuan
> ds.filter("age < 20") res8: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint] On Sat, Apr 8, 2017 at 7:22 PM, Koert Kuipers <ko...@tresata.com> wrote: > how would you use only relational transformations on dataset? > > On Sat, Apr 8, 2017 at 2:15

Why dataframe can be more efficient than dataset?

2017-04-08 Thread Shiyuan
Dataset even if we only use the relational transformation on dataset? If so, can anyone give some explanation why it is so? Any benchmark comparing dataset vs. dataframe? Thank you! Shiyuan

best practice for paralleling model training

2017-01-24 Thread Shiyuan
Hi spark users, I am looking for a way to paralleling #A and #B in the code below. Since dataframe in spark is immutable, #A and #B are completely separated operations My question is: 1). As for spark 2.1, #B only starts when #A is completed. Is it right? 2). What's the best way to

Why StringIndexer uses double instead of int for indexing?

2017-01-21 Thread Shiyuan
Hi Spark, StringIndex uses double instead of int for indexing http://spark.apache.org/docs/latest/ml-features.html#stringindexer. What's the rationale for using double to index? Would it be more appropriate to use int to index (which is consistent with other place like Vector.sparse) Shiyuan