Hi Spark Users,
Is there a way I can get the exhaustive list of configuration options
and their default values? The documentation page
https://spark.apache.org/docs/latest/configuration.html is not exhaustive.
The Spark UI/environment tab is not exhaustive either. Thank you!
Hi Spark-users,
I am using pyspark on a yarn cluster. One of my spark application launch
failed. Only the driver container had started before it failed on the
ACCEPTED state. The error message is very short and I cannot make sense of
it. The error message is attached below. Any possible causes
Hi Spark-users,
I want to submit as many spark applications as the resources permit. I am
using cluster mode on a yarn cluster. Yarn can queue and launch these
applications without problems. The problem lies on spark-submit itself.
Spark-submit starts a jvm which could fail due to insufficient
Here it is :
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2991198123660769/823198936734135/866038034322120/latest.html
On Wed, Apr 11, 2018 at 10:55 AM, Alessandro Solimando <
alessandro.solima...@gmail.com> wrote:
> Hi Shiyuan,
>
, Apr 10, 2018 at 9:03 PM, Gourav Sengupta <gourav.sengu...@gmail.com>
wrote:
> Hi Shiyuan,
>
> I do not know whether I am right, but I would prefer to avoid expressions
> in Spark as:
>
> df = <>
>
>
> Regards,
> Gourav Sengupta
>
> On Tue, Apr 1
ct("ID","score","LABEL","kk")
df_t =
df.groupby("ID").agg(F.countDistinct("LABEL").alias("nL")).filter(F.col("nL")>1)
df = df.join(df_t.select("ID"),["ID"])
df_sw = df.groupby(["ID","
you could kindly try using the below statement and
> go through your used case once again (I am yet to go through all the lines):
>
>
>
> from pyspark.sql import Row
>
> df = spark.createDataFrame([Row(score = 1.0,ID="abc",LABEL=True,k=2),
> Row(score = 1.0,ID=&
Hi Spark Users,
The following code snippet has an "attribute missing" error while the
attribute exists. This bug is triggered by a particular sequence of of
"select", "groupby" and "join". Note that if I take away the "select" in
#line B, the code runs without error. However, the
Hi,
I got an error of Uncaught exception in
thread heartbeat-receiver-event-loop-thread. Does this error indicate that
some node is too overloaded to be responsive? Thanks!
ERROR Utils: Uncaught exception in thread
heartbeat-receiver-event-loop-thread
java.lang.NullPointerException
.col("nL")>1)
df = df.join(df_t.select("ID"),["ID"])
df_sw = df.groupby(["ID","kk"]).count().withColumnRenamed("count", "cnt1")
df = df.join(df_sw, ["ID","kk"])
On Tue, Mar 20, 2018 at 9:58 PM, Shiyuan <gshy
Hi Spark-users:
I have a dataframe "df_t" which was generated from other dataframes by
several transformations. And then I did something very simple, just
counting the rows, that is the following code:
(A)
df_t_1 = df_t.groupby(["Id","key"]).count().withColumnRenamed("count",
"cnt1")
df_t_2 =
Hi Spark-Users,
I encountered the problem of "insufficient memory". The error is logged
in the file with a name " hs_err_pid86252.log"(attached in the end of this
email).
I launched the spark job by " spark-submit --driver-memory 40g --master
yarn --deploy-mode client". The spark session was
> ds.filter("age < 20")
res8: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint]
On Sat, Apr 8, 2017 at 7:22 PM, Koert Kuipers <ko...@tresata.com> wrote:
> how would you use only relational transformations on dataset?
>
> On Sat, Apr 8, 2017 at 2:15
Dataset even
if we only use the relational transformation on dataset? If so, can anyone
give some explanation why it is so? Any benchmark comparing dataset vs.
dataframe? Thank you!
Shiyuan
Hi spark users,
I am looking for a way to paralleling #A and #B in the code below. Since
dataframe in spark is immutable, #A and #B are completely separated
operations
My question is:
1). As for spark 2.1, #B only starts when #A is completed. Is it right?
2). What's the best way to
Hi Spark,
StringIndex uses double instead of int for indexing
http://spark.apache.org/docs/latest/ml-features.html#stringindexer. What's
the rationale for using double to index? Would it be more appropriate to
use int to index (which is consistent with other place like Vector.sparse)
Shiyuan
16 matches
Mail list logo