I am performing join operation , if I convert reduce side join to map side
(no shuffle will happen) and I assume in that case this error shouldn't
come. Let me know if this understanding is correct
On Tue, May 1, 2018 at 9:37 PM, Ryan Blue wrote:
> This is usually caused by
Hi,
I used pyspark to create a Logistic Regression model, train my training
data and evaluate my test data using ML api. However, to use the model in
my program, I saved the model(e.g. Logistic Regression model) and when I
tried to load it in pyspark using
sameModel =
Hello guys,
I've started to migrate my Spark jobs which use Accumulators V1 to
AccumulatorV2 and faced with the following issues:
1. LegacyAccumulatorWrapper now requires the resulting type of
AccumulableParam to implement equals. In other case the
AccumulableParam, automatically wrapped into
And reading through the comments in that issue
https://issues.apache.org/jira/browse/SPARK-20977 it looks like it was just
ignored but marked resolved.
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To
I have encountered the below exception running Spark 2.1.0 on emr. The
exception is the same as reported in
Serialization of accumulators in heartbeats is not thread-safe
https://issues.apache.org/jira/browse/SPARK-17463
Pull requests were made and merged and that issue was marked as resolved
I have also encountered the NullPointerException in CollectionAccumulator. It
looks like there was an issue filed for this
https://issues.apache.org/jira/browse/SPARK-20977.
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
Hi
Sorted ..I just replaced s3 with s3aI think I recall similar issues in
the past with aws libraries.
Thx anyway for getting back
Kr
On Wed, May 2, 2018, 4:57 PM Paul Tremblay wrote:
> I would like to see the full error. However, S3 can give misleading
> messages
My setup is that I have a spark master (using the spark scheduler) and 32
workers registered with it but they are on a private network. I can
connect to that private network via OpenVPN.
I would like to be able to run spark applications from a local (on my
desktop) IntelliJ but have them use the
I would like to see the full error. However, S3 can give misleading
messages if you don't have the correct permissions.
On Tue, Apr 24, 2018, 2:28 PM Marco Mistroni wrote:
> HI all
> i am using the following code for persisting data into S3 (aws keys are
> already stored
I have python jupyter notebook setup to create a spark context by default, and sometimes these fail with the following error:
18/04/30 18:03:27 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1.18/04/30 18:03:27 ERROR SparkContext: Error initializing
May want to think about reducing the number of iterations. Right now you
have it set at 500.
Thank You,
Irving Duran
On Fri, Apr 27, 2018 at 7:15 PM Thodoris Zois wrote:
> I am in CentOS 7 and I use Spark 2.3.0. Below I have posted my code.
> Logistic regression took 85
You need to pass config before creating a session
val conf = new SparkConf()
// All three methods below are equivalent
conf.set("spark.executor.extraJavaOptions", "-Dbasicauth=myuser:mypassword")
conf.set("spark.executorEnv.basicauth", "myuser:mypassword")
conf.setExecutorEnv("basicauth",
There is a trade off involved here. If you have a Spark application with a
complicated logical graph, you can either cache data at certain points in the
DAG, or you don’t cache data. The side effect of caching data is higher memory
usage. The side effect of not caching data is higher CPU usage
Hi all,
I am having troubles with caching and unpersisting a dataset.
I have a cycle that at each iteration filters my dataset.
I realized that caching every x steps (e.g., 50 steps) gives good performance.
However, after a certain number of caching operations, it seems that the memory
used for
Hi All,
what is the query language used for graphX? are there any plans to
introduce gremlin or is that idea being dropped and go with Spark SQL?
Thanks!
Hi all,
I wrote an application that needs an environment variable. I can set this
variable with
--conf 'spark.executor.extraJavaOptions=-Dbasicauth=myuser:mypwd'
in spark-submit and it works well in standalone cluster mode.
But, I want set it inside the application code, because the variable
(please notice this question was previously posted to
https://stackoverflow.com/questions/49943655/spark-schedules-single-task-although-rdd-has-48-partitions)
We are running Spark 2.3 / Python 3.5.2. For a job we run following code
(please notice that the input txt files are just a simplified
No, the underlying of DStream is RDD, so it will not leverage any SparkSQL
related feature. I think you should use Structured Streaming instead, which
is based on SparkSQL.
Khaled Zaouk 于2018年5月2日周三 下午4:51写道:
> Hi,
>
> I have a question regarding the execution engine of
Hi,
I have a question regarding the execution engine of Spark Streaming
(DStream API): Does Spark streaming jobs run over the Spark SQL engine?
For example, if I change a configuration parameter related to Spark SQL
(like spark.sql.streaming.minBatchesToRetain or
20 matches
Mail list logo