Re: org.apache.spark.shuffle.FetchFailedException: Too large frame:

2018-05-02 Thread Pralabh Kumar
I am performing join operation , if I convert reduce side join to map side (no shuffle will happen) and I assume in that case this error shouldn't come. Let me know if this understanding is correct On Tue, May 1, 2018 at 9:37 PM, Ryan Blue wrote: > This is usually caused by

MappingException - org.apache.spark.mllib.classification.LogisticRegressionModel.load

2018-05-02 Thread Mina Aslani
Hi, I used pyspark to create a Logistic Regression model, train my training data and evaluate my test data using ML api. However, to use the model in my program, I saved the model(e.g. Logistic Regression model) and when I tried to load it in pyspark using sameModel =

AccumulatorV2 vs AccumulableParam (V1)

2018-05-02 Thread Sergey Zhemzhitsky
Hello guys, I've started to migrate my Spark jobs which use Accumulators V1 to AccumulatorV2 and faced with the following issues: 1. LegacyAccumulatorWrapper now requires the resulting type of AccumulableParam to implement equals. In other case the AccumulableParam, automatically wrapped into

Re: Uncaught exception in thread heartbeat-receiver-event-loop-thread

2018-05-02 Thread ccherng
And reading through the comments in that issue https://issues.apache.org/jira/browse/SPARK-20977 it looks like it was just ignored but marked resolved. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To

ConcurrentModificationException

2018-05-02 Thread ccherng
I have encountered the below exception running Spark 2.1.0 on emr. The exception is the same as reported in Serialization of accumulators in heartbeats is not thread-safe https://issues.apache.org/jira/browse/SPARK-17463 Pull requests were made and merged and that issue was marked as resolved

Re: Uncaught exception in thread heartbeat-receiver-event-loop-thread

2018-05-02 Thread ccherng
I have also encountered the NullPointerException in CollectionAccumulator. It looks like there was an issue filed for this https://issues.apache.org/jira/browse/SPARK-20977. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: Problem in persisting file in S3 using Spark: xxx file does not exist Exception

2018-05-02 Thread Marco Mistroni
Hi Sorted ..I just replaced s3 with s3aI think I recall similar issues in the past with aws libraries. Thx anyway for getting back Kr On Wed, May 2, 2018, 4:57 PM Paul Tremblay wrote: > I would like to see the full error. However, S3 can give misleading > messages

Running apps over a VPN

2018-05-02 Thread Christopher Piggott
My setup is that I have a spark master (using the spark scheduler) and 32 workers registered with it but they are on a private network. I can connect to that private network via OpenVPN. I would like to be able to run spark applications from a local (on my desktop) IntelliJ but have them use the

Re: Problem in persisting file in S3 using Spark: xxx file does not exist Exception

2018-05-02 Thread Paul Tremblay
I would like to see the full error. However, S3 can give misleading messages if you don't have the correct permissions. On Tue, Apr 24, 2018, 2:28 PM Marco Mistroni wrote: > HI all > i am using the following code for persisting data into S3 (aws keys are > already stored

[no subject]

2018-05-02 Thread Filippo Balicchia

how to trace sparkDriver context creation for pyspark

2018-05-02 Thread Mihai Iacob
    I have python jupyter notebook setup to create a spark context by default, and sometimes these fail with the following error:   18/04/30 18:03:27 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1.18/04/30 18:03:27 ERROR SparkContext: Error initializing

Re: ML Linear and Logistic Regression - Poor Performance

2018-05-02 Thread Irving Duran
May want to think about reducing the number of iterations. Right now you have it set at 500. Thank You, Irving Duran On Fri, Apr 27, 2018 at 7:15 PM Thodoris Zois wrote: > I am in CentOS 7 and I use Spark 2.3.0. Below I have posted my code. > Logistic regression took 85

Re: spark.executor.extraJavaOptions inside application code

2018-05-02 Thread Vadim Semenov
You need to pass config before creating a session val conf = new SparkConf() // All three methods below are equivalent conf.set("spark.executor.extraJavaOptions", "-Dbasicauth=myuser:mypassword") conf.set("spark.executorEnv.basicauth", "myuser:mypassword") conf.setExecutorEnv("basicauth",

Re: smarter way to "forget" DataFrame definition and stick to its values

2018-05-02 Thread Lalwani, Jayesh
There is a trade off involved here. If you have a Spark application with a complicated logical graph, you can either cache data at certain points in the DAG, or you don’t cache data. The side effect of caching data is higher memory usage. The side effect of not caching data is higher CPU usage

Dataset Caching and Unpersisting

2018-05-02 Thread Daniele Foroni
Hi all, I am having troubles with caching and unpersisting a dataset. I have a cycle that at each iteration filters my dataset. I realized that caching every x steps (e.g., 50 steps) gives good performance. However, after a certain number of caching operations, it seems that the memory used for

what is the query language used for graphX?

2018-05-02 Thread kant kodali
Hi All, what is the query language used for graphX? are there any plans to introduce gremlin or is that idea being dropped and go with Spark SQL? Thanks!

spark.executor.extraJavaOptions inside application code

2018-05-02 Thread Agostino Calamita
Hi all, I wrote an application that needs an environment variable. I can set this variable with --conf 'spark.executor.extraJavaOptions=-Dbasicauth=myuser:mypwd' in spark-submit and it works well in standalone cluster mode. But, I want set it inside the application code, because the variable

[Spark scheduling] Spark schedules single task although rdd has 48 partitions?

2018-05-02 Thread Paul Borgmans
(please notice this question was previously posted to https://stackoverflow.com/questions/49943655/spark-schedules-single-task-although-rdd-has-48-partitions) We are running Spark 2.3 / Python 3.5.2. For a job we run following code (please notice that the input txt files are just a simplified

Re: [Spark Streaming]: Does DStream workload run over Spark SQL engine?

2018-05-02 Thread Saisai Shao
No, the underlying of DStream is RDD, so it will not leverage any SparkSQL related feature. I think you should use Structured Streaming instead, which is based on SparkSQL. Khaled Zaouk 于2018年5月2日周三 下午4:51写道: > Hi, > > I have a question regarding the execution engine of

[Spark Streaming]: Does DStream workload run over Spark SQL engine?

2018-05-02 Thread Khaled Zaouk
Hi, I have a question regarding the execution engine of Spark Streaming (DStream API): Does Spark streaming jobs run over the Spark SQL engine? For example, if I change a configuration parameter related to Spark SQL (like spark.sql.streaming.minBatchesToRetain or