Re: Spark Conf
Hi "In general, configuration values explicitly set on a SparkConf take the highest precedence, then flags passed to spark-submit, then values in the defaults file." https://spark.apache.org/docs/latest/submitting-applications.html Perhaps this will help Vinyas: Look at args.sparkProperties in https://github.com/apache/spark/blob/v2.3.0/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala On Thu, Mar 15, 2018 at 1:53 AM, Vinyas Shettywrote: > > Hi, > > I am trying to understand the spark internals ,so was looking the spark > code flow. Now in a scenario where i do a spark-submit in yarn cluster mode > with --executor-memory 8g via command line ,now how does spark know about > this exectuor memory value ,since in SparkContext i see : > > _executorMemory = _conf.getOption("spark.executor.memory") > > .orElse(Option(System.getenv("SPARK_EXECUTOR_MEMORY"))) >.orElse(Option(System.getenv("SPARK_MEM")) > > > Now SparkConf loads the default from Java System Properties ,but then i > did not find where the command line value is added to Java System > Properties sys.props in yarn cluster mode ie did not see a call to > Utils.loadDefaultSparkProperties.How is this default command line value > reaching the SparkConf which is part of SparkContext. > > Regards, > Vinyas > >
Re: [Spark Core] excessive read/load times on parquet files in 2.2 vs 2.0
Can you provide a code sample please? On Fri, Sep 8, 2017 at 5:44 PM, Matthew Anthonywrote: > Hi all - > > > since upgrading to 2.2.0, we've noticed a significant increase in > read.parquet(...) ops. The parquet files are being read from S3. Upon entry > at the interactive terminal (pyspark in this case), the terminal will sit > "idle" for several minutes (as many as 10) before returning: > > > "17/09/08 15:34:37 WARN SharedInMemoryCache: Evicting cached table > partition metadata from memory due to size constraints > (spark.sql.hive.filesourcePartitionFileCacheSize = 20 bytes). > This may impact query planning performance." > > > In the spark UI, there are no jobs being run during this idle period. > Subsequently, a short 1-task job lasting approximately 10 seconds runs, and > then another idle time of roughly 2-3 minutes follows thereafter before > returning to the terminal/CLI. > > > Can someone explain what is happening here in the background? Is there a > misconfiguration we should be looking for? We are using Hive metastore on > the EMR cluster. > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: Looking at EMR Logs
Modifying spark.eventLog.dir to point to a S3 path, you will encounter the following exception in Spark history log on path: /var/log/spark/spark-history-server.out Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2702) To move past this issue, we can do the following. This is for EMR Release: emr-5.4.0 cd /usr/lib/spark/jars sudo ln -s /usr/share/aws/emr/emrfs/lib/emrfs-hadoop-assembly-2.15.0.jar emrfs.jar Now Spark history server will startup correctly and you can review the Spark event logs on S3. On Fri, Mar 31, 2017 at 4:46 PM, Vadim Semenovwrote: > You can provide your own log directory, where Spark log will be saved, and > that you could replay afterwards. > > Set in your job this: `spark.eventLog.dir=s3://bucket/some/directory` and > run it. > Note! The path `s3://bucket/some/directory` must exist before you run your > job, it'll not be created automatically. > > The Spark HistoryServer on EMR won't show you anything because it's > looking for logs in `hdfs:///var/log/spark/apps` by default. > > After that you can either copy the log files from s3 to the hdfs path > above, or you can copy them locally to `/tmp/spark-events` (the default > directory for spark logs) and run the history server like: > ``` > cd /usr/local/src/spark-1.6.1-bin-hadoop2.6 > sbin/start-history-server.sh > ``` > and then open http://localhost:18080 > > > > > On Thu, Mar 30, 2017 at 8:45 PM, Paul Tremblay > wrote: > >> I am looking for tips on evaluating my Spark job after it has run. >> >> I know that right now I can look at the history of jobs through the web >> ui. I also know how to look at the current resources being used by a >> similar web ui. >> >> However, I would like to look at the logs after the job is finished to >> evaluate such things as how many tasks were completed, how many executors >> were used, etc. I currently save my logs to S3. >> >> Thanks! >> >> Henry >> >> -- >> Paul Henry Tremblay >> Robert Half Technology >> > >
Re: spark 2.02 error when writing to s3
Can you test by enabling emrfs consistent view and use s3:// uri. http://docs.aws.amazon.com/emr/latest/ManagementGuide/enable-consistent-view.html Original message From: Steve LoughranDate:20/01/2017 21:17 (GMT+02:00) To: "VND Tremblay, Paul" Cc: Takeshi Yamamuro ,user@spark.apache.org Subject: Re: spark 2.02 error when writing to s3 AWS S3 is eventually consistent: even after something is deleted, a LIST/GET call may show it. You may be seeing that effect; even after the DELETE has got rid of the files, a listing sees something there, And I suspect the time it takes for the listing to "go away" will depend on the total number of entries underneath, as there are more deletion markers "tombstones" to propagate around s3 Try deleting the path and then waiting a short period On 20 Jan 2017, at 18:54, VND Tremblay, Paul wrote: I am using an EMR cluster, and the latest version offered is 2.02. The link below indicates that that user had the same problem, which seems unresolved. Thanks Paul _ Paul Tremblay Analytics Specialist THE BOSTON CONSULTING GROUP Tel. + ▪ Mobile + _ From: Takeshi Yamamuro [mailto:linguin@gmail.com] Sent: Thursday, January 19, 2017 9:27 PM To: VND Tremblay, Paul Cc: user@spark.apache.org Subject: Re: spark 2.02 error when writing to s3 Hi, Do you get the same exception also in v2.1.0? Anyway, I saw another guy reporting the same error, I think. https://www.mail-archive.com/user@spark.apache.org/msg60882.html // maropu On Fri, Jan 20, 2017 at 5:15 AM, VND Tremblay, Paul wrote: I have come across a problem when writing CSV files to S3 in Spark 2.02. The problem does not exist in Spark 1.6. 19:09:20 Caused by: java.io.IOException: File already exists:s3://stx-apollo-pr-datascience-internal/revenue_model/part-r-00025-c48a0d52-9600-4495-913c-64ae6bf888bd.csv My code is this: new_rdd\ 135 .map(add_date_diff)\ 136 .map(sid_offer_days)\ 137 .groupByKey()\ 138 .map(custom_sort)\ 139 .map(before_rev_date)\ 140 .map(lambda x, num_weeks = args.num_weeks: create_columns(x, num_weeks))\ 141 .toDF()\ 142 .write.csv( 143 sep = "|", 144 header = True, 145 nullValue = '', 146 quote = None, 147 path = path 148 ) In order to get the path (the last argument), I call this function: 150 def _get_s3_write(test): 151 if s3_utility.s3_data_already_exists(_get_write_bucket_name(), _get_s3_write_dir(test)): 152 s3_utility.remove_s3_dir(_get_write_bucket_name(), _get_s3_write_dir(test)) 153 return make_s3_path(_get_write_bucket_name(), _get_s3_write_dir(test)) In other words, I am removing the directory if it exists before I write. Notes: * If I use a small set of data, then I don't get the error * If I use Spark 1.6, I don't get the error * If I read in a simple dataframe and then write to S3, I still get the error (without doing any transformations) * If I do the previous step with a smaller set of data, I don't get the error. * I am using pyspark, with python 2.7 * The thread at this link: https://forums.aws.amazon.com/thread.jspa?threadID=152470 Indicates the problem is caused by a problem sync problem. With large datasets, spark tries to write multiple times and causes the error. The suggestion is to turn off speculation, but I believe speculation is turned off by default in pyspark. Thanks! Paul _ Paul Tremblay Analytics Specialist THE BOSTON CONSULTING GROUP STL ▪ Tel. + ▪ Mobile + tremblay.p...@bcg.com _ Read BCG's latest insights, analysis, and viewpoints at bcgperspectives.com The Boston Consulting Group, Inc. This e-mail message may contain confidential and/or privileged information. If you are not an addressee or otherwise authorized to receive this message, you should not use, copy, disclose or take any action based on this e-mail or any information contained in the message. If you have received this material in error, please advise the sender immediately by reply e-mail and delete this message. Thank you. -- --- Takeshi Yamamuro
Re: Running Spark on EMR
Hello, Can you drop the url: spark://master:7077 The url is used when running Spark in standalone mode. Regards Original message From: Marco MistroniDate:15/01/2017 16:34 (GMT+02:00) To: User Subject: Running Spark on EMR hi all could anyone assist here? i am trying to run spark 2.0.0 on an EMR cluster,but i am having issues connecting to the master node So, below is a snippet of what i am doing sc = SparkSession.builder.master(sparkHost).appName("DataProcess").getOrCreate() sparkHost is passed as input parameter. that was thought so that i can run the script locally on my spark local instance as well as submitting scripts on any cluster i want Now i have 1 - setup a cluster on EMR. 2 - connected to masternode 3 - launch the command spark-submit myscripts.py spark://master:7077 But that results in an connection refused exception Then i have tried to remove the .master call above and launch the script with the following command spark-submit --master spark://master:7077 myscript.py but still i am getting connectionREfused exception I am using Spark 2.0.0 , could anyone advise on how shall i build the spark session and how can i submit a pythjon script to the cluster? kr marco
Re: [Spark Core] - Spark dynamoDB integration
Hello, Good examples on how to interface with DynamoDB from Spark here: https://aws.amazon.com/blogs/big-data/using-spark-sql-for-etl/ https://aws.amazon.com/blogs/big-data/analyze-your-data-on-amazon-dynamodb-with-apache-spark/ Thanks On Mon, Dec 12, 2016 at 7:56 PM, Marco Mistroniwrote: > Hi > If it can help > 1.Check Java docs of when that method was introduced > 2. U building a fat jar? Check which libraries have been includedsome > other dependencies might have forced an old copy to be included > 3. If u. Take code outside spark.does it work successfully? > 4. Send short sample > Hth > > On 12 Dec 2016 11:03 am, "Pratyaksh Sharma" ac.in> wrote: > > Hey I am using Apache Spark for one streaming application. I am trying to > store the processed data into dynamodb using java sdk. Getting the > following exception - > 16/12/08 23:23:43 WARN TaskSetManager: Lost task 0.0 in stage 1.0: > java.lang.NoSuchMethodError: com.amazonaws.SDKGlobalConfigu > ration.isInRegionOptimizedModeEnabled()Z > at com.amazonaws.ClientConfigurationFactory.getConfig(ClientCon > figurationFactory.java:35) > at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient. >(AmazonDynamoDBClient.java:374) > > Spark version - 1.6.1 > Scala version - 2.10.5 > aws sdk version - 1.11.33 > > Has anyone faced this issues? Any help will be highly appreciated. > > -- > Regards > > Pratyaksh Sharma > 12105EN013 > Department of Electronics Engineering > IIT Varanasi > Contact No +91-8127030223 <+91%2081270%2030223> > > >
Re: spark 1.4.1 saveAsTextFile is slow on emr-4.0.0
Hi, Can you set the following parameters in your mapred-site.xml file please: mapred.output.direct.EmrFileSystemtrue mapred.output.direct.NativeS3FileSystemtrue You can also config this at cluster launch time with the following Classification via EMR console: classification=mapred-site,properties=[mapred.output.direct.EmrFileSystem=true,mapred.output.direct.NativeS3FileSystem=true] Thank you On Wed, Sep 2, 2015 at 6:02 AM, Alexander Pivovarovwrote: > I checked previous emr config (emr-3.8) > mapred-site.xml has the following setting > > mapred.output.committer.classorg.apache.hadoop.mapred.DirectFileOutputCommitter > > > > On Tue, Sep 1, 2015 at 7:33 PM, Alexander Pivovarov > wrote: > >> Should I use DirectOutputCommitter? >> spark.hadoop.mapred.output.committer.class >> com.appsflyer.spark.DirectOutputCommitter >> >> >> >> On Tue, Sep 1, 2015 at 4:01 PM, Alexander Pivovarov > > wrote: >> >>> I run spark 1.4.1 in amazom aws emr 4.0.0 >>> >>> For some reason spark saveAsTextFile is very slow on emr 4.0.0 in >>> comparison to emr 3.8 (was 5 sec, now 95 sec) >>> >>> Actually saveAsTextFile says that it's done in 4.356 sec but after that >>> I see lots of INFO messages with 404 error from com.amazonaws.latency >>> logger for next 90 sec >>> >>> spark> sc.parallelize(List.range(0, 160),160).map(x => x + "\t" + >>> "A"*100).saveAsTextFile("s3n://foo-bar/tmp/test40_20") >>> >>> 2015-09-01 21:16:17,637 INFO [dag-scheduler-event-loop] >>> scheduler.DAGScheduler (Logging.scala:logInfo(59)) - ResultStage 5 >>> (saveAsTextFile at :22) finished in 4.356 s >>> 2015-09-01 21:16:17,637 INFO [task-result-getter-2] >>> cluster.YarnScheduler (Logging.scala:logInfo(59)) - Removed TaskSet 5.0, >>> whose tasks have all completed, from pool >>> 2015-09-01 21:16:17,637 INFO [main] scheduler.DAGScheduler >>> (Logging.scala:logInfo(59)) - Job 5 finished: saveAsTextFile at >>> :22, took 4.547829 s >>> 2015-09-01 21:16:17,638 INFO [main] s3n.S3NativeFileSystem >>> (S3NativeFileSystem.java:listStatus(896)) - listStatus >>> s3n://foo-bar/tmp/test40_20/_temporary/0 with recursive false >>> 2015-09-01 21:16:17,651 INFO [main] amazonaws.latency >>> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404], >>> Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found >>> (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request >>> ID: 3B2F06FD11682D22), S3 Extended Request ID: >>> C8T3rXVSEIk3swlwkUWJJX3gWuQx3QKC3Yyfxuhs7y0HXn3sEI9+c1a0f7/QK8BZ], >>> ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found], >>> AWSRequestID=[3B2F06FD11682D22], ServiceEndpoint=[ >>> https://foo-bar.s3.amazonaws.com], Exception=1, >>> HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, >>> HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.923], >>> HttpRequestTime=[11.388], HttpClientReceiveResponseTime=[9.544], >>> RequestSigningTime=[0.274], HttpClientSendRequestTime=[0.129], >>> 2015-09-01 21:16:17,723 INFO [main] amazonaws.latency >>> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200], >>> ServiceName=[Amazon S3], AWSRequestID=[E5D513E52B20FF17], ServiceEndpoint=[ >>> https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0, >>> RequestCount=1, HttpClientPoolPendingCount=0, >>> HttpClientPoolAvailableCount=1, ClientExecuteTime=[71.927], >>> HttpRequestTime=[53.517], HttpClientReceiveResponseTime=[51.81], >>> RequestSigningTime=[0.209], ResponseProcessingTime=[17.97], >>> HttpClientSendRequestTime=[0.089], >>> 2015-09-01 21:16:17,756 INFO [main] amazonaws.latency >>> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404], >>> Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found >>> (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request >>> ID: 62C6B413965447FD), S3 Extended Request ID: >>> 4w5rKMWCt9EdeEKzKBXZgWpTcBZCfDikzuRrRrBxmtHYxkZyS4GxQVyADdLkgtZf], >>> ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found], >>> AWSRequestID=[62C6B413965447FD], ServiceEndpoint=[ >>> https://foo-bar.s3.amazonaws.com], Exception=1, >>> HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, >>> HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.044], >>> HttpRequestTime=[10.543], HttpClientReceiveResponseTime=[8.743], >>> RequestSigningTime=[0.271], HttpClientSendRequestTime=[0.138], >>> 2015-09-01 21:16:17,774 INFO [main] amazonaws.latency >>> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200], >>> ServiceName=[Amazon S3], AWSRequestID=[F62B991825042889], ServiceEndpoint=[ >>> https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0, >>> RequestCount=1, HttpClientPoolPendingCount=0, >>> HttpClientPoolAvailableCount=1, ClientExecuteTime=[16.724], >>> HttpRequestTime=[16.292], HttpClientReceiveResponseTime=[14.728], >>> RequestSigningTime=[0.148], ResponseProcessingTime=[0.155], >>>