Re: Spark Conf

2018-03-15 Thread Neil Jonkers
Hi

"In general, configuration values explicitly set on a SparkConf take the
highest precedence, then flags passed to spark-submit, then values in the
defaults file."
https://spark.apache.org/docs/latest/submitting-applications.html

Perhaps this will help Vinyas:
Look at args.sparkProperties in
https://github.com/apache/spark/blob/v2.3.0/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala

On Thu, Mar 15, 2018 at 1:53 AM, Vinyas Shetty 
wrote:

>
> Hi,
>
> I am trying to understand the spark internals ,so was looking the spark
> code flow. Now in a scenario where i do a spark-submit in yarn cluster mode
> with --executor-memory 8g via command line ,now how does spark know about
> this exectuor memory value ,since in SparkContext i see :
>
> _executorMemory = _conf.getOption("spark.executor.memory")
> 
> .orElse(Option(System.getenv("SPARK_EXECUTOR_MEMORY")))
>.orElse(Option(System.getenv("SPARK_MEM"))
>
>
> Now SparkConf loads the default from Java System Properties ,but then i
> did not find where the command line value is added to Java System
> Properties sys.props in yarn cluster mode ie did not see a call to
> Utils.loadDefaultSparkProperties.How is this default command line value
> reaching the SparkConf which is part of SparkContext.
>
> Regards,
> Vinyas
>
>


Re: [Spark Core] excessive read/load times on parquet files in 2.2 vs 2.0

2017-09-08 Thread Neil Jonkers
Can you provide a code sample please?

On Fri, Sep 8, 2017 at 5:44 PM, Matthew Anthony  wrote:

> Hi all -
>
>
> since upgrading to 2.2.0, we've noticed a significant increase in
> read.parquet(...) ops. The parquet files are being read from S3. Upon entry
> at the interactive terminal (pyspark in this case), the terminal will sit
> "idle" for several minutes (as many as 10) before returning:
>
>
> "17/09/08 15:34:37 WARN SharedInMemoryCache: Evicting cached table
> partition metadata from memory due to size constraints
> (spark.sql.hive.filesourcePartitionFileCacheSize = 20 bytes).
> This may impact query planning performance."
>
>
> In the spark UI, there are no jobs being run during this idle period.
> Subsequently, a short 1-task job lasting approximately 10 seconds runs, and
> then another idle time of roughly 2-3 minutes follows thereafter before
> returning to the terminal/CLI.
>
>
> Can someone explain what is happening here in the background? Is there a
> misconfiguration we should be looking for? We are using Hive metastore on
> the EMR cluster.
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Looking at EMR Logs

2017-03-31 Thread Neil Jonkers
Modifying spark.eventLog.dir to point to a S3 path, you will encounter the
following exception in Spark history log on path:
/var/log/spark/spark-history-server.out


Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException:
Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2702)

To move past this issue, we can do the following. This is for EMR Release:
emr-5.4.0

cd /usr/lib/spark/jars
sudo ln -s /usr/share/aws/emr/emrfs/lib/emrfs-hadoop-assembly-2.15.0.jar
emrfs.jar

Now Spark history server will startup correctly and you can review the
Spark event logs on S3.


On Fri, Mar 31, 2017 at 4:46 PM, Vadim Semenov 
wrote:

> You can provide your own log directory, where Spark log will be saved, and
> that you could replay afterwards.
>
> Set in your job this: `spark.eventLog.dir=s3://bucket/some/directory` and
> run it.
> Note! The path `s3://bucket/some/directory` must exist before you run your
> job, it'll not be created automatically.
>
> The Spark HistoryServer on EMR won't show you anything because it's
> looking for logs in `hdfs:///var/log/spark/apps` by default.
>
> After that you can either copy the log files from s3 to the hdfs path
> above, or you can copy them locally to `/tmp/spark-events` (the default
> directory for spark logs) and run the history server like:
> ```
> cd /usr/local/src/spark-1.6.1-bin-hadoop2.6
> sbin/start-history-server.sh
> ```
> and then open http://localhost:18080
>
>
>
>
> On Thu, Mar 30, 2017 at 8:45 PM, Paul Tremblay 
> wrote:
>
>> I am looking for tips on evaluating my Spark job after it has run.
>>
>> I know that right now I can look at the history of jobs through the web
>> ui. I also know how to look at the current resources being used by a
>> similar web ui.
>>
>> However, I would like to look at the logs after the job is finished to
>> evaluate such things as how many tasks were completed, how many executors
>> were used, etc. I currently save my logs to S3.
>>
>> Thanks!
>>
>> Henry
>>
>> --
>> Paul Henry Tremblay
>> Robert Half Technology
>>
>
>


Re: spark 2.02 error when writing to s3

2017-01-20 Thread Neil Jonkers
Can you test by enabling emrfs consistent view and use s3:// uri.

http://docs.aws.amazon.com/emr/latest/ManagementGuide/enable-consistent-view.html

 Original message From: Steve Loughran 
 Date:20/01/2017  21:17  (GMT+02:00) 
To: "VND Tremblay, Paul"  Cc: 
Takeshi Yamamuro ,user@spark.apache.org 
Subject: Re: spark 2.02 error when writing to s3 
AWS S3 is eventually consistent: even after something is deleted, a 
LIST/GET call may show it. You may be seeing that effect; even after the DELETE 
has got rid of the files, a listing sees something there, And I suspect the 
time it takes for the listing to "go away" will depend on the total number of 
entries underneath, as there are more deletion markers "tombstones" to 
propagate around s3

Try deleting the path and then waiting a short period


On 20 Jan 2017, at 18:54, VND Tremblay, Paul  wrote:

I am using an EMR cluster, and the latest version offered is 2.02. The link 
below indicates that that user had the same problem, which seems unresolved.
 
Thanks
 
Paul
 
_

Paul Tremblay 
Analytics Specialist 
THE BOSTON CONSULTING GROUP
Tel. + ▪ Mobile +

_

From: Takeshi Yamamuro [mailto:linguin@gmail.com] 
Sent: Thursday, January 19, 2017 9:27 PM
To: VND Tremblay, Paul
Cc: user@spark.apache.org
Subject: Re: spark 2.02 error when writing to s3
 
Hi,
 
Do you get the same exception also in v2.1.0?
Anyway, I saw another guy reporting the same error, I think.
https://www.mail-archive.com/user@spark.apache.org/msg60882.html
 
// maropu
 
 
On Fri, Jan 20, 2017 at 5:15 AM, VND Tremblay, Paul  
wrote:
I have come across a problem when writing CSV files to S3 in Spark 2.02. The 
problem does not exist in Spark 1.6.
 
19:09:20 Caused by: java.io.IOException: File already 
exists:s3://stx-apollo-pr-datascience-internal/revenue_model/part-r-00025-c48a0d52-9600-4495-913c-64ae6bf888bd.csv
 
 
My code is this:
 
new_rdd\
135 .map(add_date_diff)\
136 .map(sid_offer_days)\
137 .groupByKey()\
138 .map(custom_sort)\
139 .map(before_rev_date)\
140 .map(lambda x, num_weeks = args.num_weeks: create_columns(x, 
num_weeks))\
141 .toDF()\
142 .write.csv(
143 sep = "|",
144 header = True,
145 nullValue = '',
146 quote = None,
147 path = path
148 )
 
In order to get the path (the last argument), I call this function:
 
150 def _get_s3_write(test):
151 if s3_utility.s3_data_already_exists(_get_write_bucket_name(), 
_get_s3_write_dir(test)):
152 s3_utility.remove_s3_dir(_get_write_bucket_name(), 
_get_s3_write_dir(test))
153 return make_s3_path(_get_write_bucket_name(), _get_s3_write_dir(test))
 
In other words, I am removing the directory if it exists before I write. 
 
Notes:
 
* If I use a small set of data, then I don't get the error
 
* If I use Spark 1.6, I don't get the error
 
* If I read in a simple dataframe and then write to S3, I still get the error 
(without doing any transformations)
 
* If I do the previous step with a smaller set of data, I don't get the error.
 
* I am using pyspark, with python 2.7
 
* The thread at this link: 
https://forums.aws.amazon.com/thread.jspa?threadID=152470  Indicates the 
problem is caused by a problem sync problem. With large datasets, spark tries 
to write multiple times and causes the error. The suggestion is to turn off 
speculation, but I believe speculation is turned off by default in pyspark.
 
Thanks!
 
Paul
 
 
_

Paul Tremblay 
Analytics Specialist 

THE BOSTON CONSULTING GROUP
STL ▪ 

Tel. + ▪ Mobile +
tremblay.p...@bcg.com
_

Read BCG's latest insights, analysis, and viewpoints at bcgperspectives.com
 

The Boston Consulting Group, Inc. 

This e-mail message may contain confidential and/or privileged information. If 
you are not an addressee or otherwise authorized to receive this message, you 
should not use, copy, disclose or  take any action based on this e-mail or any 
information contained in the message. If you have received this material in 
error, please advise the sender immediately by reply e-mail and delete this 
message. Thank you.


 
-- 
---
Takeshi Yamamuro



Re: Running Spark on EMR

2017-01-15 Thread Neil Jonkers
Hello,

Can you drop the url:

 spark://master:7077

The url is used when running Spark in standalone mode.

Regards

 Original message From: Marco Mistroni 
 Date:15/01/2017  16:34  (GMT+02:00) 
To: User  Subject: Running Spark 
on EMR 
hi all
 could anyone assist here?
i am trying to run spark 2.0.0 on an EMR cluster,but i am having issues 
connecting to the master node
So, below is a snippet of what i am doing


sc = SparkSession.builder.master(sparkHost).appName("DataProcess").getOrCreate()

sparkHost is passed as input parameter. that was thought so that i can run the 
script locally
on my spark local instance as well as submitting scripts on any cluster i want


Now i have 
1 - setup a cluster on EMR. 
2 - connected to masternode
3  - launch the command spark-submit myscripts.py spark://master:7077

But that results in an connection refused exception
Then i have tried to remove the .master call above and launch the script with 
the following command

spark-submit --master spark://master:7077   myscript.py  but still i am getting
connectionREfused exception


I am using Spark 2.0.0 , could anyone advise on how shall i build the spark 
session and how can i submit a pythjon script to the cluster?

kr
 marco  

Re: [Spark Core] - Spark dynamoDB integration

2016-12-12 Thread Neil Jonkers
Hello,

Good examples on how to interface with DynamoDB from Spark here:

https://aws.amazon.com/blogs/big-data/using-spark-sql-for-etl/
https://aws.amazon.com/blogs/big-data/analyze-your-data-on-amazon-dynamodb-with-apache-spark/

Thanks

On Mon, Dec 12, 2016 at 7:56 PM, Marco Mistroni  wrote:

> Hi
>  If it can help
> 1.Check Java docs of when that method was introduced
> 2. U building a fat jar? Check which libraries have been includedsome
> other dependencies might have forced an old copy to be included
> 3. If u. Take code outside spark.does it work successfully?
> 4. Send short sample
> Hth
>
> On 12 Dec 2016 11:03 am, "Pratyaksh Sharma"  ac.in> wrote:
>
> Hey I am using Apache Spark for one streaming application. I am trying to
> store the processed data into dynamodb using java sdk. Getting the
> following exception -
> 16/12/08 23:23:43 WARN TaskSetManager: Lost task 0.0 in stage 1.0:
> java.lang.NoSuchMethodError: com.amazonaws.SDKGlobalConfigu
> ration.isInRegionOptimizedModeEnabled()Z
> at com.amazonaws.ClientConfigurationFactory.getConfig(ClientCon
> figurationFactory.java:35)
> at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient. >(AmazonDynamoDBClient.java:374)
>
> Spark version - 1.6.1
> Scala version - 2.10.5
> aws sdk version - 1.11.33
>
> Has anyone faced this issues? Any help will be highly appreciated.
>
> --
> Regards
>
> Pratyaksh Sharma
> 12105EN013
> Department of Electronics Engineering
> IIT Varanasi
> Contact No +91-8127030223 <+91%2081270%2030223>
>
>
>


Re: spark 1.4.1 saveAsTextFile is slow on emr-4.0.0

2015-09-02 Thread Neil Jonkers
Hi,

Can you set the following parameters in your mapred-site.xml file please:

mapred.output.direct.EmrFileSystemtrue
mapred.output.direct.NativeS3FileSystemtrue

You can also config this at cluster launch time with the following
Classification via EMR console:

classification=mapred-site,properties=[mapred.output.direct.EmrFileSystem=true,mapred.output.direct.NativeS3FileSystem=true]


Thank you

On Wed, Sep 2, 2015 at 6:02 AM, Alexander Pivovarov 
wrote:

> I checked previous emr config (emr-3.8)
> mapred-site.xml has the following setting
> 
> mapred.output.committer.classorg.apache.hadoop.mapred.DirectFileOutputCommitter
> 
>
>
> On Tue, Sep 1, 2015 at 7:33 PM, Alexander Pivovarov 
> wrote:
>
>> Should I use DirectOutputCommitter?
>> spark.hadoop.mapred.output.committer.class
>>  com.appsflyer.spark.DirectOutputCommitter
>>
>>
>>
>> On Tue, Sep 1, 2015 at 4:01 PM, Alexander Pivovarov > > wrote:
>>
>>> I run spark 1.4.1 in amazom aws emr 4.0.0
>>>
>>> For some reason spark saveAsTextFile is very slow on emr 4.0.0 in
>>> comparison to emr 3.8  (was 5 sec, now 95 sec)
>>>
>>> Actually saveAsTextFile says that it's done in 4.356 sec but after that
>>> I see lots of INFO messages with 404 error from com.amazonaws.latency
>>> logger for next 90 sec
>>>
>>> spark> sc.parallelize(List.range(0, 160),160).map(x => x + "\t" +
>>> "A"*100).saveAsTextFile("s3n://foo-bar/tmp/test40_20")
>>>
>>> 2015-09-01 21:16:17,637 INFO  [dag-scheduler-event-loop]
>>> scheduler.DAGScheduler (Logging.scala:logInfo(59)) - ResultStage 5
>>> (saveAsTextFile at :22) finished in 4.356 s
>>> 2015-09-01 21:16:17,637 INFO  [task-result-getter-2]
>>> cluster.YarnScheduler (Logging.scala:logInfo(59)) - Removed TaskSet 5.0,
>>> whose tasks have all completed, from pool
>>> 2015-09-01 21:16:17,637 INFO  [main] scheduler.DAGScheduler
>>> (Logging.scala:logInfo(59)) - Job 5 finished: saveAsTextFile at
>>> :22, took 4.547829 s
>>> 2015-09-01 21:16:17,638 INFO  [main] s3n.S3NativeFileSystem
>>> (S3NativeFileSystem.java:listStatus(896)) - listStatus
>>> s3n://foo-bar/tmp/test40_20/_temporary/0 with recursive false
>>> 2015-09-01 21:16:17,651 INFO  [main] amazonaws.latency
>>> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404],
>>> Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
>>> (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request
>>> ID: 3B2F06FD11682D22), S3 Extended Request ID:
>>> C8T3rXVSEIk3swlwkUWJJX3gWuQx3QKC3Yyfxuhs7y0HXn3sEI9+c1a0f7/QK8BZ],
>>> ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found],
>>> AWSRequestID=[3B2F06FD11682D22], ServiceEndpoint=[
>>> https://foo-bar.s3.amazonaws.com], Exception=1,
>>> HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0,
>>> HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.923],
>>> HttpRequestTime=[11.388], HttpClientReceiveResponseTime=[9.544],
>>> RequestSigningTime=[0.274], HttpClientSendRequestTime=[0.129],
>>> 2015-09-01 21:16:17,723 INFO  [main] amazonaws.latency
>>> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200],
>>> ServiceName=[Amazon S3], AWSRequestID=[E5D513E52B20FF17], ServiceEndpoint=[
>>> https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0,
>>> RequestCount=1, HttpClientPoolPendingCount=0,
>>> HttpClientPoolAvailableCount=1, ClientExecuteTime=[71.927],
>>> HttpRequestTime=[53.517], HttpClientReceiveResponseTime=[51.81],
>>> RequestSigningTime=[0.209], ResponseProcessingTime=[17.97],
>>> HttpClientSendRequestTime=[0.089],
>>> 2015-09-01 21:16:17,756 INFO  [main] amazonaws.latency
>>> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404],
>>> Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
>>> (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request
>>> ID: 62C6B413965447FD), S3 Extended Request ID:
>>> 4w5rKMWCt9EdeEKzKBXZgWpTcBZCfDikzuRrRrBxmtHYxkZyS4GxQVyADdLkgtZf],
>>> ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found],
>>> AWSRequestID=[62C6B413965447FD], ServiceEndpoint=[
>>> https://foo-bar.s3.amazonaws.com], Exception=1,
>>> HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0,
>>> HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.044],
>>> HttpRequestTime=[10.543], HttpClientReceiveResponseTime=[8.743],
>>> RequestSigningTime=[0.271], HttpClientSendRequestTime=[0.138],
>>> 2015-09-01 21:16:17,774 INFO  [main] amazonaws.latency
>>> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200],
>>> ServiceName=[Amazon S3], AWSRequestID=[F62B991825042889], ServiceEndpoint=[
>>> https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0,
>>> RequestCount=1, HttpClientPoolPendingCount=0,
>>> HttpClientPoolAvailableCount=1, ClientExecuteTime=[16.724],
>>> HttpRequestTime=[16.292], HttpClientReceiveResponseTime=[14.728],
>>> RequestSigningTime=[0.148], ResponseProcessingTime=[0.155],
>>>