Re: Spark UI

2020-07-20 Thread ArtemisDev
Thanks Xiao for the info.  I was looking for this, too.  This page 
wasn't linked from anywhere on the main doc page (Overview) or any of 
the pull-down menus.  Someone should remind the doc team to update the 
table of contents on the Overview page.


-- ND

On 7/19/20 10:30 PM, Xiao Li wrote:
https://spark.apache.org/docs/3.0.0/web-ui.html is the official doc 
for Spark UI.


Xiao

On Sun, Jul 19, 2020 at 1:38 PM venkatadevarapu 
mailto:ramesh.biexp...@gmail.com>> wrote:


Hi,

I'm looking for a tutorial/video/material which explains the
content of
various tabes in SPARK WEB UI.
Can some one direct me with the relevant info.

Thanks



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org




--



Using Spark UI with Running Spark on Hadoop Yarn

2020-07-13 Thread ArtemisDev
Is there anyway to make the spark process visible via Spark UI when 
running Spark 3.0 on a Hadoop yarn cluster?  The spark documentation 
talked about replacing Spark UI with the spark history server, but 
didn't give much details.  Therefore I would assume it is still possible 
to use Spark UI when running spark on a hadoop yarn cluster.  Is this 
correct?   Does the spark history server have the same user functions as 
the Spark UI?


But how could this be possible (the possibility of using Spark UI) if 
the Spark master server isn't active when all the job scheduling and 
resource allocation tasks are replaced by yarn servers?


Thanks!

-- ND


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



org.apache.spark.deploy.yarn.ExecutorLauncher not found when running Spark 3.0 on Hadoop

2020-07-13 Thread ArtemisDev
I've been trying to set up the latest stable version of Spark 3.0 on a 
hadoop cluster using yarn.  When running spark-submit in client mode, I 
always got an error of org.apache.spark.deploy.yarn.ExecutorLauncher not 
found.  This happened when I preload the spark jar files onto HDFS and 
specified the spark.yarn.jars property to the HDFS address (i.e. set 
spark.yarn.jars to hdfs:///spark-3/jars or 
hdfs://namenode:8020/spark-3/jars).  I've checked the /spark-3/jars 
directory on HDFS and all the jar files are accessible.  The exception 
messages are listed below.


This problem won't occur when I commended out the spark.yarn.jars line 
in the spark-defaults.conf file.  spark-submit finishes without any 
problems.


Any ideas what I have done wrong?  Thanks!

-- ND

==

Exception in thread "main" org.apache.spark.SparkException: Application 
application_1594664166056_0005 failed 2 times due to AM Container for 
appattempt_1594664166056_0005_02 exited with exitCode: 1
Failing this attempt.Diagnostics: [2020-07-13 20:07:20.882]Exception 
from container-launch.

Container id: container_1594664166056_0005_02_01
Exit code: 1

[2020-07-13 20:07:20.886]Container exited with a non-zero exit code 1. 
Error file: prelaunch.err.

Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
Error: Could not find or load main class 
org.apache.spark.deploy.yarn.ExecutorLauncher





Re: File Not Found: /tmp/spark-events in Spark 3.0

2020-07-05 Thread ArtemisDev
Thank you all for the responses.  I believe the user shouldn't be 
worried about creating the log dir explicitly.  The event logging should 
behave like other logs (e.g. master or slave) that the directory should 
be automatically created if not exist.


-- ND

On 7/2/20 9:19 AM, Zero wrote:


This could be the result of you not setting the location of eventLog 
properly. By default, it's/TMP/Spark-Events, and since the files in 
the/TMP directory are cleaned up regularly, you could have this problem.


-- Original --
*From:* "Xin Jinhan"<18183124...@163.com>;
*Date:* Thu, Jul 2, 2020 08:39 PM
*To:* "user";
*Subject:* Re: File Not Found: /tmp/spark-events in Spark 3.0

Hi,

First, the /tmp/spark-events is the default storage location of spark
eventLog, but the log is stored only when you set the
'spark.eventLog.enabled=true', which maybe your spark 2.4.6 set to 
false. So

you can just set it to false and the error will disappear.

Second, I suggest to open the eventLog and you can specify the log 
location

with 'spark.eventLog.dir' either a filesystem or local one, because you
maybe to check the log later.(can simplely use spark-history-server)

Regards
Jinhan



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-29 Thread ArtemisDev
Could you share your code?  Are you sure you Spark 2.4 cluster had 
indeed read anything?  Looks like the Input size field is empty under 2.4.


-- ND

On 6/27/20 7:58 PM, Sanjeev Mishra wrote:


I have large amount of json files that Spark can read in 36 seconds 
but Spark 3.0 takes almost 33 minutes to read the same. On closer 
analysis, looks like Spark 3.0 is choosing different DAG than Spark 
2.0. Does anyone have any idea what is going on? Is there any 
configuration problem with Spark 3.0.


Here are the details:

*Spark 2.4*


Summary Metrics for 2203 Completed Tasks


Metric  Min 25th percentile Median  75th percentile Max
Duration0.0 ms  0.0 ms  0.0 ms  1.0 ms  62.0 ms
GC Time 0.0 ms  0.0 ms  0.0 ms  0.0 ms  11.0 ms

Showing 1 to 2 of 2 entries


Aggregated Metrics by Executor


Show  entries
Search:
Executor ID 	Logs 	Address 	Task Time 	Total Tasks 	Failed Tasks 
Killed Tasks 	Succeeded Tasks 	Blacklisted

driver  
10.0.0.8:49159 36 s22030   0 
  2203false



*Spark 3.0*


Summary Metrics for 8 Completed Tasks



Metric  Min 25th percentile Median  75th percentile Max
Duration3.8 min 4.0 min 4.1 min 4.4 min 
5.0 min
GC Time 3 s 3 s 3 s 4 s 4 s
Input Size / Records 	15.6 MiB / 51028 	16.2 MiB / 53303 	16.8 MiB / 
55259 	17.8 MiB / 58148 	20.2 MiB / 71624


Showing 1 to 3 of 3 entries


Aggregated Metrics by Executor


Show  entries
Search:
Executor ID 	Logs 	Address 	Task Time 	Total Tasks 	Failed Tasks 
Killed Tasks 	Succeeded Tasks 	Blacklisted 	Input Size / Records

driver  
	10.0.0.8:50224  	33 min 	8 	0 	0 	8 	false 
136.1 MiB / 451999




The DAG is also different
Spark 2.0 DAG

Screenshot 2020-06-27 16.30.26.png

Spark 3.0 DAG

Screenshot 2020-06-27 16.32.32.png




File Not Found: /tmp/spark-events in Spark 3.0

2020-06-29 Thread ArtemisDev
While launching a spark job from Zeppelin against a standalone spark 
cluster (Spark 3.0 with multiple workers without hadoop), we have 
encountered a Spark interpreter exception caused by a I/O File Not Found 
exception due to the non-existence of the /tmp/spark-events directory.  
We had to create the /tmp/spark-events directory manually in order to 
resolve the problem.


As a reference, the same notebook code run on Spark 2.4.6 (also a 
standalone cluster) without any problems.


What is /tmp/spark-events for and is there anyway to pre-define this 
directory as some config parameter so we don't end up manually add it in 
/tmp?


Thanks!

-- ND


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Where are all the jars gone ?

2020-06-24 Thread ArtemisDev
If you are using Maven to manage your jar dependencies, the jar files 
are located in the maven repository on your home directory. It is 
usually in the .m2 directory.


Hope this helps.

-ND

On 6/23/20 3:21 PM, Anwar AliKhan wrote:

Hi,

I prefer to do most of my projects in Python and for that I use Jupyter.
I have been downloading the compiled version of spark.

I do not normally like the source code version because the build 
process makes me nervous.

You know with lines of stuff   scrolling up the screen.
What am I am going to do if a build fails. I am a user!

I decided to risk it and it was only one  mvn command to build. (45 
minutes later)

Everything is great. Success.

I removed all jvms except jdk8 for compilation.

I used jdk8 so I know which libraries where linked in the build process.
I also used my local version of maven. Not the apt install version .

I used jdk8 because if you go this scala site.

http://scala-ide.org/download/sdk.html. they say requirement jdk8 for IDE
 even for scala12.
They don't say JDK 8 or higher ,  just jdk8.

So anyway  once in a while I  do spark projects in scala with eclipse.

For that I don't use maven or anything. I prefer to make use of build path
And external jars. This way I know exactly which libraries I am 
linking to.


creating a jar in eclipse is straight forward for spark_submit.


Anyway as you can see (below) I am pointing jupyter to find 
spark.init('opt/spark').

That's OK everything is fine.

With the compiled version of spark there is a jar directory which I 
have been using in eclipse.




With my own compiled from source version there is no jar directory.


Where are all the jars gone ?.



I am not sure how findspark.init('/opt/spark') is locating the 
libraries unless it is finding them from

Anaconda.


import findspark
findspark.init('/opt/spark')
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName('Titanic Data') \
    .getOrCreate()



Structured Streaming using File Source - How to handle live files

2020-06-07 Thread ArtemisDev
We were trying to use structured streaming from file source, but had 
problems getting the files read by Spark properly.  We have another 
process generating the data files in the Spark data source directory on 
a continuous basis.  What we have observed was that the moment a data 
file is created before the data producing process finished, it was read 
by Spark immediately without reaching the EOF.  Then Spark will never 
revisit the file.  So we only ended up with empty data content.  The 
only way to make it to work is to produce the data files in a separate 
directory (e.g. /tmp) and move them to the Spark's file source dir after 
the data generation completes.


My questions:  Is this a behavior by design or is there any way to 
control the Spark streaming process not to import a file while it is 
still being used by another process?  In other words, do we have to use 
the tmp dir to move data files around or can the data producing process 
and Spark share the same directory?


Thanks!

-- Nick


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org