Experience with centralised logging for Spark?

2015-07-03 Thread Edward Sargisson
Hi all,
I'm wondering if anybody as any experience with centralised logging for
Spark - or even has felt that there was  need for this given the WebUI.

At my organization we use Log4j2 and Flume as the front end of our
centralised logging system. I was looking into modifying Spark to use that
system and I'm reconsidering my approach. I thought I'd ask the community
to see what people have tried.

Log4j2 is important because it works nicely with Flume. The problem I've
got is that all of the Spark processes (master, worker, spark-submit) use
the same conf directory and so would get the same log4j2.xml. This then
means that they would try and use the same directory for the file channel
(which will fail because Flume locks its directory). Secondly, if I want to
add an interceptor to stamp every event with the component name then I
cannot tell the difference between the components - everything would get
'apache-spark'.

This could be fixed by modifying the start up scripts to pass the component
name around; but that's more modification than I really want to make.

So are people generally  happy with the WebUI approach for getting access
to stderr and stdout or have other peopled rolled better solutions?

Yes, I'm aware of https://issues.apache.org/jira/browse/SPARK-6305 and the
associated pull request.

Many thanks, in advance, for your thoughts.

Cheers,
Edward


Application on standalone cluster never changes state to be stopped

2015-05-22 Thread Edward Sargisson
Hi,
Environment: Spark standalone cluster running with a master and a work on a
small Vagrant VM. The Jetty Webapp on the same node calls the spark-submit
script to start the job.

From the contents of the stdout I can see that it's running successfully.
However, the spark-submit process never seems to complete (after 2 minutes)
and the state in the Web UI remains RUNNING.
The Application main calls SparkContext.stop and exits with zero.

What are the criteria for when an Application is considered finished?

Thanks in advance!
Edward


Fwd: Re: spark 1.3.1 jars in repo1.maven.org

2015-05-20 Thread Edward Sargisson
Hi Sean and Ted,
Thanks for your replies.

I don't have our current problems nicely written up as good questions yet.
I'm still sorting out classpath issues, etc.
In case it is of help, I'm seeing:
* Exception in thread Spark Context Cleaner
java.lang.NoClassDefFoundError: 0
at
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:149)
* We've been having clashing dependencies between a colleague and I because
of the aforementioned classpath issue
* The clashing dependencies are also causing issues with what jetty
libraries are available in the classloader from Spark and don't clash with
existing libraries we have.

More anon,

Cheers,
Edward



 Original Message 
 Subject: Re: spark 1.3.1 jars in repo1.maven.org Date: 2015-05-20 00:38
From: Sean Owen so...@cloudera.com To: Edward Sargisson esa...@pobox.com
Cc: user user@spark.apache.org


Yes, the published artifacts can only refer to one version of anything
(OK, modulo publishing a large number of variants under classifiers).

You aren't intended to rely on Spark's transitive dependencies for
anything. Compiling against the Spark API has no relation to what
version of Hadoop it binds against because it's not part of any API.
You mark the Spark dependency even as provided in your build and get
all the Spark/Hadoop bindings at runtime from our cluster.

What problem are you experiencing?


On Wed, May 20, 2015 at 2:17 AM, Edward Sargisson esa...@pobox.com wrote:

Hi,
I'd like to confirm an observation I've just made. Specifically that spark
is only available in repo1.maven.org for one Hadoop variant.

The Spark source can be compiled against a number of different Hadoops using
profiles. Yay.
However, the spark jars in repo1.maven.org appear to be compiled against one
specific Hadoop and no other differentiation is made. (I can see a
difference with hadoop-client being 2.2.0 in repo1.maven.org and 1.0.4 in
the version I compiled locally).

The implication here is that if you have a pom file asking for
spark-core_2.10 version 1.3.1 then Maven will only give you an Hadoop 2
version. Maven assumes that non-snapshot artifacts never change so trying to
load an Hadoop 1 version will end in tears.

This then means that if you compile code against spark-core then there will
probably be classpath NoClassDefFound issues unless the Hadoop 2 version is
exactly the one you want.

Have I gotten this correct?

It happens that our little app is using a Spark context directly from a
Jetty webapp and the classpath differences were/are causing some confusion.
We are currently installing a Hadoop 1 spark master and worker.

Thanks a lot!
Edward


spark 1.3.1 jars in repo1.maven.org

2015-05-19 Thread Edward Sargisson
Hi,
I'd like to confirm an observation I've just made. Specifically that spark
is only available in repo1.maven.org for one Hadoop variant.

The Spark source can be compiled against a number of different Hadoops
using profiles. Yay.
However, the spark jars in repo1.maven.org appear to be compiled against
one specific Hadoop and no other differentiation is made. (I can see a
difference with hadoop-client being 2.2.0 in repo1.maven.org and 1.0.4 in
the version I compiled locally).

The implication here is that if you have a pom file asking for
spark-core_2.10 version 1.3.1 then Maven will only give you an Hadoop 2
version. Maven assumes that non-snapshot artifacts never change so trying
to load an Hadoop 1 version will end in tears.

This then means that if you compile code against spark-core then there will
probably be classpath NoClassDefFound issues unless the Hadoop 2 version is
exactly the one you want.

Have I gotten this correct?

It happens that our little app is using a Spark context directly from a
Jetty webapp and the classpath differences were/are causing some confusion.
We are currently installing a Hadoop 1 spark master and worker.

Thanks a lot!
Edward


How do you use the thrift-server to get data from a Spark program?

2014-10-26 Thread Edward Sargisson
Hi all,
This feels like a dumb question but bespeaks my lack of understanding: what
is the Spark thrift-server for? Especially if there's an existing Hive
installation.

Background:
We want to use Spark to do some processing starting from files (in probably
MapRFS). We want to be able to read the result using SQL so that we can
report the results using Eclipse BIRT.

My confusion:
Spark 1.1 includes a thrift-server for accessing data via JDBC. However, I
don't understand how to make data available in it from the rest of Spark.

I have a small program that does what I want in spark-shell. It reads some
JSON, does some manipulation using SchemaRDDs and then has the data ready.
If I've started the shell with the hive-site.xml pointing to a Hive
installation I can use SchemaRDD.saveToTable to put it into Hive - and then
I can use beeline to read it.

But that's using the *Hive* thrift-server and not the Spark thrift-server.
That doesn't seem to be the intention of having a separate thrift-server in
Spark. Before I started on this I assumed that you could run a Spark
program (in, say, Java) and then make those results accessible for the JDBC
interface.

So, please, fill me in. What am I missing?

Many thanks,
Edward