Re: Anybody hit this issue in spark shell?

2015-11-09 Thread Marcelo Vanzin
On Mon, Nov 9, 2015 at 5:54 PM, Ted Yu  wrote:
> If there is no option to let shell skip processing @VisibleForTesting ,
> should the annotation be dropped ?

That's what we did last time this showed up.

> On Mon, Nov 9, 2015 at 5:50 PM, Marcelo Vanzin  wrote:
>>
>> We've had this in the past when using "@VisibleForTesting" in classes
>> that for some reason the shell tries to process. QueryExecution.scala
>> seems to use that annotation and that was added recently, so that's
>> probably the issue.
>>
>> BTW, if anyone knows how Scala can find a reference to the original
>> Guava class even after shading, I'd really like to know. I've looked
>> several times and never found where the original class name is stored.
>>
>> On Mon, Nov 9, 2015 at 10:37 AM, Zhan Zhang 
>> wrote:
>> > Hi Folks,
>> >
>> > Does anybody meet the following issue? I use "mvn package -Phive
>> > -DskipTests” to build the package.
>> >
>> > Thanks.
>> >
>> > Zhan Zhang
>> >
>> >
>> >
>> > bin/spark-shell
>> > ...
>> > Spark context available as sc.
>> > error: error while loading QueryExecution, Missing dependency 'bad
>> > symbolic
>> > reference. A signature in QueryExecution.class refers to term
>> > annotations
>> > in package com.google.common which is not available.
>> > It may be completely missing from the current classpath, or the version
>> > on
>> > the classpath might be incompatible with the version used when compiling
>> > QueryExecution.class.', required by
>> >
>> > /Users/zzhang/repo/spark/assembly/target/scala-2.10/spark-assembly-1.6.0-SNAPSHOT-hadoop2.2.0.jar(org/apache/spark/sql/execution/QueryExecution.class)
>> > :10: error: not found: value sqlContext
>> >import sqlContext.implicits._
>> >   ^
>> > :10: error: not found: value sqlContext
>> >import sqlContext.sql
>> >   ^
>>
>>
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Unable to import SharedSparkContext

2015-11-18 Thread Marcelo Vanzin
On Wed, Nov 18, 2015 at 11:08 AM, njoshi  wrote:
> Doesn't *SharedSparkContext* come with spark-core? Do I need to include any
> special package in the library dependancies for using SharedSparkContext?

That's a test class. It's not part of the Spark API.

-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Port Control for YARN-Aware Spark

2015-11-23 Thread Marcelo Vanzin
On Mon, Nov 23, 2015 at 6:24 PM, gpriestley  wrote:
> Questions I have are:
> 1) How does the spark.yarn.am.port relate to defined ports within Spark
> (driver, executor, block manager, etc.)?
> 2) Doe the spark.yarn.am.port parameter only relate to the spark
> driver.port?
> 3) Is the spark.yarn.am.port applicable to Yarn-Cluster or Yarn-Client modes
> or both?

All the "yarn.am" options are specific to the client-mode application
master (so they don't affect cluster mode), and are unrelated to any
other Spark service (such as the block manager).

-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Question about yarn-cluster mode and spark.driver.allowMultipleContexts

2015-12-01 Thread Marcelo Vanzin
On Tue, Dec 1, 2015 at 3:32 PM, Anfernee Xu  wrote:
> I have a long running backend server where I will create a short-lived Spark
> job in response to each user request, base on the fact that by default
> multiple Spark Context cannot be created in the same JVM, looks like I have
> 2 choices
>
> 2) run my jobs in yarn-cluster mode instead yarn-client

There's nothing in yarn-client mode that prevents you from doing what
you describe. If you write some server for users to submit jobs to, it
should work whether you start the context in yarn-client or
yarn-cluster mode. It just might be harder to find out where it's
running if you do it in cluster mode.

-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Question about yarn-cluster mode and spark.driver.allowMultipleContexts

2015-12-02 Thread Marcelo Vanzin
On Tue, Dec 1, 2015 at 9:43 PM, Anfernee Xu  wrote:
> But I have a single server(JVM) that is creating SparkContext, are you
> saying Spark supports multiple SparkContext in the same JVM? Could you
> please clarify on this?

I'm confused. Nothing you said so far requires multiple contexts. From
your original message:

> I have a long running backend server where I will create a short-lived Spark 
> job

You can have a single SparkContext and submit multiple jobs to it. And
that works regardless of cluster manager or deploy mode.

-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: ClassLoader resources on executor

2015-12-02 Thread Marcelo Vanzin
On Tue, Dec 1, 2015 at 12:45 PM, Charles Allen
 wrote:
> Is there a way to pass configuration file resources to be resolvable through
> the classloader?

Not in general. If you're using YARN, you can cheat and use
"spark.yarn.dist.files" which will place those files in the classpath;
the same for "--files" in yarn cluster mode (but *not* client mode!).

-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Any clue on this error, Exception in thread "main" java.lang.NoSuchFieldError: SPARK_RPC_CLIENT_CONNECT_TIMEOUT

2015-12-03 Thread Marcelo Vanzin
(bcc: user@spark, since this is Hive code.)

You're probably including unneeded Spark jars in Hive's classpath
somehow. Either the whole assembly or spark-hive, both of which will
contain Hive classes, and in this case contain old versions that
conflict with the version of Hive you're running.

On Thu, Dec 3, 2015 at 9:54 AM, Mich Talebzadeh  wrote:
> Trying to run Hive on Spark 1.3 engine, I get
>
>
>
> conf hive.spark.client.channel.log.level=null --conf
> hive.spark.client.rpc.max.size=52428800 --conf
> hive.spark.client.rpc.threads=8 --conf hive.spark.client.secret.bits=256
>
> 15/12/03 17:53:18 [stderr-redir-1]: INFO client.SparkClientImpl: Spark
> assembly has been built with Hive, including Datanucleus jars on classpath
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: Warning:
> Ignoring non-spark config property: hive.spark.client.connect.timeout=1000
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: Warning:
> Ignoring non-spark config property: hive.spark.client.rpc.threads=8
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: Warning:
> Ignoring non-spark config property: hive.spark.client.rpc.max.size=52428800
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: Warning:
> Ignoring non-spark config property: hive.spark.client.secret.bits=256
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: Warning:
> Ignoring non-spark config property:
> hive.spark.client.server.connect.timeout=9
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: 15/12/03
> 17:53:19 INFO client.RemoteDriver: Connecting to: rhes564:36577
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: Exception
> in thread "main" java.lang.NoSuchFieldError:
> SPARK_RPC_CLIENT_CONNECT_TIMEOUT
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> org.apache.hive.spark.client.rpc.RpcConfiguration.(RpcConfiguration.java:46)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:146)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:556)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> java.lang.reflect.Method.invoke(Method.java:606)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
>
>
> Any clues?
>
>
>
>
>
> Mich Talebzadeh
>
>
>
> Sybase ASE 15 Gold Medal Award 2008
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15",
> ISBN 978-0-9563693-0-7.
>
> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4
>
> Publications due shortly:
>
> Complex Event Processing in Heterogeneous Environments, ISBN:
> 978-0-9563693-3-8
>
> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus free,
> therefore neither Peridale Ltd, its subsidiaries nor their employees accept
> any responsibility.
>
>



-- 
Marcelo

-

Re: Any clue on this error, Exception in thread "main" java.lang.NoSuchFieldError: SPARK_RPC_CLIENT_CONNECT_TIMEOUT

2015-12-03 Thread Marcelo Vanzin
On Thu, Dec 3, 2015 at 10:32 AM, Mich Talebzadeh  wrote:

> hduser@rhes564::/usr/lib/spark/logs> hive --version
> SLF4J: Found binding in
> [jar:file:/usr/lib/spark/lib/spark-assembly-1.3.0-hadoop2.4.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]

As I suggested before, you have Spark's assembly in the Hive
classpath. That's not the way to configure hive-on-spark; if the
documentation you're following tells you to do that, it's wrong.

(And sorry Ted, but please ignore Ted's suggestion. Hive-on-Spark
should work fine with Spark 1.3 if it's configured correctly. You
really don't want to be overriding Hive classes with the ones shipped
in the Spark assembly, regardless of the version of Spark being used.)

-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark.authenticate=true YARN mode doesn't work

2015-12-05 Thread Marcelo Vanzin
On Fri, Dec 4, 2015 at 5:47 PM, prasadreddy  wrote:
> I am running Spark YARN and trying to enable authentication by setting
> spark.authenticate=true. After enable authentication I am not able to Run
> Spark word count or any other programs.

Define "I am not able to run". What doesn't work? What error do you get?

None of the things Ted mentioned should affect this. Enabling that
option should be all that's needed. If you're using the external
shuffle service, make sure the option is also enabled in the service's
configuration.

-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark.authenticate=true YARN mode doesn't work

2015-12-05 Thread Marcelo Vanzin
Hi Prasad, please reply to the list so that others can benefit / help.

On Sat, Dec 5, 2015 at 4:06 PM, Prasad Reddy  wrote:
> Have you had a chance to try this authentication for any of your projects
> earlier.

Yes, we run with authenticate=true by default. It works fine.

> Got the following exception. Any help would be appreciated.

You have to take a look at your executor logs, not just the driver
logs, to find out the root cause of the problem. You can use "yarn
logs" for that.


-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark.authenticate=true YARN mode doesn't work

2015-12-07 Thread Marcelo Vanzin
Prasad,

As I mentioned in my first reply, you need to enable
spark.authenticate in the shuffle service's configuration too for this
to work. It doesn't seem like you have done that.

On Sun, Dec 6, 2015 at 5:09 PM, Prasad Reddy  wrote:
> Hi Marcelo,
>
> I am attaching all container logs. can you please take a look at it when you
> get a chance.
>
> Thanks
> Prasad
>
> On Sat, Dec 5, 2015 at 2:30 PM, Marcelo Vanzin  wrote:
>>
>> On Fri, Dec 4, 2015 at 5:47 PM, prasadreddy  wrote:
>> > I am running Spark YARN and trying to enable authentication by setting
>> > spark.authenticate=true. After enable authentication I am not able to
>> > Run
>> > Spark word count or any other programs.
>>
>> Define "I am not able to run". What doesn't work? What error do you get?
>>
>> None of the things Ted mentioned should affect this. Enabling that
>> option should be all that's needed. If you're using the external
>> shuffle service, make sure the option is also enabled in the service's
>> configuration.
>>
>> --
>> Marcelo
>
>



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark classpath issue duplicate jar with diff versions

2015-07-24 Thread Marcelo Vanzin
(bcc: user@spark, cc: cdh-user@cloudera)

This is a CDH issue, so I'm moving it to the CDH mailing list.

We're taking a look at how we're packaging dependencies so that these
issues happen less when running on CDH. But in the meantime, instead of
using "--jars", you could instead add the newer jars to
spark.driver.extraClassPath and spark.executor.extraClassPath (which will
prepend entries to Spark's classpath).

On Fri, Jul 24, 2015 at 1:02 PM, Shushant Arora 
wrote:

> Hi
>
> I am running a spark stream app on yarn and using apache httpasyncclient
> 4.1
> This client Jar internally has a dependency on jar http-core4.4.1.jar.
>
> This jar's( http-core .jar) old version i.e. httpcore-4.2.5.jar is also
> present in class path and has higher priority in classpath(coming earlier
> in classpath)
> Jar is at /apps/cloudera/parcels/CDH/jars/httpcore-4.2.5.jar
>
> This is conflicting with job and making job to kill.
>
> I have packaged my jobs jar using maven.
> When I ran the job - it killed the executors with below exception :
>
> Exception in thread "main" org.apache.spark.SparkException: Job aborted
> due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent
> failure: Lost task 0.3 in stage 3.0 (TID 76, ip):
> java.lang.NoSuchFieldError: INSTANCE
> at
> org.apache.http.impl.nio.codecs.DefaultHttpRequestWriterFactory.
>   >(DefaultHttpRequestWriterFactory.java:52)
> at
> org.apache.http.impl.nio.codecs.DefaultHttpRequestWriterFactory.
>   >(DefaultHttpRequestWriterFactory.java:56)
>
>
>
> When I specify latets version of http-core in --jars argument then also it
> picked old versioned jar only.
>
> Is there any way to make it not to use spark path's jar rather my jar
> while executing the job ?
>
> Thanks
>



-- 
Marcelo


Re: Which directory contains third party libraries for Spark

2015-07-28 Thread Marcelo Vanzin
Hi Stephen,

There is no such directory currently. If you want to add an existing jar to
every app's classpath, you need to modify two config values:
spark.driver.extraClassPath and spark.executor.extraClassPath.

On Mon, Jul 27, 2015 at 10:22 PM, Stephen Boesch  wrote:

> when using spark-submit: which directory contains third party libraries
> that will be loaded on each of the slaves? I would like to scp one or more
> libraries to each of the slaves instead of shipping the contents in the
> application uber-jar.
>
> Note: I did try adding to $SPARK_HOME/lib_managed/jars.   But the
> spark-submit still results in a ClassNotFoundException for classes included
> in the added library.
>
>


-- 
Marcelo


Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

2015-07-28 Thread Marcelo Vanzin
This might be an issue with how pyspark propagates the error back to the
AM. I'm pretty sure this does not happen for Scala / Java apps.

Have you filed a bug?

On Tue, Jul 28, 2015 at 11:17 AM, Elkhan Dadashov 
wrote:

> Thanks Corey for your answer,
>
> Do you mean that "final status : SUCCEEDED" in terminal logs means that
> YARN RM could clean the resources after the application has finished
> (application finishing does not necessarily mean succeeded or failed) ?
>
> With that logic it totally makes sense.
>
> Basically the YARN logs does not say anything about the Spark job itself.
> It just says that Spark job resources have been cleaned up after the job
> completed and returned back to Yarn.
>
> It would be great if Yarn logs could also say about the consequence of the
> job, because the user is interested in more about the job final status.
>
> Yarn related logs can be found in RM ,NM, DN, NN log files in detail.
>
> Thanks again.
>
> On Mon, Jul 27, 2015 at 7:45 PM, Corey Nolet  wrote:
>
>> Elkhan,
>>
>> What does the ResourceManager say about the final status of the job?
>> Spark jobs that run as Yarn applications can fail but still successfully
>> clean up their resources and give them back to the Yarn cluster. Because of
>> this, there's a difference between your code throwing an exception in an
>> executor/driver and the Yarn application failing. Generally you'll see a
>> yarn application fail when there's a memory problem (too much memory being
>> allocated or not enough causing executors to fail multiple times not
>> allowing your job to finish).
>>
>> What I'm seeing from your post is that you had an exception in your
>> application which was caught by the Spark framework which then proceeded to
>> clean up the job and shut itself down- which it did successfully. When you
>> aren't running in the Yarn modes, you aren't seeing any Yarn status that's
>> telling you the Yarn application was successfully shut down, you are just
>> seeing the failure(s) from your drivers/executors.
>>
>>
>>
>> On Mon, Jul 27, 2015 at 2:11 PM, Elkhan Dadashov 
>> wrote:
>>
>>> Any updates on this bug ?
>>>
>>> Why Spark log results & Job final status does not match ? (one saying
>>> that job has failed, another stating that job has succeeded)
>>>
>>> Thanks.
>>>
>>>
>>> On Thu, Jul 23, 2015 at 4:43 PM, Elkhan Dadashov 
>>> wrote:
>>>
 Hi all,

 While running Spark Word count python example with intentional mistake
 in *Yarn cluster mode*, Spark terminal states final status as
 SUCCEEDED, but log files state correct results indicating that the job
 failed.

 Why terminal log output & application log output contradict each other ?

 If i run same job on *local mode* then terminal logs and application
 logs match, where both state that job has failed to expected error in
 python script.

 More details: Scenario

 While running Spark Word count python example on *Yarn cluster mode*,
 if I make intentional error in wordcount.py by changing this line (I'm
 using Spark 1.4.1, but this problem exists in Spark 1.4.0 and in 1.3.0
 versions - which i tested):

 lines = sc.textFile(sys.argv[1], 1)

 into this line:

 lines = sc.textFile(*nonExistentVariable*,1)

 where nonExistentVariable variable was never created and initialized.

 then i run that example with this command (I put README.md into HDFS
 before running this command):

 *./bin/spark-submit --master yarn-cluster wordcount.py /README.md*

 The job runs and finishes successfully according the log printed in the
 terminal :
 *Terminal logs*:
 ...
 15/07/23 16:19:17 INFO yarn.Client: Application report for
 application_1437612288327_0013 (state: RUNNING)
 15/07/23 16:19:18 INFO yarn.Client: Application report for
 application_1437612288327_0013 (state: RUNNING)
 15/07/23 16:19:19 INFO yarn.Client: Application report for
 application_1437612288327_0013 (state: RUNNING)
 15/07/23 16:19:20 INFO yarn.Client: Application report for
 application_1437612288327_0013 (state: RUNNING)
 15/07/23 16:19:21 INFO yarn.Client: Application report for
 application_1437612288327_0013 (state: FINISHED)
 15/07/23 16:19:21 INFO yarn.Client:
  client token: N/A
  diagnostics: Shutdown hook called before final status was reported.
  ApplicationMaster host: 10.0.53.59
  ApplicationMaster RPC port: 0
  queue: default
  start time: 1437693551439
  final status: *SUCCEEDED*
  tracking URL:
 http://localhost:8088/proxy/application_1437612288327_0013/history/application_1437612288327_0013/1
  user: edadashov
 15/07/23 16:19:21 INFO util.Utils: Shutdown hook called
 15/07/23 16:19:21 INFO util.Utils: Deleting directory
 /tmp/spark-eba0a1b5-a216-4afa-9c54-a3cb67b16444

 But if look at log files generated for this application in HDFS - it
 indica

Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

2015-07-28 Thread Marcelo Vanzin
BTW this is most probably caused by this line in PythonRunner.scala:

System.exit(process.waitFor())

The YARN backend doesn't like applications calling System.exit().


On Tue, Jul 28, 2015 at 12:00 PM, Marcelo Vanzin 
wrote:

> This might be an issue with how pyspark propagates the error back to the
> AM. I'm pretty sure this does not happen for Scala / Java apps.
>
> Have you filed a bug?
>
> On Tue, Jul 28, 2015 at 11:17 AM, Elkhan Dadashov 
> wrote:
>
>> Thanks Corey for your answer,
>>
>> Do you mean that "final status : SUCCEEDED" in terminal logs means that
>> YARN RM could clean the resources after the application has finished
>> (application finishing does not necessarily mean succeeded or failed) ?
>>
>> With that logic it totally makes sense.
>>
>> Basically the YARN logs does not say anything about the Spark job itself.
>> It just says that Spark job resources have been cleaned up after the job
>> completed and returned back to Yarn.
>>
>> It would be great if Yarn logs could also say about the consequence of
>> the job, because the user is interested in more about the job final status.
>>
>> Yarn related logs can be found in RM ,NM, DN, NN log files in detail.
>>
>> Thanks again.
>>
>> On Mon, Jul 27, 2015 at 7:45 PM, Corey Nolet  wrote:
>>
>>> Elkhan,
>>>
>>> What does the ResourceManager say about the final status of the job?
>>> Spark jobs that run as Yarn applications can fail but still successfully
>>> clean up their resources and give them back to the Yarn cluster. Because of
>>> this, there's a difference between your code throwing an exception in an
>>> executor/driver and the Yarn application failing. Generally you'll see a
>>> yarn application fail when there's a memory problem (too much memory being
>>> allocated or not enough causing executors to fail multiple times not
>>> allowing your job to finish).
>>>
>>> What I'm seeing from your post is that you had an exception in your
>>> application which was caught by the Spark framework which then proceeded to
>>> clean up the job and shut itself down- which it did successfully. When you
>>> aren't running in the Yarn modes, you aren't seeing any Yarn status that's
>>> telling you the Yarn application was successfully shut down, you are just
>>> seeing the failure(s) from your drivers/executors.
>>>
>>>
>>>
>>> On Mon, Jul 27, 2015 at 2:11 PM, Elkhan Dadashov 
>>> wrote:
>>>
>>>> Any updates on this bug ?
>>>>
>>>> Why Spark log results & Job final status does not match ? (one saying
>>>> that job has failed, another stating that job has succeeded)
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> On Thu, Jul 23, 2015 at 4:43 PM, Elkhan Dadashov 
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> While running Spark Word count python example with intentional mistake
>>>>> in *Yarn cluster mode*, Spark terminal states final status as
>>>>> SUCCEEDED, but log files state correct results indicating that the job
>>>>> failed.
>>>>>
>>>>> Why terminal log output & application log output contradict each other
>>>>> ?
>>>>>
>>>>> If i run same job on *local mode* then terminal logs and application
>>>>> logs match, where both state that job has failed to expected error in
>>>>> python script.
>>>>>
>>>>> More details: Scenario
>>>>>
>>>>> While running Spark Word count python example on *Yarn cluster mode*,
>>>>> if I make intentional error in wordcount.py by changing this line (I'm
>>>>> using Spark 1.4.1, but this problem exists in Spark 1.4.0 and in 1.3.0
>>>>> versions - which i tested):
>>>>>
>>>>> lines = sc.textFile(sys.argv[1], 1)
>>>>>
>>>>> into this line:
>>>>>
>>>>> lines = sc.textFile(*nonExistentVariable*,1)
>>>>>
>>>>> where nonExistentVariable variable was never created and initialized.
>>>>>
>>>>> then i run that example with this command (I put README.md into HDFS
>>>>> before running this command):
>>>>>
>>>>> *./bin/spark-submit --master yarn-cluster wordcount.py /README.md*
>>>>>
>>>&g

Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

2015-07-28 Thread Marcelo Vanzin
First, it's kinda confusing to change subjects in the middle of a thread...

On Tue, Jul 28, 2015 at 1:44 PM, Elkhan Dadashov 
wrote:

> @Marcelo
> *Question1*:
> Do you know why launching Spark job through SparkLauncher in Java, stdout
> logs (i.e., INFO Yarn.Client) are written into error stream
> (spark.getErrorStream()) instead of output stream ?
>

All Spark jobs write that information to stderr. If you run "spark-submit
... 2>/dev/null" you won't see any of those logs.


> *Question2*:
>
> What is the best way to know about Spark job progress & final status in
> Java ?
>

There's no API for that. You'd have to write something, probably
implementing a SparkListener.

-- 
Marcelo


Re: NO Cygwin Support in bin/spark-class in Spark 1.4.0

2015-07-28 Thread Marcelo Vanzin
Can you run the windows batch files (e.g. spark-submit.cmd) from the cygwin
shell?

On Tue, Jul 28, 2015 at 7:26 PM, Proust GZ Feng  wrote:

> Hi, Owen
>
> Add back the cygwin classpath detection can pass the issue mentioned
> before, but there seems lack of further support in the launch lib, see
> below stacktrace
>
> LAUNCH_CLASSPATH:
> C:\spark-1.4.0-bin-hadoop2.3\lib\spark-assembly-1.4.0-hadoop2.3.0.jar
> java -cp
> *C:\spark-1.4.0-bin-hadoop2.3\lib\spark-assembly-1.4.0-hadoop2.3.0.jar*
> org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit
> --driver-class-path ../thirdparty/lib/db2-jdbc4-95fp6a/db2jcc4.jar
> --properties-file conf/spark.properties
> target/scala-2.10/price-scala-assembly-15.4.0-SNAPSHOT.jar
> Exception in thread "main" java.lang.IllegalStateException: Library
> directory '*C:\c\spark-1.4.0-bin-hadoop2.3\lib_managed\jars*' does not
> exist.
> at
> org.apache.spark.launcher.CommandBuilderUtils.checkState(CommandBuilderUtils.java:229)
> at
> org.apache.spark.launcher.AbstractCommandBuilder.buildClassPath(AbstractCommandBuilder.java:215)
> at
> org.apache.spark.launcher.AbstractCommandBuilder.buildJavaCommand(AbstractCommandBuilder.java:115)
> at
> org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitCommand(SparkSubmitCommandBuilder.java:192)
> at
> org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(SparkSubmitCommandBuilder.java:117)
> at org.apache.spark.launcher.Main.main(Main.java:74)
>
> Thanks
> Proust
>
>
>
>
> From:Sean Owen 
> To:Proust GZ Feng/China/IBM@IBMCN
> Cc:user 
> Date:07/28/2015 06:54 PM
> Subject:Re: NO Cygwin Support in bin/spark-class in Spark 1.4.0
> --
>
>
>
> Does adding back the cygwin detection and this clause make it work?
>
> if $cygwin; then
>  CLASSPATH="`cygpath -wp "$CLASSPATH"`"
> fi
>
> If so I imagine that's fine to bring back, if that's still needed.
>
> On Tue, Jul 28, 2015 at 9:49 AM, Proust GZ Feng  wrote:
> > Thanks Owen, the problem under Cygwin is while run spark-submit under
> 1.4.0,
> > it simply report
> >
> > Error: Could not find or load main class org.apache.spark.launcher.Main
> >
> > This is because under Cygwin spark-class make the LAUNCH_CLASSPATH as
> > "/c/spark-1.4.0-bin-hadoop2.3/lib/spark-assembly-1.4.0-hadoop2.3.0.jar"
> > But under Cygwin java in Windows cannot recognize the classpath, so below
> > command simply error out
> >
> >  java -cp
> > /c/spark-1.4.0-bin-hadoop2.3/lib/spark-assembly-1.4.0-hadoop2.3.0.jar
> > org.apache.spark.launcher.Main
> > Error: Could not find or load main class org.apache.spark.launcher.Main
> >
> > Thanks
> > Proust
> >
> >
> >
> > From:Sean Owen 
> > To:Proust GZ Feng/China/IBM@IBMCN
> > Cc:user 
> > Date:07/28/2015 02:20 PM
> > Subject:Re: NO Cygwin Support in bin/spark-class in Spark 1.4.0
> > 
> >
> >
> >
> > It wasn't removed, but rewritten. Cygwin is just a distribution of
> > POSIX-related utilities so you should be able to use the normal .sh
> > scripts. In any event, you didn't say what the problem is?
> >
> > On Tue, Jul 28, 2015 at 5:19 AM, Proust GZ Feng 
> wrote:
> >> Hi, Spark Users
> >>
> >> Looks like Spark 1.4.0 cannot work with Cygwin due to the removing of
> >> Cygwin
> >> support in bin/spark-class
> >>
> >> The changeset is
> >>
> >>
> https://github.com/apache/spark/commit/517975d89d40a77c7186f488547eed11f79c1e97#diff-fdf4d3e600042c63ffa17b692c4372a3
> >>
> >> The changeset said "Add a library for launching Spark jobs
> >> programmatically", but how to use it in Cygwin?
> >> I'm wondering any solutions available to make it work in Windows?
> >>
> >>
> >> Thanks
> >> Proust
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>


-- 
Marcelo


Re: Problem submiting an script .py against an standalone cluster.

2015-07-30 Thread Marcelo Vanzin
Can you share the part of the code in your script where you create the
SparkContext instance?

On Thu, Jul 30, 2015 at 7:19 PM, fordfarline  wrote:

> Hi All,
>
> I`m having an issue when lanching an app (python) against a stand alone
> cluster, but runs in local, as it doesn't reach the cluster.
> It's the first time i try the cluster, in local works ok.
>
> i made this:
>
> -> /home/user/Spark/spark-1.3.0-bin-hadoop2.4/sbin/start-all.sh # Master
> and
> worker are up in localhost:8080/4040
> -> /home/user/Spark/spark-1.3.0-bin-hadoop2.4/bin/spark-submit --master
> spark://localhost:7077 Script.py
>* The script runs ok but in local :(i can check it in
> localhost:4040, but i don't see any job in cluster UI
>
> The only warning it's:
> WARN Utils: Your hostname, localhost resolves to a loopback address:
> 127.0.0.1; using 192.168.1.132 instead (on interface eth0)
>
> I set SPARK_LOCAL_IP=127.0.0.1 to solve this, al least de warning
> disappear,
> but the script keep executing in local not in cluster.
>
> I think it has something to do with my virtual server:
> -> Host Server: Linux Mint
> -> The Virtual Server (workstation 10) where runs Spark is Linux Mint as
> well.
>
> Any ideas what am i doing wrong?
>
> Thanks in advance for any suggestion, i getting mad on it!!
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Problem-submiting-an-script-py-against-an-standalone-cluster-tp24091.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Marcelo


Re: How to add multiple sequence files from HDFS to a Spark Context to do Batch processing?

2015-07-31 Thread Marcelo Vanzin
"file" can be a directory (look at all children) or even a glob
("/path/*.ext", for example).

On Fri, Jul 31, 2015 at 11:35 AM, swetha  wrote:

> Hi,
>
> How to add multiple sequence files from HDFS to a Spark Context to do Batch
> processing? I have something like the following in my code. Do I have to
> add
> Comma separated list of Sequence file paths to the Spark Context.
>
>  val data  = if(args.length>0 && args(0)!=null)
>   sc.sequenceFile(file,  classOf[LongWritable], classOf[Text]).
> map{case (x, y) => (x.toString, y.toString)}
>
> Thanks,
> Swetha
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-multiple-sequence-files-from-HDFS-to-a-Spark-Context-to-do-Batch-processing-tp24102.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Marcelo


Re: No event logs in yarn-cluster mode

2015-08-01 Thread Marcelo Vanzin
On Sat, Aug 1, 2015 at 9:25 AM, Akmal Abbasov 
wrote:

> When I running locally(./run-example SparkPi), the event logs are being
> created, and I can start history server.
> But when I am trying
> ./spark-submit --class org.apache.spark.examples.SparkPi --master
> yarn-cluster file:///opt/hadoop/spark/examples/src/main/python/pi.py
>

Did you look for the event log on the machine where the Spark driver ran?
You're using a "file:" URL and on yarn-cluster, that is in some random
machine in the cluster, not your local machine launching the job.

Which is why you should probably write these logs to HDFS.


Re: Contributors group and starter task

2015-08-03 Thread Marcelo Vanzin
Hi Namit,

There's no need to assign a bug to yourself to say you're working on it.
The recommended way is to just post a PR on github - the bot will update
the bug saying that you have a patch open to fix the issue.


On Mon, Aug 3, 2015 at 3:50 PM, Namit Katariya 
wrote:

> My username on the Apache JIRA is katariya.namit. Could one of the admins
> please add me to the contributors group so that I can have a starter task
> assigned to myself?
>
> Thanks,
> Namit
>
>


-- 
Marcelo


Re: Topology.py -- Cannot run on Spark Gateway on Cloudera 5.4.4.

2015-08-03 Thread Marcelo Vanzin
That should not be a fatal error, it's just a noisy exception.

Anyway, it should go away if you add YARN gateways to those nodes (aside
from Spark gateways).

On Mon, Aug 3, 2015 at 7:10 PM, Upen N  wrote:

> Hi,
> I recently installed Cloudera CDH 5.4.4. Sparks comes shipped with this
> version. I created Spark gateways. But I get the following error when run
> Spark shell from the gateway. Does anyone have any similar experience ? If
> so, please share the solution. Google shows to copy the Conf files from
> data nodes to gateway nodes. But I highly doubt if that is the right fix.
>
> Thanks
> Upender
>
> etc/hadoop/conf.cloudera.yarn/topology.py
> java.io.IOException: Cannot run program
> "/etc/hadoop/conf.cloudera.yarn/topology.py"
>
>


-- 
Marcelo


Re: Setting up Spark/flume/? to Ingest 10TB from FTP

2015-08-14 Thread Marcelo Vanzin
On Fri, Aug 14, 2015 at 2:11 PM, Varadhan, Jawahar <
varad...@yahoo.com.invalid> wrote:

> And hence, I was planning to use Spark Streaming with Kafka or Flume with
> Kafka. But flume runs on a JVM and may not be the best option as the huge
> file will create memory issues. Please suggest someway to run it inside the
> cluster.
>

I'm not sure why you think memory would be a problem. You don't need to
read all 10GB into memory to transfer the file.

I'm far from the best person to give advice about Flume, but this seems
like it would be a job more in line with what Sqoop does; although a quick
search seems to indicate Sqoop cannot yet read from FTP.

But writing your own code to read from an FTP server when a message arrives
from Kafka shouldn't really be hard.

-- 
Marcelo


Re: Scala: How to match a java object????

2015-08-18 Thread Marcelo Vanzin
On Tue, Aug 18, 2015 at 12:59 PM,  wrote:
>
> 5 match { case java.math.BigDecimal => 2 }

5 match { case _: java.math.BigDecimal => 2 }

-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Scala: How to match a java object????

2015-08-18 Thread Marcelo Vanzin
On Tue, Aug 18, 2015 at 1:19 PM,   wrote:
> Hi, Can you please elaborate? I am confused :-)

You did note that the two pieces of code are different, right?

See http://docs.scala-lang.org/tutorials/tour/pattern-matching.html
for how to match things in Scala, especially the "typed pattern"
example.

> -Original Message-
> From: Marcelo Vanzin [mailto:van...@cloudera.com]
> Sent: Tuesday, August 18, 2015 5:15 PM
> To: Ellafi, Saif A.
> Cc: wrbri...@gmail.com; user@spark.apache.org
> Subject: Re: Scala: How to match a java object
>
> On Tue, Aug 18, 2015 at 12:59 PM,  wrote:
>>
>> 5 match { case java.math.BigDecimal => 2 }
>
> 5 match { case _: java.math.BigDecimal => 2 }
>
> --
> Marcelo



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: build spark 1.4.1 with JDK 1.6

2015-08-21 Thread Marcelo Vanzin
That was only true until Spark 1.3. Spark 1.4 can be built with JDK7
and pyspark will still work.

On Fri, Aug 21, 2015 at 8:29 AM, Chen Song  wrote:
> Thanks Sean.
>
> So how PySpark is supported. I thought PySpark needs jdk 1.6.
>
> Chen
>
> On Fri, Aug 21, 2015 at 11:16 AM, Sean Owen  wrote:
>>
>> Spark 1.4 requires Java 7.
>>
>>
>> On Fri, Aug 21, 2015, 3:12 PM Chen Song  wrote:
>>>
>>> I tried to build Spark 1.4.1 on cdh 5.4.0. Because we need to support
>>> PySpark, I used JDK 1.6.
>>>
>>> I got the following error,
>>>
>>> [INFO] --- scala-maven-plugin:3.2.0:testCompile
>>> (scala-test-compile-first) @ spark-streaming_2.10 ---
>>>
>>> java.lang.UnsupportedClassVersionError: org/apache/hadoop/io/LongWritable
>>> : Unsupported major.minor version 51.0
>>> at java.lang.ClassLoader.defineClass1(Native Method)
>>> at java.lang.ClassLoader.defineClassCond(ClassLoader.java:637)
>>> at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
>>> at
>>> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
>>>
>>> I know that is due to the hadoop jar for cdh5.4.0 is built with JDK 7.
>>> Anyone has done this before?
>>>
>>> Thanks,
>>>
>>> --
>>> Chen Song
>>>
>
>
>
> --
> Chen Song
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "CDH Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to cdh-user+unsubscr...@cloudera.org.
> For more options, visit https://groups.google.com/a/cloudera.org/d/optout.



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Exclude slf4j-log4j12 from the classpath via spark-submit

2015-08-24 Thread Marcelo Vanzin
Hi Utkarsh,

Unfortunately that's not going to be easy. Since Spark bundles all
dependent classes into a single fat jar file, to remove that
dependency you'd need to modify Spark's assembly jar (potentially in
all your nodes). Doing that per-job is even trickier, because you'd
probably need some kind of script to inject the correct binding into
Spark's classpath.

That being said, that message is not an error, it's more of a noisy
warning. I'd expect slf4j to use the first binding available - in your
case, logback-classic. Is that not the case?


On Mon, Aug 24, 2015 at 2:50 PM, Utkarsh Sengar  wrote:
> Continuing this discussion:
> http://apache-spark-user-list.1001560.n3.nabble.com/same-log4j-slf4j-error-in-spark-9-1-td5592.html
>
> I am getting this error when I use logback-classic.
>
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in
> [jar:file:.m2/repository/ch/qos/logback/logback-classic/1.1.2/logback-classic-1.1.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
> [jar:file:.m2/repository/org/slf4j/slf4j-log4j12/1.7.10/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>
> I need to use logback-classic for my current project, so I am trying to
> ignore "slf4j-log4j12" from spark:
> 
> org.apache.spark
> spark-core_2.10
> 1.4.1
> 
> 
> org.slf4j
> slf4j-log4j12
> 
> 
> 
>
> Now, when I run my job from Intellij (which sets the classpath), things work
> perfectly.
>
> But when I run my job via spark-submit:
> ~/spark-1.4.1-bin-hadoop2.4/bin/spark-submit --class runner.SparkRunner
> spark-0.1-SNAPSHOT-jar-with-dependencies.jar
> My job fails because spark-submit sets up the classpath and it re-adds the
> slf4j-log4j12.
>
> I am not adding spark jar to the uber-jar via the maven assembly plugin:
>  
> 
> ..
> false
> 
> org.apache.spark:spark-core_2.10
> 
> 
> 
>
> So how can I exclude "slf4j-log4j12.jar" when I submit a job via
> spark-submit (on a per job basis)?
>
> --
> Thanks,
> -Utkarsh



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Exclude slf4j-log4j12 from the classpath via spark-submit

2015-08-24 Thread Marcelo Vanzin
Hi Utkarsh,

A quick look at slf4j's source shows it loads the first
"StaticLoggerBinder" in your classpath. How are you adding the logback
jar file to spark-submit?

If you use "spark.driver.extraClassPath" and
"spark.executor.extraClassPath" to add the jar, it should take
precedence over the log4j binding embedded in the Spark assembly.


On Mon, Aug 24, 2015 at 3:15 PM, Utkarsh Sengar  wrote:
> Hi Marcelo,
>
> When I add this exclusion rule to my pom:
> 
> org.apache.spark
> spark-core_2.10
> 1.4.1
> 
> 
> org.slf4j
> slf4j-log4j12
> 
> 
> 
>
> The SparkRunner class works fine (from IntelliJ) but when I build a jar and
> submit it to spark-submit:
>
> I get this error:
> Caused by: java.lang.ClassCastException: org.slf4j.impl.Log4jLoggerFactory
> cannot be cast to ch.qos.logback.classic.LoggerContext
> at
> com.opentable.logging.AssimilateForeignLogging.assimilate(AssimilateForeignLogging.java:68)
> at
> com.opentable.logging.AssimilateForeignLoggingHook.automaticAssimilationHook(AssimilateForeignLoggingHook.java:28)
> at com.opentable.logging.Log.(Log.java:31)
>
> Which is this here (our logging lib is open sourced):
> https://github.com/opentable/otj-logging/blob/master/logging/src/main/java/com/opentable/logging/AssimilateForeignLogging.java#L68
>
> Thanks,
> -Utkarsh
>
>
>
>
> On Mon, Aug 24, 2015 at 3:04 PM, Marcelo Vanzin  wrote:
>>
>> Hi Utkarsh,
>>
>> Unfortunately that's not going to be easy. Since Spark bundles all
>> dependent classes into a single fat jar file, to remove that
>> dependency you'd need to modify Spark's assembly jar (potentially in
>> all your nodes). Doing that per-job is even trickier, because you'd
>> probably need some kind of script to inject the correct binding into
>> Spark's classpath.
>>
>> That being said, that message is not an error, it's more of a noisy
>> warning. I'd expect slf4j to use the first binding available - in your
>> case, logback-classic. Is that not the case?
>>
>>
>> On Mon, Aug 24, 2015 at 2:50 PM, Utkarsh Sengar 
>> wrote:
>> > Continuing this discussion:
>> >
>> > http://apache-spark-user-list.1001560.n3.nabble.com/same-log4j-slf4j-error-in-spark-9-1-td5592.html
>> >
>> > I am getting this error when I use logback-classic.
>> >
>> > SLF4J: Class path contains multiple SLF4J bindings.
>> > SLF4J: Found binding in
>> >
>> > [jar:file:.m2/repository/ch/qos/logback/logback-classic/1.1.2/logback-classic-1.1.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> > SLF4J: Found binding in
>> >
>> > [jar:file:.m2/repository/org/slf4j/slf4j-log4j12/1.7.10/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> >
>> > I need to use logback-classic for my current project, so I am trying to
>> > ignore "slf4j-log4j12" from spark:
>> > 
>> > org.apache.spark
>> > spark-core_2.10
>> > 1.4.1
>> > 
>> > 
>> > org.slf4j
>> > slf4j-log4j12
>> > 
>> > 
>> > 
>> >
>> > Now, when I run my job from Intellij (which sets the classpath), things
>> > work
>> > perfectly.
>> >
>> > But when I run my job via spark-submit:
>> > ~/spark-1.4.1-bin-hadoop2.4/bin/spark-submit --class runner.SparkRunner
>> > spark-0.1-SNAPSHOT-jar-with-dependencies.jar
>> > My job fails because spark-submit sets up the classpath and it re-adds
>> > the
>> > slf4j-log4j12.
>> >
>> > I am not adding spark jar to the uber-jar via the maven assembly plugin:
>> >  
>> > 
>> > ..
>> > false
>> > 
>> > org.apache.spark:spark-core_2.10
>> > 
>> > 
>> > 
>> >
>> > So how can I exclude "slf4j-log4j12.jar" when I submit a job via
>> > spark-submit (on a per job basis)?
>> >
>> > --
>> > Thanks,
>> > -Utkarsh
>>
>>
>>
>> --
>> Marcelo
>
>
>
>
> --
> Thanks,
> -Utkarsh



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Exclude slf4j-log4j12 from the classpath via spark-submit

2015-08-24 Thread Marcelo Vanzin
On Mon, Aug 24, 2015 at 3:58 PM, Utkarsh Sengar  wrote:
> That didn't work since "extraClassPath" flag was still appending the jars at
> the end, so its still picking the slf4j jar provided by spark.

Out of curiosity, how did you verify this? The "extraClassPath"
options are supposed to prepend entries to the classpath, and the code
seems to be doing that. If it's not really doing that in some case,
it's a bug that needs to be fixed.

Another option is those is setting the "SPARK_CLASSPATH" env variable,
which is deprecated, but might come in handy in case there is actually
a bug in handling those options.


-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark (1.2.0) submit fails with exception saying log directory already exists

2015-08-25 Thread Marcelo Vanzin
This probably means your app is failing and the second attempt is
hitting that issue. You may fix the "directory already exists" error
by setting
spark.eventLog.overwrite=true in your conf, but most probably that
will just expose the actual error in your app.

On Tue, Aug 25, 2015 at 9:37 AM, Varadhan, Jawahar
 wrote:
> Here is the error
>
>
> yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason:
> User class threw exception: Log directory
> hdfs://Sandbox/user/spark/applicationHistory/application_1438113296105_0302
> already exists!)
>
>
> I am using cloudera 5.3.2 with Spark 1.2.0
>
>
> Any help is appreciated.
>
>
> Thanks
>
> Jay
>
>
>



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Exclude slf4j-log4j12 from the classpath via spark-submit

2015-08-25 Thread Marcelo Vanzin
On Tue, Aug 25, 2015 at 10:48 AM, Utkarsh Sengar  wrote:
> Now I am going to try it out on our mesos cluster.
> I assumed "spark.executor.extraClassPath" takes csv as jars the way "--jars"
> takes it but it should be ":" separated like a regular classpath jar.

Ah, yes, those options are just raw classpath strings. Also, they
don't cause jars to be copied to the cluster. You'll need the jar to
be available at the same location on all cluster machines.

-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Exclude slf4j-log4j12 from the classpath via spark-submit

2015-08-25 Thread Marcelo Vanzin
On Tue, Aug 25, 2015 at 1:50 PM, Utkarsh Sengar  wrote:
> So do I need to manually copy these 2 jars on my spark executors?

Yes. I can think of a way to work around that if you're using YARN,
but not with other cluster managers.

> On Tue, Aug 25, 2015 at 10:51 AM, Marcelo Vanzin 
> wrote:
>>
>> On Tue, Aug 25, 2015 at 10:48 AM, Utkarsh Sengar 
>> wrote:
>> > Now I am going to try it out on our mesos cluster.
>> > I assumed "spark.executor.extraClassPath" takes csv as jars the way
>> > "--jars"
>> > takes it but it should be ":" separated like a regular classpath jar.
>>
>> Ah, yes, those options are just raw classpath strings. Also, they
>> don't cause jars to be copied to the cluster. You'll need the jar to
>> be available at the same location on all cluster machines.


-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Does the driver program always run local to where you submit the job from?

2015-08-26 Thread Marcelo Vanzin
On Wed, Aug 26, 2015 at 2:03 PM, Jerry  wrote:
> Assuming your submitting the job from terminal; when main() is called, if I
> try to open a file locally, can I assume the machine is always the one I
> submitted the job from?

See the "--deploy-mode" option. "client" works as you describe;
"cluster" does not.

-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Ranger-like Security on Spark

2015-09-03 Thread Marcelo Vanzin
On Thu, Sep 3, 2015 at 5:15 PM, Matei Zaharia  wrote:
> Even simple Spark-on-YARN should run as the user that submitted the job,
> yes, so HDFS ACLs should be enforced. Not sure how it plays with the rest of
> Ranger.

It's slightly more complicated than that (without kerberos, the
underlying process runs as the same user running the YARN daemons, but
the connections to HDFS and other Hadoop services identify as the user
who submitted the application), but the end effect is what Matei
describes. I also do not know about how Ranger enforces things.

Also note that "simple authentication" is not secure at all. You're
basically just asking your users to be nice instead of actually
enforcing anything. Any user can tell YARN that he's actually someone
else when starting the application, and YARN will believe him. Just
say "HADOOP_USER_NAME=somebodyelse" and you're good to go!

> On Sep 3, 2015, at 4:57 PM, Jörn Franke  wrote:
>
> Well if it needs to read from hdfs then it will adhere to the permissions
> defined there And/or in ranger. However, I am not aware that you can protect
> dataframes, tables or streams in general in Spark.
>
>
> Le jeu. 3 sept. 2015 à 21:47, Daniel Schulz  a
> écrit :
>>
>> Hi Matei,
>>
>> Thanks for your answer.
>>
>> My question is regarding simple authenticated Spark-on-YARN only, without
>> Kerberos. So when I run Spark on YARN and HDFS, Spark will pass through my
>> HDFS user and only be able to access files I am entitled to read/write? Will
>> it enforce HDFS ACLs and Ranger policies as well?
>>
>> Best regards, Daniel.
>>
>> > On 03 Sep 2015, at 21:16, Matei Zaharia  wrote:
>> >
>> > If you run on YARN, you can use Kerberos, be authenticated as the right
>> > user, etc in the same way as MapReduce jobs.
>> >
>> > Matei
>> >
>> >> On Sep 3, 2015, at 1:37 PM, Daniel Schulz
>> >>  wrote:
>> >>
>> >> Hi,
>> >>
>> >> I really enjoy using Spark. An obstacle to sell it to our clients
>> >> currently is the missing Kerberos-like security on a Hadoop with simple
>> >> authentication. Are there plans, a proposal, or a project to deliver a
>> >> Ranger plugin or something similar to Spark. The target is to 
>> >> differentiate
>> >> users and their privileges when reading and writing data to HDFS? Is
>> >> Kerberos my only option then?


-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread Marcelo Vanzin
Hi,

Just "spark.executor.userClassPathFirst" is not enough. You should
also set "spark.driver.userClassPathFirst". Also not that I don't
think this was really tested with the shell, but that should work with
regular apps started using spark-submit.

If that doesn't work, I'd recommend shading, as others already have.

On Tue, Sep 15, 2015 at 9:19 AM, Lan Jiang  wrote:
> I used the --conf spark.files.userClassPathFirst=true  in the spark-shell
> option, it still gave me the eror: java.lang.NoSuchFieldError: unknownFields
> if I use protobuf 3.
>
> The output says spark.files.userClassPathFirst is deprecated and suggest
> using spark.executor.userClassPathFirst. I tried that and it did not work
> either.

-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: hdfs-ha on mesos - odd bug

2015-09-15 Thread Marcelo Vanzin
On Mon, Sep 14, 2015 at 6:55 AM, Adrian Bridgett  wrote:
> 15/09/14 13:00:25 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
> 10.1.200.245): java.lang.IllegalArgumentException:
> java.net.UnknownHostException: nameservice1
> at
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:377)
> at
> org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:310)

This looks like you're trying to connect to an HA HDFS service but you
have not provided the proper hdfs-site.xml for your app; then, instead
of recognizing "nameservice1" as an HA nameservice, it thinks it's an
actual NN address, tries to connect to it, and fails.

If you provide the correct hdfs-site.xml to your app (by placing it in
$SPARK_HOME/conf or setting HADOOP_CONF_DIR to point to the conf
directory), it should work.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Exception initializing JavaSparkContext

2015-09-21 Thread Marcelo Vanzin
What Spark package are you using? In particular, which hadoop version?

On Mon, Sep 21, 2015 at 9:14 AM, ekraffmiller
 wrote:
> Hi,
> I’m trying to run a simple test program to access Spark though Java.  I’m
> using JDK 1.8, and Spark 1.5.  I’m getting an Exception from the
> JavaSparkContext constructor.  My initialization code matches all the sample
> code I’ve found online, so not sure what I’m doing wrong.
>
> Here is my code:
>
> SparkConf conf = new SparkConf().setAppName("Simple Application");
> conf.setMaster("local");
> conf.setAppName("my app");
> JavaSparkContext sc = new JavaSparkContext(conf);
>
> The stack trace of the Exception:
>
> java.lang.ExceptionInInitializerError: null
> at java.lang.Class.getField(Class.java:1690)
> at
> org.apache.spark.util.SparkShutdownHookManager.install(ShutdownHookManager.scala:220)
> at
> org.apache.spark.util.ShutdownHookManager$.shutdownHooks$lzycompute(ShutdownHookManager.scala:50)
> at
> org.apache.spark.util.ShutdownHookManager$.shutdownHooks(ShutdownHookManager.scala:48)
> at
> org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:189)
> at
> org.apache.spark.util.ShutdownHookManager$.(ShutdownHookManager.scala:58)
> at
> org.apache.spark.util.ShutdownHookManager$.(ShutdownHookManager.scala)
> at
> org.apache.spark.storage.DiskBlockManager.addShutdownHook(DiskBlockManager.scala:147)
> at
> org.apache.spark.storage.DiskBlockManager.(DiskBlockManager.scala:54)
> at org.apache.spark.storage.BlockManager.(BlockManager.scala:75)
> at 
> org.apache.spark.storage.BlockManager.(BlockManager.scala:173)
> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:345)
> at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:193)
> at 
> org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:276)
> at org.apache.spark.SparkContext.(SparkContext.scala:441)
> at
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)
> at
> edu.harvard.iq.text.core.spark.SparkControllerTest.testMongoRDD(SparkControllerTest.java:63)
>
> Thanks,
> Ellen
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Exception-initializing-JavaSparkContext-tp24755.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Yarn Shutting Down Spark Processing

2015-09-23 Thread Marcelo Vanzin
Did you look at your application's logs (using the "yarn logs" command?).

That error means your application is failing to create a SparkContext.
So either you have a bug in your code, or there will be some error in
the log pointing at the actual reason for the failure.

On Tue, Sep 22, 2015 at 5:49 PM, Bryan Jeffrey  wrote:
> Hello.
>
> I have a Spark streaming job running on a cluster managed by Yarn.  The
> spark streaming job starts and receives data from Kafka.  It is processing
> well and then after several seconds I see the following error:
>
> 15/09/22 14:53:49 ERROR yarn.ApplicationMaster: SparkContext did not
> initialize after waiting for 10 ms. Please check earlier log output for
> errors. Failing the application.
> 15/09/22 14:53:49 INFO yarn.ApplicationMaster: Final app status: FAILED,
> exitCode: 13, (reason: Timed out waiting for SparkContext.)
>
> The spark process is then (obviously) shut down by Yarn.
>
> What do I need to change to allow Yarn to initialize Spark streaming (vs.
> batch) jobs?
>
> Thank you,
>
> Bryan Jeffrey



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Yarn Shutting Down Spark Processing

2015-09-23 Thread Marcelo Vanzin
But that's not the complete application log. You say the streaming
context is initialized, but can you show that in the logs? There's
something happening that is causing the SparkContext to not be
registered with the YARN backend, and that's why your application is
being killed.

If you can share the complete log or the code, that would clarify things.

On Wed, Sep 23, 2015 at 3:20 PM, Bryan  wrote:
> The error below is from the application logs. The spark streaming context is
> initialized and actively processing data when yarn claims that the context
> is not initialized.


-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: hive on spark query error

2015-09-25 Thread Marcelo Vanzin
On Fri, Sep 25, 2015 at 10:05 AM, Garry Chen  wrote:
> In spark-defaults.conf the spark.master  is  spark://hostname:7077.  From
> hive-site.xml  
> spark.master
> hostname
>   

That's not a valid value for spark.master (as the error indicates).
You should set it to "spark://hostname:7077", as you have it in
spark-defaults.conf (or perhaps remove the setting from hive-site.xml,
I think hive will honor your spark-defaults.conf).

-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: hive on spark query error

2015-09-25 Thread Marcelo Vanzin
Seems like you have "hive.server2.enable.doAs" enabled; you can either
disable it, or configure hs2 so that the user running the service
("hadoop" in your case) can impersonate others.

See:
https://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-common/Superusers.html

On Fri, Sep 25, 2015 at 10:33 AM, Garry Chen  wrote:
> 2015-09-25 13:31:16,245 INFO  [stderr-redir-1]: client.SparkClientImpl 
> (SparkClientImpl.java:run(569)) - ERROR: 
> org.apache.hadoop.security.authorize.AuthorizationException: User: hadoop is 
> not allowed to impersonate HIVEAPP

-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark-jdbc impala with kerberos using yarn-client

2016-08-24 Thread Marcelo Vanzin
I believe the Impala JDBC driver is mostly the same as the Hive
driver, but I could be wrong. In any case, the right place to ask that
question is the Impala groups (see http://impala.apache.org/).

On a side note, it is a little odd that you're trying to read data
from Impala using JDBC, instead of just telling Spark to read it
directly with its Hive support...


On Tue, Aug 23, 2016 at 2:12 PM, twisterius  wrote:
> I am trying to use the spark-jdbc package to access an impala table via a
> spark data frame. From my understanding
> (https://issues.apache.org/jira/browse/SPARK-12312) When loading DataFrames
> from JDBC datasource with Kerberos authentication, remote executors
> (yarn-client/cluster etc. modes) fail to establish a connection due to lack
> of Kerberos ticket or ability to generate it. I found a solution to this
> issue by creating an jdbc driver which properly handles kerberos
> authenticatation
> (https://datamountaineer.com/2016/01/15/spark-jdbc-sql-server-kerberos/).
> However I cannot find the source the impala jdbc driver online. Should I
> just use a hive driver to enable kerberos authentication for impala, or is
> there a location where I can find the impala jdbc driver source. Also is the
> ticket listed above SPARK-12312 accurate, or is there an out of the box way
> for me to connect to a kerberized impala using
> sqlContext.load("jdbc",options) without having to rewrite the impala driver?
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-jdbc-impala-with-kerberos-using-yarn-client-tp27589.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark launcher handle and listener not giving state

2016-08-29 Thread Marcelo Vanzin
You haven't said which version of Spark you are using. The state API
only works if the underlying Spark version is also 1.6 or later.

On Mon, Aug 29, 2016 at 4:36 PM, ckanth99  wrote:
> Hi All,
>
> I have a web application which will submit spark jobs on Cloudera spark
> cluster using spark launcher library.
>
> It is successfully submitting the spark job to cluster. However it is not
> calling back the listener class methods and also the getState() on returned
> SparkAppHandle never changes from "UNKNOWN" even after job finishes
> execution on cluster.
>
> I am using yarn-cluster mode. Here is my code. Is anything else needs to be
> done or is this a bug?
>
> SparkLauncher launcher = new SparkLauncher()
> .setSparkHome("sparkhome")
> .setMaster("yarn-cluster").setAppResource("spark
> job jar file").setMainClass("spark job driver
> class").setAppName("appname")
> .addAppArgs(argsArray).setVerbose(true)
> .addSparkArg("--verbose");  SparkAppHandle handle =
> launcher.startApplication(new LauncherListener());  int c = 0;
> while(!handle.getState().isFinal()) {   LOG.info(" state is=
> "+handle.getState() );   LOG.info(" state is not final yet. counter=
> "+c++ );   LOG.info(" sleeping for a second");   try {
> Thread.sleep(1000L);   } catch (InterruptedException e) {   }   if(c == 200)
> break; }
>
> Here are the things I have already tried:
>
> Added listener instance to SparkAppHandle once application is launched.
> Made the current class implement SparkAppHandle.Listener and passed it
> (this) in both ways (while launching, and by setting it on SparkAppHandle)
> Tried to use launcher.launch() method so that at least I can block on the
> resulting Process object by calling process.waitFor() method till spark job
> finishes running on cluster. However in this case, for long running spark
> jobs, corresponding process on this node never returns (though it works fine
> for spark jobs which are finishing in 1 or 2 min)
>
>
>
> Thanks,
> Reddy



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: YARN memory overhead settings

2016-09-06 Thread Marcelo Vanzin
It kinda depends on the application. Certain compression libraries, in
particular, are kinda lax with their use of off-heap buffers, so if
you configure executors to use many cores you might end up with higher
usage than the default configuration. Then there are also things like
PARQUET-118.

In any case, growth should not be unbounded, so you can just increase
the value until your jobs start working (or, if growth doesn't stop,
there might be a memory leak somewhere).

On Tue, Sep 6, 2016 at 9:23 AM, Tim Moran  wrote:
> Hi,
>
> I'm running a spark job on YARN, using 6 executors each with 25 GB of memory
> and spark.yarn.executor.overhead set to 5GB. Despite this, I still seem to
> see YARN killing my executors for exceeding the memory limit.
>
> Reading the docs, it looks like the overhead defaults to around 10% of the
> size of the heap - yet I'm still seeing failures when it's set to 20% of the
> heap size. Is this expected? Are there any particular issues or antipatterns
> in Spark code that could cause the JVM to use an excessive amount of memory
> beyond the heap?
>
> Thanks,
>
> Tim.
>
> This email is confidential, if you are not the intended recipient please
> delete it and notify us immediately by emailing the sender. You should not
> copy it or use it for any purpose nor disclose its contents to any other
> person. Privitar Limited is registered in England with registered number
> 09305666. Registered office Salisbury House, Station Road, Cambridge,
> CB12LA.



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark 2.0.0 won't let you create a new SparkContext?

2016-09-13 Thread Marcelo Vanzin
You're running spark-shell. It already creates a SparkContext for you and
makes it available in a variable called "sc".

If you want to change the config of spark-shell's context, you need to use
command line option. (Or stop the existing context first, although I'm not
sure how well that will work.)

On Tue, Sep 13, 2016 at 10:49 AM, Kevin Burton  wrote:

> I'm rather confused here as to what to do about creating a new
> SparkContext.
>
> Spark 2.0 prevents it... (exception included below)
>
> yet a TON of examples I've seen basically tell you to create a new
> SparkContext as standard practice:
>
> http://spark.apache.org/docs/latest/configuration.html#
> dynamically-loading-spark-properties
>
> val conf = new SparkConf()
>  .setMaster("local[2]")
>  .setAppName("CountingSheep")val sc = new SparkContext(conf)
>
>
> I'm specifically running into a problem in that ES hadoop won't work with
> its settings and I think its related to this problme.
>
> Do we have to call sc.stop() first and THEN create a new spark context?
>
> That works,, but I can't find any documentation anywhere telling us the
> right course of action.
>
>
>
> scala> val sc = new SparkContext();
> org.apache.spark.SparkException: Only one SparkContext may be running in
> this JVM (see SPARK-2243). To ignore this error, set 
> spark.driver.allowMultipleContexts
> = true. The currently running SparkContext was created at:
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.
> scala:823)
> org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
> (:15)
> (:31)
> (:33)
> .(:37)
> .()
> .$print$lzycompute(:7)
> .$print(:6)
> $print()
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:
> 62)
> sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
> java.lang.reflect.Method.invoke(Method.java:497)
> scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
> scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
> scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$
> loadAndRunReq$1.apply(IMain.scala:638)
> scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$
> loadAndRunReq$1.apply(IMain.scala:637)
> scala.reflect.internal.util.ScalaClassLoader$class.
> asContext(ScalaClassLoader.scala:31)
> scala.reflect.internal.util.AbstractFileClassLoader.asContext(
> AbstractFileClassLoader.scala:19)
>   at org.apache.spark.SparkContext$$anonfun$assertNoOtherContextIsRunning$
> 2.apply(SparkContext.scala:2221)
>   at org.apache.spark.SparkContext$$anonfun$assertNoOtherContextIsRunning$
> 2.apply(SparkContext.scala:2217)
>   at scala.Option.foreach(Option.scala:257)
>   at org.apache.spark.SparkContext$.assertNoOtherContextIsRunning(
> SparkContext.scala:2217)
>   at org.apache.spark.SparkContext$.markPartiallyConstructed(
> SparkContext.scala:2290)
>   at org.apache.spark.SparkContext.(SparkContext.scala:89)
>   at org.apache.spark.SparkContext.(SparkContext.scala:121)
>   ... 48 elided
>
>
> --
>
> We’re hiring if you know of any awesome Java Devops or Linux Operations
> Engineers!
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> 
>
>


-- 
Marcelo


Re: Add sqldriver.jar to Spark 1.6.0 executors

2016-09-14 Thread Marcelo Vanzin
Use:

spark-submit --jars /path/sqldriver.jar --conf
spark.driver.extraClassPath=sqldriver.jar --conf
spark.executor.extraClassPath=sqldriver.jar

In client mode the driver's classpath needs to point to the full path,
not just the name.


On Wed, Sep 14, 2016 at 5:42 AM, Kevin Tran  wrote:
> Hi Everyone,
>
> I tried in cluster mode on YARN
>  * spark-submit  --jars /path/sqldriver.jar
>  * --driver-class-path
>  * spark-env.sh
> SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/path/*"
>  * spark-defaults.conf
> spark.driver.extraClassPath
> spark.executor.extraClassPath
>
> None of them works for me !
>
> Does anyone have Spark app work with driver jar on executors before please
> give me your ideas. Thank you.
>
> Cheers,
> Kevin.



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Why add --driver-class-path jbdc.jar works and --jars not? (1.6.1)

2016-10-05 Thread Marcelo Vanzin
Many (all?) JDBC drivers need to be in the system classpath. --jars
places them in an app-specific class loader, so it doesn't work.

On Wed, Oct 5, 2016 at 3:32 AM, Chanh Le  wrote:
> Hi everyone,
> I just wondering why when I run my program I need to add jdbc.jar into
> —driver-class-path instead treat it like a dependency by —jars.
>
> My program works with these config
> ./bin/spark-submit --packages
> org.apache.spark:spark-streaming-kafka_2.10:1.6.1 --master "local[4]"
> --class com.ants.util.kafka.PersistenceData --driver-class-path
> /Users/giaosudau/Downloads/postgresql-9.3-1102.jdbc41.jar
> /Users/giaosudau/workspace/KafkaJobs/target/scala-2.10/kafkajobs-prod.jar
>
> According by http://stackoverflow.com/a/30947090/523075 and
> http://stackoverflow.com/a/31012955/523075
>
> This is a bug related the the classloader
>
>
> I checked this https://github.com/apache/spark/pull/6900 was merged.
>
> I am using Spark 1.6.1 and by issue tell that already fixed in 1.4.1 and 1.5
>
>
> Regards,
> Chanh



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Does the delegator map task of SparkLauncher need to stay alive until Spark job finishes ?

2016-10-18 Thread Marcelo Vanzin
On Tue, Oct 18, 2016 at 3:01 PM, Elkhan Dadashov  wrote:
> Does my map task need to wait until Spark job finishes ?

No...

> Or is there any way, my map task finishes after launching Spark job, and I
> can still query and get status of Spark job outside of map task (or failure
> reason, if it has failed) ? (maybe by querying Spark job id ?)

...but if the SparkLauncher handle goes away, then you lose the
ability to track the app's state, unless you talk directly to the
cluster manager.

> I guess also if i want my Spark job to be killed, if corresponding delegator
> map task is killed, that means my map task needs to stay alive, so i still
> have SparkAppHandle reference ?

Correct, unless you talk directly to the cluster manager.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Can i get callback notification on Spark job completion ?

2016-10-28 Thread Marcelo Vanzin
If you look at the "startApplication" method it takes listeners as parameters.

On Fri, Oct 28, 2016 at 10:23 AM, Elkhan Dadashov  wrote:
> Hi,
>
> I know that we can use SparkAppHandle (introduced in SparkLauncher version
>>=1.6), and lt the delegator map task stay alive until the Spark job
> finishes. But i wonder, if this can be done via callback notification
> instead of polling.
>
> Can i get callback notification on Spark job completion ?
>
> Similar to Hadoop, get a callback on MapReduce job completion - getting a
> notification instead of polling.
>
> At job completion, an HTTP request will be sent to
> “job.end.notification.url” value. Can be retrieved from notification URL
> both the JOB_ID and JOB_STATUS.
>
> ...
> Configuration conf = this.getConf();
> // Set the callback parameters
> conf.set("job.end.notification.url",
> "https://hadoopi.wordpress.com/api/hadoop/notification/$jobId?status=$jobStatus";);
> ...
> // Submit your job in background
> job.submit();
>
> At job completion, an HTTP request will be sent to
> “job.end.notification.url” value:
>
> https:///api/hadoop/notification/job_1379509275868_0002?status=SUCCEEDED
>
> Reference:
> https://hadoopi.wordpress.com/2013/09/18/hadoop-get-a-callback-on-mapreduce-job-completion/
>
> Thanks.



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Can i get callback notification on Spark job completion ?

2016-10-28 Thread Marcelo Vanzin
On Fri, Oct 28, 2016 at 11:14 AM, Elkhan Dadashov  wrote:
> But if the map task will finish before the Spark job finishes, that means
> SparkLauncher will go away. if the SparkLauncher handle goes away, then I
> lose the ability to track the app's state, right ?
>
> I'm investigating if there is a way to know Spark job completion (without
> Spark Job History Server) in asynchronous manner.

Correct. As I said in my other reply to you, if you can't use Spark's
API for whatever reason, you have to talk directly to the cluster
managers, and at that point it's out of Spark's hands to help you.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Delegation Token renewal in yarn-cluster

2016-11-03 Thread Marcelo Vanzin
Sounds like your test was set up incorrectly. The default TTL for
tokens is 7 days. Did you change that in the HDFS config?

The issue definitely exists and people definitely have run into it. So
if you're not hitting it, it's most definitely an issue with your test
configuration.

On Thu, Nov 3, 2016 at 7:22 AM, Zsolt Tóth  wrote:
> Hi,
>
> I ran some tests regarding Spark's Delegation Token renewal mechanism. As I
> see, the concept here is simple: if I give my keytab file and client
> principal to Spark, it starts a token renewal thread, and renews the
> namenode delegation tokens after some time. This works fine.
>
> Then I tried to run a long application (with HDFS operation in the end)
> without providing the keytab/principal to Spark, and I expected it to fail
> after the token expires. It turned out that this is not the case, the
> application finishes successfully without a delegation token renewal by
> Spark.
>
> My question is: how is that possible? Shouldn't a saveAsTextfile() fail
> after the namenode delegation token expired?
>
> Regards,
> Zsolt



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Delegation Token renewal in yarn-cluster

2016-11-03 Thread Marcelo Vanzin
I think you're a little confused about what "renewal" means here, and
this might be the fault of the documentation (I haven't read it in a
while).

The existing delegation tokens will always be "renewed", in the sense
that Spark (actually Hadoop code invisible to Spark) will talk to the
NN to extend its lifetime. The feature you're talking about is for
creating *new* delegation tokens after the old ones expire and cannot
be renewed anymore (i.e. the max-lifetime configuration).

On Thu, Nov 3, 2016 at 2:02 PM, Zsolt Tóth  wrote:
> Yes, I did change dfs.namenode.delegation.key.update-interval and
> dfs.namenode.delegation.token.renew-interval to 15 min, the max-lifetime to
> 30min. In this case the application (without Spark having the keytab) did
> not fail after 15 min, only after 30 min. Is it possible that the resource
> manager somehow automatically renews the delegation tokens for my
> application?
>
> 2016-11-03 21:34 GMT+01:00 Marcelo Vanzin :
>>
>> Sounds like your test was set up incorrectly. The default TTL for
>> tokens is 7 days. Did you change that in the HDFS config?
>>
>> The issue definitely exists and people definitely have run into it. So
>> if you're not hitting it, it's most definitely an issue with your test
>> configuration.
>>
>> On Thu, Nov 3, 2016 at 7:22 AM, Zsolt Tóth 
>> wrote:
>> > Hi,
>> >
>> > I ran some tests regarding Spark's Delegation Token renewal mechanism.
>> > As I
>> > see, the concept here is simple: if I give my keytab file and client
>> > principal to Spark, it starts a token renewal thread, and renews the
>> > namenode delegation tokens after some time. This works fine.
>> >
>> > Then I tried to run a long application (with HDFS operation in the end)
>> > without providing the keytab/principal to Spark, and I expected it to
>> > fail
>> > after the token expires. It turned out that this is not the case, the
>> > application finishes successfully without a delegation token renewal by
>> > Spark.
>> >
>> > My question is: how is that possible? Shouldn't a saveAsTextfile() fail
>> > after the namenode delegation token expired?
>> >
>> > Regards,
>> > Zsolt
>>
>>
>>
>> --
>> Marcelo
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Delegation Token renewal in yarn-cluster

2016-11-03 Thread Marcelo Vanzin
On Thu, Nov 3, 2016 at 3:47 PM, Zsolt Tóth  wrote:
> What is the purpose of the delegation token renewal (the one that is done
> automatically by Hadoop libraries, after 1 day by default)? It seems that it
> always happens (every day) until the token expires, no matter what. I'd
> probably find an answer to that in a basic Hadoop security description.

I'm not sure and I never really got a good answer to that (I had the
same question in the past). My best guess is to limit how long an
attacker can do bad things if he gets hold of a delegation token. But
IMO if an attacker gets a delegation token, that's pretty bad
regardless of how long he can use it...

> I have a feeling that giving the keytab to Spark bypasses the concept behind
> delegation tokens. As I understand, the NN basically says that "your
> application can access hdfs with this delegation token, but only for 7
> days".

I'm not sure why there's a 7 day limit either, but let's assume
there's a good reason. Basically the app, at that point, needs to
prove to the NN it has a valid kerberos credential. Whether that's
from someone typing their password into a terminal, or code using a
keytab, it doesn't really matter. If someone was worried about that
user being malicious they'd disable the user's login in the KDC.

This feature is needed because there are apps that need to keep
running, unattended, for longer than HDFS's max lifetime setting.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Delegation Token renewal in yarn-cluster

2016-11-04 Thread Marcelo Vanzin
On Fri, Nov 4, 2016 at 1:57 AM, Zsolt Tóth  wrote:
> This was what confused me in the first place. Why does Spark ask for new
> tokens based on the renew-interval instead of the max-lifetime?

It could be just a harmless bug, since tokens have a "getMaxDate()"
method which I assume returns the token's lifetime, although there's
no documentation. Or it could be that the max lifetime of the token is
not really available to the code. If you want to experiment with the
code, that should be a small change (if getMaxDate() returns the right
thing).

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: SparkLauncer 2.0.1 version working incosistently in yarn-client mode

2016-11-07 Thread Marcelo Vanzin
On Sat, Nov 5, 2016 at 2:54 AM, Elkhan Dadashov  wrote:
> while (appHandle.getState() == null || !appHandle.getState().isFinal()) {
> if (appHandle.getState() != null) {
> log.info("while: Spark job state is : " + appHandle.getState());
> if (appHandle.getAppId() != null) {
> log.info("\t App id: " + appHandle.getAppId() + "\tState: " +
> appHandle.getState());
> }
> }
> }

This is a ridiculously expensive busy loop, even more so if you
comment out the log lines. Use listeners, or at least sleep a little
bit every once in a while. You're probably starving other processes /
threads of cpu.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Correct SparkLauncher usage

2016-11-07 Thread Marcelo Vanzin
On Mon, Nov 7, 2016 at 3:29 PM, Mohammad Tariq  wrote:
> I have been trying to use SparkLauncher.startApplication() to launch a Spark 
> app from within java code, but unable to do so. However, same piece of code 
> is working if I use SparkLauncher.launch().
>
> Here are the corresponding code snippets :
>
> SparkAppHandle handle = new SparkLauncher()
>
> 
> .setSparkHome("/Users/miqbal1/DISTRIBUTED_WORLD/UNPACKED/spark-1.6.1-bin-hadoop2.6")
>
> 
> .setJavaHome("/Library/Java/JavaVirtualMachines/jdk1.8.0_92.jdk/Contents/Home")
>
> 
> .setAppResource("/Users/miqbal1/wc.jar").setMainClass("org.myorg.WC").setMaster("local")
>
> .setConf("spark.dynamicAllocation.enabled", 
> "true").startApplication();System.out.println(handle.getAppId());
>
> System.out.println(handle.getState());
>
> This prints null and UNKNOWN as output.

The information you're printing is not available immediately after you
call "startApplication()". The Spark app is still starting, so it may
take some time for the app ID and other info to be reported back. The
"startApplication()" method allows you to provide listeners you can
use to know when that information is available.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Correct SparkLauncher usage

2016-11-07 Thread Marcelo Vanzin
Then you need to look at your logs to figure out why the child app is not
working. "startApplication" will by default redirect the child's output to
the parent's logs.

On Mon, Nov 7, 2016 at 3:42 PM, Mohammad Tariq  wrote:

> Hi Marcelo,
>
> Thank you for the prompt response. I tried adding listeners as well,
> didn't work either. Looks like it isn't starting the job at all.
>
>
> [image: --]
>
> Tariq, Mohammad
> [image: https://]about.me/mti
>
> <https://about.me/mti?promo=email_sig&utm_source=email_sig&utm_medium=external_link&utm_campaign=chrome_ext>
>
>
>
>
> [image: http://]
>
> Tariq, Mohammad
> about.me/mti
> [image: http://]
> <http://about.me/mti>
>
>
> On Tue, Nov 8, 2016 at 5:06 AM, Marcelo Vanzin 
> wrote:
>
>> On Mon, Nov 7, 2016 at 3:29 PM, Mohammad Tariq 
>> wrote:
>> > I have been trying to use SparkLauncher.startApplication() to launch a
>> Spark app from within java code, but unable to do so. However, same piece
>> of code is working if I use SparkLauncher.launch().
>> >
>> > Here are the corresponding code snippets :
>> >
>> > SparkAppHandle handle = new SparkLauncher()
>> >
>> > .setSparkHome("/Users/miqbal1/DISTRIBUTED_WORLD/UNPACKED/
>> spark-1.6.1-bin-hadoop2.6")
>> >
>> > .setJavaHome("/Library/Java/JavaVirtualMachines/jdk1.8.0_92
>> .jdk/Contents/Home")
>> >
>> > .setAppResource("/Users/miqbal1/wc.jar").setMainClass("org.
>> myorg.WC").setMaster("local")
>> >
>> > .setConf("spark.dynamicAllocation.enabled",
>> "true").startApplication();System.out.println(handle.getAppId());
>> >
>> > System.out.println(handle.getState());
>> >
>> > This prints null and UNKNOWN as output.
>>
>> The information you're printing is not available immediately after you
>> call "startApplication()". The Spark app is still starting, so it may
>> take some time for the app ID and other info to be reported back. The
>> "startApplication()" method allows you to provide listeners you can
>> use to know when that information is available.
>>
>> --
>> Marcelo
>>
>
>


-- 
Marcelo


Re: Correct SparkLauncher usage

2016-11-10 Thread Marcelo Vanzin
On Thu, Nov 10, 2016 at 2:43 PM, Mohammad Tariq  wrote:
>   @Override
>   public void stateChanged(SparkAppHandle handle) {
> System.out.println("Spark App Id [" + handle.getAppId() + "]. State [" + 
> handle.getState() + "]");
> while(!handle.getState().isFinal()) {

You shouldn't loop in an event handler. That's not really how
listeners work. Instead, use the event handler to update some local
state, or signal some thread that's waiting for the state change.

Also be aware that handles currently only work in local and yarn
modes; the state updates haven't been hooked up to standalone mode
(maybe for client mode, but definitely not cluster) nor mesos.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Correct SparkLauncher usage

2016-11-10 Thread Marcelo Vanzin
Sorry, it's kinda hard to give any more feedback from just the info you
provided.

I'd start with some working code like this from Spark's own unit tests:
https://github.com/apache/spark/blob/a8ea4da8d04c1ed621a96668118f20739145edd2/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala#L164


On Thu, Nov 10, 2016 at 3:00 PM, Mohammad Tariq  wrote:

> All I want to do is submit a job, and keep on getting states as soon as it
> changes, and come out once the job is over. I'm sorry to be a pest of
> questions. Kind of having a bit of tough time making this work.
>
>
> [image: --]
>
> Tariq, Mohammad
> [image: https://]about.me/mti
>
> <https://about.me/mti?promo=email_sig&utm_source=email_sig&utm_medium=external_link&utm_campaign=chrome_ext>
>
>
>
>
> [image: http://]
>
> Tariq, Mohammad
> about.me/mti
> [image: http://]
> <http://about.me/mti>
>
>
> On Fri, Nov 11, 2016 at 4:27 AM, Mohammad Tariq 
> wrote:
>
>> Yeah, that definitely makes sense. I was just trying to make it work
>> somehow. The problem is that it's not at all calling the listeners, hence
>> i'm unable to do anything. Just wanted to cross check it by looping inside.
>> But I get the point. thank you for that!
>>
>> I'm on YARN(cluster mode).
>>
>>
>> [image: --]
>>
>> Tariq, Mohammad
>> [image: https://]about.me/mti
>>
>> <https://about.me/mti?promo=email_sig&utm_source=email_sig&utm_medium=external_link&utm_campaign=chrome_ext>
>>
>>
>>
>>
>> [image: http://]
>>
>> Tariq, Mohammad
>> about.me/mti
>> [image: http://]
>> <http://about.me/mti>
>>
>>
>> On Fri, Nov 11, 2016 at 4:19 AM, Marcelo Vanzin 
>> wrote:
>>
>>> On Thu, Nov 10, 2016 at 2:43 PM, Mohammad Tariq 
>>> wrote:
>>> >   @Override
>>> >   public void stateChanged(SparkAppHandle handle) {
>>> > System.out.println("Spark App Id [" + handle.getAppId() + "].
>>> State [" + handle.getState() + "]");
>>> > while(!handle.getState().isFinal()) {
>>>
>>> You shouldn't loop in an event handler. That's not really how
>>> listeners work. Instead, use the event handler to update some local
>>> state, or signal some thread that's waiting for the state change.
>>>
>>> Also be aware that handles currently only work in local and yarn
>>> modes; the state updates haven't been hooked up to standalone mode
>>> (maybe for client mode, but definitely not cluster) nor mesos.
>>>
>>> --
>>> Marcelo
>>>
>>
>>
>


-- 
Marcelo


Re: Does the delegator map task of SparkLauncher need to stay alive until Spark job finishes ?

2016-11-15 Thread Marcelo Vanzin
On Tue, Nov 15, 2016 at 5:57 PM, Elkhan Dadashov  wrote:
> This is confusing in the sense that, the client needs to stay alive for
> Spark Job to finish successfully.
>
> Actually the client can die  or finish (in Yarn-cluster mode), and the spark
> job will successfully finish.

That's an internal class, and you're looking at an internal javadoc
that describes how the app handle works. For the app handle to be
updated, the "client" (i.e. the sub process) needs to stay alive. So
the javadoc is correct. It has nothing to do with whether the
application succeeds or not.


-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-05 Thread Marcelo Vanzin
There's generally an exception in these cases, and you haven't posted
it, so it's hard to tell you what's wrong. The most probable cause,
without the extra information the exception provides, is that you're
using the wrong Hadoop configuration when submitting the job to YARN.

On Mon, Dec 5, 2016 at 4:35 AM, Gerard Casey  wrote:
> Hello all,
>
> I am using Spark with Kerberos authentication.
>
> I can run my code using `spark-shell` fine and I can also use `spark-submit`
> in local mode (e.g. —master local[16]). Both function as expected.
>
> local mode -
>
> spark-submit --class "graphx_sp" --master local[16] --driver-memory 20G
> target/scala-2.10/graphx_sp_2.10-1.0.jar
>
> I am now progressing to run in cluster mode using YARN.
>
> cluster mode with YARN -
>
> spark-submit --class "graphx_sp" --master yarn --deploy-mode cluster
> --executor-memory 13G --total-executor-cores 32
> target/scala-2.10/graphx_sp_2.10-1.0.jar
>
> However, this returns:
>
> diagnostics: User class threw exception:
> org.apache.hadoop.security.AccessControlException: Authentication required
>
> Before I run using spark-shell or on local mode in spark-submit I do the
> following kerberos setup:
>
> kinit -k -t ~/keytab -r 7d `whoami`
>
> Clearly, this setup is not extending to the YARN setup. How do I fix the
> Kerberos issue with YARN in cluster mode? Is this something which must be in
> my /src/main/scala/graphx_sp.scala file?
>
> Many thanks
>
> Geroid



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-05 Thread Marcelo Vanzin
05 18:23:56 INFO yarn.Client: Application report for 
> application_1479877553404_0041 (state: ACCEPTED)
> 16/12/05 18:23:57 INFO yarn.Client: Application report for 
> application_1479877553404_0041 (state: ACCEPTED)
> 16/12/05 18:23:58 INFO yarn.Client: Application report for 
> application_1479877553404_0041 (state: ACCEPTED)
> 16/12/05 18:23:59 INFO yarn.Client: Application report for 
> application_1479877553404_0041 (state: ACCEPTED)
> 16/12/05 18:24:00 INFO yarn.Client: Application report for 
> application_1479877553404_0041 (state: ACCEPTED)
> 16/12/05 18:24:01 INFO yarn.Client: Application report for 
> application_1479877553404_0041 (state: ACCEPTED)
> 16/12/05 18:24:02 INFO yarn.Client: Application report for 
> application_1479877553404_0041 (state: ACCEPTED)
> 16/12/05 18:24:03 INFO yarn.Client: Application report for 
> application_1479877553404_0041 (state: ACCEPTED)
> 16/12/05 18:24:04 INFO yarn.Client: Application report for 
> application_1479877553404_0041 (state: ACCEPTED)
> 16/12/05 18:24:05 INFO yarn.Client: Application report for 
> application_1479877553404_0041 (state: RUNNING)
> 16/12/05 18:24:05 INFO yarn.Client:
>  client token: Token { kind: YARN_CLIENT_TOKEN, service:  }
>  diagnostics: N/A
>  ApplicationMaster host:
>  ApplicationMaster RPC port: 0
>  queue: default
>  start time: 1480962209903
>  final status: UNDEFINED
>  tracking URL: 
> http://login_node1.xcat.cluster:8088/proxy/application_1479877553404_0041/
>  user: me
> 16/12/05 18:24:06 INFO yarn.Client: Application report for 
> application_1479877553404_0041 (state: RUNNING)
> 16/12/05 18:24:07 INFO yarn.Client: Application report for 
> application_1479877553404_0041 (state: RUNNING)
> 16/12/05 18:24:08 INFO yarn.Client: Application report for 
> application_1479877553404_0041 (state: RUNNING)
> 16/12/05 18:24:09 INFO yarn.Client: Application report for 
> application_1479877553404_0041 (state: RUNNING)
> 16/12/05 18:24:10 INFO yarn.Client: Application report for 
> application_1479877553404_0041 (state: RUNNING)
> 16/12/05 18:24:11 INFO yarn.Client: Application report for 
> application_1479877553404_0041 (state: RUNNING)
> 16/12/05 18:24:12 INFO yarn.Client: Application report for 
> application_1479877553404_0041 (state: RUNNING)
> 16/12/05 18:24:13 INFO yarn.Client: Application report for 
> application_1479877553404_0041 (state: RUNNING)
> 16/12/05 18:24:14 INFO yarn.Client: Application report for 
> application_1479877553404_0041 (state: RUNNING)
> 16/12/05 18:24:15 INFO yarn.Client: Application report for 
> application_1479877553404_0041 (state: RUNNING)
> 16/12/05 18:24:16 INFO yarn.Client: Application report for 
> application_1479877553404_0041 (state: RUNNING)
> 16/12/05 18:24:17 INFO yarn.Client: Application report for 
> application_1479877553404_0041 (state: RUNNING)
> 16/12/05 18:24:18 INFO yarn.Client: Application report for 
> application_1479877553404_0041 (state: FINISHED)
> 16/12/05 18:24:18 INFO yarn.Client:
>  client token: Token { kind: YARN_CLIENT_TOKEN, service:  }
>  diagnostics: User class threw exception: 
> org.apache.hadoop.security.AccessControlException: Authentication required
>  ApplicationMaster host:
>  ApplicationMaster RPC port: 0
>  queue: default
>  start time: 1480962209903
>  final status: FAILED
>  tracking URL: 
> http://login_node1.xcat.cluster:8088/proxy/application_1479877553404_0041/
>  user: me
> Exception in thread "main" org.apache.spark.SparkException: Application 
> application_1479877553404_0041 finished with failed status
> at org.apache.spark.deploy.yarn.Client.run(Client.scala:1122)
> at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1169)
> at org.apache.spark.deploy.yarn.Client.main(Client.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 16/12/05 18:24:18 INFO util.ShutdownHookManager: Shutdown hook ca

Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-07 Thread Marcelo Vanzin
On Wed, Dec 7, 2016 at 12:15 PM, Gerard Casey  wrote:
> Can anyone point me to a tutorial or a run through of how to use Spark with
> Kerberos? This is proving to be quite confusing. Most search results on the
> topic point to what needs inputted at the point of `sparks submit` and not
> the changes needed in the actual src/main/.scala file

You don't need to write any special code to run Spark with Kerberos.
Just write your application normally, and make sure you're logged in
to the KDC (i.e. "klist" shows a valid TGT) before running your app.


-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-07 Thread Marcelo Vanzin
Have you removed all the code dealing with Kerberos that you posted?
You should not be setting those principal / keytab configs.

Literally all you have to do is login with kinit then run spark-submit.

Try with the SparkPi example for instance, instead of your own code.
If that doesn't work, you have a configuration issue somewhere.

On Wed, Dec 7, 2016 at 1:09 PM, Gerard Casey  wrote:
> Thanks.
>
> I’ve checked the TGT, principal and key tab. Where to next?!
>
>> On 7 Dec 2016, at 22:03, Marcelo Vanzin  wrote:
>>
>> On Wed, Dec 7, 2016 at 12:15 PM, Gerard Casey  
>> wrote:
>>> Can anyone point me to a tutorial or a run through of how to use Spark with
>>> Kerberos? This is proving to be quite confusing. Most search results on the
>>> topic point to what needs inputted at the point of `sparks submit` and not
>>> the changes needed in the actual src/main/.scala file
>>
>> You don't need to write any special code to run Spark with Kerberos.
>> Just write your application normally, and make sure you're logged in
>> to the KDC (i.e. "klist" shows a valid TGT) before running your app.
>>
>>
>> --
>> Marcelo
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-08 Thread Marcelo Vanzin
On Wed, Dec 7, 2016 at 11:54 PM, Gerard Casey  wrote:
> To be specific, where exactly should spark.authenticate be set to true?

spark.authenticate has nothing to do with kerberos. It's for
authentication between different Spark processes belonging to the same
app.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-08 Thread Marcelo Vanzin
Then you probably have a configuration error somewhere. Since you
haven't actually posted the error you're seeing, it's kinda hard to
help any further.

On Thu, Dec 8, 2016 at 11:17 AM, Gerard Casey  wrote:
> Right. I’m confident that is setup correctly.
>
> I can run the SparkPi test script. The main difference between it and my 
> application is that it doesn’t access HDFS.
>
>> On 8 Dec 2016, at 18:43, Marcelo Vanzin  wrote:
>>
>> On Wed, Dec 7, 2016 at 11:54 PM, Gerard Casey  
>> wrote:
>>> To be specific, where exactly should spark.authenticate be set to true?
>>
>> spark.authenticate has nothing to do with kerberos. It's for
>> authentication between different Spark processes belonging to the same
>> app.
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-08 Thread Marcelo Vanzin
You could have posted just the error, which is at the end of my response.

Why are you trying to use WebHDFS? I'm not really sure how
authentication works with that. But generally applications use HDFS
(which uses a different URI scheme), and Spark should work fine with
that.


Error:
Authentication required
org.apache.hadoop.security.AccessControlException: Authentication required
at 
org.apache.hadoop.hdfs.web.WebHdfsFileSystem.validateResponse(WebHdfsFileSystem.java:457)
at 
org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$200(WebHdfsFileSystem.java:113)
at 
org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:738)
at 
org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.access$100(WebHdfsFileSystem.java:582)
at 
org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner$1.run(WebHdfsFileSystem.java:612)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
at 
org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:608)
at 
org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getDelegationToken(WebHdfsFileSystem.java:1507)
at org.apache.hadoop.fs.FileSystem.collectDelegationTokens(FileSystem.java:545)
at org.apache.hadoop.fs.FileSystem.addDelegationTokens(FileSystem.java:523)
at 
org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:140)
at 
org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
at 
org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:206)
at 
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)


On Thu, Dec 8, 2016 at 12:29 PM, Gerard Casey  wrote:
> Sure - I wanted to check with admin before sharing. I’ve attached it now, 
> does this help?
>
> Many thanks again,
>
> G
>
>
>
>> On 8 Dec 2016, at 20:18, Marcelo Vanzin  wrote:
>>
>> Then you probably have a configuration error somewhere. Since you
>> haven't actually posted the error you're seeing, it's kinda hard to
>> help any further.
>>
>> On Thu, Dec 8, 2016 at 11:17 AM, Gerard Casey  
>> wrote:
>>> Right. I’m confident that is setup correctly.
>>>
>>> I can run the SparkPi test script. The main difference between it and my 
>>> application is that it doesn’t access HDFS.
>>>
>>>> On 8 Dec 2016, at 18:43, Marcelo Vanzin  wrote:
>>>>
>>>> On Wed, Dec 7, 2016 at 11:54 PM, Gerard Casey  
>>>> wrote:
>>>>> To be specific, where exactly should spark.authenticate be set to true?
>>>>
>>>> spark.authenticate has nothing to do with kerberos. It's for
>>>> authentication between different Spark processes belonging to the same
>>>> app.
>>>>
>>>> --
>>>> Marcelo
>>>>
>>>> -
>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>
>>>
>>
>>
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: how can I set the log configuration file for spark history server ?

2016-12-09 Thread Marcelo Vanzin
(-dev)

Just configure your log4j.properties in $SPARK_HOME/conf (or set a
custom $SPARK_CONF_DIR for the history server).

On Thu, Dec 8, 2016 at 7:20 PM, John Fang  wrote:
> ./start-history-server.sh
> starting org.apache.spark.deploy.history.HistoryServer, logging to
> /home/admin/koala/data/versions/0/SPARK/2.0.2/spark-2.0.2-bin-hadoop2.6/logs/spark-admin-org.apache.spark.deploy.history.HistoryServer-1-v069166214.sqa.zmf.out
>
> Then the history will print all log to the XXX.sqa.zmf.out, so i can't limit
> the file max size.  I want limit the size of the log file



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Where is yarn-shuffle.jar in maven?

2016-12-13 Thread Marcelo Vanzin
https://mvnrepository.com/artifact/org.apache.spark/spark-network-yarn_2.11/2.0.2

On Mon, Dec 12, 2016 at 9:56 PM, Neal Yin  wrote:
> Hi,
>
> For dynamic allocation feature, I need spark-xxx-yarn-shuffle.jar. In my
> local spark build, I can see it.  But in maven central, I can’t find it. My
> build script pulls all jars from maven central. The only option is to check
> in this jar into git?
>
> Thanks,
>
> -Neal



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Is restarting of SparkContext allowed?

2016-12-15 Thread Marcelo Vanzin
(-dev, +user. dev is for Spark development, not for questions about
using Spark.)

You haven't posted code here or the actual error. But you might be
running into SPARK-15754. Or into other issues with yarn-client mode
and "--principal / --keytab" (those have known issues in client mode).

If you have the above fix, you should be able to run the SparkContext
in client mode inside a UGI.doAs() block, after you login the user,
and later stop the context and start a new one. (And don't use
"--principal" / "--keytab" in that case.)


On Thu, Dec 15, 2016 at 1:46 PM, Alexey Klimov  wrote:
> Hello, my question is the continuation of a problem I described  here
> 
> .
>
> I've done some investigation and found out that nameNode.getDelegationToken
> is called during constructing SparkContext even if delegation token is
> already presented in token list of current logged user in object of
> UserGroupInforation class. The problem doesn't occur when waiting time
> before constructing a new context is less than 10 seconds, because rpc
> connection to namenode just isn't resetting because of
> ipc.client.connection.maxidletime property.
>
> As a workaround of this problem I do login from keytab before every
> constructing of SparkContext, which basically just resets token list of
> current logged user (as well as whole user structure) and the problem seems
> to be gone. Still I'm not really sure that it is correct way to deal with
> SparkContext.
>
> Having found a reason of the problem, I've got 2 assumptions now:
> First - SparkContext was designed to be restarted during JVM run and
> behaviour above is just a bug.
> Second - it wasn't and I'm just using SparkContext in a wrong manner.
>
> Since I haven't found any related bug in Jira and any solution on the
> internet (as well as too many users facing this error) I tend to think that
> it is rather a not allowed usage of SparkContext.
>
> Is that correct?
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Is-restarting-of-SparkContext-allowed-tp20240.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: why does spark web UI keeps changing its port?

2017-01-23 Thread Marcelo Vanzin
That's the Master, whose default port is 8080 (not 4040). The default
port for the app's UI is 4040.

On Mon, Jan 23, 2017 at 11:47 AM, kant kodali  wrote:
> I am not sure why Spark web UI keeps changing its port every time I restart
> a cluster? how can I make it run always on one port? I did make sure there
> is no process running on 4040(spark default web ui port) however it still
> starts at 8080. any ideas?
>
>
> MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at
> http://x.x.x.x:8080
>
>
> Thanks!



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: why does spark web UI keeps changing its port?

2017-01-23 Thread Marcelo Vanzin
No. Each app has its own UI which runs (starting on) port 4040.

On Mon, Jan 23, 2017 at 12:05 PM, kant kodali  wrote:
> I am using standalone mode so wouldn't be 8080 for my app web ui as well?
> There is nothing running on 4040 in my cluster.
>
> http://spark.apache.org/docs/latest/security.html#standalone-mode-only
>
> On Mon, Jan 23, 2017 at 11:51 AM, Marcelo Vanzin 
> wrote:
>>
>> That's the Master, whose default port is 8080 (not 4040). The default
>> port for the app's UI is 4040.
>>
>> On Mon, Jan 23, 2017 at 11:47 AM, kant kodali  wrote:
>> > I am not sure why Spark web UI keeps changing its port every time I
>> > restart
>> > a cluster? how can I make it run always on one port? I did make sure
>> > there
>> > is no process running on 4040(spark default web ui port) however it
>> > still
>> > starts at 8080. any ideas?
>> >
>> >
>> > MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at
>> > http://x.x.x.x:8080
>> >
>> >
>> > Thanks!
>>
>>
>>
>> --
>> Marcelo
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: why does spark web UI keeps changing its port?

2017-01-23 Thread Marcelo Vanzin
Depends on what you mean by "job". Which is why I prefer "app", which
is clearer (something you submit using "spark-submit", for example).

But really, I'm not sure what you're asking now.

On Mon, Jan 23, 2017 at 12:15 PM, kant kodali  wrote:
> hmm..I guess in that case my assumption of "app" is wrong. I thought the app
> is a client jar that you submit. no? If so, say I submit multiple jobs then
> I get two UI'S?
>
> On Mon, Jan 23, 2017 at 12:07 PM, Marcelo Vanzin 
> wrote:
>>
>> No. Each app has its own UI which runs (starting on) port 4040.
>>
>> On Mon, Jan 23, 2017 at 12:05 PM, kant kodali  wrote:
>> > I am using standalone mode so wouldn't be 8080 for my app web ui as
>> > well?
>> > There is nothing running on 4040 in my cluster.
>> >
>> > http://spark.apache.org/docs/latest/security.html#standalone-mode-only
>> >
>> > On Mon, Jan 23, 2017 at 11:51 AM, Marcelo Vanzin 
>> > wrote:
>> >>
>> >> That's the Master, whose default port is 8080 (not 4040). The default
>> >> port for the app's UI is 4040.
>> >>
>> >> On Mon, Jan 23, 2017 at 11:47 AM, kant kodali 
>> >> wrote:
>> >> > I am not sure why Spark web UI keeps changing its port every time I
>> >> > restart
>> >> > a cluster? how can I make it run always on one port? I did make sure
>> >> > there
>> >> > is no process running on 4040(spark default web ui port) however it
>> >> > still
>> >> > starts at 8080. any ideas?
>> >> >
>> >> >
>> >> > MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at
>> >> > http://x.x.x.x:8080
>> >> >
>> >> >
>> >> > Thanks!
>> >>
>> >>
>> >>
>> >> --
>> >> Marcelo
>> >
>> >
>>
>>
>>
>> --
>> Marcelo
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: why does spark web UI keeps changing its port?

2017-01-23 Thread Marcelo Vanzin
As I said.

Each app gets their own UI. Look at the logs printed to the output.
The port will depend on whether they're running on the same host at
the same time.

This is irrespective of how they are run.

On Mon, Jan 23, 2017 at 12:40 PM, kant kodali  wrote:
> yes I meant submitting through spark-submit.
>
> so If I do spark-submit A.jar and spark-submit A.jar again. Do I get two
> UI's or one UI'? and which ports do they run on when using the stand alone
> mode?
>
> On Mon, Jan 23, 2017 at 12:19 PM, Marcelo Vanzin 
> wrote:
>>
>> Depends on what you mean by "job". Which is why I prefer "app", which
>> is clearer (something you submit using "spark-submit", for example).
>>
>> But really, I'm not sure what you're asking now.
>>
>> On Mon, Jan 23, 2017 at 12:15 PM, kant kodali  wrote:
>> > hmm..I guess in that case my assumption of "app" is wrong. I thought the
>> > app
>> > is a client jar that you submit. no? If so, say I submit multiple jobs
>> > then
>> > I get two UI'S?
>> >
>> > On Mon, Jan 23, 2017 at 12:07 PM, Marcelo Vanzin 
>> > wrote:
>> >>
>> >> No. Each app has its own UI which runs (starting on) port 4040.
>> >>
>> >> On Mon, Jan 23, 2017 at 12:05 PM, kant kodali 
>> >> wrote:
>> >> > I am using standalone mode so wouldn't be 8080 for my app web ui as
>> >> > well?
>> >> > There is nothing running on 4040 in my cluster.
>> >> >
>> >> >
>> >> > http://spark.apache.org/docs/latest/security.html#standalone-mode-only
>> >> >
>> >> > On Mon, Jan 23, 2017 at 11:51 AM, Marcelo Vanzin
>> >> > 
>> >> > wrote:
>> >> >>
>> >> >> That's the Master, whose default port is 8080 (not 4040). The
>> >> >> default
>> >> >> port for the app's UI is 4040.
>> >> >>
>> >> >> On Mon, Jan 23, 2017 at 11:47 AM, kant kodali 
>> >> >> wrote:
>> >> >> > I am not sure why Spark web UI keeps changing its port every time
>> >> >> > I
>> >> >> > restart
>> >> >> > a cluster? how can I make it run always on one port? I did make
>> >> >> > sure
>> >> >> > there
>> >> >> > is no process running on 4040(spark default web ui port) however
>> >> >> > it
>> >> >> > still
>> >> >> > starts at 8080. any ideas?
>> >> >> >
>> >> >> >
>> >> >> > MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at
>> >> >> > http://x.x.x.x:8080
>> >> >> >
>> >> >> >
>> >> >> > Thanks!
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Marcelo
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Marcelo
>> >
>> >
>>
>>
>>
>> --
>> Marcelo
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Jars directory in Spark 2.0

2017-02-01 Thread Marcelo Vanzin
Spark has never shaded dependencies (in the sense of renaming the classes),
with a couple of exceptions (Guava and Jetty). So that behavior is nothing
new. Spark's dependencies themselves have a lot of other dependencies, so
doing that would have limited benefits anyway.

On Tue, Jan 31, 2017 at 11:23 PM, Sidney Feiner 
wrote:

> Is this done on purpose? Because it really makes it hard to deploy
> applications. Is there a reason they didn't shade the jars they use to
> begin with?
>
>
>
> *Sidney Feiner*   */*  SW Developer
>
> M: +972.528197720 <+972%2052-819-7720>  */*  Skype: sidney.feiner.startapp
>
>
>
> [image: StartApp] 
>
>
>
> *From:* Koert Kuipers [mailto:ko...@tresata.com]
> *Sent:* Tuesday, January 31, 2017 7:26 PM
> *To:* Sidney Feiner 
> *Cc:* user@spark.apache.org
> *Subject:* Re: Jars directory in Spark 2.0
>
>
>
> you basically have to keep your versions of dependencies in line with
> sparks or shade your own dependencies.
>
> you cannot just replace the jars in sparks jars folder. if you wan to
> update them you have to build spark yourself with updated dependencies and
> confirm it compiles, passes tests etc.
>
>
>
> On Tue, Jan 31, 2017 at 3:40 AM, Sidney Feiner 
> wrote:
>
> Hey,
>
> While migrating to Spark 2.X from 1.6, I've had many issues with jars that
> come preloaded with Spark in the "jars/" directory and I had to shade most
> of my packages.
>
> Can I replace the jars in this folder to more up to date versions? Are
> those jar used for anything internal in Spark which means I can't blindly
> replace them?
>
>
>
> Thanks J
>
>
>
>
>
> *Sidney Feiner*   */*  SW Developer
>
> M: +972.528197720 <+972%2052-819-7720>  */*  Skype: sidney.feiner.startapp
>
>
>
> [image: StartApp] 
>
>
>
> 
>
>   
>



-- 
Marcelo


Re: SPark - YARN Cluster Mode

2017-02-27 Thread Marcelo Vanzin
>  none of my Config settings

Is it none of the configs or just the queue? You can't set the YARN
queue in cluster mode through code, it has to be set in the command
line. It's a chicken & egg problem (in cluster mode, the YARN app is
created before your code runs).

 --property-file works the same as setting options in the command
line, so you can use that instead.


On Sun, Feb 26, 2017 at 4:52 PM, ayan guha  wrote:
> Hi
>
> I am facing an issue with Cluster Mode, with pyspark
>
> Here is my code:
>
> conf = SparkConf()
> conf.setAppName("Spark Ingestion")
> conf.set("spark.yarn.queue","root.Applications")
> conf.set("spark.executor.instances","50")
> conf.set("spark.executor.memory","22g")
> conf.set("spark.yarn.executor.memoryOverhead","4096")
> conf.set("spark.executor.cores","4")
> conf.set("spark.sql.hive.convertMetastoreParquet", "false")
> sc = SparkContext(conf = conf)
> sqlContext = HiveContext(sc)
>
> r = sc.parallelize(xrange(1,1))
> print r.count()
>
> sc.stop()
>
> The problem is none of my Config settings are passed on to Yarn.
>
> spark-submit --master yarn --deploy-mode cluster ayan_test.py
>
> I tried the same code with deploy-mode=client and all config are passing
> fine.
>
> Am I missing something? Will introducing --property-file be of any help? Can
> anybody share some working example?
>
> Best
> Ayan
>
> --
> Best Regards,
> Ayan Guha



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: spark-submit question

2017-02-28 Thread Marcelo Vanzin
Everything after the jar path is passed to the main class as
parameters. So if it's not working you're probably doing something
wrong in your code (that you haven't posted).

On Tue, Feb 28, 2017 at 7:05 AM, Joe Olson  wrote:
> For spark-submit, I know I can submit application level command line
> parameters to my .jar.
>
>
> However, can I prefix them with switches? My command line params are
> processed in my applications using JCommander. I've tried several variations
> of the below with no success.
>
>
> An example of what I am trying to do is below in the --num-decimals
> argument.
>
>
> ./bin/spark-submit \
>   --class org.apache.spark.examples.SparkPi \
>   --master spark://207.184.161.138:7077 \
>   --deploy-mode cluster \
>   --supervise \
>   --executor-memory 20G \
>   --total-executor-cores 100 \
>   /path/to/examples.jar \
>   --num-decimals=1000 \
>   --second-argument=Arg2
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: spark-submit question

2017-02-28 Thread Marcelo Vanzin
You're either running a really old version of Spark where there might
have been issues in that code, or you're actually missing some
backslashes in the command you pasted in your message.

On Tue, Feb 28, 2017 at 2:05 PM, Joe Olson  wrote:
>> Everything after the jar path is passed to the main class as parameters.
>
> I don't think that is accurate if your application arguments contain double
> dashes. I've tried with several permutations of with and without '\'s and
> newlines.
>
> Just thought I'd ask here before I have to re-configure and re-compile all
> my jars.
>
> ./bin/spark-submit \
>   --class org.apache.spark.examples.SparkPi \
>   --master spark://207.184.161.138:7077 \
>   --deploy-mode cluster \
>   --supervise \
>   --executor-memory 20G \
>   --total-executor-cores 100 \
>   /path/to/examples.jar
>   --num-decimals=1000
>   --second-argument=Arg2
>
> {
>   "action" : "CreateSubmissionResponse",
>   "serverSparkVersion" : "2.1.0",
>   "submissionId" : "driver-20170228155848-0016",
>   "success" : true
> }
> ./test3.sh: line 15: --num-decimals=1000: command not found
> ./test3.sh: line 16: --second-argument=Arg2: command not found
>
>
> 
> From: Marcelo Vanzin 
> Sent: Tuesday, February 28, 2017 12:17:49 PM
> To: Joe Olson
> Cc: user@spark.apache.org
> Subject: Re: spark-submit question
>
> Everything after the jar path is passed to the main class as
> parameters. So if it's not working you're probably doing something
> wrong in your code (that you haven't posted).
>
> On Tue, Feb 28, 2017 at 7:05 AM, Joe Olson  wrote:
>> For spark-submit, I know I can submit application level command line
>> parameters to my .jar.
>>
>>
>> However, can I prefix them with switches? My command line params are
>> processed in my applications using JCommander. I've tried several
>> variations
>> of the below with no success.
>>
>>
>> An example of what I am trying to do is below in the --num-decimals
>> argument.
>>
>>
>> ./bin/spark-submit \
>>   --class org.apache.spark.examples.SparkPi \
>>   --master spark://207.184.161.138:7077 \
>>   --deploy-mode cluster \
>>   --supervise \
>>   --executor-memory 20G \
>>   --total-executor-cores 100 \
>>   /path/to/examples.jar \
>>   --num-decimals=1000 \
>>   --second-argument=Arg2
>>
>>
>
>
>
> --
> Marcelo



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Monitoring ongoing Spark Job when run in Yarn Cluster mode

2017-03-13 Thread Marcelo Vanzin
It's linked from the YARN RM's Web UI (see the "Application Master"
link for the running application).

On Mon, Mar 13, 2017 at 6:53 AM, Sourav Mazumder
 wrote:
> Hi,
>
> Is there a way to monitor an ongoing Spark Job when running in Yarn Cluster
> mode ?
>
> In  my understanding in Yarn Cluster mode Spark Monitoring UI for the
> ongoing job would not be available in 4040 port. So is there an alternative
> ?
>
> Regards,
> Sourav



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Problem with Java and Scala interoperability // streaming

2017-04-19 Thread Marcelo Vanzin
Why are you not using JavaStreamingContext if you're writing Java?

On Wed, Apr 19, 2017 at 1:42 PM, kant kodali  wrote:
> Hi All,
>
> I get the following errors whichever way I try either lambda or generics. I
> am using
> spark 2.1 and scalla 2.11.8
>
>
> StreamingContext ssc = StreamingContext.getOrCreate(hdfsCheckpointDir, () ->
> {return createStreamingContext();}, null, false);
>
> ERROR
>
> StreamingContext ssc = StreamingContext.getOrCreate(hdfsCheckpointDir, () ->
> {return createStreamingContext();}, null, false);
>
> multiple non-overriding abstract methods found in interface Function0
>
> Note: Some messages have been simplified; recompile with -Xdiags:verbose to
> get full output
>
> 1 error
>
> :compileJava FAILED
>
>
> StreamingContext ssc = StreamingContext.getOrCreate(hdfsCheckpointDir, new
> Function0() {
> @Override
> public StreamingContext apply() {
> return createStreamingContext();
> }
> }, null, false);
>
>
> ERROR
>
> is not abstract and does not override abstract method apply$mcV$sp() in
> Function0
>
> StreamingContext ssc = StreamingContext.getOrCreate(hdfsCheckpointDir, new
> Function0() {
> ^
>
> 1 error
>
> :compileJava FAILED
>
>
> Thanks!
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Problem with Java and Scala interoperability // streaming

2017-04-19 Thread Marcelo Vanzin
I see a bunch of getOrCreate methods in that class. They were all
added in SPARK-6752, a long time ago.

On Wed, Apr 19, 2017 at 1:51 PM, kant kodali  wrote:
> There is no getOrCreate for JavaStreamingContext however I do use
> JavaStreamingContext inside createStreamingContext() from my code in the
> previous email.
>
> On Wed, Apr 19, 2017 at 1:46 PM, Marcelo Vanzin  wrote:
>>
>> Why are you not using JavaStreamingContext if you're writing Java?
>>
>> On Wed, Apr 19, 2017 at 1:42 PM, kant kodali  wrote:
>> > Hi All,
>> >
>> > I get the following errors whichever way I try either lambda or
>> > generics. I
>> > am using
>> > spark 2.1 and scalla 2.11.8
>> >
>> >
>> > StreamingContext ssc = StreamingContext.getOrCreate(hdfsCheckpointDir,
>> > () ->
>> > {return createStreamingContext();}, null, false);
>> >
>> > ERROR
>> >
>> > StreamingContext ssc = StreamingContext.getOrCreate(hdfsCheckpointDir,
>> > () ->
>> > {return createStreamingContext();}, null, false);
>> >
>> > multiple non-overriding abstract methods found in interface Function0
>> >
>> > Note: Some messages have been simplified; recompile with -Xdiags:verbose
>> > to
>> > get full output
>> >
>> > 1 error
>> >
>> > :compileJava FAILED
>> >
>> >
>> > StreamingContext ssc = StreamingContext.getOrCreate(hdfsCheckpointDir,
>> > new
>> > Function0() {
>> > @Override
>> > public StreamingContext apply() {
>> > return createStreamingContext();
>> > }
>> > }, null, false);
>> >
>> >
>> > ERROR
>> >
>> > is not abstract and does not override abstract method apply$mcV$sp() in
>> > Function0
>> >
>> > StreamingContext ssc = StreamingContext.getOrCreate(hdfsCheckpointDir,
>> > new
>> > Function0() {
>> > ^
>> >
>> > 1 error
>> >
>> > :compileJava FAILED
>> >
>> >
>> > Thanks!
>> >
>> >
>>
>>
>>
>> --
>> Marcelo
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: --jars does not take remote jar?

2017-05-02 Thread Marcelo Vanzin
Remote jars are added to executors' classpaths, but not the driver's.
In YARN cluster mode, they would also be added to the driver's class
path.

On Tue, May 2, 2017 at 8:43 AM, Nan Zhu  wrote:
> Hi, all
>
> For some reason, I tried to pass in a HDFS path to the --jars option in
> spark-submit
>
> According to the document,
> http://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management,
> --jars would accept remote path
>
> However, in the implementation,
> https://github.com/apache/spark/blob/c622a87c44e0621e1b3024fdca9b2aa3c508615b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L757,
> it does not look like so
>
> Did I miss anything?
>
> Best,
>
> Nan



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: --jars does not take remote jar?

2017-05-02 Thread Marcelo Vanzin
On Tue, May 2, 2017 at 9:07 AM, Nan Zhu  wrote:
> I have no easy way to pass jar path to those forked Spark
> applications? (except that I download jar from a remote path to a local temp
> dir after resolving some permission issues, etc.?)

Yes, that's the only way currently in client mode.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark Shuffle Encryption

2017-05-12 Thread Marcelo Vanzin
http://spark.apache.org/docs/latest/configuration.html#shuffle-behavior

All the options you need to know are there.

On Fri, May 12, 2017 at 9:11 AM, Shashi Vishwakarma
 wrote:
> Hi
>
> I was doing research on encrypting spark shuffle data and found that Spark
> 2.1 has got that feature.
>
> https://issues.apache.org/jira/browse/SPARK-5682
>
> Does anyone has more documentation around it ? How do I aim to use this
> feature in real production environment keeping mind that I need to secure
> spark job. ?
>
> Thanks
> Shashi



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: scalastyle violation on mvn install but not on mvn package

2017-05-17 Thread Marcelo Vanzin
scalastyle runs on the "verify" phase, which is after package but
before install.

On Wed, May 17, 2017 at 5:47 PM, yiskylee  wrote:
> ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean
> package
> works, but
> ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean
> install
> triggers scalastyle violation error.
>
> Is the scalastyle check not used on package but only on install? To install,
> should I turn off "failOnViolation" in the pom?
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/scalastyle-violation-on-mvn-install-but-not-on-mvn-package-tp28693.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: SparkAppHandle - get Input and output streams

2017-05-18 Thread Marcelo Vanzin
On Thu, May 18, 2017 at 10:10 AM, Nipun Arora  wrote:
> I wanted to know how to get the the input and output streams from
> SparkAppHandle?

You can't. You can redirect the output, but not directly get the streams.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: SparkAppHandle.Listener.infoChanged behaviour

2017-06-04 Thread Marcelo Vanzin
On Sat, Jun 3, 2017 at 7:16 PM, Mohammad Tariq  wrote:
> I am having a bit of difficulty in understanding the exact behaviour of
> SparkAppHandle.Listener.infoChanged(SparkAppHandle handle) method. The
> documentation says :
>
> Callback for changes in any information that is not the handle's state.
>
> What exactly is meant by any information here? Apart from state other pieces
> of information I can see is ID

So, you answered your own question.

If there's ever any new kind of information, it would use the same event.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark job profiler results showing high TCP cpu time

2017-06-23 Thread Marcelo Vanzin
That thread looks like the connection between the Spark process and
jvisualvm. It's expected to show high up when doing sampling if the
app is not doing much else.

On Fri, Jun 23, 2017 at 10:46 AM, Reth RM  wrote:
> Running a spark job on local machine and profiler results indicate that
> highest time spent in sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.
> Screenshot of profiler result can be seen here : https://jpst.it/10i-V
>
> Spark job(program) is performing IO (sc.wholeTextFile method of spark apis),
> Reads files from local file system and analyses the text to obtain tokens.
>
> Any thoughts and suggestions?
>
> Thanks.
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: running spark job with fat jar file

2017-07-17 Thread Marcelo Vanzin
Spark distributes your application jar for you.

On Mon, Jul 17, 2017 at 8:41 AM, Mich Talebzadeh
 wrote:
> hi guys,
>
>
> an uber/fat jar file has been created to run with spark in CDH yarc client
> mode.
>
> As usual job is submitted to the edge node.
>
> does the jar file has to be placed in the same directory ewith spark is
> running in the cluster to make it work?
>
> Also what will happen if say out of 9 nodes running spark, 3 have not got
> the jar file. will that job fail or it will carry on on the fremaing 6 nodes
> that have that jar file?
>
> thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed. The
> author will in no case be liable for any monetary damages arising from such
> loss, damage or destruction.
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: running spark job with fat jar file

2017-07-17 Thread Marcelo Vanzin
The YARN backend distributes all files and jars you submit with your
application.

On Mon, Jul 17, 2017 at 10:45 AM, Mich Talebzadeh
 wrote:
> thanks guys.
>
> just to clarify let us assume i am doing spark-submit as below:
>
> ${SPARK_HOME}/bin/spark-submit \
> --packages ${PACKAGES} \
> --driver-memory 2G \
> --num-executors 2 \
> --executor-memory 2G \
> --executor-cores 2 \
> --master yarn \
> --deploy-mode client \
> --conf "${SCHEDULER}" \
> --conf "${EXTRAJAVAOPTIONS}" \
> --jars ${JARS} \
> --class "${FILE_NAME}" \
> --conf "${SPARKUIPORT}" \
> --conf "${SPARKDRIVERPORT}" \
> --conf "${SPARKFILESERVERPORT}" \
> --conf "${SPARKBLOCKMANAGERPORT}" \
> --conf "${SPARKKRYOSERIALIZERBUFFERMAX}" \
> ${JAR_FILE}
>
> The ${JAR_FILE} is the one. As I understand Spark should distribute that
> ${JAR_FILE} to each container?
>
> Also --jars ${JARS} are the list of normal jar files that need to exist in
> the same directory on each executor node?
>
> cheers,
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed. The
> author will in no case be liable for any monetary damages arising from such
> loss, damage or destruction.
>
>
>
>
> On 17 July 2017 at 18:18, ayan guha  wrote:
>>
>> Hi Mitch
>>
>> your jar file can be anywhere in the file system, including hdfs.
>>
>> If using yarn, preferably use cluster mode in terms of deployment.
>>
>> Yarn will distribute the jar to each container.
>>
>> Best
>> Ayan
>>
>> On Tue, 18 Jul 2017 at 2:17 am, Marcelo Vanzin 
>> wrote:
>>>
>>> Spark distributes your application jar for you.
>>>
>>> On Mon, Jul 17, 2017 at 8:41 AM, Mich Talebzadeh
>>>  wrote:
>>> > hi guys,
>>> >
>>> >
>>> > an uber/fat jar file has been created to run with spark in CDH yarc
>>> > client
>>> > mode.
>>> >
>>> > As usual job is submitted to the edge node.
>>> >
>>> > does the jar file has to be placed in the same directory ewith spark is
>>> > running in the cluster to make it work?
>>> >
>>> > Also what will happen if say out of 9 nodes running spark, 3 have not
>>> > got
>>> > the jar file. will that job fail or it will carry on on the fremaing 6
>>> > nodes
>>> > that have that jar file?
>>> >
>>> > thanks
>>> >
>>> > Dr Mich Talebzadeh
>>> >
>>> >
>>> >
>>> > LinkedIn
>>> >
>>> > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> >
>>> >
>>> >
>>> > http://talebzadehmich.wordpress.com
>>> >
>>> >
>>> > Disclaimer: Use it at your own risk. Any and all responsibility for any
>>> > loss, damage or destruction of data or any other property which may
>>> > arise
>>> > from relying on this email's technical content is explicitly
>>> > disclaimed. The
>>> > author will in no case be liable for any monetary damages arising from
>>> > such
>>> > loss, damage or destruction.
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Marcelo
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>> --
>> Best Regards,
>> Ayan Guha
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: running spark job with fat jar file

2017-07-17 Thread Marcelo Vanzin
Yes.

On Mon, Jul 17, 2017 at 10:47 AM, Mich Talebzadeh
 wrote:
> thanks Marcelo.
>
> are these files distributed through hdfs?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed. The
> author will in no case be liable for any monetary damages arising from such
> loss, damage or destruction.
>
>
>
>
> On 17 July 2017 at 18:46, Marcelo Vanzin  wrote:
>>
>> The YARN backend distributes all files and jars you submit with your
>> application.
>>
>> On Mon, Jul 17, 2017 at 10:45 AM, Mich Talebzadeh
>>  wrote:
>> > thanks guys.
>> >
>> > just to clarify let us assume i am doing spark-submit as below:
>> >
>> > ${SPARK_HOME}/bin/spark-submit \
>> > --packages ${PACKAGES} \
>> > --driver-memory 2G \
>> > --num-executors 2 \
>> > --executor-memory 2G \
>> > --executor-cores 2 \
>> > --master yarn \
>> > --deploy-mode client \
>> > --conf "${SCHEDULER}" \
>> > --conf "${EXTRAJAVAOPTIONS}" \
>> > --jars ${JARS} \
>> > --class "${FILE_NAME}" \
>> > --conf "${SPARKUIPORT}" \
>> > --conf "${SPARKDRIVERPORT}" \
>> > --conf "${SPARKFILESERVERPORT}" \
>> > --conf "${SPARKBLOCKMANAGERPORT}" \
>> > --conf "${SPARKKRYOSERIALIZERBUFFERMAX}" \
>> > ${JAR_FILE}
>> >
>> > The ${JAR_FILE} is the one. As I understand Spark should distribute that
>> > ${JAR_FILE} to each container?
>> >
>> > Also --jars ${JARS} are the list of normal jar files that need to exist
>> > in
>> > the same directory on each executor node?
>> >
>> > cheers,
>> >
>> >
>> >
>> > Dr Mich Talebzadeh
>> >
>> >
>> >
>> > LinkedIn
>> >
>> > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >
>> >
>> >
>> > http://talebzadehmich.wordpress.com
>> >
>> >
>> > Disclaimer: Use it at your own risk. Any and all responsibility for any
>> > loss, damage or destruction of data or any other property which may
>> > arise
>> > from relying on this email's technical content is explicitly disclaimed.
>> > The
>> > author will in no case be liable for any monetary damages arising from
>> > such
>> > loss, damage or destruction.
>> >
>> >
>> >
>> >
>> > On 17 July 2017 at 18:18, ayan guha  wrote:
>> >>
>> >> Hi Mitch
>> >>
>> >> your jar file can be anywhere in the file system, including hdfs.
>> >>
>> >> If using yarn, preferably use cluster mode in terms of deployment.
>> >>
>> >> Yarn will distribute the jar to each container.
>> >>
>> >> Best
>> >> Ayan
>> >>
>> >> On Tue, 18 Jul 2017 at 2:17 am, Marcelo Vanzin 
>> >> wrote:
>> >>>
>> >>> Spark distributes your application jar for you.
>> >>>
>> >>> On Mon, Jul 17, 2017 at 8:41 AM, Mich Talebzadeh
>> >>>  wrote:
>> >>> > hi guys,
>> >>> >
>> >>> >
>> >>> > an uber/fat jar file has been created to run with spark in CDH yarc
>> >>> > client
>> >>> > mode.
>> >>> >
>> >>> > As usual job is submitted to the edge node.
>> >>> >
>> >>> > does the jar file has to be placed in the same directory ewith spark
>> >>> > is
>> >>> > running in the cluster to make it work?
>> >>> >
>> >>> > Also what will happen if say out of 9 nodes running spark, 3 have
>> >>> > not
>> >>> > got
>> >>> > the jar file. will that job fail or it will carry on on the fremaing
>> >>> > 6
>> >>> > nodes
>> >>> > that have that jar file?
>> >>> >
>> >>> > thanks
>> >>> >
>> >>> > Dr Mich Talebzadeh
>> >>> >
>> >>> >
>> >>> >
>> >>> > LinkedIn
>> >>> >
>> >>> >
>> >>> > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >>> >
>> >>> >
>> >>> >
>> >>> > http://talebzadehmich.wordpress.com
>> >>> >
>> >>> >
>> >>> > Disclaimer: Use it at your own risk. Any and all responsibility for
>> >>> > any
>> >>> > loss, damage or destruction of data or any other property which may
>> >>> > arise
>> >>> > from relying on this email's technical content is explicitly
>> >>> > disclaimed. The
>> >>> > author will in no case be liable for any monetary damages arising
>> >>> > from
>> >>> > such
>> >>> > loss, damage or destruction.
>> >>> >
>> >>> >
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Marcelo
>> >>>
>> >>> -
>> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >>>
>> >> --
>> >> Best Regards,
>> >> Ayan Guha
>> >
>> >
>>
>>
>>
>> --
>> Marcelo
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark history server running on Mongo

2017-07-18 Thread Marcelo Vanzin
See SPARK-18085. That has much of the same goals re: SHS resource
usage, and also provides a (currently non-public) API where you could
just create a MongoDB implementation if you want.

On Tue, Jul 18, 2017 at 12:56 AM, Ivan Sadikov  wrote:
> Hello everyone!
>
> I have been working on Spark history server that uses MongoDB as a datastore
> for processed events to iterate on idea that Spree project uses for Spark
> UI. Project was originally designed to improve on standalone history server
> with reduced memory footprint.
>
> Project lives here: https://github.com/lightcopy/history-server
>
> These are just very early days of the project, sort of pre-alpha (some
> features are missing, and metrics in some failed jobs cases are
> questionable). Code is being tested on several 8gb and 2gb logs and aims to
> lower resource usage since we run history server together with several other
> systems.
>
> Would greatly appreciate any feedback on repository (issues/pull
> requests/suggestions/etc.). Thanks a lot!
>
>
> Cheers,
>
> Ivan
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark history server running on Mongo

2017-07-19 Thread Marcelo Vanzin
On Tue, Jul 18, 2017 at 7:21 PM, Ivan Sadikov  wrote:
> Repository that I linked to does not require rebuilding Spark and could be
> used with current distribution, which is preferable in my case.

Fair enough, although that means that you're re-implementing the Spark
UI, which makes that project have to constantly be modified to keep up
with UI changes in Spark (or create its own UI and forget about what
Spark does). Which is what Spree does too.

In the long term I believe having these sort of enhancements in Spark
itself would benefit more people.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Question regarding Sparks new Internal authentication mechanism

2017-07-19 Thread Marcelo Vanzin
On Wed, Jul 19, 2017 at 11:19 AM, Udit Mehrotra
 wrote:
> spark.network.crypto.saslFallback false
> spark.authenticate   true
>
> This seems to work fine with internal shuffle service of Spark. However,
> when in I try it with Yarn’s external shuffle service the executors are
> unable to register with the shuffle service as it still expects SASL
> authentication. Here is the error I get:
>
> Can someone confirm that this is expected behavior? Or provide some
> guidance, on how I can make it work with external shuffle service ?

Yes, that's the expected behavior, since you disabled SASL fallback in
your configuration. If you set it back on, then you can talk to the
old shuffle service.

Or you could upgrade the version of the shuffle service running on
your YARN cluster so that it also supports the new auth mechanism.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Question regarding Sparks new Internal authentication mechanism

2017-07-19 Thread Marcelo Vanzin
Please include the list on your replies, so others can benefit from
the discussion too.

On Wed, Jul 19, 2017 at 11:43 AM, Udit Mehrotra
 wrote:
> Hi Marcelo,
>
> Thanks a lot for confirming that. Can you explain what you mean by upgrading
> the version of shuffle service ? Wont it automatically use the corresponding
> class from spark 2.2.0 to start the external shuffle service ?

That depends on how you deploy your shuffle service. Normally YARN
will have no idea that your application is using a new Spark - it will
still have the old version of the service jar in its classpath.


-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Question regarding Sparks new Internal authentication mechanism

2017-07-19 Thread Marcelo Vanzin
Well, how did you install the Spark shuffle service on YARN? It's not
part of YARN.

If you really have the Spark 2.2 shuffle service jar deployed in your
YARN service, then perhaps you didn't configure it correctly to use
the new auth mechanism.

On Wed, Jul 19, 2017 at 12:47 PM, Udit Mehrotra
 wrote:
> Sorry about that. Will keep the list in my replies.
>
> So, just to clarify I am not using an older version of sparks shuffle
> service. This is a brand new cluster with just Spark 2.2.0 installed
> alongside hadoop 2.7.3. Could there be anything else I am missing, or I can
> try differently ?
>
>
> Thanks !
>
>
> On Wed, Jul 19, 2017 at 12:03 PM, Marcelo Vanzin 
> wrote:
>>
>> Please include the list on your replies, so others can benefit from
>> the discussion too.
>>
>> On Wed, Jul 19, 2017 at 11:43 AM, Udit Mehrotra
>>  wrote:
>> > Hi Marcelo,
>> >
>> > Thanks a lot for confirming that. Can you explain what you mean by
>> > upgrading
>> > the version of shuffle service ? Wont it automatically use the
>> > corresponding
>> > class from spark 2.2.0 to start the external shuffle service ?
>>
>> That depends on how you deploy your shuffle service. Normally YARN
>> will have no idea that your application is using a new Spark - it will
>> still have the old version of the service jar in its classpath.
>>
>>
>> --
>> Marcelo
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Question regarding Sparks new Internal authentication mechanism

2017-07-19 Thread Marcelo Vanzin
On Wed, Jul 19, 2017 at 1:10 PM, Udit Mehrotra
 wrote:
> Is there any additional configuration I need for external shuffle besides
> setting the following:
> spark.network.crypto.enabled true
> spark.network.crypto.saslFallback false
> spark.authenticate   true

Have you set these options on the shuffle service configuration too
(which is the YARN xml config file, not spark-defaults.conf)?

If you have there might be an issue, and you should probably file a
bug and include your NM's log file.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Question regarding Sparks new Internal authentication mechanism

2017-07-19 Thread Marcelo Vanzin
Hmm... that's not enough info and logs are intentionally kept silent
to avoid flooding, but if you enable DEBUG level logging for
org.apache.spark.network.crypto in both YARN and the Spark app, that
might provide more info.

On Wed, Jul 19, 2017 at 2:58 PM, Udit Mehrotra
 wrote:
> So I added these settings in yarn-site.xml as well. Now I get a completely
> different error, but atleast it seems like it is using the crypto library:
>
> ExecutorLostFailure (executor 1 exited caused by one of the running tasks)
> Reason: Unable to create executor due to Unable to register with external
> shuffle server due to : java.lang.IllegalArgumentException: Authentication
> failed.
> at
> org.apache.spark.network.crypto.AuthRpcHandler.receive(AuthRpcHandler.java:125)
> at
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:157)
> at
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:105)
> at
> org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
>
> Any clue about this ?
>
>
> On Wed, Jul 19, 2017 at 1:13 PM, Marcelo Vanzin  wrote:
>>
>> On Wed, Jul 19, 2017 at 1:10 PM, Udit Mehrotra
>>  wrote:
>> > Is there any additional configuration I need for external shuffle
>> > besides
>> > setting the following:
>> > spark.network.crypto.enabled true
>> > spark.network.crypto.saslFallback false
>> > spark.authenticate   true
>>
>> Have you set these options on the shuffle service configuration too
>> (which is the YARN xml config file, not spark-defaults.conf)?
>>
>> If you have there might be an issue, and you should probably file a
>> bug and include your NM's log file.
>>
>> --
>> Marcelo
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Question regarding Sparks new Internal authentication mechanism

2017-07-20 Thread Marcelo Vanzin
Hmm... I tried this with the new shuffle service (I generally have an
old one running) and also see failures. I also noticed some odd things
in your logs that I'm also seeing in mine, but it's better to track
these in a bug instead of e-mail.

Please file a bug and attach your logs there, I'll take a look at this.

On Thu, Jul 20, 2017 at 2:06 PM, Udit Mehrotra
 wrote:
> Hi Marcelo,
>
> I ran with setting DEBUG level logging for 'org.apache.spark.network.crypto'
> for both Spark and Yarn.
>
> However, the DEBUG logs still do not convey anything meaningful. Please find
> it attached. Can you please take a quick look, and let me know if you see
> anything suspicious ?
>
> If not, do you think I should open a JIRA for this ?
>
> Thanks !
>
> On Wed, Jul 19, 2017 at 3:14 PM, Marcelo Vanzin  wrote:
>>
>> Hmm... that's not enough info and logs are intentionally kept silent
>> to avoid flooding, but if you enable DEBUG level logging for
>> org.apache.spark.network.crypto in both YARN and the Spark app, that
>> might provide more info.
>>
>> On Wed, Jul 19, 2017 at 2:58 PM, Udit Mehrotra
>>  wrote:
>> > So I added these settings in yarn-site.xml as well. Now I get a
>> > completely
>> > different error, but atleast it seems like it is using the crypto
>> > library:
>> >
>> > ExecutorLostFailure (executor 1 exited caused by one of the running
>> > tasks)
>> > Reason: Unable to create executor due to Unable to register with
>> > external
>> > shuffle server due to : java.lang.IllegalArgumentException:
>> > Authentication
>> > failed.
>> > at
>> >
>> > org.apache.spark.network.crypto.AuthRpcHandler.receive(AuthRpcHandler.java:125)
>> > at
>> >
>> > org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:157)
>> > at
>> >
>> > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:105)
>> > at
>> >
>> > org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
>> >
>> > Any clue about this ?
>> >
>> >
>> > On Wed, Jul 19, 2017 at 1:13 PM, Marcelo Vanzin 
>> > wrote:
>> >>
>> >> On Wed, Jul 19, 2017 at 1:10 PM, Udit Mehrotra
>> >>  wrote:
>> >> > Is there any additional configuration I need for external shuffle
>> >> > besides
>> >> > setting the following:
>> >> > spark.network.crypto.enabled true
>> >> > spark.network.crypto.saslFallback false
>> >> > spark.authenticate   true
>> >>
>> >> Have you set these options on the shuffle service configuration too
>> >> (which is the YARN xml config file, not spark-defaults.conf)?
>> >>
>> >> If you have there might be an issue, and you should probably file a
>> >> bug and include your NM's log file.
>> >>
>> >> --
>> >> Marcelo
>> >
>> >
>>
>>
>>
>> --
>> Marcelo
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Question regarding Sparks new Internal authentication mechanism

2017-07-20 Thread Marcelo Vanzin
Also, things seem to work with all your settings if you disable use of
the shuffle service (which also means no dynamic allocation), if that
helps you make progress in what you wanted to do.

On Thu, Jul 20, 2017 at 4:25 PM, Marcelo Vanzin  wrote:
> Hmm... I tried this with the new shuffle service (I generally have an
> old one running) and also see failures. I also noticed some odd things
> in your logs that I'm also seeing in mine, but it's better to track
> these in a bug instead of e-mail.
>
> Please file a bug and attach your logs there, I'll take a look at this.
>
> On Thu, Jul 20, 2017 at 2:06 PM, Udit Mehrotra
>  wrote:
>> Hi Marcelo,
>>
>> I ran with setting DEBUG level logging for 'org.apache.spark.network.crypto'
>> for both Spark and Yarn.
>>
>> However, the DEBUG logs still do not convey anything meaningful. Please find
>> it attached. Can you please take a quick look, and let me know if you see
>> anything suspicious ?
>>
>> If not, do you think I should open a JIRA for this ?
>>
>> Thanks !
>>
>> On Wed, Jul 19, 2017 at 3:14 PM, Marcelo Vanzin  wrote:
>>>
>>> Hmm... that's not enough info and logs are intentionally kept silent
>>> to avoid flooding, but if you enable DEBUG level logging for
>>> org.apache.spark.network.crypto in both YARN and the Spark app, that
>>> might provide more info.
>>>
>>> On Wed, Jul 19, 2017 at 2:58 PM, Udit Mehrotra
>>>  wrote:
>>> > So I added these settings in yarn-site.xml as well. Now I get a
>>> > completely
>>> > different error, but atleast it seems like it is using the crypto
>>> > library:
>>> >
>>> > ExecutorLostFailure (executor 1 exited caused by one of the running
>>> > tasks)
>>> > Reason: Unable to create executor due to Unable to register with
>>> > external
>>> > shuffle server due to : java.lang.IllegalArgumentException:
>>> > Authentication
>>> > failed.
>>> > at
>>> >
>>> > org.apache.spark.network.crypto.AuthRpcHandler.receive(AuthRpcHandler.java:125)
>>> > at
>>> >
>>> > org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:157)
>>> > at
>>> >
>>> > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:105)
>>> > at
>>> >
>>> > org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
>>> >
>>> > Any clue about this ?
>>> >
>>> >
>>> > On Wed, Jul 19, 2017 at 1:13 PM, Marcelo Vanzin 
>>> > wrote:
>>> >>
>>> >> On Wed, Jul 19, 2017 at 1:10 PM, Udit Mehrotra
>>> >>  wrote:
>>> >> > Is there any additional configuration I need for external shuffle
>>> >> > besides
>>> >> > setting the following:
>>> >> > spark.network.crypto.enabled true
>>> >> > spark.network.crypto.saslFallback false
>>> >> > spark.authenticate   true
>>> >>
>>> >> Have you set these options on the shuffle service configuration too
>>> >> (which is the YARN xml config file, not spark-defaults.conf)?
>>> >>
>>> >> If you have there might be an issue, and you should probably file a
>>> >> bug and include your NM's log file.
>>> >>
>>> >> --
>>> >> Marcelo
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Marcelo
>>
>>
>
>
>
> --
> Marcelo



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



<    1   2   3   4   5   6   >