Spark 2.0 preview - How to configure warehouse for Catalyst? always pointing to /user/hive/warehouse

2016-06-17 Thread Andrew Lee
>From branch-2.0, Spark 2.0.0 preview,

I found it interesting, no matter what you do by configuring


spark.sql.warehouse.dir


it will always pull up the default path which is /user/hive/warehouse


In the code, I notice that at LOC45

./sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala


object SimpleAnalyzer extends Analyzer(

new SessionCatalog(

  new InMemoryCatalog,

  EmptyFunctionRegistry,

  new SimpleCatalystConf(caseSensitiveAnalysis = true)),

new SimpleCatalystConf(caseSensitiveAnalysis = true))


It will always initialize with the SimpleCatalystConf which is applying the 
hardcoded default value

defined in LOC58


./sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystConf.scala


case class SimpleCatalystConf(

caseSensitiveAnalysis: Boolean,

orderByOrdinal: Boolean = true,

groupByOrdinal: Boolean = true,

optimizerMaxIterations: Int = 100,

optimizerInSetConversionThreshold: Int = 10,

maxCaseBranchesForCodegen: Int = 20,

runSQLonFile: Boolean = true,

warehousePath: String = "/user/hive/warehouse")

  extends CatalystConf


I couldn't find any other way to get around this.


It looks like this was fixed (in SPARK-15387) after


https://github.com/apache/spark/commit/9c817d027713859cac483b4baaaf8b53c040ad93

[https://avatars0.githubusercontent.com/u/4736016?v=3=200]

[SPARK-15387][SQL] SessionCatalog in SimpleAnalyzer does not need to ... · 
apache/spark@9c817d0
github.com
...make database directory. ## What changes were proposed in this pull request? 
After #12871 is fixed, we are forced to make `/user/hive/warehouse` when 
SimpleAnalyzer is used but SimpleAnalyzer ma...


Just want to confirm this was the root cause and the PR that fixed it. Thanks.






Re: Spark build failure with com.oracle:ojdbc6:jar:11.2.0.1.0

2016-05-09 Thread Andrew Lee
In fact, it does require ojdbc from Oracle which also requires a username and 
password. This was added as part of the testing scope for Oracle's docker.


I notice this PR and commit in branch-2.0 according to 
https://issues.apache.org/jira/browse/SPARK-12941.

In the comment, I'm not sure what does it mean by installing the JAR locally 
while Spark QA test run. IF this is the case,

it means someone downloaded the JAR from Oracle and manually added to the local 
build machine that is building Spark branch-2.0 or internal maven repository 
that will serve this ojdbc JAR.




commit 8afe49141d9b6a603eb3907f32dce802a3d05172

Author: thomastechs 

Date:   Thu Feb 25 22:52:25 2016 -0800


[SPARK-12941][SQL][MASTER] Spark-SQL JDBC Oracle dialect fails to map 
string datatypes to Oracle VARCHAR datatype



## What changes were proposed in this pull request?



This Pull request is used for the fix SPARK-12941, creating a data type 
mapping to Oracle for the corresponding data type"Stringtype" from

dataframe. This PR is for the master branch fix, where as another PR is already 
tested with the branch 1.4



## How was the this patch tested?



(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)

This patch was tested using the Oracle docker .Created a new integration 
suite for the same.The oracle.jdbc jar was to be downloaded from the maven 
repository.Since there was no jdbc jar available in the maven repository, the 
jar was downloaded from oracle site manually and installed in the local; thus 
tested. So, for SparkQA test case run, the ojdbc jar might be manually placed 
in the local maven repository(com/oracle/ojdbc6/11.2.0.2.0) while Spark QA test 
run.



Author: thomastechs 



Closes #11306 from thomastechs/master.




Meanwhile, I also notice that the ojdbc groupID provided by Oracle (official 
website https://blogs.oracle.com/dev2dev/entry/how_to_get_oracle_jdbc)  is 
different.




  com.oracle.jdbc

  ojdbc6

  11.2.0.4

  test




as oppose to the one in Spark branch-2.0

external/docker-integration-tests/pom.xml




  com.oracle

  ojdbc6

  11.2.0.1.0

  test





The version is out of date and not available from the Oracle Maven repo. The PR 
was created awhile back, so the solution may just cross Oracle's maven release 
blog.


Just my inference based on what I see form git and JIRA, however, I do see a 
fix required to patch pom.xml to apply the correct groupId and version # for 
ojdbc6 driver.


Thoughts?



Get Oracle JDBC drivers and UCP from Oracle Maven 
...
blogs.oracle.com
Get Oracle JDBC drivers and UCP from Oracle Maven Repository (without IDEs) By 
Nirmala Sundarappa-Oracle on Feb 15, 2016









From: Mich Talebzadeh 
Sent: Tuesday, May 3, 2016 1:04 AM
To: Luciano Resende
Cc: Hien Luu; ☼ R Nair (रविशंकर नायर); user
Subject: Re: Spark build failure with com.oracle:ojdbc6:jar:11.2.0.1.0

which version of Spark are using?


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



On 3 May 2016 at 02:13, Luciano Resende 
> wrote:
You might have a settings.xml that is forcing your internal Maven repository to 
be the mirror of external repositories and thus not finding the dependency.

On Mon, May 2, 2016 at 6:11 PM, Hien Luu 
> wrote:
Not I am not.  I am considering downloading it manually and place it in my 
local repository.

On Mon, May 2, 2016 at 5:54 PM, ☼ R Nair (रविशंकर नायर) 
> wrote:

Oracle jdbc is not part of Maven repository,  are you keeping a downloaded file 
in your local repo?

Best, RS

On May 2, 2016 8:51 PM, "Hien Luu" 
> wrote:
Hi all,

I am running into a build problem with com.oracle:ojdbc6:jar:11.2.0.1.0.  It 
kept getting "Operation timed out" while building Spark Project Docker 
Integration Tests module (see the error below).

Has anyone run this problem before? If so, how did you resolve around this 
problem?

[INFO] Reactor Summary:

[INFO]

[INFO] Spark Project Parent POM ... SUCCESS [  2.423 s]

[INFO] Spark Project Test Tags  SUCCESS [  0.712 s]

[INFO] Spark Project Sketch ... SUCCESS [  0.498 s]

[INFO] Spark Project Networking ... SUCCESS [  1.743 s]

[INFO] Spark Project Shuffle Streaming Service  SUCCESS [  0.587 s]

[INFO] Spark Project Unsafe ... SUCCESS [  0.503 s]


RE: The auxService:spark_shuffle does not exist

2015-07-21 Thread Andrew Lee
Hi Andrew Or,
Yes, NodeManager was restarted, I also checked the logs to see if the JARs 
appear in the CLASSPATH.
I have also downloaded the binary distribution and use the JAR 
spark-1.4.1-bin-hadoop2.4/lib/spark-1.4.1-yarn-shuffle.jar without success.
Has anyone successfully enabled the spark_shuffle via the documentation 
https://spark.apache.org/docs/1.4.1/job-scheduling.html ??
I'm testing it on Hadoop 2.4.1.
Any feedback or suggestion are appreciated, thanks.

Date: Fri, 17 Jul 2015 15:35:29 -0700
Subject: Re: The auxService:spark_shuffle does not exist
From: and...@databricks.com
To: alee...@hotmail.com
CC: zjf...@gmail.com; rp...@njit.edu; user@spark.apache.org

Hi all,
Did you forget to restart the node managers after editing yarn-site.xml by any 
chance?
-Andrew
2015-07-17 8:32 GMT-07:00 Andrew Lee alee...@hotmail.com:



I have encountered the same problem after following the document.
Here's my spark-defaults.confspark.shuffle.service.enabled true
spark.dynamicAllocation.enabled  true
spark.dynamicAllocation.executorIdleTimeout 60
spark.dynamicAllocation.cachedExecutorIdleTimeout 120
spark.dynamicAllocation.initialExecutors 2
spark.dynamicAllocation.maxExecutors 8
spark.dynamicAllocation.minExecutors 1
spark.dynamicAllocation.schedulerBacklogTimeout 10

and yarn-site.xml configured.
property
nameyarn.nodemanager.aux-services/name
valuespark_shuffle,mapreduce_shuffle/value
/property
...
property
nameyarn.nodemanager.aux-services.spark_shuffle.class/name
valueorg.apache.spark.network.yarn.YarnShuffleService/value
/property
and deployed the 2 JARs to NodeManager's classpath 
/opt/hadoop/share/hadoop/mapreduce/. (I also checked the NodeManager log and 
the JARs appear in the classpath). I notice that the JAR location is not the 
same as the document in 1.4. I found them under network/yarn/target and 
network/shuffle/target/ after building it with -Phadoop-2.4 -Psparkr -Pyarn 
-Phive -Phive-thriftserver in maven.

















spark-network-yarn_2.10-1.4.1.jar
spark-network-shuffle_2.10-1.4.1.jar


and still getting the following exception.
Exception in thread ContainerLauncher #0 java.lang.Error: 
org.apache.spark.SparkException: Exception while starting container 
container_1437141440985_0003_01_02 on host alee-ci-2058-slave-2.test.foo.com
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.apache.spark.SparkException: Exception while starting container 
container_1437141440985_0003_01_02 on host alee-ci-2058-slave-2.test.foo.com
at 
org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:116)
at 
org.apache.spark.deploy.yarn.ExecutorRunnable.run(ExecutorRunnable.scala:67)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
... 2 more
Caused by: org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The 
auxService:spark_shuffle does not exist
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at 
org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152)
at 
org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
Not sure what else am I missing here or doing wrong?
Appreciate any insights or feedback, thanks.

Date: Wed, 8 Jul 2015 09:25:39 +0800
Subject: Re: The auxService:spark_shuffle does not exist
From: zjf...@gmail.com
To: rp...@njit.edu
CC: user@spark.apache.org

Did you enable the dynamic resource allocation ? You can refer to this page for 
how to configure spark shuffle service for yarn.
https://spark.apache.org/docs/1.4.0/job-scheduling.html 
On Tue, Jul 7, 2015 at 10:55 PM, roy rp...@njit.edu wrote:
we tried --master yarn-client with no different result.







--

View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/The-auxService-spark-shuffle-does-not-exist-tp23662p23689.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.



-

To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org





-- 
Best Regards

Jeff Zhang
  

  

RE: The auxService:spark_shuffle does not exist

2015-07-21 Thread Andrew Lee
Hi Andrew,
Thanks for the advice. I didn't see the log in the NodeManager, so apparently, 
something was wrong with the yarn-site.xml configuration.
After digging in more, I realize it was an user error. I'm sharing this with 
other people so others may know what mistake I have made.
When I review the configurations, I notice that there was another property 
setting yarn.nodemanager.aux-services in mapred-site.xml. It turns out that 
mapred-site.xml will override the property yarn.nodemanager.aux-services in 
yarn-site.xml, because of this, spark_shuffle service was never enabled.  :(  
err.. 
















After deleting the redundant invalid properties in mapred-site.xml, it starts 
working. I see the following logs from the NodeManager.









2015-07-21 21:24:44,046 INFO org.apache.spark.network.yarn.YarnShuffleService: 
Initializing YARN shuffle service for Spark
2015-07-21 21:24:44,046 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Adding 
auxiliary service spark_shuffle, spark_shuffle
2015-07-21 21:24:44,264 INFO org.apache.spark.network.yarn.YarnShuffleService: 
Started YARN shuffle service for Spark on port 7337. Authentication is not 
enabled.

Appreciate all and the pointers where to look at. Thanks, problem solved.



Date: Tue, 21 Jul 2015 09:31:50 -0700
Subject: Re: The auxService:spark_shuffle does not exist
From: and...@databricks.com
To: alee...@hotmail.com
CC: zjf...@gmail.com; rp...@njit.edu; user@spark.apache.org

Hi Andrew,
Based on your driver logs, it seems the issue is that the shuffle service is 
actually not running on the NodeManagers, but your application is trying to 
provide a spark_shuffle secret anyway. One way to verify whether the shuffle 
service is actually started is to look at the NodeManager logs for the 
following lines:
Initializing YARN shuffle service for Spark
Started YARN shuffle service for Spark on port X

These should be logged under the INFO level. Also, could you verify whether all 
the executors have this problem, or just a subset? If even one of the NM 
doesn't have the shuffle service, you'll see the stack trace that you ran into. 
It would be good to confirm whether the yarn-site.xml change is actually 
reflected on all NMs if the log statements above are missing.

Let me know if you can get it working. I've run the shuffle service myself on 
the master branch (which will become Spark 1.5.0) recently following the 
instructions and have not encountered any problems.
-Andrew   

RE: The auxService:spark_shuffle does not exist

2015-07-17 Thread Andrew Lee
I have encountered the same problem after following the document.
Here's my spark-defaults.confspark.shuffle.service.enabled true
spark.dynamicAllocation.enabled  true
spark.dynamicAllocation.executorIdleTimeout 60
spark.dynamicAllocation.cachedExecutorIdleTimeout 120
spark.dynamicAllocation.initialExecutors 2
spark.dynamicAllocation.maxExecutors 8
spark.dynamicAllocation.minExecutors 1
spark.dynamicAllocation.schedulerBacklogTimeout 10

and yarn-site.xml configured.
property
nameyarn.nodemanager.aux-services/name
valuespark_shuffle,mapreduce_shuffle/value
/property
...
property
nameyarn.nodemanager.aux-services.spark_shuffle.class/name
valueorg.apache.spark.network.yarn.YarnShuffleService/value
/property
and deployed the 2 JARs to NodeManager's classpath 
/opt/hadoop/share/hadoop/mapreduce/. (I also checked the NodeManager log and 
the JARs appear in the classpath). I notice that the JAR location is not the 
same as the document in 1.4. I found them under network/yarn/target and 
network/shuffle/target/ after building it with -Phadoop-2.4 -Psparkr -Pyarn 
-Phive -Phive-thriftserver in maven.
















spark-network-yarn_2.10-1.4.1.jarspark-network-shuffle_2.10-1.4.1.jar

and still getting the following exception.
Exception in thread ContainerLauncher #0 java.lang.Error: 
org.apache.spark.SparkException: Exception while starting container 
container_1437141440985_0003_01_02 on host 
alee-ci-2058-slave-2.test.altiscale.com
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.apache.spark.SparkException: Exception while starting container 
container_1437141440985_0003_01_02 on host 
alee-ci-2058-slave-2.test.altiscale.com
at 
org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:116)
at 
org.apache.spark.deploy.yarn.ExecutorRunnable.run(ExecutorRunnable.scala:67)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
... 2 more
Caused by: org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The 
auxService:spark_shuffle does not exist
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at 
org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152)
at 
org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
Not sure what else am I missing here or doing wrong?
Appreciate any insights or feedback, thanks.

Date: Wed, 8 Jul 2015 09:25:39 +0800
Subject: Re: The auxService:spark_shuffle does not exist
From: zjf...@gmail.com
To: rp...@njit.edu
CC: user@spark.apache.org

Did you enable the dynamic resource allocation ? You can refer to this page for 
how to configure spark shuffle service for yarn.
https://spark.apache.org/docs/1.4.0/job-scheduling.html 
On Tue, Jul 7, 2015 at 10:55 PM, roy rp...@njit.edu wrote:
we tried --master yarn-client with no different result.







--

View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/The-auxService-spark-shuffle-does-not-exist-tp23662p23689.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.



-

To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org





-- 
Best Regards

Jeff Zhang
  

RE: [Spark 1.3.1 on YARN on EMR] Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-06-20 Thread Andrew Lee
Hi Roberto,
I'm not an EMR person, but it looks like option -h is deploying the necessary 
dataneucleus JARs for you.The req for HiveContext is the hive-site.xml and 
dataneucleus JARs. As long as these 2 are there, and Spark is compiled with 
-Phive, it should work.
spark-shell runs in yarn-client mode. Not sure whether your other application 
is running under the same mode or a different one. Try specifying yarn-client 
mode and see if you get the same result as spark-shell.
From: roberto.coluc...@gmail.com
Date: Wed, 10 Jun 2015 14:32:04 +0200
Subject: [Spark 1.3.1 on YARN on EMR] Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
To: user@spark.apache.org

Hi!
I'm struggling with an issue with Spark 1.3.1 running on YARN, running on an 
AWS EMR cluster. Such cluster is based on AMI 3.7.0 (hence Amazon Linux 
2015.03, Hive 0.13 already installed and configured on the cluster, Hadoop 2.4, 
etc...). I make use of the AWS emr-bootstrap-action install-spark 
(https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark) with the 
option/version -v1.3.1e so to get the latest Spark for EMR installed and 
available.
I also have a simple Spark Streaming driver in my project. Such driver is part 
of a larger Maven project: in the pom.xml I'm currently using   
[...]
scala.binary.version2.10/scala.binary.version
scala.version2.10.4/scala.version
java.version1.7/java.version
spark.version1.3.1/spark.version
hadoop.version2.4.1/hadoop.version
[]
dependency
  groupIdorg.apache.spark/groupId
  artifactIdspark-streaming_${scala.binary.version}/artifactId
  version${spark.version}/version
  scopeprovided/scope
  exclusions
exclusion
  groupIdorg.apache.hadoop/groupId
  artifactIdhadoop-client/artifactId
/exclusion
  /exclusions
/dependency


dependency
  groupIdorg.apache.hadoop/groupId
  artifactIdhadoop-client/artifactId
  version${hadoop.version}/version
  scopeprovided/scope
/dependency


dependency
  groupIdorg.apache.spark/groupId
  artifactIdspark-hive_${scala.binary.version}/artifactId
  version${spark.version}/version
  scopeprovided/scope
/dependency

In fact, at compile and build time everything works just fine if, in my driver, 
I have:
-
val sparkConf = new SparkConf()  .setAppName(appName)  
.set(spark.local.dir, /tmp/ + appName)  
.set(spark.streaming.unpersist, true)  .set(spark.serializer, 
org.apache.spark.serializer.KryoSerializer)  
.registerKryoClasses(Array(classOf[java.net.URI], classOf[String]))
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, config.batchDuration)
import org.apache.spark.streaming.StreamingContext._












ssc.checkpoint(sparkConf.get(spark.local.dir) + checkpointRelativeDir)
 some input reading actions 
 some input transformation actions 
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
import sqlContext.implicits._sqlContext.sql(an-HiveQL-query)
ssc.start()ssc.awaitTerminationOrTimeout(config.timeout)

--- 
What happens is that, right after have been launched, the driver fails with the 
exception:
15/06/10 11:38:18 ERROR yarn.ApplicationMaster: User class threw exception: 
java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:346)
at 
org.apache.spark.sql.hive.HiveContext.sessionState$lzycompute(HiveContext.scala:239)
at org.apache.spark.sql.hive.HiveContext.sessionState(HiveContext.scala:235)
at 
org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:251)
at org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:250)
at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:95)
at  myDriver.scala:  line of the sqlContext.sql(query) 
Caused by  some stuff 
Caused by: javax.jdo.JDOFatalUserException: Class 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
NestedThrowables:
java.lang.ClassNotFoundException: 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory
...
Caused by: java.lang.ClassNotFoundException: 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory
Thinking about a wrong Hive installation/configuration or libs/classpath 
definition, I SSHed into the cluster and launched a spark-shell. Excluding the 
app configuration and StreamingContext usage/definition, I then carried out all 
the actions listed in the driver implementation, in particular all the 
Hive-related ones and they all went through smoothly!

I also tried to use the optional -h argument 
(https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/README.md#arguments-optional)
 in the 

RE: GSSException when submitting Spark job in yarn-cluster mode with HiveContext APIs on Kerberos cluster

2015-04-20 Thread Andrew Lee
Hi Marcelo,
Exactly what I need to track, thanks for the JIRA pointer.

 Date: Mon, 20 Apr 2015 14:03:55 -0700
 Subject: Re: GSSException when submitting Spark job in yarn-cluster mode with 
 HiveContext APIs on Kerberos cluster
 From: van...@cloudera.com
 To: alee...@hotmail.com
 CC: user@spark.apache.org
 
 I think you want to take a look at:
 https://issues.apache.org/jira/browse/SPARK-6207
 
 On Mon, Apr 20, 2015 at 1:58 PM, Andrew Lee alee...@hotmail.com wrote:
  Hi All,
 
  Affected version: spark 1.2.1 / 1.2.2 / 1.3-rc1
 
  Posting this problem to user group first to see if someone is encountering
  the same problem.
 
  When submitting spark jobs that invokes HiveContext APIs on a Kerberos
  Hadoop + YARN (2.4.1) cluster,
  I'm getting this error.
 
  javax.security.sasl.SaslException: GSS initiate failed [Caused by
  GSSException: No valid credentials provided (Mechanism level: Failed to find
  any Kerberos tgt)]
 
  Apparently, the Kerberos ticket is not on the remote data node nor computing
  node since we don't
  deploy Kerberos tickets, and that is not a good practice either. On the
  other hand, we can't just SSH to every machine and run kinit for that users.
  This is not practical and it is insecure.
 
  The point here is that shouldn't there be a delegation token during the doAs
  to use the token instead of the ticket ?
  I'm trying to understand what is missing in Spark's HiveContext API while a
  normal MapReduce job that invokes Hive APIs will work, but not in Spark SQL.
  Any insights or feedback are appreciated.
 
  Anyone got this running without pre-deploying (pre-initializing) all tickets
  node by node? Is this worth filing a JIRA?
 
 
 
  15/03/25 18:59:08 INFO hive.metastore: Trying to connect to metastore with
  URI thrift://alee-cluster.test.testserver.com:9083
  15/03/25 18:59:08 ERROR transport.TSaslTransport: SASL negotiation failure
  javax.security.sasl.SaslException: GSS initiate failed [Caused by
  GSSException: No valid credentials provided (Mechanism level: Failed to find
  any Kerberos tgt)]
  at
  com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
  at
  org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
  at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:253)
  at
  org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
  at
  org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
  at
  org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:415)
  at
  org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
  at
  org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
  at
  org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:336)
  at
  org.apache.hadoop.hive.metastore.HiveMetaStoreClient.init(HiveMetaStoreClient.java:214)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at
  sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
  at
  sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
  at
  org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1410)
  at
  org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.init(RetryingMetaStoreClient.java:62)
  at
  org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:72)
  at
  org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2453)
  at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2465)
  at
  org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:340)
  at
  org.apache.spark.sql.hive.HiveContext$$anonfun$4.apply(HiveContext.scala:235)
  at
  org.apache.spark.sql.hive.HiveContext$$anonfun$4.apply(HiveContext.scala:231)
  at scala.Option.orElse(Option.scala:257)
  at
  org.apache.spark.sql.hive.HiveContext.x$3$lzycompute(HiveContext.scala:231)
  at org.apache.spark.sql.hive.HiveContext.x$3(HiveContext.scala:229)
  at
  org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:229)
  at org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:229)
  at
  org.apache.spark.sql.hive.HiveMetastoreCatalog.init(HiveMetastoreCatalog.scala:55)
  at
  org.apache.spark.sql.hive.HiveContext$$anon$2.init(HiveContext.scala:253)
  at
  org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:253)
  at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:253)
  at
  org.apache.spark.sql.hive.HiveContext$$anon$4.init(HiveContext.scala:263

RE: SparkSQL + Tableau Connector

2015-02-17 Thread Andrew Lee
: Concurrency mode is disabled, not creating a 
lock manager
15/02/11 19:25:35 INFO ParseDriver: Parsing command: use `default`
15/02/11 19:25:35 INFO ParseDriver: Parse Completed
15/02/11 19:25:35 INFO Driver: Semantic Analysis Completed
15/02/11 19:25:35 INFO Driver: Returning Hive schema: Schema(fieldSchemas:null, 
properties:null)
15/02/11 19:25:35 INFO Driver: Starting command: use `default`
15/02/11 19:25:35 INFO HiveMetaStore: 3: get_database: default
15/02/11 19:25:35 INFO audit: ugi=anonymous ip=unknown-ip-addr 
cmd=get_database: default
15/02/11 19:25:35 INFO HiveMetaStore: 3: Opening raw store with implemenation 
class:org.apache.hadoop.hive.metastore.ObjectStore
15/02/11 19:25:35 INFO ObjectStore: ObjectStore, initialize called
15/02/11 19:25:36 INFO Query: Reading in results for query 
org.datanucleus.store.rdbms.query.SQLQuery@0 since the connection used is 
closing
15/02/11 19:25:36 INFO ObjectStore: Initialized ObjectStore
15/02/11 19:25:36 INFO HiveMetaStore: 3: get_database: default
15/02/11 19:25:36 INFO audit: ugi=anonymous ip=unknown-ip-addr 
cmd=get_database: default
15/02/11 19:25:36 INFO Driver: OK
15/02/11 19:25:36 INFO SparkExecuteStatementOperation: Running query 'create 
temporary table test
using org.apache.spark.sql.json
options (path ‘/data/json/*')'

15/02/11 19:25:38 INFO Driver: Starting command: use `default`
15/02/11 19:25:38 INFO HiveMetaStore: 4: get_database: default
15/02/11 19:25:38 INFO audit: ugi=anonymous ip=unknown-ip-addr 
cmd=get_database: default
15/02/11 19:25:38 INFO HiveMetaStore: 4: Opening raw store with implemenation 
class:org.apache.hadoop.hive.metastore.ObjectStore
15/02/11 19:25:38 INFO ObjectStore: ObjectStore, initialize called
15/02/11 19:25:38 INFO Query: Reading in results for query 
org.datanucleus.store.rdbms.query.SQLQuery@0 since the connection used is 
closing
15/02/11 19:25:38 INFO ObjectStore: Initialized ObjectStore
15/02/11 19:25:38 INFO HiveMetaStore: 4: get_database: default
15/02/11 19:25:38 INFO audit: ugi=anonymous ip=unknown-ip-addr 
cmd=get_database: default
15/02/11 19:25:38 INFO Driver: OK
15/02/11 19:25:38 INFO SparkExecuteStatementOperation: Running query '
cache table test '
15/02/11 19:25:38 INFO MemoryStore: ensureFreeSpace(211383) called with 
curMem=101514, maxMem=278019440
15/02/11 19:25:38 INFO MemoryStore: Block broadcast_2 stored as values in 
memory (estimated size 206.4 KB, free 264.8 MB)
I see no way in Tableau to see the cached table test.  I think I am missing a 
step of associating the generated temp table from Spark SQL with the metastore. 
 Any guidance or insights on what I'm missing here.
Thanks for the assistance.
-Todd

On Wed, Feb 11, 2015 at 3:20 PM, Andrew Lee alee...@hotmail.com wrote:



Sorry folks, it is executing Spark jobs instead of Hive jobs. I mis-read the 
logs since there were other activities going on on the cluster.

From: alee...@hotmail.com
To: ar...@sigmoidanalytics.com; tsind...@gmail.com
CC: user@spark.apache.org
Subject: RE: SparkSQL + Tableau Connector
Date: Wed, 11 Feb 2015 11:56:44 -0800




I'm using mysql as the metastore DB with Spark 1.2.I simply copy the 
hive-site.xml to /etc/spark/ and added the mysql JDBC JAR to spark-env.sh in 
/etc/spark/, everything works fine now.
My setup looks like this.
Tableau = Spark ThriftServer2 = HiveServer2
It's talking to Tableau Desktop 8.3. Interestingly, when I query a Hive table, 
it still invokes Hive queries to HiveServer2 which is running MR or Tez engine. 
 Is this expected?  
I thought it should at least use the catalyst engine and talk to the underlying 
HDFS like what HiveContext API does to pull in the data into RDD.  Did I 
misunderstood the purpose of Spark ThriftServer2?


Date: Wed, 11 Feb 2015 16:07:40 +0530
Subject: Re: SparkSQL + Tableau Connector
From: ar...@sigmoidanalytics.com
To: tsind...@gmail.com
CC: user@spark.apache.org

Hi
I used this, though its using a embedded driver and is not a good approch.It 
works. You can configure for some other metastore type also. I have not tried 
the metastore uri's.








configuration




property

  namejavax.jdo.option.ConnectionURL/name

  
valuejdbc:derby:;databaseName=/opt/bigdata/spark-1.2.0/metastore_db;create=true/value

  descriptionURL for the DB/description

/property




property

  namejavax.jdo.option.ConnectionDriverName/name

  valueorg.apache.derby.jdbc.EmbeddedDriver/value

/property




!-- property

  namehive.metastore.uris/name

  valuethrift://x.x.x.x:1/value

  descriptionIP address (or fully-qualified domain name) and port of the 
metastore host/description

/property --

/configuration

On Wed, Feb 11, 2015 at 3:59 PM, Todd Nist tsind...@gmail.com wrote:
Hi Arush,
So yes I want to create the tables through Spark SQL.  I have placed the 
hive-site.xml file inside of the $SPARK_HOME/conf directory I thought that was 
all I should need to do to have the thriftserver use it.  Perhaps my 
hive-site.xml is worng, it currently looks like

RE: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2015-02-17 Thread Andrew Lee
HI All,
Just want to give everyone an update of what worked for me. Thanks for Cheng's 
comment and other ppl's help.
So what I misunderstood was the --driver-class-path and how that was related to 
--files.  I put both /etc/hive/hive-site.xml in both --files and 
--driver-class-path when I started in yarn-cluster mode. 
./bin/spark-submit --verbose --queue research --driver-java-options 
-XX:MaxPermSize=8192M --files /etc/hive/hive-site.xml --driver-class-path 
/etc/hive/hive-site.xml --master yarn --deploy-mode cluster 
The problem here is that --files only look for the local files to distribute it 
onto HDFS. The --driver-class-path is what brings to CLASSPATH during runtime, 
and as you can see, it is trying to look at /etc/hive/hive-site.xml on the 
container in the remote nodes which apparently doesn't exist.  For some ppl, it 
may work fine is b/c they may deploy Hive configuration and JARs across their 
entire cluster so every node looks the same. But this wasn't my case in 
multi-tenant environment or a restricted secured cluster. So my parameter looks 
like this when I launch it.








./bin/spark-submit --verbose --queue research --driver-java-options 
-XX:MaxPermSize=8192M --files /etc/hive/hive-site.xml --driver-class-path 
hive-site.xml --master yarn --deploy-mode cluster 
So --driver-class-path here will only look at ./hive-site.xml on the remote 
container which was pre-deployed already by the --files. 
This worked for me, and I can have HiveContext API to talk to Hive metastore, 
and vice versa. Thanks.


Date: Thu, 5 Feb 2015 16:59:12 -0800
From: lian.cs@gmail.com
To: linlin200...@gmail.com; huaiyin@gmail.com
CC: user@spark.apache.org
Subject: Re: Spark sql failed in yarn-cluster mode when connecting to 
non-default hive database


  

  
  

  Hi Jenny,
  You may try to use --files
  $SPARK_HOME/conf/hive-site.xml --driver-class-path
  hive-site.xml when submitting your application. The
problem is that when running in cluster mode, the driver is
actually running in a random container directory on a random
executor node. By using --files,
you upload hive-site.xml to the container directory, by using 
--driver-class-path
  hive-site.xml, you add the file to classpath (the path
is relative to the container directory).
  When running in cluster
mode, have you tried to check the tables inside the default
database? If my guess is right, this should be an empty default
database inside the default Derby metastore created by
HiveContext when the hive-site.xml is missing.
  Best,

Cheng
  On 8/12/14 5:38 PM,
Jenny Zhao wrote:
  
  



  

  

  

  
  Hi Yin,

  

  hive-site.xml was copied to spark/conf and the same as
  the one under $HIVE_HOME/conf. 

  


through hive cli, I don't see any problem. but for spark
on yarn-cluster mode, I am not able to switch to a
database other than the default one, for Yarn-client
mode, it works fine.  



  
  Thanks!

  


Jenny

  
  



On Tue, Aug 12, 2014 at 12:53 PM,
  Yin Huai huaiyin@gmail.com
  wrote:

  
Hi Jenny,
  

  
  Have you copied hive-site.xml
  to spark/conf directory? If not, can you put it in
  conf/ and try again?
  


  Thanks,
  


  Yin


  


  

  On Mon, Aug 11, 2014 at
8:57 PM, Jenny Zhao linlin200...@gmail.com
wrote:


  

  

  
  Thanks Yin! 

  


here is my hive-site.xml,  which I copied
from $HIVE_HOME/conf, didn't experience
problem connecting to the metastore through
hive. which uses DB2 as metastore database.



  

?xml version=1.0?

?xml-stylesheet type=text/xsl

RE: SparkSQL + Tableau Connector

2015-02-11 Thread Andrew Lee
Sorry folks, it is executing Spark jobs instead of Hive jobs. I mis-read the 
logs since there were other activities going on on the cluster.

From: alee...@hotmail.com
To: ar...@sigmoidanalytics.com; tsind...@gmail.com
CC: user@spark.apache.org
Subject: RE: SparkSQL + Tableau Connector
Date: Wed, 11 Feb 2015 11:56:44 -0800




I'm using mysql as the metastore DB with Spark 1.2.I simply copy the 
hive-site.xml to /etc/spark/ and added the mysql JDBC JAR to spark-env.sh in 
/etc/spark/, everything works fine now.
My setup looks like this.
Tableau = Spark ThriftServer2 = HiveServer2
It's talking to Tableau Desktop 8.3. Interestingly, when I query a Hive table, 
it still invokes Hive queries to HiveServer2 which is running MR or Tez engine. 
 Is this expected?  
I thought it should at least use the catalyst engine and talk to the underlying 
HDFS like what HiveContext API does to pull in the data into RDD.  Did I 
misunderstood the purpose of Spark ThriftServer2?


Date: Wed, 11 Feb 2015 16:07:40 +0530
Subject: Re: SparkSQL + Tableau Connector
From: ar...@sigmoidanalytics.com
To: tsind...@gmail.com
CC: user@spark.apache.org

Hi
I used this, though its using a embedded driver and is not a good approch.It 
works. You can configure for some other metastore type also. I have not tried 
the metastore uri's.








configuration




property

  namejavax.jdo.option.ConnectionURL/name

  
valuejdbc:derby:;databaseName=/opt/bigdata/spark-1.2.0/metastore_db;create=true/value

  descriptionURL for the DB/description

/property




property

  namejavax.jdo.option.ConnectionDriverName/name

  valueorg.apache.derby.jdbc.EmbeddedDriver/value

/property




!-- property

  namehive.metastore.uris/name

  valuethrift://x.x.x.x:1/value

  descriptionIP address (or fully-qualified domain name) and port of the 
metastore host/description

/property --

/configuration

On Wed, Feb 11, 2015 at 3:59 PM, Todd Nist tsind...@gmail.com wrote:
Hi Arush,
So yes I want to create the tables through Spark SQL.  I have placed the 
hive-site.xml file inside of the $SPARK_HOME/conf directory I thought that was 
all I should need to do to have the thriftserver use it.  Perhaps my 
hive-site.xml is worng, it currently looks like this:

configurationproperty  namehive.metastore.uris/name  !-- Ensure that 
the following statement points to the Hive Metastore URI in your cluster --  
valuethrift://sandbox.hortonworks.com:9083/value  descriptionURI for 
client to contact metastore server/description/property/configuration
Which leads me to believe it is going to pull form the thriftserver from 
Horton?  I will go look at the docs to see if this is right, it is what Horton 
says to do.  Do you have an example hive-site.xml by chance that works with 
Spark SQL?
I am using 8.3 of tableau with the SparkSQL Connector.
Thanks for the assistance.
-Todd
On Wed, Feb 11, 2015 at 2:34 AM, Arush Kharbanda ar...@sigmoidanalytics.com 
wrote:
BTW what tableau connector are you using?
On Wed, Feb 11, 2015 at 12:55 PM, Arush Kharbanda ar...@sigmoidanalytics.com 
wrote:
 I am a little confused here, why do you want to create the tables in hive. You 
want to create the tables in spark-sql, right?
If you are not able to find the same tables through tableau then thrift is 
connecting to a diffrent metastore than your spark-shell.
One way to specify a metstore to thrift is to provide the path to hive-site.xml 
while starting thrift using --files hive-site.xml.
similarly you can specify the same metastore to your spark-submit or 
sharp-shell using the same option.



On Wed, Feb 11, 2015 at 5:23 AM, Todd Nist tsind...@gmail.com wrote:
Arush,
As for #2 do you mean something like this from the docs:

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

sqlContext.sql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING))
sqlContext.sql(LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' 
INTO TABLE src)

// Queries are expressed in HiveQL
sqlContext.sql(FROM src SELECT key, value).collect().foreach(println)Or did 
you have something else in mind?
-Todd

On Tue, Feb 10, 2015 at 6:35 PM, Todd Nist tsind...@gmail.com wrote:
Arush,
Thank you will take a look at that approach in the morning.  I sort of figured 
the answer to #1 was NO and that I would need to do 2 and 3 thanks for 
clarifying it for me.
-Todd
On Tue, Feb 10, 2015 at 5:24 PM, Arush Kharbanda ar...@sigmoidanalytics.com 
wrote:
1.  Can the connector fetch or query schemaRDD's saved to Parquet or JSON 
files? NO

2.  Do I need to do something to expose these via hive / metastore other than 
creating a table in hive? Create a table in spark sql to expose via spark sql

3.  Does the thriftserver need to be configured to expose these in some 
fashion, sort of related to question 2 you would need to configure thrift to 
read from the metastore you expect it read from - by default it reads from 
metastore_db directory present in the directory used to launch 

RE: Is the Thrift server right for me?

2015-02-11 Thread Andrew Lee
I have ThriftServer2 up and running, however, I notice that it relays the query 
to HiveServer2 when I pass the hive-site.xml to it.
I'm not sure if this is the expected behavior, but based on what I have up and 
running, the ThriftServer2 invokes HiveServer2 that results in MapReduce or Tez 
query. In this case, I could just connect directly to HiveServer2 if Hive is 
all you need.
If you are programmer and want to mash up data from Hive with other tables and 
data in Spark, then Spark ThriftServer2 seems to be a good integration point at 
some use case.
Please correct me if I misunderstood the purpose of Spark ThriftServer2.

 Date: Thu, 8 Jan 2015 14:49:00 -0700
 From: sjbru...@uwaterloo.ca
 To: user@spark.apache.org
 Subject: Is the Thrift server right for me?
 
 I'm building a system that collects data using Spark Streaming, does some
 processing with it, then saves the data. I want the data to be queried by
 multiple applications, and it sounds like the Thrift JDBC/ODBC server might
 be the right tool to handle the queries. However,  the documentation for the
 Thrift server
 http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server
   
 seems to be written for Hive users who are moving to Spark. I never used
 Hive before I started using Spark, so it is not clear to me how best to use
 this.
 
 I've tried putting data into Hive, then serving it with the Thrift server.
 But I have not been able to update the data in Hive without first shutting
 down the server. This is a problem because new data is always being streamed
 in, and so the data must continuously be updated.
 
 The system I'm building is supposed to replace a system that stores the data
 in MongoDB. The dataset has now grown so large that the database index does
 not fit in memory, which causes major performance problems in MongoDB.
 
 If the Thrift server is the right tool for me, how can I set it up for my
 application? If it is not the right tool, what else can I use?
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Is-the-Thrift-server-right-for-me-tp21044.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
  

RE: hadoopConfiguration for StreamingContext

2015-02-10 Thread Andrew Lee
It looks like this is related to the underlying Hadoop configuration.
Try to deploy the Hadoop configuration with your job with --files and 
--driver-class-path, or to the default /etc/hadoop/conf core-site.xml.
If that is not an option (depending on how your Hadoop cluster is setup), then 
hard code the value vie -Dkey=value to see if it works. The downside is your 
credentials are exposed in plaintext in the java commands.
or by defining it in spark-defaults.conf property 
spark.executor.extraJavaOptions
e.g.s3n







spark.executor.extraJavaOptions -Dfs.s3n.awsAccessKeyId=X 
-Dfs.s3n.awsSecretAccessKey=
s3spark.executor.extraJavaOptions -Dfs.s3.awsAccessKeyId=X 
-Dfs.s3.awsSecretAccessKey=
Hope this works. Or embed them in the s3n path. Not good security practice 
though.

From: mslimo...@gmail.com
Date: Tue, 10 Feb 2015 10:57:47 -0500
Subject: Re: hadoopConfiguration for StreamingContext
To: ak...@sigmoidanalytics.com
CC: u...@spark.incubator.apache.org

Thanks, Akhil.  I had high hopes for #2, but tried all and no luck.  
I was looking at the source and found something interesting.  The Stack Trace 
(below) directs me to FileInputDStream.scala (line 141).  This is version 
1.1.1, btw.  Line 141 has:
  private def fs: FileSystem = {
if (fs_ == null) fs_ = directoryPath.getFileSystem(new Configuration())
fs_
  }
So it looks to me like it doesn't make any attempt to use a configured 
HadoopConf.
Here is the StackTrace:








java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key 
must be specified as the username or password (respectively) of a s3n URL, or 
by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties 
(respectively).
at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)
at 
org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at org.apache.hadoop.fs.s3native.$Proxy5.initialize(Unknown Source)
at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:216)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
at 
org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$fs(FileInputDStream.scala:141)
at 
org.apache.spark.streaming.dstream.FileInputDStream.findNewFiles(FileInputDStream.scala:107)
at 
org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:75)
...




















On Tue, Feb 10, 2015 at 10:28 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
Try the following:
1. Set the access key and secret key in the sparkContext:
ssc.sparkContext.hadoopConfiguration.set(AWS_ACCESS_KEY_ID,yourAccessKey)
ssc.sparkContext.hadoopConfiguration.set(AWS_SECRET_ACCESS_KEY,yourSecretKey)

2. Set the access key and secret key in the environment before startingyour 
application:
​export AWS_ACCESS_KEY_ID=your access
export AWS_SECRET_ACCESS_KEY=your secret​

3. Set the access key and secret key inside the hadoop configurations
val 
hadoopConf=ssc.sparkContext.hadoopConfiguration;hadoopConf.set(fs.s3.impl,org.apache.hadoop.fs.s3native.NativeS3FileSystem)hadoopConf.set(fs.s3.awsAccessKeyId,yourAccessKey)hadoopConf.set(fs.s3.awsSecretAccessKey,yourSecretKey)
4. You can also try:
val stream = 
ssc.textFileStream(s3n://yourAccessKey:yourSecretKey@yourBucket/path/)ThanksBest
 Regards

On Tue, Feb 10, 2015 at 8:27 PM, Marc Limotte mslimo...@gmail.com wrote:
I see that StreamingContext has a hadoopConfiguration() method, which can be 
used like this sample I found:
 sc.hadoopConfiguration().set(fs.s3.awsAccessKeyId, XX);
sc.hadoopConfiguration().set(fs.s3.awsSecretAccessKey, XX);
But StreamingContext doesn't have the same thing.  I want to use a 
StreamingContext with s3n: text file input, but can't find a way to set the AWS 
credentials.  I also tried (with no success):
adding the properties to 
conf/spark-defaults.conf$HADOOP_HOME/conf/hdfs-site.xmlENV variablesEmbedded as 
user:password in s3n://user:password@... (w/ url encoding)Setting the conf as 
above on a new SparkContext and passing that the StreamingContext constructor: 
StreamingContext(sparkContext: SparkContext, batchDuration: Duration)Can 
someone point me in 

RE: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-12-29 Thread Andrew Lee
Hi All,
I have tried to pass the properties via the SparkContext.setLocalProperty and 
HiveContext.setConf, both failed. Based on the results (haven't get a chance to 
look into the code yet), HiveContext will try to initiate the JDBC connection 
right away, I couldn't set other properties dynamically prior to any SQL 
statement.  
The only way to get it work is to put these properties in hive-site.xml which 
did work for me. I'm wondering if there's a better way to dynamically specify 
these Hive configurations like --hiveconf or other ways such as a user-path 
hive-site.xml?  
On a shared cluster, hive-site.xml is shared and cannot be managed in a 
multiple user mode on the same edge server, especially when it contains 
personal password for metastore access. What will be the best way to pass on 
these 3 properties to spark-shell?
javax.jdo.option.ConnectionUserNamejavax.jdo.option. 
ConnectionPasswordjavax.jdo.option. ConnectionURL
According to HiveContext document, hive-site.xml is picked up from the 
classpath. Anyway to specify this dynamically for each spark-shell session?
An instance of the Spark SQL execution engine that integrates with data stored 
in Hive. Configuration for Hive is read from hive-site.xml on the classpath.

Here are the test case I ran.
Spark 1.2.0

Test Case 1








import org.apache.spark.SparkContext
import org.apache.spark.sql.hive._


sc.setLocalProperty(javax.jdo.option.ConnectionUserName,foo)
sc.setLocalProperty(javax.jdo.option.ConnectionPassword,xx)
sc.setLocalProperty(javax.jdo.option.ConnectionURL,jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true)


val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)


import hiveContext._


// Create table and clean up data
hiveContext.hql(CREATE TABLE IF NOT EXISTS spark_hive_test_table (key INT, 
value STRING))
// Encounter error, picking up default user 'APP'@'localhost' and creating 
metastore_db in current local directory, not honoring the JDBC settings for the 
metastore on mysql.

Test Case 2
import org.apache.spark.SparkContextimport org.apache.spark.sql.hive._
val hiveContext = new 
org.apache.spark.sql.hive.HiveContext(sc)hiveContext.setConf(javax.jdo.option.ConnectionUserName,foo)//
 Encounter error right here, it looks like HiveContext tries to initiate the 
JDBC connection prior to any settings from 
setConf.hiveContext.setConf(javax.jdo.option.ConnectionPassword,xxx)









hiveContext.setConf(javax.jdo.option.ConnectionURL,jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true)




From: huaiyin@gmail.com
Date: Wed, 13 Aug 2014 16:56:13 -0400
Subject: Re: Spark sql failed in yarn-cluster mode when connecting to 
non-default hive database
To: linlin200...@gmail.com
CC: lian.cs@gmail.com; user@spark.apache.org

I think the problem is that when you are using yarn-cluster mode, because the 
Spark driver runs inside the application master, the hive-conf is not 
accessible by the driver. Can you try to set those confs by using 
hiveContext.set(...)? Or, maybe you can copy hive-site.xml to spark/conf in the 
node running the application master.



On Tue, Aug 12, 2014 at 8:38 PM, Jenny Zhao linlin200...@gmail.com wrote:



Hi Yin,

hive-site.xml was copied to spark/conf and the same as the one under 
$HIVE_HOME/conf. 



through hive cli, I don't see any problem. but for spark on yarn-cluster mode, 
I am not able to switch to a database other than the default one, for 
Yarn-client mode, it works fine.  


Thanks!

Jenny




On Tue, Aug 12, 2014 at 12:53 PM, Yin Huai huaiyin@gmail.com wrote:

Hi Jenny,
Have you copied hive-site.xml to spark/conf directory? If not, can you put it 
in conf/ and try again?





Thanks,
Yin






On Mon, Aug 11, 2014 at 8:57 PM, Jenny Zhao linlin200...@gmail.com wrote:






Thanks Yin! 

here is my hive-site.xml,  which I copied from $HIVE_HOME/conf, didn't 
experience problem connecting to the metastore through hive. which uses DB2 as 
metastore database. 







?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?
!--
   Licensed to the Apache Software Foundation (ASF) under one or more






   contributor license agreements.  See the NOTICE file distributed with
   this work for additional information regarding copyright ownership.
   The ASF licenses this file to You under the Apache License, Version 2.0






   (the License); you may not use this file except in compliance with
   the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0







   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an AS IS BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.






   See the License for the specific language governing permissions and
   limitations under the License.
--
configuration
 property
  namehive.hwi.listen.port/name
  value/value






 /property
 

RE: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-12-29 Thread Andrew Lee
A follow up on the hive-site.xml, if you 
1. Specify it in spark/conf, then you can NOT apply it via the 
--driver-class-path option, otherwise, you will get the following exceptions 
when initializing SparkContext.








org.apache.spark.SparkException: Found both spark.driver.extraClassPath and 
SPARK_CLASSPATH. Use only the former.
2. If you use the --driver-class-path, then you need to unset SPARK_CLASSPATH. 
However, the flip side is that you will need to provide all the related JARs 
(hadoop-yarn, hadoop-common, hdfs, etc) that are part of the hadoop-provided 
if you built your JARs with -Phadoop-provided, and other common libraries that 
are required.

From: alee...@hotmail.com
To: user@spark.apache.org
CC: lian.cs@gmail.com; linlin200...@gmail.com; huaiyin@gmail.com
Subject: RE: Spark sql failed in yarn-cluster mode when connecting to 
non-default hive database
Date: Mon, 29 Dec 2014 16:01:26 -0800




Hi All,
I have tried to pass the properties via the SparkContext.setLocalProperty and 
HiveContext.setConf, both failed. Based on the results (haven't get a chance to 
look into the code yet), HiveContext will try to initiate the JDBC connection 
right away, I couldn't set other properties dynamically prior to any SQL 
statement.  
The only way to get it work is to put these properties in hive-site.xml which 
did work for me. I'm wondering if there's a better way to dynamically specify 
these Hive configurations like --hiveconf or other ways such as a user-path 
hive-site.xml?  
On a shared cluster, hive-site.xml is shared and cannot be managed in a 
multiple user mode on the same edge server, especially when it contains 
personal password for metastore access. What will be the best way to pass on 
these 3 properties to spark-shell?
javax.jdo.option.ConnectionUserNamejavax.jdo.option. 
ConnectionPasswordjavax.jdo.option. ConnectionURL
According to HiveContext document, hive-site.xml is picked up from the 
classpath. Anyway to specify this dynamically for each spark-shell session?
An instance of the Spark SQL execution engine that integrates with data stored 
in Hive. Configuration for Hive is read from hive-site.xml on the classpath.

Here are the test case I ran.
Spark 1.2.0

Test Case 1








import org.apache.spark.SparkContext
import org.apache.spark.sql.hive._


sc.setLocalProperty(javax.jdo.option.ConnectionUserName,foo)
sc.setLocalProperty(javax.jdo.option.ConnectionPassword,xx)
sc.setLocalProperty(javax.jdo.option.ConnectionURL,jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true)


val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)


import hiveContext._


// Create table and clean up data
hiveContext.hql(CREATE TABLE IF NOT EXISTS spark_hive_test_table (key INT, 
value STRING))
// Encounter error, picking up default user 'APP'@'localhost' and creating 
metastore_db in current local directory, not honoring the JDBC settings for the 
metastore on mysql.

Test Case 2
import org.apache.spark.SparkContextimport org.apache.spark.sql.hive._
val hiveContext = new 
org.apache.spark.sql.hive.HiveContext(sc)hiveContext.setConf(javax.jdo.option.ConnectionUserName,foo)//
 Encounter error right here, it looks like HiveContext tries to initiate the 
JDBC connection prior to any settings from 
setConf.hiveContext.setConf(javax.jdo.option.ConnectionPassword,xxx)









hiveContext.setConf(javax.jdo.option.ConnectionURL,jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true)




From: huaiyin@gmail.com
Date: Wed, 13 Aug 2014 16:56:13 -0400
Subject: Re: Spark sql failed in yarn-cluster mode when connecting to 
non-default hive database
To: linlin200...@gmail.com
CC: lian.cs@gmail.com; user@spark.apache.org

I think the problem is that when you are using yarn-cluster mode, because the 
Spark driver runs inside the application master, the hive-conf is not 
accessible by the driver. Can you try to set those confs by using 
hiveContext.set(...)? Or, maybe you can copy hive-site.xml to spark/conf in the 
node running the application master.



On Tue, Aug 12, 2014 at 8:38 PM, Jenny Zhao linlin200...@gmail.com wrote:



Hi Yin,

hive-site.xml was copied to spark/conf and the same as the one under 
$HIVE_HOME/conf. 



through hive cli, I don't see any problem. but for spark on yarn-cluster mode, 
I am not able to switch to a database other than the default one, for 
Yarn-client mode, it works fine.  


Thanks!

Jenny




On Tue, Aug 12, 2014 at 12:53 PM, Yin Huai huaiyin@gmail.com wrote:

Hi Jenny,
Have you copied hive-site.xml to spark/conf directory? If not, can you put it 
in conf/ and try again?





Thanks,
Yin






On Mon, Aug 11, 2014 at 8:57 PM, Jenny Zhao linlin200...@gmail.com wrote:






Thanks Yin! 

here is my hive-site.xml,  which I copied from $HIVE_HOME/conf, didn't 
experience problem connecting to the metastore through hive. which uses DB2 as 
metastore database. 







?xml version=1.0?
?xml-stylesheet type=text/xsl 

RE: Hive From Spark

2014-08-25 Thread Andrew Lee
Hi Du,
I didn't notice the ticket was updated recently. SPARK-2848 is a sub-task of 
Spark-2420, and it's already resolved in Spark 1.1.0.It looks like Spark-2420 
will release in Spark 1.2.0 according to the current JIRA status.
I'm tracking branch-1.1 instead of the master and haven't seen the results 
merged. Still seeing guava 14.0.1 so I don't think Spark 2848 has been merged 
yet.
Will be great to have someone to confirm or clarify the expectation.
 From: l...@yahoo-inc.com.INVALID
 To: van...@cloudera.com; alee...@hotmail.com
 CC: user@spark.apache.org
 Subject: Re: Hive From Spark
 Date: Sat, 23 Aug 2014 00:08:47 +
 
 I thought the fix had been pushed to the apache master ref. commit
 [SPARK-2848] Shade Guava in uber-jars By Marcelo Vanzin on 8/20. So my
 previous email was based on own build of the apache master, which turned
 out not working yet.
 
 Marcelo: Please correct me if I got that commit wrong.
 
 Thanks,
 Du
 
 
 
 On 8/22/14, 11:41 AM, Marcelo Vanzin van...@cloudera.com wrote:
 
 SPARK-2420 is fixed. I don't think it will be in 1.1, though - might
 be too risky at this point.
 
 I'm not familiar with spark-sql.
 
 On Fri, Aug 22, 2014 at 11:25 AM, Andrew Lee alee...@hotmail.com wrote:
  Hopefully there could be some progress on SPARK-2420. It looks like
 shading
  may be the voted solution among downgrading.
 
  Any idea when this will happen? Could it happen in Spark 1.1.1 or Spark
  1.1.2?
 
  By the way, regarding bin/spark-sql? Is this more of a debugging tool
 for
  Spark job integrating with Hive?
  How does people use spark-sql? I'm trying to understand the rationale
 and
  motivation behind this script, any idea?
 
 
  Date: Thu, 21 Aug 2014 16:31:08 -0700
 
  Subject: Re: Hive From Spark
  From: van...@cloudera.com
  To: l...@yahoo-inc.com.invalid
  CC: user@spark.apache.org; u...@spark.incubator.apache.org;
  pwend...@gmail.com
 
 
  Hi Du,
 
  I don't believe the Guava change has made it to the 1.1 branch. The
  Guava doc says hashInt was added in 12.0, so what's probably
  happening is that you have and old version of Guava in your classpath
  before the Spark jars. (Hadoop ships with Guava 11, so that may be the
  source of your problem.)
 
  On Thu, Aug 21, 2014 at 4:23 PM, Du Li l...@yahoo-inc.com.invalid
 wrote:
   Hi,
  
   This guava dependency conflict problem should have been fixed as of
   yesterday according to
 https://issues.apache.org/jira/browse/SPARK-2420
  
   However, I just got java.lang.NoSuchMethodError:
  
   
 com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/Ha
 shCode;
   by the following code snippet and ³mvn3 test² on Mac. I built the
 latest
   version of spark (1.1.0-SNAPSHOT) and installed the jar files to the
   local
   maven repo. From my pom file I explicitly excluded guava from almost
 all
   possible dependencies, such as spark-hive_2.10-1.1.0.SNAPSHOT, and
   hadoop-client. This snippet is abstracted from a larger project. So
 the
   pom.xml includes many dependencies although not all are required by
 this
   snippet. The pom.xml is attached.
  
   Anybody knows what to fix it?
  
   Thanks,
   Du
   ---
  
   package com.myself.test
  
   import org.scalatest._
   import org.apache.hadoop.io.{NullWritable, BytesWritable}
   import org.apache.spark.{SparkContext, SparkConf}
   import org.apache.spark.SparkContext._
  
   class MyRecord(name: String) extends Serializable {
   def getWritable(): BytesWritable = {
   new
   
 BytesWritable(Option(name).getOrElse(\\N).toString.getBytes(UTF-8))
   }
  
   final override def equals(that: Any): Boolean = {
   if( !that.isInstanceOf[MyRecord] )
   false
   else {
   val other = that.asInstanceOf[MyRecord]
   this.getWritable == other.getWritable
   }
   }
   }
  
   class MyRecordTestSuite extends FunSuite {
   // construct an MyRecord by Consumer.schema
   val rec: MyRecord = new MyRecord(James Bond)
  
   test(generated SequenceFile should be readable from spark) {
   val path = ./testdata/
  
   val conf = new SparkConf(false).setMaster(local).setAppName(test
 data
   exchange with Hive)
   conf.set(spark.driver.host, localhost)
   val sc = new SparkContext(conf)
   val rdd = sc.makeRDD(Seq(rec))
   rdd.map((x: MyRecord) = (NullWritable.get(), x.getWritable()))
   .saveAsSequenceFile(path)
  
   val bytes = sc.sequenceFile(path, classOf[NullWritable],
   classOf[BytesWritable]).first._2
   assert(rec.getWritable() == bytes)
  
   sc.stop()
   System.clearProperty(spark.driver.port)
   }
   }
  
  
   From: Andrew Lee alee...@hotmail.com
   Reply-To: user@spark.apache.org user@spark.apache.org
   Date: Monday, July 21, 2014 at 10:27 AM
   To: user@spark.apache.org user@spark.apache.org,
   u...@spark.incubator.apache.org u...@spark.incubator.apache.org
  
   Subject: RE: Hive From Spark
  
   Hi All,
  
   Currently, if you are running Spark HiveContext API with Hive 0.12,
 it
   won't
   work due to the following 2 libraries which are not consistent with
 Hive

Re: HiveContext is creating metastore warehouse locally instead of in hdfs

2014-07-31 Thread Andrew Lee
Could you enable HistoryServer and provide the properties and CLASSPATH for the 
spark-shell? And 'env' command to list your environment variables?

By the way, what does the spark logs says? Enable debug mode to see what's 
going on in spark-shell when it tries to interact and init HiveContext.



 On Jul 31, 2014, at 19:09, chenjie chenjie2...@gmail.com wrote:
 
 Hi, Yin and Andrew, thank you for your reply.
 When I create table in hive cli, it works correctly and the table will be
 found in hdfs. I forgot start hiveserver2 before and I started it today.
 Then I run the command below:
spark-shell --master spark://192.168.40.164:7077  --driver-class-path
 conf/hive-site.xml
 Furthermore, I added the following command:
hiveContext.hql(SET
 hive.metastore.warehouse.dir=hdfs://192.168.40.164:8020/user/hive/warehouse)
 But then didn't work for me. I got the same exception as before and found
 the table file in local directory instead of hdfs.
 
 
 Yin Huai-2 wrote
 Another way is to set hive.metastore.warehouse.dir explicitly to the
 HDFS
 dir storing Hive tables by using SET command. For example:
 
 hiveContext.hql(SET
 hive.metastore.warehouse.dir=hdfs://localhost:54310/user/hive/warehouse)
 
 
 
 
 On Thu, Jul 31, 2014 at 8:05 AM, Andrew Lee lt;
 
 alee526@
 
 gt; wrote:
 
 Hi All,
 
 It has been awhile, but what I did to make it work is to make sure the
 followings:
 
 1. Hive is working when you run Hive CLI and JDBC via Hiveserver2
 
 2. Make sure you have the hive-site.xml from above Hive configuration.
 The
 problem here is that you want the hive-site.xml from the Hive metastore.
 The one for Hive and HCatalog may be different files. Make sure you check
 the xml properties in that file, pick the one that has the warehouse
 property configured and the JDO setup.
 
 3. Make sure hive-site.xml from step 2 is included in $SPARK_HOME/conf,
 and in your runtime CLASSPATH when you run spark-shell
 
 4. Use the history server to check the runtime CLASSPATH and order to
 ensure hive-site.xml is included.
 
 HiveContext should pick up the hive-site.xml and talk to your running
 hive
 service.
 
 Hope these tips help.
 
 On Jul 30, 2014, at 22:47, chenjie lt;
 
 chenjie2001@
 
 gt; wrote:
 
 Hi, Michael. I Have the same problem. My warehouse directory is always
 created locally. I copied the default hive-site.xml into the
 $SPARK_HOME/conf directory on each node. After I executed the code
 below,
   val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
   hiveContext.hql(CREATE TABLE IF NOT EXISTS src (key INT, value
 STRING))
   hiveContext.hql(LOAD DATA LOCAL INPATH
 '/extdisk2/tools/spark/examples/src/main/resources/kv1.txt' INTO TABLE
 src)
   hiveContext.hql(FROM src SELECT key, value).collect()
 
 I got the exception below:
 java.io.FileNotFoundException: File
 file:/user/hive/warehouse/src/kv1.txt
 does not exist
   at
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520)
   at
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398)
   at
 org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.
 init
 (ChecksumFileSystem.java:137)
   at
 org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:763)
   at
 org.apache.hadoop.mapred.LineRecordReader.
 init
 (LineRecordReader.java:106)
   at
 org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
   at org.apache.spark.rdd.HadoopRDD$$anon$1.
 init
 (HadoopRDD.scala:193)
 
 At last, I found /user/hive/warehouse/src/kv1.txt was created on the
 node
 where I start spark-shell.
 
 The spark that I used is pre-built spark1.0.1 for hadoop2.
 
 Thanks in advance.
 
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/HiveContext-is-creating-metastore-warehouse-locally-instead-of-in-hdfs-tp10838p1.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Issues on spark-shell and spark-submit behave differently on spark-defaults.conf parameter spark.eventLog.dir

2014-07-28 Thread Andrew Lee
Hi All,
Not sure if anyone has ran into this problem, but this exist in spark 1.0.0 
when you specify the location in conf/spark-defaults.conf for
spark.eventLog.dir hdfs:///user/$USER/spark/logs
to use the $USER env variable. 
For example, I'm running the command with user 'test'.
In spark-submit, the folder will be created on-the-fly and you will see the 
event logs created on HDFS /user/test/spark/logs/spark-pi-1405097484152
but in spark-shell, the user 'test' folder is not created, and you will see 
this /user/$USER/spark/logs on HDFS. It will try to create 
/user/$USER/spark/logs instead of /user/test/spark/logs.
It looks like spark-shell couldn't pick up the env variable $USER to apply for 
the eventLog directory for the running user 'test'.
Is this considered a bug or bad practice to use spark-shell with Spark's 
HistoryServer?








  

RE: Issues on spark-shell and spark-submit behave differently on spark-defaults.conf parameter spark.eventLog.dir

2014-07-28 Thread Andrew Lee
Hi Andrew,
Thanks to re-confirm the problem. I thought it only happens to my own build. :)
by the way, we have multiple users using the spark-shell to explore their 
dataset, and we are continuously looking into ways to isolate their jobs 
history. In the current situation, we can't really ask them to create their own 
spark-defaults.conf since this is set to read-only. A workaround is to set it 
to a shared folder e.g. /user/spark/logs and user permission 1777. This isn't 
really ideal since other people can see what are the other jobs running on the 
shared cluster.
It will be nice to have a better security if this is enhanced so people aren't 
exposing their algorithm (which is usually embed in their job's name) to other 
users.
Will there or is there a JIRA ticket to keep track of this? any plan to enhance 
this part for spark-shell ?


Date: Mon, 28 Jul 2014 13:54:56 -0700
Subject: Re: Issues on spark-shell and spark-submit behave differently on 
spark-defaults.conf parameter spark.eventLog.dir
From: and...@databricks.com
To: user@spark.apache.org

Hi Andrew,
It's definitely not bad practice to use spark-shell with HistoryServer. The 
issue here is not with spark-shell, but the way we pass Spark configs to the 
application. spark-defaults.conf does not currently support embedding 
environment variables, but instead interprets everything as a string literal. 
You will have to manually specify test instead of $USER in the path you 
provide to spark.eventLog.dir.

-Andrew

2014-07-28 12:40 GMT-07:00 Andrew Lee alee...@hotmail.com:




Hi All,
Not sure if anyone has ran into this problem, but this exist in spark 1.0.0 
when you specify the location in conf/spark-defaults.conf for

spark.eventLog.dir hdfs:///user/$USER/spark/logs
to use the $USER env variable. 

For example, I'm running the command with user 'test'.
In spark-submit, the folder will be created on-the-fly and you will see the 
event logs created on HDFS /user/test/spark/logs/spark-pi-1405097484152

but in spark-shell, the user 'test' folder is not created, and you will see 
this /user/$USER/spark/logs on HDFS. It will try to create 
/user/$USER/spark/logs instead of /user/test/spark/logs.

It looks like spark-shell couldn't pick up the env variable $USER to apply for 
the eventLog directory for the running user 'test'.

Is this considered a bug or bad practice to use spark-shell with Spark's 
HistoryServer?









  

  

RE: Need help, got java.lang.ExceptionInInitializerError in Yarn-Client/Cluster mode

2014-07-28 Thread Andrew Lee
Hi Jianshi,
My understanding is 'No' based on how Spark's is designed even with your own 
log4j.properties in the Spark's conf folder.
In YARN mode, the Application Master is running inside the cluster and all logs 
are part of containers log which is defined by another log4j.properties file 
from the Hadoop and YARN environment. Spark can't override that unless it can 
provide its own log4j prior to YARN's in the classpath. So the only way is to 
login to the resource manager and click on the job itself to read the 
containers log. (Other people) Please correct me if my understanding is wrong.
You may be thinking why can't I stream the log's to an external service (e.g. 
Flume, syslogd) with a different appender in log4j, myself don't consider this 
a good practice since:1. you need 2 infra structure to operate the entire 
cluster.  2. you will need to open up the firewall ports between the 2 services 
to transfer/stream logs.3. unpredictable traffic, the YARN cluster may bring 
down the logging service/infra (DDoS) when someone accidentally change the 
logging level from WARN to INFO, or worst, DEBUG.
I was thinking maybe we can suggest the community to enhance the Spark 
HistoryServer to capture the last failure exception from the container logs in 
the last failed stage? Not sure if this is an good idea since it may complicate 
the event model. I'm not sure if Akka model can support this or some other 
components in Spark could help to capture these exceptions and pass it back to 
AM and eventually stored in somewhere for later troubleshooting. I'm not clear 
how this path is constructed until reading the source code, so I can't give a 
better answer.
AL

From: jianshi.hu...@gmail.com
Date: Mon, 28 Jul 2014 13:32:05 +0800
Subject: Re: Need help, got java.lang.ExceptionInInitializerError in 
Yarn-Client/Cluster mode
To: user@spark.apache.org

Hi Andrew,
Thanks for the reply, I figured out the cause of the issue. Some resource files 
were missing in JARs. A class initialization depends on the resource files so 
it got that exception.


I appended the resource files explicitly to --jars option and it worked fine.
The Caused by... messages were found in yarn logs actually, I think it might 
be useful if I can seem them from the console which runs spark-submit. Would 
that be possible?


Jianshi


On Sat, Jul 26, 2014 at 7:08 AM, Andrew Lee alee...@hotmail.com wrote:





Hi Jianshi,
Could you provide which HBase version you're using?
By the way, a quick sanity check on whether the Workers can access HBase?


Were you able to manually write one record to HBase with the serialize 
function? Hardcode and test it ?

From: jianshi.hu...@gmail.com


Date: Fri, 25 Jul 2014 15:12:18 +0800
Subject: Re: Need help, got java.lang.ExceptionInInitializerError in 
Yarn-Client/Cluster mode
To: user@spark.apache.org



I nailed it down to a union operation, here's my code snippet:
val properties: RDD[((String, String, String), Externalizer[KeyValue])] = 
vertices.map { ve =

  val (vertices, dsName) = ve

  val rval = GraphConfig.getRval(datasetConf, Constants.VERTICES, dsName)   
   val (_, rvalAsc, rvalType) = rval
  println(sTable name: $dsName, Rval: $rval)



  println(vertices.toDebugString)
  vertices.map { v =val rk = appendHash(boxId(v.id)).getBytes  
  val cf = PROP_BYTES



val cq = boxRval(v.rval, rvalAsc, rvalType).getBytesval value = 
Serializer.serialize(v.properties)
((new String(rk), new String(cf), new String(cq)),



 Externalizer(put(rk, cf, cq, value)))  }
}.reduce(_.union(_)).sortByKey(numPartitions = 32)

Basically I read data from multiple tables (Seq[RDD[(key, value)]]) and they're 
transformed to the a KeyValue to be insert in HBase, so I need to do a 
.reduce(_.union(_)) to combine them into one RDD[(key, value)].




I cannot see what's wrong in my code.
Jianshi


On Fri, Jul 25, 2014 at 12:24 PM, Jianshi Huang jianshi.hu...@gmail.com wrote:




I can successfully run my code in local mode using spark-submit (--master 
local[4]), but I got ExceptionInInitializerError errors in Yarn-client mode.




Any hints what is the problem? Is it a closure serialization problem? How can I 
debug it? Your answers would be very helpful. 

14/07/25 11:48:14 WARN scheduler.TaskSetManager: Loss was due to 
java.lang.ExceptionInInitializerErrorjava.lang.ExceptionInInitializerError  
  at 
com.paypal.risk.rds.granada.storage.hbase.HBaseStore$$anonfun$1$$anonfun$apply$1.apply(HBaseStore.scal




a:40)at 
com.paypal.risk.rds.granada.storage.hbase.HBaseStore$$anonfun$1$$anonfun$apply$1.apply(HBaseStore.scala:36)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)




at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1016)   
 at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847)at 
org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847

RE: Need help, got java.lang.ExceptionInInitializerError in Yarn-Client/Cluster mode

2014-07-25 Thread Andrew Lee
Hi Jianshi,
Could you provide which HBase version you're using?
By the way, a quick sanity check on whether the Workers can access HBase?
Were you able to manually write one record to HBase with the serialize 
function? Hardcode and test it ?

From: jianshi.hu...@gmail.com
Date: Fri, 25 Jul 2014 15:12:18 +0800
Subject: Re: Need help, got java.lang.ExceptionInInitializerError in 
Yarn-Client/Cluster mode
To: user@spark.apache.org

I nailed it down to a union operation, here's my code snippet:
val properties: RDD[((String, String, String), Externalizer[KeyValue])] = 
vertices.map { ve =  val (vertices, dsName) = ve

  val rval = GraphConfig.getRval(datasetConf, Constants.VERTICES, dsName)   
   val (_, rvalAsc, rvalType) = rval
  println(sTable name: $dsName, Rval: $rval)

  println(vertices.toDebugString)
  vertices.map { v =val rk = appendHash(boxId(v.id)).getBytes  
  val cf = PROP_BYTES

val cq = boxRval(v.rval, rvalAsc, rvalType).getBytesval value = 
Serializer.serialize(v.properties)
((new String(rk), new String(cf), new String(cq)),

 Externalizer(put(rk, cf, cq, value)))  }
}.reduce(_.union(_)).sortByKey(numPartitions = 32)

Basically I read data from multiple tables (Seq[RDD[(key, value)]]) and they're 
transformed to the a KeyValue to be insert in HBase, so I need to do a 
.reduce(_.union(_)) to combine them into one RDD[(key, value)].


I cannot see what's wrong in my code.
Jianshi


On Fri, Jul 25, 2014 at 12:24 PM, Jianshi Huang jianshi.hu...@gmail.com wrote:


I can successfully run my code in local mode using spark-submit (--master 
local[4]), but I got ExceptionInInitializerError errors in Yarn-client mode.


Any hints what is the problem? Is it a closure serialization problem? How can I 
debug it? Your answers would be very helpful. 

14/07/25 11:48:14 WARN scheduler.TaskSetManager: Loss was due to 
java.lang.ExceptionInInitializerErrorjava.lang.ExceptionInInitializerError  
  at 
com.paypal.risk.rds.granada.storage.hbase.HBaseStore$$anonfun$1$$anonfun$apply$1.apply(HBaseStore.scal


a:40)at 
com.paypal.risk.rds.granada.storage.hbase.HBaseStore$$anonfun$1$$anonfun$apply$1.apply(HBaseStore.scala:36)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)


at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1016)   
 at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847)at 
org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847)


at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)  
  at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)  
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)


at org.apache.spark.scheduler.Task.run(Task.scala:51)at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)


at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
   at java.lang.Thread.run(Thread.java:745)



-- 
Jianshi Huang

LinkedIn: jianshi

Twitter: @jshuang
Github  Blog: http://huangjs.github.com/




-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/



  

RE: Hive From Spark

2014-07-22 Thread Andrew Lee
Hi Sean,
Thanks for clarifying. I re-read SPARK-2420 and now have a better understanding.
From a user perspective, what would you recommend to build Spark with Hive 
0.12 / 0.13+ libraries moving forward and deploy to production cluster that 
runs on a older version of Hadoop (e.g. 2.2 or 2.4) ?
My concern is that there's going to be a lag for technology adoption and since 
Spark is moving fast, the libraries may always be newer. Protobuf is one good 
example, shading. From a biz point of view, if there is no benefit to upgrade 
the library, the chances that this will happen with a higher priority is low 
due to stability concern and re-running the entire test suite. Just by 
observation, there's still a lot of ppl running Hadoop 2.2 instead of 2.4 or 
2.5 and the release and upgrade is depending on other big players such as 
Cloudera, Hortonwork, etc for their distro. Not to mention the process of 
upgrading.
Is there any benefit to use Guava 14 in Spark? I believe there is usually some 
competitive reason why Spark choose Guava 14, however, I'm not sure if anyone 
raise that in the conversation so I don't know if that is necessary.
Looking forward to seeing Hive on Spark to work soon. Please let me know if 
there's any help or feedback I can provide.
Thanks Sean.


 From: so...@cloudera.com
 Date: Mon, 21 Jul 2014 18:36:10 +0100
 Subject: Re: Hive From Spark
 To: user@spark.apache.org
 
 I haven't seen anyone actively 'unwilling' -- I hope not. See
 discussion at https://issues.apache.org/jira/browse/SPARK-2420 where I
 sketch what a downgrade means. I think it just hasn't gotten a looking
 over.
 
 Contrary to what I thought earlier, the conflict does in fact cause
 problems in theory, and you show it causes a problem in practice. Not
 to mention it causes issues for Hive-on-Spark now.
 
 On Mon, Jul 21, 2014 at 6:27 PM, Andrew Lee alee...@hotmail.com wrote:
  Hive and Hadoop are using an older version of guava libraries (11.0.1) where
  Spark Hive is using guava 14.0.1+.
  The community isn't willing to downgrade to 11.0.1 which is the current
  version for Hadoop 2.2 and Hive 0.12.
  

RE: Hive From Spark

2014-07-21 Thread Andrew Lee
Hi All,
Currently, if you are running Spark HiveContext API with Hive 0.12, it won't 
work due to the following 2 libraries which are not consistent with Hive 0.12 
and Hadoop as well. (Hive libs aligns with Hadoop libs, and as a common 
practice, they should be consistent to work inter-operable).
These are under discussion in the 2 JIRA tickets:
https://issues.apache.org/jira/browse/HIVE-7387
https://issues.apache.org/jira/browse/SPARK-2420
When I ran the command by tweaking the classpath and build for Spark 1.0.1-rc3, 
I was able to create table through HiveContext, however, when I fetch the data, 
due to incompatible API calls in Guava, it breaks. This is critical since it 
needs to map the cllumns to the RDD schema.
Hive and Hadoop are using an older version of guava libraries (11.0.1) where 
Spark Hive is using guava 14.0.1+.The community isn't willing to downgrade to 
11.0.1 which is the current version for Hadoop 2.2 and Hive 0.12. Be aware of 
protobuf version as well in Hive 0.12 (it uses protobuf 2.4).
scalascala import org.apache.spark.SparkContext
import org.apache.spark.SparkContextscala import org.apache.spark.sql.hive._
import org.apache.spark.sql.hive._scalascala val hiveContext = new 
org.apache.spark.sql.hive.HiveContext(sc)
hiveContext: org.apache.spark.sql.hive.HiveContext = 
org.apache.spark.sql.hive.HiveContext@34bee01ascalascala 
hiveContext.hql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING))
res0: org.apache.spark.sql.SchemaRDD = 
SchemaRDD[0] at RDD at SchemaRDD.scala:104
== Query Plan ==
Native command: executed by Hivescala hiveContext.hql(LOAD DATA LOCAL 
INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src)
res1: org.apache.spark.sql.SchemaRDD = 
SchemaRDD[3] at RDD at SchemaRDD.scala:104
== Query Plan ==
Native command: executed by Hivescalascala // Queries are expressed in 
HiveQLscala hiveContext.hql(FROM src SELECT key, 
value).collect().foreach(println)
java.lang.NoSuchMethodError: 
com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;
at 
org.apache.spark.util.collection.OpenHashSet.org$apache$spark$util$collection$OpenHashSet$$hashcode(OpenHashSet.scala:261)
at 
org.apache.spark.util.collection.OpenHashSet$mcI$sp.getPos$mcI$sp(OpenHashSet.scala:165)
at 
org.apache.spark.util.collection.OpenHashSet$mcI$sp.contains$mcI$sp(OpenHashSet.scala:102)
at 
org.apache.spark.util.SizeEstimator$$anonfun$visitArray$2.apply$mcVI$sp(SizeEstimator.scala:214)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:210)
at 
org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:169)
at 
org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:161)
at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:155)
at org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:75)
at org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:92)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:661)
at org.apache.spark.storage.BlockManager.put(BlockManager.scala:546)
at org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:812)
at org.apache.spark.broadcast.HttpBroadcast.init(HttpBroadcast.scala:52)
at 
org.apache.spark.broadcast.HttpBroadcastFactory.newBroadcast(HttpBroadcastFactory.scala:35)
at 
org.apache.spark.broadcast.HttpBroadcastFactory.newBroadcast(HttpBroadcastFactory.scala:29)
at 
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:776)
at org.apache.spark.sql.hive.HadoopTableReader.init(TableReader.scala:60)
at 
org.apache.spark.sql.hive.execution.HiveTableScan.init(HiveTableScan.scala:70)
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$4.apply(HiveStrategies.scala:73)
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$4.apply(HiveStrategies.scala:73)
at 
org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:280)
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:69)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:316)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:316)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:319)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:319)
at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:420)
at 

RE: SPARK_CLASSPATH Warning

2014-07-11 Thread Andrew Lee
As mentioned, deprecated in Spark 1.0+.
Try to use the --driver-class-path:
 ./bin/spark-shell --driver-class-path yourlib.jar:abc.jar:xyz.jar

Don't use glob *, specify the JAR one by one with colon.

Date: Wed, 9 Jul 2014 13:45:07 -0700
From: kat...@cs.pitt.edu
Subject: SPARK_CLASSPATH Warning
To: user@spark.apache.org

Hello,

I have installed Apache Spark v1.0.0 in a machine with a proprietary Hadoop 
Distribution installed (v2.2.0 without yarn). Due to the fact that the Hadoop 
Distribution that I am using, uses a list of jars , I do the following changes 
to the conf/spark-env.sh


#!/usr/bin/env bash

export HADOOP_CONF_DIR=/path-to-hadoop-conf/hadoop-conf
export SPARK_LOCAL_IP=impl41
export 
SPARK_CLASSPATH=/path-to-proprietary-hadoop-lib/lib/*:/path-to-proprietary-hadoop-lib/*

...

Also, to make sure that I have everything working I execute the Spark shell as 
follows:

[biadmin@impl41 spark]$ ./bin/spark-shell --jars 
/path-to-proprietary-hadoop-lib/lib/*.jar


14/07/09 13:37:28 INFO spark.SecurityManager: Changing view acls to: biadmin
14/07/09 13:37:28 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(biadmin)

14/07/09 13:37:28 INFO spark.HttpServer: Starting HTTP Server
14/07/09 13:37:29 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/07/09 13:37:29 INFO server.AbstractConnector: Started 
SocketConnector@0.0.0.0:44292

Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.0.0
  /_/

Using Scala version 2.10.4 (IBM J9 VM, Java 1.7.0)

Type in expressions to have them evaluated.
Type :help for more information.
14/07/09 13:37:36 WARN spark.SparkConf: 
SPARK_CLASSPATH was detected (set to 
'path-to-proprietary-hadoop-lib/*:/path-to-proprietary-hadoop-lib/lib/*').

This is deprecated in Spark 1.0+.

Please instead use:
 - ./spark-submit with --driver-class-path to augment the driver classpath
 - spark.executor.extraClassPath to augment the executor classpath


14/07/09 13:37:36 WARN spark.SparkConf: Setting 'spark.executor.extraClassPath' 
to '/path-to-proprietary-hadoop-lib/lib/*:/path-to-proprietary-hadoop-lib/*' as 
a work-around.
14/07/09 13:37:36 WARN spark.SparkConf: Setting 'spark.driver.extraClassPath' 
to '/path-to-proprietary-hadoop-lib/lib/*:/path-to-proprietary-hadoop-lib/*' as 
a work-around.

14/07/09 13:37:36 INFO spark.SecurityManager: Changing view acls to: biadmin
14/07/09 13:37:36 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(biadmin)

14/07/09 13:37:37 INFO slf4j.Slf4jLogger: Slf4jLogger started
14/07/09 13:37:37 INFO Remoting: Starting remoting
14/07/09 13:37:37 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://spark@impl41:46081]

14/07/09 13:37:37 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://spark@impl41:46081]
14/07/09 13:37:37 INFO spark.SparkEnv: Registering MapOutputTracker
14/07/09 13:37:37 INFO spark.SparkEnv: Registering BlockManagerMaster

14/07/09 13:37:37 INFO storage.DiskBlockManager: Created local directory at 
/tmp/spark-local-20140709133737-798b
14/07/09 13:37:37 INFO storage.MemoryStore: MemoryStore started with capacity 
307.2 MB.
14/07/09 13:37:38 INFO network.ConnectionManager: Bound socket to port 16685 
with id = ConnectionManagerId(impl41,16685)

14/07/09 13:37:38 INFO storage.BlockManagerMaster: Trying to register 
BlockManager
14/07/09 13:37:38 INFO storage.BlockManagerInfo: Registering block manager 
impl41:16685 with 307.2 MB RAM
14/07/09 13:37:38 INFO storage.BlockManagerMaster: Registered BlockManager

14/07/09 13:37:38 INFO spark.HttpServer: Starting HTTP Server
14/07/09 13:37:38 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/07/09 13:37:38 INFO server.AbstractConnector: Started 
SocketConnector@0.0.0.0:21938

14/07/09 13:37:38 INFO broadcast.HttpBroadcast: Broadcast server started at 
http://impl41:21938
14/07/09 13:37:38 INFO spark.HttpFileServer: HTTP File server directory is 
/tmp/spark-91e8e040-f2ca-43dd-b574-805033f476c7

14/07/09 13:37:38 INFO spark.HttpServer: Starting HTTP Server
14/07/09 13:37:38 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/07/09 13:37:38 INFO server.AbstractConnector: Started 
SocketConnector@0.0.0.0:52678

14/07/09 13:37:38 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/07/09 13:37:38 INFO server.AbstractConnector: Started 
SelectChannelConnector@0.0.0.0:4040
14/07/09 13:37:38 INFO ui.SparkUI: Started SparkUI at http://impl41:4040

14/07/09 13:37:39 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
14/07/09 13:37:39 INFO spark.SparkContext: Added JAR 
file:/opt/ibm/biginsights/IHC/lib/adaptive-mr.jar at 
http://impl41:52678/jars/adaptive-mr.jar with timestamp 1404938259526

14/07/09 13:37:39 INFO executor.Executor: Using REPL class URI: 
http://impl41:44292
14/07/09 

RE: spark-1.0.0-rc11 2f1dc868 spark-shell not honoring --properties-file option?

2014-07-11 Thread Andrew Lee
Ok, I found it on JIRA SPARK-2390:
https://issues.apache.org/jira/browse/SPARK-2390
So it looks like this is a known issue.

From: alee...@hotmail.com
To: user@spark.apache.org
Subject: spark-1.0.0-rc11 2f1dc868 spark-shell not honoring --properties-file 
option?
Date: Tue, 8 Jul 2014 15:17:00 -0700




Build: Spark 1.0.0 rc11 (git commit tag: 
2f1dc868e5714882cf40d2633fb66772baf34789)








Hi All,
When I enabled the spark-defaults.conf for the eventLog, spark-shell broke 
while spark-submit works.
I'm trying to create a separate directory per user to keep track with their own 
Spark job event logs with the env $USER in spark-defaults.conf.
Here's the spark-defaults.conf I specified so that HistoryServer can start 
picking up these event log from HDFS.As you can see here, I was trying to 
create a directory for each user so they can store the event log on a per user 
base.However, when I launch spark-shell, it didn't pick up $USER as the current 
login user. However, this works for spark-submit.
Here's more details.
/opt/spark/ is SPARK_HOME








[test@ ~]$ cat /opt/spark/conf/spark-defaults.conf
# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.


# Example:
# spark.masterspark://master:7077
spark.eventLog.enabledtrue
spark.eventLog.dirhdfs:///user/$USER/spark/logs/
# spark.serializerorg.apache.spark.serializer.KryoSerializer

and I tried to create a separate config file to override the default one:







[test@ ~]$ SPARK_SUBMIT_OPTS=-XX:MaxPermSize=256m /opt/spark/bin/spark-shell 
--master yarn --driver-class-path 
/opt/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo.jar --properties-file 
/home/test/spark-defaults.conf [test@~]$ cat /home/test/spark-defaults.conf# 
Default system properties included when running spark-submit.# This is useful 
for setting default environmental settings.
# Example:# spark.masterspark://master:7077spark.eventLog.enabled   
 truespark.eventLog.dirhdfs:///user/test/spark/logs/















# spark.serializerorg.apache.spark.serializer.KryoSerializer
But it didn't work also, it is still looking at the 
/opt/spark/conf/spark-defaults.conf. According to the document, 
http://spark.apache.org/docs/latest/configuration.htmlHardcoded properties in 
SparkConf  spark-submit / spark-shell  conf/spark-defaults.conf
2 problems here:
1. In repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala, the instance 
SparkConf didn't look for the user specified spark-defaults.conf anywhere.
I don't see anywhere that pulls in the file from option --properties-file, it 
is just the default location conf/spark-defaults.confval conf = new SparkConf() 
 .setMaster(getMaster())  .setAppName(Spark shell)  .setJars(jars)











  .set(spark.repl.class.uri, intp.classServer.uri)
2. The $USER isn't picked up in spark-shell. This may be another problem and 
fixed at the same time when it re-use how SparkSubmit.scala does to SparkConf???











  

spark-1.0.0-rc11 2f1dc868 spark-shell not honoring --properties-file option?

2014-07-08 Thread Andrew Lee
Build: Spark 1.0.0 rc11 (git commit tag: 
2f1dc868e5714882cf40d2633fb66772baf34789)








Hi All,
When I enabled the spark-defaults.conf for the eventLog, spark-shell broke 
while spark-submit works.
I'm trying to create a separate directory per user to keep track with their own 
Spark job event logs with the env $USER in spark-defaults.conf.
Here's the spark-defaults.conf I specified so that HistoryServer can start 
picking up these event log from HDFS.As you can see here, I was trying to 
create a directory for each user so they can store the event log on a per user 
base.However, when I launch spark-shell, it didn't pick up $USER as the current 
login user. However, this works for spark-submit.
Here's more details.
/opt/spark/ is SPARK_HOME








[test@ ~]$ cat /opt/spark/conf/spark-defaults.conf
# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.


# Example:
# spark.masterspark://master:7077
spark.eventLog.enabledtrue
spark.eventLog.dirhdfs:///user/$USER/spark/logs/
# spark.serializerorg.apache.spark.serializer.KryoSerializer

and I tried to create a separate config file to override the default one:







[test@ ~]$ SPARK_SUBMIT_OPTS=-XX:MaxPermSize=256m /opt/spark/bin/spark-shell 
--master yarn --driver-class-path 
/opt/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo.jar --properties-file 
/home/test/spark-defaults.conf [test@~]$ cat /home/test/spark-defaults.conf# 
Default system properties included when running spark-submit.# This is useful 
for setting default environmental settings.
# Example:# spark.masterspark://master:7077spark.eventLog.enabled   
 truespark.eventLog.dirhdfs:///user/test/spark/logs/















# spark.serializerorg.apache.spark.serializer.KryoSerializer
But it didn't work also, it is still looking at the 
/opt/spark/conf/spark-defaults.conf. According to the document, 
http://spark.apache.org/docs/latest/configuration.htmlHardcoded properties in 
SparkConf  spark-submit / spark-shell  conf/spark-defaults.conf
2 problems here:
1. In repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala, the instance 
SparkConf didn't look for the user specified spark-defaults.conf anywhere.
I don't see anywhere that pulls in the file from option --properties-file, it 
is just the default location conf/spark-defaults.confval conf = new SparkConf() 
 .setMaster(getMaster())  .setAppName(Spark shell)  .setJars(jars)











  .set(spark.repl.class.uri, intp.classServer.uri)
2. The $USER isn't picked up in spark-shell. This may be another problem and 
fixed at the same time when it re-use how SparkSubmit.scala does to SparkConf???










  

RE: Spark logging strategy on YARN

2014-07-07 Thread Andrew Lee
Hi Kudryavtsev,
Here's what I am doing as a common practice and reference, I don't want to say 
it is best practice since it requires a lot of customer experience and 
feedback, but from a development and operating stand point, it will be great to 
separate the YARN container logs with the Spark logs.
Event Log - Use HistoryServer to take a look at the workflow, overall resource 
usage, etc for the Job.

Spark Log - Provide readable info on settings and configuration, and is covered 
by the event logs. You can customize this in the 'conf' folder with your own 
log4j.properties file. This won't be picked up by your YARN container since 
your Hadoop may be referring to a different log4j file somewhere else.
Stderr/Stdout log - This is actually picked up by the YARN container and you 
won't be able to override this unless you override the one in the resource 
folder (yarn/common/src/main/resources/log4j-spark-container.properties) during 
the build process and include it in your build (JAR file).
One thing I haven't tried yet is to separate that resource file into a separate 
JAR, and include it in the ext jar options on HDFS to suppress the log. This is 
more of a exploiting the CLASSPATH search behavior to override YARN log4j 
settings without building JARs to include the YARN container log4j settings, I 
don't know if this is a good practice though. Just some ideas that gives ppl 
flexibility, but probably not a good practice.
Anyone else have ideas? thoughts?








 From: kudryavtsev.konstan...@gmail.com
 Subject: Spark logging strategy on YARN
 Date: Thu, 3 Jul 2014 22:26:48 +0300
 To: user@spark.apache.org
 
 Hi all,
 
 Could you please share your the best practices on writing logs in Spark? I’m 
 running it on YARN, so when I check logs I’m bit confused… 
 Currently, I’m writing System.err.println to put a message in log and access 
 it via YARN history server. But, I don’t like this way… I’d like to use 
 log4j/slf4j and write them to more concrete place… any practices?
 
 Thank you in advance
  

RE: write event logs with YARN

2014-07-02 Thread Andrew Lee
Hi Christophe,
Make sure you have 3 slashes in the hdfs scheme.
e.g.
hdfs:///server_name:9000/user/user_name/spark-events
and in the spark-defaults.conf as 
well.spark.eventLog.dir=hdfs:///server_name:9000/user/user_name/spark-events

 Date: Thu, 19 Jun 2014 11:18:51 +0200
 From: christophe.pre...@kelkoo.com
 To: user@spark.apache.org
 Subject: write event logs with YARN
 
 Hi,
 
 I am trying to use the new Spark history server in 1.0.0 to view finished 
 applications (launched on YARN), without success so far.
 
 Here are the relevant configuration properties in my spark-defaults.conf:
 
 spark.yarn.historyServer.address=server_name:18080
 spark.ui.killEnabled=false
 spark.eventLog.enabled=true
 spark.eventLog.compress=true
 spark.eventLog.dir=hdfs://server_name:9000/user/user_name/spark-events
 
 And the history server has been launched with the command below:
 
 /opt/spark/sbin/start-history-server.sh 
 hdfs://server_name:9000/user/user_name/spark-events
 
 
 However, the finished application do not appear in the history server UI 
 (though the UI itself works correctly).
 Apparently, the problem is that the APPLICATION_COMPLETE file is not created:
 
 hdfs dfs -stat %n spark-events/application_name-1403166516102/*
 COMPRESSION_CODEC_org.apache.spark.io.LZFCompressionCodec
 EVENT_LOG_2
 SPARK_VERSION_1.0.0
 
 Indeed, if I manually create an empty APPLICATION_COMPLETE file in the above 
 directory, the application can now be viewed normally in the history server.
 
 Finally, here is the relevant part of the YARN application log, which seems 
 to imply that
 the DFS Filesystem is already closed when the APPLICATION_COMPLETE file is 
 created:
 
 (...)
 14/06/19 08:29:29 INFO ApplicationMaster: finishApplicationMaster with 
 SUCCEEDED
 14/06/19 08:29:29 INFO AMRMClientImpl: Waiting for application to be 
 successfully unregistered.
 14/06/19 08:29:29 INFO ApplicationMaster: AppMaster received a signal.
 14/06/19 08:29:29 INFO ApplicationMaster: Deleting staging directory 
 .sparkStaging/application_1397477394591_0798
 14/06/19 08:29:29 INFO ApplicationMaster$$anon$1: Invoking sc stop from 
 shutdown hook
 14/06/19 08:29:29 INFO SparkUI: Stopped Spark web UI at 
 http://dc1-ibd-corp-hadoop-02.corp.dc1.kelkoo.net:54877
 14/06/19 08:29:29 INFO DAGScheduler: Stopping DAGScheduler
 14/06/19 08:29:29 INFO CoarseGrainedSchedulerBackend: Shutting down all 
 executors
 14/06/19 08:29:29 INFO CoarseGrainedSchedulerBackend: Asking each executor to 
 shut down
 14/06/19 08:29:30 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor 
 stopped!
 14/06/19 08:29:30 INFO ConnectionManager: Selector thread was interrupted!
 14/06/19 08:29:30 INFO ConnectionManager: ConnectionManager stopped
 14/06/19 08:29:30 INFO MemoryStore: MemoryStore cleared
 14/06/19 08:29:30 INFO BlockManager: BlockManager stopped
 14/06/19 08:29:30 INFO BlockManagerMasterActor: Stopping BlockManagerMaster
 14/06/19 08:29:30 INFO BlockManagerMaster: BlockManagerMaster stopped
 Exception in thread Thread-44 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:629)
 at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1365)
 at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1307)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:384)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:380)
 at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:380)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:324)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:905)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:886)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:783)
 at org.apache.spark.util.FileLogger.createWriter(FileLogger.scala:117)
 at org.apache.spark.util.FileLogger.newFile(FileLogger.scala:181)
 at 
 org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:129)
 at 
 org.apache.spark.SparkContext$$anonfun$stop$2.apply(SparkContext.scala:989)
 at 
 org.apache.spark.SparkContext$$anonfun$stop$2.apply(SparkContext.scala:989)
 at scala.Option.foreach(Option.scala:236)
 at org.apache.spark.SparkContext.stop(SparkContext.scala:989)
 at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$1.run(ApplicationMaster.scala:443)
 14/06/19 08:29:30 INFO RemoteActorRefProvider$RemotingTerminator: Shutting 
 down remote daemon.
 
 
 Am I missing something, or is it a bug?
 
 Thanks,
 Christophe.
 
 Kelkoo SAS
 Société par Actions Simplifiée
 Au capital de € 4.168.964,30
 Siège social : 8, rue du Sentier 75002 Paris
 425 093 069 RCS Paris
 
 Ce message et les pièces jointes 

RE: HDFS folder .sparkStaging not deleted and filled up HDFS in yarn mode

2014-06-23 Thread Andrew Lee
I checked the source code, it looks like it was re-added back based on JIRA 
SPARK-1588, but I don't know if there's any test case associated with this?









  SPARK-1588.  Restore SPARK_YARN_USER_ENV and SPARK_JAVA_OPTS for YARN.
  Sandy Ryza sa...@cloudera.com
  2014-04-29 12:54:02 -0700
  Commit: 5f48721, github.com/apache/spark/pull/586


From: alee...@hotmail.com
To: user@spark.apache.org
Subject: RE: HDFS folder .sparkStaging not deleted and filled up HDFS in yarn 
mode
Date: Wed, 18 Jun 2014 11:24:36 -0700




Forgot to mention that I am using spark-submit to submit jobs, and a verbose 
mode print out looks like this with the SparkPi examples.The .sparkStaging 
won't be deleted. My thoughts is that this should be part of the staging and 
should be cleaned up as well when sc gets terminated.









[test@ spark]$ SPARK_YARN_USER_ENV=spark.yarn.preserve.staging.files=false 
SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop2.2.0.jar 
./bin/spark-submit --verbose --master yarn --deploy-mode cluster --class 
org.apache.spark.examples.SparkPi --driver-memory 512M --driver-library-path 
/opt/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo.jar --executor-memory 512M 
--executor-cores 1 --queue research --num-executors 2 
examples/target/spark-examples_2.10-1.0.0.jar 
















Using properties file: null
Using properties file: null
Parsed arguments:
  master  yarn
  deployMode  cluster
  executorMemory  512M
  executorCores   1
  totalExecutorCores  null
  propertiesFile  null
  driverMemory512M
  driverCores null
  driverExtraClassPathnull
  driverExtraLibraryPath  /opt/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo.jar
  driverExtraJavaOptions  null
  supervise   false
  queue   research
  numExecutors2
  files   null
  pyFiles null
  archivesnull
  mainClass   org.apache.spark.examples.SparkPi
  primaryResource 
file:/opt/spark/examples/target/spark-examples_2.10-1.0.0.jar
  nameorg.apache.spark.examples.SparkPi
  childArgs   []
  jarsnull
  verbose true


Default properties from null:
  



Using properties file: null
Main class:
org.apache.spark.deploy.yarn.Client
Arguments:
--jar
file:/opt/spark/examples/target/spark-examples_2.10-1.0.0.jar
--class
org.apache.spark.examples.SparkPi
--name
org.apache.spark.examples.SparkPi
--driver-memory
512M
--queue
research
--num-executors
2
--executor-memory
512M
--executor-cores
1
System properties:
spark.driver.extraLibraryPath - 
/opt/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo.jar
SPARK_SUBMIT - true
spark.app.name - org.apache.spark.examples.SparkPi
Classpath elements:








From: alee...@hotmail.com
To: user@spark.apache.org
Subject: HDFS folder .sparkStaging not deleted and filled up HDFS in yarn mode
Date: Wed, 18 Jun 2014 11:05:12 -0700




Hi All,
Have anyone ran into the same problem? By looking at the source code in 
official release (rc11),this property settings is set to false by default, 
however, I'm seeing the .sparkStaging folder remains on the HDFS and causing it 
to fill up the disk pretty fast since SparkContext deploys the fat JAR file 
(~115MB) every time for each job and it is not cleaned up.








yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:
  val preserveFiles = sparkConf.get(spark.yarn.preserve.staging.files, 
false).toBoolean
[test@spark ~]$ hdfs dfs -ls .sparkStagingFound 46 itemsdrwx--   - test 
users  0 2014-05-01 01:42 
.sparkStaging/application_1398370455828_0050drwx--   - test users  
0 2014-05-01 02:03 .sparkStaging/application_1398370455828_0051drwx--   - 
test users  0 2014-05-01 02:04 
.sparkStaging/application_1398370455828_0052drwx--   - test users  
0 2014-05-01 05:44 .sparkStaging/application_1398370455828_0053drwx--   - 
test users  0 2014-05-01 05:45 
.sparkStaging/application_1398370455828_0055drwx--   - test users  
0 2014-05-01 05:46 .sparkStaging/application_1398370455828_0056drwx--   - 
test users  0 2014-05-01 05:49 
.sparkStaging/application_1398370455828_0057drwx--   - test users  
0 2014-05-01 05:52 .sparkStaging/application_1398370455828_0058drwx--   - 
test users  0 2014-05-01 05:58 
.sparkStaging/application_1398370455828_0059drwx--   - test users  
0 2014-05-01 07:38 .sparkStaging/application_1398370455828_0060drwx--   - 
test users  0 2014-05-01 07:41 
.sparkStaging/application_1398370455828_0061….drwx--   - test users 
 0 2014-06-16 14:45 .sparkStaging/application_1402001910637_0131drwx--   - 
test users  0 2014-06-16 15:03 
.sparkStaging/application_1402001910637_0135drwx--   - test users  
0 2014-06-16 15:16 

HDFS folder .sparkStaging not deleted and filled up HDFS in yarn mode

2014-06-18 Thread Andrew Lee
Hi All,
Have anyone ran into the same problem? By looking at the source code in 
official release (rc11),this property settings is set to false by default, 
however, I'm seeing the .sparkStaging folder remains on the HDFS and causing it 
to fill up the disk pretty fast since SparkContext deploys the fat JAR file 
(~115MB) every time for each job and it is not cleaned up.








yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:
  val preserveFiles = sparkConf.get(spark.yarn.preserve.staging.files, 
false).toBoolean
[test@spark ~]$ hdfs dfs -ls .sparkStagingFound 46 itemsdrwx--   - test 
users  0 2014-05-01 01:42 
.sparkStaging/application_1398370455828_0050drwx--   - test users  
0 2014-05-01 02:03 .sparkStaging/application_1398370455828_0051drwx--   - 
test users  0 2014-05-01 02:04 
.sparkStaging/application_1398370455828_0052drwx--   - test users  
0 2014-05-01 05:44 .sparkStaging/application_1398370455828_0053drwx--   - 
test users  0 2014-05-01 05:45 
.sparkStaging/application_1398370455828_0055drwx--   - test users  
0 2014-05-01 05:46 .sparkStaging/application_1398370455828_0056drwx--   - 
test users  0 2014-05-01 05:49 
.sparkStaging/application_1398370455828_0057drwx--   - test users  
0 2014-05-01 05:52 .sparkStaging/application_1398370455828_0058drwx--   - 
test users  0 2014-05-01 05:58 
.sparkStaging/application_1398370455828_0059drwx--   - test users  
0 2014-05-01 07:38 .sparkStaging/application_1398370455828_0060drwx--   - 
test users  0 2014-05-01 07:41 
.sparkStaging/application_1398370455828_0061….drwx--   - test users 
 0 2014-06-16 14:45 .sparkStaging/application_1402001910637_0131drwx--   - 
test users  0 2014-06-16 15:03 
.sparkStaging/application_1402001910637_0135drwx--   - test users  
0 2014-06-16 15:16 .sparkStaging/application_1402001910637_0136drwx--   - 
test users  0 2014-06-16 15:46 
.sparkStaging/application_1402001910637_0138drwx--   - test users  
0 2014-06-16 23:57 .sparkStaging/application_1402001910637_0157drwx--   - 
test users  0 2014-06-17 05:55 
.sparkStaging/application_1402001910637_0161
Is this something that needs to be explicitly set in 
:SPARK_YARN_USER_ENV=spark.yarn.preserve.staging.files=false
http://spark.apache.org/docs/latest/running-on-yarn.htmlspark.yarn.preserve.staging.filesfalseSet
 to true to preserve the staged files (Spark jar, app jar, distributed cache 
files) at the end of the job rather then delete them.or this is a bug that is 
not honoring the default value and is override to true somewhere?
Thanks.


  

RE: HDFS folder .sparkStaging not deleted and filled up HDFS in yarn mode

2014-06-18 Thread Andrew Lee
Forgot to mention that I am using spark-submit to submit jobs, and a verbose 
mode print out looks like this with the SparkPi examples.The .sparkStaging 
won't be deleted. My thoughts is that this should be part of the staging and 
should be cleaned up as well when sc gets terminated.









[test@ spark]$ SPARK_YARN_USER_ENV=spark.yarn.preserve.staging.files=false 
SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop2.2.0.jar 
./bin/spark-submit --verbose --master yarn --deploy-mode cluster --class 
org.apache.spark.examples.SparkPi --driver-memory 512M --driver-library-path 
/opt/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo.jar --executor-memory 512M 
--executor-cores 1 --queue research --num-executors 2 
examples/target/spark-examples_2.10-1.0.0.jar 
















Using properties file: null
Using properties file: null
Parsed arguments:
  master  yarn
  deployMode  cluster
  executorMemory  512M
  executorCores   1
  totalExecutorCores  null
  propertiesFile  null
  driverMemory512M
  driverCores null
  driverExtraClassPathnull
  driverExtraLibraryPath  /opt/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo.jar
  driverExtraJavaOptions  null
  supervise   false
  queue   research
  numExecutors2
  files   null
  pyFiles null
  archivesnull
  mainClass   org.apache.spark.examples.SparkPi
  primaryResource 
file:/opt/spark/examples/target/spark-examples_2.10-1.0.0.jar
  nameorg.apache.spark.examples.SparkPi
  childArgs   []
  jarsnull
  verbose true


Default properties from null:
  



Using properties file: null
Main class:
org.apache.spark.deploy.yarn.Client
Arguments:
--jar
file:/opt/spark/examples/target/spark-examples_2.10-1.0.0.jar
--class
org.apache.spark.examples.SparkPi
--name
org.apache.spark.examples.SparkPi
--driver-memory
512M
--queue
research
--num-executors
2
--executor-memory
512M
--executor-cores
1
System properties:
spark.driver.extraLibraryPath - 
/opt/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo.jar
SPARK_SUBMIT - true
spark.app.name - org.apache.spark.examples.SparkPi
Classpath elements:








From: alee...@hotmail.com
To: user@spark.apache.org
Subject: HDFS folder .sparkStaging not deleted and filled up HDFS in yarn mode
Date: Wed, 18 Jun 2014 11:05:12 -0700




Hi All,
Have anyone ran into the same problem? By looking at the source code in 
official release (rc11),this property settings is set to false by default, 
however, I'm seeing the .sparkStaging folder remains on the HDFS and causing it 
to fill up the disk pretty fast since SparkContext deploys the fat JAR file 
(~115MB) every time for each job and it is not cleaned up.








yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:
  val preserveFiles = sparkConf.get(spark.yarn.preserve.staging.files, 
false).toBoolean
[test@spark ~]$ hdfs dfs -ls .sparkStagingFound 46 itemsdrwx--   - test 
users  0 2014-05-01 01:42 
.sparkStaging/application_1398370455828_0050drwx--   - test users  
0 2014-05-01 02:03 .sparkStaging/application_1398370455828_0051drwx--   - 
test users  0 2014-05-01 02:04 
.sparkStaging/application_1398370455828_0052drwx--   - test users  
0 2014-05-01 05:44 .sparkStaging/application_1398370455828_0053drwx--   - 
test users  0 2014-05-01 05:45 
.sparkStaging/application_1398370455828_0055drwx--   - test users  
0 2014-05-01 05:46 .sparkStaging/application_1398370455828_0056drwx--   - 
test users  0 2014-05-01 05:49 
.sparkStaging/application_1398370455828_0057drwx--   - test users  
0 2014-05-01 05:52 .sparkStaging/application_1398370455828_0058drwx--   - 
test users  0 2014-05-01 05:58 
.sparkStaging/application_1398370455828_0059drwx--   - test users  
0 2014-05-01 07:38 .sparkStaging/application_1398370455828_0060drwx--   - 
test users  0 2014-05-01 07:41 
.sparkStaging/application_1398370455828_0061….drwx--   - test users 
 0 2014-06-16 14:45 .sparkStaging/application_1402001910637_0131drwx--   - 
test users  0 2014-06-16 15:03 
.sparkStaging/application_1402001910637_0135drwx--   - test users  
0 2014-06-16 15:16 .sparkStaging/application_1402001910637_0136drwx--   - 
test users  0 2014-06-16 15:46 
.sparkStaging/application_1402001910637_0138drwx--   - test users  
0 2014-06-16 23:57 .sparkStaging/application_1402001910637_0157drwx--   - 
test users  0 2014-06-17 05:55 
.sparkStaging/application_1402001910637_0161
Is this something that needs to be explicitly set in 
:SPARK_YARN_USER_ENV=spark.yarn.preserve.staging.files=false

Is spark 1.0.0 spark-shell --master=yarn running in yarn-cluster mode or yarn-client mode?

2014-05-21 Thread Andrew Lee
Does anyone know if:
./bin/spark-shell --master yarn 
is running yarn-cluster or yarn-client by default?
Base on source code:







./core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
if (args.deployMode == cluster  args.master.startsWith(yarn)) {  
args.master = yarn-cluster}if (args.deployMode != cluster  
args.master.startsWith(yarn)) {  args.master = yarn-client












}
It looks like the answer is yarn-cluster mode.
I want to confirm this with the community, thanks.  
  

RE: Is spark 1.0.0 spark-shell --master=yarn running in yarn-cluster mode or yarn-client mode?

2014-05-21 Thread Andrew Lee
Ah, forgot the -verbose option. Thanks Andrew. That is very helpful. 

Date: Wed, 21 May 2014 11:07:55 -0700
Subject: Re: Is spark 1.0.0 spark-shell --master=yarn running in yarn-cluster 
mode or yarn-client mode?
From: and...@databricks.com
To: user@spark.apache.org

The answer is actually yarn-client. A quick way to find out:
$ bin/spark-shell --master yarn --verbose
From the system properties you can see spark.master is set to yarn-client. 
From the code, this is because args.deployMode is null, and so it's not equal 
to cluster and so it falls into the second if case you mentioned:

if (args.deployMode != cluster  args.master.startsWith(yarn)) {
  args.master = yarn-client}

2014-05-21 10:57 GMT-07:00 Andrew Lee alee...@hotmail.com:




Does anyone know if:
./bin/spark-shell --master yarn 
is running yarn-cluster or yarn-client by default?

Base on source code:







./core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala


if (args.deployMode == cluster  args.master.startsWith(yarn)) {

  args.master = yarn-cluster
}
if (args.deployMode != cluster  args.master.startsWith(yarn)) {

  args.master = yarn-client














}

It looks like the answer is yarn-cluster mode.
I want to confirm this with the community, thanks.  
  


  

RE: spark-shell driver interacting with Workers in YARN mode - firewall blocking communication

2014-05-06 Thread Andrew Lee
Hi Jacob,
I agree, we need to address both driver and workers bidirectionally.
If the subnet is isolated and self-contained, only limited ports are configured 
to access the driver via a dedicated gateway for the user, could you explain 
your concern? or what doesn't satisfy the security criteria?
Are you referring to any security certificate or regulation requirement that 
separate subnet with a configurable policy couldn't satisfy?
What I meant a subnet basically includes both driver and Workers running in 
this subnet. See following example setup.
e.g. (254 max nodes for example)Hadoop / HDFS = 10.5.5.0/24 (GW 10.5.5.1) 
eth0Spark Driver and Worker bind to = 10.10.10.0/24 eth1 with routing to 
10.5.5.0/24 on specific ports for NameNode and DataNode.So basically driver and 
Worker are bound to the same subnet that is separated from others.iptables for 
10.10.10.0/24 can allow SSH 22 login (or port forwarding) onto the Spark Driver 
machine to launch shell or submit spark jobs.


Subject: RE: spark-shell driver interacting with Workers in YARN mode - 
firewall blocking communication
To: user@spark.apache.org
From: jeis...@us.ibm.com
Date: Mon, 5 May 2014 12:40:53 -0500


Howdy Andrew,



I agree; the subnet idea is a good one...  unfortunately, it doesn't really 
help to secure the network.



You mentioned that the drivers need to talk to the workers.  I think it is 
slightly broader - all of the workers and the driver/shell need to be 
addressable from/to each other on any dynamic port.



I would check out setting the environment variable SPARK_LOCAL_IP [1].  This 
seems to enable Spark to bind correctly to a private subnet.



Jacob



[1]  http://spark.apache.org/docs/latest/configuration.html 



Jacob D. Eisinger

IBM Emerging Technologies

jeis...@us.ibm.com - (512) 286-6075



Andrew Lee ---05/04/2014 09:57:08 PM---Hi Jacob, Taking both concerns into 
account, I'm actually thinking about using a separate subnet to



From:   Andrew Lee alee...@hotmail.com

To: user@spark.apache.org user@spark.apache.org

Date:   05/04/2014 09:57 PM

Subject:RE: spark-shell driver interacting with Workers in YARN mode - 
firewall blocking communication








Hi Jacob,



Taking both concerns into account, I'm actually thinking about using a separate 
subnet to isolate the Spark Workers, but need to look into how to bind the 
process onto the correct interface first. This may require some code change.

Separate subnet doesn't limit itself with port range so port exhaustion should 
rarely happen, and won't impact performance.



By opening up all port between 32768-61000 is actually the same as no firewall, 
this expose some security concerns, but need more information whether that is 
critical or not.



The bottom line is the driver needs to talk to the Workers. The way how user 
access the Driver should be easier to solve such as launching Spark (shell) 
driver on a specific interface.



Likewise, if you found out any interesting solutions, please let me know. I'll 
share the solution once I have something up and running. Currently, it is 
running ok with iptables off, but still need to figure out how to 
product-ionize the security part.



Subject: RE: spark-shell driver interacting with Workers in YARN mode - 
firewall blocking communication

To: user@spark.apache.org

From: jeis...@us.ibm.com

Date: Fri, 2 May 2014 16:07:50 -0500



Howdy Andrew,



I think I am running into the same issue [1] as you.  It appears that Spark 
opens up dynamic / ephemera [2] ports for each job on the shell and the 
workers.  As you are finding out, this makes securing and managing the network 
for Spark very difficult.



 Any idea how to restrict the 'Workers' port range?

The port range can be found by running: 
$ sysctl net.ipv4.ip_local_port_range

net.ipv4.ip_local_port_range = 32768 61000


With that being said, a couple avenues you may try: 

Limit the dynamic ports [3] to a more reasonable number and open all of these 
ports on your firewall; obviously, this might have unintended consequences like 
port exhaustion. 
Secure the network another way like through a private VPN; this may reduce 
Spark's performance.


If you have other workarounds, I am all ears --- please let me know!

Jacob



[1] 
http://apache-spark-user-list.1001560.n3.nabble.com/Securing-Spark-s-Network-tp4832p4984.html

[2] http://en.wikipedia.org/wiki/Ephemeral_port

[3] 
http://www.cyberciti.biz/tips/linux-increase-outgoing-network-sockets-range.html



Jacob D. Eisinger

IBM Emerging Technologies

jeis...@us.ibm.com - (512) 286-6075



Andrew Lee ---05/02/2014 03:15:42 PM---Hi Yana,  I did. I configured the the 
port in spark-env.sh, the problem is not the driver port which



From: Andrew Lee alee...@hotmail.com

To: user@spark.apache.org user@spark.apache.org

Date: 05/02/2014 03:15 PM

Subject: RE: spark-shell driver interacting with Workers in YARN mode - 
firewall blocking communication







Hi Yana, 



I did. I configured

RE: run spark0.9.1 on yarn with hadoop CDH4

2014-05-06 Thread Andrew Lee
Please check JAVA_HOME. Usually it should point to /usr/java/default on 
CentOS/Linux.
or FYI: http://stackoverflow.com/questions/1117398/java-home-directory


 Date: Tue, 6 May 2014 00:23:02 -0700
 From: sln-1...@163.com
 To: u...@spark.incubator.apache.org
 Subject: run spark0.9.1 on yarn with hadoop CDH4
 
 Hi all,
  I have make HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which
 contains the (client side) configuration files for the hadoop cluster. 
 The command to launch the YARN Client which I run is like this:
 
 #
 SPARK_JAR=./~/spark-0.9.1/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar
 ./bin/spark-class org.apache.spark.deploy.yarn.Client\--jar
 examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar\--class
 org.apache.spark.examples.SparkPi\--args yarn-standalone \--num-workers 3
 \--master-memory 2g \--worker-memory 2g \--worker-cores 1
 ./bin/spark-class: line 152: /usr/lib/jvm/java-7-sun/bin/java: No such file
 or directory
 ./bin/spark-class: line 152: exec: /usr/lib/jvm/java-7-sun/bin/java: cannot
 execute: No such file or directory
 How to make it runs well?
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/run-spark0-9-1-on-yarn-with-hadoop-CDH4-tp5426.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
  

RE: spark-shell driver interacting with Workers in YARN mode - firewall blocking communication

2014-05-04 Thread Andrew Lee
Hi Jacob,
Taking both concerns into account, I'm actually thinking about using a separate 
subnet to isolate the Spark Workers, but need to look into how to bind the 
process onto the correct interface first. This may require some code 
change.Separate subnet doesn't limit itself with port range so port exhaustion 
should rarely happen, and won't impact performance.
By opening up all port between 32768-61000 is actually the same as no firewall, 
this expose some security concerns, but need more information whether that is 
critical or not.
The bottom line is the driver needs to talk to the Workers. The way how user 
access the Driver should be easier to solve such as launching Spark (shell) 
driver on a specific interface.
Likewise, if you found out any interesting solutions, please let me know. I'll 
share the solution once I have something up and running. Currently, it is 
running ok with iptables off, but still need to figure out how to 
product-ionize the security part.
Subject: RE: spark-shell driver interacting with Workers in YARN mode - 
firewall blocking communication
To: user@spark.apache.org
From: jeis...@us.ibm.com
Date: Fri, 2 May 2014 16:07:50 -0500


Howdy Andrew,



I think I am running into the same issue [1] as you.  It appears that Spark 
opens up dynamic / ephemera [2] ports for each job on the shell and the 
workers.  As you are finding out, this makes securing and managing the network 
for Spark very difficult.



 Any idea how to restrict the 'Workers' port range?

The port range can be found by running:

$ sysctl net.ipv4.ip_local_port_range

net.ipv4.ip_local_port_range = 3276861000


With that being said, a couple avenues you may try:

Limit the dynamic ports [3] to a more reasonable number and open all of these 
ports on your firewall; obviously, this might have unintended consequences like 
port exhaustion.
Secure the network another way like through a private VPN; this may reduce 
Spark's performance.


If you have other workarounds, I am all ears --- please let me know!

Jacob



[1] 
http://apache-spark-user-list.1001560.n3.nabble.com/Securing-Spark-s-Network-tp4832p4984.html

[2] http://en.wikipedia.org/wiki/Ephemeral_port

[3] 
http://www.cyberciti.biz/tips/linux-increase-outgoing-network-sockets-range.html



Jacob D. Eisinger

IBM Emerging Technologies

jeis...@us.ibm.com - (512) 286-6075



Andrew Lee ---05/02/2014 03:15:42 PM---Hi Yana,  I did. I configured the the 
port in spark-env.sh, the problem is not the driver port which



From:   Andrew Lee alee...@hotmail.com

To: user@spark.apache.org user@spark.apache.org

Date:   05/02/2014 03:15 PM

Subject:RE: spark-shell driver interacting with Workers in YARN mode - 
firewall blocking communication







Hi Yana, 



I did. I configured the the port in spark-env.sh, the problem is not the driver 
port which is fixed.

it's the Workers port that are dynamic every time when they are launched in the 
YARN container. :-(



Any idea how to restrict the 'Workers' port range?



Date: Fri, 2 May 2014 14:49:23 -0400

Subject: Re: spark-shell driver interacting with Workers in YARN mode - 
firewall blocking communication

From: yana.kadiy...@gmail.com

To: user@spark.apache.org



I think what you want to do is set spark.driver.port to a fixed port.





On Fri, May 2, 2014 at 1:52 PM, Andrew Lee alee...@hotmail.com wrote:
Hi All,



I encountered this problem when the firewall is enabled between the spark-shell 
and the Workers.



When I launch spark-shell in yarn-client mode, I notice that Workers on the 
YARN containers are trying to talk to the driver (spark-shell), however, the 
firewall is not opened and caused timeout.



For the Workers, it tries to open listening ports on 54xxx for each Worker? Is 
the port random in such case?

What will be the better way to predict the ports so I can configure the 
firewall correctly between the driver (spark-shell) and the Workers? Is there a 
range of ports we can specify in the firewall/iptables?



Any ideas?

  

spark-shell driver interacting with Workers in YARN mode - firewall blocking communication

2014-05-02 Thread Andrew Lee
Hi All,
I encountered this problem when the firewall is enabled between the spark-shell 
and the Workers.
When I launch spark-shell in yarn-client mode, I notice that Workers on the 
YARN containers are trying to talk to the driver (spark-shell), however, the 
firewall is not opened and caused timeout.
For the Workers, it tries to open listening ports on 54xxx for each Worker? Is 
the port random in such case?What will be the better way to predict the ports 
so I can configure the firewall correctly between the driver (spark-shell) and 
the Workers? Is there a range of ports we can specify in the firewall/iptables?
Any ideas?

RE: spark-shell driver interacting with Workers in YARN mode - firewall blocking communication

2014-05-02 Thread Andrew Lee
Hi Yana, 
I did. I configured the the port in spark-env.sh, the problem is not the driver 
port which is fixed.it's the Workers port that are dynamic every time when they 
are launched in the YARN container. :-(
Any idea how to restrict the 'Workers' port range?

Date: Fri, 2 May 2014 14:49:23 -0400
Subject: Re: spark-shell driver interacting with Workers in YARN mode - 
firewall blocking communication
From: yana.kadiy...@gmail.com
To: user@spark.apache.org

I think what you want to do is set spark.driver.port to a fixed port.


On Fri, May 2, 2014 at 1:52 PM, Andrew Lee alee...@hotmail.com wrote:




Hi All,
I encountered this problem when the firewall is enabled between the spark-shell 
and the Workers.
When I launch spark-shell in yarn-client mode, I notice that Workers on the 
YARN containers are trying to talk to the driver (spark-shell), however, the 
firewall is not opened and caused timeout.

For the Workers, it tries to open listening ports on 54xxx for each Worker? Is 
the port random in such case?What will be the better way to predict the ports 
so I can configure the firewall correctly between the driver (spark-shell) and 
the Workers? Is there a range of ports we can specify in the firewall/iptables?

Any ideas?

  

Spark 0.9.1 - How to run bin/spark-class with my own hadoop jar files?

2014-03-25 Thread Andrew Lee
Hi All,
I'm getting the following error when I execute start-master.sh which also 
invokes spark-class at the end.








Failed to find Spark assembly in /root/spark/assembly/target/scala-2.10/
You need to build Spark with 'sbt/sbt assembly' before running this program.
After digging into the code, I see the CLASSPATH is hardcoded with 
spark-assembly.*hadoop.*.jar.In bin/spark-class :
if [ ! -f $FWDIR/RELEASE ]; then  # Exit if the user hasn't compiled Spark  
num_jars=$(ls $FWDIR/assembly/target/scala-$SCALA_VERSION/ | grep 
spark-assembly.*hadoop.*.jar | wc -l)  jars_list=$(ls 
$FWDIR/assembly/target/scala-$SCALA_VERSION/ | grep 
spark-assembly.*hadoop.*.jar)  if [ $num_jars -eq 0 ]; thenecho 
Failed to find Spark assembly in $FWDIR/assembly/target/scala-$SCALA_VERSION/ 
2echo You need to build Spark with 'sbt/sbt assembly' before running 
this program. 2exit 1  fi  if [ $num_jars -gt 1 ]; thenecho 
Found multiple Spark assembly jars in 
$FWDIR/assembly/target/scala-$SCALA_VERSION: 2echo $jars_listecho 
Please remove all but one jar.exit 1  fi






















fi
Is there any reason why this is only grabbing spark-assembly.*hadoop.*.jar ? I 
am trying to run Spark that links to my own version of Hadoop under 
/opt/hadoop23/, and I use 'sbt/sbt clean package' to build the package without 
the Hadoop jar. What is the correct way to link to my own Hadoop jar?


  

RE: Spark 0.9.1 - How to run bin/spark-class with my own hadoop jar files?

2014-03-25 Thread Andrew Lee
Hi Paul,
I got it sorted out.
The problem is that the JARs are built into the assembly JARs when you run
sbt/sbt clean assembly
What I did is:sbt/sbt clean package
This will only give you the small JARs. The next steps is to update the 
CLASSPATH in the bin/compute-classpath.sh script manually, appending all the 
JARs.
With :
sbt/sbt assembly
We can't introduce our own Hadoop patch since it will always pull from Maven 
repo, unless we hijack the repository path, or do a 'mvn install' locally. This 
is more of a hack I think.


Date: Tue, 25 Mar 2014 15:23:08 -0700
Subject: Re: Spark 0.9.1 - How to run bin/spark-class with my own hadoop jar 
files?
From: paulmscho...@gmail.com
To: user@spark.apache.org

Andrew, 
I ran into the same problem and eventually settled on just running the jars 
directly with java. Since we use sbt to build our jars we had all the 
dependancies builtin to the jar it self so need for random class paths. 


On Tue, Mar 25, 2014 at 1:47 PM, Andrew Lee alee...@hotmail.com wrote:




Hi All,
I'm getting the following error when I execute start-master.sh which also 
invokes spark-class at the end.








Failed to find Spark assembly in /root/spark/assembly/target/scala-2.10/

You need to build Spark with 'sbt/sbt assembly' before running this program.


After digging into the code, I see the CLASSPATH is hardcoded with 
spark-assembly.*hadoop.*.jar.

In bin/spark-class :


if [ ! -f $FWDIR/RELEASE ]; then
  # Exit if the user hasn't compiled Spark
  num_jars=$(ls $FWDIR/assembly/target/scala-$SCALA_VERSION/ | grep 
spark-assembly.*hadoop.*.jar | wc -l)

  jars_list=$(ls $FWDIR/assembly/target/scala-$SCALA_VERSION/ | grep 
spark-assembly.*hadoop.*.jar)
  if [ $num_jars -eq 0 ]; then

echo Failed to find Spark assembly in 
$FWDIR/assembly/target/scala-$SCALA_VERSION/ 2
echo You need to build Spark with 'sbt/sbt assembly' before running this 
program. 2

exit 1
  fi
  if [ $num_jars -gt 1 ]; then
echo Found multiple Spark assembly jars in 
$FWDIR/assembly/target/scala-$SCALA_VERSION: 2

echo $jars_list
echo Please remove all but one jar.
exit 1
  fi

























fi


Is there any reason why this is only grabbing spark-assembly.*hadoop.*.jar ? I 
am trying to run Spark that links to my own version of Hadoop under 
/opt/hadoop23/, 

and I use 'sbt/sbt clean package' to build the package without the Hadoop jar. 
What is the correct way to link to my own Hadoop jar?