date:20160413

Re: Spark 1.6.0 - token renew failure

2016-04-13 Thread Jeff Zhang

It is not supported in spark to specify both principal and proxy-user.  You
need to either use proxy-user or use principal.

Seems currently spark only check that from spark submit arguments but
ignore the configuration in spark-defaults.xml

if (proxyUser != null && principal != null) {
  SparkSubmit.printErrorAndExit("Only one of --proxy-user or
--principal can be provided.")
}


On Wed, Apr 13, 2016 at 8:52 AM, Luca Rea 
wrote:

> Hi,
> I'm testing Livy server with Hue 3.9 and Spark 1.6.0 inside a kerberized
> cluster (HDP 2.4), when I run the command
>
>
> /usr/java/jdk1.7.0_71//bin/java -Dhdp.version=2.4.0.0-169 -cp
> /usr/hdp/2.4.0.0-169/spark/conf/:/usr/hdp/2.4.0.0-169/spark/lib/spark-assembly-1.6.0.2.4.0.0-169-hadoop2.7.1.2.4.0.0-169.jar:/usr/hdp/2.4.0.0-169/spark/lib/datanucleus-core-3.2.10.jar:/usr/hdp/2.4.0.0-169/spark/lib/datanucleus-rdbms-3.2.9.jar:/usr/hdp/2.4.0.0-169/spark/lib/datanucleus-api-jdo-3.2.6.jar:/etc/hadoop/conf/:/usr/hdp/2.4.0.0-169/hadoop/lib/hadoop-lzo-0.6.0.2.4.0.0-169.jar
> -XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit --master
> yarn-cluster --conf spark.livy.port=0 --conf spark.livy.callbackUrl=
> http://172.16.24.26:8998/sessions/0/callback --conf
> spark.driver.extraJavaOptions=-Dhdp.version=2.4.0.0-169 --class
> com.cloudera.hue.livy.repl.Main --name Livy --proxy-user luca.rea
> /var/cloudera_hue/apps/spark/java/livy-assembly/target/scala-2.10/livy-assembly-0.2.0-SNAPSHOT.jar
> spark
>
>
> This fails renewing the token  and returns the error below:
>
>
> 16/04/13 09:34:52 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to
> load native-hadoop library for your platform... using builtin-java classes
> where applicable
> 16/04/13 09:34:53 INFO org.apache.hadoop.security.UserGroupInformation:
> Login successful for user spark-pantagr...@contactlab.lan using keytab
> file /etc/security/keytabs/spark.headless.keytab
> 16/04/13 09:34:54 INFO
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl: Timeline service
> address: http://pg-master04.contactlab.lan:8188/ws/v1/timeline/
> 16/04/13 09:34:54 WARN
> org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory: The short-circuit
> local reads feature cannot be used because libhadoop cannot be loaded.
> 16/04/13 09:34:55 INFO org.apache.hadoop.hdfs.DFSClient: Created
> HDFS_DELEGATION_TOKEN token 2135943 for luca.rea on ha-hdfs:pgha
> Exception in thread "main"
> org.apache.hadoop.security.AccessControlException: luca.rea tries to renew
> a token with renewer spark
> at
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:481)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:6793)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:635)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:1005)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
>
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
> at
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
> at
> org.apache.hadoop.hdfs.DFSClient$Renewer.renew(DFSClient.java:1147)
> at org.apache.hadoop.security.token.Token.renew(Token.java:385)
> at
> org.apache.spark.deploy.yarn.Client.getTokenRenewalInterval(Client.scala:593)
> at
> org.apache.spark.deploy.yarn.Client.setupLaunchEnv(Client.scala:621)
> at
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:721)
> at
>

RE: How does spark-submit handle Python scripts (and how to repeat it)?

2016-04-13 Thread Sun, Rui

In SparkSubmit, there is less work for yarn-client than that for yarn-cluster. 
Basically prepare some spark configurations into system prop , for example, 
information on additional resources required by the application that need to be 
distributed to the cluster. These configurations will be used in SparkContext 
initialization later.

So generally for yarn-client, maybe you can skip spark-submit and directly 
launching the spark application with some configurations setup before new 
SparkContext.

Not sure about your error, have you setup YARN_CONF_DIR?

From: Andrei [mailto:faithlessfri...@gmail.com]
Sent: Thursday, April 14, 2016 5:45 AM
To: Sun, Rui 
Cc: user 
Subject: Re: How does spark-submit handle Python scripts (and how to repeat it)?

Julia can pick the env var, and set the system properties or directly fill the 
configurations into a SparkConf, and then create a SparkContext

That's the point - just setting master to "yarn-client" doesn't work, even in 
Java/Scala. E.g. following code in Scala:

val conf = new SparkConf().setAppName("My App").setMaster("yarn-client")
val sc = new SparkContext(conf)
sc.parallelize(1 to 10).collect()
sc.stop()

results in an error:

Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:8032

I think for now we can even put Julia aside and concentrate the following 
question: how does submitting application via `spark-submit` with "yarn-client" 
mode differ from setting the same mode directly in `SparkConf`?

On Wed, Apr 13, 2016 at 5:06 AM, Sun, Rui 
> wrote:
Spark configurations specified at the command line for spark-submit should be 
passed to the JVM inside Julia process. You can refer to 
https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L267
 and 
https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L295
Generally,
spark-submit JVM -> JuliaRunner -> Env var like 
“JULIA_SUBMIT_ARGS” -> julia process -> new JVM with SparkContext
  Julia can pick the env var, and set the system properties or directly fill 
the configurations into a SparkConf, and then create a SparkContext

Yes, you are right, `spark-submit` creates new Python/R process that connects 
back to that same JVM and creates SparkContext in it.
Refer to 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala#L47
 and
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/RRunner.scala#L65

From: Andrei 
[mailto:faithlessfri...@gmail.com]
Sent: Wednesday, April 13, 2016 4:32 AM
To: Sun, Rui >
Cc: user >
Subject: Re: How does spark-submit handle Python scripts (and how to repeat it)?

One part is passing the command line options, like “--master”, from the JVM 
launched by spark-submit to the JVM where SparkContext resides

Since I have full control over both - JVM and Julia parts - I can pass whatever 
options to both. But what exactly should be passed? Currently pipeline looks 
like this:

spark-submit JVM -> JuliaRunner -> julia process -> new JVM with SparkContext

 I want to make the last JVM's SparkContext to understand that it should run on 
YARN. Obviously, I can't pass `--master yarn` option to JVM itself. Instead, I 
can pass system property "spark.master" = "yarn-client", but this results in an 
error:

Retrying connect to server: 0.0.0.0/0.0.0.0:8032

So it's definitely not enough. I tried to set manually all system properties 
that `spark-submit` adds to the JVM (including "spark-submit=true", 
"spark.submit.deployMode=client", etc.), but it didn't help too. Source code is 
always good, but for a stranger like me it's a little bit hard to grasp control 
flow in SparkSubmit class.

For pySpark & SparkR, when running scripts in client deployment modes 
(standalone client and yarn client), the JVM is the same (py4j/RBackend running 
as a thread in the JVM launched by spark-submit)

Can you elaborate on this? Does it mean that `spark-submit` creates new 
Python/R process that connects back to that same JVM and creates SparkContext 
in it?

On Tue, Apr 12, 2016 at 2:04 PM, Sun, Rui 
> wrote:
There is much deployment preparation work handling different deployment modes 
for pyspark and SparkR in SparkSubmit. It is difficult to summarize it briefly, 
you had better refer to the source code.

Supporting running Julia scripts in SparkSubmit is more than implementing a 
‘JuliaRunner’. One part is passing the command line options, like “--master”, 
from the JVM launched by spark-submit to the JVM where SparkContext resides, in 
the case

Memory needs when using expensive operations like groupBy

2016-04-13 Thread Divya Gehlot

Hi,
I am using Spark 1.5.2 with Scala 2.10 and my Spark job keeps failing with
exit code 143 .
except one job where I am using unionAll and groupBy operation on multiple
columns .

Please advice me the options to optimize it .
The one option which I am using it now
--conf spark.executor.extraJavaOptions  -XX:MaxPermSize=1024m
-XX:PermSize=256m --conf spark.driver.extraJavaOptions
 -XX:MaxPermSize=1024m -XX:PermSize=256m --conf
spark.yarn.executor.memoryOverhead=1024

Need to know the best practices/better ways to optimize code.

Thanks,
Divya

RE: 回复： build/sbt gen-idea error

2016-04-13 Thread Yu, Yucai

Reminder: gen-idea has been removed in the master. See:

commit a172e11cba6f917baf5bd6c4f83dc6689932de9a
Author: Luciano Resende 
Date:   Mon Apr 4 16:55:59 2016 -0700

[SPARK-14366] Remove sbt-idea plugin

## What changes were proposed in this pull request?

Remove sbt-idea plugin as importing sbt project provides much better 
support.

Author: Luciano Resende 

Closes #12151 from lresende/SPARK-14366.


From: ImMr.K [mailto:875061...@qq.com]
Sent: Wednesday, April 13, 2016 9:48 PM
To: Ted Yu 
Cc: user 
Subject: 回复： build/sbt gen-idea error

Actually, same error occurred when I ran build/sbt compile or other commands. 
After struggled for some time, I reminded that I used proxy to connect to 
Internet. So set proxy to maven, everything seems OK. Just remind those who use 
proxies.

--
Best regards,
Ze Jin



-- 原始邮件 --
发件人: "Ted Yu";>;
发送时间: 2016年4月12日(星期二) 晚上11:38
收件人: "ImMr.K"<875061...@qq.com>;
抄送: "user">;
主题: Re: build/sbt gen-idea error

gen-idea doesn't seem to be a valid command:

[warn] Ignoring load failure: no project loaded.
[error] Not a valid command: gen-idea
[error] gen-idea

On Tue, Apr 12, 2016 at 8:28 AM, ImMr.K 
<875061...@qq.com> wrote:
Hi,
I have cloned spark and ,
cd spark
build/sbt gen-idea

got the following output:


Using /usr/java/jre1.7.0_09 as default JAVA_HOME.
Note, this will be overridden by -java-home if it is set.
[info] Loading project definition from /home/king/github/spark/project/project
[info] Loading project definition from 
/home/king/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader/project
[warn] Multiple resolvers having different access mechanism configured with 
same name 'sbt-plugin-releases'. To avoid conflict, Remove duplicate project 
resolvers (`resolvers`) or rename publishing resolver (`publishTo`).
[info] Loading project definition from /home/king/github/spark/project
org.apache.maven.model.building.ModelBuildingException: 1 problem was 
encountered while building the effective model for 
org.apache.spark:spark-parent_2.11:2.0.0-SNAPSHOT
[FATAL] Non-resolvable parent POM: Could not transfer artifact 
org.apache:apache:pom:14 from/to central ( 
http://repo.maven.apache.org/maven2): Error transferring file: Connection timed 
out from  
http://repo.maven.apache.org/maven2/org/apache/apache/14/apache-14.pom and 
'parent.relativePath' points at wrong local POM @ line 22, column 11

at 
org.apache.maven.model.building.DefaultModelProblemCollector.newModelBuildingException(DefaultModelProblemCollector.java:195)
at 
org.apache.maven.model.building.DefaultModelBuilder.readParentExternally(DefaultModelBuilder.java:841)
at 
org.apache.maven.model.building.DefaultModelBuilder.readParent(DefaultModelBuilder.java:664)
at 
org.apache.maven.model.building.DefaultModelBuilder.build(DefaultModelBuilder.java:310)
at 
org.apache.maven.model.building.DefaultModelBuilder.build(DefaultModelBuilder.java:232)
at 
com.typesafe.sbt.pom.MvnPomResolver.loadEffectivePom(MavenPomResolver.scala:61)
at com.typesafe.sbt.pom.package$.loadEffectivePom(package.scala:41)
at 
com.typesafe.sbt.pom.MavenProjectHelper$.makeProjectTree(MavenProjectHelper.scala:128)
at 
com.typesafe.sbt.pom.MavenProjectHelper$.makeReactorProject(MavenProjectHelper.scala:49)
at com.typesafe.sbt.pom.PomBuild$class.projectDefinitions(PomBuild.scala:28)
at SparkBuild$.projectDefinitions(SparkBuild.scala:347)
at sbt.Load$.sbt$Load$$projectsFromBuild(Load.scala:506)
at sbt.Load$$anonfun$27.apply(Load.scala:446)
at sbt.Load$$anonfun$27.apply(Load.scala:446)
at scala.collection.immutable.Stream.flatMap(Stream.scala:442)
at sbt.Load$.loadUnit(Load.scala:446)
at sbt.Load$$anonfun$18$$anonfun$apply$11.apply(Load.scala:291)
at sbt.Load$$anonfun$18$$anonfun$apply$11.apply(Load.scala:291)
at 
sbt.BuildLoader$$anonfun$componentLoader$1$$anonfun$apply$4$$anonfun$apply$5$$anonfun$apply$6.apply(BuildLoader.scala:91)
at 
sbt.BuildLoader$$anonfun$componentLoader$1$$anonfun$apply$4$$anonfun$apply$5$$anonfun$apply$6.apply(BuildLoader.scala:90)
at sbt.BuildLoader.apply(BuildLoader.scala:140)
at sbt.Load$.loadAll(Load.scala:344)
at sbt.Load$.loadURI(Load.scala:299)
at sbt.Load$.load(Load.scala:295)
at sbt.Load$.load(Load.scala:286)
at sbt.Load$.apply(Load.scala:140)
at sbt.Load$.defaultLoad(Load.scala:36)
at sbt.BuiltinCommands$.liftedTree1$1(Main.scala:492)
at sbt.BuiltinCommands$.doLoadProject(Main.scala:492)
at sbt.BuiltinCommands$$anonfun$loadProjectImpl$2.apply(Main.scala:484)
at sbt.BuiltinCommands$$anonfun$loadProjectImpl$2.apply(Main.scala:484)
at sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.scala:59)
at sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.scala:59)
at

pyspark EOFError after calling map

2016-04-13 Thread Pete Werner

Hi

I am new to spark & pyspark.

I am reading a small csv file (~40k rows) into a dataframe.

from pyspark.sql import functions as F
df =
sqlContext.read.format('com.databricks.spark.csv').options(header='true',
inferschema='true').load('/tmp/sm.csv')
df = df.withColumn('verified', F.when(df['verified'] == 'Y',
1).otherwise(0))
df2 = df.map(lambda x: Row(label=float(x[0]),
features=Vectors.dense(x[1:]))).toDF()

I get some weird error that does not occur every single time, but does
happen pretty regularly

>>> df2.show(1)
++-+
|features|label|
++-+
|[0.0,0.0,0.0,0.0,...|0.0|
++-+
only showing top 1 row

>>> df2.count()
41999

>>> df2.show(1)
++-+
|features|label|
++-+
|[0.0,0.0,0.0,0.0,...|0.0|
++-+
only showing top 1 row

>>> df2.count()
41999

>>> df2.show(1)
Traceback (most recent call last):
  File "spark-1.6.1/python/lib/pyspark.zip/pyspark/daemon.py", line 157, in
manager
  File "spark-1.6.1/python/lib/pyspark.zip/pyspark/daemon.py", line 61, in
worker
  File "spark-1.6.1/python/lib/pyspark.zip/pyspark/worker.py", line 136, in
main
if read_int(infile) == SpecialLengths.END_OF_STREAM:
  File "spark-1.6.1/python/lib/pyspark.zip/pyspark/serializers.py", line
545, in read_int
raise EOFError
EOFError
++-+
|features|label|
++-+
|[0.0,0.0,0.0,0.0,...|4700734.0|
++-+
only showing top 1 row

Once that EOFError has been raised, I will not see it again until I do
something that requires interacting with the spark server

When I call df2.count() it shows that [Stage xxx] prompt which is what I
mean by it going to the spark server.

Anything that triggers that seems to eventually end up giving the EOFError
again when I do something with df2.

It does not seem to happen with df (vs. df2) so seems like it must be
something happening with the df.map() line.

-- 

Pete Werner
Data Scientist
Freelancer.com

Level 20
680 George Street
Sydney NSW 2000

e: pwer...@freelancer.com
p:  +61 2 8599 2700
w: http://www.freelancer.com

Error at starting httpd after the instillation using spark-ec2 script

2016-04-13 Thread Mohed Alibrahim

Dear All,

I installed spark 1.6.1 on Amazon EC2 using spark-ec2 script. Everything
was OK, but , it failed to start httpd at the end of the installation. I
followed exactly the instruction and I repeated the process many times, but
there is no luck.

-
[timing] rstudio setup:  00h 00m 00s
Setting up ganglia
RSYNC'ing /etc/ganglia to slaves...ec.
us-west-2.compute.amazonaws.com
Shutting down GANGLIA gmond:   [FAILED]
Starting GANGLIA gmond:[  OK  ]
Shutting down GANGLIA gmond:   [FAILED]
Starting GANGLIA gmond:[  OK  ]
Connection to ec2-.us-west-2.compute.amazonaws.com
closed.
Shutting down GANGLIA gmetad:  [FAILED]
Starting GANGLIA gmetad:   [  OK  ]
Stopping httpd:[FAILED]
Starting httpd: httpd: Syntax error on line 154 of
/etc/httpd/conf/httpd.conf: Cannot load
/etc/httpd/modules/mod_authz_core.so into server:
/etc/httpd/modules/mod_authz_core.so: cannot open shared object file: No
such file or directory
   [FAILED]
[timing] ganglia setup:  00h 00m 01s
Connection to ec2-.us-west-2.compute.amazonaws.com closed.
Spark standalone cluster started at http://ec2-...
us-west-2.compute.amazonaws.com:8080
Ganglia started at http://ec2-.
us-west-2.compute.amazonaws.com:5080/ganglia
Done!
--

httpd.conf:

line 154:

LoadModule authz_core_module modules/mod_authz_core.so

If i commented this line, it shows the error to the following lines:

LoadModule unixd_module modules/mod_unixd.so
LoadModule access_compat_module modules/mod_access_compat.so
LoadModule mpm_prefork_module modules/mod_mpm_prefork.so
LoadModule php5_module modules/libphp-5.6.so

---

Any help would be really appreciated.

Re: How does spark-submit handle Python scripts (and how to repeat it)?

2016-04-13 Thread Andrei

>
> Julia can pick the env var, and set the system properties or directly fill
> the configurations into a SparkConf, and then create a SparkContext


That's the point - just setting master to "yarn-client" doesn't work, even
in Java/Scala. E.g. following code in *Scala*:


val conf = new SparkConf().setAppName("My App").setMaster("yarn-client")
val sc = new SparkContext(conf)
sc.parallelize(1 to 10).collect()
sc.stop()


results in an error:

Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032


I think for now we can even put Julia aside and concentrate the following
question: how does submitting application via `spark-submit` with
"yarn-client" mode differ from setting the same mode directly in
`SparkConf`?



On Wed, Apr 13, 2016 at 5:06 AM, Sun, Rui  wrote:

> Spark configurations specified at the command line for spark-submit should
> be passed to the JVM inside Julia process. You can refer to
> https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L267
> and
> https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L295
>
> Generally,
>
> spark-submit JVM -> JuliaRunner -> Env var like
> “JULIA_SUBMIT_ARGS” -> julia process -> new JVM with SparkContext
>
>   Julia can pick the env var, and set the system properties or directly
> fill the configurations into a SparkConf, and then create a SparkContext
>
>
>
> Yes, you are right, `spark-submit` creates new Python/R process that
> connects back to that same JVM and creates SparkContext in it.
>
> Refer to
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala#L47
> and
>
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/RRunner.scala#L65
>
>
>
>
>
> *From:* Andrei [mailto:faithlessfri...@gmail.com]
> *Sent:* Wednesday, April 13, 2016 4:32 AM
> *To:* Sun, Rui 
> *Cc:* user 
> *Subject:* Re: How does spark-submit handle Python scripts (and how to
> repeat it)?
>
>
>
> One part is passing the command line options, like “--master”, from the
> JVM launched by spark-submit to the JVM where SparkContext resides
>
>
>
> Since I have full control over both - JVM and Julia parts - I can pass
> whatever options to both. But what exactly should be passed? Currently
> pipeline looks like this:
>
>
>
> spark-submit JVM -> JuliaRunner -> julia process -> new JVM with
> SparkContext
>
>
>
>  I want to make the last JVM's SparkContext to understand that it should
> run on YARN. Obviously, I can't pass `--master yarn` option to JVM itself.
> Instead, I can pass system property "spark.master" = "yarn-client", but
> this results in an error:
>
>
>
> Retrying connect to server: 0.0.0.0/0.0.0.0:8032
>
>
>
>
>
> So it's definitely not enough. I tried to set manually all system
> properties that `spark-submit` adds to the JVM (including
> "spark-submit=true", "spark.submit.deployMode=client", etc.), but it didn't
> help too. Source code is always good, but for a stranger like me it's a
> little bit hard to grasp control flow in SparkSubmit class.
>
>
>
>
>
> For pySpark & SparkR, when running scripts in client deployment modes
> (standalone client and yarn client), the JVM is the same (py4j/RBackend
> running as a thread in the JVM launched by spark-submit)
>
>
>
> Can you elaborate on this? Does it mean that `spark-submit` creates new
> Python/R process that connects back to that same JVM and creates
> SparkContext in it?
>
>
>
>
>
> On Tue, Apr 12, 2016 at 2:04 PM, Sun, Rui  wrote:
>
> There is much deployment preparation work handling different deployment
> modes for pyspark and SparkR in SparkSubmit. It is difficult to summarize
> it briefly, you had better refer to the source code.
>
>
>
> Supporting running Julia scripts in SparkSubmit is more than implementing
> a ‘JuliaRunner’. One part is passing the command line options, like
> “--master”, from the JVM launched by spark-submit to the JVM where
> SparkContext resides, in the case that the two JVMs are not the same. For
> pySpark & SparkR, when running scripts in client deployment modes
> (standalone client and yarn client), the JVM is the same (py4j/RBackend
> running as a thread in the JVM launched by spark-submit) , so no need to
> pass the command line options around. However, in your case, Julia
> interpreter launches an in-process JVM for SparkContext, which is a
> separate JVM from the one launched by spark-submit. So you need a way,
> typically an environment environment variable, like “SPARKR_SUBMIT_ARGS”
> for SparkR or “PYSPARK_SUBMIT_ARGS” for pyspark, to pass command line args
> to the in-process JVM in the Julia interpreter so that SparkConf can pick
> the options.
>
>
>
> *From:* Andrei [mailto:faithlessfri...@gmail.com]
> *Sent:* Tuesday, April 12, 2016 3:48 AM
> *To:*

Re: Strange bug: Filter problem with parenthesis

2016-04-13 Thread Michael Armbrust

You need to use `backticks` to reference columns that have non-standard
characters.

On Wed, Apr 13, 2016 at 6:56 AM,  wrote:

> Hi,
>
> I am debugging a program, and for some reason, a line calling the
> following is failing:
>
> df.filter("sum(OpenAccounts) > 5").show
>
> It says it cannot find the column *OpenAccounts*, as if it was applying
> the sum() function and looking for a column called like that, where there
> is not. This works fine if I rename the column to something without
> parenthesis.
>
> I can’t reproduce this issue in Spark Shell (1.6.0), any ideas on how can
> I analyze this? This is an aggregation result, with the default column
> names afterwards.
>
> PS: Workaround is to use toDF(cols) and rename all columns, but I am
> wondering if toDF has any impact on the RDD structure behind (e.g.
> repartitioning, cache, etc)
>
> Appreciated,
> Saif
>
>

error "Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe."

2016-04-13 Thread AlexModestov

I get this error.
Who knows what does it mean?

Py4JJavaError: An error occurred while calling
z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure:
Exception while getting task result:
org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1
locations. Most recent failure cause:
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007)
at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1.apply(RDD.scala:1397)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1384)
at
org.apache.spark.sql.execution.TakeOrderedAndProject.collectData(basicOperators.scala:213)
at
org.apache.spark.sql.execution.TakeOrderedAndProject.doExecute(basicOperators.scala:223)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at
org.apache.spark.sql.execution.Union$$anonfun$doExecute$1.apply(basicOperators.scala:144)
at
org.apache.spark.sql.execution.Union$$anonfun$doExecute$1.apply(basicOperators.scala:144)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.execution.Union.doExecute(basicOperators.scala:144)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:187)
at
org.apache.spark.sql.execution.EvaluatePython$$anonfun$takeAndServe$1.apply$mcI$sp(python.scala:126)
at
org.apache.spark.sql.execution.EvaluatePython$$anonfun$takeAndServe$1.apply(python.scala:124)
at
org.apache.spark.sql.execution.EvaluatePython$$anonfun$takeAndServe$1.apply(python.scala:124)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at 
org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
at

RE: how does sc.textFile translate regex in the input.

2016-04-13 Thread Yong Zhang

It is described in "Hadoop Definition Guild", chapter 3, FilePattern
https://www.safaribooksonline.com/library/view/hadoop-the-definitive/9781449328917/ch03.html#FilePatterns
Yong

From: pradeep1...@gmail.com
Date: Wed, 13 Apr 2016 18:56:58 +
Subject: how does sc.textFile translate regex in the input.
To: user@spark.apache.org

I am trying to understand on how spark's sc.textFile() works. I specifically 
have the question on how it translates the paths with regex in it.
For example:
files = sc.textFile("hdfs://:/file1/*/*/*/*.txt")
How does it find all the sub-directories and recurses to all the leaf files. ? 
Is there any documentation on how this happens ?
Pradeep

Re: Py4JJavaError: An error occurred while calling o115.parquet. _metadata is not a Parquet file (too small)

2016-04-13 Thread Mich Talebzadeh

actually how many tables are involved here.

what is the version of Hive used? Sorry I have no idea about Cloudera 5.5.1
spec.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 13 April 2016 at 19:20, pseudo oduesp  wrote:

> hi guys ,
> i have this error after 5 hours of processing i make lot of joins 14 left
> joins
> with small table :
>
>
>
>  i saw in the spark ui  and console log evrithing ok but when he save
> last join i get this error
>
> Py4JJavaError: An error occurred while calling o115.parquet. _metadata is
> not a Parquet file (too small)
>
> i use 4 containers  26 go each and 8 cores i increase number of partition
> and  i use broadcast join  whithout succes i get log file but he s large 57
> mo i can't share with you .
>
> i use pyspark 1.5.0 on cloudera 5.5.1 and yarn  and i use
> hivecontext  for dealing with data.
>
>
>
>

how does sc.textFile translate regex in the input.

2016-04-13 Thread Pradeep Nayak

I am trying to understand on how spark's sc.textFile() works. I
specifically have the question on how it translates the paths with regex in
it.

For example:

files = sc.textFile("hdfs://:/file1/*/*/*/*.txt")

How does it find all the sub-directories and recurses to all the leaf
files. ? Is there any documentation on how this happens ?

Pradeep

Py4JJavaError: An error occurred while calling o115.parquet. _metadata is not a Parquet file (too small)

2016-04-13 Thread pseudo oduesp

hi guys ,
i have this error after 5 hours of processing i make lot of joins 14 left
joins
with small table :



 i saw in the spark ui  and console log evrithing ok but when he save
last join i get this error

Py4JJavaError: An error occurred while calling o115.parquet. _metadata is
not a Parquet file (too small)

i use 4 containers  26 go each and 8 cores i increase number of partition
and  i use broadcast join  whithout succes i get log file but he s large 57
mo i can't share with you .

i use pyspark 1.5.0 on cloudera 5.5.1 and yarn  and i use
hivecontext  for dealing with data.

Error starting Spark 1.6.1

2016-04-13 Thread Mohed Alibrahim

Dear All,

I installed spark 1.6.1 on Amazon EC2 using spark-ec2 script. Everything
was OK, but , it failed to start httpd at the end of the installation. I
followed exactly the instruction and I repeated the process many times, but
there is no luck.

-
[timing] rstudio setup:  00h 00m 00s
Setting up ganglia
RSYNC'ing /etc/ganglia to slaves...ec.
us-west-2.compute.amazonaws.com
Shutting down GANGLIA gmond:   [FAILED]
Starting GANGLIA gmond:[  OK  ]
Shutting down GANGLIA gmond:   [FAILED]
Starting GANGLIA gmond:[  OK  ]
Connection to ec2-.us-west-2.compute.amazonaws.com
closed.
Shutting down GANGLIA gmetad:  [FAILED]
Starting GANGLIA gmetad:   [  OK  ]
Stopping httpd:[FAILED]
Starting httpd: httpd: Syntax error on line 154 of
/etc/httpd/conf/httpd.conf: Cannot load
/etc/httpd/modules/mod_authz_core.so into server:
/etc/httpd/modules/mod_authz_core.so: cannot open shared object file: No
such file or directory
   [FAILED]
[timing] ganglia setup:  00h 00m 01s
Connection to ec2-.us-west-2.compute.amazonaws.com closed.
Spark standalone cluster started at http://ec2-...
us-west-2.compute.amazonaws.com:8080
Ganglia started at http://ec2-.
us-west-2.compute.amazonaws.com:5080/ganglia
Done!
--

httpd.conf:

line 154:

LoadModule authz_core_module modules/mod_authz_core.so

If i commented this line, it shows the error to the following lines:

LoadModule unixd_module modules/mod_unixd.so
LoadModule access_compat_module modules/mod_access_compat.so
LoadModule mpm_prefork_module modules/mod_mpm_prefork.so
LoadModule php5_module modules/libphp-5.6.so

---

Any help would be really appreciated.

Re: Logging in executors

2016-04-13 Thread Carlos Rojas Matas

Hi Yong,

thanks for your response. As I said in my first email, I've tried both the
reference to the classpath resource (env/dev/log4j-executor.properties) as
the file:// protocol. Also, the driver logging is working fine and I'm
using the same kind of reference.

Below the content of my classpath:

[image: Inline image 1]

Plus this is the content of the exploded fat jar assembled with sbt
assembly plugin:

[image: Inline image 2]


This folder is at the root level of the classpath.

Thanks,
-carlos.

On Wed, Apr 13, 2016 at 2:35 PM, Yong Zhang  wrote:

> Is the env/dev/log4j-executor.properties file within your jar file? Is the
> path matching with what you specified as env/dev/log4j-executor.properties?
>
> If you read the log4j document here:
> https://logging.apache.org/log4j/1.2/manual.html
>
> When you specify the log4j.configuration=my_custom.properties, you have 2
> option:
>
> 1) the my_custom.properties has to be in the jar (or in the classpath). In
> your case, since you specify the package path, you need to make sure they
> are matched in your jar file
> 2) use like log4j.configuration=file:///tmp/my_custom.properties. In this
> way, you need to make sure file my_custom.properties exists in /tmp folder
> on ALL of your worker nodes.
>
> Yong
>
> --
> Date: Wed, 13 Apr 2016 14:18:24 -0300
> Subject: Re: Logging in executors
> From: cma...@despegar.com
> To: yuzhih...@gmail.com
> CC: user@spark.apache.org
>
>
> Thanks for your response Ted. You're right, there was a typo. I changed
> it, now I'm executing:
>
> bin/spark-submit --master spark://localhost:7077 --conf
> "spark.driver.extraJavaOptions=-Dlog4j.configuration=env/dev/log4j-driver.properties"
> --conf
> "spark.executor.extraJavaOptions=-Dlog4j.configuration=env/dev/log4j-executor.properties"
> --class
>
> The content of this file is:
>
> # Set everything to be logged to the console
> log4j.rootCategory=INFO, FILE
> log4j.appender.console=org.apache.log4j.ConsoleAppender
> log4j.appender.console.target=System.err
> log4j.appender.console.layout=org.apache.log4j.PatternLayout
> log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p
> %c{1}: %m%n
>
> log4j.appender.FILE=org.apache.log4j.RollingFileAppender
> log4j.appender.FILE.File=/tmp/executor.log
> log4j.appender.FILE.ImmediateFlush=true
> log4j.appender.FILE.Threshold=debug
> log4j.appender.FILE.Append=true
> log4j.appender.FILE.MaxFileSize=100MB
> log4j.appender.FILE.MaxBackupIndex=5
> log4j.appender.FILE.layout=org.apache.log4j.PatternLayout
> log4j.appender.FILE.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p
> %c{1}: %m%n
>
> # Settings to quiet third party logs that are too verbose
> log4j.logger.org.spark-project.jetty=WARN
> log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
> log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
> log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
> log4j.logger.org.apache.parquet=ERROR
> log4j.logger.parquet=ERROR
> log4j.logger.com.despegar.p13n=DEBUG
>
> # SPARK-9183: Settings to avoid annoying messages when looking up
> nonexistent UDFs in SparkSQL with Hive support
> log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
> log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
>
>
> Finally, the code on which I'm using logging in the executor is:
>
> def groupAndCount(keys: DStream[(String, List[String])])(handler: 
> ResultHandler) = {
>
>   val result = keys.reduceByKey((prior, current) => {
> (prior ::: current)
>   }).flatMap {
> case (date, keys) =>
>   val rs = keys.groupBy(x => x).map(
>   obs =>{
> val (d,t) = date.split("@") match {
>   case Array(d,t) => (d,t)
> }
> import org.apache.log4j.Logger
> import scala.collection.JavaConverters._
> val logger: Logger = Logger.getRootLogger
> logger.info(s"Metric retrieved $d")
> Metric("PV", d, obs._1, t, obs._2.size)
> }
>   )
>   rs
>   }
>
>   result.foreachRDD((rdd: RDD[Metric], time: Time) => {
> handler(rdd, time)
>   })
>
> }
>
>
> Originally the import and logger object was outside the map function. I'm
> also using the root logger just to see if it's working, but nothing gets
> logged. I've checked that the property is set correctly on the executor
> side through println(System.getProperty("log4j.configuration")) and is OK,
> but still not working.
>
> Thanks again,
> -carlos.
>

Re: Silly question...

2016-04-13 Thread Mich Talebzadeh

These are the components



*java -versionjava version "1.8.0_77"*Java(TM) SE Runtime Environment
(build 1.8.0_77-b03)
Java HotSpot(TM) 64-Bit Server VM (build 25.77-b03, mixed mode)



*hadoop versionHadoop 2.6.0*Subversion
https://git-wip-us.apache.org/repos/asf/hadoop.git -r
e3496499ecb8d220fba99dc5ed4c99c8f9e33bb1
Compiled by jenkins on 2014-11-13T21:10Z
Compiled with protoc 2.5.0


*hive --versionHive 2.0.0*

    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\
*version 1.6.1*  /_/
*Using Scala version 2.10.5 (*Java HotSpot(TM) 64-Bit Server VM, Java
1.8.0_77)

Metastore

 BANNER
CON_ID

--
*Oracle Database 12c* Enterprise Edition Release 12.1.0.2.0 - 64bit
Production

To me all working OK

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 13 April 2016 at 18:10, Michael Segel  wrote:

> Mich
>
> Are you building your own releases from the source?
> Which version of Scala?
>
> Again, the builds seem to be ok and working, but I don’t want to hit some
> ‘gotcha’ if I could avoid it.
>
>
> On Apr 13, 2016, at 7:15 AM, Mich Talebzadeh 
> wrote:
>
> Hi,
>
> I am not sure this helps.
>
> we use Spark 1.6 and Hive 2. I also use JDBC (beeline for Hive)  plus
> Oracle and Sybase. They all work fine.
>
>
> HTH
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 12 April 2016 at 23:42, Michael Segel 
> wrote:
>
>> Hi,
>> This is probably a silly question on my part…
>>
>> I’m looking at the latest (spark 1.6.1 release) and would like to do a
>> build w Hive and JDBC support.
>>
>> From the documentation, I see two things that make me scratch my head.
>>
>> 1) Scala 2.11
>> "Spark does not yet support its JDBC component for Scala 2.11.”
>>
>> So if we want to use JDBC, don’t use Scala 2.11.x (in this case its
>> 2.11.8)
>>
>> 2) Hive Support
>> "To enable Hive integration for Spark SQL along with its JDBC server and
>> CLI, add the -Phive and Phive-thriftserver profiles to your existing
>> build options. By default Spark will build with Hive 0.13.1 bindings.”
>>
>> So if we’re looking at a later release of Hive… lets say 1.1.x … still
>> use the -Phive and Phive-thriftserver . Is there anything else we should
>> consider?
>>
>> Just asking because I’ve noticed that this part of the documentation
>> hasn’t changed much over the past releases.
>>
>> Thanks in Advance,
>>
>> -Mike
>>
>>
>
>

RE: Logging in executors

2016-04-13 Thread Yong Zhang

Is the env/dev/log4j-executor.properties file within your jar file? Is the path 
matching with what you specified as env/dev/log4j-executor.properties?
If you read the log4j document here: 
https://logging.apache.org/log4j/1.2/manual.html
When you specify the log4j.configuration=my_custom.properties, you have 2 
option:
1) the my_custom.properties has to be in the jar (or in the classpath). In your 
case, since you specify the package path, you need to make sure they are 
matched in your jar file2) use like 
log4j.configuration=file:///tmp/my_custom.properties. In this way, you need to 
make sure file my_custom.properties exists in /tmp folder on ALL of your worker 
nodes.
Yong

Date: Wed, 13 Apr 2016 14:18:24 -0300
Subject: Re: Logging in executors
From: cma...@despegar.com
To: yuzhih...@gmail.com
CC: user@spark.apache.org

Thanks for your response Ted. You're right, there was a typo. I changed it, now 
I'm executing:
bin/spark-submit --master spark://localhost:7077 --conf 
"spark.driver.extraJavaOptions=-Dlog4j.configuration=env/dev/log4j-driver.properties"
 --conf 
"spark.executor.extraJavaOptions=-Dlog4j.configuration=env/dev/log4j-executor.properties"
 --class
The content of this file is:
# Set everything to be logged to the consolelog4j.rootCategory=INFO, 
FILElog4j.appender.console=org.apache.log4j.ConsoleAppenderlog4j.appender.console.target=System.errlog4j.appender.console.layout=org.apache.log4j.PatternLayoutlog4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd
 HH:mm:ss} %p %c{1}: %m%n
log4j.appender.FILE=org.apache.log4j.RollingFileAppenderlog4j.appender.FILE.File=/tmp/executor.loglog4j.appender.FILE.ImmediateFlush=truelog4j.appender.FILE.Threshold=debuglog4j.appender.FILE.Append=truelog4j.appender.FILE.MaxFileSize=100MBlog4j.appender.FILE.MaxBackupIndex=5log4j.appender.FILE.layout=org.apache.log4j.PatternLayoutlog4j.appender.FILE.layout.ConversionPattern=%d{yy/MM/dd
 HH:mm:ss} %p %c{1}: %m%n
# Settings to quiet third party logs that are too 
verboselog4j.logger.org.spark-project.jetty=WARNlog4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERRORlog4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFOlog4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFOlog4j.logger.org.apache.parquet=ERRORlog4j.logger.parquet=ERRORlog4j.logger.com.despegar.p13n=DEBUG
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent 
UDFs in SparkSQL with Hive 
supportlog4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATALlog4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR

Finally, the code on which I'm using logging in the executor is:
def groupAndCount(keys: DStream[(String, List[String])])(handler: 
ResultHandler) = {
  
  val result = keys.reduceByKey((prior, current) => {
(prior ::: current)
  }).flatMap {
case (date, keys) =>
  val rs = keys.groupBy(x => x).map(
  obs =>{
val (d,t) = date.split("@") match {
  case Array(d,t) => (d,t)
}
import org.apache.log4j.Logger
import scala.collection.JavaConverters._
val logger: Logger = Logger.getRootLogger
logger.info(s"Metric retrieved $d")
Metric("PV", d, obs._1, t, obs._2.size)
}
  )
  rs
  }

  result.foreachRDD((rdd: RDD[Metric], time: Time) => {
handler(rdd, time)
  })

}
Originally the import and logger object was outside the map function. I'm also 
using the root logger just to see if it's working, but nothing gets logged. 
I've checked that the property is set correctly on the executor side through 
println(System.getProperty("log4j.configuration")) and is OK, but still not 
working.
Thanks again,-carlos.

Re: Logging in executors

2016-04-13 Thread Carlos Rojas Matas

Thanks for your response Ted. You're right, there was a typo. I changed it,
now I'm executing:

bin/spark-submit --master spark://localhost:7077 --conf
"spark.driver.extraJavaOptions=-Dlog4j.configuration=env/dev/log4j-driver.properties"
--conf
"spark.executor.extraJavaOptions=-Dlog4j.configuration=env/dev/log4j-executor.properties"
--class

The content of this file is:

# Set everything to be logged to the console
log4j.rootCategory=INFO, FILE
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p
%c{1}: %m%n

log4j.appender.FILE=org.apache.log4j.RollingFileAppender
log4j.appender.FILE.File=/tmp/executor.log
log4j.appender.FILE.ImmediateFlush=true
log4j.appender.FILE.Threshold=debug
log4j.appender.FILE.Append=true
log4j.appender.FILE.MaxFileSize=100MB
log4j.appender.FILE.MaxBackupIndex=5
log4j.appender.FILE.layout=org.apache.log4j.PatternLayout
log4j.appender.FILE.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p
%c{1}: %m%n

# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark-project.jetty=WARN
log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
log4j.logger.com.despegar.p13n=DEBUG

# SPARK-9183: Settings to avoid annoying messages when looking up
nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR


Finally, the code on which I'm using logging in the executor is:

def groupAndCount(keys: DStream[(String, List[String])])(handler:
ResultHandler) = {

  val result = keys.reduceByKey((prior, current) => {
(prior ::: current)
  }).flatMap {
case (date, keys) =>
  val rs = keys.groupBy(x => x).map(
  obs =>{
val (d,t) = date.split("@") match {
  case Array(d,t) => (d,t)
}
import org.apache.log4j.Logger
import scala.collection.JavaConverters._
val logger: Logger = Logger.getRootLogger
logger.info(s"Metric retrieved $d")
Metric("PV", d, obs._1, t, obs._2.size)
}
  )
  rs
  }

  result.foreachRDD((rdd: RDD[Metric], time: Time) => {
handler(rdd, time)
  })

}


Originally the import and logger object was outside the map function. I'm
also using the root logger just to see if it's working, but nothing gets
logged. I've checked that the property is set correctly on the executor
side through println(System.getProperty("log4j.configuration")) and is OK,
but still not working.

Thanks again,
-carlos.

Blog: Better Feature Engineering with Spark, Solr, and Lucene Analyzers

2016-04-13 Thread Steve Rowe

FYI, I wrote functionality to enable Lucene text analysis components to be used 
to extract text features via a transformer in spark.ml pipelines.  
Non-machine-learning uses supported too.  

See my blog describing the capabilities, which are included in the open-source 
spark-solr project: 


Feedback welcome!

--
Steve Rowe
www.lucidworks.com


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Silly question...

2016-04-13 Thread Michael Segel

Mich

Are you building your own releases from the source? 
Which version of Scala? 

Again, the builds seem to be ok and working, but I don’t want to hit some 
‘gotcha’ if I could avoid it. 


> On Apr 13, 2016, at 7:15 AM, Mich Talebzadeh  
> wrote:
> 
> Hi,
> 
> I am not sure this helps.
> 
> we use Spark 1.6 and Hive 2. I also use JDBC (beeline for Hive)  plus Oracle 
> and Sybase. They all work fine.
> 
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> On 12 April 2016 at 23:42, Michael Segel  > wrote:
> Hi, 
> This is probably a silly question on my part… 
> 
> I’m looking at the latest (spark 1.6.1 release) and would like to do a build 
> w Hive and JDBC support. 
> 
> From the documentation, I see two things that make me scratch my head.
> 
> 1) Scala 2.11 
> "Spark does not yet support its JDBC component for Scala 2.11.”
> 
> So if we want to use JDBC, don’t use Scala 2.11.x (in this case its 2.11.8)
> 
> 2) Hive Support
> "To enable Hive integration for Spark SQL along with its JDBC server and CLI, 
> add the -Phive and Phive-thriftserver profiles to your existing build 
> options. By default Spark will build with Hive 0.13.1 bindings.”
> 
> So if we’re looking at a later release of Hive… lets say 1.1.x … still use 
> the -Phive and Phive-thriftserver . Is there anything else we should 
> consider? 
> 
> Just asking because I’ve noticed that this part of the documentation hasn’t 
> changed much over the past releases. 
> 
> Thanks in Advance, 
> 
> -Mike
> 
>

Re: Streaming WriteAheadLogBasedBlockHandler disallows parellism via StorageLevel replication factor

2016-04-13 Thread Ted Yu

w.r.t. the effective storage level log, here is the JIRA which introduced
it:

[SPARK-4671][Streaming]Do not replicate streaming block when WAL is enabled

On Wed, Apr 13, 2016 at 7:43 AM, Patrick McGloin 
wrote:

> Hi all,
>
> If I am using a Custom Receiver with Storage Level set to StorageLevel.
> MEMORY_ONLY_SER_2 and the WAL enabled I get this Warning in the logs:
>
> 16/04/13 14:03:15 WARN WriteAheadLogBasedBlockHandler: Storage level 
> replication 2 is unnecessary when write ahead log is enabled, change to 
> replication 1
> 16/04/13 14:03:15 WARN WriteAheadLogBasedBlockHandler: User defined storage 
> level StorageLevel(false, true, false, false, 2) is changed to effective 
> storage level StorageLevel(false, true, false, false, 1) when write ahead log 
> is enabled
>
>
> My application is running on 4 Executors with 4 cores each, and 1
> Receiver.  Because the data is not replicated the processing runs on only
> one Executor:
>
> [image: Inline images 1]
>
> Instead of 16 cores processing the Streaming data only 4 are being used.
>
> We cannot reparation the DStream to distribute data to more Executors
> since if you call reparation on an RDD which is only located on one node,
> the new partitions are only created on that node, which doesn't help.  This
> theory that repartitioning doesn't help can be tested with this simple
> example, which tries to go from one partition on a single node to many on
> many nodes.  What you find with when you look at the multiplePartitions RDD
> in the UI is that its 6 partitions are on the same Executor.
>
> scala> val rdd = sc.parallelize(Seq(1, 2, 3, 4, 5, 6), 4).cache.setName("rdd")
> rdd: org.apache.spark.rdd.RDD[Int] = rdd ParallelCollectionRDD[0] at 
> parallelize at :27
>
> scala> rdd.count()
> res0: Long = 6
>
> scala> val singlePartition = 
> rdd.repartition(1).cache.setName("singlePartition")
> singlePartition: org.apache.spark.rdd.RDD[Int] = singlePartition 
> MapPartitionsRDD[4] at repartition at :29
>
> scala> singlePartition.count()
> res1: Long = 6
>
> scala> val multiplePartitions = 
> singlePartition.repartition(6).cache.setName("multiplePartitions")
> multiplePartitions: org.apache.spark.rdd.RDD[Int] = multiplePartitions 
> MapPartitionsRDD[8] at repartition at :31
>
> scala> multiplePartitions.count()
> res2: Long = 6
>
> Am I correct in the use of reparation, that the data does not get shuffled if 
> it is all on one Executor?
>
> Shouldn't I be allowed to set the Receiver replication factor to two when the 
> WAL is enabled so that multiple Executors can work on the Streaming input 
> data?
>
> We will look into creating 4 Receivers so that the data gets distributed
> more evenly.  But won't that "waste" 4 cores in our example, where one
> would do?
>
> Best regards,
> Patrick
>
>
>
>
>

EMR Spark log4j and metrics

2016-04-13 Thread Peter Halliday

I have an existing cluster that I stand up via Docker images and CloudFormation 
Templates  on AWS.  We are moving to EMR and AWS Data Pipeline process, and 
having problems with metrics and log4j.  We’ve sent a JSON configuration for 
spark-log4j and spark-metrics.  The log4j file seems to be basically working 
for the master.  However, the driver and executors it isn’t working for.  I’m 
not sure why.  Also, the metrics aren’t working anywhere. It’s using a cloud 
watch to log the metrics, and there’s no CloudWatch Sink for Spark it seems on 
EMR, and so we created one that we added to a jar than’s sent via —jars to 
spark-submit.

Peter Halliday
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark acessing secured HDFS

2016-04-13 Thread vijikarthi

Looks like the support does not exist unless someone counter it and there is
a open JIRA.

https://issues.apache.org/jira/browse/SPARK-12909



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-acessing-secured-HDFS-tp26766p26778.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Logging in executors

2016-04-13 Thread Ted Yu

bq. --conf "spark.executor.extraJavaOptions=-Dlog4j.
configuration=env/dev/log4j-driver.properties"

I think the above may have a typo : you refer to log4j-driver.properties in
both arguments.

FYI

On Wed, Apr 13, 2016 at 8:09 AM, Carlos Rojas Matas 
wrote:

> Hi guys,
>
> I'm trying to enable logging in the executors but with no luck.
>
> According to the oficial documentation and several blogs, this should be
> done passing the
> "spark.executor.extraJavaOpts=-Dlog4j.configuration=[my-file]" to the
> spark-submit tool. I've tried both sending a reference to a classpath
> resource as using the "file:" protocol but nothing happens. Of course in
> the later case, I've used the --file option in the command line, although
> is not clear where this file is uploaded in the worker machine.
>
> However, I was able to make it work by setting the properties in the
> spark-defaults.conf file pointing to each one of the configurations on the
> machine. This approach has a big drawback though: if I change something in
> the log4j configuration I need to change it in every machine (and I''m not
> sure if restarting is required) which is not what I'm looking for.
>
> The complete command I'm using is as follows:
>
> bin/spark-submit --master spark://localhost:7077 --conf
> "spark.driver.extraJavaOptions=-Dlog4j.configuration=env/dev/log4j-driver.properties"
> --conf
> "spark.executor.extraJavaOptions=-Dlog4j.configuration=env/dev/log4j-driver.properties"
> --class [my-main-class] [my-jar].jar
>
>
> Both files are in the classpath and are reachable -- already tested with
> the driver.
>
> Any comments will be welcomed.
>
> Thanks in advance.
> -carlos.
>
>

Logging in executors

2016-04-13 Thread Carlos Rojas Matas

Hi guys,

I'm trying to enable logging in the executors but with no luck.

According to the oficial documentation and several blogs, this should be
done passing the
"spark.executor.extraJavaOpts=-Dlog4j.configuration=[my-file]" to the
spark-submit tool. I've tried both sending a reference to a classpath
resource as using the "file:" protocol but nothing happens. Of course in
the later case, I've used the --file option in the command line, although
is not clear where this file is uploaded in the worker machine.

However, I was able to make it work by setting the properties in the
spark-defaults.conf file pointing to each one of the configurations on the
machine. This approach has a big drawback though: if I change something in
the log4j configuration I need to change it in every machine (and I''m not
sure if restarting is required) which is not what I'm looking for.

The complete command I'm using is as follows:

bin/spark-submit --master spark://localhost:7077 --conf
"spark.driver.extraJavaOptions=-Dlog4j.configuration=env/dev/log4j-driver.properties"
--conf
"spark.executor.extraJavaOptions=-Dlog4j.configuration=env/dev/log4j-driver.properties"
--class [my-main-class] [my-jar].jar


Both files are in the classpath and are reachable -- already tested with
the driver.

Any comments will be welcomed.

Thanks in advance.
-carlos.

Streaming WriteAheadLogBasedBlockHandler disallows parellism via StorageLevel replication factor

2016-04-13 Thread Patrick McGloin

Hi all,

If I am using a Custom Receiver with Storage Level set to StorageLevel.
MEMORY_ONLY_SER_2 and the WAL enabled I get this Warning in the logs:

16/04/13 14:03:15 WARN WriteAheadLogBasedBlockHandler: Storage level
replication 2 is unnecessary when write ahead log is enabled, change
to replication 1
16/04/13 14:03:15 WARN WriteAheadLogBasedBlockHandler: User defined
storage level StorageLevel(false, true, false, false, 2) is changed to
effective storage level StorageLevel(false, true, false, false, 1)
when write ahead log is enabled


My application is running on 4 Executors with 4 cores each, and 1
Receiver.  Because the data is not replicated the processing runs on only
one Executor:

[image: Inline images 1]

Instead of 16 cores processing the Streaming data only 4 are being used.

We cannot reparation the DStream to distribute data to more Executors since
if you call reparation on an RDD which is only located on one node, the new
partitions are only created on that node, which doesn't help.  This theory
that repartitioning doesn't help can be tested with this simple example,
which tries to go from one partition on a single node to many on many
nodes.  What you find with when you look at the multiplePartitions RDD in
the UI is that its 6 partitions are on the same Executor.

scala> val rdd = sc.parallelize(Seq(1, 2, 3, 4, 5, 6), 4).cache.setName("rdd")
rdd: org.apache.spark.rdd.RDD[Int] = rdd ParallelCollectionRDD[0] at
parallelize at :27

scala> rdd.count()
res0: Long = 6

scala> val singlePartition = rdd.repartition(1).cache.setName("singlePartition")
singlePartition: org.apache.spark.rdd.RDD[Int] = singlePartition
MapPartitionsRDD[4] at repartition at :29

scala> singlePartition.count()
res1: Long = 6

scala> val multiplePartitions =
singlePartition.repartition(6).cache.setName("multiplePartitions")
multiplePartitions: org.apache.spark.rdd.RDD[Int] = multiplePartitions
MapPartitionsRDD[8] at repartition at :31

scala> multiplePartitions.count()
res2: Long = 6

Am I correct in the use of reparation, that the data does not get
shuffled if it is all on one Executor?

Shouldn't I be allowed to set the Receiver replication factor to two
when the WAL is enabled so that multiple Executors can work on the
Streaming input data?

We will look into creating 4 Receivers so that the data gets distributed
more evenly.  But won't that "waste" 4 cores in our example, where one
would do?

Best regards,
Patrick

Re: Silly question...

2016-04-13 Thread Mich Talebzadeh

Hi,

I am not sure this helps.

we use Spark 1.6 and Hive 2. I also use JDBC (beeline for Hive)  plus
Oracle and Sybase. They all work fine.


HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 12 April 2016 at 23:42, Michael Segel  wrote:

> Hi,
> This is probably a silly question on my part…
>
> I’m looking at the latest (spark 1.6.1 release) and would like to do a
> build w Hive and JDBC support.
>
> From the documentation, I see two things that make me scratch my head.
>
> 1) Scala 2.11
> "Spark does not yet support its JDBC component for Scala 2.11.”
>
> So if we want to use JDBC, don’t use Scala 2.11.x (in this case its 2.11.8)
>
> 2) Hive Support
> "To enable Hive integration for Spark SQL along with its JDBC server and
> CLI, add the -Phive and Phive-thriftserver profiles to your existing
> build options. By default Spark will build with Hive 0.13.1 bindings.”
>
> So if we’re looking at a later release of Hive… lets say 1.1.x … still use
> the -Phive and Phive-thriftserver . Is there anything else we should
> consider?
>
> Just asking because I’ve noticed that this part of the documentation
> hasn’t changed much over the past releases.
>
> Thanks in Advance,
>
> -Mike
>
>

Spark fileStream from a partitioned hive dir

2016-04-13 Thread Daniel Haviv

Hi,
We have a hive table which gets data written to it by two partition keys,
day and hour.
We would like to stream the incoming files assince fileStream can only
listen on one directory we start a streaming job on the latest partition
and every hour kill it and start a new one on a newer partition (We are
also working on migrating the stream from HDFS to Kafka but it will take a
while).

I imagine I'm not the first who tries that, is there a better way to either
stream multiple dirs or change the streaming source location at runtime (or
any other suggestion)?


Thank you.
Daniel

Strange bug: Filter problem with parenthesis

2016-04-13 Thread Saif.A.Ellafi

Hi,

I am debugging a program, and for some reason, a line calling the following is 
failing:

df.filter("sum(OpenAccounts) > 5").show

It says it cannot find the column OpenAccounts, as if it was applying the sum() 
function and looking for a column called like that, where there is not. This 
works fine if I rename the column to something without parenthesis.

I can't reproduce this issue in Spark Shell (1.6.0), any ideas on how can I 
analyze this? This is an aggregation result, with the default column names 
afterwards.

PS: Workaround is to use toDF(cols) and rename all columns, but I am wondering 
if toDF has any impact on the RDD structure behind (e.g. repartitioning, cache, 
etc)

Appreciated,
Saif

?????? build/sbt gen-idea error

2016-04-13 Thread ImMr.K

Actually, same error occurred when I ran build/sbt compile or other commands. 
After struggled for some time, I reminded that I used proxy to connect to 
Internet. So set proxy to maven, everything seems OK. Just remind those who use 
proxies.


--
Best regards,
Ze Jin


 




--  --
??: "Ted Yu";;
: 2016??4??12??(??) 11:38
??: "ImMr.K"<875061...@qq.com>; 
: "user"; 
: Re: build/sbt gen-idea error



gen-idea doesn't seem to be a valid command:
[warn] Ignoring load failure: no project loaded.
[error] Not a valid command: gen-idea
[error] gen-idea




On Tue, Apr 12, 2016 at 8:28 AM, ImMr.K <875061...@qq.com> wrote:
Hi,
I have cloned spark and ,
cd spark
build/sbt gen-idea


got the following output:




Using /usr/java/jre1.7.0_09 as default JAVA_HOME.
Note, this will be overridden by -java-home if it is set.
[info] Loading project definition from /home/king/github/spark/project/project
[info] Loading project definition from 
/home/king/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader/project
[warn] Multiple resolvers having different access mechanism configured with 
same name 'sbt-plugin-releases'. To avoid conflict, Remove duplicate project 
resolvers (`resolvers`) or rename publishing resolver (`publishTo`).
[info] Loading project definition from /home/king/github/spark/project
org.apache.maven.model.building.ModelBuildingException: 1 problem was 
encountered while building the effective model for 
org.apache.spark:spark-parent_2.11:2.0.0-SNAPSHOT
[FATAL] Non-resolvable parent POM: Could not transfer artifact 
org.apache:apache:pom:14 from/to central ( 
http://repo.maven.apache.org/maven2): Error transferring file: Connection timed 
out from  
http://repo.maven.apache.org/maven2/org/apache/apache/14/apache-14.pom and 
'parent.relativePath' points at wrong local POM @ line 22, column 11


at 
org.apache.maven.model.building.DefaultModelProblemCollector.newModelBuildingException(DefaultModelProblemCollector.java:195)
at 
org.apache.maven.model.building.DefaultModelBuilder.readParentExternally(DefaultModelBuilder.java:841)
at 
org.apache.maven.model.building.DefaultModelBuilder.readParent(DefaultModelBuilder.java:664)
at 
org.apache.maven.model.building.DefaultModelBuilder.build(DefaultModelBuilder.java:310)
at 
org.apache.maven.model.building.DefaultModelBuilder.build(DefaultModelBuilder.java:232)
at 
com.typesafe.sbt.pom.MvnPomResolver.loadEffectivePom(MavenPomResolver.scala:61)
at com.typesafe.sbt.pom.package$.loadEffectivePom(package.scala:41)
at 
com.typesafe.sbt.pom.MavenProjectHelper$.makeProjectTree(MavenProjectHelper.scala:128)
at 
com.typesafe.sbt.pom.MavenProjectHelper$.makeReactorProject(MavenProjectHelper.scala:49)
at 
com.typesafe.sbt.pom.PomBuild$class.projectDefinitions(PomBuild.scala:28)
at SparkBuild$.projectDefinitions(SparkBuild.scala:347)
at sbt.Load$.sbt$Load$$projectsFromBuild(Load.scala:506)
at sbt.Load$$anonfun$27.apply(Load.scala:446)
at sbt.Load$$anonfun$27.apply(Load.scala:446)
at scala.collection.immutable.Stream.flatMap(Stream.scala:442)
at sbt.Load$.loadUnit(Load.scala:446)
at sbt.Load$$anonfun$18$$anonfun$apply$11.apply(Load.scala:291)
at sbt.Load$$anonfun$18$$anonfun$apply$11.apply(Load.scala:291)
at 
sbt.BuildLoader$$anonfun$componentLoader$1$$anonfun$apply$4$$anonfun$apply$5$$anonfun$apply$6.apply(BuildLoader.scala:91)
at 
sbt.BuildLoader$$anonfun$componentLoader$1$$anonfun$apply$4$$anonfun$apply$5$$anonfun$apply$6.apply(BuildLoader.scala:90)
at sbt.BuildLoader.apply(BuildLoader.scala:140)
at sbt.Load$.loadAll(Load.scala:344)
at sbt.Load$.loadURI(Load.scala:299)
at sbt.Load$.load(Load.scala:295)
at sbt.Load$.load(Load.scala:286)
at sbt.Load$.apply(Load.scala:140)
at sbt.Load$.defaultLoad(Load.scala:36)
at sbt.BuiltinCommands$.liftedTree1$1(Main.scala:492)
at sbt.BuiltinCommands$.doLoadProject(Main.scala:492)
at sbt.BuiltinCommands$$anonfun$loadProjectImpl$2.apply(Main.scala:484)
at sbt.BuiltinCommands$$anonfun$loadProjectImpl$2.apply(Main.scala:484)
at 
sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.scala:59)
at 
sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.scala:59)
at 
sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.scala:61)
at 
sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.scala:61)
at sbt.Command$.process(Command.scala:93)
at sbt.MainLoop$$anonfun$1$$anonfun$apply$1.apply(MainLoop.scala:96)
at sbt.MainLoop$$anonfun$1$$anonfun$apply$1.apply(MainLoop.scala:96)
at sbt.State$$anon$1.process(State.scala:184)
at

Re: Please assist: Spark 1.5.2 / cannot find StateSpec / State

2016-04-13 Thread Matthias Niehoff

The StateSpec and the mapWithState method is only available in Spark 1.6.x

2016-04-13 11:34 GMT+02:00 Marco Mistroni :

> hi all
>  i am trying to replicate the Streaming Wordcount example described here
>
>
> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala
>
> in my build,sbt i have the following dependencies
>
> .
> libraryDependencies += "org.apache.spark" %% "spark-core"   % "1.5.2" %
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-streaming"   % "1.5.2"
> % "provided"
> libraryDependencies += "org.apache.spark" %% "spark-mllib"   % "1.5.2"  %
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-streaming-flume"   %
> "1.3.0"  % "provided"
> ...
> But compilations fail mentioning that class StateSpec and State are not
> found
>
> Could pls someone point me to the right packages to refer if i want to use
> StateSpec?
>
> kind regards
>  marco
>
>
>


-- 
Matthias Niehoff | IT-Consultant | Agile Software Factory  | Consulting
codecentric AG | Zeppelinstr 2 | 76185 Karlsruhe | Deutschland
tel: +49 (0) 721.9595-681 | fax: +49 (0) 721.9595-666 | mobil: +49 (0)
172.1702676
www.codecentric.de | blog.codecentric.de | www.meettheexperts.de |
www.more4fi.de

Sitz der Gesellschaft: Solingen | HRB 25917| Amtsgericht Wuppertal
Vorstand: Michael Hochgürtel . Mirko Novakovic . Rainer Vehns
Aufsichtsrat: Patric Fedlmeier (Vorsitzender) . Klaus Jäger . Jürgen Schütz

Diese E-Mail einschließlich evtl. beigefügter Dateien enthält vertrauliche
und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige
Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie
bitte sofort den Absender und löschen Sie diese E-Mail und evtl.
beigefügter Dateien umgehend. Das unerlaubte Kopieren, Nutzen oder Öffnen
evtl. beigefügter Dateien sowie die unbefugte Weitergabe dieser E-Mail ist
nicht gestattet

Problem with History Server

2016-04-13 Thread alvarobrandon

Hello:

I'm using the history server to keep track of the applications I run in my
cluster. I'm using Spark with YARN.
When I run on application it finishes correctly even YARN says that it
finished. This is the result of the YARN Resource Manager API

{u'app': [{u'runningContainers': -1, u'allocatedVCores': -1, u'clusterId':
1460540049690, u'amContainerLogs':
u'http://parapide-2.rennes.grid5000.fr:8042/node/containerlogs/container_1460540049690_0001_01_01/abrandon',
u'id': u'*application_1460540049690_0001*', u'preemptedResourceMB': 0,
u'finishedTime': 1460550170085, u'numAMContainerPreempted': 0, u'user':
u'abrandon', u'preemptedResourceVCores': 0, u'startedTime': 1460548211207,
u'elapsedTime': 1958878, u'state': u'FINISHED',
u'numNonAMContainerPreempted': 0, u'progress': 100.0, u'trackingUI':
u'History', u'trackingUrl':
u'http://paranoia-1.rennes.grid5000.fr:8088/proxy/application_1460540049690_0001/A',
u'allocatedMB': -1, u'amHostHttpAddress':
u'parapide-2.rennes.grid5000.fr:8042', u'memorySeconds': 37936274,
u'applicationTags': u'', u'name': u'KMeans', u'queue': u'default',
u'vcoreSeconds': 13651, u'applicationType': u'SPARK', u'diagnostics': u'',
u'finalStatus': u'*SUCCEEDED*'}

However when I query the SPARK UI 


 

You can see that for Job ID 2 no tasks have run and I can't get information
about them. Is this some kind of bug?

Thanks for your help as always



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Problem-with-History-Server-tp26777.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Performance Tuning | Shark 0.9.1 with Spark 1.0.2

2016-04-13 Thread N, Manjunath 3. (EXT - IN/Noida)

Hi,

I am trying to reduce the query performance. I am not sure how to go about in 
shark/spark this. Here is my problem.

When I execute a query it is ran twice and here is summary. First is Filesink's 
runjob and next is mapPartitionis executed.

1.  Filesink uses only one job always is there a way to parallelize this?
2.  mapPartitionsWithIndex is taking 1.2 mins is there a way to bring this 
time down?

TimeShuffle ReadShuffle Write   No of Jobs  Summary
1.3 min 217.4 MB1   runJob at FileSinkOperator.scala 157
1.2 min 219.8 MB292 mapPartitionsWithIndex at 
Operator.scala:312

Thanks
Manjunath

Re: S3n performance (@AaronDavidson)

2016-04-13 Thread Gourav Sengupta

Hi,

I have stopped working on s3n for a long time now. In case you are working
with parquet and writing files s3a is the only alternative to failures.
Otherwise why not use just s3://?


Regards,
Gourav

On Wed, Apr 13, 2016 at 12:17 PM, Steve Loughran 
wrote:

>
> On 12 Apr 2016, at 22:05, Martin Eden  wrote:
>
> Hi everyone,
>
> Running on EMR 4.3 with Spark 1.6.0 and the provided S3N native driver I
> manage to process approx 1TB of strings inside gzipped parquet in about 50
> mins on a 20 node cluster (8 cores, 60Gb ram). That's about 17MBytes/sec
> per node.
>
> This seems sub optimal.
>
> The processing is very basic, simple fields extraction from the strings
> and a groupBy.
>
> Watching Aaron's talk from the Spark EU Summit:
> https://youtu.be/GzG9RTRTFck?t=863
>
> it seems I am hitting the same issues with suboptimal S3 throughput he
> mentions there.
>
> I tried different numbers of files for the input data set (more smaller
> files vs less larger files) combined with various settings
> for fs.s3n.block.size thinking that might help if each mapper streams
> larger chunks. It didn't! It actually seems that many small files gives
> better performance than less larger ones (of course with oversubscribed
> number of tasks/threads).
>
> Similarly to what Aaron is mentioning with oversubscribed tasks/threads we
> also become CPU bound (reach 100% cpu utilisation).
>
>
> Has anyone seen a similar behaviour? How can we optimise this?
>
> Are the improvements mentioned in Aaron's talk now part of S3n or S3a
> driver or are they just available under DataBricksCloud? How can we benefit
> from those improvements?
>
> Thanks,
> Martin
>
> P.S. Have not tried S3a.
>
>
>
> s3n is getting, deliberately, absolutely no maintenance except for
> critical bug fixes. Every time something minor was done there (like bump up
> a jets3t version), something else breaks.
>
> S3a is getting the work, and is the one you have to be using.
>
> It's also where the the performance analysis and enhancement is going on
>
> https://issues.apache.org/jira/browse/HADOOP-11694
>
> a spark-related patch that's gone in is:
> https://issues.apache.org/jira/browse/HADOOP-12810 ; significant
> optimisation of time to enum lists of files in a partition; that could be
> the root cause of a mentioned problem. Otherwise, a recent change is
> HADOOP-12444, "Lazy-seek", where you can call seek() as much as you like,
> but theres no attempt to open or close+reopen an HTTPS connection until the
> next read. That really boosts performance on code doing
> iostream.readFully(position, ... ), because the way readFully() was
> implemented it was doing a
>
> long pos  =getPos()
> seek(newpos)
> read(...)
> seek(oldpos)
>
> Two seeks, even if they were reading in sequential blocks in the files...
> this was hurting ORC reads where there's a lot of absolute reading and
> skipping of sections of the file.
>
> There's also some ongoing work at the spark layer to do more in parallel:
> https://github.com/apache/spark/pull/11242
>
> If you look at the a lot of the work, it's coming from people (there
> netflix), trying to understand why things are slow, and fixing them.
> Anything other people can do to help here is welcome
>
> A key first step, even if you don't want to contribute any code back to
> the OSS projects is : test other people's work.
>
> Anyone who can test the SPARK-9926 patch against their datasets should
> apply that PR and test to see if shows (a) speedup, (b) no change or (c)
> dramatic slowdown in performance.
>
> Similarly, the s3a performance working going in hadoop 2.8+ are things you
> can test today, both by grabbing the branch and building spark against it,
> or even going one step up and testing those patches before they get in.
>
> It's during the development time where any performance, functionality,
> reliability problems can be fixed overnight —get testing with your code, to
> know things will work when releases ship. This is particularly important
> against object stores, because none of the Jenkins builds of the projects
> test against S3. If you really care about S3 performance, and want to make
> sure ongoing development is improving your ilfe: you need to get involved.
>
> -Steve
>
>
>
>
>

Re: S3n performance (@AaronDavidson)

2016-04-13 Thread Steve Loughran


On 12 Apr 2016, at 22:05, Martin Eden 
> wrote:

Hi everyone,

Running on EMR 4.3 with Spark 1.6.0 and the provided S3N native driver I manage 
to process approx 1TB of strings inside gzipped parquet in about 50 mins on a 
20 node cluster (8 cores, 60Gb ram). That's about 17MBytes/sec per node.

This seems sub optimal.

The processing is very basic, simple fields extraction from the strings and a 
groupBy.

Watching Aaron's talk from the Spark EU Summit:
https://youtu.be/GzG9RTRTFck?t=863

it seems I am hitting the same issues with suboptimal S3 throughput he mentions 
there.

I tried different numbers of files for the input data set (more smaller files 
vs less larger files) combined with various settings for fs.s3n.block.size 
thinking that might help if each mapper streams larger chunks. It didn't! It 
actually seems that many small files gives better performance than less larger 
ones (of course with oversubscribed number of tasks/threads).

Similarly to what Aaron is mentioning with oversubscribed tasks/threads we also 
become CPU bound (reach 100% cpu utilisation).


Has anyone seen a similar behaviour? How can we optimise this?

Are the improvements mentioned in Aaron's talk now part of S3n or S3a driver or 
are they just available under DataBricksCloud? How can we benefit from those 
improvements?

Thanks,
Martin

P.S. Have not tried S3a.


s3n is getting, deliberately, absolutely no maintenance except for critical bug 
fixes. Every time something minor was done there (like bump up a jets3t 
version), something else breaks.

S3a is getting the work, and is the one you have to be using.

It's also where the the performance analysis and enhancement is going on

https://issues.apache.org/jira/browse/HADOOP-11694

a spark-related patch that's gone in is: 
https://issues.apache.org/jira/browse/HADOOP-12810 ; significant optimisation 
of time to enum lists of files in a partition; that could be the root cause of 
a mentioned problem. Otherwise, a recent change is HADOOP-12444, "Lazy-seek", 
where you can call seek() as much as you like, but theres no attempt to open or 
close+reopen an HTTPS connection until the next read. That really boosts 
performance on code doing iostream.readFully(position, ... ), because the way 
readFully() was implemented it was doing a

long pos  =getPos()
seek(newpos)
read(...)
seek(oldpos)

Two seeks, even if they were reading in sequential blocks in the files... this 
was hurting ORC reads where there's a lot of absolute reading and skipping of 
sections of the file.

There's also some ongoing work at the spark layer to do more in parallel: 
https://github.com/apache/spark/pull/11242

If you look at the a lot of the work, it's coming from people (there netflix), 
trying to understand why things are slow, and fixing them. Anything other 
people can do to help here is welcome

A key first step, even if you don't want to contribute any code back to the OSS 
projects is : test other people's work.

Anyone who can test the SPARK-9926 patch against their datasets should apply 
that PR and test to see if shows (a) speedup, (b) no change or (c) dramatic 
slowdown in performance.

Similarly, the s3a performance working going in hadoop 2.8+ are things you can 
test today, both by grabbing the branch and building spark against it, or even 
going one step up and testing those patches before they get in.

It's during the development time where any performance, functionality, 
reliability problems can be fixed overnight —get testing with your code, to 
know things will work when releases ship. This is particularly important 
against object stores, because none of the Jenkins builds of the projects test 
against S3. If you really care about S3 performance, and want to make sure 
ongoing development is improving your ilfe: you need to get involved.

-Steve

RE: ML Random Forest Classifier

2016-04-13 Thread Ashic Mahtab

It looks like all of that is building up to spark 2.0 (for random forests / 
gbts / etc.). Ah well...thanks for your help. Was interesting digging into the 
depths.

Date: Wed, 13 Apr 2016 09:48:32 +0100
Subject: Re: ML Random Forest Classifier
From: ja...@gluru.co
To: as...@live.com
CC: user@spark.apache.org

Hi Ashic,
Unfortunately I don't know how to work around that - I suggested this line as 
it looked promising (I had considered it once before deciding to use a 
different algorithm) but I never actually tried it. 
Regards,
James
On 13 April 2016 at 02:29, Ashic Mahtab  wrote:

It looks like the issue is around impurity stats. After converting an rf model 
to old, and back to new (without disk storage or anything), and specifying the 
same num of features, same categorical features map, etc., 
DecisionTreeClassifier::predictRaw throws a null pointer exception here:
 override protected def predictRaw(features: Vector): Vector = {
Vectors.dense(rootNode.predictImpl(features).impurityStats.stats.clone())  }
It appears impurityStats is always null (even though impurity does have a 
value). Any known workarounds? It's looking like I'll have to revert to using 
mllib instead :(
-Ashic.
From: as...@live.com
To: ja...@gluru.co
CC: user@spark.apache.org
Subject: RE: ML Random Forest Classifier
Date: Wed, 13 Apr 2016 02:20:53 +0100

I managed to get to the map using MetadataUtils (it's a private ml package). 
There are still some issues, around feature names, etc. Trying to pin them down.

From: as...@live.com
To: ja...@gluru.co
CC: user@spark.apache.org
Subject: RE: ML Random Forest Classifier
Date: Wed, 13 Apr 2016 00:50:31 +0100

Hi James,Following on from the previous email, is there a way to get the 
categoricalFeatures of a Spark ML Random Forest? Essentially something I can 
pass to
RandomForestClassificationModel.fromOld(oldModel, parent, categoricalFeatures, 
numClasses, numFeatures)
I could construct it by hand, but I was hoping for a more automated way of 
getting the map. Since the trained model already knows about the value, perhaps 
it's possible to grab it for storage?
Thanks,Ashic.

From: as...@live.com
To: ja...@gluru.co
CC: user@spark.apache.org
Subject: RE: ML Random Forest Classifier
Date: Mon, 11 Apr 2016 23:21:53 +0100

Thanks, James. That looks promising. 

Date: Mon, 11 Apr 2016 10:41:07 +0100
Subject: Re: ML Random Forest Classifier
From: ja...@gluru.co
To: as...@live.com
CC: user@spark.apache.org

To add a bit more detail perhaps something like this might work:

package org.apache.spark.ml

import org.apache.spark.ml.classification.RandomForestClassificationModel

import org.apache.spark.ml.classification.DecisionTreeClassificationModel

import org.apache.spark.ml.classification.LogisticRegressionModel

import org.apache.spark.mllib.tree.model.{ RandomForestModel => 
OldRandomForestModel }

import org.apache.spark.ml.classification.RandomForestClassifier

object RandomForestModelConverter {

  def fromOld(oldModel: OldRandomForestModel, parent: RandomForestClassifier = 
null,

categoricalFeatures: Map[Int, Int], numClasses: Int, numFeatures: Int = 
-1): RandomForestClassificationModel = {

RandomForestClassificationModel.fromOld(oldModel, parent, 
categoricalFeatures, numClasses, numFeatures)

  }

  def toOld(newModel: RandomForestClassificationModel): OldRandomForestModel = {

newModel.toOld

  }

}

Regards,
James 
On 11 April 2016 at 10:36, James Hammerton  wrote:
There are methods for converting the dataframe based random forest models to 
the old RDD based models and vice versa. Perhaps using these will help given 
that the old models can be saved and loaded?
In order to use them however you will need to write code in the 
org.apache.spark.ml package.
I've not actually tried doing this myself but it looks as if it might work.
Regards,
James

On 11 April 2016 at 10:29, Ashic Mahtab  wrote:

Hello,I'm trying to save a pipeline with a random forest classifier. If I try 
to save the pipeline, it complains that the classifier is not Writable, and 
indeed the classifier itself doesn't have a write function. There's a pull 
request that's been merged that enables this for Spark 2.0 (any dates around 
when that'll release?). I am, however, using the Spark Cassandra Connector 
which doesn't seem to be able to create a CqlContext with spark 2.0 snapshot 
builds. Seeing that ML Lib's random forest classifier supports storing and 
loading models, is there a way to create a Spark ML pipeline in Spark 1.6 with 
a random forest classifier that'll allow me to store and load the model? The 
model takes significant amount of time to train, and I really don't want to 
have to train it every time my application launches.
Thanks,Ashic.

Please assist: Spark 1.5.2 / cannot find StateSpec / State

2016-04-13 Thread Marco Mistroni

hi all
 i am trying to replicate the Streaming Wordcount example described here

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala

in my build,sbt i have the following dependencies

.
libraryDependencies += "org.apache.spark" %% "spark-core"   % "1.5.2" %
"provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming"   % "1.5.2"
% "provided"
libraryDependencies += "org.apache.spark" %% "spark-mllib"   % "1.5.2"  %
"provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming-flume"   %
"1.3.0"  % "provided"
...
But compilations fail mentioning that class StateSpec and State are not
found

Could pls someone point me to the right packages to refer if i want to use
StateSpec?

kind regards
 marco

Re: spark.driver.extraClassPath and export SPARK_CLASSPATH

2016-04-13 Thread AlexModestov

I wrote in "spark-defaults.conf" spark.driver.extraClassPath '/dir'
or "PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook"
/.../sparkling-water-1.6.1/bin/pysparkling \ --conf
spark.driver.extraClassPath='/.../sqljdbc41.jar'
Nothing works



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-driver-extraClassPath-and-export-SPARK-CLASSPATH-tp26740p26774.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: ML Random Forest Classifier

2016-04-13 Thread James Hammerton

Hi Ashic,

Unfortunately I don't know how to work around that - I suggested this line
as it looked promising (I had considered it once before deciding to use a
different algorithm) but I never actually tried it.

Regards,

James

On 13 April 2016 at 02:29, Ashic Mahtab  wrote:

> It looks like the issue is around impurity stats. After converting an rf
> model to old, and back to new (without disk storage or anything), and
> specifying the same num of features, same categorical features map, etc.,
> DecisionTreeClassifier::predictRaw throws a null pointer exception here:
>
>  override protected def predictRaw(features: Vector): Vector = {
> Vectors.dense(rootNode.predictImpl(features).*impurityStats.*
> stats.clone())
>   }
>
> It appears impurityStats is always null (even though impurity does have a
> value). Any known workarounds? It's looking like I'll have to revert to
> using mllib instead :(
>
> -Ashic.
>
> --
> From: as...@live.com
> To: ja...@gluru.co
> CC: user@spark.apache.org
> Subject: RE: ML Random Forest Classifier
> Date: Wed, 13 Apr 2016 02:20:53 +0100
>
>
> I managed to get to the map using MetadataUtils (it's a private ml
> package). There are still some issues, around feature names, etc. Trying to
> pin them down.
>
> --
> From: as...@live.com
> To: ja...@gluru.co
> CC: user@spark.apache.org
> Subject: RE: ML Random Forest Classifier
> Date: Wed, 13 Apr 2016 00:50:31 +0100
>
> Hi James,
> Following on from the previous email, is there a way to get the
> categoricalFeatures of a Spark ML Random Forest? Essentially something I
> can pass to
>
> RandomForestClassificationModel.fromOld(oldModel, parent,
> *categoricalFeatures*, numClasses, numFeatures)
>
> I could construct it by hand, but I was hoping for a more automated way of
> getting the map. Since the trained model already knows about the value,
> perhaps it's possible to grab it for storage?
>
> Thanks,
> Ashic.
>
> --
> From: as...@live.com
> To: ja...@gluru.co
> CC: user@spark.apache.org
> Subject: RE: ML Random Forest Classifier
> Date: Mon, 11 Apr 2016 23:21:53 +0100
>
> Thanks, James. That looks promising.
>
> --
> Date: Mon, 11 Apr 2016 10:41:07 +0100
> Subject: Re: ML Random Forest Classifier
> From: ja...@gluru.co
> To: as...@live.com
> CC: user@spark.apache.org
>
> To add a bit more detail perhaps something like this might work:
>
> package org.apache.spark.ml
>
>
> import org.apache.spark.ml.classification.RandomForestClassificationModel
> import org.apache.spark.ml.classification.DecisionTreeClassificationModel
> import org.apache.spark.ml.classification.LogisticRegressionModel
> import org.apache.spark.mllib.tree.model.{ RandomForestModel =>
> OldRandomForestModel }
> import org.apache.spark.ml.classification.RandomForestClassifier
>
>
> object RandomForestModelConverter {
>
>
>   def fromOld(oldModel: OldRandomForestModel, parent:
> RandomForestClassifier = null,
> categoricalFeatures: Map[Int, Int], numClasses: Int, numFeatures: Int
> = -1): RandomForestClassificationModel = {
> RandomForestClassificationModel.fromOld(oldModel, parent,
> categoricalFeatures, numClasses, numFeatures)
>   }
>
>
>   def toOld(newModel: RandomForestClassificationModel):
> OldRandomForestModel = {
> newModel.toOld
>   }
> }
>
>
> Regards,
>
> James
>
>
> On 11 April 2016 at 10:36, James Hammerton  wrote:
>
> There are methods for converting the dataframe based random forest models
> to the old RDD based models and vice versa. Perhaps using these will help
> given that the old models can be saved and loaded?
>
> In order to use them however you will need to write code in the
> org.apache.spark.ml package.
>
> I've not actually tried doing this myself but it looks as if it might work.
>
> Regards,
>
> James
>
> On 11 April 2016 at 10:29, Ashic Mahtab  wrote:
>
> Hello,
> I'm trying to save a pipeline with a random forest classifier. If I try to
> save the pipeline, it complains that the classifier is not Writable, and
> indeed the classifier itself doesn't have a write function. There's a pull
> request that's been merged that enables this for Spark 2.0 (any dates
> around when that'll release?). I am, however, using the Spark Cassandra
> Connector which doesn't seem to be able to create a CqlContext with spark
> 2.0 snapshot builds. Seeing that ML Lib's random forest classifier supports
> storing and loading models, is there a way to create a Spark ML pipeline in
> Spark 1.6 with a random forest classifier that'll allow me to store and
> load the model? The model takes significant amount of time to train, and I
> really don't want to have to train it every time my application launches.
>
> Thanks,
> Ashic.
>
>
>
>

RE: Unable to Access files in Hadoop HA enabled from using Spark

2016-04-13 Thread ashesh_28

Are you running from eclipse ?
If so add the *Hadoop_conf_dir* path to the classpath

And then you can access your hdfs directory as below 

object sparkExample {
  def main(args: Array[String]){ 
val logname = "///user/hduser/input/sample.txt"
val conf = new
SparkConf().setAppName("SimpleApp").setMaster("local[2]").set("spark.executor.memory",
"1g")
val sc = new SparkContext(conf)
val logData = sc.textFile(logname, 2)
val numAs = logData.filter(line => line.contains("hadoop")).count()
val numBs = logData.filter(line => line.contains("spark")).count()
println("Lines with Hadoop : %s, Lines with Spark: %s".format(numAs,
numBs))
  }
}




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-Access-files-in-Hadoop-HA-enabled-from-using-Spark-tp26768p26771.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: Unable to Access files in Hadoop HA enabled from using Spark

2016-04-13 Thread Amit Hora

Finally I tried setting the configuration manually using
sc.hadoopconfiguration.set
dfs.nameservices
dfs.ha.namenodes.hdpha
dfs.namenode.rpc-address.hdpha.n1 

And it worked ,don't know why it was not reading these settings from file under 
HADOOP_CONF_DIR

-Original Message-
From: "Amit Hora" 
Sent: ‎4/‎13/‎2016 11:41 AM
To: "Jörn Franke" 
Cc: "user@spark.apache.org" 
Subject: RE: Unable to Access files in Hadoop HA enabled from using Spark

There are DNS entries for both of my namenode
Ambarimaster is standby and it resolves to ip perfectly
Hdp231 is active and it also resolves to ip
Hdpha is my Hadoop HA cluster name
And hdfs-site.xml has entries related to these configuration


From: Jörn Franke
Sent: ‎4/‎13/‎2016 11:37 AM
To: Amit Singh Hora
Cc: user@spark.apache.org
Subject: Re: Unable to Access files in Hadoop HA enabled from using Spark


Is the host in /etc/hosts ?

> On 13 Apr 2016, at 07:28, Amit Singh Hora  wrote:
> 
> I am trying to access directory in Hadoop from my Spark code on local
> machine.Hadoop is HA enabled .
> 
> val conf = new SparkConf().setAppName("LDA Sample").setMaster("local[2]")
> val sc=new SparkContext(conf)
> val distFile = sc.textFile("hdfs://hdpha/mini_newsgroups/")
> println(distFile.count())
> but getting error
> 
> java.net.UnknownHostException: hdpha
> As hdpha not resolves to a particular machine it is the name I have chosen
> for my HA Hadoop.I have already copied all hadoop configuration on my local
> machine and have set the env. variable HADOOP_CONF_DIR But still no success.
> 
> Any suggestion will be of a great help
> 
> Note:- Hadoop HA is working properly as i have tried uploading file to
> hadoop and it works
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-Access-files-in-Hadoop-HA-enabled-from-using-Spark-tp26768.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

RE: Unable to Access files in Hadoop HA enabled from using Spark

2016-04-13 Thread Amit Hora

There are DNS entries for both of my namenode
Ambarimaster is standby and it resolves to ip perfectly
Hdp231 is active and it also resolves to ip
Hdpha is my Hadoop HA cluster name
And hdfs-site.xml has entries related to these configuration

-Original Message-
From: "Jörn Franke" 
Sent: ‎4/‎13/‎2016 11:37 AM
To: "Amit Singh Hora" 
Cc: "user@spark.apache.org" 
Subject: Re: Unable to Access files in Hadoop HA enabled from using Spark

Is the host in /etc/hosts ?

> On 13 Apr 2016, at 07:28, Amit Singh Hora  wrote:
> 
> I am trying to access directory in Hadoop from my Spark code on local
> machine.Hadoop is HA enabled .
> 
> val conf = new SparkConf().setAppName("LDA Sample").setMaster("local[2]")
> val sc=new SparkContext(conf)
> val distFile = sc.textFile("hdfs://hdpha/mini_newsgroups/")
> println(distFile.count())
> but getting error
> 
> java.net.UnknownHostException: hdpha
> As hdpha not resolves to a particular machine it is the name I have chosen
> for my HA Hadoop.I have already copied all hadoop configuration on my local
> machine and have set the env. variable HADOOP_CONF_DIR But still no success.
> 
> Any suggestion will be of a great help
> 
> Note:- Hadoop HA is working properly as i have tried uploading file to
> hadoop and it works
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-Access-files-in-Hadoop-HA-enabled-from-using-Spark-tp26768.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

Re: Unable to Access files in Hadoop HA enabled from using Spark

2016-04-13 Thread Jörn Franke

Is the host in /etc/hosts ?

> On 13 Apr 2016, at 07:28, Amit Singh Hora  wrote:
> 
> I am trying to access directory in Hadoop from my Spark code on local
> machine.Hadoop is HA enabled .
> 
> val conf = new SparkConf().setAppName("LDA Sample").setMaster("local[2]")
> val sc=new SparkContext(conf)
> val distFile = sc.textFile("hdfs://hdpha/mini_newsgroups/")
> println(distFile.count())
> but getting error
> 
> java.net.UnknownHostException: hdpha
> As hdpha not resolves to a particular machine it is the name I have chosen
> for my HA Hadoop.I have already copied all hadoop configuration on my local
> machine and have set the env. variable HADOOP_CONF_DIR But still no success.
> 
> Any suggestion will be of a great help
> 
> Note:- Hadoop HA is working properly as i have tried uploading file to
> hadoop and it works
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-Access-files-in-Hadoop-HA-enabled-from-using-Spark-tp26768.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

43 matches

Mail list logo