Re: Kerberos setup in Apache spark connecting to remote HDFS/Yarn

2016-06-17 Thread akhandeshi
Little more progress... I add few enviornment variables, not I get following error message: InvocationTargetException: Can't get Master Kerberos principal for use as renewer -> [Help 1] -- View this message in context:

Re: Kerberos setup in Apache spark connecting to remote HDFS/Yarn

2016-06-16 Thread akhandeshi
Rest of the stacktrace. WARNING] java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at

Kerberos setup in Apache spark connecting to remote HDFS/Yarn

2016-06-16 Thread akhandeshi
I am trying to setup my IDE to a scala spark application. I want to access HDFS files from remote Hadoop server that has Kerberos enabled. My understanding is I should be able to do that from Spark. Here is my code so far: val sparkConf = new SparkConf().setAppName(appName).setMaster(master);

This post has NOT been accepted by the mailing list yet.

2015-10-07 Thread akhandeshi
I seem to see this for many of my posts... does anyone have solution? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/This-post-has-NOT-been-accepted-by-the-mailing-list-yet-tp24969.html Sent from the Apache Spark User List mailing list archive at

Re: SparkR Error in sparkR.init(master=“local”) in RStudio

2015-10-06 Thread akhandeshi
I couldn't get this working... I have have JAVA_HOME set. I have defined SPARK_HOME Sys.setenv(SPARK_HOME="c:\DevTools\spark-1.5.1") .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) library("SparkR", lib.loc="c:\\DevTools\\spark-1.5.1\\lib") library(SparkR)

Re: SparkR Error in sparkR.init(master=“local”) in RStudio

2015-10-06 Thread akhandeshi
It seems it is failing at path <- tempfile(pattern = "backend_port") I do not see backend_port directory created... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-Error-in-sparkR-init-master-local-in-RStudio-tp23768p24958.html Sent from the

Loading status

2015-02-02 Thread akhandeshi
I am not sure what Loading status means, followed by Running. In the application UI, I see: Executor Summary ExecutorID Worker Cores Memory State Logs 1 worker-20150202144112-hadoop-w-1.c.fi-mdd-poc.internal-3887416 83971 LOADING stdout stderr 0

ExternalSorter - spilling in-memory map

2015-01-13 Thread akhandeshi
I am using spark 1.2, and I see a lot of messages like: ExternalSorter: Thread 66 spilling in-memory map of 5.0 MB to disk (13160 times so far) I seem to have a lot of memory: URL: spark://hadoop-m:7077 Workers: 4 Cores: 64 Total, 64 Used Memory: 328.0 GB Total, 327.0 GB Used

Re: Any ideas why a few tasks would stall

2014-12-04 Thread akhandeshi
This did not work for me. that is, rdd.coalesce(200, forceShuffle) . Does anyone have ideas on how to distribute your data evenly and co-locate partitions of interest? -- View this message in context:

Re: Help understanding - Not enough space to cache rdd

2014-12-03 Thread akhandeshi
hmm.. 33.6gb is sum of the memory used by the two RDD that is cached. You're right when I put serialized RDDs in the cache, the memory foot print for these rdds become a lot smaller. Serialized Memory footprint shown below: RDD NameStorage Level Cached Partitions Fraction Cached

Re: Help understanding - Not enough space to cache rdd

2014-12-03 Thread akhandeshi
I think, the memory calculation is correct, what I didn't account for is the memory used. I am still puzzled as how I can successfully process the RDD in spark. -- View this message in context:

Help understanding - Not enough space to cache rdd

2014-12-02 Thread akhandeshi
I am running in local mode. I am using google n1-highmem-16 (16 vCPU, 104 GB memory) machine. I have allocated the SPARK_DRIVER_MEMORY=95g I see Memory: 33.6 GB Used (73.7 GB Total) that the exeuctor is using. In the log out put below, I see 33.6 gb blocks are used by 2 rdds that I have cached.

packaging from source gives protobuf compatibility issues.

2014-12-01 Thread akhandeshi
scala textFile.count() java.lang.VerifyError: class org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$CompleteReques tProto overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet; I tried ./make-distribution.sh -Dhadoop.version=2.5.0 and

Re: java.lang.OutOfMemoryError: Requested array size exceeds VM limit

2014-11-17 Thread akhandeshi
only option is to split you problem further by increasing parallelism My understanding is by increasing the number of partitions, is that right? That didn't seem to help because it is seem the partitions are not uniformly sized. My observation is when I increase the number of partitions, it

SparkSubmitDriverBootstrapper and JVM parameters

2014-11-06 Thread akhandeshi
/usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java org.apache.spark.deploy.SparkSubmitDriverBootstrapper When I execute /usr/local/spark-1.1.0/bin/spark-submit local[32] for my app, I see two processes get spun off. One is the org.apache.spark.deploy.SparkSubmitDriverBootstrapper and

Re: CANNOT FIND ADDRESS

2014-11-03 Thread akhandeshi
no luck :(! Still observing the same behavior! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CANNOT-FIND-ADDRESS-tp17637p17988.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

OOM - Requested array size exceeds VM limit

2014-11-03 Thread akhandeshi
I am running local (client). My vm is 16 cpu/108gb ram. My configuration is as following: spark.executor.extraJavaOptions -XX:+PrintGCDetails -XX:+UseCompressedOops -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:+DisableExplicitGC -XX:MaxPermSize=1024m spark.daemon.memory=20g

Re: CANNOT FIND ADDRESS

2014-10-31 Thread akhandeshi
Thanks for the pointers! I did tried but didn't seem to help... In my latest try, I am doing spark-submit local But see the same message in spark App ui (4040) localhost CANNOT FIND ADDRESS In the logs, I see a lot of in-memory map to disk. I don't understand why that is the case.

CANNOT FIND ADDRESS

2014-10-29 Thread akhandeshi
SparkApplication UI shows that one of the executor Cannot find Addresss Aggregated Metrics by Executor Executor ID Address Task Time Total Tasks Failed Tasks Succeeded Tasks Input Shuffle ReadShuffle

Spark Performance

2014-10-29 Thread akhandeshi
I am relatively new to spark processing. I am using Spark Java API to process data. I am having trouble processing a data set that I don't think is significantly large. It is joining a dataset that is around 3-4gb each (around 12 gb data). The workflow is: x=RDD1.KeyBy(x).partitionBy(new

Re: CANNOT FIND ADDRESS

2014-10-29 Thread akhandeshi
Thanks...hmm It is seems to be a timeout issue perhaps?? Not sure what is causing it? or how to debug? I see following error message... 4/10/29 13:26:04 ERROR ContextCleaner: Error cleaning broadcast 9 akka.pattern.AskTimeoutException: Timed out at

Spark-submt job Killed

2014-10-28 Thread akhandeshi
I recently starting seeing this new problem where spark-submt is terminated by Killed message but no error message indicating what happened. I have enable logging on in spark configuration. has anyone seen this or know how to troubleshoot? -- View this message in context:

Re: Spark-submt job Killed

2014-10-28 Thread akhandeshi
I did have it as rdd.saveAsText(RDD); and now I have it as: Log.info(RDD Counts+rdd.persist(StorageLevel.MEMORY_AND_DISK_SER()).count()); -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submt-job-Killed-tp17560p17598.html Sent from the Apache Spark