Calrification on Spark-Hadoop Configuration

2015-10-01 Thread Vinoth Sankar
Hi, I'm new to Spark. For my application I need to overwrite Hadoop configurations (Can't change Configurations in Hadoop as it might affect my regular HDFS), so that Namenode IPs gets automatically resolved.What are the ways to do so. I tried giving "spark.hadoop.dfs.ha.namenodes.nn",

automatic start of streaming job on failure on YARN

2015-10-01 Thread Jeetendra Gangele
We've a streaming application running on yarn and we would like to ensure that is up running 24/7. Is there a way to tell yarn to automatically restart a specific application on failure? There is property yarn.resourcemanager.am.max-attempts which is default set to 2 setting it to bigger value

How to connect HadoopHA from spark

2015-10-01 Thread Vinoth Sankar
Hi, How do i connect HadoopHA from SPARK. I tried overwriting hadoop configurations from sparkCong. But Still I'm getting UnknownHostException with following trace java.lang.IllegalArgumentException: java.net.UnknownHostException: ABC at

Re: How to connect HadoopHA from spark

2015-10-01 Thread Adam McElwee
Do you have all of the required HDFS HA config options in your override? I think these are the minimum required for HA: dfs.nameservices dfs.ha.namenodes.{nameservice ID} dfs.namenode.rpc-address.{nameservice ID}.{name node ID} On Thu, Oct 1, 2015 at 7:22 AM, Vinoth Sankar

Re: Submitting with --deploy-mode cluster: uploading the jar

2015-10-01 Thread Christophe Schmitz
I am using standalone deployment, with spark 1.4.1 When I submit the job, I get no error at the submission terminal. Then I check the webui, I can find the driver section which has a my driver submission, with this error: java.io.FileNotFoundException ... which point the full path of my jar as

Re: Calrification on Spark-Hadoop Configuration

2015-10-01 Thread Sabarish Sasidharan
You can point to your custom HADOOP_CONF_DIR in your spark-env.sh Regards Sab On 01-Oct-2015 5:22 pm, "Vinoth Sankar" wrote: > Hi, > > I'm new to Spark. For my application I need to overwrite Hadoop > configurations (Can't change Configurations in Hadoop as it might affect

Re: Kafka Direct Stream

2015-10-01 Thread Adrian Tanase
On top of that you could make the topic part of the key (e.g. keyBy in .transform or manually emitting a tuple) and use one of the .xxxByKey operators for the processing. If you have a stable, domain specific list of topics (e.g. 3-5 named topics) and the processing is really different, I

Re: Kafka Direct Stream

2015-10-01 Thread Cody Koeninger
You can get the topic for a given partition from the offset range. You can either filter using that; or just have a single rdd and match on topic when doing mapPartitions or foreachPartition (which I think is a better idea)

Re: Lost leader exception in Kafka Direct for Streaming

2015-10-01 Thread Cody Koeninger
Did you check you kafka broker logs to see what was going on during that time? The direct stream will handle normal leader loss / rebalance by retrying tasks. But the exception you got indicates that something with kafka was wrong, such that offsets were being re-used. ie. your job already

Re: Deploying spark-streaming application on production

2015-10-01 Thread Jeetendra Gangele
Ya Also I think I need to enable the checkpointing and rather then building the lineage DAG need to store the RDD data into HDFS. On 23 September 2015 at 01:04, Adrian Tanase wrote: > btw I re-read the docs and I want to clarify that reliable receiver + WAL > gives you at

Re: Lost leader exception in Kafka Direct for Streaming

2015-10-01 Thread Adrian Tanase
This also happened to me in extreme recovery scenarios – e.g. Killing 4 out of a 7 machine cluster. I’d put my money on recovering from an out of sync replica, although I haven’t done extensive testing around it. -adrian From: Cody Koeninger Date: Thursday, October 1, 2015 at 5:18 PM To:

Re: How to connect HadoopHA from spark

2015-10-01 Thread Ted Yu
Have you setup HADOOP_CONF_DIR in spark-env.sh correctly ? Cheers On Thu, Oct 1, 2015 at 5:22 AM, Vinoth Sankar wrote: > Hi, > > How do i connect HadoopHA from SPARK. I tried overwriting hadoop > configurations from sparkCong. But Still I'm getting UnknownHostException >

Decision Tree Model

2015-10-01 Thread hishamm
Hi, I am using SPARK 1.4.0, Python and Decision Trees to perform machine learning classification. I test it by creating the predictions and zip it to the test data, as following: *predictions = tree_model.predict(test_data.map(lambda a: a.features)) labels = test_data.map(lambda a:

Re: automatic start of streaming job on failure on YARN

2015-10-01 Thread Adrian Tanase
This happens automatically as long as you submit with cluster mode instead of client mode. (e.g. ./spark-submit —master yarn-cluster …) The property you mention would help right after that, although you will need to set it to a large value (e.g. 1000?) - as there is no “infinite” support.

Re: Pyspark: "Error: No main class set in JAR; please specify one with --class"

2015-10-01 Thread Marcelo Vanzin
How are you running the actual application? I find it slightly odd that you're setting PYSPARK_SUBMIT_ARGS directly; that's supposed to be an internal env variable used by Spark. You'd normally pass those parameters in the spark-submit (or pyspark) command line. On Thu, Oct 1, 2015 at 8:56 AM,

Re: Pyspark: "Error: No main class set in JAR; please specify one with --class"

2015-10-01 Thread Ted Yu
In your second command, have you tried changing the comma to colon ? Cheers On Thu, Oct 1, 2015 at 8:56 AM, YaoPau wrote: > I'm trying to add multiple SerDe jars to my pyspark session. > > I got the first one working by changing my PYSPARK_SUBMIT_ARGS to: > > "--master

Pyspark: "Error: No main class set in JAR; please specify one with --class"

2015-10-01 Thread YaoPau
I'm trying to add multiple SerDe jars to my pyspark session. I got the first one working by changing my PYSPARK_SUBMIT_ARGS to: "--master yarn-client --executor-cores 5 --num-executors 5 --driver-memory 3g --executor-memory 5g --jars /home/me/jars/csv-serde-1.1.2.jar" But when I tried to add a

Getting spark application driver ID programmatically

2015-10-01 Thread Snehal Nagmote
Hi , I have use case where we need to automate start/stop of spark streaming application. To stop spark job, we need driver/application id of the job . For example : /app/spark-master/bin/spark-class org.apache.spark.deploy.Client kill spark://10.65.169.242:7077 $driver_id I am thinking to

OOM error in Spark worker

2015-10-01 Thread varun sharma
My workers are going OOM over time. I am running a streaming job in spark 1.4.0. Here is the heap dump of workers. *16,802 instances of "org.apache.spark.deploy.worker.ExecutorRunner", loaded by "sun.misc.Launcher$AppClassLoader @ 0xdff94088" occupy 488,249,688 (95.80%) bytes. These

How to access lost executor log file

2015-10-01 Thread Lan Jiang
Hi, there When running a Spark job on YARN, 2 executors somehow got lost during the execution. The message on the history server GUI is “CANNOT find address”. Two extra executors were launched by YARN and eventually finished the job. Usually I go to the “Executors” tab on the UI to check the

Accumulator of rows?

2015-10-01 Thread Saif.A.Ellafi
Hi all, I need to repeat a couple rows from a dataframe by n times each. To do so, I plan to create a new Data Frame, but I am being unable to find a way to accumulate "Rows" somewhere, as this might get huge, I can't accumulate into a mutable Array, I think?. Thanks, Saif

Re: How to access lost executor log file

2015-10-01 Thread Lan Jiang
Ted, Thanks for your reply. First of all, after sending email to the mailing list, I use yarn logs applicationId to retrieve the aggregated log successfully. I found the exceptions I am looking for. Now as to your suggestion, when I go to the YARN RM UI, I can only see the "Tracking URL" in

Re: How to access lost executor log file

2015-10-01 Thread Ted Yu
Can you go to YARN RM UI to find all the attempts for this Spark Job ? The two lost executors should be found there. On Thu, Oct 1, 2015 at 10:30 AM, Lan Jiang wrote: > Hi, there > > When running a Spark job on YARN, 2 executors somehow got lost during the > execution. The

Re: Problem understanding spark word count execution

2015-10-01 Thread Kartik Mathur
Thanks Nicolae , So In my case all executers are sending results back to the driver and and " *shuffle* *is just sending out the textFile to distribute the partitions", *could you please elaborate on this ? what exactly is in this file ? On Wed, Sep 30, 2015 at 9:57 PM, Nicolae Marasoiu <

"java.io.IOException: Filesystem closed" on executors

2015-10-01 Thread Lan Jiang
Hi, there Here is the problem I ran into when executing a Spark Job (Spark 1.3). The spark job is loading a bunch of avro files using Spark SQL spark-avro 1.0.0 library. Then it does some filter/map transformation, repartition to 1 partition and then write to HDFS. It creates 2 stages. The total

Re: Kafka Direct Stream

2015-10-01 Thread Nicolae Marasoiu
Hi, If you just need processing per topic, why not generate N different kafka direct streams ? when creating a kafka direct stream you have list of topics - just give one. Then the reusable part of your computations should be extractable as transformations/functions and reused between the

spark.streaming.kafka.maxRatePerPartition for direct stream

2015-10-01 Thread Sourabh Chandak
Hi, I am writing a spark streaming job using the direct stream method for kafka and wanted to handle the case of checkpoint failure when we'll have to reprocess the entire data from starting. By default for every new checkpoint it tries to load everything from each partition and that takes a lot

Re: spark.streaming.kafka.maxRatePerPartition for direct stream

2015-10-01 Thread Cody Koeninger
That depends on your job, your cluster resources, the number of seconds per batch... You'll need to do some empirical work to figure out how many messages per batch a given executor can handle. Divide that by the number of seconds per batch. On Thu, Oct 1, 2015 at 3:39 PM, Sourabh Chandak

python version in spark-submit

2015-10-01 Thread roy
Hi, We have python2.6 (default) on cluster and also we have installed python2.7. I was looking a way to set python version in spark-submit. anyone know how to do this ? Thanks -- View this message in context:

Re: python version in spark-submit

2015-10-01 Thread Ted Yu
PYSPARK_PYTHON determines what the worker uses. PYSPARK_DRIVER_PYTHON is for driver. See the comment at the beginning of bin/pyspark FYI On Thu, Oct 1, 2015 at 1:56 PM, roy wrote: > Hi, > > We have python2.6 (default) on cluster and also we have installed > python2.7. > > I

Re: Hive permanent functions are not available in Spark SQL

2015-10-01 Thread Yin Huai
Hi Pala, Can you add the full stacktrace of the exception? For now, can you use create temporary function to workaround the issue? Thanks, Yin On Wed, Sep 30, 2015 at 11:01 AM, Pala M Muthaia < mchett...@rocketfuelinc.com.invalid> wrote: > +user list > > On Tue, Sep 29, 2015 at 3:43 PM, Pala

Shuffle Write v/s Shuffle Read

2015-10-01 Thread Kartik Mathur
Hi I am trying to better understand shuffle in spark . Based on my understanding thus far , *Shuffle Write* : writes stage output for intermediate stage on local disk if memory is not sufficient., Example , if each worker has 200 MB memory for intermediate results and the results are 300MB then

Java REST custom receiver

2015-10-01 Thread Pavol Loffay
Hello, is it possible to implement custom receiver [1] which will receive messages from REST calls? As REST classes in Java(jax-rs) are defined declarative and instantiated by application server I'm not use if it is possible. I have tried to implement custom receiver which is inject to REST

Re: Problem understanding spark word count execution

2015-10-01 Thread Kartik Mathur
Hi Nicolae, Thanks for the reply. To further clarify things - sc.textFile is reading from HDFS, now shouldn't the file be read in a way such that EACH executer works on only the local copy of file part available , in this case its a ~ 4.64 GB file and block size is 256MB, so approx 19 partitions

Re: Hive permanent functions are not available in Spark SQL

2015-10-01 Thread Pala M Muthaia
Thanks for getting back Yin. I have copied the stack below. The associated query is just this: "hc.sql("select murmurhash3('abc') from dual")". The UDF murmurhash3 is already available in our hive metastore. Regarding temporary function, can i create a temp function with existing Hive UDF code,

SparkSQL: Reading data from hdfs and storing into multiple paths

2015-10-01 Thread haridass saisriram
Hi, I am trying to find a simple example to read a data file on HDFS. The file has the following format a , b , c ,,mm a1,b1,c1,2015,09 a2,b2,c2,2014,08 I would like to read this file and store it in HDFS partitioned by year and month. Something like this /path/to/hdfs//mm I want to

Re: Problem understanding spark word count execution

2015-10-01 Thread Nicolae Marasoiu
Hi, So you say " sc.textFile -> flatMap -> Map". My understanding is like this: First step is a number of partitions are determined, p of them. You can give hint on this. Then the nodes which will load partitions p, that is n nodes (where n<=p). Relatively at the same time or not, the n nodes

Re: Call Site - Spark Context

2015-10-01 Thread Sandip Mehta
Thanks Robin. Regards SM > On 01-Oct-2015, at 3:15 pm, Robin East wrote: > > From the comments in the code: > > When called inside a class in the spark package, returns the name of the user > code class (outside the spark package) that called into Spark, as well as >

Spark cluster - use machine name in WorkerID, not IP address

2015-10-01 Thread markluk
I'm running a standalone Spark cluster of 1 master and 2 slaves. My slaves file under /conf list the fully qualified domain name of the 2 slave machines When I look on the Spark webpage ( on :8080), I see my 2 workers, but the worker ID uses the IP address , like

Re: How to Set Retry Policy in Spark

2015-10-01 Thread Renxia Wang
Additional Info: I am running Spark on YARN. 2015-10-01 15:42 GMT-07:00 Renxia Wang : > Hi guys, > > I know there is a way to set the number of retry of failed tasks, using > spark.task.maxFailures. what is the default policy for the failed tasks > retry? Is it exponential

Re: spark.mesos.coarse impacts memory performance on mesos

2015-10-01 Thread Utkarsh Sengar
Not sure what you mean by that, I shared the data which I see in spark UI. Can you point me to a location where I can precisely get the data you need? When I run the job in fine grained mode, I see tons are tasks created and destroyed under a mesos "framework". I have about 80k spark tasks which

Re: Java REST custom receiver

2015-10-01 Thread Silvio Fiorito
When you say “receive messages” you mean acting as a REST endpoint, right? If so, it might be better to use JMS (or Kafka) option for a few reasons: The receiver will be deployed to any of the available executors, so your REST clients will need to be made aware of the IP where the receiver is

Re: spark.mesos.coarse impacts memory performance on mesos

2015-10-01 Thread Utkarsh Sengar
Bumping it up, its not really a blocking issue. But fine grain mode eats up uncertain number of resources in mesos and launches tons of tasks, so I would prefer using the coarse grained mode if only it didn't run out of memory. Thanks, -Utkarsh On Mon, Sep 28, 2015 at 2:24 PM, Utkarsh Sengar

Re: Standalone Scala Project

2015-10-01 Thread Robineast
I've eyeballed the sbt file and it look ok to me Try sbt clean package that should sort it out. If not please supply the full code you are running - Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co.

Re: Call Site - Spark Context

2015-10-01 Thread Sandip Mehta
Thanks Robin. Regards SM > On 01-Oct-2015, at 3:15 pm, Robin East wrote: > > From the comments in the code: > > When called inside a class in the spark package, returns the name of the user > code class (outside the spark package) that called into Spark, as well as >

Re: spark.mesos.coarse impacts memory performance on mesos

2015-10-01 Thread Tim Chen
Hi Utkarsh, I replied earlier asking what is your task assignment like with fine vs coarse grain mode look like? Tim On Thu, Oct 1, 2015 at 4:05 PM, Utkarsh Sengar wrote: > Bumping it up, its not really a blocking issue. > But fine grain mode eats up uncertain number of

Call Site - Spark Context

2015-10-01 Thread Sandip Mehta
Hi, I wanted to understand what is the purpose of Call Site in Spark Context? Regards SM - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

How to save DataFrame as a Table in Hbase?

2015-10-01 Thread unk1102
Hi anybody tried to save DataFrame in HBase? I have processed data in DataFrame which I need to store in HBase so that my web ui can access it from Hbase? Please guide. Thanks in advance. -- View this message in context:

Re: Hive permanent functions are not available in Spark SQL

2015-10-01 Thread Yin Huai
Yes. You can use create temporary function to create a function based on a Hive UDF ( https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/ReloadFunction ). Regarding the error, I think the problem is that starting from Spark 1.4, we have two separate

How to Set Retry Policy in Spark

2015-10-01 Thread Renxia Wang
Hi guys, I know there is a way to set the number of retry of failed tasks, using spark.task.maxFailures. what is the default policy for the failed tasks retry? Is it exponential backoff? My tasks sometimes failed because of Socket connection timeout/reset, even with retry, some of the tasks will

Re: Worker node timeout exception

2015-10-01 Thread Mark Luk
Here is the log file from the worker node 15/09/30 23:49:37 INFO Worker: Executor app-20150930233113-/8 finished with state EXITED message Command exited with code 1 exitStatus \ 1 15/09/30 23:49:37 INFO Worker: Asked to launch executor app-20150930233113-/9 for PythonPi 15/09/30 23:49:37

Re: What is the best way to submit multiple tasks?

2015-10-01 Thread Shixiong Zhu
Right, you can use SparkContext and SQLContext in multiple threads. They are thread safe. Best Regards, Shixiong Zhu 2015-10-01 4:57 GMT+08:00 : > Hi all, > > I have a process where I do some calculations on each one of the columns > of a dataframe. >

Re: Spark Streaming Standalone 1.5 - Stage cancelled because SparkContext was shut down

2015-10-01 Thread Shixiong Zhu
Do you have the log? Looks like some exceptions in your codes make SparkContext stopped. Best Regards, Shixiong Zhu 2015-09-30 17:30 GMT+08:00 tranan : > Hello All, > > I have several Spark Streaming applications running on Standalone mode in > Spark 1.5. Spark is currently

Re: [cache eviction] partition recomputation in big lineage RDDs

2015-10-01 Thread Hemant Bhanawat
As I understand, you don't need merge of your historical data RDD with your RDD_inc, what you need is merge of the computation results of the your historical RDD with RDD_inc and so on. IMO, you should consider having an external row store to hold your computations. I say this because you need

Re: Worker node timeout exception

2015-10-01 Thread Shixiong Zhu
Do you have the log file? It may be because of wrong settings. Best Regards, Shixiong Zhu 2015-10-01 7:32 GMT+08:00 markluk : > I setup a new Spark cluster. My worker node is dying with the following > exception. > > Caused by: java.util.concurrent.TimeoutException: Futures

RE: Problem understanding spark word count execution

2015-10-01 Thread java8964
I am not sure about originally explain of shuffle write. In the word count example, the shuffle is needed, as Spark has to group by the word (ReduceBy is more accurate here). Image that you have 2 mappers to read the data, then each mapper will generate the (word, count) tuple output in

Re: Spark streaming job filling a lot of data in local spark nodes

2015-10-01 Thread swetha kasireddy
We have limited disk space. So, can we have spark.cleaner.ttl to clean up the files? Or is there any setting that can cleanup old temp files? On Mon, Sep 28, 2015 at 7:02 PM, Shixiong Zhu wrote: > These files are created by shuffle and just some temp files. They are not >

[ANNOUNCE] Announcing Spark 1.5.1

2015-10-01 Thread Reynold Xin
Hi All, Spark 1.5.1 is a maintenance release containing stability fixes. This release is based on the branch-1.5 maintenance branch of Spark. We *strongly recommend* all 1.5.0 users to upgrade to this release. The full list of bug fixes is here: http://s.apache.org/spark-1.5.1

Re: How to access lost executor log file

2015-10-01 Thread Ted Yu
Looks like the spark history server should take the lost exectuors into account by analyzing the output from 'yarn logs applicationId' command. Cheers On Thu, Oct 1, 2015 at 11:46 AM, Lan Jiang wrote: > Ted, > > Thanks for your reply. > > First of all, after sending email to

spark-submit --packages using different resolver

2015-10-01 Thread Jerry Lam
Hi spark users and developers, I'm trying to use spark-submit --packages against private s3 repository. With sbt, I'm using fm-sbt-s3-resolver with proper aws s3 credentials. I wonder how can I add this resolver into spark-submit such that --packages can resolve dependencies from private repo?

calling persist would cause java.util.NoSuchElementException: key not found:

2015-10-01 Thread Eyad Sibai
Hi I am trying to call .persist() on a dataframe but once I execute the next line I am getting java.util.NoSuchElementException: key not found: …. I tried to do persist on disk also the same thing. I am using: pyspark with python3 spark 1.5 Thanks! EYAD SIBAI Risk Engineer iZettle ®