Re: newbie: unable to use all my cores and memory

2015-11-20 Thread Igor Berman
u've asked total cores to be 2 + 1 for driver(since you are running in cluster mode, so it's running on one of the slaves) change total cores to be 3*2 change submit mode to be client - you'll have full utilization (btw it's not advisable to use all cores of slave...since there is OS processes and

Re: Configuring Log4J (Spark 1.5 on EMR 4.1)

2015-11-20 Thread Igor Berman
try to assemble log4j.xml or log4j.properties in your jar...probably you'll get what you want, however pay attention that when you'll move to multinode cluster - there will be difference On 20 November 2015 at 05:10, Afshartous, Nick wrote: > > < log4j.properties file

how to use sc.hadoopConfiguration from pyspark

2015-11-20 Thread Tamas Szuromi
Hello, I've just wanted to use sc._jsc.hadoopConfiguration().set('key','value') in pyspark 1.5.2 but I got set method not exists error. Are there anyone who know a workaround to set some hdfs related properties like dfs.blocksize? Thanks in advance! Tamas

Re: Save GraphX to disk

2015-11-20 Thread Ashish Rawat
Hi Todd, Could you please provide an example of doing this. Mazerunner seems to be doing something similar with Neo4j but it goes via hdfs and updates only the graph properties. Is there a direct way to do this with Neo4j or Titan? Regards, Ashish From: SLiZn Liu

Data in one partition after reduceByKey

2015-11-20 Thread Patrick McGloin
Hi, I have Spark application which contains the following segment: val reparitioned = rdd.repartition(16) val filtered: RDD[(MyKey, myData)] = MyUtils.filter(reparitioned, startDate, endDate) val mapped: RDD[(DateTime, myData)] = filtered.map(kv=(kv._1.processingTime, kv._2)) val reduced:

Drop multiple columns in the DataFrame API

2015-11-20 Thread BenFradet
Hi everyone, I was wondering if there is a better way to drop mutliple columns from a dataframe or why there is no drop(cols: Column*) method in the dataframe API. Indeed, I tend to write code like this: val filteredDF = df.drop("colA") .drop("colB") .drop("colC") //etc which is a

Re: getting different results from same line of code repeated

2015-11-20 Thread Walrus theCat
I'm running into all kinds of problems with Spark 1.5.1 -- does anyone have a version that's working smoothly for them? On Fri, Nov 20, 2015 at 10:50 AM, Dean Wampler wrote: > I didn't expect that to fail. I would call it a bug for sure, since it's > practically useless

Does spark streaming write ahead log writes all received data to HDFS ?

2015-11-20 Thread kali.tumm...@gmail.com
Hi All, If write ahead logs are enabled in spark streaming does all the received data gets written to HDFS path ? or it only writes the metadata. How does clean up works , does HDFS path gets bigger and bigger up everyday do I need to write an clean up job to delete data from write ahead logs

Re: getting different results from same line of code repeated

2015-11-20 Thread Ted Yu
Mind trying 1.5.2 release ? Thanks On Fri, Nov 20, 2015 at 10:56 AM, Walrus theCat wrote: > I'm running into all kinds of problems with Spark 1.5.1 -- does anyone > have a version that's working smoothly for them? > > On Fri, Nov 20, 2015 at 10:50 AM, Dean Wampler

Re: Spark Streaming - stream between 2 applications

2015-11-20 Thread Saiph Kappa
I think my problem persists whether I use Kafka or sockets. Or am I wrong? How would you use Kafka here? On Fri, Nov 20, 2015 at 7:12 PM, Christian wrote: > Have you considered using Kafka? > > On Fri, Nov 20, 2015 at 6:48 AM Saiph Kappa wrote: > >>

Hive error after update from 1.4.1 to 1.5.2

2015-11-20 Thread Bryan Jeffrey
Hello. I'm seeing an error creating a Hive Context moving from Spark 1.4.1 to 1.5.2. Has anyone seen this issue? I'm invoking the following: new HiveContext(sc) // sc is a Spark Context I am seeing the following error: SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding

Re: Hive error after update from 1.4.1 to 1.5.2

2015-11-20 Thread Bryan Jeffrey
The 1.5.2 Spark was compiled using the following options: mvn -Dhadoop.version=2.6.1 -Dscala-2.11 -DskipTests -Pyarn -Phive -Phive-thriftserver clean package Regards, Bryan Jeffrey On Fri, Nov 20, 2015 at 2:13 PM, Bryan Jeffrey wrote: > Hello. > > I'm seeing an error

Re: Spark Streaming - stream between 2 applications

2015-11-20 Thread Cody Koeninger
You're confused about which parts of your code are running on the driver vs the executor, which is why you're getting serialization errors. Read http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd On Fri, Nov 20, 2015 at 1:07 PM, Saiph

Re: Hive error after update from 1.4.1 to 1.5.2

2015-11-20 Thread Bryan Jeffrey
Nevermind. I had a library dependency that still had the old Spark version. On Fri, Nov 20, 2015 at 2:14 PM, Bryan Jeffrey wrote: > The 1.5.2 Spark was compiled using the following options: mvn > -Dhadoop.version=2.6.1 -Dscala-2.11 -DskipTests -Pyarn -Phive >

RE: Spark Expand Cluster

2015-11-20 Thread Bozeman, Christopher
Dan, Even though you may be adding more nodes to the cluster, the Spark application has to be requesting additional executors in order to thus use the added resources. Or the Spark application can be using Dynamic Resource Allocation

RE: Yarn Spark on EMR

2015-11-20 Thread Bozeman, Christopher
Suraj, Spark History server is running on 18080 (http://spark.apache.org/docs/latest/monitoring.html) which is not going to give you are real-time update on a running Spark application. Given this is Spark on YARN, you will need to view the Spark UI from the Application Master URL which can

Corelation between 2 consecutive RDDs in Dstream

2015-11-20 Thread anshu shukla
1- Is there any wat=y to either make the pair of RDDs from a Dstream- Dstream ---> Dstream so that i can use already defined corelation function in spark. *Aim is to find auto-corelation value in spark .(As per my knowledge spark streaming does not support this.)* -- Thanks &

question about combining small input splits

2015-11-20 Thread nezih
Hey everyone, I have a Hive table that has a lot of small parquet files and I am creating a data frame out of it to do some processing, but since I have a large number of splits/files my job creates a lot of tasks, which I don't want. Basically what I want is the same functionality that Hive

FW: starting spark-shell throws /tmp/hive on HDFS should be writable error

2015-11-20 Thread Mich Talebzadeh
From: Mich Talebzadeh [mailto:m...@peridale.co.uk] Sent: 20 November 2015 21:14 To: u...@hive.apache.org Subject: starting spark-shell throws /tmp/hive on HDFS should be writable error Hi, Has this been resolved. I don't think this has anything to do with /tmp/hive directory permission

How to kill the spark job using Java API.

2015-11-20 Thread Hokam Singh Chauhan
Hi, I have been running the spark job on standalone spark cluster. I wants to kill the spark job using Java API. I am having the spark job name and spark job id. The REST POST call for killing the job is not working. If anyone explored it please help me out. -- Thanks and Regards, Hokam Singh

Re: How to kill spark applications submitted using spark-submit reliably?

2015-11-20 Thread Ted Yu
Interesting, SPARK-3090 installs shutdown hook for stopping SparkContext. FYI On Fri, Nov 20, 2015 at 7:12 PM, Stéphane Verlet wrote: > I solved the first issue by adding a shutdown hook in my code. The > shutdown hook get call when you exit your script (ctrl-C ,

Re: How to kill spark applications submitted using spark-submit reliably?

2015-11-20 Thread Vikram Kone
I tried adding shutdown hook to my code but it didn't help. Still same issue On Fri, Nov 20, 2015 at 7:08 PM, Ted Yu wrote: > Which Spark release are you using ? > > Can you pastebin the stack trace of the process running on your machine ? > > Thanks > > On Nov 20, 2015,

Re: How to kill spark applications submitted using spark-submit reliably?

2015-11-20 Thread Vikram Kone
Spark 1.4.1 On Friday, November 20, 2015, Ted Yu wrote: > Which Spark release are you using ? > > Can you pastebin the stack trace of the process running on your machine ? > > Thanks > > On Nov 20, 2015, at 6:46 PM, Vikram Kone

Re: How to kill spark applications submitted using spark-submit reliably?

2015-11-20 Thread Stéphane Verlet
I am not sure , I think it has to do with the signal sent to the process and how the JVM handles it Ctrl-C sends a a SIGINT vs a TERM signal for the kill command On Fri, Nov 20, 2015 at 8:21 PM, Vikram Kone wrote: > Thanks for the info Stephane. > Why does CTRL-C in the

Re: How to kill spark applications submitted using spark-submit reliably?

2015-11-20 Thread varun sharma
I do this in my stop script to kill the application: kill -s SIGTERM `pgrep -f StreamingApp` to stop it forcefully : pkill -9 -f "StreamingApp" StreamingApp is name of class which I submitted. I also have shutdown hook thread to stop it gracefully. sys.ShutdownHookThread { logInfo("Gracefully

Error in Saving the MLlib models

2015-11-20 Thread hokam chauhan
Hi, I am exploring the MLlib. I have taken the examples of the MLlib and tried to train a SVM Model. I am getting the exception when i am saving the trained model.As i run the code in local mode it works fine, but when i run the MLlib example in standalone cluster mode it fails to save the Model.

Re: How to kill spark applications submitted using spark-submit reliably?

2015-11-20 Thread Stéphane Verlet
I solved the first issue by adding a shutdown hook in my code. The shutdown hook get call when you exit your script (ctrl-C , kill … but nor kill -9) val shutdownHook = scala.sys.addShutdownHook { try { sparkContext.stop() //Make sure to kill any other threads or thread pool you may be

Re: How to kill spark applications submitted using spark-submit reliably?

2015-11-20 Thread Vikram Kone
Thanks for the info Stephane. Why does CTRL-C in the terminal running spark-submit kills the app in spark master correctly w/o any explicit shutdown hooks in the code? Can you explain why we need to add the shutdown hook to kill it when launched via a shell script ? For the second issue, I'm not

Re: How to kill spark applications submitted using spark-submit reliably?

2015-11-20 Thread Ted Yu
Which Spark release are you using ? Can you pastebin the stack trace of the process running on your machine ? Thanks > On Nov 20, 2015, at 6:46 PM, Vikram Kone wrote: > > Hi, > I'm seeing a strange problem. I have a spark cluster in standalone mode. I > submit spark