Re: Re: Spark streaming doesn't print output when working with standalone master

2015-02-20 Thread Akhil Das
local[3] spawns 3 threads on 1 core :) Thanks Best Regards On Fri, Feb 20, 2015 at 12:50 PM, bit1...@163.com bit1...@163.com wrote: Thanks Akhil, you are right. I checked and find that I have only 1 core allocated to the program I am running on a visual machine,and only allocate one

Regarding shuffle data file format

2015-02-20 Thread twinkle sachdeva
Hi, What is the file format which is used to write files while shuffle write? Is it dependent on the spark shuffle manager or output format? Is it possible to change the file format for shuffle, irrespective of the output format of the file? Thanks, Twinkle

Re: RDD Partition number

2015-02-20 Thread Alessandro Lulli
Hi All, Thanks for your answers. I have one more details to point out. It is clear now how partition number is defined for HDFS file, However, if i have my dataset replicated on all the machines in the same absolute path. In this case each machine has for instance ext3 filesystem. If i load

Re: Spark on Mesos: Multiple Users with iPython Notebooks

2015-02-20 Thread Iulian Dragoș
On Thu, Feb 19, 2015 at 2:49 PM, John Omernik j...@omernik.com wrote: I am running Spark on Mesos and it works quite well. I have three users, all who setup iPython notebooks to instantiate a spark instance to work with on the notebooks. I love it so far. Since I am auto instantiating (I

Re: Spark Performance on Yarn

2015-02-20 Thread Sean Owen
None of this really points to the problem. These indicate that workers died but not why. I'd first go locate executor logs that reveal more about what's happening. It sounds like a hard-er type of failure, like JVM crash or running out of file handles, or GC thrashing. On Fri, Feb 20, 2015 at

Accumulator in SparkUI for streaming

2015-02-20 Thread Tim Smith
On Spark 1.2: I am trying to capture # records read from a kafka topic: val inRecords = ssc.sparkContext.accumulator(0, InRecords) .. kInStreams.foreach( k = { k.foreachRDD ( rdd = inRecords += rdd.count().toInt ) inRecords.value Question

DataFrame: Enable zipWithUniqueId

2015-02-20 Thread Dima Zhiyanov
Hello Question regarding the new DataFrame API introduced here https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html I oftentimes use the zipWithUniqueId method of the SchemaRDD (as an RDD) to replace string keys with more efficient long keys.

Re: Streaming Linear Regression

2015-02-20 Thread Emre Sevinc
Hello Baris, Giving your complete source code (if not very long, or maybe via https://gist.github.com/) could be more helpful. Also telling which Spark version you use, on which file system, and how you run your application, together with the any log / output info it produces might make

Re: using a database connection pool to write data into an RDBMS from a Spark application

2015-02-20 Thread Sean Owen
Although I don't know if it's related, the Class.forName() method of loading drivers is very old. You should be using DataSource and javax.sql; this has been the usual practice since about Java 1.4. Why do you say a different driver is being loaded? that's not the error here. Try instantiating

Can you add Big Industries to the Powered by Spark page?

2015-02-20 Thread Emre Sevinc
Hello, Could you please add Big Industries to the Powered by Spark page at https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark ? Company Name: Big Industries URL: http://http://www.bigindustries.be/ Spark Components: Spark Streaming Use Case: Big Content Platform Summary:

Re: Re: Spark streaming doesn't print output when working with standalone master

2015-02-20 Thread bit1...@163.com
Thanks Akhil. From: Akhil Das Date: 2015-02-20 16:29 To: bit1...@163.com CC: user Subject: Re: Re: Spark streaming doesn't print output when working with standalone master local[3] spawns 3 threads on 1 core :) Thanks Best Regards On Fri, Feb 20, 2015 at 12:50 PM, bit1...@163.com

Setting the number of executors in standalone mode

2015-02-20 Thread Yiannis Gkoufas
Hi there, I try to increase the number of executors per worker in the standalone mode and I have failed to achieve that. I followed a bit the instructions of this thread: http://stackoverflow.com/questions/26645293/spark-configuration-memory-instance-cores and did that: spark.executor.memory 1g

Where to look for potential causes for Akka timeout errors in a Spark Streaming Application?

2015-02-20 Thread Emre Sevinc
Hello, We are building a Spark Streaming application that listens to a directory on HDFS, and uses the SolrJ library to send newly detected files to a Solr server. When we put 10.000 files to the directory it is listening to, it starts to process them by sending the files to our Solr server but

GSOC2015

2015-02-20 Thread magellane a
Hi Since we're approaching the GSOC2015 application process I have some questions: 1) Will your organization be a part of GSOC2015 and what are the projects that you will be interested in? 2) Since I'm not a contributor to apache spark, what are some starter tasks I can work on to gain facility

Re: storing MatrixFactorizationModel (pyspark)

2015-02-20 Thread Antony Mayi
well, I understand the math (having two vectors) but the python  MatrixFactorizationModel object seems to be just a wrapper around java class so not sure how to extract the two RDDs?thx,Antony. On Thursday, 19 February 2015, 16:32, Ilya Ganelin ilgan...@gmail.com wrote: Yep. the

Re: Why is RDD lookup slow?

2015-02-20 Thread shahab
Thanks you all. Just changing RDD to Map structure saved me approx. 1 second. Yes, I will check out IndexedRDD to see if it has better performance. best, /Shahab On Thu, Feb 19, 2015 at 6:38 PM, Burak Yavuz brk...@gmail.com wrote: If your dataset is large, there is a Spark Package called

Re: Spark job fails on cluster but works fine on a single machine

2015-02-20 Thread Pavel Velikhov
I definitely delete the file on the right HDFS, I only have one HDFS instance. The problem seems to be in the CassandraRDD - reading always fails in some way when run on the cluster, but single-machine reads are okay. On Feb 20, 2015, at 4:20 AM, Ilya Ganelin ilgan...@gmail.com wrote: The

Re: Streaming Linear Regression

2015-02-20 Thread Emre Sevinc
Baris, I've tried the following piece of code: https://gist.github.com/emres/10c509c1d69264fe6fdb and built it using sbt package and then submitted it via spark-submit --class org.apache.spark.examples.mllib.StreamingLinearRegression --master local[4]

Re: Spark on Mesos: Multiple Users with iPython Notebooks

2015-02-20 Thread John Omernik
Awesome! This is exactly what I'd need. Unfortunately, I am not a programmer of any talent or skill, but how could I assist with this JIRA? From a User perspective, this is really the next step for my org taking our Mesos cluster to user land with Spark. I don't want to be pushy, but is there any

what does Submitting ... missing tasks from Stage mean?

2015-02-20 Thread shahab
Hi, Probably this is silly question, but I couldn't find any clear documentation explaining why one should submitting... missing tasks from Stage ... in the logs? Specially in my case when I do not have any failure in job execution, I wonder why this should happen? Does it have any relation to

Re: Where to look for potential causes for Akka timeout errors in a Spark Streaming Application?

2015-02-20 Thread Todd Nist
Hi Emre, Have you tried adjusting these: .set(spark.akka.frameSize, 500).set(spark.akka.askTimeout, 30).set(spark.core.connection.ack.wait.timeout, 600) -Todd On Fri, Feb 20, 2015 at 8:14 AM, Emre Sevinc emre.sev...@gmail.com wrote: Hello, We are building a Spark Streaming application that

Re: Spark Streaming and message ordering

2015-02-20 Thread Cody Koeninger
For a given batch, for a given partition, the messages will be processed in order by the executor that is running that partition. That's because messages for the given offset range are pulled by the executor, not pushed from some other receiver. If you have speculative execution, yes, another

How Spark and Flink are shaping the future of Hadoop?

2015-02-20 Thread Slim Baltagi
Hi 1.*To get a taste* of my talk at the 2015 Hadoop Summit, please find below a few links to a similar talk that I gave at the Chicago Hadoop Users Group on ‘ *Transitioning Compute Models: Apache MapReduce to Spark*’ on February 12, 2015 in front of 185 attendees: - Video Recording:

Re: loads of memory still GC overhead limit exceeded

2015-02-20 Thread Xiangrui Meng
Hi Antony, Is it easy for you to try Spark 1.3.0 or master? The ALS performance should be improved in 1.3.0. -Xiangrui On Fri, Feb 20, 2015 at 1:32 PM, Antony Mayi antonym...@yahoo.com.invalid wrote: Hi Ilya, thanks for your insight, this was the right clue. I had default parallelism already

randomSplit instead of a huge map reduce ?

2015-02-20 Thread shlomib
Hi, I am new to Spark and I think I missed something very basic. I have the following use case (I use Java and run Spark locally on my laptop): I have a JavaRDDString[] - The RDD contains around 72,000 arrays of strings (String[]) - Each array contains 80 words (on average). What I want to

Re: output worker stdout to one place

2015-02-20 Thread Anny Chen
Thanks Marcelo! I will try to change the log4j.properties On Fri, Feb 20, 2015 at 11:37 AM, Marcelo Vanzin van...@cloudera.com wrote: Hi Anny, You could play with creating your own log4j.properties that will write the output somewhere else (e.g. to some remote mount, or remote syslog).

Re: using hivecontext with sparksql on cdh 5.3

2015-02-20 Thread Sourigna Phetsarath
Correction, should be HADOOP_CONF_DIR=/etc/hive/conf spark-shell --driver-class-path '/data/opt/cloudera/parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hive/lib/*' --driver-java-options '-Dspark.executor.extraClassPath=/opt/cloudera/ parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hive/lib/*' On Fri, Feb 20, 2015

Re: using hivecontext with sparksql on cdh 5.3

2015-02-20 Thread Sourigna Phetsarath
Correction, should be HADOOP_CONF_DIR=/etc/hive/conf --driver-class-path '/data/opt/cloudera/parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hive/lib/*' --driver-java-options '-Dspark.executor.extraClassPath=/opt/cloudera/ parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hive/lib/*' On Fri, Feb 20, 2015 at 3:43 PM,

Re: high GC in the Kmeans algorithm

2015-02-20 Thread Xiangrui Meng
A single vector of size 10^7 won't hit that bound. How many clusters did you set? The broadcast variable size is 10^7 * k and you can calculate the amount of memory it needs. Try to reduce the number of tasks and see whether it helps. -Xiangrui On Tue, Feb 17, 2015 at 7:20 PM, lihu

Re: loads of memory still GC overhead limit exceeded

2015-02-20 Thread Ilya Ganelin
No problem, Antony. ML lib is tricky! I'd love to chat with you about your use case - sounds like we're working on similar problems/scales. On Fri, Feb 20, 2015 at 1:55 PM Xiangrui Meng men...@gmail.com wrote: Hi Antony, Is it easy for you to try Spark 1.3.0 or master? The ALS performance

Re: using hivecontext with sparksql on cdh 5.3

2015-02-20 Thread chirag lakhani
That worked perfectly...thanks so much! On Fri, Feb 20, 2015 at 3:49 PM, Sourigna Phetsarath gna.phetsar...@teamaol.com wrote: Correction, should be HADOOP_CONF_DIR=/etc/hive/conf spark-shell --driver-class-path '/data/opt/cloudera/parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hive/lib/*'

Re: Spark Performance on Yarn

2015-02-20 Thread Kelvin Chu
Hi Sandy, I appreciate your clear explanation. Let me try again. It's the best way to confirm I understand. spark.executor.memory + spark.yarn.executor.memoryOverhead = the memory that YARN will create a JVM spark.executor.memory = the memory I can actually use in my jvm application = part of

Re: using hivecontext with sparksql on cdh 5.3

2015-02-20 Thread Sourigna Phetsarath
Try it without --master yarn-cluster if you are trying to run a spark-shell. :) On Fri, Feb 20, 2015 at 3:18 PM, chirag lakhani chirag.lakh...@gmail.com wrote: I tried spark-shell --master yarn-cluster --driver-class-path

Re: Spark 1.3 SQL Programming Guide and sql._ / sql.types._

2015-02-20 Thread Denny Lee
Oh no worries at all. If you want, I'd be glad to make updates and PR for anything I find, eh?! On Fri, Feb 20, 2015 at 12:18 Michael Armbrust mich...@databricks.com wrote: Yeah, sorry. The programming guide has not been updated for 1.3. I'm hoping to get to that this weekend / next week.

Spark performance tuning

2015-02-20 Thread java8964
Hi, I am new to Spark, and I am trying to test the Spark SQL performance vs Hive. I setup a standalone box, with 24 cores and 64G memory. We have one SQL in mind to test. Here is the basically setup on this one box for the SQL we are trying to run: 1) Dataset 1, 6.6G AVRO file with snappy

Re: using hivecontext with sparksql on cdh 5.3

2015-02-20 Thread chirag lakhani
Thanks! I am able to login to Spark now but I am still getting the same error scala sqlContext.sql(FROM analytics.trainingdatafinal SELECT *).collect().foreach(println) 15/02/20 14:40:22 INFO ParseDriver: Parsing command: FROM analytics.trainingdatafinal SELECT * 15/02/20 14:40:22 INFO

Re: using hivecontext with sparksql on cdh 5.3

2015-02-20 Thread Sourigna Phetsarath
Also, you might want to add the hadoop configs: HADOOP_CONF_DIR=/etc/hadoop/conf:/etc/hive/conf --driver-class-path '/data/opt/cloudera/parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hive/lib/*' --driver-java-options '-Dspark.executor.extraClassPath=/opt/cloudera/

Re: Spark Performance on Yarn

2015-02-20 Thread Sandy Ryza
That's all correct. -Sandy On Fri, Feb 20, 2015 at 1:23 PM, Kelvin Chu 2dot7kel...@gmail.com wrote: Hi Sandy, I appreciate your clear explanation. Let me try again. It's the best way to confirm I understand. spark.executor.memory + spark.yarn.executor.memoryOverhead = the memory that

Re: loads of memory still GC overhead limit exceeded

2015-02-20 Thread Antony Mayi
Hi Ilya, thanks for your insight, this was the right clue. I had default parallelism already set but it was quite low (hundreds) and moreover the number of partitions of the input RDD was low as well so the chunks were really too big. Increased parallelism and repartitioning seems to be

Re: No executors allocated on yarn with latest master branch

2015-02-20 Thread Sandy Ryza
Are you using the capacity scheduler or fifo scheduler without multi resource scheduling by any chance? On Thu, Feb 12, 2015 at 1:51 PM, Anders Arpteg arp...@spotify.com wrote: The nm logs only seems to contain similar to the following. Nothing else in the same time range. Any help?

Re: what does Submitting ... missing tasks from Stage mean?

2015-02-20 Thread Imran Rashid
yeah, this is just the totally normal message when spark executes something. The first time something is run, all of its tasks are missing. I would not worry about cases when all tasks aren't missing if you're new to spark, its probably an advanced concept that you don't care about. (and would

Re: Which OutputCommitter to use for S3?

2015-02-20 Thread Mingyu Kim
I didn’t get any response. It’d be really appreciated if anyone using a special OutputCommitter for S3 can comment on this! Thanks, Mingyu From: Mingyu Kim m...@palantir.commailto:m...@palantir.com Date: Monday, February 16, 2015 at 1:15 AM To: user@spark.apache.orgmailto:user@spark.apache.org

Re: Spark Performance on Yarn

2015-02-20 Thread Lee Bierman
Thanks for the suggestions. I'm experimenting with different values for spark memoryOverhead and explictly giving the executors more memory, but still have not found the golden medium to get it to finish in a proper time frame. Is my cluster massively undersized at 5 boxes, 8gb 2cpu ? Trying to

Re: Which OutputCommitter to use for S3?

2015-02-20 Thread Josh Rosen
We (Databricks) use our own DirectOutputCommitter implementation, which is a couple tens of lines of Scala code. The class would almost entirely be a no-op except we took some care to properly handle the _SUCCESS file. On Fri, Feb 20, 2015 at 3:52 PM, Mingyu Kim m...@palantir.com wrote: I

Re: randomSplit instead of a huge map reduce ?

2015-02-20 Thread Ashish Rangole
Is there a check you can put in place to not create pairs that aren't in your set of 20M pairs? Additionally, once you have your arrays converted to pairs you can do aggregateByKey with each pair being the key. On Feb 20, 2015 1:57 PM, shlomib shl...@summerhq.com wrote: Hi, I am new to Spark

About FlumeUtils.createStream

2015-02-20 Thread bit1...@163.com
Hi, In the spark streaming application, I write the code, FlumeUtils.createStream(ssc,localhost,),which means spark will listen on the port, and wait for Flume Sink to write to it. My question is: when I submit the application to the Spark Standalone cluster, will be opened only

Force RDD evaluation

2015-02-20 Thread pnpritchard
Is there a technique for forcing the evaluation of an RDD? I have used actions to do so but even the most basic count has a non-negligible cost (even on a cached RDD, repeated calls to count take time). My use case is for logging the execution time of the major components in my application. At

Shuffle Spill

2015-02-20 Thread Thomas Gerber
Hello, I have a few tasks in a stage with lots of tasks that have a large amount of shuffle spill. I scouted the web to understand shuffle spill, and I did not find any simple explanation of the spill mechanism. What I put together is: 1. the shuffle spill can happens when the shuffle is

RE: using a database connection pool to write data into an RDBMS from a Spark application

2015-02-20 Thread Mohammed Guller
Sean, I know that Class.forName is not required since Java 1.4 :-) It was just a desperate attempt to make sure that the Postgres driver is getting loaded. Since Class.forName(org.postgresql.Driver) is not throwing an exception, I assume that the driver is available in the classpath. Is that

Re: Spark Streaming and message ordering

2015-02-20 Thread Jörn Franke
You may think as well if your use case really needs a very strict order, because configuring spark that it supports such a strict order means rendering most of benefits useless (failure handling, parallelism etc.). Usually, in a distributed setting you can order events, but this also means that

Re: Spark Streaming and message ordering

2015-02-20 Thread Neelesh
Thanks Jorn. Indeed, we do not need global ordering, since our data is partitioned well. We do not need ordering based on wallclock time, that would require waiting indefinitely. All we need is the execution of batches (not job submission) to happen in the same order they are generated, which

Re: Spark Streaming and message ordering

2015-02-20 Thread Neelesh
Thanks for the detailed response Cody. Our use case is to do some external lookups (cached and all) for every event, match the event against the looked up data, decide whether to write an entry in mysql and write it in the order in which the events arrived within a kafka partition. We don't need

Re: Spark Streaming and message ordering

2015-02-20 Thread Cody Koeninger
There is typically some slack between when a batch finishes executing and when the next batch is scheduled. You should be able to arrange your batch sizes / cluster resources to ensure that. If there isn't slack, your overall delay is going to keep increasing indefinitely. If you're inserting

Re: GraphX:java.lang.NoSuchMethodError:org.apache.spark.graphx.Graph$.apply

2015-02-20 Thread jwm
Has anyone found a solution to this? I was able to reproduce it here http://stackoverflow.com/questions/28576439/getting-nosuchmethoderror-when-setting-up-spark-graphx-graph but I'm unable to resolve it. -- View this message in context:

Saving Spark RDD to Avro with spark.api.python.Converter

2015-02-20 Thread daria
Hi! I am trying to persist RDD in Avro format with spark API. I wonder if someone has any experience or suggestions. My converter with example can be viewed here https://github.com/daria-sukhareva/spark/commit/2ba7b213572d6ce2056cfc2536b701ae689c7f98 and relevant question here

Re: Spark Performance on Yarn

2015-02-20 Thread lbierman
A bit more context on this issue. From the container logs on the executor Given my cluster specs above what would be appropriate parameters to pass into : --num-executors --num-cores --executor-memory I had tried it with --executor-memory 2500MB 015-02-20 06:50:09,056 WARN

Re: Spark Performance on Yarn

2015-02-20 Thread Sandy Ryza
Are you specifying the executor memory, cores, or number of executors anywhere? If not, you won't be taking advantage of the full resources on the cluster. -Sandy On Fri, Feb 20, 2015 at 2:41 AM, Sean Owen so...@cloudera.com wrote: None of this really points to the problem. These indicate

Re: using a database connection pool to write data into an RDBMS from a Spark application

2015-02-20 Thread Sean Owen
Have a look at spark.yarn.user.classpath.first and spark.files.userClassPathFirst for a possible way to give your copy of the libs precedence. On Fri, Feb 20, 2015 at 5:20 PM, Mohammed Guller moham...@glassbeam.com wrote: Sean, I know that Class.forName is not required since Java 1.4 :-) It was

Re: Setting the number of executors in standalone mode

2015-02-20 Thread Yiannis Gkoufas
Hi Mohammed, thanks a lot for the reply. Ok, so from what I understand I cannot control the number of executors per worker in standalone cluster mode. Is that correct? BR On 20 February 2015 at 17:46, Mohammed Guller moham...@glassbeam.com wrote: SPARK_WORKER_MEMORY=8g Will allocate 8GB

Re: Spark Performance on Yarn

2015-02-20 Thread Sandy Ryza
If that's the error you're hitting, the fix is to boost spark.yarn.executor.memoryOverhead, which will put some extra room in between the executor heap sizes and the amount of memory requested for them from YARN. -Sandy On Fri, Feb 20, 2015 at 9:40 AM, lbierman leebier...@gmail.com wrote: A

RE: Setting the number of executors in standalone mode

2015-02-20 Thread Mohammed Guller
SPARK_WORKER_MEMORY=8g Will allocate 8GB memory to Spark on each worker node. Nothing to do with # of executors. Mohammed From: Yiannis Gkoufas [mailto:johngou...@gmail.com] Sent: Friday, February 20, 2015 4:55 AM To: user@spark.apache.org Subject: Setting the number of executors in standalone

PySpark Cassandra forked

2015-02-20 Thread Rumph, Frens Jan
Hi all, Wanted to let you know I've forked PySpark Cassandra on https://github.com/TargetHolding/pyspark-cassandra. Unfortunately the original code didn't work for me and I couldn't figure out how it could work. But it inspired! so I rewrote the majority of the project. The rewrite implements

Stopping a Custom Receiver

2015-02-20 Thread pnpritchard
Hi, I have a use case for creating a DStream from a single file. I have created a custom receiver that reads the file, calls 'store' with the contents, then calls 'stop'. However, I'm second guessing if this is the correct approach due to the spark logs I see. I always see these logs, and the

RE: Setting the number of executors in standalone mode

2015-02-20 Thread Mohammed Guller
ASFAIK, in stand-alone mode, each Spark application gets one executor on each worker. You could run multiple workers on a machine though. Mohammed From: Yiannis Gkoufas [mailto:johngou...@gmail.com] Sent: Friday, February 20, 2015 9:48 AM To: Mohammed Guller Cc: user@spark.apache.org Subject:

RE: using a database connection pool to write data into an RDBMS from a Spark application

2015-02-20 Thread Mohammed Guller
SPARK_CLASSPATH has been deprecated since 1.0. In any case, I tired and it didn't work since it appends to the classpath. I need something that prepends to the classpath. Mohammed -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Friday, February 20, 2015 10:08 AM

Spark 1.3 SQL Programming Guide and sql._ / sql.types._

2015-02-20 Thread Denny Lee
Quickly reviewing the latest SQL Programming Guide https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md (in github) I had a couple of quick questions: 1) Do we need to instantiate the SparkContext as per // sc is an existing SparkContext. val sqlContext = new

RE: using a database connection pool to write data into an RDBMS from a Spark application

2015-02-20 Thread Mohammed Guller
It looks like spark.files.userClassPathFirst gives precedence to user libraries only on the worker nodes. Is there something similar to achieve the same behavior on the master? BTW, I am running Spark in stand-alone mode. Mohammed -Original Message- From: Sean Owen

Re: using a database connection pool to write data into an RDBMS from a Spark application

2015-02-20 Thread Sean Owen
Hm, others can correct me if I'm wrong, but is this what SPARK_CLASSPATH is for? On Fri, Feb 20, 2015 at 6:04 PM, Mohammed Guller moham...@glassbeam.com wrote: It looks like spark.files.userClassPathFirst gives precedence to user libraries only on the worker nodes. Is there something similar

Re: Setting the number of executors in standalone mode

2015-02-20 Thread Kelvin Chu
Hi, Currently, there is only one executor per worker. There is jira ticket to relax this: https://issues.apache.org/jira/browse/SPARK-1706 But, if you want to use more cores, maybe, you can try increasing SPARK_WORKER_INSTANCES. It increases the number of workers per machine. Take a look here:

Re: output worker stdout to one place

2015-02-20 Thread Marcelo Vanzin
Hi Anny, You could play with creating your own log4j.properties that will write the output somewhere else (e.g. to some remote mount, or remote syslog). Sorry, but I don't have an example handy. Alternatively, if you can use Yarn, it will collect all logs after the job is finished and make them

output worker stdout to one place

2015-02-20 Thread anny9699
Hi, I am wondering if there's some way that could lead some of the worker stdout to one place instead of in each worker's stdout. For example, I have the following code RDD.foreach{line = try{ do something }catch{ case e:exception = println(line) } } Every time I want to check what's causing

Use Spark Streaming for Batch?

2015-02-20 Thread craigv
We have a sophisticated Spark Streaming application that we have been using successfully in production for over a year to process a time series of events. Our application makes novel use of updateStateByKey() for state management. We now have the need to perform exactly the same processing on

using hivecontext with sparksql on cdh 5.3

2015-02-20 Thread chirag lakhani
I am trying to access a hive table using spark sql but I am having trouble. I followed the instructions in a cloudera community board which stated 1) Import hive jars into the class path export SPARK_CLASSPATH=$(find /data/opt/cloudera/parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hive/lib/ -name

Re: Spark Performance on Yarn

2015-02-20 Thread Kelvin Chu
Hi Sandy, I am also doing memory tuning on YARN. Just want to confirm, is it correct to say: spark.executor.memory - spark.yarn.executor.memoryOverhead = the memory I can actually use in my jvm application If it is not, what is the correct relationship? Any other variables or config parameters

Re: Spark Performance on Yarn

2015-02-20 Thread Sandy Ryza
Hi Kelvin, spark.executor.memory controls the size of the executor heaps. spark.yarn.executor.memoryOverhead is the amount of memory to request from YARN beyond the heap size. This accounts for the fact that JVMs use some non-heap memory. The Spark heap is divided into

Re: using hivecontext with sparksql on cdh 5.3

2015-02-20 Thread Sourigna Phetsarath
Chirag, This worked for us: spark-submit --master yarn-cluster --driver-class-path '/opt/cloudera/parcels/CDH/lib/hive/lib/*' --driver-java-options '-Dspark.executor.extraClassPath=/opt/cloudera/parcels/CDH/lib/hive/lib/*' ... Let me know, if you have any issues. On Fri, Feb 20, 2015 at 2:43

Re: Spark 1.3 SQL Programming Guide and sql._ / sql.types._

2015-02-20 Thread Michael Armbrust
Yeah, sorry. The programming guide has not been updated for 1.3. I'm hoping to get to that this weekend / next week. On Fri, Feb 20, 2015 at 9:55 AM, Denny Lee denny.g@gmail.com wrote: Quickly reviewing the latest SQL Programming Guide

Re: using hivecontext with sparksql on cdh 5.3

2015-02-20 Thread chirag lakhani
I tried spark-shell --master yarn-cluster --driver-class-path '/data/opt/cloudera/parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hive/lib/*' --driver-java-options '-Dspark.executor.extraClassPath=/opt/cloudera/parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hive/lib/*' and I get the following error Error: Cluster