Re: Question about Spark best practice when counting records.

2015-02-27 Thread Kostas Sakellis
Hey Darin, Record count metrics are coming in Spark 1.3. Can you wait until it is released? Or do you need a solution in older versions of spark. Kostas On Friday, February 27, 2015, Darin McBeath ddmcbe...@yahoo.com.invalid wrote: I have a fairly large Spark job where I'm essentially

Re: textFile partitions

2015-02-09 Thread Kostas Sakellis
The partitions parameter to textFile is the minPartitions. So there will be at least that level of parallelism. Spark delegates to Hadoop to create the splits for that file (yes, even for a text file on disk and not hdfs). You can take a look at the code in FileInputFormat - but briefly it will

Re: spark driver behind firewall

2015-02-05 Thread Kostas Sakellis
Yes, the driver has to be able to accept incoming connections. All the executors connect back to the driver sending heartbeats, map status, metrics. It is critical and I don't know of a way around it. You could look into using something like the https://github.com/spark-jobserver/spark-jobserver

Re: Reg Job Server

2015-02-05 Thread Kostas Sakellis
Which Spark Job server are you talking about? On Thu, Feb 5, 2015 at 8:28 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, Can Spark Job Server be used for profiling Spark jobs?

Re: Reg Job Server

2015-02-05 Thread Kostas Sakellis
, 2015 at 9:03 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: I read somewhere about Gatling. Can that be used to profile Spark jobs? On Fri, Feb 6, 2015 at 10:27 AM, Kostas Sakellis kos...@cloudera.com wrote: Which Spark Job server are you talking about? On Thu, Feb 5, 2015 at 8:28 PM, Deep

Re: How many stages in my application?

2015-02-05 Thread Kostas Sakellis
Yes, there is no way right now to know how many stages a job will generate automatically. Like Mark said, RDD#toDebugString will give you some info about the RDD DAG and from that you can determine based on the dependency types (Wide vs. narrow) if there is a stage boundary. On Thu, Feb 5, 2015

Re: Whether standalone spark support kerberos?

2015-02-05 Thread Kostas Sakellis
Standalone mode does not support talking to a kerberized HDFS. If you want to talk to a kerberized (secure) HDFS cluster i suggest you use Spark on Yarn. On Wed, Feb 4, 2015 at 2:29 AM, Jander g jande...@gmail.com wrote: Hope someone helps me. Thanks. On Wed, Feb 4, 2015 at 6:14 PM, Jander g

Re: Spark Job running on localhost on yarn cluster

2015-02-05 Thread Kostas Sakellis
Kundan, So I think your configuration here is incorrect. We need to adjust memory and #executors. So for your case you have: Cluster setup 5 nodes 16gb RAM 8 cores. The number of executors should be the total number of nodes in your cluster - in your case 5. As for --num-executor-cores it should

Re: Yarn Driver OOME (Java heap space) when executors request map output locations

2014-09-09 Thread Kostas Sakellis
Hey, If you are interested in more details there is also a thread about this issue here: http://apache-spark-developers-list.1001551.n3.nabble.com/Eliminate-copy-while-sending-data-any-Akka-experts-here-td7127.html Kostas On Tue, Sep 9, 2014 at 3:01 PM, jbeynon jbey...@gmail.com wrote: Thanks