Re: Connection pooling in spark jobs

2015-04-03 Thread Charles Feduke
across jobs On Fri, Apr 3, 2015 at 10:21 AM, Charles Feduke charles.fed...@gmail.com wrote: How long does each executor keep the connection open for? How many connections does each executor open? Are you certain that connection pooling is a performant and suitable solution? Are you running

Re: Spark Streaming Worker runs out of inodes

2015-04-03 Thread Charles Feduke
You could also try setting your `nofile` value in /etc/security/limits.conf for `soft` to some ridiculously high value if you haven't done so already. On Fri, Apr 3, 2015 at 2:09 AM Akhil Das ak...@sigmoidanalytics.com wrote: Did you try these? - Disable shuffle : spark.shuffle.spill=false -

Re: Which OS for Spark cluster nodes?

2015-04-03 Thread Charles Feduke
As Akhil says Ubuntu is a good choice if you're starting from near scratch. Cloudera CDH virtual machine images[1] include Hadoop, HDFS, Spark, and other big data tools so you can get a cluster running with very little effort. Keep in mind Cloudera is a for-profit corporation so they are also

Re: Connection pooling in spark jobs

2015-04-02 Thread Charles Feduke
How long does each executor keep the connection open for? How many connections does each executor open? Are you certain that connection pooling is a performant and suitable solution? Are you running out of resources on the database server and cannot tolerate each executor having a single

Re: com.esotericsoftware.kryo.KryoException: java.io.IOException: File too large vs FileNotFoundException (Too many open files) on spark 1.2.1

2015-03-20 Thread Charles Feduke
Assuming you are on Linux, what is your /etc/security/limits.conf set for nofile/soft (number of open file handles)? On Fri, Mar 20, 2015 at 3:29 PM Shuai Zheng szheng.c...@gmail.com wrote: Hi All, I try to run a simple sort by on 1.2.1. And it always give me below two errors: 1,

Re: Writing Spark Streaming Programs

2015-03-19 Thread Charles Feduke
Scala is the language used to write Spark so there's never a situation in which features introduced in a newer version of Spark cannot be taken advantage of if you write your code in Scala. (This is mostly true of Java, but it may be a little more legwork if a Java-friendly adapter isn't available

Re: Spark History server default conf values

2015-03-10 Thread Charles Feduke
What I found from a quick search of the Spark source code (from my local snapshot on January 25, 2015): // Interval between each check for event log updates private val UPDATE_INTERVAL_MS = conf.getInt(spark.history.fs.updateInterval, conf.getInt(spark.history.updateInterval, 10)) * 1000

Re: Spark on EC2

2015-02-24 Thread Charles Feduke
This should help you understand the cost of running a Spark cluster for a short period of time: http://www.ec2instances.info/ If you run an instance for even 1 second of a single hour you are charged for that complete hour. So before you shut down your miniature cluster make sure you really are

Re: Can't I mix non-Spark properties into a .properties file and pass it to spark-submit via --properties-file?

2015-02-17 Thread Charles Feduke
file that resides on HDFS, so that it will be available to my driver program wherever that program runs. -- Emre On Mon, Feb 16, 2015 at 4:41 PM, Charles Feduke charles.fed...@gmail.com wrote: I haven't actually tried mixing non-Spark settings into the Spark properties. Instead I package

Re: Spark newbie desires feedback on first program

2015-02-16 Thread Charles Feduke
I cannot comment about the correctness of Python code. I will assume your caper_kv is keyed on something that uniquely identifies all the rows that make up the person's record so your group by key makes sense, as does the map. (I will also assume all of the rows that comprise a single person's

Re: Spark newbie desires feedback on first program

2015-02-16 Thread Charles Feduke
there's much to be gained in moving the data from MySQL to Spark first. I have yet to find any non-trivial examples of ETL logic on the web ... it seems like it's mostly word count map-reduce replacements. On 02/16/2015 01:32 PM, Charles Feduke wrote: I cannot comment about the correctness

Re: Can't I mix non-Spark properties into a .properties file and pass it to spark-submit via --properties-file?

2015-02-16 Thread Charles Feduke
I haven't actually tried mixing non-Spark settings into the Spark properties. Instead I package my properties into the jar and use the Typesafe Config[1] - v1.2.1 - library (along with Ficus[2] - Scala specific) to get at my properties: Properties file: src/main/resources/integration.conf (below

Re: SPARK_LOCAL_DIRS Issue

2015-02-11 Thread Charles Feduke
A central location, such as NFS? If they are temporary for the purpose of further job processing you'll want to keep them local to the node in the cluster, i.e., in /tmp. If they are centralized you won't be able to take advantage of data locality and the central file store will become a

Re: SPARK_LOCAL_DIRS Issue

2015-02-11 Thread Charles Feduke
that route, since that's the performance advantage Spark has over vanilla Hadoop. On Wed Feb 11 2015 at 2:10:36 PM Tassilo Klein tjkl...@gmail.com wrote: Thanks for the info. The file system in use is a Lustre file system. Best, Tassilo On Wed, Feb 11, 2015 at 12:15 PM, Charles Feduke charles.fed

Re: Parsing CSV files in Spark

2015-02-06 Thread Charles Feduke
I've been doing a bunch of work with CSVs in Spark, mostly saving them as a merged CSV (instead of the various part-n files). You might find the following links useful: - This article is about combining the part files and outputting a header as the first line in the merged results:

Re: spark streaming from kafka real time + batch processing in java

2015-02-06 Thread Charles Feduke
Good questions, some of which I'd like to know the answer to. Is it okay to update a NoSQL DB with aggregated counts per batch interval or is it generally stored in hdfs? This depends on how you are going to use the aggregate data. 1. Is there a lot of data? If so, and you are going to use

Re: How do I set spark.local.dirs?

2015-02-06 Thread Charles Feduke
Did you restart the slaves so they would read the settings? You don't need to start/stop the EC2 cluster, just the slaves. From the master node: $SPARK_HOME/sbin/stop-slaves.sh $SPARK_HOME/sbin/start-slaves.sh ($SPARK_HOME is probably /root/spark) On Fri Feb 06 2015 at 10:31:18 AM Joe Wass

Re: spark on ec2

2015-02-05 Thread Charles Feduke
I don't see anything that says you must explicitly restart them to load the new settings, but usually there is some sort of signal trapped [or brute force full restart] to get a configuration reload for most daemons. I'd take a guess and use the $SPARK_HOME/sbin/{stop,start}-slaves.sh scripts on

Re: How to design a long live spark application

2015-02-05 Thread Charles Feduke
If you want to design something like Spark shell have a look at: http://zeppelin-project.org/ Its open source and may already do what you need. If not, its source code will be helpful in answering the questions about how to integrate with long running jobs that you have. On Thu Feb 05 2015 at

Re: Writing RDD to a csv file

2015-02-03 Thread Charles Feduke
In case anyone needs to merge all of their part-n files (small result set only) into a single *.csv file or needs to generically flatten case classes, tuples, etc., into comma separated values: http://deploymentzone.com/2015/01/30/spark-and-merged-csv-files/ On Tue Feb 03 2015 at 8:23:59 AM

Re: groupByKey is not working

2015-01-30 Thread Charles Feduke
You'll still need to: import org.apache.spark.SparkContext._ Importing org.apache.spark._ does _not_ recurse into sub-objects or sub-packages, it only brings in whatever is at the level of the package or object imported. SparkContext._ has some implicits, one of them for adding groupByKey to an

Re: Serialized task result size exceeded

2015-01-30 Thread Charles Feduke
Are you using the default Java object serialization, or have you tried Kryo yet? If you haven't tried Kryo please do and let me know how much it impacts the serialization size. (I know its more efficient, I'm curious to know how much more efficient, and I'm being lazy - I don't have ~6K 500MB

Re: groupByKey is not working

2015-01-30 Thread Charles Feduke
Define not working. Not compiling? If so you need: import org.apache.spark.SparkContext._ On Fri Jan 30 2015 at 3:21:45 PM Amit Behera amit.bd...@gmail.com wrote: hi all, my sbt file is like this: name := Spark version := 1.0 scalaVersion := 2.10.4 libraryDependencies +=

Re: Exception when using HttpSolrServer (httpclient) from within Spark Streaming: java.lang.NoSuchMethodError: org.apache.http.impl.conn.SchemeRegistryFactory.createSystemDefault()Lorg/apache/http/con

2015-01-28 Thread Charles Feduke
I deal with problems like this so often across Java applications with large dependency trees. Add the shell function at the following link to your shell on the machine where your Spark Streaming is installed: https://gist.github.com/cfeduke/fe63b12ab07f87e76b38 Then run in the directory where

Re: spark 1.2 ec2 launch script hang

2015-01-28 Thread Charles Feduke
as bash. Nick ​ On Wed Jan 28 2015 at 3:30:08 PM Charles Feduke charles.fed...@gmail.com wrote: It was only hanging when I specified the path with ~ I never tried relative. Hanging on the waiting for ssh to be ready on all hosts. I let it sit for about 10 minutes then I found

Re: spark 1.2 ec2 launch script hang

2015-01-28 Thread Charles Feduke
that for Spark 1.2.0 https://issues.apache.org/jira/browse/SPARK-4137. Maybe there’s some case that we missed? Nick On Tue Jan 27 2015 at 10:10:29 AM Charles Feduke charles.fed...@gmail.com wrote: Absolute path means no ~ and also verify that you have the path to the file correct. For some reason

Re: Exception when using HttpSolrServer (httpclient) from within Spark Streaming: java.lang.NoSuchMethodError: org.apache.http.impl.conn.SchemeRegistryFactory.createSystemDefault()Lorg/apache/http/con

2015-01-28 Thread Charles Feduke
: This is what I get: ./bigcontent-1.0-SNAPSHOT.jar:org/apache/http/impl/conn/Sch emeRegistryFactory.class (probably because I'm using a self-contained JAR). In other words, I'm still stuck. -- Emre On Wed, Jan 28, 2015 at 2:47 PM, Charles Feduke charles.fed...@gmail.com wrote: I

Re: Spark and S3 server side encryption

2015-01-28 Thread Charles Feduke
I have been trying to work around a similar problem with my Typesafe config *.conf files seemingly not appearing on the executors. (Though now that I think about it its not because the files are absent in the JAR, but because the -Dconf.resource environment variable I pass to the master obviously

Re: Exception when using HttpSolrServer (httpclient) from within Spark Streaming: java.lang.NoSuchMethodError: org.apache.http.impl.conn.SchemeRegistryFactory.createSystemDefault()Lorg/apache/http/con

2015-01-28 Thread Charles Feduke
1.2.0 binary (pre-built for Hadoop 2.4 and later). Or maybe I'm totally wrong, and the problem / fix is something completely different? -- Emre On Wed, Jan 28, 2015 at 4:58 PM, Charles Feduke charles.fed...@gmail.com wrote: It looks like you're shading in the Apache HTTP commons library

Re: spark 1.2 ec2 launch script hang

2015-01-27 Thread Charles Feduke
Absolute path means no ~ and also verify that you have the path to the file correct. For some reason the Python code does not validate that the file exists and will hang (this is the same reason why ~ hangs). On Mon, Jan 26, 2015 at 10:08 PM Pete Zybrick pzybr...@gmail.com wrote: Try using an

Re: HW imbalance

2015-01-26 Thread Charles Feduke
You should look at using Mesos. This should abstract away the individual hosts into a pool of resources and make the different physical specifications manageable. I haven't tried configuring Spark Standalone mode to have different specs on different machines but based on spark-env.sh.template: #

Re: No AMI for Spark 1.2 using ec2 scripts

2015-01-26 Thread Charles Feduke
I definitely have Spark 1.2 running within EC2 using the spark-ec2 scripts. I downloaded Spark 1.2 with prebuilt for Hadoop 2.4 and later. What parameters are you using when you execute spark-ec2? I am launching in the us-west-1 region (ami-7a320f3f) which may explain things. On Mon Jan 26 2015

Re: Analyzing data from non-standard data sources (e.g. AWS Redshift)

2015-01-25 Thread Charles Feduke
.) Because of the sub-range bucketing and cluster distribution you shouldn't run into OOM errors, assuming you provision sufficient worker nodes in the cluster. On Sun Jan 25 2015 at 9:39:56 AM Charles Feduke charles.fed...@gmail.com wrote: I'm facing a similar problem except my data is already

Re: where storagelevel DISK_ONLY persists RDD to

2015-01-25 Thread Charles Feduke
I think you want to instead use `.saveAsSequenceFile` to save an RDD to someplace like HDFS or NFS it you are attempting to interoperate with another system, such as Hadoop. `.persist` is for keeping the contents of an RDD around so future uses of that particular RDD don't need to recalculate its

Re: Analyzing data from non-standard data sources (e.g. AWS Redshift)

2015-01-25 Thread Charles Feduke
I'm facing a similar problem except my data is already pre-sharded in PostgreSQL. I'm going to attempt to solve it like this: - Submit the shard names (database names) across the Spark cluster as a text file and partition it so workers get 0 or more - hopefully 1 - shard name. In this case you

JDBC sharded solution

2015-01-24 Thread Charles Feduke
I'm trying to figure out the best approach to getting sharded data from PostgreSQL into Spark. Our production PGSQL cluster has 12 shards with TiB of data on each shard. (I won't be accessing all of the data on a shard at once, but I don't think its feasible to use Sqoop to copy tables who's data