Re: spark-shell running out of memory even with 6GB ?

2017-01-09 Thread Kevin Burton
f jobs, the logged SparkEvents of which stick around in order > for the UI to render. There are some options under `spark.ui.retained*` to > limit that if it's a problem. > > > On Mon, Jan 9, 2017 at 6:00 PM, Kevin Burton <bur...@spinn3r.com> wrote: > >> We've had

spark-shell running out of memory even with 6GB ?

2017-01-09 Thread Kevin Burton
We've had various OOM issues with spark and have been trying to track them down one by one. Now we have one in spark-shell which is super surprising. We currently allocate 6GB to spark shell, as confirmed via 'ps' Why the heck would the *shell* need that much memory. I'm going to try to give

OutOfMemoryError while running job...

2016-12-06 Thread Kevin Burton
I am trying to run a Spark job which reads from ElasticSearch and should write it's output back to a separate ElasticSearch index. Unfortunately I keep getting `java.lang.OutOfMemoryError: Java heap space` exceptions. I've tried running it with: --conf spark.memory.offHeap.enabled=true --conf

Re: take() works on RDD but .write.json() does not work in 2.0.0

2016-09-19 Thread Kevin Burton
e few questions on this. > > Does that only not work with write.json() ? I just wonder if write.text, > csv or another API does not work as well and it is a JSON specific issue. > > Also, does that work with small data? I want to make sure if this happen > only on large data. > > Thanks!

take() works on RDD but .write.json() does not work in 2.0.0

2016-09-17 Thread Kevin Burton
I'm seeing some weird behavior and wanted some feedback. I have a fairly large, multi-hour job that operates over about 5TB of data. It builds it out into a ranked category index of about 25000 categories sorted by rank, descending. I want to write this to a file but it's not actually writing

Re: Spark 2.0.0 won't let you create a new SparkContext?

2016-09-13 Thread Kevin Burton
rally use --conf to set this on the command line if using > the shell. > > > On Tue, Sep 13, 2016, 19:22 Kevin Burton <bur...@spinn3r.com> wrote: > >> The problem is that without a new spark context, with a custom conf, >> elasticsearch-hadoop is refusing to read in se

Re: Spark 2.0.0 won't let you create a new SparkContext?

2016-09-13 Thread Kevin Burton
may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 13 September 2016 at 18:57, Sean Owen <so...@cloudera.com> wrote: > &g

Spark 2.0.0 won't let you create a new SparkContext?

2016-09-13 Thread Kevin Burton
I'm rather confused here as to what to do about creating a new SparkContext. Spark 2.0 prevents it... (exception included below) yet a TON of examples I've seen basically tell you to create a new SparkContext as standard practice:

"Too many elements to create a power set" on Elasticsearch

2016-09-11 Thread Kevin Burton
1.6.1 and 1.6.2 don't work on our Elasticsearch setup because we use daily indexes. We get the error: "Too many elements to create a power set" It works on SINGLE indexes.. but if I specify content_* then I get this error. I don't see this documented anywhere. Is this a known issue? Is there

Re: Selecting the top 100 records per group by?

2016-09-10 Thread Kevin Burton
, Sep 10, 2016 at 7:42 PM, Kevin Burton <bur...@spinn3r.com> wrote: > Ah.. might actually. I'll have to mess around with that. > > On Sat, Sep 10, 2016 at 6:06 PM, Karl Higley <kmhig...@gmail.com> wrote: > >> Would `topByKey` help? >> >> https://github.c

Re: Selecting the top 100 records per group by?

2016-09-10 Thread Kevin Burton
scala#L42 > > Best, > Karl > > On Sat, Sep 10, 2016 at 9:04 PM Kevin Burton <bur...@spinn3r.com> wrote: > >> I'm trying to figure out a way to group by and return the top 100 records >> in that group. >> >> Something like: >> >> SELECT TOP(

Selecting the top 100 records per group by?

2016-09-10 Thread Kevin Burton
I'm trying to figure out a way to group by and return the top 100 records in that group. Something like: SELECT TOP(100, user_id) FROM posts GROUP BY user_id; But I can't really figure out the best way to do this... There is a FIRST and LAST aggregate function but this only returns one column.

Re: Classpath hell and Elasticsearch 2.3.2...

2016-06-02 Thread Kevin Burton
com> wrote: > Which Scala version is Spark built against? I'd guess it's 2.10 since > you're using spark-1.6, and you're using the 2.11 jar for es-hadoop. > > > On Thu, 2 Jun 2016 at 15:50 Kevin Burton <bur...@spinn3r.com> wrote: > >> Thanks. >> >&

Re: Classpath hell and Elasticsearch 2.3.2...

2016-06-02 Thread Kevin Burton
haps > not with recent Spark versions). > > > > On Thu, 2 Jun 2016 at 15:34 Kevin Burton <bur...@spinn3r.com> wrote: > >> I'm trying to get spark 1.6.1 to work with 2.3.2... needless to say it's >> not super easy. >> >> I wish there was an easier way to get this

Classpath hell and Elasticsearch 2.3.2...

2016-06-02 Thread Kevin Burton
I'm trying to get spark 1.6.1 to work with 2.3.2... needless to say it's not super easy. I wish there was an easier way to get this stuff to work.. Last time I tried to use spark more I was having similar problems with classpath setup and Cassandra. Seems a huge opportunity to make this easier

Best way to bring up Spark with Cassandra (and Elasticsearch) in production.

2016-02-14 Thread Kevin Burton
Afternoon. About 6 months ago I tried (and failed) to get Spark and Cassandra working together in production due to dependency hell. I'm going to give it another try! Here's my general strategy. I'm going to create a maven module for my code... with spark dependencies. Then I'm going to get

Does spark *always* fork its workers?

2015-02-18 Thread Kevin Burton
I want to map over a Cassandra table in Spark but my code that executes needs a shutdown() call to return any threads, release file handles, etc. Will spark always execute my mappers as a forked process? And if so how do I handle threads preventing the JVM from terminating. It would be nice if

saveAsObjectFile is actually saveAsSequenceFile

2015-01-13 Thread Kevin Burton
This is interesting. I’m using ObjectInputStream to try to read a file written as saveAsObjectFile… but it’s not working. The documentation says: Write the elements of the dataset in a simple format using Java serialization, which can then be loaded using SparkContext.objectFile().” … but

saveAsTextFile just uses toString and Row@37f108

2015-01-13 Thread Kevin Burton
This is almost funny. I want to dump a computation to the filesystem. It’s just the result of a Spark SQL call reading the data from Cassandra. The problem is that it looks like it’s just calling toString() which is useless. The example is below. I assume this is just a (bad) bug.

Re: saveAsObjectFile is actually saveAsSequenceFile

2015-01-13 Thread Kevin Burton
the objectFile javadoc says. It is expecting a SequenceFile with NullWritable keys and BytesWritable values containing the serialized values. This looks correct to me. On Tue, Jan 13, 2015 at 8:39 AM, Kevin Burton bur...@spinn3r.com wrote: This is interesting. I’m using ObjectInputStream to try

quickly counting the number of rows in a partition?

2015-01-12 Thread Kevin Burton
Is there a way to compute the total number of records in each RDD partition? So say I had 4 partitions.. I’d want to have partition 0: 100 records partition 1: 104 records partition 2: 90 records partition 3: 140 records Kevin -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* blog:

status of spark analytics functions? over, rank, percentile, row_number, etc.

2015-01-10 Thread Kevin Burton
I’m curious what the status of implementing hive analytics functions in spark. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics Many of these seem missing. I’m assuming they’re not implemented yet? Is there an ETA on them? or am I the first to bring this

Rank for SQL and ORDER BY?

2015-01-09 Thread Kevin Burton
I’m trying to do simple graph sort in Spark which I mostly have working. The one problem I have now is that I need to order them and then assign a rank position. So the top item should have rank 0, the next one should have rank 1, etc. Hive and Pig support this with the RANK operator. I

Re: spark ignoring all memory settings and defaulting to 512MB?

2015-01-01 Thread Kevin Burton
the system properties with -D too if you need to do so directly. You don't have to change your app. Executor memory does not have to be set this way but you could. On Jan 1, 2015 6:36 AM, Kevin Burton bur...@spinn3r.com wrote: This is really weird and I’m surprised no one has found this issue yet

limit vs sample for indexing a small amount of data quickly?

2014-12-31 Thread Kevin Burton
Is there a limit function which just returns the first N records? Sample is nice but I’m trying to do this so it’s super fast and just to test the functionality of an algorithm. With sample I’d have to compute the % that would yield 1000 results first… Kevin -- Founder/CEO Spinn3r.com

spark ignoring all memory settings and defaulting to 512MB?

2014-12-31 Thread Kevin Burton
This is really weird and I’m surprised no one has found this issue yet. I’ve spent about an hour or more trying to debug this :-( My spark install is ignoring ALL my memory settings. And of course my job is running out of memory. The default is 512MB so pretty darn small. The worker and

Re: spark ignoring all memory settings and defaulting to 512MB?

2014-12-31 Thread Kevin Burton
to clean it up but I don’t know where to begin. On Wed, Dec 31, 2014 at 10:35 PM, Kevin Burton bur...@spinn3r.com wrote: This is really weird and I’m surprised no one has found this issue yet. I’ve spent about an hour or more trying to debug this :-( My spark install is ignoring ALL my memory

init / shutdown for complex map job?

2014-12-27 Thread Kevin Burton
I have a job where I want to map over all data in a cassandra database. I’m then selectively sending things to my own external system (ActiveMQ) if the item matches criteria. The problem is that I need to do some init and shutdown. Basically on init I need to create ActiveMQ connections and on

Can spark read and write to cassandra without HDFS?

2014-11-12 Thread Kevin Burton
We have all our data in Cassandra so I’d prefer to not have to bring up Hadoop/HDFS as that’s just another thing that can break. But I’m reading that spark requires a shared filesystem like HDFS or S3… Can I use Tachyon or this or something simple for a shared filesystem? -- Founder/CEO

Re: Can spark read and write to cassandra without HDFS?

2014-11-12 Thread Kevin Burton
Hadoop. Have you seen this: https://github.com/datastax/spark-cassandra-connector Harold On Wed, Nov 12, 2014 at 9:28 PM, Kevin Burton bur...@spinn3r.com wrote: We have all our data in Cassandra so I’d prefer to not have to bring up Hadoop/HDFS as that’s just another thing that can break

embedded spark for unit testing..

2014-11-09 Thread Kevin Burton
What’s the best way to embed spark to run local mode in unit tests? Some or our jobs are mildly complex and I want to keep verifying that they work including during schema changes / migration. I think for some of this I would just run local mode, read from a few text files via resources, and

Debian package for spark?

2014-11-08 Thread Kevin Burton
Are there debian packages for spark? If not I plan on making one… I threw one together in about 20 minutes as they are somewhat easy with maven and jdeb. But of course there are other things I need to install like cassandra support and an init script. So I figured I’d ask here first. If not we

Re: Debian package for spark?

2014-11-08 Thread Kevin Burton
On Sat, Nov 8, 2014 at 11:19 AM, Eugen Cepoi cepoi.eu...@gmail.com wrote: Yep there is one have a look here http://spark.apache.org/docs/latest/building-with-maven.html#building-spark-debian-packages Le 8 nov. 2014 19:48, Kevin Burton bur...@spinn3r.com a écrit : Are there debian packages

Re: Debian package for spark?

2014-11-08 Thread Kevin Burton
package: Control file descriptor keys are invalid [Version]. The following keys are mandatory [Package, Version, Section, Priority, Architecture, Maintainer, Description]. Please check your pom.xml/build.xml and your control file. - [Help 1] On Sat, Nov 8, 2014 at 11:24 AM, Kevin Burton bur

Re: Debian package for spark?

2014-11-08 Thread Kevin Burton
OK… here’s my version. https://github.com/spinn3r/spark-deb it’s just two files really. so if the standard spark packages get fixed I’ll just switch to them. Doesn’t look like there’s an init script and the conf isn’t in /etc … On Sat, Nov 8, 2014 at 12:06 PM, Kevin Burton bur...@spinn3r.com

Re: Debian package for spark?

2014-11-08 Thread Kevin Burton
Another note for the official debs. ‘spark’ is a bad package name because of confusion with the spark programming lang based on ada. There are packages for this already named ‘spark’ so I put mine as ‘apache-spark’ On Sat, Nov 8, 2014 at 12:21 PM, Kevin Burton bur...@spinn3r.com wrote: OK

Re: Debian package for spark?

2014-11-08 Thread Kevin Burton
. On Sat, Nov 8, 2014 at 1:17 PM, Kevin Burton bur...@spinn3r.com wrote: Another note for the official debs. ‘spark’ is a bad package name because of confusion with the spark programming lang based on ada. There are packages for this already named ‘spark’ so I put mine as ‘apache-spark