f jobs, the logged SparkEvents of which stick around in order
> for the UI to render. There are some options under `spark.ui.retained*` to
> limit that if it's a problem.
>
>
> On Mon, Jan 9, 2017 at 6:00 PM, Kevin Burton <bur...@spinn3r.com> wrote:
>
>> We've had
We've had various OOM issues with spark and have been trying to track them
down one by one.
Now we have one in spark-shell which is super surprising.
We currently allocate 6GB to spark shell, as confirmed via 'ps'
Why the heck would the *shell* need that much memory.
I'm going to try to give
I am trying to run a Spark job which reads from ElasticSearch and should
write it's output back to a separate ElasticSearch index. Unfortunately I
keep getting `java.lang.OutOfMemoryError: Java heap space` exceptions. I've
tried running it with: --conf spark.memory.offHeap.enabled=true --conf
e few questions on this.
>
> Does that only not work with write.json() ? I just wonder if write.text,
> csv or another API does not work as well and it is a JSON specific issue.
>
> Also, does that work with small data? I want to make sure if this happen
> only on large data.
>
> Thanks!
I'm seeing some weird behavior and wanted some feedback.
I have a fairly large, multi-hour job that operates over about 5TB of data.
It builds it out into a ranked category index of about 25000 categories
sorted by rank, descending.
I want to write this to a file but it's not actually writing
rally use --conf to set this on the command line if using
> the shell.
>
>
> On Tue, Sep 13, 2016, 19:22 Kevin Burton <bur...@spinn3r.com> wrote:
>
>> The problem is that without a new spark context, with a custom conf,
>> elasticsearch-hadoop is refusing to read in se
may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 13 September 2016 at 18:57, Sean Owen <so...@cloudera.com> wrote:
>
&g
I'm rather confused here as to what to do about creating a new SparkContext.
Spark 2.0 prevents it... (exception included below)
yet a TON of examples I've seen basically tell you to create a new
SparkContext as standard practice:
1.6.1 and 1.6.2 don't work on our Elasticsearch setup because we use daily
indexes.
We get the error:
"Too many elements to create a power set"
It works on SINGLE indexes.. but if I specify content_* then I get this
error.
I don't see this documented anywhere. Is this a known issue?
Is there
, Sep 10, 2016 at 7:42 PM, Kevin Burton <bur...@spinn3r.com> wrote:
> Ah.. might actually. I'll have to mess around with that.
>
> On Sat, Sep 10, 2016 at 6:06 PM, Karl Higley <kmhig...@gmail.com> wrote:
>
>> Would `topByKey` help?
>>
>> https://github.c
scala#L42
>
> Best,
> Karl
>
> On Sat, Sep 10, 2016 at 9:04 PM Kevin Burton <bur...@spinn3r.com> wrote:
>
>> I'm trying to figure out a way to group by and return the top 100 records
>> in that group.
>>
>> Something like:
>>
>> SELECT TOP(
I'm trying to figure out a way to group by and return the top 100 records
in that group.
Something like:
SELECT TOP(100, user_id) FROM posts GROUP BY user_id;
But I can't really figure out the best way to do this...
There is a FIRST and LAST aggregate function but this only returns one
column.
com>
wrote:
> Which Scala version is Spark built against? I'd guess it's 2.10 since
> you're using spark-1.6, and you're using the 2.11 jar for es-hadoop.
>
>
> On Thu, 2 Jun 2016 at 15:50 Kevin Burton <bur...@spinn3r.com> wrote:
>
>> Thanks.
>>
>&
haps
> not with recent Spark versions).
>
>
>
> On Thu, 2 Jun 2016 at 15:34 Kevin Burton <bur...@spinn3r.com> wrote:
>
>> I'm trying to get spark 1.6.1 to work with 2.3.2... needless to say it's
>> not super easy.
>>
>> I wish there was an easier way to get this
I'm trying to get spark 1.6.1 to work with 2.3.2... needless to say it's
not super easy.
I wish there was an easier way to get this stuff to work.. Last time I
tried to use spark more I was having similar problems with classpath setup
and Cassandra.
Seems a huge opportunity to make this easier
Afternoon.
About 6 months ago I tried (and failed) to get Spark and Cassandra working
together in production due to dependency hell.
I'm going to give it another try!
Here's my general strategy.
I'm going to create a maven module for my code... with spark dependencies.
Then I'm going to get
I want to map over a Cassandra table in Spark but my code that executes
needs a shutdown() call to return any threads, release file handles, etc.
Will spark always execute my mappers as a forked process? And if so how do
I handle threads preventing the JVM from terminating.
It would be nice if
This is interesting.
I’m using ObjectInputStream to try to read a file written as
saveAsObjectFile… but it’s not working.
The documentation says:
Write the elements of the dataset in a simple format using Java
serialization, which can then be loaded using SparkContext.objectFile().”
… but
This is almost funny.
I want to dump a computation to the filesystem. It’s just the result of a
Spark SQL call reading the data from Cassandra.
The problem is that it looks like it’s just calling toString() which is
useless.
The example is below.
I assume this is just a (bad) bug.
the objectFile javadoc says. It is expecting a
SequenceFile with NullWritable keys and BytesWritable values containing the
serialized values. This looks correct to me.
On Tue, Jan 13, 2015 at 8:39 AM, Kevin Burton bur...@spinn3r.com wrote:
This is interesting.
I’m using ObjectInputStream to try
Is there a way to compute the total number of records in each RDD partition?
So say I had 4 partitions.. I’d want to have
partition 0: 100 records
partition 1: 104 records
partition 2: 90 records
partition 3: 140 records
Kevin
--
Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog:
I’m curious what the status of implementing hive analytics functions in
spark.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
Many of these seem missing. I’m assuming they’re not implemented yet?
Is there an ETA on them?
or am I the first to bring this
I’m trying to do simple graph sort in Spark which I mostly have working.
The one problem I have now is that I need to order them and then assign a
rank position.
So the top item should have rank 0, the next one should have rank 1, etc.
Hive and Pig support this with the RANK operator.
I
the system properties with -D
too if you need to do so directly. You don't have to change your app.
Executor memory does not have to be set this way but you could.
On Jan 1, 2015 6:36 AM, Kevin Burton bur...@spinn3r.com wrote:
This is really weird and I’m surprised no one has found this issue yet
Is there a limit function which just returns the first N records?
Sample is nice but I’m trying to do this so it’s super fast and just to
test the functionality of an algorithm.
With sample I’d have to compute the % that would yield 1000 results first…
Kevin
--
Founder/CEO Spinn3r.com
This is really weird and I’m surprised no one has found this issue yet.
I’ve spent about an hour or more trying to debug this :-(
My spark install is ignoring ALL my memory settings. And of course my job
is running out of memory.
The default is 512MB so pretty darn small.
The worker and
to clean it up but I don’t know where to begin.
On Wed, Dec 31, 2014 at 10:35 PM, Kevin Burton bur...@spinn3r.com wrote:
This is really weird and I’m surprised no one has found this issue yet.
I’ve spent about an hour or more trying to debug this :-(
My spark install is ignoring ALL my memory
I have a job where I want to map over all data in a cassandra database.
I’m then selectively sending things to my own external system (ActiveMQ) if
the item matches criteria.
The problem is that I need to do some init and shutdown. Basically on init
I need to create ActiveMQ connections and on
We have all our data in Cassandra so I’d prefer to not have to bring up
Hadoop/HDFS as that’s just another thing that can break.
But I’m reading that spark requires a shared filesystem like HDFS or S3…
Can I use Tachyon or this or something simple for a shared filesystem?
--
Founder/CEO
Hadoop. Have you seen
this:
https://github.com/datastax/spark-cassandra-connector
Harold
On Wed, Nov 12, 2014 at 9:28 PM, Kevin Burton bur...@spinn3r.com wrote:
We have all our data in Cassandra so I’d prefer to not have to bring up
Hadoop/HDFS as that’s just another thing that can break
What’s the best way to embed spark to run local mode in unit tests?
Some or our jobs are mildly complex and I want to keep verifying that they
work including during schema changes / migration.
I think for some of this I would just run local mode, read from a few text
files via resources, and
Are there debian packages for spark?
If not I plan on making one… I threw one together in about 20 minutes as
they are somewhat easy with maven and jdeb. But of course there are other
things I need to install like cassandra support and an init script.
So I figured I’d ask here first.
If not we
On Sat, Nov 8, 2014 at 11:19 AM, Eugen Cepoi cepoi.eu...@gmail.com wrote:
Yep there is one have a look here
http://spark.apache.org/docs/latest/building-with-maven.html#building-spark-debian-packages
Le 8 nov. 2014 19:48, Kevin Burton bur...@spinn3r.com a écrit :
Are there debian packages
package: Control file descriptor keys are invalid
[Version]. The following keys are mandatory [Package, Version, Section,
Priority, Architecture, Maintainer, Description]. Please check your
pom.xml/build.xml and your control file. - [Help 1]
On Sat, Nov 8, 2014 at 11:24 AM, Kevin Burton bur
OK… here’s my version.
https://github.com/spinn3r/spark-deb
it’s just two files really. so if the standard spark packages get fixed
I’ll just switch to them.
Doesn’t look like there’s an init script and the conf isn’t in /etc …
On Sat, Nov 8, 2014 at 12:06 PM, Kevin Burton bur...@spinn3r.com
Another note for the official debs. ‘spark’ is a bad package name because
of confusion with the spark programming lang based on ada.
There are packages for this already named ‘spark’
so I put mine as ‘apache-spark’
On Sat, Nov 8, 2014 at 12:21 PM, Kevin Burton bur...@spinn3r.com wrote:
OK
.
On Sat, Nov 8, 2014 at 1:17 PM, Kevin Burton bur...@spinn3r.com wrote:
Another note for the official debs. ‘spark’ is a bad package name
because of confusion with the spark programming lang based on ada.
There are packages for this already named ‘spark’
so I put mine as ‘apache-spark
37 matches
Mail list logo