it works similarly as reducebykey.
>
> On 29 Aug 2016 20:57, "Marius Soutier" <mps@gmail.com
> <mailto:mps@gmail.com>> wrote:
> In DataFrames (and thus in 1.5 in general) this is not possible, correct?
>
>> On 11.08.2016, at 05:42, Holden Karau &
In DataFrames (and thus in 1.5 in general) this is not possible, correct?
> On 11.08.2016, at 05:42, Holden Karau wrote:
>
> Hi Luis,
>
> You might want to consider upgrading to Spark 2.0 - but in Spark 1.6.2 you
> can do groupBy followed by a reduce on the
That's to be expected - the application UI is not started by the master, but by
the driver. So the UI will run on the machine that submits the job.
> On 26.07.2016, at 15:49, Jestin Ma wrote:
>
> I did netstat -apn | grep 4040 on machine 6, and I see
>
> tcp
normal join. This should be faster than joining and subtracting then.
> Anyway, thanks for the hint of the transformWith method!
>
> Am 27. Juni 2016 um 14:32 schrieb Marius Soutier <mps@gmail.com
> <mailto:mps@gmail.com>>:
> `transformWith` accepts another stream
Can't you use `transform` instead of `foreachRDD`?
> On 15.06.2016, at 15:18, Matthias Niehoff
> wrote:
>
> Hi,
>
> i want to subtract 2 DStreams (based on the same Input Stream) to get all
> elements that exist in the original stream, but not in the
> On 04.03.2016, at 22:39, Cody Koeninger wrote:
>
> The only other valid use of messageHandler that I can think of is
> catching serialization problems on a per-message basis. But with the
> new Kafka consumer library, that doesn't seem feasible anyway, and
> could be
Found an issue for this:
https://issues.apache.org/jira/browse/SPARK-10251
<https://issues.apache.org/jira/browse/SPARK-10251>
> On 09.09.2015, at 18:00, Marius Soutier <mps@gmail.com> wrote:
>
> Hi all,
>
> as indicated in the title, I’m using Kryo wi
Hi all,
as indicated in the title, I’m using Kryo with a custom Kryo serializer, but as
soon as I enable `spark.kryo.registrationRequired`, my Spark Streaming job
fails to start with this exception:
Class is not registered: scala.collection.immutable.Range
When I register it, it continues
Same problem here...
On 20.04.2015, at 09:59, Zsolt Tóth toth.zsolt@gmail.com wrote:
Hi all,
it looks like the 1.2.2 pre-built version for hadoop2.4 is not available on
the mirror sites. Am I missing something?
Regards,
Zsolt
The processing speed displayed in the UI doesn’t seem to take everything into
account. I also had a low processing time but had to increase batch duration
from 30 seconds to 1 minute because waiting batches kept increasing. Now it
runs fine.
On 17.04.2015, at 13:30, González Salgado, Miquel
It cleans the work dir, and SPARK_LOCAL_DIRS should be cleaned automatically.
From the source code comments:
// SPARK_LOCAL_DIRS environment variable, and deleted by the Worker when the
// application finishes.
On 13.04.2015, at 11:26, Guillaume Pitel guillaume.pi...@exensa.com wrote:
Does
That’s true, spill dirs don’t get cleaned up when something goes wrong. We are
are restarting long running jobs once in a while for cleanups and have
spark.cleaner.ttl set to a lower value than the default.
On 14.04.2015, at 17:57, Guillaume Pitel guillaume.pi...@exensa.com wrote:
Right, I
I have set SPARK_WORKER_OPTS in spark-env.sh for that. For example:
export SPARK_WORKER_OPTS=-Dspark.worker.cleanup.enabled=true
-Dspark.worker.cleanup.appDataTtl=seconds
On 11.04.2015, at 00:01, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com wrote:
Does anybody have an answer for
Hi there,
I'm using Spark Streaming 1.2.1 with actorStreams. Initially, all goes well.
15/03/30 15:37:00 INFO spark.storage.MemoryStore: Block broadcast_1 stored as
values in memory (estimated size 3.2 KB, free 1589.8 MB)
15/03/30 15:37:00 INFO spark.storage.BlockManagerInfo: Added
1. I don't think textFile is capable of unpacking a .gz file. You need to use
hadoopFile or newAPIHadoop file for this.
Sorry that’s incorrect, textFile works fine on .gz files. What it can’t do is
compute splits on gz files, so if you have a single file, you'll have a single
partition.
/streaming-programming-guide.html#dataframe-and-sql-operations
http://people.apache.org/~tdas/spark-1.3.0-temp-docs/streaming-programming-guide.html#dataframe-and-sql-operations
TD
On Wed, Mar 11, 2015 at 12:20 PM, Marius Soutier mps@gmail.com
mailto:mps@gmail.com wrote:
Forgot
Hi,
I’ve written a Spark Streaming Job that inserts into a Parquet, using
stream.foreachRDD(_insertInto(“table”, overwrite = true). Now I’ve added
checkpointing; everything works fine when starting from scratch. When starting
from a checkpoint however, the job doesn’t work and produces the
Forgot to mention, it works when using .foreachRDD(_.saveAsTextFile(“”)).
On 11.03.2015, at 18:35, Marius Soutier mps@gmail.com wrote:
Hi,
I’ve written a Spark Streaming Job that inserts into a Parquet, using
stream.foreachRDD(_insertInto(“table”, overwrite = true). Now I’ve added
-n on your machine.
On Mon, Feb 23, 2015 at 10:55 PM, Marius Soutier mps@gmail.com
mailto:mps@gmail.com wrote:
Hi Sameer,
I’m still using Spark 1.1.1, I think the default is hash shuffle. No external
shuffle service.
We are processing gzipped JSON files, the partitions
Hi,
just a quick question about calling persist with the _2 option. Is the 2x
replication only useful for fault tolerance, or will it also increase job speed
by avoiding network transfers? Assuming I’m doing joins or other shuffle
operations.
Thanks
for computations?
yes they can.
On Wed, Feb 25, 2015 at 10:36 AM, Marius Soutier mps@gmail.com wrote:
Hi,
just a quick question about calling persist with the _2 option. Is the 2x
replication only useful for fault tolerance, or will it also increase job
speed by avoiding network transfers
the external shuffle service enabled (so that the Worker
JVM or NodeManager can still serve the map spill files after an Executor
crashes)?
How many partitions are in your RDDs before and after the problematic shuffle
operation?
On Monday, February 23, 2015, Marius Soutier mps
Hi guys,
I keep running into a strange problem where my jobs start to fail with the
dreaded Resubmitted (resubmitted due to lost executor)” because of having too
many temp files from previous runs.
Both /var/run and /spill have enough disk space left, but after a given amount
of jobs have
-Original Message-
From: Marius Soutier [mailto:mps@gmail.com]
Sent: Monday, February 09, 2015 2:19 AM
To: user
Subject: Executor Lost with StorageLevel.MEMORY_AND_DISK_SER
Hi there,
I'm trying to improve performance on a job that has GC troubles and takes
longer
Hi there,
I’m trying to improve performance on a job that has GC troubles and takes
longer to compute simply because it has to recompute failed tasks. After
deferring object creation as much as possible, I’m now trying to improve memory
usage with StorageLevel.MEMORY_AND_DISK_SER and a custom
)
at scala.Option.foreach(Option.scala:236)
On 15.12.2014, at 22:36, Marius Soutier mps@gmail.com wrote:
Ok, maybe these test versions will help me then. I’ll check it out.
On 15.12.2014, at 22:33, Michael Armbrust mich...@databricks.com wrote:
Using a single SparkContext should not cause this problem
Hi,
is there an easy way to “migrate” parquet files or indicate optional values in
sql statements? I added a couple of new fields that I also use in a
schemaRDD.sql() which obviously fails for input files that don’t have the new
fields.
Thanks
- Marius
Hi,
I’m seeing strange, random errors when running unit tests for my Spark jobs. In
this particular case I’m using Spark SQL to read and write Parquet files, and
one error that I keep running into is this one:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 19 in
stage
of our
unit testing.
On Mon, Dec 15, 2014 at 1:27 PM, Marius Soutier mps@gmail.com wrote:
Possible, yes, although I’m trying everything I can to prevent it, i.e. fork
in Test := true and isolated. Can you confirm that reusing a single
SparkContext for multiple tests poses a problem as well
You can also insert into existing tables via .insertInto(tableName, overwrite).
You just have to import sqlContext._
On 19.11.2014, at 09:41, Daniel Haviv danielru...@gmail.com wrote:
Hello,
I'm writing a process that ingests json files and saves them a parquet files.
The process is as such:
Default value is infinite, so you need to enable it. Personally I’ve setup a
couple of cron jobs to clean up /tmp and /var/run/spark.
On 06.11.2014, at 08:15, Romi Kuntsman r...@totango.com wrote:
Hello,
Spark has an internal cleanup mechanism
(defined by spark.cleaner.ttl, see
I did some simple experiments with Impala and Spark, and Impala came out ahead.
But it’s also less flexible, couldn’t handle irregular schemas, didn't support
Json, and so on.
On 01.11.2014, at 02:20, Soumya Simanta soumya.sima...@gmail.com wrote:
I agree. My personal experience with Spark
Just a wild guess, but I had to exclude “javax.servlet.servlet-api” from my
Hadoop dependencies to run a SparkContext.
In your build.sbt:
org.apache.hadoop % hadoop-common % “... exclude(javax.servlet,
servlet-api),
org.apache.hadoop % hadoop-hdfs % “... exclude(javax.servlet,
servlet-api”)
Are these /vols formatted? You typically need to format and define a mount
point in /mnt for attached EBS volumes.
I’m not using the ec2 script, so I don’t know what is installed, but there’s
usually an HDFS info service running on port 50070. After changing
hdfs-site.xml, you have to restart
So, apparently `wholeTextFiles` runs the job again, passing null as argument
list, which in turn blows up my argument parsing mechanics. I never thought I
had to check for null again in a pure Scala environment ;)
On 26.10.2014, at 11:57, Marius Soutier mps@gmail.com wrote:
I tried
From: Marius Soutier [mps@gmail.com]
Sent: Friday, October 24, 2014 6:35 AM
To: user@spark.apache.org
Subject: scala.collection.mutable.ArrayOps$ofRef$.length$extension since
Spark 1.1.0
Hi,
I’m running a job whose simple task it is to find files
Hi,
I’m running a job whose simple task it is to find files that cannot be read
(sometimes our gz files are corrupted).
With 1.0.x, this worked perfectly. Since 1.1.0 however, I’m getting an
exception:
scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114)
Hi guys,
another question: what’s the approach to working with column-oriented data,
i.e. data with more than 1000 columns. Using Parquet for this should be fine,
but how well does SparkSQL handle the big amount of columns? Is there a limit?
Should we use standard Spark instead?
Thanks for
Hi there,
we have a small Spark cluster running and are processing around 40 GB of
Gzip-compressed JSON data per day. I have written a couple of word count-like
Scala jobs that essentially pull in all the data, do some joins, group bys and
aggregations. A job takes around 40 minutes to
We’re using 1.1.0. Yes I expected Scala to be maybe twice as fast, but not
that...
On 22.10.2014, at 13:02, Nicholas Chammas nicholas.cham...@gmail.com wrote:
What version of Spark are you running? Some recent changes to how PySpark
works relative to Scala Spark may explain things.
Didn’t seem to help:
conf = SparkConf().set(spark.shuffle.spill,
false).set(spark.default.parallelism, 12)
sc = SparkContext(appName=’app_name', conf = conf)
but still taking as much time
On 22.10.2014, at 14:17, Nicholas Chammas nicholas.cham...@gmail.com wrote:
Total guess without knowing
Yeah we’re using Python 2.7.3.
On 22.10.2014, at 20:06, Nicholas Chammas nicholas.cham...@gmail.com wrote:
On Wed, Oct 22, 2014 at 11:34 AM, Eustache DIEMERT eusta...@diemert.fr
wrote:
Wild guess maybe, but do you decode the json records in Python ? it could be
much slower as the
Can’t install that on our cluster, but I can try locally. Is there a pre-built
binary available?
On 22.10.2014, at 19:01, Davies Liu dav...@databricks.com wrote:
In the master, you can easily profile you job, find the bottlenecks,
see https://github.com/apache/spark/pull/2556
Could you try
Hello,
sc.textFile and so on support wildcards in their path, but apparently
sqlc.parquetFile() does not. I always receive “File
/file/to/path/*/input.parquet does not exist. Is this normal or a bug? Is
there are a workaround?
Thanks
- Marius
Thank you, that works!
On 24.09.2014, at 19:01, Michael Armbrust mich...@databricks.com wrote:
This behavior is inherited from the parquet input format that we use. You
could list the files manually and pass them as a comma separated list.
On Wed, Sep 24, 2014 at 7:46 AM, Marius Soutier
Writing to Parquet and querying the result via SparkSQL works great (except for
some strange SQL parser errors). However the problem remains, how do I get that
data back to a dashboard. So I guess I’ll have to use a database after all.
You can batch up data store into parquet partitions as
another SparkSQL shell, JDBC driver in SparkSQL is part 1.1 i believe.
--
Regards,
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi
On Fri, Sep 12, 2014 at 2:54 PM, Marius Soutier mps@gmail.com wrote:
Hi there,
I’m pretty new to Spark
partitions as well. query it
using another SparkSQL shell, JDBC driver in SparkSQL is part 1.1 i
believe.
--
Regards,
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi
On Fri, Sep 12, 2014 at 2:54 PM, Marius Soutier mps@gmail.com wrote:
Hi
Hi there,
I’m pretty new to Spark, and so far I’ve written my jobs the same way I wrote
Scalding jobs - one-off, read data from HDFS, count words, write counts back to
HDFS.
Now I want to display these counts in a dashboard. Since Spark allows to cache
RDDs in-memory and you have to
49 matches
Mail list logo