Re: Is there a reduceByKey functionality in DataFrame API?

2016-08-29 Thread Marius Soutier
it works similarly as reducebykey. > > On 29 Aug 2016 20:57, "Marius Soutier" <mps@gmail.com > <mailto:mps@gmail.com>> wrote: > In DataFrames (and thus in 1.5 in general) this is not possible, correct? > >> On 11.08.2016, at 05:42, Holden Karau &

Re: Is there a reduceByKey functionality in DataFrame API?

2016-08-29 Thread Marius Soutier
In DataFrames (and thus in 1.5 in general) this is not possible, correct? > On 11.08.2016, at 05:42, Holden Karau wrote: > > Hi Luis, > > You might want to consider upgrading to Spark 2.0 - but in Spark 1.6.2 you > can do groupBy followed by a reduce on the

Re: Spark Web UI port 4040 not working

2016-07-27 Thread Marius Soutier
That's to be expected - the application UI is not started by the master, but by the driver. So the UI will run on the machine that submits the job. > On 26.07.2016, at 15:49, Jestin Ma wrote: > > I did netstat -apn | grep 4040 on machine 6, and I see > > tcp

Re: Substract two DStreams

2016-06-28 Thread Marius Soutier
normal join. This should be faster than joining and subtracting then. > Anyway, thanks for the hint of the transformWith method! > > Am 27. Juni 2016 um 14:32 schrieb Marius Soutier <mps@gmail.com > <mailto:mps@gmail.com>>: > `transformWith` accepts another stream

Re: Substract two DStreams

2016-06-27 Thread Marius Soutier
Can't you use `transform` instead of `foreachRDD`? > On 15.06.2016, at 15:18, Matthias Niehoff > wrote: > > Hi, > > i want to subtract 2 DStreams (based on the same Input Stream) to get all > elements that exist in the original stream, but not in the

Re: Use cases for kafka direct stream messageHandler

2016-03-08 Thread Marius Soutier
> On 04.03.2016, at 22:39, Cody Koeninger wrote: > > The only other valid use of messageHandler that I can think of is > catching serialization problems on a per-message basis. But with the > new Kafka consumer library, that doesn't seem feasible anyway, and > could be

Re: spark.kryo.registrationRequired: Tuple2 is not registered

2015-09-10 Thread Marius Soutier
Found an issue for this: https://issues.apache.org/jira/browse/SPARK-10251 <https://issues.apache.org/jira/browse/SPARK-10251> > On 09.09.2015, at 18:00, Marius Soutier <mps@gmail.com> wrote: > > Hi all, > > as indicated in the title, I’m using Kryo wi

spark.kryo.registrationRequired: Tuple2 is not registered

2015-09-09 Thread Marius Soutier
Hi all, as indicated in the title, I’m using Kryo with a custom Kryo serializer, but as soon as I enable `spark.kryo.registrationRequired`, my Spark Streaming job fails to start with this exception: Class is not registered: scala.collection.immutable.Range When I register it, it continues

Re: Spark-1.2.2-bin-hadoop2.4.tgz missing

2015-04-20 Thread Marius Soutier
Same problem here... On 20.04.2015, at 09:59, Zsolt Tóth toth.zsolt@gmail.com wrote: Hi all, it looks like the 1.2.2 pre-built version for hadoop2.4 is not available on the mirror sites. Am I missing something? Regards, Zsolt

Re: Streaming problems running 24x7

2015-04-20 Thread Marius Soutier
The processing speed displayed in the UI doesn’t seem to take everything into account. I also had a low processing time but had to increase batch duration from 30 seconds to 1 minute because waiting batches kept increasing. Now it runs fine. On 17.04.2015, at 13:30, González Salgado, Miquel

Re: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-14 Thread Marius Soutier
It cleans the work dir, and SPARK_LOCAL_DIRS should be cleaned automatically. From the source code comments: // SPARK_LOCAL_DIRS environment variable, and deleted by the Worker when the // application finishes. On 13.04.2015, at 11:26, Guillaume Pitel guillaume.pi...@exensa.com wrote: Does

Re: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-14 Thread Marius Soutier
That’s true, spill dirs don’t get cleaned up when something goes wrong. We are are restarting long running jobs once in a while for cleanups and have spark.cleaner.ttl set to a lower value than the default. On 14.04.2015, at 17:57, Guillaume Pitel guillaume.pi...@exensa.com wrote: Right, I

Re: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-13 Thread Marius Soutier
I have set SPARK_WORKER_OPTS in spark-env.sh for that. For example: export SPARK_WORKER_OPTS=-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.appDataTtl=seconds On 11.04.2015, at 00:01, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: Does anybody have an answer for

actorStream woes

2015-03-30 Thread Marius Soutier
Hi there, I'm using Spark Streaming 1.2.1 with actorStreams. Initially, all goes well. 15/03/30 15:37:00 INFO spark.storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.2 KB, free 1589.8 MB) 15/03/30 15:37:00 INFO spark.storage.BlockManagerInfo: Added

Re: Processing of text file in large gzip archive

2015-03-16 Thread Marius Soutier
1. I don't think textFile is capable of unpacking a .gz file. You need to use hadoopFile or newAPIHadoop file for this. Sorry that’s incorrect, textFile works fine on .gz files. What it can’t do is compute splits on gz files, so if you have a single file, you'll have a single partition.

Re: Spark Streaming recover from Checkpoint with Spark SQL

2015-03-12 Thread Marius Soutier
/streaming-programming-guide.html#dataframe-and-sql-operations http://people.apache.org/~tdas/spark-1.3.0-temp-docs/streaming-programming-guide.html#dataframe-and-sql-operations TD On Wed, Mar 11, 2015 at 12:20 PM, Marius Soutier mps@gmail.com mailto:mps@gmail.com wrote: Forgot

Spark Streaming recover from Checkpoint with Spark SQL

2015-03-11 Thread Marius Soutier
Hi, I’ve written a Spark Streaming Job that inserts into a Parquet, using stream.foreachRDD(_insertInto(“table”, overwrite = true). Now I’ve added checkpointing; everything works fine when starting from scratch. When starting from a checkpoint however, the job doesn’t work and produces the

Re: Spark Streaming recover from Checkpoint with Spark SQL

2015-03-11 Thread Marius Soutier
Forgot to mention, it works when using .foreachRDD(_.saveAsTextFile(“”)). On 11.03.2015, at 18:35, Marius Soutier mps@gmail.com wrote: Hi, I’ve written a Spark Streaming Job that inserts into a Parquet, using stream.foreachRDD(_insertInto(“table”, overwrite = true). Now I’ve added

Re: Executor lost with too many temp files

2015-02-26 Thread Marius Soutier
-n on your machine. On Mon, Feb 23, 2015 at 10:55 PM, Marius Soutier mps@gmail.com mailto:mps@gmail.com wrote: Hi Sameer, I’m still using Spark 1.1.1, I think the default is hash shuffle. No external shuffle service. We are processing gzipped JSON files, the partitions

Effects of persist(XYZ_2)

2015-02-25 Thread Marius Soutier
Hi, just a quick question about calling persist with the _2 option. Is the 2x replication only useful for fault tolerance, or will it also increase job speed by avoiding network transfers? Assuming I’m doing joins or other shuffle operations. Thanks

Re: Effects of persist(XYZ_2)

2015-02-25 Thread Marius Soutier
for computations? yes they can. On Wed, Feb 25, 2015 at 10:36 AM, Marius Soutier mps@gmail.com wrote: Hi, just a quick question about calling persist with the _2 option. Is the 2x replication only useful for fault tolerance, or will it also increase job speed by avoiding network transfers

Re: Executor lost with too many temp files

2015-02-23 Thread Marius Soutier
the external shuffle service enabled (so that the Worker JVM or NodeManager can still serve the map spill files after an Executor crashes)? How many partitions are in your RDDs before and after the problematic shuffle operation? On Monday, February 23, 2015, Marius Soutier mps

Executor lost with too many temp files

2015-02-23 Thread Marius Soutier
Hi guys, I keep running into a strange problem where my jobs start to fail with the dreaded Resubmitted (resubmitted due to lost executor)” because of having too many temp files from previous runs. Both /var/run and /spill have enough disk space left, but after a given amount of jobs have

Re: Executor Lost with StorageLevel.MEMORY_AND_DISK_SER

2015-02-10 Thread Marius Soutier
-Original Message- From: Marius Soutier [mailto:mps@gmail.com] Sent: Monday, February 09, 2015 2:19 AM To: user Subject: Executor Lost with StorageLevel.MEMORY_AND_DISK_SER Hi there, I'm trying to improve performance on a job that has GC troubles and takes longer

Executor Lost with StorageLevel.MEMORY_AND_DISK_SER

2015-02-09 Thread Marius Soutier
Hi there, I’m trying to improve performance on a job that has GC troubles and takes longer to compute simply because it has to recompute failed tasks. After deferring object creation as much as possible, I’m now trying to improve memory usage with StorageLevel.MEMORY_AND_DISK_SER and a custom

Re: Intermittent test failures

2014-12-17 Thread Marius Soutier
) at scala.Option.foreach(Option.scala:236) On 15.12.2014, at 22:36, Marius Soutier mps@gmail.com wrote: Ok, maybe these test versions will help me then. I’ll check it out. On 15.12.2014, at 22:33, Michael Armbrust mich...@databricks.com wrote: Using a single SparkContext should not cause this problem

Migrating Parquet inputs

2014-12-15 Thread Marius Soutier
Hi, is there an easy way to “migrate” parquet files or indicate optional values in sql statements? I added a couple of new fields that I also use in a schemaRDD.sql() which obviously fails for input files that don’t have the new fields. Thanks - Marius

Intermittent test failures

2014-12-15 Thread Marius Soutier
Hi, I’m seeing strange, random errors when running unit tests for my Spark jobs. In this particular case I’m using Spark SQL to read and write Parquet files, and one error that I keep running into is this one: org.apache.spark.SparkException: Job aborted due to stage failure: Task 19 in stage

Re: Intermittent test failures

2014-12-15 Thread Marius Soutier
of our unit testing. On Mon, Dec 15, 2014 at 1:27 PM, Marius Soutier mps@gmail.com wrote: Possible, yes, although I’m trying everything I can to prevent it, i.e. fork in Test := true and isolated. Can you confirm that reusing a single SparkContext for multiple tests poses a problem as well

Re: Merging Parquet Files

2014-11-19 Thread Marius Soutier
You can also insert into existing tables via .insertInto(tableName, overwrite). You just have to import sqlContext._ On 19.11.2014, at 09:41, Daniel Haviv danielru...@gmail.com wrote: Hello, I'm writing a process that ingests json files and saves them a parquet files. The process is as such:

Re: Snappy temp files not cleaned up

2014-11-06 Thread Marius Soutier
Default value is infinite, so you need to enable it. Personally I’ve setup a couple of cron jobs to clean up /tmp and /var/run/spark. On 06.11.2014, at 08:15, Romi Kuntsman r...@totango.com wrote: Hello, Spark has an internal cleanup mechanism (defined by spark.cleaner.ttl, see

Re: SparkSQL performance

2014-11-03 Thread Marius Soutier
I did some simple experiments with Impala and Spark, and Impala came out ahead. But it’s also less flexible, couldn’t handle irregular schemas, didn't support Json, and so on. On 01.11.2014, at 02:20, Soumya Simanta soumya.sima...@gmail.com wrote: I agree. My personal experience with Spark

Re: Submiting Spark application through code

2014-11-02 Thread Marius Soutier
Just a wild guess, but I had to exclude “javax.servlet.servlet-api” from my Hadoop dependencies to run a SparkContext. In your build.sbt: org.apache.hadoop % hadoop-common % “... exclude(javax.servlet, servlet-api), org.apache.hadoop % hadoop-hdfs % “... exclude(javax.servlet, servlet-api”)

Re: use additional ebs volumes for hsdf storage with spark-ec2

2014-11-01 Thread Marius Soutier
Are these /vols formatted? You typically need to format and define a mount point in /mnt for attached EBS volumes. I’m not using the ec2 script, so I don’t know what is installed, but there’s usually an HDFS info service running on port 50070. After changing hdfs-site.xml, you have to restart

Re: scala.collection.mutable.ArrayOps$ofRef$.length$extension since Spark 1.1.0

2014-10-27 Thread Marius Soutier
So, apparently `wholeTextFiles` runs the job again, passing null as argument list, which in turn blows up my argument parsing mechanics. I never thought I had to check for null again in a pure Scala environment ;) On 26.10.2014, at 11:57, Marius Soutier mps@gmail.com wrote: I tried

Re: scala.collection.mutable.ArrayOps$ofRef$.length$extension since Spark 1.1.0

2014-10-26 Thread Marius Soutier
From: Marius Soutier [mps@gmail.com] Sent: Friday, October 24, 2014 6:35 AM To: user@spark.apache.org Subject: scala.collection.mutable.ArrayOps$ofRef$.length$extension since Spark 1.1.0 Hi, I’m running a job whose simple task it is to find files

scala.collection.mutable.ArrayOps$ofRef$.length$extension since Spark 1.1.0

2014-10-24 Thread Marius Soutier
Hi, I’m running a job whose simple task it is to find files that cannot be read (sometimes our gz files are corrupted). With 1.0.x, this worked perfectly. Since 1.1.0 however, I’m getting an exception: scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114)

SparkSQL and columnar data

2014-10-23 Thread Marius Soutier
Hi guys, another question: what’s the approach to working with column-oriented data, i.e. data with more than 1000 columns. Using Parquet for this should be fine, but how well does SparkSQL handle the big amount of columns? Is there a limit? Should we use standard Spark instead? Thanks for

Python vs Scala performance

2014-10-22 Thread Marius Soutier
Hi there, we have a small Spark cluster running and are processing around 40 GB of Gzip-compressed JSON data per day. I have written a couple of word count-like Scala jobs that essentially pull in all the data, do some joins, group bys and aggregations. A job takes around 40 minutes to

Re: Python vs Scala performance

2014-10-22 Thread Marius Soutier
We’re using 1.1.0. Yes I expected Scala to be maybe twice as fast, but not that... On 22.10.2014, at 13:02, Nicholas Chammas nicholas.cham...@gmail.com wrote: What version of Spark are you running? Some recent changes to how PySpark works relative to Scala Spark may explain things.

Re: Python vs Scala performance

2014-10-22 Thread Marius Soutier
Didn’t seem to help: conf = SparkConf().set(spark.shuffle.spill, false).set(spark.default.parallelism, 12) sc = SparkContext(appName=’app_name', conf = conf) but still taking as much time On 22.10.2014, at 14:17, Nicholas Chammas nicholas.cham...@gmail.com wrote: Total guess without knowing

Re: Python vs Scala performance

2014-10-22 Thread Marius Soutier
Yeah we’re using Python 2.7.3. On 22.10.2014, at 20:06, Nicholas Chammas nicholas.cham...@gmail.com wrote: On Wed, Oct 22, 2014 at 11:34 AM, Eustache DIEMERT eusta...@diemert.fr wrote: Wild guess maybe, but do you decode the json records in Python ? it could be much slower as the

Re: Python vs Scala performance

2014-10-22 Thread Marius Soutier
Can’t install that on our cluster, but I can try locally. Is there a pre-built binary available? On 22.10.2014, at 19:01, Davies Liu dav...@databricks.com wrote: In the master, you can easily profile you job, find the bottlenecks, see https://github.com/apache/spark/pull/2556 Could you try

parquetFile and wilcards

2014-09-24 Thread Marius Soutier
Hello, sc.textFile and so on support wildcards in their path, but apparently sqlc.parquetFile() does not. I always receive “File /file/to/path/*/input.parquet does not exist. Is this normal or a bug? Is there are a workaround? Thanks - Marius

Re: parquetFile and wilcards

2014-09-24 Thread Marius Soutier
Thank you, that works! On 24.09.2014, at 19:01, Michael Armbrust mich...@databricks.com wrote: This behavior is inherited from the parquet input format that we use. You could list the files manually and pass them as a comma separated list. On Wed, Sep 24, 2014 at 7:46 AM, Marius Soutier

Re: Serving data

2014-09-16 Thread Marius Soutier
Writing to Parquet and querying the result via SparkSQL works great (except for some strange SQL parser errors). However the problem remains, how do I get that data back to a dashboard. So I guess I’ll have to use a database after all. You can batch up data store into parquet partitions as

Re: Serving data

2014-09-15 Thread Marius Soutier
another SparkSQL shell, JDBC driver in SparkSQL is part 1.1 i believe. -- Regards, Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Fri, Sep 12, 2014 at 2:54 PM, Marius Soutier mps@gmail.com wrote: Hi there, I’m pretty new to Spark

Re: Serving data

2014-09-15 Thread Marius Soutier
partitions as well. query it using another SparkSQL shell, JDBC driver in SparkSQL is part 1.1 i believe. -- Regards, Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Fri, Sep 12, 2014 at 2:54 PM, Marius Soutier mps@gmail.com wrote: Hi

Serving data

2014-09-12 Thread Marius Soutier
Hi there, I’m pretty new to Spark, and so far I’ve written my jobs the same way I wrote Scalding jobs - one-off, read data from HDFS, count words, write counts back to HDFS. Now I want to display these counts in a dashboard. Since Spark allows to cache RDDs in-memory and you have to