Re: counters in spark

2015-04-13 Thread Grandl Robert
Guys, Do you have any thoughts on this ? Thanks,Robert On Sunday, April 12, 2015 5:35 PM, Grandl Robert rgra...@yahoo.com.INVALID wrote: Hi guys, I was trying to figure out some counters in Spark, related to the amount of CPU or Memory used (in some metric), used by a

Re: Multiple Kafka Recievers

2015-04-13 Thread Cody Koeninger
As far as I know, createStream doesn't let you specify where receivers are run. createDirectStream in 1.3 doesn't use long-running receivers, so it is likely to give you more even distribution of consumers across your workers. On Mon, Apr 13, 2015 at 11:31 AM, Laeeq Ahmed

Re: Configuring amount of disk space available to spark executors in mesos?

2015-04-13 Thread Patrick Wendell
Hey Jonathan, Are you referring to disk space used for storing persisted RDD's? For that, Spark does not bound the amount of data persisted to disk. It's a similar story to how Spark's shuffle disk output works (and also Hadoop and other frameworks make this assumption as well for their shuffle

Re: feature scaling in GeneralizedLinearAlgorithm.scala

2015-04-13 Thread Xiangrui Meng
Correct. Prediction doesn't touch that code path. -Xiangrui On Mon, Apr 13, 2015 at 9:58 AM, Jianguo Li flyingfromch...@gmail.com wrote: Hi, In the GeneralizedLinearAlgorithm, which Logistic Regression relied on, it says if userFeatureScaling is enabled, we will standardize the training

Re: Configuring amount of disk space available to spark executors in mesos?

2015-04-13 Thread Jonathan Coveney
Nothing so complicated... we are seeing mesos kill off our executors immediately. When I reroute logging to an NFS directory we have available, the executors survive fine. As such I am wondering if the spark workers are getting killed by mesos for exceeding their disk quota (which atm is 0). This

Re: Opening many Parquet files = slow

2015-04-13 Thread Eric Eijkelenboom
Hi guys Does anyone know how to stop Spark from opening all Parquet files before starting a job? This is quite a show stopper for me, since I have 5000 Parquet files on S3. Recap of what I tried: 1. Disable schema merging with: sqlContext.load(“parquet, Map(mergeSchema - false”, path -

Re: Spark TeraSort source request

2015-04-13 Thread Ewan Higgs
Tom, According to Github's public activity log, Reynold Xin (in CC) deleted his sort-benchmark branch yesterday. I didn't have a local copy aside from the Daytona Partitioner (attached). Reynold, is it possible to reinstate your branch? -Ewan On 13/04/15 16:41, Tom Hubregtsen wrote: Thank

Re: Spark Cluster: RECEIVED SIGNAL 15: SIGTERM

2015-04-13 Thread Guillaume Pitel
That's why I think it's the OOM killer. There are several cases of memory overuse / errors : 1 - The application tries to allocate more than the Heap limit and GC cannot free more memory = OutOfMemory : Java Heap Space exception from JVM 2 - The jvm is configured with a max heap size larger

Re: Equi Join is taking for ever. 1 Task is Running while other 199 are complete

2015-04-13 Thread Jonathan Coveney
I'm not 100% sure of spark's implementation but in the MR frameworks, it would have a much larger shuffle write size becasue that node is dealing with a lot more data and as a result has a lot more to shuffle 2015-04-13 13:20 GMT-04:00 java8964 java8...@hotmail.com: If it is really due to data

Re: Multipart upload to S3 fails with Bad Digest Exceptions

2015-04-13 Thread Eugen Cepoi
I think I found where the problem comes from. I am writing lzo compressed thrift records using elephant-bird, my guess is that perhaps one side is computing the checksum based on the uncompressed data and the other on the compressed data, thus getting a mismatch. When writing the data as strings

Re: Spark support for Hadoop Formats (Avro)

2015-04-13 Thread Michael Armbrust
The problem is likely that the underlying avro library is reusing objects for speed. You probably need to explicitly copy the values out of the reused record before the collect. On Sat, Apr 11, 2015 at 9:23 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: The read seem to be successfully as the

Re: Need some guidance

2015-04-13 Thread Dean Wampler
That appears to work, with a few changes to get the types correct: input.distinct().combineByKey((s: String) = 1, (agg: Int, s: String) = agg + 1, (agg1: Int, agg2: Int) = agg1 + agg2) dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition

Re: Need some guidance

2015-04-13 Thread Victor Tso-Guillen
How about this? input.distinct().combineByKey((v: V) = 1, (agg: Int, x: Int) = agg + 1, (agg1: Int, agg2: Int) = agg1 + agg2).collect() On Mon, Apr 13, 2015 at 10:31 AM, Dean Wampler deanwamp...@gmail.com wrote: The problem with using collect is that it will fail for large data sets, as

Rack locality

2015-04-13 Thread rcharaya
I want to use Rack locality feature of Apache Spark in my application. Is YARN the only resource manager which supports it as of now? After going through the source code, I found that default implementation of getRackForHost() method returns NONE in TaskSchedulerImpl which (I suppose) would be

Re: Spark SQL Parquet as External table - 1.3.x HiveMetastoreType now hidden

2015-04-13 Thread Michael Armbrust
Here is the stack trace. The first part shows the log when the session is started in Tableau. It is using the init sql option on the data connection to create theTEMPORARY table myNodeTable. Ah, I see. thanks for providing the error. The problem here is that temporary tables do not exist in

Re: How to load avro file into spark not on Hadoop in pyspark?

2015-04-13 Thread sa
Note that I am running pyspark in local mode (I do not have a hadoop cluster connected) as I want to be able to work with the avro file outside of hadoop. -- View this message in context:

Re: Rack locality

2015-04-13 Thread Sandy Ryza
Hi Riya, As far as I know, that is correct, unless Mesos fine-grained mode handles this in some mysterious way. -Sandy On Mon, Apr 13, 2015 at 2:09 PM, rcharaya riya.char...@gmail.com wrote: I want to use Rack locality feature of Apache Spark in my application. Is YARN the only resource

Re: Task result in Spark Worker Node

2015-04-13 Thread Imran Rashid
On the worker side, it all happens in Executor. The task result is computed here: https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/executor/Executor.scala#L210 then its serialized along with some other goodies, and finally sent

Re: Registering classes with KryoSerializer

2015-04-13 Thread Imran Rashid
Those funny class names come from scala's specialization -- its compiling a different version of OpenHashMap for each primitive you stick in the type parameter. Here's a super simple example: *➜ **~ * more Foo.scala class Foo[@specialized X] *➜ **~ * scalac Foo.scala *➜ **~ * ls

Spark Worker IP Error

2015-04-13 Thread DStrip
I tried to start the Spark Worker using the registered IP but this error occurred: 15/04/13 21:35:59 INFO Worker: Registered signal handlers for [TERM, HUP, INT] Exception in thread main java.net.UnknownHostException: 10.240.92.75/: Name or service not known at

Registering classes with KryoSerializer

2015-04-13 Thread Arun Lists
Hi, I am trying to register classes with KryoSerializer. This has worked with other programs. Usually the error messages are helpful in indicating which classes need to be registered. But with my current program, I get the following cryptic error message: *Caused by:

Re: How to get rdd count() without double evaluation of the RDD?

2015-04-13 Thread Imran Rashid
yes, it sounds like a good use of an accumulator to me val counts = sc.accumulator(0L) rdd.map{x = counts += 1 x }.saveAsObjectFile(file2) On Mon, Mar 30, 2015 at 12:08 PM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: Sean Yes I know that I can use persist() to persist

Re: org.apache.spark.ml.recommendation.ALS

2015-04-13 Thread Jay Katukuri
Hi Xiangrui, Here is the class: object ALSNew { def main (args: Array[String]) { val conf = new SparkConf() .setAppName(TrainingDataPurchase) .set(spark.executor.memory, 4g) conf.set(spark.shuffle.memoryFraction,0.65) //default is 0.2

sbt-assembly spark-streaming-kinesis-asl error

2015-04-13 Thread Mike Trienis
Hi All, I have having trouble building a fat jar file through sbt-assembly. [warn] Merging 'META-INF/NOTICE.txt' with strategy 'rename' [warn] Merging 'META-INF/NOTICE' with strategy 'rename' [warn] Merging 'META-INF/LICENSE.txt' with strategy 'rename' [warn] Merging 'META-INF/LICENSE' with

Re: sbt-assembly spark-streaming-kinesis-asl error

2015-04-13 Thread Mike Trienis
Thanks Vadim, I can certainly consume data from a Kinesis stream when running locally. I'm currently in the processes of extending my work to a proper cluster (i.e. using a spark-submit job via uber jar). Feel free to add me to gmail chat and maybe we can help each other. On Mon, Apr 13, 2015 at

Re: sbt-assembly spark-streaming-kinesis-asl error

2015-04-13 Thread Vadim Bichutskiy
I don't believe the Kinesis asl should be provided. I used mergeStrategy successfully to produce an uber jar. Fyi, I've been having trouble consuming data out of Kinesis with Spark with no success :( Would be curious to know if you got it working. Vadim On Apr 13, 2015, at 9:36 PM, Mike

How to access postgresql on Spark SQL

2015-04-13 Thread doovsaid
Hi all, Who know how to access postgresql on Spark SQL? Do I need add the postgresql dependency in build.sbt and set class path for it? Thanks. RegardsYi

Re: sbt-assembly spark-streaming-kinesis-asl error

2015-04-13 Thread Vadim Bichutskiy
Thanks Mike. I was having trouble on EC2. On Apr 13, 2015, at 10:25 PM, Mike Trienis mike.trie...@orcsol.com wrote: Thanks Vadim, I can certainly consume data from a Kinesis stream when running locally. I'm currently in the processes of extending my work to a proper cluster (i.e. using a

Re: [GraphX] aggregateMessages with active set

2015-04-13 Thread James
Hello, Great thanks for your reply. From the code I found that the reason why my program will scan all the edges is becasue of the EdgeDirection I passed into is EdgeDirection.Either. However I still met the problem of Time consuming of each iteration will not decrease by time. Thus I have two

How can I add my custom Rule to spark sql?

2015-04-13 Thread Andy Zhao
Hi guys, I want to add my custom Rules(whatever the rule is) when the sql Logical Plan is being analysed. Is there a way to do that in the spark application code? Thanks -- View this message in context:

Re: Understanding Spark Memory distribution

2015-04-13 Thread Imran Rashid
broadcast variables count towards spark.storage.memoryFraction, so they use the same pool of memory as cached RDDs. That being said, I'm really not sure why you are running into problems, it seems like you have plenty of memory available. Most likely its got nothing to do with broadcast

Re: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2015-04-13 Thread sachin Singh
Hi Linlin, have you got the solution for this issue, if yes then what are the thing need to make correct,because I am also getting same error,when submitting spark job in cluster mode getting error as under - 2015-04-14 18:16:43 DEBUG Transaction - Transaction rolled back in 0 ms 2015-04-14

Fwd: Expected behavior for DataFrame.unionAll

2015-04-13 Thread Justin Yip
Hello, I am experimenting with DataFrame. I tried to construct two DataFrames with: 1. case class A(a: Int, b: String) scala adf.printSchema() root |-- a: integer (nullable = false) |-- b: string (nullable = true) 2. case class B(a: String, c: Int) scala bdf.printSchema() root |-- a: string

Re: [Spark1.3] UDF registration issue

2015-04-13 Thread Takeshi Yamamuro
Hi, It's a syntax error in Spark-1.3. The next release of spark supports the kind of UDF calls in DataFrame. See a link below. https://issues.apache.org/jira/browse/SPARK-6379 On Sat, Apr 11, 2015 at 3:30 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi, I'm running into some trouble

Re: Spark 1.3.0: Running Pi example on YARN fails

2015-04-13 Thread Zhan Zhang
Hi Zork, From the exception, it is still caused by hdp.version not being propagated correctly. Can you check whether there is any typo? [root@c6402 conf]# more java-opts -Dhdp.version=2.2.0.0–2041 [root@c6402 conf]# more spark-defaults.conf spark.driver.extraJavaOptions

RE: Equi Join is taking for ever. 1 Task is Running while other 199 are complete

2015-04-13 Thread java8964
If it is really due to data skew, will the task hanging has much bigger Shuffle Write Size in this case? In this case, the shuffle write size for that task is 0, and the rest IO of this task is not much larger than the fast finished tasks, is that normal? I am also interested in this case, as

Re: Need some guidance

2015-04-13 Thread Dean Wampler
The problem with using collect is that it will fail for large data sets, as you'll attempt to copy the entire RDD to the memory of your driver program. The following works (Scala syntax, but similar to Python): scala val i1 = input.distinct.groupByKey scala i1.foreach(println)

Spark Cluster: RECEIVED SIGNAL 15: SIGTERM

2015-04-13 Thread James King
Any idea what this means, many thanks == logs/spark-.-org.apache.spark.deploy.worker.Worker-1-09.out.1 == 15/04/13 07:07:22 INFO Worker: Starting Spark worker 09:39910 with 4 cores, 6.6 GB RAM 15/04/13 07:07:22 INFO Worker: Running Spark version 1.3.0 15/04/13 07:07:22 INFO

Manning looking for a co-author for the GraphX in Action book

2015-04-13 Thread Reynold Xin
Hi all, Manning (the publisher) is looking for a co-author for the GraphX in Action book. The book currently has one author (Michael Malak), but they are looking for a co-author to work closely with Michael and improve the writings and make it more consumable. Early access page for the book:

RE: Kryo exception : Encountered unregistered class ID: 13994

2015-04-13 Thread mehdisinger
Hello, Thank you for your answer. I'm already registering my classes as you're suggesting... Regards De : tsingfu [via Apache Spark User List] [mailto:ml-node+s1001560n22468...@n3.nabble.com] Envoyé : lundi 13 avril 2015 03:48 À : Mehdi Singer Objet : Re: Kryo exception : Encountered

Re: Kryo exception : Encountered unregistered class ID: 13994

2015-04-13 Thread ๏̯͡๏
You need to do few more things or you will eventually run into these issues val conf = new SparkConf() .set(spark.serializer, org.apache.spark.serializer.KryoSerializer) * .set(spark.kryoserializer.buffer.mb, arguments.get(buffersize).get)* *

Re: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-13 Thread Marius Soutier
I have set SPARK_WORKER_OPTS in spark-env.sh for that. For example: export SPARK_WORKER_OPTS=-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.appDataTtl=seconds On 11.04.2015, at 00:01, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: Does anybody have an answer for

Re: Spark Cluster: RECEIVED SIGNAL 15: SIGTERM

2015-04-13 Thread Guillaume Pitel
Very likely to be this : http://www.linuxdevcenter.com/pub/a/linux/2006/11/30/linux-out-of-memory.html?page=2 Your worker ran out of memory = maybe you're asking for too much memory for the JVM, or something else is running on the worker Guillaume Any idea what this means, many thanks ==

Re: regarding ZipWithIndex

2015-04-13 Thread Jeetendra Gangele
How about using mapToPair and exchanging the two. Will it be efficient Below is the code , will it be efficient to convert like this. JavaPairRDDLong, MatcherReleventData RddForMarch =matchRdd.zipWithindex.mapToPair(new PairFunctionTuple2VendorRecord,Long, Long, MatcherReleventData() {

Re: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-13 Thread Guillaume Pitel
Does it also cleanup spark local dirs ? I thought it was only cleaning $SPARK_HOME/work/ Guillaume I have set SPARK_WORKER_OPTS in spark-env.sh for that. For example: export SPARK_WORKER_OPTS=-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.appDataTtl=seconds On 11.04.2015, at

ExceptionDriver-Memory while running Spark job on Yarn-cluster

2015-04-13 Thread sachin Singh
Hi , When I am submitting spark job as --master yarn-cluster with below command/options getting driver memory error- spark-submit --jars ./libs/mysql-connector-java-5.1.17.jar,./libs/log4j-1.2.17.jar --files datasource.properties,log4j.properties --master yarn-cluster --num-executors 1

Re: MLlib : Gradient Boosted Trees classification confidence

2015-04-13 Thread mike
Thank you Peter. I just want to be sure. even if I use the classification setting the GBT uses regression trees and not classification trees? I know the difference between the two(theoretically) is only in the loss and impurity functions. thus in case it uses classification trees doing what you

Re: Spark 1.3.0: Running Pi example on YARN fails

2015-04-13 Thread Zork Sail
Hi Zhan, Alas setting: -Dhdp.version=2.2.0.0–2041 Does not help. Still get the same error: 15/04/13 09:53:59 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1428918838408

Reading files from http server

2015-04-13 Thread Peter Rudenko
Hi, i want to play with Criteo 1 tb dataset. Files are located on azure storage. Here's a command to download them: curl -O http://azuremlsampleexperiments.blob.core.windows.net/criteo/day_{`seq -s ‘,’ 0 23`}.gz is there any way to read files through http protocol with spark without

How to use multiple app jar files?

2015-04-13 Thread Michael Weir
My app works fine with the single, uber jar file containing my app and all its dependencies. However, it takes about 20 minutes to copy the 65MB jar file up to the node on the cluster, so my code, compile, test cycle has become a core, compile, cooppp, test cycle. I'd like to have a

Re: Parquet File Binary column statistics error when reuse byte[] among rows

2015-04-13 Thread Cheng Lian
Thanks Yijie! Also cc the user list. Cheng On 4/13/15 9:19 AM, Yijie Shen wrote: I opened a new Parquet JIRA ticket here: https://issues.apache.org/jira/browse/PARQUET-251 Yijie On April 12, 2015 at 11:48:57 PM, Cheng Lian (lian.cs@gmail.com mailto:lian.cs@gmail.com) wrote:

Re: Problem getting program to run on 15TB input

2015-04-13 Thread Daniel Mahler
Sometimes a large number of partitions leads to memory problems. Something like val rdd1 = sc.textFile(file1).coalesce(500). ... val rdd2 = sc.textFile(file2).coalesce(500). ... may help. On Mon, Mar 2, 2015 at 6:26 PM, Arun Luthra arun.lut...@gmail.com wrote: Everything works smoothly if I

Re: How to use multiple app jar files?

2015-04-13 Thread ๏̯͡๏
I faced exact same issue. The way i solved it was 1. Copy entire project. 2. Delete all the source, have only the dependencies in pom.xml. This will create, fat jar, without source but deps only. 3. In original project keep it as is, now build it. this will create a JAR (no deps, by default) Now

Re: Packaging Java + Python library

2015-04-13 Thread prabeesh k
Refer this post http://blog.prabeeshk.com/blog/2015/04/07/self-contained-pyspark-application/ On 13 April 2015 at 17:41, Punya Biswal pbis...@palantir.com wrote: Dear Spark users, My team is working on a small library that builds on PySpark and is organized like PySpark as well -- it has a

Re: Spark TeraSort source request

2015-04-13 Thread Tom Hubregtsen
Thank you for your response Ewan. I quickly looked yesterday and it was there, but today at work I tried to open it again to start working on it, but it appears to be removed. Is this correct? Thanks, Tom On 12 April 2015 at 06:58, Ewan Higgs ewan.hi...@ugent.be wrote: Hi all. The code is

Packaging Java + Python library

2015-04-13 Thread Punya Biswal
Dear Spark users, My team is working on a small library that builds on PySpark and is organized like PySpark as well -- it has a JVM component (that runs in the Spark driver and executor) and a Python component (that runs in the PySpark driver and executor processes). What's a good approach

Sqoop parquet file not working in spark

2015-04-13 Thread bipin
Hi I imported a table from mssql server with Sqoop 1.4.5 in parquet format. But when I try to load it from Spark shell, it throws error like : scala val df1 = sqlContext.load(/home/bipin/Customer2) scala.collection.parallel.CompositeThrowable: Multiple exceptions thrown during a parallel

Re: ExceptionDriver-Memory while running Spark job on Yarn-cluster

2015-04-13 Thread ๏̯͡๏
Try this ./bin/spark-submit -v --master yarn-cluster --jars ./libs/mysql-connector-java-5.1.17.jar,./libs/log4j-1.2.17.jar --files datasource.properties,log4j.properties --num-executors 1 --driver-memory 4g *--driver-java-options -XX:MaxPermSize=1G* --executor-memory 2g --executor-cores 1

Need some guidance

2015-04-13 Thread Marco Shaw
**Learning the ropes** I'm trying to grasp the concept of using the pipeline in pySpark... Simplified example: list=[(1,alpha),(1,beta),(1,foo),(1,alpha),(2,alpha),(2,alpha),(2,bar),(3,foo)] Desired outcome: [(1,3),(2,2),(3,1)] Basically for each key, I want the number of unique values. I've

Multipart upload to S3 fails with Bad Digest Exceptions

2015-04-13 Thread Eugen Cepoi
Hi, I am not sure my problem is relevant to spark, but perhaps someone else had the same error. When I try to write files that need multipart upload to S3 from a job on EMR I always get this error: com.amazonaws.services.s3.model.AmazonS3Exception: The Content-MD5 you specified did not match

Re: Equi Join is taking for ever. 1 Task is Running while other 199 are complete

2015-04-13 Thread ๏̯͡๏
You mean there is a tuple in either RDD, that has itemID = 0 or null ? And what is catch all ? That implies is it a good idea to run a filter on each RDD first ? We do not do this using Pig on M/R. Is it required in Spark world ? On Mon, Apr 13, 2015 at 9:58 PM, Jonathan Coveney

Re: Equi Join is taking for ever. 1 Task is Running while other 199 are complete

2015-04-13 Thread Jonathan Coveney
I can promise you that this is also a problem in the pig world :) not sure why it's not a problem for this data set, though... are you sure that the two are doing the exact same code? you should inspect your source data. Make a histogram for each and see what the data distribution looks like. If

feature scaling in GeneralizedLinearAlgorithm.scala

2015-04-13 Thread Jianguo Li
Hi, In the GeneralizedLinearAlgorithm, which Logistic Regression relied on, it says if userFeatureScaling is enabled, we will standardize the training features , and trained the model in the scaled space. Then we transform the coefficients from the scaled space to the original space My

Re: Spark Cluster: RECEIVED SIGNAL 15: SIGTERM

2015-04-13 Thread Tim Chen
Linux OOM throws SIGTERM, but if I remember correctly JVM handles heap memory limits differently and throws OutOfMemoryError and eventually sends SIGINT. Not sure what happened but the worker simply received a SIGTERM signal, so perhaps the daemon was terminated by someone or a parent process.

Spark Streaming Kafka Consumer, Confluent Platform, Avro StorageLevel

2015-04-13 Thread Nicolas Phung
Hello, I'm trying to use a Spark Streaming (1.2.0-cdh5.3.2) consumer (via spark-streaming-kafka lib of the same version) with Kafka's Confluent Platform 1.0. I manage to make a Producer that produce my data and can check it via the avro-console-consumer : ./bin/kafka-avro-console-consumer

Help in transforming the RDD

2015-04-13 Thread Jeetendra Gangele
Hi All I have an JavaPairRDDLong,String where each long key have 4 string values associated with it. I want to fire the Hbase query for look up of the each String part of RDD. This look-up will give result of around 7K integers.so for each key I will have 7k values. now my input RDD always

Re: Equi Join is taking for ever. 1 Task is Running while other 199 are complete

2015-04-13 Thread Jonathan Coveney
My guess would be data skew. Do you know if there is some item id that is a catch all? can it be null? item id 0? lots of data sets have this sort of value and it always kills joins 2015-04-13 11:32 GMT-04:00 ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com: Code: val viEventsWithListings: RDD[(Long,

Re: Could not compute split, block not found in Spark Streaming Simple Application

2015-04-13 Thread Saiph Kappa
Whether I use 1 or 2 machines, the results are the same... Here follows the results I got using 1 and 2 receivers with 2 machines: 2 machines, 1 receiver: sbt/sbt run-main Benchmark 1 machine1 1000 21 | grep -i Total delay\|record 15/04/13 16:41:34 INFO JobScheduler: Total delay: 0.156 s

Equi Join is taking for ever. 1 Task is Running while other 199 are complete

2015-04-13 Thread ๏̯͡๏
Code: val viEventsWithListings: RDD[(Long, (DetailInputRecord, VISummary, Long))] = lstgItem.join(viEvents).map { case (itemId, (listing, viDetail)) = val viSummary = new VISummary viSummary.leafCategoryId = listing.getLeafCategId().toInt viSummary.itemSiteId =

Configuring amount of disk space available to spark executors in mesos?

2015-04-13 Thread Jonathan Coveney
I'm surprised that I haven't been able to find this via google, but I haven't... What is the setting that requests some amount of disk space for the executors? Maybe I'm misunderstanding how this is configured... Thanks for any help!

What's the cleanest way to make spark aware of my custom scheduler?

2015-04-13 Thread Jonathan Coveney
I need to have my own scheduler to point to a proprietary remote execution framework. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L2152 I'm looking at where it decides on the backend and it doesn't look like there is a hook. Of course I can