Problem reading Parquet from 1.2 to 1.3

2015-06-03 Thread Don Drake
As part of upgrading a cluster from CDH 5.3.x to CDH 5.4.x I noticed that Spark is behaving differently when reading Parquet directories that contain a .metadata directory. It seems that in spark 1.2.x, it would just ignore the .metadata directory, but now that I'm using Spark 1.3, reading these

Managing spark processes via supervisord

2015-06-03 Thread Mike Trienis
Hi All, I am curious to know if anyone has successfully deployed a spark cluster using supervisord? - http://supervisord.org/ Currently I am using the cluster launch scripts which are working greater, however, every time I reboot my VM or development environment I need to re-launch the

Re: Managing spark processes via supervisord

2015-06-03 Thread Igor Berman
assuming you are talking about standalone cluster imho, with workers you won't get any problems and it's straightforward since they are usually foreground processes with master it's a bit more complicated, ./sbin/start-master.sh goes background which is not good for supervisor, but anyway I think

Python Image Library and Spark

2015-06-03 Thread Justin Spargur
Hi all, I'm playing around with manipulating images via Python and want to utilize Spark for scalability. That said, I'm just learing Spark and my Python is a bit rusty (been doing PHP coding for the last few years). I think I have most of the process figured out. However, the script fails

Re: Problem reading Parquet from 1.2 to 1.3

2015-06-03 Thread Marcelo Vanzin
(bcc: user@spark, cc:cdh-user@cloudera) If you're using CDH, Spark SQL is currently unsupported and mostly untested. I'd recommend trying to use it in CDH. You could try an upstream version of Spark instead. On Wed, Jun 3, 2015 at 1:39 PM, Don Drake dondr...@gmail.com wrote: As part of

Re: Python Image Library and Spark

2015-06-03 Thread ayan guha
Try with large number of partition in parallelize. On 4 Jun 2015 06:28, Justin Spargur jmspar...@gmail.com wrote: Hi all, I'm playing around with manipulating images via Python and want to utilize Spark for scalability. That said, I'm just learing Spark and my Python is a bit rusty

[ANNOUNCE] YARN support in Spark EC2

2015-06-03 Thread Shivaram Venkataraman
Hi all We recently merged support for launching YARN clusters using Spark EC2 scripts as a part of https://issues.apache.org/jira/browse/SPARK-3674. To use this you can pass in hadoop-major-version as yarn to the spark-ec2 script and this will setup Hadoop 2.4 HDFS, YARN and Spark built for YARN

Standard Scaler taking 1.5hrs

2015-06-03 Thread Piero Cinquegrana
Hello User group, I have a RDD of LabeledPoint composed of sparse vectors like showing below. In the next step, I am standardizing the columns with the Standard Scaler. The data has 2450 columns and ~110M rows. It took 1.5hrs to complete the standardization with 10 nodes and 80 executors. The

Re: data localisation in spark

2015-06-03 Thread Sandy Ryza
Tasks are scheduled on executors based on data locality. Things work as you would expect in the example you brought up. Through dynamic allocation, the number of executors can change throughout the life time of an application. 10 executors (or 5 executors with 2 cores each) are not needed for a

Re: Spark 1.4.0-rc4 HiveContext.table(db.tbl) NoSuchTableException

2015-06-03 Thread Yin Huai
Hi Doug, Actually, sqlContext.table does not support database name in both Spark 1.3 and Spark 1.4. We will support it in future version. Thanks, Yin On Wed, Jun 3, 2015 at 10:45 AM, Doug Balog doug.sparku...@dugos.com wrote: Hi, sqlContext.table(“db.tbl”) isn’t working for me, I get a

Re: ALS Rating Object

2015-06-03 Thread Joseph Bradley
Hi Yasemin, If you can convert your user IDs to Integers in pre-processing (if you have a couple billion users), that would work. Otherwise... In Spark 1.3: You may need to modify ALS to use Long instead of Int. In Spark 1.4: spark.ml.recommendation.ALS (in the Pipeline API) exposes ALS.train

RE: Make HTTP requests from within Spark

2015-06-03 Thread Mohammed Guller
The short answer is yes. How you do it depends on a number of factors. Assuming you want to build an RDD from the responses and then analyze the responses using Spark core (not Spark Streaming), here is one simple way to do it: 1) Implement a class or function that connects to a web service and

RandomForest - subsamplingRate parameter

2015-06-03 Thread Andrew Leverentz
When training a RandomForest model, the Strategy class (in mllib.tree.configuration) provides a subsamplingRate parameter. I was hoping to use this to cut down on processing time for large datasets (more than 2MM rows and 9K predictors), but I've found that the runtime stays approximately

How to pass system properties in spark ?

2015-06-03 Thread Ashwin Shankar
Hi, I'm trying to use property substitution in my log4j.properties, so that I can choose where to write spark logs at runtime. The problem is that, system property passed to spark shell doesn't seem to getting propagated to log4j. *Here is log4j.properites(partial) with a parameter

Re: Spark Client

2015-06-03 Thread Akhil Das
Did you try this? Create an sbt project like: // Create your context val sconf = new SparkConf().setAppName(Sigmoid).setMaster(spark://sigmoid:7077) val sc = new SparkContext(sconf) // Do some computations sc.parallelize(1 to 1).take(10).foreach(println) //Now return the exit status

run spark submit on cloudera cluster

2015-06-03 Thread Pa Rö
hi, i want to run my spark app on a cluster, i use cloudera live single node vm. how i must build the job for the spark submit script? and i must upload spark submit on hdfs? best regards paul

Re: Application is always in process when I check out logs of completed application

2015-06-03 Thread ayan guha
Have you done sc.stop() ? :) On 3 Jun 2015 14:05, amghost zhengweita...@outlook.com wrote: I run spark application in spark standalone cluster with client deploy mode. I want to check out the logs of my finished application, but I always get a page telling me Application history not found -

Re: Filter operation to return two RDDs at once.

2015-06-03 Thread Jeff Zhang
As far as I know, spark don't support multiple outputs On Wed, Jun 3, 2015 at 2:15 PM, ayan guha guha.a...@gmail.com wrote: Why do you need to do that if filter and content of the resulting rdd are exactly same? You may as well declare them as 1 RDD. On 3 Jun 2015 15:28, ÐΞ€ρ@Ҝ (๏̯͡๏)

MetaException(message:java.security.AccessControlException: Permission denied

2015-06-03 Thread patcharee
Hi, I was running a spark job to insert overwrite hive table and got Permission denied. My question is why spark job did the insert by using user 'hive', not myself who ran the job? How can I fix the problem? val hiveContext = new HiveContext(sc) import hiveContext.implicits._

Re: ERROR cluster.YarnScheduler: Lost executor

2015-06-03 Thread Akhil Das
You need to look into your executor/worker logs to see whats going on. Thanks Best Regards On Wed, Jun 3, 2015 at 12:01 PM, patcharee patcharee.thong...@uni.no wrote: Hi, What can be the cause of this ERROR cluster.YarnScheduler: Lost executor? How can I fix it? Best, Patcharee

Re: Filter operation to return two RDDs at once.

2015-06-03 Thread Jeff Zhang
I check the RDD#randSplit, it is much more like multiple one-to-one transformation rather than a one-to-multiple transformation. I write one sample code as following, it would generate 3 stages. Although we can use cache here to make it better, If spark can support multiple outputs, only 2 stages

Re: ERROR cluster.YarnScheduler: Lost executor

2015-06-03 Thread Jeff Zhang
node down or container preempted ? You need to check the executor log / node manager log for more info. On Wed, Jun 3, 2015 at 2:31 PM, patcharee patcharee.thong...@uni.no wrote: Hi, What can be the cause of this ERROR cluster.YarnScheduler: Lost executor? How can I fix it? Best,

Re: DataFrames coming in SparkR in Apache Spark 1.4.0

2015-06-03 Thread Emaasit
You can build Spark from the 1.4 release branch yourself: https://github.com/apache/spark/tree/branch-1.4 - Daniel Emaasit, Ph.D. Research Assistant Transportation Research Center (TRC) University of Nevada, Las Vegas Las Vegas, NV 89154-4015 Cell: 615-649-2489 www.danielemaasit.com --

Error: Building Spark 1.4.0 from Github-1.4 release branch

2015-06-03 Thread Emaasit
I run into errors while trying to build Spark from the 1.4 release branch yourself: https://github.com/apache/spark/tree/branch-1.4. Any help will be much appreciated. Here is the log file. (F.Y.I, I installed all the dependencies like Java 7, Maven 3.2.5) C:\Program Files\Apache Software

Re: ERROR cluster.YarnScheduler: Lost executor

2015-06-03 Thread patcharee
Hi again, Below is the log from executor FetchFailed(BlockManagerId(4, compute-10-0.local, 38594), shuffleId=0, mapId=117, reduceId=117, message= org.apache.spark.shuffle.FetchFailedException: Failed to connect to compute-10-0.local/10.10.255.241:38594 at

Spark Client

2015-06-03 Thread pavan kumar Kolamuri
Hi guys , i am new to spark . I am using sparksubmit to submit spark jobs. But for my use case i don't want it to be exit with System.exit . Is there any other spark client which is api friendly other than SparkSubmit which shouldn't exit with system.exit. Please correct me if i am missing

Re: Filter operation to return two RDDs at once.

2015-06-03 Thread Sean Owen
In the sense here, Spark actually does have operations that make multiple RDDs like randomSplit. However there is not an equivalent of the partition operation which gives the elements that matched and did not match at once. On Wed, Jun 3, 2015, 8:32 AM Jeff Zhang zjf...@gmail.com wrote: As far

Re: Scripting with groovy

2015-06-03 Thread Akhil Das
I think when you do a ssc.stop it will stop your entire application and by update a transformation function you mean modifying the driver program? In that case even if you checkpoint your application, it won't be able to recover from its previous state. A simpler approach would be to add certain

Re: Spark Client

2015-06-03 Thread pavan kumar Kolamuri
Hi akhil , sorry i may not conveying the question properly . Actually we are looking to Launch a spark job from a long running workflow manager, which invokes spark client via SparkSubmit. Unfortunately the client upon successful completion of the application exits with a System.exit(0) or

Re: ERROR cluster.YarnScheduler: Lost executor

2015-06-03 Thread patcharee
This is log I can get 15/06/02 16:37:31 INFO shuffle.RetryingBlockFetcher: Retrying fetch (2/3) for 4 outstanding blocks after 5000 ms 15/06/02 16:37:36 INFO client.TransportClientFactory: Found inactive connection to compute-10-3.local/10.10.255.238:33671, creating a new one. 15/06/02

Re: Spark Client

2015-06-03 Thread Akhil Das
Run it as a standalone application. Create an sbt project and do sbt run? Thanks Best Regards On Wed, Jun 3, 2015 at 11:36 AM, pavan kumar Kolamuri pavan.kolam...@gmail.com wrote: Hi guys , i am new to spark . I am using sparksubmit to submit spark jobs. But for my use case i don't want it

Make HTTP requests from within Spark

2015-06-03 Thread kasparfischer
Hi everybody, I'm new to Spark, apologies if my question is very basic. I have a need to send millions of requests to a web service and analyse and store the responses in an RDD. I can easy express the analysing part using Spark's filter/map/etc. primitives but I don't know how to make the

Re: Spark 1.4.0-rc3: Actor not found

2015-06-03 Thread Anders Arpteg
Tried on some other data sources as well, and it actually works for some parquet sources. Potentially some specific problems with that first parquet source that I tried with, and not a Spark 1.4 problem. I'll get back with more info if I find any new information. Thanks, Anders On Tue, Jun 2,

Re: ERROR cluster.YarnScheduler: Lost executor

2015-06-03 Thread Akhil Das
Which version of spark? Looks like you are hitting this one https://issues.apache.org/jira/browse/SPARK-4516 Thanks Best Regards On Wed, Jun 3, 2015 at 1:06 PM, patcharee patcharee.thong...@uni.no wrote: This is log I can get 15/06/02 16:37:31 INFO shuffle.RetryingBlockFetcher: Retrying

Re: Spark 1.4.0 build Error on Windows

2015-06-03 Thread Daniel Emaasit
I run into errors while trying to build Spark from the 1.4 release branch: https://github.com/apache/spark/tree/branch-1.4. Any help will be much appreciated. Here is the log file from my windows 8.1 PC. (F.Y.I, I installed all the dependencies like Java 7, Maven 3.2.5 and set the environment

Re: Filter operation to return two RDDs at once.

2015-06-03 Thread ayan guha
Why do you need to do that if filter and content of the resulting rdd are exactly same? You may as well declare them as 1 RDD. On 3 Jun 2015 15:28, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I want to do this val qtSessionsWithQt = rawQtSession.filter(_._2.qualifiedTreatmentId !=

ERROR cluster.YarnScheduler: Lost executor

2015-06-03 Thread patcharee
Hi, What can be the cause of this ERROR cluster.YarnScheduler: Lost executor? How can I fix it? Best, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail:

Re: ERROR cluster.YarnScheduler: Lost executor

2015-06-03 Thread Saisai Shao
I think you could check the yarn nodemanager log or other Spark executor logs to see the details. What you listed above of the exception stacks are just the phenomenon, not the cause. Normally there will be some situations which will lead to executor lost: 1. Killed by yarn cause of memory

RE: Objects serialized before foreachRDD/foreachPartition ?

2015-06-03 Thread Evo Eftimov
Dmitry was concerned about the “serialization cost” NOT the “memory footprint – hence option a) is still viable since a Broadcast is performed only ONCE for the lifetime of Driver instance From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Wednesday, June 3, 2015 2:44 PM To: Evo Eftimov Cc:

columnar structure of RDDs from Parquet or ORC files

2015-06-03 Thread kiran lonikar
When spark reads parquet files (sqlContext.parquetFile), it creates a DataFrame RDD. I would like to know if the resulting DataFrame has columnar structure (many rows of a column coalesced together in memory) or its a row wise structure that a spark RDD has. The section Spark SQL and DataFrames

Re: Application is always in process when I check out logs of completed application

2015-06-03 Thread Richard Marscher
I had the same issue a couple days ago. It's a bug in 1.3.0 that is fixed in 1.3.1 and up. https://issues.apache.org/jira/browse/SPARK-6036 The ordering that the event logs are moved from in-progress to complete is coded to be after the Master tries to build the history page for the logs. The

Re: Spark Client

2015-06-03 Thread Richard Marscher
I think the short answer to the question is, no, there is no alternate API that will not use the System.exit calls. You can craft a workaround like is being suggested in this thread. For comparison, we are doing programmatic submission of applications in a long-running client application. To get

Does Apache Spark maintain a columnar structure when creating RDDs from Parquet or ORC files?

2015-06-03 Thread lonikar
When spark reads parquet files (sqlContext.parquetFile), it creates a DataFrame RDD. I would like to know if the resulting DataFrame has columnar structure (many rows of a column coalesced together in memory) or its a row wise structure that a spark RDD has. The section Spark SQL and DataFrames

Equivalent to Storm's 'field grouping' in Spark.

2015-06-03 Thread allonsy
Hi everybody, is there in Spark anything sharing the philosophy of Storm's field grouping? I'd like to manage data partitioning across the workers by sending tuples sharing the same key to the very same worker in the cluster, but I did not find any method to do that. Suggestions? :) -- View

RE: Objects serialized before foreachRDD/foreachPartition ?

2015-06-03 Thread Evo Eftimov
Hmmm a spark streaming app code doesn't execute in the linear fashion assumed in your previous code snippet - to achieve your objectives you should do something like the following in terms of your second objective - saving the initialization and serialization of the params you can: a) broadcast

ALS Rating Object

2015-06-03 Thread Yasemin Kaya
Hi, I want to use Spark's ALS in my project. I have the userid like 30011397223227125563254 and Rating Object which is the Object of ALS wants Integer as a userid so the id field does not fit into a 32 bit Integer. How can I solve that ? Thanks. Best, yasemin -- hiç ender hiç

Re: in GraphX,program with Pregel runs slower and slower after several iterations

2015-06-03 Thread Cheuk Lam
I think you're exactly right. I once had 100 iterations in a single Pregel call, and got into the lineage problem right there. I had to modify the Pregel function and checkpoint both the graph and the newVerts RDD there to cut off the lineage. If you draw out the dependency graph among the g,

Objects serialized before foreachRDD/foreachPartition ?

2015-06-03 Thread dgoldenberg
I'm looking at https://spark.apache.org/docs/latest/tuning.html. Basically the takeaway is that all objects passed into the code processing RDD's must be serializable. So if I've got a few objects that I'd rather initialize once and deinitialize once outside of the logic processing the RDD's, I'd

Re: Objects serialized before foreachRDD/foreachPartition ?

2015-06-03 Thread Ted Yu
Considering memory footprint of param as mentioned by Dmitry, option b seems better. Cheers On Wed, Jun 3, 2015 at 6:27 AM, Evo Eftimov evo.efti...@isecc.com wrote: Hmmm a spark streaming app code doesn't execute in the linear fashion assumed in your previous code snippet - to achieve your

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
Great. You should monitor vital performance / job clogging stats of the Spark Streaming Runtime not “kafka topics” -- anything specific you were thinking of? On Wed, Jun 3, 2015 at 11:49 AM, Evo Eftimov evo.efti...@isecc.com wrote: Makes sense especially if you have a cloud with “infinite”

Re: Spark 1.4.0 build Error on Windows

2015-06-03 Thread pawan kumar
I got the same error message when using maven 3.3 . On Jun 3, 2015 8:58 AM, Ted Yu yuzhih...@gmail.com wrote: I used the same command on Linux but didn't reproduce the error. Can you include -X switch on your command line ? Also consider upgrading maven to 3.3.x Cheers On Wed, Jun 3,

Re: Objects serialized before foreachRDD/foreachPartition ?

2015-06-03 Thread Dmitry Goldenberg
So Evo, option b is to singleton the Param, as in your modified snippet, i.e. instantiate is once per an RDD. But if I understand correctly the a) option is broadcast, meaning instantiation is in the Driver once before any transformations and actions, correct? That's where my serialization costs

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
If we have a hand-off between the older consumer and the newer consumer, I wonder if we need to manually manage the offsets in Kafka so as not to miss some messages as the hand-off is happening. Or if we let the new consumer run for a bit then let the old consumer know the 'new guy is in town'

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
Would it be possible to implement Spark autoscaling somewhat along these lines? -- 1. If we sense that a new machine is needed, by watching the data load in Kafka topic(s), then 2. Provision a new machine via a Provisioner interface (e.g. talk to AWS and get a machine); 3. Create a shadow/mirror

RE: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Evo Eftimov
Makes sense especially if you have a cloud with “infinite” resources / nodes which allows you to double, triple etc in the background/parallel the resources of the currently running cluster I was thinking more about the scenario where you have e.g. 100 boxes and want to / can add e.g. 20

Re: Spark 1.4.0 build Error on Windows

2015-06-03 Thread Ted Yu
I used the same command on Linux but didn't reproduce the error. Can you include -X switch on your command line ? Also consider upgrading maven to 3.3.x Cheers On Wed, Jun 3, 2015 at 2:36 AM, Daniel Emaasit daniel.emaa...@gmail.com wrote: I run into errors while trying to build Spark from

Re: Example Page Java Function2

2015-06-03 Thread Sean Owen
Yes, I think you're right. Since this is a change to the ASF hosted site, I can make this change to the .md / .html directly rather than go through the usual PR. On Wed, Jun 3, 2015 at 6:23 PM, linkstar350 . tweicomepan...@gmail.com wrote: Hi, I'm Taira. I notice that this example page may be

Example Page Java Function2

2015-06-03 Thread linkstar350 .
Hi, I'm Taira. I notice that this example page may be a mistake. https://spark.apache.org/examples.html Word Count (Java) JavaRDDString textFile = spark.textFile(hdfs://...); JavaRDDString words = textFile.flatMap(new FlatMapFunctionString, String() { public

StreamingListener, anyone?

2015-06-03 Thread dgoldenberg
Hi, I've got a Spark Streaming driver job implemented and in it, I register a streaming listener, like so: JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.milliseconds(params.getBatchDurationMillis())); jssc.addStreamingListener(new JobListener(jssc)); where

Re: Broadcast variables can be rebroadcast?

2015-06-03 Thread NB
I am pasting some of the exchanges I had on this topic via the mailing list directly so it may help someone else too. (Don't know why those responses don't show up here). --- Thanks Imran. It does help clarify. I believe I had it right all along then but was

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
I think what we'd want to do is track the ingestion rate in the consumer(s) via Spark's aggregation functions and such. If we're at a critical level (load too high / load too low) then we issue a request into our Provisioning Component to add/remove machines. Once it comes back with an OK, each

Re: Behavior of the spark.streaming.kafka.maxRatePerPartition config param?

2015-06-03 Thread Cody Koeninger
The default of 0 means no limit. Each batch will grab as much as is available, ie a range of offsets spanning from the end of the previous batch to the highest available offsets on the leader. If you set spark.streaming.kafka.maxRatePerPartition 0, the number you set is the maximum number of

Adding new Spark workers on AWS EC2 - access error

2015-06-03 Thread barmaley
I have the existing operating Spark cluster that was launched with spark-ec2 script. I'm trying to add new slave by following the instructions: Stop the cluster On AWS console launch more like this on one of the slaves Start the cluster Although the new instance is added to the same security

Re: Make HTTP requests from within Spark

2015-06-03 Thread Pat McDonough
Try something like the following. Create a function to make the HTTP call, e.g. using org.apache.commons.httpclient.HttpClient as in below. def getUrlAsString(url: String): String = { val client = new org.apache.http.impl.client.DefaultHttpClient() val request = new

Spark 1.4 HiveContext fails to initialise with native libs

2015-06-03 Thread Night Wolf
Hi all, Trying out Spark 1.4 RC4 on MapR4/Hadoop 2.5.1 running in yarn-client mode with Hive support. *Build command;* ./make-distribution.sh --name mapr4.0.2_yarn_j6_2.10 --tgz -Pyarn -Pmapr4 -Phadoop-2.4 -Pmapr4 -Phive -Phadoop-provided -Dhadoop.version=2.5.1-mapr-1501

Re: ALS Rating Object

2015-06-03 Thread Yasemin Kaya
Hi Joseph, I think about converting IDS but there will be birthday problem. The probability of a Hash Collision http://preshing.com/20110504/hash-collision-probabilities/ is important for me because of the user number. I don't know how can I modify ALS to use Integer. yasemin 2015-06-04 2:28

Re: Example Page Java Function2

2015-06-03 Thread linkstar350 .
Thank you. I confirmed the page. 2015-06-04 1:35 GMT+09:00 Sean Owen so...@cloudera.com: Yes, I think you're right. Since this is a change to the ASF hosted site, I can make this change to the .md / .html directly rather than go through the usual PR. On Wed, Jun 3, 2015 at 6:23 PM,

Re: Standard Scaler taking 1.5hrs

2015-06-03 Thread Piero Cinquegrana
The fit part is very slow, transform not at all. The number of partitions was 210 vs number of executors 80. Spark 1.4 sounds great but as my company is using Qubole we are dependent upon them to upgrade from version 1.3.1. Until that happens, can you think of any other reasons as to why it

TreeReduce Functionality in Spark

2015-06-03 Thread raggy
I am trying to understand what the treeReduce function for an RDD does, and how it is different from the normal reduce function. My current understanding is that treeReduce tries to split up the reduce into multiple steps. We do a partial reduce on different nodes, and then a final reduce is done

Re: Standard Scaler taking 1.5hrs

2015-06-03 Thread DB Tsai
Which part of StandardScaler is slow? Fit or transform? Fit has shuffle but very small, and transform doesn't do shuffle. I guess you don't have enough partition, so please repartition your input dataset to a number at least larger than the # of executors you have. In Spark 1.4's new ML pipeline

Re: Make HTTP requests from within Spark

2015-06-03 Thread William Briggs
Hi Kaspar, This is definitely doable, but in my opinion, it's important to remember that, at its core, Spark is based around a functional programming paradigm - you're taking input sets of data and, by applying various transformations, you end up with a dataset that represents your answer.

Re: Spark Cluster Benchmarking Frameworks

2015-06-03 Thread Zhen Jia
Hi Jonathan, Maybe you can try BigDataBench. http://prof.ict.ac.cn/BigDataBench/ http://prof.ict.ac.cn/BigDataBench/ . It provides lots of workloads, including both Hadoop and Spark based workloads. Zhen Jia hodgesz wrote Hi Spark Experts, I am curious what people are using to benchmark

Re: Standard Scaler taking 1.5hrs

2015-06-03 Thread DB Tsai
Can you do count() before fit to force materialize the RDD? I think something before fit is slow. On Wednesday, June 3, 2015, Piero Cinquegrana pcinquegr...@marketshare.com wrote: The fit part is very slow, transform not at all. The number of partitions was 210 vs number of executors 80.

Re: Spark Client

2015-06-03 Thread pavan kumar Kolamuri
Thanks Akhil, Richard, Oleg for your quick response . @Oleg we have actually tried the same thing but unfortunately when we throw exception Akka framework is catching all exceptions and thinking job failed and rerunning the spark jobs infinitely. Since in OneForOneStrategy in akka , max no of

Spark 1.4.0-rc4 HiveContext.table(db.tbl) NoSuchTableException

2015-06-03 Thread Doug Balog
Hi, sqlContext.table(“db.tbl”) isn’t working for me, I get a NoSuchTableException. But I can access the table via sqlContext.sql(“select * from db.tbl”) So I know it has the table info from the metastore. Anyone else see this ? I’ll keep digging. I compiled via make-distribution -Pyarn

Re: Equivalent to Storm's 'field grouping' in Spark.

2015-06-03 Thread Matei Zaharia
This happens automatically when you use the byKey operations, e.g. reduceByKey, updateStateByKey, etc. Spark Streaming keeps the state for a given set of keys on a specific node and sends new tuples with that key to that. Matei On Jun 3, 2015, at 6:31 AM, allonsy luke1...@gmail.com wrote: