word count aggregation

2014-12-29 Thread Hoai-Thu Vuong
dear user of spark I've got a program, streaming a folder, when a new file is created in this folder, I count a word, which appears in this document and update it (I used StatefulNetworkWordCount to do it). And it work like charm. However, I would like to know the different of top 10 word at now

Re: Setting up Simple Kafka Consumer via Spark Java app

2014-12-29 Thread Suhas Shekar
Hello Akhil, I chanced my Kafka dependency to 2.10 (which is the version of kafka that was on 10.0.1.232). I am getting a slightly different error, but at the same place as the previous error (pasted below). FYI, when I make these changes to the pom file, I do mvn clean package then cp the new

Re: Setting up Simple Kafka Consumer via Spark Java app

2014-12-29 Thread Akhil Das
Add this jar in the dependency http://mvnrepository.com/artifact/com.yammer.metrics/metrics-core/2.2.0 Thanks Best Regards On Mon, Dec 29, 2014 at 1:31 PM, Suhas Shekar suhsheka...@gmail.com wrote: Hello Akhil, I chanced my Kafka dependency to 2.10 (which is the version of kafka that was on

Re: word count aggregation

2014-12-29 Thread Akhil Das
You can use reduceByKeyAndWindow for that. Here's a pretty clean example https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/TwitterPopularTags.scala Thanks Best Regards On Mon, Dec 29, 2014 at 1:30 PM, Hoai-Thu Vuong thuv...@gmail.com wrote:

Re: Setting up Simple Kafka Consumer via Spark Java app

2014-12-29 Thread Suhas Shekar
I'm very close! So I added that and then I added this: http://mvnrepository.com/artifact/org.apache.kafka/kafka-clients/0.8.2-beta and it seems as though the stream is working as it says Stream 0 received 1 or 2 blocks as I enter in messages on my kafka producer. However, the Receiver seems to

Re: Setting up Simple Kafka Consumer via Spark Java app

2014-12-29 Thread Akhil Das
If you want to stop the streaming after 10 seconds, then use ssc.awaitTermination(1). Make sure you push some data to kafka for the streaming to consume within the 10 seconds. Thanks Best Regards On Mon, Dec 29, 2014 at 1:53 PM, Suhas Shekar suhsheka...@gmail.com wrote: I'm very close! So

Re: Setting up Simple Kafka Consumer via Spark Java app

2014-12-29 Thread Suhas Shekar
Hmmm..soo I added 1 (10,000) to jssc.awaitTermination , however it does not stop. When I am not pushing in any data it gives me this: 14/12/29 08:35:12 INFO ReceiverTracker: Stream 0 received 0 blocks 14/12/29 08:35:12 INFO JobScheduler: Added jobs for time 1419860112000 ms 14/12/29 08:35:14

Re: Setting up Simple Kafka Consumer via Spark Java app

2014-12-29 Thread Akhil Das
Now, Add these lines to get ride of those logs import org.apache.log4j.Logger import org.apache.log4j.Level Logger.getLogger(org).setLevel(Level.OFF) Logger.getLogger(akka).setLevel(Level.OFF) Thanks Best Regards On Mon, Dec 29, 2014 at 2:09 PM, Suhas Shekar

Re: Setting up Simple Kafka Consumer via Spark Java app

2014-12-29 Thread Akhil Das
How many cores are you allocated/seeing in the webui? (that usually runs on 8080, for cloudera i think its 18080). Most likely the job is being allocated 1 core (should be = 2 cores) and that's why the count is never happening. Thanks Best Regards On Mon, Dec 29, 2014 at 2:22 PM, Suhas Shekar

Re: Setting up Simple Kafka Consumer via Spark Java app

2014-12-29 Thread Akhil Das
You don't submit it like that :/ You use [*] things when you run the job in local mode, whereas here you are running it in stand alone cluster mode. You can try either of these: 1. /opt/cloudera/parcels/CDH-5.2.1-1.cdh5.2.10.12/lib/spark/bin/spark-submit --class SimpleApp --master

Re: Setting up Simple Kafka Consumer via Spark Java app

2014-12-29 Thread Suhas Shekar
I thought I was running it in local mode as http://spark.apache.org/docs/1.1.1/submitting-applications.html says that if I don't include --deploy-mode cluster then it will run as local mode? I tried both of the scripts above and they gave the same result as the script I was running before. Also,

Re: Spark 1.2.0 Yarn not published

2014-12-29 Thread Aniket Bhatnagar
Thanks Ted. Adding dependency to spark-network-yarn would allow resolution to YarnShuffleService which from docs suggests that it runs on Node Manager and perhaps isn't useful while submitting jobs programmatically. I think what I need is a dependency to spark-yarn module so that classes like

Re: recent join/iterator fix

2014-12-29 Thread Sean Owen
It wasn't so much the cogroup that was optimized here, but what is done to the result of cogroup. Yes, it was a matter of not materializing the entire result of a flatMap-like function after the cogroup, since this will accept just an Iterator (actually, TraversableOnce). I'd say that wherever

Re: Dynamic Allocation in Spark 1.2.0

2014-12-29 Thread Anders Arpteg
Thanks Tsuyoshi and Shixiong for the info. Awesome with more documentation about the feature! Was afraid that the node manager needed reconfiguration (and restart). Any idea of how much resources will the shuffle service take on the node manager? In a multi-tenant Hadoop cluster environment, it

Mapping directory structure to columns in SparkSQL

2014-12-29 Thread Mickalas
I see that there is already a request to add wildcard support to the SQLContext.parquetFile function https://issues.apache.org/jira/browse/SPARK-3928. What seems like a useful thing for our use case is to associate the directory structure with certain columns in the table, but it does not seem

Re: Using TF-IDF from MLlib

2014-12-29 Thread Sean Owen
Given (label, terms) you can just transform the values to a TF vector, then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can make a LabeledPoint from (label, vector) pairs. Is that what you're looking for? On Mon, Dec 29, 2014 at 3:37 AM, Yao y...@ford.com wrote: I found the TF-IDF

How to set up spark sql on ec2

2014-12-29 Thread critikaled
How to make the spark ec2 script to install hive and spark sql on ec2 when I run the spark ec2 script and go to bin and run ./spark-sql and execute query I'm getting connection refused on master:9000 what else has to be configured for this? -- View this message in context:

What are all the Hadoop Major Versions in spark-ec2 script?

2014-12-29 Thread critikaled
So what should be the value for --hadoop-major-version the follwing hadoop versions Hadoop1.x is 1 CDH4 Hadoop2.3 Hadoop2.4 MapR 3.x MapR 4.x -- View this message in context:

Clustering text data with MLlib

2014-12-29 Thread jatinpreet
Hi, I wish to cluster a set of textual documents into undefined number of classes. The clustering algorithm provided in MLlib i.e. K-means requires me to give a pre-defined number of classes. Is there any algorithm which is intelligent enough to identify how many classes should be made based on

Re: action progress in ipython notebook?

2014-12-29 Thread Aniket Bhatnagar
Hi Josh Is there documentation available for status API? I would like to use it. Thanks, Aniket On Sun Dec 28 2014 at 02:37:32 Josh Rosen rosenvi...@gmail.com wrote: The console progress bars are implemented on top of a new stable status API that was added in Spark 1.2. It's possible to

Re: Clustering text data with MLlib

2014-12-29 Thread Sean Owen
You can try several values of k, apply some evaluation metric to the clustering, and then use that to decide what k is best, or at least pretty good. If it's a completely unsupervised problem, the metrics you can use tend to be some function of the inter-cluster and intra-cluster distances (good

Spark Configurations

2014-12-29 Thread Chirag Aggarwal
Hi, It seems that spark-defaults.conf is not read by spark-sql. Is it used only by spark-shell? Thanks, Chirag

Re: Spark Configurations

2014-12-29 Thread Akhil Das
I believe If you use spark-shell or spark-submit, then it will pick up the conf from spark-defaults.conf, If you are running independent application then you can set all those confs while creating the SparkContext. Thanks Best Regards On Mon, Dec 29, 2014 at 5:40 PM, Chirag Aggarwal

DecisionTree Algorithm used in Spark MLLib

2014-12-29 Thread Anoop Shiralige
Hi All, I am trying to do a comparison, by building the model locally using R and on cluster using spark. There is some difference in the results. Any idea what is the internal implementation of Decision Tree in Spark MLLib.. (ID3 or C4.5 or C5.0 or CART algorithm). Thanks, AnoopShiralige

python: module pyspark.daemon not found

2014-12-29 Thread Naveen Kumar Pokala
14/12/29 18:10:56 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 2, nj09mhf0730.mhf.mhc, PROCESS_LOCAL, 1246 bytes) 14/12/29 18:10:56 INFO TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1) on executor nj09mhf0730.mhf.mhc: org.apache.spark.SparkException ( Error from python worker:

Re: Using TF-IDF from MLlib

2014-12-29 Thread andy petrella
Here is what I did for this case : https://github.com/andypetrella/tf-idf Le lun 29 déc. 2014 11:31, Sean Owen so...@cloudera.com a écrit : Given (label, terms) you can just transform the values to a TF vector, then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can make a

Re: Can't submit the SparkPi example to local Yarn 2.6.0 installed byambari 1.7.0

2014-12-29 Thread guxiaobo1982
/bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 1g --executor-memory 1g --executor-cores 1 --queue thequeue lib/spark-examples-1.2.0-hadoop2.6.0.jar 10 Got the same error by the above command, I think I missed the jar

Re: recent join/iterator fix

2014-12-29 Thread Stephen Haberman
It wasn't so much the cogroup that was optimized here, but what is done to the result of cogroup. Right. Yes, it was a matter of not materializing the entire result of a flatMap-like function after the cogroup, since this will accept just an Iterator (actually, TraversableOnce). Yeah...I

Spark profiler

2014-12-29 Thread rapelly kartheek
Hi, I want to find the time taken for replicating an rdd in spark cluster along with the computation time on the replicated rdd. Can someone please suggest a suitable spark profiler? Thank you

Re: recent join/iterator fix

2014-12-29 Thread Sean Owen
On Mon, Dec 29, 2014 at 2:11 PM, Stephen Haberman stephen.haber...@gmail.com wrote: Yeah...I was trying to poke around, are the Iterables that Spark passes into cogroup already materialized (e.g. the bug was making a copy of an already-in-memory list) or are the Iterables streaming? The result

Re: recent join/iterator fix

2014-12-29 Thread Shixiong Zhu
The Iterable from cogroup is CompactBuffer, which is already materialized. It's not a lazy Iterable. So now Spark cannot handle skewed data that some key has too many values that cannot be fit into the memory.​

Re: action progress in ipython notebook?

2014-12-29 Thread Josh Rosen
It's accessed through the `statusTracker` field on SparkContext. *Scala*: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkStatusTracker *Java*: https://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaSparkStatusTracker.html Don't create new

Re: action progress in ipython notebook?

2014-12-29 Thread Aniket Bhatnagar
Thanks Josh. Looks promising. I will give it a try. Thanks, Aniket On Mon, Dec 29, 2014, 9:55 PM Josh Rosen rosenvi...@gmail.com wrote: It's accessed through the `statusTracker` field on SparkContext. *Scala*:

Re: recent join/iterator fix

2014-12-29 Thread Stephen Haberman
Hi Shixiong, The Iterable from cogroup is CompactBuffer, which is already materialized. It's not a lazy Iterable. So now Spark cannot handle skewed data that some key has too many values that cannot be fit into the memory.​ Cool, thanks for the confirmation. - Stephen

Re: ReliableDeliverySupervisor: Association with remote system

2014-12-29 Thread SamyaMaiti
Resolved. I changed to Apache Hadoop 2.4.0 Apache spark 1.2.0 combination, all works fine. Must be because the 1.2.0 version of spark was compiled with hadoop 2.4.0 -- View this message in context:

Re: Clustering text data with MLlib

2014-12-29 Thread Suneel Marthi
Here's the Streaming KMeans from Spark 1.2http://spark.apache.org/docs/latest/mllib-clustering.html#examples-1 Steaming KMeans still needs an initial 'k' to be specified, it then progresses to come up with an optimal 'k' IIRC. From: Sean Owen so...@cloudera.com To: jatinpreet

Re: EC2 VPC script

2014-12-29 Thread Eduardo Cusa
I running the master branch. Finally I can make it work, changing all occurrences of *public_dns_name* property with *private_ip_address* in the spark_ec2.py script. My VPC instances always have null value in *public_dns_name* property Now my script only work for VPC instances. Regards

Re: Spark profiler

2014-12-29 Thread Boromir Widas
It would be very helpful if there is any such tool, but the distributed nature may be difficult to capture. I had been trying to run a task where merging the accumulators was taking an inordinately long time and was not reflecting in the standalone cluster's web UI. What I think will be useful is

Re: SPARK-streaming app running 10x slower on YARN vs STANDALONE cluster

2014-12-29 Thread Sandy Ryza
Hi Mukesh, Based on your spark-submit command, it looks like you're only running with 2 executors on YARN. Also, how many cores does each machine have? -Sandy On Mon, Dec 29, 2014 at 4:36 AM, Mukesh Jha me.mukesh@gmail.com wrote: Hello Experts, I'm bench-marking Spark on YARN (

Re: Clustering text data with MLlib

2014-12-29 Thread Paco Nathan
Jatin, One approach for determining K would be to sample the data set and run PCA. Then evaluate how many many of the resulting eigenvalue/eigenvector pairs to use before you reach diminishing returns on cumulative error. That number provides a reasonably good value for K to use in KMeans. With

Re: SPARK-streaming app running 10x slower on YARN vs STANDALONE cluster

2014-12-29 Thread Mukesh Jha
Sorry Sandy, The command is just for reference but I can confirm that there are 4 executors and a driver as shown in the spark UI page. Each of these machines is a 8 core box with ~15G of ram. On Mon, Dec 29, 2014 at 11:23 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Mukesh, Based on

Re: SPARK-streaming app running 10x slower on YARN vs STANDALONE cluster

2014-12-29 Thread Mukesh Jha
And this is with spark version 1.2.0. On Mon, Dec 29, 2014 at 11:43 PM, Mukesh Jha me.mukesh@gmail.com wrote: Sorry Sandy, The command is just for reference but I can confirm that there are 4 executors and a driver as shown in the spark UI page. Each of these machines is a 8 core box

Re: action progress in ipython notebook?

2014-12-29 Thread Nicholas Chammas
On Mon, Dec 29, 2014 at 12:00 AM, Eric Friedman eric.d.fried...@gmail.com wrote: I'm not seeing RDDs or SRDDs cached in the Spark UI. That page remains empty despite my calling cache(). Just a small note on this, and perhaps you already know: Calling cache() is not enough to cache something and

Re: SPARK-streaming app running 10x slower on YARN vs STANDALONE cluster

2014-12-29 Thread Sandy Ryza
Are you setting --num-executors to 8? On Mon, Dec 29, 2014 at 10:13 AM, Mukesh Jha me.mukesh@gmail.com wrote: Sorry Sandy, The command is just for reference but I can confirm that there are 4 executors and a driver as shown in the spark UI page. Each of these machines is a 8 core box

Re: SPARK-streaming app running 10x slower on YARN vs STANDALONE cluster

2014-12-29 Thread Sandy Ryza
*oops, I mean are you setting --executor-cores to 8 On Mon, Dec 29, 2014 at 10:15 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Are you setting --num-executors to 8? On Mon, Dec 29, 2014 at 10:13 AM, Mukesh Jha me.mukesh@gmail.com wrote: Sorry Sandy, The command is just for reference

How to pass options to KeyConverter using PySpark

2014-12-29 Thread Brett Meyer
I¹m running PySpark on YARN, and I¹m reading in SequenceFiles for which I have a custom KeyConverter class. My KeyConverter needs to have some configuration options passed to it, but I am unable to find a way to get the options to that class without modifying the Spark source. Is there a

Re: SPARK-streaming app running 10x slower on YARN vs STANDALONE cluster

2014-12-29 Thread Mukesh Jha
Nope, I am setting 5 executors with 2 cores each. Below is the command that I'm using to submit in YARN mode. This starts up 5 executor nodes and a drives as per the spark application master UI. spark-submit --master yarn-cluster --num-executors 5 --driver-memory 1024m --executor-memory 1024m

Re: SPARK-streaming app running 10x slower on YARN vs STANDALONE cluster

2014-12-29 Thread Sandy Ryza
When running in standalone mode, each executor will be able to use all 8 cores on the box. When running on YARN, each executor will only have access to 2 cores. So the comparison doesn't seem fair, no? -Sandy On Mon, Dec 29, 2014 at 10:22 AM, Mukesh Jha me.mukesh@gmail.com wrote: Nope, I

Re: SPARK-streaming app running 10x slower on YARN vs STANDALONE cluster

2014-12-29 Thread Mukesh Jha
Makes sense, I've also tries it in standalone mode where all 3 workers driver were running on the same 8 core box and the results were similar. Anyways I will share the results in YARN mode with 8 core yarn containers. On Mon, Dec 29, 2014 at 11:58 PM, Sandy Ryza sandy.r...@cloudera.com wrote:

Re: action progress in ipython notebook?

2014-12-29 Thread Matei Zaharia
Hey Eric, sounds like you are running into several issues, but thanks for reporting them. Just to comment on a few of these: I'm not seeing RDDs or SRDDs cached in the Spark UI. That page remains empty despite my calling cache(). This is expected until you compute the RDDs the first time

Re: trying to understand yarn-client mode

2014-12-29 Thread Fernando Otero
were you able to fix the issue? I'm facing a similar issue when trying to use yarn client from spark-jobserver -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/trying-to-understand-yarn-client-mode-tp7925p20888.html Sent from the Apache Spark User List

Clean up app folders in worker nodes

2014-12-29 Thread hutashan
Hello All, I need to clean up app folder(include app downloaded jar) in spark under work folder. I have tried to set below configuration in spark env. but it is not working as expected. SPARK_WORKER_OPTS=-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.interval=10 I cannot clear

Re: Clean up app folders in worker nodes

2014-12-29 Thread Michael Quinlan
I'm also interested in the solution to this. Thanks, Mike On Mon, Dec 29, 2014 at 12:01 PM, hutashan [via Apache Spark User List] ml-node+s1001560n20889...@n3.nabble.com wrote: Hello All, I need to clean up app folder(include app downloaded jar) in spark under work folder. I have tried

Submit spark jobs inside web application

2014-12-29 Thread Corey Nolet
I want to have a SparkContext inside of a web application running in Jetty that i can use to submit jobs to a cluster of Spark executors. I am running YARN. Ultimately, I would love it if I could just use somethjing like SparkSubmit.main() to allocate a bunch of resoruces in YARN when the webapp

Re: Failed to read chunk exception

2014-12-29 Thread rajnish
I am facing the same issue in spark-1.1.0 versions /12/29 20:44:31 INFO scheduler.TaskSetManager: Starting task 5.0 in stage 1.1 (TID 1373, X.X.X.X , ANY, 2185 bytes) 14/12/29 20:44:31 WARN scheduler.TaskSetManager: Lost task 6.0 in stage 3.0 (TID 1367, iX.X.X.X): java.io.IOException: failed to

Re: action progress in ipython notebook?

2014-12-29 Thread Eric Friedman
On Dec 29, 2014, at 1:52 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hey Eric, sounds like you are running into several issues, but thanks for reporting them. Just to comment on a few of these: I'm not seeing RDDs or SRDDs cached in the Spark UI. That page remains empty despite

Re: python: module pyspark.daemon not found

2014-12-29 Thread Eric Friedman
Was your spark assembly jarred with Java 7? There's a known issue with jar files made with that version. It prevents them from being used on PYTHONPATH. You can rejar with Java 6 for better results. Eric Friedman On Dec 29, 2014, at 8:01 AM, Naveen Kumar Pokala npok...@spcapitaliq.com

Building Spark 1.2 jmx and jmxtools issue?

2014-12-29 Thread John Omernik
I am trying to do the sbt assembly for spark 1.2 sbt/sbt -Pmapr4 -Phive -Phive-thriftserver assembly and I am getting the errors below. Any thoughts? Thanks in advance! [warn] :: [warn] :: FAILED DOWNLOADS:: [warn] :: ^ see

How to tell if RDD no longer has any children

2014-12-29 Thread Corey Nolet
Let's say I have an RDD which gets cached and has two children which do something with it: val rdd1 = ...cache() rdd1.saveAsSequenceFile() rdd1.groupBy()..saveAsSequenceFile() If I were to submit both calls to saveAsSequenceFile() in thread to take advantage of concurrency (where

Re: Setting up Simple Kafka Consumer via Spark Java app

2014-12-29 Thread Suhas Shekar
Got it to work...thanks a lot for the help! I started a new cluster where Spark has Yarn as a dependency. I ran it with the script with local[2] and it worked (this same script did not work with Spark in standalone mode). A follow up question...I have seen this question posted around the internet

Re: Building Spark 1.2 jmx and jmxtools issue?

2014-12-29 Thread Ted Yu
I got same error when specifying -Pmapr4. For the following command: sbt/sbt -Pyarn -Phive -Phive-thriftserver assembly I got: how can getCommonSuperclass() do its job if different class symbols get the same bytecode-level internal name: org/apache/spark/sql/catalyst/dsl/package$ScalaUdfBuilder

RE: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-12-29 Thread Andrew Lee
Hi All, I have tried to pass the properties via the SparkContext.setLocalProperty and HiveContext.setConf, both failed. Based on the results (haven't get a chance to look into the code yet), HiveContext will try to initiate the JDBC connection right away, I couldn't set other properties

Re: Using TF-IDF from MLlib

2014-12-29 Thread Xiangrui Meng
Hopefully the new pipeline API addresses this problem. We have a code example here: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/SimpleTextClassificationPipeline.scala -Xiangrui On Mon, Dec 29, 2014 at 5:22 AM, andy petrella

RE: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-12-29 Thread Andrew Lee
A follow up on the hive-site.xml, if you 1. Specify it in spark/conf, then you can NOT apply it via the --driver-class-path option, otherwise, you will get the following exceptions when initializing SparkContext. org.apache.spark.SparkException: Found both spark.driver.extraClassPath

Re: Compile error from Spark 1.2.0

2014-12-29 Thread Michael Armbrust
Yeah, this looks like a regression in the API due to the addition of arbitrary decimal support. Can you open a JIRA? On Sun, Dec 28, 2014 at 12:23 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Hi Zigen, Looks like they missed it. Thanks Best Regards On Sat, Dec 27, 2014 at 12:43 PM,

Re: weights not changed with different reg param

2014-12-29 Thread Xiangrui Meng
Could you post your code? It sounds like a bug. One thing to check is that wheher you set regType, which is None by default. -Xiangrui On Tue, Dec 23, 2014 at 3:36 PM, Thomas Kwan thomas.k...@manage.com wrote: Hi there We are on mllib 1.1.1, and trying different regularization parameters. We

Re: MLLib beginner question

2014-12-29 Thread Xiangrui Meng
b0c1, did you apply model.predict to a DStream? Maybe it would help understand your question better if you can post your code. -Xiangrui On Tue, Dec 23, 2014 at 11:54 AM, boci boci.b...@gmail.com wrote: Xiangrui: Hi, I want to using this with streaming and with job too. I using kafka

Re: SVDPlusPlus Recommender in MLLib

2014-12-29 Thread Xiangrui Meng
There is an SVD++ implementation in GraphX. It would be nice if you can compare its performance vs. Mahout. -Xiangrui On Wed, Dec 24, 2014 at 6:46 AM, Prafulla Wani prafulla.w...@gmail.com wrote: hi , Is there any plan to add SVDPlusPlus based recommender to MLLib ? It is implemented in

Spark Streaming: HiveContext within Custom Actor

2014-12-29 Thread sranga
Hi Could Spark-SQL be used from within a custom actor that acts as a receiver for a streaming application? If yes, what is the recommended way of passing the SparkContext to the actor? Thanks for your help. - Ranga -- View this message in context:

Re: trying to understand yarn-client mode

2014-12-29 Thread firemonk9
I am able to fix it by adding the the jars(in the spark distribution) to the classpath. In my sbt file I changed the scope to provided. Let me know if you need more details. -- View this message in context:

Shuffle write increases in spark 1.2

2014-12-29 Thread Kevin Jung
Hi all, The size of shuffle write showing in spark web UI is mush different when I execute same spark job on same input data(100GB) in both spark 1.1 and spark 1.2. At the same sortBy stage, the size of shuffle write is 39.7GB in spark 1.1 but 91.0GB in spark 1.2. I set spark.shuffle.manager

Cached RDD

2014-12-29 Thread Corey Nolet
If I have 2 RDDs which depend on the same RDD like the following: val rdd1 = ... val rdd2 = rdd1.groupBy()... val rdd3 = rdd1.groupBy()... If I don't cache rdd1, will it's lineage be calculated twice (one for rdd2 and one for rdd3)?

Re: A question about using insert into in rdd foreach in spark 1.2

2014-12-29 Thread Michael Armbrust
-dev +user In general you cannot create new RDDs inside closures that run on the executors (which is what sql inside of a foreach is doing). I think what you want here is something like: sqlContext.parquetFile(Data\\Test\\Parquet\\2).registerTempTable(temp2) sql(SELECT col1, col2 FROM

Re: Spark Configurations

2014-12-29 Thread Chirag Aggarwal
Is there a way to get these set by default in spark-sql shell Thanks, Chirag From: Akhil Das ak...@sigmoidanalytics.commailto:ak...@sigmoidanalytics.com Date: Monday, 29 December 2014 5:53 PM To: Chirag Aggarwal chirag.aggar...@guavus.commailto:chirag.aggar...@guavus.com Cc:

Re: Mapping directory structure to columns in SparkSQL

2014-12-29 Thread Michael Armbrust
You can't do this now without writing a bunch of custom logic (see here for an example: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala ) I would like to make this easier as part of improvements to the datasources api that we are

Re: How to set up spark sql on ec2

2014-12-29 Thread Michael Armbrust
I would expect this to work. Can you run the standard spark-shell? On Mon, Dec 29, 2014 at 2:34 AM, critikaled isasmani@gmail.com wrote: How to make the spark ec2 script to install hive and spark sql on ec2 when I run the spark ec2 script and go to bin and run ./spark-sql and execute