Re: createDirectStream and Stats

2015-06-19 Thread Cody Koeninger
when you say your old version was k = createStream . were you manually creating multiple receivers? Because otherwise you're only using one receiver on one executor... If that's the case I'd try direct stream without the repartitioning. On Fri, Jun 19, 2015 at 6:43 PM, Tim Smith

Re: Missing values support in Mllib yet?

2015-06-19 Thread DB Tsai
Not really yet. But at work, we do GBDT missing values imputation, so I've the interest to port them to mllib if I have enough time. Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Fri, Jun 19, 2015 at 1:23

Re: Assigning number of workers in spark streaming

2015-06-19 Thread anshu shukla
Thanx alot ! But in client mode Can we assign number of workers/nodes as a flag parameter to the spark-Submit command . And by default how it will distribute the load across the nodes. # Run on a Spark Standalone cluster in client deploy mode ./bin/spark-submit \ --class

Re: RE: Spark or Storm

2015-06-19 Thread Tathagata Das
I agree with Cody. Its pretty hard for any framework to provide in built support for that since the semantics completely depends on what data store you want to use it with. Providing interfaces does help a little, but even with those interface, the user still has to do most of the heavy lifting;

Re: createDirectStream and Stats

2015-06-19 Thread Tathagata Das
Yes, please tell us what operation are you using. TD On Fri, Jun 19, 2015 at 11:42 AM, Cody Koeninger c...@koeninger.org wrote: Is there any more info you can provide / relevant code? On Fri, Jun 19, 2015 at 1:23 PM, Tim Smith secs...@gmail.com wrote: Update on performance of the new API:

Re: Serial batching with Spark Streaming

2015-06-19 Thread Michal Čizmazia
Thanks Tathagata! I will use *foreachRDD*/*foreachPartition*() instead of *trasform*() then. Does the default scheduler initiate the execution of the *batch X+1* after the *batch X* even if tasks for the* batch X *need to be *retried due to failures*? If not, please could you suggest workarounds

Re: createDirectStream and Stats

2015-06-19 Thread Tim Smith
Essentially, I went from: k = createStream . val dataout = k.map(x=myFunc(x._2,someParams)) dataout.foreachRDD ( rdd = rdd.foreachPartition(rec = { myOutputFunc.write(rec) }) To: kIn = createDirectStream . k = kIn.repartition(numberOfExecutors) //since #kafka partitions #spark-executors

Un-persist RDD in a loop

2015-06-19 Thread afarahat
Hello; I am trying to get the optimal number of factors in ALS. To that end, i am scanning various values and evaluating the RSE. DO i need to un-perisist the RDD between loops or will the resources (memory) get automatically deleted and re-assigned between iterations. for i in range(5):

Re: Assigning number of workers in spark streaming

2015-06-19 Thread Tathagata Das
All the basic parameter applies to both client and cluster mode. The only difference between client and cluster mode is that the driver will be run in the cluster, and there are some *additional* parameters to configure that. Other params are common. Isnt it clear from the docs? On Fri, Jun 19,

Re: Serial batching with Spark Streaming

2015-06-19 Thread Tathagata Das
I see what is the problem. You are adding sleep in the transform operation. The transform function is called at the time of preparing the Spark jobs for a batch. It should not be running any time consuming operation like a RDD action or a sleep. Since this operation needs to run every batch

Re: Assigning number of workers in spark streaming

2015-06-19 Thread Tathagata Das
Depends on what cluster manager are you using. Its all pretty well documented in the online documentation. http://spark.apache.org/docs/latest/submitting-applications.html On Fri, Jun 19, 2015 at 2:29 PM, anshu shukla anshushuk...@gmail.com wrote: Hey , *[For Client Mode]* 1- Is there any

PySpark on YARN port out of range

2015-06-19 Thread John Meehan
Has anyone encountered this “port out of range” error when launching PySpark jobs on YARN? It is sporadic (e.g. 2/3 jobs get this error). LOG: 15/06/19 11:49:44 INFO scheduler.TaskSetManager: Lost task 0.3 in stage 39.0 (TID 211) on executor xxx.xxx.xxx.com http://xxx.xxx.xxx.com/:

Re: NaiveBayes for MLPipeline is absent

2015-06-19 Thread Xiangrui Meng
Hi Justin, We plan to add it in 1.5, along with some other estimators. We are now preparing a list of JIRAs, but feel free to create a JIRA for this and submit a PR:) Best, Xiangrui On Thu, Jun 18, 2015 at 6:35 PM, Justin Yip yipjus...@prediction.io wrote: Hello, Currently, there is no

Missing values support in Mllib yet?

2015-06-19 Thread Arun Luthra
Hi, Is there any support for handling missing values in mllib yet, especially for decision trees where this is a natural feature? Arun

Assigning number of workers in spark streaming

2015-06-19 Thread anshu shukla
Hey , *[For Client Mode]* 1- Is there any way to assign the number of workers from a cluster should be used for particular application . 2- If not then how spark scheduler decides scheduling of diif applications inside one full logic . say my logic have {inputStream

Re: Abount Jobs UI in yarn-client mode

2015-06-19 Thread Andrew Or
Did you make sure that the YARN IP is not an internal address? If it still doesn't work then it seems like an issue on the YARN side... 2015-06-19 8:48 GMT-07:00 Sea 261810...@qq.com: Hi, all: I run spark on yarn, I want to see the Jobs UI http://ip:4040/, but it redirect to http://

Re: createDirectStream and Stats

2015-06-19 Thread Tathagata Das
I dont think there was any enhancments that can change this behavior. On Fri, Jun 19, 2015 at 6:16 PM, Tim Smith secs...@gmail.com wrote: On Fri, Jun 19, 2015 at 5:15 PM, Tathagata Das t...@databricks.com wrote: Also, can you find from the spark UI the break up of the stages in each batch's

Re: createDirectStream and Stats

2015-06-19 Thread Cody Koeninger
If that's the case, you're still only using as many read executors as there are kafka partitions. I'd remove the repartition. If you weren't doing any shuffles in the old job, and are doing a shuffle in the new job, it's not really comparable. On Fri, Jun 19, 2015 at 8:16 PM, Tim Smith

Re: [ERROR] Insufficient Space

2015-06-19 Thread Vadim Bichutskiy
Awesome. -Vadim ᐧ On Fri, Jun 19, 2015 at 8:30 PM, Kelly, Jonathan jonat...@amazon.com wrote: Yep, I'm on the EMR team at Amazon, and I was at the Spark Summit. ;-) So of course I'm biased toward EMR, even over EC2. I'm not sure if there's a way to resize an EC2 Spark cluster, or at least

Re: [ERROR] Insufficient Space

2015-06-19 Thread Ruslan Dautkhanov
Vadim, You could edit /etc/fstab, then issue mount -o remount to give more shared memory online. Didn't know Spark uses shared memory. Hope this helps. On Fri, Jun 19, 2015, 8:15 AM Vadim Bichutskiy vadim.bichuts...@gmail.com wrote: Hello Spark Experts, I've been running a standalone Spark

Re: What files/folders/jars spark-submit script depend on ?

2015-06-19 Thread Andrew Or
Hi Elkhan, Spark submit depends on several things: the launcher jar (1.3.0+ only), the spark-core jar, and the spark-yarn jar (in your case). Why do you want to put it in HDFS though? AFAIK you can't execute scripts directly from HDFS; you need to copy them to a local file system first. I don't

Re: createDirectStream and Stats

2015-06-19 Thread Tim Smith
I did try without repartition, initially, but that was even more horrible because instead of the allocated 100 executors, only 30 (which is the number of kafka partitions) would have to do the work. The MyFunc is a CPU bound task so adding more memory per executor wouldn't help and I saw that each

Re: createDirectStream and Stats

2015-06-19 Thread Cody Koeninger
So were you repartitioning with the original job as well? On Fri, Jun 19, 2015 at 9:36 PM, Tim Smith secs...@gmail.com wrote: I did try without repartition, initially, but that was even more horrible because instead of the allocated 100 executors, only 30 (which is the number of kafka

Re: createDirectStream and Stats

2015-06-19 Thread Tathagata Das
Also, can you find from the spark UI the break up of the stages in each batch's jobs, and find which stage is taking more time after a while? On Fri, Jun 19, 2015 at 4:51 PM, Cody Koeninger c...@koeninger.org wrote: when you say your old version was k = createStream . were you

Re: [ERROR] Insufficient Space

2015-06-19 Thread Vadim Bichutskiy
Thanks Jonathan. I should totally move to EMR. Spark on EMR was announced at Spark Summit! There's no easy way to resize the cluster on EC2. You basically have to destroy it and launch a new one. Right? -Vadim ᐧ On Fri, Jun 19, 2015 at 3:41 PM, Kelly, Jonathan jonat...@amazon.com wrote:

Re: PySpark on YARN port out of range

2015-06-19 Thread Andrew Or
Hm, one thing to see is whether the same port appears many times (1315905645). The way pyspark works today is that the JVM reads the port from the stdout of the python process. If there is some interference in output from the python side (e.g. any print statements, exception messages), then the

Re: Submitting Spark Applications using Spark Submit

2015-06-19 Thread Andrew Or
Hi Raghav, If you want to make changes to Spark and run your application with it, you may follow these steps. 1. git clone g...@github.com:apache/spark 2. cd spark; build/mvn clean package -DskipTests [...] 3. make local changes 4. build/mvn package -DskipTests [...] (no need to clean again

RE: Code review - Spark SQL command-line client for Cassandra

2015-06-19 Thread Mohammed Guller
Hi Matthew, It looks fine to me. I have built a similar service that allows a user to submit a query from a browser and returns the result in JSON format. Another alternative is to leave a Spark shell or one of the notebooks (Spark Notebook, Zeppelin, etc.) session open and run queries from

Re: Spark on Yarn - How to configure

2015-06-19 Thread Andrew Or
Hi Ashish, For Spark on YARN, you actually only need the Spark files on one machine - the submission client. This machine could even live outside of the cluster. Then all you need to do is point YARN_CONF_DIR to the directory containing your hadoop configuration files (e.g. yarn-site.xml) on that

Re: Submitting Spark Applications using Spark Submit

2015-06-19 Thread Andrew Or
Hi Raghav, I'm assuming you're using standalone mode. When using the Spark EC2 scripts you need to make sure that every machine has the most updated jars. Once you have built on one of the nodes, you must *rsync* the Spark directory to the rest of the nodes (see /root/spark-ec2/copy-dir). That

Re: createDirectStream and Stats

2015-06-19 Thread Tim Smith
On Fri, Jun 19, 2015 at 5:15 PM, Tathagata Das t...@databricks.com wrote: Also, can you find from the spark UI the break up of the stages in each batch's jobs, and find which stage is taking more time after a while? Sure, will try to debug/troubleshoot. Are there enhancements to this specific

Re: Submitting Spark Applications using Spark Submit

2015-06-19 Thread Raghav Shankar
Thanks Andrew! Is this all I have to do when using the spark ec2 script to setup a spark cluster? It seems to be getting an assembly jar that is not from my project(perhaps from a maven repo). Is there a way to make the ec2 script use the assembly jar that I created? Thanks, Raghav On Friday,

Re: What files/folders/jars spark-submit script depend on ?

2015-06-19 Thread Elkhan Dadashov
Thanks Andrew. We cannot include Spark in our Java project due to dependency issues. The Spark will not be exposed to clients. What we want todo is to put spark tarball (in worst case) into HDFS, so through our java app which runs in local mode, launch spark-submit script with user python files.

RE: Build spark application into uber jar

2015-06-19 Thread prajod.vettiyattil
but when I run the application locally, it complains that spark related stuff is missing I use the uber jar option. What do you mean by “locally” ? In the Spark scala shell ? In the From: bit1...@163.com [mailto:bit1...@163.com] Sent: 19 June 2015 08:11 To: user Subject: Build spark

RE: Re: Build spark application into uber jar

2015-06-19 Thread prajod.vettiyattil
Hi, When running inside Eclipse IDE, I use another maven target to build. That is the default maven target. For building for uber jar. I use the assembly jar target. So use two maven build targets in the same pom file to solve this issue. In maven you can have multiple build targets, and each

Re: Header in each output files.

2015-06-19 Thread rahulkumar-aws
Just check this stackoverflow link may it help http://stackoverflow.com/questions/26157456/add-a-header-before-text-file-on-save-in-spark http://stackoverflow.com/questions/26157456/add-a-header-before-text-file-on-save-in-spark - Software Developer Sigmoid (SigmoidAnalytics), India --

Re: RE: Spark or Storm

2015-06-19 Thread Enno Shioji
Tbh I find the doc around this a bit confusing. If it says end-to-end exactly-once semantics (if your updates to downstream systems are idempotent or transactional), I think most people will interpret it that as long as you use a storage which has atomicity (like MySQL/Postgres etc.), a successful

Re: Error when connecting to Spark SQL via Hive JDBC driver

2015-06-19 Thread rahulkumar-aws
it look's like your spark-Hive jars are not compatible with Spark , compile spark source with hive 13 flag. mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean package it will solve ur problem. - Software Developer Sigmoid (SigmoidAnalytics), India

Re: If not stop StreamingContext gracefully, will checkpoint data be consistent?

2015-06-19 Thread Akhil Das
One workaround would be remove/move the files from the input directory once you have it processed. Thanks Best Regards On Fri, Jun 19, 2015 at 5:48 AM, Haopu Wang hw...@qilinsoft.com wrote: Akhil, From my test, I can see the files in the last batch will alwyas be reprocessed upon

Re: RE: Build spark application into uber jar

2015-06-19 Thread bit1...@163.com
Thanks. I guess what you mean by maven build target is maven profile. I added two profiles, one is LocalRun, the other is ClusterRun for the spark related artifact scope. So that, I don't have to change the pom file but just to select a profile. profile idLocalRun/id properties

Re: RE: Build spark application into uber jar

2015-06-19 Thread bit1...@163.com
Sure, Thanks Projod for the detailed steps! bit1...@163.com From: prajod.vettiyat...@wipro.com Date: 2015-06-19 16:56 To: bit1...@163.com; ak...@sigmoidanalytics.com CC: user@spark.apache.org Subject: RE: RE: Build spark application into uber jar Multiple maven profiles may be the ideal way.

[ERROR] Insufficient Space

2015-06-19 Thread Vadim Bichutskiy
Hello Spark Experts, I've been running a standalone Spark cluster on EC2 for a few months now, and today I get this error: IOError: [Errno 28] No space left on device Spark assembly has been built with Hive, including Datanucleus jars on classpath OpenJDK 64-Bit Server VM warning: Insufficient

Re: RE: Spark or Storm

2015-06-19 Thread Ashish Soni
My understanding for exactly once semantics is it is handled into the framework itself but it is not very clear from the documentation , I believe documentation needs to be updated with a simple example so that it is clear to the end user , This is very critical to decide when some one is

ERROR in withColumn method

2015-06-19 Thread Animesh Baranawal
I am trying to perform some insert column operations in dataframe. Following is the code I used: val df = sqlContext.read.json(examples/src/main/resources/people.json) df.show() { works correctly } df.withColumn(age, df.col(name) ) { works correctly } df.withColumn(age, df.col(name) ).show()

RE: RE: Spark or Storm

2015-06-19 Thread Haopu Wang
My question is not directly related: about the exactly-once semantic, the document (copied below) said spark streaming gives exactly-once semantic, but actually from my test result, with check-point enabled, the application always re-process the files in last batch after gracefully restart.

Re: RE: Spark or Storm

2015-06-19 Thread bit1...@163.com
I think your observation is correct, you have to take care of these replayed data at your end,eg,each message has a unique id or something else. I am using I think in the above sentense, because I am not sure and I also have a related question: I am wonderring how direct stream + kakfa is

Re: Spark 1.4 on HortonWork HDP 2.2

2015-06-19 Thread ayan guha
I think you can get spark 1.4 pre built with hadoop 2.6 (as that what hdp 2.2 provides) and just start using it On Fri, Jun 19, 2015 at 10:28 PM, Ashish Soni asoni.le...@gmail.com wrote: I do not where to start as Spark 1.2 comes bundled with HDP2.2 but i want to use 1.4 and i do not know

SparkR - issue when starting the sparkR shell

2015-06-19 Thread Kulkarni, Vikram
Hello, I am seeing this issue when starting the sparkR shell. Please note that I have R version 2.14.1. [root@vertica4 bin]# sparkR R version 2.14.1 (2011-12-22) Copyright (C) 2011 The R Foundation for Statistical Computing ISBN 3-900051-07-0 Platform: x86_64-unknown-linux-gnu (64-bit) R is

Re: Spark 1.4 on HortonWork HDP 2.2

2015-06-19 Thread ayan guha
what problem are you facing? are you trying to build it yurself or gettingpre-built version? On Fri, Jun 19, 2015 at 10:22 PM, Ashish Soni asoni.le...@gmail.com wrote: Hi , Is any one able to install Spark 1.4 on HDP 2.2 , Please let me know how can i do the same ? Ashish -- Best

Re: Latency between the RDD in Streaming

2015-06-19 Thread anshu shukla
How will i can to know that for how much time particular RDD had remained in pipeline . On Fri, Jun 19, 2015 at 7:59 AM, Tathagata Das t...@databricks.com wrote: Why do you need to uniquely identify the message? All you need is the time when the message was inserted by the receiver, and

Re: RE: Spark or Storm

2015-06-19 Thread Tathagata Das
If the current documentation is confusing, we can definitely improve the documentation. However, I dont not understand why is the term transactional confusing. If your output operation has to add 5, then the user has to implement the following mechanism 1. If the unique id of the batch of data is

Re: Best way to randomly distribute elements

2015-06-19 Thread abellet
Thanks a lot for the suggestions! Le 18/06/2015 15:02, Himanshu Mehra [via Apache Spark User List] a écrit : Hi A bellet You can try RDD.randomSplit(weights array) where a weights array is the array of weight you wants to want to put in the consecutive partition example

Abount Jobs UI on yarn-client mode

2015-06-19 Thread Sea
Hi, all: I run spark on yarn, I want to see the Jobs UI http://ip:4040/, but it redirect to http://${yarn.ip}/proxy/application_1428110196022_924324/ which can not be found. Why? Anyone can help?

Spark 1.4 on HortonWork HDP 2.2

2015-06-19 Thread Ashish Soni
Hi , Is any one able to install Spark 1.4 on HDP 2.2 , Please let me know how can i do the same ? Ashish

Re: Spark 1.4 on HortonWork HDP 2.2

2015-06-19 Thread Ashish Soni
I do not where to start as Spark 1.2 comes bundled with HDP2.2 but i want to use 1.4 and i do not know how to update it to 1.4 Ashish On Fri, Jun 19, 2015 at 8:26 AM, ayan guha guha.a...@gmail.com wrote: what problem are you facing? are you trying to build it yurself or gettingpre-built

Re: Build spark application into uber jar

2015-06-19 Thread Akhil Das
This is how i used to build a assembly jar with sbt: Your build.sbt file would look like this: *import AssemblyKeys._* *assemblySettings* *name := FirstScala* *version := 1.0* *scalaVersion := 2.10.4* *libraryDependencies += org.apache.spark %% spark-core % 1.3.1* *libraryDependencies +=

Re: RE: Build spark application into uber jar

2015-06-19 Thread bit1...@163.com
Thank you for the reply. Run the application locally means that I run the application in my IDE with master as local[*]. When spark stuff is marked as provided, then I can't run it because the spark stuff is missing. So, how do you work around this? Thanks! bit1...@163.com From:

Spark group by sub coulumn

2015-06-19 Thread Suraj Shetiya
Hi, I wanted to obtain a grouped by frame from a dataframe. A snippet of the column on which I need to perform groupby is below. df.select(To).show() To ArrayBuffer(vance... ArrayBuffer(vance... ArrayBuffer(rober... ArrayBuffer(richa... ArrayBuffer(guill... ArrayBuffer(m..pr...

Re: how to change /tmp folder for spark ut use sbt

2015-06-19 Thread Akhil Das
You can try setting these properties: .set(spark.local.dir,/mnt/spark/) .set(java.io.tmpdir,/mnt/spark/) Thanks Best Regards On Fri, Jun 19, 2015 at 8:28 AM, yuemeng (A) yueme...@huawei.com wrote: hi,all if i want to change the /tmp folder to any other folder for spark ut use

RE: RE: Build spark application into uber jar

2015-06-19 Thread prajod.vettiyattil
Multiple maven profiles may be the ideal way. You can also do this with: 1. The defaul build command “mvn compile” , for local builds(use this to build with Eclipse’s “Run As-Maven build” option when you right-click on the pom.xml file) 2. Add maven build options to the same build

Re: N kafka topics vs N spark Streaming

2015-06-19 Thread Akhil Das
Like this? val add_msgs = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]( ssc, kafkaParams, Array(add).toSet) val delete_msgs = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]( ssc, kafkaParams, Array(delete).toSet) val

Code review - Spark SQL command-line client for Cassandra

2015-06-19 Thread Matthew Johnson
Hi all, I have been struggling with Cassandra’s lack of adhoc query support (I know this is an anti-pattern of Cassandra, but sometimes management come over and ask me to run stuff and it’s impossible to explain that it will take me a while when it would take about 10 seconds in MySQL) so I

N kafka topics vs N spark Streaming

2015-06-19 Thread Manohar753
Hi Everybody, I have four kafks topics each for separateoperation(Add,Delete,Update,Merge). so spark also will have four consumed streams,so how we can run my spark job here? should i run four spark jobs separately? is there any way to bundle all streams into singlejar and run as single Job?

Re: RE: Spark or Storm

2015-06-19 Thread Enno Shioji
Fair enough, on second thought, just saying that it should be idempotent is indeed more confusing. I guess the crux of the confusion comes from the fact that people tend to assume the work you described (store batch id and skip etc.) is handled by the framework, perhaps partly because Storm

Re: Spark-sql versus Impala versus Hive

2015-06-19 Thread Sanjay Subramanian
Hi guys I am using CDH 5.3.3 and that comes with Hive 0.13.1 and Spark 1.2 So to answer your question its not Tez (that I believe comes with HortonWorks) This Hive query was run with hive defaults. I used additional hive params right now to improve the timingsSET mapreduce.job.reduces=16;SET

Cassandra - Spark 1.3 - reading data from cassandra table with PYSpark

2015-06-19 Thread Koen Vantomme
Hello, I'm trying to read data from a table stored in cassandra with pyspark. I found the scala code to loop through the table : cassandra_rdd.toArray.foreach(println) How can this be translated into PySpark ? code snipplet : sc_cass = CassandraSparkContext(conf=conf) cassandra_rdd =

Re: RE: Spark or Storm

2015-06-19 Thread Cody Koeninger
http://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics semantics of output operations section Is this really not clear? As for the general tone of why doesn't the framework do it for you... in my opinion, this is essential complexity for delivery

Re: RE: Spark or Storm

2015-06-19 Thread Cody Koeninger
auto.offset.reset only applies when there are no starting offsets (either from a checkpoint, or from you providing them explicitly) On Fri, Jun 19, 2015 at 6:10 AM, bit1...@163.com bit1...@163.com wrote: I think your observation is correct, you have to take care of these replayed data at your

Abount Jobs UI in yarn-client mode

2015-06-19 Thread Sea
Hi, all: I run spark on yarn, I want to see the Jobs UI http://ip:4040/, but it redirect to http://${yarn.ip}/proxy/application_1428110196022_924324/ which can not be found. Why? Anyone can help?

Re: Spark 1.4 on HortonWork HDP 2.2

2015-06-19 Thread Doug Balog
If you run Hadoop in secure mode and want to talk to Hive 0.14, it won’t work, see SPARK-5111 I have a patched version of 1.3.1 that I’ve been using. I haven’t had the time to get 1.4.0 working. Cheers, Doug On Jun 19, 2015, at 8:39 AM, ayan guha guha.a...@gmail.com wrote: I think you

Re: Settings for K-Means Clustering in Mlib for large data set

2015-06-19 Thread Rogers Jeffrey
Thanks. Setting the driver memory property worked for K=1000 . But when I increased K to1500 I get the following error: 15/06/19 09:38:44 INFO ContextCleaner: Cleaned accumulator 7 15/06/19 09:38:44 INFO BlockManagerInfo: Removed broadcast_34_piece0 on 172.31.3.51:45157 in memory (size: 1568.0

What files/folders/jars spark-submit script depend on ?

2015-06-19 Thread Elkhan Dadashov
Hi all, If I want to ship spark-submit script to HDFS. and then call it from HDFS location for starting Spark job, which other files/folders/jars need to be transferred into HDFS with spark-submit script ? Due to some dependency issues, we can include Spark in our Java application, so instead we

What is needed to integrate Spark with Pandas and scikit-learn?

2015-06-19 Thread YaoPau
I'm running Spark on YARN, will be upgrading to 1.3 soon. For the integration, will I need to install Pandas and scikit-learn on every node in my cluster, or is the integration just something that takes place on the edge node after a collect in yarn-client mode? -- View this message in

Spark Streaming 1.3.0 ERROR LiveListenerBus

2015-06-19 Thread Evo Eftimov
Spark Streaming 1.3.0 on YARN during Job Execution keeps generating the following error while the application is running: ERROR LiveListenerBus: Listener EventLoggingListener threw an exception java.lang.reflect.InvocationTargetException etc etc Caused by: java.io.IOException: Filesystem closed

Re: SparkR - issue when starting the sparkR shell

2015-06-19 Thread Davies Liu
Yes, right now, we only tested SparkR with R 3.x On Fri, Jun 19, 2015 at 5:53 AM, Kulkarni, Vikram vikram.kulka...@hp.com wrote: Hello, I am seeing this issue when starting the sparkR shell. Please note that I have R version 2.14.1. [root@vertica4 bin]# sparkR R version 2.14.1

SparkSQL: leftOuterJoin is VERY slow!

2015-06-19 Thread Piero Cinquegrana
Hello, I have two RDDs: tv and sessions. I need to convert these DataFrames into RDDs because I need to use the groupByKey function. The reduceByKey function would not work here as I am not doing any aggregations, but I am grouping using a (K, V) pair. See the snippets of code below. The

Re: Spark 1.4 on HortonWork HDP 2.2

2015-06-19 Thread Todd Nist
You can get HDP with at least 1.3.1 from Horton: http://hortonworks.com/hadoop-tutorial/using-apache-spark-technical-preview-with-hdp-2-2/ for your convenience from the dos: wget -nv http://public-repo-1.hortonworks.com/HDP/centos6/2.x/updates/2.2.4.4/hdp.repo -O /etc/yum.repos.d/HDP-TP.repo

Re: Cassandra - Spark 1.3 - reading data from cassandra table with PYSpark

2015-06-19 Thread Davies Liu
On Fri, Jun 19, 2015 at 7:33 AM, Koen Vantomme koen.vanto...@gmail.com wrote: Hello, I'm trying to read data from a table stored in cassandra with pyspark. I found the scala code to loop through the table : cassandra_rdd.toArray.foreach(println) How can this be translated into PySpark ?

Spark FP-Growth algorithm for frequent sequential patterns

2015-06-19 Thread ping yan
Hi, I have a use case where I'd like to mine frequent sequential patterns (consider the clickpath scenario). Transaction A - B doesn't equal Transaction B-A.. From what I understand about FP-growth in general and the MLlib implementation of it, the orders are not preserved. Anyone can provide

Re: createDirectStream and Stats

2015-06-19 Thread Tim Smith
Update on performance of the new API: the new code using the createDirectStream API ran overnight and when I checked the app state in the morning, there were massive scheduling delays :( Not sure why and haven't investigated a whole lot. For now, switched back to the createStream API build of my

Re: ERROR in withColumn method

2015-06-19 Thread Davies Liu
This is an known issue: https://issues.apache.org/jira/browse/SPARK-8461?filter=-1 Will be fixed soon by https://github.com/apache/spark/pull/6898 On Fri, Jun 19, 2015 at 5:50 AM, Animesh Baranawal animeshbarana...@gmail.com wrote: I am trying to perform some insert column operations in

Re: Difference between Lasso regression in MLlib package and ML package

2015-06-19 Thread DB Tsai
Hi Wei, I don't think ML is meant for single node computation, and the algorithms in ML are designed for pipeline framework. In short, the lasso regression in ML is new algorithm implemented from scratch, and it's faster, and converged to the same solution as R's glmnet but with scalability.

Spark on Yarn - How to configure

2015-06-19 Thread Ashish Soni
Can some one please let me know what all i need to configure to have Spark run using Yarn , There is lot of documentation but none of it says how and what all files needs to be changed Let say i have 4 node for Spark - SparkMaster , SparkSlave1 , SparkSlave2 , SparkSlave3 Now in which node

Re: SparkSQL: leftOuterJoin is VERY slow!

2015-06-19 Thread Michael Armbrust
Broadcast outer joins are on my short list for 1.5. On Fri, Jun 19, 2015 at 10:48 AM, Piero Cinquegrana pcinquegr...@marketshare.com wrote: Hello, I have two RDDs: tv and sessions. I need to convert these DataFrames into RDDs because I need to use the groupByKey function. The reduceByKey

Re: [ERROR] Insufficient Space

2015-06-19 Thread Kelly, Jonathan
Would you be able to use Spark on EMR rather than on EC2? EMR clusters allow easy resizing of the cluster, and EMR also now supports Spark 1.3.1 as of EMR AMI 3.8.0. See http://aws.amazon.com/emr/spark ~ Jonathan From: Vadim Bichutskiy

Re: Spark group by sub coulumn

2015-06-19 Thread Michael Armbrust
You are probably looking to do .select(explode($to), ...) first, which will produce a new row for each value in the input array. On Fri, Jun 19, 2015 at 12:02 AM, Suraj Shetiya surajshet...@gmail.com wrote: Hi, I wanted to obtain a grouped by frame from a dataframe. A snippet of the column

RE: SparkSQL: leftOuterJoin is VERY slow!

2015-06-19 Thread Piero Cinquegrana
Any tips on how to implement and broadcast left outer join using Scala? From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Friday, June 19, 2015 12:40 PM To: Piero Cinquegrana Cc: user@spark.apache.org Subject: Re: SparkSQL: leftOuterJoin is VERY slow! Broadcast outer joins are on my

Re: Spark on EMR

2015-06-19 Thread Bozeman, Christopher
You can use Spark 1.4 on EMR AMI 3.8.0 if you install Spark as a 3rd party application using the bootstrap action directly without the native Spark inclusion with 1.3.1. See https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark Refer to

Re: Hive query execution from Spark(through HiveContext) failing with Apache Sentry

2015-06-19 Thread Nitin kak
Any other suggestions guys? On Wed, Jun 17, 2015 at 7:54 PM, Nitin kak nitinkak...@gmail.com wrote: With Sentry, only hive user has the permission for read/write/execute on the subdirectories of warehouse. All the users get translated to hive when interacting with hiveserver2. But i think

Re: Serial batching with Spark Streaming

2015-06-19 Thread Michal Čizmazia
Binh, thank you very much for your comment and code. Please could you outline an example use of your stream? I am a newbie to Spark. Thanks again! On 18 June 2015 at 14:29, Binh Nguyen Van binhn...@gmail.com wrote: I haven’t tried with 1.4 but I tried with 1.3 a while ago and I could not get

Difference between Lasso regression in MLlib package and ML package

2015-06-19 Thread Wei Zhou
Hi Spark experts, I see lasso regression/ elastic net implementation under both MLLib and ML, does anyone know what is the difference between the two implementation? In spark summit, one of the keynote speakers mentioned that ML is meant for single node computation, could anyone elaborate this?

Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-19 Thread Ravi Mody
Hi, I'm running implicit matrix factorization/ALS in Spark 1.3.1 on fairly large datasets (1+ billion input records). As I grow my dataset I often run into issues with a lot of failed stages and dropped executors, ultimately leading to the whole application failing. The errors are like

Re: createDirectStream and Stats

2015-06-19 Thread Cody Koeninger
Is there any more info you can provide / relevant code? On Fri, Jun 19, 2015 at 1:23 PM, Tim Smith secs...@gmail.com wrote: Update on performance of the new API: the new code using the createDirectStream API ran overnight and when I checked the app state in the morning, there were massive