Re: Concurreny does not improve for Spark Jobs with Same Spark Context

2016-02-18 Thread Prabhu Joseph
Fair Scheduler, YARN Queue has the entire cluster resource as maxResource, preemption does not come into picture during test case, all the spark jobs got the requested resource. The concurrent jobs with different spark context runs fine, so suspecting on resource contention is not a correct one.

Re: Concurreny does not improve for Spark Jobs with Same Spark Context

2016-02-18 Thread Ted Yu
Is it possible to perform the tests using Spark 1.6.0 ? Thanks On Thu, Feb 18, 2016 at 9:51 PM, Prabhu Joseph wrote: > Hi All, > >When running concurrent Spark Jobs on YARN (Spark-1.5.2) which share a > single Spark Context, the jobs take more time to complete

RE: Spark JDBC connection - data writing success or failure cases

2016-02-18 Thread Mich Talebzadeh
Sorry where is the source of data. Are you writing to Oracle table or reading from? In general JDBC messages will you about the connection failure halfway or any other message received say from Oracle via JDBC. What batch size are you using for this transaction? HTH Dr Mich

Re: Concurreny does not improve for Spark Jobs with Same Spark Context

2016-02-18 Thread Jörn Franke
How did you configure YARN queues? What scheduler? Preemption ? > On 19 Feb 2016, at 06:51, Prabhu Joseph wrote: > > Hi All, > >When running concurrent Spark Jobs on YARN (Spark-1.5.2) which share a > single Spark Context, the jobs take more time to complete

Concurreny does not improve for Spark Jobs with Same Spark Context

2016-02-18 Thread Prabhu Joseph
Hi All, When running concurrent Spark Jobs on YARN (Spark-1.5.2) which share a single Spark Context, the jobs take more time to complete comparing with when they ran with different Spark Context. The spark jobs are submitted on different threads. Test Case: A. 3 spark jobs submitted

Logistic Regression using ML Pipeline

2016-02-18 Thread Arunkumar Pillai
Hi I'm trying to build logistic regression using ML Pipeline val lr = new LogisticRegression() lr.setFitIntercept(true) lr.setMaxIter(100) val model = lr.fit(data) println(model.summary) I'm getting coefficients but not able to get the predicted and probability values.

Re: cannot coerce class "data.frame" to a DataFrame - with spark R

2016-02-18 Thread Felix Cheung
Doesn't DESeqDataSetFromMatrix work with data.frame only? It wouldn't work with Spark's DataFrame - try collect(countMat) and others to convert them into data.frame? _ From: roni Sent: Thursday, February 18, 2016 4:55 PM Subject: cannot

Re: Using sbt assembly

2016-02-18 Thread Brian London
You need to add the plugin to your plugins.sbt file not your build.sbt file. Also, I don't see a 0.13.9 version on Github. 0.14.2 is current. On Thu, Feb 18, 2016 at 9:50 PM Arko Provo Mukherjee < arkoprovomukher...@gmail.com> wrote: > Hello, > > I am trying to use sbt assembly to generate a

Using sbt assembly

2016-02-18 Thread Arko Provo Mukherjee
Hello, I am trying to use sbt assembly to generate a fat JAR. Here is my \project\assembly.sbt file: resolvers += Resolver.url("bintray-sbt-plugins", url("http://dl.bintray.com/sbt/sbt-plugin-releases;))(Resolver.ivyStylePatterns) addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.9")

Re: Spark History Server NOT showing Jobs with Hortonworks

2016-02-18 Thread Divya Gehlot
Hi Sutanu , When you run your spark shell you would see below lines in your console 16/02/18 21:43:53 INFO AbstractConnector: Started SelectChannelConnector@0.0.0.0:4041 16/02/18 21:43:53 INFO Utils: Successfully started service 'SparkUI' on port 4041. 16/02/18 21:43:54 INFO SparkUI: Started

Spark JDBC connection - data writing success or failure cases

2016-02-18 Thread Divya Gehlot
Hi, I am a Spark job which connects to RDBMS (in mycase its Oracle). How can we check that complete data writing is successful? Can I use commit in case of success or rollback in case of failure ? Thanks, Divya

StreamingKMeans does not update cluster centroid locations

2016-02-18 Thread ramach1776
I have streaming application wherein I train the model using a receiver input stream in 4 sec batches val stream = ssc.receiverStream(receiver) //receiver gets new data every batch model.trainOn(stream.map(Vectors.parse)) If I use model.latestModel.clusterCenters.foreach(println) the value of

Re: subtractByKey increases RDD size in memory - any ideas?

2016-02-18 Thread Andrew Ehrlich
There could be clues in the different RDD subclasses; rdd1 is ParallelCollectionRDD but rdd3 is SubtractedRDD. On Thu, Feb 18, 2016 at 1:37 PM, DaPsul wrote: > (copy from > > http://stackoverflow.com/questions/35467128/spark-subtractbykey-increases-rdd-cached-memory-size > ) > >

Re: Hive REGEXP_REPLACE use or equivalent in Spark

2016-02-18 Thread Andrew Ehrlich
Use the scala method .split(",") to split the string into a collection of strings, and try using .replaceAll() on the field with the "?" to remove it. On Thu, Feb 18, 2016 at 2:09 PM, Mich Talebzadeh wrote: > Hi, > > What is the equivalent of this Hive statement in Spark >

Re: spark 1.6 new memory management - some issues with tasks not using all executors

2016-02-18 Thread Koert Kuipers
looking at the cached rdd i see a similar story: with useLegacyMode = true the cached rdd is spread out across 10 executors, but with useLegacyMode = false the data for the cached rdd sits on only 3 executors (the rest all show 0s). my cached RDD is a key-value RDD that got partitioned (hash

spark 1.6 new memory management - some issues with tasks not using all executors

2016-02-18 Thread Koert Kuipers
hello all, we are just testing a semi-realtime application (it should return results in less than 20 seconds from cached RDDs) on spark 1.6.0. before this it used to run on spark 1.5.1 in spark 1.6.0 the performance is similar to 1.5.1 if i set spark.memory.useLegacyMode = true, however if i

Re: UDAF support for DataFrames in Spark 1.5.0?

2016-02-18 Thread Ted Yu
Richard: Please see SPARK-9664 Use sqlContext.udf to register UDAFs Cheers On Thu, Feb 18, 2016 at 3:18 PM, Kabeer Ahmed wrote: > I use Spark 1.5 with CDH5.5 distribution and I see that support is present > for UDAF. From the link: >

Re: Streaming with broadcast joins

2016-02-18 Thread Srikanth
Code with SQL broadcast hint. This worked and I was able to see that broadcastjoin was performed. val testDF = sqlContext.read.format("com.databricks.spark.csv") .schema(schema).load("file:///shared/data/test-data.txt") val lines = ssc.socketTextStream("DevNode", )

Re: UDAF support for DataFrames in Spark 1.5.0?

2016-02-18 Thread Kabeer Ahmed
I use Spark 1.5 with CDH5.5 distribution and I see that support is present for UDAF. From the link: https://databricks.com/blog/2015/09/16/spark-1-5-dataframe-api-highlights-datetimestring-handling-time-intervals-and-udafs.html, I read that this is an experimental feature. So it makes sense not

RE: Spark History Server NOT showing Jobs with Hortonworks

2016-02-18 Thread Mich Talebzadeh
Is 4040 port used in your host? It should be default Example netstat -plten|grep 4040 tcp0 0 :::4040 :::* LISTEN 1009 42748209 22778/java ps -ef|grep 22778 hduser 22778 22770 0 08:34 pts/100:01:18 /usr/java/latest/bin/java

RE: Spark History Server NOT showing Jobs with Hortonworks

2016-02-18 Thread Sutanu Das
Hi Mich, Community - Do I need to specify it in the properties file in my spark-submit ? From: Mich Talebzadeh [mailto:m...@peridale.co.uk] Sent: Thursday, February 18, 2016 4:28 PM To: Sutanu Das; user@spark.apache.org Subject: RE: Spark History Server NOT showing Jobs with Hortonworks The

RE: JDBC based access to RDD

2016-02-18 Thread Mich Talebzadeh
Can you please clarify your point Do you mean using JDBC to get data from other databases into Spark val s = HiveContext.load("jdbc", Map("url" -> _ORACLEserver, "dbtable" -> "table”, "user" -> _username, "password" -> _password)) HTH Dr Mich Talebzadeh LinkedIn

JDBC based access to RDD

2016-02-18 Thread Shyam Sarkar
Is there any good code example for JDBC based access to RDD ? Thanks.

RE: Spark History Server NOT showing Jobs with Hortonworks

2016-02-18 Thread Mich Talebzadeh
The jobs are normally shown under :4040/jobs/ in a normal set up not using any vendor's flavoiur Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUr V8Pw

Spark History Server NOT showing Jobs with Hortonworks

2016-02-18 Thread Sutanu Das
Hi Community, Challenged with Spark issues with Hortonworks (HDP 2.3.2_Spark 1.4.1) - The Spark History Server is NOT showing the Spark Running Jobs in Local Mode The local-host:4040/app/v1 is ALSO not working How can I look at my local Spark job? # Generated by Apache Ambari. Fri Feb 5

Hive REGEXP_REPLACE use or equivalent in Spark

2016-02-18 Thread Mich Talebzadeh
Hi, What is the equivalent of this Hive statement in Spark select "?2,500.00", REGEXP_REPLACE("?2,500.00",'[^\\d\\.]',''); ++--+--+ |_c0 | _c1| ++--+--+ | ?2,500.00 | 2500.00 | ++--+--+ Basically I want to get rid of

subtractByKey increases RDD size in memory - any ideas?

2016-02-18 Thread DaPsul
(copy from http://stackoverflow.com/questions/35467128/spark-subtractbykey-increases-rdd-cached-memory-size) I've found a very strange behavior for RDD's (spark 1.6.0 with scala 2.11): When i use subtractByKey on an RDD the resulting RDD should be of equal or smaller size. What i get is an RDD

Re: Yarn client mode: Setting environment variables

2016-02-18 Thread Lin Zhao
Thanks for the reply. I also found that sparkConf.setExecutorEnv works for yarn. From: Saisai Shao > Date: Wednesday, February 17, 2016 at 6:02 PM To: Lin Zhao > Cc:

Re: Importing csv files into Hive ORC target table

2016-02-18 Thread Alex Dzhagriev
Hi Mich, Try to use a regexp to parse your string instead of the split. Thanks, Alex. On Thu, Feb 18, 2016 at 6:35 PM, Mich Talebzadeh < mich.talebza...@cloudtechnologypartners.co.uk> wrote: > > > thanks, > > > > I have an issue here. > > define rdd to read the CSV file > > scala> var csv =

Lazy executors

2016-02-18 Thread Bemaze
I'm running a regular Spark job with a cluster of 50 core instances and 100 executors. The work they are doing appears to be fairly evenly distributed, however often I will see one or two of the executors appear to be doing no work. They are listed as having tasks active and often those become

Re: How to use a custom partitioner in a dataframe in Spark

2016-02-18 Thread Koert Kuipers
although it is not a bad idea to write data out partitioned, and then use a merge join when reading it back in, this currently isn't even easily doable with rdds because when you read an rdd from disk the partitioning info is lost. re-introducing a partitioner at that point causes a shuffle

Re: SPARK-9559

2016-02-18 Thread Igor Berman
what are you trying to solve? killing worker jvm is like killing yarn node manager...why would you do this? usually worker jvm is "agent" on each worker machine which opens executors per each application, so it doesn't works hard or has big memory footprint yes it can fail, but it rather corner

Access to broadcasted variable

2016-02-18 Thread jeff saremi
I'd like to know if the broadcasted object gets serialized when accessed by the executor during the execution of a task? I know that it gets serialized from the driver to the worker. This question is inside worker when executor JVM's are accessing it thanks Jeff

Re: How to use a custom partitioner in a dataframe in Spark

2016-02-18 Thread Rishi Mishra
Michael, Is there any specific reason why DataFrames does not have partitioners like RDDs ? This will be very useful if one is writing custom datasources , which keeps data in partitions. While storing data one can pre-partition the data at Spark level rather than at the datasource. Regards,

Re: SparkConf does not work for spark.driver.memory

2016-02-18 Thread Marcelo Vanzin
On Thu, Feb 18, 2016 at 10:26 AM, wgtmac wrote: > In the code, I did following: > val sc = new SparkContext(new > SparkConf().setAppName("test").set("spark.driver.memory", "4g")) You can't set the driver memory like this, in any deploy mode. When that code runs, the driver is

SparkConf does not work for spark.driver.memory

2016-02-18 Thread wgtmac
Hi I'm using spark 1.5.1. But I encountered a problem using SparkConf to set spark.driver.memory in yarn-cluster mode. Example 1: In the code, I did following: val sc = new SparkContext(new SparkConf().setAppName("test").set("spark.driver.memory", "4g")) And used following command to submit

Re: adding a split and union to a streaming application cause big performance hit

2016-02-18 Thread ramach1776
bq. streamingContext.remember("duration") did not help Can you give a bit more detail on the above ? Did you mean the job encountered OOME later on ? Which Spark release are you using ? tried these 2 global settings (and restarted the app) after enabling cache for stream1

Re: Streaming with broadcast joins

2016-02-18 Thread Sebastian Piu
Can you paste the code where you use sc.broadcast ? On Thu, Feb 18, 2016 at 5:32 PM Srikanth wrote: > Sebastian, > > I was able to broadcast using sql broadcast hint. Question is how to > prevent this broadcast for each RDD. > Is there a way where it can be broadcast once

Importing csv files into Hive ORC target table

2016-02-18 Thread Mich Talebzadeh
> thanks, > > I have an issue here. > > define rdd to read the CSV file > > scala> var csv = sc.textFile("/data/stg/table2") > csv: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[69] at textFile at > :27 > > I then get rid of the header > > scala> val csv2 =

Re: Streaming with broadcast joins

2016-02-18 Thread Srikanth
Sebastian, I was able to broadcast using sql broadcast hint. Question is how to prevent this broadcast for each RDD. Is there a way where it can be broadcast once and used locally for each RDD? Right now every batch the metadata file is read and the DF is broadcasted. I tried sc.broadcast and

Re: equalTo isin not working as expected with a constructed column with DataFrames

2016-02-18 Thread Mehdi Ben Haj Abbes
Hi, I forgot to mention that I'm using the 1.5.1 version. Regards, On Thu, Feb 18, 2016 at 4:20 PM, Mehdi Ben Haj Abbes wrote: > Hi folks, > > I have DataFrame with let's say this schema : > -dealId, > -ptf, > -ts > from it I derive another dataframe (lets call it df) to

Re: SPARK REST API on YARN

2016-02-18 Thread Ricardo Paiva
You can use the yarn proxy: http:// :8088/proxy//api/v1/applications//executors I have an scala application that monitor the number of executors of some spark streamings, and I had a similar problem, where I iterate over the running jobs and get the number of executors: val states = EnumSet.of(

Re: Is this likely to cause any problems?

2016-02-18 Thread James Hammerton
I have now... So far I think the issues I've had are not related to this, but I wanted to be sure in case it should be something that needs to be patched. I've had some jobs run successfully but this warning appears in the logs. Regards, James On 18 February 2016 at 12:23, Ted Yu

Re: Is this likely to cause any problems?

2016-02-18 Thread James Hammerton
I'm fairly new to Spark. The documentation suggests using the spark-ec2 script to launch clusters in AWS, hence I used it. Would EMR offer any advantage? Regards, James On 18 February 2016 at 14:04, Gourav Sengupta wrote: > Hi, > > Just out of sheet curiosity why

UDAF support for DataFrames in Spark 1.5.0?

2016-02-18 Thread Richard Cobbe
I'm working on an application using DataFrames (Scala API) in Spark 1.5.0, and we need to define and use several custom aggregators. I'm having trouble figuring out how to do this, however. First, which version of Spark did UDAF support land in? Has it in fact landed at all?

Re: Spark Streaming with Kafka Use Case

2016-02-18 Thread Cody Koeninger
If by smaller block interval you mean the value in seconds passed to the streaming context constructor, no. You'll still get everything from the starting offset until now in the first batch. On Thu, Feb 18, 2016 at 10:02 AM, praveen S wrote: > Sorry.. Rephrasing : > Can

Re: Spark Streaming with Kafka Use Case

2016-02-18 Thread praveen S
Sorry.. Rephrasing : Can this issue be resolved by having a smaller block interval? Regards, Praveen On 18 Feb 2016 21:30, "praveen S" wrote: > Can having a smaller block interval only resolve this? > > Regards, > Praveen > On 18 Feb 2016 21:13, "Cody Koeninger"

Re: Spark Streaming with Kafka Use Case

2016-02-18 Thread praveen S
Can having a smaller block interval only resolve this? Regards, Praveen On 18 Feb 2016 21:13, "Cody Koeninger" wrote: > Backpressure won't help you with the first batch, you'd need > spark.streaming.kafka.maxRatePerPartition > for that > > On Thu, Feb 18, 2016 at 9:40 AM,

Re: Is stddev not a supported aggregation function in SparkSQL WindowSpec?

2016-02-18 Thread Mich Talebzadeh
On 18/02/2016 11:47, Mich Talebzadeh wrote: > It is available in Hive as well > > You can of course write your own standard deviation function > > For example sttdev for column amount_sold cann be expressed as > >

SPARK REST API on YARN

2016-02-18 Thread alvarobrandon
Hello: I wanted to access the REST API (http://spark.apache.org/docs/latest/monitoring.html#rest-api) of Spark to monitor my jobs. However I'm running my Spark Apps over YARN. When I try to make a request to http://localhost:4040/api/v1 as the documentation says I don't get any response. My

Re: Spark Streaming with Kafka Use Case

2016-02-18 Thread Cody Koeninger
Backpressure won't help you with the first batch, you'd need spark.streaming.kafka.maxRatePerPartition for that On Thu, Feb 18, 2016 at 9:40 AM, praveen S wrote: > Have a look at > > spark.streaming.backpressure.enabled > Property > > Regards, > Praveen > On 18 Feb 2016

Re: Spark Streaming with Kafka Use Case

2016-02-18 Thread praveen S
Have a look at spark.streaming.backpressure.enabled Property Regards, Praveen On 18 Feb 2016 00:13, "Abhishek Anand" wrote: > I have a spark streaming application running in production. I am trying to > find a solution for a particular use case when my application has

SPARK-9559

2016-02-18 Thread Ashish Soni
Hi All , Just wanted to know if there is any work around or resolution for below issue in Stand alone mode https://issues.apache.org/jira/browse/SPARK-9559 Ashish

Re: How to train and predict in parallel via Spark MLlib?

2016-02-18 Thread Игорь Ляхов
Xiangrui, thnx for your answer! Could you clarify some details? What do you mean "I can trigger training jobs in different threads on the driver"? I have 4-machine cluster (It will grow in future), and I wish use them in parallel for training and predicting. Do you have any example? It will be

Re: [MLlib] What is the best way to forecast the next month page visit?

2016-02-18 Thread diplomatic Guru
Hi Jorge, Thanks for the example. I managed to get the job to run but the results are appalling. The best I could get it: Test Mean Squared Error: 684.3709679595169 Learned regression tree model: DecisionTreeModel regressor of depth 30 with 6905 nodes I tried tweaking maxDepth and maxBins but I

Re: Why no computations run on workers/slaves in cluster mode?

2016-02-18 Thread Gourav Sengupta
Hi, Have you registered an application in the standalone cluster? This can also happen if the data path that you are giving SPARK to access is only visible in one system and not another. For example if I provide the data path as "/abcd/*" and that path is available in only one system and not

Re: Is this likely to cause any problems?

2016-02-18 Thread Gourav Sengupta
Hi Ted/ Teng, Just read the content in the email which is very different from what the facts are: Just to want to add another point, spark-ec2 is nice to keep and improve because it allows users to any version of spark (nightly-build for example). EMR does not allow you to do that without manual

Re: Why no computations run on workers/slaves in cluster mode?

2016-02-18 Thread Xiangrui Meng
Did the test program finish and did you see any error messages from the console? -Xiangrui On Wed, Feb 17, 2016, 1:49 PM Junjie Qian wrote: > Hi all, > > I am new to Spark, and have one problem that, no computations run on > workers/slave_servers in the standalone

Re: Is this likely to cause any problems?

2016-02-18 Thread Gourav Sengupta
Hi Teng, Are you using VPC in EMR? Seems quite curious though that you can lock in traffic at gateway, subnet, security group (using private setting using NAT) and still feel insecured. I will be really interested to know what your feelings are based on. I bet Amazon guys will also find it very

Re: How to train and predict in parallel via Spark MLlib?

2016-02-18 Thread Xiangrui Meng
If you have a big cluster, you can trigger training jobs in different threads on the driver. Putting RDDs inside an RDD won't work. -Xiangrui On Thu, Feb 18, 2016, 4:28 AM Igor L. wrote: > Good day, Spark team! > I have to solve regression problem for different restricitons.

Re: Is this likely to cause any problems?

2016-02-18 Thread Ted Yu
Please see the last 3 posts on this thread: http://search-hadoop.com/m/q3RTtTorTf2o3UGK1=Re+spark+ec2+vs+EMR FYI On Thu, Feb 18, 2016 at 6:25 AM, Teng Qiu wrote: > EMR is great, but I'm curiosity how are you dealing with security settings > with EMR, only whitelisting some

Re: Is this likely to cause any problems?

2016-02-18 Thread Teng Qiu
EMR is great, but I'm curiosity how are you dealing with security settings with EMR, only whitelisting some IP range with security group setting is really too weak. are there really many production system are using EMR? for me, i feel using EMR means everyone in my IP range (for some ISP it may

Re: adding a split and union to a streaming application cause big performance hit

2016-02-18 Thread krishna ramachandran
I tried these 2 global settings (and restarted the app) after enabling cache for stream1 conf.set("spark.streaming.unpersist", "true") streamingContext.remember(Seconds(batchDuration * 4)) batch duration is 4 sec Using spark-1.4.1. The application runs for about 4-5 hrs then see out of memory

Re: Is this likely to cause any problems?

2016-02-18 Thread Gourav Sengupta
Hi, Just out of sheet curiosity why are you not using EMR to start your SPARK cluster? Regards, Gourav On Thu, Feb 18, 2016 at 12:23 PM, Ted Yu wrote: > Have you seen this ? > > HADOOP-10988 > > Cheers > > On Thu, Feb 18, 2016 at 3:39 AM, James Hammerton

Re: Error when executing Spark application on YARN

2016-02-18 Thread alvarobrandon
Found the solution. I was pointing to the wrong hadoop conf directory. I feel so stupid :P -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-executing-Spark-application-on-YARN-tp26248p26266.html Sent from the Apache Spark User List mailing list

Re: adding a split and union to a streaming application cause big performance hit

2016-02-18 Thread Ted Yu
bq. streamingContext.remember("duration") did not help Can you give a bit more detail on the above ? Did you mean the job encountered OOME later on ? Which Spark release are you using ? Cheers On Wed, Feb 17, 2016 at 6:03 PM, ramach1776 wrote: > We have a streaming

Re: Is this likely to cause any problems?

2016-02-18 Thread Ted Yu
Have you seen this ? HADOOP-10988 Cheers On Thu, Feb 18, 2016 at 3:39 AM, James Hammerton wrote: > HI, > > I am seeing warnings like this in the logs when I run Spark jobs: > > OpenJDK 64-Bit Server VM warning: You have loaded library >

How do I stream in Parquet files using fileStream() and ParquetInputFormat?

2016-02-18 Thread Rory Byrne
Hi, I'm trying to understand how to stream Parquet files into Spark using StreamingContext.fileStream[Key, Value, Format](). I am struggling to understand a) what should be passed as Key and Value (assuming ParquetInputFormat - is this the correct format?), and b) how - if at all - to configure

Is this likely to cause any problems?

2016-02-18 Thread James Hammerton
HI, I am seeing warnings like this in the logs when I run Spark jobs: OpenJDK 64-Bit Server VM warning: You have loaded library /root/ephemeral-hdfs/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now. It's highly recommended that you

Re: Reading CSV file using pyspark

2016-02-18 Thread Gourav Sengupta
Hi Devesh, you have to start your SPARK Shell using the packages. The command is mentioned below (you can use pyspark instead of spark-shell), anyways all the required commands for this is mentioned here https://github.com/databricks/spark-csv and I prefer using the 2.11 version instead of 2.10

explaination for parent.slideDuration in ReducedWindowedDStream

2016-02-18 Thread Sachin Aggarwal
While reading code i came across parent.slideDuration in ReducedWindowedDStream class val currentWindow = new Interval(currentTime - windowDuration + parent.slideDuration, currentTime) // _ // | previous window _|___ //

Re: Error when executing Spark application on YARN

2016-02-18 Thread alvarobrandon
1. It happens to all the classes inside the jar package. 2. I didn't do any changes. - I have three nodes: one master and two slaves in the conf/slaves file - In spark-env.sh I just set the HADOOP_CONF_DIR parameter - In spark-defaults.conf I didn't change anything 3. The

Re: Reading CSV file using pyspark

2016-02-18 Thread Teng Qiu
download a right version of this jar http://mvnrepository.com/artifact/com.databricks/spark-csv_2.10 (or 2.11), and append it to SPARK_CLASSPATH 2016-02-18 11:05 GMT+01:00 Devesh Raj Singh : > Hi, > > I want to read CSV file in pyspark > > I am running pyspark on pycharm

Re: Is stddev not a supported aggregation function in SparkSQL WindowSpec?

2016-02-18 Thread rok
There is a stddev function since 1.6: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.stddev If you are using spark < 1.6 you can write your own more or less easily. On Wed, Feb 17, 2016 at 5:06 PM, mayx [via Apache Spark User List] <

How do I stream in Parquet files using fileStream() and ParquetInputFormat

2016-02-18 Thread roryofbyrne
Hi, I'm trying to understand how to stream Parquet files into Spark using StreamingContext.fileStream[Key, Value, Format](). I am struggling to understand a) what should be passed as Key and Value (assuming ParquetInputFormat - is this the correct format?), and b) how - if at all - to

Reading CSV file using pyspark

2016-02-18 Thread Devesh Raj Singh
Hi, I want to read CSV file in pyspark I am running pyspark on pycharm I am trying to load a csv using pyspark import os import sys os.environ['SPARK_HOME']="/Users/devesh/Downloads/spark-1.5.1-bin-hadoop2.6" sys.path.append("/Users/devesh/Downloads/spark-1.5.1-bin-hadoop2.6/python/") # Now

How to train and predict in parallel via Spark MLlib?

2016-02-18 Thread Igor L.
Good day, Spark team! I have to solve regression problem for different restricitons. There is a bunch of criteria and rules for them, I have to build model and make predictions for each, combine all and save. So, now my solution looks like: criteria2Rules: List[(String, Set[String])]