Re: Spark job uses only one Worker

2016-01-07 Thread Annabel Melongo
Michael, I don't know what's your environment but if it's Cloudera, you should be able to see the link to your master in the Hue. Thanks On Thursday, January 7, 2016 5:03 PM, Michael Pisula wrote: I had tried several parameters, including

Re: Date Time Regression as Feature

2016-01-07 Thread Annabel Melongo
Or he can also transform the whole date into a string On Thursday, January 7, 2016 2:25 PM, Sujit Pal wrote: Hi Jorge, Maybe extract things like dd, mm, day of week, time of day from the datetime string and use them as features? -sujit On Thu, Jan 7, 2016 at

Re: [Spark 1.6] Spark Streaming - java.lang.AbstractMethodError

2016-01-07 Thread Dibyendu Bhattacharya
Right .. if you are using github version, just modify the ReceiverLauncher and add that . I will fix it for Spark 1.6 and release new version in spark-packages for spark 1.6 Dibyendu On Thu, Jan 7, 2016 at 4:14 PM, Ted Yu wrote: > I cloned

Window Functions importing issue in Spark 1.4.0

2016-01-07 Thread satish chandra j
HI All, Currently using Spark 1.4.0 version, I have a requirement to add a column having Sequential Numbering to an existing DataFrame I understand Window Function "rowNumber" serves my purpose hence I have below import statements to include the same import org.apache.spark.sql.expressions.Window

Re: Window Functions importing issue in Spark 1.4.0

2016-01-07 Thread Jacek Laskowski
Ok, enuf! :) Leaving the room for now as I'm like a copycat :) https://en.wiktionary.org/wiki/enuf Pozdrawiam, Jacek Jacek Laskowski | https://medium.com/@jaceklaskowski/ Mastering Apache Spark ==> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/ Follow me at

Re: [Spark 1.6] Spark Streaming - java.lang.AbstractMethodError

2016-01-07 Thread Dibyendu Bhattacharya
Some discussion is there in https://github.com/dibbhatt/kafka-spark-consumer and some is mentioned in https://issues.apache.org/jira/browse/SPARK-11045 Let me know if those answer your question . In short, Direct Stream is good choice if you need exact once semantics and message ordering , but

Re: Window Functions importing issue in Spark 1.4.0

2016-01-07 Thread Ted Yu
Please take a look at the following for sample on how rowNumber is used: https://github.com/apache/spark/pull/9050 BTW 1.4.0 was an old release. Please consider upgrading. On Thu, Jan 7, 2016 at 3:04 AM, satish chandra j wrote: > HI All, > Currently using Spark 1.4.0

How HiveContext can read subdirectories

2016-01-07 Thread Arkadiusz Bicz
Hi, Can Spark using HiveContext External Tables read sub-directories? Example: import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql._ import sqlContext.implicits._ //prepare data and create subdirectories with parquet val df = Seq("id1" -> 1, "id2" -> 4, "id3"->

[Spark 1.6] Spark Streaming - java.lang.AbstractMethodError

2016-01-07 Thread Walid LEZZAR
Hi, We have been using spark streaming for a little while now. Until now, we were running our spark streaming jobs in spark 1.5.1 and it was working well. Yesterday, we upgraded to spark 1.6.0 without any changes in the code. But our streaming jobs are not working any more. We are getting an

Spark shell throws java.lang.RuntimeException

2016-01-07 Thread will
Hi, I wanted to try the 1.6.0 version of Spark, but when I run it into my local machine, it throws me this exception : java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Thing is, this problem happened to me in the 1.5.1

Re: Why is this job running since one hour?

2016-01-07 Thread Umesh Kacha
Hi thanks for the response. Each Job is processing around 5gb of skewed data does group by multiple fields and does aggregation and does coalesce(1) and saves csv file in gzip format. I think coalesce is causing problem but data is not that huge I don't understand why it keeps on running for an

Re: 101 question on external metastore

2016-01-07 Thread Deenar Toraskar
I sorted this out. There were 2 different version of derby and ensuring the metastore and spark used the same version of Derby made the problem go away. Deenar On 6 January 2016 at 02:55, Yana Kadiyska wrote: > Deenar, I have not resolved this issue. Why do you think

Re: spark ui security

2016-01-07 Thread Ted Yu
According to https://spark.apache.org/docs/latest/security.html#web-ui , web UI is covered. FYI On Thu, Jan 7, 2016 at 6:35 AM, Kostiantyn Kudriavtsev < kudryavtsev.konstan...@gmail.com> wrote: > hi community, > > do I understand correctly that spark.ui.filters property sets up filters > only

Re: adding jars - hive on spark cdh 5.4.3

2016-01-07 Thread Prem Sure
did you try -- jars property in spark submit? if your jar is of huge size, you can pre-load the jar on all executors in a common available directory to avoid network IO. On Thu, Jan 7, 2016 at 4:03 PM, Ophir Etzion wrote: > I' trying to add jars before running a query

Re: Date Time Regression as Feature

2016-01-07 Thread Yanbo Liang
First extracting year, month, day, time from the datetime. Then you should decide which variables can be treated as category features such as year/month/day and encode them to boolean form using OneHotEncoder. At last using VectorAssembler to assemble the encoded output vector and the other raw

Re: Problems with reading data from parquet files in a HDFS remotely

2016-01-07 Thread Prem Sure
you many need to add createDataFrame( for Python, inferschema) call before registerTempTable. Thanks, Prem On Thu, Jan 7, 2016 at 12:53 PM, Henrik Baastrup < henrik.baast...@netscout.com> wrote: > Hi All, > > I have a small Hadoop cluster where I have stored a lot of data in parquet >

Re: Question in rdd caching in memory using persist

2016-01-07 Thread Prem Sure
are you running standalone - local mode or cluster mode. executor and driver existance differ based on setup type. snapshot of your env UI would be helpful to say On Thu, Jan 7, 2016 at 11:51 AM, wrote: > Hi, > > > > After I called rdd.persist(*MEMORY_ONLY_SER*), I

Spark streaming routing

2016-01-07 Thread Lin Zhao
I have a need to route the dstream through the streming pipeline by some key, such that data with the same key always goes through the same executor. There doesn't seem to be a way to do manual routing with Spark Streaming. The closest I can come up with is: stream.foreachRDD {rdd =>

Re: Spark streaming routing

2016-01-07 Thread Lin Zhao
Thanks for the replay Tathagata. Our pipeline has a rather fat state and that's why we have custom failure handling that kills all executors and go back to a certain point in time in the past. On a separate but related note, I noticed that in a chained map job, the entire pipeline runs on the

Re: Large scale ranked recommendation

2016-01-07 Thread xenocyon
(following up a rather old thread:) Hi Christopher, I understand how you might use nearest neighbors for item-item recommendations, but how do you use it for top N items per user? Thanks! Apu -- View this message in context:

Re: Predictive Modelling in sparkR

2016-01-07 Thread Chandan Verma
Hi yanbo, I was able to successfully perform logistic regression on my data and also performed the cross validation and it all worked fine. Thanks Sent from my Sony Xperia™ smartphone Yanbo Liang wrote >Hi Chandan, > > >Do you mean to run your own LR algorithm based on SparkR? >

Re: [Spark-SQL] Custom aggregate function for GrouppedData

2016-01-07 Thread Abhishek Gayakwad
Thanks Michael for replying, Aggregator/UDAF is exactly what I am looking for, but are still on 1.4 and it's gonna take time to get 1.6. On Wed, Jan 6, 2016 at 10:32 AM, Michael Armbrust wrote: > In Spark 1.6 GroupedDataset >

Re: Newbie question

2016-01-07 Thread Deepak Sharma
Yes , you can do it unless the method is marked static/final. Most of the methods in SparkContext are marked static so you can't over ride them definitely , else over ride would work usually. Thanks Deepak On Fri, Jan 8, 2016 at 12:06 PM, yuliya Feldman wrote: >

Re: Newbie question

2016-01-07 Thread censj
You can try it. > 在 2016年1月8日,14:44,yuliya Feldman 写道: > > invoked

Recommendations using Spark

2016-01-07 Thread anjali gautam
Hi, Can anybody please guide me how can we create generate recommendations for a user using spark? Regards, Anjali Gautam

Re: Recommendations using Spark

2016-01-07 Thread dEEPU
Use spark mlib kmeans algorithm to generate recommendations On Jan 8, 2016 12:41 PM, anjali gautam wrote: Hi, Can anybody please guide me how can we create generate recommendations for a user using spark? Regards, Anjali Gautam

Re: Date Time Regression as Feature

2016-01-07 Thread dEEPU
Maybe u want to convert the date to a duration in form of number of hours/days and then do calculation on it On Jan 8, 2016 12:39 AM, Jorge Machado wrote: Hello all, I'm new to machine learning. I'm trying to predict some electric usage with a decision Free The

Newbie question

2016-01-07 Thread yuliya Feldman
Hello, I am new to Spark and have a most likely basic question - can I override a method from SparkContext? Thanks

Re: Newbie question

2016-01-07 Thread yuliya Feldman
For example to add some functionality there. I understand I can have extended SparkContext as an implicit class to add new methods that can be invoked on SparkContext, but I want to see if I can override existing one. From: censj To: yuliya Feldman

RE: How to split a huge rdd and broadcast it by turns?

2016-01-07 Thread LINChen
Hi kdmxen,You want to delete the broadcast variables on the executors to avoid executors lost failure, right?Have you try to use the unpersist method? Like this way:itemSplitBroadcast.destroy(true); => itemSplitBroadcast.unpersist(true); LIN Chen Date: Thu, 7 Jan 2016 22:01:27 +0800 Subject:

Re: Newbie question

2016-01-07 Thread yuliya Feldman
Thank you From: Deepak Sharma To: yuliya Feldman Cc: "user@spark.apache.org" Sent: Thursday, January 7, 2016 10:41 PM Subject: Re: Newbie question Yes , you can do it unless the method is marked

RE: Recommendations using Spark

2016-01-07 Thread Singh, Abhijeet
The question itself is very vague. You might want to use this slide as a starting point http://www.slideshare.net/CasertaConcepts/analytics-week-recommendations-on-spark. From: anjali gautam [mailto:anjali.gauta...@gmail.com] Sent: Friday, January 08, 2016 12:42 PM To: user@spark.apache.org

Spark Context not getting initialized in local mode

2016-01-07 Thread Rahul Kumar
Hi all, I am trying to start solr with a custom plugin which uses spark library. I am trying to initialize sparkcontext in local mode. I have made a fat jar for this plugin using maven shade and put it in the lib directory. *While starting solr it is not able to initialize sparkcontext.* It says

Re: Newbie question

2016-01-07 Thread dEEPU
If the method is not final or static then u can On Jan 8, 2016 12:07 PM, yuliya Feldman wrote: Hello, I am new to Spark and have a most likely basic question - can I override a method from SparkContext? Thanks

Re: Recommendations using Spark

2016-01-07 Thread Stephen Boesch
Alternating least squares takes an RDD of (user/product/ratings) tuples and the resulting Model provides predict(user, product) or predictProducts methods among others.

Re: How to load specific Hive partition in DataFrame Spark 1.6?

2016-01-07 Thread Yin Huai
Hi, we made the change because the partitioning discovery logic was too flexible and it introduced problems that were very confusing to users. To make your case work, we have introduced a new data source option called basePath. You can use DataFrame df =

Re: How to load specific Hive partition in DataFrame Spark 1.6?

2016-01-07 Thread Umesh Kacha
Hi Yin, thanks much your answer solved my problem. Really appreciate it! Regards On Fri, Jan 8, 2016 at 1:26 AM, Yin Huai wrote: > Hi, we made the change because the partitioning discovery logic was too > flexible and it introduced problems that were very confusing to

Re: How to load specific Hive partition in DataFrame Spark 1.6?

2016-01-07 Thread Yin Huai
No problem! Glad it helped! On Thu, Jan 7, 2016 at 12:05 PM, Umesh Kacha wrote: > Hi Yin, thanks much your answer solved my problem. Really appreciate it! > > Regards > > > On Fri, Jan 8, 2016 at 1:26 AM, Yin Huai wrote: > >> Hi, we made the change

Date Time Regression as Feature

2016-01-07 Thread Jorge Machado
Hello all, I'm new to machine learning. I'm trying to predict some electric usage with a decision Free The data is : 2015-12-10-10:00, 1200 2015-12-11-10:00, 1150 My question is : What is the best way to turn date and time into feature on my Vector ? Something like this : Vector (1200,

"impossible to get artifacts " error when using sbt to build 1.6.0 for scala 2.11

2016-01-07 Thread Lin Zhao
I tried to build 1.6.0 for yarn and scala 2.11, but have an error. Any help is appreciated. [warn] Strategy 'first' was applied to 2 files [info] Assembly up to date: /Users/lin/git/spark/network/yarn/target/scala-2.11/spark-network-yarn-1.6.0-hadoop2.7.1.jar java.lang.IllegalStateException:

Problems with reading data from parquet files in a HDFS remotely

2016-01-07 Thread Henrik Baastrup
Hi All, I have a small Hadoop cluster where I have stored a lot of data in parquet files. I have installed a Spark master service on one of the nodes and now would like to query my parquet files from a Spark client. When I run the following program from the spark-shell on the Spark Master node

Re: Spark job uses only one Worker

2016-01-07 Thread Michael Pisula
Hi, I start the cluster using the spark-ec2 scripts, so the cluster is in stand-alone mode. Here is how I submit my job: spark/bin/spark-submit --class demo.spark.StaticDataAnalysis --master spark://:6066 --deploy-mode cluster demo/Demo-1.0-SNAPSHOT-all.jar Cheers, Michael On 07.01.2016 22:41,

Re: Spark job uses only one Worker

2016-01-07 Thread Igor Berman
read about *--total-executor-cores* not sure why you specify port 6066 in master...usually it's 7077 verify in master ui(usually port 8080) how many cores are there(depends on other configs, but usually workers connect to master with all their cores) On 7 January 2016 at 23:46, Michael Pisula

Re: spark ui security

2016-01-07 Thread Kostiantyn Kudriavtsev
can I do it without kerberos and hadoop? ideally using filters as for job UI On Jan 7, 2016, at 1:22 PM, Prem Sure wrote: > you can refer more on https://searchcode.com/codesearch/view/97658783/ >

Re: Spark streaming routing

2016-01-07 Thread Tathagata Das
You cannot guarantee that each key will forever be on the same executor. That is flawed approach to designing an application if you have to take ensure fault-tolerance toward executor failures. On Thu, Jan 7, 2016 at 9:34 AM, Lin Zhao wrote: > I have a need to route the

Re: spark ui security

2016-01-07 Thread Ted Yu
Without kerberos you don't have true security. Cheers On Thu, Jan 7, 2016 at 1:56 PM, Kostiantyn Kudriavtsev < kudryavtsev.konstan...@gmail.com> wrote: > can I do it without kerberos and hadoop? > ideally using filters as for job UI > > On Jan 7, 2016, at 1:22 PM, Prem Sure

RE: Question in rdd caching in memory using persist

2016-01-07 Thread seemanto.barua
Attached is the screen shot of the storage tab details for the cached rdd. The host highlighted and at the end of the list is driver machine. [cid:image001.png@01D1496C.FD79AB40] -regards Seemanto Barua From: Barua, Seemanto (US) Sent: Thursday, January 07, 2016 12:43 PM To:

Re: Spark job uses only one Worker

2016-01-07 Thread Michael Pisula
I had tried several parameters, including --total-executor-cores, no effect. As for the port, I tried 7077, but if I remember correctly I got some kind of error that suggested to try 6066, with which it worked just fine (apart from this issue here). Each worker has two cores. I also tried

Re: spark ui security

2016-01-07 Thread Kostiantyn Kudriavtsev
I know, but I need only to hide/protect web ui at least with servlet/filter api On Jan 7, 2016, at 4:59 PM, Ted Yu wrote: > Without kerberos you don't have true security. > > Cheers > > On Thu, Jan 7, 2016 at 1:56 PM, Kostiantyn Kudriavtsev >

Re: Spark job uses only one Worker

2016-01-07 Thread Igor Berman
do you see in master ui that workers connected to master & before you are running your app there are 2 available cores in master ui per each worker? I understand that there are 2 cores on each worker - the question is do they got registered under master regarding port it's very strange, please

Re: Spark job uses only one Worker

2016-01-07 Thread Michael Pisula
All the workers were connected, I even saw the job being processed on different workers, so that was working fine. I will fire up the cluster again tomorrow and post the results of connecting to 7077 and using --total-executor-cores 4. Thanks for the help On 07.01.2016 23:10, Igor Berman wrote:

Re: Spark job uses only one Worker

2016-01-07 Thread Igor Berman
share how you submit your job what cluster(yarn, standalone) On 7 January 2016 at 23:24, Michael Pisula wrote: > Hi there, > > I ran a simple Batch Application on a Spark Cluster on EC2. Despite having > 3 > Worker Nodes, I could not get the application processed on

adding jars - hive on spark cdh 5.4.3

2016-01-07 Thread Ophir Etzion
I' trying to add jars before running a query using hive on spark on cdh 5.4.3. I've tried applying the patch in https://issues.apache.org/jira/browse/HIVE-12045 (manually as the patch is done on a different hive version) but still hasn't succeeded. did anyone manage to do ADD JAR successfully

Spark job uses only one Worker

2016-01-07 Thread Michael Pisula
Hi there, I ran a simple Batch Application on a Spark Cluster on EC2. Despite having 3 Worker Nodes, I could not get the application processed on more than one node, regardless if I submitted the Application in Cluster or Client mode. I also tried manually increasing the number of partitions in

SparkContext SyntaxError: invalid syntax

2016-01-07 Thread weineran
Hello, When I try to submit a python job using spark-submit (using --master yarn --deploy-mode cluster), I get the following error: /Traceback (most recent call last): File "loss_rate_by_probe.py", line 15, in ? from pyspark import SparkContext File