Re: How to handle categorical variables in Spark MLlib?

2015-12-25 Thread Yanbo Liang
Hi Hokam, You can use OneHotEncoder to encode category variables to feature vector, Spark ML provide this transformer. To weight for individual category, there is no exist method to do this, but you can implement a UDF which can multiple a factor to specified column of a vector. Yanbo

Re: Job Error:Actor not found for: ActorSelection[Anchor(akka.tcp://sparkDriver@130.1.10.108:23600/)

2015-12-25 Thread Saisai Shao
I think SparkContext is thread-safe, you could concurrently submit jobs from different threads, the problem you hit might not relate to this. Can you reproduce this issue each time when you concurrently submit jobs, or is it happened occasionally? BTW, I guess you're using the old version of

Re: spark rdd grouping

2015-12-25 Thread Shushant Arora
Hi I have created a jira for this feature https://issues.apache.org/jira/browse/SPARK-12524 Please vote this feature if its necessary. I would like to implement this feature. Thanks Shushant On Wed, Dec 2, 2015 at 1:14 PM, Rajat Kumar wrote: > What if I don't have

Re: Job Error:Actor not found for: ActorSelection[Anchor(akka.tcp://sparkDriver@130.1.10.108:23600/)

2015-12-25 Thread Saisai Shao
MapOutputTracker is used to track the map output data, which will be used by shuffle fetcher to fetch the shuffle blocks. I'm not sure it is related to hardware resources, did you see other exceptions beside this one? This Akka failure may related to other issues. If you think system resource

REST Api not working in spark

2015-12-25 Thread aman solanki
Hi All, I am getting below error while fetching job details of an application via rest api that spark provides. Can anybody help around the same why i am getting below error? Thanks in Advance..!!! FYI. My application is a streaming application. Error 503 Service Unavailable HTTP ERROR

Re: Spark Streaming - print accumulators value every period as logs

2015-12-25 Thread Ali Gouta
Something like Stream.foreachRdd(rdd=> rdd.collect.foreach(print accum)) Should answer your question. You get things printed in Each batch interval Ali Gouta Le 25 déc. 2015 04:22, "Roberto Coluccio" a écrit : > Hello, > > I have a batch and a streaming driver

?????? Job Error:Actor not found for: ActorSelection[Anchor(akka.tcp://sparkDriver@130.1.10.108:23600/)

2015-12-25 Thread donhoff_h
Hi, Saisai Shao Many thanks for your reply. I used spark v1.3. Unfortunately I can not change to other version. As to the frequency, yes, every time when I ran a few jobs simultaneously(ususally above 10 jobs), this would appear. Is this related to the cpus or memory? I ran those jobs on a

回复: How can I get the column data based on specific column name and then stored these data in array or list ?

2015-12-25 Thread fightf...@163.com
Emm...I think you can do a df.map and store each column value to your list. fightf...@163.com 发件人: zml张明磊 发送时间: 2015-12-25 15:33 收件人: user@spark.apache.org 抄送: dev-subscr...@spark.apache.org 主题: How can I get the column data based on specific column name and then stored these data in array

Re: How can I get the column data based on specific column name and then stored these data in array or list ?

2015-12-25 Thread Eran Witkon
If you drop other columns (or map to a new df with only that column) and call collect i think you will get what you want. On Fri, 25 Dec 2015 at 10:26 fightf...@163.com wrote: > Emm...I think you can do a df.map and store each column value to your list. > >

Re: Struggling time by data

2015-12-25 Thread Xingchi Wang
map{case(x, y) => s = x.split("_"), (s(0), (s(1), y)))}.groupByKey().filter{case (_, (a, b)) => abs(a._1, a._1) < 30min} does it work for you ? 2015-12-25 16:53 GMT+08:00 Yasemin Kaya : > hi, > > I have struggled this data couple of days, i cant find solution. Could you >

?????? Job Error:Actor not found for: ActorSelection[Anchor(akka.tcp://sparkDriver@130.1.10.108:23600/)

2015-12-25 Thread donhoff_h
Hi, there is not other exception beside this one. I guess it is related to hardware resources just because the exception appears only when running more than 10 jobs simultaneously. But since I am not sure the cause reason, I can not require more hardware resources from my company. This is what

Struggling time by data

2015-12-25 Thread Yasemin Kaya
hi, I have struggled this data couple of days, i cant find solution. Could you help me? *DATA:* *(userid1_time, url) * *(userid1_time2, url2)* I want to get url which are in 30 min. *RESULT:* *If time2-time1<30 min* *(user1, [url1, url2] )* Best, yasemin -- hiç ender hiç

Re: Retrieving the PCA parameters in pyspark

2015-12-25 Thread Yanbo Liang
Hi Rohit, This is a known bug, but you can get these parameters if you use Scala version. Yanbo 2015-12-03 0:36 GMT+08:00 Rohit Girdhar : > Hi > > I'm using PCA through the python interface for spark, as per the > instructions on this page: >

Re: error while defining custom schema in Spark 1.5.0

2015-12-25 Thread Ted Yu
The error was due to blank field being defined twice. On Tue, Dec 22, 2015 at 12:03 AM, Divya Gehlot wrote: > Hi, > I am new bee to Apache Spark ,using CDH 5.5 Quick start VM.having spark > 1.5.0. > I working on custom schema and getting error > > import

Re: Problem using limit clause in spark sql

2015-12-25 Thread manasdebashiskar
It can be easily done using an RDD. rdd.zipwithIndex.partitionBy(YourCustomPartitioner) should give you your items. Here YourCustomPartitioner will know how to pick sample items from each partition. If you want to stick to Dataframe you can always repartition the data after you apply the limit.

Re: fishing for help!

2015-12-25 Thread Chris Fregly
note that with AWS, you can use Placement Groups and EC2 instances with Enhanced Networking to lower network latency and increase network

Stuck with DataFrame df.select("select * from table");

2015-12-25 Thread Eugene Morozov
Hello, I'm basically stuck as I have no idea where to look; Following simple code, given that my Datasource is working gives me an exception. DataFrame df = sqlc.load(filename, "com.epam.parso.spark.ds.DefaultSource"); df.cache(); df.printSchema(); <-- prints the schema perfectly fine!

Re: Memory allocation for Broadcast values

2015-12-25 Thread Chris Fregly
Note that starting with Spark 1.6, memory can be dynamically allocated by the Spark execution engine based on workload heuristics. You can still set a low watermark for the spark.storage.memoryFraction (RDD cache), but the rest can be dynamic. Here's some relevant slides from a recent

Re: error while defining custom schema in Spark 1.5.0

2015-12-25 Thread Chris Fregly
what are you trying to do, exactly? On Tue, Dec 22, 2015 at 3:03 AM, Divya Gehlot wrote: > Hi, > I am new bee to Apache Spark ,using CDH 5.5 Quick start VM.having spark > 1.5.0. > I working on custom schema and getting error > > import

java.sql.SQLException: Unsupported type -101

2015-12-25 Thread Madabhattula Rajesh Kumar
Hi I'm not able to read "Oracle Table - TIMESTAMP(6) WITH TIME ZONE datatype" column using Spark SQL. I'm getting below exception. Please let me know how to resolve this issue. *Exception :-* Exception in thread "main" java.sql.SQLException: Unsupported type -101 at

Re: Tips for Spark's Random Forest slow performance

2015-12-25 Thread Chris Fregly
so it looks like you're increasing num trees by 5x and you're seeing an 8x increase in runtime, correct? did you analyze the Spark cluster resources to monitor the memory usage, spillage, disk I/O, etc? you may need more Workers. On Tue, Dec 22, 2015 at 8:57 AM, Alexander Ratnikov <

Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

2015-12-25 Thread Chris Fregly
and which version of Spark/Spark Streaming are you using? are you explicitly setting the spark.streaming.concurrentJobs to something larger than the default of 1? if so, please try setting that back to 1 and see if the problem still exists. this is a dangerous parameter to modify from the

Re: Stuck with DataFrame df.select("select * from table");

2015-12-25 Thread Igor Berman
sqlContext.sql("select * from table limit 5").show() (not sure if limit 5 supported) or use Dmitriy's solution. select() defines your projection when you've specified entire query On 25 December 2015 at 15:42, Василец Дмитрий wrote: > hello > you can try to use

Re: Fat jar can't find jdbc

2015-12-25 Thread Chris Fregly
JDBC Drivers need to be on the system classpath. try passing --jars /path/to/local/mysql-connector.jar when you submit the job. this will also copy the jars to each of the worker nodes and should set you straight. On Tue, Dec 22, 2015 at 11:42 AM, Igor Berman wrote: >

Re: Stuck with DataFrame df.select("select * from table");

2015-12-25 Thread Eugene Morozov
Ted, Igor, Oh my... thanks a lot to both of you! Igor was absolutely right, but I missed that I have to use sqlContext =( Everything's perfect. Thank you. -- Be well! Jean Morozov On Fri, Dec 25, 2015 at 8:31 PM, Ted Yu wrote: > DataFrame uses different syntax from SQL

Re: Spark Streaming + Kafka + scala job message read issue

2015-12-25 Thread vivek.meghanathan
Any help is highly appreciated, i am completely stuck here.. From: Vivek Meghanathan (WT01 - NEP) Sent: Thursday, December 24, 2015 7:50 PM To: Bryan; user@spark.apache.org Subject: RE: Spark Streaming + Kafka + scala job message read issue We are using the

Re: Stuck with DataFrame df.select("select * from table");

2015-12-25 Thread Chris Fregly
I assume by "The same code perfectly works through Zeppelin 0.5.5" that you're using the %sql interpreter with your regular SQL SELECT statement, correct? If so, the Zeppelin interpreter is converting the that follows %sql to sqlContext.sql() per the following code:

Re: Stuck with DataFrame df.select("select * from table");

2015-12-25 Thread Eugene Morozov
Thanks for the comments, although the issue is not in limit() predicate. It's something with spark being unable to resolve the expression. I can do smth like this. It works as it suppose to: df.select(df.col("*")).where(df.col("x1").equalTo(3.0)).show(5); But I think old fashioned sql style

Re: Stuck with DataFrame df.select("select * from table");

2015-12-25 Thread Василец Дмитрий
hello you can try to use df.limit(5).show() just trick :) On Fri, Dec 25, 2015 at 2:34 PM, Eugene Morozov wrote: > Hello, I'm basically stuck as I have no idea where to look; > > Following simple code, given that my Datasource is working gives me an > exception. > >

Re: Spark SQL 1.5.2 missing JDBC driver for PostgreSQL?

2015-12-25 Thread Chris Fregly
Configuring JDBC drivers with Spark is a bit tricky as the JDBC driver needs to be on the Java System Classpath per this troubleshooting section in the Spark SQL programming guide. Here

Re: Stuck with DataFrame df.select("select * from table");

2015-12-25 Thread Ted Yu
DataFrame uses different syntax from SQL query. I searched unit tests but didn't find any in the form of df.select("select ...") Looks like you should use sqlContext as other people suggested. On Fri, Dec 25, 2015 at 8:29 AM, Eugene Morozov wrote: > Thanks for the

Re: Struggling time by data

2015-12-25 Thread Yasemin Kaya
it is ok but . I want to categorize the urls by sessions actually. *DATA:* (sorted by time) *(userid1_time, url1) * *(userid1_time2, url2)* *(userid1_time3, url3) * *(userid1_time4, url4)* *RESULT: * *url1 *already added to* session1* *time2-time1 < 30 min *so* url2 *go to* session1*

Re: Getting estimates and standard error using ml.LinearRegression

2015-12-25 Thread Chris Fregly
try http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.evaluation.RegressionEvaluator if you're using *spark.ml.LinearRegression *(which it appears you are) or http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.evaluation.RegressionMetrics

ClassNotFoundException when executing spark jobs in standalone/cluster mode on Spark 1.5.2

2015-12-25 Thread Saiph Kappa
Hi, I'm submitting a spark job like this: ~/spark-1.5.2-bin-hadoop2.6/bin/spark-submit --class Benchmark --master > spark://machine1:6066 --deploy-mode cluster --jars > target/scala-2.10/benchmark-app_2.10-0.1-SNAPSHOT.jar > /home/user/bench/target/scala-2.10/benchmark-app_2.10-0.1-SNAPSHOT.jar

Re: Spark SQL UDF with Struct input parameters

2015-12-25 Thread Deenar Toraskar
I have found that this even does not work with a struct as an input parameter def testUDF(expectedExposures: (Float, Float))= { (expectedExposures._1 * expectedExposures._2 /expectedExposures._1) } sqlContext.udf.register("testUDF", testUDF _) sqlContext.sql("select

number of executors in sparkR.init()

2015-12-25 Thread Franc Carter
Hi, I'm having trouble working out how to get the number of executors set when using sparkR.init(). If I start sparkR with sparkR --master yarn --num-executors 6 then I get 6 executors However, if start sparkR with sparkR followed by sc <- sparkR.init(master="yarn-client",

Re: Spark SQL 1.5.2 missing JDBC driver for PostgreSQL?

2015-12-25 Thread Benjamin Kim
Hi Chris, I did what you did. It works for me now! Thanks for your help. Have a Merry Christmas! Cheers, Ben > On Dec 25, 2015, at 6:41 AM, Chris Fregly wrote: > > Configuring JDBC drivers with Spark is a bit tricky as the JDBC driver needs > to be on the Java System

Re: Spark Streaming + Kafka + scala job message read issue

2015-12-25 Thread vivek.meghanathan
Hi Brian,PhuDuc, All 8 jobs are consuming 8 different IN topics. 8 different Scala jobs running each topic map mentioned below has only 1 thread number mentioned. In this case group should not be a problem right. Here is the complete flow, spring MVC sends in messages to Kafka , spark

RE: Spark Streaming + Kafka + scala job message read issue

2015-12-25 Thread Bryan
Vivek, https://spark.apache.org/docs/1.5.2/streaming-kafka-integration.html The map is per partitions number of topics to consume. Is numThreads below equal to the number of partitions in your topic? Regards, Bryan Jeffrey Sent from Outlook Mail for Windows 10 phone From:

Re: Spark Streaming + Kafka + scala job message read issue

2015-12-25 Thread PhuDuc Nguyen
Vivek, Did you say you have 8 spark jobs that are consuming from the same topic and all jobs are using the same consumer group name? If so, each job would get a subset of messages from that kafka topic, ie each job would get 1 out of 8 messages from that topic. Is that your intent? regards, Duc

why one of Stage is into Skipped section instead of Completed

2015-12-25 Thread Prem Spark
Whats does the below Skipped Stage means. can anyone help in clarifying? I was expecting 3 stages to get Succeeded but only 2 of them getting completed while one is skipped. Status: SUCCEEDED Completed Stages: 2 Skipped Stages: 1 Scala REPL Code Used: accounts is a basic RDD contains

Re: Stuck with DataFrame df.select("select * from table");

2015-12-25 Thread Chris Fregly
oh, and it's worth noting that - starting with Spark 1.6 - you'll be able to just do the following: SELECT * FROM json.`/path/to/json/file` (note the back ticks) instead of calling registerTempTable() for the sole purpose of using SQL. https://issues.apache.org/jira/browse/SPARK-11197 On Fri,

RE: Spark Streaming + Kafka + scala job message read issue

2015-12-25 Thread Bryan
Agreed. I did not see that they were using the same group name. Sent from Outlook Mail for Windows 10 phone From: PhuDuc Nguyen Sent: Friday, December 25, 2015 3:35 PM To: vivek.meghanat...@wipro.com Cc: user@spark.apache.org Subject: Re: Spark Streaming + Kafka + scala job message read issue

Re: Tips for Spark's Random Forest slow performance

2015-12-25 Thread Alexander Ratnikov
Definitely the biggest difference is the maxDepth of the trees. With values smaller or equal to 5 the time goes into milliseconds. The amount of trees affects the performance but not that much. I tried to profile the app and I see decent time spent in serialization. I'm wondering if Spark isn't

Re: Tips for Spark's Random Forest slow performance

2015-12-25 Thread Chris Fregly
ah, so with that much serialization happening, you might actually need *less* workers! :) in the next couple releases of Spark ML should, we should see better scoring/predicting functionality using a single node for exactly this reason. to get there, we need model.save/load support (PMML?),

Spark SQL UDF with Struct input parameters

2015-12-25 Thread Deenar Toraskar
Hi I am trying to define an UDF that can take an array of tuples as input def effectiveExpectedExposure(expectedExposures: Seq[(Seq[Float], Seq[Float])])= expectedExposures.map(x=> x._1 * x._2).sum/expectedExposures.map(x=> x._1).sum sqlContext.udf.register("expectedPositiveExposure",

Re: number of executors in sparkR.init()

2015-12-25 Thread Felix Cheung
The equivalent for spark-submit --num-executors should be  spark.executor.instancesWhen use in SparkConf?http://spark.apache.org/docs/latest/running-on-yarn.html Could you try setting that with sparkR.init()? _ From: Franc Carter Sent:

Re: number of executors in sparkR.init()

2015-12-25 Thread Franc Carter
Thanks, that works cheers On 26 December 2015 at 16:53, Felix Cheung wrote: > The equivalent for spark-submit --num-executors should be > spark.executor.instances > When use in SparkConf? > http://spark.apache.org/docs/latest/running-on-yarn.html > > Could you try