Re: Spark Streaming: Combine MLlib Prediction and Features on Dstreams

2016-05-30 Thread obaidul karim
foreachRDD does not return any value. I can be used just to send result to another place/context, like db,file etc. I could use that but seems like over head of having another hop. I wanted to make it simple and light. On Tuesday, 31 May 2016, nguyen duc tuan wrote: > How

Re: Spark Streaming: Combine MLlib Prediction and Features on Dstreams

2016-05-30 Thread nguyen duc tuan
How about using foreachRDD ? I think this is much better than your trick. 2016-05-31 12:32 GMT+07:00 obaidul karim : > Hi Guys, > > In the end, I am using below. > The trick is using "native python map" along with "spark spreaming > transform". > May not an elegent way,

Re: Spark Streaming: Combine MLlib Prediction and Features on Dstreams

2016-05-30 Thread obaidul karim
Hi Guys, In the end, I am using below. The trick is using "native python map" along with "spark spreaming transform". May not an elegent way, however it works :). def predictScore(texts, modelRF): predictions = texts.map( lambda txt : (txt , getFeatures(txt)) ).\ map(lambda (txt,

Re: Spark + Kafka processing trouble

2016-05-30 Thread Mich Talebzadeh
how are you getting your data from the database. Are you using JDBC. Can you actually call the database first (assuming the same data, put it in temp table in Spark and cache it for the duration of windows length and use the data from the cached table? Dr Mich Talebzadeh LinkedIn *

Re: equvalent beewn join sql and data frame

2016-05-30 Thread Mich Talebzadeh
one is sql and the other one is its equivalent in functional programming val s = HiveContext.table("sales").select("AMOUNT_SOLD","TIME_ID","CHANNEL_ID") val c = HiveContext.table("channels").select("CHANNEL_ID","CHANNEL_DESC") val t =

Re: Spark Streaming heap space out of memory

2016-05-30 Thread Shahbaz
Hi Christian can you please try if 30seconds works for your case .I think your batches are getting queued up .Regards Shahbaz On Tuesday 31 May 2016, Dancuart, Christian wrote: > While it has heap space, batches run well below 15 seconds. > > > > Once it starts to

Re: equvalent beewn join sql and data frame

2016-05-30 Thread Takeshi Yamamuro
Hi, The same they are. If you check the equality, you can use DataFrame#explain. // maropu On Tue, May 31, 2016 at 12:26 PM, pseudo oduesp wrote: > hi guys , > it s similare thing to do : > > sqlcontext.join("select * from t1 join t2 on condition) and > >

equvalent beewn join sql and data frame

2016-05-30 Thread pseudo oduesp
hi guys , it s similare thing to do : sqlcontext.join("select * from t1 join t2 on condition) and df1.join(df2,condition,'inner")?? ps:df1.registertable('t1') ps:df2.registertable('t2') thanks

Re: Spark + Kafka processing trouble

2016-05-30 Thread Malcolm Lockyer
On Tue, May 31, 2016 at 3:14 PM, Darren Govoni wrote: > Well that could be the problem. A SQL database is essential a big > synchronizer. If you have a lot of spark tasks all bottlenecking on a single > database socket (is the database clustered or colocated with spark

Re: Spark + Kafka processing trouble

2016-05-30 Thread Darren Govoni
Well that could be the problem. A SQL database is essential a big synchronizer. If you have a lot of spark tasks all bottlenecking on a single database socket (is the database clustered or colocated with spark workers?) then you will have blocked threads on the database server. Sent

Re: Spark + Kafka processing trouble

2016-05-30 Thread Malcolm Lockyer
On Tue, May 31, 2016 at 1:56 PM, Darren Govoni wrote: > So you are calling a SQL query (to a single database) within a spark > operation distributed across your workers? Yes, but currently with very small sets of data (1-10,000) and on a single (dev) machine right now.

RE: Spark + Kafka processing trouble

2016-05-30 Thread Darren Govoni
So you are calling a SQL query (to a single database) within a spark operation distributed across your workers?  Sent from my Verizon Wireless 4G LTE smartphone Original message From: Malcolm Lockyer Date: 05/30/2016 9:45 PM (GMT-05:00)

Spark + Kafka processing trouble

2016-05-30 Thread Malcolm Lockyer
Hopefully this is not off topic for this list, but I am hoping to reach some people who have used Kafka + Spark before. We are new to Spark and are setting up our first production environment and hitting a speed issue that maybe configuration related - and we have little experience in configuring

Re: Can we use existing R model in Spark

2016-05-30 Thread Sun Rui
I mean train random forest model (not using R) and use it for prediction together using Spark ML. > On May 30, 2016, at 20:15, Neha Mehta wrote: > > Thanks Sujeet.. will try it out. > > Hi Sun, > > Can you please tell me what did you mean by "Maybe you can try using

Re: GraphX Java API

2016-05-30 Thread Chris Fregly
btw, GraphX in Action is one of the better books out on Spark. Michael did a great job with this one. He even breaks down snippets of Scala for newbies to understand the seemingly-arbitrary syntax. I learned quite a bit about not only Spark, but also Scala. And of course, we shouldn't forget

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Mich Talebzadeh
I think we are going to move to a model that the computation stack will be separate from storage stack and moreover something like Hive that provides the means for persistent storage (well HDFS is the one that stores all the data) will have an in-memory type capability much like what Oracle

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Michael Segel
And you have MapR supporting Apache Drill. So these are all alternatives to Spark, and its not necessarily an either or scenario. You can have both. > On May 30, 2016, at 12:49 PM, Mich Talebzadeh > wrote: > > yep Hortonworks supports Tez for one reason or other

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Jörn Franke
I do not think that in-memory itself will make things faster in all cases. Especially if you use Tez with Orc or parquet. Especially for ad hoc queries on large dataset (indecently if they fit in-memory or not) this will have a significant impact. This is an experience I have also with the

RE: Spark Streaming heap space out of memory

2016-05-30 Thread Dancuart, Christian
While it has heap space, batches run well below 15 seconds. Once it starts to run out of space, processing time takes about 1.5 minutes. Scheduling delay is around 4 minutes and total delay around 5.5 minutes. I usually shut it down at that point. The number of stages (and pending stages) does

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Ovidiu-Cristian MARCU
Spark in relation to Tez can be like a Flink runner for Apache Beam? The use case of Tez however may be interesting (but current implementation only YARN-based?) Spark is efficient (or faster) for a number of reasons, including its ‘in-memory’ execution (from my understanding and experiments).

Re: Does Spark support updates or deletes on underlying Hive tables

2016-05-30 Thread Mich Talebzadeh
Hi, Remember that acidity and transactional support was added to Hive 0.14 onward because of advent of ORC tables. Now Spark does not support transactions because quote "there is a piece in the execution side that needs to send heartbeats to Hive metastore saying a transaction is still alive".

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Mich Talebzadeh
yep Hortonworks supports Tez for one reason or other which I am going hopefully to test it as the query engine for hive. Tthough I think Spark will be faster because of its in-memory support. Also if you are independent then you better off dealing with Spark and Hive without the need to support

Re: Secondary Indexing?

2016-05-30 Thread Mich Talebzadeh
your point on "At the same time… if you are dealing with a large enough set of data… you will have I/O. Both in terms of networking and Physical. This is true of both Spark and in-memory RDBMs. .." Well an IMDB will not start flushing to disk when it gets full, thus doing PIO. It won't be able

Re: Spark Streaming heap space out of memory

2016-05-30 Thread Shahbaz
Hi Christian, - What is the processing time of each of your Batch,is it exceeding 15 seconds. - How many jobs are queued. - Can you take a heap dump and see which objects are occupying the heap. Regards, Shahbaz On Tue, May 31, 2016 at 12:21 AM, christian.dancu...@rbc.com <

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Michael Segel
Mich, Most people use vendor releases because they need to have the support. Hortonworks is the vendor who has the most skin in the game when it comes to Tez. If memory serves, Tez isn’t going to be M/R but a local execution engine? Then LLAP is the in-memory piece to speed up Tez? HTH

Re: HiveContext standalone => without a Hive metastore

2016-05-30 Thread Michael Segel
Going from memory… Derby is/was Cloudscape which IBM acquired from Informix who bought the company way back when. (Since IBM released it under Apache licensing, Sun Microsystems took it and created JavaDB…) I believe that there is a networking function so that you can either bring it up in

Re: Bulk loading Serialized RDD into Hbase throws KryoException - IndexOutOfBoundsException

2016-05-30 Thread Nirav Patel
Put is a type of Mutation so not sure what you mean by if I use mutation. Anyway I registered all 3 classes to kryo. kryo.register(classOf[org.apache.hadoop.hbase.client.Put]) kryo.register(classOf[ImmutableBytesWritable]) kryo.register(classOf[Mutable]) It still fails with the same

Spark Streaming heap space out of memory

2016-05-30 Thread christian.dancu...@rbc.com
Hi All, We have a spark streaming v1.4/java 8 application that slows down and eventually runs out of heap space. The less driver memory, the faster it happens. Appended is our spark configuration and a snapshot of the of heap taken using jmap on the driver process. The RDDInfo, $colon$colon and

Does Spark support updates or deletes on underlying Hive tables

2016-05-30 Thread Ashok Kumar
Hi, I can do inserts from Spark on Hive tables. How about updates or deletes. They are failing when I tried? Thanking

Re: Secondary Indexing?

2016-05-30 Thread Gourav Sengupta
Hi, have you tried using partitioning and parquet format. It works super fast in SPARK. Regards, Gourav On Mon, May 30, 2016 at 5:08 PM, Michael Segel wrote: > I’m not sure where to post this since its a bit of a philosophical > question in terms of design and

Re: Secondary Indexing?

2016-05-30 Thread Mich Talebzadeh
Just a thought Well in Spark RDDs are immutable which is an advantage compared to a conventional IMDB like Oracle TimesTen meaning concurrency is not an issue for certain indexes. The overriding optimisation (as there is no Physical IO) has to be reducing memory footprint and CPU demands and

Re: can not use udf in hivethriftserver2

2016-05-30 Thread lalit sharma
Can you try adding jar to SPARK_CLASSPATH env variable ? On Mon, May 30, 2016 at 9:55 PM, 喜之郎 <251922...@qq.com> wrote: > HI all, I have a problem when using hiveserver2 and beeline. > when I use CLI mode, the udf works well. > But when I begin to use hiveserver2 and beeline, the udf can not

Window Operation on Dstream Fails

2016-05-30 Thread vinay453
Hello, I am using 1.6.0 version of Spark and trying to run window operation on DStreams. Window_TwoMin = 4*60*1000 Slide_OneMin = 2*60*1000 census = ssc.textFileStream("./census_stream/").filter(lambda a: a.startswith('-') == False).map(lambda b: b.split("\t")) .map(lambda c:

can not use udf in hivethriftserver2

2016-05-30 Thread ??????
HI all, I have a problem when using hiveserver2 and beeline. when I use CLI mode, the udf works well. But when I begin to use hiveserver2 and beeline, the udf can not work. My Spark version is 1.5.1. I tried 2 methods, first: ## add jar /home/hadoop/dmp-udf-0.0.1-SNAPSHOT.jar; create temporary

Secondary Indexing?

2016-05-30 Thread Michael Segel
I’m not sure where to post this since its a bit of a philosophical question in terms of design and vision for spark. If we look at SparkSQL and performance… where does Secondary indexing fit in? The reason this is a bit awkward is that if you view Spark as querying RDDs which are temporary,

Re: GraphX Java API

2016-05-30 Thread Michael Malak
Yes, it is possible to use GraphX from Java but it requires 10x the amount of code and involves using obscure typing and pre-defined lambda prototype facilities. I give an example of it in my book, the source code for which can be downloaded for free from 

Re: Bug of PolynomialExpansion ?

2016-05-30 Thread Sean Owen
The 2-degree expansion of (x,y,z) is, in this implementation: (x, x^2, y, xy, y^2, z, xz, yz, z^2) Given your input is (1,0,1), the output (1,1,0,0,0,1,1,0,1) is right. On Mon, May 30, 2016 at 12:37 AM, Jeff Zhang wrote: > I use PolynomialExpansion to convert one vector to

Re: Running glm in sparkR (data pre-processing step)

2016-05-30 Thread Yanbo Liang
Yes, you are right. 2016-05-30 2:34 GMT-07:00 Abhishek Anand : > > Thanks Yanbo. > > So, you mean that if I have a variable which is of type double but I want > to treat it like String in my model I just have to cast those columns into > string and simply run the glm

Re: GraphX Java API

2016-05-30 Thread Sean Owen
No, you can call any Scala API in Java. It is somewhat less convenient if the method was not written with Java in mind but does work. On Mon, May 30, 2016, 00:32 Takeshi Yamamuro wrote: > These package are used only for Scala. > > On Mon, May 30, 2016 at 2:23 PM, Kumar,

Re: FAILED_TO_UNCOMPRESS Error - Spark 1.3.1

2016-05-30 Thread Takeshi Yamamuro
Hi, This is a known issue. You need to check a related JIRA ticket: https://issues.apache.org/jira/browse/SPARK-4105 // maropu On Mon, May 30, 2016 at 7:51 PM, Prashant Singh Thakur < prashant.tha...@impetus.co.in> wrote: > Hi, > > > > We are trying to use Spark Data Frames for our use case

Re: Can we use existing R model in Spark

2016-05-30 Thread Neha Mehta
Thanks Sujeet.. will try it out. Hi Sun, Can you please tell me what did you mean by "Maybe you can try using the existing random forest model" ? did you mean creating the model again using Spark MLLIB? Thanks, Neha > From: sujeet jog > Date: Mon, May 30, 2016 at 4:52

Re: Spark Streaming: Combine MLlib Prediction and Features on Dstreams

2016-05-30 Thread nguyen duc tuan
How about this ? def extract_feature(rf_model, x): text = getFeatures(x).split(',') fea = [float(i) for i in text] prediction = rf_model.predict(fea) return (prediction, x) output = texts.map(lambda x: extract_feature(rf_model, x)) 2016-05-30 14:17 GMT+07:00 obaidul karim :

Re: Can we use existing R model in Spark

2016-05-30 Thread sujeet jog
Try to invoke a R script from Spark using rdd pipe method , get the work done & and receive the model back in RDD. for ex :- . rdd.pipe("") On Mon, May 30, 2016 at 3:57 PM, Sun Rui wrote: > Unfortunately no. Spark does not support loading external modes (for >

Re: JDBC Cluster

2016-05-30 Thread Mich Talebzadeh
when you start master it stats applicationmaster. it does not slaves/workers! you need to start slaves with start-slaves.sh slaves will look at the file $SPARK_HOME/conf/slaves to get a list of nodes to start slaves. then it will start slaves/workers in each node. you can see all this in spark

FAILED_TO_UNCOMPRESS Error - Spark 1.3.1

2016-05-30 Thread Prashant Singh Thakur
Hi, We are trying to use Spark Data Frames for our use case where we are getting this exception. The parameters used are listed below. Kindly suggest if we are missing something. Version used is Spark 1.3.1 Jira is still showing this issue as Open

DAG of Spark Sort application spanning two jobs

2016-05-30 Thread alvarobrandon
I've written a very simple Sort scala program with Spark. /object Sort { def main(args: Array[String]): Unit = { if (args.length < 2) { System.err.println("Usage: Sort " + " []") System.exit(1) } val conf = new

Re: Can we use existing R model in Spark

2016-05-30 Thread Sun Rui
Unfortunately no. Spark does not support loading external modes (for examples, PMML) for now. Maybe you can try using the existing random forest model in Spark. > On May 30, 2016, at 18:21, Neha Mehta wrote: > > Hi, > > I have an existing random forest model created

Re: HiveContext standalone => without a Hive metastore

2016-05-30 Thread Gerard Maas
Michael, Mitch, Silvio, Thanks! The own directoy is the issue. We are running the Spark Notebook, which uses the same dir per server (i.e. for all notebooks). So this issue prevents us from running 2 notebooks using HiveContext. I'll look in a proper Hive installation and I'm glad to know that

Can we use existing R model in Spark

2016-05-30 Thread Neha Mehta
Hi, I have an existing random forest model created using R. I want to use that to predict values on Spark. Is it possible to do the same? If yes, then how? Thanks & Regards, Neha

Re: Running glm in sparkR (data pre-processing step)

2016-05-30 Thread Abhishek Anand
Thanks Yanbo. So, you mean that if I have a variable which is of type double but I want to treat it like String in my model I just have to cast those columns into string and simply run the glm model. String columns will be directly one-hot encoded by the glm provided by sparkR ? Just wanted to

Re: Running glm in sparkR (data pre-processing step)

2016-05-30 Thread Yanbo Liang
Hi Abhi, In SparkR glm, category features (columns of type string) will be one-hot encoded automatically. So pre-processing like `as.factor` is not necessary, you can directly feed your data to the model training. Thanks Yanbo 2016-05-30 2:06 GMT-07:00 Abhishek Anand :

Re: Launch Spark shell using differnt python version

2016-05-30 Thread Eike von Seggern
Hi Stuti 2016-03-15 10:08 GMT+01:00 Stuti Awasthi : > Thanks Prabhu, > > I tried starting in local mode but still picking Python 2.6 only. I have > exported “DEFAULT_PYTHON” in my session variable and also included in PATH. > > > > Export: > > export

Re: List of questios about spark

2016-05-30 Thread Ian
No, the limit is given by your setup. If you use Spark on a YARN cluster, then the number of concurrent jobs is really limited to the resources allocated to each job and how the YARN queues are set up. For instance, if you use the FIFO scheduler (default), then it can be the case that the first

Running glm in sparkR (data pre-processing step)

2016-05-30 Thread Abhishek Anand
Hi , I want to run glm variant of sparkR for my data that is there in a csv file. I see that the glm function in sparkR takes a spark dataframe as input. Now, when I read a file from csv and create a spark dataframe, how could I take care of the factor variables/columns in my data ? Do I need

Re: Spark Streaming: Combine MLlib Prediction and Features on Dstreams

2016-05-30 Thread obaidul karim
Hi, Anybody has any idea on below? -Obaid On Friday, 27 May 2016, obaidul karim wrote: > Hi Guys, > > This is my first mail to spark users mailing list. > > I need help on Dstream operation. > > In fact, I am using a MLlib randomforest model to predict using spark >

RE: Query related to spark cluster

2016-05-30 Thread Kumar, Saurabh 5. (Nokia - IN/Bangalore)
Hi All, @Deepak: Thanks for your suggestion, we are using Mesos to handle spark cluster. @Jorn : the reason we chose postgresXL was of its geo-spational support as we store location data. We were seeing how to quickly put things better and what is the right approach Our original thinking was

Re: Query related to spark cluster

2016-05-30 Thread Deepak Sharma
Hi Saurabh You can have hadoop cluster running YARN as scheduler. Configure spark to run with the same YARN setup. Then you need R only on 1 node , and connect to the cluster using the SparkR. Thanks Deepak On Mon, May 30, 2016 at 12:12 PM, Jörn Franke wrote: > > Well if

RE: Query related to spark cluster

2016-05-30 Thread Kumar, Saurabh 5. (Nokia - IN/Bangalore)
H Jorn, Thanks for suggestion. My current cluster setup is mentioned in attached snapshot .Apart from PotgreXL do you see any problem over there? Regards, Saurabh From: Jörn Franke [mailto:jornfra...@gmail.com] Sent: Monday, May 30, 2016 12:12 PM To: Kumar, Saurabh 5. (Nokia - IN/Bangalore)

Re: Query related to spark cluster

2016-05-30 Thread Jörn Franke
Well if you require R then you need to install it (including all additional packages) on each node. I am not sure why you store the data in Postgres . Storing it in Parquet and Orc is sufficient in HDFS (sorted on relevant columns) and you use the SparkR libraries to access them. > On 30 May

Query related to spark cluster

2016-05-30 Thread Kumar, Saurabh 5. (Nokia - IN/Bangalore)
Hi Team, I am using Apache spark to build scalable Analytic engine. My setup is as follows. Flow of processing is as follows: Raw Files > Store to HDFS > Process by Spark and Store to Postgre_XL Database > R process data fom Postgre-XL to process in distributed mode. I have 6 nodes cluster

Re: Bulk loading Serialized RDD into Hbase throws KryoException - IndexOutOfBoundsException

2016-05-30 Thread sjk
org.apache.hadoop.hbase.client.{Mutation, Put} org.apache.hadoop.hbase.io.ImmutableBytesWritable if u used mutation, register the above class too > On May 30, 2016, at 08:11, Nirav Patel wrote: > > Sure let me can try that. But from looks of it it seems kryo >