foreachRDD does not return any value. I can be used just to send result to
another place/context, like db,file etc.
I could use that but seems like over head of having another hop.
I wanted to make it simple and light.
On Tuesday, 31 May 2016, nguyen duc tuan wrote:
> How
How about using foreachRDD ? I think this is much better than your trick.
2016-05-31 12:32 GMT+07:00 obaidul karim :
> Hi Guys,
>
> In the end, I am using below.
> The trick is using "native python map" along with "spark spreaming
> transform".
> May not an elegent way,
Hi Guys,
In the end, I am using below.
The trick is using "native python map" along with "spark spreaming
transform".
May not an elegent way, however it works :).
def predictScore(texts, modelRF):
predictions = texts.map( lambda txt : (txt , getFeatures(txt)) ).\
map(lambda (txt,
how are you getting your data from the database. Are you using JDBC.
Can you actually call the database first (assuming the same data, put it in
temp table in Spark and cache it for the duration of windows length and use
the data from the cached table?
Dr Mich Talebzadeh
LinkedIn *
one is sql and the other one is its equivalent in functional programming
val s =
HiveContext.table("sales").select("AMOUNT_SOLD","TIME_ID","CHANNEL_ID")
val c = HiveContext.table("channels").select("CHANNEL_ID","CHANNEL_DESC")
val t =
Hi Christian can you please try if 30seconds works for your case .I think
your batches are getting queued up .Regards Shahbaz
On Tuesday 31 May 2016, Dancuart, Christian
wrote:
> While it has heap space, batches run well below 15 seconds.
>
>
>
> Once it starts to
Hi,
The same they are.
If you check the equality, you can use DataFrame#explain.
// maropu
On Tue, May 31, 2016 at 12:26 PM, pseudo oduesp
wrote:
> hi guys ,
> it s similare thing to do :
>
> sqlcontext.join("select * from t1 join t2 on condition) and
>
>
hi guys ,
it s similare thing to do :
sqlcontext.join("select * from t1 join t2 on condition) and
df1.join(df2,condition,'inner")??
ps:df1.registertable('t1')
ps:df2.registertable('t2')
thanks
On Tue, May 31, 2016 at 3:14 PM, Darren Govoni wrote:
> Well that could be the problem. A SQL database is essential a big
> synchronizer. If you have a lot of spark tasks all bottlenecking on a single
> database socket (is the database clustered or colocated with spark
Well that could be the problem. A SQL database is essential a big synchronizer.
If you have a lot of spark tasks all bottlenecking on a single database socket
(is the database clustered or colocated with spark workers?) then you will have
blocked threads on the database server.
Sent
On Tue, May 31, 2016 at 1:56 PM, Darren Govoni wrote:
> So you are calling a SQL query (to a single database) within a spark
> operation distributed across your workers?
Yes, but currently with very small sets of data (1-10,000) and on a
single (dev) machine right now.
So you are calling a SQL query (to a single database) within a spark operation
distributed across your workers?
Sent from my Verizon Wireless 4G LTE smartphone
Original message
From: Malcolm Lockyer
Date: 05/30/2016 9:45 PM (GMT-05:00)
Hopefully this is not off topic for this list, but I am hoping to
reach some people who have used Kafka + Spark before.
We are new to Spark and are setting up our first production
environment and hitting a speed issue that maybe configuration related
- and we have little experience in configuring
I mean train random forest model (not using R) and use it for prediction
together using Spark ML.
> On May 30, 2016, at 20:15, Neha Mehta wrote:
>
> Thanks Sujeet.. will try it out.
>
> Hi Sun,
>
> Can you please tell me what did you mean by "Maybe you can try using
btw, GraphX in Action is one of the better books out on Spark.
Michael did a great job with this one. He even breaks down snippets of
Scala for newbies to understand the seemingly-arbitrary syntax. I learned
quite a bit about not only Spark, but also Scala.
And of course, we shouldn't forget
I think we are going to move to a model that the computation stack will be
separate from storage stack and moreover something like Hive that provides
the means for persistent storage (well HDFS is the one that stores all the
data) will have an in-memory type capability much like what Oracle
And you have MapR supporting Apache Drill.
So these are all alternatives to Spark, and its not necessarily an either or
scenario. You can have both.
> On May 30, 2016, at 12:49 PM, Mich Talebzadeh
> wrote:
>
> yep Hortonworks supports Tez for one reason or other
I do not think that in-memory itself will make things faster in all cases.
Especially if you use Tez with Orc or parquet.
Especially for ad hoc queries on large dataset (indecently if they fit
in-memory or not) this will have a significant impact. This is an experience I
have also with the
While it has heap space, batches run well below 15 seconds.
Once it starts to run out of space, processing time takes about 1.5 minutes.
Scheduling delay is around 4 minutes and total delay around 5.5 minutes. I
usually shut it down at that point.
The number of stages (and pending stages) does
Spark in relation to Tez can be like a Flink runner for Apache Beam? The use
case of Tez however may be interesting (but current implementation only
YARN-based?)
Spark is efficient (or faster) for a number of reasons, including its
‘in-memory’ execution (from my understanding and experiments).
Hi,
Remember that acidity and transactional support was added to Hive 0.14
onward because of advent of ORC tables.
Now Spark does not support transactions because quote "there is a piece in
the execution side that needs to send heartbeats to Hive metastore saying a
transaction is still alive".
yep Hortonworks supports Tez for one reason or other which I am going
hopefully to test it as the query engine for hive. Tthough I think Spark
will be faster because of its in-memory support.
Also if you are independent then you better off dealing with Spark and Hive
without the need to support
your point on
"At the same time… if you are dealing with a large enough set of data… you
will have I/O. Both in terms of networking and Physical. This is true of
both Spark and in-memory RDBMs. .."
Well an IMDB will not start flushing to disk when it gets full, thus doing
PIO. It won't be able
Hi Christian,
- What is the processing time of each of your Batch,is it exceeding 15
seconds.
- How many jobs are queued.
- Can you take a heap dump and see which objects are occupying the heap.
Regards,
Shahbaz
On Tue, May 31, 2016 at 12:21 AM, christian.dancu...@rbc.com <
Mich,
Most people use vendor releases because they need to have the support.
Hortonworks is the vendor who has the most skin in the game when it comes to
Tez.
If memory serves, Tez isn’t going to be M/R but a local execution engine? Then
LLAP is the in-memory piece to speed up Tez?
HTH
Going from memory… Derby is/was Cloudscape which IBM acquired from Informix who
bought the company way back when. (Since IBM released it under Apache
licensing, Sun Microsystems took it and created JavaDB…)
I believe that there is a networking function so that you can either bring it
up in
Put is a type of Mutation so not sure what you mean by if I use mutation.
Anyway I registered all 3 classes to kryo.
kryo.register(classOf[org.apache.hadoop.hbase.client.Put])
kryo.register(classOf[ImmutableBytesWritable])
kryo.register(classOf[Mutable])
It still fails with the same
Hi All,
We have a spark streaming v1.4/java 8 application that slows down and
eventually runs out of heap space. The less driver memory, the faster it
happens.
Appended is our spark configuration and a snapshot of the of heap taken
using jmap on the driver process. The RDDInfo, $colon$colon and
Hi,
I can do inserts from Spark on Hive tables. How about updates or deletes. They
are failing when I tried?
Thanking
Hi,
have you tried using partitioning and parquet format. It works super fast
in SPARK.
Regards,
Gourav
On Mon, May 30, 2016 at 5:08 PM, Michael Segel
wrote:
> I’m not sure where to post this since its a bit of a philosophical
> question in terms of design and
Just a thought
Well in Spark RDDs are immutable which is an advantage compared to a
conventional IMDB like Oracle TimesTen meaning concurrency is not an issue
for certain indexes.
The overriding optimisation (as there is no Physical IO) has to be reducing
memory footprint and CPU demands and
Can you try adding jar to SPARK_CLASSPATH env variable ?
On Mon, May 30, 2016 at 9:55 PM, 喜之郎 <251922...@qq.com> wrote:
> HI all, I have a problem when using hiveserver2 and beeline.
> when I use CLI mode, the udf works well.
> But when I begin to use hiveserver2 and beeline, the udf can not
Hello,
I am using 1.6.0 version of Spark and trying to run window operation on
DStreams.
Window_TwoMin = 4*60*1000
Slide_OneMin = 2*60*1000
census = ssc.textFileStream("./census_stream/").filter(lambda a:
a.startswith('-') == False).map(lambda b: b.split("\t")) .map(lambda c:
HI all, I have a problem when using hiveserver2 and beeline.
when I use CLI mode, the udf works well.
But when I begin to use hiveserver2 and beeline, the udf can not work.
My Spark version is 1.5.1.
I tried 2 methods, first:
##
add jar /home/hadoop/dmp-udf-0.0.1-SNAPSHOT.jar;
create temporary
I’m not sure where to post this since its a bit of a philosophical question in
terms of design and vision for spark.
If we look at SparkSQL and performance… where does Secondary indexing fit in?
The reason this is a bit awkward is that if you view Spark as querying RDDs
which are temporary,
Yes, it is possible to use GraphX from Java but it requires 10x the amount of
code and involves using obscure typing and pre-defined lambda prototype
facilities. I give an example of it in my book, the source code for which can
be downloaded for free from
The 2-degree expansion of (x,y,z) is, in this implementation:
(x, x^2, y, xy, y^2, z, xz, yz, z^2)
Given your input is (1,0,1), the output (1,1,0,0,0,1,1,0,1) is right.
On Mon, May 30, 2016 at 12:37 AM, Jeff Zhang wrote:
> I use PolynomialExpansion to convert one vector to
Yes, you are right.
2016-05-30 2:34 GMT-07:00 Abhishek Anand :
>
> Thanks Yanbo.
>
> So, you mean that if I have a variable which is of type double but I want
> to treat it like String in my model I just have to cast those columns into
> string and simply run the glm
No, you can call any Scala API in Java. It is somewhat less convenient if
the method was not written with Java in mind but does work.
On Mon, May 30, 2016, 00:32 Takeshi Yamamuro wrote:
> These package are used only for Scala.
>
> On Mon, May 30, 2016 at 2:23 PM, Kumar,
Hi,
This is a known issue.
You need to check a related JIRA ticket:
https://issues.apache.org/jira/browse/SPARK-4105
// maropu
On Mon, May 30, 2016 at 7:51 PM, Prashant Singh Thakur <
prashant.tha...@impetus.co.in> wrote:
> Hi,
>
>
>
> We are trying to use Spark Data Frames for our use case
Thanks Sujeet.. will try it out.
Hi Sun,
Can you please tell me what did you mean by "Maybe you can try using the
existing random forest model" ? did you mean creating the model again using
Spark MLLIB?
Thanks,
Neha
> From: sujeet jog
> Date: Mon, May 30, 2016 at 4:52
How about this ?
def extract_feature(rf_model, x):
text = getFeatures(x).split(',')
fea = [float(i) for i in text]
prediction = rf_model.predict(fea)
return (prediction, x)
output = texts.map(lambda x: extract_feature(rf_model, x))
2016-05-30 14:17 GMT+07:00 obaidul karim :
Try to invoke a R script from Spark using rdd pipe method , get the work
done & and receive the model back in RDD.
for ex :-
. rdd.pipe("")
On Mon, May 30, 2016 at 3:57 PM, Sun Rui wrote:
> Unfortunately no. Spark does not support loading external modes (for
>
when you start master it stats applicationmaster. it does not
slaves/workers!
you need to start slaves with start-slaves.sh
slaves will look at the file $SPARK_HOME/conf/slaves to get a list of nodes
to start slaves. then it will start slaves/workers in each node. you can
see all this in spark
Hi,
We are trying to use Spark Data Frames for our use case where we are getting
this exception.
The parameters used are listed below. Kindly suggest if we are missing
something.
Version used is Spark 1.3.1
Jira is still showing this issue as Open
I've written a very simple Sort scala program with Spark.
/object Sort {
def main(args: Array[String]): Unit = {
if (args.length < 2) {
System.err.println("Usage: Sort " +
" []")
System.exit(1)
}
val conf = new
Unfortunately no. Spark does not support loading external modes (for examples,
PMML) for now.
Maybe you can try using the existing random forest model in Spark.
> On May 30, 2016, at 18:21, Neha Mehta wrote:
>
> Hi,
>
> I have an existing random forest model created
Michael, Mitch, Silvio,
Thanks!
The own directoy is the issue. We are running the Spark Notebook, which
uses the same dir per server (i.e. for all notebooks). So this issue
prevents us from running 2 notebooks using HiveContext.
I'll look in a proper Hive installation and I'm glad to know that
Hi,
I have an existing random forest model created using R. I want to use that
to predict values on Spark. Is it possible to do the same? If yes, then how?
Thanks & Regards,
Neha
Thanks Yanbo.
So, you mean that if I have a variable which is of type double but I want
to treat it like String in my model I just have to cast those columns into
string and simply run the glm model. String columns will be directly
one-hot encoded by the glm provided by sparkR ?
Just wanted to
Hi Abhi,
In SparkR glm, category features (columns of type string) will be one-hot
encoded automatically.
So pre-processing like `as.factor` is not necessary, you can directly feed
your data to the model training.
Thanks
Yanbo
2016-05-30 2:06 GMT-07:00 Abhishek Anand :
Hi Stuti
2016-03-15 10:08 GMT+01:00 Stuti Awasthi :
> Thanks Prabhu,
>
> I tried starting in local mode but still picking Python 2.6 only. I have
> exported “DEFAULT_PYTHON” in my session variable and also included in PATH.
>
>
>
> Export:
>
> export
No, the limit is given by your setup. If you use Spark on a YARN cluster,
then the number of concurrent jobs is really limited to the resources
allocated to each job and how the YARN queues are set up. For instance, if
you use the FIFO scheduler (default), then it can be the case that the first
Hi ,
I want to run glm variant of sparkR for my data that is there in a csv file.
I see that the glm function in sparkR takes a spark dataframe as input.
Now, when I read a file from csv and create a spark dataframe, how could I
take care of the factor variables/columns in my data ?
Do I need
Hi,
Anybody has any idea on below?
-Obaid
On Friday, 27 May 2016, obaidul karim wrote:
> Hi Guys,
>
> This is my first mail to spark users mailing list.
>
> I need help on Dstream operation.
>
> In fact, I am using a MLlib randomforest model to predict using spark
>
Hi All,
@Deepak: Thanks for your suggestion, we are using Mesos to handle spark cluster.
@Jorn : the reason we chose postgresXL was of its geo-spational support as we
store location data.
We were seeing how to quickly put things better and what is the right approach
Our original thinking was
Hi Saurabh
You can have hadoop cluster running YARN as scheduler.
Configure spark to run with the same YARN setup.
Then you need R only on 1 node , and connect to the cluster using the
SparkR.
Thanks
Deepak
On Mon, May 30, 2016 at 12:12 PM, Jörn Franke wrote:
>
> Well if
H Jorn,
Thanks for suggestion.
My current cluster setup is mentioned in attached snapshot .Apart from PotgreXL
do you see any problem over there?
Regards,
Saurabh
From: Jörn Franke [mailto:jornfra...@gmail.com]
Sent: Monday, May 30, 2016 12:12 PM
To: Kumar, Saurabh 5. (Nokia - IN/Bangalore)
Well if you require R then you need to install it (including all additional
packages) on each node. I am not sure why you store the data in Postgres .
Storing it in Parquet and Orc is sufficient in HDFS (sorted on relevant
columns) and you use the SparkR libraries to access them.
> On 30 May
Hi Team,
I am using Apache spark to build scalable Analytic engine. My setup is as
follows.
Flow of processing is as follows:
Raw Files > Store to HDFS > Process by Spark and Store to Postgre_XL Database >
R process data fom Postgre-XL to process in distributed mode.
I have 6 nodes cluster
org.apache.hadoop.hbase.client.{Mutation, Put}
org.apache.hadoop.hbase.io.ImmutableBytesWritable
if u used mutation, register the above class too
> On May 30, 2016, at 08:11, Nirav Patel wrote:
>
> Sure let me can try that. But from looks of it it seems kryo
>
61 matches
Mail list logo