Re: Cluster-mode job compute-time/cost metrics

2023-12-12 Thread murat migdisoglu
de using a > dedicated cluster for each job. Is there a way to see how much compute time > each job takes via Spark APIs, metrics, etc.? In case it makes a > difference, I’m using AWS EMR - I’d ultimately like to be able to say this > job costs $X since it took Y minutes on Z instance t

Re: Cluster-mode job compute-time/cost metrics

2023-12-12 Thread Jörn Franke
I’m running Spark jobs in cluster mode using a > dedicated cluster for each job. Is there a way to see how much compute time > each job takes via Spark APIs, metrics, etc.? In case it makes a difference, > I’m using AWS EMR - I’d ultimately like to be able to say this job costs $X >

Cluster-mode job compute-time/cost metrics

2023-12-11 Thread Jack Wells
Hello Spark experts - I’m running Spark jobs in cluster mode using a dedicated cluster for each job. Is there a way to see how much compute time each job takes via Spark APIs, metrics, etc.? In case it makes a difference, I’m using AWS EMR - I’d ultimately like to be able to say this job costs $X

[sparklyR] broadcast table for temporary table -> can you compute statistics for temporary table?

2022-11-23 Thread Joris Billen
an oracle table) there will not be any statistics. Is there any way to compute the stats for a temporary table so that spark will know whether he needs to autobroadcast? Thanks!

Separating storage from compute layer with Spark and data warehouses offering ML capabilities

2020-11-29 Thread Mich Talebzadeh
a or results of models). This where Spark comes into play. It can connect multiple sources with JDBC connections, can combine data from these sources within Spark itself and provide in-memory enrichment and computation at the compute layer. additionally and perhaps more importantly you can scale up and

WARN ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped

2020-09-27 Thread xorz57
I am running Apache Spark Core using Scala 2.12.12 on IntelliJ IDEA 2020.2 with Docker 2.3.0.5 I am running Windows 10 build 2004 Can somebody explain me why am I receiving this

WARN ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped

2020-06-03 Thread YuqingWan
I installed Spark and run it, then I get an error "WARN ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped" Something like:

Re: Compute the Hash of each row in new column

2020-03-02 Thread Chetan Khatri
(ds.columns.map(col) ++ ds.columns.map(column => > md5(col(column)).as(s"$column hash")): _*).show(false) > > Enrico > > Am 02.03.20 um 11:10 schrieb Chetan Khatri: > > Thanks Enrico > I want to compute hash of all the columns value in the row. &g

Re: Compute the Hash of each row in new column

2020-03-02 Thread Enrico Minack
Well, then apply md5 on all columns: ds.select(ds.columns.map(col) ++ ds.columns.map(column => md5(col(column)).as(s"$column hash")): _*).show(false) Enrico Am 02.03.20 um 11:10 schrieb Chetan Khatri: Thanks Enrico I want to compute hash of all the columns value in the row. O

Re: Compute the Hash of each row in new column

2020-03-02 Thread Chetan Khatri
Thanks Enrico I want to compute hash of all the columns value in the row. On Fri, Feb 28, 2020 at 7:28 PM Enrico Minack wrote: > This computes the md5 hash of a given column id of Dataset ds: > > ds.withColumn("id hash", md5($"id")).show(false) > > Te

Re: Compute the Hash of each row in new column

2020-02-28 Thread Enrico Minack
a, sha1, sha2 and hash: https://spark.apache.org/docs/2.4.5/api/sql/index.html Enrico Am 28.02.20 um 13:56 schrieb Chetan Khatri: Hi Spark Users, How can I compute Hash of each row and store in new column at Dataframe, c

Re: Compute the Hash of each row in new column

2020-02-28 Thread Riccardo Ferrari
Hi Chetan, Would the sql function `hash` do the trick for your use-case ? Best, On Fri, Feb 28, 2020 at 1:56 PM Chetan Khatri wrote: > Hi Spark Users, > How can I compute Hash of each row and store in new column at Dataframe, > could someone help me. > > Thanks >

Compute the Hash of each row in new column

2020-02-28 Thread Chetan Khatri
Hi Spark Users, How can I compute Hash of each row and store in new column at Dataframe, could someone help me. Thanks

RE: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-17 Thread Gautham Acharya
@spark.apache.org Subject: Re: [Beginner] Run compute on large matrices and return the result in seconds? CAUTION: This email originated from outside the Allen Institute. Please do not click links or open attachments unless you've validated the sender and know the content is safe

Re: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-17 Thread Jerry Vinokurov
ight variations (+- 15 seconds) will not cause > issues. > > > > --gautham > > *From:* Patrick McCarthy [mailto:pmccar...@dstillery.com] > *Sent:* Wednesday, July 17, 2019 12:39 PM > *To:* Gautham Acharya > *Cc:* Bobby Evans ; Steven Stetzler < > steven.stetz...@gmail.c

RE: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-17 Thread Gautham Acharya
...@dstillery.com] Sent: Wednesday, July 17, 2019 12:39 PM To: Gautham Acharya Cc: Bobby Evans ; Steven Stetzler ; user@spark.apache.org Subject: Re: [Beginner] Run compute on large matrices and return the result in seconds? CAUTION: This email originated from outside the Allen Institute. Please do

Re: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-17 Thread Patrick McCarthy
retrieval, like ElasticSearch or ScyllaDB? Spark is a compute framework rather than a serving backend, I don't think it's designed with retrieval SLAs in mind and you may find those SLAs difficult to maintain. On Wed, Jul 17, 2019 at 3:14 PM Gautham Acharya wrote: > Thanks for the reply, Bo

RE: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-17 Thread Gautham Acharya
easily perform this computation in spark? --gautham From: Bobby Evans [mailto:reva...@gmail.com] Sent: Wednesday, July 17, 2019 7:06 AM To: Steven Stetzler Cc: Gautham Acharya ; user@spark.apache.org Subject: Re: [Beginner] Run compute on large matrices and return the result in seconds? CAUTION

Re: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-17 Thread Bobby Evans
a m5.12xlarge EC2 instance), you could fit your > problem in main memory and perform your computation with thread based > parallelism. This might get your result relatively quickly. For a dedicated > application with well constrained memory and compute requirements, it might > not be a

Re: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-11 Thread Steven Stetzler
. This might get your result relatively quickly. For a dedicated application with well constrained memory and compute requirements, it might not be a bad option to do everything on one machine as well. Accessing an external database and distributing work over a large number of computers can add overhead

RE: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-11 Thread Gautham Acharya
Ping? I would really appreciate advice on this! Thank you! From: Gautham Acharya Sent: Tuesday, July 9, 2019 4:22 PM To: user@spark.apache.org Subject: [Beginner] Run compute on large matrices and return the result in seconds? This is my first email to this mailing list, so I apologize if I

[Beginner] Run compute on large matrices and return the result in seconds?

2019-07-09 Thread Gautham Acharya
This is my first email to this mailing list, so I apologize if I made any errors. My team's going to be building an application and I'm investigating some options for distributed compute systems. We want to be performing computes on large matrices. The requirements are as follows: 1

Compute /Storage Calculation

2018-07-19 Thread Deepu Raj
Hi Team - Any good calculator/Excel to estimate compute and storage requirements for the new spark jobs to be developed. Capacity planning based on:- Job, Data type etc Thanks, Deepu Raj

Re: Performance of Spark when the compute and storage are separated

2018-04-15 Thread Mark Hamstra
erformance can be > quite acceptable. > > On Sun, Apr 15, 2018 at 1:46 PM, Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > >> Thanks Mark, >> >> I guess this may be broadened to the concept of separate compute from >> storage. Your point on &

Re: Performance of Spark when the compute and storage are separated

2018-04-15 Thread Mich Talebzadeh
Thanks Mark, I guess this may be broadened to the concept of separate compute from storage. Your point on " ... can kind of disappear after the data is first read from the storage layer." reminds of performing Logical IOs as opposed to Physical IOs. But again as you correctly p

Re: Performance of Spark when the compute and storage are separated

2018-04-15 Thread vincent gromakowski
ing local as > opposed to Spark running on compute nodes? > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxia

Re: Performance of Spark when the compute and storage are separated

2018-04-14 Thread Mich Talebzadeh
Thanks Vincent. You mean 20 times improvement with data being local as opposed to Spark running on compute nodes? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view

Re: Performance of Spark when the compute and storage are separated

2018-04-14 Thread vincent gromakowski
Not with hadoop but with Cassandra, i have seen 20x data locality improvement on partitioned optimized spark jobs Le sam. 14 avr. 2018 à 21:17, Mich Talebzadeh a écrit : > Hi, > > This is a sort of your mileage varies type question. > > In a classic Hadoop cluster,

Performance of Spark when the compute and storage are separated

2018-04-14 Thread Mich Talebzadeh
Hi, This is a sort of your mileage varies type question. In a classic Hadoop cluster, one has data locality when each node includes the Spark libraries and HDFS data. this helps certain queries like interactive BI. However running Spark over remote storage say Isilon scaled out NAS instead of

spark sql get result time larger than compute Duration

2018-03-11 Thread wkhapy_1
get result 1.67s <http://apache-spark-user-list.1001560.n3.nabble.com/file/t3966/wpLV3.png> compute cost 0.2s <http://apache-spark-user-list.1001560.n3.nabble.com/file/t3966/Kl0VG.png> below is sql select event_date, dim ,concat_ws('|',collect_list(result)) result from ( sele

[Structured Streaming] How to compute the difference between two rows of a streaming dataframe?

2017-09-29 Thread 张万新
Hi, I want to compute the difference between two rows in a streaming dataframe, is there a feasible API? May be some function like the window function *lag *in normal dataframe, but it seems that this function is unavailable in streaming dataframe. Thanks.

MLlib to Compute boundaries of a rectangle given random points on its Surface

2016-12-06 Thread Pradeep Gaddam
Hello, Can someone please let me know if it is possible to construct a surface(for example:- Rectangle) given random points on its surface using Spark MLlib? Thanks Pradeep Gaddam This message and any attachments may contain confidential information of View, Inc. If you are not the

Re: How to compute the recall and F1-score in Linear Regression based model

2016-12-06 Thread Sean Owen
)); > } > } > ); > And *'model'* is a Random Forest model instance trained with multiclass > regression or classification dataset. > > The current implementation of Logistic Regression supports only the binary > classification. But, Linea

Re: How to compute the recall and F1-score in Linear Regression based model

2016-12-06 Thread Md. Rezaul Karim
Random Forest model instance trained with multiclass regression or classification dataset. The current implementation of Logistic Regression supports only the binary classification. But, Linear Regression supports/works on the dataset having multiclass. I was wondering if it's possible to compute

Re: How to compute the recall and F1-score in Linear Regression based model

2016-12-06 Thread Sean Owen
} > System.out.println("precision: " + (double) (count * 100) / > predictions.count()); > > Now, I would like to compute other evaluation metrics like *Recall *and > *F1-score > *etc. How could I do that? > > > > Regards, > __

How to compute the recall and F1-score in Linear Regression based model

2016-12-06 Thread Md. Rezaul Karim
List()) { count++; } System.out.println("precision: " + (double) (count * 100) / predictions.count()); Now, I would like to compute other evaluation metrics like *Recall *and *F1-score *etc. How could I do that? Regards, _ *M

Re: java.lang.Exception: Could not compute split, block input-0-1480539568000 not found

2016-12-01 Thread Marco Mistroni
yer >> like HDFS. so I dont know what is a connection between RDD and blocks (I >> know that for every batch one RDD is produced)? what is a block in this >> context? is it a disk block ? if so, what is it default size? and Finally, >> why does the following erro

Re: java.lang.Exception: Could not compute split, block input-0-1480539568000 not found

2016-12-01 Thread kant kodali
now what is a connection between RDD and blocks (I know > that for every batch one RDD is produced)? what is a block in this context? > is it a disk block ? if so, what is it default size? and Finally, why does > the following error happens so often? > > java.lang.Exception: Could not com

Re: java.lang.Exception: Could not compute split, block input-0-1480539568000 not found

2016-12-01 Thread kant kodali
? is it a disk block ? if so, what is it default size? and Finally, why does the following error happens so often? java.lang.Exception: Could not compute split, block input-0-1480539568000 not found On Thu, Dec 1, 2016 at 12:42 AM, kant kodali <kanth...@gmail.com> wrote: > I also use t

Re: java.lang.Exception: Could not compute split, block input-0-1480539568000 not found

2016-12-01 Thread kant kodali
Gson gson = new Gson(); >> stringIntegerJavaPairRDD >> .collect() >> .forEach((Tuple2<String, Integer> KV) -> { >> String status = KV._1(); >> Integer count = KV._2(); >> map.put(status, count); >> } >> ); >> NSQReceiver.send(producer, "output_777", >> gson.toJson(map).getBytes()); >> } >> }); >> >> >> Thanks, >> >> kant >> >> >> On Wed, Nov 30, 2016 at 2:11 PM, Marco Mistroni <mmistr...@gmail.com> >> wrote: >> >>> Could you paste reproducible snippet code? >>> Kr >>> >>> On 30 Nov 2016 9:08 pm, "kant kodali" <kanth...@gmail.com> wrote: >>> >>>> I have lot of these exceptions happening >>>> >>>> java.lang.Exception: Could not compute split, block >>>> input-0-1480539568000 not found >>>> >>>> >>>> Any ideas what this could be? >>>> >>> >> >

Re: java.lang.Exception: Could not compute split, block input-0-1480539568000 not found

2016-11-30 Thread kant kodali
; > map.put(status, count); > } > ); > NSQReceiver.send(producer, "output_777", > gson.toJson(map).getBytes()); > } > }); > > > Thanks, > > kant > > > On Wed, Nov 30, 2016 at 2:11 PM, Marco Mistroni <mmistr...@gmail.com> > wrote: > >> Could you paste reproducible snippet code? >> Kr >> >> On 30 Nov 2016 9:08 pm, "kant kodali" <kanth...@gmail.com> wrote: >> >>> I have lot of these exceptions happening >>> >>> java.lang.Exception: Could not compute split, block >>> input-0-1480539568000 not found >>> >>> >>> Any ideas what this could be? >>> >> >

Re: java.lang.Exception: Could not compute split, block input-0-1480539568000 not found

2016-11-30 Thread kant kodali
Bytes()); } }); Thanks, kant On Wed, Nov 30, 2016 at 2:11 PM, Marco Mistroni <mmistr...@gmail.com> wrote: > Could you paste reproducible snippet code? > Kr > > On 30 Nov 2016 9:08 pm, "kant kodali" <kanth...@gmail.com> wrote: > >> I have lot of these exceptions happening >> >> java.lang.Exception: Could not compute split, block input-0-1480539568000 >> not found >> >> >> Any ideas what this could be? >> >

Re: java.lang.Exception: Could not compute split, block input-0-1480539568000 not found

2016-11-30 Thread Marco Mistroni
Could you paste reproducible snippet code? Kr On 30 Nov 2016 9:08 pm, "kant kodali" <kanth...@gmail.com> wrote: > I have lot of these exceptions happening > > java.lang.Exception: Could not compute split, block input-0-1480539568000 > not found > > > Any ideas what this could be? >

java.lang.Exception: Could not compute split, block input-0-1480539568000 not found

2016-11-30 Thread kant kodali
I have lot of these exceptions happening java.lang.Exception: Could not compute split, block input-0-1480539568000 not found Any ideas what this could be?

How to compute a net (difference) given a bi-directional stream of numbers using spark streaming?

2016-08-24 Thread kant kodali
Hi Guys, I am new to spark but I am wondering how do I compute the difference given a bidirectional stream of numbers using spark streaming? To put it more concrete say Bank A is sending money to Bank B and Bank B is sending money to Bank A throughout the day such that at any given time we want

Re: Compute pairwise distance

2016-07-07 Thread Debasish Das
ere an update on the JIRA ticket above and can I use something to > compute RowSimilarity in spark 1.6.0 on my dataset? I will be thankful for > any other ideas too on this. > > - Manoj > > [0] https://issues.apache.org/jira/browse/SPARK-4823 > [1] https://github.com/apache/spark/pull/62

Re: Compute pairwise distance

2016-07-07 Thread Manoj Awasthi
M and columns in tens (20 to be specific). DIMSUM doesn't help because of obvious reasons (transposing the matrix infeasible) already discussed in JIRA. Is there an update on the JIRA ticket above and can I use something to compute RowSimilarity in spark 1.6.0 on my dataset? I will be thankful for an

Re: Multiple compute nodes in standalone mode

2016-06-23 Thread Ted Yu
y computer can access to the data I am working > on. Right now, I run Spark in a single node, and it work beautifully. > > My question is, Is it possible to run Spark using multiple compute nodes > (as > a standalone mode, I don't have HDFS/Hadoop installed)? If so, what do I >

Multiple compute nodes in standalone mode

2016-06-23 Thread avendaon
Hi all, I have a cluster that has multiple nodes, and the data partition is unified, therefore all my nodes in my computer can access to the data I am working on. Right now, I run Spark in a single node, and it work beautifully. My question is, Is it possible to run Spark using multiple compute

Compute the global rank of the column

2016-05-31 Thread Dai, Kevin
Hi, All I want to compute the rank of some column in a table. Currently, I use the window function to do it. However all data will be in one partition. Is there better solution to do it? Regards, Kevin.

Re: Spark Streaming - Kafka Direct Approach: re-compute from specific time

2016-05-25 Thread trung kien
you want to do this kind of thing, you will need to maintain your > >> own index from time to offset. > >> > >> On Wed, May 25, 2016 at 8:15 AM, trung kien <kient...@gmail.com> wrote: > >> > Hi all, > >> > > >> > Is there

Re: Spark Streaming - Kafka Direct Approach: re-compute from specific time

2016-05-25 Thread Cody Koeninger
ent proposal for it but it has gotten pushed back to at least >> 0.10.1 >> >> If you want to do this kind of thing, you will need to maintain your >> own index from time to offset. >> >> On Wed, May 25, 2016 at 8:15 AM, trung kien <kient...@gmail.com> wrote: &

Re: Spark Streaming - Kafka Direct Approach: re-compute from specific time

2016-05-25 Thread trung kien
m> wrote: > > Hi all, > > > > Is there any way to re-compute using Spark Streaming - Kafka Direct > Approach > > from specific time? > > > > In some cases, I want to re-compute again from specific time (e.g > beginning > > of day)? is that possible? > > > > > > > > -- > > Thanks > > Kien >

Re: Spark Streaming - Kafka Direct Approach: re-compute from specific time

2016-05-25 Thread Cody Koeninger
ent...@gmail.com> wrote: > Hi all, > > Is there any way to re-compute using Spark Streaming - Kafka Direct Approach > from specific time? > > In some cases, I want to re-compute again from specific time (e.g beginning > of day)? is that possible?

Spark Streaming - Kafka Direct Approach: re-compute from specific time

2016-05-25 Thread trung kien
Hi all, Is there any way to re-compute using Spark Streaming - Kafka Direct Approach from specific time? In some cases, I want to re-compute again from specific time (e.g beginning of day)? is that possible? -- Thanks Kien

Re: Compute

2016-04-27 Thread Karl Higley
atures by >> building (Signature, (Int, Vector)) tuples, grouping by signature, and then >> iterating pairwise over the resulting lists of points to compute the >> distances between them. The points still have to be shuffled over the >> network, but at least the shuffle doesn't cr

Re: Compute

2016-04-27 Thread nguyen duc tuan
e, (Int, Vector)) tuples, grouping by signature, and then > iterating pairwise over the resulting lists of points to compute the > distances between them. The points still have to be shuffled over the > network, but at least the shuffle doesn't create multiple copies of each > p

Re: Compute

2016-04-27 Thread Karl Higley
One idea is to avoid materializing the pairs of points before computing the distances between them. You could do that using the LSH signatures by building (Signature, (Int, Vector)) tuples, grouping by signature, and then iterating pairwise over the resulting lists of points to compute

Compute

2016-04-27 Thread nguyen duc tuan
Hi all, Currently, I'm working on implementing LSH on spark. The problem leads to follow problem. I have an RDD[(Int, Int)] stores all pairs of ids of vectors need to compute distance and an other RDD[(Int, Vector)] stores all vectors with their ids. Can anyone suggest an efficiency way

Spark Streaming:Could not compute split

2016-02-02 Thread aafri
failure: Task 2 in stage 11032.0 failed 4 times, most recent failure: Lost task 2.3 in stage 11032.0 (TID 16137, hadoop225.localdomain): java.lang.Exception: Could not compute split, block input-0-1454324106400 not found at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51

[Spark Streaming] "Could not compute split, block input-0-1452563923800 not found” when trying to recover from checkpoint data

2016-01-13 Thread Collin Shi
Hi I was doing a simple updateByKey transformation and print on the data received from socket, and spark version is 1.4.0. The first submit went all right, but after I kill (CTRL + C) the job and submit again. Apparently spark was trying to recover from the checkpoint data , but then the

Re: Compute Real-time Visualizations using spark streaming

2015-10-11 Thread Akhil Das
you can push the data to the web-socket which will power the dashboards. Thanks Best Regards On Fri, Oct 2, 2015 at 5:17 PM, Sureshv <suresh.ku...@transerainc.com> wrote: > Hi, > > I am new to Spark and I would like know how to compute (dynamically) > real-time visualizations usi

Compute Real-time Visualizations using spark streaming

2015-10-02 Thread Sureshv
Hi, I am new to Spark and I would like know how to compute (dynamically) real-time visualizations using Spark streaming (Kafka). Use case : We have Real-time analytics dashboard (reports and dashboard), user can define report (visualization) with certain parameters like, refresh period, choose

Re: How to compute the probability of each class in Naive Bayes

2015-09-10 Thread Adamantios Corais
great. so, provided that *model.theta* represents the log-probabilities and (hence the result of *brzPi + brzTheta * testData.toBreeze* is a big number too), how can I get back the *non-*log-probabilities which - apparently - are bounded between *0.0 and 1.0*? *// Adamantios* On Tue, Sep 1,

Re: How to compute the probability of each class in Naive Bayes

2015-09-10 Thread Sean Owen
I can see probabilities are NOT normalized; > denominator isn't implemented in either v1.1.0 or v1.5.0 (by denominator, I > refer to the probability of feature X). So, for given lambda, how to compute > the denominator? FYI: > https://github.com/apache/spark/blob/v1.5.0/mllib/src/main/scala/o

Re: How to compute the probability of each class in Naive Bayes

2015-09-10 Thread Sean Owen
The log probabilities are unlikely to be very large, though the probabilities may be very small. The direct answer is to exponentiate brzPi + brzTheta * testData.toBreeze -- apply exp(x). I have forgotten whether the probabilities are normalized already though. If not you'll have to normalize to

Re: How to compute the probability of each class in Naive Bayes

2015-09-10 Thread Adamantios Corais
Thanks Sean. As far as I can see probabilities are NOT normalized; denominator isn't implemented in either v1.1.0 or v1.5.0 (by denominator, I refer to the probability of feature X). So, for given lambda, how to compute the denominator? FYI: https://github.com/apache/spark/blob/v1.5.0/mllib/src

Re: How to compute the probability of each class in Naive Bayes

2015-09-01 Thread Sean Owen
(pedantic: it's the log-probabilities) On Tue, Sep 1, 2015 at 10:48 AM, Yanbo Liang wrote: > Actually > brzPi + brzTheta * testData.toBreeze > is the probabilities of the input Vector on each class, however it's a > Breeze Vector. > Pay attention the index of this Vector need

Re: How to compute the probability of each class in Naive Bayes

2015-09-01 Thread Yanbo Liang
Actually brzPi + brzTheta * testData.toBreeze is the probabilities of the input Vector on each class, however it's a Breeze Vector. Pay attention the index of this Vector need to map to the corresponding label index. 2015-08-28 20:38 GMT+08:00 Adamantios Corais : >

How to compute the probability of each class in Naive Bayes

2015-08-28 Thread Adamantios Corais
Hi, I am trying to change the following code so as to get the probabilities of the input Vector on each class (instead of the class itself with the highest probability). I know that this is already available as part of the most recent release of Spark but I have to use Spark 1.1.0. Any help is

Porting a multit-hreaded compute intensive job to spark

2015-08-27 Thread Utkarsh Sengar
I am working on code which uses executor service to parallelize tasks (think machine learning computations done over small dataset over and over again). My goal is to execute some code as fast as possible, multiple times and store the result somewhere (total executions will be on the order of 100M

Re: Spark Streaming: Some issues (Could not compute split, block —— not found) and questions

2015-08-25 Thread Akhil Das
minute windows, we have issues with Could not compute split, block —— not found. This is being run on a YARN cluster and it seems like the executors are getting killed even though they should have plenty of memory. Also, it seems like no computation actually takes place until the end of the window

Spark Streaming: Some issues (Could not compute split, block —— not found) and questions

2015-08-19 Thread jlg
(there are a lot of repeated keys across this time frame, and we want to combine them all -- we do this using reduceByKeyAndWindow). But even when trying to do 5 minute windows, we have issues with Could not compute split, block —— not found. This is being run on a YARN cluster and it seems like

COMPUTE STATS on hive table - NoSuchTableException

2015-08-18 Thread VIJAYAKUMAR JAWAHARLAL
Hi I am trying to compute stats on a lookup table from spark which resides in hive. I am invoking spark API as follows. It gives me NoSuchTableException. Table is double verified and subsequent statement “sqlContext.sql(“select * from cpatext.lkup”)” picks up the table correctly. I am

Re: Which memory fraction is Spark using to compute RDDs that are not going to be persisted

2015-07-22 Thread Andrew Or
this fraction to execute arbitrary user code (default 0.2) I am not mentioning the storage and shuffle safety fractions for simplicity. My question is, which memory fraction is Spark using to compute and transform RDDs that are not going to be persisted? For example: lines = sc.textFile(i am a big

Which memory fraction is Spark using to compute RDDs that are not going to be persisted

2015-07-21 Thread wdbaruni
the storage and shuffle safety fractions for simplicity. My question is, which memory fraction is Spark using to compute and transform RDDs that are not going to be persisted? For example: lines = sc.textFile(i am a big file.txt) count = lines.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1

Re: Spark 1.4.0 compute-classpath.sh

2015-07-15 Thread Lokesh Kumar Padhnavis
, lokeshkumar lok...@dataken.net wrote: Hi forum I have downloaded the latest spark version 1.4.0 and started using it. But I couldn't find the compute-classpath.sh file in bin/ which I am using in previous versions to provide third party libraries to my application. Can anyone please let me

Re: Spark 1.4.0 compute-classpath.sh

2015-07-15 Thread Marcelo Vanzin
1.4.0 and started using it. But I couldn't find the compute-classpath.sh file in bin/ which I am using in previous versions to provide third party libraries to my application. Can anyone please let me know where I can provide CLASSPATH with my third party libs in 1.4.0? Thanks Lokesh

Spark 1.4.0 compute-classpath.sh

2015-07-15 Thread lokeshkumar
Hi forum I have downloaded the latest spark version 1.4.0 and started using it. But I couldn't find the compute-classpath.sh file in bin/ which I am using in previous versions to provide third party libraries to my application. Can anyone please let me know where I can provide CLASSPATH with my

Re: Compute Median in Spark Dataframe

2015-06-22 Thread Deenar Toraskar
, Is there any way to compute a median on a column using Spark's Dataframe. I know you can use stats in a RDD but I'd rather stay within a dataframe. Hive seems to imply that using ntile one can compute percentiles, quartiles and therefore a median. Does anyone have experience

Re: Compute Median in Spark Dataframe

2015-06-04 Thread Holden Karau
...@lateral-thoughts.com wrote: Hi everyone, Is there any way to compute a median on a column using Spark's Dataframe. I know you can use stats in a RDD but I'd rather stay within a dataframe. Hive seems to imply that using ntile one can compute percentiles, quartiles and therefore a median. Does

Re: Compute Median in Spark Dataframe

2015-06-04 Thread Deenar Toraskar
(select percentile(key,0.5) from table).show() ​ On Tue, Jun 2, 2015 at 8:07 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, Is there any way to compute a median on a column using Spark's Dataframe. I know you can use stats in a RDD but I'd rather stay within

Re: Compute Median in Spark Dataframe

2015-06-02 Thread Olivier Girardot
) sqlContext.sql(select percentile(key,0.5) from table).show() ​ On Tue, Jun 2, 2015 at 8:07 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, Is there any way to compute a median on a column using Spark's Dataframe. I know you can use stats in a RDD but I'd rather stay within

Re: Compute Median in Spark Dataframe

2015-06-02 Thread Holden Karau
...@lateral-thoughts.com'); wrote: Hi everyone, Is there any way to compute a median on a column using Spark's Dataframe. I know you can use stats in a RDD but I'd rather stay within a dataframe. Hive seems to imply that using ntile one can compute percentiles, quartiles and therefore a median. Does

Re: Compute Median in Spark Dataframe

2015-06-02 Thread Yana Kadiyska
, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, Is there any way to compute a median on a column using Spark's Dataframe. I know you can use stats in a RDD but I'd rather stay within a dataframe. Hive seems to imply that using ntile one can compute percentiles, quartiles

Re: Compute Median in Spark Dataframe

2015-06-02 Thread Olivier Girardot
to compute a median on a column using Spark's Dataframe. I know you can use stats in a RDD but I'd rather stay within a dataframe. Hive seems to imply that using ntile one can compute percentiles, quartiles and therefore a median. Does anyone have experience with this ? Regards, Olivier.

Re: Compute Median in Spark Dataframe

2015-06-02 Thread Holden Karau
=sc.parallelize(1 to 50).map(i=KeyValue(i, i.toString)).toDF df.registerTempTable(table) sqlContext.sql(select percentile(key,0.5) from table).show() ​ On Tue, Jun 2, 2015 at 8:07 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, Is there any way to compute a median

Re: Compute pairwise distance

2015-04-30 Thread Driesprong, Fokko
further improvement: 1. Create a rdd of your dataset 2. Do an cross join to generate pairs 3. Apply reducebykey and compute distance. You will get a rdd with keypairs and distance Best Ayan On 30 Apr 2015 06:11, Driesprong, Fokko fo...@driesprong.frl wrote: Dear Sparkers, I am working

Compute pairwise distance

2015-04-29 Thread Driesprong, Fokko
approach this? I first thought about broadcasting the original points to all the workers, and then compute the distances across the different workers. Although this requires all the points to be distributed across all the machines. But this feels rather brute-force, what do you guys think. I don't

Re: Compute pairwise distance

2015-04-29 Thread ayan guha
This is my first thought, please suggest any further improvement: 1. Create a rdd of your dataset 2. Do an cross join to generate pairs 3. Apply reducebykey and compute distance. You will get a rdd with keypairs and distance Best Ayan On 30 Apr 2015 06:11, Driesprong, Fokko fo...@driesprong.frl

Re: Compute pairwise distance

2015-04-29 Thread Debasish Das
wrote: This is my first thought, please suggest any further improvement: 1. Create a rdd of your dataset 2. Do an cross join to generate pairs 3. Apply reducebykey and compute distance. You will get a rdd with keypairs and distance Best Ayan On 30 Apr 2015 06:11, Driesprong, Fokko fo

Re: Spark Streaming: JavaDStream compute method NPE

2015-04-28 Thread Himanshu Mehra
Hi Puneith, Please provide the code if you may. It will be helpful. Thank you, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-JavaDStream-compute-method-NPE-tp22676p22684.html Sent from the Apache Spark User List mailing list archive

Re: executor failed, cannot find compute-classpath.sh

2015-04-16 Thread TimMalt
as driver-20150416112240-0007 ... waiting before polling master for driver state ... polling master for driver state State of driver-20150416112240-0007 is ERROR Exception from cluster was: org.apache.spark.SparkException: Process List(/usr/local/spark-1.3.0-bin-hadoop2.4/bin/compute-classpath.sh) exited

Re: Could not compute split, block not found in Spark Streaming Simple Application

2015-04-13 Thread Saiph Kappa
size 3.0 KB, free 267.2 MB) 15/03/27 16:21:35 ERROR Executor: Exception in task 8.0 in stage 27.0 (TID 140) java.lang.Exception: Could not compute split, block input-0-1427473262420 not found at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51

Re: Could not compute split, block not found in Spark Streaming Simple Application

2015-04-09 Thread Saiph Kappa
(estimated size 3.0 KB, free 267.2 MB) 15/03/27 16:21:35 ERROR Executor: Exception in task 8.0 in stage 27.0 (TID 140) java.lang.Exception: Could not compute split, block input-0-1427473262420 not found at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51

Re: Could not compute split, block not found in Spark Streaming Simple Application

2015-04-09 Thread Tathagata Das
(3104) called with curMem=49003, maxMem=280248975 15/03/27 16:21:35 INFO MemoryStore: Block broadcast_24 stored as values in memory (estimated size 3.0 KB, free 267.2 MB) 15/03/27 16:21:35 ERROR Executor: Exception in task 8.0 in stage 27.0 (TID 140) java.lang.Exception: Could not compute split

Could not compute split, block not found in Spark Streaming Simple Application

2015-03-27 Thread Saiph Kappa
:21:35 ERROR Executor: Exception in task 8.0 in stage 27.0 (TID 140) java.lang.Exception: Could not compute split, block input-0-1427473262420 not found at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280

Re: Could not compute split, block not found in Spark Streaming Simple Application

2015-03-27 Thread Tathagata Das
15/03/27 16:21:35 INFO MemoryStore: Block broadcast_24 stored as values in memory (estimated size 3.0 KB, free 267.2 MB) 15/03/27 16:21:35 ERROR Executor: Exception in task 8.0 in stage 27.0 (TID 140) java.lang.Exception: Could not compute split, block input-0-1427473262420 not found

Re: SparkStreaming failing with exception Could not compute split, block input

2015-02-27 Thread Mukesh Jha
scheduler.TaskSetManager: Starting task 38.1 in stage 451.0 (TID 22520, chsnmphbase23.usdc2.cloud.com, RACK_LOCAL, 1288 bytes) 15/02/25 05:32:43 WARN scheduler.TaskSetManager: Lost task 32.1 in stage 451.0 (TID 22511, chsnmphbase19.usdc2.cloud.com): java.lang.Exception: Could not compute split

Re: SparkStreaming failing with exception Could not compute split, block input

2015-02-27 Thread Mukesh Jha
scheduler.TaskSetManager: Lost task 32.1 in stage 451.0 (TID 22511, chsnmphbase19.usdc2.cloud.com): java.lang.Exception: Could not compute split, block input-3-1424842351600 not found at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51

  1   2   >