Hi Tarun,
You can use Ganglia for monitoring the entire cluster, and if you want some
more custom functionality like sending emails etc, then you can go after
nagios.
Thanks
Best Regards
On Tue, Oct 14, 2014 at 3:31 AM, Tarun Garg bigdat...@live.com wrote:
Hi All,
I am doing a POC and
Just make sure you have the same version of spark-streaming-kafka
http://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka_2.10
jar and spark in your classpath.
Thanks
Best Regards
On Tue, Oct 14, 2014 at 9:02 AM, Gary Zhao garyz...@gmail.com wrote:
Hello
I'm trying to
Hi,
Sorry to insist, but I really feel like the problem described below is a bug in
Spark.
Can anybody confirm if it is a bug, or a (configuration?) problem on my side?
Thanks,
Christophe.
On 10/10/2014 18:24, Christophe Préaud wrote:
Hi,
After updating from spark-1.0.0 to spark-1.1.0, my
Thank you, Michael.
In Spark SQL DataType, we have a lot of types, for example, ByteType,
ShortType, StringType, etc.
These types are used to form the table schema. As for me, StringType is
enough, why do we need others ?
Hao
--
View this message in context:
Update:
This syntax is mainly for avoiding retyping column names.
Let's take the example in my previous post, where *a* is a table of 15
columns, *b* has 5 columns, after a join, I have a table of (15 + 5 - 1(key
in b)) = 19 columns and register the table in sqlContext.
I don't want to actually
Hi,
Are you sure that the id/key that you used can access to s3? You can try to
use the same id/key through python boto package to test it.
Because, I have almost the same situation as yours, but I can access to s3.
Best
--
View this message in context:
Hi,
I met the same problem before, and according to Matei Zaharia:
/The issue is that you're using SQLContext instead of HiveContext.
SQLContext implements a smaller subset of the SQL language and so you're
getting a SQL parse error because it doesn't support the syntax you have.
Look at how
Hi Everybody,
I'm looking for information on possible Thrift JDBC/ODBC clients and Thrift
JDBC/ODBC servers. For example, how to connect Tableu or other BI/ analytics
tools to Spark using ODBC/JDBC servers as mentioned over the Spark SQL page.
Thanks and Regards,
Neeraj Garg
Thank you, Gen.
I will give hiveContext a try. =)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-select-syntax-tp16299p16368.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Try the following:
1. Set the access key and secret key in the sparkContext:
sparkContext.set(
AWS_ACCESS_KEY_ID,yourAccessKey)
sparkContext.set(
AWS_SECRET_ACCESS_KEY,yourSecretKey)
2. Set the access key and secret key in the environment before starting
your application:
export
Hi,
keep in mind that you're going to have a bad time if your secret key
contains a /
This is due to old and stupid hadoop bug:
https://issues.apache.org/jira/browse/HADOOP-3733
Best way is to regenerate the key so it does not include a /
/Raf
Akhil Das wrote:
Try the following:
1. Set the
I created https://issues.apache.org/jira/browse/SPARK-3947
On Tue, Oct 14, 2014 at 3:54 AM, Michael Armbrust mich...@databricks.com
wrote:
Its not on the roadmap for 1.2. I'd suggest opening a JIRA.
On Mon, Oct 13, 2014 at 4:28 AM, Pierre B
pierre.borckm...@realimpactanalytics.com wrote:
Hi All,
I've downloaded and installed Apache Spark 1.1.0 pre-built for Hadoop 2.4.
Now, I want to test two features of Spark:
1. YARN deployment : As per my understanding, I need to modify
spark-defaults.conf file with the settings mentioned at URL
Hi Spark users/experts,
In Spark source code (Master.scala Worker.scala), when registering the
worker with master, I see the usage of *persistenceEngine*. When we don't
specify spark.deploy.recovery mode explicitly, what is the default value
used ? This recovery mode is used to persists and
It seems that the second problem is dependency issue.
and it works exactly like the first one.
*this is the complete code:*
JavaSchemaRDD schemas=ctx.jsonRDD(arg0);
schemas.insertInto(test, true);
JavaSchemaRDD teeagers=ctx.hql(SELECT a,b FROM test);
ListString
Hello,
I have already posted a message with the exact same problem, and proposed a
patch (the subject is Application failure in yarn-cluster mode).
Can you test it, and see if it works for you?
I would be glad too if someone can confirm that it is a bug in Spark 1.1.0.
Regards,
Christophe.
On
sorry ,a mistake by me.
the above code generate a result exactly like the one seen from hive.
NOW my question is can a hive table be applied to insertinto function?
why I keep geting 111,NULL instead of 111,222
--
View this message in context:
Hi all,
I have two nodes, one as master(*host1*) and the other as
worker(*host2*). I am using the standalone mode.
After starting the master on host1, I run
$ export MASTER=spark://host1:7077
$ bin/run-example SparkPi 10
on host2, but I get this:
14/10/14 21:54:23 WARN TaskSchedulerImpl:
I'm trying to get some workaround for this issue.
Thanks and Regards,
Neeraj Garg
From: H4ml3t [via Apache Spark User List]
[mailto:ml-node+s1001560n16379...@n3.nabble.com]
Sent: Tuesday, October 14, 2014 6:53 PM
To: Neeraj Garg02
Subject: Re: Running Spark/YARN on AWS EMR - Issues finding file
Denny Lee wrote an awesome article on how to connect to Tableau to Spark
SQL recently:
https://www.concur.com/blog/en-us/connect-tableau-to-sparksql
On 10/14/14 6:10 PM, Neeraj Garg02 wrote:
Hi Everybody,
I’m looking for information on possible Thrift JDBC/ODBC clients and
Thrift JDBC/ODBC
On 10/14/14 7:31 PM, Neeraj Garg02 wrote:
Hi All,
I’ve downloaded and installed Apache Spark 1.1.0 pre-built for Hadoop
2.4.
Now, I want to test two features of Spark:
|1.|*YARN deployment* : As per my understanding, I need to modify
“spark-defaults.conf” file with the settings mentioned
Thanks Michael. We are running 1.1 and I believe that is the latest release?
I am getting this error when I tried doing what you suggested:
org.apache.spark.sql.parquet.ParquetTypesConverter$
ParquetTypes.scalajava.lang.RuntimeException: [2.1] failure: ``UNCACHE''
expected but identifier CREATE
I realized my mistake of not using hiveContext. So that error is gone but now
I am getting this error:
==
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.IllegalArgumentException:
Unknown field info: binary
--
View this message
Hi Michael,
That worked for me. At least I’m now further than I was. Thanks for the tip!
-Terry
From: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com
Date: Monday, October 13, 2014 at 5:05 PM
To: Terry Siu terry@smartfocus.commailto:terry@smartfocus.com
Cc:
I'm following the Mllib example for TF-IDF and ran into a problem due to my
lack of knowledge of Scala and spark. Any help would be greatly
appreciated.
Following the Mllib example I could do something like this:
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import
Hi,
In my application, I am successfully using foreachPartition to write large
amounts of data into a Cassandra database.
What is the recommended practice if the application wants to know that the
tasks have completed for all partitions?
Thanks,
Salman
Thanks for the pointers.
I verified that the access key-id/secret used are valid. However, the
secret may contain / at times. The issues I am facing are as follows:
- The EC2 instances are setup with an IAMRole () and don't have a static
key-id/secret
- All of the EC2 instances have
Are you using spark streaming?
On Oct 14, 2014, at 10:35 AM, Salman Haq sal...@revmetrix.com wrote:
Hi,
In my application, I am successfully using foreachPartition to write large
amounts of data into a Cassandra database.
What is the recommended practice if the application wants to know
Thanks for your response, it is not about infrastructure because I am using EC2
machines and Amazon cloud watch can provide EC2 nodes cpu usage, memory usage
details but I need to send notification in situation like processing delay,
total delay, Maximum rate is low,etc.
Tarun
Date: Tue, 14
So the only way that I could make this work was to build a fat jar file as
suggested earlier. To me (and I am no expert) it seems like this is a
bug. Everything was working for me prior to our upgrade to Spark 1.1 on
Hadoop 2.2 but now it seems to not... ie packaging my jars locally then
Its much more efficient to store and compute on numeric types than string
types.
On Tue, Oct 14, 2014 at 1:25 AM, invkrh inv...@gmail.com wrote:
Thank you, Michael.
In Spark SQL DataType, we have a lot of types, for example, ByteType,
ShortType, StringType, etc.
These types are used to
Hi, I have spark 1.1.0 yarn installation. I am using spark-submit to run a
simple application.
From the console output, I have 769 partitions and after task 768 in stage 0
(count) finished,
it hangs. I used jstack to dump the stacktop and it shows it is waiting ...
Any suggestion what might go
Hi,
If I remember well, spark cannot use the IAMrole credentials to access to
s3. It use at first the id/key in the environment. If it is null in the
environment, it use the value in the file core-site.xml. So, IAMrole is not
useful for spark. The same problem happens if you want to use distcp
On Tue, Oct 14, 2014 at 12:42 PM, Sean McNamara sean.mcnam...@webtrends.com
wrote:
Are you using spark streaming?
No, not at this time.
Thanks for the input.
Yes, I did use the temporary access credentials provided by the IAM role
(also detailed in the link you provided). The session token needs to be
specified and I was looking for a way to set that in the header (which
doesn't seem possible).
Looks like a static key/secret is
Yes, for that you can tweak nagios a bit or you can write a custom
monitoring applicaton which will check the processing delay etc.
Thanks
Best Regards
On Tue, Oct 14, 2014 at 10:16 PM, Tarun Garg bigdat...@live.com wrote:
Thanks for your response, it is not about infrastructure because I am
I am planning to write custom monitoring application for that i analyse
org.apache.spark.streaming.scheduler.StreamingListener, is there another spark
streaming api which can give me the insight of the cluster like total
processing time, delay, etc
Tarun
Date: Tue, 14 Oct 2014 23:19:47 +0530
Thanks Abraham Jacob, Tobias Pfeiffer, Akhil Das-2, and Sean Owen for your
helpful comments. Cockpit error on my part in just putting the .scala file
as an argument rather than redirecting stdin from it.
--
View this message in context:
One related question. Could I specify the
com.amazonaws.services.s3.AmazonS3Client implementation for the
fs.s3.impl parameter? Let me try that and update this thread with my
findings.
On Tue, Oct 14, 2014 at 10:48 AM, Ranga sra...@gmail.com wrote:
Thanks for the input.
Yes, I did use the
Hi All,
I have a Graph with millions of edges. Edges are represented by
org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[String]]
= MappedRDD[4] . I have two questions :
1)How can I fetch all the nodes with a given edge label ( all edges with a
given property )
2) Is it possible to create
I have many tables of same schema, they are partitioned by time. For example
one id could be in many of those table. I would like to find aggregation of
such ids. Originally these tables are located on HDFS as files. Once table
schemaRDD is loaded, I cacheTable on them. Each table is around 30m -
Have you solved this issue? im also wondering how to stream an existing file.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-tp14306p16406.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
It seems to me that SPARK_SUBMIT_CLASSPATH does not follow the same ability as
other tools to put wildcards in the paths you add. For some reason it doesn't
pick up the classpath information from yarn-site.xml either, it seems, when
running on YARN. I'm having to manually add every single
hi again. just want to check in again to see if anyone could advise on how
to implement a mutable, growing graph with graphx?
we're building a graph is growing over time. it adds more vertices and
edges every iteration of our algorithm.
it doesn't look like there is an obvious way to add a
Hi,
I am trying to implement simple sentiment analysis of Twitter streams in
Spark/Scala. I am getting an exception and it appears when I combine
SparkContext with StreamingContext in the same program. When I read the
positive and negative words using only SparkContext.textFile (without
creating
after creating a coordinate matrix from my rdd[matrixentry]...
1. how can i get/query the value at coordiate (i, j)?
2. how can i set/update the value at coordiate (i, j)?
3. how can i get all the values on a specific row i, ideally as a vector?
thanks!
--
View this message in context:
Hi guys,
I am new to Spark. When I run Spark Kmeans
(org.apache.spark.mllib.clustering.KMeans) on a small dataset, it works
great. However, when using a large dataset with 1.5 million vectors, it just
hangs there at some reducyByKey/collectAsMap stages (attached image shows
the corresponding UI).
On Tue, Oct 14, 2014 at 12:36 PM, ll duy.huynh@gmail.com wrote:
hi again. just want to check in again to see if anyone could advise on how
to implement a mutable, growing graph with graphx?
we're building a graph is growing over time. it adds more vertices and
edges every iteration of
Hi,
I'm learning about Apache Spark Streaming and I'm doing some tests.
Now,
I have a modified version of the app NetworkWordCount that perform a
/reduceByKeyAndWindow/ with window of 10 seconds in intervals of 5 seconds.
I'm using also the function to measure the rate of records/second like
Hi guys,
An interesting thing, for the input dataset which has 1.5 million vectors,
if set the KMeans's k_value = 100 or k_value = 50, it hangs as mentioned
above. However, if decrease k_value = 10, the same error still appears in
the log but the application finished successfully, without
thanks ankur. indexedrdd sounds super helpful!
a related question, what is the best way to update the values of existing
vertices and edges?
On Tue, Oct 14, 2014 at 4:30 PM, Ankur Dave ankurd...@gmail.com wrote:
On Tue, Oct 14, 2014 at 12:36 PM, ll duy.huynh@gmail.com wrote:
hi again.
You cannot recover the document from the TF-IDF vector, because
HashingTF is not reversible. You can assign each document a unique ID,
and join back the result after training. HasingTF can transform
individual record:
val docs: RDD[(String, Seq[String])] = ...
val tf = new HashingTF()
val
Hello,
CoordinateMatrix is in its infancy, and right now is only a placeholder.
To get/set the value at (i,j), you should map the entries rdd using the
usual rdd map operation, and change the relevant entries.
To get the values on a specific row, you can call toIndexedRowMatrix(),
which returns
What is the feature dimension? I saw you used 100 partitions. How many
cores does your cluster have? -Xiangrui
On Tue, Oct 14, 2014 at 1:51 PM, Ray ray-w...@outlook.com wrote:
Hi guys,
An interesting thing, for the input dataset which has 1.5 million vectors,
if set the KMeans's k_value = 100
On Tue, Oct 14, 2014 at 1:57 PM, Duy Huynh duy.huynh@gmail.com wrote:
a related question, what is the best way to update the values of existing
vertices and edges?
Many of the Graph methods deal with updating the existing values in bulk,
including mapVertices, mapEdges, mapTriplets,
great, thanks!
On Tue, Oct 14, 2014 at 5:08 PM, Ankur Dave ankurd...@gmail.com wrote:
On Tue, Oct 14, 2014 at 1:57 PM, Duy Huynh duy.huynh@gmail.com
wrote:
a related question, what is the best way to update the values of existing
vertices and edges?
Many of the Graph methods deal
thanks reza!
On Tue, Oct 14, 2014 at 5:02 PM, Reza Zadeh r...@databricks.com wrote:
Hello,
CoordinateMatrix is in its infancy, and right now is only a placeholder.
To get/set the value at (i,j), you should map the entries rdd using the
usual rdd map operation, and change the relevant
Hi,
I'm rookie in spark, but hope someone can help me out. I'm writing an app
that I'm submitting to my spark-master that has a worker on a separate
node. It uses spark-cassandra-connector, and since it depends on guava-v16
and it conflicts with the default spark-1.1.0-assembly's guava-v14.1 I
As per EMR documentation:
http://docs.amazonaws.cn/en_us/ElasticMapReduce/latest/DeveloperGuide/emr-iam-roles.html
Access AWS Resources Using IAM Roles
If you've launched your cluster with an IAM role, applications running on the
EC2 instances of that cluster can use the IAM role to obtain
Thanks Rishi. That is exactly what I am trying to do now :)
On Tue, Oct 14, 2014 at 2:41 PM, Rishi Pidva rpi...@pivotal.io wrote:
As per EMR documentation:
http://docs.amazonaws.cn/en_us/ElasticMapReduce/latest/DeveloperGuide/emr-iam-roles.html
Access AWS Resources Using IAM Roles
If
Hi Xiangrui,
The input dataset has 1.5 million sparse vectors. Each sparse vector has a
dimension(cardinality) of 9153 and has less than 15 nonzero elements.
Yes, if I set num-executors = 200, from the hadoop cluster scheduler, I can
see the application got 201 vCores. From the spark UI, I can
Hey all, I'm trying a very basic spark SQL job and apologies as I'm new to
a lot of this, but I'm getting this failure:
Exception in thread main java.lang.NoSuchMethodError:
org.apache.spark.sql.SchemaRDD.take(I)[Lorg/apache/spark/sql/catalyst/expressions/Row;
I've tried a variety of uber-jar
Hi folks,
When I am predicting Binary 1/0 responses with LogsticRegressionWithSGD, it
returns a LogisticRegressionModel. In Spark 1.0.X I was using the
clearThreshold method on the model to get the raw predicted probabilities
when I ran the predict() method...
It appears now that rather than
Hi Ray,
The reduceByKey / collectAsMap does a lot of calculations. Therefore it can
take a very long time if:
1) The parameter number of runs is set very high
2) k is set high (you have observed this already)
3) data is not properly repartitioned
It seems that it is hanging, but there is a lot
Wow...I just tried LogisticRegressionWithLBFGS, and using clearThreshold()
DOES IN FACT work. It appears the the LogsticRegressionWithSGD returns a
model whose method is broken!!
On Tue, Oct 14, 2014 at 3:14 PM, Aris arisofala...@gmail.com wrote:
Hi folks,
When I am predicting Binary 1/0
hi,
is the a rule engine based on spark? i like to allow the business user to
define their rules in a language and the execution of the rules should be
done in spark.
Thanks,
Ali
--
View this message in context:
LBFGS is better. If you data is easily separable, LR might return
values very close or equal to either 0.0 or 1.0. It is rare but it may
happen. -Xiangrui
On Tue, Oct 14, 2014 at 3:18 PM, Aris arisofala...@gmail.com wrote:
Wow...I just tried LogisticRegressionWithLBFGS, and using
Yes.. I solved this problem using fileStream instead of textfile
StreamingContext scc = new StreamingContext(conf, new Duration(1));
ClassTag LongWritable k = ClassTag$.MODULE$.apply(LongWritable.class);
ClassTag Text v = ClassTag$.MODULE$.apply(Text.class);
Hi,
With respect to the first issue, one possible way is to filter the graph
via 'graph.subgraph(epred = e = e.attr == edgeLabel)' , but I am still
curious if we can index RDDs.
Warm Regards
Soumitra
On Tue, Oct 14, 2014 at 2:46 PM, Soumitra Johri
soumitra.siddha...@gmail.com wrote:
Hi
Looks like it may be related to
https://issues.apache.org/jira/browse/SPARK-3807.
I will build from branch 1.1 to see if the issue is resolved.
Chen
On Tue, Oct 14, 2014 at 10:33 AM, Chen Song chen.song...@gmail.com wrote:
Sorry for bringing this out again, as I have no clue what could have
Hi Burak,
In Kmeans, I used k_value = 100, num_iteration = 2, and num_run = 1.
In the current test, I increase num-executors = 200. In the storage info 2
(as shown below), 11 executors are used (I think the data is kind of
balanced) and others have zero memory usage.
We are developing something similar on top of Streaming. Could you detail
some rule functionality you are looking for.
We are developing a dsl for data processing on top of streaming as well as
static data enabled on Spark.
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203 3257
Hi All,
I am trying to understand what is going on in my simple WordCount Spark
Streaming application. Here is the setup -
I have a Kafka producer that is streaming words (lines of text). On the
flip side, I have a spark streaming application that uses the high-level
Kafka/Spark connector to
I saw similar bottleneck in reduceByKey operation. Maybe we can
implement treeReduceByKey to reduce the pressure on single executor
reducing the particular key.
Sincerely,
DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn:
Just ran a test on mnist8m (8m x 784) with k = 100 and numIter = 50.
It worked fine. Ray, the error log you posted is after cluster
termination, which is not the root cause. Could you search your log
and find the real cause? On the executor tab screenshot, I saw only
200MB is used. Did you cache
Yeah... it totally should be... There is nothing fancy in there -
import org.apache.spark.api.java.function.Function2;
public class ReduceWords implements Function2Integer, Integer, Integer {
private static final long serialVersionUID = -6076139388549335886L;
public Integer call(Integer
I am running a couple of functions on an RDD which require access to data
on the file system known to the context. If I create a class with a context
a a member variable I get a serialization error,
So I am running my function on some slave and I want to read in data from a
Path defined by a
The results are no different -
import org.apache.spark.api.java.function.Function2;
import java.io.Serializable;
public class ReduceWords implements Serializable, Function2Integer,
Integer, Integer {
private static final long serialVersionUID = -6076139388549335886L;
public Integer
Hi,
I found writing output back to s3 using rdd.saveAsTextFile() is extremely
slow, much slower than reading from s3. Is there a way to make it faster?
The rdd has 150 partitions so parallelism is enough I assume.
Thanks a lot!
Anny
--
View this message in context:
Hi,
I have an RDD containing Vehicle Number , timestamp, Position.
I want to get the lag function equivalent to my RDD to be able to create
track segment of each Vehicle.
Any help?
PS: I have tried reduceByKey and then splitting the List of position in
tuples. For me it runs out of memory
ahhh never mind… I didn’t notice that a spark-assembly jar file gets produced
after compiling the whole spark suite… So no more manual editing of the jar
file of the AMI for now!
Christos
On Oct 10, 2014, at 12:15 AM, Christos Kozanitis wrote:
Hi
I have written a few extensions for
Hi Xiangrui,
Thanks for the guidance. I read the log carefully and found the root cause.
KMeans, by default, uses KMeans++ as the initialization mode. According to
the log file, the 70-minute hanging is actually the computing time of
Kmeans++, as pasted below:
14/10/14 14:48:18 INFO
My code is as follows:
*documents.flatMap(case words = words.map(w = (w, 1))).reduceByKey(_ +
_).collect()*
In driver's log, reduceByKey() is finished, but collect() seems always in
run, just can't be finished.
In additional, there are about 200,000,000 words needs to be collected. Is
it
Hello Terry,
How many columns does pqt_rdt_snappy have?
Thanks,
Yin
On Tue, Oct 14, 2014 at 11:52 AM, Terry Siu terry@smartfocus.com
wrote:
Hi Michael,
That worked for me. At least I’m now further than I was. Thanks for the
tip!
-Terry
From: Michael Armbrust
Hi Randy,
collect essentially transfers all the data to the driver node. You definitely
wouldn’t want to collect 200 million words. It is a pretty large number and you
can run out of memory on your driver with that much data.
--
Reynold Xin
On October 14, 2014 at 9:26:13 PM, randylu
Hi all, how can i tell if my kryos serializer is actually working. I have a
class which extends Serializable and i have included the following imports:
import com.esotericsoftware.kryo.Kryo
import org.apache.spark.serializer.KryoRegistrator
i also have class included
MyRegistrator extends
Hi,
I am still new to Spark. Sorry if similar questions are asked here before.
I am trying to read a Hive table; then run a query and save the result into
a Hive partitioned Parquet table.
For example, I was able to run the following in Hive:
INSERT INTO TABLE target_table PARTITION
I'm stumped with how to take 1 RDD that has lines like:
4,01012009,00:00,1289,4
5,01012009,00:00,1326,4
6,01012009,00:00,1497,7
and produce a new RDD with just the 4th field from each line (1289, 1326,
1497)
I don't want to apply a conditional, I just want to grab that one field from
each
Hi,
You can merge them into one table by:
sqlContext.unionAll(sqlContext.unionAll(sqlContext.table(table_1),
sqlContext.table(table_2)),
sqlContext.table(table_3)).registarTempTable(table_all)
Or load them in one call by:
Thanks rxin,
I still have a doubt about collect().
Word's number before reduceByKey() is about 200 million, and after
reduceByKey() it decreases to 18 million.
Memory for driver is initialized 15GB, then I print out
runtime.freeMemory() before reduceByKey(), it indicates 13GB free memory.
I
Hello all . Does anyone else have any suggestions? Even understanding what
this error is from would help a lot.
On Oct 11, 2014 12:56 AM, Ilya Ganelin ilgan...@gmail.com wrote:
Hi Akhil - I tried your suggestions and tried varying my partition sizes.
Reducing the number of partitions led to
If memory is not enough, OutOfMemory exception should occur, but nothing in
driver's log.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/something-about-rdd-collect-tp16451p16461.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
rdd.map(lambda line: int(line.split(',')[3]))
On Tue, Oct 14, 2014 at 6:58 PM, Chop thomrog...@att.net wrote:
I'm stumped with how to take 1 RDD that has lines like:
4,01012009,00:00,1289,4
5,01012009,00:00,1326,4
6,01012009,00:00,1497,7
and produce a new RDD with just the 4th field
I used k-means||, which is the default. And it took less than 1
minute to finish. 50 iterations took less than 25 minutes on a cluster
of 9 m3.2xlarge EC2 nodes. Which deploy mode did you use? Is it
yarn-client? -Xiangrui
On Tue, Oct 14, 2014 at 6:03 PM, Ray ray-w...@outlook.com wrote:
Hi
Perhaps, its just me but lag function isnt familiar to me ..
But have you tried configuring the spark appropriately
http://spark.apache.org/docs/latest/configuration.html
On Tue, Oct 14, 2014 at 5:37 PM, Manas Kar manasdebashis...@gmail.com
wrote:
Hi,
I have an RDD containing Vehicle Number
95 matches
Mail list logo