Hi,
Thank you for this confirmation.
Coalescing is what we do now. It creates, however, very big partitions.
Guillaume
Hey,
I am not 100% sure but from my understanding accumulators are per partition
(so per task as its the same) and are sent back to the driver with the task
result and
.setMaster(local) set it to local[2] or local[*]
Thanks
Best Regards
On Thu, Jun 18, 2015 at 5:59 PM, Bartek Radziszewski bar...@scalaric.com
wrote:
hi,
I'm trying to run simple kafka spark streaming example over spark-shell:
sc.stop
import org.apache.spark.SparkConf
import
I am wondering how direct stream api ensures end-to-end exactly once semantics
I think there are two things involved:
1. From the spark streaming end, the driver will replay the Offset range when
it's down and restarted,which means that the new tasks will process some
already processed data.
2.
I was thinking exactly the same. I'm going to try it, It doesn't really
matter if I lose an executor, since its sketch will be lost, but then
reexecuted somewhere else.
And anyway, it's an approximate data structure, and what matters are
ratios, not exact values.
I mostly need to take care
Hello,
In the context of a machine learning algorithm, I need to be able to
randomly distribute the elements of a large RDD across partitions (i.e.,
essentially assign each element to a random partition). How could I achieve
this? I have tried to call repartition() with the current number of
how about generating the key using some 1-way hashing like md5?
On Thu, Jun 18, 2015 at 9:59 PM, Guillaume Pitel guillaume.pi...@exensa.com
wrote:
I think you can randomly reshuffle your elements just by emitting a random
key (mapping a PairRdd's key triggers a reshuffle IIRC)
yourrdd.map{
Hey,
I am not 100% sure but from my understanding accumulators are per partition
(so per task as its the same) and are sent back to the driver with the task
result and merged. When a task needs to be run n times (multiple rdds
depend on this one, some partition loss later in the chain etc) then
Thank you, Sandy! I'll investigate use of the extraClassPath variable. Both
options are helpful.
Thanks,
Matt
On Jun 17, 2015, at 8:01 PM, Sandy Ryza
sandy.r...@cloudera.commailto:sandy.r...@cloudera.com wrote:
Hi Matt,
If you place your jars on HDFS in a public location, YARN will cache
I believe it is available here:
https://cloud.google.com/hadoop/google-cloud-storage-connector
2015-06-18 15:31 GMT+02:00 Klaus Schaefers klaus.schaef...@ligatus.com:
Hi,
is there a kind adapter to use GoogleCloudStorage with Spark?
Cheers,
Klaus
--
--
Klaus Schaefers
Senior
BTW I suggest this instead of using thread locals as I am not sure in which
situation spark will reuse or not them. For example if an error happens
inside a thread, will spark then create a new one or the error is catched
inside the thread preventing it to stop. So in short, does spark guarantee
I think you can randomly reshuffle your elements just by emitting a
random key (mapping a PairRdd's key triggers a reshuffle IIRC)
yourrdd.map{ x = (rand(), x)}
There is obiously a risk that rand() will give same sequence of numbers
in each partition, so you may need to use
Why not something like your mobile app pushes data to your webserver which
pushes the data to Kafka or Cassandra or any other database and have a
Spark streaming job running all the time operating on the incoming data and
pushes the calculated values back. This way, you don't have to start a
spark
Hi,
We switched from ParallelGC to CMS, and the symptom is gone.
On Thu, Jun 4, 2015 at 3:37 PM, Ji ZHANG zhangj...@gmail.com wrote:
Hi,
I set spark.shuffle.io.preferDirectBufs to false in SparkConf and this
setting can be seen in web ui's environment tab. But, it still eats memory,
i.e.
I am having trouble using a UDF on a column of Vectors in PySpark which can
be illustrated here:
from pyspark import SparkContext
from pyspark.sql import Row
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import udf
from pyspark.mllib.linalg import Vectors
FeatureRow =
Hi A bellet
You can try RDD.randomSplit(weights array) where a weights array is the
array of weight you wants to want to put in the consecutive partition
example RDD.randomSplit(Array(0.7, 0.3)) will create two partitions
containing 70% data in one and 30% in other, randomly selecting the
Yeah thats the problem. There is probably some perfect num of partitions
that provides the best balance between partition size and memory and merge
overhead. Though it's not an ideal solution :(
There could be another way but very hacky... for example if you store one
sketch in a singleton per
Hi,
I'm trying to figure out the smartest way to implement a global
count-min-sketch on accumulators. For now, we are doing that with RDDs.
It works well, but with one sketch per partition, merging takes too long.
As you probably know, a count-min sketch is a big mutable array of array
of
Hi,
is there a kind adapter to use GoogleCloudStorage with Spark?
Cheers,
Klaus
--
--
Klaus Schaefers
Senior Optimization Manager
Ligatus GmbH
Hohenstaufenring 30-32
D-50674 Köln
Tel.: +49 (0) 221 / 56939 -784
Fax: +49 (0) 221 / 56 939 - 599
E-Mail: klaus.schaef...@ligatus.com
Web:
hi,
I'm trying to run simple kafka spark streaming example over spark-shell:
sc.stop
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext._
import kafka.serializer.DefaultDecoder
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
import
2015-06-18 15:17 GMT+02:00 Guillaume Pitel guillaume.pi...@exensa.com:
I was thinking exactly the same. I'm going to try it, It doesn't really
matter if I lose an executor, since its sketch will be lost, but then
reexecuted somewhere else.
I mean that between the action that will update the
Hi,
I am writing pyspark stream program. I have the training data set to
compute the regression model. I want to use the stream data set to
test the model. So, I join with RDD with the StreamRDD, but i got the
exception. Following are my source code, and the exception I got. Any
help is
Hi,
UpdateStateByKey : if you can brief the issue you are facing with
this,that will be great.
Regarding not keeping whole dataset in memory, you can tweak the parameter
of remember, such that it does checkpoint at appropriate time.
Thanks
Twinkle
On Thursday, June 18, 2015, Nipun Arora
It would be the 40%, although it's probably better to think of it as
shuffle vs. data cache and the remainder goes to tasks. As the comments for
the shuffle memory fraction configuration clarify that it will be taking
memory at the expense of the storage/data cache fraction:
Thanks Sabarish and Nick
Would you happen to have some code snippets that you can share.
Best
Ayman
On Jun 17, 2015, at 10:35 PM, Sabarish Sasidharan
sabarish.sasidha...@manthan.com wrote:
Nick is right. I too have implemented this way and it works just fine. In my
case, there can be even
Hi All,
I appreciate the help :)
Here is a sample code where I am trying to keep the data of the previous
RDD and the current RDD in a foreachRDD in spark stream.
I do not know if the bottom code technically works as I cannot compile it ,
but I am trying to in a way keep the historical reference
Also not sure how threading helps here because Spark puts a partition to
each core. On each core may be there are multiple threads if you are using
intel hyperthreading but I will let Spark handle the threading.
On Thu, Jun 18, 2015 at 8:38 AM, Debasish Das debasish.da...@gmail.com
wrote:
We
We added SPARK-3066 for this. In 1.4 you should get the code to do BLAS
dgemm based calculation.
On Thu, Jun 18, 2015 at 8:20 AM, Ayman Farahat
ayman.fara...@yahoo.com.invalid wrote:
Thanks Sabarish and Nick
Would you happen to have some code snippets that you can share.
Best
Ayman
On Jun
Also in my experiments, it's much faster to blocked BLAS through cartesian
rather than doing sc.union. Here are the details on the experiments:
https://issues.apache.org/jira/browse/SPARK-4823
On Thu, Jun 18, 2015 at 8:40 AM, Debasish Das debasish.da...@gmail.com
wrote:
Also not sure how
btw, user listt will be a better place for this thread.
On Thu, Jun 18, 2015 at 8:19 AM, Yin Huai yh...@databricks.com wrote:
Is it the full stack trace?
On Thu, Jun 18, 2015 at 6:39 AM, Sea 261810...@qq.com wrote:
Hi, all:
I want to run spark sql on yarn(yarn-client), but ... I already
That general description is accurate, but not really a specific issue of
the direct steam. It applies to anything consuming from kafka (or, as
Matei already said, any streaming system really). You can't have exactly
once semantics, unless you know something more about how you're storing
results.
You can do something like this:
ObjectListing objectListing;
do {
objectListing = s3Client.listObjects(listObjectsRequest);
for (S3ObjectSummary objectSummary :
objectListing.getObjectSummaries()) {
if
Hi,
To make the jar files as part of the jar which you would like to use, you
should create a uber jar. Please refer to the following:
https://maven.apache.org/plugins/maven-shade-plugin/examples/includes-excludes.html
--
View this message in context:
This is a known issue. See
https://issues.apache.org/jira/browse/SPARK-7902 -Xiangrui
On Thu, Jun 18, 2015 at 6:41 AM, calstad colin.als...@gmail.com wrote:
I am having trouble using a UDF on a column of Vectors in PySpark which can
be illustrated here:
from pyspark import SparkContext
from
@Twinkle - what did you mean by Regarding not keeping whole dataset in
memory, you can tweak the parameter of remember, such that it does
checkpoint at appropriate time?
On Thu, Jun 18, 2015 at 11:40 AM, Nipun Arora nipunarora2...@gmail.com
wrote:
Hi All,
I appreciate the help :)
Here is a
I am reading JSON data that has different schemas for every record. That
is, for a given field that would have a null value, it's simply absent from
that record (and therefore, its schema).
I would like to use the DataFrame API to select specific fields from this
data, and for fields that are
Hi All.
I have a partitioned table in Hive. The use case is to drop one of the
partitions before inserting new data every time the Spark process runs. I
am using the Hivecontext to read and write (dynamic partitions) and also to
alter the table to drop the partition before insert. Everything runs
Hi,
I have the following piece of code, where I am trying to transform a spark
stream and add min and max to it of eachRDD. However, I get an error saying
max call does not exist, at run-time (compiles properly). I am using
spark-1.4
I have added the question to stackoverflow as well:
Hi,
We use pants to build python project to executable python file (pex).
But we cannot run pex file until we add all necessary library paths to
PYTHONPATH and use pip to install necessary packages
for $SPARK_HOME/python/lib/pyspark.zip specially.
Since we have already add the unzipped
Hi,
We use pants to build python project to executable python file (pex).
But we cannot run pex file until we add all necessary library paths to
PYTHONPATH and use pip to install necessary packages
for $SPARK_HOME/python/lib/pyspark.zip specially.
Since we have already add the unzipped
Also, could you give a screenshot of the streaming UI. Even better, could
you run it on Spark 1.4 which has a new streaming UI and then use that for
debugging/screenshot?
TD
On Thu, Jun 18, 2015 at 3:05 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Which version of spark? and what is your
Why do you need to uniquely identify the message? All you need is the time
when the message was inserted by the receiver, and when it is processed,
isnt it?
On Thu, Jun 18, 2015 at 2:28 PM, anshu shukla anshushuk...@gmail.com
wrote:
Thanks alot , But i have already tried the second way
Hello,
Currently, there is no NaiveBayes implementation for MLpipeline. I couldn't
find the JIRA ticket related to it too (or maybe I missed).
Is there a plan to implement it? If no one has the bandwidth, I can work on
it.
Thanks.
Justin
--
View this message in context:
Thanks for the super-fast response, TD :)
I will now go bug my hadoop vendor to upgrade from 1.3 to 1.4. Cloudera,
are you listening? :D
On Thu, Jun 18, 2015 at 7:02 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Are you using Spark 1.3.x ? That explains. This issue has been fixed in
Hi, all:
I want to run spark sql on yarn(yarn-client), but ... I already set
spark.yarn.jar and spark.jars in conf/spark-defaults.conf.
./bin/spark-sql -f game.sql --executor-memory 2g --num-executors 100 game.txt
Exception in thread main java.lang.NoClassDefFoundError:
Hey Nathan,
I like the first idea better. Let's see what others think. I'd be happy to
review your PR afterwards!
Best,
Burak
On Thu, Jun 18, 2015 at 9:53 PM, Nathan McCarthy
nathan.mccar...@quantium.com.au wrote:
Hey,
Spark Submit adds maven central spark bintray to the ChainResolver
Since upgrading to Spark 1.4, I'm getting a
scala.reflect.internal.MissingRequirementError when creating a DataFrame
from an RDD. The error references a case class in the application (the
RDD's type parameter), which has been verified to be present.
Items of note:
1) This is running on AWS EMR
Running l1 and picking non zero coefficient s gives a good estimate of
interesting features as well...
On Jun 17, 2015 4:51 PM, Xiangrui Meng men...@gmail.com wrote:
We don't have it in MLlib. The closest would be the ChiSqSelector,
which works for categorical data. -Xiangrui
On Thu, Jun 11,
Are you using Spark 1.3.x ? That explains. This issue has been fixed in
Spark 1.4.0. Bonus you get a fancy new streaming UI with more awesome
stats. :)
On Thu, Jun 18, 2015 at 7:01 PM, Tim Smith secs...@gmail.com wrote:
Hi,
I just switched from createStream to the createDirectStream API for
Are you writing to an existing hive orc table?
On Wed, Jun 17, 2015 at 3:25 PM, Cheng Lian lian.cs@gmail.com wrote:
Thanks for reporting this. Would you mind to help creating a JIRA for this?
On 6/16/15 2:25 AM, patcharee wrote:
I found if I move the partitioned columns in schemaString
More details on the Direct API of Spark 1.3 is at the databricks blog:
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
Note the use of checkpoints to persist the Kafka offsets in Spark Streaming
itself, and not in zookeeper.
Also this
If you are writing to an existing hive table, our insert into operator
follows hive's requirement, which is
*the dynamic partition columns must be specified last among the columns in
the SELECT statement and in the same order** in which they appear in the
PARTITION() clause*.
You can find
hello,
I am not sure what is wrong..
But, in my case, I followed the instruction from
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/HiveJDBCDriver.html.
It worked fine with SQuirreL SQL Client
(http://squirrel-sql.sourceforge.net/), and SQL Workbench J
Thanks for reporting. Filed as:
https://issues.apache.org/jira/browse/SPARK-8470
On Thu, Jun 18, 2015 at 5:35 PM, Adam Lewandowski
adam.lewandow...@gmail.com wrote:
Since upgrading to Spark 1.4, I'm getting a
scala.reflect.internal.MissingRequirementError when creating a DataFrame
from an
I saw another report so I filed it already:
Filed as: https://issues.apache.org/jira/browse/SPARK-8470
On Thu, Jun 18, 2015 at 4:07 PM, Chad Urso McDaniel cha...@gmail.com
wrote:
We're using the normal command line:
---
bin/spark-submit --properties-file ./spark-submit.conf --class
Hi,sparks,
I have a spark streaming application that is a maven project, I would like to
build it into a uber jar and run in the cluster.
I have found out two options to build the uber jar, either of them has its
shortcomings, so I would ask how you guys do it.
Thanks.
1. Use the maven shade
Hi,
I encountered errors fitting a model using a CrossValidator. The training
set contained a feature which was initially a String with many unique
values. I used a StringIndexer to transform this feature column into label
indices. Fitting a model with a regular pipeline worked fine, but I ran
Akhil,
From my test, I can see the files in the last batch will alwyas be
reprocessed upon restarting from checkpoint even for graceful shutdown.
I think usually the file is expected to be processed only once. Maybe
this is a bug in fileStream? or do you know any approach to workaround
it?
Sorry Du,
Repartition means coalesce(shuffle = true) as per [1]. They are the same
operation. Coalescing with shuffle = false means you are specifying the max
amount of partitions after the coalesce (if there are less partitions you
will end up with the lesser amount.
[1]
I'm confused about this. The comment on the function seems to indicate
that there is absolutely no shuffle or network IO but it also states that
it assigns an even number of parent partitions to each final partition
group. I'm having trouble seeing how this can be guaranteed without some
data
hi,all
if i want to change the /tmp folder to any other folder for spark ut use
sbt,how can i do?
Hey,
Spark Submit adds maven central spark bintray to the ChainResolver before it
adds any external resolvers.
https://github.com/apache/spark/blob/branch-1.4/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L821
When running on a cluster without internet access, this means the
Hi,
I just switched from createStream to the createDirectStream API for
kafka and while things otherwise seem happy, the first thing I noticed is
that stream/receiver stats are gone from the Spark UI :( Those stats were
very handy for keeping an eye on health of the app.
What's the best way to
Hi Tathagata,
When you say please mark spark-core and spark-streaming as dependencies how
do you mean?
I have installed the pre-build spark-1.4 for Hadoop 2.6 from spark
downloads. In my maven pom.xml, I am using version 1.4 as described.
Please let me know how I can fix that?
Thanks
Nipun
On
Hi everyone,
I¹m looking into switching raw RDD operations to DataFrames operations. When
I used JavaPairRDD.join(), I had the option to specify the number of
partitions with which to do the join. However, I don¹t see an equivalent
option in DataFrame.join(). Is there a way to specify the
Glad to hear that. :)
On Thu, Jun 18, 2015 at 6:25 AM, Ji ZHANG zhangj...@gmail.com wrote:
Hi,
We switched from ParallelGC to CMS, and the symptom is gone.
On Thu, Jun 4, 2015 at 3:37 PM, Ji ZHANG zhangj...@gmail.com wrote:
Hi,
I set spark.shuffle.io.preferDirectBufs to false in
I just published results of my findings
herehttps://bigdatalatte.wordpress.com/2015/06/18/spark-sql-versus-impala-versus-hive/
I haven’t tried with 1.4 but I tried with 1.3 a while ago and I could not
get the serialized behavior by using default scheduler when there is
failure and retry
so I created a customized stream like this.
class EachSeqRDD[T: ClassTag] (
parent: DStream[T], eachSeqFunc: (RDD[T], Time) = Unit
Is there any fixed way to find among RDD in stream processing systems ,
in the Distributed set-up .
--
Thanks Regards,
Anshu Shukla
Thanks all for the help.
It turned out that using the bumpy matrix multiplication made a huge difference
in performance. I suspect that Numpy already uses BLAS optimized code.
Here is Python code
#This is where i load and directly test the predictions
myModel =
This seems be a bug, could you file a JIRA for it?
RDD should be serializable for Streaming job.
On Thu, Jun 18, 2015 at 4:25 AM, Groupme grou...@gmail.com wrote:
Hi,
I am writing pyspark stream program. I have the training data set to compute
the regression model. I want to use the stream
This is not independent programmatic way of running of Spark job on Yarn
cluster.
The example I created simply demonstrates how to wire up the classpath so
that spark submit can be called programmatically. For my use case, I wanted
to hold open a connection so I could send tasks to the executors
Hi All,
I am trying to run KMeans clustering on a large data set with 12,000 points
and 80,000 dimensions. I have a spark cluster in Ec2 stand alone mode
with 8 workers running on 2 slaves with 160 GB Ram and 40 VCPU.
My Code is as Follows:
def convert_into_sparse_vector(A):
Yup, numpy calls into BLAS for matrix multiply.
Sent from my iPad
On 18 Jun 2015, at 8:54 PM, Ayman Farahat ayman.fara...@yahoo.com wrote:
Thanks all for the help.
It turned out that using the bumpy matrix multiplication made a huge
difference in performance. I suspect that Numpy already
Hi All,
I am trying to run KMeans clustering on a large data set with 12,000 points
and 80,000 dimensions. I have a spark cluster in Ec2 stand alone mode with
8 workers running on 2 slaves with 160 GB Ram and 40 VCPU.
*My Code is as Follows:*
def convert_into_sparse_vector(A):
I think you may be including a different version of Spark Streaming in your
assembly. Please mark spark-core nd spark-streaming as provided
dependencies. Any installation of Spark will automatically provide Spark in
the classpath so you do not have to bundle it.
On Thu, Jun 18, 2015 at 8:44 AM,
Tathagata, thanks for your response. You are right! Everything seems
to work as expected.
Please could help me understand why the time for processing of all
jobs for a batch is always less than 4 seconds?
Please see my playground code below.
The last modified time of the input (lines) RDD dump
ChiSqSelector calls an RDD of labeled points, where the label is the
target. See
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala#L120
On Wed, Jun 17, 2015 at 10:22 PM, Ruslan Dautkhanov
dautkha...@gmail.com wrote:
Thank you
Its not clear what you are asking. Find what among RDD?
On Thu, Jun 18, 2015 at 11:24 AM, anshu shukla anshushuk...@gmail.com
wrote:
Is there any fixed way to find among RDD in stream processing systems ,
in the Distributed set-up .
--
Thanks Regards,
Anshu Shukla
Tathagata,
Please could you confirm that batches are not processed in parallel
during retries in Spark 1.4? See Binh's email copied below. Any
pointers for workarounds if necessary?
Thanks!
On 18 June 2015 at 14:29, Binh Nguyen Van binhn...@gmail.com wrote:
I haven’t tried with 1.4 but I
Sorry , i missed the LATENCY word.. for a large streaming query .How to
find the time taken by the particular RDD to travel from initial
D-STREAM to final/last D-STREAM .
Help Please !!
On Fri, Jun 19, 2015 at 12:40 AM, Tathagata Das t...@databricks.com wrote:
Its not clear what you are
Got it. Thanks!
--
Ruslan Dautkhanov
On Thu, Jun 18, 2015 at 1:02 PM, Xiangrui Meng men...@gmail.com wrote:
ChiSqSelector calls an RDD of labeled points, where the label is the
target. See
Thanks alot , But i have already tried the second way ,Problem with that
is that how to identify the particular RDD from source to sink (as we can
do by passing a msg id in storm) . For that i just updated RDD and added
a msgID (as static variable) . but while dumping them to file some of the
Hi,
I'm running Spark Standalone on a single node with 16 cores. Master and 4
workers are running.
I'm trying to submit two applications via spark-submit and am getting the
following error when submitting the second one: Initial job has not
accepted any resources; check your cluster UI to ensure
I just realized that --conf needs to be one key-value pair per line. And
somehow I needed
--conf spark.cores.max=2 \
However, when it was
--conf spark.deploy.defaultCores=2 \
then one job would take up all 16 cores on the box.
What's the actual model here?
We've got 10 apps
Couple of ways.
1. Easy but approx way: Find scheduling delay and processing time using
StreamingListener interface, and then calculate end-to-end delay = 0.5 *
batch interval + scheduling delay + processing time. The 0.5 * batch
inteval is the approx average batching delay across all the records
I would also love to see a more recent version of Spark SQL. There have
been a lot of performance improvements between 1.2 and 1.4 :)
On Thu, Jun 18, 2015 at 3:18 PM, Steve Nunez snu...@hortonworks.com wrote:
Interesting. What where the Hive settings? Specifically it would be
useful to know
How are you adding com.rr.data.Visit to spark? With --jars? It is
possible we are using the wrong classloader. Could you open a JIRA?
On Thu, Jun 18, 2015 at 2:56 PM, Chad Urso McDaniel cha...@gmail.com
wrote:
We are seeing class exceptions when converting to a DataFrame.
Anyone out there
I got the same problem with rdd,repartition() in my streaming app, which
generated a few huge partitions and many tiny partitions. The resulting high
data skew makes the processing time of a batch unpredictable and often
exceeding the batch interval. I eventually solved the problem by using
We are seeing class exceptions when converting to a DataFrame.
Anyone out there with some suggestions on what is going on?
Our original intention was to use a HiveContext to write ORC and we say the
error there and have narrowed it down.
This is an example of our code:
---
def
Interesting. What where the Hive settings? Specifically it would be useful to
know if this was Hive on Tez.
- Steve
From: Sanjay Subramanian
Reply-To: Sanjay Subramanian
Date: Thursday, June 18, 2015 at 11:08
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Spark-sql versus Impala
We're using the normal command line:
---
bin/spark-submit --properties-file ./spark-submit.conf --class
com.rr.data.visits.VisitSequencerRunner
./mvt-master-SNAPSHOT-jar-with-dependencies.jar
---
Our jar contains both com.rr.data.visits.orc.OrcReadWrite (which you can
see in the stack trace) and
Doesn't repartition call coalesce(shuffle=true)?
On Jun 18, 2015 6:53 PM, Du Li l...@yahoo-inc.com.invalid wrote:
I got the same problem with rdd,repartition() in my streaming app, which
generated a few huge partitions and many tiny partitions. The resulting
high data skew makes the processing
You can specify the jars of your application to be included with spark-submit
with the /--jars/ switch.
Otherwise, are you sure that your newly compiled spark jar assembly is in
assembly/target/scala-2.10/?
--
View this message in context:
With 80,000 features and 1000 clusters, you need 80,000,000 doubles to
store the cluster centers. That is ~600MB. If there are 10 partitions,
you might need 6GB on the driver to collect updates from workers. I
guess the driver died. Did you specify driver memory with
spark-submit? -Xiangrui
On
repartition() means coalesce(shuffle=false)
On Thursday, June 18, 2015 4:07 PM, Corey Nolet cjno...@gmail.com wrote:
Doesn't repartition call coalesce(shuffle=true)?On Jun 18, 2015 6:53 PM, Du
Li l...@yahoo-inc.com.invalid wrote:
I got the same problem with rdd,repartition() in my
I am submitting the application from a python notebook. I am launching
pyspark as follows:
SPARK_PUBLIC_DNS=ec2-54-165-202-17.compute-1.amazonaws.com
SPARK_WORKER_CORES=8 SPARK_WORKER_MEMORY=15g SPARK_MEM=30g OUR_JAVA_MEM=30g
SPARK_DAEMON_JAVA_OPTS=-XX:MaxPermSize=30g -Xms30g -Xmx30g IPYTHON=1
What is GraphX:
- It can be viewed as a kind of Distributed, Parallel, Graph Database
- It can be viewed as Graph Data Structure (Data Structures 101 from
your CS course)
- It features some off the shelve algos for Graph Processing and
Navigation (Algos and Data
Firstly apologies for the header of my email containing some junk, I
believe it's due to a copy and paste error on a smart phone.
Thanks for your response. I will indeed make the PR you suggest, though
glancing at the code I realize it's not just a case of making these public
since the types are
Which version of spark? and what is your data source? For some reason, your
processing delay is exceeding the batch duration. And its strange that you
are not seeing any scheduling delay.
Thanks
Best Regards
On Thu, Jun 18, 2015 at 7:29 AM, Mike Fang chyfan...@gmail.com wrote:
Hi,
I have a
Hi,
I was wondering if it is possible to use MLlib function inside SparkR, as
outlined at the Spark Summer East 2015 Warmup meetup:
http://www.meetup.com/Spark-NYC/events/220850389/
Are there available examples?
Thank you!
Elena
1 - 100 of 113 matches
Mail list logo