Hi Makoto,
I don't remember I wrote that but thanks for bringing this issue up!
There are two important settings to check: 1) driver memory (you can
see it from the executor tab), 2) number of partitions (try to use
small number of partitions). I put two PRs to fix the problem:
1) use broadcast
Then it may be a new issue. Do you mind creating a JIRA to track this
issue? It would be great if you can help locate the line in
BinaryClassificationMetrics that caused the problem. Thanks! -Xiangrui
On Tue, Jul 15, 2014 at 10:56 PM, crater cq...@ucmerced.edu wrote:
I don't really have my code,
Hi all,
I am encountering the following error:
INFO scheduler.TaskSetManager: Loss was due to java.io.IOException: No
space left on device [duplicate 4]
For each slave, df -h looks roughtly like this, which makes the above error
surprising.
FilesystemSize Used Avail Use% Mounted
Check the number of inodes (df -i). The assembly build may create many
small files. -Xiangrui
On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois chris.dub...@gmail.com wrote:
Hi all,
I am encountering the following error:
INFO scheduler.TaskSetManager: Loss was due to java.io.IOException: No
df -i # on a slave
FilesystemInodes IUsed IFree IUse% Mounted on
/dev/xvda1524288 277701 246587 53% /
tmpfs1917974 1 19179731% /dev/shm
On Tue, Jul 15, 2014 at 11:39 PM, Xiangrui Meng men...@gmail.com wrote:
Check the number of inodes
Hi Chris,
I've encountered this error when running Spark’s ALS methods too. In my case,
it was because I set spark.local.dir improperly, and every time there was a
shuffle, it would spill many GB of data onto the local drive. What fixed it
was setting it to use the /mnt directory, where a
My query is just a simple query that use the spark sql dsl :
tagCollection.join(selectedVideos).where('videoId === 'id)
On Tue, Jul 15, 2014 at 6:03 PM, Yin Huai huaiyin@gmail.com wrote:
Hi Jao,
Seems the SQL analyzer cannot resolve the references in the Join
condition. What is your
Thanks for the quick responses!
I used your final -Dspark.local.dir suggestion, but I see this during the
initialization of the application:
14/07/16 06:56:08 INFO storage.DiskBlockManager: Created local directory at
/vol/spark-local-20140716065608-7b2a
I would have expected something in
Hi Xiangrui,
(2014/07/16 15:05), Xiangrui Meng wrote:
I don't remember I wrote that but thanks for bringing this issue up!
There are two important settings to check: 1) driver memory (you can
see it from the executor tab), 2) number of partitions (try to use
small number of partitions). I put
That makes sense. Thanks everyone for the explanations!
Mingyu
From: Matei Zaharia matei.zaha...@gmail.com
Reply-To: user@spark.apache.org user@spark.apache.org
Date: Tuesday, July 15, 2014 at 3:00 PM
To: user@spark.apache.org user@spark.apache.org
Subject: Re: How does Spark speculation
Hello @ the mailing list,
We think of using spark in one of our projects in a Hadoop cluster. During
evaluation several questions remain which are stated below.
Preconditions
Let's assume Apache Spark is deployed on a hadoop cluster using YARN.
Furthermore a spark execution is running. How
Hi Chris,
Could you also try `df -i` on the master node? How many
blocks/partitions did you set?
In the current implementation, ALS doesn't clean the shuffle data
because the operations are chained together. But it shouldn't run out
of disk space on the MovieLens dataset, which is small.
Thanks for your reply. The SparkContext is configured as below:
sparkConf.setAppName(WikipediaPageRank)
sparkConf.set(spark.serializer,
org.apache.spark.serializer.KryoSerializer)
sparkConf.set(spark.kryo.registrator, classOf[PRKryoRegistrator].getName)
val inputFile = args(0)
Hi Matthias,
Answers inline.
-Sandy
On Wed, Jul 16, 2014 at 12:21 AM, Matthias Kricke
matthias.kri...@mgm-tp.com wrote:
Hello @ the mailing list,
We think of using spark in one of our projects in a Hadoop cluster. During
evaluation several questions remain which are stated below.
Thanks, your answers totally cover all my questions ☺
Von: Sandy Ryza [mailto:sandy.r...@cloudera.com]
Gesendet: Mittwoch, 16. Juli 2014 09:41
An: user@spark.apache.org
Betreff: Re: How does Apache Spark handles system failure when deployed in YARN?
Hi Matthias,
Answers inline.
-Sandy
On Wed,
Does anyone here have a way to do Spark Streaming with external timing
for windows? Right now, it relies on the wall clock of the driver to
determine the amount of time that each batch read lasts.
We have a Kafka, and HDFS ingress into our Spark Streaming pipeline
where the events are annotated
Hi Xiangrui,
Here is the result on the master node:
$ df -i
FilesystemInodes IUsed IFree IUse% Mounted on
/dev/xvda1524288 273997 250291 53% /
tmpfs1917974 1 19179731% /dev/shm
/dev/xvdv524288000 30 5242879701% /vol
I
Hi Sargun,
There have been few discussions on the list recently about the topic. The
short answer is that this is not supported at the moment.
This is a particularly good thread as it discusses the current state and
limitations:
Thanks.
Really, now I compare a stage data of the two jobs, ‘core7-exec3’ spends about
12.5 minutes more than ‘core2-exec12’ on GC.
From: Nishkam Ravi [mailto:nr...@cloudera.com]
Sent: Wednesday, July 16, 2014 5:28 PM
To: user@spark.apache.org
Subject: Re: executor-cores vs.
Hi Team,
Now i've changed my code and reading configuration from hbase-site.xml
file(this file is in classpath). When i run this program using : mvn
exec:java
-Dexec.mainClass=com.cisco.ana.accessavailability.AccessAvailability. It
is working fine. But when i run this program from spark-submit
Hi,
I'm running a Java program using Spark Streaming 1.0.0 on Cloudera 4.4.0
quickstart virtual machine, with hadoop-client 2.0.0-mr1-cdh4.4.0, which is
the one corresponding to my Hadoop distribution, and that works with other
mapreduce programs, and with the maven property
Yes, but what I show can be done in one Spark job.
On Wed, Jul 16, 2014 at 5:01 AM, Wei Tan w...@us.ibm.com wrote:
Thanks Sean. In Oozie you can use fork-join, however using Oozie to drive
Spark jobs, jobs will not be able to share RDD (Am I right? I think multiple
jobs submitted by Oozie will
Hi everyone!
I'm really new to Spark and I'm trying to figure out which would be the
proper way to do the following:
1.- Read a file header (a single line)
2.- Build with it a configuration object
3.- Use that object in a function that will be called by map()
I thought about using filter()
You can rdd.take(1) to get just the header line.
I think someone mentioned before that this is a good use case for
having a tail method on RDDs too, to skip the header for subsequent
processing. But you can ignore it with a filter, or logic in your map
method.
On Wed, Jul 16, 2014 at 11:01 AM,
Thank you! This is what I needed, I've read it should work as the first()
method as well. It's a pity that the taken element cannot be removed from
the RDD though.
Thanks again!
On 16 July 2014 12:09, Sean Owen so...@cloudera.com wrote:
You can rdd.take(1) to get just the header line.
I
Server IPC version 7 cannot communicate with client version 4 means
your client is Hadoop 1.x and your cluster is Hadoop 2.x. The default
Spark distribution is built for Hadoop 1.x. You would have to make
your own build (or, use the artifacts distributed for CDH4.6 maybe?
they are certainly built
Hi,
I am newbie to spark sql and i would like to know about how to read all the
columns from a file in spark sql. I have referred the programming guide
here:
http://people.apache.org/~tdas/spark-1.0-docs/sql-programming-guide.html
The example says:
val people =
Thanks Matei.
On Tue, Jul 15, 2014 at 11:47 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Yup, as mentioned in the FAQ, we are aware of multiple deployments running
jobs on over 1000 nodes. Some of our proof of concepts involved people
running a 2000-node job on EC2.
I wouldn't confuse
Hello all,
Can anyone offer any insight on the below?
Both are legal Spark but the first one works, the latter one does not.
They both work on a local machine but in a standalone cluster the one
with countByValue fails.
Thanks!
Ognen
On 7/15/14, 2:23 PM, Ognen Duzlevski wrote:
Hello,
I
Hi,
I’m trying to run the Spark (1.0.0) shell on EMR and encountering a classpath
issue.
I suspect I’m missing something gloriously obviously, but so far it is eluding
me.
I launch the EMR Cluster (using the aws cli) with:
aws emr create-cluster --name Test Cluster \
--ami-version
Hi All,
I'm trying to do a simple record matching between 2 files and wrote
following code -
*import org.apache.spark.sql.SQLContext;*
*import org.apache.spark.rdd.RDD*
*object SqlTest {*
* case class Test(fld1:String, fld2:String, fld3:String, fld4:String,
fld4:String, fld5:Double,
Check your executor logs for the output or if your data is not big collect it
in the driver and print it.
On Jul 16, 2014, at 9:21 AM, Sarath Chandra
sarathchandra.jos...@algofusiontech.com wrote:
Hi All,
I'm trying to do a simple record matching between 2 files and wrote following
Hi,
I think same issue is happening with the constructor of the
PartitionPruningRDD class. It hasn't been fixed in version 1.0.1 Should this
be reported to JIRA?
--
View this message in context:
Hi Soumya,
Data is very small, 500+ lines in each file.
Removed last 2 lines and placed this at the end
matched.collect().foreach(println);. Still no luck. It's been more than
5min, the execution is still running.
Checked logs, nothing in stdout. In stderr I don't see anything going
wrong, all
When you submit your job, it should appear on the Spark UI. Same with the
REPL. Make sure you job is submitted to the cluster properly.
On Wed, Jul 16, 2014 at 10:08 AM, Sarath Chandra
sarathchandra.jos...@algofusiontech.com wrote:
Hi Soumya,
Data is very small, 500+ lines in each file.
Yes it is appearing on the Spark UI, and remains there with state as
RUNNING till I press Ctrl+C in the terminal to kill the execution.
Barring the statements to create the spark context, if I copy paste the
lines of my code in spark shell, runs perfectly giving the desired output.
~Sarath
On
Can you try submitting a very simple job to the cluster.
On Jul 16, 2014, at 10:25 AM, Sarath Chandra
sarathchandra.jos...@algofusiontech.com wrote:
Yes it is appearing on the Spark UI, and remains there with state as
RUNNING till I press Ctrl+C in the terminal to kill the execution.
I think what you might be looking for is the ability to programmatically
specify the schema, which is coming in 1.1.
Here's the JIRA: SPARK-2179
https://issues.apache.org/jira/browse/SPARK-2179
On Wed, Jul 16, 2014 at 8:24 AM, pandees waran pande...@gmail.com wrote:
Hi,
I am newbie to spark
Yes, but if both tagCollection and selectedVideos have a column named id
then Spark SQL does not know which one you are referring to in the where
clause. Here's an example with aliases:
val x = testData2.as('x)
val y = testData2.as('y)
val join = x.join(y, Inner, Some(x.a.attr ===
Yes Soumya, I did it.
First I tried with the example available in the documentation (example
using people table and finding teenagers). After successfully running it, I
moved on to this one which is starting point to a bigger requirement for
which I'm evaluating Spark SQL.
On Wed, Jul 16, 2014
What if you just run something like:
*sc.textFile(hdfs://localhost:54310/user/hduser/file1.csv).count()*
On Wed, Jul 16, 2014 at 10:37 AM, Sarath Chandra
sarathchandra.jos...@algofusiontech.com wrote:
Yes Soumya, I did it.
First I tried with the example available in the documentation
Hi Michael,
Tried it. It's correctly printing the line counts of both the files. Here's
what I tried -
*Code:*
*package test*
*object Test4 {*
* case class Test(fld1: String, *
* fld2: String, *
* fld3: String, *
* fld4: String, *
* fld5: String, *
* fld6: Double, *
* fld7:
Hi Xiangrui,
I accidentally did not send df -i for the master node. Here it is at the
moment of failure:
FilesystemInodes IUsed IFree IUse% Mounted on
/dev/xvda1524288 280938 243350 54% /
tmpfs3845409 1 38454081% /dev/shm
/dev/xvdb
Hi TD,
I Defines the Case Class outside the main method and was able to compile
the code successfully. But getting a run time error when trying to process
some json file from kafka. here is the code i an to compile
import java.util.Properties
import kafka.producer._
import
Should I take it from the lack of replies that the --ebs-vol-size feature
doesn't work?
-Ben
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Trouble-with-spark-ec2-script-ebs-vol-size-tp9619p9934.html
Sent from the Apache Spark User List mailing list
please add
From: Ben Horner [via Apache Spark User List]
ml-node+s1001560n9934...@n3.nabble.commailto:ml-node+s1001560n9934...@n3.nabble.com
Date: Wednesday, July 16, 2014 at 8:47 AM
To: Ben Horner ben.hor...@atigeo.commailto:ben.hor...@atigeo.com
Subject: Re: Trouble with spark-ec2 script:
Hi there,
I am looking for a GBM MLlib implementation. Does anyone know if there is a
plan to roll it out soon?
Thanks!
Pedro
Hi Burak,
Thank you for your pointer, it is really helping out. I do have some
consecutive questions though.
After looking at the Big Data Benchmark page
https://amplab.cs.berkeley.edu/benchmark/ (Section Run this benchmark
yourself), I was expecting the following combination of files:
Sets:
Hi Ben,
It worked for me, but only when using the default region. Using
--region=us-west-2 resulted in errors about security groups.
Chris
On Wed, Jul 16, 2014 at 8:53 AM, Ben Horner ben.hor...@atigeo.com wrote:
please add
From: Ben Horner [via Apache Spark User List] [hidden email]
so I need to reconfigure my sparkcontext this way:
val conf = new SparkConf()
.setMaster(local)
.setAppName(CountingSheep)
.set(spark.executor.memory, 1g)
.set(spark.akka.frameSize,20)
val sc = new SparkContext(conf)
And start a new cluster
Hi Pedro,
Yes, although they will probably not be included in the next release (since
the code freeze is ~2 weeks away), GBM (and other ensembles of decision
trees) are currently under active development. We're hoping they'll make
it into the subsequent release.
-Ameet
On Wed, Jul 16, 2014 at
Hello community,
tried to run storm app on yarn, using cloudera hadoop and spark distro
(from http://archive.cloudera.com/cdh5/cdh/5)
hadoop version: hadoop-2.3.0-cdh5.0.3.tar.gz
spark version: spark-0.9.0-cdh5.0.3.tar.gz
DEFAULT_YARN_APPLICATION_CLASSPATH is part of hadoop-api-yarn jar ...
Hi Srinivas,
Seems the query you used is val results =sqlContext.sql(select type from
table1). However, table1 does not have a field called type. The schema of
table1 is defined as the class definition of your case class Record (i.e. ID,
name, score, and school are fields of your table1). Can you
Hi Ameet, that's great news!
Thanks,
Pedro
On Wed, Jul 16, 2014 at 9:33 AM, Ameet Talwalkar atalwal...@gmail.com
wrote:
Hi Pedro,
Yes, although they will probably not be included in the next release
(since the code freeze is ~2 weeks away), GBM (and other ensembles of
decision trees) are
Somewhere in here, you are not actually running vs Hadoop 2 binaries.
Your cluster is certainly Hadoop 2, but your client is not using the
Hadoop libs you think it is (or your compiled binary is linking
against Hadoop 1, which is the default for Spark -- did you change
it?)
On Wed, Jul 16, 2014
Hi,
My application has multiple dstreams on the same inputstream:
dstream1 // 1 second window
dstream2 // 2 second window
dstream3 // 5 minute window
I want to write logic that deals with all three windows (e.g. when the 1
second window differs from the 2 second window by some delta ...)
I've
Andrew,
Are you running on a CM-managed cluster? I just checked, and there is a
bug here (fixed in 1.0), but it's avoided by having
yarn.application.classpath defined in your yarn-site.xml.
-Sandy
On Wed, Jul 16, 2014 at 10:02 AM, Sean Owen so...@cloudera.com wrote:
Somewhere in here, you
I'm joining several kafka dstreams using the join operation but you have
the limitation that the duration of the batch has to be same,i.e. 1 second
window for all dstreams... so it would not work for you.
2014-07-16 18:08 GMT+01:00 Walrus theCat walrusthe...@gmail.com:
Hi,
My application has
OK, if you're sure your binary has Hadoop 2 and/or your classpath has
Hadoop 2, that's not it. I'd look at Sandy's suggestion then.
On Wed, Jul 16, 2014 at 6:11 PM, Andrew Milkowski amgm2...@gmail.com wrote:
thanks Sean! so what I did is in project/SparkBuild.scala I made it compile
with
thanks Sandzy, no CM-managed cluster, straight from cloudera tar (
http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.3.0-cdh5.0.3.tar.gz)
trying your suggestion immediate! thanks so much for taking time..
On Wed, Jul 16, 2014 at 1:10 PM, Sandy Ryza sandy.r...@cloudera.com wrote:
Andrew,
Are
Hi Tathagata,
I have tried the repartition method. The reduce stage first had 2 executors
and then it had around 85 executors. I specified repartition(300) and each
of the executors were specified 2 cores when I submitted the job. This
shows repartition works to increase more executors. However,
Hi all,
I just installed a mesos 0.19 cluster. I am failing to execute basic SparkQL
operations on text files with Spark 1.0.1 with the spark-shell.
I have one Mesos master without zookeeper and 4 mesos slaves.
All nodes are running JDK 1.7.51 and Scala 2.10.4.
The spark package is
Yeah -- I tried the .union operation and it didn't work for that reason.
Surely there has to be a way to do this, as I imagine this is a commonly
desired goal in streaming applications?
On Wed, Jul 16, 2014 at 10:10 AM, Luis Ángel Vicente Sánchez
langel.gro...@gmail.com wrote:
I'm joining
hum... maybe consuming all streams at the same time with an actor that
would act as a new DStream source... but this is just a random idea... I
don't really know if that would be a good idea or even possible.
2014-07-16 18:30 GMT+01:00 Walrus theCat walrusthe...@gmail.com:
Yeah -- I tried the
Or, if not, is there a way to do this in terms of a single dstream? Keep
in mind that dstream1, dstream2, and dstream3 have already had
transformations applied. I tried creating the dstreams by calling .window
on the first one, but that ends up with me having ... 3 dstreams... which
is the same
hey at least it's something (thanks!) ... not sure what i'm going to do if
i can't find a solution (other than not use spark) as i really need these
capabilities. anyone got anything else?
On Wed, Jul 16, 2014 at 10:34 AM, Luis Ángel Vicente Sánchez
langel.gro...@gmail.com wrote:
hum...
Sandy, perfect! you saved me tons of time! added this in yarn-site.xml job
ran to completion
Can you do me (us) a favor and push newest and patched spark/hadoop to cdh5
(tar's) if possible
and thanks again for this (huge time saver)
On Wed, Jul 16, 2014 at 1:10 PM, Sandy Ryza
For others, to solve topic problem: in yarn-site.xml add:
property
nameyarn.application.classpath/name
value$HADOOP_CONF_DIR,
$HADOOP_COMMON_HOME/share/hadoop/common/*,
$HADOOP_COMMON_HOME/share/hadoop/common/lib/*,
$HADOOP_HDFS_HOME/share/hadoop/hdfs/*,
Hi Rajesh,
I saw : Warning: Local jar /home/rajesh/hbase-0.96.1.1-hadoop2/lib/hbase
-client-0.96.1.1-hadoop2.jar, does not exist, skipping.
in your log.
I believe this jar contains the HBaseConfiguration. I'm not sure what went
wrong in your case but can you try without spaces in --jars
i.e.
When I'm reading the API of spark streaming, I'm confused by the 3
different durations
StreamingContext(conf: SparkConf
http://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkConf.html
, batchDuration: Duration
Thanks for sharing your experience. I got the same experience -- multiple
moderate JVMs beat a single huge JVM.
Besides the minor JVM starting overhead, is it always better to have
multiple JVMs rather than a single one?
Best regards,
Wei
-
Wei Tan, PhD
Is the class that is not found in the wikipediapagerank jar?
TD
On Wed, Jul 16, 2014 at 12:32 AM, Hao Wang wh.s...@gmail.com wrote:
Thanks for your reply. The SparkContext is configured as below:
sparkConf.setAppName(WikipediaPageRank)
sparkConf.set(spark.serializer,
Hi Tom,
Actually I was mistaken, sorry about that. Indeed on the website, the keys for
the datasets you mention are not showing up. However,
they are still accessible through the spark-shell, which means that they are
there.
So in order to answer your questions:
- Are the tiny and 1node sets
Mostly true. The execution of two equivalent logical plans will be exactly
the same, independent of the dialect. Resolution can be slightly different
as SQLContext defaults to case sensitive and HiveContext defaults to case
insensitive.
One other very technical detail: The actual planning done
Now I see the answer to this.
Spark slaves are start on random ports, and tell the master where they are.
then the master acknowledges them.
(worker logs)
Starting Spark worker :43282
(master logs)
Registering worker on :43282 with 8 cores, 16.5 GB RAM
Thus, the port is random because
The only other thing to keep in mind is that window duration and slide
duration have to be multiples of batch duration, IDK if you made that fully
clear
--
View this message in context:
Thanks Marcelo, I'm not seeing anything in the logs that clearly explains
what's causing this to break.
One interesting point that we just discovered is that if we run the driver
and the slave (worker) on the same host it runs, but if we run the driver
on a separate host it does not run.
Note that runnning a simple map+reduce job on the same hdfs files with the
same installation works fine:
Did you call collect() on the totalLength? Otherwise nothing has actually
executed.
Oh, I'm sorry... reduce is also an operation
On Wed, Jul 16, 2014 at 3:37 PM, Michael Armbrust mich...@databricks.com
wrote:
Note that runnning a simple map+reduce job on the same hdfs files with the
same installation works fine:
Did you call collect() on the totalLength? Otherwise
Hi All,I am new to Spark. Written a program to read data from local big file,
sort using Spark SQL and then filter based some validation rules. I have
tested this program with 23860746 lines of file, and it took 39 secs (2
cores and Xmx as 6gb). But, when I want to serializing it to a local file,
One way to do that is currently possible is given here
http://mail-archives.apache.org/mod_mbox/spark-user/201407.mbox/%3CCAMwrk0=b38dewysliwyc6hmze8tty8innbw6ixatnd1ue2-...@mail.gmail.com%3E
On Wed, Jul 16, 2014 at 1:16 AM, Gerard Maas gerard.m...@gmail.com wrote:
Hi Sargun,
There have
Yeah. I have been wondering how to check this in the general case, across
all deployment modes, but thats a hard problem. Last week I realized that
even if we can do it just for local, we can get the biggest bang of the
buck.
TD
On Tue, Jul 15, 2014 at 9:31 PM, Tobias Pfeiffer t...@preferred.jp
I hope it all works :)
On Wed, Jul 16, 2014 at 9:08 AM, gorenuru goren...@gmail.com wrote:
Hi and thank you for your reply.
Looks like it's possible. It looks like a hack for me because we are
specifying batch duration when creating context. This means that if we will
specify batch
I think I know what the problem is. Spark Streaming is constantly doing
garbage cleanup by throwing away data that it does not based on the
operations in the DStream. Here the DSTream operations are not aware of the
spark sql queries thats happening asynchronous to spark streaming. So data
is
Hi Michael,
Thanks for your reply. Yes, the reduce triggered the actual execution, I got
a total length (totalLength: 95068762, for the record).
--
View this message in context:
Dear List,
The version of pyspark on master has a lot of nice new features, e.g.
SequenceFile reading, pickle i/o, etc:
https://github.com/apache/spark/blob/master/python/pyspark/context.py#L353
I downloaded the recent 1.0.1 release and was surprised to see the
distribution did not include these
H, it could be some weirdness with classloaders / Mesos / spark sql?
I'm curious if you would hit an error if there were no lambda functions
involved. Perhaps if you load the data using jsonFile or parquetFile.
Either way, I'd file a JIRA. Thanks!
On Jul 16, 2014 6:48 PM, Svend
Hi Ron,
I just checked and this bug is fixed in recent releases of Spark.
-Sandy
On Sun, Jul 13, 2014 at 8:15 PM, Chester Chen ches...@alpinenow.com wrote:
Ron,
Which distribution and Version of Hadoop are you using ?
I just looked at CDH5 ( hadoop-mapreduce-client-core-
You should expect master to compile and run: patches aren't merged unless
they build and pass tests on Jenkins.
You shouldn't expect new features to be added to stable code in maintenance
releases (e.g. 1.0.1).
AFAIK, we're still on track with Spark 1.1.0 development, which means that
it should
Tnks to both for the comments and the debugging suggestion, I will try to
use.
Regarding you comment, yes I do agree the current solution was not efficient
but for using the saveToCassandra method I need an RDD thus the paralelize
method. I finally got direct by Piotr to use the
Matei - I tried using coalesce(numNodes, true), but it then seemed to run too
few SNAP tasks - only 2 or 3 when I had specified 46. The job failed,
perhaps for unrelated reasons, with some odd exceptions in the log (at the
end of this message). But I really don't want to force data movement
Yeah, we try to have a regular 3 month release cycle; see
https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage for the current
window.
Matei
On Jul 16, 2014, at 4:21 PM, Mark Hamstra m...@clearstorydata.com wrote:
You should expect master to compile and run: patches aren't merged
Hi Ravi,
I have seen a similar issue before. You can try to set
fs.hdfs.impl.disable.cache to true in your hadoop configuration. For
example, suppose your hadoop configuration file is hadoopConf, you can use
hadoopConf.setBoolean(fs.hdfs.impl.disable.cache, true)
Let me know if that helps.
Hi all,
I am currently using Spark Streaming to conduct a real-time data analytics.
We receive data from Kafka. We want to generate output files that contain
results that are based on the data we receive from a specific time
interval.
I have several questions on Spark Streaming's timestamp:
1)
Hi,
It seems that spark-ec2 script deploys Tachyon module along with other
setup.
I am trying to use .persist(OFF_HEAP) for RDD persistence, but on worker I
see this error
--
Failed to connect (2) to master localhost/127.0.0.1:19998 :
java.net.ConnectException: Connection refused
--
From
Have you taken a look at DStream.transformWith( ... ) . That allows you
apply arbitrary transformation between RDDs (of the same timestamp) of two
different streams.
So you can do something like this.
2s-window-stream.transformWith(1s-window-stream, (rdd1: RDD[...], rdd2:
RDD[...]) = {
...
//
Hi,
I want to use Spark with HBase and I'm confused about how to ingest my data
using HBase' HFileOutputFormat. It recommends calling
configureIncrementalLoad which does the following:
- Inspects the table to configure a total order partitioner
- Uploads the partitions file to the cluster
Answers inline.
On Wed, Jul 16, 2014 at 5:39 PM, Bill Jay bill.jaypeter...@gmail.com
wrote:
Hi all,
I am currently using Spark Streaming to conduct a real-time data
analytics. We receive data from Kafka. We want to generate output files
that contain results that are based on the data we
Can anyone explain to me what is difference between kmeans in Mlib and kmeans
in examples/src/main/python/kmeans.py?
Best Regards
...
Amin Mohebbi
PhD candidate in Software Engineering
at university of Malaysia
H#x2F;P : +60 18
You should try cleaning and then building. We have recently hit a bug in
the scala compiler that sometimes causes non-clean builds to fail.
On Wed, Jul 16, 2014 at 7:56 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Yeah, we try to have a regular 3 month release cycle; see
1 - 100 of 103 matches
Mail list logo