Hello Shivram
Thanks for your reply.
Here is a simple data set input. This data is in file called
/sparkdev/datafiles/covariance.txt
1,1
2,2
3,3
4,4
5,5
6,6
7,7
8,8
9,9
10,10
Output I would like to see is a total of columns. It can be done with
reduce, but I wanted to test lapply.
Output I
Take a look at this gist
https://gist.github.com/bigaidream/40fe0f8267a80e7c9cf8
That worked for me.
On Wed, Aug 6, 2014 at 7:32 PM, Sathish Kumaran Vairavelu
vsathishkuma...@gmail.com wrote:
Mohit, This doesn't seems to be working can you please provide more
details? when I use from pyspark
Okay, going back to your origin question, it wasnt clear what is the reduce
function that you are trying to implement. Going by the 2nd example using
window() operation, following by a count+filter (using sql), I am guessing
you are trying to maintain a count of the all the active states in the
thanks for the quick answer!
numpy array only can support basic types, so we can not use it during
collect()
by default.
sure, but if you knew that a numpy array went in on one end, you could safely
use it on the other end, no? Perhaps it would require an extension of the RDD
class and
Hi Pranay,
If this is data format is to be assumed, then I believe the issue starts at
lines - textFile(sc,/sparkdev/datafiles/covariance.txt)
totals - lapply(lines, function(lines)
After the first line, `lines` becomes an RDD of strings, each of which
is a line of the form 1,1.
I followed the example in
examples/src/main/scala/org/apache/spark/examples/mllib/SparseNaiveBayes.scala.
IN this file Params is defined as follows:
case class Params (
input: String = null,
minPartitions: Int = 0,
numFeatures: Int = -1,
lambda: Double = 1.0)
In the main
OK, I think I've figured it out.
It seems to be a bug which has been reported at:
https://issues.apache.org/jira/browse/SPARK-2823 and
https://github.com/apache/spark/pull/1763.
As it says: If the users set “spark.default.parallelism” and the value is
different with the EdgeRDD partition
Hi,
I am following the code in
examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala
For setting the parameters and parsing the command line options, I am just
reusing that code.Params is defined as follows.
case class Params(
input: String = null,
It is used in data loading:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/SparseNaiveBayes.scala#L76
On Thu, Aug 7, 2014 at 12:47 AM, SK skrishna...@gmail.com wrote:
I followed the example in
I hope this has been resolved, were u connected to right zookeeper? did
Kafka and HBase share the same zookeeper and port? If not, did u set a
right config for Hbase job? -- Khanderao
On Wed, Jul 2, 2014 at 4:12 PM, JiajiaJing jj.jing0...@gmail.com wrote:
Hi,
I am trying to write a program
You can download and compile spark against your existing hadoop version.
Here's a quick start
https://spark.apache.org/docs/latest/cluster-overview.html#cluster-manager-types
You can also read a bit here
http://docs.sigmoidanalytics.com/index.php/Installing_Spark_andSetting_Up_Your_Cluster
( the
Alan/TD,
We are facing the problem in a project going to production.
Was there any progress on this? Are we able to confirm that this is a
bug/limitation in the current streaming code? Or there is anything wrong in
user scope?
Regards,
Rohit
*Founder CEO, **Tuplejump, Inc.*
Hi,
When I try to use HiveContext in Spark shell on AWS, I got the error
java.lang.IllegalAccessError: tried to access method
com.google.common.collect.MapMaker.makeComputingMap(Lcom/google/common/base/Function;)Ljava/util/concurrent/ConcurrentMap.
I follow the steps below to compile and install
Hey Zhun,
Thanks for the detailed problem description. Please see my comments inlined
below.
On Thu, Aug 7, 2014 at 6:18 PM, Zhun Shen shenzhunal...@gmail.com wrote:
Caused by: java.lang.IllegalAccessError: tried to access method
This was a completely misleading error message..
The problem was due to a log message getting dumped to the stdout. This was
getting accumulated in the workers and hence there was no space left on
device after some time.
When I re-tested with spark-0.9.1, the saveAsTextFile api threw no space
Thanks for your response. I could able to compile my code now.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-tp11618p11644.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi there,
I'm interested if it is possible to get the same behavior as for reduce
function from MR framework. I mean for each key K get list of associated
values ListV.
There is function reduceByKey that works only with separate V from list. Is
it exist any way to get list? Because I have to
Our lab need to do some simulation on online social networks. We need to
handle a 5000*5000 adjacency matrix, namely, to get its largest eigenvalue
and corresponding eigenvector. Matlab can be used but it is time-consuming.
Is Spark effective in linear algebra calculations and transformations?
this two posts should be good for setting up spark+hbase environment and use
the results of hbase table scan as RDD
settings
http://www.abcn.net/2014/07/lighting-spark-with-hbase-full-edition.html
some samples:
http://www.abcn.net/2014/07/spark-hbase-result-keyvalue-bytearray.html
--
View
You may use groupByKey in this case.
On Aug 7, 2014, at 9:18 PM, Konstantin Kudryavtsev
kudryavtsev.konstan...@gmail.com wrote:
Hi there,
I'm interested if it is possible to get the same behavior as for reduce
function from MR framework. I mean for each key K get list of associated
Dear all,
I am using Spark 0.9.2 in Standalone mode. Hive and
HDFS in CDH 5.1.0.
6 worker nodes each with memory 96GB and 32
cores.
I am using Shark Shell to execute queries on Spark.
I
have a raw_table ( of size 3TB with replication 3 ) which is partitioned
by year, month and day. I am running
I'm also getting this - Ryan we both seem to be running into this issue
with elasticsearch-hadoop :)
I tried spark.files.userClassPathFirst true on command line and that
doesn;t work
If I put it that line in spark/conf/spark-defaults it works but now I'm
getting:
java.lang.NoClassDefFoundError:
Forgot to include user@
Another email from Amit indicated that there is 1 region in his table.
This wouldn't give you the benefit TableInputFormat is expected to deliver.
Please split your table into multiple regions.
See http://hbase.apache.org/book.html#d3593e6847 and related links.
Cheers
a long time ago, in Spark Summit 2013, Patrick Wendell said in his talk about
performance
(http://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/)
that, reduceByKey will be more efficient than groupByKey... he mentioned
groupByKey copies all data over network.
sparkuser2345 wrote
I'm using Spark 1.0.0.
The same works when
- Using Spark 0.9.1.
- Saving to and reading from local file system (Spark 1.0.0)
- Saving to and reading from HDFS (Spark 1.0.0)
--
View this message in context:
I haven't seen people write directly to sql database,
mainly because it's difficult to deal with failure,
what if network broken in half of the process? should we drop all data in
database and restart from beginning? if the process is Appending data to
database, then things becomes even complex.
Hi,
Following the document:
# Cloudera CDH 4.2.0
mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.2.0 -DskipTests clean package
I compile Spark 1.0.2 with this cmd:
mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.6.0 -DskipTests clean package
However, I got two errors:
[INFO] Compiling 14 Scala
The point is that in many cases the operation passed to reduceByKey aggregates
data into much smaller size, say + and * for integer. String concatenation
doesn’t actually “shrink” data, thus in your case, rdd.reduceByKey(_ ++ _) and
rdd.groupByKey suffer similar performance issue. In general,
Maybe a little off topic, but would you mind to share your motivation of saving
the RDD into an SQL DB?
If you’re just trying to do further transformations/queries with SQL for
convenience, then you may just use Spark SQL directly within your Spark
application without saving them into DB:
On Thu, Aug 7, 2014 at 11:08 AM, 诺铁 noty...@gmail.com wrote:
what if network broken in half of the process? should we drop all data in
database and restart from beginning?
The best way to deal with this -- which, unfortunately, is not commonly
supported -- is with a two-phase commit that can
In Spark Streaming, is there a way to write output to different paths based
on the partition key? The saveAsTextFiles method will write output in the
same directory.
For example, if the partition key has a hour/day column and I want to
separate DStream output into different directories by
On Thu, Aug 7, 2014 at 11:25 AM, Cheng Lian lian.cs@gmail.com wrote:
Maybe a little off topic, but would you mind to share your motivation of
saving the RDD into an SQL DB?
Many possible reasons (Vida, please chime in with yours!):
- You have an existing database you want to load new
Hello,
I am trying to build Apache Spark version 1.0.1 on Ubuntu 12.04 LTS. After
unzipping the file and running sbt/sbt assembly I get the following error :
rasika@rasikap:~/spark-1.0.1$ sbt/sbt package
Error occurred during initialization of VM
Could not reserve enough space for object heap
Hello all,
I am not sure what is going on – I am getting a NotSerializedException and
initially I thought it was due to not registering one of my classes with Kryo
but that doesn’t seem to be the case. I am essentially eliminating duplicates
in a spark streaming application by using a “window”
Hi,
Could you try running spark-shell with the flag --driver-memory 2g or more if
you have more RAM available and try again?
Thanks,
Burak
- Original Message -
From: AlexanderRiggers alexander.rigg...@gmail.com
To: u...@spark.incubator.apache.org
Sent: Thursday, August 7, 2014 7:37:40
Hi,
I'm trying a simple thing: create an RDD from a text file (~3GB) located in
GlusterFS, which is mounted by all Spark cluster machines, and calling
rdd.count(); but Spark never managed to complete the job, giving message
like the following: WARN TaskSchedulerImpl: Initial job has not accepted
Specifically, reduceByKey expects a commutative/associative reduce
operation, and will automatically do this locally before a shuffle, which
means it acts like a combiner in MapReduce terms -
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions
On Thu,
There are two problems that might be happening:
- You're requesting more resources than the master has available, so
your executors are not starting. Given your explanation this doesn't
seem to be the case.
- The executors are starting, but are having problems connecting back
to the driver. In
right, Spark is more like to act as an OLAP, i believe no one will use spark
as an OLTP, so there is always some question about how to share the data
between these two platform efficiently
and a more important is that most of enterprise BI tools rely on RDBMS or at
least a JDBC/ODBC interface
darkjh wrote
But in my experience, when reading directly from
s3n, spark create only 1 input partition per file, regardless of the file
size. This may lead to some performance problem if you have big files.
This is actually not true, Spark uses the underlying hadoop input formats to
read the
It could be because of the variable enableOpStat. Since its defined
outside foreachRDD, referring to it inside the rdd.foreach is probably
causing the whole streaming context being included in the closure. Scala
funkiness. Try this, see if it works.
msgCount.join(ddCount).foreachRDD((rdd:
The problem boils down to how to write an RDD in that way. You could use
the HDFS Filesystem API to write each partition directly.
pairRDD.groupByKey().foreachPartition(iterator =
iterator.map { case (key, values) =
// Open an output stream to destination file
base-path/key/whatever
For future reference in this thread, a better set of examples than the
MetricAggregatorHBase
on the JIRA to look at are here
https://github.com/tmalaska/SparkOnHBase
On Thu, Aug 7, 2014 at 1:41 AM, Khanderao Kand khanderao.k...@gmail.com
wrote:
I hope this has been resolved, were u
I wanted to post for validation to understand if there is more efficient way
to achieve my goal. I'm currently performing this flow for two distinct
calculations executing in parallel:
1) Sum key/value pair, by using a simple witnessed count(apply 1 to a
mapToPair() and then groupByKey()
2)
Another possible reason behind this maybe that there are two versions of
Akka present in the classpath, which are interfering with each other. This
could happen through many scenarios.
1. Launching Spark application with Scala brings in Akka from Scala, which
interferes with Spark's Akka
2.
(-incubator, +user)
If your matrix is symmetric (and real I presume), and if my linear
algebra isn't too rusty, then its SVD is its eigendecomposition. The
SingularValueDecomposition object you get back has U and V, both of
which have columns that are the eigenvectors.
There are a few SVDs in
It's not running out of memory on the driver though, right? the
executors may need more memory, or use more executors.
--executory-memory would let you increase from the default of 512MB.
On Thu, Aug 7, 2014 at 5:07 PM, Burak Yavuz bya...@stanford.edu wrote:
Hi,
Could you try running
As a follow up, I commented out that entire code and I am still getting the
exception. It may be related to what you are suggesting so are there any best
practices so that I can audit other parts of the code?
Thanks,
Mahesh
From: Padmanabhan, Mahesh Padmanabhan
(-incubator, +user)
It's not Spark running out of memory, but SBT, so those env variables
have no effect. They're options to Spark at runtime anyway, not
compile time, and you're intending to compile I take it.
SBT is a memory hog, and Spark is a big build. You will probably need
to give it more
Ashish Rangole wrote
Specify a folder instead of a file name for input and output code, as in:
Output:
s3n://your-bucket-name/your-data-folder
Input: (when consuming the above output)
s3n://your-bucket-name/your-data-folder/*
Unfortunately no luck:
Exception in thread main
Can you enable the java flag -Dsun.io.serialization.extendedDebugInfo=true
for driver in your driver startup-script? That should give an indication of
the sequence of object references that lead to the StremaingContext being
included in the closure.
TD
On Thu, Aug 7, 2014 at 10:23 AM,
Can you try with -Pyarn instead of -Pyarn-alpha?
I'm pretty sure CDH4 ships with the newer Yarn API.
On Thu, Aug 7, 2014 at 8:11 AM, linkpatrickliu linkpatrick...@live.com wrote:
Hi,
Following the document:
# Cloudera CDH 4.2.0
mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.2.0 -DskipTests
Then this may be a bug. Do you mind sharing the dataset that we can
use to reproduce the problem? -Xiangrui
On Thu, Aug 7, 2014 at 1:20 AM, SK skrishna...@gmail.com wrote:
Spark 1.0.1
thanks
--
View this message in context:
Did you cache the table? There are couple ways of caching a table in
Shark: https://github.com/amplab/shark/wiki/Shark-User-Guide
On Thu, Aug 7, 2014 at 6:51 AM, vinay.kash...@socialinfra.net wrote:
Dear all,
I am using Spark 0.9.2 in Standalone mode. Hive and HDFS in CDH 5.1.0.
6 worker
That won't be it, since you can see from the directory listing that
there are no data files under test -- only _ files and dirs. The
output looks like it was written, or partially written at least, but
didn't finish, in that the part-* files were never moved to the target
dir. I don't know why,
Reza Zadeh has contributed the distributed implementation of (Tall/Skinny)
SVD (http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html),
which is in MLlib (Spark 1.0) and a distributed sparse SVD coming in Spark
1.1. (https://issues.apache.org/jira/browse/SPARK-1782). If your data
The use case I was thinking of was outputting calculations made in Spark
into a SQL database for the presentation layer to access. So in other
words, having a Spark backend in Java that writes to a SQL database and
then having a Rails front-end that can display the data nicely.
On Thu, Aug 7,
@Miles, the latest SVD implementation in mllib is partially distributed.
Matrix-vector multiplication is computed among all workers, but the right
singular vectors are all stored in the driver. If your symmetric matrix is
n x n and you want the first k eigenvalues, you will need to fit n x k
I am not sure if it is a typo-error or not, but how are you using
groupByKey to get the summed_values? Assuming you meant reduceByKey(),
these workflows seems pretty efficient.
TD
On Thu, Aug 7, 2014 at 10:18 AM, Dan H. dch.ema...@gmail.com wrote:
I wanted to post for validation to understand
Hi All,
I'm having a bit of trouble with nested data structures in pyspark with
saveAsParquetFile. I'm running master (as of yesterday) with this pull
request added: https://github.com/apache/spark/pull/1802.
*# these all work*
sqlCtx.jsonRDD(sc.parallelize(['{record:
Vida,
What kind of database are you trying to write to?
For example, I found that for loading into Redshift, by far the easiest
thing to do was to save my output from Spark as a CSV to S3, and then load
it from there into Redshift. This is not a slow as you think, because Spark
can write the
If you just want to find the top eigenvalue / eigenvector you can do
something like the Lanczos method. There is a description of a MapReduce
based algorithm in Section 4.2 of [1]
[1] http://www.cs.cmu.edu/~ukang/papers/HeigenPAKDD2011.pdf
On Thu, Aug 7, 2014 at 10:54 AM, Li Pu
Thanks for your answers. The dataset is only 400MB, so I shouldn't run out of
memory. I restructured my code now, because I forgot to cache my dataset and
set down number of iterations to 2, but still get kicked out of Spark. Did I
cache the data wrong (sorry not an expert):
scala import
Yep, this command given in the Spark docs is correct:
mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.2.0 -DskipTests clean package
and while I also would hope that this works, it doesn't compile:
mvn -Pyarn -Dhadoop.version=2.0.0-cdh4.6.0 -DskipTests clean package
I believe later 4.x includes
Isn't sqoop export meant for that?
http://hadooped.blogspot.it/2013/06/apache-sqoop-part-3-data-transfer.html?m=1
On Aug 7, 2014 7:59 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
Vida,
What kind of database are you trying to write to?
For example, I found that for loading into
Does this help? I can’t figure out anything new from this extra information.
Thanks,
Mahesh
2014-08-07 12:27:00,170 [spark-akka.actor.default-dispatcher-4] ERROR
akka.actor.OneForOneStrategy - org.apache.spark.streaming.StreamingContext
- field (class
I think Cloudera only started adding Spark to CDH4 starting with 4.6,
so maybe that's the minimum if you want to try out Spark on CDH4.
On Thu, Aug 7, 2014 at 11:22 AM, Sean Owen so...@cloudera.com wrote:
Yep, this command given in the Spark docs is correct:
mvn -Pyarn-alpha
There is one more configuration option called spark.closure.serializer that
can be used to specify serializer for closures.
Maybe in the the class you have Streaming Context as a field, so when spark
tries to serialize the whole class it uses the spark.closure.serializer to
serialize even the
That's a good idea - to write to files first and then load. Thanks.
On Thu, Aug 7, 2014 at 11:26 AM, Flavio Pompermaier pomperma...@okkam.it
wrote:
Isn't sqoop export meant for that?
http://hadooped.blogspot.it/2013/06/apache-sqoop-part-3-data-transfer.html?m=1
On Aug 7, 2014 7:59 PM,
From the extended info, I see that you have a function called
createStreamingContext() in your code. Somehow that is getting referenced
in in the foreach function. Is the whole foreachRDD code inside the
createStreamingContext() function? Did you try marking the ssc field as
transient?
Here is a
Yes, thanks, I did in fact mean reduceByKey(), thus allowing the convenience
method process the summation by key.
Thanks for your feedback!
DH
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Workflow-Validation-tp11677p11706.html
Sent from
Thanks TD, Amit.
I think I figured out where the problem is through the process of commenting
out individual lines of code one at a time :(
Can either of you help me find the right solution? I tried creating the
SparkContext outside the foreachRDD but that didn’t help.
I have an object (let’s
I have a few questions regarding a collaborative filtering model, and was
hoping for some recommendations (no pun intended...)
*Setup*
I have a csv file with user/movie/ratings named unimaginatively
'movies.csv'. Here are the contents:
0,0,5
0,1,5
0,2,0
0,3,0
1,0,5
1,3,0
2,1,4
2,2,0
3,0,0
On Thu, Aug 7, 2014 at 9:06 PM, Jay Hutfles jayhutf...@gmail.com wrote:
0,0,5
0,1,5
0,2,0
0,3,0
1,0,5
1,3,0
2,1,4
2,2,0
3,0,0
3,1,0
3,2,5
3,3,4
4,0,0
4,1,0
4,2,5
val rank = 10
This is likely the problem? your rank is actually larger than the
number of users or items. The error
Hi Jay,
I've had the same problem you've been having in Question 1 with a synthetic
dataset. I thought I wasn't producing the dataset well enough. This seems to
be a bug. I will open a JIRA for it.
Instead of using:
ratings.map{ case Rating(u,m,r) = {
val pred = model.predict(u, m)
(r
It looks like your Java heap space is too low: -Xmx512m. It's only using .5G
of RAM, try bumping this up
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/memory-issue-on-standalone-master-tp11610p11711.html
Sent from the Apache Spark User List mailing list
Hi Brad,
It is a bug. I have filed https://issues.apache.org/jira/browse/SPARK-2908
to track it. It will be fixed soon.
Thanks,
Yin
On Thu, Aug 7, 2014 at 10:55 AM, Brad Miller bmill...@eecs.berkeley.edu
wrote:
Hi All,
I'm having a bit of trouble with nested data structures in pyspark
Depending on what you mean by save, you might be able to use the Twitter
Storehaus package to do this. There was a nice talk about this at a Spark
meetup -- Stores, Monoids and Dependency Injection - Abstractions for Spark
Streaming Jobs. Video here:
Thanks Yin!
best,
-Brad
On Thu, Aug 7, 2014 at 1:39 PM, Yin Huai yh...@databricks.com wrote:
Hi Brad,
It is a bug. I have filed https://issues.apache.org/jira/browse/SPARK-2908
to track it. It will be fixed soon.
Thanks,
Yin
On Thu, Aug 7, 2014 at 10:55 AM, Brad Miller
Actually, the issue is if values of a field are always null (or this field
is missing), we cannot figure out the data type. So, we use NullType (it is
an internal data type). Right now, we have a step to convert the data type
from NullType to StringType. This logic in the master has a bug.
We
The PR is https://github.com/apache/spark/pull/1840.
On Thu, Aug 7, 2014 at 1:48 PM, Yin Huai yh...@databricks.com wrote:
Actually, the issue is if values of a field are always null (or this field
is missing), we cannot figure out the data type. So, we use NullType (it is
an internal data
Well I dont see the rdd in the foreachRDD being passed into the A.func1()
so I am not sure what is purpose of the function. Assuming that you do want
to pass on that RDD into that function, and also want to have access to the
sparkContext, you can only pass on the RDD and then access the
Slap my head moment – using rdd.context solved it!
Thanks TD,
Mahesh
From: Tathagata Das
tathagata.das1...@gmail.commailto:tathagata.das1...@gmail.com
Date: Thursday, August 7, 2014 at 3:06 PM
To: Mahesh Padmanabhan
LOL! Glad it solved it.
TD
On Thu, Aug 7, 2014 at 2:23 PM, Padmanabhan, Mahesh (contractor)
mahesh.padmanab...@twc-contractor.com wrote:
Slap my head moment – using rdd.context solved it!
Thanks TD,
Mahesh
From: Tathagata Das tathagata.das1...@gmail.com
Date: Thursday, August 7, 2014
Not all memory can be used for Java heap space, so maybe it does run out.
Could you try repartitioning the data? To my knowledge you shouldn't be
thrown out as long as a single partition fits into memory, even if the whole
dataset does not.
To do that, exchange
val train = parsedData.cache()
I'm running into a problem with executors failing, and it's not clear what's
causing it. Any suggestions on how to diagnose fix it would be
appreciated.
There are a variety of errors in the logs, and I don't see a consistent
triggering error. I've tried varying the number of executors per
What is the environment ? YARN or Mesos or Standalone?
It will be more helpful if you could show more loggings.
On Wed, Aug 6, 2014 at 7:25 PM, Avishek Saha avishek.s...@gmail.com wrote:
Hi,
I get a lot of executor lost error for saveAsTextFile with PySpark
and Hadoop 2.4.
For small
Hi
I wish to migrate from shark to the spark-sql shell, where I am facing some
difficulties in setting up.
I cloned the branch-1.0-jdbc to test out the spark-sql shell, but I am
unable to run it after building the source.
I've tried two methods for building (with Hadoop 1.0.4) - sbt/sbt
Hello Zongheng
Infact the problem is in lapplyPartition
lapply gives output as
1,1
2,2
3,3
...
10,10
However lapplyPartition gives output as
55, NA
55, NA
Why lapply output is horizontal and lapplyPartition is vertical ?
Here is my code
library(SparkR)
sc - sparkR.init(local)
lines -
Similarly, I am seeing tasks moved to the completed section which
apparently haven't finished all elements... (succeeded/total 1)... is this
related?
--
View this message in context:
On Thu, Aug 7, 2014 at 12:06 AM, Rok Roskar rokros...@gmail.com wrote:
sure, but if you knew that a numpy array went in on one end, you could safely
use it on the other end, no? Perhaps it would require an extension of the RDD
class and overriding the colect() method.
Could you give a short
Hi,
Thank you or your help. With the new code I am getting the following error
in the driver. What is going wrong here?
14/08/07 13:22:28 ERROR JobScheduler: Error running job streaming job
1407450148000 ms.0
org.apache.spark.SparkException: Checkpoint RDD CheckpointRDD[4528] at apply
at
Are you running on a cluster but giving a local path in ssc.checkpoint(...)
?
TD
On Thu, Aug 7, 2014 at 3:24 PM, salemi alireza.sal...@udo.edu wrote:
Hi,
Thank you or your help. With the new code I am getting the following error
in the driver. What is going wrong here?
14/08/07 13:22:28
That is correct. I do scc.checkpOint(checkpoint). Why is the checkpoint
required?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-reduceByWindow-reduceFunc-invReduceFunc-windowDuration-slideDuration-tp11591p11731.html
Sent from the Apache
Hi Khanderao and TD
Thank you very much for your reply and the new example. I have resolved the
problem. The zookeeper port I used wasn't right, the default port is not the
one that I suppose to use. So I set the
hbase.zookeeper.property.clientPort to the correct port and everything
worked.
That is required for driver fault-tolerance, as well as for some
transformations like updateSTateByKey that persist information across
batches. It must be a HDFS directory when running on a cluster.
TD
On Thu, Aug 7, 2014 at 4:25 PM, salemi alireza.sal...@udo.edu wrote:
That is correct. I do
I can't figure out how to use Spark Streaming to find the max of a 5 second
batch of data and keep updating the max every 5 seconds. How would I do
this?
--
View this message in context:
You can do the following.
var globalMax = ...
dstreamOfNumericalType.foreachRDD( rdd = {
globalMax = math.max(rdd.max, globalMax)
})
globalMax will keep getting updated after every batch
TD
On Thu, Aug 7, 2014 at 5:31 PM, bumble123 tc1...@att.com wrote:
I can't figure out how to use
My problem was that I didn’t know how to add. For what might be worthy, it was
solved by editing the spark-env.sh.
Thanks anyway!
Baoqiang Cao
Blog: http://baoqiang.org
Email: bqcaom...@gmail.com
On Aug 7, 2014, at 3:27 PM, maddenpj madde...@gmail.com wrote:
It looks like your Java heap
What is the definition of regParam and what is the range of values it is
allowed to take?
thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Regularization-parameters-tp11601p11737.html
Sent from the Apache Spark User List mailing list archive at
1 - 100 of 122 matches
Mail list logo