When I use reduceByKeyAndWindow with func and invFunc (in PySpark) the size
of the window keeps growing. I am appending the code that reproduces this
issue. This prints out the count() of the dstream which goes up every batch
by 10 elements.
Is this a bug in the Python version of Scala or is this
It could be stuck on a GC pause, Can you check a bit more in the executor
logs and see whats going on? Also from the driver UI you would get to know
at which stage it is being stuck etc.
Thanks
Best Regards
On Sun, Aug 16, 2015 at 11:45 PM, unk1102 umesh.ka...@gmail.com wrote:
Hi I have
You need to debug further and figure out the bottle neck. Why are you doing
a collect? If the dataset is too huge that will mostly hung the driver
machine. It would be good if you can paste the sample code, without that
its really hard to understand the flow of your program.
Thanks
Best Regards
Or can I generally create new RDD from transformation and enrich its
partitions with some metadata so that I would copy OffsetRanges in my new
RDD in DStream?
On Mon, Aug 17, 2015 at 1:08 PM, Petr Novak oss.mli...@gmail.com wrote:
Hi all,
I need to transform KafkaRDD into a new stream of
Yeah, lots of libraries needs to be changed to compile in order to run the
examples in intellij.
Thanks,
Xiaohe
On Mon, Aug 17, 2015 at 10:01 AM, Jeff Zhang zjf...@gmail.com wrote:
Check module example's dependency (right click examples and click Open
Modules Settings), by default
That looks like scala version mismatch.
Thanks
Best Regards
On Fri, Aug 14, 2015 at 9:04 PM, saif.a.ell...@wellsfargo.com wrote:
Hi All,
I have a working program, in which I create two big tuples2 out of the
data. This seems to work in local but when I switch over cluster standalone
mode,
In Spark Streaming you can simply check whether your RDD contains any
records or not and if records are there you can save them using
FIleOutputStream:
DStream.foreachRDD(t= { var count = t.count(); if (count0){ // SAVE YOUR
STUFF} };
This will not create unnecessary files of 0 bytes.
On Mon,
What does this mean in .setMaster(local[2])
Is this applicable only for standalone Mode?
Can I do this in a cluster setup, eg:
. setMaster(hostname:port[2])..
Is it number of threads per worker node?
Currently, spark streaming would create a new directory for every batch and
store the data to it (whether it has anything or not). There is no direct
append call as of now, but you can achieve this either with
FileUtil.copyMerge
How to create classtag in java ?Also Constructor of DirectKafkaInputDStream
takes Function1 not Function but kafkautils.createDirectStream allows
function.
I have below as overriden DirectKafkaInputDStream.
public class CustomDirectKafkaInputDstream extends
DirectKafkaInputDStreambyte[],
s3n underneath uses the hadoop api, so i guess it would partition according
to your hadoop configuration (128MB per partition by default)
Thanks
Best Regards
On Mon, Aug 17, 2015 at 2:29 PM, matd matd...@gmail.com wrote:
Hello,
I would like to understand how the work is parallelized accross
Have a look at Mesos.
Thanks
Best Regards
On Sat, Aug 15, 2015 at 1:03 PM, Vikram Kone vikramk...@gmail.com wrote:
Hi,
We are planning to install Spark in stand alone mode on cassandra cluster.
The problem, is since Cassandra has a no-SPOF architecture ie any node can
become the master for
Hi Praveen,
On Mon, Aug 17, 2015 at 12:34 PM, praveen S mylogi...@gmail.com wrote:
What does this mean in .setMaster(local[2])
Local mode (executor in the same JVM) with 2 executor threads.
Is this applicable only for standalone Mode?
It is not applicable for standalone mode, only for
Hi all,
I need to transform KafkaRDD into a new stream of deserialized case
classes. I want to use the new stream to save it to file and to perform
additional transformations on it.
To save it I want to use offsets in filenames, hence I need OffsetRanges in
transformed RDD. But KafkaRDD is
Hello,
I would like to understand how the work is parallelized accross a Spark
cluster (and what is left to the driver) when I read several files from a
single folder in s3 s3n://bucket_xyz/some_folder_having_many_files_in_it/
How files (or file parts) are mapped to partitions ?
Thanks
Mathieu
Are you triggering an action within the while loop? How are you loading the
data from jdbc? You need to make sure the job has enough partitions to run
parallel to increase the performance.
Thanks
Best Regards
On Sat, Aug 15, 2015 at 2:41 AM, saif.a.ell...@wellsfargo.com wrote:
Hello all,
I
Have you tried adding path to hbase-protocol jar to
spark.driver.extraClassPath and spark.executor.extraClassPath ?
Cheers
On Mon, Aug 17, 2015 at 7:51 PM, stark_summer stark_sum...@qq.com wrote:
spark vesion:1.4.1
java version:1.7
hadoop version:
Hadoop 2.3.0-cdh5.1.0
submit spark job to
I have a 4 node cluster and have been playing around with the num-executors
parameters, executor-memory and executor-cores
I set the following:
--executor-memory=10G
--num-executors=1
--executor-cores=8
But when I run the job, I see that each worker, is running one executor
which has 2 cores
Has anyone experienced issues running Spark 1.4.1 on a Mac OSX Yosemite?
I'm been running a standalone 1.3.1 fine but it failed when trying to run
1.4.1. (I also trie 1.4.0).
I've tried both the pre-built packages as well as compiling from source,
both with the same results (I can successfully
the code below is taken from the spark website and generates the error
detailed
Hi using spark 1.3 and trying some sample code:
val users: RDD[(VertexId, (String, String))] =
sc.parallelize(Array((3L, (rxin, student)), (7L, (jgonzal,
postdoc)),
(5L, (franklin, prof)), (2L, (istoica, prof
//
I had success earlier today on OSX Yosemite 10.10.4 building Spark 1.4.1
using these instructions
http://genomegeek.blogspot.com/2014/11/how-to-install-apache-spark-on-mac-os-x.html
(using
`$ sbt/sbt clean assembly`, with the additional step of downloading the
proper sbt-launch.jar (0.13.7) from
approach1:
submit spark job add bolow:
--conf
spark.driver.extraClassPath=/home/cluster/apps/hbase/lib/hbase-protocol-0.98.1-cdh5.1.0.jar
--conf
spark.executor.extraClassPath=/home/cluster/apps/hbase/lib/hbase-protocol-0.98.1-cdh5.1.0.jar
such as:
spark vesion:1.4.1
java version:1.7
hadoop version:
Hadoop 2.3.0-cdh5.1.0
submit spark job to yarn cluster that read hbase data,after job running, it
comes below error :
15/08/17 19:28:33 ERROR yarn.ApplicationMaster: User class threw exception:
org.apache.hadoop.hbase.DoNotRetryIOException:
I'd recommend using the built-in save and load, which will be better for
cross-version compatibility. You should be able to call
myModel.save(path), and load it back with
MatrixFactorizationModel.load(path).
On Mon, Aug 17, 2015 at 6:31 AM, Madawa Soysa madawa...@cse.mrt.ac.lk
wrote:
Hi All,
Yes, they both are set. Just recompiled and still no success, silent
failure.
Which versions of java and scala are you using?
On 17 August 2015 at 19:59, Charlie Hack charles.t.h...@gmail.com wrote:
I had success earlier today on OSX Yosemite 10.10.4 building Spark 1.4.1
using these
Look at the definitions of the java-specific KafkaUtils.createDirectStream
methods (the ones that take a JavaStreamingContext)
On Mon, Aug 17, 2015 at 5:13 AM, Shushant Arora shushantaror...@gmail.com
wrote:
How to create classtag in java ?Also Constructor
of DirectKafkaInputDStream takes
Hi All,
Thank you very much for the detailed explanation.
I have scenario like this-
I have rdd of ticket records and another rdd of booking records. for each
ticket record, i need to check whether any link exists in booking table.
val ticketCachedRdd = ticketRdd.cache
ticketRdd.foreach{
I have been using graphx in production on 1.3 and 1.4 with no issues.
What's the exception you see and what are you trying to do?
On Aug 17, 2015 10:49 AM, dizzy5112 dave.zee...@gmail.com wrote:
Hi using spark 1.3 and trying some sample code:
when i run:
all works well but with
it falls
This statement is from the Spark's website itself.
Regards,
Sandeep Giri,
+1 347 781 4573 (US)
+91-953-899-8962 (IN)
www.KnowBigData.com. http://KnowBigData.com.
Phone: +1-253-397-1945 (Office)
[image: linkedin icon] https://linkedin.com/company/knowbigdata [image:
other site icon]
Hi All,
I have an issue when i try to serialize a MatrixFactorizationModel object
as a java object in a Java application. When I deserialize the object, I
get the following exception.
Caused by: java.lang.ClassNotFoundException:
org.apache.spark.OneToOneDependency cannot be found by
The error could be because of the missing brackets after the word cache -
.ticketRdd.cache()
On Aug 17, 2015, at 7:26 AM, Priya Ch learnings.chitt...@gmail.com wrote:
Hi All,
Thank you very much for the detailed explanation.
I have scenario like this-
I have rdd of ticket records
Hi,
I can't access
http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf.
Could someone help try to see if it is available and reply with it?Thanks!
I got 404 when trying to access the link.
On Aug 17, 2015, at 5:31 AM, Todd bit1...@163.com wrote:
Hi,
I can't access
http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf.
Could someone help try to see if it is available and reply with it?Thanks!
an extra “,” is at the end
--
Nan Zhu
http://codingcat.me
On Monday, August 17, 2015 at 9:28 AM, Ted Yu wrote:
I got 404 when trying to access the link.
On Aug 17, 2015, at 5:31 AM, Todd bit1...@163.com (mailto:bit1...@163.com)
wrote:
Hi,
I can't access
Hi,
I'm running Spark on Amazon EMR (Spark 1.4.1, Hadoop 2.6.0). I'm seeing
the exception below when encountering file names that contain colons. Any
idea on how to get around this?
scala val files = sc.textFile(s3a://redactedbucketname/*)
2015-08-18 04:38:34,567 INFO [main]
Thanks Nan.
That is why I always put an extra space between URL and punctuation in my
comments / emails.
On Mon, Aug 17, 2015 at 6:31 AM, Nan Zhu zhunanmcg...@gmail.com wrote:
an extra “,” is at the end
--
Nan Zhu
http://codingcat.me
On Monday, August 17, 2015 at 9:28 AM, Ted Yu wrote:
Try doing a count on both lookups to force the caching to occur before the join.
On 8/17/15, 12:39 PM, VIJAYAKUMAR JAWAHARLAL sparkh...@data2o.io wrote:
Thanks for your help
I tried to cache the lookup tables and left out join with the big table (DF).
Join does not seem to be using
with the right ftp client JAR on your classpath (I forget which), you can use
ftp:// a a source for a hadoop FS operation. you may even be able to use it as
an input for some spark (non streaming job directly.
On 14 Aug 2015, at 14:11, Varadhan, Jawahar
Looks like because of Spark-5063
RDD transformations and actions can only be invoked by the driver, not
inside of other transformations; for example, rdd1.map(x =
rdd2.values.count() * x) is invalid because the values transformation and
count action cannot be performed inside of the rdd1.map
This will also depend on the file format you are using.
A word of advice: you would be much better off with the s3a file system.
As I found out recently the hard way, s3n has some issues with reading
through entire files even when looking for headers.
On Mon, Aug 17, 2015 at 2:10 AM, Akhil Das
val numStreams = 4
val kafkaStreams = (1 to numStreams).map { i = KafkaUtils.createStream(...)
}
In a Java in a for loop you will create four streams using
KafkaUtils.createStream() so that each receiver will run in different
threads
for more information please visit
*Firstly so sorry for my poor English.*
I was reading the source code of Apache Spark 1.4.1 and I really got stuck
at the logic of RangePartitioner.rangeBounds method. The code is shown
below.
So can anyone please explain me that:
1. What is 3.0 * for in the code line of val
Hi all,
when runnig the Spark cluster in standalone mode I am able to create the
Spark context from Java via the following code snippet:
SparkConf conf = new SparkConf()
.setAppName(MySparkApp)
.setMaster(spark://SPARK_MASTER:7077)
.setJars(jars);
JavaSparkContext sc = new
Did you resolve this issue?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/issue-Running-Spark-Job-on-Yarn-Cluster-tp21779p24300.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi,
I'm wondering how to achieve, say, a Monte Carlo simulation in SparkR
without use of low level RDD functions that were made private in 1.4, such
as parallelize and map. Something like
parallelize(sc, 1:1000).map (
### R code that does my computation
)
where the code is the same on every
I have an RDD queried from a scan of a data source. Sometimes the RDD has
rows and at other times it has none. I would like to register this RDD as
a temporary table in a SQL context. I suspect this will work in Scala, but
in PySpark some code assumes that the RDD has rows in it, which are used
You were building against 1.4.x, right ?
In master branch, switch-to-scala-2.11.sh is gone. There is scala-2.11
profile.
FYI
On Sun, Aug 16, 2015 at 11:12 AM, Stephen Boesch java...@gmail.com wrote:
I am building spark with the following options - most notably the
**scala-2.11**:
.
In 1.4 it is change-scala-version.sh 2.11
But the problem was it is a -Dscala-211 not a -P. I misread the doc's.
2015-08-17 14:17 GMT-07:00 Ted Yu yuzhih...@gmail.com:
You were building against 1.4.x, right ?
In master branch, switch-to-scala-2.11.sh is gone. There is scala-2.11
I am comparing the log of Spark line by line between the hanging case (big
dataset) and not hanging case (small dataset).
In the hanging case, the Spark's log looks identical with not hanging case for
reading the first block data from the HDFS.
But after that, starting from line 438 in the
Howdy folks!
I’m interested in hearing about what people think of spark-ec2
http://spark.apache.org/docs/latest/ec2-scripts.html outside of the
formal JIRA process. Your answers will all be anonymous and public.
If the embedded form below doesn’t work for you, you can use this link to
get the
Hi I have around 2000 Hive source partitions to process and insert data into
same table and different partition. For e.g. I have the following query
hiveContext.sql(insert into table myTable
partition(mypartition=someparition) bla bla)
If I call above query in Spark driver program it runs fine
Thanks for your help
I tried to cache the lookup tables and left out join with the big table (DF).
Join does not seem to be using broadcast join-still it goes with hash partition
join and shuffling big table. Here is the scenario
…
table1 as big_df
left outer join
table2 as lkup
on
Hi Nick,
I forgot to mention in the survey that ganglia is never installed properly
for some reasons.
I have this exception every time I launched the cluster:
Starting httpd: httpd: Syntax error on line 154 of
/etc/httpd/conf/httpd.conf: Cannot load
/etc/httpd/modules/mod_authz_core.so into
54 matches
Mail list logo