Hi Pro,
I want to merge elements in a Spark RDD when the two elements satisfy certain
condition
Suppose there is a RDD[Seq[Int]], where some Seq[Int] in this RDD contain
overlapping elements. The task is to merge all overlapping Seq[Int] in this
RDD, and store the result into a new RDD.
For
The workaround was to wrap the map returned by spark libraries into HashMap
and then broadcast them.
Could anyone please let me know if there is any issue open?
--
View this message in context:
Hello spark users!
I found lots of strange messages in driver log. Here it is:
2014-12-01 11:54:23,849 [sparkDriver-akka.actor.default-dispatcher-25]
ERROR
SerializableMapWrapper was added in
https://issues.apache.org/jira/browse/SPARK-3926; do you mind opening a new
JIRA and linking it to that one?
On Mon, Dec 1, 2014 at 12:17 AM, lokeshkumar lok...@dataken.net wrote:
The workaround was to wrap the map returned by spark libraries into HashMap
Hi Vikas and Simone,
thanks for the replies.
Yeah I understand this would be easier with 1.2 but this is completely out
of my control. I really have to work with 1.0.0.
About Simone's approach, during the imports I get:
/scala import org.apache.avro.mapreduce.{ AvroJob, AvroKeyInputFormat,
Hi Michael,
About this new data source API, what type of data sources would it support?
Does it have to be RDBMS necessarily?
Cheers
On Sat, Nov 29, 2014 at 12:57 AM, Michael Armbrust mich...@databricks.com
wrote:
You probably don't need to create a new kind of SchemaRDD. Instead I'd
Hi Judy,
Thank you for your response.
When I try to compile using maven mvn -Dhadoop.version=1.2.1 -DskipTests
clean package I get an error Error: Could not find or load main class .
I have maven 3.0.4.
And when I run command sbt package I get the same exception as earlier.
I have done the
I am using Cassandra-Spark connector to pull data from Cassandra, process
it and write it back to Cassandra.
Now I am getting the following exception, and apparently it is Kryo
serialisation. Does anyone what is the reason and how this can be solved?
I also tried to register
Hello,
I have two weird effects when working with spark-shell:
1. This code executed in spark-shell causes an exception below. At the
same time it works perfectly when submitted with spark-submit! :
import org.apache.hadoop.hbase.{HConstants, HBaseConfiguration}
import
Hi,
I am integrating Kafka and Spark, using spark-streaming. I have created a topic
as a kafka producer:
bin/kafka-topics.sh --create --zookeeper localhost:2181
--replication-factor 1 --partitions 1 --topic test
I am publishing messages in kafka and trying to read them using
It says:
14/11/27 11:56:05 WARN scheduler.TaskSchedulerImpl: Initial job has not
accepted any resources; check your cluster UI to ensure that workers are
registered and have sufficient memory
A quick guess would be, you are giving the wrong master url. ( spark://
192.168.88.130:7077 ) Open the
Hi,
The spark master is working, and I have given the same url in the code:
[cid:image001.png@01D00D82.6DC2FFF0]
The warning is gone, and the new log is:
---
Time: 141742785 ms
---
INFO
Don't know if this'll solve it, but if you're on Spark 1.1, the Cassandra
Connector version 1.1.0 final fixed the guava back compat issue. Maybe taking
the guava exclusions might help?
Date: Mon, 1 Dec 2014 10:48:25 +0100
Subject: Kryo exception for CassandraSQLRow
From: shahab.mok...@gmail.com
I see you have no worker machines to execute the job
[image: Inline image 1]
You haven't configured your spark cluster properly.
Quick fix to get it running would be run it on local mode, for that change
this line
JavaStreamingContext jssc = *new* JavaStreamingContext(spark://
Don't set `spark.akka.frameSize` to 1. The max value of
`spark.akka.frameSize` is 2047. The unit is MB.
Best Regards,
Shixiong Zhu
2014-12-01 0:51 GMT+08:00 Yanbo yanboha...@gmail.com:
Try to use spark-shell --conf spark.akka.frameSize=1
在 2014年12月1日,上午12:25,Brian Dolan
Hi all,
I need some advise whether Spark is the right tool for my zoo. My requirements
share commonalities with „big data“, workflow coordination and „reactive“ event
driven data processing (as in for example Haskell Arrows), which doesn’t make
it any easier to decide on a tool set.
NB: I
Not quite sure which geo processing you're doing are they raster, vector? More
info will be appreciated for me to help you further.
Meanwhile I can try to give some hints, for instance, did you considered
GeoMesa http://www.geomesa.org/2014/08/05/spark/?
Since you need a WMS (or alike), did you
I have an RDD that serves as a feature look-up table downstream in my
analysis. I create it using the zipWithIndex() and because I suppose that
the elements of the RDD could end up in a different order if it is
regenerated at any point, I cache it to try and ensure that the (feature --
index)
Thank you for mentioning GeoTrellis. I haven’t heard of this before. We have
many custom tools and steps, I’ll check our tools fit in. The end result after
is actually a 3D map for native OpenGL based rendering on iOS / Android [1].
I’m using GeoPackage which is basically SQLite with R-Tree and
… Sorry, I forgot to mention why I’m basically bound to SQLite. The workflow
involves more data processings than I mentioned. There are several tools in the
chain which either rely on SQLite as exchange format, or processings like data
cleaning that are done orders of magnitude faster / or
Hi, I have a problem, it is easy in Scala code, but I can not take the top
N from RDD as RDD.
There are 1 Student Score, ask take top 10 age, and then take top 10
from each age, the result is 100 records.
The Scala code is here, but how can I do it in RDD, *for RDD.take return
is Array,
Indeed. However, I guess the important load and stress is in the processing
of the 3D data (DEM or alike) into geometries/shades/whatever.
Hence you can use spark (geotrellis can be tricky for 3D, poke @lossyrob
for more info) to perform these operations then keep an RDD of only the
resulting
I think the robust thing to do is sort the RDD, and then zipWithIndex.
Even if the RDD is recomputed, the ordering and thus assignment of IDs
should be the same.
On Mon, Dec 1, 2014 at 2:36 PM, rok rokros...@gmail.com wrote:
I have an RDD that serves as a feature look-up table downstream in my
Hi,
My incoming message has time stamp as one field and i have to perform
aggregation over 3 minute of time slice.
Message sample
Item ID Item Type timeStamp
1 X 1-12-2014:12:01
1 X 1-12-2014:12:02
1 X
I've been trying to create a Spark cluster on EC2 using the
documentation at https://spark.apache.org/docs/latest/ec2-scripts.html
(with Spark 1.1.1).
Running the script successfully creates some EC2 instances, HDFS etc.,
but appears to fail to copy the actual files needed to run Spark
across.
I
true though I was hoping to avoid having to sort... maybe there's no way
around it. Thanks!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/ensuring-RDD-indices-remain-immutable-tp20094p20104.html
Sent from the Apache Spark User List mailing list archive at
btw the same error from above also happen on 1.1.0 (just tested)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-1-0-0-RDD-from-snappy-compress-avro-file-tp19998p20106.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
For converting an Array or any List to a RDD, we can try using :
sc.parallelize(groupedScore)//or whatever the name of the list
variable is
On Mon, Dec 1, 2014 at 8:14 PM, Xuefeng Wu ben...@gmail.com wrote:
Hi, I have a problem, it is easy in Scala code, but I can not take the top
N
Or setting the HADOOP_CONF_DIR property. Either way, you must have the YARN
configuration available to the submitting application to allow for the use of
“yarn-client” or “yarn-master”
The attached stack trace below doesn’t provide any information as to why the
job failed.
mn
On Nov 27,
Hi,
You can associate all the messages of a 3min interval with a unique key and
then group by and finally add up.
Thanks
On Dec 1, 2014 9:02 PM, pankaj pankaje...@gmail.com wrote:
Hi,
My incoming message has time stamp as one field and i have to perform
aggregation over 3 minute of time
Thanks for your reply, but I'm still running into issues
installing/configuring the native libraries for MLlib. Here are the steps
I've taken, please let me know if anything is incorrect.
- Download Spark source
- unzip and compile using `mvn -Pnetlib-lgpl -DskipTests clean package `
- Run
Hi ,
suppose i keep batch size of 3 minute. in 1 batch there can be incoming
records with any time stamp.
so it is difficult to keep track of when the 3 minute interval was start and
end. i am doing output operation on worker nodes in forEachPartition not in
drivers(forEachRdd) so i cannot use
scala textFile.count()
java.lang.VerifyError: class
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$CompleteReques
tProto overrides final method
getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
I tried ./make-distribution.sh -Dhadoop.version=2.5.0 and
rdd.top collects it on master...
If you want topk for a key run map / mappartition and use a bounded
priority queue and reducebykey the queues.
I experimented with topk from algebird and bounded priority queue wrapped
over jpriority queue ( spark default)...bpq is faster
Code example is here:
Have you checked out the wiki here?
http://spark.apache.org/docs/latest/building-with-maven.html
A couple things I did differently from you:
1) I got the bits directly from github (https://github.com/apache/spark/). Use
branch 1.1 for spark 1.1
2) execute maven command on cmd (powershell misses
Hi,
I am using openNLP NER ( Token Name finder ) for parsing an Unstructured
data. In order to speed up my process( to quickly train a models and analyze
the documents from the models ), I want to use Spark and I saw on the web
that it is possible to connect openNLP with Spark using UIMAFit but I
No, it should support any data source that has a schema and can produce
rows.
On Mon, Dec 1, 2014 at 1:34 AM, Niranda Perera nira...@wso2.com wrote:
Hi Michael,
About this new data source API, what type of data sources would it
support? Does it have to be RDBMS necessarily?
Cheers
On
Hi everyone,
I’m interested in empirically measuring how faster spark works in comparison to
Hadoop for certain problems and input corpus I currently work with (I’ve read
Matei Zahari’s “Resilient Distributed Datasets: A Fault-Tolerant Abstraction
for In-Memory Cluster Computing” paper and I
Thank you very much for your reply.
I have a cluster of 8 nodes: m1, m2, m3.. m8. m1 configured as Spark master
node, the rest of the nodes are all worker node. I also configured m3 as the
History Server. But the history server fails to start.I ran FlumeEventCount in
m1 using the right
Hi,
Is there a way to REMOVE a jar (added via addJar) to spark contexts? We have
a long running context used by the spark jobserver, but after trying to
update versions of classes already in the class path via addJars, the
context still runs the old versions. It would be helpful if I could remove
Thanks Yanbo! That works!
The only issue is that it won’t print the predicted value from lp.features,
from code line below.
model.predictOnValues(testData.map(lp = (lp.label, lp.features))).print()
It prints the test input data correctly, but it keeps on printing “0.0” as the
predicted
Hi Gurus,
I did not look at the code yet. I wonder if StreamingLinearRegressionWithSGD
http://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/regression/StreamingLinearRegressionWithSGD.html
is equivalent to
LinearRegressionWithSGD
Environment: Spark 1.1, 4 Node Spark and Hadoop Dev cluster - 6 cores, 32 GB
Ram each. Default serialization, Standalone, no security
Data was sqooped from relational DB to HDFS and Data is partitioned across
HDFS uniformly. I am reading a fact table about 8 GB in size and one small
dim table
Try
(hdfs:///localhost:8020/user/data/*)
With 3 /.
Thx
tri
-Original Message-
From: Benjamin Cuthbert [mailto:cuthbert@gmail.com]
Sent: Monday, December 01, 2014 4:41 PM
To: user@spark.apache.org
Subject: hdfs streaming context
All,
Is it possible to stream on HDFS directory
Have you tried just passing a path to ssc.textFileStream() ? It
monitors the path for new files by looking at mtime/atime ; all
new/touched files in the time window appear as an rdd in the dstream.
On 1 December 2014 at 14:41, Benjamin Cuthbert cuthbert@gmail.com wrote:
All,
Is it possible
Yes, in fact, that's the only way it works. You need
hdfs://localhost:8020/user/data, I believe.
(No it's not correct to write hdfs:///...)
On Mon, Dec 1, 2014 at 10:41 PM, Benjamin Cuthbert
cuthbert@gmail.com wrote:
All,
Is it possible to stream on HDFS directory and listen for multiple
Thanks Sean,
That worked just removing the /* and leaving it as /user/data
Seems to be streaming in.
On 1 Dec 2014, at 22:50, Sean Owen so...@cloudera.com wrote:
Yes, in fact, that's the only way it works. You need
hdfs://localhost:8020/user/data, I believe.
(No it's not correct to
For the streaming example I am working on, Its accepted (hdfs:///user/data)
without the localhost info.
Let me dig through my hdfs config.
-Original Message-
From: Sean Owen [mailto:so...@cloudera.com]
Sent: Monday, December 01, 2014 4:50 PM
To: Benjamin Cuthbert
Cc:
Yes but you can't follow three slashes with host:port. No host
probably defaults to whatever is found in your HDFS config.
On Mon, Dec 1, 2014 at 11:02 PM, Bui, Tri tri@verizonwireless.com wrote:
For the streaming example I am working on, Its accepted (hdfs:///user/data)
without the
Yep. No localhost
Usually, I use hdfs:///user/data to indicates I want hdfs or file:///user/data
to indicates local file directory.
-Original Message-
From: Sean Owen [mailto:so...@cloudera.com]
Sent: Monday, December 01, 2014 5:06 PM
To: Bui, Tri
Cc: Benjamin Cuthbert;
hi Debasish,
I found test code in map translate,
would it collect all products too?
+ val sortedProducts = products.toArray.sorted(ord.reverse)
Yours, Xuefeng Wu 吴雪峰 敬上
On 2014年12月2日, at 上午1:33, Debasish Das debasish.da...@gmail.com wrote:
rdd.top collects it on master...
If you want
Hi,How to pass the Java options (such as -XX:MaxMetaspaceSize=100M) when
lunching AM or task containers?
This is related to running Spark on Yarn (Hadoop 2.3.0). In Map-reduce case,
setting the property such as
mapreduce.map.java.opts would do the work.
Any help would be highly appreciated.
file = tranform file into a bunch of records
What does this function do exactly? Does it load the file locally?
Spark supports RDDs exceeding global RAM (cf the terasort example), but if
your example just loads each file locally, then this may cause problems.
Instead, you should load each file
Actually, I'm working with a binary format. The api allows reading out a
single record at a time, but I'm not sure how to get those records into
spark (without reading everything into memory from a single file at once).
On Mon, Dec 1, 2014 at 5:07 PM, Andy Twigg andy.tw...@gmail.com wrote:
Hi,
have a look at the documentation for spark.driver.extraJavaOptions (which
seems to have disappeared since I looked it up last week)
and spark.executor.extraJavaOptions at
http://spark.apache.org/docs/latest/configuration.html#runtime-environment.
Tobias
Thanks Tobias for the answer.Does it work for driver as well?
Regards,Mohammad
On Monday, December 1, 2014 5:30 PM, Tobias Pfeiffer t...@preferred.jp
wrote:
Hi,
have a look at the documentation for spark.driver.extraJavaOptions (which seems
to have disappeared since I looked it up
Please check whether
https://github.com/apache/spark/pull/3409#issuecomment-64045677 solve the
problem for launching AM.
Thanks.
Zhan Zhang
On Dec 1, 2014, at 4:49 PM, Mohammad Islam misla...@yahoo.com.INVALID wrote:
Hi,
How to pass the Java options (such as -XX:MaxMetaspaceSize=100M) when
Could you modify your function so that it streams through the files record
by record and outputs them to hdfs, then read them all in as RDDs and take
the union? That would only use bounded memory.
On 1 December 2014 at 17:19, Keith Simmons ke...@pulse.io wrote:
Actually, I'm working with a
This works as expected in the 1.1 branch:
from pyspark.sql import *
rdd = sc.parallelize([range(0, 10), range(10,20), range(20, 30)]
# define the schema
schemaString = value1 value2 value3 value4 value5 value6 value7 value8 value9
value10
fields = [StructField(field_name, IntegerType(), True)
Yep, that's definitely possible. It's one of the workarounds I was
considering. I was just curious if there was a simpler (and perhaps more
efficient) approach.
Keith
On Mon, Dec 1, 2014 at 6:28 PM, Andy Twigg andy.tw...@gmail.com wrote:
Could you modify your function so that it streams
Yes, the issue appears to be due to the 2GB block size limitation. I am
hence looking for (user, product) block sizing suggestions to work around
the block size limitation.
On Sun, Nov 30, 2014 at 3:01 PM, Sean Owen so...@cloudera.com wrote:
(It won't be that, since you see that the error occur
If you are able to use YARN in your hadoop cluster, then the following
technique is pretty straightforward:
http://blog.sequenceiq.com/blog/2014/08/22/spark-submit-in-java/
We use this in our system and it's super easy to execute from our Tomcat
application.
--
View this message in context:
You may be able to construct RDDs directly from an iterator - not sure
- you may have to subclass your own.
On 1 December 2014 at 18:40, Keith Simmons ke...@pulse.io wrote:
Yep, that's definitely possible. It's one of the workarounds I was
considering. I was just curious if there was a
applySchema() only accept RDD of Row/list/tuple, it does not work with
numpy.array.
After applySchema(), the Python RDD will be pickled and unpickled in
JVM, so you will not have any benefit by using numpy.array.
It will work if you convert ndarray into list:
schemaRDD =
Hi,
I face the following exception when submit a spark application. The log
file shows:
14/12/02 11:52:58 ERROR LiveListenerBus: Listener EventLoggingListener
threw an exception
java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:689)
at
What is the application that you are submitting? Looks like you might have
invoked fs inside the app and then closed it within it.
Thanks
Best Regards
On Tue, Dec 2, 2014 at 11:59 AM, rapelly kartheek kartheek.m...@gmail.com
wrote:
Hi,
I face the following exception when submit a spark
66 matches
Mail list logo