Can you key your RDD by some key and use reduceByKey? In fact if you are
merging bunch of maps you can create a set of (k, v) in your mapPartitions and
then reduceByKey using some merge function. The reduce will happen in parallel
on multiple nodes in this case. You'll end up with just a single
Hi Nick,
How does reduce work? I thought after reducing in the executor, it
will reduce in parallel between multiple executors instead of pulling
everything to driver and reducing there.
Sincerely,
DB Tsai
---
My Blog: https://www.dbtsai.com
Hi,
I am reading data from a HBase table to RDD and then using foreach on that
RDD I am doing some processing on every Result of HBase table. After this
processing I want to store the processed data back to another HBase table.
How can I do that ? If I use standard Hadoop and HBase classes to
Please see sample code attached at
https://issues.apache.org/jira/browse/SPARK-944.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Writing-data-to-HBase-using-Spark-tp7304p7305.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Is it possible to use Shark over Streaming data?
I did not find any mention of that on the website. When you run shark it
gives you a shell to run your queries for stored data. Is there any way to
do the same over streaming data?
--
Thanks
--
View this message in context:
And if you want to use the SQL CLI (based on catalyst) as it works in Shark,
you can also check out https://github.com/amplab/shark/pull/337 :)
This preview version doesn’t require the Hive to be setup in the cluster.
(Don’t forget to put the hive-site.xml under SHARK_HOME/conf also)
Cheng Hao
Thanks for the hint.
I removed signature info from same jar and JVM is happy now.
But problem remains, several same jar's but different versions, not good.
Spark itself is very, very promising, I am very excited
Thank you all
toivo
--
View this message in context:
I am running a spark streaming job to count top 10 hashtags over last 5 mins
window, querying every 1 sec.
It is taking approx 1.4 sec (end-to-end-delay) to answer most of the query
but there are few instances in between when it takes considerable more
amount of time (like around 15 sec) due to
Hi Nilmish,
I confront the same problem. I am wondering how do you measure the latency?
Regards,
Yingjun
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Problem-in-Spark-Streaming-tp7310p7311.html
Sent from the Apache Spark User List mailing list archive
You can measure the latency from the logs. Search for words like Total delay
in the logs. This denotes the total end to end delay for a particular query.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Problem-in-Spark-Streaming-tp7310p7312.html
Sent from
hello guys,
has anybody experiances with the library augustus as a serializer for
scoring models?
looks very promising and i even found a hint on the connection augustus and
spark
all the best
--
View this message in context:
It's worth mentioning that Augustus is a Python-based library. On a
related note, in Java-land, I have had good experiences with jpmml's
projects:
On Tue, Jun 10, 2014 at 7:52 AM, filipus floe...@gmail.com wrote:
hello guys,
has anybody experiances with the library augustus as a serializer
Hello!
Spark Streaming supports HDFS as input source, and also Akka actor
receivers, or TCP socket receivers.
For my use case I think it's probably more convenient to read the data
directly from Actors, because I already need to set up a multi-node Akka
cluster (on the same nodes that Spark runs
I am getting a strange null pointer exception when trying to list the first
entry of a JavaPairRDD after calling groupByKey on it. Following is my code:
JavaPairRDDTuple3lt;String, String, String, ListString KeyToAppList =
KeyToApp.distinct().groupByKey();
Hey Nilesh,
Great to hear your using Spark Streaming, in my opinion the crux of your
question comes down to what you want to do with the data in the future
and/or if there is utility it using it from more than one Spark/Streaming
job.
1). *One-time-use fire and forget *- as you rightly point
Hi Nilmish,
What's the data rate/node when you see the high latency? (It seems the
latency keeps increasing.) Do you still see it if you lower the data rate or
the frequency of the windowed query?
--
View this message in context:
How can I measure data rate/node ?
I am feeding the data through kafka API. I only know the total inflow data
rate which almost remains constant . How can I figure out what amount of
data is distributed to the nodes in my cluster ?
Latency does not keep on increasing infinetly. It goes up for
Hi Yingjun,
Do you see a stable latency or the latency keeps increasing? And could you
provide some details about the input data rate/node, batch interval,
windowDuration and slideDuration when you see the high latency?
--
View this message in context:
Good morning,
I have taken the socketTextStream example and instead of running on a local
Spark instance, I have pushed it to my Spark cluster in AWS (1 master with 5
slave nodes). I am getting the following error that appears to indicate that
all the slaves are trying to read from localhost:
Oh, I mean the average data rate/node.
But in case I want to know the input activities to each node (I use a custom
receiver instead of Kafka), I usually search these records in logs to get a
sense: BlockManagerInfo: Added input ... on [hostname:port] (size: xxx KB)
I also see some spikes in
You can use the master's IP address (Or whichever machine you chose to run
the nc command) instead of localhost.
Worked! Thanks so much!
Fred
Fred Wolfinger
Research Staff Member, CyberPoint Labs
direct +1 410 779 6741
mobile +1 443 655 3322
CyberPoint International
621 East Pratt Street, Suite 300
Baltimore MD 21202-3140
phone +1 410 779 6700
www.cyberpointllc.com http://www.cyberpointllc.com/
I should point out that if you don't want to take a polyglot approach to
languages and reside solely in the JVM, then you can just use plain old
java serialization on the Model objects that come out of MLlib's APIs from
Java or Scala and load them up in another process and call the relevant
Hi all,
I have build Shark-0.9.1 using sbt using the below command:
*SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.6.0 sbt/sbt assembly*
My Hadoop cluster is also having version 2.0.0-mr1-cdh4.6.0.
But when I try to execute the below command from Spark shell,which reads a
file from HDFS, I get the IPC
Hi Qi Ping,
You don't have to distribute these files; they are automatically packaged
in the assembly jar, which is already shipped to the worker nodes.
Other people have run into the same issue. See if the instructions here are
of any help:
Event logs are different from writing using a logger, like log4j. The event
logs are the type of data showing up in the history server.
For my team, we use com.typesafe.scalalogging.slf4j.Logging. Our logs show
up in /etc/spark/work/app-id/executor-id/stderr and stdout.
All of our logging seems
Question on the input and output for ALS.train() and
MatrixFactorizationModel.predict().
My input is list of Ratings(user_id, product_id, rating) and my ratings are
one a scale of 1-5 (inclusive). When I compute predictions over the
superset of all (user_id, product_id) pairs, the ratings
Hi,
I have the same problem when running Kafka to Spark Streaming pipeline from
Java with explicitely specified message decoders. I had thought, that it was
related to Eclipse environment, as suggested here, but it's not the case. I
have coded an example based on class:
I had this same problem as well. I ended up just adding the necessary code
in KafkaUtil and compiling my own spark jar. Something like this for the
raw stream:
def createRawStream(
jssc: JavaStreamingContext,
kafkaParams: JMap[String, String],
topics: JMap[String, JInt]
For trainImplicit(), the output is an approximation of a matrix of 0s
and 1s, so the values are generally (not always) in [0,1]
But for train(), you should be predicting the original input matrix
as-is, as I understand. You should get output in about the same range
as the input but again not
I added https://issues.apache.org/jira/browse/SPARK-2103 to track
this. I also ran into it. I don't have a fix, but, somehow I think
someone with more understanding of Scala and Manifest objects might
see the easy fix.
On Tue, Jun 10, 2014 at 5:15 PM, mpieck mpi...@gazeta.pl wrote:
Hi,
I have
The executors shown CANNOT FIND ADDRESS are not listed in the Executors Tab
on the top of the Spark UI.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Information-on-Spark-UI-tp7354p7355.html
Sent from the Apache Spark User List mailing list archive at
I am using Spark 1.0.0 compiled with Hadoop 1.2.1.
I have a toy spark-streaming-kafka program. It reads from a kafka queue and
does
stream
.map {case (k, v) = (v, 1)}
.reduceByKey(_ + _)
.print()
using a 1 second interval on the stream.
The docs say to make Spark and
After doing a groupBy operation, I have the following result:
val res =
(ID1,ArrayBuffer((145804601,ID1,japan)))
(ID3,ArrayBuffer((145865080,ID3,canada),
(145899640,ID3,china)))
(ID2,ArrayBuffer((145752760,ID2,usa),
(145934200,ID2,usa)))
Now I need to output for each group,
We're running into an issue where periodically the master loses connectivity
with workers in the spark cluster. We believe this issue tends to manifest
when the cluster is under heavy load, but we're not entirely sure when it
happens. I've seen one or two other messages to this list about this
I created a Spark 1.0 cluster on EC2 using the provided scripts. However, I
do not seem to be able to start the history server on the master node. I
used the following command:
./start-history-server.sh /root/spark_log
The error message says that the logging directory /root/spark_log does not
What's the permission on /root itself?
On Jun 10, 2014 6:29 PM, zhen z...@latrobe.edu.au wrote:
I created a Spark 1.0 cluster on EC2 using the provided scripts. However, I
do not seem to be able to start the history server on the master node. I
used the following command:
My output is a set of tuples and when I output it using saveAsTextFile, my
file looks as follows:
(field1_tup1, field2_tup1, field3_tup1,...)
(field1_tup2, field2_tup2, field3_tup2,...)
In Spark. is there some way I can simply have it output in CSV format as
follows (i.e. without the
Thanks for the clarification.
What is the proper way to configure RDDs when your aggregate data size
exceeds your available working memory size? In particular, in additional to
typical operations, I'm performing cogroups, joins, and coalesces/shuffles.
I see that the default storage level for
We are seeing this issue as well.
We run on YARN and see logs about lost executor. Looks like some stages had
to be re-run to compute RDD partitions lost in the executor.
We were able to complete 20 iterations with 20% full matrix but not beyond
that (total 100GB).
On Tue, Jun 10, 2014 at 8:32
you can just use something like this:
myRdd(_.productIterator.mkString(,)).saveAsTextFile
On Tue, Jun 10, 2014 at 6:34 PM, SK skrishna...@gmail.com wrote:
My output is a set of tuples and when I output it using saveAsTextFile, my
file looks as follows:
(field1_tup1, field2_tup1,
It would be better to add one more transformation step before saveAsTextFile,
like:
rdd.map(tuple = %s,%s,%s.format(tuple._1, tuple._2,
tuple._3)).saveAsTextFile(...)
By manually convert to the format you what, and then write to HDFS.
Thanks
Jerry
-Original Message-
From: SK
Thanks. Now I know how to broadcast the dataset but I still wonder after
broadcasting the dataset how can I apply my algorithm to training the model
in the wokers. To describe my question in detail, The following code is used
to train LDA(Latent Dirichlet Allocation) model with JGibbLDA in single
I checked the permission on root and it is the following:
drwxr-xr-x 20 root root 4096 Jun 11 01:05 root
So anyway, I changed to use /tmp/spark_log instead and this time I made sure
that all permissions are given to /tmp and /tmp/spark_log like below. But it
still does not work:
drwxrwxrwt 8
Can you try file:/root/spark_log?
2014-06-10 19:22 GMT-07:00 zhen z...@latrobe.edu.au:
I checked the permission on root and it is the following:
drwxr-xr-x 20 root root 4096 Jun 11 01:05 root
So anyway, I changed to use /tmp/spark_log instead and this time I made
sure
that all
Someone suggests me to use Mahout, but I'm not familiar with it. And in that
case, using Mahout will add difficulties to my program. I'd like to run the
algorithm in Spark. I'm a beginner, can you give me some suggestions?
--
View this message in context:
Hi,
What I (seems to) know about RDD persisting API is as follows:
- cache() and persist() is not an action. It only does a marking.
- unpersist() is also not an action. It only removes a marking. But if the
rdd is already in memory, it is unloaded.
And there seems no API to forcefully
Thanks Sean. I realized that I was supplying train() with a very low rank
so I will retry with something higher and then play with lambda as-needed.
On Tue, Jun 10, 2014 at 4:58 PM, Sean Owen so...@cloudera.com wrote:
For trainImplicit(), the output is an approximation of a matrix of 0s
and
BTW, it is possible that rdd.first() does not compute the whole partitions.
So, first() cannot be uses for the situation below.
-Original Message-
From: innowireless TaeYun Kim [mailto:taeyun@innowireless.co.kr]
Sent: Wednesday, June 11, 2014 11:40 AM
To: user@spark.apache.org
Have you considered the garbage collection impact and if it coincides with
your latency spikes? You can enable gc logging by changing Spark
configuration for your job.
Hi, as I searched the keyword Total delay in the console log, the delay
keeps increasing. I am not sure what does this total delay
No, I meant pass the path to the history server start script.
2014-06-10 19:33 GMT-07:00 zhen z...@latrobe.edu.au:
Sure here it is:
drwxrwxrwx 2 1000 root 4096 Jun 11 01:05 spark_logs
Zhen
--
View this message in context:
It might be that conf/spark-env.sh on EC2 is configured to set it to 512, and
is overriding the application’s settings. Take a look in there and delete that
line if possible.
Matei
On Jun 10, 2014, at 2:38 PM, Aliaksei Litouka aliaksei.lito...@gmail.com
wrote:
I am testing my application in
Yep, it gives tons of errors. I was able to make it work with sudo. Looks
like ownership issue.
Cheers
k/
On Tue, Jun 10, 2014 at 6:29 PM, zhen z...@latrobe.edu.au wrote:
I created a Spark 1.0 cluster on EC2 using the provided scripts. However, I
do not seem to be able to start the history
53 matches
Mail list logo