hi
I test spark streaming and kafka,the applicaion like this:import
kafka.serializer.StringDecoder
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kafka.KafkaUtils
import net.sf.json.JSONObject
/**
* Created by
it is caused by conflict of dependency,spark-core1.4 need dependency of
javax.servlet-api:3.0.1it can be resolved to add dependency to built.sbt as
that: javax.servlet % javax.servlet-api % 3.0.1,
From: wwyando...@hotmail.com
To: user@spark.apache.org
Subject: Spark1.4 application throw
Why do you even want to stop it? You can join it with a rdd loading the
newest hash tags from disk in a regular interval
Le dim. 19 juil. 2015 à 7:40, Zoran Jeremic zoran.jere...@gmail.com a
écrit :
Hi,
I have a twitter spark stream initialized in the following way:
val
Are you running something on port 0 already? Looks like spark app master is
failing.
On 19 Jul 2015 06:13, Chester @work ches...@alpinenow.com wrote:
it might be a network issue. The error states failed to bind the server IP
address
Chester
Sent from my iPhone
On Jul 18, 2015, at 11:46 AM,
Could you name the Storage service that you are using? Most of them
provides a S3 like RestAPI endpoint for you to hit.
Thanks
Best Regards
On Fri, Jul 17, 2015 at 2:06 PM, Schmirr Wurst schmirrwu...@gmail.com
wrote:
Hi,
I wonder how to use S3 compatible Storage in Spark ?
If I'm using
Hi All ,
I have an XML file with same tag repeated multiple times as below , Please
suggest what would be best way to process this data inside spark as ...
How can i extract each open and closing tag and process them or how can i
combine multiple line into single line
review
/review
review
Just to add more information. I have checked the status of this file, not a
single block is corrupted.
*[hadoop@ip-172-31-24-27 ~]$ hadoop fsck /ankit -files -blocks*
*DEPRECATED: Use of this script to execute hdfs command is deprecated.*
*Instead use the hdfs command for it.*
Connecting to
You mean this does not work?
SELECT key, count(value) from table group by key
On Sun, Jul 19, 2015 at 2:28 PM, N B nb.nos...@gmail.com wrote:
Hello,
How do I go about performing the equivalent of the following SQL clause in
Spark Streaming? I will be using this on a Windowed DStream.
You would need to write an Xml Input Format that can parse XML into lines
based on start/end tags
Mahout has a XMLInputFormat implementation you should be able to import:
https://github.com/apache/mahout/blob/master/integration/src/main/java/org/apache/mahout/text/wikipedia/XmlInputFormat.java
May be you need to do below steps:
1) Swap key and value
2) Use sortByKey API
3) Swap key and value
4) Reduce result for top keys
http://stackoverflow.com/questions/29003246/how-to-achieve-sort-by-value-in-spark-java
On Sun, Jul 19, 2015 at 5:48 PM, N B nb.nos...@gmail.com wrote:
Hi
I'm trying to understand how spark works under the hood, so I tried to read
the source code.
as I normally do, I downloaded the git source code, reverted to the very
first version ( actually e5c4cd8a5e188592f8786a265c0cd073c69ac886 since the
first version even lacked the definition of RDD.scala)
Yes.
Sent from my iPhone
On 19 Jul, 2015, at 10:52 pm, Jahagirdar, Madhu
madhu.jahagir...@philips.com wrote:
All,
Can we run different version of Spark using the same Mesos Dispatcher. For
example we can run drivers with Spark 1.3 and Spark 1.4 at the same time ?
Regards,
Madhu
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf().setAppName(CountDistinct);
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
ListTuple2String, String list = new ArrayListTuple2String,
String();
list.add(new Tuple2String, String(key1, val1));
Hi there,
I got an error when running one simple graphX program.
My setting is: spark 1.4.0, Hadoop yarn 2.5. scala 2.10. with four virtual
machines.
if I constructed one small graph (6 nodes, 4 edges), I run:
println(triangleCount: %s .format(
hdfs_graph.triangleCount().vertices.count() ))
Hi Jerry,
It does not work directly for 2 reasons:
1. I am trying to do this using Spark Streaming (Window DStreams) and
DataFrames API does not work with Streaming yet.
2. The query equivalent has a distinct embedded in it i.e. I am looking
to achieve the equivalent of
SELECT key,
Hi Suyog,
That code outputs the following:
key2 val22 : 1
key1 val1 : 2
key2 val2 : 2
while the output I want to achieve would have been (with your example):
key1 : 2
key2 : 2
because there are 2 distinct types of values for each key ( regardless of
their actual duplicate counts .. hence the
Hi Nikunj,
Sorry, I totally misread your question.
I think you need to first groupbykey (get all values of the same key together),
then follow by mapValues (probably put the values into a set and then take the
size of it because you want a distinct count)
HTH,
Jerry
Sent from my iPhone
On
All,
Can we run different version of Spark using the same Mesos Dispatcher. For
example we can run drivers with Spark 1.3 and Spark 1.4 at the same time ?
Regards,
Madhu Jahagirdar
The information contained in this message may be confidential and legally
Hello,
How do I go about performing the equivalent of the following SQL clause in
Spark Streaming? I will be using this on a Windowed DStream.
SELECT key, count(distinct(value)) from table group by key;
so for example, given the following dataset in the table:
key | value
-+---
k1 |
I want to use pithos, were do I can specify that endpoint, is it
possible in the url ?
2015-07-19 17:22 GMT+02:00 Akhil Das ak...@sigmoidanalytics.com:
Could you name the Storage service that you are using? Most of them provides
a S3 like RestAPI endpoint for you to hit.
Thanks
Best Regards
Hi Jorn,
I didn't know that it is possible to change filter without re-opening
twitter stream. Actually, I already had that question earlier at the
stackoverflow
http://stackoverflow.com/questions/30960984/apache-spark-twitter-streaming
and I got the answer that it's not possible, but it would be
Hi,
I am using below code to trigger spark job from remote jvm.
import org.apache.hadoop.conf.Configuration;
import org.apache.spark.SparkConf;
import org.apache.spark.deploy.yarn.Client;
import org.apache.spark.deploy.yarn.ClientArguments;
/**
* @version 1.0, 15-Jul-2015
* @author ankit
*/
Hi TD,
Yay! Thanks for the help. That solved our issue of ever increasing
processing time. I added filter functions to all our reduceByKeyAndWindow()
operations and now its been stable for over 2 days already! :-).
One small feedback about the API though. The one that accepts the filter
function
Are you running something on port 0 already?
No actually
I tired multiple ways to avoid this problem, and it seems to disappear when
I m setting the num-executors to 6 (My hadoop cluster is 3 nodes),
Could the num-executors have anything to do with the error I m getting??
On Sun, Jul 19, 2015
Hello! So I am doing a union of two dataframes with the same schema but a
different number of rows. However, I am unable to pass an assertion. I
think it is this one here
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
This is what happens when you create a DataFrame
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L430,
in your case, rdd1.values.flatMap(fun) will be executed
Did you try inputs.repartition(1).foreachRDD(..)?
Thanks
Best Regards
On Fri, Jul 17, 2015 at 9:51 PM, PAULI, KEVIN CHRISTIAN
[AG-Contractor/1000] kevin.christian.pa...@monsanto.com wrote:
Spark newbie here, using Spark 1.3.1.
I’m consuming a stream and trying to pipe the data from the
Hi,
I recently founded Toronto Apache Spark meetup and we are going to have our
first event on the last week of August. Could you add us to the list on
https://spark.apache.org/community.html
Link to the meetup page: http://www.meetup.com/Toronto-Apache-Spark
Cheers,
Mehrdad Pazooki
CEO and
I only used client mode both 1.3 and 1.4 versions on mesos.
I skimmed through
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/mesos/MesosClusterDispatcher.scala.
I would actually backport the Cluster Mode feature. Sorry, I don't have an
answer for this.
On
1.3 does not have MesosDisptacher or does not have support for Mesos cluster
mode , is it still possible to create a Dispatcher using 1.4 and run 1.3 using
that dispatcher ?
From: Jerry Lam [chiling...@gmail.com]
Sent: Monday, July 20, 2015 8:27 AM
To:
Hi
I am executing a simple flow as shown below
*data = sc.wholeTextFiles(...)
tokens = data.flatMap(function)
counts = tokens.map(lambda token: (token,1))
counters = counts.reduceByKey(lambda a,b: a+b)
counters.sortBy(lambda x:x[1],False).saveAsTextFile(...)
*
There are some problems that I am
thanks, my point is that earlier versions are normally much simpler so it's
easier to follow. and the basic structure should at least bare great
similarity with latest version
On Sun, Jul 19, 2015 at 9:27 PM, Ted Yu yuzhih...@gmail.com wrote:
e5c4cd8a5e188592f8786a265 was from 2011.
Not sure
Hi
Apologies for posting queries again. But again posting as they are
unanswered and I am not able to determine differences of parallelism
between Stand Alone on Yarn and their dependency of physical cores. This I
need to understand so that I can decide in which mode we should deploy
Spark.
If
Depends on how you run 1.3/1.4 versions of Spark, if you're giving it
different Docker images / tar balls of Spark, technically it should work
since it's just launching a driver for you at the end of the day.
However, I haven't really tried it so let me know if you run into problems
with it.
Tim
e5c4cd8a5e188592f8786a265 was from 2011.
Not sure why you started with such an early commit.
Spark project has evolved quite fast.
I suggest you clone Spark project from github.com/apache/spark/ and start
with core/src/main/scala/org/apache/spark/rdd/RDD.scala
Cheers
On Sun, Jul 19, 2015 at
I was in a discussion with someone who works for a cloud provider which
offers Spark/Hadoop services. We got into a discussion of performance and
the bewildering array of machine types and the problem of selecting a
cluster with 20 Large instances VS 10 Jumbo instances or the trade offs
between
Hi Mike,
Spark is rack-aware in its task scheduling. Currently Spark doesn't honor
any locality preferences when scheduling executors, but this is being
addressed in SPARK-4352, after which executor-scheduling will be rack-aware
as well.
-Sandy
On Sat, Jul 18, 2015 at 6:25 PM, Mike Frampton
37 matches
Mail list logo