spark1.4.0, streaming + kafka, throw org.apache.spark.util.TaskCompletionListenerException

2015-07-19 Thread Wwh 吴
hi I test spark streaming and kafka,the applicaion like this:import kafka.serializer.StringDecoder import org.apache.spark.SparkConf import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.streaming.kafka.KafkaUtils import net.sf.json.JSONObject /** * Created by

RE: Spark1.4 application throw java.lang.NoClassDefFoundError: javax/servlet/FilterRegistration

2015-07-19 Thread Wwh 吴
it is caused by conflict of dependency,spark-core1.4 need dependency of javax.servlet-api:3.0.1it can be resolved to add dependency to built.sbt as that: javax.servlet % javax.servlet-api % 3.0.1, From: wwyando...@hotmail.com To: user@spark.apache.org Subject: Spark1.4 application throw

Re: How to restart Twitter spark stream

2015-07-19 Thread Jörn Franke
Why do you even want to stop it? You can join it with a rdd loading the newest hash tags from disk in a regular interval Le dim. 19 juil. 2015 à 7:40, Zoran Jeremic zoran.jere...@gmail.com a écrit : Hi, I have a twitter spark stream initialized in the following way: val

Re: spark-shell with Yarn failed

2015-07-19 Thread ayan guha
Are you running something on port 0 already? Looks like spark app master is failing. On 19 Jul 2015 06:13, Chester @work ches...@alpinenow.com wrote: it might be a network issue. The error states failed to bind the server IP address Chester Sent from my iPhone On Jul 18, 2015, at 11:46 AM,

Re: use S3-Compatible Storage with spark

2015-07-19 Thread Akhil Das
Could you name the Storage service that you are using? Most of them provides a S3 like RestAPI endpoint for you to hit. Thanks Best Regards On Fri, Jul 17, 2015 at 2:06 PM, Schmirr Wurst schmirrwu...@gmail.com wrote: Hi, I wonder how to use S3 compatible Storage in Spark ? If I'm using

XML Parsing

2015-07-19 Thread Ashish Soni
Hi All , I have an XML file with same tag repeated multiple times as below , Please suggest what would be best way to process this data inside spark as ... How can i extract each open and closing tag and process them or how can i combine multiple line into single line review /review review

Re: Exception while triggering spark job from remote jvm

2015-07-19 Thread ankit tyagi
Just to add more information. I have checked the status of this file, not a single block is corrupted. *[hadoop@ip-172-31-24-27 ~]$ hadoop fsck /ankit -files -blocks* *DEPRECATED: Use of this script to execute hdfs command is deprecated.* *Instead use the hdfs command for it.* Connecting to

Re: Counting distinct values for a key?

2015-07-19 Thread Jerry Lam
You mean this does not work? SELECT key, count(value) from table group by key On Sun, Jul 19, 2015 at 2:28 PM, N B nb.nos...@gmail.com wrote: Hello, How do I go about performing the equivalent of the following SQL clause in Spark Streaming? I will be using this on a Windowed DStream.

Re: XML Parsing

2015-07-19 Thread Ram Sriharsha
You would need to write an Xml Input Format that can parse XML into lines based on start/end tags Mahout has a XMLInputFormat implementation you should be able to import: https://github.com/apache/mahout/blob/master/integration/src/main/java/org/apache/mahout/text/wikipedia/XmlInputFormat.java

Re: Counting distinct values for a key?

2015-07-19 Thread suyog choudhari
May be you need to do below steps: 1) Swap key and value 2) Use sortByKey API 3) Swap key and value 4) Reduce result for top keys http://stackoverflow.com/questions/29003246/how-to-achieve-sort-by-value-in-spark-java On Sun, Jul 19, 2015 at 5:48 PM, N B nb.nos...@gmail.com wrote: Hi

how to start reading the spark source code?

2015-07-19 Thread Yang
I'm trying to understand how spark works under the hood, so I tried to read the source code. as I normally do, I downloaded the git source code, reverted to the very first version ( actually e5c4cd8a5e188592f8786a265c0cd073c69ac886 since the first version even lacked the definition of RDD.scala)

Re: Spark Mesos Dispatcher

2015-07-19 Thread Jerry Lam
Yes. Sent from my iPhone On 19 Jul, 2015, at 10:52 pm, Jahagirdar, Madhu madhu.jahagir...@philips.com wrote: All, Can we run different version of Spark using the same Mesos Dispatcher. For example we can run drivers with Spark 1.3 and Spark 1.4 at the same time ? Regards, Madhu

Re: Counting distinct values for a key?

2015-07-19 Thread suyog choudhari
public static void main(String[] args) { SparkConf sparkConf = new SparkConf().setAppName(CountDistinct); JavaSparkContext jsc = new JavaSparkContext(sparkConf); ListTuple2String, String list = new ArrayListTuple2String, String(); list.add(new Tuple2String, String(key1, val1));

assertion failed error with GraphX

2015-07-19 Thread Jack Yang
Hi there, I got an error when running one simple graphX program. My setting is: spark 1.4.0, Hadoop yarn 2.5. scala 2.10. with four virtual machines. if I constructed one small graph (6 nodes, 4 edges), I run: println(triangleCount: %s .format( hdfs_graph.triangleCount().vertices.count() ))

Re: Counting distinct values for a key?

2015-07-19 Thread N B
Hi Jerry, It does not work directly for 2 reasons: 1. I am trying to do this using Spark Streaming (Window DStreams) and DataFrames API does not work with Streaming yet. 2. The query equivalent has a distinct embedded in it i.e. I am looking to achieve the equivalent of SELECT key,

Re: Counting distinct values for a key?

2015-07-19 Thread N B
Hi Suyog, That code outputs the following: key2 val22 : 1 key1 val1 : 2 key2 val2 : 2 while the output I want to achieve would have been (with your example): key1 : 2 key2 : 2 because there are 2 distinct types of values for each key ( regardless of their actual duplicate counts .. hence the

Re: Counting distinct values for a key?

2015-07-19 Thread Jerry Lam
Hi Nikunj, Sorry, I totally misread your question. I think you need to first groupbykey (get all values of the same key together), then follow by mapValues (probably put the values into a set and then take the size of it because you want a distinct count) HTH, Jerry Sent from my iPhone On

Spark Mesos Dispatcher

2015-07-19 Thread Jahagirdar, Madhu
All, Can we run different version of Spark using the same Mesos Dispatcher. For example we can run drivers with Spark 1.3 and Spark 1.4 at the same time ? Regards, Madhu Jahagirdar The information contained in this message may be confidential and legally

Counting distinct values for a key?

2015-07-19 Thread N B
Hello, How do I go about performing the equivalent of the following SQL clause in Spark Streaming? I will be using this on a Windowed DStream. SELECT key, count(distinct(value)) from table group by key; so for example, given the following dataset in the table: key | value -+--- k1 |

Fwd: use S3-Compatible Storage with spark

2015-07-19 Thread Schmirr Wurst
I want to use pithos, were do I can specify that endpoint, is it possible in the url ? 2015-07-19 17:22 GMT+02:00 Akhil Das ak...@sigmoidanalytics.com: Could you name the Storage service that you are using? Most of them provides a S3 like RestAPI endpoint for you to hit. Thanks Best Regards

Re: How to restart Twitter spark stream

2015-07-19 Thread Zoran Jeremic
Hi Jorn, I didn't know that it is possible to change filter without re-opening twitter stream. Actually, I already had that question earlier at the stackoverflow http://stackoverflow.com/questions/30960984/apache-spark-twitter-streaming and I got the answer that it's not possible, but it would be

Exception while triggering spark job from remote jvm

2015-07-19 Thread ankit tyagi
Hi, I am using below code to trigger spark job from remote jvm. import org.apache.hadoop.conf.Configuration; import org.apache.spark.SparkConf; import org.apache.spark.deploy.yarn.Client; import org.apache.spark.deploy.yarn.ClientArguments; /** * @version 1.0, 15-Jul-2015 * @author ankit */

Re: Spark streaming Processing time keeps increasing

2015-07-19 Thread N B
Hi TD, Yay! Thanks for the help. That solved our issue of ever increasing processing time. I added filter functions to all our reduceByKeyAndWindow() operations and now its been stable for over 2 days already! :-). One small feedback about the API though. The one that accepts the filter function

Re: spark-shell with Yarn failed

2015-07-19 Thread Amjad ALSHABANI
Are you running something on port 0 already? No actually I tired multiple ways to avoid this problem, and it seems to disappear when I m setting the num-executors to 6 (My hadoop cluster is 3 nodes), Could the num-executors have anything to do with the error I m getting?? On Sun, Jul 19, 2015

DataFrame Union not passing optimizer assertion

2015-07-19 Thread Brandon White
Hello! So I am doing a union of two dataframes with the same schema but a different number of rows. However, I am unable to pass an assertion. I think it is this one here https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

Re: Spark APIs memory usage?

2015-07-19 Thread Akhil Das
This is what happens when you create a DataFrame https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L430, in your case, rdd1.values.flatMap(fun) will be executed

Re: streaming and piping to R, sending all data in window to pipe()

2015-07-19 Thread Akhil Das
Did you try inputs.repartition(1).foreachRDD(..)? Thanks Best Regards On Fri, Jul 17, 2015 at 9:51 PM, PAULI, KEVIN CHRISTIAN [AG-Contractor/1000] kevin.christian.pa...@monsanto.com wrote: Spark newbie here, using Spark 1.3.1. I’m consuming a stream and trying to pipe the data from the

Toronto Apache Spark

2015-07-19 Thread Mehrdad Pazooki
Hi, I recently founded Toronto Apache Spark meetup and we are going to have our first event on the last week of August. Could you add us to the list on https://spark.apache.org/community.html Link to the meetup page: http://www.meetup.com/Toronto-Apache-Spark Cheers, Mehrdad Pazooki CEO and

Re: Spark Mesos Dispatcher

2015-07-19 Thread Jerry Lam
I only used client mode both 1.3 and 1.4 versions on mesos. I skimmed through https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/mesos/MesosClusterDispatcher.scala. I would actually backport the Cluster Mode feature. Sorry, I don't have an answer for this. On

RE: Spark Mesos Dispatcher

2015-07-19 Thread Jahagirdar, Madhu
1.3 does not have MesosDisptacher or does not have support for Mesos cluster mode , is it still possible to create a Dispatcher using 1.4 and run 1.3 using that dispatcher ? From: Jerry Lam [chiling...@gmail.com] Sent: Monday, July 20, 2015 8:27 AM To:

Error while Partitioning

2015-07-19 Thread rishikesh
Hi I am executing a simple flow as shown below *data = sc.wholeTextFiles(...) tokens = data.flatMap(function) counts = tokens.map(lambda token: (token,1)) counters = counts.reduceByKey(lambda a,b: a+b) counters.sortBy(lambda x:x[1],False).saveAsTextFile(...) * There are some problems that I am

Re: how to start reading the spark source code?

2015-07-19 Thread Yang
thanks, my point is that earlier versions are normally much simpler so it's easier to follow. and the basic structure should at least bare great similarity with latest version On Sun, Jul 19, 2015 at 9:27 PM, Ted Yu yuzhih...@gmail.com wrote: e5c4cd8a5e188592f8786a265 was from 2011. Not sure

Fwd: Problem in Understanding concept of Physical Cores

2015-07-19 Thread Aniruddh Sharma
Hi Apologies for posting queries again. But again posting as they are unanswered and I am not able to determine differences of parallelism between Stand Alone on Yarn and their dependency of physical cores. This I need to understand so that I can decide in which mode we should deploy Spark. If

Re: Spark Mesos Dispatcher

2015-07-19 Thread Tim Chen
Depends on how you run 1.3/1.4 versions of Spark, if you're giving it different Docker images / tar balls of Spark, technically it should work since it's just launching a driver for you at the end of the day. However, I haven't really tried it so let me know if you run into problems with it. Tim

Re: how to start reading the spark source code?

2015-07-19 Thread Ted Yu
e5c4cd8a5e188592f8786a265 was from 2011. Not sure why you started with such an early commit. Spark project has evolved quite fast. I suggest you clone Spark project from github.com/apache/spark/ and start with core/src/main/scala/org/apache/spark/rdd/RDD.scala Cheers On Sun, Jul 19, 2015 at

Looking for a few Spark Benchmarks

2015-07-19 Thread Steve Lewis
I was in a discussion with someone who works for a cloud provider which offers Spark/Hadoop services. We got into a discussion of performance and the bewildering array of machine types and the problem of selecting a cluster with 20 Large instances VS 10 Jumbo instances or the trade offs between

Re: [General Question] [Hadoop + Spark at scale] Spark Rack Awareness ?

2015-07-19 Thread Sandy Ryza
Hi Mike, Spark is rack-aware in its task scheduling. Currently Spark doesn't honor any locality preferences when scheduling executors, but this is being addressed in SPARK-4352, after which executor-scheduling will be rack-aware as well. -Sandy On Sat, Jul 18, 2015 at 6:25 PM, Mike Frampton