Re: Re: driver ClassNotFoundException when MySQL JDBC exceptions are thrown on executor

2015-11-19 Thread Zsolt Tóth
Hi, this is exactly the same as my issue, seems to be a bug in 1.5.x. (see my thread for details) 2015-11-19 11:20 GMT+01:00 Jeff Zhang : > Seems your jdbc url is not correct. Should be jdbc:mysql:// > 192.168.41.229:3306 > > On Thu, Nov 19, 2015 at 6:03 PM,

[SPARK STREAMING] multiple hosts and multiple ports for Stream job

2015-11-19 Thread diplomatic Guru
Hello team, I was wondering whether it is a good idea to have multiple hosts and multiple ports for a spark job. Let's say that there are two hosts, and each has 2 ports, is this a good idea? If this is not an issue then what is the best way to do it. Currently, we pass it as an argument comma

Re: Distinct on key-value pair of JavaRDD

2015-11-19 Thread Ramkumar V
I thought some specific function would be there but I'm using reducebykey now. Its working fine. Thanks a lot. *Thanks*, On Tue, Nov 17, 2015 at 6:21 PM, ayan guha wrote: > How about using reducebykey? > On 17 Nov 2015 22:00,

Re: Calculating Timeseries Aggregation

2015-11-19 Thread Sanket Patil
Hey Sandip: TD has already outlined the right approach, but let me add a couple of thoughts as I recently worked on a similar project. I had to compute some real-time metrics on streaming data. Also, these metrics had to be aggregated for hour/day/week/month. My data pipeline was Kafka --> Spark

Re: Calculating Timeseries Aggregation

2015-11-19 Thread Sandip Mehta
Thank you Sanket for the feedback. Regards SM > On 19-Nov-2015, at 1:57 PM, Sanket Patil wrote: > > Hey Sandip: > > TD has already outlined the right approach, but let me add a couple of > thoughts as I recently worked on a similar project. I had to compute some

[no subject]

2015-11-19 Thread aman solanki
Hi All, I want to know how one can get historical data of jobs,stages,tasks etc of a running spark application. Please share the information regarding the same. Thanks, Aman Solanki

Reading from RabbitMq via Apache Spark Streaming

2015-11-19 Thread D
I am trying to write a simple "Hello World" kind of application using spark streaming and RabbitMq, in which Apache Spark Streaming will read message from RabbitMq via the RabbitMqReceiver and print it in the console. But some how I am not able to

Re: dounbts on parquet

2015-11-19 Thread Cheng Lian
/cc Spark user list I'm confused here, you mentioned that you were writing Parquet files using MR jobs. What's the relation between that Parquet writing task and this JavaPairRDD one? Is it a separate problem? Spark supports dynamic partitioning (e.g. df.write.partitionBy("col1",

ClassNotFound for exception class in Spark 1.5.x

2015-11-19 Thread Zsolt Tóth
Hi, I try to throw an exception of my own exception class (MyException extends SparkException) on one of the executors. This works fine on Spark 1.3.x, 1.4.x but throws a deserialization/ClassNotFound exception on Spark 1.5.x. This happens only when I throw it on an executor, on the driver it

Re: WARN LoadSnappy: Snappy native library not loaded

2015-11-19 Thread David Rosenstrauch
I ran into this recently. Turned out we had an old org-xerial-snappy.properties file in one of our conf directories that had the setting: # Disables loading Snappy-Java native library bundled in the # snappy-java-*.jar file forcing to load the Snappy-Java native # library from the

Re: PySpark Lost Executors

2015-11-19 Thread Ross.Cramblit
Thank you Ted and Sandy for getting me pointed in the right direction. From the logs: WARN yarn.YarnAllocator: Container killed by YARN for exceeding memory limits. 25.4 GB of 25.3 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. On Nov 19, 2015, at 12:20 PM,

External Table not getting updated from parquet files written by spark streaming

2015-11-19 Thread Abhishek Anand
Hi , I am using spark streaming to write the aggregated output as parquet files to the hdfs using SaveMode.Append. I have an external table created like : CREATE TABLE if not exists rolluptable USING org.apache.spark.sql.parquet OPTIONS ( path "hdfs:" ); I had an impression that in case

Re: Invocation of StreamingContext.stop() hangs in 1.5

2015-11-19 Thread jiten
Hi, Thanks to Ted Vu and Nilanjan. Stopping the streaming context asynchronously did the trick! Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Invocation-of-StreamingContext-stop-hangs-in-1-5-tp25402p25434.html Sent from the Apache Spark User

Shuffle performance tuning. How to tune netty?

2015-11-19 Thread t3l
I am facing a very tricky issue here. I have a treeReduce task. The reduce-function returns a very large object. In fact it is a Map[Int, Array[Double]]. Each reduce task inserts and/or updates values into the map or updates the array. My problem is, that this Map can become very large. Currently,

Re: Reading from RabbitMq via Apache Spark Streaming

2015-11-19 Thread Sabarish Sasidharan
The stack trace is clear enough: Caused by: com.rabbitmq.client.ShutdownSignalException: channel error; protocol method: #method(reply-code=406, reply-text=PRECONDITION_FAILED - inequivalent arg 'durable' for queue 'hello1' in vhost '/': received 'true' but current is 'false', class-id=50,

Spark Tasks on second node never return in Yarn when I have more than 1 task node

2015-11-19 Thread Shuai Zheng
Hi All, I face a very weird case. I have already simplify the scenario to the most so everyone can replay the scenario. My env: AWS EMR 4.1.0, Spark1.5 My code can run without any problem when I run it in a local mode, and it has no problem when it run on a EMR cluster with one

Re: spark-submit is throwing NPE when trying to submit a random forest model

2015-11-19 Thread Joseph Bradley
Hi, Could you please submit this via JIRA as a bug report? It will be very helpful if you include the Spark version, system details, and other info too. Thanks! Joseph On Thu, Nov 19, 2015 at 1:21 PM, Rachana Srivastava < rachana.srivast...@markmonitor.com> wrote: > *Issue:* > > I have a random

Re: create a table for csv files

2015-11-19 Thread Andrew Or
There's not an easy way. The closest thing you can do is: import org.apache.spark.sql.functions._ val df = ... df.withColumn("id", monotonicallyIncreasingId()) -Andrew 2015-11-19 8:23 GMT-08:00 xiaohe lan : > Hi, > > I have some csv file in HDFS with headers like col1,

Re: Spark Tasks on second node never return in Yarn when I have more than 1 task node

2015-11-19 Thread Jonathan Kelly
I don't know if this actually has anything to do with why your job is hanging, but since you are using EMR you should probably not set those fs.s3 properties but rather let it use EMRFS, EMR's optimized Hadoop FileSystem implementation for interacting with S3. One benefit is that it will

newbie: unable to use all my cores and memory

2015-11-19 Thread Andy Davidson
I am having a heck of a time figuring out how to utilize my cluster effectively. I am using the stand alone cluster manager. I have a master and 3 slaves. Each machine has 2 cores. I am trying to run a streaming app in cluster mode and pyspark at the same time. t1) On my console I see *

Re: Blocked REPL commands

2015-11-19 Thread Jacek Laskowski
Hi, Dunno the answer, but :reset should be blocked, too, for obvious reasons. ➜ spark git:(master) ✗ ./bin/spark-shell ... Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.6.0-SNAPSHOT /_/ Using Scala

spark-submit is throwing NPE when trying to submit a random forest model

2015-11-19 Thread Rachana Srivastava
Issue: I have a random forest model that am trying to load during streaming using following code. The code is working fine when I am running the code from Eclipse but getting NPE when running the code using spark-submit. JavaStreamingContext jssc = new JavaStreamingContext(jsc,

Re: Error not found value sqlContext

2015-11-19 Thread Michael Armbrust
http://spark.apache.org/docs/latest/sql-programming-guide.html#upgrading-from-spark-sql-10-12-to-13 On Thu, Nov 19, 2015 at 4:19 AM, satish chandra j wrote: > HI All, > we have recently migrated from Spark 1.2.1 to Spark 1.4.0, I am fetching > data from an RDBMS using

Configuring Log4J (Spark 1.5 on EMR 4.1)

2015-11-19 Thread Afshartous, Nick
Hi, On Spark 1.5 on EMR 4.1 the message below appears in stderr in the Yarn UI. ERROR StatusLogger No log4j2 configuration file found. Using default configuration: logging only errors to the console. I do see that there is /usr/lib/spark/conf/log4j.properties Can someone please advise

Drop multiple columns in the DataFrame API

2015-11-19 Thread Benjamin Fradet
Hi everyone, I was wondering if there is a better way to drop mutliple columns from a dataframe or why there is no drop(cols: Column*) method in the dataframe API. Indeed, I tend to write code like this: val filteredDF = df.drop("colA") .drop("colB") .drop("colC") //etc which is a

Re: Blocked REPL commands

2015-11-19 Thread Jakob Odersky
that definitely looks like a bug, go ahead with filing an issue I'll check the scala repl source code to see what, if any, other commands there are that should be disabled On 19 November 2015 at 12:54, Jacek Laskowski wrote: > Hi, > > Dunno the answer, but :reset should be

Re: Streaming Job gives error after changing to version 1.5.2

2015-11-19 Thread swetha kasireddy
That was actually an issue with our Mesos. On Wed, Nov 18, 2015 at 5:29 PM, Tathagata Das wrote: > If possible, could you give us the root cause and solution for future > readers of this thread. > > On Wed, Nov 18, 2015 at 6:37 AM, swetha kasireddy < >

Re: ClassNotFound for exception class in Spark 1.5.x

2015-11-19 Thread Zsolt Tóth
Hi Tamás, the exception class is in the application jar, I'm using the spark-submit script. 2015-11-19 11:54 GMT+01:00 Tamas Szuromi : > Hi Zsolt, > > How you load the jar and how you prepend it to the classpath? > > Tamas > > > > > On 19 November 2015 at 11:02, Zsolt

Moving avg in saprk streaming

2015-11-19 Thread anshu shukla
Any formal way to do moving avg over fixed window duration . I calculated a simple moving average by creating a count stream and a sum stream; then joined them and finally calculated the mean. This was not per time window since time periods were part of the tuples. -- Thanks & Regards, Anshu

Re: How to clear the temp files that gets created by shuffle in Spark Streaming

2015-11-19 Thread swetha kasireddy
OK. We have a long running streaming job. I was thinking that may be we should have a cron to clear files that are older than 2 days. What would be an appropriate way to do that? On Wed, Nov 18, 2015 at 7:43 PM, Ted Yu wrote: > Have you seen SPARK-5836 ? > Note TD's comment

FastUtil DataStructures in Spark

2015-11-19 Thread swetha
Hi, Has anybody used FastUtil equivalent to HashSet for Strings in Spark? Any example would be of great help. Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/FastUtil-DataStructures-in-Spark-tp25429.html Sent from the Apache Spark User List

Re: Re: driver ClassNotFoundException when MySQL JDBC exceptions are thrown on executor

2015-11-19 Thread Jeff Zhang
Seems your jdbc url is not correct. Should be jdbc:mysql:// 192.168.41.229:3306 On Thu, Nov 19, 2015 at 6:03 PM, wrote: > hi guy, > >I also found --driver-class-path and spark.driver.extraClassPath > is not working when I'm accessing mysql driver in my spark APP.

Why Spark Streaming keeps all batches in memory after processing?

2015-11-19 Thread Artem Moskvin
Hello there! I wonder why Spark Streaming keeps all processed batches in memory? It leads to getting out of memory on executors but I really don't need them after processing. Can it be configured somewhere so that batches are not kept in memory after processing? Respectfully, Artem Moskvin

Re: ClassNotFound for exception class in Spark 1.5.x

2015-11-19 Thread Tamas Szuromi
Hi Zsolt, How you load the jar and how you prepend it to the classpath? Tamas On 19 November 2015 at 11:02, Zsolt Tóth wrote: > Hi, > > I try to throw an exception of my own exception class (MyException extends > SparkException) on one of the executors. This works

spark streaming problem saveAsTextFiles() does not write valid JSON to HDFS

2015-11-19 Thread Andy Davidson
I am working on a simple POS. I am running into a really strange problem. I wrote a java streaming app that collects tweets using the spark twitter package and stores the to disk in JSON format. I noticed that when I run the code on my mac. The file are written to the local files system as I

Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2 when compile spark-1.5.2

2015-11-19 Thread ck...@126.com
Hey everyone when compile spark-1.5.2 using Intelligent IDEA there is an error like this: [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on project spark-core_2.10: wrap: java.lang.ClassNotFoundException: xsbt.CompilerInterface:

RE: Configuring Log4J (Spark 1.5 on EMR 4.1)

2015-11-19 Thread Afshartous, Nick
< log4j.properties file only exists on the master and not the slave nodes, so you are probably running into https://issues.apache.org/jira/browse/SPARK-11105, which has already been fixed in the not-yet-released Spark 1.6.0. EMR will upgrade to Spark 1.6.0 once it is released. Thanks for the

Re: PySpark Lost Executors

2015-11-19 Thread Ross.Cramblit
Hmm I guess I do not - I get 'application_1445957755572_0176 does not have any log files.’ Where can I enable log aggregation? On Nov 19, 2015, at 11:07 AM, Ted Yu > wrote: Do you have YARN log aggregation enabled ? You can try retrieving log for

has any spark write orc document

2015-11-19 Thread zhangjp
Hi, has any spark write orc document which like the parquet document. http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files Thanks

Re: has any spark write orc document

2015-11-19 Thread Jeff Zhang
It should be very similar with parquet in the api perspective, Please refer this doc http://hortonworks.com/hadoop-tutorial/using-hive-with-orc-from-apache-spark/ On Fri, Nov 20, 2015 at 2:59 PM, zhangjp <592426...@qq.com> wrote: > Hi, > has any spark write orc document which like the parquet

Re: Reading from RabbitMq via Apache Spark Streaming

2015-11-19 Thread Daniel Carroza
Hi, Just answered on Github: https://github.com/Stratio/rabbitmq-receiver/issues/20 Regards Daniel Carroza Santana Vía de las Dos Castillas, 33, Ática 4, 3ª Planta. 28224 Pozuelo de Alarcón. Madrid. Tel: +34 91 828 64 73 // *@stratiobd * 2015-11-19 10:02

Re: PySpark Lost Executors

2015-11-19 Thread Ted Yu
Do you have YARN log aggregation enabled ? You can try retrieving log for the container using the following command: yarn logs -applicationId application_1445957755572_0176 -containerId container_1445957755572_0176_01_03 Cheers On Thu, Nov 19, 2015 at 8:02 AM,

Spark 1.5.3 release

2015-11-19 Thread Madabhattula Rajesh Kumar
Hi, Please let me know Spark 1.5.3 release date details Regards, Rajesh

PySpark Lost Executors

2015-11-19 Thread Ross.Cramblit
I am running Spark 1.5.2 on Yarn. My job consists of a number of SparkSQL transforms on a JSON data set that I load into a data frame. The data set is not large (~100GB) and most stages execute without any issues. However, some more complex stages tend to lose executors/nodes regularly. What

Re: Spark streaming and custom partitioning

2015-11-19 Thread Cody Koeninger
Not sure what you mean by "no documentation regarding ways to achieve effective communication between the 2", but the docs on integrating with kafka are at http://spark.apache.org/docs/latest/streaming-kafka-integration.html As far as custom partitioners go, Learning Spark from O'Reilly has a

RE: SequenceFile and object reuse

2015-11-19 Thread jeff saremi
Sandy, Ryan, Andrew Thanks very much. I think i now understand it better. Jeff From: ryan.blake.willi...@gmail.com Date: Thu, 19 Nov 2015 06:00:30 + Subject: Re: SequenceFile and object reuse To: sandy.r...@cloudera.com; jeffsar...@hotmail.com CC: user@spark.apache.org Hey Jeff, in addition

Receiver stage fails but Spark application stands RUNNING

2015-11-19 Thread Pierre Van Ingelandt
Hi, I currently have two Spark Streaming applications (Spark 1.3.1), one using a custom JMS Receiver and the other one a Kafka Receiver. Most of the time when a job fails (ie smthg like "Job aborted due to stage failure: Task 3 in stage 2.0 failed 4 times"), the application gets either

Spark streaming and custom partitioning

2015-11-19 Thread Sachin Mousli
Hi, I would like to implement a custom partitioning algorithm in a streaming environment, preferably in a generic manner using a single pass. The only sources I could find mention Apache Kafka which adds a complexity I would like to avoid since there seem to be no documentation regarding ways to

Re: PySpark Lost Executors

2015-11-19 Thread Sandy Ryza
Hi Ross, This is most likely occurring because YARN is killing containers for exceeding physical memory limits. You can make this less likely to happen by bumping spark.yarn.executor.memoryOverhead to something higher than 10% of your spark.executor.memory. -Sandy On Thu, Nov 19, 2015 at 8:14

Re:

2015-11-19 Thread Dean Wampler
If you mean retaining data from past jobs, try running the history server, documented here: http://spark.apache.org/docs/latest/monitoring.html Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe

Re: spark with breeze error of NoClassDefFoundError

2015-11-19 Thread Ted Yu
I don't have Spark 1.4 source code on hand. You can use the following command: mvn dependency:tree to find out the answer to your question. Cheers On Wed, Nov 18, 2015 at 10:18 PM, Jack Yang wrote: > Back to my question. If I use “*provided*”, the jar file > will expect

Re: PySpark Lost Executors

2015-11-19 Thread Ted Yu
Here are the parameters related to log aggregation : yarn.log-aggregation-enable true yarn.log-aggregation.retain-seconds 2592000 yarn.nodemanager.log-aggregation.compression-type gz

what is algorithm to optimize function with nonlinear constraints

2015-11-19 Thread Zhiliang Zhu
Hi all, I have some optimization problem, I have googled a lot but still did not get the exact algorithm or third-party open package to apply in it. Its type is like this, Objective function: f(x1, x2, ..., xn)   (n >= 100, and f may be linear or non-linear)Constraint functions: x1 + x2 + ... +