Re: Spark is much slower than direct access MySQL

2015-07-26 Thread Shixiong Zhu
Oh, I see. That's the total time of executing a query in Spark. Then the difference is reasonable, considering Spark has much more work to do, e.g., launching tasks in executors. Best Regards, Shixiong Zhu 2015-07-26 16:16 GMT+08:00 Louis Hust louis.h...@gmail.com: Look at the given url:

Re: Writing binary files in Spark

2015-07-26 Thread Oren Shpigel
As I wrote before, the result of my pipeline is binary objects, which I want to write directly as raw bytes, and not serializing them again. Is it possible? On Sat, Jul 25, 2015 at 11:28 AM Akhil Das ak...@sigmoidanalytics.com wrote: Its been added from spark 1.1.0 i guess

Re: Download Apache Spark on Windows 7 for a Proof of Concept installation

2015-07-26 Thread Jörn Franke
Use a Hadoop distribution that supports Windows and has Spark included. Generally - if you want to use windows - you should use the server version. Le sam. 25 juil. 2015 à 20:11, Peter Leventis pleven...@telkomsa.net a écrit : I just wanted an easy step by step guide as to exactly what version

Re: Asked to remove non-existent executor exception

2015-07-26 Thread Pa Rö
i has seen that the tempfs is full, how i can clear this? 2015-07-23 13:41 GMT+02:00 Pa Rö paul.roewer1...@googlemail.com: hello spark community, i have build an application with geomesa, accumulo and spark. if it run on spark local mode, it is working, but on spark cluster not. in short

Re: Parallelism of Custom receiver in spark

2015-07-26 Thread Michal Čizmazia
#1 see https://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-receiving #2 By default, all input data and persisted RDDs generated by DStream transformations are automatically cleared. Spark Streaming decides when to clear the data based on the

R: Spark is much slower than direct access MySQL

2015-07-26 Thread Paolo Platter
If you want a performance boost, you need to load the full table in memory using caching and them execute your query directly on cached dataframe. Otherwise you use spark only as a bridge and you don't leverage the distributed in memory engine of spark. Paolo Inviata dal mio Windows Phone

Re: Asked to remove non-existent executor exception

2015-07-26 Thread Ted Yu
You can list the files in tmpfs in reverse chronological order and remove the oldest until you have enough space. Cheers On Sun, Jul 26, 2015 at 12:43 AM, Pa Rö paul.roewer1...@googlemail.com wrote: i has seen that the tempfs is full, how i can clear this? 2015-07-23 13:41 GMT+02:00 Pa Rö

Re: Spark is much slower than direct access MySQL

2015-07-26 Thread Louis Hust
I got it, thanks for that 2015-07-26 17:21 GMT+08:00 Paolo Platter paolo.plat...@agilelab.it: If you want a performance boost, you need to load the full table in memory using caching and them execute your query directly on cached dataframe. Otherwise you use spark only as a bridge and you

Re: Spark is much slower than direct access MySQL

2015-07-26 Thread Jerrick Hoang
how big is the dataset? how complicated is the query? On Sun, Jul 26, 2015 at 12:47 AM Louis Hust louis.h...@gmail.com wrote: Hi, all, I am using spark DataFrame to fetch small table from MySQL, and i found it cost so much than directly access MySQL Using JDBC. Time cost for Spark is about

Re: Spark is much slower than direct access MySQL

2015-07-26 Thread Louis Hust
Look at the given url: Code can be found at: https://github.com/louishust/sparkDemo/blob/master/src/main/java/DirectQueryTest.java 2015-07-26 16:14 GMT+08:00 Shixiong Zhu zsxw...@gmail.com: Could you clarify how you measure the Spark time cost? Is it the total time of running the query? If

Re: Spark is much slower than direct access MySQL

2015-07-26 Thread Shixiong Zhu
Could you clarify how you measure the Spark time cost? Is it the total time of running the query? If so, it's possible because the overhead of Spark dominates for small queries. Best Regards, Shixiong Zhu 2015-07-26 15:56 GMT+08:00 Jerrick Hoang jerrickho...@gmail.com: how big is the dataset?

Re: Spark is much slower than direct access MySQL

2015-07-26 Thread Louis Hust
Thanks for your explain 2015-07-26 16:22 GMT+08:00 Shixiong Zhu zsxw...@gmail.com: Oh, I see. That's the total time of executing a query in Spark. Then the difference is reasonable, considering Spark has much more work to do, e.g., launching tasks in executors. Best Regards, Shixiong Zhu

Re: Twitter streaming with apache spark stream only a small amount of tweets

2015-07-26 Thread Zoran Jeremic
Hi, I discovered what is the problem here. Twitter public stream is limited to 1% of overall tweets (https://goo.gl/kDwnyS), so that's why I can't access all the tweets posted with specific hashtag using approach that I posted in previous email, so I guess this approach would not work for me. The

Writing streaming data to cassandra creates duplicates

2015-07-26 Thread Priya Ch
Hi All, I have a problem when writing streaming data to cassandra. Or existing product is on Oracle DB in which while wrtiting data, locks are maintained such that duplicates in the DB are avoided. But as spark has parallel processing architecture, if more than 1 thread is trying to write same

Re: RDD[Future[T]] = Future[RDD[T]]

2015-07-26 Thread Ignacio Blasco
Maybe using mapPartitions and .sequence inside it? El 26/7/2015 10:22 p. m., Ayoub benali.ayoub.i...@gmail.com escribió: Hello, I am trying to convert the result I get after doing some async IO : val rdd: RDD[T] = // some rdd val result: RDD[Future[T]] = rdd.map(httpCall) Is there a way

Re: Download Apache Spark on Windows 7 for a Proof of Concept installation

2015-07-26 Thread Peter Leventis
Thank you for the answers. I followed numerous recipes including videos and uncounted many obstacles such as 7-Zip unable to unzip the *.gx file and to the need to use SBT. My situation is fixed. I use a Windows 7 PC (not Linux). I would be very grateful for an approach that simply works. This is

Schema evolution in tables

2015-07-26 Thread sim
The schema merging http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging section of the Spark SQL documentation shows an example of schema evolution in a partitioned table. Is this functionality only available when creating a Spark SQL table?

回复: Asked to remove non-existent executor exception

2015-07-26 Thread Sea
This exception is so ugly!!! The screen is full of these information when the program runs a long time, and they will not fail the job. I comment it in the source code. I think this information is useless because the executor is already removed and I don't know what does the executor id

Spark - Cassandra (timestamp question)

2015-07-26 Thread Ivan Babic
Hi, I am using Spark to load data form Cassandra. One of the fields in C* table is timestamp. When queried in C* it looks like this 2015-06-01 02:56:07-0700 After loading data in to Spark DataFrame (using sqlContext) and printing it from there, I lose the last field (4-digit time zone) and than

Re: RDD[Future[T]] = Future[RDD[T]]

2015-07-26 Thread Ayoub Benali
It doesn't work because mapPartitions expects a function f:(Iterator[T]) ⇒ Iterator[U] while .sequence wraps the iterator in a Future 2015-07-26 22:25 GMT+02:00 Ignacio Blasco elnopin...@gmail.com: Maybe using mapPartitions and .sequence inside it? El 26/7/2015 10:22 p. m., Ayoub

Custom partitioner

2015-07-26 Thread Hafiz Mujadid
Hi I have csv data in which i have a column of date time. I want to partition my data in 12 partitions with each partition containing data of one month only. I am not getting how to write such partitioner and how to use that partitioner to read write data. Kindly help me in this regard.

Spark is much slower than direct access MySQL

2015-07-26 Thread Louis Hust
Hi, all, I am using spark DataFrame to fetch small table from MySQL, and i found it cost so much than directly access MySQL Using JDBC. Time cost for Spark is about 2033ms, and direct access at about 16ms. Code can be found at:

spark as a lookup engine for dedup

2015-07-26 Thread Shushant Arora
Hi I have a requirement for processing large events but ignoring duplicate at the same time. Events are consumed from kafka and each event has a eventid. It may happen that an event is already processed and came again at some other offset. 1.Can I use Spark RDD to persist processed events and

Re: Spark on Tomcat has exception IncompatibleClassChangeError: Implementing class

2015-07-26 Thread Zoran Jeremic
Yes. You're right. I didn't get it till now. Thanks. On Sun, Jul 26, 2015 at 7:36 AM, Ted Yu yuzhih...@gmail.com wrote: bq. [INFO] \- org.apache.spark:spark-core_2.10:jar:1.4.0:compile I think the above notation means spark-core_2.10 is the last dependency. Cheers On Thu, Jul 23, 2015 at

RDD[Future[T]] = Future[RDD[T]]

2015-07-26 Thread Ayoub
Hello, I am trying to convert the result I get after doing some async IO : val rdd: RDD[T] = // some rdd val result: RDD[Future[T]] = rdd.map(httpCall) Is there a way collect all futures once they are completed in a *non blocking* (i.e. without scala.concurrent Await) and lazy way? If the

Re: Asked to remove non-existent executor exception

2015-07-26 Thread Mridul Muralidharan
Simply customize your log4j confit instead of modifying code if you don't want messages from that class. Regards Mridul On Sunday, July 26, 2015, Sea 261810...@qq.com wrote: This exception is so ugly!!! The screen is full of these information when the program runs a long time, and they

PYSPARK_DRIVER_PYTHON=ipython spark/bin/pyspark Does not create SparkContext

2015-07-26 Thread Zerony Zhao
Hello everyone, I have a newbie question. $SPARK_HOME/bin/pyspark will create SparkContext automatically. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.4.1 /_/ Using Python version 2.7.3 (default,

Re: Custom partitioner

2015-07-26 Thread Ted Yu
You can write subclass of Partitioner whose getPartition() returns partition number corresponding to the given key. Take a look at core/src/main/scala/org/apache/spark/api/python/PythonPartitioner.scala for an example. Cheers On Sun, Jul 26, 2015 at 1:43 PM, Hafiz Mujadid

Re: Asked to remove non-existent executor exception

2015-07-26 Thread Ted Yu
If I read the code correctly, that error message came from CoarseGrainedSchedulerBackend. There may be existing / future error messages, other than the one cited below, which are useful. Maybe change the log level of this message to DEBUG ? Cheers On Sun, Jul 26, 2015 at 3:28 PM, Mridul

unserialize error in sparkR

2015-07-26 Thread Jennifer15
Hi, I have a newbie question; I get the following error by increasing the number of samples in my sample script samplescript.R http://apache-spark-user-list.1001560.n3.nabble.com/file/n24002/samplescript.R , which is written in Spark1.2 (no error for small sample of error): Error in