RE: What's the benifit of RDD checkpoint against RDD save

2016-03-24 Thread Sun, Rui
As Mark said, checkpoint() can be called before calling any action on the RDD. The choice between checkpoint and saveXXX depends. If you just want to cut the long RDD lineage, and the data won’t be re-used later, then use checkpoint, because it is simple and the checkpoint data will be cleaned

Testing spark with AWS spot instances

2016-03-24 Thread Dillian Murphey
I'm very new to apache spark. I'm just a user not a developer. I'm running a cluster with many spot instances. Am I correct in understanding that spark can handle an unlimited number of spot instance failures and restarts? Sometimes all the spot instances will dissapear without warning, and then

Re: python support of mapWithState

2016-03-24 Thread Cyril Scetbon
It's a pity, I have now to port python code cause the support the three languages is not the same :( > On Mar 24, 2016, at 20:58, Holden Karau wrote: > > In general the Python API lags behind the Scala & Java APIs. The Scala & Java > APIs tend to be easier to keep in

Re: python support of mapWithState

2016-03-24 Thread Holden Karau
In general the Python API lags behind the Scala & Java APIs. The Scala & Java APIs tend to be easier to keep in sync since they are both in the JVM and a bit more work is needed to expose the same functionality from the JVM in Python (or re-implement the Scala code in Python where appropriate).

python support of mapWithState

2016-03-24 Thread Cyril Scetbon
Hi guys, It seems that mapWithState is not supported. Am I right ? Is there a reason there are 3 languages supported and one is behind the two others ? Thanks - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For

nullable in spark-sql

2016-03-24 Thread Koert Kuipers
In spark 2, is nullable treated as reliable? or is it just a hint for efficient code generation, the optimizer etc. The reason i ask is i see a lot of code generated with if statements handling null for struct fields where nullable=false

Re: Spark SQL - java.lang.StackOverflowError after caching table

2016-03-24 Thread Mohamed Nadjib MAMI
I ran: sqlContext.cacheTable("product") var df = sqlContext.sql("...complex query...") df.explain(true) ...and obtained: http://pastebin.com/k9skERsr ...where "[...]" corresponds therein to huge lists of records from the addressed table (product) The query is of the following form: "SELECT

Re: sliding Top N window

2016-03-24 Thread Lars Albertsson
I am not aware of any open source examples. If you search for usages of stream-lib or Algebird, you might be lucky. Twitter uses CMSs, so they might have shared some code or presentation. We created a proprietary prototype of the solution I described, but I am not at liberty to share code. We

Re: Converting String to Datetime using map

2016-03-24 Thread Mich Talebzadeh
Thanks. Tried this scala> val a = df.filter(col("Total") > "").map(p => Invoices(p(0).toString, p(1).toString.TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(p(1),"dd/MM/"),"-MM-dd")), p(2).toString.substring(1).replace(",", "").toDouble, p(3).toString.substring(1).replace(",", "").toDouble,

Re: Converting String to Datetime using map

2016-03-24 Thread Alexander Krasnukhin
You can invoke exactly the same functions on scala side as well i.e. http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$ Have you tried them? On Thu, Mar 24, 2016 at 10:29 PM, Mich Talebzadeh wrote: > > Hi, > > Read a CSV in with

Re: Reading Back a Cached RDD

2016-03-24 Thread Marco Colombo
You can persist off-heap, for example with tachyon, now called Alluxio. Take a look at off heap peristance Regards Il giovedì 24 marzo 2016, Holden Karau ha scritto: > Even checkpoint() is maybe not exactly what you want, since if reference > tracking is turned on it will

Re: What's the benifit of RDD checkpoint against RDD save

2016-03-24 Thread Ted Yu
Thanks, Mark. Since checkpoint may get cleaned up later on, it seems option #2 (saveXXX) is viable. On Wed, Mar 23, 2016 at 8:01 PM, Mark Hamstra wrote: > Yes, the terminology is being used sloppily/non-standardly in this thread > -- "the last RDD" after a series of

Converting String to Datetime using map

2016-03-24 Thread Mich Talebzadeh
Hi, Read a CSV in with the following schema scala> df.printSchema root |-- Invoice Number: string (nullable = true) |-- Payment date: string (nullable = true) |-- Net: string (nullable = true) |-- VAT: string (nullable = true) |-- Total: string (nullable = true) I use mapping as below

Re: Extending Spark REST API

2016-03-24 Thread Ted Yu
bq. getServletHandlers is not intended for public use >From MetricsSystem.scala : private[spark] class MetricsSystem private ( Looks like there is no easy way to extend REST API. On Thu, Mar 24, 2016 at 1:09 PM, Sebastian Kochman < sebastian.koch...@outlook.com> wrote: > Hello, > I have a

Re: Spark SQL - java.lang.StackOverflowError after caching table

2016-03-24 Thread Ted Yu
Can you obtain output from explain(true) on the query after cacheTable() call ? Potentially related JIRA: [SPARK-13657] [SQL] Support parsing very long AND/OR expressions On Thu, Mar 24, 2016 at 12:55 PM, Mohamed Nadjib MAMI wrote: > Here is the stack trace:

Re: Column explode a map

2016-03-24 Thread Michał Zieliński
Thanks Sujit, Michael, The list of columns is data driven (and in the order of 100s), but your 2nd example looks exactly like the thing I want. Appreciate the help! On 24 March 2016 at 20:20, Michael Armbrust wrote: > If you know the map keys ahead of time then you can

Re: Column explode a map

2016-03-24 Thread Michael Armbrust
If you know the map keys ahead of time then you can just extract them directly. Here are a few examples . On Thu, Mar 24, 2016 at 12:01

Re: calling individual columns from spark temporary table

2016-03-24 Thread Michael Armbrust
df.filter(col("paid") > "").select(col("name1").as("newName"), ...) On Wed, Mar 23, 2016 at 6:17 PM, Ashok Kumar wrote: > Thank you again > > For > > val r = df.filter(col("paid") > "").map(x => > (x.getString(0),x.getString(1).) > > Can you give an example of column

Extending Spark REST API

2016-03-24 Thread Sebastian Kochman
Hello, I have a question: is there a way to extend Spark REST API with higher-level, application-specific handlers? (exposing additional information, like app-specific metrics, but also taking some actions within an app) Yes, I could host my own REST endpoint within an app, but then I have to

Re: Spark SQL - java.lang.StackOverflowError after caching table

2016-03-24 Thread Mohamed Nadjib MAMI
Here is the stack trace: http://pastebin.com/ueHqiznH Here's the code: val sqlContext = new org.apache.spark.sql.SQLContext(sc) val table = sqlContext.read.parquet("hdfs://...parquet_table") table.registerTempTable("table") sqlContext.sql("...complex query...").show() /** works */

Re: Direct Kafka input stream and window(…) function

2016-03-24 Thread Cody Koeninger
If this is related to https://issues.apache.org/jira/browse/SPARK-14105 , are you windowing before doing any transformations at all? Try using map to extract the data you care about before windowing. On Tue, Mar 22, 2016 at 12:24 PM, Cody Koeninger wrote: > I definitely have

Re: Reading Back a Cached RDD

2016-03-24 Thread Holden Karau
Even checkpoint() is maybe not exactly what you want, since if reference tracking is turned on it will get cleaned up once the original RDD is out of scope and GC is triggered. If you want to share persisted RDDs right now one way to do this is sharing the same spark context (using something like

Re: No active SparkContext

2016-03-24 Thread Max Schmidt
Am 2016-03-24 18:00, schrieb Mark Hamstra: You seem to be confusing the concepts of Job and Application.  A Spark Application has a SparkContext.  A Spark Application is capable of running multiple Jobs, each with its own ID, visible in the webUI. Obviously I mixed it up, but then I would like

Column explode a map

2016-03-24 Thread Michał Zieliński
Hi, Imagine you have a structure like this: val events = sqlContext.createDataFrame( Seq( ("a", Map("a"->1,"b"->1)), ("b", Map("b"->1,"c"->1)), ("c", Map("a"->1,"c"->1)) ) ).toDF("id","map") What I want to achieve is have the map values as a separate columns. Basically I

Re: Reading Back a Cached RDD

2016-03-24 Thread Nicholas Chammas
Isn’t persist() only for reusing an RDD within an active application? Maybe checkpoint() is what you’re looking for instead? ​ On Thu, Mar 24, 2016 at 2:02 PM Afshartous, Nick wrote: > > Hi, > > > After calling RDD.persist(), is then possible to come back later and >

Re: Slaves died, but jobs not picked up by standby slaves

2016-03-24 Thread Dillian Murphey
Never mind. What I was missing was waiting long enough :-O. Sry bout that. On Thu, Mar 24, 2016 at 11:20 AM, Dillian Murphey wrote: > Had 15 slaves. > > Added 10 more. > > Shut down some slaves in the first bunch of 15. > > The 10 slaves I added are sitting there

Slaves died, but jobs not picked up by standby slaves

2016-03-24 Thread Dillian Murphey
Had 15 slaves. Added 10 more. Shut down some slaves in the first bunch of 15. The 10 slaves I added are sitting there idle. Spark did not assign idle cores to pick up the slack. What am I missing? Thanks for any help. Confused. :-i

Reading Back a Cached RDD

2016-03-24 Thread Afshartous, Nick
Hi, After calling RDD.persist(), is then possible to come back later and access the persisted RDD. Let's say for instance coming back and starting a new Spark shell session. How would one access the persisted RDD in the new shell session ? Thanks, -- Nick

Re: Converting a string of format of 'dd/MM/yyyy' in Spark sql

2016-03-24 Thread Mich Talebzadeh
Hi Bijay. At the moment it is only POC getting CSV data for invoices on a daily basis, importing into HDFS and store it in ORC table (non transactional as Spark cannot read from it) in Hive database. I have written both Hive version and Spark version. The Hive version is pretty stable as below.

Re: Converting a string of format of 'dd/MM/yyyy' in Spark sql

2016-03-24 Thread Bijay Pathak
Hi, I have written the UDF for doing same in pyspark DataFrame since some of my dates are before unix standard time epoch of 1/1/1970. I have more than 250 columns and applying custom date_format UDF to more than 50 columns. I am getting OOM error and poor performance because of UDF. What's your

Re: Converting a string of format of 'dd/MM/yyyy' in Spark sql

2016-03-24 Thread Mich Talebzadeh
Minor correction UK date is dd/MM/ scala> sql("select paymentdate, TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(paymentdate,'dd/MM/'),'-MM-dd')) AS newdate from tmp").first res47: org.apache.spark.sql.Row = [10/02/2014,2014-02-10] Dr Mich Talebzadeh LinkedIn *

Re: Converting a string of format of 'dd/MM/yyyy' in Spark sql

2016-03-24 Thread Kasinathan, Prabhu
Can you try this one? spark-sql> select paymentdate, TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(paymentdate,'MM/dd/'),'-MM-dd')) from tmp; 10/02/2014 2014-10-02 spark-sql> From: Tamas Szuromi > Date: Thursday, March

Re: No active SparkContext

2016-03-24 Thread Mark Hamstra
You seem to be confusing the concepts of Job and Application. A Spark Application has a SparkContext. A Spark Application is capable of running multiple Jobs, each with its own ID, visible in the webUI. On Thu, Mar 24, 2016 at 6:11 AM, Max Schmidt wrote: > Am 24.03.2016 um

RE: Spark 1.5.2, why the broadcast join shuffle so much data in the last step

2016-03-24 Thread Yong Zhang
The patch works as I expect. The DAG shows the broadcast joining in stage 4, eliminating following stages, and data generated right after it (There are no shuffle write in this stage any more).Much faster than before. If you like this patch back port to 1.5 and 1.6, please vote in

Re: Converting a string of format of 'dd/MM/yyyy' in Spark sql

2016-03-24 Thread Mich Talebzadeh
Thanks guys. Unfortunately neither is working sql("select paymentdate, unix_timestamp(paymentdate) from tmp").first res28: org.apache.spark.sql.Row = [10/02/2014,null] Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Converting a string of format of 'dd/MM/yyyy' in Spark sql

2016-03-24 Thread Tamas Szuromi
Actually, you should run sql("select paymentdate, unix_timestamp(paymentdate, "dd/MM/") from tmp").first But keep in mind you will get a unix timestamp! On 24 March 2016 at 17:29, Mich Talebzadeh wrote: > Thanks guys. > > Unfortunately neither is working > >

Re: multiple tables for join

2016-03-24 Thread Bijay Pathak
Hi, Can you elaborate what's the issues you are facing, I am doing the similar kind of join so I may be able to provide you with some suggestions and pointers. Thanks, BIjay On Thu, Mar 24, 2016 at 5:12 AM, pseudo oduesp wrote: > hi , i spent two months of my times to

Re: Forcing data from disk to memory

2016-03-24 Thread Daniel Imberman
Hi Takeshi, Thank you for getting back to me. If this is not possible then perhaps you can help me with the root problem that caused me to ask this question. Basically I have a job where I'm loading/persisting an RDD and running queries against it. The problem I'm having is that even though

Re: apache spark errors

2016-03-24 Thread Max Schmidt
I would recommend to use the JedisPool with autocloseable pattern: private JedisPool pool = new JedisPool(host, port); try (Jedis jedis = pool.getResource()) { /*do magic to jedis*/ } pool.destroy(); We use this contruct successfully in a foreachPartition action. Am 24.03.2016 um 15:20 schrieb

Re: apache spark errors

2016-03-24 Thread Ted Yu
Can you instantiate Jedis class in a standalone program and see its memory footprint ? Just to get some idea how much its cost is. Also, can you look for other approach which doesn't require using ThreadLocal ? On Thu, Mar 24, 2016 at 7:20 AM, Michel Hubert wrote: > No. > >

Re: Best way to determine # of workers

2016-03-24 Thread Aaron Jackson
Well thats unfortunate, just means I have to scrape the webui for that information. As to why, I have a cluster that is being increased in size to accommodate the processing requirements of a large set of jobs. Its useful to know when the new workers have joined the spark cluster. In my

Re: Converting a string of format of 'dd/MM/yyyy' in Spark sql

2016-03-24 Thread Ajay Chander
Mich, Can you try the value for paymentdata to this format paymentdata='2015-01-01 23:59:59' , to_date(paymentdate) and see if it helps. On Thursday, March 24, 2016, Tamas Szuromi wrote: > Hi Mich, > > Take a look >

RE: apache spark errors

2016-03-24 Thread Michel Hubert
No. But I may be on to something. I use Jedis to send data to Redis. I used a ThreadLocal construct: private static final ThreadLocal jedis = new ThreadLocal(){ @Override protected Jedis initialValue() { return new Jedis("10.101.41.19",6379); } }; and then

Re: apache spark errors

2016-03-24 Thread Ted Yu
Do you have history server enabled ? Posting your code snippet would help us understand your use case (and reproduce the leak). Thanks On Thu, Mar 24, 2016 at 6:40 AM, Michel Hubert wrote: > > > org.apache.spark > spark-core_2.10 > 1.6.1 >

Re: Converting a string of format of 'dd/MM/yyyy' in Spark sql

2016-03-24 Thread Tamas Szuromi
Hi Mich, Take a look https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/functions.html#unix_timestamp(org.apache.spark.sql.Column,%20java.lang.String) cheers, Tamas On 24 March 2016 at 14:29, Mich Talebzadeh wrote: > > Hi, > > I am trying to convert

RE: apache spark errors

2016-03-24 Thread Michel Hubert
org.apache.spark spark-core_2.10 1.6.1 org.apache.spark spark-streaming_2.10 1.6.1 org.apache.spark spark-streaming-kafka_2.10 1.6.1 org.elasticsearch elasticsearch

Re: apache spark errors

2016-03-24 Thread Ted Yu
Which release of Spark are you using ? Have you looked the tasks whose Ids were printed to see if there was more clue ? Thanks On Thu, Mar 24, 2016 at 6:12 AM, Michel Hubert wrote: > HI, > > > > I constantly get these errors: > > > > 0[Executor task launch worker-15]

apache spark errors

2016-03-24 Thread Michel Hubert
HI, I constantly get these errors: 0[Executor task launch worker-15] ERROR org.apache.spark.executor.Executor - Managed memory leak detected; size = 6564500 bytes, TID = 38969 310002 [Executor task launch worker-12] ERROR org.apache.spark.executor.Executor - Managed memory leak detected;

Re: No active SparkContext

2016-03-24 Thread Max Schmidt
Am 24.03.2016 um 10:34 schrieb Simon Hafner: > 2016-03-24 9:54 GMT+01:00 Max Schmidt >: > > we're using with the java-api (1.6.0) a ScheduledExecutor that > continuously > > executes a SparkJob to a standalone cluster. > I'd recommend Scala. Why should

Re: Create one DB connection per executor

2016-03-24 Thread Gerard Maas
Hi Manas, The approach is correct, with one caveat: You may have several tasks executing in parallel in one executor. Having one single connection per JVM will either fail, if the connection is not thread-safe or become a bottleneck b/c all task will be competing for the same resource. The best

Create one DB connection per executor

2016-03-24 Thread Manas
I understand that using foreachPartition I can create one DB connection per partition level. Is there a way to create a DB connection per executor level and share that for all partitions/tasks run within that executor? One approach I am thinking is to have a singleton with say a getConnection

REST-API for Killing a Streaming Application

2016-03-24 Thread Christian Kurz
Hello Spark Streaming Gurus, for better automation I need to manage my Spark Streaming Applications remotely. These applications read from Kafka and therefore have a receiver job and are started via spark-submit. For now I have only found a REST-API for killing Spark applications remotely,

Re: join multiple 14 tables with lot of columns

2016-03-24 Thread Malouke
no one please i need help -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/join-multiple-14-tables-with-lot-of-columns-tp26533p26587.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark SQL - java.lang.StackOverflowError after caching table

2016-03-24 Thread Ted Yu
Can you pastebin the stack trace ? If you can show snippet of your code, that would help give us more clue. Thanks > On Mar 24, 2016, at 2:43 AM, Mohamed Nadjib MAMI wrote: > > Hi all, > I'm running SQL queries (sqlContext.sql()) on Parquet tables and facing a >

Re: Write RDD to Elasticsearch

2016-03-24 Thread Ted Yu
Consider using their forum: https://discuss.elastic.co/c/elasticsearch > On Mar 24, 2016, at 3:09 AM, Jakub Stransky wrote: > > Hi, > > I am trying to write JavaPairRDD into elasticsearch 1.7 using spark 1.2.1 > using elasticsearch-hadoop 2.0.2 > >

Re: what happened if cache a RDD for multiple time?

2016-03-24 Thread charles li
hi, yash, that's really help me, great thanks On Thu, Mar 24, 2016 at 7:07 PM, yash datta wrote: > Yes, That is correct. > > When you call cache on an RDD, internally it calls > persist(StorageLevel.MEMORY_ONLY) which further calls > > persist(StorageLevel.MEMORY_ONLY,

multiple tables for join

2016-03-24 Thread pseudo oduesp
hi , i spent two months of my times to make 10 joins whith folowin tables : 1go tbal1 3go table 2 500mo table 3 400 mo table 4 20 mo table 5 100 mo table 6 30 mo table 7 40 mo table 8 700 mo table 9 800 mo table 10 i use hivecontext.sql("select * from table1

Spark DataFrames Array functions

2016-03-24 Thread Viktor Taranenko
Hi there, I'm experimenting with Spark ml classification in Python and would like to raise some questions We have input data with format like {label: "string", field1: "string", field2: "string", field3: "array[string]"} The idea is to build the text field with specified combinations on these

Re: what happened if cache a RDD for multiple time?

2016-03-24 Thread yash datta
Yes, That is correct. When you call cache on an RDD, internally it calls persist(StorageLevel.MEMORY_ONLY) which further calls persist(StorageLevel.MEMORY_ONLY, allowOverride=false) , if the RDD is not marked for localCheckpointing Below is what is finally triggered : /** * Mark this RDD for

Re: Output is being stored on the clusters (slaves).

2016-03-24 Thread Simon Hafner
2016-03-24 11:09 GMT+01:00 Shishir Anshuman : > I am using two Slaves to run the ALS algorithm. I am saving the predictions > in a textfile using : > saveAsTextFile(path) > > The predictions is getting stored on the slaves but I want the predictions > to be saved

Write RDD to Elasticsearch

2016-03-24 Thread Jakub Stransky
Hi, I am trying to write JavaPairRDD into elasticsearch 1.7 using spark 1.2.1 using elasticsearch-hadoop 2.0.2 JavaPairRDD output = ... final JobConf jc = new JobConf(output.context().hadoopConfiguration()); jc.set("mapred.output.format.class",

Output is being stored on the clusters (slaves).

2016-03-24 Thread Shishir Anshuman
I am using two Slaves to run the ALS algorithm. I am saving the predictions in a textfile using : *saveAsTextFile(path)* The predictions is getting stored on the slaves but I want the predictions to be saved on the Master. Any suggestion on how to achieve this? I am using Standalone

Spark SQL - java.lang.StackOverflowError after caching table

2016-03-24 Thread Mohamed Nadjib MAMI
Hi all, I'm running SQL queries (sqlContext.sql()) on Parquet tables and facing a problem with table caching (sqlContext.cacheTable()), using spark-shell of Spark 1.5.1. After I run the sqlContext.cacheTable(table), the sqlContext.sql(query) takes longer the first time (well, for the lazy

Re: No active SparkContext

2016-03-24 Thread Simon Hafner
2016-03-24 9:54 GMT+01:00 Max Schmidt : > we're using with the java-api (1.6.0) a ScheduledExecutor that continuously > executes a SparkJob to a standalone cluster. I'd recommend Scala. > After each job we close the JavaSparkContext and create a new one. Why do that? You can

No active SparkContext

2016-03-24 Thread Max Schmidt
Hi there, we're using with the java-api (1.6.0) a ScheduledExecutor that continuously executes a SparkJob to a standalone cluster. After each job we close the JavaSparkContext and create a new one. But sometimes the Scheduling JVM crashes with: 24.03.2016-08:30:27:375# error - Application has

what happened if cache a RDD for multiple time?

2016-03-24 Thread charles li
happened to see this problem on stackoverflow: http://stackoverflow.com/questions/36195105/what-happens-if-i-cache-the-same-rdd-twice-in-spark/36195812#36195812 I think it's very interesting, and I think the answer posted by Aaron sounds promising, but I'm not sure, and I don't find the details

Fwd: Forcing data from disk to memory

2016-03-24 Thread Takeshi Yamamuro
just re-sent, -- Forwarded message -- From: Takeshi Yamamuro Date: Thu, Mar 24, 2016 at 5:19 PM Subject: Re: Forcing data from disk to memory To: Daniel Imberman Hi, We have no direct approach; we need to unpersist cached

Re: Unit testing framework for Spark Jobs?

2016-03-24 Thread Shiva Ramagopal
Hi Lars, Very pragmatic ideas around testing of Spark applications end-to-end! -Shiva On Fri, Mar 18, 2016 at 12:35 PM, Lars Albertsson wrote: > I would recommend against writing unit tests for Spark programs, and > instead focus on integration tests of jobs or pipelines of

Re: Best way to determine # of workers

2016-03-24 Thread Takeshi Yamamuro
Hi, There is no way to get such information from your app. Why do you need that? thanks, maropu On Thu, Mar 24, 2016 at 8:23 AM, Ajaxx wrote: > I'm building some elasticity into my model and I'd like to know when my > workers have come online. It appears at present that