Re: Thanks For a Job Well Done !!!

2016-06-18 Thread Reynold Xin
Thanks for the kind words, Krishna! Please keep the feedback coming. On Saturday, June 18, 2016, Krishna Sankar wrote: > Hi all, >Just wanted to thank all for the dataset API - most of the times we see > only bugs in these lists ;o). > >- Putting some context, this

Re: How to cause a stage to fail (using spark-shell)?

2016-06-18 Thread Burak Yavuz
Hi Jacek, Can't you simply have a mapPartitions task throw an exception or something? Are you trying to do something more esoteric? Best, Burak On Sat, Jun 18, 2016 at 5:35 AM, Jacek Laskowski wrote: > Hi, > > Following up on this question, is a stage considered failed only

Thanks For a Job Well Done !!!

2016-06-18 Thread Krishna Sankar
Hi all, Just wanted to thank all for the dataset API - most of the times we see only bugs in these lists ;o). - Putting some context, this weekend I was updating the SQL chapters of my book - it had all the ugliness of SchemaRDD, registerTempTable, take(10).foreach(println) and

Re: Spark not using all the cluster instances in AWS EMR

2016-06-18 Thread Akhil Das
spark.executor.instances is the parameter that you are looking for. Read more here http://spark.apache.org/docs/latest/running-on-yarn.html On Sun, Jun 19, 2016 at 2:17 AM, Natu Lauchande wrote: > Hi, > > I am running some spark loads . I notice that in it only uses one

Re: spark streaming - how to purge old data files in data directory

2016-06-18 Thread Akhil Das
Currently, there is no out of the box solution for this. Although, you can use other hdfs utils to remove older files from the directory (say 24hrs old). Another approach is discussed here

Re: Running JavaBased Implementationof StreamingKmeans

2016-06-18 Thread Akhil Das
SparkStreaming does not pick up old files by default, so you need to start your job with master=local[2] (It needs 2 or more working threads, 1 to read the files and the other to do your computation) and once the job start to run, place your input files in the input directories and you can see

spark streaming - how to purge old data files in data directory

2016-06-18 Thread Vamsi Krishna
Hi, I'm on HDP 2.3.2 cluster (Spark 1.4.1). I have a spark streaming app which uses 'textFileStream' to stream simple CSV files and process. I see the old data files that are processed are left in the data directory. What is the right way to purge the old data files in data directory on HDFS?

Re: Creating tables for JSON data

2016-06-18 Thread brendan kehoe
Hello I downloaded Apache Spark pre built for Hadoop 2.6 . When I create a table, an empty directory with the same name is created in /user/hive/warehouse. I created tables with the following kind of statement:

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Colin Kincaid Williams
Hi Mich again, Regarding batch window, etc. I have provided the sources, but I'm not currently calling the window function. Did you see the program source? It's only 100 lines. https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877 Then I would expect I'm using defaults, other than

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Mich Talebzadeh
Ok What is the set up for these please? batch window window length sliding interval And also in each batch window how much data do you get in (no of messages in the topic whatever)? Dr Mich Talebzadeh LinkedIn *

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Mich Talebzadeh
I believe you have an issue with performance? have you checked spark GUI (default 4040) for details including shuffles etc? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Colin Kincaid Williams
I'm attaching a picture from the streaming UI. On Sat, Jun 18, 2016 at 7:59 PM, Colin Kincaid Williams wrote: > There are 25 nodes in the spark cluster. > > On Sat, Jun 18, 2016 at 7:53 PM, Mich Talebzadeh > wrote: >> how many nodes are in your

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Colin Kincaid Williams
There are 25 nodes in the spark cluster. On Sat, Jun 18, 2016 at 7:53 PM, Mich Talebzadeh wrote: > how many nodes are in your cluster? > > --num-executors 6 \ > --driver-memory 4G \ > --executor-memory 2G \ > --total-executor-cores 12 \ > > > Dr Mich Talebzadeh > >

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Mich Talebzadeh
how many nodes are in your cluster? --num-executors 6 \ --driver-memory 4G \ --executor-memory 2G \ --total-executor-cores 12 \ Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Colin Kincaid Williams
I updated my app to Spark 1.5.2 streaming so that it consumes from Kafka using the direct api and inserts content into an hbase cluster, as described in this thread. I was away from this project for awhile due to events in my family. Currently my scheduling delay is high, but the processing time

Spark not using all the cluster instances in AWS EMR

2016-06-18 Thread Natu Lauchande
Hi, I am running some spark loads . I notice that in it only uses one of the machines(instead of the 3 available) of the cluster. Is there any parameter that can be set to force it to use all the cluster. I am using AWS EMR with Yarn. Thanks, Natu

Re: unsubscribe error

2016-06-18 Thread Mich Talebzadeh
How do you unsubscribe with the email that it saysmarco.plata...@yahoo.it .invalid . When the manager program sends you confirmation to unsubscribe it is probably bouncing back! Dr Mich Talebzadeh LinkedIn *

unsubscribe error

2016-06-18 Thread Marco Platania
Dear admin, I've tried to unsubscribe from this mailing list twice, but I'm still receiving emails. Can you please fix this? Thanks,Marco

CfP for Spark Summit Brussels, 2016

2016-06-18 Thread Jules Damji
Hello All, Just in case you missed, Spark Summit is returning to Europe, October 25-27, 2016, and the Call for Presentations is open. Submit your Cfp before July 1 https://spark-summit.org/eu-2016/ Cheers, Jules Community Evangelist Databricks, Inc Sent from my iPhone Pardon the dumb

Re: Running JavaBased Implementationof StreamingKmeans

2016-06-18 Thread Biplob Biswas
Hi, I tried local[*] and local[2] and the result is the same. I don't really understand the problem here. How can I confirm that the files are read properly? Thanks & Regards Biplob Biswas On Sat, Jun 18, 2016 at 5:59 PM, Akhil Das wrote: > Looks like you need to set your

Running JavaBased Implementation of StreamingKmeans

2016-06-18 Thread Biplob Biswas
Hi, I implemented the streamingKmeans example provided in the spark website but in Java. The full implementation is here, http://pastebin.com/CJQfWNvk But i am not getting anything in the output except occasional timestamps like one below: --- Time:

Re: What does it mean when a executor has negative active tasks?

2016-06-18 Thread Brandon White
1.6 On Jun 18, 2016 10:02 AM, "Mich Talebzadeh" wrote: > could be a bug as they are no failed jobs. what version of Spark is this? > > > HTH > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >

Re: What does it mean when a executor has negative active tasks?

2016-06-18 Thread Mich Talebzadeh
could be a bug as they are no failed jobs. what version of Spark is this? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: Many executors with the same ID in web UI (under Executors)?

2016-06-18 Thread Jacek Laskowski
Hi Mich, That's correct -- they're indeed duplicates in the table but not on OS. The reason for this *might* be that you need to have separate stdout and stderr for the failed execution(s). I'm using --num-executors 2 and there are two executor backends. $ jps -l 28865 sun.tools.jps.Jps 802

Re: Many executors with the same ID in web UI (under Executors)?

2016-06-18 Thread Mich Talebzadeh
Can you please run jps on 1-node host and send the output. All those executor IDs some are just duplicates! HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: How to enable core dump in spark

2016-06-18 Thread Jacek Laskowski
What about the user of NodeManagers? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Thu, Jun 16, 2016 at 10:51 PM, prateek arora

Re: Many executors with the same ID in web UI (under Executors)?

2016-06-18 Thread Jacek Laskowski
Hi, Thanks Mich and Akhil for such prompt responses! Here's the screenshot [1] which is a part of https://issues.apache.org/jira/browse/SPARK-16047 I reported today (to have the executors sorted by status and id). [1]

Re: Can I control the execution of Spark jobs?

2016-06-18 Thread Jacek Laskowski
Hi, Ahh, that makes sense now. Spark works like this by default. You just do your 1st pipeline and then another one (and perhaps some more). Since the pipelines are processed serially (one by one) you implicitly create a dependency between Spark jobs. You need no special steps to have it.

Re: Many executors with the same ID in web UI (under Executors)?

2016-06-18 Thread Akhil Das
A screenshot of the executor tab will explain it better. Usually executors are allocated when the job is started, if you have a multi-node cluster then you'll see executors launched on different nodes. On Sat, Jun 18, 2016 at 9:04 PM, Jacek Laskowski wrote: > Hi, > > This is

Re: Many executors with the same ID in web UI (under Executors)?

2016-06-18 Thread Mich Talebzadeh
Hi Jacek, Can you take a snapshot of your GUI /executors and GUI /Environment. On a single node cluster The executor ID is the driver? But we can find out all from the Environment snapshot (snipping tool) HTH Dr Mich Talebzadeh LinkedIn *

Re: Running JavaBased Implementationof StreamingKmeans

2016-06-18 Thread Akhil Das
Looks like you need to set your master to local[2] or local[*] On Sat, Jun 18, 2016 at 4:54 PM, Biplob Biswas wrote: > Hi, > > I implemented the streamingKmeans example provided in the spark website but > in Java. > The full implementation is here, > >

Re: Dataset Select Function after Aggregate Error

2016-06-18 Thread Ted Yu
scala> ds.groupBy($"_1").count.select(expr("_1").as[String], expr("count").as[Long]) res0: org.apache.spark.sql.Dataset[(String, Long)] = [_1: int, count: bigint] scala> ds.groupBy($"_1").count.select(expr("_1").as[String], expr("count").as[Long]).show +---+-+ | _1|count| +---+-+ | 1|

Re: Dataset Select Function after Aggregate Error

2016-06-18 Thread Pedro Rodriguez
I am curious if there is a way to call this so that it becomes a compile error rather than runtime error: // Note mispelled count and name ds.groupBy($"name").count.select('nam, $"coun").show More specifically, what are the best type safety guarantees that Datasets provide? It seems like with

Many executors with the same ID in web UI (under Executors)?

2016-06-18 Thread Jacek Laskowski
Hi, This is for Spark on YARN - a 1-node cluster with Spark 2.0.0-SNAPSHOT (today build) I can understand that when a stage fails a new executor entry shows up in web UI under Executors tab (that corresponds to a stage attempt). I understand that this is to keep the stdout and stderr logs for

Re: Python to Scala

2016-06-18 Thread Sivakumaran S
If you can identify a suitable java example in the spark directory, you can use that as a template and convert it to scala code using http://javatoscala.com/ Siva > On 18-Jun-2016, at 6:27 AM, Aakash Basu wrote: > > I don't have a sound

Re: Python to Scala

2016-06-18 Thread Marco Mistroni
Hi Post the code. I code in python and Scala on spark..I can give u help though api for Scala and python are practically sameonly difference is in the python lambda vs Scala inline functions Hth On 18 Jun 2016 6:27 am, "Aakash Basu" wrote: > I don't have a sound

Re: Dataset Select Function after Aggregate Error

2016-06-18 Thread Pedro Rodriguez
Looks like it was my own fault. I had spark 2.0 cloned/built, but had the spark shell in my path so somehow 1.6.1 was being used instead of 2.0. Thanks On Sat, Jun 18, 2016 at 1:16 AM, Takeshi Yamamuro wrote: > which version you use? > I passed in 2.0-preview as follows;

Re: How to cause a stage to fail (using spark-shell)?

2016-06-18 Thread Jacek Laskowski
Hi, Following up on this question, is a stage considered failed only when there is a FetchFailed exception? Can I have a failed stage with only a single-stage job? Appreciate any help on this...(as my family doesn't like me spending the weekend with Spark :)) Pozdrawiam, Jacek Laskowski

Re: Making spark read from sources other than HDFS

2016-06-18 Thread Mich Talebzadeh
Spark is capable of reading data from a variety of sources including normal non HDFS RDBMS databases. This will require JDBC connection for that source which is obviously not HDFS. Which sort of storage do you have in mind. Can you access it via JDBC, ODBC etc? HTH Dr Mich Talebzadeh

How to cause a stage to fail (using spark-shell)?

2016-06-18 Thread Jacek Laskowski
Hi, I'm trying to see some stats about failing stages in web UI and want to "create" few failed stages. Is this possible using spark-shell at all? Which setup of Spark/spark-shell would allow for such a scenario. I could write a Scala code if that's the only way to have failing stages. Please

Making spark read from sources other than HDFS

2016-06-18 Thread Ramprakash Ramamoorthy
Hi team, I'm running spark in cluster mode. We have a custom file storage in our organisation. Can I plug in data from these custom sources (Non HDFS like...) Can you please shed some light in this aspect, like where do I start, should I have to tweak the spark source code (Where exactly do I

Re: Python to Scala

2016-06-18 Thread ayan guha
Post the code..some one would be able to help (your truly included) On Sat, Jun 18, 2016 at 4:13 PM, Yash Sharma wrote: > Couple of things that can work- > - If you know the logic- just forget the python script and write it in > java/scala from scratch > - If you have

Re: Dataset Select Function after Aggregate Error

2016-06-18 Thread Takeshi Yamamuro
which version you use? I passed in 2.0-preview as follows; --- Spark context available as 'sc' (master = local[*], app id = local-1466234043659). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/

Re: Python to Scala

2016-06-18 Thread Yash Sharma
Couple of things that can work- - If you know the logic- just forget the python script and write it in java/scala from scratch - If you have python functions and libraries used- Pyspark is probably the best bet. - If you have specific questions on how to solve a particular implementation issue you

Re: Dataset Select Function after Aggregate Error

2016-06-18 Thread Takeshi Yamamuro
'$' is just replaced with 'Column' inside. // maropu On Sat, Jun 18, 2016 at 12:59 PM, Pedro Rodriguez wrote: > Thanks Xinh and Takeshi, > > I am trying to avoid map since my impression is that this uses a Scala > closure so is not optimized as well as doing