Sorting the RDD

2016-03-02 Thread Angel Angel
Hello Sir/Madam, I am try to sort the RDD using *sortByKey* function but i am getting the following error. My code is 1) convert the rdd array into key value pair. 2) after that sort by key but i am getting the error *No implicit Ordering defined for any * [image: Inline image 1] thanks

Re: select count(*) return wrong row counts

2016-03-02 Thread Mich Talebzadeh
This works fine scala> sql("use oraclehadoop") res1: org.apache.spark.sql.DataFrame = [result: string] scala> sql("select count(1) from sales").show +---+ |_c0| +---+ |4991761| +---+ You can do "select count(*) from tablename") as it is not dynamic sql. Does it actually work?

Re: Spark on Yarn with Dynamic Resource Allocation. Container always marked as failed

2016-03-02 Thread Xiaoye Sun
Hi Jeff and Prabhu, Thanks for your help. I look deep in the nodemanager log and I found that I have a error message like this: 2016-03-02 03:13:59,692 ERROR org.apache.spark.network.shuffle.ExternalShuffleBlockResolver: error opening leveldb file

Re: Spark sql query taking long time

2016-03-02 Thread Ted Yu
Have you seen the thread 'Filter on a column having multiple values' where Michael gave this example ? https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/107522969592/2840265927289860/2388bac36e.html FYI On Wed, Mar 2, 2016 at

Spark sql query taking long time

2016-03-02 Thread Angel Angel
Hello Sir/Madam, I am writing one application using spark sql. i made the vary big table using the following command *val dfCustomers1 = sc.textFile("/root/Desktop/database.txt").map(_.split(",")).map(p => Customer1(p(0),p(1).trim.toInt, p(2).trim.toInt, p(3)))toDF* Now i want to search the

Re: Fair scheduler pool details

2016-03-02 Thread Mark Hamstra
If I'm understanding you correctly, then you are correct that the fair scheduler doesn't currently do everything that you want to achieve. Fair scheduler pools currently can be configured with a minimum number of cores that they will need before accepting Tasks, but there isn't a way to restrict

Re: Using netlib-java in Spark 1.6 on linux

2016-03-02 Thread Sean Owen
This is really more a netlib question, but I'd guess strongly that you haven't installed libgfortran on your machines. OS X doesn't need it; netlib can't provide it though. On Thu, Mar 3, 2016 at 1:06 AM, cindymc wrote: > I want to take advantage of the Breeze linear

Re: Spark 1.5 on Mesos

2016-03-02 Thread Tim Chen
You shouldn't need to specify --jars at all since you only have one jar. The error is pretty odd as it suggests it's trying to load /opt/spark/Example but that doesn't really seem to be anywhere in your image or command. Can you paste your stdout from the driver task launched by the cluster

Re: Building a REST Service with Spark back-end

2016-03-02 Thread Benjamin Kim
I want to ask about something related to this. Does anyone know if there is or will be a command line equivalent of spark-shell client for Livy Spark Server or any other Spark Job Server? The reason that I am asking spark-shell does not handle multiple users on the same server well. Since a

Stage contains task of large size

2016-03-02 Thread Bijuna
Spark users, We are running spark application in standalone mode. We see warn messages in the logs which says Stage 46 contains a task of very large size (983 KB) . The maximum recommended task size is 100 KB. What is the recommended approach to fix this warning. Please let me know. Thank

Re: Renaming sc variable in sparkcontext throws task not serializable

2016-03-02 Thread Prashant Sharma
*This is a known issue. * https://issues.apache.org/jira/browse/SPARK-3200 Prashant Sharma On Thu, Mar 3, 2016 at 9:01 AM, Rahul Palamuttam wrote: > Thank you Jeff. > > I have filed a JIRA under the following link : > >

Re: Spark on Yarn with Dynamic Resource Allocation. Container always marked as failed

2016-03-02 Thread Prabhu Joseph
Is all NodeManager services restarted after the change in yarn-site.xml On Thu, Mar 3, 2016 at 6:00 AM, Jeff Zhang wrote: > The executor may fail to start. You need to check the executor logs, if > there's no executor log then you need to check node manager log. > > On Wed,

Re: Renaming sc variable in sparkcontext throws task not serializable

2016-03-02 Thread Jeff Zhang
I can reproduce it in spark-shell. But it works for batch job. Looks like spark repl issue. On Thu, Mar 3, 2016 at 10:43 AM, Rahul Palamuttam wrote: > Hi All, > > We recently came across this issue when using the spark-shell and zeppelin. > If we assign the sparkcontext

Re: rdd cache name

2016-03-02 Thread charles li
thanks a lot, Xinh, that's very helpful for me. On Thu, Mar 3, 2016 at 12:54 AM, Xinh Huynh wrote: > Hi Charles, > > You can set the RDD name before using it. Just do before caching: > (Scala) myRdd.setName("Charles RDD") > (Python) myRdd.setName('Charles RDD') >

Renaming sc variable in sparkcontext throws task not serializable

2016-03-02 Thread Rahul Palamuttam
Hi All, We recently came across this issue when using the spark-shell and zeppelin. If we assign the sparkcontext variable (sc) to a new variable and reference another variable in an RDD lambda expression we get a task not serializable exception. The following three lines of code illustrate this

select count(*) return wrong row counts

2016-03-02 Thread Jesse F Chen
I am finding a strange issue with Spark SQL where "select count(*) " returns wrong row counts for certain tables. I am using TPCDS tables, so here are the actual counts: Row

Re: Fair scheduler pool details

2016-03-02 Thread Eugene Morozov
Mark, I'm trying to configure spark cluster to share resources between two pools. I can do that by assigning minimal shares (it works fine), but that means specific amount of cores is going to be wasted by just being ready to run anything. While that's better, than nothing, I'd like to specify

Re: Spark 1.5 on Mesos

2016-03-02 Thread Ashish Soni
See below and Attached the Dockerfile to build the spark image ( between i just upgraded to 1.6 ) I am running below setup - Mesos Master - Docker Container Mesos Slave 1 - Docker Container Mesos Slave 2 - Docker Container Marathon - Docker Container Spark MESOS Dispatcher - Docker

Re: Mapper side join with DataFrames API

2016-03-02 Thread Deepak Gopalakrishnan
Hello, I'm using 1.6.0 on EMR On Thu, Mar 3, 2016 at 12:34 AM, Yong Zhang wrote: > What version of Spark you are using? > > I am also trying to figure out how to do the map side join in Spark. > > In 1.5.x, there is a broadcast function in the Dataframe, and it caused >

Using netlib-java in Spark 1.6 on linux

2016-03-02 Thread cindymc
I want to take advantage of the Breeze linear algebra libraries, built on netlib-java, used heavily by SparkML. I've found this amazingly time-consuming to figure out, and have only been able to do so on MacOS. I want to do same on Linux: $ uname -a Linux slc10whv 3.8.13-68.3.4.el6uek.x86_64 #2

Re: Does DataFrame.collect() maintain the underlying schema?

2016-03-02 Thread Mohammad Tariq
I think this could be the reason : DataFrame sorts the column of each record lexicographically if we do a *select **. So, if we wish to maintain a specific column ordering while processing we should use do *select col1, col2...* instead of select *. However, this is just what I feel. Let's wait

Spark job on YARN ApplicationMaster DEBUG log

2016-03-02 Thread Prabhu Joseph
Hi All, I am trying to add DEBUG for Spark ApplicationMaster for it is not working. On running Spark job, passed -Dlog4j.configuration=file:/opt/mapr/spark/spark-1.4.1/conf/log4j.properties The log4j.properties has log4j.rootCategory=DEBUG, console Spark Executor Containers has DEBUG logs but

Re: spark 1.6 new memory management - some issues with tasks not using all executors

2016-03-02 Thread Koert Kuipers
with the locality issue resolved, i am still struggling with the new memory management. i am seeing tasks on tiny amounts of data take 15 seconds, of which 14 are spend in GC. with the legacy memory management (spark.memory.useLegacyMode = false ) they complete in 1 - 2 seconds. since we are

Re: Spark on Yarn with Dynamic Resource Allocation. Container always marked as failed

2016-03-02 Thread Jeff Zhang
The executor may fail to start. You need to check the executor logs, if there's no executor log then you need to check node manager log. On Wed, Mar 2, 2016 at 4:26 PM, Xiaoye Sun wrote: > Hi all, > > I am very new to spark and yarn. > > I am running a BroadcastTest

Re: getPreferredLocations race condition in spark 1.6.0?

2016-03-02 Thread Andy Sloane
Done, thanks. https://issues.apache.org/jira/browse/SPARK-13631 Will continue discussion there. On Wed, Mar 2, 2016 at 4:09 PM Shixiong(Ryan) Zhu wrote: > I think it's a bug. Could you open a ticket here: > https://issues.apache.org/jira/browse/SPARK > > On Wed, Mar

Re: Does DataFrame.collect() maintain the underlying schema?

2016-03-02 Thread Mohammad Tariq
Cool. Here is it how it goes... I am reading Avro objects from a Kafka topic as a DStream, converting it into a DataFrame so that I can filter out records based on some conditions and finally do some aggregations on these filtered records. During the process I also need to tag each record based

Re: Does DataFrame.collect() maintain the underlying schema?

2016-03-02 Thread Sainath Palla
Hi Tariq, Can you tell in brief what kind of operation you have to do? I can try helping you out with that. In general, if you are trying to use any group operations you can use window operations. On Wed, Mar 2, 2016 at 6:40 PM, Mohammad Tariq wrote: > Hi Sainath, > > Thank

getPreferredLocations race condition in spark 1.6.0?

2016-03-02 Thread Andy Sloane
We are seeing something that looks a lot like a regression from spark 1.2. When we run jobs with multiple threads, we have a crash somewhere inside getPreferredLocations, as was fixed in SPARK-4454. Except now it's inside org.apache.spark.MapOutputTrackerMaster.getLocationsWithLargestOutputs

Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-02 Thread Nicholas Chammas
We’re veering off from the original question of this thread, but to clarify, my comment earlier was this: So in short, DataFrames are the “new RDD”—i.e. the new base structure you should be using in your Spark programs wherever possible. RDDs are not going away, and clearly in your case

Re: Does DataFrame.collect() maintain the underlying schema?

2016-03-02 Thread Mohammad Tariq
Hi Sainath, Thank you for the prompt response! Could you please elaborate your answer a bit? I'm sorry I didn't quite get this. What kind of operation I can perform using SQLContext? It just helps us during things like DF creation, schema application etc, IMHO. [image: http://] Tariq,

Re: Does DataFrame.collect() maintain the underlying schema?

2016-03-02 Thread Sainath Palla
Instead of collecting the data frame, you can try using a sqlContext on the data frame. But it depends on what kind of operations are you trying to perform. On Wed, Mar 2, 2016 at 6:21 PM, Mohammad Tariq wrote: > Hi list, > > *Scenario :* > I am creating a DStream by reading

Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-02 Thread Darren Govoni
Our data is made up of single text documents scraped off the web. We store these in a  RDD. A Dataframe or similar structure makes no sense at that point. And the RDD is transient. So my point is. Dataframes should not replace plain old rdd since rdds allow for more flexibility and sql

Does DataFrame.collect() maintain the underlying schema?

2016-03-02 Thread Mohammad Tariq
Hi list, *Scenario :* I am creating a DStream by reading an Avro object from a Kafka topic and then converting it into a DataFrame to perform some operations on the data. I call DataFrame.collect() and perform the intended operation on each Row of Array[Row] returned by DataFrame.collect().

Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-02 Thread ayan guha
+1 on all the pointers. @Darren - it would probably good idea to explain your scenario a little more in terms of structured vs un-structured datasets. Then people here can give you better input on how you can use DF. On Thu, Mar 3, 2016 at 9:43 AM, Nicholas Chammas

Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-02 Thread Nicholas Chammas
Plenty of people get their data in Parquet, Avro, or ORC files; or from a database; or do their initial loading of un- or semi-structured data using one of the various data source libraries which help with type-/schema-inference. All of

Re: Spark 1.5 on Mesos

2016-03-02 Thread Tim Chen
Hi Charles, I thought that's fixed with your patch in latest master now right? Ashish, yes please give me your docker image name (if it's in the public registry) and what you've tried and I can see what's wrong. I think it's most likely just the configuration of where the Spark home folder is in

Re: Spark 1.5 on Mesos

2016-03-02 Thread Charles Allen
Re: Spark on Mesos Warning regarding disk space: https://issues.apache.org/jira/browse/SPARK-12330 That's a spark flaw I encountered on a very regular basis on mesos. That and a few other annoyances are fixed in https://github.com/metamx/spark/tree/v1.5.2-mmx Here's another mild annoyance

Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-02 Thread Darren Govoni
Dataframes are essentially structured tables with schemas. So where does the non typed data sit before it becomes structured if not in a traditional RDD? For us almost all the processing comes before there is structure to it. Sent from my Verizon Wireless 4G LTE smartphone

Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-02 Thread Nicholas Chammas
> However, I believe, investing (or having some members of your group) learn and invest in Scala is worthwhile for few reasons. One, you will get the performance gain, especially now with Tungsten (not sure how it relates to Python, but some other knowledgeable people on the list, please chime

Re: Spark Streaming 1.6 mapWithState not working well with Kryo Serialization

2016-03-02 Thread Shixiong(Ryan) Zhu
See https://issues.apache.org/jira/browse/SPARK-12591 After applying the patch, it should work. However, if you want to enable "registrationRequired", you still need to register "org.apache.spark.streaming.util.OpenHashMapBasedStateMap", "org.apache.spark.streaming.util.EmptyStateMap" and

AVRO vs Parquet

2016-03-02 Thread Timothy Spann
Which format is the best format for SparkSQL adhoc queries and general data storage? There are lots of specialized cases, but generally accessing some but not all the available columns with a reasonable subset of the data. I am learning towards Parquet as it has great support in Spark. I also

Spark Streaming 1.6 mapWithState not working well with Kryo Serialization

2016-03-02 Thread Aris
Hello Spark folks and especially TD, I am using the Spark Streaming 1.6 mapWithState API, and I am trying to enforce Kryo Serialization with SparkConf.set("spark.kryo.registrationRequired", "true") However, this appears to be impossible! I registered all the classes that are my own, but I

Re: Spark 1.5 on Mesos

2016-03-02 Thread Ashish Soni
I have no luck and i would to ask the question to spark committers will this be ever designed to run on mesos ? spark app as a docker container not working at all on mesos ,if any one would like the code i can send it over to have a look. Ashish On Wed, Mar 2, 2016 at 12:23 PM, Sathish Kumaran

Re: Building a REST Service with Spark back-end

2016-03-02 Thread Guru Medasani
Hi Yanlin, This is a fairly new effort and is not officially released/supported by Cloudera yet. I believe those numbers will be out once it is released. Guru Medasani gdm...@gmail.com > On Mar 2, 2016, at 10:40 AM, yanlin wang wrote: > > Did any one use Livy in real world

Spark on Yarn with Dynamic Resource Allocation. Container always marked as failed

2016-03-02 Thread Xiaoye Sun
Hi all, I am very new to spark and yarn. I am running a BroadcastTest example application using spark 1.6.0 and Hadoop/Yarn 2.7.1. in a 5 nodes cluster. I configured my configuration files according to https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation 1. copy

Avro SerDe Issue w/ Manual Partitions?

2016-03-02 Thread Chris Miller
Hi, I have a strange issue occurring when I use manual partitions. If I create a table as follows, I am able to query the data with no problem: CREATE TABLE test1 ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT

Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

2016-03-02 Thread Reynold Xin
Thanks. Once you create the jira just reply to this email with the link. On Wednesday, March 2, 2016, Ewan Leith wrote: > Thanks, I'll create the JIRA for it. Happy to help contribute to a patch if > we can, not sure if my own scala skills will be up to it but

Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

2016-03-02 Thread Ewan Leith
Thanks, I'll create the JIRA for it. Happy to help contribute to a patch if we can, not sure if my own scala skills will be up to it but perhaps one of my colleagues' will :) Ewan I don't think that exists right now, but it's definitely a good option to have. I myself have run into this issue

Re: SFTP Compressed CSV into Dataframe

2016-03-02 Thread Sumedh Wale
On Thursday 03 March 2016 12:47 AM, Benjamin Kim wrote: I wonder if anyone has opened a SFTP connection to open a remote GZIP CSV file? I am able to download the file first locally using the SFTP Client in the spark-sftp package. Then, I load the file into a dataframe using the spark-csv

Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

2016-03-02 Thread Reynold Xin
I don't think that exists right now, but it's definitely a good option to have. I myself have run into this issue a few times. Can you create a JIRA ticket so we can track it? Would be even better if you are interested in working on a patch! Thanks. On Wed, Mar 2, 2016 at 11:51 AM, Ewan Leith

Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

2016-03-02 Thread Ewan Leith
Hi Reynold, yes that would be perfect for our use case. I assume it doesn't exist though, otherwise I really need to go re-read the docs! Thanks to both of you for replying by the way, I know you must be hugely busy. Ewan Are you looking for "relaxed" mode that simply return nulls for fields

Re: SFTP Compressed CSV into Dataframe

2016-03-02 Thread Ewan Leith
The Apache Commons library will let you access files on an SFTP server via a Java library, no local file handling involved https://commons.apache.org/proper/commons-vfs/filesystems.html Hope this helps, Ewan I wonder if anyone has opened a SFTP connection to open a remote GZIP CSV file? I am

Re: SFTP Compressed CSV into Dataframe

2016-03-02 Thread Holden Karau
So doing a quick look through the README & code for spark-sftp it seems that the way this connector works is by downloading the file locally on the driver program and this is not configurable - so you would probably need to find a different connector (and you probably shouldn't use spark-sftp for

Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

2016-03-02 Thread Reynold Xin
Are you looking for "relaxed" mode that simply return nulls for fields that doesn't exist or have incompatible schema? On Wed, Mar 2, 2016 at 11:12 AM, Ewan Leith wrote: > Thanks Michael, it's not a great example really, as the data I'm working with > has some

SFTP Compressed CSV into Dataframe

2016-03-02 Thread Benjamin Kim
I wonder if anyone has opened a SFTP connection to open a remote GZIP CSV file? I am able to download the file first locally using the SFTP Client in the spark-sftp package. Then, I load the file into a dataframe using the spark-csv package, which automatically decompresses the file. I just

Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

2016-03-02 Thread Ewan Leith
Thanks Michael, it's not a great example really, as the data I'm working with has some source files that do fit the schema, and some that don't (out of millions that do work, perhaps 10 might not). In an ideal world for us the select would probably return the valid records only. We're trying

Re: Unit testing framework for Spark Jobs?

2016-03-02 Thread radoburansky
I am sure you have googled this: https://github.com/holdenk/spark-testing-base On Wed, Mar 2, 2016 at 6:54 PM, SRK [via Apache Spark User List] < ml-node+s1001560n2638...@n3.nabble.com> wrote: > Hi, > > What is a good unit testing framework for Spark batch/streaming jobs? I > have core spark,

Re: spark streaming

2016-03-02 Thread Vinti Maheshwari
Thanks Shixiong. Sure. Please find the details: Spark-version: 1.5.2 I am doing data aggregation using check pointing, not sure if this is causing issue. Also, i am using perl_kafka producer to push data to kafka and then my spark program is reading it from kafka. Not sure, if i need to use

Re: Unit testing framework for Spark Jobs?

2016-03-02 Thread Ricardo Paiva
I use the plain and old Junit Spark batch example: import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.sql.SQLContext import org.junit.AfterClass import org.junit.Assert.assertEquals import org.junit.BeforeClass import org.junit.Test object TestMyCode

Re: spark streaming

2016-03-02 Thread Shixiong(Ryan) Zhu
Hey, KafkaUtils.createDirectStream doesn't need a StorageLevel as it doesn't store blocks to BlockManager. However, the error is not related to StorageLevel. It may be a bug. Could you provide more info about it? E.g., Spark version, your codes, logs. On Wed, Mar 2, 2016 at 3:02 AM, Vinti

Re: Unit testing framework for Spark Jobs?

2016-03-02 Thread Silvio Fiorito
Please check out the following for some good resources: https://github.com/holdenk/spark-testing-base https://spark-summit.org/east-2016/events/beyond-collect-and-parallelize-for-tests/ On 3/2/16, 12:54 PM, "SRK" wrote: >Hi, > >What is a good unit testing

Re: Unit testing framework for Spark Jobs?

2016-03-02 Thread Yin Yang
Cycling prior bits: http://search-hadoop.com/m/q3RTto4sby1Cd2rt=Re+Unit+test+with+sqlContext On Wed, Mar 2, 2016 at 9:54 AM, SRK wrote: > Hi, > > What is a good unit testing framework for Spark batch/streaming jobs? I > have > core spark, spark sql with dataframes

Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

2016-03-02 Thread Michael Armbrust
-dev +user StructType(StructField(data,ArrayType(StructType(StructField( > *stuff,ArrayType(*StructType(StructField(onetype,ArrayType(StructType(StructField(id,LongType,true), > StructField(name,StringType,true)),true),true), StructField(othertype, >

Unit testing framework for Spark Jobs?

2016-03-02 Thread SRK
Hi, What is a good unit testing framework for Spark batch/streaming jobs? I have core spark, spark sql with dataframes and streaming api getting used. Any good framework to cover unit tests for these APIs? Thanks! -- View this message in context:

Spark on Yarn with Dynamic Resource Allocation. Container always marked as failed

2016-03-02 Thread Xiaoye Sun
Hi all, I am very new to spark and yarn. I am running a BroadcastTest example application using spark 1.6.0 and Hadoop/Yarn 2.7.1. in a 5 nodes cluster. I configured my configuration files according to https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation 1. copy

Re: How to control the number of parquet files getting created under a partition ?

2016-03-02 Thread swetha kasireddy
Thanks. I tried this yesterday and it seems to be working. On Wed, Mar 2, 2016 at 1:49 AM, James Hammerton wrote: > Hi, > > Based on the behaviour I've seen using parquet, the number of partitions > in the DataFrame will determine the number of files in each parquet > partition.

Re: Add Jars to Master/Worker classpath

2016-03-02 Thread Sumedh Wale
On Wednesday 02 March 2016 09:39 PM, Matthias Niehoff wrote: no, not to driver and executor but to the master and worker instances of the spark standalone cluster Why exactly does adding jars to driver/executor extraClassPath not

Re: please add Christchurch Apache Spark Meetup Group

2016-03-02 Thread Sean Owen
(I have the site's svn repo handy, so I just added it.) On Wed, Mar 2, 2016 at 5:16 PM, Raazesh Sainudiin wrote: > Hi, > > Please add Christchurch Apache Spark Meetup Group to the community list > here: > http://spark.apache.org/community.html > > Our Meetup URI is:

How to achieve nested for loop in Spark

2016-03-02 Thread Vikash Kumar
Can we implement nested for/while loop in spark? I have to convert some SQL procedure code into Spark. And it has multiple loops and processing and I want to implement this in spark. How to implement this. 1. open cursor and fetch for personType 2. open cursor and fetch for personGroup

Re: Spark 1.5 on Mesos

2016-03-02 Thread Sathish Kumaran Vairavelu
Try passing jar using --jars option On Wed, Mar 2, 2016 at 10:17 AM Ashish Soni wrote: > I made some progress but now i am stuck at this point , Please help as > looks like i am close to get it working > > I have everything running in docker container including mesos slave

please add Christchurch Apache Spark Meetup Group

2016-03-02 Thread Raazesh Sainudiin
Hi, Please add Christchurch Apache Spark Meetup Group to the community list here: http://spark.apache.org/community.html Our Meetup URI is: http://www.meetup.com/Christchurch-Apache-Spark-Meetup/ Thanks, Raaz

Re: rdd cache name

2016-03-02 Thread Xinh Huynh
Hi Charles, You can set the RDD name before using it. Just do before caching: (Scala) myRdd.setName("Charles RDD") (Python) myRdd.setName('Charles RDD') Reference: PySpark doc: http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD Fraction cached is the percentage of partitions

Re: Building a REST Service with Spark back-end

2016-03-02 Thread yanlin wang
Did any one use Livy in real world high concurrency web app? I think it uses spark submit command line to create job... How about job server or notebook comparing with Livy? Thx, Yanlin Sent from my iPhone > On Mar 2, 2016, at 6:24 AM, Guru Medasani wrote: > > Hi Don, >

Re: Add Jars to Master/Worker classpath

2016-03-02 Thread Prabhu Joseph
Matthias, Can you check appending the jars in LAUNCH_CLASSPATH of spark-1.4.1/sbin/spark_class 2016-03-02 21:39 GMT+05:30 Matthias Niehoff : > no, not to driver and executor but to the master and worker instances of > the spark standalone cluster > > Am 2.

Re: Spark 1.5 on Mesos

2016-03-02 Thread Ashish Soni
I made some progress but now i am stuck at this point , Please help as looks like i am close to get it working I have everything running in docker container including mesos slave and master When i try to submit the pi example i get below error *Error: Cannot load main class from JAR

Re: Add Jars to Master/Worker classpath

2016-03-02 Thread Matthias Niehoff
no, not to driver and executor but to the master and worker instances of the spark standalone cluster Am 2. März 2016 um 17:05 schrieb Igor Berman : > spark.driver.extraClassPath > spark.executor.extraClassPath > > 2016-03-02 18:01 GMT+02:00 Matthias Niehoff < >

Re: Add Jars to Master/Worker classpath

2016-03-02 Thread Igor Berman
spark.driver.extraClassPath spark.executor.extraClassPath 2016-03-02 18:01 GMT+02:00 Matthias Niehoff : > Hi, > > we want to add jars to the Master and Worker class path mainly for logging > reason (we have a redis appender to send logs to redis -> logstash -> >

Add Jars to Master/Worker classpath

2016-03-02 Thread Matthias Niehoff
Hi, we want to add jars to the Master and Worker class path mainly for logging reason (we have a redis appender to send logs to redis -> logstash -> elasticsearch). While it is working with setting SPARK_CLASSPATH, this solution is afaik deprecated and should not be used. Furthermore we are also

Re: EMR 4.3.0 spark 1.6 shell problem

2016-03-02 Thread Daniel Siegmann
In the past I have seen this happen when I filled up HDFS and some core nodes became unhealthy. There was no longer anywhere to replicate the data. >From your command it looks like you should have 1 master and 2 core nodes in your cluster. Can you verify both the core nodes are healthy? On Wed,

Re: Configuring Ports for Network Security

2016-03-02 Thread Guru Prateek Pinnadhari
Thanks for your response. End users and developers in our scenario need terminal / SSH access to the cluster. So cluster isolation from external networks is not an option. We use a Hortonworks based hadoop cluster. Knox is useful but as users also have shell access, we need iptables. Even

org.apache.spark.sql.types.GenericArrayData cannot be cast to org.apache.spark.sql.catalyst.InternalRow

2016-03-02 Thread dmt
Hi, the following error is raised using Spark 1.5.2 or 1.6.0, in stand alone mode, on my computer. Has anyone had the same problem, and do you know what might cause this exception ? Thanks in advance. /16/03/02 15:12:27 WARN TaskSetManager: Lost task 9.0 in stage 0.0 (TID 9, 192.168.1.36):

Re: Configuring Ports for Network Security

2016-03-02 Thread Jörn Franke
You can make the nodes non-reachable from any computer external to the cluster. Applications can be deployed on an edge node that is connected to the cluster. Do you use Hadoop for managing the cluster? Then you may want to look at Apache Knox. > On 02 Mar 2016, at 15:14, zgpinnadhari

Re: Building a REST Service with Spark back-end

2016-03-02 Thread Guru Medasani
Hi Don, Here is another REST interface for interacting with Spark from anywhere. https://github.com/cloudera/livy Here is an example to estimate PI using Spark from Python using requests library. >>> data = { ... 'code': textwrap.dedent("""\ ... val

Configuring Ports for Network Security

2016-03-02 Thread zgpinnadhari
Hi We want to use spark in a secure cluster with iptables enabled. For this, we need a specific list of ports used by spark so that we can whitelist them. >From what I could learn from - http://spark.apache.org/docs/latest/security.html#configuring-ports-for-network-security - there are

Re: Building a REST Service with Spark back-end

2016-03-02 Thread Todd Nist
Have you looked at Apache Toree, http://toree.apache.org/. This was formerly the Spark-Kernel from IBM but contributed to apache. https://github.com/apache/incubator-toree You can find a good overview on the spark-kernel here:

Re: SparkR Count vs Take performance

2016-03-02 Thread Dirceu Semighini Filho
Thanks Sun, this explain why I was getting too many jobs running, my RDDs were empty. 2016-03-02 10:29 GMT-03:00 Sun, Rui : > This is nothing to do with object serialization/deserialization. It is > expected behavior that take(1) most likely runs slower than count() on an >

RE: SparkR Count vs Take performance

2016-03-02 Thread Sun, Rui
This is nothing to do with object serialization/deserialization. It is expected behavior that take(1) most likely runs slower than count() on an empty RDD. This is all about the algorithm with which take() is implemented. Take() 1. Reads one partition to get the elements 2. If the fetched

Re: Spark streaming: StorageLevel.MEMORY_AND_DISK_SER setting for KafkaUtils.createDirectStream

2016-03-02 Thread Vinti Maheshwari
Increasing Spark_executors_instances to 4 worked. SPARK_EXECUTOR_INSTANCES="4" #Number of workers to start (Default: 2) Regards, Vinti On Wed, Mar 2, 2016 at 4:28 AM, Vinti Maheshwari wrote: > Thanks much Saisai. Got it. > So i think increasing worker executor memory

Re: Spark streaming: StorageLevel.MEMORY_AND_DISK_SER setting for KafkaUtils.createDirectStream

2016-03-02 Thread Vinti Maheshwari
Thanks much Saisai. Got it. So i think increasing worker executor memory might work. Trying that. Regards, ~Vinti On Wed, Mar 2, 2016 at 4:21 AM, Saisai Shao wrote: > You don't have to specify the storage level for direct Kafka API, since it > doesn't require to store

Re: Spark streaming: StorageLevel.MEMORY_AND_DISK_SER setting for KafkaUtils.createDirectStream

2016-03-02 Thread Saisai Shao
You don't have to specify the storage level for direct Kafka API, since it doesn't require to store the input data ahead of time. Only receiver-based approach could specify the storage level. Thanks Saisai On Wed, Mar 2, 2016 at 7:08 PM, Vinti Maheshwari wrote: > Hi All,

Spark streaming: StorageLevel.MEMORY_AND_DISK_SER setting for KafkaUtils.createDirectStream

2016-03-02 Thread Vinti Maheshwari
Hi All, I wanted to set *StorageLevel.MEMORY_AND_DISK_SER* in my spark-streaming program as currently i am getting MetadataFetchFailedException*. *I am not sure where i should pass StorageLevel.MEMORY_AND_DISK, as it seems like createDirectStream doesn't allow to pass that parameter. val

Re: [Spark 1.5.2]: Iterate through Dataframe columns and put it in map

2016-03-02 Thread Mohammad Tariq
Hi Divya, You could call *collect()* method provided by DataFram API. This will give you an *Array[Rows]*. You could then iterate over this array and create your map. Something like this : val mapOfVals = scala.collection.mutable.Map[String,String]() var rows = DataFrame.collect() rows.foreach(r

spark streaming

2016-03-02 Thread Vinti Maheshwari
Hi All, I wanted to set *StorageLevel.MEMORY_AND_DISK_SER* in my spark-streaming program as currently i am getting MetadataFetchFailedException*. *I am not sure where i should pass StorageLevel.MEMORY_AND_DISK, as it seems like createDirectStream doesn't allow to pass that parameter. val

Re: EMR 4.3.0 spark 1.6 shell problem

2016-03-02 Thread Oleg Ruchovets
Here is my command: aws emr create-cluster --release-label emr-4.3.0 --name "ClusterJava8" --use-default-roles --applications Name=Ganglia Name=Hive Name=Hue Name=Mahout Name=Pig Name=Spark --ec2-attributes KeyName=CC-ES-Demo --instance-count 3 --instance-type m3.xlarge

Inconsistent performance across multiple iterations of same application

2016-03-02 Thread Harsh Rathi
Hi, I am doing spark-submit with same resources from a bash script on EC2 Cluster. But, the time taken in running this application is varying a lot from 5 mins to 60 mins across multiple iterations. I am restarting spark cluster with stop-all.sh and start-all.sh scripts after every run of

[Spark 1.5.2]: Iterate through Dataframe columns and put it in map

2016-03-02 Thread Divya Gehlot
Hi, I need to iterate through columns in dataframe based on certain condition and put it in map . Dataset Column1 Column2 Car Model1 Bike Model2 Car Model2 Bike Model 2 I want to iterate through above dataframe and put it in map where car is key and model1 and model 2

Re: How to control the number of parquet files getting created under a partition ?

2016-03-02 Thread James Hammerton
Hi, Based on the behaviour I've seen using parquet, the number of partitions in the DataFrame will determine the number of files in each parquet partition. I.e. when you use "PARTITION BY" you're actually partitioning twice, once via the partitions spark has created internally and then again

rdd cache name

2016-03-02 Thread charles li
hi, there, I feel a little confused about the *cache* in spark. first, is there any way to *customize the cached RDD name*, it's not convenient for me when looking at the storage page, there are the kind of RDD in the RDD Name column, I hope to make it as my customized name, kinds of 'rdd 1',

Re: Spark Mllib kmeans execution

2016-03-02 Thread Sonal Goyal
It will run distributed On Mar 2, 2016 3:00 PM, "Priya Ch" wrote: > Hi All, > > I am running k-means clustering algorithm. Now, when I am running the > algorithm as - > > val conf = new SparkConf > val sc = new SparkContext(conf) > . > . > val kmeans = new

Spark Mllib kmeans execution

2016-03-02 Thread Priya Ch
Hi All, I am running k-means clustering algorithm. Now, when I am running the algorithm as - val conf = new SparkConf val sc = new SparkContext(conf) . . val kmeans = new KMeans() val model = kmeans.run(RDD[Vector]) . . . The 'kmeans' object gets created on driver. Now does *kmeans.run() *get

  1   2   >