Re: detecting last record of partition

2016-10-13 Thread Holden Karau
It sounds like mapPartitionsWithIndex will give you the information you want over flatMap. On Thursday, October 13, 2016, Shushant Arora wrote: > Hi > > I have a transformation on a pair rdd using flatmap function. > > 1.Can I detect in flatmap whether the current

SparkR execution hang on when handle a RDD which is converted from DataFrame

2016-10-13 Thread Lantao Jin
sqlContext <- sparkRHive.init(sc) sqlString<- "SELECT key_id, rtl_week_beg_dt rawdate, gmv_plan_rate_amt value FROM metrics_moveing_detection_cube " df <- sql(sqlString) rdd<-SparkR:::toRDD(df) #hang on case one: take from rdd #take(rdd,3) #hang on case two: convert back to dataframe

detecting last record of partition

2016-10-13 Thread Shushant Arora
Hi I have a transformation on a pair rdd using flatmap function. 1.Can I detect in flatmap whether the current record is last record of partition being processed and 2. what is the partition index of this partition. public Iterable> call(Tuple2 t) throws

[Spark 2.0.0] error when unioning to an empty dataset

2016-10-13 Thread Efe Selcuk
I have a use case where I want to build a dataset based off of conditionally available data. I thought I'd do something like this: case class SomeData( ... ) // parameters are basic encodable types like strings and BigDecimals var data = spark.emptyDataset[SomeData] // loop, determining what

Re: No way to set mesos cluster driver memory overhead?

2016-10-13 Thread drewrobb
It seems like this is a real issue, so I've opened an issue: https://issues.apache.org/jira/browse/SPARK-17928 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-way-to-set-mesos-cluster-driver-memory-overhead-tp27897p27901.html Sent from the Apache Spark

Java.util.ArrayList is not a valid external type for schema of array

2016-10-13 Thread Mohamed Nadjib MAMI
In Spark 1.5.2 I had a job that reads from textFile and saves some data into a Parquet table. One value was of type `ArrayList` being successfully saved as an "array" column in the Parquet table. I upgraded to Spark version 2.0.1, I changed the necessary code (SparkConf to SparkSession, DataFrame

How to spark-submit using python subprocess module?

2016-10-13 Thread Vikram Kone
I have a python script that is used to submit spark jobs using the spark-submit tool. I want to execute the command and write the output both to STDOUT and a logfile in real time. i'm using python 2.7 on a ubuntu server. This is what I have so far in my SubmitJob.py script #!/usr/bin/python #

Re: Can mapWithState state func be called every batchInterval?

2016-10-13 Thread manasdebashiskar
Actually each element of mapwithstate has a time out component. You can write a function to "treat" your time out. You can match it with your batch size and do fun stuff when the batch ends. People do session management with the same approach. When activity is registered the session is

Re: Re-partitioning mapwithstateDstream

2016-10-13 Thread manasdebashiskar
StateSpec has a method numPartitions to set the initial number of partition. That should do the trick. ...Manas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Re-partitioning-mapwithstateDstream-tp27880p27899.html Sent from the Apache Spark User List

Re: Python Spark Improvements (forked from Spark Improvement Proposals)

2016-10-13 Thread Holden Karau
Awesome, good points everyone. The ranking of the issues is super useful and I'd also completely forgotten about the lack of built in UDAF support which is rather important. There is a PR to make it easier to call/register JVM UDFs from Python which will hopefully help a bit there too. I'm getting

Re: No way to set mesos cluster driver memory overhead?

2016-10-13 Thread Michael Gummelt
We see users run both in the dispatcher and marathon. I generally prefer marathon, because there's a higher likelihood it's going to have some feature you need that the dispatcher lacks (like in this case). It doesn't look like we support overhead for the driver. On Thu, Oct 13, 2016 at 10:42

Re: No way to set mesos cluster driver memory overhead?

2016-10-13 Thread Rodrick Brown
On Thu, Oct 13, 2016 at 1:42 PM, drewrobb wrote: > When using spark on mesos and deploying a job in cluster mode using > dispatcher, there appears to be no memory overhead configuration for the > launched driver processes ("--driver-memory" is the same as Xmx which is > the >

Re: spark on mesos memory sizing with offheap

2016-10-13 Thread Michael Gummelt
It doesn't look like we are. Can you file a JIRA? A workaround is to set spark.mesos.executor.overhead to be at least spark.memory.offheap.size. This is how the container is sized:

No way to set mesos cluster driver memory overhead?

2016-10-13 Thread drewrobb
When using spark on mesos and deploying a job in cluster mode using dispatcher, there appears to be no memory overhead configuration for the launched driver processes ("--driver-memory" is the same as Xmx which is the same as the memory quota). This makes it almost a guarantee that a long running

Re: Limit Kafka batches size with Spark Streaming

2016-10-13 Thread Samy Dindane
Hi, On 10/13/2016 04:35 PM, Cody Koeninger wrote: So I see in the logs that PIDRateEstimator is choosing a new rate, and the rate it's choosing is 100. But it's always choosing 100, while all the other variables change (processing time, latestRate, etc.) change. Also, the records per batch is

Spark 2.0.0 TreeAggregate with larger depth will be OOM?

2016-10-13 Thread Jy Chen
Hi,all I'm using Spark 2.0.0 to train a model with 1000w+ parameters, about 500GB data. The treeAggregate is used to aggregate the gradient, when I set the depth = 2 or 3, it works, and depth equals to 3 is faster. So I set depth = 4 to obtain better performance, but now some executors will be OOM

Re: Limit Kafka batches size with Spark Streaming

2016-10-13 Thread Cody Koeninger
So I see in the logs that PIDRateEstimator is choosing a new rate, and the rate it's choosing is 100. That happens to be the default minimum of an (apparently undocumented) setting, spark.streaming.backpressure.pid.minRate Try setting that to 1 and see if there's different behavior. BTW, how

pyspark doesn't recognize MMM dateFormat pattern in spark.read.load() for dates like 1989Dec31 and 31Dec1989

2016-10-13 Thread Pietro Pugni
Hi there, I opened a question on StackOverflow at this link: http://stackoverflow.com/questions/40007972/pyspark-doesnt-recognize-mmm-dateformat-pattern-in-spark-read-load-for-dates?noredirect=1#comment67297930_40007972 I didn’t get any useful answer, so I’m writing here hoping that someone can

Re: Want to test spark-sql-kafka but get unresolved dependency error

2016-10-13 Thread Cody Koeninger
As Sean said, it's unreleased. If you want to try it out, build spark http://spark.apache.org/docs/latest/building-spark.html The easiest way to include the jar is probably to use mvn install to put it in your local repository, then link it in your application's mvn or sbt build file as

spark on mesos memory sizing with offheap

2016-10-13 Thread vincent gromakowski
Hi, I am trying to understand how mesos allocate memory when offheap is enabled but it seems that the framework is only taking the heap + 400 MB overhead into consideration for resources allocation. Example: spark.executor.memory=3g spark.memory.offheap.size=1g ==> mesos report 3.4g allocated for

Re: Spyder and SPARK combination problem...Please help!

2016-10-13 Thread innocent73
Finally I found the solution! I have changed the Python's directory settings as below: import os import sys os.chdir(*"C:\Python27"*) os.curdir and it works like a charm :) -- View this message in context:

Spark security

2016-10-13 Thread Mendelson, Assaf
Hi, We have a spark cluster and we wanted to add some security for it. I was looking at the documentation (in http://spark.apache.org/docs/latest/security.html) and had some questions. 1. Do all executors listen by the same blockManager port? For example, in yarn there are multiple

Re: spark with kerberos

2016-10-13 Thread Saisai Shao
I think security has nothing to do with what API you use, spark sql or RDD API. Assuming you're running on yarn cluster (that is the only cluster manager supports Kerberos currently). Firstly you need to get Kerberos tgt in your local spark-submit process, after being authenticated by Kerberos,

Re: spark with kerberos

2016-10-13 Thread Denis Bolshakov
The problem happens when writting (reading works fine) rdd.saveAsNewAPIHadoopFile We use just RDD and HDFS, no other things. Spark 1.6.1 version. `Claster A` - CDH 5.7.1 `Cluster B` - vanilla hadoop 2.6.5 `Cluster C` - CDH 5.8.0 Best regards, Denis On 13 October 2016 at 13:06, ayan guha

Re: RowMatrix from DenseVector

2016-10-13 Thread Meeraj Kunnumpurath
Apologies, oversight, I had a mix of mllib and ml imports. On Thu, Oct 13, 2016 at 2:27 PM, Meeraj Kunnumpurath < mee...@servicesymphony.com> wrote: > Hello, > > How do I create a row matrix from a dense vector. The following code, > doesn't compile. > > val features = df.rdd.map(r =>

RowMatrix from DenseVector

2016-10-13 Thread Meeraj Kunnumpurath
Hello, How do I create a row matrix from a dense vector. The following code, doesn't compile. val features = df.rdd.map(r => Vectors.dense(r.getAs[Double]("constant"), r.getAs[Double]("sqft_living"))) val rowMatrix = new RowMatrix(features, features.count(), 2) The compiler error Error:(24,

[1.6.0] Skipped stages keep increasing and causes OOM finally

2016-10-13 Thread Mungeol Heo
Hello, My task is updating a dataframe in a while loop until there is no more data to update. The spark SQL I used is like below val hc = sqlContext hc.sql("use person") var temp_pair = hc.sql(""" select ROW_NUMBER() OVER (ORDER

Re: spark with kerberos

2016-10-13 Thread ayan guha
And a little more details on Spark version, hadoop version and distribution would also help... On Thu, Oct 13, 2016 at 9:05 PM, ayan guha wrote: > I think one point you need to mention is your target - HDFS, Hive or Hbase > (or something else) and which end points are used.

Re: spark with kerberos

2016-10-13 Thread ayan guha
I think one point you need to mention is your target - HDFS, Hive or Hbase (or something else) and which end points are used. On Thu, Oct 13, 2016 at 8:50 PM, dbolshak wrote: > Hello community, > > We've a challenge and no ideas how to solve it. > > The problem, > >

spark with kerberos

2016-10-13 Thread dbolshak
Hello community, We've a challenge and no ideas how to solve it. The problem, Say we have the following environment: 1. `cluster A`, the cluster does not use kerberos and we use it as a source of data, important thing is - we don't manage this cluster. 2. `cluster B`, small cluster where our

Spark with kerberos

2016-10-13 Thread Denis Bolshakov
Hello community, We've a challenge and no ideas how to solve it. The problem, Say we have the following environment: 1. `cluster A`, the cluster does not use kerberos and we use it as a source of data, important thing is - we don't manage this cluster. 2. `cluster B`, small cluster where our

Re: OOM when running Spark SQL by PySpark on Java 8

2016-10-13 Thread Shady Xu
All nodes of my YARN cluster is running on Java 7, but I submit the job from a Java 8 client. I realised I run the job in yarn cluster mode and that's why setting ' --driver-java-options' is effective. Now the problem is, why submitting a job from a Java 8 client to a Java 7 cluster causes a

Re: DataFrame API: how to partition by a "virtual" column, or by a nested column?

2016-10-13 Thread Samy Dindane
This partially answers the question: http://stackoverflow.com/a/35449563/604041 On 10/04/2016 03:10 PM, Samy Dindane wrote: Hi, I have the following schema: -root |-timestamp |-date |-year |-month |-day |-some_column |-some_other_column I'd like to achieve either of these: 1)

Re: Limit Kafka batches size with Spark Streaming

2016-10-13 Thread Samy Dindane
Hey Cody, Thanks for the reply. Really helpful. Following your suggestion, I set spark.streaming.backpressure.enabled to true and maxRatePerPartition to 10. I know I can handle 100k records at the same time, but definitely not in 1 second (the batchDuration), so I expect the backpressure

Re: OOM when running Spark SQL by PySpark on Java 8

2016-10-13 Thread Sean Owen
You can specify it; it just doesn't do anything but cause a warning in Java 8. It won't work in general to have such a tiny PermGen. If it's working it means you're on Java 8 because it's ignored. You should set MaxPermSize if anything, not PermSize. However the error indicates you are not using

Re: OOM when running Spark SQL by PySpark on Java 8

2016-10-13 Thread Shady Xu
Solved the problem by specifying the PermGen size when submitting the job (even to just a few MB). Seems Java 8 has removed the Permanent Generation space, thus corresponding JVM arguments are ignored. But I can still use --driver-java-options "-XX:PermSize=80M -XX:MaxPermSize=100m" to specify

Re: OOM when running Spark SQL by PySpark on Java 8

2016-10-13 Thread Sean Owen
The error doesn't say you're out of memory, but says you're out of PermGen. If you see this, you aren't running Java 8 AFAIK, because 8 has no PermGen. But if you're running Java 7, and you go investigate what this error means, you'll find you need to increase PermGen. This is mentioned in the

Re: Want to test spark-sql-kafka but get unresolved dependency error

2016-10-13 Thread Mich Talebzadeh
add --jars /spark-streaming-kafka_2.10-1.5.1.jar (may need to download the jar file or any newer version) to spark-shell. I also have spark-streaming-kafka-assembly_2.10-1.6.1.jar as well on --jar list HTH Dr Mich Talebzadeh LinkedIn *

OOM when running Spark SQL by PySpark on Java 8

2016-10-13 Thread Shady Xu
Hi, I have a problem when running Spark SQL by PySpark on Java 8. Below is the log. 16/10/13 16:46:40 INFO spark.SparkContext: Starting job: sql at NativeMethodAccessorImpl.java:-2 Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: PermGen space at

Re: Want to test spark-sql-kafka but get unresolved dependency error

2016-10-13 Thread Sean Owen
I don't believe that's been released yet. It looks like it was merged into branches about a week ago. You're looking at unreleased docs too - have a look at http://spark.apache.org/docs/latest/ for the latest released docs. On Thu, Oct 13, 2016 at 9:24 AM JayKay

Want to test spark-sql-kafka but get unresolved dependency error

2016-10-13 Thread JayKay
I want to work with the Kafka integration for structured streaming. I use Spark version 2.0.0. and I start the spark-shell with: spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.0.0 As described here:

Re: Anyone attending spark summit?

2016-10-13 Thread Andrew Gelinas
*ANDREW! thank you. The code worked, Youre a legend. I was going to register today and now saved **€**€**€. Owe you a beer* *Gregory* 2016-10-12 10:04 GMT+09:00 Andrew James : > Hey, I just found a promo code for Spark Summit Europe that saves 20%. > It’s

receiving stream data options

2016-10-13 Thread vr spark
Hi, I have a continuous rest api stream which keeps spitting out data in form of json. I access the stream using python requests.get(url, stream=True, headers=headers). I want to receive them using spark and do further processing. I am not sure which is best way to receive it in spark. What are