SparkSQL DAG generation , DAG optimization , DAG execution

2016-09-09 Thread Rabin Banerjee
HI All, I am writing and executing a Spark Batch program which only use SPARK-SQL , But it is taking lot of time and finally giving GC overhead . Here is the program , 1.Read 3 files ,one medium size and 2 small files, and register them as DF. 2. fire sql with complex aggregation and

Re: SparkR error: reference is ambiguous.

2016-09-09 Thread Bedrytski Aliaksandr
Hi, Can you use full-string queries in SparkR? Like (in Scala): df1.registerTempTable("df1") df2.registerTempTable("df2") val df3 = sparkContext.sql("SELECT * FROM df1 JOIN df2 ON df1.ra = df2.ra") explicitly mentioning table names in the query often solves ambiguity problems. Regards --

Re: Spark metrics when running with YARN?

2016-09-09 Thread Jacek Laskowski
Hi, That's correct. One app one web UI. Open 4041 and you'll see the other app. Jacek On 9 Sep 2016 11:53 a.m., "Vladimir Tretyakov" < vladimir.tretya...@sematext.com> wrote: > Hello again. > > I am trying to play with Spark version "2.11-2.0.0". > > Problem that REST API and UI shows me

Re: Streaming Backpressure with Multiple Streams

2016-09-09 Thread Jeff Nadler
Yes I'll test that next. On Sep 9, 2016 5:36 PM, "Cody Koeninger" wrote: > Does the same thing happen if you're only using direct stream plus back > pressure, not the receiver stream? > > On Sep 9, 2016 6:41 PM, "Jeff Nadler" wrote: > >> Maybe this is a

Re: Streaming Backpressure with Multiple Streams

2016-09-09 Thread Cody Koeninger
Does the same thing happen if you're only using direct stream plus back pressure, not the receiver stream? On Sep 9, 2016 6:41 PM, "Jeff Nadler" wrote: > Maybe this is a pretty esoteric implementation, but I'm seeing some bad > behavior with backpressure plus multiple Kafka

Re: Spark Java Heap Error

2016-09-09 Thread Baktaawar
Hi Thanks I tried that. But got this error. Again OOM. I am not sure what to do now. For spark.driver.maxResultSize i kept 2g. Rest I did as mentioned above. 16Gb for driver and 2g for executor. I have 16Gb mac. Please help. I am very delayed on my work because of this and not able to move ahead.

Streaming Backpressure with Multiple Streams

2016-09-09 Thread Jeff Nadler
Maybe this is a pretty esoteric implementation, but I'm seeing some bad behavior with backpressure plus multiple Kafka streams / direct streams. Here's the scenario: We have 1 Kafka topic using the reliable receiver (4 receivers, union the result).In the same app, we consume another Kafka

Spark Memory Allocation Exception

2016-09-09 Thread Sunil Tripathy
Hi, I am using spark 1.6 to load a history activity dataset for last 3/4 years and write that to a parquet file partitioned by day. I am using the following exception when the insert command is running to insert the data onto the parquet partitions.

Re: classpath conflict with spark internal libraries and the spark shell.

2016-09-09 Thread Benyi Wang
I had a problem when I used "spark.executor.userClassPathFirst" before. I don't remember what the problem is. Alternatively, you can use --driver-class-path and "--conf spark.executor.extraClassPath". Unfortunately you may feel frustrated like me when trying to make it work. Depends on how you

Re: classpath conflict with spark internal libraries and the spark shell.

2016-09-09 Thread Colin Kincaid Williams
My bad, gothos on IRC pointed me to the docs: http://jhz.name/2016/01/10/spark-classpath.html Thanks Gothos! On Fri, Sep 9, 2016 at 9:23 PM, Colin Kincaid Williams wrote: > I'm using the spark shell v1.61 . I have a classpath conflict, where I > have an external library ( not

classpath conflict with spark internal libraries and the spark shell.

2016-09-09 Thread Colin Kincaid Williams
I'm using the spark shell v1.61 . I have a classpath conflict, where I have an external library ( not OSS either :( , can't rebuild it.) using httpclient-4.5.2.jar. I use spark-shell --jars file:/path/to/httpclient-4.5.2.jar However spark is using httpclient-4.3 internally. Then when I try to use

Spark with S3 DirectOutputCommitter

2016-09-09 Thread Srikanth
Hello, I'm trying to use DirectOutputCommitter for s3a in Spark 2.0. I've tried a few configs and none of them seem to work. Output always creates _temporary directory. Rename is killing performance. I read some notes about DirectOutputcommitter causing problems with speculation turned on. Was

Re: Using sparkContext.stop()

2016-09-09 Thread Mich Talebzadeh
Hi We are talking about spark streaming in here? Depending on what is streamed, you can work out an exit strategy through the total messages streamed in or through a time window in which you can monitor the duration and exit if the duration > Window allocated (not to be confused with windows

Re: Assign values to existing column in SparkR

2016-09-09 Thread Deepak Sharma
Data frames are immutable in nature , so i don't think you can directly assign or change values on the column. Thanks Deepak On Fri, Sep 9, 2016 at 10:59 PM, xingye wrote: > I have some questions about assign values to a spark dataframe. I want to > assign values to an

scalable-deeplearning 1.0.0 released

2016-09-09 Thread Ulanov, Alexander
Dear Spark users and developers, I have released version 1.0.0 of scalable-deeplearning package. This package is based on the implementation of artificial neural networks in Spark ML. It is intended for new Spark deep learning features that were not yet merged to Spark ML or that are too

Re: iterating over DataFrame Partitions sequentially

2016-09-09 Thread Jakob Odersky
> Hi Jakob, I have a DataFrame with like 10 patitions, based on the exact > content on each partition i want to batch load some other data from DB, i > cannot operate in parallel due to resource contraints i have, hence want to > sequential iterate over each partition and perform operations.

Approximate Nearest Neighbors (ann) for Scala Spark

2016-09-09 Thread Kim, Min-Seok
Hi, I wrote a Scala implementation of Annoy(https://github.com/spotify/annoy) which is an ann library. https://github.com/mskimm/annoy4s Because building tree in Annoy is done by a single node, I thought the following solution: - building tree (index file) using `toLocalIterator` of RDD on the

Re: iterating over DataFrame Partitions sequentially

2016-09-09 Thread Jakob Odersky
Hi Sujeet, going sequentially over all parallel, distributed data seems like a counter-productive thing to do. What are you trying to accomplish? regards, --Jakob On Fri, Sep 9, 2016 at 3:29 AM, sujeet jog wrote: > Hi, > Is there a way to iterate over a DataFrame with n

Using sparkContext.stop()

2016-09-09 Thread Bruno Faria
Hey all, I have created a Spark Job that runs successfully but if I do not use sc.stop() at the end, the job hangs. It shows some "cleaned accumulator 0" messages but never finishes. I intent to use these jobs in production via spark-submit and schedule it in cron. Is that the best practice

accessing spark packages through proxy

2016-09-09 Thread Ulanov, Alexander
Dear Spark users, I am trying to use spark packages, however I get the ivy error listed below. I checked JIRA and stackoverflow and it might be a proxy error. However, neither of proposed solutions did not work for me. Could you suggest how to solve this issue?

Re: pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK

2016-09-09 Thread Josh Rosen
cache() / persist() is definitely *not* supposed to affect the result of a program, so the behavior that you're seeing is unexpected. I'll try to reproduce this myself by caching in PySpark under heavy memory pressure, but in the meantime the following questions will help me to debug: - Does

questions about using dapply

2016-09-09 Thread xingye
I have a question about using UDF in SparkR. I’m converting some R code into SparkR. • The original R code is :cols_in <- apply(df[, paste("cr_cd", 1:12, sep = "")], MARGIN = 2, FUN = "%in%", c(61, 99)) • If I use dapply and put the original apply function as a function for dapply,cols_in

SparkR error: reference is ambiguous.

2016-09-09 Thread xingye
Not sure whether this is the right distribution list that I can ask questions. If not, can someone give a distribution list that can find someone to help?I kept getting error of reference is ambiguous when implementing some sparkR code.1. when i tried to assign values to a column using the

Assign values to existing column in SparkR

2016-09-09 Thread xingye
I have some questions about assign values to a spark dataframe. I want to assign values to an existing column of a spark dataframe but if I assign the value directly, I got the following error.df$c_mon<-0Error: class(value) == "Column" || is.null(value) is not TRUEIs there a way to solve this?

Re: Complex RDD operation as DataFrame UDF ?

2016-09-09 Thread Thunder Stumpges
Bump, check if this is actually going to the group? I can't see my recent posts on the archives: http://apache-spark-user-list.1001560.n3.nabble.com/ Is there a reason it would not show up here? Thanks! On Tue, Sep 6, 2016 at 11:28 AM Thunder Stumpges wrote: > Hi

Spark + Parquet + IBM Block Storage at Bluemix

2016-09-09 Thread Daniel Lopes
Hi, someone can help I'm trying to use parquet in IBM Block Storage at Spark but when I try to load get this error: using this config credentials = { "name": "keystone", *"auth_url": "https://identity.open.softlayer.com ",* "project":

spark nightly builds with Hadoop 2.7

2016-09-09 Thread Joseph Naegele
Hello, I'm using the Spark nightly build "spark-2.1.0-SNAPSHOT-bin-hadoop2.7" from http://people.apache.org/~pwendell/spark-nightly/spark-master-bin/ due to bugs in Spark 2.0.0 (SPARK-16740, SPARK-16802), however I noticed that the recent builds only come in "-hadoop2.4-without-hive" and

Re: spark-deployer 3.0.1 released

2016-09-09 Thread Bhupendra Mishra
Grt will test and share feedback Sent from my iPhone > On 09-Sep-2016, at 9:37 PM, pishen tsai wrote: > > spark-deployer is a sbt plugin that help deploying a Spark stand-alone > cluster on EC2 and submit your Spark jobs. All the works are done in sbt. > > We just

spark-deployer 3.0.1 released

2016-09-09 Thread pishen tsai
spark-deployer is a sbt plugin that help deploying a Spark stand-alone cluster on EC2 and submit your Spark jobs. All the works are done in sbt. We just released a new version with (hopefully) better experience for Spark newbie. https://github.com/KKBOX/spark-deployer Please ask in our gitter

Re: Why does spark take so much time for simple task without calculation?

2016-09-09 Thread Bedrytski Aliaksandr
Hi xiefeng, Even if your RDDs are tiny and reduced to one partition, there is always orchestration overhead (sending tasks to executor(s), reducing results, etc., these things are not free). If you need fast, [near] real-time processing, look towards spark-streaming. Regards, -- Bedrytski

add jars like spark-csv to ipython notebook with pyspakr

2016-09-09 Thread pseudo oduesp
Hi , how i can add jar to Ipython notebooke i tied Pyspark_submit_args without succes ? thanks

Re: year out of range

2016-09-09 Thread Daniel Lopes
Thanks Ayan! *Daniel Lopes* Chief Data and Analytics Officer | OneMatch c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes www.onematch.com.br On Thu, Sep 8, 2016 at 7:54 PM, ayan guha

Get spark metrics in code

2016-09-09 Thread Han JU
Hi, I'd like to know if there's a possibility to get spark's metrics from code. For example val sc = new SparkContext(conf) val result = myJob(sc, ...) result.save(...) val gauge = MetricSystem.getGauge("org.apahce.spark") println(gauge.getValue) // or send to to internal

pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK

2016-09-09 Thread Ben Leslie
Hi, I'm trying to understand if there is any difference in correctness between rdd.persist(pyspark.StorageLevel.MEMORY_ONLY) and rdd.persist(pyspark.StorageLevel.MEMORY_AND_DISK). I can see that there may be differences in performance, but my expectation was that using either would result in the

Re: Does it run distributed if class not Serializable

2016-09-09 Thread Gourav Sengupta
And you are using JAVA? AND WHY? Regards, Gourav On Fri, Sep 9, 2016 at 11:47 AM, Yusuf Can Gürkan wrote: > Hi, > > If i don't make a class Serializable (... extends Serializable) will it > run distributed with executors or it will only run on master machine? > >

Re: spark-xml to avro - SchemaParseException: Can't redefine

2016-09-09 Thread Arun Patel
Thank you Yong. I just looked at it. There was a pull request (#73 ) as well. Anything wrong with that fix? Can I use similar fix? On Thu, Sep 8, 2016 at 8:53 PM, Yong Zhang wrote: > Do you take a look about this ->

Does it run distributed if class not Serializable

2016-09-09 Thread Yusuf Can Gürkan
Hi, If i don't make a class Serializable (... extends Serializable) will it run distributed with executors or it will only run on master machine? Thanks

iterating over DataFrame Partitions sequentially

2016-09-09 Thread sujeet jog
Hi, Is there a way to iterate over a DataFrame with n partitions sequentially, Thanks, Sujeet

Video analytics on SPark

2016-09-09 Thread Priya Ch
Hi All, I have video surveillance data and this needs to be processed in Spark. I am going through the Spark + OpenCV. How to load .mp4 images into an RDD ? Can we directly do this or the video needs to be coverted to sequenceFile ? Thanks, Padma CH

Re: MLib : Non Linear Optimization

2016-09-09 Thread Nitin Sareen
Yes, we are using primarily these two algorithms. 1. Interior point trust-region line-search algorithm 2. Active-set trust-region line-search algorithm We are performing optimizations with constraints & thresholds etc. We are primarily using Lindo / SAS modules but want to get away

Re: Graphhopper/routing in Spark

2016-09-09 Thread Robin East
It’s not obvious to me how that would work. In principle I imagine you could have your source data loaded into HDFS and read by GraphHopper instances running on Spark workers. But a graph by it’s nature has items that have connections to potentially any other item so GraphHopper instances would