DataFrames in Spark - Performance when interjected with RDDs

2015-09-07 Thread Pallavi Rao
Hello All, I had a question regarding the performance optimization (Catalyst Optimizer) of DataFrames. I understand that DataFrames are interoperable with RDDs. If I switch back and forth between DataFrames and RDDs, does the performance optimization still kick-in? I need to switch to RDDs to

Code generation for GPU

2015-09-07 Thread lonikar
Hi,I am speaking in Spark Europe summit on exploiting GPUs for columnar DataFrame operations. I was going through various blogs, talks and JIRAs given by all you and trying to figure out where to make changes for this proposal.First of all, I must thank the recent progress in project tungsten that

Spark SQL - UDF for scoring a model - take $"*"

2015-09-07 Thread Night Wolf
Is it possible to have a UDF which takes a variable number of arguments? e.g. df.select(myUdf($"*")) fails with org.apache.spark.sql.AnalysisException: unresolved operator 'Project [scalaUDF(*) AS scalaUDF(*)#26]; What I would like to do is pass in a generic data frame which can be then passed

Code generation for GPU

2015-09-07 Thread lonikar
Hi, I am speaking in Spark Europe summit on exploiting GPUs for columnar DataFrame operations. I was going through various blogs, talks and JIRAs given by all you and trying to figure out where to make changes for this proposal. First of all, I must thank the recent progress in project tungsten

Re: Python Spark Streaming example with textFileStream does not work. Why?

2015-09-07 Thread Kamil Khadiev
I think that problem also depends on file system, I use mac and My program found file, but only when I created new, but not rename or move And in logs 15/09/07 10:44:52 INFO FileInputDStream: New files at time 1441611892000 ms: I found my file But I don't see any processing of file in logs

Java UDFs in GROUP BY expressions

2015-09-07 Thread James Aley
Hi everyone, I raised this JIRA ticket back in July: https://issues.apache.org/jira/browse/SPARK-9435 The problem is that it seems Spark SQL doesn't recognise columns we transform with a UDF when referenced in the GROUP BY clause. There's a minimal reproduction Java file attached to illustrate

Exception when restoring spark streaming with batch RDD from checkpoint.

2015-09-07 Thread ZhengHanbin
Hi, I am using spark streaming to join every RDD of a DStream to a stand alone RDD to generate a new DStream as followed: def joinWithBatchEvent(contentFeature: RDD[(String, String)], batchEvent: DStream[((String, String), (Long, Double, Double))]) = {

Zeppelin + Spark on EMR

2015-09-07 Thread shahab
Hi, I am trying to use Zeppelin to work with Spark on Amazon EMR. I used the script provided by Anders ( https://gist.github.com/andershammar/224e1077021d0ea376dd) to setup Zeppelin. The Zeppelin can connect to Spark but when I got error when I run the tutorials. and I get the following error:

Sending yarn application logs to web socket

2015-09-07 Thread Jeetendra Gangele
Hi All I have been trying to send my application related logs to socket so that we can write log stash and check the application logs. here is my log4j.property file main.logger=RFA,SA log4j.appender.SA=org.apache.log4j.net.SocketAppender log4j.appender.SA.Port=4560

OutOfMemory error with Spark ML 1.5 logreg example

2015-09-07 Thread Zoltán Tóth
Hi, When I execute the Spark ML Logisitc Regression example in pyspark I run into an OutOfMemory exception. I'm wondering if any of you experienced the same or has a hint about how to fix this. The interesting bit is that I only get the exception when I try to write the result DataFrame into a

Re: OutOfMemory error with Spark ML 1.5 logreg example

2015-09-07 Thread Zoltán Tóth
Aaand, the error! :) Exception in thread "org.apache.hadoop.hdfs.PeerCache@4e000abf" Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "org.apache.hadoop.hdfs.PeerCache@4e000abf" Exception in thread "Thread-7" Exception: java.lang.OutOfMemoryError thrown

Re: Drools and Spark Integration - Need Help

2015-09-07 Thread Akhil Das
How are you integrating it with spark? Thanks Best Regards On Fri, Sep 4, 2015 at 12:11 PM, Shiva moorthy wrote: > Hi Team, > > I am able to integrate Drools with Apache spark but after integration my > application runs slower. > Could you please give ideas about how

Re: Meets "java.lang.IllegalArgumentException" when test spark ml pipe with DecisionTreeClassifier

2015-09-07 Thread Terry Hole
Xiangrui, Do you have any idea how to make this work? Thanks - Terry Terry Hole 于2015年9月6日星期日 17:41写道: > Sean > > Do you know how to tell decision tree that the "label" is a binary or set > some attributes to dataframe to carry number of classes? > > Thanks! > - Terry >

Re: Access a Broadcast variable causes Spark to launch a second context

2015-09-07 Thread sstraub
never mind, all of this was caused because somewhere in my code I wrote `def` instead of `val`, which caused `collectAsMap` to be executed on each call. Not sure why Spark at some point decided to create a new context, though... Anyway, sorry for the disturbance. sstraub wrote > Hi, > > I'm

Re: Sending yarn application logs to web socket

2015-09-07 Thread Jeetendra Gangele
I also tried placing my costomized log4j.properties file under src/main/resources still no luck. won't above step modify the default YARN and spark log4j.properties ? anyhow its still taking log4j.properties from YARn. On 7 September 2015 at 19:25, Jeetendra Gangele

Re: Spark SQL - UDF for scoring a model - take $"*"

2015-09-07 Thread Jörn Franke
Can you use a map or list with different properties as one parameter? Alternatively a string where parameters are Comma-separated... Le lun. 7 sept. 2015 à 8:35, Night Wolf a écrit : > Is it possible to have a UDF which takes a variable number of arguments? > > e.g.

Re: spark-shell does not see conf folder content on emr-4

2015-09-07 Thread Akhil Das
You can also create a link to /etc/spark/conf from /usr/lib/spark/ Thanks Best Regards On Fri, Sep 4, 2015 at 2:40 AM, Alexander Pivovarov wrote: > Hi Everyone > > My question is specific to running spark-1.4.1 on emr-4.0.0 > > spark installed to /usr/lib/spark > conf

Re: OutOfMemory error with Spark ML 1.5 logreg example

2015-09-07 Thread Zoltán Zvara
Hey, I'd try to debug, profile ResolvedDataSource. As far as I know, your write will be performed by the JVM. On Mon, Sep 7, 2015 at 4:11 PM Tóth Zoltán wrote: > Unfortunately I'm getting the same error: > The other interesting things are that: > - the parquet files got

Re: OutOfMemory error with Spark ML 1.5 logreg example

2015-09-07 Thread boci
Hi, Can you try to using save method instead of write? ex: out_df.save("path","parquet") b0c1 -- Skype: boci13, Hangout: boci.b...@gmail.com On Mon, Sep 7, 2015 at

Access a Broadcast variable causes Spark to launch a second context

2015-09-07 Thread sstraub
Hi, I'm working on a spark job that frequently iterates over huge RDDs and matches the elements against some Maps that easily fit into memory. So what I do is to broadcast that Map and reference it from my RDD. Works like a charm, until at some point it doesn't, and I can't figure out why...

Re: OutOfMemory error with Spark ML 1.5 logreg example

2015-09-07 Thread Tóth Zoltán
Unfortunately I'm getting the same error: The other interesting things are that: - the parquet files got actually written to HDFS (also with .write.parquet() ) - the application gets stuck in the RUNNING state for good even after the error is thrown 15/09/07 10:01:10 INFO spark.ContextCleaner:

Shared data between algorithms

2015-09-07 Thread Somabha Bhattacharjya
Hello All, I plan to run 2 clustering algorithms on a shared data (Algo A starts first and modifies data and then Algo B starts with the modified data part. Thereafter they run in parallel) in Spark MLLib. Is this possible to share data between two algorithms in a single pipeline? Regards,

Re: Sending yarn application logs to web socket

2015-09-07 Thread Jeetendra Gangele
anybody here to help? On 7 September 2015 at 17:53, Jeetendra Gangele wrote: > Hi All I have been trying to send my application related logs to socket so > that we can write log stash and check the application logs. > > here is my log4j.property file > >

Re: OutOfMemory error with Spark ML 1.5 logreg example

2015-09-07 Thread Zsolt Tóth
Hi, I ran your example on Spark-1.4.1 and 1.5.0-rc3. It succeeds on 1.4.1 but throws the OOM on 1.5.0. Do any of you know which PR introduced this issue? Zsolt 2015-09-07 16:33 GMT+02:00 Zoltán Zvara : > Hey, I'd try to debug, profile ResolvedDataSource. As far as I

Re: Spark on Yarn vs Standalone

2015-09-07 Thread Sandy Ryza
Hi Alex, If they're both configured correctly, there's no reason that Spark Standalone should provide performance or memory improvement over Spark on YARN. -Sandy On Fri, Sep 4, 2015 at 1:24 PM, Alexander Pivovarov wrote: > Hi Everyone > > We are trying the latest aws

Spark ANN

2015-09-07 Thread Ruslan Dautkhanov
http://people.apache.org/~pwendell/spark-releases/latest/ml-ann.html Implementation seems missing backpropagation? Was there is a good reason to omit BP? What are the drawbacks of a pure feedforward-only ANN? Thanks! -- Ruslan Dautkhanov

Spark on Yarn: Kryo throws ClassNotFoundException for class included in fat jar

2015-09-07 Thread Nicholas R. Peterson
I'm trying to run a Spark 1.4.1 job on my CDH5.4 cluster, through Yarn. Serialization is set to use Kryo. I have a large object which I send to the executors as a Broadcast. The object seems to serialize just fine. When it attempts to deserialize, though, Kryo throws a ClassNotFoundException...

Re: Parquet Array Support Broken?

2015-09-07 Thread Alex Kozlov
Thank you - it works if the file is created in Spark On Mon, Sep 7, 2015 at 3:06 PM, Ruslan Dautkhanov wrote: > Read response from Cheng Lian on Aug/27th - it > looks the same problem. > > Workarounds > 1. write that parquet file in Spark; > 2.

Re: Parquet Array Support Broken?

2015-09-07 Thread Alex Kozlov
No, it was created in Hive by CTAS, but any help is appreciated... On Mon, Sep 7, 2015 at 2:51 PM, Ruslan Dautkhanov wrote: > That parquet table wasn't created in Spark, is it? > > There was a recent discussion on this list that complex data types in > Spark prior to 1.5

Re: Spark ANN

2015-09-07 Thread Ruslan Dautkhanov
Thanks! It does not look Spark ANN yet supports dropout/dropconnect or any other techniques that help avoiding overfitting? http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf https://cs.nyu.edu/~wanli/dropc/dropc.pdf ps. There is a small copy-paste typo in

Re: Spark ANN

2015-09-07 Thread Ruslan Dautkhanov
Found a dropout commit from avulanov: https://github.com/avulanov/spark/commit/3f25e26d10ef8617e46e35953fe0ad1a178be69d It probably hasn't made its way to MLLib (yet?). -- Ruslan Dautkhanov On Mon, Sep 7, 2015 at 8:34 PM, Feynman Liang wrote: > Unfortunately, not

Re: Parquet Array Support Broken?

2015-09-07 Thread Ruslan Dautkhanov
Read response from Cheng Lian on Aug/27th - it looks the same problem. Workarounds 1. write that parquet file in Spark; 2. upgrade to Spark 1.5. -- Ruslan Dautkhanov On Mon, Sep 7, 2015 at 3:52 PM, Alex Kozlov wrote: > No, it was created in Hive by

Re: Sending yarn application logs to web socket

2015-09-07 Thread Yana Kadiyska
Hopefully someone will give you a more direct answer but whenever I'm having issues with log4j I always try -Dlog4j.debug=true.This will tell you which log4j settings are getting picked up from where. I've spent countless hours due to typos in the file, for example. On Mon, Sep 7, 2015 at 11:47

Re: hadoop2.6.0 + spark1.4.1 + python2.7.10

2015-09-07 Thread Ashish Dutt
Hello Sasha, I have no answer for debian. My cluster is on Linux and I'm using CDH 5.4 Your question- "Error from python worker: /cube/PY/Python27/bin/python: No module named pyspark" On a single node (ie one server/machine/computer) I installed pyspark binaries and it worked. Connected it to

Re: Spark ANN

2015-09-07 Thread Feynman Liang
Unfortunately, not yet... Deep learning support (autoencoders, RBMs) is on the roadmap for 1.6 though, and there is a spark package for dropout regularized logistic regression.

Re: Spark ANN

2015-09-07 Thread Feynman Liang
BTW thanks for pointing out the typos, I've included them in my MLP cleanup PR On Mon, Sep 7, 2015 at 7:34 PM, Feynman Liang wrote: > Unfortunately, not yet... Deep learning support (autoencoders, RBMs) is on > the roadmap for

Re: Spark ANN

2015-09-07 Thread Debasish Das
Not sure dropout but if you change the solver from breeze bfgs to breeze owlqn or breeze.proximal.NonlinearMinimizer you can solve ann loss with l1 regularization which will yield elastic net style sparse solutionsusing that you can clean up edges which has 0.0 as weight... On Sep 7, 2015 7:35

Re: Parquet Array Support Broken?

2015-09-07 Thread Ruslan Dautkhanov
That parquet table wasn't created in Spark, is it? There was a recent discussion on this list that complex data types in Spark prior to 1.5 often incompatible with Hive for example, if I remember correctly. On Mon, Sep 7, 2015, 2:57 PM Alex Kozlov wrote: > I am trying to read

Re: Parquet Array Support Broken?

2015-09-07 Thread Alex Kozlov
The same error if I do: val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) val results = sqlContext.sql("SELECT * FROM stats") but it does work from Hive shell directly... On Mon, Sep 7, 2015 at 1:56 PM, Alex Kozlov wrote: > I am trying to read an (array typed)

Re: Problems with Tungsten in Spark 1.5.0-rc2

2015-09-07 Thread Reynold Xin
On Wed, Sep 2, 2015 at 12:03 AM, Anders Arpteg wrote: > > BTW, is it possible (or will it be) to use Tungsten with dynamic > allocation and the external shuffle manager? > > Yes - I think this already works. There isn't anything specific here related to Tungsten.

Spark summit Asia

2015-09-07 Thread Kevin Jung
Is there any plan to hold Spark summit in Asia? I'm very much looking forward to it. Kevin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-summit-Asia-tp24598.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Support of other languages?

2015-09-07 Thread Rahul Palamuttam
Hi, I wanted to know more about how Spark supports R and Python, with respect to what gets copied into the language environments. To clarify : I know that PySpark utilizes py4j sockets to pass pickled python functions between the JVM and the python daemons. However, I wanted to know how it

Spark 1.4 RDD to DF fails with toDF()

2015-09-07 Thread Gheorghe Postelnicu
Hi, The following code fails when compiled from SBT: package main.scala import org.apache.spark.SparkContext import org.apache.spark.sql.SQLContext object TestMain { def main(args: Array[String]): Unit = { implicit val sparkContext = new SparkContext() val sqlContext = new

Re: Spark 1.4 RDD to DF fails with toDF()

2015-09-07 Thread Jonathan Coveney
Try adding the following to your build.sbt libraryDependencies += "org.scala-lang" % "scala-reflect" % "2.11.6" I believe that spark shades the scala library, and this is a library that it looks like you need in an unshaded way. 2015-09-07 16:48 GMT-04:00 Gheorghe Postelnicu <

Parquet Array Support Broken?

2015-09-07 Thread Alex Kozlov
I am trying to read an (array typed) parquet file in spark-shell (Spark 1.4.1 with Hadoop 2.6): {code} $ bin/spark-shell log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory). log4j:WARN Please initialize the log4j system properly. log4j:WARN See

Re: Spark 1.4 RDD to DF fails with toDF()

2015-09-07 Thread Jonathan Coveney
How are you building and running it? El lunes, 7 de septiembre de 2015, Gheorghe Postelnicu < gheorghe.posteln...@gmail.com> escribió: > Interesting idea. Tried that, didn't work. Here is my new SBT file: > > name := """testMain""" > > scalaVersion := "2.11.6" > > libraryDependencies ++= Seq( >

Re: Spark 1.4 RDD to DF fails with toDF()

2015-09-07 Thread Gheorghe Postelnicu
sbt assembly; $SPARK_HOME/bin/spark-submit --class main.scala.TestMain --master "local[4]" target/scala-2.11/bof-assembly-0.1-SNAPSHOT.jar using Spark: /opt/spark-1.4.1-bin-hadoop2.6 On Mon, Sep 7, 2015 at 10:20 PM, Jonathan Coveney wrote: > How are you building and

Adding additional jars to distributed cache (yarn-client)

2015-09-07 Thread Srikanth Sundarrajan
Hi, Am trying to use JavaSparkContext() to create a new SparkContext and attempted to pass the requisite jars. But looks like they aren't getting added to the distributed cache automatically. Looking into YarnClientSchedulerBackend::start() and ClientArguments, it did seem like it would

Re: Spark ANN

2015-09-07 Thread Nick Pentreath
Haven't checked the actual code but that doc says "MLPC employes backpropagation for learning the model. .."? — Sent from Mailbox On Mon, Sep 7, 2015 at 8:18 PM, Ruslan Dautkhanov wrote: > http://people.apache.org/~pwendell/spark-releases/latest/ml-ann.html >

Re: Spark ANN

2015-09-07 Thread Feynman Liang
Backprop is used to compute the gradient here , which is then optimized by SGD or LBFGS here

Re: Spark on Yarn vs Standalone

2015-09-07 Thread Alexander Pivovarov
Hi Sandy Thank you for your reply Currently we use r3.2xlarge boxes (vCPU: 8, Mem: 61 GiB) with emr setting for Spark "maximizeResourceAllocation": "true" It is automatically converted to Spark settings spark.executor.memory47924M spark.yarn.executor.memoryOverhead 5324 we also set