Re: deploy-mode flag in spark-sql cli

2016-06-29 Thread Saisai Shao
I think you cannot use sql client in the cluster mode, also for spark-shell/pyspark which has a repl, all these application can only be started with client deploy mode. On Thu, Jun 30, 2016 at 12:46 PM, Mich Talebzadeh wrote: > Hi, > > When you use spark-shell or for

Re: deploy-mode flag in spark-sql cli

2016-06-29 Thread Mich Talebzadeh
Hi, When you use spark-shell or for that matter spark-sql, you are staring spark-submit under the bonnet. These two shells are created to make life easier to work on Spark. However, if you look at what $SPARK_HOME/bin/spark-sql does in the script, you will notice my point: exec

Re: Error report file is deleted automatically after spark application finished

2016-06-29 Thread dhruve ashar
You can look at the yarn-default configuration file. Check your log related settings to see if log aggregation is enabled or also the log retention duration to see if its too small and files are being deleted. On Wed, Jun 29, 2016 at 4:47 PM, prateek arora wrote: >

deploy-mode flag in spark-sql cli

2016-06-29 Thread Huang Meilong
Hello, I added deploy-mode flag in spark-sql cli like this: $ spark-sql --deploy-mode cluster --master yarn -e "select * from mx" It showed error saying "Cluster deploy mode is not applicable to Spark SQL shell", but "spark-sql --help" shows "--deploy-mode" option. Is this a bug?

Regarding Decision Tree

2016-06-29 Thread Chintan Bhatt
Hello, I want to improve decision tree in spark. Can anyone help me for parameter tuning for such improvement?? -- CHINTAN BHATT Assistant Professor, U & P U Patel Department of Computer Engineering, Chandubhai S. Patel Institute of

Re: Using R code as part of a Spark Application

2016-06-29 Thread Sun Rui
Hi, Gilad, You can try the dapply() and gapply() function in SparkR in Spark 2.0. Yes, it is required that R installed on each worker node. However, if your Spark application is Scala/Java based, it is not supported for now to run R code in DataFrames. There is closed lira

Re: Using R code as part of a Spark Application

2016-06-29 Thread Xinh Huynh
It looks like it. "DataFrame UDFs in R" is resolved in Spark 2.0: https://issues.apache.org/jira/browse/SPARK-6817 Here's some of the code: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/r/MapPartitionsRWrapper.scala /** * A function wrapper

Re: Aggregator (Spark 2.0) skips aggregation is zero(0 returns null

2016-06-29 Thread Koert Kuipers
its the difference between a semigroup and a monoid, and yes max does not easily fit into a monoid. see also discussion here: https://issues.apache.org/jira/browse/SPARK-15598 On Mon, Jun 27, 2016 at 3:19 AM, Amit Sela wrote: > OK. I see that, but the current (provided)

Re: Can Spark Dataframes preserve order when joining?

2016-06-29 Thread Mich Talebzadeh
Hi, Well I would not assume anything myself. If you want to order it do it explicitly. Let us take a simple case by creating three DFs based on existing tables val s = HiveContext.table("sales").select("AMOUNT_SOLD","TIME_ID","CHANNEL_ID") val c =

Re: Possible to broadcast a function?

2016-06-29 Thread Bin Fan
following this suggestion, Aaron, you may take a look at Alluxio as the off-heap in-memory data storage as input/output for Spark jobs if that works for you. See more intro on how to run Spark with Alluxio as data input / output.

Re: Using R code as part of a Spark Application

2016-06-29 Thread Sean Owen
Here we (or certainly I) am not talking about R Server, but plain vanilla R, as used with Spark and SparkR. Currently, SparkR doesn't distribute R code at all (it used to, sort of), so I'm wondering if that is changing back. On Wed, Jun 29, 2016 at 10:53 PM, John Aherne

Re: Using R code as part of a Spark Application

2016-06-29 Thread John Aherne
I don't think R server requires R on the executor nodes. I originally set up a SparkR cluster for our Data Scientist on Azure which required that I install R on each node, but for the R Server set up, there is an extra edge node with R server that they connect to. From what little research I was

PySpark crashed because "remote RPC client disassociated"

2016-06-29 Thread jw.cmu
I am running my own PySpark application (solving matrix factorization using Gemulla's DSGD algorithm). The program seemed to work fine on smaller movielens dataset but failed on larger Netflix data. It too about 14 hours to complete two iterations and lost an executor (I used totally 8 executors

Error report file is deleted automatically after spark application finished

2016-06-29 Thread prateek arora
Hi My Spark application was crashed and show information LogType:stdout Log Upload Time:Wed Jun 29 14:38:03 -0700 2016 LogLength:1096 Log Contents: # # A fatal error has been detected by the Java Runtime Environment: # # SIGILL (0x4) at pc=0x7f67baa0d221, pid=12207, tid=140083473176320 # #

Re: Using R code as part of a Spark Application

2016-06-29 Thread Sean Owen
Oh, interesting: does this really mean the return of distributing R code from driver to executors and running it remotely, or do I misunderstand? this would require having R on the executor nodes like it used to? On Wed, Jun 29, 2016 at 5:53 PM, Xinh Huynh wrote: > There is

Re: Possible to broadcast a function?

2016-06-29 Thread Sean Owen
Ah, I completely read over the "250GB" part. Yeah you have a huge heap then and indeed you can run into problems with GC pauses. You can probably still manage such huge executors with a fair bit of care with the GC and memory settings, and, you have a good reason to consider this. In particular I

Friendly Reminder: Spark Summit EU CfP Deadline July 1, 2016

2016-06-29 Thread Jules Damji
Hello All, If you haven't submitted a CfP for Spark Summit EU, the deadline is this Friday, July 1st. Submit at https://spark-summit.org/eu-2016/ Cheers! Jules Spark Community Evangelist Databricks, Inc. Sent from my iPhone Pardon the dumb thumb typos :)

Kudu Connector

2016-06-29 Thread Benjamin Kim
I was wondering if anyone, who is a Spark Scala developer, would be willing to continue the work done for the Kudu connector? https://github.com/apache/incubator-kudu/tree/master/java/kudu-spark/src/main/scala/org/kududb/spark/kudu I have been testing and using Kudu for the past month and

Re: Set the node the spark driver will be started

2016-06-29 Thread Bryan Cutler
Hi Felix, I think the problem you are describing has been fixed in later versions, check out this JIRA https://issues.apache.org/jira/browse/SPARK-13803 On Wed, Jun 29, 2016 at 9:27 AM, Mich Talebzadeh wrote: > Fine. in standalone mode spark uses its own scheduling

Re: Unsubscribe - 3rd time

2016-06-29 Thread Mich Talebzadeh
Indeed Nicholas very valid point Example of the new email listings like below for ISUG etc that allow all options including unsubscribe -End Original Message- *Site Links:* View post online View mailing list online

Apache Spark Is Hanging when fetch data from SQL Server 2008

2016-06-29 Thread Gastón Schabas
Hi everyone. I'm experiencing an issue when I try to fetch data from SQL Server. This is my context Ubuntu 14.04 LTS Apache Spark 1.4.0 SQL Server 2008 Scala 2.10.5 Sbt 0.13.11 I'm trying to fetch data from a table in SQL Server 2008 that has 85.000.000 records. I just only need around 200.000

groupBy cannot handle large RDDs

2016-06-29 Thread Kaiyin Zhong
Could anyone have a look at this? It looks like a bug: http://stackoverflow.com/questions/38106554/groupby-cannot-handle-large-rdds Best regards, Kaiyin ZHONG

Re: Unsubscribe - 3rd time

2016-06-29 Thread Nicholas Chammas
> I'm not sure I've ever come across an email list that allows you to unsubscribe by responding to the list with "unsubscribe". Many noreply lists (e.g. companies sending marketing email) actually work that way, which is probably what most people are used to these days. What this list needs is

Re: Using R code as part of a Spark Application

2016-06-29 Thread Jörn Franke
Still you need sparkR > On 29 Jun 2016, at 19:14, John Aherne wrote: > > Microsoft Azure has an option to create a spark cluster with R Server. MS > bought RevoScale (I think that was the name) and just recently deployed it. > >> On Wed, Jun 29, 2016 at 10:53 AM,

Re: Using R code as part of a Spark Application

2016-06-29 Thread John Aherne
Microsoft Azure has an option to create a spark cluster with R Server. MS bought RevoScale (I think that was the name) and just recently deployed it. On Wed, Jun 29, 2016 at 10:53 AM, Xinh Huynh wrote: > There is some new SparkR functionality coming in Spark 2.0, such as >

Re: Unsubscribe - 3rd time

2016-06-29 Thread Jonathan Kelly
If at first you don't succeed, try, try again. But please don't. :) See the "unsubscribe" link here: http://spark.apache.org/community.html I'm not sure I've ever come across an email list that allows you to unsubscribe by responding to the list with "unsubscribe". At least, all of the Apache

Re: Using R code as part of a Spark Application

2016-06-29 Thread Xinh Huynh
There is some new SparkR functionality coming in Spark 2.0, such as "dapply". You could use SparkR to load a Parquet file and then run "dapply" to apply a function to each partition of a DataFrame. Info about loading Parquet file:

Re: Possible to broadcast a function?

2016-06-29 Thread Aaron Perrin
>From what I've read, people had seen performance issues when the JVM used more than 60 GiB of memory. I haven't tested it myself, but I guess not true? Also, how does one optimize memory when the driver allocates some on one node? For example, let's say my cluster has N nodes each with 500 GiB

Re: Set the node the spark driver will be started

2016-06-29 Thread Mich Talebzadeh
Fine. in standalone mode spark uses its own scheduling as opposed to Yarn or anything else. As a matter of interest can you start spark-submit from any node in the cluster? Are these all have the same or similar CPU and RAM? HTH Dr Mich Talebzadeh LinkedIn *

Re: Possible to broadcast a function?

2016-06-29 Thread Sean Owen
If you have one executor per machine, which is the right default thing to do, and this is a singleton in the JVM, then this does just have one copy per machine. Of course an executor is tied to an app, so if you mean to hold this data across executors that won't help. On Wed, Jun 29, 2016 at

Re: Unsubscribe - 3rd time

2016-06-29 Thread Mich Talebzadeh
LOL. Bravely said Joaquin. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your

RE: Unsubscribe - 3rd time

2016-06-29 Thread Joaquin Alzola
And 3rd time is not enough to know that unsubscribe is done through --> user-unsubscr...@spark.apache.org From: Steve Florence [mailto:sflore...@ypm.com] Sent: 29 June 2016 16:47 To: user@spark.apache.org Subject: Unsubscribe - 3rd time This email is confidential and may be subject to

Unsubscribe - 3rd time

2016-06-29 Thread Steve Florence

Re: Possible to broadcast a function?

2016-06-29 Thread Sonal Goyal
Have you looked at Alluxio? (earlier tachyon) Best Regards, Sonal Founder, Nube Technologies Reifier at Strata Hadoop World Reifier at Spark Summit 2015

Re: Do tasks from the same application run in different JVMs

2016-06-29 Thread Mathieu Longtin
Same JVMs. On Wed, Jun 29, 2016 at 8:48 AM Huang Meilong wrote: > Hi, > > In spark, tasks from different applications run in different JVMs, then > what about tasks from the same application? > -- Mathieu Longtin 1-514-803-8977

Possible to broadcast a function?

2016-06-29 Thread Aaron Perrin
The user guide describes a broadcast as a way to move a large dataset to each node: "Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input

Re: Using R code as part of a Spark Application

2016-06-29 Thread sujeet jog
try Spark pipeRDD's , you can invoke the R script from pipe , push the stuff you want to do on the Rscript stdin, p On Wed, Jun 29, 2016 at 7:10 PM, Gilad Landau wrote: > Hello, > > > > I want to use R code as part of spark application (the same way I would do >

Re: Spark jobs

2016-06-29 Thread sujeet jog
check if this helps, from multiprocessing import Process def training() : print ("Training Workflow") cmd = spark/bin/spark-submit ./ml.py & " os.system(cmd) w_training = Process(target = training) On Wed, Jun 29, 2016 at 6:28 PM, Joaquin Alzola

Re: Joining a compressed ORC table with a non compressed text table

2016-06-29 Thread Michael Segel
Hi, I’m not sure I understand your initial question… Depending on the compression algo, you may or may not be able to split the file. So if its not splittable, you have a single long running thread. My guess is that you end up with a very long single partition. If so, if you repartition,

Using R code as part of a Spark Application

2016-06-29 Thread Gilad Landau
Hello, I want to use R code as part of spark application (the same way I would do with Scala/Python). I want to be able to run an R syntax as a map function on a big Spark dataframe loaded from a parquet file. Is this even possible or the only way to use R is as part of RStudio orchestration

Re: Joining a compressed ORC table with a non compressed text table

2016-06-29 Thread Jörn Franke
Does the same happen if all the tables are in ORC format? It might be just simpler to convert the text table to ORC since it is rather small > On 29 Jun 2016, at 15:14, Mich Talebzadeh wrote: > > Hi all, > > It finished in 2 hours 18 minutes! > > Started at >

Can Spark Dataframes preserve order when joining?

2016-06-29 Thread Jestin Ma
If it’s not too much trouble, could I get some pointers/help on this? (see link) http://stackoverflow.com/questions/38085801/can-dataframe-joins-in-spark-preserve-order -also, as a side question, do

Spark RDD aggregate action behaves strangely

2016-06-29 Thread Kaiyin Zhong
Could anyone have a look at this? http://stackoverflow.com/questions/38100918/spark-rdd-aggregate-action-behaves-strangely Thanks! Best regards, Kaiyin ZHONG

Spark jobs

2016-06-29 Thread Joaquin Alzola
Hi, This is a totally newbie question but I seem not to find the link . when I create a spark-submit python script to be launch ... how should I call it from the main python script with a subprocess.popen? BR Joaquin This email is confidential and may be subject to privilege. If you

Do tasks from the same application run in different JVMs

2016-06-29 Thread Huang Meilong
Hi, In spark, tasks from different applications run in different JVMs, then what about tasks from the same application?

Metadata for the StructField

2016-06-29 Thread Ted Yu
You can specify Metadata for the StructField : case class StructField( name: String, dataType: DataType, nullable: Boolean = true, metadata: Metadata = Metadata.empty) { FYI On Wed, Jun 29, 2016 at 2:50 AM, pooja mehta wrote: > Hi, > > Want to add a

Re: Set the node the spark driver will be started

2016-06-29 Thread Felix Massem
In addition we are not using Yarn we are using the standalone mode and the driver will be started with the deploy-mode cluster Thx Felix Felix Massem | IT-Consultant | Karlsruhe mobil: +49 (0) 172.2919848 <> www.codecentric.de | blog.codecentric.de

Spark sql dataframe

2016-06-29 Thread pooja mehta
Hi, Want to add a metadata field to StructField case class in spark. case class StructField(name: String) And how to carry over the metadata in query execution.

[no subject]

2016-06-29 Thread pooja mehta
Hi, Want to add a metadata field to StructField case class in spark. case class StructField(name: String) And how to carry over the metadata in query execution.

Re: Running into issue using SparkIMain

2016-06-29 Thread Jayant Shekhar
Hello, Found a workaround to it. Installed scala and added the scala jars to the classpath before starting the web application. Now it works smoothly - just that it adds an extra step for the users to do. Would next look into making it work with the scala jar files contained in the war. Thx

Re: Joining a compressed ORC table with a non compressed text table

2016-06-29 Thread Jörn Franke
I think the TEZ engine is much more maintained with respect to optimizations related to Orc , hive , vectorizing, querying than the mr engine. It will be definitely better to use it. Mr is also deprecated in hive 2.0. For me it does not make sense to use mr with hive larger than 1.1. As I

Re: Set the node the spark driver will be started

2016-06-29 Thread Felix Massem
Hey Mich, the distribution is like not given. Just right now I have 15 applications and all 15 drivers are running on one node. This is just after giving all machines a little more memory. Before I had like 15 applications and about 13 driver where running on one machine. While trying to

Job aborted due to not serializable exception

2016-06-29 Thread Paolo Patierno
Hi, following the socketStream[T] function implementation from the official spark GitHub repo : ef socketStream[T]( hostname: String, port: Int, converter: JFunction[InputStream, java.lang.Iterable[T]],

Driver zombie process (standalone cluster)

2016-06-29 Thread Tomer Benyamini
Hi, I'm trying to run spark applications on a standalone cluster, running on top of AWS. Since my slaves are spot instances, in some cases they are being killed and lost due to bid prices. When apps are running during this event, sometimes the spark application dies - and the driver process just

Re: Best practice for handing tables between pipeline components

2016-06-29 Thread Chanh Le
Hi Everett, We are using Alluxio for the last 2 months. We implement Alluxio for sharing data each Spark Job, isolated Spark only for process layer and Alluxio for the storage layer. > On Jun 29, 2016, at 2:52 AM, Everett Anderson > wrote: > > Thanks! Alluxio