Re: Databricks fails to read the csv file with blank line at the file header

2016-03-26 Thread Koert Kuipers
To me this is expected behavior that I would not want fixed, but if you look at the recent commits for spark-csv it has one that deals this... On Mar 26, 2016 21:25, "Mich Talebzadeh" wrote: > > Hi, > > I have a standard csv file (saved as csv in HDFS) that has first

Databricks fails to read the csv file with blank line at the file header

2016-03-26 Thread Mich Talebzadeh
Hi, I have a standard csv file (saved as csv in HDFS) that has first line of blank at the header as follows [blank line] Date, Type, Description, Value, Balance, Account Name, Account Number [blank line] 22/03/2011,SBT,"'FUNDS TRANSFER , FROM A/C 1790999",200.00,200.00,"'BROWN

Re: Limit pyspark.daemon threads

2016-03-26 Thread Sven Krasser
Hey Ken, 1. You're correct, cached RDDs live on the JVM heap. (There's an off-heap storage option using Alluxio, formerly Tachyon, with which I have no experience however.) 2. The worker memory setting is not a hard maximum unfortunately. What happens is that during aggregation the Python daemon

Re: Hive on Spark engine

2016-03-26 Thread Mich Talebzadeh
Thanks Ted, More interested in general availability of Hive 2 on Spark 1.6 engine as opposed to Vendors specific custom built. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Hive on Spark engine

2016-03-26 Thread Ted Yu
According to: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_HDP_RelNotes/bk_HDP_RelNotes-20151221.pdf Spark 1.5.2 comes out of box. Suggest moving questions on HDP to Hortonworks forum. Cheers On Sat, Mar 26, 2016 at 3:32 PM, Mich Talebzadeh wrote: >

Re: Hive on Spark engine

2016-03-26 Thread Mich Talebzadeh
Thanks Jorn. Just to be clear they get Hive working with Spark 1.6 out of the box (binary download)? The usual work-around is to build your own package and get the Hadoop-assembly jar file copied over to $HIVE_HOME/lib. Cheers Dr Mich Talebzadeh LinkedIn *

Re: Hive on Spark engine

2016-03-26 Thread Jörn Franke
If you check the newest Hortonworks distribution then you see that it generally works. Maybe you can borrow some of their packages. Alternatively it should be also available in other distributions. > On 26 Mar 2016, at 22:47, Mich Talebzadeh wrote: > > Hi, > > I am

Hive on Spark engine

2016-03-26 Thread Mich Talebzadeh
Hi, I am running Hive 2 and now Spark 1.6.1 but I still do not see any sign that Hive can utilise a Spark engine higher than 1.3.1 My understanding was that there were miss-match on Hadoop assembly Jar files that cause Hive not being able to run on Spark using the binary downloads. I just tried

Re: Limit pyspark.daemon threads

2016-03-26 Thread Carlile, Ken
This is extremely helpful! I’ll have to talk to my users about how the python memory limit should be adjusted and what their expectations are. I’m fairly certain we bumped it up in the dark past when jobs were failing because of insufficient memory for the python processes.  So

Re: Limit pyspark.daemon threads

2016-03-26 Thread Sven Krasser
My understanding is that the spark.executor.cores setting controls the number of worker threads in the executor in the JVM. Each worker thread communicates then with a pyspark daemon process (these are not threads) to stream data into Python. There should be one daemon process per worker thread

A problem involving Spark & HBase.

2016-03-26 Thread ManasjyotiSharma
Disclaimer: This is more of a design question. I am very new to Spark and HBase. This is going to be my first project using these 2 technologies and so far in last 2 months or so I’ve been just going over different resources to have a grasp on Spark and HBase. My question concerns mainly in terms

Fwd: This simple UDF is not working!

2016-03-26 Thread Mich Talebzadeh
Thanks great Dhaval. scala> import java.text.SimpleDateFormat import java.text.SimpleDateFormat scala> scala> import java.sql.Date import java.sql.Date scala> scala> import scala.util.{Try, Success, Failure} import scala.util.{Try, Success, Failure} scala> val toDate = udf{(out:String, form:

Re: whether a certain piece can be assigned to a specicified node by some codes in my program.

2016-03-26 Thread Ted Yu
Please take a look at the following method: /** * Get the preferred locations of a partition, taking into account whether the * RDD is checkpointed. */ final def preferredLocations(split: Partition): Seq[String] = { checkpointRDD.map(_.getPreferredLocations(split)).getOrElse {

whether a certain piece can be assigned to a specicified node by some codes in my program.

2016-03-26 Thread chenyong
I am newbie to the greate spark framework. After reading some meterials about spark, I know that a RDD dataset are actually broken into pieces and distributed among serveral nodes. I am wondering whether a certain piece can be assigned to a specicified node by some codes in my program. Or

Re: Any plans to migrate Transformer API to Spark SQL (closer to DataFrames)?

2016-03-26 Thread Ted Yu
That's quite informative, Michal. Though I don't read the first few slides which are not in English. On Sat, Mar 26, 2016 at 6:12 AM, Michał Zieliński < zielinski.mich...@gmail.com> wrote: > Ted, > > Sure. This was presented by my colleague during Data Science London > meetup. The talk was

Re: Create one DB connection per executor

2016-03-26 Thread Manas
Thanks much Gerard & Manas for your inputs. I'll keep in mind the connection pooling part. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Create-one-DB-connection-per-executor-tp26588p26601.html Sent from the Apache Spark User List mailing list archive at

Re: Is this expected in Spark 1.6.1, derby.log file created when spark shell starts

2016-03-26 Thread Ted Yu
Same with master branch. I found derby.log in the following two files: .gitignore:derby.log dev/.rat-excludes:derby.log FYI On Sat, Mar 26, 2016 at 4:09 AM, Mich Talebzadeh wrote: > Having moved to Spark 1.6.1, I have noticed thar whenerver I start a > spark-sql or

Re: Limit pyspark.daemon threads

2016-03-26 Thread Carlile, Ken
Thanks, Sven!  I know that I’ve messed up the memory allocation, but I’m trying not to think too much about that (because I’ve advertised it to my users as “90GB for Spark works!” and that’s how it displays in the Spark UI (totally ignoring the python processes). So I’ll need to deal

Fwd: Any plans to migrate Transformer API to Spark SQL (closer to DataFrames)?

2016-03-26 Thread Michał Zieliński
Ted, Sure. This was presented by my colleague during Data Science London meetup. The talk was about "Scalable Predictive Pipelines with Spark & Scala". Link to the meetup and slides below: http://www.meetup.com/Data-Science-London/events/229755935/

Is this expected in Spark 1.6.1, derby.log file created when spark shell starts

2016-03-26 Thread Mich Talebzadeh
Having moved to Spark 1.6.1, I have noticed thar whenerver I start a spark-sql or shell. a dervy.log file is created in the directory! cat derby.log Sat Mar 26 11:18:55 GMT 2016: Booting Derby version The Apache Software Foundation

Re: Finding out the time a table was created

2016-03-26 Thread Mich Talebzadeh
Hi Ted, I moved to Spark 1.6 Still the same issue outstanding Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.6.1 /_/ Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_25) Type

Re: is there any way to submit spark application from outside of spark cluster

2016-03-26 Thread Hyukjin Kwon
Hi, For RESTful API for submitting an application, please take a look at this link. http://arturmkrtchyan.com/apache-spark-hidden-rest-api On 26 Mar 2016 12:07 p.m., "vetal king" wrote: > Prateek > > It's possible to submit spark application from outside application. If

Re: This simple UDF is not working!

2016-03-26 Thread Dhaval Modi
Hi Mich, You can try this: val toDate = udf{(out:String, form: String) => { val format = new SimpleDateFormat(s"$form"); Try(new Date(format.parse(out.toString()).getTime))match { case Success(t) => Some(t) case Failure(_) => None }}}; Usage: src = src.withColumn(s"$columnName",