Re: DataFrame column name restriction

2015-04-11 Thread Michael Armbrust
That is a good question. Names with `.` in them are in particular broken by SPARK-5632 https://issues.apache.org/jira/browse/SPARK-5632, which I'd like to fix. There is a more general question of whether strings that are passed to DataFrames should be treated as quoted identifiers (i.e. `as

Re: Keep local variable

2015-04-11 Thread Tassilo Klein
Hi Gerard, thanks for the hint with the Singleton object. Seems very interesting. However, when my singleton object (e.g. handle to my DB) is supposed to have a member variable that is non-serializable I again will have a problem, won’t I? At least I always run into issues that Python tries to

Re: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-11 Thread Guillaume Pitel
Hi, I had to setup a cron job for cleanup in $SPARK_HOME/work and in $SPARK_LOCAL_DIRS. Here are the cron lines. Unfortunately it's for *nix machines, I guess you will have to adapt it seriously for Windows. 12 * * * * find $SPARK_HOME/work -cmin +1440 -prune -exec rm -rf {} \+ 32 * * * *

Re: HiveThriftServer2

2015-04-11 Thread Cheng Lian
Unfortunately the spark-hive-thriftserver hasn't been published yet, you may either publish it locally or use it as an unmanaged SBT dependency. On 4/8/15 8:58 AM, Mohammed Guller wrote: Hi – I want to create an instance of HiveThriftServer2 in my Scala application, so I imported the

Re: Yarn application state monitor thread dying on IOException

2015-04-11 Thread Steve Loughran
On 10 Apr 2015, at 13:40, Lorenz Knies m...@l1024.org wrote: i would consider it a bug, that the Yarn application state monitor” thread dies on an, i think even expected (at least in the java methods called further down the stack), exception. What do you think? Is it a problem, that we

Taks going into NODE_LOCAL at beginning of job

2015-04-11 Thread Jeetendra Gangele
I have 3 transformation and then I am running for each job is going Process is going in NODE_LOCAL level and no executor in waiting for long time no task is running. Regarding Jeetendra

RE: HiveThriftServer2

2015-04-11 Thread Mohammed Guller
Thanks, Cheng. BTW, there is another thread on the same topic. It looks like the thrift-server will be published for 1.3.1. Mohammed From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: Saturday, April 11, 2015 5:37 AM To: Mohammed Guller; user@spark.apache.org Subject: Re: HiveThriftServer2

Re: Spark on Mesos / Executor Memory

2015-04-11 Thread Tim Chen
(Adding spark user list) Hi Tom, If I understand correctly you're saying that you're running into memory problems because the scheduler is allocating too much CPUs and not enough memory to acoomodate them right? In the case of fine grain mode I don't think that's a problem since we have a fixed

Re: Spark support for Hadoop Formats (Avro)

2015-04-11 Thread ๏̯͡๏
The read seem to be successfully as the values for each field in record are different and correct. The problem is when i collect it or trigger next processing (join with other table) , each of this probably triggers serialization and thats when all the fields in the record get the value of first

Re: not found: value SQLContextSingleton

2015-04-11 Thread Tathagata Das
Have you created a class called SQLContextSingleton ? If so, is it in the compile class path? On Fri, Apr 10, 2015 at 6:47 AM, Mukund Ranjan (muranjan) muran...@cisco.com wrote: Hi All, Any idea why I am getting this error? wordsTenSeconds.foreachRDD((rdd: RDD[String], time: Time)

Spark support for Hadoop Formats (Avro)

2015-04-11 Thread ๏̯͡๏
We have very large processing being done on Hadoop (400 M/r Jobs, 1 Day duration, 100s of TB data, 100s of joins). We are exploring Spark as alternative to speed up our processing time. We use Scala + Scoobie today and Avro is the data format across steps. I observed a strange behavior, i read

Re: Microsoft SQL jdbc support from spark sql

2015-04-11 Thread Cheng Lian
Your first DDL should be correct (as long as the JDBC URL is correct). The string after USING should be the data source name (org.apache.spark.sql.jdbc or simply jdbc). The SQLException here indicates that Spark SQL couldn't find SQL Server JDBC driver in the classpath. As what Denny said,

Re: Spark SQL or rules hot reload

2015-04-11 Thread Cheng Lian
What do you mean by rules? Spark SQL optimization rules? Currently these are entirely private to Spark SQL, and are not configurable during runtime. Cheng On 4/10/15 2:55 PM, Bruce Dou wrote: Hi, How to manage the life cycle of spark sql and rules applying on the data stream. Enabling or

Re: How to use Joda Time with Spark SQL?

2015-04-11 Thread Cheng Lian
One possible approach can be defining a UDT (user-defined type) for Joda time. A UDT maps an arbitrary type to and from Spark SQL data types. You may check the ExamplePointUDT [1] for more details. [1]:

Unusual behavior with leftouterjoin

2015-04-11 Thread ๏̯͡๏
I have two RDD leftRDD = RDD[(Long, (DetailInputRecord, VISummary, Long))] and rightRDD = RDD[(Long, com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum) DetailInputRecord is a object that contains (guid, sessionKey, sessionStartDAte, siteID) There are 10 records in leftRDD

Re: Unusual behavior with leftouterjoin

2015-04-11 Thread ๏̯͡๏
I took that RDD run through it and printed 4 elements from it, they all printed correctly. val x = viEvents.map { case (itemId, event) = println(event.get(guid), itemId, event.get(itemId), event.get(siteId)) (itemId, event) } The above code prints