Delete checkpointed data for a single dataset?

2019-10-23 Thread Isabelle Phan
Hello, In a non streaming application, I am using the checkpoint feature to truncate the lineage of complex datasets. At the end of the job, the checkpointed data, which is stored in HDFS, is deleted. I am looking for a way to delete the unused checkpointed data earlier than the end of the job.

Read ORC file with subset of schema

2019-08-30 Thread Isabelle Phan
Hello, When reading an older ORC file where the schema is a subset of the current schema, reader throws an error. Please see sample code below (ran on spark 2.1). The same commands on a parquet file do not error out, they return the new column with null values. Is there a setting to add to the

ClassCastException when using SparkSQL Window function

2016-11-17 Thread Isabelle Phan
Hello, I have a simple session table, which tracks pages users visited with a sessionId. I would like to apply a window function by sessionId, but am hitting a type cast exception. I am using Spark 1.5.0. Here is sample code: scala> df.printSchema root |-- sessionid: string (nullable = true)

Re: DataFrame creation delay?

2015-12-11 Thread Isabelle Phan
lect * from temp.log"? Adding a > where clause to the query with a partition condition will help Spark prune > the request to just the required partitions (vs. all, which is proving > expensive). > > On Fri, Dec 11, 2015 at 3:59 AM Isabelle Phan <nlip...@gma

Re: DataFrame creation delay?

2015-12-10 Thread Isabelle Phan
a lot for your help, Isabelle On Fri, Sep 4, 2015 at 1:42 PM, Michael Armbrust <mich...@databricks.com> wrote: > If you run sqlContext.table("...").registerTempTable("...") that > temptable will cache the lookup of partitions. > > On Fri, Sep 4, 2015 at 1:1

Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-04 Thread Isabelle Phan
; >>> Hi >>> you can try: >>> >>> if your table under location “/test/table/“ on HDFS >>> and has partitions: >>> >>> “/test/table/dt=2012” >>> “/test/table/dt=2013” >>> >>> df.write.mode(SaveMode.Append).partitio

How to catch error during Spark job?

2015-10-27 Thread Isabelle Phan
Hello, I had a question about error handling in Spark job: if an exception occurs during the job, what is the best way to get notification of the failure? Can Spark jobs return with different exit codes? For example, I wrote a dummy Spark job just throwing out an Exception, as follows: import

Re: How to distinguish columns when joining DataFrames with shared parent?

2015-10-21 Thread Isabelle Phan
Thanks Michael and Ali for the reply! I'll make sure to use unresolved columns when working with self joins then. As pointed by Ali, isn't there still an issue with the aliasing? It works when using org.apache.spark.sql.functions.col(colName: String) method, but not when using

Re: How to distinguish columns when joining DataFrames with shared parent?

2015-10-21 Thread Isabelle Phan
joins as it eagerly binds to a > specific column in a what that breaks when we do the rewrite of one side of > the query. Using the apply method constructs a resolved column eagerly > (which looses the alias information). > > On Wed, Oct 21, 2015 at 1:49 PM, Isabelle Phan <nlip...@gm

How to distinguish columns when joining DataFrames with shared parent?

2015-10-20 Thread Isabelle Phan
Hello, When joining 2 DataFrames which originate from the same initial DataFrame, why can't org.apache.spark.sql.DataFrame.apply(colName: String) method distinguish which column to read? Let me illustrate this question with a simple example (ran on Spark 1.5.1): //my initial DataFrame scala> df

Re: DataFrame creation delay?

2015-09-04 Thread Isabelle Phan
n Fri, Sep 4, 2015 at 12:54 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> What format is this table. For parquet and other optimized formats we >> cache a bunch of file metadata on first access to make interactive queries >> faster. >> >>

DataFrame creation delay?

2015-09-03 Thread Isabelle Phan
Hello, I am using SparkSQL to query some Hive tables. Most of the time, when I create a DataFrame using sqlContext.sql("select * from table") command, DataFrame creation is less than 0.5 second. But I have this one table with which it takes almost 12 seconds! scala> val start =

Re: How to determine the value for spark.sql.shuffle.partitions?

2015-09-03 Thread Isabelle Phan
+1 I had the exact same question as I am working on my first Spark applications. Hope someone can share some best practices. Thanks! Isabelle On Tue, Sep 1, 2015 at 2:17 AM, Romi Kuntsman wrote: > Hi all, > > The number of partition greatly affect the speed and efficiency

DataFrame rollup with alias?

2015-08-23 Thread Isabelle Phan
Hello, I am new to Spark and just running some tests to get familiar with the APIs. When calling the rollup function on my DataFrame, I get different results when I alias the columns I am grouping on (see below for example data set). I was expecting alias function to only affect the column name.