from:"Isabelle Phan"

Delete checkpointed data for a single dataset?

2019-10-23 Thread Isabelle Phan

Hello, In a non streaming application, I am using the checkpoint feature to truncate the lineage of complex datasets. At the end of the job, the checkpointed data, which is stored in HDFS, is deleted. I am looking for a way to delete the unused checkpointed data earlier than the end of the job.

Read ORC file with subset of schema

2019-08-30 Thread Isabelle Phan

Hello, When reading an older ORC file where the schema is a subset of the current schema, reader throws an error. Please see sample code below (ran on spark 2.1). The same commands on a parquet file do not error out, they return the new column with null values. Is there a setting to add to the

ClassCastException when using SparkSQL Window function

2016-11-17 Thread Isabelle Phan

Hello, I have a simple session table, which tracks pages users visited with a sessionId. I would like to apply a window function by sessionId, but am hitting a type cast exception. I am using Spark 1.5.0. Here is sample code: scala> df.printSchema root |-- sessionid: string (nullable = true)

Re: DataFrame creation delay?

2015-12-11 Thread Isabelle Phan

lect * from temp.log"? Adding a > where clause to the query with a partition condition will help Spark prune > the request to just the required partitions (vs. all, which is proving > expensive). > > On Fri, Dec 11, 2015 at 3:59 AM Isabelle Phan <nlip...@gma

Re: DataFrame creation delay?

2015-12-10 Thread Isabelle Phan

a lot for your help, Isabelle On Fri, Sep 4, 2015 at 1:42 PM, Michael Armbrust <mich...@databricks.com> wrote: > If you run sqlContext.table("...").registerTempTable("...") that > temptable will cache the lookup of partitions. > > On Fri, Sep 4, 2015 at 1:1

Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-04 Thread Isabelle Phan

; >>> Hi >>> you can try: >>> >>> if your table under location “/test/table/“ on HDFS >>> and has partitions: >>> >>> “/test/table/dt=2012” >>> “/test/table/dt=2013” >>> >>> df.write.mode(SaveMode.Append).partitio

How to catch error during Spark job?

2015-10-27 Thread Isabelle Phan

Hello, I had a question about error handling in Spark job: if an exception occurs during the job, what is the best way to get notification of the failure? Can Spark jobs return with different exit codes? For example, I wrote a dummy Spark job just throwing out an Exception, as follows: import

Re: How to distinguish columns when joining DataFrames with shared parent?

2015-10-21 Thread Isabelle Phan

Thanks Michael and Ali for the reply! I'll make sure to use unresolved columns when working with self joins then. As pointed by Ali, isn't there still an issue with the aliasing? It works when using org.apache.spark.sql.functions.col(colName: String) method, but not when using

Re: How to distinguish columns when joining DataFrames with shared parent?

2015-10-21 Thread Isabelle Phan

joins as it eagerly binds to a > specific column in a what that breaks when we do the rewrite of one side of > the query. Using the apply method constructs a resolved column eagerly > (which looses the alias information). > > On Wed, Oct 21, 2015 at 1:49 PM, Isabelle Phan <nlip...@gm

How to distinguish columns when joining DataFrames with shared parent?

2015-10-20 Thread Isabelle Phan

Hello, When joining 2 DataFrames which originate from the same initial DataFrame, why can't org.apache.spark.sql.DataFrame.apply(colName: String) method distinguish which column to read? Let me illustrate this question with a simple example (ran on Spark 1.5.1): //my initial DataFrame scala> df

Re: DataFrame creation delay?

2015-09-04 Thread Isabelle Phan

n Fri, Sep 4, 2015 at 12:54 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> What format is this table. For parquet and other optimized formats we >> cache a bunch of file metadata on first access to make interactive queries >> faster. >> >>

DataFrame creation delay?

2015-09-03 Thread Isabelle Phan

Hello, I am using SparkSQL to query some Hive tables. Most of the time, when I create a DataFrame using sqlContext.sql("select * from table") command, DataFrame creation is less than 0.5 second. But I have this one table with which it takes almost 12 seconds! scala> val start =

Re: How to determine the value for spark.sql.shuffle.partitions?

2015-09-03 Thread Isabelle Phan

+1 I had the exact same question as I am working on my first Spark applications. Hope someone can share some best practices. Thanks! Isabelle On Tue, Sep 1, 2015 at 2:17 AM, Romi Kuntsman wrote: > Hi all, > > The number of partition greatly affect the speed and efficiency

DataFrame rollup with alias?

2015-08-23 Thread Isabelle Phan

Hello, I am new to Spark and just running some tests to get familiar with the APIs. When calling the rollup function on my DataFrame, I get different results when I alias the columns I am grouping on (see below for example data set). I was expecting alias function to only affect the column name.

Delete checkpointed data for a single dataset?

Read ORC file with subset of schema

ClassCastException when using SparkSQL Window function

Re: DataFrame creation delay?

Re: DataFrame creation delay?

Re: SparkSQL API to insert DataFrame into a static partition?

How to catch error during Spark job?

Re: How to distinguish columns when joining DataFrames with shared parent?

Re: How to distinguish columns when joining DataFrames with shared parent?

How to distinguish columns when joining DataFrames with shared parent?

Re: DataFrame creation delay?

DataFrame creation delay?

Re: How to determine the value for spark.sql.shuffle.partitions?

DataFrame rollup with alias?

14 matches

Site Navigation

Mail list logo

Footer information