Re: A DataFrame cache bug

2017-02-22 Thread gen tang
Hi, The example that I provided is not very clear. And I add a more clear example in jira. Thanks Cheers Gen On Wed, Feb 22, 2017 at 3:47 PM, gen tang <gen.tan...@gmail.com> wrote: > Hi Kazuaki Ishizaki > > Thanks a lot for your help. It works. However, a more strange bug appea

Re: A DataFrame cache bug

2017-02-21 Thread gen tang
uot;overwrite").parquet(dir) > spark.catalog.refreshByPath(dir) // insert a NEW statement > val df1 = spark.read.parquet(dir) > df1.count // output 1000 which is correct, in fact other operation expect > df1.filter("id>10") return correct result. > f(df1).count // out

Re: A DataFrame cache bug

2017-02-21 Thread gen tang
Hi All, I might find a related issue on jira: https://issues.apache.org/jira/browse/SPARK-15678 This issue is closed, may be we should reopen it. Thanks Cheers Gen On Wed, Feb 22, 2017 at 1:57 PM, gen tang <gen.tan...@gmail.com> wrote: > Hi All, > > I found a strange bug w

A DataFrame cache bug

2017-02-21 Thread gen tang
Hi All, I found a strange bug which is related with reading data from a updated path and cache operation. Please consider the following code: import org.apache.spark.sql.DataFrame def f(data: DataFrame): DataFrame = { val df = data.filter("id>10") df.cache df.count df }

Fwd: dataframe slow down with tungsten turn on

2015-11-05 Thread gen tang
-- Forwarded message -- From: gen tang <gen.tan...@gmail.com> Date: Fri, Nov 6, 2015 at 12:14 AM Subject: Re: dataframe slow down with tungsten turn on To: "Cheng, Hao" <hao.ch...@intel.com> Hi, My application is as follows: 1. create dataframe from h

Potential bug broadcastNestedLoopJoin or default value of spark.sql.autoBroadcastJoinThreshold

2015-08-11 Thread gen tang
Hi, Recently, I use spark sql to do join on non-equality condition, condition1 or condition2 for example. Spark will use broadcastNestedLoopJoin to do this. Assume that one of dataframe(df1) is not created from hive table nor local collection and the other one is created from hivetable(df2). For

Re: Potential bug broadcastNestedLoopJoin or default value of spark.sql.autoBroadcastJoinThreshold

2015-08-11 Thread gen tang
not sure how you created the df1 instance, but we’d better to reflect the real size for the statistics of it, and let the framework decide what to do, hopefully Spark Sql can support the non-equal join for large tables in the next release. Hao *From:* gen tang [mailto:gen.tan

Spark on teradata?

2015-01-07 Thread gen tang
Hi, I have a stupid question: Is it possible to use spark on Teradata data warehouse, please? I read some news on internet which say yes. However, I didn't find any example about this issue Thanks in advance. Cheers Gen