Re: [ANNOUNCE] Announcing Apache Spark 2.1.0

2016-12-29 Thread Yin Huai
> > Jacek > > On 29 Dec 2016 5:03 p.m., "Yin Huai" <yh...@databricks.com> wrote: > >> Hi all, >> >> Apache Spark 2.1.0 is the second release of Spark 2.x line. This release >> makes significant strides in the production readiness of Structured

Re: Is there any scheduled release date for Spark 2.1.0?

2016-12-29 Thread Yin Huai
Hello, We spent sometime preparing artifacts and changes to the website (including the release notes). I just sent out the the announcement. 2.1.0 is officially released. Thanks, Yin On Wed, Dec 28, 2016 at 12:42 PM, Justin Miller < justin.mil...@protectwise.com> wrote: > Interesting, because

[ANNOUNCE] Announcing Apache Spark 2.1.0

2016-12-29 Thread Yin Huai
Hi all, Apache Spark 2.1.0 is the second release of Spark 2.x line. This release makes significant strides in the production readiness of Structured Streaming, with added support for event time watermarks

Re: Can't read tables written in Spark 2.1 in Spark 2.0 (and earlier)

2016-11-30 Thread Yin Huai
Hello Michael, Thank you for reporting this issue. It will be fixed by https://github.com/apache/spark/pull/16080. Thanks, Yin On Tue, Nov 29, 2016 at 11:34 PM, Timur Shenkao wrote: > Hi! > > Do you have real HIVE installation? > Have you built Spark 2.1 & Spark 2.0 with

Re: How to load specific Hive partition in DataFrame Spark 1.6?

2016-01-07 Thread Yin Huai
Hi, we made the change because the partitioning discovery logic was too flexible and it introduced problems that were very confusing to users. To make your case work, we have introduced a new data source option called basePath. You can use DataFrame df =

Re: How to load specific Hive partition in DataFrame Spark 1.6?

2016-01-07 Thread Yin Huai
No problem! Glad it helped! On Thu, Jan 7, 2016 at 12:05 PM, Umesh Kacha <umesh.ka...@gmail.com> wrote: > Hi Yin, thanks much your answer solved my problem. Really appreciate it! > > Regards > > > On Fri, Jan 8, 2016 at 1:26 AM, Yin Huai <yh...@databricks.com> wrot

Re: spark 1.5.2 memory leak? reading JSON

2015-12-20 Thread Yin Huai
Hi Eran, Can you try 1.6? With the change in https://github.com/apache/spark/pull/10288, JSON data source will not throw a runtime exception if there is any record that it cannot parse. Instead, it will put the entire record to the column of "_corrupt_record". Thanks, Yin On Sun, Dec 20, 2015

Re: DataFrame equality does not working in 1.5.1

2015-11-05 Thread Yin Huai
Can you attach the result of eventDF.filter($"entityType" === "user").select("entityId").distinct.explain(true)? Thanks, Yin On Thu, Nov 5, 2015 at 1:12 AM, 千成徳 wrote: > Hi All, > > I have data frame like this. > > Equality expression is not working in 1.5.1 but, works as

Re: Spark SQL lag() window function, strange behavior

2015-11-02 Thread Yin Huai
Hi Ross, What version of spark are you using? There were two issues that affected the results of window function in Spark 1.5 branch. Both of issues have been fixed and will be released with Spark 1.5.2 (this release will happen soon). For more details of these two issues, you can take a look at

Re: Anyone feels sparkSQL in spark1.5.1 very slow?

2015-10-26 Thread Yin Huai
@filthysocks, can you get the output of jmap -histo before the OOM ( http://docs.oracle.com/javase/7/docs/technotes/tools/share/jmap.html)? On Mon, Oct 26, 2015 at 6:35 AM, filthysocks wrote: > We upgrade from 1.4.1 to 1.5 and it's a pain > see > >

Re: Spark SQL Exception: Conf non-local session path expected to be non-null

2015-10-20 Thread Yin Huai
btw, what version of Spark did you use? On Mon, Oct 19, 2015 at 1:08 PM, YaoPau wrote: > I've connected Spark SQL to the Hive Metastore and currently I'm running > SQL > code via pyspark. Typically everything works fine, but sometimes after a > long-running Spark SQL job I

Re: Hive permanent functions are not available in Spark SQL

2015-10-01 Thread Yin Huai
Hi Pala, Can you add the full stacktrace of the exception? For now, can you use create temporary function to workaround the issue? Thanks, Yin On Wed, Sep 30, 2015 at 11:01 AM, Pala M Muthaia < mchett...@rocketfuelinc.com.invalid> wrote: > +user list > > On Tue, Sep 29, 2015 at 3:43 PM, Pala

Re: Hive permanent functions are not available in Spark SQL

2015-10-01 Thread Yin Huai
670) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) > at > org.apache.spark.repl.SparkILoop$$a

Re: Generic DataType in UDAF

2015-09-25 Thread Yin Huai
Hi Ritesh, Right now, we only allow specific data types defined in the inputSchema. Supporting abstract types (e.g. NumericType) may cause the logic of a UDAF be more complex. It will be great to understand the use cases first. What kinds of possible input data types that you want to support and

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
Seems 1.4 has the same issue. On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai <yh...@databricks.com> wrote: > btw, does 1.4 has the same problem? > > On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai <yh...@databricks.com> wrote: > >> Hi Jerry, >> >> Looks like it

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
/browse/SPARK-10731>). >> >> Best Regards, >> >> Jerry >> >> >> On Mon, Sep 21, 2015 at 1:01 PM, Yin Huai <yh...@databricks.com> wrote: >> >>> btw, does 1.4 has the same problem? >>> >>> On Mon, Sep 21, 2015 at 10:01

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
Hi Jerry, Looks like it is a Python-specific issue. Can you create a JIRA? Thanks, Yin On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam wrote: > Hi Spark Developers, > > I just ran some very simple operations on a dataset. I was surprise by the > execution plan of take(1),

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
btw, does 1.4 has the same problem? On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai <yh...@databricks.com> wrote: > Hi Jerry, > > Looks like it is a Python-specific issue. Can you create a JIRA? > > Thanks, > > Yin > > On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam <

Re: Null Value in DecimalType column of DataFrame

2015-09-17 Thread Yin Huai
tting. Is this really the expected behavior? Never seen something > returning null in other Scala tools that I used. > > Regards, > > > 2015-09-14 18:54 GMT-03:00 Yin Huai <yh...@databricks.com>: > >> btw, move it to user list. >> >> On Mon, Sep 14, 2015 at 2:

Re: Null Value in DecimalType column of DataFrame

2015-09-14 Thread Yin Huai
btw, move it to user list. On Mon, Sep 14, 2015 at 2:54 PM, Yin Huai <yh...@databricks.com> wrote: > A scale of 10 means that there are 10 digits at the right of the decimal > point. If you also have precision 10, the range of your data will be [0, 1) > and casting "10.5&qu

Re: java.util.NoSuchElementException: key not found

2015-09-11 Thread Yin Huai
Looks like you hit https://issues.apache.org/jira/browse/SPARK-10422, it has been fixed in branch 1.5. 1.5.1 release will have it. On Fri, Sep 11, 2015 at 3:35 AM, guoqing0...@yahoo.com.hk < guoqing0...@yahoo.com.hk> wrote: > Hi all , > After upgrade spark to 1.5 , Streaming throw >

Re: How to evaluate custom UDF over window

2015-08-24 Thread Yin Huai
For now, user-defined window function is not supported. We will add it in future. On Mon, Aug 24, 2015 at 6:26 AM, xander92 alexander.fra...@ompnt.com wrote: The ultimate aim of my program is to be able to wrap an arbitrary Scala function (mostly will be statistics / customized rolling window

Re: SQLContext Create Table Problem

2015-08-19 Thread Yin Huai
Can you try to use HiveContext instead of SQLContext? Your query is trying to create a table and persist the metadata of the table in metastore, which is only supported by HiveContext. On Wed, Aug 19, 2015 at 8:44 AM, Yusuf Can Gürkan yu...@useinsider.com wrote: Hello, I’m trying to create a

Re: About Databricks's spark-sql-perf

2015-08-13 Thread Yin Huai
Hi Todd, We have not got a chance to update it. We will update it after 1.5 release. Thanks, Yin On Thu, Aug 13, 2015 at 6:49 AM, Todd bit1...@163.com wrote: Hi, I got a question about the spark-sql-perf project by Databricks at https://github.com/databricks/spark-sql-perf/ The

Re: create HiveContext if available, otherwise SQLContext

2015-07-16 Thread Yin Huai
We do this in SparkILookp ( https://github.com/apache/spark/blob/master/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkILoop.scala#L1023-L1037). What is the version of Spark you are using? How did you add the spark-csv jar? On Thu, Jul 16, 2015 at 1:21 PM, Koert Kuipers

Re: create HiveContext if available, otherwise SQLContext

2015-07-16 Thread Yin Huai
No problem:) Glad to hear that! On Thu, Jul 16, 2015 at 8:22 PM, Koert Kuipers ko...@tresata.com wrote: that solved it, thanks! On Thu, Jul 16, 2015 at 6:22 PM, Koert Kuipers ko...@tresata.com wrote: thanks i will try 1.4.1 On Thu, Jul 16, 2015 at 5:24 PM, Yin Huai yh...@databricks.com

Re: [SPARK-SQL] Window Functions optimization

2015-07-13 Thread Yin Huai
Your query will be partitioned once. Then, a single Window operator will evaluate these three functions. As mentioned by Harish, you can take a look at the plan (sql(your sql...).explain()). On Mon, Jul 13, 2015 at 12:26 PM, Harish Butani rhbutani.sp...@gmail.com wrote: Just once. You can see

Re: Spark Streaming - Inserting into Tables

2015-07-12 Thread Yin Huai
Hi Brandon, Can you explain what did you mean by It simply does not work? You did not see new data files? Thanks, Yin On Fri, Jul 10, 2015 at 11:55 AM, Brandon White bwwintheho...@gmail.com wrote: Why does this not work? Is insert into broken in 1.3.1? It does not throw any errors, fail, or

Re: SparkSQL 'describe table' tries to look at all records

2015-07-12 Thread Yin Huai
Jerrick, Let me ask a few clarification questions. What is the version of Spark? Is the table a hive table? What is the format of the table? Is the table partitioned? Thanks, Yin On Sun, Jul 12, 2015 at 6:01 PM, ayan guha guha.a...@gmail.com wrote: Describe computes statistics, so it will

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-06 Thread Yin Huai
variable, however, as the behavior of the shell changed from hanging to quitting when the env var value got to 1g. /Sim Simeon Simeonov, Founder CTO, Swoop http://swoop.com/ @simeons http://twitter.com/simeons | blog.simeonov.com | 617.299.6746 From: Yin Huai yh...@databricks.com

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-06 Thread Yin Huai
4g /Sim Simeon Simeonov, Founder CTO, Swoop http://swoop.com/ @simeons http://twitter.com/simeons | blog.simeonov.com | 617.299.6746 From: Yin Huai yh...@databricks.com Date: Monday, July 6, 2015 at 12:59 AM To: Simeon Simeonov s...@swoop.com Cc: Denny Lee denny.g@gmail.com

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-05 Thread Yin Huai
the test file so you can reproduce this in your own environment. /Sim Simeon Simeonov, Founder CTO, Swoop http://swoop.com/ @simeons http://twitter.com/simeons | blog.simeonov.com | 617.299.6746 From: Yin Huai yh...@databricks.com Date: Sunday, July 5, 2015 at 11:04 PM To: Denny Lee denny.g

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-05 Thread Yin Huai
Sim, Can you increase the PermGen size? Please let me know what is your setting when the problem disappears. Thanks, Yin On Sun, Jul 5, 2015 at 5:59 PM, Denny Lee denny.g@gmail.com wrote: I had run into the same problem where everything was working swimmingly with Spark 1.3.1. When I

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-02 Thread Yin Huai
HiveContext's methods). Can you just use the sqlContext created by the shell and try again? Thanks, Yin On Thu, Jul 2, 2015 at 12:50 PM, Yin Huai yh...@databricks.com wrote: Hi Sim, Spark 1.4.0's memory consumption on PermGen is higher then Spark 1.3 (explained in https://issues.apache.org/jira/browse

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-02 Thread Yin Huai
Hi Sim, Spark 1.4.0's memory consumption on PermGen is higher then Spark 1.3 (explained in https://issues.apache.org/jira/browse/SPARK-8776). Can you add --conf spark.driver.extraJavaOptions=-XX:MaxPermSize=256m in the command you used to launch Spark shell? This will increase the PermGen size

Re: How to extract complex JSON structures using Apache Spark 1.4.0 Data Frames

2015-06-24 Thread Yin Huai
The function accepted by explode is f: Row = TraversableOnce[A]. Seems user_mentions is an array of structs. So, can you change your pattern matching to the following? case Row(rows: Seq[_]) = rows.asInstanceOf[Seq[Row]].map(elem = ...) On Wed, Jun 24, 2015 at 5:27 AM, Gustavo Arjones

Re: Help optimising Spark SQL query

2015-06-22 Thread Yin Huai
Hi James, Maybe it's the DISTINCT causing the issue. I rewrote the query as follows. Maybe this one can finish faster. select sum(cnt) as uses, count(id) as users from ( select count(*) cnt, cast(id as string) as id, from usage_events where

Re: Spark-sql(yarn-client) java.lang.NoClassDefFoundError: org/apache/spark/deploy/yarn/ExecutorLauncher

2015-06-18 Thread Yin Huai
btw, user listt will be a better place for this thread. On Thu, Jun 18, 2015 at 8:19 AM, Yin Huai yh...@databricks.com wrote: Is it the full stack trace? On Thu, Jun 18, 2015 at 6:39 AM, Sea 261810...@qq.com wrote: Hi, all: I want to run spark sql on yarn(yarn-client), but ... I already

Re: HiveContext saveAsTable create wrong partition

2015-06-18 Thread Yin Huai
Are you writing to an existing hive orc table? On Wed, Jun 17, 2015 at 3:25 PM, Cheng Lian lian.cs@gmail.com wrote: Thanks for reporting this. Would you mind to help creating a JIRA for this? On 6/16/15 2:25 AM, patcharee wrote: I found if I move the partitioned columns in schemaString

Re: HiveContext saveAsTable create wrong partition

2015-06-18 Thread Yin Huai
to open a jira about removing this requirement for inserting into an existing hive table. Thanks, Yin On Thu, Jun 18, 2015 at 9:39 PM, Yin Huai yh...@databricks.com wrote: Are you writing to an existing hive orc table? On Wed, Jun 17, 2015 at 3:25 PM, Cheng Lian lian.cs@gmail.com wrote

Re: Spark SQL and Skewed Joins

2015-06-17 Thread Yin Huai
Hi John, Did you also set spark.sql.planner.externalSort to true? Probably you will not see executor lost with this conf. For now, maybe you can manually split the query to two parts, one for skewed keys and one for other records. Then, you union then results of these two parts together. Thanks,

Re: Spark DataFrame 1.4 write to parquet/saveAsTable tasks fail

2015-06-17 Thread Yin Huai
shell session, every time I do a write I get a random number of tasks failing on the first run with the NPE. Using dynamic allocation of executors in YARN mode. No speculative execution is enabled. On Tue, Jun 16, 2015 at 3:11 PM, Yin Huai yh...@databricks.com wrote: I saw it once but I

Re: Spark DataFrame 1.4 write to parquet/saveAsTable tasks fail

2015-06-15 Thread Yin Huai
I saw it once but I was not clear how to reproduce it. The jira I created is https://issues.apache.org/jira/browse/SPARK-7837. More information will be very helpful. Were those errors from speculative tasks or regular tasks (the first attempt of the task)? Is this error deterministic (can you

Re: Issues with `when` in Column class

2015-06-12 Thread Yin Huai
Hi Chris, Have you imported org.apache.spark.sql.functions._? Thanks, Yin On Fri, Jun 12, 2015 at 8:05 AM, Chris Freeman cfree...@alteryx.com wrote: I’m trying to iterate through a list of Columns and create new Columns based on a condition. However, the when method keeps giving me errors

Re: Spark 1.4.0-rc4 HiveContext.table(db.tbl) NoSuchTableException

2015-06-05 Thread Yin Huai
Hi Doug, For now, I think you can use sqlContext.sql(USE databaseName) to change the current database. Thanks, Yin On Thu, Jun 4, 2015 at 12:04 PM, Yin Huai yh...@databricks.com wrote: Hi Doug, sqlContext.table does not officially support database name. It only supports table name

Re: Spark 1.4 HiveContext fails to initialise with native libs

2015-06-04 Thread Yin Huai
Are you using RC4? On Wed, Jun 3, 2015 at 10:58 PM, Night Wolf nightwolf...@gmail.com wrote: Thanks Yin, that seems to work with the Shell. But on a compiled application with Spark-submit it still fails with the same exception. On Thu, Jun 4, 2015 at 2:46 PM, Yin Huai yh...@databricks.com

Re: Spark 1.4.0-rc4 HiveContext.table(db.tbl) NoSuchTableException

2015-06-04 Thread Yin Huai
. Thanks, Doug On Jun 3, 2015, at 8:21 PM, Yin Huai yh...@databricks.com wrote: Hi Doug, Actually, sqlContext.table does not support database name in both Spark 1.3 and Spark 1.4. We will support it in future version. Thanks, Yin On Wed, Jun 3, 2015 at 10:45 AM, Doug

Re: Spark 1.4.0-rc4 HiveContext.table(db.tbl) NoSuchTableException

2015-06-03 Thread Yin Huai
Hi Doug, Actually, sqlContext.table does not support database name in both Spark 1.3 and Spark 1.4. We will support it in future version. Thanks, Yin On Wed, Jun 3, 2015 at 10:45 AM, Doug Balog doug.sparku...@dugos.com wrote: Hi, sqlContext.table(“db.tbl”) isn’t working for me, I get a

Re: Spark 1.4.0-rc3: Actor not found

2015-06-02 Thread Yin Huai
Does it happen every time you read a parquet source? On Tue, Jun 2, 2015 at 3:42 AM, Anders Arpteg arp...@spotify.com wrote: The log is from the log aggregation tool (hortonworks, yarn logs ...), so both executors and driver. I'll send a private mail to you with the full logs. Also, tried

Re: Spark 1.3.0 - 1.3.1 produces java.lang.NoSuchFieldError: NO_FILTER

2015-05-30 Thread Yin Huai
Looks like your program somehow picked up a older version of parquet (spark 1.3.1 uses parquet 1.6.0rc3 and seems NO_FILTER field was introduced in 1.6.0rc2). Is it possible that you can check the parquet lib version in your classpath? Thanks, Yin On Sat, May 30, 2015 at 2:44 PM, ogoh

Re: dataframe cumulative sum

2015-05-29 Thread Yin Huai
Hi Cesar, We just added it in Spark 1.4. In Spark 1.4, You can use window function in HiveContext to do it. Assuming you want to calculate the cumulative sum for every flag, import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.functions._ df.select( $flag, $price,

Re: Spark SQL: STDDEV working in Spark Shell but not in a standalone app

2015-05-08 Thread Yin Huai
Can you attach the full stack trace? Thanks, Yin On Fri, May 8, 2015 at 4:44 PM, barmaley o...@solver.com wrote: Given a registered table from data frame, I'm able to execute queries like sqlContext.sql(SELECT STDDEV(col1) FROM table) from Spark Shell just fine. However, when I run exactly

Re: Parquet error reading data that contains array of structs

2015-04-24 Thread Yin Huai
, 2015 at 11:00 AM, Yin Huai yh...@databricks.com wrote: The exception looks like the one mentioned in https://issues.apache.org/jira/browse/SPARK-4520. What is the version of Spark? On Fri, Apr 24, 2015 at 2:40 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, My data looks like

Re: Convert DStream to DataFrame

2015-04-24 Thread Yin Huai
Hi Sergio, I missed this thread somehow... For the error case classes cannot have more than 22 parameters., it is the limitation of scala (see https://issues.scala-lang.org/browse/SI-7296). You can follow the instruction at

Re: Parquet error reading data that contains array of structs

2015-04-24 Thread Yin Huai
The exception looks like the one mentioned in https://issues.apache.org/jira/browse/SPARK-4520. What is the version of Spark? On Fri, Apr 24, 2015 at 2:40 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, My data looks like this: +---++--+ |

Re: Bug? Can't reference to the column by name after join two DataFrame on a same name key

2015-04-23 Thread Yin Huai
Hi Shuai, You can use as to create a table alias. For example, df1.as(df1). Then you can use $df1.col to refer it. Thanks, Yin On Thu, Apr 23, 2015 at 11:14 AM, Shuai Zheng szheng.c...@gmail.com wrote: Hi All, I use 1.3.1 When I have two DF and join them on a same name key, after

Re: Parquet Hive table become very slow on 1.3?

2015-04-22 Thread Yin Huai
Xudong and Rex, Can you try 1.3.1? With PR 5339 http://github.com/apache/spark/pull/5339 , after we get a hive parquet from metastore and convert it to our native parquet code path, we will cache the converted relation. For now, the first access to that hive parquet table reads all of the footers

Re: dataframe can not find fields after loading from hive

2015-04-19 Thread Yin Huai
Hi Cesar, Can you try 1.3.1 ( https://spark.apache.org/releases/spark-release-1-3-1.html) and see if it still shows the error? Thanks, Yin On Fri, Apr 17, 2015 at 1:58 PM, Reynold Xin r...@databricks.com wrote: This is strange. cc the dev list since it might be a bug. On Thu, Apr 16,

Re: [SQL] DROP TABLE should also uncache table

2015-04-16 Thread Yin Huai
Can your code that can reproduce the problem? On Thu, Apr 16, 2015 at 5:42 AM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: Hi As per JIRA this issue is resolved, but i am still facing this issue. SPARK-2734 - DROP TABLE should also uncache table -- [image: Sigmoid Analytics]

Re: [SQL] DROP TABLE should also uncache table

2015-04-16 Thread Yin Huai
, Yin Huai yh...@databricks.com wrote: Can your code that can reproduce the problem? On Thu, Apr 16, 2015 at 5:42 AM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: Hi As per JIRA this issue is resolved, but i am still facing this issue. SPARK-2734 - DROP TABLE should also uncache table

Re: Spark SQL query key/value in Map

2015-04-16 Thread Yin Huai
For Map type column, fields['driver'] is the syntax to retrieve the map value (in the schema, you can see fields: map). The syntax of fields.driver is used for struct type. On Thu, Apr 16, 2015 at 12:37 AM, jc.francisco jc.francisc...@gmail.com wrote: Hi, I'm new with both Cassandra and Spark

Re: dataframe call, how to control number of tasks for a stage

2015-04-16 Thread Yin Huai
Hi Neal, spark.sql.shuffle.partitions is the property to control the number of tasks after shuffle (to generate t2, there is a shuffle for the aggregations specified by groupBy and agg.) You can use sqlContext.setConf(spark.sql.shuffle.partitions, newNumber) or sqlContext.sql(set

Re: DataFrame degraded performance after DataFrame.cache

2015-04-07 Thread Yin Huai
Hi Justin, Does the schema of your data have any decimal, array, map, or struct type? Thanks, Yin On Tue, Apr 7, 2015 at 6:31 PM, Justin Yip yipjus...@prediction.io wrote: Hello, I have a parquet file of around 55M rows (~ 1G on disk). Performing simple grouping operation is pretty

Re: [SQL] Simple DataFrame questions

2015-04-02 Thread Yin Huai
For cast, you can use selectExpr method. For example, df.selectExpr(cast(col1 as int) as col1, cast(col2 as bigint) as col2). Or, df.select(df(colA).cast(int), ...) On Thu, Apr 2, 2015 at 8:33 PM, Michael Armbrust mich...@databricks.com wrote: val df = Seq((test, 1)).toDF(col1, col2) You can

Re: rdd.toDF().saveAsParquetFile(tachyon://host:19998/test)

2015-03-28 Thread Yin Huai
You are hitting https://issues.apache.org/jira/browse/SPARK-6330. It has been fixed in 1.3.1, which will be released soon. On Fri, Mar 27, 2015 at 10:42 PM, sud_self 852677...@qq.com wrote: spark version is 1.3.0 with tanhyon-0.6.1 QUESTION DESCRIPTION:

Re: Use pig load function in spark

2015-03-23 Thread Yin Huai
Hello Kevin, You can take a look at our generic load function https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#generic-loadsave-functions . For example, you can use val df = sqlContext.load(/myData, parquet) To load a parquet dataset stored in /myData as a DataFrame

Re: Date and decimal datatype not working

2015-03-23 Thread Yin Huai
--- Ananda Basak Ph: 425-213-7092 *From:* BASAK, ANANDA *Sent:* Tuesday, March 17, 2015 3:08 PM *To:* Yin Huai *Cc:* user@spark.apache.org *Subject:* RE: Date and decimal datatype not working Ok, thanks for the suggestions. Let me try and will confirm all. Regards Ananda

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-20 Thread Yin Huai
Yin, thanks a lot for that! Will give it a shot and let you know. On 19 March 2015 at 16:30, Yin Huai yh...@databricks.com wrote: Was the OOM thrown during the execution of first stage (map) or the second stage (reduce)? If it was the second stage, can you increase the value

Re: Spark SQL filter DataFrame by date?

2015-03-19 Thread Yin Huai
Can you add your code snippet? Seems it's missing in the original email. Thanks, Yin On Thu, Mar 19, 2015 at 3:22 PM, kamatsuoka ken...@gmail.com wrote: I'm trying to filter a DataFrame by a date column, with no luck so far. Here's what I'm doing: When I run reqs_day.count() I get zero,

Re: saveAsTable broken in v1.3 DataFrames?

2015-03-19 Thread Yin Huai
created https://issues.apache.org/jira/browse/SPARK-6413 to track the improvement on the output of DESCRIBE statement. On Thu, Mar 19, 2015 at 12:11 PM, Yin Huai yh...@databricks.com wrote: Hi Christian, Your table is stored correctly in Parquet format. For saveAsTable, the table created

Re: saveAsTable broken in v1.3 DataFrames?

2015-03-19 Thread Yin Huai
Hi Christian, Your table is stored correctly in Parquet format. For saveAsTable, the table created is *not* a Hive table, but a Spark SQL data source table ( http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#data-sources). We are only using Hive's metastore to store the metadata (to

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-19 Thread Yin Huai
: Hi Yin, Thanks for your feedback. I have 1700 parquet files, sized 100MB each. The number of tasks launched is equal to the number of parquet files. Do you have any idea on how to deal with this situation? Thanks a lot On 18 Mar 2015 17:35, Yin Huai yh...@databricks.com wrote: Seems

Re: Spark SQL weird exception after upgrading from 1.1.1 to 1.2.x

2015-03-18 Thread Yin Huai
Hi Roberto, For now, if the timestamp is a top level column (not a field in a struct), you can use use backticks to quote the column name like `timestamp `. Thanks, Yin On Wed, Mar 18, 2015 at 12:10 PM, Roberto Coluccio roberto.coluc...@gmail.com wrote: Hey Cheng, thank you so much for your

Re: HiveContext can't find registered function

2015-03-17 Thread Yin Huai
it mean 'resolved attribute'? On Mar 17, 2015 8:14 PM, Yin Huai yh...@databricks.com wrote: The number is an id we used internally to identify an resolved Attribute. Looks like basic_null_diluted_d was not resolved since there is no id associated with it. On Tue, Mar 17, 2015 at 2:08 PM

Re: Date and decimal datatype not working

2015-03-17 Thread Yin Huai
p(0) is a String. So, you need to explicitly convert it to a Long. e.g. p(0).trim.toLong. You also need to do it for p(2). For those BigDecimals value, you need to create BigDecimal objects from your String values. On Tue, Mar 17, 2015 at 5:55 PM, BASAK, ANANDA ab9...@att.com wrote: Hi All,

Re: HiveContext can't find registered function

2015-03-17 Thread Yin Huai
. I will check it soon and update. Can you elaborate what does it mean the # and the number? Is that a reference to the field in the rdd? Thank you, Ophir On Mar 17, 2015 7:06 PM, Yin Huai yh...@databricks.com wrote: Seems basic_null_diluted_d was not resolved? Can you check

Re: HiveContext can't find registered function

2015-03-17 Thread Yin Huai
Seems basic_null_diluted_d was not resolved? Can you check if basic_null_diluted_d is in you table? On Tue, Mar 17, 2015 at 9:34 AM, Ophir Cohen oph...@gmail.com wrote: Hi Guys, I'm registering a function using: sqlc.registerFunction(makeEstEntry,ReutersDataFunctions.makeEstEntry _) Then I

Re: Loading in json with spark sql

2015-03-13 Thread Yin Huai
Seems you want to use array for the field of providers, like providers:[{id: ...}, {id:...}] instead of providers:{{id: ...}, {id:...}} On Fri, Mar 13, 2015 at 7:45 PM, kpeng1 kpe...@gmail.com wrote: Hi All, I was noodling around with loading in a json file into spark sql's hive context and

Re: Spark SQL. Cast to Bigint

2015-03-13 Thread Yin Huai
Are you using SQLContext? Right now, the parser in the SQLContext is quite limited on the data type keywords that it handles (see here https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala#L391) and unfortunately bigint is not handled

Re: Spark 1.3 SQL Type Parser Changes?

2015-03-10 Thread Yin Huai
Hi Nitay, Can you try using backticks to quote the column name? Like org.apache.spark.sql.hive.HiveMetastoreTypes.toDataType( struct`int`:bigint)? Thanks, Yin On Tue, Mar 10, 2015 at 2:43 PM, Michael Armbrust mich...@databricks.com wrote: Thanks for reporting. This was a result of a change

Re: Spark-SQL and Hive - is Hive required?

2015-03-06 Thread Yin Huai
Hi Edmon, No, you do not need to install Hive to use Spark SQL. Thanks, Yin On Fri, Mar 6, 2015 at 6:31 AM, Edmon Begoli ebeg...@gmail.com wrote: Does Spark-SQL require installation of Hive for it to run correctly or not? I could not tell from this statement:

Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread Yin Huai
Regarding current_date, I think it is not in either Hive 0.12.0 or 0.13.1 (versions that we support). Seems https://issues.apache.org/jira/browse/HIVE-5472 added it Hive recently. On Tue, Mar 3, 2015 at 6:03 AM, Cheng, Hao hao.ch...@intel.com wrote: The temp table in metastore can not be

Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread Yin Huai
@Shahab, based on https://issues.apache.org/jira/browse/HIVE-5472, current_date was added in Hive *1.2.0 (not 0.12.0)*. For my previous email, I meant current_date is not in neither Hive 0.12.0 nor Hive 0.13.1 (Spark SQL currently supports these two Hive versions). On Tue, Mar 3, 2015 at 8:55 AM,

Re: Issues reading in Json file with spark sql

2015-03-02 Thread Yin Huai
Is the string of the above JSON object in the same line? jsonFile requires that every line is a JSON object or an array of JSON objects. On Mon, Mar 2, 2015 at 11:28 AM, kpeng1 kpe...@gmail.com wrote: Hi All, I am currently having issues reading in a json file using spark sql's api. Here is

Re: [SparkSQL, Spark 1.2] UDFs in group by broken?

2015-02-26 Thread Yin Huai
Seems you hit https://issues.apache.org/jira/browse/SPARK-4296. It has been fixed in 1.2.1 and 1.3. On Thu, Feb 26, 2015 at 1:22 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Can someone confirm if they can run UDFs in group by in spark1.2? I have two builds running -- one from a custom

Re: SQLContext.applySchema strictness

2015-02-13 Thread Yin Huai
Hi Justin, It is expected. We do not check if the provided schema matches rows since all rows need to be scanned to give a correct answer. Thanks, Yin On Fri, Feb 13, 2015 at 1:33 PM, Justin Pihony justin.pih...@gmail.com wrote: Per the documentation: It is important to make sure that

Re: spark sql registerFunction with 1.2.1

2015-02-11 Thread Yin Huai
Regarding backticks: Right. You need backticks to quote the column name timestamp because timestamp is a reserved keyword in our parser. On Tue, Feb 10, 2015 at 3:02 PM, Mohnish Kodnani mohnish.kodn...@gmail.com wrote: actually i tried in spark shell , got same error and then for some reason i

Re: Can we execute create table and load data commands against Hive inside HiveContext?

2015-02-10 Thread Yin Huai
org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdConfOnlyAuthorizerFactory was introduced in Hive 0.14 and Spark SQL only supports Hive 0.12 and 0.13.1. Can you change the setting of hive.security.authorization.manager to someone accepted by 0.12 or 0.13.1? On Thu, Feb 5, 2015

Re: Spark SQL - Column name including a colon in a SELECT clause

2015-02-10 Thread Yin Huai
Can you try using backticks to quote the field name? Like `f:price`. On Tue, Feb 10, 2015 at 5:47 AM, presence2001 neil.andra...@thefilter.com wrote: Hi list, I have some data with a field name of f:price (it's actually part of a JSON structure loaded from ElasticSearch via

Re: [SQL] Using HashPartitioner to distribute by column

2015-01-21 Thread Yin Huai
Hello Michael, In Spark SQL, we have our internal concepts of Output Partitioning (representing the partitioning scheme of an operator's output) and Required Child Distribution (representing the requirement of input data distribution of an operator) for a physical operator. Let's say we have two

Re: Hive UDAF percentile_approx says This UDAF does not support the deprecated getEvaluator() method.

2015-01-13 Thread Yin Huai
Yeah, it's a bug. It has been fixed by https://issues.apache.org/jira/browse/SPARK-3891 in master. On Tue, Jan 13, 2015 at 2:41 PM, Ted Yu yuzhih...@gmail.com wrote: Looking at the source code for AbstractGenericUDAFResolver, the following (non-deprecated) method should be called: public

Re: Convert RDD[Map[String, Any]] to SchemaRDD

2014-12-08 Thread Yin Huai
Hello Jianshi, You meant you want to convert a Map to a Struct, right? We can extract some useful functions from JsonRDD.scala, so others can access them. Thanks, Yin On Mon, Dec 8, 2014 at 1:29 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: I checked the source code for inferSchema. Looks

Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection

2014-11-26 Thread Yin Huai
Hello Jonathan, There was a bug regarding casting data types before inserting into a Hive table. Hive does not have the notion of containsNull for array values. So, for a Hive table, the containsNull will be always true for an array and we should ignore this field for Hive. This issue has been

Re: can't get smallint field from hive on spark

2014-11-26 Thread Yin Huai
For hive on spark, did you mean the thrift server of Spark SQL or https://issues.apache.org/jira/browse/HIVE-7292? If you meant the latter one, I think Hive's mailing list will be a good place to ask (see https://hive.apache.org/mailing_lists.html). Thanks, Yin On Wed, Nov 26, 2014 at 10:49 PM,

Re: Spark SQL Join returns less rows that expected

2014-11-25 Thread Yin Huai
I guess you want to use split(\\|) instead of split(|). On Tue, Nov 25, 2014 at 4:51 AM, Cheng Lian lian.cs@gmail.com wrote: Which version are you using? Or if you are using the most recent master or branch-1.2, which commit are you using? On 11/25/14 4:08 PM, david wrote: Hi, I

Re: How to deal with BigInt in my case class for RDD = SchemaRDD convertion

2014-11-21 Thread Yin Huai
Hello Jianshi, The reason of that error is that we do not have a Spark SQL data type for Scala BigInt. You can use Decimal for your case. Thanks, Yin On Fri, Nov 21, 2014 at 5:11 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I got an error during rdd.registerTempTable(...) saying

Re: Converting a json struct to map

2014-11-19 Thread Yin Huai
Oh, actually, we do not support MapType provided by the schema given to jsonRDD at the moment (my bad..). Daniel, you need to wait for the patch of 4476 (I should have one soon). Thanks, Yin On Wed, Nov 19, 2014 at 2:32 PM, Daniel Haviv danielru...@gmail.com wrote: Thank you Michael I will

Re: jsonRdd and MapType

2014-11-07 Thread Yin Huai
Hello Brian, Right now, MapType is not supported in the StructType provided to jsonRDD/jsonFile. We will add the support. I have created https://issues.apache.org/jira/browse/SPARK-4302 to track this issue. Thanks, Yin On Fri, Nov 7, 2014 at 3:41 PM, boclair bocl...@gmail.com wrote: I'm

Re: [SQL] PERCENTILE is not working

2014-11-05 Thread Yin Huai
Hello Kevin, https://issues.apache.org/jira/browse/SPARK-3891 will fix this bug. Thanks, Yin On Wed, Nov 5, 2014 at 8:06 PM, Cheng, Hao hao.ch...@intel.com wrote: Which version are you using? I can reproduce that in the latest code, but with different exception. I've filed an bug

Re: spark sql create nested schema

2014-11-04 Thread Yin Huai
Hello Tridib, For you case, you can use StructType(StructField(ParentInfo, parentInfo, true) :: StructField(ChildInfo, childInfo, true) :: Nil) to create the StructType representing the schema (parentInfo and childInfo are two existing StructTypes). You can take a look at our docs (

  1   2   >