Re: Spark SQL joins taking too long

2016-01-27 Thread Raghu Ganti
Why would changing the order of the join make such a big difference?

I will try the repartition, although, it does not make sense to me why 
repartitioning should help, since the data itself is so small!

Regards,
Raghu

> On Jan 27, 2016, at 20:08, Cheng, Hao  wrote:
> 
> Another possibility is about the parallelism? Probably be 1 or some other 
> small value, since the input data size is not that big.
>  
> If in that case, probably you can try something like:
>  
> Df1.repartition(10).registerTempTable(“hospitals”);
> Df2.repartition(10).registerTempTable(“counties”);
> …
> And then doing the join.
>  
>  
> From: Raghu Ganti [mailto:raghuki...@gmail.com] 
> Sent: Thursday, January 28, 2016 3:06 AM
> To: Ted Yu; Дмитро Попович
> Cc: user
> Subject: Re: Spark SQL joins taking too long
>  
> The problem is with the way Spark query plan is being created, IMO, what was 
> happening before is that the order of the tables mattered and when the larger 
> table is given first, it took a very long time (~53mins to complete). I 
> changed the order of the tables with the smaller one first (including 
> replacing the table with one element with that of the entire one) and 
> modified the query to look like this:
> 
> SELECT c.NAME, h.name FROM counties c, hospitals h WHERE c.NAME = 'Dutchess' 
> AND ST_Intersects(c.shape, h.location)
> 
> With the above query, things worked like a charm (<1min to finish the entire 
> execution and join on 3141 polygons with 6.5k points).
> 
> Do let me know if you need more info in order to pin point the issue.
> 
> Regards,
> Raghu
>  
> On Tue, Jan 26, 2016 at 5:13 PM, Ted Yu  wrote:
> What's the type of shape column ?
>  
> Can you disclose what SomeUDF does (by showing the code) ?
>  
> Cheers
>  
> On Tue, Jan 26, 2016 at 12:41 PM, raghukiran  wrote:
> Hi,
> 
> I create two tables, one counties with just one row (it actually has 2k
> rows, but I used only one) and another hospitals, which has 6k rows. The
> join command I use is as follows, which takes way too long to run and has
> never finished successfully (even after nearly 10mins). The following is
> what I have:
> 
> DataFrame df1 = ...
> df1.registerTempTable("hospitals");
> DataFrame df2 = ...
> df2.registerTempTable("counties"); //has only one row right now
> DataFrame joinDf = sqlCtx.sql("SELECT h.name, c.name FROM hospitals h JOIN
> counties c ON SomeUDF(c.shape, h.location)");
> long count = joinDf.count(); //this takes too long!
> 
> //whereas the following which is the exact equivalent of the above gets done
> very quickly!
> DataFrame joinDf = sqlCtx.sql("SELECT h.name FROM hospitals WHERE
> SomeUDF('c.shape as string', h.location)");
> long count = joinDf.count(); //gives me the correct answer of 8
> 
> Any suggestions on what I can do to optimize and debug this piece of code?
> 
> Regards,
> Raghu
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-joins-taking-too-long-tp26078.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
>  
>  


Re: Spark SQL joins taking too long

2016-01-27 Thread Raghu Ganti
The problem is with the way Spark query plan is being created, IMO, what
was happening before is that the order of the tables mattered and when the
larger table is given first, it took a very long time (~53mins to
complete). I changed the order of the tables with the smaller one first
(including replacing the table with one element with that of the entire
one) and modified the query to look like this:

SELECT c.NAME, h.name FROM counties c, hospitals h WHERE c.NAME =
'Dutchess' AND ST_Intersects(c.shape, h.location)

With the above query, things worked like a charm (<1min to finish the
entire execution and join on 3141 polygons with 6.5k points).

Do let me know if you need more info in order to pin point the issue.

Regards,
Raghu

On Tue, Jan 26, 2016 at 5:13 PM, Ted Yu  wrote:

> What's the type of shape column ?
>
> Can you disclose what SomeUDF does (by showing the code) ?
>
> Cheers
>
> On Tue, Jan 26, 2016 at 12:41 PM, raghukiran  wrote:
>
>> Hi,
>>
>> I create two tables, one counties with just one row (it actually has 2k
>> rows, but I used only one) and another hospitals, which has 6k rows. The
>> join command I use is as follows, which takes way too long to run and has
>> never finished successfully (even after nearly 10mins). The following is
>> what I have:
>>
>> DataFrame df1 = ...
>> df1.registerTempTable("hospitals");
>> DataFrame df2 = ...
>> df2.registerTempTable("counties"); //has only one row right now
>> DataFrame joinDf = sqlCtx.sql("SELECT h.name, c.name FROM hospitals h
>> JOIN
>> counties c ON SomeUDF(c.shape, h.location)");
>> long count = joinDf.count(); //this takes too long!
>>
>> //whereas the following which is the exact equivalent of the above gets
>> done
>> very quickly!
>> DataFrame joinDf = sqlCtx.sql("SELECT h.name FROM hospitals WHERE
>> SomeUDF('c.shape as string', h.location)");
>> long count = joinDf.count(); //gives me the correct answer of 8
>>
>> Any suggestions on what I can do to optimize and debug this piece of code?
>>
>> Regards,
>> Raghu
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-joins-taking-too-long-tp26078.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Spark SQL joins taking too long

2016-01-26 Thread Raghu Ganti
Yes, the SomeUDF is Contains, shape is a UDT that maps a custom geometry type 
to sql binary type.

Custom geometry type is a Java class. Please let me know if you need further 
info.

Regards
Raghu

> On Jan 26, 2016, at 17:13, Ted Yu  wrote:
> 
> What's the type of shape column ?
> 
> Can you disclose what SomeUDF does (by showing the code) ?
> 
> Cheers
> 
>> On Tue, Jan 26, 2016 at 12:41 PM, raghukiran  wrote:
>> Hi,
>> 
>> I create two tables, one counties with just one row (it actually has 2k
>> rows, but I used only one) and another hospitals, which has 6k rows. The
>> join command I use is as follows, which takes way too long to run and has
>> never finished successfully (even after nearly 10mins). The following is
>> what I have:
>> 
>> DataFrame df1 = ...
>> df1.registerTempTable("hospitals");
>> DataFrame df2 = ...
>> df2.registerTempTable("counties"); //has only one row right now
>> DataFrame joinDf = sqlCtx.sql("SELECT h.name, c.name FROM hospitals h JOIN
>> counties c ON SomeUDF(c.shape, h.location)");
>> long count = joinDf.count(); //this takes too long!
>> 
>> //whereas the following which is the exact equivalent of the above gets done
>> very quickly!
>> DataFrame joinDf = sqlCtx.sql("SELECT h.name FROM hospitals WHERE
>> SomeUDF('c.shape as string', h.location)");
>> long count = joinDf.count(); //gives me the correct answer of 8
>> 
>> Any suggestions on what I can do to optimize and debug this piece of code?
>> 
>> Regards,
>> Raghu
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-joins-taking-too-long-tp26078.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
> 


Re: Scala MatchError in Spark SQL

2016-01-20 Thread Raghu Ganti
Ah, OK! I am a novice to Scala - will take a look at Scala case classes. It
would be awesome if you can provide some pointers.

Thanks,
Raghu

On Wed, Jan 20, 2016 at 12:25 PM, Andy Grove 
wrote:

> I'm talking about implementing CustomerRecord as a scala case class,
> rather than as a Java class. Scala case classes implement the scala.Product
> trait, which Catalyst is looking for.
>
>
> Thanks,
>
> Andy.
>
> --
>
> Andy Grove
> Chief Architect
> AgilData - Simple Streaming SQL that Scales
> www.agildata.com
>
>
> On Wed, Jan 20, 2016 at 10:21 AM, Raghu Ganti 
> wrote:
>
>> Is it not internal to the Catalyst implementation? I should not be
>> modifying the Spark source to get things to work, do I? :-)
>>
>> On Wed, Jan 20, 2016 at 12:21 PM, Raghu Ganti 
>> wrote:
>>
>>> Case classes where?
>>>
>>> On Wed, Jan 20, 2016 at 12:21 PM, Andy Grove 
>>> wrote:
>>>
>>>> Honestly, moving to Scala and using case classes is the path of least
>>>> resistance in the long term.
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Andy.
>>>>
>>>> --
>>>>
>>>> Andy Grove
>>>> Chief Architect
>>>> AgilData - Simple Streaming SQL that Scales
>>>> www.agildata.com
>>>>
>>>>
>>>> On Wed, Jan 20, 2016 at 10:19 AM, Raghu Ganti 
>>>> wrote:
>>>>
>>>>> Thanks for your reply, Andy.
>>>>>
>>>>> Yes, that is what I concluded based on the Stack trace. The problem is
>>>>> stemming from Java implementation of generics, but I thought this will go
>>>>> away if you compiled against Java 1.8, which solves the issues of proper
>>>>> generic implementation.
>>>>>
>>>>> Any ideas?
>>>>>
>>>>> Also, are you saying that in order for my example to work, I would
>>>>> need to move to Scala and have the UDT implemented in Scala?
>>>>>
>>>>>
>>>>> On Wed, Jan 20, 2016 at 10:27 AM, Andy Grove 
>>>>> wrote:
>>>>>
>>>>>> Catalyst is expecting a class that implements scala.Row or
>>>>>> scala.Product and is instead finding a Java class. I've run into this 
>>>>>> issue
>>>>>> a number of times. Dataframe doesn't work so well with Java. Here's a 
>>>>>> blog
>>>>>> post with more information on this:
>>>>>>
>>>>>> http://www.agildata.com/apache-spark-rdd-vs-dataframe-vs-dataset/
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Andy.
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Andy Grove
>>>>>> Chief Architect
>>>>>> AgilData - Simple Streaming SQL that Scales
>>>>>> www.agildata.com
>>>>>>
>>>>>>
>>>>>> On Wed, Jan 20, 2016 at 7:07 AM, raghukiran 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I created a custom UserDefinedType in Java as follows:
>>>>>>>
>>>>>>> SQLPoint = new UserDefinedType() {
>>>>>>> //overriding serialize, deserialize, sqlType, userClass functions
>>>>>>> here
>>>>>>> }
>>>>>>>
>>>>>>> When creating a dataframe, I am following the manual mapping, I have
>>>>>>> a
>>>>>>> constructor for JavaPoint - JavaPoint(double x, double y) and a
>>>>>>> Customer
>>>>>>> record as follows:
>>>>>>>
>>>>>>> public class CustomerRecord {
>>>>>>> private int id;
>>>>>>> private String name;
>>>>>>> private Object location;
>>>>>>>
>>>>>>> //setters and getters follow here
>>>>>>> }
>>>>>>>
>>>>>>> Following the example in Spark source, when I create a RDD as
>>>>>>> follows:
>>>>>>>
>>>>>>> sc.textFile(inputFileName).map(new Function>>>>>> CustomerRecord>() {
>>>>>>> //cal

Re: Scala MatchError in Spark SQL

2016-01-20 Thread Raghu Ganti
Is it not internal to the Catalyst implementation? I should not be
modifying the Spark source to get things to work, do I? :-)

On Wed, Jan 20, 2016 at 12:21 PM, Raghu Ganti  wrote:

> Case classes where?
>
> On Wed, Jan 20, 2016 at 12:21 PM, Andy Grove 
> wrote:
>
>> Honestly, moving to Scala and using case classes is the path of least
>> resistance in the long term.
>>
>>
>>
>> Thanks,
>>
>> Andy.
>>
>> --
>>
>> Andy Grove
>> Chief Architect
>> AgilData - Simple Streaming SQL that Scales
>> www.agildata.com
>>
>>
>> On Wed, Jan 20, 2016 at 10:19 AM, Raghu Ganti 
>> wrote:
>>
>>> Thanks for your reply, Andy.
>>>
>>> Yes, that is what I concluded based on the Stack trace. The problem is
>>> stemming from Java implementation of generics, but I thought this will go
>>> away if you compiled against Java 1.8, which solves the issues of proper
>>> generic implementation.
>>>
>>> Any ideas?
>>>
>>> Also, are you saying that in order for my example to work, I would need
>>> to move to Scala and have the UDT implemented in Scala?
>>>
>>>
>>> On Wed, Jan 20, 2016 at 10:27 AM, Andy Grove 
>>> wrote:
>>>
>>>> Catalyst is expecting a class that implements scala.Row or
>>>> scala.Product and is instead finding a Java class. I've run into this issue
>>>> a number of times. Dataframe doesn't work so well with Java. Here's a blog
>>>> post with more information on this:
>>>>
>>>> http://www.agildata.com/apache-spark-rdd-vs-dataframe-vs-dataset/
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Andy.
>>>>
>>>> --
>>>>
>>>> Andy Grove
>>>> Chief Architect
>>>> AgilData - Simple Streaming SQL that Scales
>>>> www.agildata.com
>>>>
>>>>
>>>> On Wed, Jan 20, 2016 at 7:07 AM, raghukiran 
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I created a custom UserDefinedType in Java as follows:
>>>>>
>>>>> SQLPoint = new UserDefinedType() {
>>>>> //overriding serialize, deserialize, sqlType, userClass functions here
>>>>> }
>>>>>
>>>>> When creating a dataframe, I am following the manual mapping, I have a
>>>>> constructor for JavaPoint - JavaPoint(double x, double y) and a
>>>>> Customer
>>>>> record as follows:
>>>>>
>>>>> public class CustomerRecord {
>>>>> private int id;
>>>>> private String name;
>>>>> private Object location;
>>>>>
>>>>> //setters and getters follow here
>>>>> }
>>>>>
>>>>> Following the example in Spark source, when I create a RDD as follows:
>>>>>
>>>>> sc.textFile(inputFileName).map(new Function() {
>>>>> //call method
>>>>> CustomerRecord rec = new CustomerRecord();
>>>>> rec.setLocation(SQLPoint.serialize(new JavaPoint(x, y)));
>>>>> });
>>>>>
>>>>> This results in a MatchError. The stack trace is as follows:
>>>>>
>>>>> scala.MatchError: [B@45aa3dd5 (of class [B)
>>>>> at
>>>>>
>>>>> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255)
>>>>> at
>>>>>
>>>>> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>>>>> at
>>>>>
>>>>> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>>>>> at
>>>>>
>>>>> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
>>>>> at
>>>>>
>>>>> org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1358)
>>>>> at
>>>>>
>>>>> org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1358)
>>>>> at
>>>>>

Re: Scala MatchError in Spark SQL

2016-01-20 Thread Raghu Ganti
Case classes where?

On Wed, Jan 20, 2016 at 12:21 PM, Andy Grove 
wrote:

> Honestly, moving to Scala and using case classes is the path of least
> resistance in the long term.
>
>
>
> Thanks,
>
> Andy.
>
> --
>
> Andy Grove
> Chief Architect
> AgilData - Simple Streaming SQL that Scales
> www.agildata.com
>
>
> On Wed, Jan 20, 2016 at 10:19 AM, Raghu Ganti 
> wrote:
>
>> Thanks for your reply, Andy.
>>
>> Yes, that is what I concluded based on the Stack trace. The problem is
>> stemming from Java implementation of generics, but I thought this will go
>> away if you compiled against Java 1.8, which solves the issues of proper
>> generic implementation.
>>
>> Any ideas?
>>
>> Also, are you saying that in order for my example to work, I would need
>> to move to Scala and have the UDT implemented in Scala?
>>
>>
>> On Wed, Jan 20, 2016 at 10:27 AM, Andy Grove 
>> wrote:
>>
>>> Catalyst is expecting a class that implements scala.Row or scala.Product
>>> and is instead finding a Java class. I've run into this issue a number of
>>> times. Dataframe doesn't work so well with Java. Here's a blog post with
>>> more information on this:
>>>
>>> http://www.agildata.com/apache-spark-rdd-vs-dataframe-vs-dataset/
>>>
>>>
>>> Thanks,
>>>
>>> Andy.
>>>
>>> --
>>>
>>> Andy Grove
>>> Chief Architect
>>> AgilData - Simple Streaming SQL that Scales
>>> www.agildata.com
>>>
>>>
>>> On Wed, Jan 20, 2016 at 7:07 AM, raghukiran 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I created a custom UserDefinedType in Java as follows:
>>>>
>>>> SQLPoint = new UserDefinedType() {
>>>> //overriding serialize, deserialize, sqlType, userClass functions here
>>>> }
>>>>
>>>> When creating a dataframe, I am following the manual mapping, I have a
>>>> constructor for JavaPoint - JavaPoint(double x, double y) and a Customer
>>>> record as follows:
>>>>
>>>> public class CustomerRecord {
>>>> private int id;
>>>> private String name;
>>>> private Object location;
>>>>
>>>> //setters and getters follow here
>>>> }
>>>>
>>>> Following the example in Spark source, when I create a RDD as follows:
>>>>
>>>> sc.textFile(inputFileName).map(new Function() {
>>>> //call method
>>>> CustomerRecord rec = new CustomerRecord();
>>>> rec.setLocation(SQLPoint.serialize(new JavaPoint(x, y)));
>>>> });
>>>>
>>>> This results in a MatchError. The stack trace is as follows:
>>>>
>>>> scala.MatchError: [B@45aa3dd5 (of class [B)
>>>> at
>>>>
>>>> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255)
>>>> at
>>>>
>>>> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>>>> at
>>>>
>>>> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>>>> at
>>>>
>>>> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
>>>> at
>>>>
>>>> org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1358)
>>>> at
>>>>
>>>> org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1358)
>>>> at
>>>>
>>>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>>>> at
>>>>
>>>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>>>> at
>>>>
>>>> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>>>> at
>>>> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>>>> at
>>>> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>>>> at
>>>> scala.col

Re: Scala MatchError in Spark SQL

2016-01-20 Thread Raghu Ganti
Thanks for your reply, Andy.

Yes, that is what I concluded based on the Stack trace. The problem is
stemming from Java implementation of generics, but I thought this will go
away if you compiled against Java 1.8, which solves the issues of proper
generic implementation.

Any ideas?

Also, are you saying that in order for my example to work, I would need to
move to Scala and have the UDT implemented in Scala?


On Wed, Jan 20, 2016 at 10:27 AM, Andy Grove 
wrote:

> Catalyst is expecting a class that implements scala.Row or scala.Product
> and is instead finding a Java class. I've run into this issue a number of
> times. Dataframe doesn't work so well with Java. Here's a blog post with
> more information on this:
>
> http://www.agildata.com/apache-spark-rdd-vs-dataframe-vs-dataset/
>
>
> Thanks,
>
> Andy.
>
> --
>
> Andy Grove
> Chief Architect
> AgilData - Simple Streaming SQL that Scales
> www.agildata.com
>
>
> On Wed, Jan 20, 2016 at 7:07 AM, raghukiran  wrote:
>
>> Hi,
>>
>> I created a custom UserDefinedType in Java as follows:
>>
>> SQLPoint = new UserDefinedType() {
>> //overriding serialize, deserialize, sqlType, userClass functions here
>> }
>>
>> When creating a dataframe, I am following the manual mapping, I have a
>> constructor for JavaPoint - JavaPoint(double x, double y) and a Customer
>> record as follows:
>>
>> public class CustomerRecord {
>> private int id;
>> private String name;
>> private Object location;
>>
>> //setters and getters follow here
>> }
>>
>> Following the example in Spark source, when I create a RDD as follows:
>>
>> sc.textFile(inputFileName).map(new Function() {
>> //call method
>> CustomerRecord rec = new CustomerRecord();
>> rec.setLocation(SQLPoint.serialize(new JavaPoint(x, y)));
>> });
>>
>> This results in a MatchError. The stack trace is as follows:
>>
>> scala.MatchError: [B@45aa3dd5 (of class [B)
>> at
>>
>> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255)
>> at
>>
>> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>> at
>>
>> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>> at
>>
>> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
>> at
>>
>> org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1358)
>> at
>>
>> org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1358)
>> at
>>
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>> at
>>
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>> at
>>
>> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>> at
>> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>> at
>> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>> at
>>
>> org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1.apply(SQLContext.scala:1358)
>> at
>>
>> org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1.apply(SQLContext.scala:1356)
>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>> at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
>> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>> at
>> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>> at
>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>> at
>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>> at scala.collection.TraversableOnce$class.to
>> (TraversableOnce.scala:273)
>> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>> at
>> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>> at
>> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>> at
>>
>> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
>> at
>>
>> org.apache.spa

Re: Spark SQL create table

2016-01-18 Thread Raghu Ganti
Great, I got that to work following your example! Thanks.

A followup question is: If I had a custom SQL type (UserDefinedType),
how can I map it to this type from the RDD in the DataFrame?

Regards

On Mon, Jan 18, 2016 at 1:35 PM, Ted Yu  wrote:

> By SparkSQLContext, I assume you mean SQLContext.
> From the doc for SQLContext#createDataFrame():
>
>*  dataFrame.registerTempTable("people")
>*  sqlContext.sql("select name from people").collect.foreach(println)
>
> If you want to persist table externally, you need Hive, etc
>
> Regards
>
> On Mon, Jan 18, 2016 at 10:28 AM, Raghu Ganti 
> wrote:
>
>> This requires Hive to be installed and uses HiveContext, right?
>>
>> What is the SparkSQLContext useful for?
>>
>> On Mon, Jan 18, 2016 at 1:27 PM, Ted Yu  wrote:
>>
>>> Please take a look
>>> at 
>>> sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveDataFrameSuite.scala
>>>
>>> On Mon, Jan 18, 2016 at 9:57 AM, raghukiran 
>>> wrote:
>>>
>>>> Is creating a table using the SparkSQLContext currently supported?
>>>>
>>>> Regards,
>>>> Raghu
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-create-table-tp25996.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>
>


Re: Spark SQL create table

2016-01-18 Thread Raghu Ganti
This requires Hive to be installed and uses HiveContext, right?

What is the SparkSQLContext useful for?

On Mon, Jan 18, 2016 at 1:27 PM, Ted Yu  wrote:

> Please take a look
> at sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveDataFrameSuite.scala
>
> On Mon, Jan 18, 2016 at 9:57 AM, raghukiran  wrote:
>
>> Is creating a table using the SparkSQLContext currently supported?
>>
>> Regards,
>> Raghu
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-create-table-tp25996.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Spark SQL create table

2016-01-18 Thread Raghu Ganti
Btw, Thanks a lot for all your quick responses - it is very useful and
definitely appreciate it :-)

On Mon, Jan 18, 2016 at 1:28 PM, Raghu Ganti  wrote:

> This requires Hive to be installed and uses HiveContext, right?
>
> What is the SparkSQLContext useful for?
>
> On Mon, Jan 18, 2016 at 1:27 PM, Ted Yu  wrote:
>
>> Please take a look
>> at sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveDataFrameSuite.scala
>>
>> On Mon, Jan 18, 2016 at 9:57 AM, raghukiran  wrote:
>>
>>> Is creating a table using the SparkSQLContext currently supported?
>>>
>>> Regards,
>>> Raghu
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-create-table-tp25996.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: SQL UDF problem (with re to types)

2016-01-14 Thread Raghu Ganti
Would this go away if the Spark source was compiled against Java 1.8 (since
the problem of type erasure is solved through proper generics
implementation in Java 1.8).

On Thu, Jan 14, 2016 at 1:42 PM, Michael Armbrust 
wrote:

> We automatically convert types for UDFs defined in Scala, but we can't do
> it in Java because the types are erased by the compiler.  If you want to
> use double you should cast before calling the UDF.
>
> On Wed, Jan 13, 2016 at 8:10 PM, Raghu Ganti  wrote:
>
>> So, when I try BigDecimal, it works. But, should it not parse based on
>> what the UDF defines? Am I missing something here?
>>
>> On Wed, Jan 13, 2016 at 4:57 PM, Ted Yu  wrote:
>>
>>> Please take a look
>>> at 
>>> sql/hive/src/test/java/org/apache/spark/sql/hive/aggregate/MyDoubleSum.java
>>> which shows a UserDefinedAggregateFunction that works on DoubleType column.
>>>
>>> sql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java
>>> shows how it is registered.
>>>
>>> Cheers
>>>
>>> On Wed, Jan 13, 2016 at 11:58 AM, raghukiran 
>>> wrote:
>>>
>>>> While registering and using SQL UDFs, I am running into the following
>>>> problem:
>>>>
>>>> UDF registered:
>>>>
>>>> ctx.udf().register("Test", new UDF1() {
>>>> /**
>>>>  *
>>>>  */
>>>> private static final long serialVersionUID =
>>>> -8231917155671435931L;
>>>>
>>>> public String call(Double x) throws Exception {
>>>> return "testing";
>>>> }
>>>> }, DataTypes.StringType);
>>>>
>>>> Usage:
>>>> query = "SELECT Test(82.4)";
>>>> result = sqlCtx.sql(query).first();
>>>> System.out.println(result.toString());
>>>>
>>>> Problem: Class Cast exception thrown
>>>> Caused by: java.lang.ClassCastException: java.math.BigDecimal cannot be
>>>> cast
>>>> to java.lang.Double
>>>>
>>>> This problem occurs with Spark v1.5.2 and 1.6.0.
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/SQL-UDF-problem-with-re-to-types-tp25968.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>
>


Re: SQL UDF problem (with re to types)

2016-01-13 Thread Raghu Ganti
So, when I try BigDecimal, it works. But, should it not parse based on what
the UDF defines? Am I missing something here?

On Wed, Jan 13, 2016 at 4:57 PM, Ted Yu  wrote:

> Please take a look
> at sql/hive/src/test/java/org/apache/spark/sql/hive/aggregate/MyDoubleSum.java
> which shows a UserDefinedAggregateFunction that works on DoubleType column.
>
> sql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java
> shows how it is registered.
>
> Cheers
>
> On Wed, Jan 13, 2016 at 11:58 AM, raghukiran  wrote:
>
>> While registering and using SQL UDFs, I am running into the following
>> problem:
>>
>> UDF registered:
>>
>> ctx.udf().register("Test", new UDF1() {
>> /**
>>  *
>>  */
>> private static final long serialVersionUID =
>> -8231917155671435931L;
>>
>> public String call(Double x) throws Exception {
>> return "testing";
>> }
>> }, DataTypes.StringType);
>>
>> Usage:
>> query = "SELECT Test(82.4)";
>> result = sqlCtx.sql(query).first();
>> System.out.println(result.toString());
>>
>> Problem: Class Cast exception thrown
>> Caused by: java.lang.ClassCastException: java.math.BigDecimal cannot be
>> cast
>> to java.lang.Double
>>
>> This problem occurs with Spark v1.5.2 and 1.6.0.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/SQL-UDF-problem-with-re-to-types-tp25968.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>