Spark SQL/DDF's for production

2015-07-21 Thread bipin
Hi I want to ask an issue I have faced while using Spark. I load dataframes from parquet files. Some dataframes' parquet have lots of partitions, 10 million rows. Running where id = x query on dataframe scans all partitions. When saving to rdd object/parquet there is a partition column. The

Multiple Join Conditions in dataframe join

2015-07-03 Thread bipin
,Utm_Campaign), left) When I do this I get error: too many arguments for method apply. Thanks Bipin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-Join-Conditions-in-dataframe-join-tp23606.html Sent from the Apache Spark User List mailing list

Re: BigDecimal problem in parquet file

2015-06-18 Thread Bipin Nag
writing wide tables. Cheng On 6/15/15 5:48 AM, Bipin Nag wrote: HI Davies, I have tried recent 1.4 and 1.5-snapshot to 1) open the parquet and save it again or 2 apply schema to rdd and save dataframe as parquet but now I get this error (right in the beginning

Re: BigDecimal problem in parquet file

2015-06-15 Thread Bipin Nag
recently. On Fri, Jun 12, 2015 at 5:38 AM, Bipin Nag bipin@gmail.com wrote: Hi Cheng, Yes, some rows contain unit instead of decimal values. I believe some rows from original source I had don't have any value i.e. it is null. And that shows up as unit. How does the spark-sql or parquet

Re: BigDecimal problem in parquet file

2015-06-12 Thread Bipin Nag
to change it properly. Thanks for helping out. Bipin On 12 June 2015 at 14:57, Cheng Lian lian.cs@gmail.com wrote: On 6/10/15 8:53 PM, Bipin Nag wrote: Hi Cheng, I am using Spark 1.3.1 binary available for Hadoop 2.6. I am loading an existing parquet file, then repartitioning

Re: BigDecimal problem in parquet file

2015-06-10 Thread Bipin Nag
/+name); For applying schema and saving the parquet: val myschema = schemamap(name) val myrdd = sc.objectFile[Array[Object]](/home/bipin/rawdata/+name).map(x = org.apache.spark.sql.Row(x:_*)) val actualdata = sqlContext.createDataFrame(myrdd, myschema) actualdata.saveAsParquetFile

Re: Error in using saveAsParquetFile

2015-06-09 Thread Bipin Nag
@gmail.com wrote: I suspect that Bookings and Customerdetails both have a PolicyType field, one is string and the other is an int. Cheng On 6/8/15 9:15 PM, Bipin Nag wrote: Hi Jeetendra, Cheng I am using following code for joining val Bookings = sqlContext.load(/home/administrator

BigDecimal problem in parquet file

2015-06-09 Thread bipin
Hi, When I try to save my data frame as a parquet file I get the following error: java.lang.ClassCastException: scala.runtime.BoxedUnit cannot be cast to org.apache.spark.sql.types.Decimal at org.apache.spark.sql.parquet.RowWriteSupport.writePrimitive(ParquetTableSupport.scala:220)

Error in using saveAsParquetFile

2015-06-08 Thread bipin
Hi I get this error message when saving a table: parquet.io.ParquetDecodingException: The requested schema is not compatible with the file schema. incompatible types: optional binary PolicyType (UTF8) != optional int32 PolicyType at

Re: Error in using saveAsParquetFile

2015-06-08 Thread Bipin Nag
found a column with conflicting data types. Cheng On 6/8/15 5:29 PM, bipin wrote: Hi I get this error message when saving a table: parquet.io.ParquetDecodingException: The requested schema is not compatible with the file schema. incompatible types: optional binary PolicyType (UTF8

Create dataframe from saved objectfile RDD

2015-06-01 Thread bipin
/bipin/rawdata/+name) But I get java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to org.apache.spark.sql.Row How to work around this. Is there a better way. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Create-dataframe-from-saved

Re: How to group multiple row data ?

2015-05-01 Thread Bipin Nag
:42 PM, bipin bipin@gmail.com wrote: Hi, I have a ddf with schema (CustomerID, SupplierID, ProductID, Event, CreatedOn), the first 3 are Long ints and event can only be 1,2,3 and CreatedOn is a timestamp. How can I make a group triplet/doublet/singlet out of them such that I can infer

How to group multiple row data ?

2015-04-29 Thread bipin
Hi, I have a ddf with schema (CustomerID, SupplierID, ProductID, Event, CreatedOn), the first 3 are Long ints and event can only be 1,2,3 and CreatedOn is a timestamp. How can I make a group triplet/doublet/singlet out of them such that I can infer that Customer registered event from 1to 2 and if

Re: Microsoft SQL jdbc support from spark sql

2015-04-16 Thread bipin
I was running the spark shell and sql with --jars option containing the paths when I got my error. What is the correct way to add jars I am not sure. I tried placing the jar inside the directory you said but still get the error. I will give the code you posted a try. Thanks. -- View this

Re: Microsoft SQL jdbc support from spark sql

2015-04-16 Thread bipin
I am running the queries from spark-sql. I don't think it can communicate with thrift server. Can you tell how I should run the quries to make it work. -- View this message in context:

Re: Microsoft SQL jdbc support from spark sql

2015-04-16 Thread bipin
Looks a good option. BTW v3.0 is round the corner. http://slick.typesafe.com/news/2015/04/02/slick-3.0.0-RC3-released.html Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Microsoft-SQL-jdbc-support-from-spark-sql-tp22399p22521.html Sent from the

Sqoop parquet file not working in spark

2015-04-13 Thread bipin
Hi I imported a table from mssql server with Sqoop 1.4.5 in parquet format. But when I try to load it from Spark shell, it throws error like : scala val df1 = sqlContext.load(/home/bipin/Customer2) scala.collection.parallel.CompositeThrowable: Multiple exceptions thrown during a parallel

Re: Microsoft SQL jdbc support from spark sql

2015-04-07 Thread Bipin Nag
. There was some thoughts - credit to Cheng Lian for this - about making the JDBC data source extensible for third party support possibly via slick. On Mon, Apr 6, 2015 at 10:41 PM bipin bipin@gmail.com wrote: Hi, I am trying to pull data from ms-sql server. I have tried using

Microsoft SQL jdbc support from spark sql

2015-04-06 Thread bipin
Hi, I am trying to pull data from ms-sql server. I have tried using the spark.sql.jdbc CREATE TEMPORARY TABLE c USING org.apache.spark.sql.jdbc OPTIONS ( url jdbc:sqlserver://10.1.0.12:1433\;databaseName=dbname\;, dbtable Customer ); But it shows java.sql.SQLException: No suitable driver found