from:"bipin"

Spark SQL/DDF's for production

2015-07-21 Thread bipin

Hi I want to ask an issue I have faced while using Spark. I load dataframes
from parquet files. Some dataframes' parquet have lots of partitions, 10
million rows.

Running where id = x query on dataframe scans all partitions. When saving
to rdd object/parquet there is a partition column. The mentioned where
query on the partition column should zero in and only open possible
partitions. Sometimes I need to create index on other columns too to speed
things up. Without index I feel its not production ready.

I see there are two parts to do this:
Ability of spark SQL to create/use indexes - Mentioned as to be implemented
in documentation
Parquet index support- arriving in v2.0 currently it is v1.8

When can we hope to get index support that Spark SQL/catalyst can use. Is
anyone using Spark SQL in production. How did you handle this ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-DDF-s-for-production-tp23926.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Multiple Join Conditions in dataframe join

2015-07-03 Thread bipin

Hi, I need to join with multiple conditions. Can anyone tell how to specify
that. For e.g. this is what I am trying to do :

val Lead_all = Leads.
 | join(Utm_Master,
Leaddetails.columns(LeadSource,Utm_Source,Utm_Medium,Utm_Campaign)
==
Utm_Master.columns(LeadSource,Utm_Source,Utm_Medium,Utm_Campaign),
left)

When I do this I get  error: too many arguments for method apply. 

Thanks
Bipin




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-Join-Conditions-in-dataframe-join-tp23606.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: BigDecimal problem in parquet file

2015-06-18 Thread Bipin Nag

I increased the memory limits
,
now 
it works fine.

Thanks for the help.




On 18 June 2015 at 04:01, Cheng Lian lian.cs@gmail.com wrote:

  Does increasing executor memory fix the memory problem?

 How many columns does the schema contain? Parquet can be super memory
 consuming when writing wide tables.

 Cheng


 On 6/15/15 5:48 AM, Bipin Nag wrote:

  HI Davies,

  I have tried recent 1.4 and 1.5-snapshot to 1) open the parquet and save
 it again or 2 apply schema to rdd and save dataframe as parquet but now I
 get this error (right in the beginning):

 java.lang.OutOfMemoryError: Java heap space
 at
 parquet.bytes.CapacityByteArrayOutputStream.initSlabs(CapacityByteArrayOutputStream.java:65)
 at
 parquet.bytes.CapacityByteArrayOutputStream.init(CapacityByteArrayOutputStream.java:57)
 at
 parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.init(ColumnChunkPageWriteStore.java:68)
 at
 parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.init(ColumnChunkPageWriteStore.java:48)
 at
 parquet.hadoop.ColumnChunkPageWriteStore.getPageWriter(ColumnChunkPageWriteStore.java:215)
 at
 parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:67)
 at
 parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56)
 at
 parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.init(MessageColumnIO.java:178)
 at parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369)
 at
 parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:108)
 at
 parquet.hadoop.InternalParquetRecordWriter.init(InternalParquetRecordWriter.java:94)
 at
 parquet.hadoop.ParquetRecordWriter.init(ParquetRecordWriter.java:64)
 at
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
 at
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
 at
 org.apache.spark.sql.parquet.ParquetOutputWriter.init(newParquet.scala:111)
 at
 org.apache.spark.sql.parquet.ParquetRelation2$$anon$4.newInstance(newParquet.scala:244)
 at
 org.apache.spark.sql.sources.DefaultWriterContainer.initWriters(commands.scala:386)
 at
 org.apache.spark.sql.sources.BaseWriterContainer.executorSideSetup(commands.scala:298)
 at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org
 $apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:142)
 at
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:132)
 at
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:132)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)


 Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
 at parquet.io.api.Binary.fromByteArray(Binary.java:159)
 at
 parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.init(PlainValuesDictionary.java:94)
 at
 parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.init(PlainValuesDictionary.java:67)
 at parquet.column.Encoding$4.initDictionary(Encoding.java:131)
 at
 parquet.column.impl.ColumnReaderImpl.init(ColumnReaderImpl.java:325)
 at
 parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63)
 at
 parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58)
 at
 parquet.io.RecordReaderImplementation.init(RecordReaderImplementation.java:267)
 at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131)
 at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96)
 at
 parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136)
 at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96)
 at
 parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:126)
 at
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)
 at
 parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
 at
 org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
 at
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org
 $apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:152)

  I am not sure if this is related to your patch or some

Re: BigDecimal problem in parquet file

2015-06-15 Thread Bipin Nag

HI Davies,

I have tried recent 1.4 and 1.5-snapshot to 1) open the parquet and save it
again or 2 apply schema to rdd and save dataframe as parquet but now I get
this error (right in the beginning):

java.lang.OutOfMemoryError: Java heap space
at
parquet.bytes.CapacityByteArrayOutputStream.initSlabs(CapacityByteArrayOutputStream.java:65)
at
parquet.bytes.CapacityByteArrayOutputStream.init(CapacityByteArrayOutputStream.java:57)
at
parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.init(ColumnChunkPageWriteStore.java:68)
at
parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.init(ColumnChunkPageWriteStore.java:48)
at
parquet.hadoop.ColumnChunkPageWriteStore.getPageWriter(ColumnChunkPageWriteStore.java:215)
at
parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:67)
at
parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56)
at
parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.init(MessageColumnIO.java:178)
at parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369)
at
parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:108)
at
parquet.hadoop.InternalParquetRecordWriter.init(InternalParquetRecordWriter.java:94)
at
parquet.hadoop.ParquetRecordWriter.init(ParquetRecordWriter.java:64)
at
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
at
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
at
org.apache.spark.sql.parquet.ParquetOutputWriter.init(newParquet.scala:111)
at
org.apache.spark.sql.parquet.ParquetRelation2$$anon$4.newInstance(newParquet.scala:244)
at
org.apache.spark.sql.sources.DefaultWriterContainer.initWriters(commands.scala:386)
at
org.apache.spark.sql.sources.BaseWriterContainer.executorSideSetup(commands.scala:298)
at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org
$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:142)
at
org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:132)
at
org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:132)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
at parquet.io.api.Binary.fromByteArray(Binary.java:159)
at
parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.init(PlainValuesDictionary.java:94)
at
parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.init(PlainValuesDictionary.java:67)
at parquet.column.Encoding$4.initDictionary(Encoding.java:131)
at
parquet.column.impl.ColumnReaderImpl.init(ColumnReaderImpl.java:325)
at
parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63)
at
parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58)
at
parquet.io.RecordReaderImplementation.init(RecordReaderImplementation.java:267)
at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131)
at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96)
at
parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136)
at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96)
at
parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:126)
at
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)
at
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
at
org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org
$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:152)

I am not sure if this is related to your patch or some other bug. My error
doesn't show up in newer versions, so this is the problem to fix now.

Thanks

On 13 June 2015 at 06:31, Davies Liu dav...@databricks.com wrote:

 Maybe it's related to a bug, which is fixed by
 https://github.com/apache/spark/pull/6558 recently.

 On Fri, Jun 12, 2015 at 5:38 AM, Bipin Nag bipin@gmail.com wrote:
  Hi Cheng,
 
  Yes, some rows contain unit instead of decimal values. I believe some
 rows
  from original source I had

Re: BigDecimal problem in parquet file

2015-06-12 Thread Bipin Nag

Hi Cheng,

Yes, some rows contain unit instead of decimal values. I believe some rows
from original source I had don't have any value i.e. it is null. And that
shows up as unit. How does the spark-sql or parquet handle null in place of
decimal values, assuming that field is nullable. I will have to change it
properly.

Thanks for helping out.
Bipin

On 12 June 2015 at 14:57, Cheng Lian lian.cs@gmail.com wrote:

  On 6/10/15 8:53 PM, Bipin Nag wrote:

   Hi Cheng,

  I am using Spark 1.3.1 binary available for Hadoop 2.6. I am loading an
 existing parquet file, then repartitioning and saving it. Doing this gives
 the error. The code for this doesn't look like causing  problem. I have a
 feeling the source - the existing parquet is the culprit.

 I created that parquet using a jdbcrdd (pulled from microsoft sql server).
 First I saved jdbcrdd as an objectfile on disk. Then loaded it again, made
 a dataframe from it using a schema then saved it as a parquet.

  Following is the code :
  For saving jdbcrdd:
   name - fullqualifiedtablename
   pk - string for primarykey
   pklast - last id to pull
  val myRDD = new JdbcRDD( sc, () =
 DriverManager.getConnection(url,username,password) ,
 SELECT * FROM  + name +  WITH (NOLOCK) WHERE ? = +pk+ and
 +pk+ = ?,
 1, lastpk, 1, JdbcRDD.resultSetToObjectArray)
 myRDD.saveAsObjectFile(rawdata/+name);

  For applying schema and saving the parquet:
 val myschema = schemamap(name)
 val myrdd =
 sc.objectFile[Array[Object]](/home/bipin/rawdata/+name).map(x =
 org.apache.spark.sql.Row(x:_*))

   Have you tried to print out x here to check its contents? My guess is
 that x actually contains unit values. For example, the follow Spark shell
 code can reproduce a similar exception:

 import org.apache.spark.sql.types._import org.apache.spark.sql.Row
 val schema = StructType(StructField(dec, DecimalType(10, 0)) :: Nil)val rdd 
 = sc.parallelize(1 to 10).map(_ = Array(())).map(arr = Row(arr: _*))val df 
 = sqlContext.createDataFrame(rdd, schema)

 df.saveAsParquetFile(file:///tmp/foo)

val actualdata = sqlContext.createDataFrame(myrdd, myschema)
 actualdata.saveAsParquetFile(/home/bipin/stageddata/+name)

  Schema structtype can be made manually, though I pull table's metadata
 and make one. It is a simple string translation (see sql docs
 https://msdn.microsoft.com/en-us/library/ms378878%28v=sql.110%29.aspx
 and/or spark datatypes
 https://spark.apache.org/docs/1.3.1/sql-programming-guide.html#data-types
 )

  That is how I created the parquet file. Any help to solve the issue is
 appreciated.
  Thanks
  Bipin


 On 9 June 2015 at 20:44, Cheng Lian lian.cs@gmail.com wrote:

 Would you please provide a snippet that reproduce this issue? What
 version of Spark were you using?

 Cheng

 On 6/9/15 8:18 PM, bipin wrote:

 Hi,
 When I try to save my data frame as a parquet file I get the following
 error:

 java.lang.ClassCastException: scala.runtime.BoxedUnit cannot be cast to
 org.apache.spark.sql.types.Decimal
 at

 org.apache.spark.sql.parquet.RowWriteSupport.writePrimitive(ParquetTableSupport.scala:220)
 at

 org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:192)
 at

 org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:171)
 at

 org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:134)
 at

 parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
 at
 parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
 at
 parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
 at
 org.apache.spark.sql.parquet.ParquetRelation2.org
 $apache$spark$sql$parquet$ParquetRelation2$writeShard$1(newParquet.scala:671)
 at

 org.apache.spark.sql.parquet.ParquetRelation2$anonfun$insert$2.apply(newParquet.scala:689)
 at

 org.apache.spark.sql.parquet.ParquetRelation2$anonfun$insert$2.apply(newParquet.scala:689)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)

 How to fix this problem ?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/BigDecimal-problem-in-parquet-file-tp23221.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: BigDecimal problem in parquet file

2015-06-10 Thread Bipin Nag

Hi Cheng,

I am using Spark 1.3.1 binary available for Hadoop 2.6. I am loading an
existing parquet file, then repartitioning and saving it. Doing this gives
the error. The code for this doesn't look like causing  problem. I have a
feeling the source - the existing parquet is the culprit.

I created that parquet using a jdbcrdd (pulled from microsoft sql server).
First I saved jdbcrdd as an objectfile on disk. Then loaded it again, made
a dataframe from it using a schema then saved it as a parquet.

Following is the code :
For saving jdbcrdd:
 name - fullqualifiedtablename
 pk - string for primarykey
 pklast - last id to pull
val myRDD = new JdbcRDD( sc, () =
DriverManager.getConnection(url,username,password) ,
SELECT * FROM  + name +  WITH (NOLOCK) WHERE ? = +pk+ and
+pk+ = ?,
1, lastpk, 1, JdbcRDD.resultSetToObjectArray)
myRDD.saveAsObjectFile(rawdata/+name);

For applying schema and saving the parquet:
val myschema = schemamap(name)
val myrdd =
sc.objectFile[Array[Object]](/home/bipin/rawdata/+name).map(x =
org.apache.spark.sql.Row(x:_*))
val actualdata = sqlContext.createDataFrame(myrdd, myschema)
actualdata.saveAsParquetFile(/home/bipin/stageddata/+name)

Schema structtype can be made manually, though I pull table's metadata and
make one. It is a simple string translation (see sql docs
https://msdn.microsoft.com/en-us/library/ms378878%28v=sql.110%29.aspx
and/or spark datatypes
https://spark.apache.org/docs/1.3.1/sql-programming-guide.html#data-types)

That is how I created the parquet file. Any help to solve the issue is
appreciated.
Thanks
Bipin


On 9 June 2015 at 20:44, Cheng Lian lian.cs@gmail.com wrote:

 Would you please provide a snippet that reproduce this issue? What version
 of Spark were you using?

 Cheng

 On 6/9/15 8:18 PM, bipin wrote:

 Hi,
 When I try to save my data frame as a parquet file I get the following
 error:

 java.lang.ClassCastException: scala.runtime.BoxedUnit cannot be cast to
 org.apache.spark.sql.types.Decimal
 at

 org.apache.spark.sql.parquet.RowWriteSupport.writePrimitive(ParquetTableSupport.scala:220)
 at

 org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:192)
 at

 org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:171)
 at

 org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:134)
 at

 parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
 at
 parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
 at
 parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
 at
 org.apache.spark.sql.parquet.ParquetRelation2.org
 $apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:671)
 at

 org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689)
 at

 org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)

 How to fix this problem ?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/BigDecimal-problem-in-parquet-file-tp23221.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Error in using saveAsParquetFile

2015-06-09 Thread Bipin Nag

Cheng you were right. It works when I remove the field from either one. I
should have checked the types beforehand. What confused me is that Spark
attempted to join it and midway threw the error. It isn't quite there yet.
Thanks for the help.

On Mon, Jun 8, 2015 at 8:29 PM Cheng Lian lian.cs@gmail.com wrote:

  I suspect that Bookings and Customerdetails both have a PolicyType field,
 one is string and the other is an int.


 Cheng


 On 6/8/15 9:15 PM, Bipin Nag wrote:

  Hi Jeetendra, Cheng

  I am using following code for joining

 val Bookings = sqlContext.load(/home/administrator/stageddata/Bookings)
 val Customerdetails =
 sqlContext.load(/home/administrator/stageddata/Customerdetails)

 val CD = Customerdetails.
 where($CreatedOn  2015-04-01 00:00:00.0).
 where($CreatedOn  2015-05-01 00:00:00.0)

 //Bookings by CD
 val r1 = Bookings.
 withColumnRenamed(ID,ID2)
 val r2 = CD.
 join(r1,CD.col(CustomerID) === r1.col(ID2),left)

 r2.saveAsParquetFile(/home/administrator/stageddata/BOOKING_FULL);

  @Cheng I am not appending the joined table to an existing parquet file,
 it is a new file.
  @Jitender I have a rather large parquet file and it also contains some
 confidential data. Can you tell me what you need to check in it.

  Thanks


 On 8 June 2015 at 16:47, Jeetendra Gangele gangele...@gmail.com wrote:

 Parquet file when are you loading these file?
 can you please share the code where you are passing parquet file to
 spark?.

 On 8 June 2015 at 16:39, Cheng Lian lian.cs@gmail.com wrote:

 Are you appending the joined DataFrame whose PolicyType is string to an
 existing Parquet file whose PolicyType is int? The exception indicates that
 Parquet found a column with conflicting data types.

 Cheng


 On 6/8/15 5:29 PM, bipin wrote:

 Hi I get this error message when saving a table:

 parquet.io.ParquetDecodingException: The requested schema is not
 compatible
 with the file schema. incompatible types: optional binary PolicyType
 (UTF8)
 != optional int32 PolicyType
 at

 parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:105)
 at

 parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:97)
 at parquet.schema.PrimitiveType.accept(PrimitiveType.java:386)
 at

 parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:87)
 at

 parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:61)
 at parquet.schema.MessageType.accept(MessageType.java:55)
 at
 parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:148)
 at
 parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:137)
 at
 parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:157)
 at

 parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:107)
 at

 parquet.hadoop.InternalParquetRecordWriter.init(InternalParquetRecordWriter.java:94)
 at
 parquet.hadoop.ParquetRecordWriter.init(ParquetRecordWriter.java:64)
 at

 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
 at

 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
 at
 org.apache.spark.sql.parquet.ParquetRelation2.org
 $apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:667)
 at

 org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689)
 at

 org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)

 I joined two tables both loaded from parquet file, the joined table when
 saved throws this error. I could not find anything about this error.
 Could
 this be a bug ?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Error-in-using-saveAsParquetFile-tp23204.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




   --
  Hi,

  Find my attached resume. I have total around 7 years of work experience.
 I worked for Amazon and Expedia in my

BigDecimal problem in parquet file

2015-06-09 Thread bipin

Hi,
When I try to save my data frame as a parquet file I get the following
error:

java.lang.ClassCastException: scala.runtime.BoxedUnit cannot be cast to
org.apache.spark.sql.types.Decimal
at
org.apache.spark.sql.parquet.RowWriteSupport.writePrimitive(ParquetTableSupport.scala:220)
at
org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:192)
at
org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:171)
at
org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:134)
at
parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
at
org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:671)
at
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689)
at
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

How to fix this problem ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/BigDecimal-problem-in-parquet-file-tp23221.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Error in using saveAsParquetFile

2015-06-08 Thread bipin

Hi I get this error message when saving a table:

parquet.io.ParquetDecodingException: The requested schema is not compatible
with the file schema. incompatible types: optional binary PolicyType (UTF8)
!= optional int32 PolicyType
at
parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:105)
at
parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:97)
at parquet.schema.PrimitiveType.accept(PrimitiveType.java:386)
at
parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:87)
at
parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:61)
at parquet.schema.MessageType.accept(MessageType.java:55)
at parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:148)
at parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:137)
at parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:157)
at
parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:107)
at
parquet.hadoop.InternalParquetRecordWriter.init(InternalParquetRecordWriter.java:94)
at 
parquet.hadoop.ParquetRecordWriter.init(ParquetRecordWriter.java:64)
at
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
at
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
at
org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:667)
at
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689)
at
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

I joined two tables both loaded from parquet file, the joined table when
saved throws this error. I could not find anything about this error. Could
this be a bug ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-in-using-saveAsParquetFile-tp23204.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Error in using saveAsParquetFile

2015-06-08 Thread Bipin Nag

Hi Jeetendra, Cheng

I am using following code for joining

val Bookings = sqlContext.load(/home/administrator/stageddata/Bookings)
val Customerdetails =
sqlContext.load(/home/administrator/stageddata/Customerdetails)

val CD = Customerdetails.
where($CreatedOn  2015-04-01 00:00:00.0).
where($CreatedOn  2015-05-01 00:00:00.0)

//Bookings by CD
val r1 = Bookings.
withColumnRenamed(ID,ID2)
val r2 = CD.
join(r1,CD.col(CustomerID) === r1.col(ID2),left)

r2.saveAsParquetFile(/home/administrator/stageddata/BOOKING_FULL);

@Cheng I am not appending the joined table to an existing parquet file, it
is a new file.
@Jitender I have a rather large parquet file and it also contains some
confidential data. Can you tell me what you need to check in it.

Thanks


On 8 June 2015 at 16:47, Jeetendra Gangele gangele...@gmail.com wrote:

 Parquet file when are you loading these file?
 can you please share the code where you are passing parquet file to spark?.

 On 8 June 2015 at 16:39, Cheng Lian lian.cs@gmail.com wrote:

 Are you appending the joined DataFrame whose PolicyType is string to an
 existing Parquet file whose PolicyType is int? The exception indicates that
 Parquet found a column with conflicting data types.

 Cheng


 On 6/8/15 5:29 PM, bipin wrote:

 Hi I get this error message when saving a table:

 parquet.io.ParquetDecodingException: The requested schema is not
 compatible
 with the file schema. incompatible types: optional binary PolicyType
 (UTF8)
 != optional int32 PolicyType
 at

 parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:105)
 at

 parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:97)
 at parquet.schema.PrimitiveType.accept(PrimitiveType.java:386)
 at

 parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:87)
 at

 parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:61)
 at parquet.schema.MessageType.accept(MessageType.java:55)
 at
 parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:148)
 at
 parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:137)
 at
 parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:157)
 at

 parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:107)
 at

 parquet.hadoop.InternalParquetRecordWriter.init(InternalParquetRecordWriter.java:94)
 at
 parquet.hadoop.ParquetRecordWriter.init(ParquetRecordWriter.java:64)
 at

 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
 at

 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
 at
 org.apache.spark.sql.parquet.ParquetRelation2.org
 $apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:667)
 at

 org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689)
 at

 org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)

 I joined two tables both loaded from parquet file, the joined table when
 saved throws this error. I could not find anything about this error.
 Could
 this be a bug ?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Error-in-using-saveAsParquetFile-tp23204.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 --
 Hi,

 Find my attached resume. I have total around 7 years of work experience.
 I worked for Amazon and Expedia in my previous assignments and currently I
 am working with start- up technology company called Insideview in hyderabad.

 Regards
 Jeetendra

Create dataframe from saved objectfile RDD

2015-06-01 Thread bipin

Hi, what is the method to create ddf from an RDD which is saved as
objectfile. I don't have a java object but a structtype I want to use as
schema for ddf. How to load the objectfile without the object.

I tried retrieving as Row
val myrdd =
sc.objectFile[org.apache.spark.sql.Row](/home/bipin/rawdata/+name)

But I get 
java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to
org.apache.spark.sql.Row

How to work around this. Is there a better way.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Create-dataframe-from-saved-objectfile-RDD-tp23095.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to group multiple row data ?

2015-05-01 Thread Bipin Nag

OK, consider the case where there are multiple event triggers for a given
customer/ vendor/product like 1,1,2,2,3 arranged in the order of *event*
*occurrence* (time stamp). So output should be two groups (1,2) and
(1,2,3). The doublet would be first occurrence of 1,2 and triplet later
occurrences 1,2,3.

On 29 April 2015 at 18:04, Manoj Awasthi awasthi.ma...@gmail.com wrote:

 Sorry but I didn't fully understand the grouping. This line:

  The group must only take the closest previous trigger. The first one
 hence shows alone.

 Can you please explain further?


 On Wed, Apr 29, 2015 at 4:42 PM, bipin bipin@gmail.com wrote:

 Hi, I have a ddf with schema (CustomerID, SupplierID, ProductID, Event,
 CreatedOn), the first 3 are Long ints and event can only be 1,2,3 and
 CreatedOn is a timestamp. How can I make a group triplet/doublet/singlet
 out
 of them such that I can infer that Customer registered event from 1to 2
 and
 if present to 3 timewise and preserving the number of entries. For e.g.

 Before processing:
 10001, 132, 2002, 1, 2012-11-23
 10001, 132, 2002, 1, 2012-11-24
 10031, 102, 223, 2, 2012-11-24
 10001, 132, 2002, 2, 2012-11-25
 10001, 132, 2002, 3, 2012-11-26
 (total 5 rows)

 After processing:
 10001, 132, 2002, 2012-11-23, 1
 10031, 102, 223, 2012-11-24, 2
 10001, 132, 2002, 2012-11-24, 1,2,3
 (total 5 in last field - comma separated!)

 The group must only take the closest previous trigger. The first one hence
 shows alone. Can this be done using spark sql ? If it needs to processed
 in
 functionally in scala, how to do this. I can't wrap my head around this.
 Can
 anyone help.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-group-multiple-row-data-tp22701.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

How to group multiple row data ?

2015-04-29 Thread bipin

Hi, I have a ddf with schema (CustomerID, SupplierID, ProductID, Event,
CreatedOn), the first 3 are Long ints and event can only be 1,2,3 and
CreatedOn is a timestamp. How can I make a group triplet/doublet/singlet out
of them such that I can infer that Customer registered event from 1to 2 and
if present to 3 timewise and preserving the number of entries. For e.g. 

Before processing:
10001, 132, 2002, 1, 2012-11-23
10001, 132, 2002, 1, 2012-11-24
10031, 102, 223, 2, 2012-11-24
10001, 132, 2002, 2, 2012-11-25
10001, 132, 2002, 3, 2012-11-26
(total 5 rows)

After processing:
10001, 132, 2002, 2012-11-23, 1 
10031, 102, 223, 2012-11-24, 2
10001, 132, 2002, 2012-11-24, 1,2,3
(total 5 in last field - comma separated!)

The group must only take the closest previous trigger. The first one hence
shows alone. Can this be done using spark sql ? If it needs to processed in
functionally in scala, how to do this. I can't wrap my head around this. Can
anyone help.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-group-multiple-row-data-tp22701.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Microsoft SQL jdbc support from spark sql

2015-04-16 Thread bipin

I was running the spark shell and sql with --jars option containing the paths
when I got my error. What is the correct way to add jars I am not sure. I
tried placing the jar inside the directory you said but still get the error.
I will give the code you posted a try. Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Microsoft-SQL-jdbc-support-from-spark-sql-tp22399p22514.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Microsoft SQL jdbc support from spark sql

2015-04-16 Thread bipin

I am running the queries from spark-sql. I don't think it can communicate
with thrift server. Can you tell how I should run the quries to make it
work.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Microsoft-SQL-jdbc-support-from-spark-sql-tp22399p22516.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Microsoft SQL jdbc support from spark sql

2015-04-16 Thread bipin

Looks a good option. BTW v3.0 is round the corner.
http://slick.typesafe.com/news/2015/04/02/slick-3.0.0-RC3-released.html
Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Microsoft-SQL-jdbc-support-from-spark-sql-tp22399p22521.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Sqoop parquet file not working in spark

2015-04-13 Thread bipin

Hi I imported a table from mssql server with Sqoop 1.4.5 in parquet format.
But when I try to load it from Spark shell, it throws error like :

scala val df1 = sqlContext.load(/home/bipin/Customer2)
scala.collection.parallel.CompositeThrowable: Multiple exceptions thrown
during a parallel computation: java.lang.NullPointerException
parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:249)
parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:543)
parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:520)
parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:426)
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$6.apply(newParquet.scala:298)
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$6.apply(newParquet.scala:297)
scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
.
.
.
at scala.collection.parallel.package$$anon$1.alongWith(package.scala:87)
at scala.collection.parallel.Task$class.mergeThrowables(Tasks.scala:86)
at
scala.collection.parallel.mutable.ParArray$Map.mergeThrowables(ParArray.scala:650)
at scala.collection.parallel.Task$class.tryMerge(Tasks.scala:72)
at
scala.collection.parallel.mutable.ParArray$Map.tryMerge(ParArray.scala:650)
at
scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:190)
at
scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:514)
at
scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:162)
at
scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514)
at 
scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

I looked at the sqoop parquet folder and it's structure is different than
the one that I created on Spark. How can I make the parquet file work ? 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Sqoop-parquet-file-not-working-in-spark-tp22477.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Microsoft SQL jdbc support from spark sql

2015-04-07 Thread Bipin Nag

Thanks for the information. Hopefully this will happen in near future. For
now my best bet would be to export data and import it in spark sql.

On 7 April 2015 at 11:28, Denny Lee denny.g@gmail.com wrote:

 At this time, the JDBC Data source is not extensible so it cannot support
 SQL Server.   There was some thoughts - credit to Cheng Lian for this -
  about making the JDBC data source extensible for third party support
 possibly via slick.


 On Mon, Apr 6, 2015 at 10:41 PM bipin bipin@gmail.com wrote:

 Hi, I am trying to pull data from ms-sql server. I have tried using the
 spark.sql.jdbc

 CREATE TEMPORARY TABLE c
 USING org.apache.spark.sql.jdbc
 OPTIONS (
 url jdbc:sqlserver://10.1.0.12:1433\;databaseName=dbname\;,
 dbtable Customer
 );

 But it shows java.sql.SQLException: No suitable driver found for
 jdbc:sqlserver

 I have jdbc drivers for mssql but i am not sure how to use them I provide
 the jars to the sql shell and then tried the following:

 CREATE TEMPORARY TABLE c
 USING com.microsoft.sqlserver.jdbc.SQLServerDriver
 OPTIONS (
 url jdbc:sqlserver://10.1.0.12:1433\;databaseName=dbname\;,
 dbtable Customer
 );

 But this gives ERROR CliDriver: scala.MatchError: SQLServerDriver:4 (of
 class com.microsoft.sqlserver.jdbc.SQLServerDriver)

 Can anyone tell what is the proper way to connect to ms-sql server.
 Thanks






 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/Microsoft-SQL-jdbc-support-from-spark-
 sql-tp22399.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Microsoft SQL jdbc support from spark sql

2015-04-06 Thread bipin

Hi, I am trying to pull data from ms-sql server. I have tried using the
spark.sql.jdbc 

CREATE TEMPORARY TABLE c
USING org.apache.spark.sql.jdbc
OPTIONS (
url jdbc:sqlserver://10.1.0.12:1433\;databaseName=dbname\;,
dbtable Customer
);

But it shows java.sql.SQLException: No suitable driver found for
jdbc:sqlserver

I have jdbc drivers for mssql but i am not sure how to use them I provide
the jars to the sql shell and then tried the following:

CREATE TEMPORARY TABLE c
USING com.microsoft.sqlserver.jdbc.SQLServerDriver
OPTIONS (
url jdbc:sqlserver://10.1.0.12:1433\;databaseName=dbname\;,
dbtable Customer
);

But this gives ERROR CliDriver: scala.MatchError: SQLServerDriver:4 (of
class com.microsoft.sqlserver.jdbc.SQLServerDriver)

Can anyone tell what is the proper way to connect to ms-sql server.
Thanks






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Microsoft-SQL-jdbc-support-from-spark-sql-tp22399.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark SQL/DDF's for production

Multiple Join Conditions in dataframe join

Re: BigDecimal problem in parquet file

Re: BigDecimal problem in parquet file

Re: BigDecimal problem in parquet file

Re: BigDecimal problem in parquet file

Re: Error in using saveAsParquetFile

BigDecimal problem in parquet file

Error in using saveAsParquetFile

Re: Error in using saveAsParquetFile

Create dataframe from saved objectfile RDD

Re: How to group multiple row data ?

How to group multiple row data ?

Re: Microsoft SQL jdbc support from spark sql

Re: Microsoft SQL jdbc support from spark sql

Re: Microsoft SQL jdbc support from spark sql

Sqoop parquet file not working in spark

Re: Microsoft SQL jdbc support from spark sql

Microsoft SQL jdbc support from spark sql

19 matches

Site Navigation

Mail list logo

Footer information