Spark SQL/DDF's for production

2015-07-21 Thread bipin
Hi I want to ask an issue I have faced while using Spark. I load dataframes
from parquet files. Some dataframes' parquet have lots of partitions, >10
million rows.

Running "where id = x" query on dataframe scans all partitions. When saving
to rdd object/parquet there is a partition column. The mentioned "where"
query on the partition column should zero in and only open possible
partitions. Sometimes I need to create index on other columns too to speed
things up. Without index I feel its not production ready.

I see there are two parts to do this:
Ability of spark SQL to create/use indexes - Mentioned as to be implemented
in documentation
Parquet index support- arriving in v2.0 currently it is v1.8

When can we hope to get index support that Spark SQL/catalyst can use. Is
anyone using Spark SQL in production. How did you handle this ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-DDF-s-for-production-tp23926.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-03 Thread bipin
I have a hunch I want to share: I feel that data is not being deallocated in
memory (at least like in 1.3). Once it goes in-memory it just stays there.

Spark SQL works fine, the same query when run on a new shell won't throw
that error, but when run on a shell which has been used for other queries
before, throws that error.

Also I read on the spark blog, that project Tungsten is making changes in
memory management. And first changes would land in 1.4. Maybe it is related
to that.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/1-4-0-regression-out-of-memory-errors-on-small-data-tp23595p23608.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-03 Thread bipin
I will second this. I very rarely used to get out-of-memory errors in 1.3.
Now I get these errors all the time. I feel that I could work on 1.3
spark-shell for long periods of time without spark throwing that error,
whereas in 1.4 the shell needs to be restarted or gets killed frequently.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/1-4-0-regression-out-of-memory-errors-on-small-data-tp23595p23607.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Multiple Join Conditions in dataframe join

2015-07-03 Thread bipin
Hi, I need to join with multiple conditions. Can anyone tell how to specify
that. For e.g. this is what I am trying to do :

val Lead_all = Leads.
 | join(Utm_Master,
Leaddetails.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign")
==
Utm_Master.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"),
"left")

When I do this I get  error: too many arguments for method apply. 

Thanks
Bipin




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-Join-Conditions-in-dataframe-join-tp23606.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: BigDecimal problem in parquet file

2015-06-18 Thread Bipin Nag
I increased the memory limits
​,​
​now ​
it works fine.

​Thanks for the help.




On 18 June 2015 at 04:01, Cheng Lian  wrote:

>  Does increasing executor memory fix the memory problem?
>
> How many columns does the schema contain? Parquet can be super memory
> consuming when writing wide tables.
>
> Cheng
>
>
> On 6/15/15 5:48 AM, Bipin Nag wrote:
>
>  HI Davies,
>
>  I have tried recent 1.4 and 1.5-snapshot to 1) open the parquet and save
> it again or 2 apply schema to rdd and save dataframe as parquet but now I
> get this error (right in the beginning):
>
> java.lang.OutOfMemoryError: Java heap space
> at
> parquet.bytes.CapacityByteArrayOutputStream.initSlabs(CapacityByteArrayOutputStream.java:65)
> at
> parquet.bytes.CapacityByteArrayOutputStream.(CapacityByteArrayOutputStream.java:57)
> at
> parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.(ColumnChunkPageWriteStore.java:68)
> at
> parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.(ColumnChunkPageWriteStore.java:48)
> at
> parquet.hadoop.ColumnChunkPageWriteStore.getPageWriter(ColumnChunkPageWriteStore.java:215)
> at
> parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:67)
> at
> parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56)
> at
> parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.(MessageColumnIO.java:178)
> at parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369)
> at
> parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:108)
> at
> parquet.hadoop.InternalParquetRecordWriter.(InternalParquetRecordWriter.java:94)
> at
> parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:64)
> at
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
> at
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
> at
> org.apache.spark.sql.parquet.ParquetOutputWriter.(newParquet.scala:111)
> at
> org.apache.spark.sql.parquet.ParquetRelation2$$anon$4.newInstance(newParquet.scala:244)
> at
> org.apache.spark.sql.sources.DefaultWriterContainer.initWriters(commands.scala:386)
> at
> org.apache.spark.sql.sources.BaseWriterContainer.executorSideSetup(commands.scala:298)
> at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org
> $apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:142)
> at
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:132)
> at
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:132)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
>
> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
> at parquet.io.api.Binary.fromByteArray(Binary.java:159)
> at
> parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.(PlainValuesDictionary.java:94)
> at
> parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.(PlainValuesDictionary.java:67)
> at parquet.column.Encoding$4.initDictionary(Encoding.java:131)
> at
> parquet.column.impl.ColumnReaderImpl.(ColumnReaderImpl.java:325)
> at
> parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63)
> at
> parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58)
> at
> parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:267)
> at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131)
> at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96)
> at
> parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136)
> at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96)
> at
> parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:126)
> at
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)
> at
> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
> at
> org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
> at
> org.apache.spark.InterruptibleIt

Re: BigDecimal problem in parquet file

2015-06-15 Thread Bipin Nag
HI Davies,

I have tried recent 1.4 and 1.5-snapshot to 1) open the parquet and save it
again or 2 apply schema to rdd and save dataframe as parquet but now I get
this error (right in the beginning):

java.lang.OutOfMemoryError: Java heap space
at
parquet.bytes.CapacityByteArrayOutputStream.initSlabs(CapacityByteArrayOutputStream.java:65)
at
parquet.bytes.CapacityByteArrayOutputStream.(CapacityByteArrayOutputStream.java:57)
at
parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.(ColumnChunkPageWriteStore.java:68)
at
parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.(ColumnChunkPageWriteStore.java:48)
at
parquet.hadoop.ColumnChunkPageWriteStore.getPageWriter(ColumnChunkPageWriteStore.java:215)
at
parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:67)
at
parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56)
at
parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.(MessageColumnIO.java:178)
at parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369)
at
parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:108)
at
parquet.hadoop.InternalParquetRecordWriter.(InternalParquetRecordWriter.java:94)
at
parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:64)
at
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
at
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
at
org.apache.spark.sql.parquet.ParquetOutputWriter.(newParquet.scala:111)
at
org.apache.spark.sql.parquet.ParquetRelation2$$anon$4.newInstance(newParquet.scala:244)
at
org.apache.spark.sql.sources.DefaultWriterContainer.initWriters(commands.scala:386)
at
org.apache.spark.sql.sources.BaseWriterContainer.executorSideSetup(commands.scala:298)
at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org
$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:142)
at
org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:132)
at
org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:132)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
at parquet.io.api.Binary.fromByteArray(Binary.java:159)
at
parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.(PlainValuesDictionary.java:94)
at
parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.(PlainValuesDictionary.java:67)
at parquet.column.Encoding$4.initDictionary(Encoding.java:131)
at
parquet.column.impl.ColumnReaderImpl.(ColumnReaderImpl.java:325)
at
parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63)
at
parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58)
at
parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:267)
at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131)
at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96)
at
parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136)
at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96)
at
parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:126)
at
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)
at
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
at
org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org
$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:152)

I am not sure if this is related to your patch or some other bug. My error
doesn't show up in newer versions, so this is the problem to fix now.

Thanks

On 13 June 2015 at 06:31, Davies Liu  wrote:

> Maybe it's related to a bug, which is fixed by
> https://github.com/apache/spark/pull/6558 recently.
>
> On Fri, Jun 12, 2015 at 5:38 AM, Bipin Nag  wrote:
> > Hi Cheng,
> >
> > Yes, some rows contain unit instead of decimal values. I believe some
> rows
> > from original source I had don't have any val

Re: BigDecimal problem in parquet file

2015-06-12 Thread Bipin Nag
Hi Cheng,

Yes, some rows contain unit instead of decimal values. I believe some rows
from original source I had don't have any value i.e. it is null. And that
shows up as unit. How does the spark-sql or parquet handle null in place of
decimal values, assuming that field is nullable. I will have to change it
properly.

Thanks for helping out.
Bipin

On 12 June 2015 at 14:57, Cheng Lian  wrote:

>  On 6/10/15 8:53 PM, Bipin Nag wrote:
>
>   Hi Cheng,
>
>  I am using Spark 1.3.1 binary available for Hadoop 2.6. I am loading an
> existing parquet file, then repartitioning and saving it. Doing this gives
> the error. The code for this doesn't look like causing  problem. I have a
> feeling the source - the existing parquet is the culprit.
>
> I created that parquet using a jdbcrdd (pulled from microsoft sql server).
> First I saved jdbcrdd as an objectfile on disk. Then loaded it again, made
> a dataframe from it using a schema then saved it as a parquet.
>
>  Following is the code :
>  For saving jdbcrdd:
>   name - fullqualifiedtablename
>   pk - string for primarykey
>   pklast - last id to pull
>  val myRDD = new JdbcRDD( sc, () =>
> DriverManager.getConnection(url,username,password) ,
> "SELECT * FROM " + name + " WITH (NOLOCK) WHERE ? <= "+pk+" and
> "+pk+" <= ?",
> 1, lastpk, 1, JdbcRDD.resultSetToObjectArray)
> myRDD.saveAsObjectFile("rawdata/"+name);
>
>  For applying schema and saving the parquet:
> val myschema = schemamap(name)
> val myrdd =
> sc.objectFile[Array[Object]]("/home/bipin/rawdata/"+name).map(x =>
> org.apache.spark.sql.Row(x:_*))
>
>   Have you tried to print out x here to check its contents? My guess is
> that x actually contains unit values. For example, the follow Spark shell
> code can reproduce a similar exception:
>
> import org.apache.spark.sql.types._import org.apache.spark.sql.Row
> val schema = StructType(StructField("dec", DecimalType(10, 0)) :: Nil)val rdd 
> = sc.parallelize(1 to 10).map(_ => Array(())).map(arr => Row(arr: _*))val df 
> = sqlContext.createDataFrame(rdd, schema)
>
> df.saveAsParquetFile("file:///tmp/foo")
>
>val actualdata = sqlContext.createDataFrame(myrdd, myschema)
> actualdata.saveAsParquetFile("/home/bipin/stageddata/"+name)
>
>  Schema structtype can be made manually, though I pull table's metadata
> and make one. It is a simple string translation (see sql docs
> <https://msdn.microsoft.com/en-us/library/ms378878%28v=sql.110%29.aspx>
> and/or spark datatypes
> <https://spark.apache.org/docs/1.3.1/sql-programming-guide.html#data-types>
> )
>
>  That is how I created the parquet file. Any help to solve the issue is
> appreciated.
>  Thanks
>  Bipin
>
>
> On 9 June 2015 at 20:44, Cheng Lian  wrote:
>
>> Would you please provide a snippet that reproduce this issue? What
>> version of Spark were you using?
>>
>> Cheng
>>
>> On 6/9/15 8:18 PM, bipin wrote:
>>
>>> Hi,
>>> When I try to save my data frame as a parquet file I get the following
>>> error:
>>>
>>> java.lang.ClassCastException: scala.runtime.BoxedUnit cannot be cast to
>>> org.apache.spark.sql.types.Decimal
>>> at
>>>
>>> org.apache.spark.sql.parquet.RowWriteSupport.writePrimitive(ParquetTableSupport.scala:220)
>>> at
>>>
>>> org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:192)
>>> at
>>>
>>> org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:171)
>>> at
>>>
>>> org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:134)
>>> at
>>>
>>> parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
>>> at
>>> parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
>>> at
>>> parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
>>> at
>>> org.apache.spark.sql.parquet.ParquetRelation2.org
>>> $apache$spark$sql$parquet$ParquetRelation2$writeShard$1(newParquet.scala:671)
>>> at
>>>
>>> org.apache.spark.sql.parquet.ParquetRelation2$anonfun$insert$2.apply(newParquet.scala:689)
>>> at
>>>
>>> org.apache.spark.sql.parquet.ParquetRelation2$anonfun$insert$2.apply(newParquet.scala:689)
>>> at
>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)

Re: BigDecimal problem in parquet file

2015-06-10 Thread Bipin Nag
Hi Cheng,

I am using Spark 1.3.1 binary available for Hadoop 2.6. I am loading an
existing parquet file, then repartitioning and saving it. Doing this gives
the error. The code for this doesn't look like causing  problem. I have a
feeling the source - the existing parquet is the culprit.

I created that parquet using a jdbcrdd (pulled from microsoft sql server).
First I saved jdbcrdd as an objectfile on disk. Then loaded it again, made
a dataframe from it using a schema then saved it as a parquet.

Following is the code :
For saving jdbcrdd:
 name - fullqualifiedtablename
 pk - string for primarykey
 pklast - last id to pull
val myRDD = new JdbcRDD( sc, () =>
DriverManager.getConnection(url,username,password) ,
"SELECT * FROM " + name + " WITH (NOLOCK) WHERE ? <= "+pk+" and
"+pk+" <= ?",
1, lastpk, 1, JdbcRDD.resultSetToObjectArray)
myRDD.saveAsObjectFile("rawdata/"+name);

For applying schema and saving the parquet:
val myschema = schemamap(name)
val myrdd =
sc.objectFile[Array[Object]]("/home/bipin/rawdata/"+name).map(x =>
org.apache.spark.sql.Row(x:_*))
val actualdata = sqlContext.createDataFrame(myrdd, myschema)
actualdata.saveAsParquetFile("/home/bipin/stageddata/"+name)

Schema structtype can be made manually, though I pull table's metadata and
make one. It is a simple string translation (see sql docs
<https://msdn.microsoft.com/en-us/library/ms378878%28v=sql.110%29.aspx>
and/or spark datatypes
<https://spark.apache.org/docs/1.3.1/sql-programming-guide.html#data-types>)

That is how I created the parquet file. Any help to solve the issue is
appreciated.
Thanks
Bipin


On 9 June 2015 at 20:44, Cheng Lian  wrote:

> Would you please provide a snippet that reproduce this issue? What version
> of Spark were you using?
>
> Cheng
>
> On 6/9/15 8:18 PM, bipin wrote:
>
>> Hi,
>> When I try to save my data frame as a parquet file I get the following
>> error:
>>
>> java.lang.ClassCastException: scala.runtime.BoxedUnit cannot be cast to
>> org.apache.spark.sql.types.Decimal
>> at
>>
>> org.apache.spark.sql.parquet.RowWriteSupport.writePrimitive(ParquetTableSupport.scala:220)
>> at
>>
>> org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:192)
>> at
>>
>> org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:171)
>> at
>>
>> org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:134)
>> at
>>
>> parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
>> at
>> parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
>> at
>> parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
>> at
>> org.apache.spark.sql.parquet.ParquetRelation2.org
>> $apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:671)
>> at
>>
>> org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689)
>> at
>>
>> org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689)
>> at
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>> at org.apache.spark.scheduler.Task.run(Task.scala:64)
>> at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>> at
>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> How to fix this problem ?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/BigDecimal-problem-in-parquet-file-tp23221.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>>
>


BigDecimal problem in parquet file

2015-06-09 Thread bipin
Hi,
When I try to save my data frame as a parquet file I get the following
error:

java.lang.ClassCastException: scala.runtime.BoxedUnit cannot be cast to
org.apache.spark.sql.types.Decimal
at
org.apache.spark.sql.parquet.RowWriteSupport.writePrimitive(ParquetTableSupport.scala:220)
at
org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:192)
at
org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:171)
at
org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:134)
at
parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
at
org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:671)
at
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689)
at
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

How to fix this problem ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/BigDecimal-problem-in-parquet-file-tp23221.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Error in using saveAsParquetFile

2015-06-08 Thread Bipin Nag
Cheng you were right. It works when I remove the field from either one. I
should have checked the types beforehand. What confused me is that Spark
attempted to join it and midway threw the error. It isn't quite there yet.
Thanks for the help.

On Mon, Jun 8, 2015 at 8:29 PM Cheng Lian  wrote:

>  I suspect that Bookings and Customerdetails both have a PolicyType field,
> one is string and the other is an int.
>
>
> Cheng
>
>
> On 6/8/15 9:15 PM, Bipin Nag wrote:
>
>  Hi Jeetendra, Cheng
>
>  I am using following code for joining
>
> val Bookings = sqlContext.load("/home/administrator/stageddata/Bookings")
> val Customerdetails =
> sqlContext.load("/home/administrator/stageddata/Customerdetails")
>
> val CD = Customerdetails.
> where($"CreatedOn" > "2015-04-01 00:00:00.0").
> where($"CreatedOn" < "2015-05-01 00:00:00.0")
>
> //Bookings by CD
> val r1 = Bookings.
> withColumnRenamed("ID","ID2")
> val r2 = CD.
> join(r1,CD.col("CustomerID") === r1.col("ID2"),"left")
>
> r2.saveAsParquetFile("/home/administrator/stageddata/BOOKING_FULL");
>
>  @Cheng I am not appending the joined table to an existing parquet file,
> it is a new file.
>  @Jitender I have a rather large parquet file and it also contains some
> confidential data. Can you tell me what you need to check in it.
>
>  Thanks
>
>
> On 8 June 2015 at 16:47, Jeetendra Gangele  wrote:
>
>> Parquet file when are you loading these file?
>> can you please share the code where you are passing parquet file to
>> spark?.
>>
>> On 8 June 2015 at 16:39, Cheng Lian  wrote:
>>
>>> Are you appending the joined DataFrame whose PolicyType is string to an
>>> existing Parquet file whose PolicyType is int? The exception indicates that
>>> Parquet found a column with conflicting data types.
>>>
>>> Cheng
>>>
>>>
>>> On 6/8/15 5:29 PM, bipin wrote:
>>>
>>>> Hi I get this error message when saving a table:
>>>>
>>>> parquet.io.ParquetDecodingException: The requested schema is not
>>>> compatible
>>>> with the file schema. incompatible types: optional binary PolicyType
>>>> (UTF8)
>>>> != optional int32 PolicyType
>>>> at
>>>>
>>>> parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:105)
>>>> at
>>>>
>>>> parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:97)
>>>> at parquet.schema.PrimitiveType.accept(PrimitiveType.java:386)
>>>> at
>>>>
>>>> parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:87)
>>>> at
>>>>
>>>> parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:61)
>>>> at parquet.schema.MessageType.accept(MessageType.java:55)
>>>> at
>>>> parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:148)
>>>> at
>>>> parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:137)
>>>> at
>>>> parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:157)
>>>> at
>>>>
>>>> parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:107)
>>>> at
>>>>
>>>> parquet.hadoop.InternalParquetRecordWriter.(InternalParquetRecordWriter.java:94)
>>>> at
>>>> parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:64)
>>>> at
>>>>
>>>> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
>>>> at
>>>>
>>>> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
>>>> at
>>>> org.apache.spark.sql.parquet.ParquetRelation2.org
>>>> $apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:667)
>>>> at
>>>>
>>>> org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689)
>>>> at
>>>>
>>>> org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689)
>>>> at
>>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>>>> at org.apache.spark.sched

Re: Error in using saveAsParquetFile

2015-06-08 Thread Bipin Nag
Hi Jeetendra, Cheng

I am using following code for joining

val Bookings = sqlContext.load("/home/administrator/stageddata/Bookings")
val Customerdetails =
sqlContext.load("/home/administrator/stageddata/Customerdetails")

val CD = Customerdetails.
where($"CreatedOn" > "2015-04-01 00:00:00.0").
where($"CreatedOn" < "2015-05-01 00:00:00.0")

//Bookings by CD
val r1 = Bookings.
withColumnRenamed("ID","ID2")
val r2 = CD.
join(r1,CD.col("CustomerID") === r1.col("ID2"),"left")

r2.saveAsParquetFile("/home/administrator/stageddata/BOOKING_FULL");

@Cheng I am not appending the joined table to an existing parquet file, it
is a new file.
@Jitender I have a rather large parquet file and it also contains some
confidential data. Can you tell me what you need to check in it.

Thanks


On 8 June 2015 at 16:47, Jeetendra Gangele  wrote:

> Parquet file when are you loading these file?
> can you please share the code where you are passing parquet file to spark?.
>
> On 8 June 2015 at 16:39, Cheng Lian  wrote:
>
>> Are you appending the joined DataFrame whose PolicyType is string to an
>> existing Parquet file whose PolicyType is int? The exception indicates that
>> Parquet found a column with conflicting data types.
>>
>> Cheng
>>
>>
>> On 6/8/15 5:29 PM, bipin wrote:
>>
>>> Hi I get this error message when saving a table:
>>>
>>> parquet.io.ParquetDecodingException: The requested schema is not
>>> compatible
>>> with the file schema. incompatible types: optional binary PolicyType
>>> (UTF8)
>>> != optional int32 PolicyType
>>> at
>>>
>>> parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:105)
>>> at
>>>
>>> parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:97)
>>> at parquet.schema.PrimitiveType.accept(PrimitiveType.java:386)
>>> at
>>>
>>> parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:87)
>>> at
>>>
>>> parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:61)
>>> at parquet.schema.MessageType.accept(MessageType.java:55)
>>> at
>>> parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:148)
>>> at
>>> parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:137)
>>> at
>>> parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:157)
>>> at
>>>
>>> parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:107)
>>> at
>>>
>>> parquet.hadoop.InternalParquetRecordWriter.(InternalParquetRecordWriter.java:94)
>>> at
>>> parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:64)
>>> at
>>>
>>> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
>>> at
>>>
>>> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
>>> at
>>> org.apache.spark.sql.parquet.ParquetRelation2.org
>>> $apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:667)
>>> at
>>>
>>> org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689)
>>> at
>>>
>>> org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689)
>>> at
>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>>> at org.apache.spark.scheduler.Task.run(Task.scala:64)
>>> at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>>> at
>>>
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> at
>>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>> I joined two tables both loaded from parquet file, the joined table when
>>> saved throws this error. I could not find anything about this error.
>>> Could
>>> this be a bug ?
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Error-in-using-saveAsParquetFile-tp23204.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
> Hi,
>
> Find my attached resume. I have total around 7 years of work experience.
> I worked for Amazon and Expedia in my previous assignments and currently I
> am working with start- up technology company called Insideview in hyderabad.
>
> Regards
> Jeetendra
>


Error in using saveAsParquetFile

2015-06-08 Thread bipin
Hi I get this error message when saving a table:

parquet.io.ParquetDecodingException: The requested schema is not compatible
with the file schema. incompatible types: optional binary PolicyType (UTF8)
!= optional int32 PolicyType
at
parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:105)
at
parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:97)
at parquet.schema.PrimitiveType.accept(PrimitiveType.java:386)
at
parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:87)
at
parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:61)
at parquet.schema.MessageType.accept(MessageType.java:55)
at parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:148)
at parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:137)
at parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:157)
at
parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:107)
at
parquet.hadoop.InternalParquetRecordWriter.(InternalParquetRecordWriter.java:94)
at 
parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:64)
at
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
at
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
at
org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:667)
at
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689)
at
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

I joined two tables both loaded from parquet file, the joined table when
saved throws this error. I could not find anything about this error. Could
this be a bug ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-in-using-saveAsParquetFile-tp23204.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Create dataframe from saved objectfile RDD

2015-05-31 Thread bipin
Hi, what is the method to create ddf from an RDD which is saved as
objectfile. I don't have a java object but a structtype I want to use as
schema for ddf. How to load the objectfile without the object.

I tried retrieving as Row
val myrdd =
sc.objectFile[org.apache.spark.sql.Row]("/home/bipin/rawdata/"+name)

But I get 
java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to
org.apache.spark.sql.Row

How to work around this. Is there a better way.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Create-dataframe-from-saved-objectfile-RDD-tp23095.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to group multiple row data ?

2015-04-30 Thread Bipin Nag
OK, consider the case where there are multiple event triggers for a given
customer/ vendor/product like 1,1,2,2,3 arranged in the order of *event*
*occurrence* (time stamp). So output should be two groups (1,2) and
(1,2,3). The doublet would be first occurrence of 1,2 and triplet later
occurrences 1,2,3.

On 29 April 2015 at 18:04, Manoj Awasthi  wrote:

> Sorry but I didn't fully understand the grouping. This line:
>
> >> The group must only take the closest previous trigger. The first one
> hence shows alone.
>
> Can you please explain further?
>
>
> On Wed, Apr 29, 2015 at 4:42 PM, bipin  wrote:
>
>> Hi, I have a ddf with schema (CustomerID, SupplierID, ProductID, Event,
>> CreatedOn), the first 3 are Long ints and event can only be 1,2,3 and
>> CreatedOn is a timestamp. How can I make a group triplet/doublet/singlet
>> out
>> of them such that I can infer that Customer registered event from 1to 2
>> and
>> if present to 3 timewise and preserving the number of entries. For e.g.
>>
>> Before processing:
>> 10001, 132, 2002, 1, 2012-11-23
>> 10001, 132, 2002, 1, 2012-11-24
>> 10031, 102, 223, 2, 2012-11-24
>> 10001, 132, 2002, 2, 2012-11-25
>> 10001, 132, 2002, 3, 2012-11-26
>> (total 5 rows)
>>
>> After processing:
>> 10001, 132, 2002, 2012-11-23, "1"
>> 10031, 102, 223, 2012-11-24, "2"
>> 10001, 132, 2002, 2012-11-24, "1,2,3"
>> (total 5 in last field - comma separated!)
>>
>> The group must only take the closest previous trigger. The first one hence
>> shows alone. Can this be done using spark sql ? If it needs to processed
>> in
>> functionally in scala, how to do this. I can't wrap my head around this.
>> Can
>> anyone help.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-group-multiple-row-data-tp22701.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


How to group multiple row data ?

2015-04-29 Thread bipin
Hi, I have a ddf with schema (CustomerID, SupplierID, ProductID, Event,
CreatedOn), the first 3 are Long ints and event can only be 1,2,3 and
CreatedOn is a timestamp. How can I make a group triplet/doublet/singlet out
of them such that I can infer that Customer registered event from 1to 2 and
if present to 3 timewise and preserving the number of entries. For e.g. 

Before processing:
10001, 132, 2002, 1, 2012-11-23
10001, 132, 2002, 1, 2012-11-24
10031, 102, 223, 2, 2012-11-24
10001, 132, 2002, 2, 2012-11-25
10001, 132, 2002, 3, 2012-11-26
(total 5 rows)

After processing:
10001, 132, 2002, 2012-11-23, "1" 
10031, 102, 223, 2012-11-24, "2"
10001, 132, 2002, 2012-11-24, "1,2,3"
(total 5 in last field - comma separated!)

The group must only take the closest previous trigger. The first one hence
shows alone. Can this be done using spark sql ? If it needs to processed in
functionally in scala, how to do this. I can't wrap my head around this. Can
anyone help.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-group-multiple-row-data-tp22701.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to merge two dataframes with same schema

2015-04-22 Thread bipin
I have looked into sqlContext documentation but there is nothing on how to
merge two data-frames. How can I do this ?

Thanks
Bipin



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-merge-two-dataframes-with-same-schema-tp22606.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark REPL no progress when run in cluster

2015-04-21 Thread bipin
Hi all,
I am facing an issue, whenever I run a job on my mesos cluster, I cannot see
any progress on my terminal. It shows :
[Stage 0:>(0 + 0) /
204]
I have setup the cluster on AWS EC2 manually. I first run mesos master and
slaves, then run spark on each slave machine. Then finally a spark-shell
with mesos master as argument. The data is located on HDFS. The shell loads
properly. But I cannot run the program over my spark cluster. Same program
runs fine on standalone spark. I want to test the script on a a cluster.

Here is my spark default.conf 

spark.executor.uri
hdfs://masterip/user/ubuntu/spark-1.4.0-SNAPSHOT-bin-2.5.0-cdh5.3.2.tgz
spark.mastermesos://masterip:5050
spark.eventLog.enabled  true
spark.eventLog.dir  hdfs://masterip/user/ubuntu/sparkeventlogs
spark.executor.memory   6g
spark.driver.memory 5g
spark.serializerorg.apache.spark.serializer.KryoSerializer

What could be the problem here ? Has anyone come across this ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-REPL-no-progress-when-run-in-cluster-tp22592.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Microsoft SQL jdbc support from spark sql

2015-04-16 Thread bipin
Looks a good option. BTW v3.0 is round the corner.
http://slick.typesafe.com/news/2015/04/02/slick-3.0.0-RC3-released.html
Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Microsoft-SQL-jdbc-support-from-spark-sql-tp22399p22521.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Microsoft SQL jdbc support from spark sql

2015-04-16 Thread bipin
I am running the queries from spark-sql. I don't think it can communicate
with thrift server. Can you tell how I should run the quries to make it
work.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Microsoft-SQL-jdbc-support-from-spark-sql-tp22399p22516.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Microsoft SQL jdbc support from spark sql

2015-04-16 Thread bipin
I was running the spark shell and sql with --jars option containing the paths
when I got my error. What is the correct way to add jars I am not sure. I
tried placing the jar inside the directory you said but still get the error.
I will give the code you posted a try. Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Microsoft-SQL-jdbc-support-from-spark-sql-tp22399p22514.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Sqoop parquet file not working in spark

2015-04-13 Thread bipin
Hi I imported a table from mssql server with Sqoop 1.4.5 in parquet format.
But when I try to load it from Spark shell, it throws error like :

scala> val df1 = sqlContext.load("/home/bipin/Customer2")
scala.collection.parallel.CompositeThrowable: Multiple exceptions thrown
during a parallel computation: java.lang.NullPointerException
parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:249)
parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:543)
parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:520)
parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:426)
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$6.apply(newParquet.scala:298)
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$6.apply(newParquet.scala:297)
scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
.
.
.
at scala.collection.parallel.package$$anon$1.alongWith(package.scala:87)
at scala.collection.parallel.Task$class.mergeThrowables(Tasks.scala:86)
at
scala.collection.parallel.mutable.ParArray$Map.mergeThrowables(ParArray.scala:650)
at scala.collection.parallel.Task$class.tryMerge(Tasks.scala:72)
at
scala.collection.parallel.mutable.ParArray$Map.tryMerge(ParArray.scala:650)
at
scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:190)
at
scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:514)
at
scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:162)
at
scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514)
at 
scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

I looked at the sqoop parquet folder and it's structure is different than
the one that I created on Spark. How can I make the parquet file work ? 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Sqoop-parquet-file-not-working-in-spark-tp22477.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Microsoft SQL jdbc support from spark sql

2015-04-06 Thread Bipin Nag
Thanks for the information. Hopefully this will happen in near future. For
now my best bet would be to export data and import it in spark sql.

On 7 April 2015 at 11:28, Denny Lee  wrote:

> At this time, the JDBC Data source is not extensible so it cannot support
> SQL Server.   There was some thoughts - credit to Cheng Lian for this -
>  about making the JDBC data source extensible for third party support
> possibly via slick.
>
>
> On Mon, Apr 6, 2015 at 10:41 PM bipin  wrote:
>
>> Hi, I am trying to pull data from ms-sql server. I have tried using the
>> spark.sql.jdbc
>>
>> CREATE TEMPORARY TABLE c
>> USING org.apache.spark.sql.jdbc
>> OPTIONS (
>> url "jdbc:sqlserver://10.1.0.12:1433\;databaseName=dbname\;",
>> dbtable "Customer"
>> );
>>
>> But it shows java.sql.SQLException: No suitable driver found for
>> jdbc:sqlserver
>>
>> I have jdbc drivers for mssql but i am not sure how to use them I provide
>> the jars to the sql shell and then tried the following:
>>
>> CREATE TEMPORARY TABLE c
>> USING com.microsoft.sqlserver.jdbc.SQLServerDriver
>> OPTIONS (
>> url "jdbc:sqlserver://10.1.0.12:1433\;databaseName=dbname\;",
>> dbtable "Customer"
>> );
>>
>> But this gives ERROR CliDriver: scala.MatchError: SQLServerDriver:4 (of
>> class com.microsoft.sqlserver.jdbc.SQLServerDriver)
>>
>> Can anyone tell what is the proper way to connect to ms-sql server.
>> Thanks
>>
>>
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Microsoft-SQL-jdbc-support-from-spark-
>> sql-tp22399.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>


Microsoft SQL jdbc support from spark sql

2015-04-06 Thread bipin
Hi, I am trying to pull data from ms-sql server. I have tried using the
spark.sql.jdbc 

CREATE TEMPORARY TABLE c
USING org.apache.spark.sql.jdbc
OPTIONS (
url "jdbc:sqlserver://10.1.0.12:1433\;databaseName=dbname\;",
dbtable "Customer"
);

But it shows java.sql.SQLException: No suitable driver found for
jdbc:sqlserver

I have jdbc drivers for mssql but i am not sure how to use them I provide
the jars to the sql shell and then tried the following:

CREATE TEMPORARY TABLE c
USING com.microsoft.sqlserver.jdbc.SQLServerDriver
OPTIONS (
url "jdbc:sqlserver://10.1.0.12:1433\;databaseName=dbname\;",
dbtable "Customer"
);

But this gives ERROR CliDriver: scala.MatchError: SQLServerDriver:4 (of
class com.microsoft.sqlserver.jdbc.SQLServerDriver)

Can anyone tell what is the proper way to connect to ms-sql server.
Thanks






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Microsoft-SQL-jdbc-support-from-spark-sql-tp22399.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org