Help needed with Py4J

2015-05-20 Thread Addanki, Santosh Kumar
Hi Colleagues

We need to call a Scala Class from pySpark in Ipython notebook.

We tried something like below :

from py4j.java_gateway import java_import

java_import(sparkContext._jvm,'mynamespace')

myScalaClass =  sparkContext._jvm.SimpleScalaClass ()

myScalaClass.sayHello(World) Works Fine

But

When we try to pass sparkContext to our class it fails  like below

myContext  = _jvm.MySQLContext(sparkContext) fails with


AttributeErrorTraceback (most recent call last)

ipython-input-19-34330244f574 in module()

 1 z = _jvm.MySQLContext(sparkContext)



C:\Users\i033085\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py in 
__call__(self, *args)

690

691 args_command = ''.join(

-- 692 [get_command_part(arg, self._pool) for arg in new_args])

693

694 command = CONSTRUCTOR_COMMAND_NAME +\



C:\Users\i033085\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py in 
get_command_part(parameter, python_proxy_pool)

263 command_part += ';' + interface

264 else:

-- 265 command_part = REFERENCE_TYPE + parameter._get_object_id()

266

267 command_part += '\n'
attributeError: 'SparkContext' object has no attribute '_get_object_id'




And

myContext  = _jvm.MySQLContext(sparkContext._jsc) fails with


Constructor org.apache.spark.sql.MySQLContext([class 
org.apache.spark.api.java.JavaSparkContext]) does not exist





Would this be possible ... or there are serialization issues and hence not 
possible.

If not what are the options we have to instantiate our own SQLContext written 
in scala from pySpark...



Best Regards,

Santosh






Re: Help needed with Py4J

2015-05-20 Thread Addanki, Santosh Kumar
Yeah ... I am able to instantiate the simple scala class as explained below 
which is from the same JAR

Regards
Santosh


On May 20, 2015, at 7:26 PM, Holden Karau 
hol...@pigscanfly.camailto:hol...@pigscanfly.ca wrote:

Are your jars included in both the driver and worker class paths?

On Wednesday, May 20, 2015, Addanki, Santosh Kumar 
santosh.kumar.adda...@sap.commailto:santosh.kumar.adda...@sap.com wrote:
Hi Colleagues

We need to call a Scala Class from pySpark in Ipython notebook.

We tried something like below :

from py4j.java_gateway import java_import

java_import(sparkContext._jvm,'mynamespace')

myScalaClass =  sparkContext._jvm.SimpleScalaClass ()

myScalaClass.sayHello(“World”) Works Fine

But

When we try to pass sparkContext to our class it fails  like below

myContext  = _jvm.MySQLContext(sparkContext) fails with


AttributeErrorTraceback (most recent call last)

ipython-input-19-34330244f574 in module()

 1 z = _jvm.MySQLContext(sparkContext)



C:\Users\i033085\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py in 
__call__(self, *args)

690

691 args_command = ''.join(

-- 692 [get_command_part(arg, self._pool) for arg in new_args])

693

694 command = CONSTRUCTOR_COMMAND_NAME +\



C:\Users\i033085\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py in 
get_command_part(parameter, python_proxy_pool)

263 command_part += ';' + interface

264 else:

-- 265 command_part = REFERENCE_TYPE + parameter._get_object_id()

266

267 command_part += '\n'
attributeError: 'SparkContext' object has no attribute '_get_object_id'




And

myContext  = _jvm.MySQLContext(sparkContext._jsc) fails with


Constructor org.apache.spark.sql.MySQLContext([class 
org.apache.spark.api.java.JavaSparkContext]) does not exist





Would this be possible … or there are serialization issues and hence not 
possible.

If not what are the options we have to instantiate our own SQLContext written 
in scala from pySpark…



Best Regards,

Santosh






--
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau
Linked In: https://www.linkedin.com/in/holdenkarau



External Data Source in Spark

2015-03-02 Thread Addanki, Santosh Kumar
Hi Colleagues,

Currently we have implemented  External Data Source API and are able to push 
filters and projections.

Could you provide some info on how perhaps the joins could be pushed to the 
original Data Source if both the data sources are from same database

Briefly looked at DataSourceStrategy.scala but could not get far

Best Regards
Santosh



External Data Source in SPARK

2015-02-09 Thread Addanki, Santosh Kumar
Hi,

We implemented an External Data Source by extending the TableScan . We added 
the classes to the classpath
The data source works fine when run in Spark Shell .

But currently we are unable to use this same data source in Python Environment. 
So when we execute the following below in an Ipython notebook

sqlContext.sql(CREATE TEMPORARY TABLE dataTable USING  MyDataSource OPTIONS 
(partitions '2')) we get the following error :

Py4JJavaError: An error occurred while calling o78.sql.
: java.lang.RuntimeException: Failed to load class for data source: MyDataSource


How to expose this data source for consumption even in PySpark environment.


Regards,
Santosh


saveAsParquetFile and DirectFileOutputCommitter Class not found Error

2014-12-07 Thread Addanki, Santosh Kumar
Hi,

When we try to call saveAsParquetFile on a schemaRDD we get the following error 
:


Py4JJavaError: An error occurred while calling o384.saveAsParquetFile.
: java.lang.NoClassDefFoundError: 
org/apache/hadoop/mapreduce/lib/output/DirectFileOutputCommitter
at 
org.apache.spark.sql.parquet.InsertIntoParquetTable.execute(ParquetTableOperations.scala:240)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
at 
org.apache.spark.sql.SchemaRDDLike$class.saveAsParquetFile(SchemaRDDLike.scala:76)
at 
org.apache.spark.sql.api.java.JavaSchemaRDD.saveAsParquetFile(JavaSchemaRDD.scala:42)



https://issues.apache.org/jira/browse/SPARK-3595 seems to have addressed this 
issue of respecting the OutputCommitter but when I pull from the master and try 
the same I still encounter this issue.

I am on a Mapr Distribution and my org\apache\hadoop\mapreduce\lib\output does 
not contain DirectFileOutputCommitter

Best Regards,
Santosh


Hive Context and Mapr

2014-11-03 Thread Addanki, Santosh Kumar
Hi

We are currently using Mapr Distribution.

To read the files  from the file system we specify as follows  :

test = sc.textFile(mapr/mycluster/user/mapr/test.csv)

This works fine from Spark Context.

But ...

Currently we are trying to create a table in hive using the hiveContext from 
Spark.

So when I specify something like

from pyspark.sql import HiveContext
hc = HiveContext(sc)

hc.sql(Create table deleteme_1  + createString )

hc.sql(LOAD DATA LOCAL INPATH ' mapr/BI2-104/user/mapr/test.csv' INTO TABLE 
deleteme_1)

We  get the following  error

FAILED: RuntimeException java.io.IOException: No FileSystem for scheme: maprfs


An error occurred while calling o79.sql.
: org.apache.spark.sql.execution.QueryExecutionException: FAILED: 
RuntimeException java.io.IOException: No FileSystem for scheme: maprfs
   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:302)
   at 
org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:272)
   at 
org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35)
   at 
org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35)
   at 
org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:38)
   at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360)
   at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360)
   at 
org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
   at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:103)
   at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:98)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
   at py4j.Gateway.invoke(Gateway.java:259)
   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
   at py4j.commands.CallCommand.execute(CallCommand.java:79)
   at py4j.GatewayConnection.run(GatewayConnection.java:207)
   at java.lang.Thread.run(Thread.java:745)

Best Regards
Santosh



Schema RDD and saveAsTable in hive

2014-11-03 Thread Addanki, Santosh Kumar
Hi,

I have a schemaRDD created  like below :

schemaTransactions = sqlContext.applySchema(transactions,schema);

When I try to save the schemaRDD as a table using :

schemaTransactions.saveAsTable(transactions) I get the error below


Py4JJavaError: An error occurred while calling o70.saveAsTable.
: java.lang.AssertionError: assertion failed: No plan for 
InsertIntoCreatedTable None, transactions
SparkLogicalPlan (ExistingRdd 
[GUID#21,INSTANCE#22,TRANSID#23,CONTEXT_ID#24,DIALOG_STEP#25,REPORT#26,ACCOUNT#27,MANDT#28,ACTION#29,TASKTYPE#30,TCODE#31,F12#32,F13#33,STARTDATE#34,STARTTIME#35,F16#36,RESPTIME#37,F18#38,F19#39,F20#40,F21#41],
 MapPartitionsRDD[11] at mapPartitions at SQLContext.scala:522)


Also I have copied my hive-site.xml to the spark conf folder and started the 
thrift server .So where does the saveAsTable store the table in ..hive ??

Best Regards
Santosh



Spark And Mapr

2014-10-01 Thread Addanki, Santosh Kumar
Hi

We were using Horton 2.4.1 as our Hadoop distribution and now switched to MapR

Previously to read a text file  we would use :

test = sc.textFile(\hdfs://10.48.101.111:8020/user/hdfs/test\)


What would be the equivalent of the same for Mapr.

Best Regards
Santosh


RE: Spark And Mapr

2014-10-01 Thread Addanki, Santosh Kumar
Hi

We would like to do this in PySpark Environment

i.e something like

test = sc.textFile(maprfs:///user/root/test) or
test = sc.textFile(hdfs:///user/root/test) or


Currently when we try
test = sc.textFile(maprfs:///user/root/test)

It throws the error
No File-System for scheme: maprfs


Best Regards
Santosh



From: Vladimir Rodionov [mailto:vrodio...@splicemachine.com]
Sent: Wednesday, October 01, 2014 3:59 PM
To: Addanki, Santosh Kumar
Cc: user@spark.apache.org
Subject: Re: Spark And Mapr

There is doc on MapR:

http://doc.mapr.com/display/MapR/Accessing+MapR-FS+in+Java+Applications

-Vladimir Rodionov

On Wed, Oct 1, 2014 at 3:00 PM, Addanki, Santosh Kumar 
santosh.kumar.adda...@sap.commailto:santosh.kumar.adda...@sap.com wrote:
Hi

We were using Horton 2.4.1 as our Hadoop distribution and now switched to MapR

Previously to read a text file  we would use :

test = 
sc.textFile(\hdfs://10.48.101.111:8020/user/hdfs/test\http://10.48.101.111:8020/user/hdfs/test%5C)


What would be the equivalent of the same for Mapr.

Best Regards
Santosh



RE: SchemaRDD and RegisterAsTable

2014-09-18 Thread Addanki, Santosh Kumar
Hi Denny

Thanks for the reply. I have tried the same and seems to work.

I had a quick question though.I have configured to use Hive Metastore (MySql).

When I connect against the Thrift Server using hive it seems to schedule Map 
Reduce job when I query against the table. When I run the same using beeline it 
seems to use the Spark Context to execute .

Is this correct or something wrong with my setup?


My Understanding was that the Thrift Server was just a HIVEQL frontend and the 
undelying  query execution would be done by SPARK .

Regards
Santosh

From: Denny Lee [mailto:denny.g@gmail.com]
Sent: Wednesday, September 17, 2014 10:14 PM
To: user@spark.apache.org; Addanki, Santosh Kumar
Subject: Re: SchemaRDD and RegisterAsTable

The registered table is stored within the spark context itself.  To have the 
table available for the thrift server to get access to, you can save the sc 
table into the Hive context so that way the Thrift server process can see the 
table.  If you are using derby as your metastore, then the thrift server should 
be accessing this as you would want to utilize the same hive configuration 
(i.e. hive-site.xml).  You may want to migrate your metastore to MySQL or 
Postgres as it will be handle concurrency better than derby.

HTH!
Denny



On September 17, 2014 at 21:47:50, Addanki, Santosh Kumar 
(santosh.kumar.adda...@sap.commailto:santosh.kumar.adda...@sap.com) wrote:
Hi,

We built out SPARK 1.1.0 Version with MVN using
mvn –Pyarn –Phadoop-2.4 –Dhadoop.version=2.4.0 –Phive clean package


And the Thrift Server has been configured to use the Hive Meta Store.

When a schemaRDD is registered as table where does the metadata of this table 
get stored. Can it be stored in the configured hive meta-store or?

Also if the thrift Server is not configured to use the HIVE metastore its using 
its own default (probably derby) metastore.So would the table metainfo be 
stored in this meta-store.

Best Regards
Santosh








SchemaRDD and RegisterAsTable

2014-09-17 Thread Addanki, Santosh Kumar
Hi,

We built out SPARK 1.1.0 Version with MVN using
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive clean package


And the Thrift Server has been configured to use the Hive Meta Store.

When a schemaRDD is registered as table where does the metadata of this table 
get stored. Can it be stored in the configured hive meta-store or?

Also if the thrift Server is not configured to use the HIVE metastore its using 
its own default (probably derby) metastore.So would the table metainfo be 
stored in this meta-store.

Best Regards
Santosh