Help needed with Py4J
Hi Colleagues We need to call a Scala Class from pySpark in Ipython notebook. We tried something like below : from py4j.java_gateway import java_import java_import(sparkContext._jvm,'mynamespace') myScalaClass = sparkContext._jvm.SimpleScalaClass () myScalaClass.sayHello(World) Works Fine But When we try to pass sparkContext to our class it fails like below myContext = _jvm.MySQLContext(sparkContext) fails with AttributeErrorTraceback (most recent call last) ipython-input-19-34330244f574 in module() 1 z = _jvm.MySQLContext(sparkContext) C:\Users\i033085\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py in __call__(self, *args) 690 691 args_command = ''.join( -- 692 [get_command_part(arg, self._pool) for arg in new_args]) 693 694 command = CONSTRUCTOR_COMMAND_NAME +\ C:\Users\i033085\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py in get_command_part(parameter, python_proxy_pool) 263 command_part += ';' + interface 264 else: -- 265 command_part = REFERENCE_TYPE + parameter._get_object_id() 266 267 command_part += '\n' attributeError: 'SparkContext' object has no attribute '_get_object_id' And myContext = _jvm.MySQLContext(sparkContext._jsc) fails with Constructor org.apache.spark.sql.MySQLContext([class org.apache.spark.api.java.JavaSparkContext]) does not exist Would this be possible ... or there are serialization issues and hence not possible. If not what are the options we have to instantiate our own SQLContext written in scala from pySpark... Best Regards, Santosh
Re: Help needed with Py4J
Yeah ... I am able to instantiate the simple scala class as explained below which is from the same JAR Regards Santosh On May 20, 2015, at 7:26 PM, Holden Karau hol...@pigscanfly.camailto:hol...@pigscanfly.ca wrote: Are your jars included in both the driver and worker class paths? On Wednesday, May 20, 2015, Addanki, Santosh Kumar santosh.kumar.adda...@sap.commailto:santosh.kumar.adda...@sap.com wrote: Hi Colleagues We need to call a Scala Class from pySpark in Ipython notebook. We tried something like below : from py4j.java_gateway import java_import java_import(sparkContext._jvm,'mynamespace') myScalaClass = sparkContext._jvm.SimpleScalaClass () myScalaClass.sayHello(“World”) Works Fine But When we try to pass sparkContext to our class it fails like below myContext = _jvm.MySQLContext(sparkContext) fails with AttributeErrorTraceback (most recent call last) ipython-input-19-34330244f574 in module() 1 z = _jvm.MySQLContext(sparkContext) C:\Users\i033085\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py in __call__(self, *args) 690 691 args_command = ''.join( -- 692 [get_command_part(arg, self._pool) for arg in new_args]) 693 694 command = CONSTRUCTOR_COMMAND_NAME +\ C:\Users\i033085\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py in get_command_part(parameter, python_proxy_pool) 263 command_part += ';' + interface 264 else: -- 265 command_part = REFERENCE_TYPE + parameter._get_object_id() 266 267 command_part += '\n' attributeError: 'SparkContext' object has no attribute '_get_object_id' And myContext = _jvm.MySQLContext(sparkContext._jsc) fails with Constructor org.apache.spark.sql.MySQLContext([class org.apache.spark.api.java.JavaSparkContext]) does not exist Would this be possible … or there are serialization issues and hence not possible. If not what are the options we have to instantiate our own SQLContext written in scala from pySpark… Best Regards, Santosh -- Cell : 425-233-8271 Twitter: https://twitter.com/holdenkarau Linked In: https://www.linkedin.com/in/holdenkarau
External Data Source in Spark
Hi Colleagues, Currently we have implemented External Data Source API and are able to push filters and projections. Could you provide some info on how perhaps the joins could be pushed to the original Data Source if both the data sources are from same database Briefly looked at DataSourceStrategy.scala but could not get far Best Regards Santosh
External Data Source in SPARK
Hi, We implemented an External Data Source by extending the TableScan . We added the classes to the classpath The data source works fine when run in Spark Shell . But currently we are unable to use this same data source in Python Environment. So when we execute the following below in an Ipython notebook sqlContext.sql(CREATE TEMPORARY TABLE dataTable USING MyDataSource OPTIONS (partitions '2')) we get the following error : Py4JJavaError: An error occurred while calling o78.sql. : java.lang.RuntimeException: Failed to load class for data source: MyDataSource How to expose this data source for consumption even in PySpark environment. Regards, Santosh
saveAsParquetFile and DirectFileOutputCommitter Class not found Error
Hi, When we try to call saveAsParquetFile on a schemaRDD we get the following error : Py4JJavaError: An error occurred while calling o384.saveAsParquetFile. : java.lang.NoClassDefFoundError: org/apache/hadoop/mapreduce/lib/output/DirectFileOutputCommitter at org.apache.spark.sql.parquet.InsertIntoParquetTable.execute(ParquetTableOperations.scala:240) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425) at org.apache.spark.sql.SchemaRDDLike$class.saveAsParquetFile(SchemaRDDLike.scala:76) at org.apache.spark.sql.api.java.JavaSchemaRDD.saveAsParquetFile(JavaSchemaRDD.scala:42) https://issues.apache.org/jira/browse/SPARK-3595 seems to have addressed this issue of respecting the OutputCommitter but when I pull from the master and try the same I still encounter this issue. I am on a Mapr Distribution and my org\apache\hadoop\mapreduce\lib\output does not contain DirectFileOutputCommitter Best Regards, Santosh
Hive Context and Mapr
Hi We are currently using Mapr Distribution. To read the files from the file system we specify as follows : test = sc.textFile(mapr/mycluster/user/mapr/test.csv) This works fine from Spark Context. But ... Currently we are trying to create a table in hive using the hiveContext from Spark. So when I specify something like from pyspark.sql import HiveContext hc = HiveContext(sc) hc.sql(Create table deleteme_1 + createString ) hc.sql(LOAD DATA LOCAL INPATH ' mapr/BI2-104/user/mapr/test.csv' INTO TABLE deleteme_1) We get the following error FAILED: RuntimeException java.io.IOException: No FileSystem for scheme: maprfs An error occurred while calling o79.sql. : org.apache.spark.sql.execution.QueryExecutionException: FAILED: RuntimeException java.io.IOException: No FileSystem for scheme: maprfs at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:302) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:272) at org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35) at org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35) at org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:38) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360) at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:103) at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:98) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) Best Regards Santosh
Schema RDD and saveAsTable in hive
Hi, I have a schemaRDD created like below : schemaTransactions = sqlContext.applySchema(transactions,schema); When I try to save the schemaRDD as a table using : schemaTransactions.saveAsTable(transactions) I get the error below Py4JJavaError: An error occurred while calling o70.saveAsTable. : java.lang.AssertionError: assertion failed: No plan for InsertIntoCreatedTable None, transactions SparkLogicalPlan (ExistingRdd [GUID#21,INSTANCE#22,TRANSID#23,CONTEXT_ID#24,DIALOG_STEP#25,REPORT#26,ACCOUNT#27,MANDT#28,ACTION#29,TASKTYPE#30,TCODE#31,F12#32,F13#33,STARTDATE#34,STARTTIME#35,F16#36,RESPTIME#37,F18#38,F19#39,F20#40,F21#41], MapPartitionsRDD[11] at mapPartitions at SQLContext.scala:522) Also I have copied my hive-site.xml to the spark conf folder and started the thrift server .So where does the saveAsTable store the table in ..hive ?? Best Regards Santosh
Spark And Mapr
Hi We were using Horton 2.4.1 as our Hadoop distribution and now switched to MapR Previously to read a text file we would use : test = sc.textFile(\hdfs://10.48.101.111:8020/user/hdfs/test\) What would be the equivalent of the same for Mapr. Best Regards Santosh
RE: Spark And Mapr
Hi We would like to do this in PySpark Environment i.e something like test = sc.textFile(maprfs:///user/root/test) or test = sc.textFile(hdfs:///user/root/test) or Currently when we try test = sc.textFile(maprfs:///user/root/test) It throws the error No File-System for scheme: maprfs Best Regards Santosh From: Vladimir Rodionov [mailto:vrodio...@splicemachine.com] Sent: Wednesday, October 01, 2014 3:59 PM To: Addanki, Santosh Kumar Cc: user@spark.apache.org Subject: Re: Spark And Mapr There is doc on MapR: http://doc.mapr.com/display/MapR/Accessing+MapR-FS+in+Java+Applications -Vladimir Rodionov On Wed, Oct 1, 2014 at 3:00 PM, Addanki, Santosh Kumar santosh.kumar.adda...@sap.commailto:santosh.kumar.adda...@sap.com wrote: Hi We were using Horton 2.4.1 as our Hadoop distribution and now switched to MapR Previously to read a text file we would use : test = sc.textFile(\hdfs://10.48.101.111:8020/user/hdfs/test\http://10.48.101.111:8020/user/hdfs/test%5C) What would be the equivalent of the same for Mapr. Best Regards Santosh
RE: SchemaRDD and RegisterAsTable
Hi Denny Thanks for the reply. I have tried the same and seems to work. I had a quick question though.I have configured to use Hive Metastore (MySql). When I connect against the Thrift Server using hive it seems to schedule Map Reduce job when I query against the table. When I run the same using beeline it seems to use the Spark Context to execute . Is this correct or something wrong with my setup? My Understanding was that the Thrift Server was just a HIVEQL frontend and the undelying query execution would be done by SPARK . Regards Santosh From: Denny Lee [mailto:denny.g@gmail.com] Sent: Wednesday, September 17, 2014 10:14 PM To: user@spark.apache.org; Addanki, Santosh Kumar Subject: Re: SchemaRDD and RegisterAsTable The registered table is stored within the spark context itself. To have the table available for the thrift server to get access to, you can save the sc table into the Hive context so that way the Thrift server process can see the table. If you are using derby as your metastore, then the thrift server should be accessing this as you would want to utilize the same hive configuration (i.e. hive-site.xml). You may want to migrate your metastore to MySQL or Postgres as it will be handle concurrency better than derby. HTH! Denny On September 17, 2014 at 21:47:50, Addanki, Santosh Kumar (santosh.kumar.adda...@sap.commailto:santosh.kumar.adda...@sap.com) wrote: Hi, We built out SPARK 1.1.0 Version with MVN using mvn –Pyarn –Phadoop-2.4 –Dhadoop.version=2.4.0 –Phive clean package And the Thrift Server has been configured to use the Hive Meta Store. When a schemaRDD is registered as table where does the metadata of this table get stored. Can it be stored in the configured hive meta-store or? Also if the thrift Server is not configured to use the HIVE metastore its using its own default (probably derby) metastore.So would the table metainfo be stored in this meta-store. Best Regards Santosh
SchemaRDD and RegisterAsTable
Hi, We built out SPARK 1.1.0 Version with MVN using mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive clean package And the Thrift Server has been configured to use the Hive Meta Store. When a schemaRDD is registered as table where does the metadata of this table get stored. Can it be stored in the configured hive meta-store or? Also if the thrift Server is not configured to use the HIVE metastore its using its own default (probably derby) metastore.So would the table metainfo be stored in this meta-store. Best Regards Santosh