Re: Ubuntu 18.04: Docker: start-master.sh: command not found

2021-03-31 Thread JB Data31
find the *start-master.sh* file with command excuted as root  *find / -name
"start-master.sh" -print.*
Check that the directory found is well put in the $PATH.
As a first step go to the directory where the *start-master.sh* is with *cd*
command and execute with command *./start-master.sh*.

@*JB*Δ 



Le mer. 31 mars 2021 à 12:35, GUINKO Ferdinand 
a écrit :

>
> I have exited from the container, and logged in using:
>
> sudo docker run -it -p8080:8080 ubuntu
>
> Then I tried to launch Start Standalone Spark Master Server doing:
>
> start-master.sh
>
> and got the following message:
>
> bash: start-master.sh: command not found
>
> So I started the process of setting the environmental variables again doing:
>
> echo "export SPARK_HOME=/opt/spark" >> ~/.profile
> export 
> PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/bin:/sbin:$SPARK_HOME/bin:$SPARK_HOME/sbin
> echo $PATH
> echo "export PYSPARK_PYTHON=/usr/bin/python3" >> ~/.profile
>
> Here is the output of the file .profile:
>
> root@291b0eb654ea:/# cat ~/.profile
> # ~/.profile: executed by Bourne-compatible login shells.
>
> if [ "$BASH" ]; then
>   if [ -f ~/.bashrc ]; then
> . ~/.bashrc
>   fi
> fi
>
> mesg n 2> /dev/null || true
> export SPARK_HOME=/opt/spark
> export 
> PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/bin:/sbin:/opt/spark/bin:/opt/spark/sbin
> export PYSPARK_PYTHON=/usr/bin/python3
>
> I also typed this command:
>
> root@291b0eb654ea:/# source ~/.profile
>
> I am still getting the following message:
>
> bash: start-master.sh: command not found
>
> What am missing please?
>
> --
> «Ce dont le monde a le plus besoin, c'est d'hommes, non pas des hommes
> qu'on achète et qui se vendent, mais d'hommes profondément loyaux et
> intègres, des hommes qui ne craignent pas d'appeler le péché par son nom,
> des hommes dont la conscience soit aussi fidèle à son devoir que la
> boussole l'est au pôle, des hommes qui défendraient la justice et la vérité
> même si l'univers s'écroulait.» Ellen Gould WHITE, education, P. 55
>
> Le Mardi, Mars 30, 2021 21:01 GMT, Mich Talebzadeh <
> mich.talebza...@gmail.com> a écrit:
>
>
> Those two export lines means set SPARK_HOME and PATH *as environment
> variables* in the session you have in Ubuntu.
>
> Check this website for more info
>
> bash - How do I add environment variables? - Ask Ubuntu
> 
>
>
> If you are familiar with windows, then they are equivalent to Windows
> environment variables. For example, note SPARK_HOME
>
>
> [image: image.png]
>
>
> Next. Now you are trying to start spark in standalone mode. Check the
> following link
>
> Spark Standalone Mode - Spark 3.1.1 Documentation (apache.org)
> 
>
> to learn more about it
>
> And also check the logfile generated by invoking spark-master.sh in your
> output
>
> starting org.apache.spark.deploy.master.Master, logging to*
> /opt/spark/logs/spark--org.apache.spark.deploy.master.Master-1-23d865d7f117.out*
>
> HTH
>
>
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On Tue, 30 Mar 2021 at 20:48, GUINKO Ferdinand <
> tonguimferdin...@guinko.net> wrote:
>
>> This is what I am having now:
>>
>> root@33z261w1a18:/opt/spark# *export SPARK_HOME=/opt/spark*
>> root@33z261w1a18:/opt/spark# *export
>> PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/bin:/sbin:$SPARK_HOME/bin:$SPARK_HOME/sbin*
>> root@33z261w1a18:/opt/spark# *echo $PATH*
>>
>> /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/bin:/sbin:/opt/spark/bin:/opt/spark/sbin
>> root@33z261w1a18:/opt/spark# *start-master.sh*
>> starting org.apache.spark.deploy.master.Master, logging to
>> /opt/spark/logs/spark--org.apache.spark.deploy.master.Master-1-23d865d7f117.out
>> root@33z261w1a18:/opt/spark#
>>
>> It seems that SPARK was trying to start but couldn't.
>>
>> Please would you explain me what the following lines means and do:
>>
>> export SPARK_HOME=/opt/spark
>> export
>> PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/bin:/sbin:$SPARK_HOME/bin:$SPARK_HOME/sbin
>>
>> then
>>
>> echo $PATH
>>
>>
>> Thank you for the assistance.
>>
>> --
>> «Ce dont le monde a le plus besoin, c'est d'hommes, non pas des hommes
>> qu'on achète et qui se vendent, mais d'hommes profondément loyaux et
>> intègres, des hommes qui ne craignent pas d'a

Re: convert java dataframe to pyspark dataframe

2021-03-31 Thread Aditya Singh
Thanks a lot, this was really helpful.

On Wed, 31 Mar 2021 at 4:13 PM, Khalid Mammadov 
wrote:

> I think what you want to achieve is what PySpark is actually doing in it's
> API under the hood.
>
> So, specifically you need to look at PySpark's implementation of
> DataFrame, SparkSession and SparkContext API. Under the hood that what is
> happening, it start a py4j gateway and delegates all Spark operations and
> objects creation to JVM.
>
> Look for example here
> ,
> here
> 
> and here
> where
> SparkSession/SparkContext (Python) communicates with JVM and creates
> SparkSession/SparkContext on JVM side. And rest of the PySpark code will be
> delegating execution to them.
>
> Having these objects created by you and custom Java/Scala application
> which holds Spark objects then you can play around with rest of DataFrame
> creation and passing back and forward. But, I must admit, this is not part
> of official documentation and playing around internals of Spark and which
> are subject to change (often).
>
> So, I am not sure what your actual requirement is but you will need to
> implement your custom version of PySpark API to get all functionality you
> need and control on JVM side.
>
>
> On 31/03/2021 06:49, Aditya Singh wrote:
>
> Thanks a lot Khalid for replying.
>
> I have one question though. The approach tou showed needs an understanding
> on python side before hand about the data type of columns of dataframe. Can
> we implement a generic approach where this info is not required and we just
> have the java dataframe as input on python side?
>
> Also one more question, in my use case I will sending a dataframe from
> java to python and then on python side there will be some transformation
> done on the dataframe(including using python udfs) but no actions will be
> performed here and then will send it back to java where actions will be
> performed. So also wanted to ask if this is feasible and if yes do we need
> to send some special jars to executors so that it can execute udfs on the
> dataframe.
>
> On Wed, 31 Mar 2021 at 3:37 AM, Khalid Mammadov 
> wrote:
>
>> Hi Aditya,
>>
>>
>> I think you original question was as how to convert a DataFrame from
>> Spark session created on Java/Scala to a DataFrame on a Spark session
>> created from Python(PySpark).
>>
>> So, as I have answered on your SO question:
>>
>>
>> There is a missing call to *entry_point* before calling getDf() in your
>> code
>>
>> So, try this :
>>
>> app = gateway.entry_point
>> j_df = app.getDf()
>>
>> Additionally, I have create working copy using Python and Scala (hope you
>> dont mind) below that shows how on Scala side py4j gateway is started with
>> Spark session and a sample DataFrame and on Python side I have accessed
>> that DataFrame and converted to Python List[Tuple] before converting back
>> to a DataFrame for a Spark session on Python side:
>>
>> *Python:*
>>
>> from py4j.java_gateway import JavaGatewayfrom pyspark.sql import 
>> SparkSessionfrom pyspark.sql.types import StructType, IntegerType, 
>> StructField
>> if __name__ == '__main__':
>> gateway = JavaGateway()
>>
>> spark_app = gateway.entry_point
>> df = spark_app.df()
>>
>> # Note "apply" method here comes from Scala's companion object to access 
>> elements of an array
>> df_to_list_tuple = [(int(i.apply(0)), int(i.apply(1))) for i in df]
>>
>> spark = (SparkSession
>>  .builder
>>  .appName("My PySpark App")
>>  .getOrCreate())
>>
>> schema = StructType([
>> StructField("a", IntegerType(), True),
>> StructField("b", IntegerType(), True)])
>>
>> df = spark.createDataFrame(df_to_list_tuple, schema)
>>
>> df.show()
>>
>>
>> *Scala:*
>>
>> import java.nio.file.{Path, Paths}
>> import org.apache.spark.sql.SparkSessionimport py4j.GatewayServer
>> object SparkApp {
>>   val myFile: Path = Paths.get(System.getProperty("user.home") + 
>> "/dev/sample_data/games.csv")
>>   val spark = SparkSession.builder()
>> .master("local[*]")
>> .appName("My app")
>> .getOrCreate()
>>
>>   val df = spark
>>   .read
>>   .option("header", "True")
>>   .csv(myFile.toString)
>>   .collect()
>>
>> }
>> object Py4JServerApp extends App {
>>
>>
>>   val server = new GatewayServer(SparkApp)
>>   server.start()
>>
>>   print("Started and running...")
>> }
>>
>>
>>
>>
>>
>> Regards,
>> Khalid
>>
>>
>> On 30/03/2021 07:57, Aditya Singh wrote:
>>
>> Hi Sean,
>>
>> Thanks a lot for replying and apologies for the late reply(I somehow
>> missed this mail before) but I am under the impression that passing the
>> py4j.java_gateway.JavaGateway object 

Re: Source.getBatch and schema vs qe.analyzed.schema?

2021-03-31 Thread Bartosz Konieczny
Hi Jacek,

An interesting question! I don't know the exact answer and will be happy to
learn by the way :) Below you can find my understanding for these 2 things,
hoping it helps a little.

For me, we can distinguish 2 different source categories. The first of them
is a source with some fixed schema. A good example is Apache Kafka which
exposes the topic name, key, value and you can't change that; it's always
the same, whenever you run the reader in Company A or in Company B. What
changes is the data extraction logic from the key, value or headers. But
it's business-specific, not data store-specific. You can find the schema
implementation here: Kafka


The second type is a source with user-defined schema, like a RDBMS table or
a NoSQL schemaless store. Here, predicting the schema will not only be
business-specific, but also data store-specific; you can set any name for a
Primary Key column, there is no such rule like "key" or "value" in Kafka.
To avoid runtime errors (= favor fail-fast approach before the data is
read), Spark can use the metadata to assert (analyze) the schema specified
by the user to confirm it or fail fast before reading the data.

Best,
Bartosz.



On Mon, Mar 29, 2021 at 1:07 PM Jacek Laskowski  wrote:

> Hi,
>
> I've been developing a data source with a source and sink for Spark
> Structured Streaming.
>
> I've got a question about Source.getBatch [1]:
>
> def getBatch(start: Option[Offset], end: Offset): DataFrame
>
> getBatch returns a streaming DataFrame between the offsets so the idiom
> (?) is to have a code as follows:
>
> val relation = new MyRelation(...)(sparkSession)
> val plan = LogicalRelation(relation, isStreaming = true)
> new Dataset[Row](sparkSession, plan, RowEncoder(schema))
>
> Note the use of schema [2] that is another part of the Source abstraction:
>
> def schema: StructType
>
> This is the "source" of my question. Is the above OK in a streaming sink /
> Source.getBatch?
>
> Since there are no interim operators that could change attributes (schema)
> I think it's OK.
>
> I've seen the following code and that made me wonder whether it's better
> or not compared to the solution above:
>
> val relation = new MyRelation(...)(sparkSession)
> val plan = LogicalRelation(relation, isStreaming = true)
>
> // When would we have to execute plan?
> val qe = sparkSession.sessionState.executePlan(plan)
> new Dataset[Row](sparkSession, plan, RowEncoder(qe.analyzed.schema))
>
> When would or do we have to use qe.analyzed.schema vs simply schema? Could
> this qe.analyzed.schema help avoid some edge cases and is a preferred
> approach?
>
> Thank you for any help you can offer. Much appreciated.
>
> [1]
> https://github.com/apache/spark/blob/053dd858d38e6107bc71e0aa3a4954291b74f8c8/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Source.scala#L61
> [2]
> https://github.com/apache/spark/blob/053dd858d38e6107bc71e0aa3a4954291b74f8c8/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Source.scala#L35
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> "The Internals Of" Online Books 
> Follow me on https://twitter.com/jaceklaskowski
>
> 
>


-- 
Bartosz Konieczny
data engineer
https://www.waitingforcode.com
https://github.com/bartosz25/
https://twitter.com/waitingforcode


Re: Introducing Gallia: a Scala+Spark library for data manipulation

2021-03-31 Thread galliaproject
I posted another update on the scala mailing list:
https://users.scala-lang.org/t/introducing-gallia-a-library-for-data-manipulation/7112/11

It notably pertains to:
- A full *RDD*-powered example:
https://github.com/galliaproject/gallia-genemania-spark#description (via
EMR)
- New license (*BSL*): https://github.com/galliaproject/gallia-core/bsl.md;
basically it's *free for essential or small* entities



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Ubuntu 18.04: Docker: start-master.sh: command not found

2021-03-31 Thread GUINKO Ferdinand

I have exited from the container, and logged in using:sudo docker run -it 
-p8080:8080 ubuntu
Then I tried to launch Start Standalone Spark Master Server doing:
start-master.sh
and got the following message:
bash: start-master.sh: command not foundSo I started the process of setting the 
environmental variables again doing:echo "export SPARK_HOME=/opt/spark" >> 
~/.profile
export 
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/bin:/sbin:$SPARK_HOME/bin:$SPARK_HOME/sbin
echo $PATH
echo "export PYSPARK_PYTHON=/usr/bin/python3" >> ~/.profileHere is the output 
of the file .profile:root@291b0eb654ea:/# cat ~/.profile
# ~/.profile: executed by Bourne-compatible login shells.

if [ "$BASH" ]; then
  if [ -f ~/.bashrc ]; then
    . ~/.bashrc
  fi
fi

mesg n 2> /dev/null || true
export SPARK_HOME=/opt/spark
export 
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/bin:/sbin:/opt/spark/bin:/opt/spark/sbin
export PYSPARK_PYTHON=/usr/bin/python3I also typed this 
command:root@291b0eb654ea:/# source ~/.profileI am still getting the following 
message:bash: start-master.sh: command not foundWhat am missing please?

-- 
«Ce dont le monde a le plus besoin, c'est d'hommes, non pas des hommes qu'on 
achète et qui se vendent, mais d'hommes profondément loyaux et intègres, des 
hommes qui ne craignent pas d'appeler le péché par son nom, des hommes dont la 
conscience soit aussi fidèle à son devoir que la boussole l'est au pôle, des 
hommes qui défendraient la justice et la vérité même si l'univers s'écroulait.» 
Ellen Gould WHITE, education, P. 55

Le Mardi, Mars 30, 2021 21:01 GMT, Mich Talebzadeh  
a écrit:
 Those two export lines means set SPARK_HOME and PATH as environment variables 
in the session you have in Ubuntu. Check this website for more info bash - How 
do I add environment variables? - Ask Ubuntu  If you are familiar with windows, 
then they are equivalent to Windows environment variables. For example, note 
SPARK_HOMENext. Now you are trying to start spark in standalone mode. Check 
the following link Spark Standalone Mode - Spark 3.1.1 Documentation 
(apache.org) to learn more about it And also check the logfile generated by 
invoking spark-master.sh in your output starting 
org.apache.spark.deploy.master.Master, logging to 
/opt/spark/logs/spark--org.apache.spark.deploy.master.Master-1-23d865d7f117.out 
HTH 

 
   view my Linkedin profile
 
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.
  On Tue, 30 Mar 2021 at 20:48, GUINKO Ferdinand  
wrote:This is what I am having now:
root@33z261w1a18:/opt/spark# export SPARK_HOME=/opt/spark
root@33z261w1a18:/opt/spark# export 
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/bin:/sbin:$SPARK_HOME/bin:$SPARK_HOME/sbin
root@33z261w1a18:/opt/spark# echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/bin:/sbin:/opt/spark/bin:/opt/spark/sbin
root@33z261w1a18:/opt/spark# start-master.sh
starting org.apache.spark.deploy.master.Master, logging to 
/opt/spark/logs/spark--org.apache.spark.deploy.master.Master-1-23d865d7f117.out
root@33z261w1a18:/opt/spark#It seems that SPARK was trying to start but 
couldn't.

Please would you explain me what the following lines means and do:export 
SPARK_HOME=/opt/spark
export 
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/bin:/sbin:$SPARK_HOME/bin:$SPARK_HOME/sbin
 then echo $PATH
Thank you for the assistance.

-- 
«Ce dont le monde a le plus besoin, c'est d'hommes, non pas des hommes qu'on 
achète et qui se vendent, mais d'hommes profondément loyaux et intègres, des 
hommes qui ne craignent pas d'appeler le péché par son nom, des hommes dont la 
conscience soit aussi fidèle à son devoir que la boussole l'est au pôle, des 
hommes qui défendraient la justice et la vérité même si l'univers s'écroulait.» 
Ellen Gould WHITE, education, P. 55

Le Mardi, Mars 30, 2021 15:00 GMT, Mich Talebzadeh  
a écrit:
 Ok simple export SPARK_HOME=/opt/spark
export 
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/bin:/sbin:$SPARK_HOME/bin:$SPARK_HOME/sbin
 then echo $PATH which start-master.sh and send the output HTH
 

 
   view my Linkedin profile
 
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.
  On Tue, 30 Mar 2021 at 13:23, GUINKO Ferdinand  
wrote:Where is SPARK? I also tried this:

root@33z261w1a18:/opt/spark# whereis spark-shell
spark-shell: /opt/spark/bin/spark-shell2.cmd /opt/

Re: Spark thrift server ldap

2021-03-31 Thread Pavel Solomin
Hello!

Can anyone give some attention this question about LDAP please?

Me an my colleague have faced similar issues trying to setting up LDAP for
ThriftServer, and actually ended up patching Hive Service code and shading
Spark's LdapAuthenticationProviderImpl.java with our own custom jar, passed
in `spark.driver.extraClassPath`.

We are up for proposing a patch for Spark ThriftServer codebase. For such a
patch, does it make sense to import LdapAuthenticationProviderImpl.java from
Hive code? We are not sure what were the reasons for Spark codebase to shade
Hive's `org.apache.hive.service` in the first place.

Thank you!



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: convert java dataframe to pyspark dataframe

2021-03-31 Thread Khalid Mammadov
I think what you want to achieve is what PySpark is actually doing in 
it's API under the hood.


So, specifically you need to look at PySpark's implementation of 
DataFrame, SparkSession and SparkContext API. Under the hood that what 
is happening, it start a py4j gateway and delegates all Spark operations 
and objects creation to JVM.


Look for example here 
, 
here 
 
and here 
where 
SparkSession/SparkContext (Python) communicates with JVM and creates 
SparkSession/SparkContext on JVM side. And rest of the PySpark code will 
be delegating execution to them.


Having these objects created by you and custom Java/Scala application 
which holds Spark objects then you can play around with rest of 
DataFrame creation and passing back and forward. But, I must admit, this 
is not part of official documentation and playing around internals of 
Spark and which are subject to change (often).


So, I am not sure what your actual requirement is but you will need to 
implement your custom version of PySpark API to get all functionality 
you need and control on JVM side.



On 31/03/2021 06:49, Aditya Singh wrote:

Thanks a lot Khalid for replying.

I have one question though. The approach tou showed needs an 
understanding on python side before hand about the data type of 
columns of dataframe. Can we implement a generic approach where this 
info is not required and we just have the java dataframe as input on 
python side?


Also one more question, in my use case I will sending a dataframe from 
java to python and then on python side there will be some 
transformation done on the dataframe(including using python udfs) but 
no actions will be performed here and then will send it back to java 
where actions will be performed. So also wanted to ask if this is 
feasible and if yes do we need to send some special jars to executors 
so that it can execute udfs on the dataframe.


On Wed, 31 Mar 2021 at 3:37 AM, Khalid Mammadov 
mailto:khalidmammad...@gmail.com>> wrote:


Hi Aditya,


I think you original question was as how to convert a DataFrame
from Spark session created on Java/Scala to a DataFrame on a Spark
session created from Python(PySpark).

So, as I have answered on your SO question:


There is a missing call to *entry_point* before calling getDf() in
your code

So, try this :

|app = gateway.entry_point j_df = app.getDf() |

Additionally, I have create working copy using Python and Scala
(hope you dont mind) below that shows how on Scala side py4j
gateway is started with Spark session and a sample DataFrame and
on Python side I have accessed that DataFrame and converted to
Python List[Tuple] before converting back to a DataFrame for a
Spark session on Python side:

*Python:*

|from py4j.java_gateway import JavaGateway from pyspark.sql import
SparkSession from pyspark.sql.types import StructType,
IntegerType, StructField if __name__ == '__main__': gateway =
JavaGateway() spark_app = gateway.entry_point df = spark_app.df()
# Note "apply" method here comes from Scala's companion object to
access elements of an array df_to_list_tuple = [(int(i.apply(0)),
int(i.apply(1))) for i in df] spark = (SparkSession .builder
.appName("My PySpark App") .getOrCreate()) schema = StructType([
StructField("a", IntegerType(), True), StructField("b",
IntegerType(), True)]) df =
spark.createDataFrame(df_to_list_tuple, schema) df.show() |

*Scala:*

|import java.nio.file.{Path, Paths} import
org.apache.spark.sql.SparkSession import py4j.GatewayServer object
SparkApp { val myFile: Path =
Paths.get(System.getProperty("user.home") +
"/dev/sample_data/games.csv") val spark = SparkSession.builder()
.master("local[*]") .appName("My app") .getOrCreate() val df =
spark .read .option("header", "True") .csv(myFile.toString)
.collect() } object Py4JServerApp extends App { val server = new
GatewayServer(SparkApp) server.start() print("Started and
running...") } |


Regards,
Khalid


On 30/03/2021 07:57, Aditya Singh wrote:

Hi Sean,

Thanks a lot for replying and apologies for the late reply(I
somehow missed this mail before) but I am under the impression
that passing the py4j.java_gateway.JavaGateway object lets the
pyspark access the spark context created on the java side.
My use case is exactly what you mentioned in the last email. I
want to access the same spark session across java and pyspark. So
how can we share the spark context and in turn spark session,
across java and pyspark.

Regards,
Aditya


Re: convert java dataframe to pyspark dataframe

2021-03-31 Thread Aditya Singh
Thanks a lot Khalid for replying.

I have one question though. The approach tou showed needs an understanding
on python side before hand about the data type of columns of dataframe. Can
we implement a generic approach where this info is not required and we just
have the java dataframe as input on python side?

Also one more question, in my use case I will sending a dataframe from java
to python and then on python side there will be some transformation done on
the dataframe(including using python udfs) but no actions will be performed
here and then will send it back to java where actions will be performed. So
also wanted to ask if this is feasible and if yes do we need to send some
special jars to executors so that it can execute udfs on the dataframe.

On Wed, 31 Mar 2021 at 3:37 AM, Khalid Mammadov 
wrote:

> Hi Aditya,
>
>
> I think you original question was as how to convert a DataFrame from Spark
> session created on Java/Scala to a DataFrame on a Spark session created
> from Python(PySpark).
>
> So, as I have answered on your SO question:
>
>
> There is a missing call to *entry_point* before calling getDf() in your
> code
>
> So, try this :
>
> app = gateway.entry_point
> j_df = app.getDf()
>
> Additionally, I have create working copy using Python and Scala (hope you
> dont mind) below that shows how on Scala side py4j gateway is started with
> Spark session and a sample DataFrame and on Python side I have accessed
> that DataFrame and converted to Python List[Tuple] before converting back
> to a DataFrame for a Spark session on Python side:
>
> *Python:*
>
> from py4j.java_gateway import JavaGatewayfrom pyspark.sql import 
> SparkSessionfrom pyspark.sql.types import StructType, IntegerType, StructField
> if __name__ == '__main__':
> gateway = JavaGateway()
>
> spark_app = gateway.entry_point
> df = spark_app.df()
>
> # Note "apply" method here comes from Scala's companion object to access 
> elements of an array
> df_to_list_tuple = [(int(i.apply(0)), int(i.apply(1))) for i in df]
>
> spark = (SparkSession
>  .builder
>  .appName("My PySpark App")
>  .getOrCreate())
>
> schema = StructType([
> StructField("a", IntegerType(), True),
> StructField("b", IntegerType(), True)])
>
> df = spark.createDataFrame(df_to_list_tuple, schema)
>
> df.show()
>
>
> *Scala:*
>
> import java.nio.file.{Path, Paths}
> import org.apache.spark.sql.SparkSessionimport py4j.GatewayServer
> object SparkApp {
>   val myFile: Path = Paths.get(System.getProperty("user.home") + 
> "/dev/sample_data/games.csv")
>   val spark = SparkSession.builder()
> .master("local[*]")
> .appName("My app")
> .getOrCreate()
>
>   val df = spark
>   .read
>   .option("header", "True")
>   .csv(myFile.toString)
>   .collect()
>
> }
> object Py4JServerApp extends App {
>
>
>   val server = new GatewayServer(SparkApp)
>   server.start()
>
>   print("Started and running...")
> }
>
>
>
>
>
> Regards,
> Khalid
>
>
> On 30/03/2021 07:57, Aditya Singh wrote:
>
> Hi Sean,
>
> Thanks a lot for replying and apologies for the late reply(I somehow
> missed this mail before) but I am under the impression that passing the
> py4j.java_gateway.JavaGateway object lets the pyspark access the spark
> context created on the java side.
> My use case is exactly what you mentioned in the last email. I want to
> access the same spark session across java and pyspark. So how can we share
> the spark context and in turn spark session, across java and pyspark.
>
> Regards,
> Aditya
>
> On Fri, 26 Mar 2021 at 6:49 PM, Sean Owen  wrote:
>
>> The problem is that both of these are not sharing a SparkContext as far
>> as I can see, so there is no way to share the object across them, let alone
>> languages.
>>
>> You can of course write the data from Java, read it from Python.
>>
>> In some hosted Spark products, you can access the same session from two
>> languages and register the DataFrame as a temp view in Java, then access it
>> in Pyspark.
>>
>>
>> On Fri, Mar 26, 2021 at 8:14 AM Aditya Singh 
>> wrote:
>>
>>> Hi All,
>>>
>>> I am a newbie to spark and trying to pass a java dataframe to pyspark.
>>> Foloowing link has details about what I am trying to do:-
>>>
>>>
>>> https://stackoverflow.com/questions/66797382/creating-pysparks-spark-context-py4j-java-gateway-object
>>>
>>> Can someone please help me with this?
>>>
>>> Thanks,
>>>
>>