Re:

2022-04-02 Thread Bitfox
Nice reading. Can you give a comparison on Hive on MR3 and Hive on Tez?

Thanks

On Sat, Apr 2, 2022 at 7:17 PM Sungwoo Park  wrote:

> Hi Spark users,
>
> We have published an article where we evaluate the performance of Spark
> 2.3.8 and Spark 3.2.1 (along with Hive 3). If interested, please see:
>
> https://www.datamonad.com/post/2022-04-01-spark-hive-performance-1.4/
>
> --- SW
>


Question for so many SQL tools

2022-03-25 Thread Bitfox
Just a question why there are so many SQL based tools existing for data
jobs?

The ones I know,

Spark
Flink
Ignite
Impala
Drill
Hive
…

They are doing the similar jobs IMO.
Thanks


Re: GraphX Support

2022-03-25 Thread Bitfox
BTW , is MLlib still in active development?

Thanks

On Tue, Mar 22, 2022 at 07:11 Sean Owen  wrote:

> GraphX is not active, though still there and does continue to build and
> test with each Spark release. GraphFrames kind of superseded it, but is
> also not super active FWIW.
>
> On Mon, Mar 21, 2022 at 6:03 PM Jacob Marquez 
> wrote:
>
>> Hello!
>>
>>
>>
>> My team and I are evaluating GraphX as a possible solution. Would someone
>> be able to speak to the support of this Spark feature? Is there active
>> development or is GraphX in maintenance mode (e.g. updated to ensure
>> functionality with new Spark releases)?
>>
>>
>>
>> Thanks in advance for your help!
>>
>>
>>
>> --
>>
>> Jacob H. Marquez
>>
>> He/Him
>>
>> Data & Applied Scientist
>>
>> Microsoft Cloud Data Sciences
>>
>>
>>
>


Re: Continuous ML model training in stream mode

2022-03-18 Thread Bitfox
For online recommendation systems, continuous training is needed. :)
And we are a living video player, the content is changing every minute, so
a real time rec system is the must.


On Fri, Mar 18, 2022 at 3:31 AM Sean Owen  wrote:

> (Thank you, not sure that was me though)
> I don't know of plans to expose the streaming impls in ML, as they still
> work fine in MLlib and they also don't come up much. Continuous training is
> relatively rare, maybe under-appreciated, but rare in practice.
>
> On Thu, Mar 17, 2022 at 1:57 PM Gourav Sengupta 
> wrote:
>
>> Dear friends,
>>
>> a few years ago, I was in a London meetup seeing Sean (Owen) demonstrate
>> how we can try to predict the gender of individuals who are responding to
>> tweets after accepting privacy agreements, in case I am not wrong.
>>
>> It was real time, it was spectacular, and it was the presentation that
>> set me into data science and its applications.
>>
>> Thanks Sean! :)
>>
>> Regards,
>> Gourav Sengupta
>>
>>
>>
>>
>> On Tue, Mar 15, 2022 at 9:39 PM Artemis User 
>> wrote:
>>
>>> Thanks Sean!  Well, it looks like we have to abandon our structured
>>> streaming model to use DStream for this, or do you see possibility to use
>>> structured streaming with ml instead of mllib?
>>>
>>> On 3/15/22 4:51 PM, Sean Owen wrote:
>>>
>>> There is a streaming k-means example in Spark.
>>> https://spark.apache.org/docs/latest/mllib-clustering.html#streaming-k-means
>>>
>>> On Tue, Mar 15, 2022, 3:46 PM Artemis User 
>>> wrote:
>>>
 Has anyone done any experiments of training an ML model using stream
 data? especially for unsupervised models?   Any suggestions/references
 are highly appreciated...

 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org


>>>


Re: Continuous ML model training in stream mode

2022-03-18 Thread Bitfox
we are keeping the training with the input content from a streaming. But
the framework is tensorflow not spark.

On Wed, Mar 16, 2022 at 4:46 AM Artemis User  wrote:

> Has anyone done any experiments of training an ML model using stream
> data? especially for unsupervised models?   Any suggestions/references
> are highly appreciated...
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Play data development with Scala and Spark

2022-03-16 Thread Bitfox
Hello,

I have written a free book which is available online, giving a beginner
introduction to Scala and Spark development.

https://github.com/bitfoxtop/Play-Data-Development-with-Scala-and-Spark/blob/main/PDDWS2-v1.pdf

If you can read Chinese then you are welcome to give any feedback. I will
update the content in my free time.

Thank you.


Re: Question on List to DF

2022-03-16 Thread Bitfox
Thank you. that makes sense.

On Wed, Mar 16, 2022 at 2:03 PM Lalwani, Jayesh  wrote:

> The toDF function in scala uses a bit of Scala magic that allows you to
> add methods to existing classes. Here’s a link to explanation
> https://www.oreilly.com/library/view/scala-cookbook/9781449340292/ch01s11.html
>
>
>
> In short, you can implement a class that extends the List class and add
> methods to your  list class, and you can implement an implicit converter
> that converts from List to your class. When the Scala compiler sees that
> you are calling a function on a List object that doesn’t exist in the List
> class, it will look for implicit converters that convert List object to
> another object that has the function, and will automatically call it.
>
> So, if you have a class
>
> Class MyList extends List {
> def toDF(colName: String): DataFrame{
> …..
> }
> }
>
> and a implicit converter
> implicit def convertListToMyList(list: List): MyList {
>
> ….
> }
>
> when you do
> List("apple","orange","cherry").toDF("fruit")
>
>
>
> Internally, Scala will generate the code as
> convertListToMyList(List("apple","orange","cherry")).toDF("fruit")
>
>
>
>
>
> *From: *Bitfox 
> *Date: *Wednesday, March 16, 2022 at 12:06 AM
> *To: *"user @spark" 
> *Subject: *[EXTERNAL] Question on List to DF
>
>
>
> *CAUTION*: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
> I am wondering why the list in scala spark can be converted into a
> dataframe directly?
>
>
>
> scala> val df = List("apple","orange","cherry").toDF("fruit")
>
> *df*: *org.apache.spark.sql.DataFrame* = [fruit: string]
>
>
>
> scala> df.show
>
> +--+
>
> | fruit|
>
> +--+
>
> | apple|
>
> |orange|
>
> |cherry|
>
> +--+
>
>
>
> I don't think pyspark can convert that as well.
>
>
>
> Thank you.
>


Question on List to DF

2022-03-15 Thread Bitfox
I am wondering why the list in scala spark can be converted into a
dataframe directly?

scala> val df = List("apple","orange","cherry").toDF("fruit")

*df*: *org.apache.spark.sql.DataFrame* = [fruit: string]


scala> df.show

+--+

| fruit|

+--+

| apple|

|orange|

|cherry|

+--+


I don't think pyspark can convert that as well.


Thank you.


Re: Unsubscribe

2022-03-11 Thread Bitfox
please send an empty email to:
user-unsubscr...@spark.apache.org
to unsubscribe yourself from the list.


On Sat, Mar 12, 2022 at 2:42 PM Aziret Satybaldiev <
satybaldiev.azi...@gmail.com> wrote:

>


Re: Issue while creating spark app

2022-02-26 Thread Bitfox
Java SDK installed?

On Sun, Feb 27, 2022 at 5:39 AM Sachit Murarka 
wrote:

> Hello ,
>
> Thanks for replying. I have installed Scala plugin in IntelliJ  first then
> also it's giving same error
>
> Cannot find project Scala library 2.12.12 for module SparkSimpleApp
>
> Thanks
> Rajat
>
> On Sun, Feb 27, 2022, 00:52 Bitfox  wrote:
>
>> You need to install scala first, the current version for spark is 2.12.15
>> I would suggest you install scala by sdk which works great.
>>
>> Thanks
>>
>> On Sun, Feb 27, 2022 at 12:10 AM rajat kumar 
>> wrote:
>>
>>> Hello Users,
>>>
>>> I am trying to create spark application using Scala(Intellij).
>>> I have installed Scala plugin in intelliJ still getting below error:-
>>>
>>> Cannot find project Scala library 2.12.12 for module SparkSimpleApp
>>>
>>>
>>> Could anyone please help what I am doing wrong?
>>>
>>> Thanks
>>>
>>> Rajat
>>>
>>


Re: Issue while creating spark app

2022-02-26 Thread Bitfox
You need to install scala first, the current version for spark is 2.12.15
I would suggest you install scala by sdk which works great.

Thanks

On Sun, Feb 27, 2022 at 12:10 AM rajat kumar 
wrote:

> Hello Users,
>
> I am trying to create spark application using Scala(Intellij).
> I have installed Scala plugin in intelliJ still getting below error:-
>
> Cannot find project Scala library 2.12.12 for module SparkSimpleApp
>
>
> Could anyone please help what I am doing wrong?
>
> Thanks
>
> Rajat
>


Re: [E] COMMERCIAL BULK: Re: TensorFlow on Spark

2022-02-24 Thread Bitfox
I have been using tensorflow for a long time, it's not hard to implement a
distributed training job at all, either by model parallelization or data
parallelization. I don't think there is much need to develop spark to
support tensorflow jobs. Just my thoughts...


On Thu, Feb 24, 2022 at 4:36 PM Gourav Sengupta 
wrote:

> Hi,
>
> I do not think that there is any reason for using over engineered
> platforms like Petastorm and Ray, except for certain use cases.
>
> What Ray is doing, except for certain use cases, could have been easily
> done by SPARK, I think, had the open source community got that steer. But
> maybe I am wrong and someone should be able to explain why the SPARK open
> source community cannot develop the capabilities which are so natural to
> almost all use cases of data processing in SPARK where the data gets
> consumed by deep learning frameworks and we are asked to use Ray or
> Petastorm?
>
> For those of us who are asking what does native integrations means please
> try to compare delta between release 2.x and 3.x and koalas before 3.2 and
> after 3.2.
>
> I am sure that the SPARK community can push for extending the dataframes
> from SPARK to deep learning and other frameworks by natively integrating
> them.
>
>
> Regards,
> Gourav Sengupta
>
>
> On Wed, Feb 23, 2022 at 4:42 PM Dennis Suhari 
> wrote:
>
>> Currently we are trying AnalyticsZoo and Ray
>>
>>
>> Von meinem iPhone gesendet
>>
>> Am 23.02.2022 um 04:53 schrieb Bitfox :
>>
>> 
>> tensorflow itself can implement the distributed computing via a
>> parameter server. Why did you want spark here?
>>
>> regards.
>>
>> On Wed, Feb 23, 2022 at 11:27 AM Vijayant Kumar
>>  wrote:
>>
>>> Thanks Sean for your response. !!
>>>
>>>
>>>
>>> Want to add some more background here.
>>>
>>>
>>>
>>> I am using Spark3.0+ version with Tensorflow 2.0+.
>>>
>>> My use case is not for the image data but for the Time-series data where
>>> I am using LSTM and transformers to forecast.
>>>
>>>
>>>
>>> I evaluated *SparkFlow* and *spark_tensorflow_distributor *libraries, and
>>> there has been no major development recently on those libraries. I faced
>>> the issue of version dependencies on those and had a hard time fixing the
>>> library compatibilities. Hence a couple of below doubts:-
>>>
>>>
>>>
>>>- Does *Horovod* have any dependencies?
>>>- Any other library which is suitable for my use case.?
>>>- Any example code would really be of great help to understand.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Vijayant
>>>
>>>
>>>
>>> *From:* Sean Owen 
>>> *Sent:* Wednesday, February 23, 2022 8:40 AM
>>> *To:* Vijayant Kumar 
>>> *Cc:* user @spark 
>>> *Subject:* [E] COMMERCIAL BULK: Re: TensorFlow on Spark
>>>
>>>
>>>
>>> *Email is from a Free Mail Service (Gmail/Yahoo/Hotmail….) *: Beware of
>>> Phishing Scams, Report questionable emails to s...@mavenir.com
>>>
>>> Sure, Horovod is commonly used on Spark for this:
>>>
>>> https://horovod.readthedocs.io/en/stable/spark_include.html
>>>
>>>
>>>
>>> On Tue, Feb 22, 2022 at 8:51 PM Vijayant Kumar <
>>> vijayant.ku...@mavenir.com.invalid> wrote:
>>>
>>> Hi All,
>>>
>>>
>>>
>>> Anyone using Apache spark with TensorFlow for building models. My
>>> requirement is to use TensorFlow distributed model training across the
>>> Spark executors.
>>>
>>> Please help me with some resources or some sample code.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Vijayant
>>> --
>>>
>>> This e-mail message may contain confidential or proprietary information
>>> of Mavenir Systems, Inc. or its affiliates and is intended solely for the
>>> use of the intended recipient(s). If you are not the intended recipient of
>>> this message, you are hereby notified that any review, use or distribution
>>> of this information is absolutely prohibited and we request that you delete
>>> all copies in your control and contact us by e-mailing to
>>> secur...@mavenir.com. This message contains the views of its author and
>>> may not necessarily reflect the views of Mavenir Systems, Inc. or its
>>> affiliates, who employ syst

Re: One click to run Spark on Kubernetes

2022-02-23 Thread Bitfox
from my viewpoints, if there is such a pay as you go service I would like
to use.
otherwise I have to deploy a regular spark cluster with GCP/AWS etc and the
cost is not low.

Thanks.

On Wed, Feb 23, 2022 at 4:00 PM bo yang  wrote:

> Right, normally people start with simple script, then add more stuff, like
> permission and more components. After some time, people want to run the
> script consistently in different environments. Things will become complex.
>
> That is why we want to see whether people have interest for such a "one
> click" tool to make things easy.
>
>
> On Tue, Feb 22, 2022 at 11:31 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi,
>>
>> There are two distinct actions here; namely Deploy and Run.
>>
>> Deployment can be done by command line script with autoscaling. In the
>> newer versions of Kubernnetes you don't even need to specify the node
>> types, you can leave it to the Kubernetes cluster  to scale up and down and
>> decide on node type.
>>
>> The second point is the running spark that you will need to submit.
>> However, that depends on setting up access permission, use of service
>> accounts, pulling the correct dockerfiles for the driver and the executors.
>> Those details add to the complexity.
>>
>> Thanks
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 23 Feb 2022 at 04:06, bo yang  wrote:
>>
>>> Hi Spark Community,
>>>
>>> We built an open source tool to deploy and run Spark on Kubernetes with
>>> a one click command. For example, on AWS, it could automatically create an
>>> EKS cluster, node group, NGINX ingress, and Spark Operator. Then you will
>>> be able to use curl or a CLI tool to submit Spark application. After the
>>> deployment, you could also install Uber Remote Shuffle Service to enable
>>> Dynamic Allocation on Kuberentes.
>>>
>>> Anyone interested in using or working together on such a tool?
>>>
>>> Thanks,
>>> Bo
>>>
>>>


Re: One click to run Spark on Kubernetes

2022-02-22 Thread Bitfox
How can I specify the cluster memory and cores?
For instance, I want to run a job with 16 cores and 300 GB memory for about
1 hour. Do you have the SaaS solution for this? I can pay as I did.

Thanks

On Wed, Feb 23, 2022 at 12:21 PM bo yang  wrote:

> It is not a standalone spark cluster. In some details, it deploys a Spark
> Operator (https://github.com/GoogleCloudPlatform/spark-on-k8s-operator)
> and an extra REST Service. When people submit Spark application to that
> REST Service, the REST Service will create a CRD inside the
> Kubernetes cluster. Then Spark Operator will pick up the CRD and launch the
> Spark application. The one click tool intends to hide these details, so
> people could just submit Spark and do not need to deal with too many
> deployment details.
>
> On Tue, Feb 22, 2022 at 8:09 PM Bitfox  wrote:
>
>> Can it be a cluster installation of spark? or just the standalone node?
>>
>> Thanks
>>
>> On Wed, Feb 23, 2022 at 12:06 PM bo yang  wrote:
>>
>>> Hi Spark Community,
>>>
>>> We built an open source tool to deploy and run Spark on Kubernetes with
>>> a one click command. For example, on AWS, it could automatically create an
>>> EKS cluster, node group, NGINX ingress, and Spark Operator. Then you will
>>> be able to use curl or a CLI tool to submit Spark application. After the
>>> deployment, you could also install Uber Remote Shuffle Service to enable
>>> Dynamic Allocation on Kuberentes.
>>>
>>> Anyone interested in using or working together on such a tool?
>>>
>>> Thanks,
>>> Bo
>>>
>>>


Re: One click to run Spark on Kubernetes

2022-02-22 Thread Bitfox
Can it be a cluster installation of spark? or just the standalone node?

Thanks

On Wed, Feb 23, 2022 at 12:06 PM bo yang  wrote:

> Hi Spark Community,
>
> We built an open source tool to deploy and run Spark on Kubernetes with a
> one click command. For example, on AWS, it could automatically create an
> EKS cluster, node group, NGINX ingress, and Spark Operator. Then you will
> be able to use curl or a CLI tool to submit Spark application. After the
> deployment, you could also install Uber Remote Shuffle Service to enable
> Dynamic Allocation on Kuberentes.
>
> Anyone interested in using or working together on such a tool?
>
> Thanks,
> Bo
>
>


Re: [E] COMMERCIAL BULK: Re: TensorFlow on Spark

2022-02-22 Thread Bitfox
tensorflow itself can implement the distributed computing via a
parameter server. Why did you want spark here?

regards.

On Wed, Feb 23, 2022 at 11:27 AM Vijayant Kumar
 wrote:

> Thanks Sean for your response. !!
>
>
>
> Want to add some more background here.
>
>
>
> I am using Spark3.0+ version with Tensorflow 2.0+.
>
> My use case is not for the image data but for the Time-series data where I
> am using LSTM and transformers to forecast.
>
>
>
> I evaluated *SparkFlow* and *spark_tensorflow_distributor *libraries, and
> there has been no major development recently on those libraries. I faced
> the issue of version dependencies on those and had a hard time fixing the
> library compatibilities. Hence a couple of below doubts:-
>
>
>
>- Does *Horovod* have any dependencies?
>- Any other library which is suitable for my use case.?
>- Any example code would really be of great help to understand.
>
>
>
> Thanks,
>
> Vijayant
>
>
>
> *From:* Sean Owen 
> *Sent:* Wednesday, February 23, 2022 8:40 AM
> *To:* Vijayant Kumar 
> *Cc:* user @spark 
> *Subject:* [E] COMMERCIAL BULK: Re: TensorFlow on Spark
>
>
>
> *Email is from a Free Mail Service (Gmail/Yahoo/Hotmail….) *: Beware of
> Phishing Scams, Report questionable emails to s...@mavenir.com
>
> Sure, Horovod is commonly used on Spark for this:
>
> https://horovod.readthedocs.io/en/stable/spark_include.html
>
>
>
> On Tue, Feb 22, 2022 at 8:51 PM Vijayant Kumar <
> vijayant.ku...@mavenir.com.invalid> wrote:
>
> Hi All,
>
>
>
> Anyone using Apache spark with TensorFlow for building models. My
> requirement is to use TensorFlow distributed model training across the
> Spark executors.
>
> Please help me with some resources or some sample code.
>
>
>
> Thanks,
>
> Vijayant
> --
>
> This e-mail message may contain confidential or proprietary information of
> Mavenir Systems, Inc. or its affiliates and is intended solely for the use
> of the intended recipient(s). If you are not the intended recipient of this
> message, you are hereby notified that any review, use or distribution of
> this information is absolutely prohibited and we request that you delete
> all copies in your control and contact us by e-mailing to
> secur...@mavenir.com. This message contains the views of its author and
> may not necessarily reflect the views of Mavenir Systems, Inc. or its
> affiliates, who employ systems to monitor email messages, but make no
> representation that such messages are authorized, secure, uncompromised, or
> free from computer viruses, malware, or other defects. Thank You
>
> --
>
> This e-mail message may contain confidential or proprietary information of
> Mavenir Systems, Inc. or its affiliates and is intended solely for the use
> of the intended recipient(s). If you are not the intended recipient of this
> message, you are hereby notified that any review, use or distribution of
> this information is absolutely prohibited and we request that you delete
> all copies in your control and contact us by e-mailing to
> secur...@mavenir.com. This message contains the views of its author and
> may not necessarily reflect the views of Mavenir Systems, Inc. or its
> affiliates, who employ systems to monitor email messages, but make no
> representation that such messages are authorized, secure, uncompromised, or
> free from computer viruses, malware, or other defects. Thank You
>


Re: Unsubscribe

2022-02-09 Thread Bitfox
Please send an e-mail: user-unsubscr...@spark.apache.org
to unsubscribe yourself from the mailing list.

On Thu, Feb 10, 2022 at 1:38 AM Yogitha Ramanathan 
wrote:

>


Re: Help With unstructured text file with spark scala

2022-02-09 Thread Bitfox
Hi

I am not sure about the total situation.
But if you want a scala integration I think it could use regex to match and
capture the keywords.
Here I wrote one you can modify by your end.

import scala.io.Source

import scala.collection.mutable.ArrayBuffer


val list1 = ArrayBuffer[(String,String,String)]()

val list2 = ArrayBuffer[(String,String)]()



val patt1 = """^(.*)#(.*)#([^#]*)$""".r

val patt2 = """^(.*)#([^#]*)$""".r


val file = "1.txt"

val lines = Source.fromFile(file).getLines()


for ( x <- lines ) {

  x match {

case patt1(k,v,z) => list1 += ((k,v,z))

case patt2(k,v) => list2 += ((k,v))

case _ => println("no match")

  }

}



Now the list1 and list2 have the elements you wanted, you can convert them
to a dataframe easily.


Thanks.

On Wed, Feb 9, 2022 at 7:20 PM Danilo Sousa 
wrote:

> Hello
>
>
> Yes, for this block I can open as csv with # delimiter, but have the block
> that is no csv format.
>
> This is the likely key value.
>
> We have two different layouts in the same file. This is the “problem”.
>
> Thanks for your time.
>
>
>
> Relação de Beneficiários Ativos e Excluídos
>> Carteira em#27/12/2019##Todos os Beneficiários
>> Operadora#AMIL
>> Filial#SÃO PAULO#Unidade#Guarulhos
>>
>> Contrato#123456 - Test
>> Empresa#Test
>
>
> On 9 Feb 2022, at 00:58, Bitfox  wrote:
>
> Hello
>
> You can treat it as a csf file and load it from spark:
>
> >>> df = spark.read.format("csv").option("inferSchema",
> "true").option("header", "true").option("sep","#").load(csv_file)
> >>> df.show()
> ++---+-+
> |   Plano|Código Beneficiário|Nome Beneficiário|
> ++---+-+
> |58693 - NACIONAL ...|   65751353|   Jose Silva|
> |58693 - NACIONAL ...|   65751388|  Joana Silva|
> |58693 - NACIONAL ...|   65751353| Felipe Silva|
> |58693 - NACIONAL ...|   65751388|  Julia Silva|
> ++---+-+
>
>
> cat csv_file:
>
> Plano#Código Beneficiário#Nome Beneficiário
> 58693 - NACIONAL R COPART PJCE#065751353#Jose Silva
> 58693 - NACIONAL R COPART PJCE#065751388#Joana Silva
> 58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva
>
> 58693 - NACIONAL R COPART PJCE#065751388#Julia Silva
>
>
> Regards
>
>
> On Wed, Feb 9, 2022 at 12:50 AM Danilo Sousa 
> wrote:
>
>> Hi
>> I have to transform unstructured text to dataframe.
>> Could anyone please help with Scala code ?
>>
>> Dataframe need as:
>>
>> operadora filial unidade contrato empresa plano codigo_beneficiario
>> nome_beneficiario
>>
>> Relação de Beneficiários Ativos e Excluídos
>> Carteira em#27/12/2019##Todos os Beneficiários
>> Operadora#AMIL
>> Filial#SÃO PAULO#Unidade#Guarulhos
>>
>> Contrato#123456 - Test
>> Empresa#Test
>> Plano#Código Beneficiário#Nome Beneficiário
>> 58693 - NACIONAL R COPART PJCE#073930312#Joao Silva
>> 58693 - NACIONAL R COPART PJCE#073930313#Maria Silva
>>
>> Contrato#898011000 - FUNDACAO GERDAU
>> Empresa#FUNDACAO GERDAU
>> Plano#Código Beneficiário#Nome Beneficiário
>> 58693 - NACIONAL R COPART PJCE#065751353#Jose Silva
>> 58693 - NACIONAL R COPART PJCE#065751388#Joana Silva
>> 58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva
>> 58693 - NACIONAL R COPART PJCE#065751388#Julia Silva
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: Help With unstructured text file with spark scala

2022-02-08 Thread Bitfox
Hello

You can treat it as a csf file and load it from spark:

>>> df = spark.read.format("csv").option("inferSchema",
"true").option("header", "true").option("sep","#").load(csv_file)

>>> df.show()

++---+-+

|   Plano|Código Beneficiário|Nome Beneficiário|

++---+-+

|58693 - NACIONAL ...|   65751353|   Jose Silva|

|58693 - NACIONAL ...|   65751388|  Joana Silva|

|58693 - NACIONAL ...|   65751353| Felipe Silva|

|58693 - NACIONAL ...|   65751388|  Julia Silva|

++---+-+



cat csv_file:


Plano#Código Beneficiário#Nome Beneficiário

58693 - NACIONAL R COPART PJCE#065751353#Jose Silva

58693 - NACIONAL R COPART PJCE#065751388#Joana Silva

58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva

58693 - NACIONAL R COPART PJCE#065751388#Julia Silva



Regards



On Wed, Feb 9, 2022 at 12:50 AM Danilo Sousa 
wrote:

> Hi
> I have to transform unstructured text to dataframe.
> Could anyone please help with Scala code ?
>
> Dataframe need as:
>
> operadora filial unidade contrato empresa plano codigo_beneficiario
> nome_beneficiario
>
> Relação de Beneficiários Ativos e Excluídos
> Carteira em#27/12/2019##Todos os Beneficiários
> Operadora#AMIL
> Filial#SÃO PAULO#Unidade#Guarulhos
>
> Contrato#123456 - Test
> Empresa#Test
> Plano#Código Beneficiário#Nome Beneficiário
> 58693 - NACIONAL R COPART PJCE#073930312#Joao Silva
> 58693 - NACIONAL R COPART PJCE#073930313#Maria Silva
>
> Contrato#898011000 - FUNDACAO GERDAU
> Empresa#FUNDACAO GERDAU
> Plano#Código Beneficiário#Nome Beneficiário
> 58693 - NACIONAL R COPART PJCE#065751353#Jose Silva
> 58693 - NACIONAL R COPART PJCE#065751388#Joana Silva
> 58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva
> 58693 - NACIONAL R COPART PJCE#065751388#Julia Silva
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: add an auto_increment column

2022-02-08 Thread Bitfox
Maybe col func is not even needed here. :)

>>> df.select(F.dense_rank().over(wOrder).alias("rank"),
"fruit","amount").show()

++--+--+

|rank| fruit|amount|

++--+--+

|   1|cherry| 5|

|   2| apple| 3|

|   2|tomato| 3|

|   3|orange| 2|

++--+--+




On Tue, Feb 8, 2022 at 3:50 PM Mich Talebzadeh 
wrote:

> simple either rank() or desnse_rank()
>
> >>> from pyspark.sql import functions as F
> >>> from pyspark.sql.functions import col
> >>> from pyspark.sql.window import Window
> >>> wOrder = Window().orderBy(df['amount'].desc())
> >>> df.select(F.rank().over(wOrder).alias("rank"), col('fruit'),
> col('amount')).show()
> ++--+--+
> |rank| fruit|amount|
> ++--+--+
> |   1|cherry| 5|
> |   2| apple| 3|
> |   2|tomato| 3|
> |   4|orange| 2|
> ++--+--+
>
> >>> df.select(F.dense_rank().over(wOrder).alias("rank"), col('fruit'),
> col('amount')).show()
> ++--+--+
> |rank| fruit|amount|
> ++--+--+
> |   1|cherry| 5|
> |   2| apple| 3|
> |   2|tomato| 3|
> |   3|orange| 2|
> ++--+--+
>
> HTH
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 7 Feb 2022 at 01:27,  wrote:
>
>> For a dataframe object, how to add a column who is auto_increment like
>> mysql's behavior?
>>
>> Thank you.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


foreachRDD question

2022-02-07 Thread Bitfox
Hello list,

for the code in the link:
https://github.com/apache/spark/blob/v3.2.1/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala

I am not sure, why enclose the RDD to Dataframe logic in a foreachRDD block?
What's the use of foreachRDD?


Thanks in advance.


Re: Unsubscribe

2022-02-05 Thread Bitfox
Please send an e-mail: user-unsubscr...@spark.apache.org
to unsubscribe yourself from the mailing list.

On Sun, Feb 6, 2022 at 2:21 PM Rishi Raj Tandon 
wrote:

> Unsubscribe
>


Re: Python performance

2022-02-04 Thread Bitfox
Please see my this test:
https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/

Don’t use Python RDD, using dataframe instead.

Regards

On Fri, Feb 4, 2022 at 5:02 PM Hinko Kocevar 
wrote:

> I'm looking into using Python interface with Spark and came across this
> [1] chart showing some performance hit when going with Python RDD. Data is
> ~ 7 years and for older version of Spark. Is this still the case with more
> recent Spark releases?
>
> I'm trying to understand what to expect from Python and Spark and under
> what conditions.
>
> [1]
> https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
>
> Thanks,
> //hinko
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re:

2022-01-31 Thread Bitfox
Please send an e-mail: user-unsubscr...@spark.apache.org
to unsubscribe yourself from the mailing list.


On Mon, Jan 31, 2022 at 10:11 PM  wrote:

> unsubscribe
>
>
>


Re:

2022-01-31 Thread Bitfox
Please send an e-mail: user-unsubscr...@spark.apache.org
to unsubscribe yourself from the mailing list.


On Mon, Jan 31, 2022 at 10:23 PM Gaetano Fabiano 
wrote:

> Unsubscribe
>
> Inviato da iPhone
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: unsubscribe

2022-01-31 Thread Bitfox
The signature in your messages has showed how to unsubscribe.

To unsubscribe e-mail: user-unsubscr...@spark.apache.org

On Mon, Jan 31, 2022 at 7:53 PM Lucas Schroeder Rossi 
wrote:

> unsubscribe
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: why the pyspark RDD API is so slow?

2022-01-31 Thread Bitfox
Hi

In PySpark, RDD need serialised/deserialised, but dataframe doesn’t? Why?

Thanks

On Mon, Jan 31, 2022 at 4:46 PM Khalid Mammadov 
wrote:

> Your scala program does not use any Spark API hence faster that others. If
> you write the same code in pure Python I think it will be even faster than
> Scala program, especially taking into account these 2 programs runs on a
> single VM.
>
> Regarding Dataframe and RDD I would suggest to use Dataframes anyway since
> it's recommended approach since Spark 2.0.
> RDD for Pyspark is slow as others said it needs to be
> serialised/deserialised.
>
> One general note is that Spark is written Scala and core is running on JVM
> and Python is wrapper around Scala API and most of PySpark APIs are
> delegated to Scala/JVM to be executed. Hence most of big data
> transformation tasks will complete almost at the same time as they (Scala
> and Python) use the same API under the hood. Therefore you can also observe
> that APIs are very similar and code is written in the same fashion.
>
>
> On Sun, 30 Jan 2022, 10:10 Bitfox,  wrote:
>
>> Hello list,
>>
>> I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a
>> pure scala program. The result shows the pyspark RDD is too slow.
>>
>> For the operations and dataset please see:
>>
>> https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/
>>
>> The result table is below.
>> Can you give suggestions on how to optimize the RDD operation?
>>
>> Thanks a lot.
>>
>>
>> *program* *time*
>> scala program 49s
>> pyspark dataframe 56s
>> scala RDD 1m31s
>> pyspark RDD 7m15s
>>
>


Re: [ANNOUNCE] Apache Kyuubi (Incubating) released 1.4.1-incubating

2022-01-30 Thread Bitfox
What’s the difference between Spark and Kyuubi?

Thanks

On Mon, Jan 31, 2022 at 2:45 PM Vino Yang  wrote:

> Hi all,
>
> The Apache Kyuubi (Incubating) community is pleased to announce that
> Apache Kyuubi (Incubating) 1.4.1-incubating has been released!
>
> Apache Kyuubi (Incubating) is a distributed multi-tenant JDBC server for
> large-scale data processing and analytics, built on top of Apache Spark
> and designed to support more engines (i.e. Apache Flink).
>
> Kyuubi provides a pure SQL gateway through Thrift JDBC/ODBC interface
> for end-users to manipulate large-scale data with pre-programmed and
> extensible Spark SQL engines.
>
> We are aiming to make Kyuubi an "out-of-the-box" tool for data warehouses
> and data lakes.
>
> This "out-of-the-box" model minimizes the barriers and costs for end-users
> to use Spark at the client side.
>
> At the server-side, Kyuubi server and engine's multi-tenant architecture
> provides the administrators a way to achieve computing resource isolation,
> data security, high availability, high client concurrency, etc.
>
> The full release notes and download links are available at:
> Release Notes: https://kyuubi.apache.org/release/1.4.1-incubating.html
>
> To learn more about Apache Kyuubi (Incubating), please see
> https://kyuubi.apache.org/
>
> Kyuubi Resources:
> - Issue: https://github.com/apache/incubator-kyuubi/issues
> - Mailing list: d...@kyuubi.apache.org
>
> We would like to thank all contributors of the Kyuubi community and
> Incubating
> community who made this release possible!
>
> Thanks,
> On behalf of Apache Kyuubi (Incubating) community
>


Re: unsubscribe

2022-01-30 Thread Bitfox
The signature in your mail has showed the info:

To unsubscribe e-mail: user-unsubscr...@spark.apache.org



On Sun, Jan 30, 2022 at 8:50 PM Lucas Schroeder Rossi 
wrote:

> unsubscribe
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


why the pyspark RDD API is so slow?

2022-01-30 Thread Bitfox
Hello list,

I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a pure
scala program. The result shows the pyspark RDD is too slow.

For the operations and dataset please see:
https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/

The result table is below.
Can you give suggestions on how to optimize the RDD operation?

Thanks a lot.


*program* *time*
scala program 49s
pyspark dataframe 56s
scala RDD 1m31s
pyspark RDD 7m15s


Re: [ANNOUNCE] Apache Spark 3.2.1 released

2022-01-28 Thread Bitfox
Is there a guide for upgrading from 3.2.0 to 3.2.1?

thanks

On Sat, Jan 29, 2022 at 9:14 AM huaxin gao  wrote:

> We are happy to announce the availability of Spark 3.2.1!
>
> Spark 3.2.1 is a maintenance release containing stability fixes. This
> release is based on the branch-3.2 maintenance branch of Spark. We strongly
> recommend all 3.2 users to upgrade to this stable release.
>
> To download Spark 3.2.1, head over to the download page:
> https://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-2-1.html
>
> We would like to acknowledge all community members for contributing to this
> release. This release would not have been possible without you.
>
> Huaxin Gao
>


may I need a join here?

2022-01-23 Thread Bitfox
>>> df.show(3)

++-+

|word|count|

++-+

|  on|1|

| dec|1|

|2020|1|

++-+

only showing top 3 rows


>>> df2.show(3)

++-+

|stopword|count|

++-+

|able|1|

|   about|1|

|   above|1|

++-+

only showing top 3 rows


>>> df3=df.filter(~col("word").isin(df2.stopword ))

Traceback (most recent call last):

  File "", line 1, in 

  File "/opt/spark/python/pyspark/sql/dataframe.py", line 1733, in filter

jdf = self._jdf.filter(condition._jc)

  File "/opt/spark/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py",
line 1310, in __call__

  File "/opt/spark/python/pyspark/sql/utils.py", line 117, in deco

raise converted from None

pyspark.sql.utils.AnalysisException: Resolved attribute(s) stopword#4
missing from word#0,count#1L in operator !Filter NOT word#0 IN
(stopword#4).;

!Filter NOT word#0 IN (stopword#4)

+- LogicalRDD [word#0, count#1L], false





The filter method doesn't work here.

Maybe I need a join for two DF?

What's the syntax for this?



Thank you and regards,

Bitfox


Question about ports in spark

2022-01-23 Thread Bitfox
Hello

When spark started in my home server, I saw there were two ports open then.
8080 for master, 8081 for worker.
If I keep these two ports open without any network filter, does it have
security issues?

Thanks


Re: How to make batch filter

2022-01-02 Thread Bitfox
I always use dataframe API, though I am pretty familiar with general SQL.
I use the method you provide to create a big filter as described here:

https://bitfoxtop.wordpress.com/2022/01/02/filter-out-stopwords-in-spark/

Thanks


On Sun, Jan 2, 2022 at 9:06 PM Mich Talebzadeh 
wrote:

> Well the short answer is there is no such thing as which one is more
> performant. Your mileage varies.
>
> SQL is a domain-specific language used in programming and designed for
> managing data held in a relational database management system, or for
> stream processing in a relational data stream management system.
>
>
> A DataFrame is a *Dataset* organised into named columns. It is
> conceptually equivalent to a table in a relational database or a data frame
> in R/Python, but with richer optimizations under the hood. DataFrames can
> be constructed from a wide array of sources such as: structured data
> files, tables in Apache Hive, Google BigQuery, other external databases, or
> existing RDDs.
>
>
> You use sql-API to interact from the underlying data read through by
> constructing a dataframe on it
>
>
> The way I use it is to use either
>
>
> from pyspark.sql.functions import col
>
> DF =  spark.table("alayer.joint_accounts_view")
>
> DF.select(col('transactiondate'),col('transactiontype')).orderBy(col("transactiondate")).show(5)
>
> OR
>
>
> DF.createOrReplaceTempView("tmp") ## create a temporary view
> spark.sql("select transactiondate, transactiontype from tmp order by
> transactiondate").show(5)
>
> You use as you choose. Under the hood, these APIs are using a common
> layer. So the performance for me as a practitioner (i.e. which one is more
> performant) does not come into it.
>
>
> HTH
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 2 Jan 2022 at 12:11, Bitfox  wrote:
>
>> May I ask for daraframe API and sql API, which is better on performance?
>> Thanks
>>
>> On Sun, Jan 2, 2022 at 8:06 PM Gourav Sengupta 
>> wrote:
>>
>>> Hi Mich,
>>>
>>> your notes are really great, it really brought back the old days again
>>> :) thanks.
>>>
>>> Just to note a few points that I found useful related to this question:
>>> 1. cores and threads - page 5
>>> 2. executor cores and number settings - page 6..
>>>
>>>
>>> I think that the following example may be of use, note that I have one
>>> driver and that has 8 cores as I am running PYSPARK 3.1.2 in local mode,
>>> but this will give a way to find out a bit more possibly:
>>>
>>> 
>>> >>> from pyspark.sql.types import *
>>> >>> #create the filter dataframe, there are easier ways to do the below
>>> >>> spark.createDataFrame(list(map(lambda filter:
>>> pyspark.sql.Row(filter), [0, 1, 2, 4, 7, 9])),
>>> StructType([StructField("filter_value",
>>> IntegerType())])).createOrReplaceTempView("filters")
>>> >>> #create the main table
>>> >>> spark.range(100).createOrReplaceTempView("test_base")
>>> >>> spark.sql("SELECT id, FLOOR(RAND() * 10) rand FROM
>>> test_base").createOrReplaceTempView("test")
>>> >>> #see the partitions in the filters and the main table
>>> >>> spark.sql("SELECT * FROM filters").rdd.getNumPartitions()
>>> 8
>>> >>> spark.sql("SELECT * FROM test").rdd.getNumPartitions()
>>> 8
>>> >>> #see the number of partitions in the filtered join output, I am
>>> assuming implicit casting here
>>> >>> spark.sql("SELECT * FROM test WHERE rand IN (SELECT filter_value
>>> FROM filters)").rdd.getNumPartitions()
>>> 200
>>> >>> spark.sql("SET spark.sql.shuffle.partitions=10")
>>> DataFrame[key: string, value: string]
>>> >>> spark.sql("SELECT * FROM test WHERE rand IN (SELECT filter_value
>>> FROM filters)").rdd.getNumPartitions()
>>> 10
>>> ===

Re: How to make batch filter

2022-01-02 Thread Bitfox
May I ask for daraframe API and sql API, which is better on performance?
Thanks

On Sun, Jan 2, 2022 at 8:06 PM Gourav Sengupta 
wrote:

> Hi Mich,
>
> your notes are really great, it really brought back the old days again :)
> thanks.
>
> Just to note a few points that I found useful related to this question:
> 1. cores and threads - page 5
> 2. executor cores and number settings - page 6..
>
>
> I think that the following example may be of use, note that I have one
> driver and that has 8 cores as I am running PYSPARK 3.1.2 in local mode,
> but this will give a way to find out a bit more possibly:
>
> 
> >>> from pyspark.sql.types import *
> >>> #create the filter dataframe, there are easier ways to do the below
> >>> spark.createDataFrame(list(map(lambda filter: pyspark.sql.Row(filter),
> [0, 1, 2, 4, 7, 9])), StructType([StructField("filter_value",
> IntegerType())])).createOrReplaceTempView("filters")
> >>> #create the main table
> >>> spark.range(100).createOrReplaceTempView("test_base")
> >>> spark.sql("SELECT id, FLOOR(RAND() * 10) rand FROM
> test_base").createOrReplaceTempView("test")
> >>> #see the partitions in the filters and the main table
> >>> spark.sql("SELECT * FROM filters").rdd.getNumPartitions()
> 8
> >>> spark.sql("SELECT * FROM test").rdd.getNumPartitions()
> 8
> >>> #see the number of partitions in the filtered join output, I am
> assuming implicit casting here
> >>> spark.sql("SELECT * FROM test WHERE rand IN (SELECT filter_value FROM
> filters)").rdd.getNumPartitions()
> 200
> >>> spark.sql("SET spark.sql.shuffle.partitions=10")
> DataFrame[key: string, value: string]
> >>> spark.sql("SELECT * FROM test WHERE rand IN (SELECT filter_value FROM
> filters)").rdd.getNumPartitions()
> 10
> ====
>
> Please do refer to the following page for adaptive sql execution in SPARK
> 3, it will be of massive help particularly in case you are handling skewed
> joins, https://spark.apache.org/docs/latest/sql-performance-tuning.html
>
>
> Thanks and Regards,
> Gourav Sengupta
>
> On Sun, Jan 2, 2022 at 11:24 AM Bitfox  wrote:
>
>> Thanks Mich. That looks good.
>>
>> On Sun, Jan 2, 2022 at 7:10 PM Mich Talebzadeh 
>> wrote:
>>
>>> LOL.
>>>
>>> You asking these questions takes me back to summer 2016 when I started
>>> writing notes on spark. Obviously earlier versions but the notion of RDD,
>>> Local, standalone, YARN etc. are still valid. Those days there were no k8s
>>> and the public cloud was not widely adopted.  I browsed it and it was
>>> refreshing for me. Anyway you may find some points addressing your
>>> questions that you tend to ask.
>>>
>>> HTH
>>>
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sun, 2 Jan 2022 at 00:20, Bitfox  wrote:
>>>
>>>> One more question, for this big filter, given my server has 4 Cores,
>>>> will spark (standalone mode) split the RDD to 4 partitions automatically?
>>>>
>>>> Thanks
>>>>
>>>> On Sun, Jan 2, 2022 at 6:30 AM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Create a list of values that you don't want anf filter oon those
>>>>>
>>>>> >>> DF = spark.range(10)
>>>>> >>> DF
>>>>> DataFrame[id: bigint]
>>>>> >>>
>>>>> >>> array = [1, 2, 3, 8]  # don't want these
>>>>> >>> DF.filter(DF.id.isin(array) == False).show()
>>>>> +---+
>>>>> | id|
>>>>> +---+
>>>>> |  0|
>>>>> |  4|
>>>>> |  5|
>>>>> |  6|
>>>>> |  7|
>>>>

Re: How to make batch filter

2022-01-02 Thread Bitfox
Thanks Mich. That looks good.

On Sun, Jan 2, 2022 at 7:10 PM Mich Talebzadeh 
wrote:

> LOL.
>
> You asking these questions takes me back to summer 2016 when I started
> writing notes on spark. Obviously earlier versions but the notion of RDD,
> Local, standalone, YARN etc. are still valid. Those days there were no k8s
> and the public cloud was not widely adopted.  I browsed it and it was
> refreshing for me. Anyway you may find some points addressing your
> questions that you tend to ask.
>
> HTH
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 2 Jan 2022 at 00:20, Bitfox  wrote:
>
>> One more question, for this big filter, given my server has 4 Cores, will
>> spark (standalone mode) split the RDD to 4 partitions automatically?
>>
>> Thanks
>>
>> On Sun, Jan 2, 2022 at 6:30 AM Mich Talebzadeh 
>> wrote:
>>
>>> Create a list of values that you don't want anf filter oon those
>>>
>>> >>> DF = spark.range(10)
>>> >>> DF
>>> DataFrame[id: bigint]
>>> >>>
>>> >>> array = [1, 2, 3, 8]  # don't want these
>>> >>> DF.filter(DF.id.isin(array) == False).show()
>>> +---+
>>> | id|
>>> +---+
>>> |  0|
>>> |  4|
>>> |  5|
>>> |  6|
>>> |  7|
>>> |  9|
>>> +---+
>>>
>>>  or use binary NOT operator:
>>>
>>>
>>> >>> DF.filter(*~*DF.id.isin(array)).show()
>>>
>>> +---+
>>>
>>> | id|
>>>
>>> +---+
>>>
>>> |  0|
>>>
>>> |  4|
>>>
>>> |  5|
>>>
>>> |  6|
>>>
>>> |  7|
>>>
>>> |  9|
>>>
>>> +---+
>>>
>>>
>>> HTH
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sat, 1 Jan 2022 at 20:59, Bitfox  wrote:
>>>
>>>> Using the dataframe API I need to implement a batch filter:
>>>>
>>>> DF. select(..).where(col(..) != ‘a’ and col(..) != ‘b’ and …)
>>>>
>>>> There are a lot of keywords should be filtered for the same column in
>>>> where statement.
>>>>
>>>> How can I make it more smater? UDF or others?
>>>>
>>>> Thanks & Happy new Year!
>>>> Bitfox
>>>>
>>>


Re: How to make batch filter

2022-01-01 Thread Bitfox
One more question, for this big filter, given my server has 4 Cores, will
spark (standalone mode) split the RDD to 4 partitions automatically?

Thanks

On Sun, Jan 2, 2022 at 6:30 AM Mich Talebzadeh 
wrote:

> Create a list of values that you don't want anf filter oon those
>
> >>> DF = spark.range(10)
> >>> DF
> DataFrame[id: bigint]
> >>>
> >>> array = [1, 2, 3, 8]  # don't want these
> >>> DF.filter(DF.id.isin(array) == False).show()
> +---+
> | id|
> +---+
> |  0|
> |  4|
> |  5|
> |  6|
> |  7|
> |  9|
> +---+
>
>  or use binary NOT operator:
>
>
> >>> DF.filter(*~*DF.id.isin(array)).show()
>
> +---+
>
> | id|
>
> +---+
>
> |  0|
>
> |  4|
>
> |  5|
>
> |  6|
>
> |  7|
>
> |  9|
>
> +---+
>
>
> HTH
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 1 Jan 2022 at 20:59, Bitfox  wrote:
>
>> Using the dataframe API I need to implement a batch filter:
>>
>> DF. select(..).where(col(..) != ‘a’ and col(..) != ‘b’ and …)
>>
>> There are a lot of keywords should be filtered for the same column in
>> where statement.
>>
>> How can I make it more smater? UDF or others?
>>
>> Thanks & Happy new Year!
>> Bitfox
>>
>


Re: How to make batch filter

2022-01-01 Thread Bitfox
That’s great thanks.

On Sun, Jan 2, 2022 at 6:30 AM Mich Talebzadeh 
wrote:

> Create a list of values that you don't want anf filter oon those
>
> >>> DF = spark.range(10)
> >>> DF
> DataFrame[id: bigint]
> >>>
> >>> array = [1, 2, 3, 8]  # don't want these
> >>> DF.filter(DF.id.isin(array) == False).show()
> +---+
> | id|
> +---+
> |  0|
> |  4|
> |  5|
> |  6|
> |  7|
> |  9|
> +---+
>
>  or use binary NOT operator:
>
>
> >>> DF.filter(*~*DF.id.isin(array)).show()
>
> +---+
>
> | id|
>
> +---+
>
> |  0|
>
> |  4|
>
> |  5|
>
> |  6|
>
> |  7|
>
> |  9|
>
> +---+
>
>
> HTH
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 1 Jan 2022 at 20:59, Bitfox  wrote:
>
>> Using the dataframe API I need to implement a batch filter:
>>
>> DF. select(..).where(col(..) != ‘a’ and col(..) != ‘b’ and …)
>>
>> There are a lot of keywords should be filtered for the same column in
>> where statement.
>>
>> How can I make it more smater? UDF or others?
>>
>> Thanks & Happy new Year!
>> Bitfox
>>
>


How to make batch filter

2022-01-01 Thread Bitfox
Using the dataframe API I need to implement a batch filter:

DF. select(..).where(col(..) != ‘a’ and col(..) != ‘b’ and …)

There are a lot of keywords should be filtered for the same column in where
statement.

How can I make it more smater? UDF or others?

Thanks & Happy new Year!
Bitfox


my first data science project with spark

2021-12-26 Thread bitfox

Hello list,

Thanks to Spark project and the community I have made my first data 
statistics project with Spark.

The url:  https://github.com/bitfoxtop/EmailRankings
Surely this is not that big-data... I can even write a python script to 
finish the job more quickly.

But since the job was done in Spark I want to share it here.

Thanks for your reviews.

regards
Bitfox

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: measure running time

2021-12-24 Thread bitfox
Thanks a lot Hollis. It is does due to the pypi version. Now I updated 
it.


$ pip3 -V
pip 9.0.1 from /usr/lib/python3/dist-packages (python 3.6)

$ pip3 install sparkmeasure
Collecting sparkmeasure
  Using cached 
https://files.pythonhosted.org/packages/9f/bf/c9810ff2d88513ffc185e65a3ab9df6121ad5b4c78aa8d134a06177f9021/sparkmeasure-0.14.0-py2.py3-none-any.whl

Installing collected packages: sparkmeasure
Successfully installed sparkmeasure-0.14.0

$ pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
...

from sparkmeasure import StageMetrics
stagemetrics = StageMetrics(spark)
stagemetrics.runandmeasure(locals(), 'spark.sql("select count(*) from 
range(1000) cross join range(1000) cross join range(100)").show()')

+-+
| count(1)|
+-+
|1|
+-+
...


Hope it helps to others who have met the same issue.
Happy holidays. :0

Bitfox


On 2021-12-25 09:48, Hollis wrote:

 Replied mail 

 From
 Mich Talebzadeh

 Date
 12/25/2021 00:25

 To
 Sean Owen

 Cc
 user、Luca Canali

 Subject
 Re: measure running time

Hi Sean,

I have already discussed an issue in my case with Spark 3.1.1 and
sparkmeasure  with the author Luca Canali on this matter. It has been
reproduced. I think we ought to wait for a patch.

HTH,

Mich

   view my Linkedin profile [1]

Disclaimer: Use it at your own risk. Any and all responsibility for
any loss, damage or destruction of data or any other property which
may arise from relying on this email's technical content is explicitly
disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.

On Fri, 24 Dec 2021 at 14:51, Sean Owen  wrote:


You probably did not install it on your cluster, nor included the
python package with your app

On Fri, Dec 24, 2021, 4:35 AM  wrote:


but I already installed it:

Requirement already satisfied: sparkmeasure in
/usr/local/lib/python2.7/dist-packages

so how? thank you.

On 2021-12-24 18:15, Hollis wrote:

Hi bitfox,

you need pip install sparkmeasure firstly. then can lanch in

pysaprk.



from sparkmeasure import StageMetrics
stagemetrics = StageMetrics(spark)
stagemetrics.runandmeasure(locals(), 'spark.sql("select

count(*)

from range(1000) cross join range(1000) cross join
range(100)").show()')
+-+

| count(1)|
+-+
|1|
+-+

Regards,
Hollis

At 2021-12-24 09:18:19, bit...@bitfox.top wrote:

Hello list,

I run with Spark 3.2.0

After I started pyspark with:
$ pyspark --packages

ch.cern.sparkmeasure:spark-measure_2.12:0.17


I can't load from the module sparkmeasure:


from sparkmeasure import StageMetrics

Traceback (most recent call last):
File "", line 1, in 
ModuleNotFoundError: No module named 'sparkmeasure'

Do you know why? @Luca thanks.


On 2021-12-24 04:20, bit...@bitfox.top wrote:

Thanks Gourav and Luca. I will try with the tools you provide

in

the

Github.

On 2021-12-23 23:40, Luca Canali wrote:

Hi,

I agree with Gourav that just measuring execution time is a

simplistic

approach that may lead you to miss important details, in

particular

when running distributed computations.

WebUI, REST API, and metrics instrumentation in Spark can be

quite

useful for further drill down. See
https://spark.apache.org/docs/latest/monitoring.html

You can also have a look at this tool that takes care of

automating

collecting and aggregating some executor task metrics:
https://github.com/LucaCanali/sparkMeasure

Best,

Luca

From: Gourav Sengupta 
Sent: Thursday, December 23, 2021 14:23
To: bit...@bitfox.top
Cc: user 
Subject: Re: measure running time

Hi,

I do not think that such time comparisons make any sense at

all in

distributed computation. Just saying that an operation in RDD

and

Dataframe can be compared based on their start and stop time

may

not

provide any valid information.

You will have to look into the details of timing and the

steps.

For

example, please look at the SPARK UI to see how timings are

calculated

in distributed computing mode, there are several well written

papers

on this.

Thanks and Regards,

Gourav Sengupta

On Thu, Dec 23, 2021 at 10:57 AM  wrote:


hello community,

In pyspark how can I measure the running time to the

command?

I just want to compare the running time of the RDD API and

dataframe


API, in my this blog:










https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/


I tried spark.time() it doesn't work.
Thank you.











-

To unsubscribe e-mail: user-unsubscr...@spark.apache.org










-

To uns

df.show() to text file

2021-12-24 Thread bitfox

Hello list,

spark newbie here :0
How can I write the df.show() result to a text file in the system?
I run with pyspark, not the python client programming.

Thanks.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: measure running time

2021-12-24 Thread bitfox

As you see below:

$ pip install sparkmeasure
Collecting sparkmeasure
  Using cached 
https://files.pythonhosted.org/packages/9f/bf/c9810ff2d88513ffc185e65a3ab9df6121ad5b4c78aa8d134a06177f9021/sparkmeasure-0.14.0-py2.py3-none-any.whl

Installing collected packages: sparkmeasure
Successfully installed sparkmeasure-0.14.0


$ pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
..

from sparkmeasure import StageMetrics

Traceback (most recent call last):
  File "", line 1, in 
ModuleNotFoundError: No module named 'sparkmeasure'


That doesn't work still.
I run spark 3.2.0 on an ubuntu system.

Regards.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: measure running time

2021-12-24 Thread bitfox

but I already installed it:

Requirement already satisfied: sparkmeasure in 
/usr/local/lib/python2.7/dist-packages


so how? thank you.

On 2021-12-24 18:15, Hollis wrote:

Hi bitfox,

you need pip install sparkmeasure firstly. then can lanch in pysaprk.


from sparkmeasure import StageMetrics
stagemetrics = StageMetrics(spark)
stagemetrics.runandmeasure(locals(), 'spark.sql("select count(*)

from range(1000) cross join range(1000) cross join
range(100)").show()')
+-+

| count(1)|
+-+
|1|
+-+

Regards,
Hollis

At 2021-12-24 09:18:19, bit...@bitfox.top wrote:

Hello list,

I run with Spark 3.2.0

After I started pyspark with:
$ pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17

I can't load from the module sparkmeasure:


from sparkmeasure import StageMetrics

Traceback (most recent call last):
  File "", line 1, in 
ModuleNotFoundError: No module named 'sparkmeasure'

Do you know why? @Luca thanks.


On 2021-12-24 04:20, bit...@bitfox.top wrote:

Thanks Gourav and Luca. I will try with the tools you provide in

the

Github.

On 2021-12-23 23:40, Luca Canali wrote:

Hi,

I agree with Gourav that just measuring execution time is a

simplistic

approach that may lead you to miss important details, in

particular

when running distributed computations.

WebUI, REST API, and metrics instrumentation in Spark can be quite
useful for further drill down. See
https://spark.apache.org/docs/latest/monitoring.html

You can also have a look at this tool that takes care of

automating

collecting and aggregating some executor task metrics:
https://github.com/LucaCanali/sparkMeasure

Best,

Luca

From: Gourav Sengupta 
Sent: Thursday, December 23, 2021 14:23
To: bit...@bitfox.top
Cc: user 
Subject: Re: measure running time

Hi,

I do not think that such time comparisons make any sense at all in
distributed computation. Just saying that an operation in RDD and
Dataframe can be compared based on their start and stop time may

not

provide any valid information.

You will have to look into the details of timing and the steps.

For

example, please look at the SPARK UI to see how timings are

calculated

in distributed computing mode, there are several well written

papers

on this.

Thanks and Regards,

Gourav Sengupta

On Thu, Dec 23, 2021 at 10:57 AM  wrote:


hello community,

In pyspark how can I measure the running time to the command?
I just want to compare the running time of the RDD API and

dataframe


API, in my this blog:




https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/


I tried spark.time() it doesn't work.
Thank you.





-

To unsubscribe e-mail: user-unsubscr...@spark.apache.org




-

To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Dataframe's storage size

2021-12-23 Thread bitfox

Hello

Is it possible to know a dataframe's total storage size in bytes? such 
as:



df.size()

Traceback (most recent call last):
  File "", line 1, in 
  File "/opt/spark/python/pyspark/sql/dataframe.py", line 1660, in 
__getattr__
"'%s' object has no attribute '%s'" % (self.__class__.__name__, 
name))

AttributeError: 'DataFrame' object has no attribute 'size'

Sure it won't work. but if there is such a method that would be great.

Thanks.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: measure running time

2021-12-23 Thread bitfox

Hello list,

I run with Spark 3.2.0

After I started pyspark with:
$ pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17

I can't load from the module sparkmeasure:


from sparkmeasure import StageMetrics

Traceback (most recent call last):
  File "", line 1, in 
ModuleNotFoundError: No module named 'sparkmeasure'

Do you know why? @Luca thanks.


On 2021-12-24 04:20, bit...@bitfox.top wrote:
Thanks Gourav and Luca. I will try with the tools you provide in the 
Github.


On 2021-12-23 23:40, Luca Canali wrote:

Hi,

I agree with Gourav that just measuring execution time is a simplistic
approach that may lead you to miss important details, in particular
when running distributed computations.

WebUI, REST API, and metrics instrumentation in Spark can be quite
useful for further drill down. See
https://spark.apache.org/docs/latest/monitoring.html

You can also have a look at this tool that takes care of automating
collecting and aggregating some executor task metrics:
https://github.com/LucaCanali/sparkMeasure

Best,

Luca

From: Gourav Sengupta 
Sent: Thursday, December 23, 2021 14:23
To: bit...@bitfox.top
Cc: user 
Subject: Re: measure running time

Hi,

I do not think that such time comparisons make any sense at all in
distributed computation. Just saying that an operation in RDD and
Dataframe can be compared based on their start and stop time may not
provide any valid information.

You will have to look into the details of timing and the steps. For
example, please look at the SPARK UI to see how timings are calculated
in distributed computing mode, there are several well written papers
on this.

Thanks and Regards,

Gourav Sengupta

On Thu, Dec 23, 2021 at 10:57 AM  wrote:


hello community,

In pyspark how can I measure the running time to the command?
I just want to compare the running time of the RDD API and dataframe

API, in my this blog:


https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/


I tried spark.time() it doesn't work.
Thank you.



-

To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: measure running time

2021-12-23 Thread bitfox
Thanks Gourav and Luca. I will try with the tools you provide in the 
Github.


On 2021-12-23 23:40, Luca Canali wrote:

Hi,

I agree with Gourav that just measuring execution time is a simplistic
approach that may lead you to miss important details, in particular
when running distributed computations.

WebUI, REST API, and metrics instrumentation in Spark can be quite
useful for further drill down. See
https://spark.apache.org/docs/latest/monitoring.html

You can also have a look at this tool that takes care of automating
collecting and aggregating some executor task metrics:
https://github.com/LucaCanali/sparkMeasure

Best,

Luca

From: Gourav Sengupta 
Sent: Thursday, December 23, 2021 14:23
To: bit...@bitfox.top
Cc: user 
Subject: Re: measure running time

Hi,

I do not think that such time comparisons make any sense at all in
distributed computation. Just saying that an operation in RDD and
Dataframe can be compared based on their start and stop time may not
provide any valid information.

You will have to look into the details of timing and the steps. For
example, please look at the SPARK UI to see how timings are calculated
in distributed computing mode, there are several well written papers
on this.

Thanks and Regards,

Gourav Sengupta

On Thu, Dec 23, 2021 at 10:57 AM  wrote:


hello community,

In pyspark how can I measure the running time to the command?
I just want to compare the running time of the RDD API and dataframe

API, in my this blog:


https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/


I tried spark.time() it doesn't work.
Thank you.



-

To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



measure running time

2021-12-23 Thread bitfox

hello community,

In pyspark how can I measure the running time to the command?
I just want to compare the running time of the RDD API and dataframe 
API, in my this blog:

https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/

I tried spark.time() it doesn't work.
Thank you.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Unable to use WriteStream to write to delta file.

2021-12-17 Thread bitfox
May I ask why you don’t  use spark.read and spark.write instead of 
readStream and writeStream? Thanks.


On 2021-12-17 15:09, Abhinav Gundapaneni wrote:

Hello Spark community,

I’m using Apache spark(version 3.2) to read a CSV file to a
dataframe using ReadStream, process the dataframe and write the
dataframe to Delta file using WriteStream. I’m getting a failure
during the WriteStream process. I’m trying to run the script locally
in my windows 11 machine. Below is the stack trace of the error I’m
facing. Please let me know if there’s anything that I’m missing.

java.lang.NoSuchMethodError:
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkFieldNames(Lscala/collection/Seq;)V


at
org.apache.spark.sql.delta.schema.SchemaUtils$.checkFieldNames(SchemaUtils.scala:958)


at
org.apache.spark.sql.delta.OptimisticTransactionImpl.verifyNewMetadata(OptimisticTransaction.scala:216)


at
org.apache.spark.sql.delta.OptimisticTransactionImpl.verifyNewMetadata$(OptimisticTransaction.scala:214)


at
org.apache.spark.sql.delta.OptimisticTransaction.verifyNewMetadata(OptimisticTransaction.scala:80)


at
org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadata(OptimisticTransaction.scala:208)


at
org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadata$(OptimisticTransaction.scala:195)


at
org.apache.spark.sql.delta.OptimisticTransaction.updateMetadata(OptimisticTransaction.scala:80)


at
org.apache.spark.sql.delta.schema.ImplicitMetadataOperation.updateMetadata(ImplicitMetadataOperation.scala:101)


at
org.apache.spark.sql.delta.schema.ImplicitMetadataOperation.updateMetadata$(ImplicitMetadataOperation.scala:62)


at
org.apache.spark.sql.delta.sources.DeltaSink.updateMetadata(DeltaSink.scala:37)


at
org.apache.spark.sql.delta.schema.ImplicitMetadataOperation.updateMetadata(ImplicitMetadataOperation.scala:59)


at
org.apache.spark.sql.delta.schema.ImplicitMetadataOperation.updateMetadata$(ImplicitMetadataOperation.scala:50)


at
org.apache.spark.sql.delta.sources.DeltaSink.updateMetadata(DeltaSink.scala:37)


at
org.apache.spark.sql.delta.sources.DeltaSink.$anonfun$addBatch$1(DeltaSink.scala:80)


at
org.apache.spark.sql.delta.sources.DeltaSink.$anonfun$addBatch$1$adapted(DeltaSink.scala:54)


at
org.apache.spark.sql.delta.DeltaLog.withNewTransaction(DeltaLog.scala:188)


at
org.apache.spark.sql.delta.sources.DeltaSink.addBatch(DeltaSink.scala:54)


at
org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$17(MicroBatchExecution.scala:600)


at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)


at
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)


at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)


at
org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)

at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)


at
org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$16(MicroBatchExecution.scala:598)


at
org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:375)


at
org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:373)


at
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:69)


at
org.apache.spark.sql.execution.streaming.MicroBatchExecution.runBatch(MicroBatchExecution.scala:598)


at
org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:228)


at
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)


at
org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:375)


at
org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:373)


at
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:69)


at
org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:193)


at
org.apache.spark.sql.execution.streaming.OneTimeExecutor.execute(TriggerExecutor.scala:39)


at
org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:187)


at
org.apache.spark.sql.execution.streaming.StreamExecution.$anonfun$runStream$1(StreamExecution.scala:303)


at
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)


at
org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)


issue on define a dataframe

2021-12-14 Thread bitfox

Hello,

Spark newbie here :)

Why I can't create the dataframe with just one column?

for instance, this works:


df=spark.createDataFrame([("apple",2),("orange",3)],["name","count"])



But this can't work:


df=spark.createDataFrame([("apple"),("orange")],["name"])

Traceback (most recent call last):
  File "", line 1, in 
  File "/opt/spark/python/pyspark/sql/session.py", line 675, in 
createDataFrame
return self._create_dataframe(data, schema, samplingRatio, 
verifySchema)
  File "/opt/spark/python/pyspark/sql/session.py", line 700, in 
_create_dataframe

rdd, schema = self._createFromLocal(map(prepare, data), schema)
  File "/opt/spark/python/pyspark/sql/session.py", line 512, in 
_createFromLocal

struct = self._inferSchemaFromList(data, names=schema)
  File "/opt/spark/python/pyspark/sql/session.py", line 439, in 
_inferSchemaFromList
schema = reduce(_merge_type, (_infer_schema(row, names) for row in 
data))
  File "/opt/spark/python/pyspark/sql/session.py", line 439, in 

schema = reduce(_merge_type, (_infer_schema(row, names) for row in 
data))
  File "/opt/spark/python/pyspark/sql/types.py", line 1067, in 
_infer_schema

raise TypeError("Can not infer schema for type: %s" % type(row))
TypeError: Can not infer schema for type: 


how can I fix it?

Thanks

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: About some Spark technical assistance

2021-12-12 Thread bitfox

github url please.

On 2021-12-13 01:06, sam smith wrote:

Hello guys,

I am replicating a paper's algorithm (graph coloring algorithm) in
Spark under Java, and thought about asking you guys for some
assistance to validate / review my 600 lines of code. Any volunteers
to share the code with ?
Thanks


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: creating database issue

2021-12-07 Thread bitfox

And, I can't start spark-sql shell, the error as below.
Does this mean I need to install Hive on local machine?

Caused by: java.sql.SQLException: Failed to start database 
'metastore_db' with class loader 
jdk.internal.loader.ClassLoaders$AppClassLoader@5ffd2b27, see the next 
exception for details.
	at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown 
Source)
	at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown 
Source)

at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source)
at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(Unknown 
Source)
at org.apache.derby.impl.jdbc.EmbedConnection.(Unknown Source)
at org.apache.derby.jdbc.InternalDriver$1.run(Unknown Source)
at org.apache.derby.jdbc.InternalDriver$1.run(Unknown Source)
at java.base/java.security.AccessController.doPrivileged(Native Method)
	at org.apache.derby.jdbc.InternalDriver.getNewEmbedConnection(Unknown 
Source)

at org.apache.derby.jdbc.InternalDriver.connect(Unknown Source)
at org.apache.derby.jdbc.InternalDriver.connect(Unknown Source)
at org.apache.derby.jdbc.AutoloadedDriver.connect(Unknown Source)
at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:677)
at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:189)
at com.jolbox.bonecp.BoneCP.obtainRawInternalConnection(BoneCP.java:361)
at com.jolbox.bonecp.BoneCP.(BoneCP.java:416)
... 89 more
Caused by: ERROR XJ040: Failed to start database 'metastore_db' with 
class loader jdk.internal.loader.ClassLoaders$AppClassLoader@5ffd2b27, 
see the next exception for details.
	at org.apache.derby.iapi.error.StandardException.newException(Unknown 
Source)
	at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown 
Source)

... 105 more


Thanks.

On 2021/12/8 9:28, bitfox wrote:

Hello

This is just a standalone deployment for testing purpose.
The version:
Spark 3.2.0 (git revision 5d45a415f3) built for Hadoop 3.3.1
Build flags: -B -Pmesos -Pyarn -Pkubernetes -Psparkr -Pscala-2.12 
-Phadoop-3.2 -Phive -Phive-thriftserver


I just started one master and one worker for the test.

Thanks


On 2021/12/8 9:15, Qian Sun wrote:
   It seems to be a hms question. Would u like to provide the 
information about spark version, hive version and spark application 
configuration?


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: creating database issue

2021-12-07 Thread bitfox

Hello

This is just a standalone deployment for testing purpose.
The version:
Spark 3.2.0 (git revision 5d45a415f3) built for Hadoop 3.3.1
Build flags: -B -Pmesos -Pyarn -Pkubernetes -Psparkr -Pscala-2.12 
-Phadoop-3.2 -Phive -Phive-thriftserver


I just started one master and one worker for the test.

Thanks


On 2021/12/8 9:15, Qian Sun wrote:

   It seems to be a hms question. Would u like to provide the information about 
spark version, hive version and spark application configuration?


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



creating database issue

2021-12-07 Thread bitfox

sorry I am newbie to spark.

When I created a database in pyspark shell following the book content of 
learning spark 2.0, it gets:


>>> spark.sql("CREATE DATABASE learn_spark_db")
21/12/08 09:01:34 WARN HiveConf: HiveConf of name 
hive.stats.jdbc.timeout does not exist
21/12/08 09:01:34 WARN HiveConf: HiveConf of name 
hive.stats.retries.wait does not exist
21/12/08 09:01:39 WARN ObjectStore: Version information not found in 
metastore. hive.metastore.schema.verification is not enabled so 
recording the schema version 2.3.0
21/12/08 09:01:39 WARN ObjectStore: setMetaStoreSchemaVersion called but 
recording version is disabled: version = 2.3.0, comment = Set by 
MetaStore pyh@185.213.174.249
21/12/08 09:01:40 WARN ObjectStore: Failed to get database default, 
returning NoSuchObjectException
21/12/08 09:01:40 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
21/12/08 09:01:40 WARN ObjectStore: Failed to get database 
learn_spark_db, returning NoSuchObjectException


Can you point to me where is wrong?

Thanks.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org