Re:

2022-04-02 Thread Bitfox
Nice reading. Can you give a comparison on Hive on MR3 and Hive on Tez?

Thanks

On Sat, Apr 2, 2022 at 7:17 PM Sungwoo Park  wrote:

> Hi Spark users,
>
> We have published an article where we evaluate the performance of Spark
> 2.3.8 and Spark 3.2.1 (along with Hive 3). If interested, please see:
>
> https://www.datamonad.com/post/2022-04-01-spark-hive-performance-1.4/
>
> --- SW
>


Re: out of memory error

2022-03-29 Thread Bitfox
That might be the truth. When I switched to a 4gb node it just worked.

Thanks

On Tue, Mar 29, 2022 at 8:49 PM Pau Tallada  wrote:

> I don't know what to say.
> It it fails of OutOfMemory, then you have to assign more memory to it.
>
> Also a, 2GB VM for a hadoop node is too tiny. Hadoop ecosystem is usually
> memory-intensive
>
> Missatge de Bitfox  del dia dt., 29 de març 2022 a les
> 14:46:
>
>> Yes, a quite small table with 1 rows for test purposes.
>>
>> Thanks
>>
>> On Tue, Mar 29, 2022 at 8:43 PM Pau Tallada  wrote:
>>
>>> Hi,
>>>
>>> I think it depends a lot on the data volume you are trying to process.
>>> Does it work with a smaller table?
>>>
>>> Missatge de Bitfox  del dia dt., 29 de març 2022 a
>>> les 14:39:
>>>
>>>> 0: jdbc:hive2://localhost:1/default> set
>>>> hive.tez.container.size=1024;
>>>>
>>>> No rows affected (0.027 seconds)
>>>>
>>>>
>>>> 0: jdbc:hive2://localhost:1/default> set hive.execution.engine;
>>>>
>>>> +---+
>>>>
>>>> |set|
>>>>
>>>> +---+
>>>>
>>>> | hive.execution.engine=mr  |
>>>>
>>>> +---+
>>>>
>>>> 1 row selected (0.048 seconds)
>>>>
>>>>
>>>> 0: jdbc:hive2://localhost:1/default> set
>>>> mapreduce.map.memory.mb=1024;
>>>>
>>>> No rows affected (0.032 seconds)
>>>>
>>>> 0: jdbc:hive2://localhost:1/default> set
>>>> mapreduce.map.java.opts=-Xmx1024m;
>>>>
>>>> No rows affected (0.01 seconds)
>>>>
>>>> 0: jdbc:hive2://localhost:1/default> set
>>>> mapreduce.reduce.memory.mb=1024;
>>>>
>>>> No rows affected (0.014 seconds)
>>>>
>>>> 0: jdbc:hive2://localhost:1/default> set
>>>> mapreduce.reduce.java.opts=-Xmx1024m;
>>>>
>>>> No rows affected (0.015 seconds)
>>>>
>>>>
>>>> 0: jdbc:hive2://localhost:1/default> select job,count(*) as dd from
>>>> ppl group by job limit 10;
>>>>
>>>> Error: Error while processing statement: FAILED: Execution Error,
>>>> return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
>>>> (state=08S01,code=2)
>>>>
>>>>
>>>>
>>>>
>>>> Sorry my test VM has 2gb Ram only. So I set all the above memory size
>>>> to 1GB.
>>>>
>>>> But it still gets the same error.
>>>>
>>>>
>>>>
>>>> please help. thanks.
>>>>
>>>>
>>>>
>>>> On Tue, Mar 29, 2022 at 8:32 PM Pau Tallada  wrote:
>>>>
>>>>> I assume you have to increase container size (if using tez/yarn)
>>>>>
>>>>> Missatge de Bitfox  del dia dt., 29 de març 2022 a
>>>>> les 14:30:
>>>>>
>>>>>> My hive run out of memory even for a small query:
>>>>>>
>>>>>> 2022-03-29T20:26:51,440  WARN [Thread-1329] mapred.LocalJobRunner:
>>>>>> job_local300585280_0011
>>>>>>
>>>>>> java.lang.Exception: java.lang.OutOfMemoryError: Java heap space
>>>>>>
>>>>>> at
>>>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:492)
>>>>>> ~[hadoop-mapreduce-client-common-3.3.2.jar:?]
>>>>>>
>>>>>> at
>>>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:552)
>>>>>> ~[hadoop-mapreduce-client-common-3.3.2.jar:?]
>>>>>>
>>>>>> Caused by: java.lang.OutOfMemoryError: Java heap space
>>>>>>
>>>>>>
>>>>>>
>>>>>> hadoop-3.3.2
>>>>>>
>>>>>> hive-3.1.2
>>>>>>
>>>>>> java version "1.8.0_321"
>>>>>>
>>>>>>
>>>>>>
>>>>>> How to fix this? thanks.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> --
>>>>> Pau Tallada Crespí
>>>>> Departament de Serveis
>>>>> Port d'Informació Científica (PIC)
>>>>> Tel: +34 93 170 2729
>>>>> --
>>>>>
>>>>>
>>>
>>> --
>>> --
>>> Pau Tallada Crespí
>>> Departament de Serveis
>>> Port d'Informació Científica (PIC)
>>> Tel: +34 93 170 2729
>>> --
>>>
>>>
>
> --
> --
> Pau Tallada Crespí
> Departament de Serveis
> Port d'Informació Científica (PIC)
> Tel: +34 93 170 2729
> --
>
>


Re: out of memory error

2022-03-29 Thread Bitfox
Yes, a quite small table with 1 rows for test purposes.

Thanks

On Tue, Mar 29, 2022 at 8:43 PM Pau Tallada  wrote:

> Hi,
>
> I think it depends a lot on the data volume you are trying to process.
> Does it work with a smaller table?
>
> Missatge de Bitfox  del dia dt., 29 de març 2022 a les
> 14:39:
>
>> 0: jdbc:hive2://localhost:1/default> set hive.tez.container.size=1024;
>>
>> No rows affected (0.027 seconds)
>>
>>
>> 0: jdbc:hive2://localhost:1/default> set hive.execution.engine;
>>
>> +---+
>>
>> |set|
>>
>> +---+
>>
>> | hive.execution.engine=mr  |
>>
>> +---+
>>
>> 1 row selected (0.048 seconds)
>>
>>
>> 0: jdbc:hive2://localhost:1/default> set mapreduce.map.memory.mb=1024;
>>
>> No rows affected (0.032 seconds)
>>
>> 0: jdbc:hive2://localhost:1/default> set
>> mapreduce.map.java.opts=-Xmx1024m;
>>
>> No rows affected (0.01 seconds)
>>
>> 0: jdbc:hive2://localhost:1/default> set
>> mapreduce.reduce.memory.mb=1024;
>>
>> No rows affected (0.014 seconds)
>>
>> 0: jdbc:hive2://localhost:1/default> set
>> mapreduce.reduce.java.opts=-Xmx1024m;
>>
>> No rows affected (0.015 seconds)
>>
>>
>> 0: jdbc:hive2://localhost:1/default> select job,count(*) as dd from
>> ppl group by job limit 10;
>>
>> Error: Error while processing statement: FAILED: Execution Error, return
>> code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
>> (state=08S01,code=2)
>>
>>
>>
>>
>> Sorry my test VM has 2gb Ram only. So I set all the above memory size to
>> 1GB.
>>
>> But it still gets the same error.
>>
>>
>>
>> please help. thanks.
>>
>>
>>
>> On Tue, Mar 29, 2022 at 8:32 PM Pau Tallada  wrote:
>>
>>> I assume you have to increase container size (if using tez/yarn)
>>>
>>> Missatge de Bitfox  del dia dt., 29 de març 2022 a
>>> les 14:30:
>>>
>>>> My hive run out of memory even for a small query:
>>>>
>>>> 2022-03-29T20:26:51,440  WARN [Thread-1329] mapred.LocalJobRunner:
>>>> job_local300585280_0011
>>>>
>>>> java.lang.Exception: java.lang.OutOfMemoryError: Java heap space
>>>>
>>>> at
>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:492)
>>>> ~[hadoop-mapreduce-client-common-3.3.2.jar:?]
>>>>
>>>> at
>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:552)
>>>> ~[hadoop-mapreduce-client-common-3.3.2.jar:?]
>>>>
>>>> Caused by: java.lang.OutOfMemoryError: Java heap space
>>>>
>>>>
>>>>
>>>> hadoop-3.3.2
>>>>
>>>> hive-3.1.2
>>>>
>>>> java version "1.8.0_321"
>>>>
>>>>
>>>>
>>>> How to fix this? thanks.
>>>>
>>>>
>>>>
>>>>
>>>
>>> --
>>> --
>>> Pau Tallada Crespí
>>> Departament de Serveis
>>> Port d'Informació Científica (PIC)
>>> Tel: +34 93 170 2729
>>> --
>>>
>>>
>
> --
> --
> Pau Tallada Crespí
> Departament de Serveis
> Port d'Informació Científica (PIC)
> Tel: +34 93 170 2729
> --
>
>


Re: out of memory error

2022-03-29 Thread Bitfox
0: jdbc:hive2://localhost:1/default> set hive.tez.container.size=1024;

No rows affected (0.027 seconds)


0: jdbc:hive2://localhost:1/default> set hive.execution.engine;

+---+

|set|

+---+

| hive.execution.engine=mr  |

+---+

1 row selected (0.048 seconds)


0: jdbc:hive2://localhost:1/default> set mapreduce.map.memory.mb=1024;

No rows affected (0.032 seconds)

0: jdbc:hive2://localhost:1/default> set
mapreduce.map.java.opts=-Xmx1024m;

No rows affected (0.01 seconds)

0: jdbc:hive2://localhost:1/default> set
mapreduce.reduce.memory.mb=1024;

No rows affected (0.014 seconds)

0: jdbc:hive2://localhost:1/default> set
mapreduce.reduce.java.opts=-Xmx1024m;

No rows affected (0.015 seconds)


0: jdbc:hive2://localhost:1/default> select job,count(*) as dd from ppl
group by job limit 10;

Error: Error while processing statement: FAILED: Execution Error, return
code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
(state=08S01,code=2)




Sorry my test VM has 2gb Ram only. So I set all the above memory size to
1GB.

But it still gets the same error.



please help. thanks.



On Tue, Mar 29, 2022 at 8:32 PM Pau Tallada  wrote:

> I assume you have to increase container size (if using tez/yarn)
>
> Missatge de Bitfox  del dia dt., 29 de març 2022 a les
> 14:30:
>
>> My hive run out of memory even for a small query:
>>
>> 2022-03-29T20:26:51,440  WARN [Thread-1329] mapred.LocalJobRunner:
>> job_local300585280_0011
>>
>> java.lang.Exception: java.lang.OutOfMemoryError: Java heap space
>>
>> at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:492)
>> ~[hadoop-mapreduce-client-common-3.3.2.jar:?]
>>
>> at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:552)
>> ~[hadoop-mapreduce-client-common-3.3.2.jar:?]
>>
>> Caused by: java.lang.OutOfMemoryError: Java heap space
>>
>>
>>
>> hadoop-3.3.2
>>
>> hive-3.1.2
>>
>> java version "1.8.0_321"
>>
>>
>>
>> How to fix this? thanks.
>>
>>
>>
>>
>
> --
> --
> Pau Tallada Crespí
> Departament de Serveis
> Port d'Informació Científica (PIC)
> Tel: +34 93 170 2729
> --
>
>


out of memory error

2022-03-29 Thread Bitfox
My hive run out of memory even for a small query:

2022-03-29T20:26:51,440  WARN [Thread-1329] mapred.LocalJobRunner:
job_local300585280_0011

java.lang.Exception: java.lang.OutOfMemoryError: Java heap space

at
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:492)
~[hadoop-mapreduce-client-common-3.3.2.jar:?]

at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:552)
~[hadoop-mapreduce-client-common-3.3.2.jar:?]

Caused by: java.lang.OutOfMemoryError: Java heap space



hadoop-3.3.2

hive-3.1.2

java version "1.8.0_321"



How to fix this? thanks.


Re: Hive 3 with tez issue

2022-03-28 Thread Bitfox
Or, is there a standard installation guide for integration tez and hive3?

Thank you.

On Mon, Mar 28, 2022 at 12:21 PM Bitfox  wrote:

> When I had this config in hive-env.sh:
>
> export
> HADOOP_CLASSPATH=/opt/tez/conf:/opt/tez/*:/opt/tez/lib/*:$HADOOP_CLASSPATH
>
>
>
> and when I started Hive I got the error:
>
> Exception in thread "main" java.lang.NoSuchMethodError:
> org.apache.hadoop.fs.FsTracer.get(Lorg/apache/hadoop/conf/Configuration;)Lorg/apache/htrace/core/Tracer;
>
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:304)
>
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:289)
>
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:172)
>
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
>
> at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
>
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
>
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
>
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
>
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:288)
>
> at
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:624)
>
> at
> org.apache.hadoop.hive.ql.session.SessionState.beginStart(SessionState.java:591)
>
> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:747)
>
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:683)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:498)
>
> at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
>
> at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
>
>
>
>
>
> But when I removed that line in hive-env.sh and started hive again, it
> just worked.
>
>
> hadoop-3.3.2
>
> hive-3.1.2
>
> tez-0.10.1
>
> java version "1.8.0_321"
>
>
> All of them were installed in a local node for development purposes.
>
>
> Please help with this issue. Thanks.
>
> Bitfox
>
>
>


Re: Hive 3 with tez issue

2022-03-28 Thread Bitfox
Or, is there a standard installation guide for integration tez and hive3?

Thank you.

On Mon, Mar 28, 2022 at 12:21 PM Bitfox  wrote:

> When I had this config in hive-env.sh:
>
> export
> HADOOP_CLASSPATH=/opt/tez/conf:/opt/tez/*:/opt/tez/lib/*:$HADOOP_CLASSPATH
>
>
>
> and when I started Hive I got the error:
>
> Exception in thread "main" java.lang.NoSuchMethodError:
> org.apache.hadoop.fs.FsTracer.get(Lorg/apache/hadoop/conf/Configuration;)Lorg/apache/htrace/core/Tracer;
>
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:304)
>
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:289)
>
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:172)
>
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
>
> at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
>
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
>
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
>
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
>
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:288)
>
> at
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:624)
>
> at
> org.apache.hadoop.hive.ql.session.SessionState.beginStart(SessionState.java:591)
>
> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:747)
>
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:683)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:498)
>
> at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
>
> at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
>
>
>
>
>
> But when I removed that line in hive-env.sh and started hive again, it
> just worked.
>
>
> hadoop-3.3.2
>
> hive-3.1.2
>
> tez-0.10.1
>
> java version "1.8.0_321"
>
>
> All of them were installed in a local node for development purposes.
>
>
> Please help with this issue. Thanks.
>
> Bitfox
>
>
>


Hive 3 with tez issue

2022-03-27 Thread Bitfox
When I had this config in hive-env.sh:

export
HADOOP_CLASSPATH=/opt/tez/conf:/opt/tez/*:/opt/tez/lib/*:$HADOOP_CLASSPATH



and when I started Hive I got the error:

Exception in thread "main" java.lang.NoSuchMethodError:
org.apache.hadoop.fs.FsTracer.get(Lorg/apache/hadoop/conf/Configuration;)Lorg/apache/htrace/core/Tracer;

at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:304)

at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:289)

at
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:172)

at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)

at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)

at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)

at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)

at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)

at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:288)

at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:624)

at
org.apache.hadoop.hive.ql.session.SessionState.beginStart(SessionState.java:591)

at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:747)

at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:683)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:323)

at org.apache.hadoop.util.RunJar.main(RunJar.java:236)





But when I removed that line in hive-env.sh and started hive again, it just
worked.


hadoop-3.3.2

hive-3.1.2

tez-0.10.1

java version "1.8.0_321"


All of them were installed in a local node for development purposes.


Please help with this issue. Thanks.

Bitfox


Hive 3 with tez issue

2022-03-27 Thread Bitfox
When I had this config in hive-env.sh:

export
HADOOP_CLASSPATH=/opt/tez/conf:/opt/tez/*:/opt/tez/lib/*:$HADOOP_CLASSPATH



and when I started Hive I got the error:

Exception in thread "main" java.lang.NoSuchMethodError:
org.apache.hadoop.fs.FsTracer.get(Lorg/apache/hadoop/conf/Configuration;)Lorg/apache/htrace/core/Tracer;

at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:304)

at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:289)

at
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:172)

at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)

at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)

at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)

at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)

at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)

at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:288)

at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:624)

at
org.apache.hadoop.hive.ql.session.SessionState.beginStart(SessionState.java:591)

at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:747)

at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:683)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:323)

at org.apache.hadoop.util.RunJar.main(RunJar.java:236)





But when I removed that line in hive-env.sh and started hive again, it just
worked.


hadoop-3.3.2

hive-3.1.2

tez-0.10.1

java version "1.8.0_321"


All of them were installed in a local node for development purposes.


Please help with this issue. Thanks.

Bitfox


Question for so many SQL tools

2022-03-25 Thread Bitfox
Just a question why there are so many SQL based tools existing for data
jobs?

The ones I know,

Spark
Flink
Ignite
Impala
Drill
Hive
…

They are doing the similar jobs IMO.
Thanks


Re: GraphX Support

2022-03-25 Thread Bitfox
BTW , is MLlib still in active development?

Thanks

On Tue, Mar 22, 2022 at 07:11 Sean Owen  wrote:

> GraphX is not active, though still there and does continue to build and
> test with each Spark release. GraphFrames kind of superseded it, but is
> also not super active FWIW.
>
> On Mon, Mar 21, 2022 at 6:03 PM Jacob Marquez 
> wrote:
>
>> Hello!
>>
>>
>>
>> My team and I are evaluating GraphX as a possible solution. Would someone
>> be able to speak to the support of this Spark feature? Is there active
>> development or is GraphX in maintenance mode (e.g. updated to ensure
>> functionality with new Spark releases)?
>>
>>
>>
>> Thanks in advance for your help!
>>
>>
>>
>> --
>>
>> Jacob H. Marquez
>>
>> He/Him
>>
>> Data & Applied Scientist
>>
>> Microsoft Cloud Data Sciences
>>
>>
>>
>


Re: Continuous ML model training in stream mode

2022-03-18 Thread Bitfox
For online recommendation systems, continuous training is needed. :)
And we are a living video player, the content is changing every minute, so
a real time rec system is the must.


On Fri, Mar 18, 2022 at 3:31 AM Sean Owen  wrote:

> (Thank you, not sure that was me though)
> I don't know of plans to expose the streaming impls in ML, as they still
> work fine in MLlib and they also don't come up much. Continuous training is
> relatively rare, maybe under-appreciated, but rare in practice.
>
> On Thu, Mar 17, 2022 at 1:57 PM Gourav Sengupta 
> wrote:
>
>> Dear friends,
>>
>> a few years ago, I was in a London meetup seeing Sean (Owen) demonstrate
>> how we can try to predict the gender of individuals who are responding to
>> tweets after accepting privacy agreements, in case I am not wrong.
>>
>> It was real time, it was spectacular, and it was the presentation that
>> set me into data science and its applications.
>>
>> Thanks Sean! :)
>>
>> Regards,
>> Gourav Sengupta
>>
>>
>>
>>
>> On Tue, Mar 15, 2022 at 9:39 PM Artemis User 
>> wrote:
>>
>>> Thanks Sean!  Well, it looks like we have to abandon our structured
>>> streaming model to use DStream for this, or do you see possibility to use
>>> structured streaming with ml instead of mllib?
>>>
>>> On 3/15/22 4:51 PM, Sean Owen wrote:
>>>
>>> There is a streaming k-means example in Spark.
>>> https://spark.apache.org/docs/latest/mllib-clustering.html#streaming-k-means
>>>
>>> On Tue, Mar 15, 2022, 3:46 PM Artemis User 
>>> wrote:
>>>
 Has anyone done any experiments of training an ML model using stream
 data? especially for unsupervised models?   Any suggestions/references
 are highly appreciated...

 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org


>>>


Re: Continuous ML model training in stream mode

2022-03-18 Thread Bitfox
we are keeping the training with the input content from a streaming. But
the framework is tensorflow not spark.

On Wed, Mar 16, 2022 at 4:46 AM Artemis User  wrote:

> Has anyone done any experiments of training an ML model using stream
> data? especially for unsupervised models?   Any suggestions/references
> are highly appreciated...
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Does Apache Kafka support IPv6/IPv4 or IPv6-only networks?

2022-03-17 Thread Bitfox
>From my experience, it supports both.


On Thu, Mar 17, 2022 at 10:18 PM 5  wrote:

> Hi, everyone, does Apache Kafka support IPv6/IPv4 or IPv6-only networks?


Play data development with Scala and Spark

2022-03-16 Thread Bitfox
Hello,

I have written a free book which is available online, giving a beginner
introduction to Scala and Spark development.

https://github.com/bitfoxtop/Play-Data-Development-with-Scala-and-Spark/blob/main/PDDWS2-v1.pdf

If you can read Chinese then you are welcome to give any feedback. I will
update the content in my free time.

Thank you.


Re: Question on List to DF

2022-03-16 Thread Bitfox
Thank you. that makes sense.

On Wed, Mar 16, 2022 at 2:03 PM Lalwani, Jayesh  wrote:

> The toDF function in scala uses a bit of Scala magic that allows you to
> add methods to existing classes. Here’s a link to explanation
> https://www.oreilly.com/library/view/scala-cookbook/9781449340292/ch01s11.html
>
>
>
> In short, you can implement a class that extends the List class and add
> methods to your  list class, and you can implement an implicit converter
> that converts from List to your class. When the Scala compiler sees that
> you are calling a function on a List object that doesn’t exist in the List
> class, it will look for implicit converters that convert List object to
> another object that has the function, and will automatically call it.
>
> So, if you have a class
>
> Class MyList extends List {
> def toDF(colName: String): DataFrame{
> …..
> }
> }
>
> and a implicit converter
> implicit def convertListToMyList(list: List): MyList {
>
> ….
> }
>
> when you do
> List("apple","orange","cherry").toDF("fruit")
>
>
>
> Internally, Scala will generate the code as
> convertListToMyList(List("apple","orange","cherry")).toDF("fruit")
>
>
>
>
>
> *From: *Bitfox 
> *Date: *Wednesday, March 16, 2022 at 12:06 AM
> *To: *"user @spark" 
> *Subject: *[EXTERNAL] Question on List to DF
>
>
>
> *CAUTION*: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
> I am wondering why the list in scala spark can be converted into a
> dataframe directly?
>
>
>
> scala> val df = List("apple","orange","cherry").toDF("fruit")
>
> *df*: *org.apache.spark.sql.DataFrame* = [fruit: string]
>
>
>
> scala> df.show
>
> +--+
>
> | fruit|
>
> +--+
>
> | apple|
>
> |orange|
>
> |cherry|
>
> +--+
>
>
>
> I don't think pyspark can convert that as well.
>
>
>
> Thank you.
>


Question on List to DF

2022-03-15 Thread Bitfox
I am wondering why the list in scala spark can be converted into a
dataframe directly?

scala> val df = List("apple","orange","cherry").toDF("fruit")

*df*: *org.apache.spark.sql.DataFrame* = [fruit: string]


scala> df.show

+--+

| fruit|

+--+

| apple|

|orange|

|cherry|

+--+


I don't think pyspark can convert that as well.


Thank you.


Re: Unsubscribe

2022-03-11 Thread Bitfox
please send an empty email to:
user-unsubscr...@spark.apache.org
to unsubscribe yourself from the list.


On Sat, Mar 12, 2022 at 2:42 PM Aziret Satybaldiev <
satybaldiev.azi...@gmail.com> wrote:

>


insufficient memory

2022-03-10 Thread Bitfox
Hello

My VM has only 4gb memory, 2gb free for use.
When I run drill-embedded i got the error:

OpenJDK 64-Bit Server VM warning: INFO:
os::commit_memory(0x0007, 4294967296, 0) failed; error='Not
enough space' (errno=12)

#

# There is insufficient memory for the Java Runtime Environment to continue.

# Native memory allocation (mmap) failed to map 4294967296 bytes for
committing reserved memory.



How can I run it successfully?


Thanks


Hive with tez engine gets the error

2022-03-10 Thread Bitfox
Hive with tez engine can't run. errors:


0: jdbc:hive2://localhost:1/default> select * from people;

Error: java.io.IOException: java.io.IOException:
com.google.protobuf.ServiceException: java.lang.NoSuchFieldError: PARSER
(state=,code=0)



Apache Hive (version 2.3.9)

Hadoop 3.3.1

Tez: I have tried both 0.9 and 0.10


$ protoc --version

libprotoc 2.5.0


Can you help?

Thanks.


Re: Hive 3 and Java 11 issue

2022-03-10 Thread Bitfox
That sounds bad. All our apps are running on JDK 11.

On Thu, Mar 10, 2022 at 5:06 PM Pau Tallada  wrote:

> I think only JDK8 is supported yet
>
> Missatge de Bitfox  del dia dj., 10 de març 2022 a les
> 2:39:
>
>> my java version:
>>
>> openjdk version "11.0.13" 2021-10-19
>>
>>
>> I can't run hive 3.1.2.
>>
>> The error include:
>>
>>
>> Exception in thread "main" java.lang.ClassCastException: class
>> jdk.internal.loader.ClassLoaders$AppClassLoader cannot be cast to class
>> java.net.URLClassLoader (jdk.internal.loader.ClassLoaders$AppClassLoader
>> and java.net.URLClassLoader are in module java.base of loader 'bootstrap')
>>
>>
>> So I am asking Hive 3 doesn't support java 11 yet?
>>
>>
>> Thanks.
>>
>
>
> --
> --
> Pau Tallada Crespí
> Departament de Serveis
> Port d'Informació Científica (PIC)
> Tel: +34 93 170 2729
> --
>
>


Hive 3 and Java 11 issue

2022-03-09 Thread Bitfox
my java version:

openjdk version "11.0.13" 2021-10-19


I can't run hive 3.1.2.

The error include:


Exception in thread "main" java.lang.ClassCastException: class
jdk.internal.loader.ClassLoaders$AppClassLoader cannot be cast to class
java.net.URLClassLoader (jdk.internal.loader.ClassLoaders$AppClassLoader
and java.net.URLClassLoader are in module java.base of loader 'bootstrap')


So I am asking Hive 3 doesn't support java 11 yet?


Thanks.


Re: protobuf.ServiceException

2022-03-09 Thread Bitfox
Does this URL work for you?
https://hive.apache.org/

The regular hive homepage disappears when I am trying to download hive
3.

Thanks

On Thu, Mar 10, 2022 at 7:05 AM Aaron Grubb  wrote:

> My understanding is Hive 2.x is for Hadoop 2.x, while Hive 3.x is for
> Hadoop 3.x, so I would guess that's where your problem lies.
>
> On Thu, 2022-03-10 at 06:57 +0800, Bitfox wrote:
>
> Hello
>
> In beeline I am getting the error:
>
> 0: jdbc:hive2://localhost:1/default> select * from people;
>
> Error: java.io.IOException: java.io.IOException:
> com.google.protobuf.ServiceException: java.lang.NoSuchFieldError: PARSER
> (state=,code=0)
>
>
>
>  Apache Hive (version 2.3.9)
>
> Hadoop 3.3.1
>
>
> $ protoc --version
>
> libprotoc 2.5.0
>
>
>
> I can't catch what the points are.
>
> Can you help? Thanks.
>
>
>
>


protobuf.ServiceException

2022-03-09 Thread Bitfox
Hello

In beeline I am getting the error:

0: jdbc:hive2://localhost:1/default> select * from people;

Error: java.io.IOException: java.io.IOException:
com.google.protobuf.ServiceException: java.lang.NoSuchFieldError: PARSER
(state=,code=0)



 Apache Hive (version 2.3.9)

Hadoop 3.3.1


$ protoc --version

libprotoc 2.5.0



I can't catch what the points are.

Can you help? Thanks.


Re: question about a beeline variable

2022-02-27 Thread Bitfox
I got the idea it's the null value in Hive.

0: jdbc:hive2://localhost:1/default> select size(null);

+--+

| _c0  |

+--+

| -1   |

+--+


Thanks

On Sun, Feb 27, 2022 at 4:02 PM Bitfox  wrote:

> what does this -1 value mean?
>
> > set mapred.reduce.tasks;
>
> +-+
>
> |   set   |
>
> +-+
>
> | mapred.reduce.tasks=-1  |
>
> +-+
>
> 1 row selected (0.014 seconds)
>


question about a beeline variable

2022-02-27 Thread Bitfox
what does this -1 value mean?

> set mapred.reduce.tasks;

+-+

|   set   |

+-+

| mapred.reduce.tasks=-1  |

+-+

1 row selected (0.014 seconds)


Re: Issue while creating spark app

2022-02-26 Thread Bitfox
Java SDK installed?

On Sun, Feb 27, 2022 at 5:39 AM Sachit Murarka 
wrote:

> Hello ,
>
> Thanks for replying. I have installed Scala plugin in IntelliJ  first then
> also it's giving same error
>
> Cannot find project Scala library 2.12.12 for module SparkSimpleApp
>
> Thanks
> Rajat
>
> On Sun, Feb 27, 2022, 00:52 Bitfox  wrote:
>
>> You need to install scala first, the current version for spark is 2.12.15
>> I would suggest you install scala by sdk which works great.
>>
>> Thanks
>>
>> On Sun, Feb 27, 2022 at 12:10 AM rajat kumar 
>> wrote:
>>
>>> Hello Users,
>>>
>>> I am trying to create spark application using Scala(Intellij).
>>> I have installed Scala plugin in intelliJ still getting below error:-
>>>
>>> Cannot find project Scala library 2.12.12 for module SparkSimpleApp
>>>
>>>
>>> Could anyone please help what I am doing wrong?
>>>
>>> Thanks
>>>
>>> Rajat
>>>
>>


Re: Issue while creating spark app

2022-02-26 Thread Bitfox
You need to install scala first, the current version for spark is 2.12.15
I would suggest you install scala by sdk which works great.

Thanks

On Sun, Feb 27, 2022 at 12:10 AM rajat kumar 
wrote:

> Hello Users,
>
> I am trying to create spark application using Scala(Intellij).
> I have installed Scala plugin in intelliJ still getting below error:-
>
> Cannot find project Scala library 2.12.12 for module SparkSimpleApp
>
>
> Could anyone please help what I am doing wrong?
>
> Thanks
>
> Rajat
>


Re: [E] COMMERCIAL BULK: Re: TensorFlow on Spark

2022-02-24 Thread Bitfox
I have been using tensorflow for a long time, it's not hard to implement a
distributed training job at all, either by model parallelization or data
parallelization. I don't think there is much need to develop spark to
support tensorflow jobs. Just my thoughts...


On Thu, Feb 24, 2022 at 4:36 PM Gourav Sengupta 
wrote:

> Hi,
>
> I do not think that there is any reason for using over engineered
> platforms like Petastorm and Ray, except for certain use cases.
>
> What Ray is doing, except for certain use cases, could have been easily
> done by SPARK, I think, had the open source community got that steer. But
> maybe I am wrong and someone should be able to explain why the SPARK open
> source community cannot develop the capabilities which are so natural to
> almost all use cases of data processing in SPARK where the data gets
> consumed by deep learning frameworks and we are asked to use Ray or
> Petastorm?
>
> For those of us who are asking what does native integrations means please
> try to compare delta between release 2.x and 3.x and koalas before 3.2 and
> after 3.2.
>
> I am sure that the SPARK community can push for extending the dataframes
> from SPARK to deep learning and other frameworks by natively integrating
> them.
>
>
> Regards,
> Gourav Sengupta
>
>
> On Wed, Feb 23, 2022 at 4:42 PM Dennis Suhari 
> wrote:
>
>> Currently we are trying AnalyticsZoo and Ray
>>
>>
>> Von meinem iPhone gesendet
>>
>> Am 23.02.2022 um 04:53 schrieb Bitfox :
>>
>> 
>> tensorflow itself can implement the distributed computing via a
>> parameter server. Why did you want spark here?
>>
>> regards.
>>
>> On Wed, Feb 23, 2022 at 11:27 AM Vijayant Kumar
>>  wrote:
>>
>>> Thanks Sean for your response. !!
>>>
>>>
>>>
>>> Want to add some more background here.
>>>
>>>
>>>
>>> I am using Spark3.0+ version with Tensorflow 2.0+.
>>>
>>> My use case is not for the image data but for the Time-series data where
>>> I am using LSTM and transformers to forecast.
>>>
>>>
>>>
>>> I evaluated *SparkFlow* and *spark_tensorflow_distributor *libraries, and
>>> there has been no major development recently on those libraries. I faced
>>> the issue of version dependencies on those and had a hard time fixing the
>>> library compatibilities. Hence a couple of below doubts:-
>>>
>>>
>>>
>>>- Does *Horovod* have any dependencies?
>>>- Any other library which is suitable for my use case.?
>>>- Any example code would really be of great help to understand.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Vijayant
>>>
>>>
>>>
>>> *From:* Sean Owen 
>>> *Sent:* Wednesday, February 23, 2022 8:40 AM
>>> *To:* Vijayant Kumar 
>>> *Cc:* user @spark 
>>> *Subject:* [E] COMMERCIAL BULK: Re: TensorFlow on Spark
>>>
>>>
>>>
>>> *Email is from a Free Mail Service (Gmail/Yahoo/Hotmail….) *: Beware of
>>> Phishing Scams, Report questionable emails to s...@mavenir.com
>>>
>>> Sure, Horovod is commonly used on Spark for this:
>>>
>>> https://horovod.readthedocs.io/en/stable/spark_include.html
>>>
>>>
>>>
>>> On Tue, Feb 22, 2022 at 8:51 PM Vijayant Kumar <
>>> vijayant.ku...@mavenir.com.invalid> wrote:
>>>
>>> Hi All,
>>>
>>>
>>>
>>> Anyone using Apache spark with TensorFlow for building models. My
>>> requirement is to use TensorFlow distributed model training across the
>>> Spark executors.
>>>
>>> Please help me with some resources or some sample code.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Vijayant
>>> --
>>>
>>> This e-mail message may contain confidential or proprietary information
>>> of Mavenir Systems, Inc. or its affiliates and is intended solely for the
>>> use of the intended recipient(s). If you are not the intended recipient of
>>> this message, you are hereby notified that any review, use or distribution
>>> of this information is absolutely prohibited and we request that you delete
>>> all copies in your control and contact us by e-mailing to
>>> secur...@mavenir.com. This message contains the views of its author and
>>> may not necessarily reflect the views of Mavenir Systems, Inc. or its
>>> affiliates, who employ syst

Re: help with beeline connection to hive

2022-02-23 Thread Bitfox
I resolved the issue by this way
https://github.com/chuqbach/Big-Data-Installation/issues/2

Thanks for your help.

Regards


On Wed, Feb 23, 2022 at 5:49 PM Mich Talebzadeh 
wrote:

> and check that beeline thrift server is indeed running (mine runs on port
> 10099)
>
> netstat -plten|grep 10099
> tcp0  0 0.0.0.0:10099   0.0.0.0:*
>  LISTEN  1
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 23 Feb 2022 at 09:42, Mich Talebzadeh 
> wrote:
>
>> beeline -u jdbc:hive2://localhost:1/default
>> org.apache.hive.jdbc.HiveDriver -n > is running> -p 
>>
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 23 Feb 2022 at 09:35, Aaron Grubb  wrote:
>>
>>> Try username "root" and empty password. That works for me on 3.1.2.
>>>
>>> On Wed, 2022-02-23 at 10:16 +0800, Bitfox wrote:
>>>
>>> Hello
>>>
>>> I have hive 2.3.9 installed by default on localhost for testing.
>>> HDFS is also installed on localhost, which works correctly b/c I have
>>> already used the file storage feature.
>>>
>>> I didn't change any configure files for hive.
>>>
>>> I can login into hive shell:
>>>
>>> hive> show databases;
>>>
>>> OK
>>>
>>> default
>>>
>>> Time taken: 4.458 seconds, Fetched: 1 row(s)
>>>
>>>
>>>
>>> After started hiveserver2 which works online, I can't connect via
>>> beeline:
>>>
>>>
>>> beeline> !connect jdbc:hive2://localhost:1/default
>>>
>>> Connecting to jdbc:hive2://localhost:1/default
>>>
>>> Enter username for jdbc:hive2://localhost:1/default: APP
>>>
>>> Enter password for jdbc:hive2://localhost:1/default: 
>>>
>>> 22/02/23 10:11:41 [main]: WARN jdbc.HiveConnection: Failed to connect to
>>> localhost:1
>>>
>>> Could not open connection to the HS2 server. Please check the server URI
>>> and if the URI is correct, then ask the administrator to check the server
>>> status.
>>>
>>> Error: Could not open client transport with JDBC Uri:
>>> jdbc:hive2://localhost:1/default: java.net.ConnectException: Connection
>>> refused (Connection refused) (state=08S01,code=0)
>>>
>>>
>>>
>>> It prompts me to input username and password.
>>>
>>> I have tried both empty user/pass, and user "APP"/pass "mine".
>>>
>>> Neither of them will work.
>>>
>>>
>>>
>>> How can I fix this and connect to Hive correctly via beeline?
>>>
>>>
>>> Sorry I am the newbie to Hive.
>>>
>>>
>>> Thanks a lot.
>>>
>>>
>>>


Re: One click to run Spark on Kubernetes

2022-02-23 Thread Bitfox
from my viewpoints, if there is such a pay as you go service I would like
to use.
otherwise I have to deploy a regular spark cluster with GCP/AWS etc and the
cost is not low.

Thanks.

On Wed, Feb 23, 2022 at 4:00 PM bo yang  wrote:

> Right, normally people start with simple script, then add more stuff, like
> permission and more components. After some time, people want to run the
> script consistently in different environments. Things will become complex.
>
> That is why we want to see whether people have interest for such a "one
> click" tool to make things easy.
>
>
> On Tue, Feb 22, 2022 at 11:31 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi,
>>
>> There are two distinct actions here; namely Deploy and Run.
>>
>> Deployment can be done by command line script with autoscaling. In the
>> newer versions of Kubernnetes you don't even need to specify the node
>> types, you can leave it to the Kubernetes cluster  to scale up and down and
>> decide on node type.
>>
>> The second point is the running spark that you will need to submit.
>> However, that depends on setting up access permission, use of service
>> accounts, pulling the correct dockerfiles for the driver and the executors.
>> Those details add to the complexity.
>>
>> Thanks
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 23 Feb 2022 at 04:06, bo yang  wrote:
>>
>>> Hi Spark Community,
>>>
>>> We built an open source tool to deploy and run Spark on Kubernetes with
>>> a one click command. For example, on AWS, it could automatically create an
>>> EKS cluster, node group, NGINX ingress, and Spark Operator. Then you will
>>> be able to use curl or a CLI tool to submit Spark application. After the
>>> deployment, you could also install Uber Remote Shuffle Service to enable
>>> Dynamic Allocation on Kuberentes.
>>>
>>> Anyone interested in using or working together on such a tool?
>>>
>>> Thanks,
>>> Bo
>>>
>>>


Re: One click to run Spark on Kubernetes

2022-02-22 Thread Bitfox
How can I specify the cluster memory and cores?
For instance, I want to run a job with 16 cores and 300 GB memory for about
1 hour. Do you have the SaaS solution for this? I can pay as I did.

Thanks

On Wed, Feb 23, 2022 at 12:21 PM bo yang  wrote:

> It is not a standalone spark cluster. In some details, it deploys a Spark
> Operator (https://github.com/GoogleCloudPlatform/spark-on-k8s-operator)
> and an extra REST Service. When people submit Spark application to that
> REST Service, the REST Service will create a CRD inside the
> Kubernetes cluster. Then Spark Operator will pick up the CRD and launch the
> Spark application. The one click tool intends to hide these details, so
> people could just submit Spark and do not need to deal with too many
> deployment details.
>
> On Tue, Feb 22, 2022 at 8:09 PM Bitfox  wrote:
>
>> Can it be a cluster installation of spark? or just the standalone node?
>>
>> Thanks
>>
>> On Wed, Feb 23, 2022 at 12:06 PM bo yang  wrote:
>>
>>> Hi Spark Community,
>>>
>>> We built an open source tool to deploy and run Spark on Kubernetes with
>>> a one click command. For example, on AWS, it could automatically create an
>>> EKS cluster, node group, NGINX ingress, and Spark Operator. Then you will
>>> be able to use curl or a CLI tool to submit Spark application. After the
>>> deployment, you could also install Uber Remote Shuffle Service to enable
>>> Dynamic Allocation on Kuberentes.
>>>
>>> Anyone interested in using or working together on such a tool?
>>>
>>> Thanks,
>>> Bo
>>>
>>>


Re: One click to run Spark on Kubernetes

2022-02-22 Thread Bitfox
Can it be a cluster installation of spark? or just the standalone node?

Thanks

On Wed, Feb 23, 2022 at 12:06 PM bo yang  wrote:

> Hi Spark Community,
>
> We built an open source tool to deploy and run Spark on Kubernetes with a
> one click command. For example, on AWS, it could automatically create an
> EKS cluster, node group, NGINX ingress, and Spark Operator. Then you will
> be able to use curl or a CLI tool to submit Spark application. After the
> deployment, you could also install Uber Remote Shuffle Service to enable
> Dynamic Allocation on Kuberentes.
>
> Anyone interested in using or working together on such a tool?
>
> Thanks,
> Bo
>
>


Re: [E] COMMERCIAL BULK: Re: TensorFlow on Spark

2022-02-22 Thread Bitfox
tensorflow itself can implement the distributed computing via a
parameter server. Why did you want spark here?

regards.

On Wed, Feb 23, 2022 at 11:27 AM Vijayant Kumar
 wrote:

> Thanks Sean for your response. !!
>
>
>
> Want to add some more background here.
>
>
>
> I am using Spark3.0+ version with Tensorflow 2.0+.
>
> My use case is not for the image data but for the Time-series data where I
> am using LSTM and transformers to forecast.
>
>
>
> I evaluated *SparkFlow* and *spark_tensorflow_distributor *libraries, and
> there has been no major development recently on those libraries. I faced
> the issue of version dependencies on those and had a hard time fixing the
> library compatibilities. Hence a couple of below doubts:-
>
>
>
>- Does *Horovod* have any dependencies?
>- Any other library which is suitable for my use case.?
>- Any example code would really be of great help to understand.
>
>
>
> Thanks,
>
> Vijayant
>
>
>
> *From:* Sean Owen 
> *Sent:* Wednesday, February 23, 2022 8:40 AM
> *To:* Vijayant Kumar 
> *Cc:* user @spark 
> *Subject:* [E] COMMERCIAL BULK: Re: TensorFlow on Spark
>
>
>
> *Email is from a Free Mail Service (Gmail/Yahoo/Hotmail….) *: Beware of
> Phishing Scams, Report questionable emails to s...@mavenir.com
>
> Sure, Horovod is commonly used on Spark for this:
>
> https://horovod.readthedocs.io/en/stable/spark_include.html
>
>
>
> On Tue, Feb 22, 2022 at 8:51 PM Vijayant Kumar <
> vijayant.ku...@mavenir.com.invalid> wrote:
>
> Hi All,
>
>
>
> Anyone using Apache spark with TensorFlow for building models. My
> requirement is to use TensorFlow distributed model training across the
> Spark executors.
>
> Please help me with some resources or some sample code.
>
>
>
> Thanks,
>
> Vijayant
> --
>
> This e-mail message may contain confidential or proprietary information of
> Mavenir Systems, Inc. or its affiliates and is intended solely for the use
> of the intended recipient(s). If you are not the intended recipient of this
> message, you are hereby notified that any review, use or distribution of
> this information is absolutely prohibited and we request that you delete
> all copies in your control and contact us by e-mailing to
> secur...@mavenir.com. This message contains the views of its author and
> may not necessarily reflect the views of Mavenir Systems, Inc. or its
> affiliates, who employ systems to monitor email messages, but make no
> representation that such messages are authorized, secure, uncompromised, or
> free from computer viruses, malware, or other defects. Thank You
>
> --
>
> This e-mail message may contain confidential or proprietary information of
> Mavenir Systems, Inc. or its affiliates and is intended solely for the use
> of the intended recipient(s). If you are not the intended recipient of this
> message, you are hereby notified that any review, use or distribution of
> this information is absolutely prohibited and we request that you delete
> all copies in your control and contact us by e-mailing to
> secur...@mavenir.com. This message contains the views of its author and
> may not necessarily reflect the views of Mavenir Systems, Inc. or its
> affiliates, who employ systems to monitor email messages, but make no
> representation that such messages are authorized, secure, uncompromised, or
> free from computer viruses, malware, or other defects. Thank You
>


help with beeline connection to hive

2022-02-22 Thread Bitfox
Hello

I have hive 2.3.9 installed by default on localhost for testing.
HDFS is also installed on localhost, which works correctly b/c I have
already used the file storage feature.

I didn't change any configure files for hive.

I can login into hive shell:

hive> show databases;

OK

default

Time taken: 4.458 seconds, Fetched: 1 row(s)



After started hiveserver2 which works online, I can't connect via beeline:


beeline> !connect jdbc:hive2://localhost:1/default

Connecting to jdbc:hive2://localhost:1/default

Enter username for jdbc:hive2://localhost:1/default: APP

Enter password for jdbc:hive2://localhost:1/default: 

22/02/23 10:11:41 [main]: WARN jdbc.HiveConnection: Failed to connect to
localhost:1

Could not open connection to the HS2 server. Please check the server URI
and if the URI is correct, then ask the administrator to check the server
status.

Error: Could not open client transport with JDBC Uri:
jdbc:hive2://localhost:1/default: java.net.ConnectException: Connection
refused (Connection refused) (state=08S01,code=0)



It prompts me to input username and password.

I have tried both empty user/pass, and user "APP"/pass "mine".

Neither of them will work.



How can I fix this and connect to Hive correctly via beeline?


Sorry I am the newbie to Hive.


Thanks a lot.


Re: Unsubscribe

2022-02-09 Thread Bitfox
Please send an e-mail: user-unsubscr...@spark.apache.org
to unsubscribe yourself from the mailing list.

On Thu, Feb 10, 2022 at 1:38 AM Yogitha Ramanathan 
wrote:

>


Re: Help With unstructured text file with spark scala

2022-02-09 Thread Bitfox
Hi

I am not sure about the total situation.
But if you want a scala integration I think it could use regex to match and
capture the keywords.
Here I wrote one you can modify by your end.

import scala.io.Source

import scala.collection.mutable.ArrayBuffer


val list1 = ArrayBuffer[(String,String,String)]()

val list2 = ArrayBuffer[(String,String)]()



val patt1 = """^(.*)#(.*)#([^#]*)$""".r

val patt2 = """^(.*)#([^#]*)$""".r


val file = "1.txt"

val lines = Source.fromFile(file).getLines()


for ( x <- lines ) {

  x match {

case patt1(k,v,z) => list1 += ((k,v,z))

case patt2(k,v) => list2 += ((k,v))

case _ => println("no match")

  }

}



Now the list1 and list2 have the elements you wanted, you can convert them
to a dataframe easily.


Thanks.

On Wed, Feb 9, 2022 at 7:20 PM Danilo Sousa 
wrote:

> Hello
>
>
> Yes, for this block I can open as csv with # delimiter, but have the block
> that is no csv format.
>
> This is the likely key value.
>
> We have two different layouts in the same file. This is the “problem”.
>
> Thanks for your time.
>
>
>
> Relação de Beneficiários Ativos e Excluídos
>> Carteira em#27/12/2019##Todos os Beneficiários
>> Operadora#AMIL
>> Filial#SÃO PAULO#Unidade#Guarulhos
>>
>> Contrato#123456 - Test
>> Empresa#Test
>
>
> On 9 Feb 2022, at 00:58, Bitfox  wrote:
>
> Hello
>
> You can treat it as a csf file and load it from spark:
>
> >>> df = spark.read.format("csv").option("inferSchema",
> "true").option("header", "true").option("sep","#").load(csv_file)
> >>> df.show()
> ++---+-+
> |   Plano|Código Beneficiário|Nome Beneficiário|
> ++---+-+
> |58693 - NACIONAL ...|   65751353|   Jose Silva|
> |58693 - NACIONAL ...|   65751388|  Joana Silva|
> |58693 - NACIONAL ...|   65751353| Felipe Silva|
> |58693 - NACIONAL ...|   65751388|  Julia Silva|
> ++---+-+
>
>
> cat csv_file:
>
> Plano#Código Beneficiário#Nome Beneficiário
> 58693 - NACIONAL R COPART PJCE#065751353#Jose Silva
> 58693 - NACIONAL R COPART PJCE#065751388#Joana Silva
> 58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva
>
> 58693 - NACIONAL R COPART PJCE#065751388#Julia Silva
>
>
> Regards
>
>
> On Wed, Feb 9, 2022 at 12:50 AM Danilo Sousa 
> wrote:
>
>> Hi
>> I have to transform unstructured text to dataframe.
>> Could anyone please help with Scala code ?
>>
>> Dataframe need as:
>>
>> operadora filial unidade contrato empresa plano codigo_beneficiario
>> nome_beneficiario
>>
>> Relação de Beneficiários Ativos e Excluídos
>> Carteira em#27/12/2019##Todos os Beneficiários
>> Operadora#AMIL
>> Filial#SÃO PAULO#Unidade#Guarulhos
>>
>> Contrato#123456 - Test
>> Empresa#Test
>> Plano#Código Beneficiário#Nome Beneficiário
>> 58693 - NACIONAL R COPART PJCE#073930312#Joao Silva
>> 58693 - NACIONAL R COPART PJCE#073930313#Maria Silva
>>
>> Contrato#898011000 - FUNDACAO GERDAU
>> Empresa#FUNDACAO GERDAU
>> Plano#Código Beneficiário#Nome Beneficiário
>> 58693 - NACIONAL R COPART PJCE#065751353#Jose Silva
>> 58693 - NACIONAL R COPART PJCE#065751388#Joana Silva
>> 58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva
>> 58693 - NACIONAL R COPART PJCE#065751388#Julia Silva
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: Help With unstructured text file with spark scala

2022-02-08 Thread Bitfox
Hello

You can treat it as a csf file and load it from spark:

>>> df = spark.read.format("csv").option("inferSchema",
"true").option("header", "true").option("sep","#").load(csv_file)

>>> df.show()

++---+-+

|   Plano|Código Beneficiário|Nome Beneficiário|

++---+-+

|58693 - NACIONAL ...|   65751353|   Jose Silva|

|58693 - NACIONAL ...|   65751388|  Joana Silva|

|58693 - NACIONAL ...|   65751353| Felipe Silva|

|58693 - NACIONAL ...|   65751388|  Julia Silva|

++---+-+



cat csv_file:


Plano#Código Beneficiário#Nome Beneficiário

58693 - NACIONAL R COPART PJCE#065751353#Jose Silva

58693 - NACIONAL R COPART PJCE#065751388#Joana Silva

58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva

58693 - NACIONAL R COPART PJCE#065751388#Julia Silva



Regards



On Wed, Feb 9, 2022 at 12:50 AM Danilo Sousa 
wrote:

> Hi
> I have to transform unstructured text to dataframe.
> Could anyone please help with Scala code ?
>
> Dataframe need as:
>
> operadora filial unidade contrato empresa plano codigo_beneficiario
> nome_beneficiario
>
> Relação de Beneficiários Ativos e Excluídos
> Carteira em#27/12/2019##Todos os Beneficiários
> Operadora#AMIL
> Filial#SÃO PAULO#Unidade#Guarulhos
>
> Contrato#123456 - Test
> Empresa#Test
> Plano#Código Beneficiário#Nome Beneficiário
> 58693 - NACIONAL R COPART PJCE#073930312#Joao Silva
> 58693 - NACIONAL R COPART PJCE#073930313#Maria Silva
>
> Contrato#898011000 - FUNDACAO GERDAU
> Empresa#FUNDACAO GERDAU
> Plano#Código Beneficiário#Nome Beneficiário
> 58693 - NACIONAL R COPART PJCE#065751353#Jose Silva
> 58693 - NACIONAL R COPART PJCE#065751388#Joana Silva
> 58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva
> 58693 - NACIONAL R COPART PJCE#065751388#Julia Silva
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: add an auto_increment column

2022-02-08 Thread Bitfox
Maybe col func is not even needed here. :)

>>> df.select(F.dense_rank().over(wOrder).alias("rank"),
"fruit","amount").show()

++--+--+

|rank| fruit|amount|

++--+--+

|   1|cherry| 5|

|   2| apple| 3|

|   2|tomato| 3|

|   3|orange| 2|

++--+--+




On Tue, Feb 8, 2022 at 3:50 PM Mich Talebzadeh 
wrote:

> simple either rank() or desnse_rank()
>
> >>> from pyspark.sql import functions as F
> >>> from pyspark.sql.functions import col
> >>> from pyspark.sql.window import Window
> >>> wOrder = Window().orderBy(df['amount'].desc())
> >>> df.select(F.rank().over(wOrder).alias("rank"), col('fruit'),
> col('amount')).show()
> ++--+--+
> |rank| fruit|amount|
> ++--+--+
> |   1|cherry| 5|
> |   2| apple| 3|
> |   2|tomato| 3|
> |   4|orange| 2|
> ++--+--+
>
> >>> df.select(F.dense_rank().over(wOrder).alias("rank"), col('fruit'),
> col('amount')).show()
> ++--+--+
> |rank| fruit|amount|
> ++--+--+
> |   1|cherry| 5|
> |   2| apple| 3|
> |   2|tomato| 3|
> |   3|orange| 2|
> ++--+--+
>
> HTH
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 7 Feb 2022 at 01:27,  wrote:
>
>> For a dataframe object, how to add a column who is auto_increment like
>> mysql's behavior?
>>
>> Thank you.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


foreachRDD question

2022-02-07 Thread Bitfox
Hello list,

for the code in the link:
https://github.com/apache/spark/blob/v3.2.1/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala

I am not sure, why enclose the RDD to Dataframe logic in a foreachRDD block?
What's the use of foreachRDD?


Thanks in advance.


Re: Unsubscribe

2022-02-05 Thread Bitfox
Please send an e-mail: user-unsubscr...@spark.apache.org
to unsubscribe yourself from the mailing list.

On Sun, Feb 6, 2022 at 2:21 PM Rishi Raj Tandon 
wrote:

> Unsubscribe
>


Re: Python performance

2022-02-04 Thread Bitfox
Please see my this test:
https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/

Don’t use Python RDD, using dataframe instead.

Regards

On Fri, Feb 4, 2022 at 5:02 PM Hinko Kocevar 
wrote:

> I'm looking into using Python interface with Spark and came across this
> [1] chart showing some performance hit when going with Python RDD. Data is
> ~ 7 years and for older version of Spark. Is this still the case with more
> recent Spark releases?
>
> I'm trying to understand what to expect from Python and Spark and under
> what conditions.
>
> [1]
> https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
>
> Thanks,
> //hinko
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re:

2022-01-31 Thread Bitfox
Please send an e-mail: user-unsubscr...@spark.apache.org
to unsubscribe yourself from the mailing list.


On Mon, Jan 31, 2022 at 10:11 PM  wrote:

> unsubscribe
>
>
>


Re:

2022-01-31 Thread Bitfox
Please send an e-mail: user-unsubscr...@spark.apache.org
to unsubscribe yourself from the mailing list.


On Mon, Jan 31, 2022 at 10:23 PM Gaetano Fabiano 
wrote:

> Unsubscribe
>
> Inviato da iPhone
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: unsubscribe

2022-01-31 Thread Bitfox
The signature in your messages has showed how to unsubscribe.

To unsubscribe e-mail: user-unsubscr...@spark.apache.org

On Mon, Jan 31, 2022 at 7:53 PM Lucas Schroeder Rossi 
wrote:

> unsubscribe
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: why the pyspark RDD API is so slow?

2022-01-31 Thread Bitfox
Hi

In PySpark, RDD need serialised/deserialised, but dataframe doesn’t? Why?

Thanks

On Mon, Jan 31, 2022 at 4:46 PM Khalid Mammadov 
wrote:

> Your scala program does not use any Spark API hence faster that others. If
> you write the same code in pure Python I think it will be even faster than
> Scala program, especially taking into account these 2 programs runs on a
> single VM.
>
> Regarding Dataframe and RDD I would suggest to use Dataframes anyway since
> it's recommended approach since Spark 2.0.
> RDD for Pyspark is slow as others said it needs to be
> serialised/deserialised.
>
> One general note is that Spark is written Scala and core is running on JVM
> and Python is wrapper around Scala API and most of PySpark APIs are
> delegated to Scala/JVM to be executed. Hence most of big data
> transformation tasks will complete almost at the same time as they (Scala
> and Python) use the same API under the hood. Therefore you can also observe
> that APIs are very similar and code is written in the same fashion.
>
>
> On Sun, 30 Jan 2022, 10:10 Bitfox,  wrote:
>
>> Hello list,
>>
>> I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a
>> pure scala program. The result shows the pyspark RDD is too slow.
>>
>> For the operations and dataset please see:
>>
>> https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/
>>
>> The result table is below.
>> Can you give suggestions on how to optimize the RDD operation?
>>
>> Thanks a lot.
>>
>>
>> *program* *time*
>> scala program 49s
>> pyspark dataframe 56s
>> scala RDD 1m31s
>> pyspark RDD 7m15s
>>
>


Re: [ANNOUNCE] Apache Kyuubi (Incubating) released 1.4.1-incubating

2022-01-30 Thread Bitfox
What’s the difference between Spark and Kyuubi?

Thanks

On Mon, Jan 31, 2022 at 2:45 PM Vino Yang  wrote:

> Hi all,
>
> The Apache Kyuubi (Incubating) community is pleased to announce that
> Apache Kyuubi (Incubating) 1.4.1-incubating has been released!
>
> Apache Kyuubi (Incubating) is a distributed multi-tenant JDBC server for
> large-scale data processing and analytics, built on top of Apache Spark
> and designed to support more engines (i.e. Apache Flink).
>
> Kyuubi provides a pure SQL gateway through Thrift JDBC/ODBC interface
> for end-users to manipulate large-scale data with pre-programmed and
> extensible Spark SQL engines.
>
> We are aiming to make Kyuubi an "out-of-the-box" tool for data warehouses
> and data lakes.
>
> This "out-of-the-box" model minimizes the barriers and costs for end-users
> to use Spark at the client side.
>
> At the server-side, Kyuubi server and engine's multi-tenant architecture
> provides the administrators a way to achieve computing resource isolation,
> data security, high availability, high client concurrency, etc.
>
> The full release notes and download links are available at:
> Release Notes: https://kyuubi.apache.org/release/1.4.1-incubating.html
>
> To learn more about Apache Kyuubi (Incubating), please see
> https://kyuubi.apache.org/
>
> Kyuubi Resources:
> - Issue: https://github.com/apache/incubator-kyuubi/issues
> - Mailing list: d...@kyuubi.apache.org
>
> We would like to thank all contributors of the Kyuubi community and
> Incubating
> community who made this release possible!
>
> Thanks,
> On behalf of Apache Kyuubi (Incubating) community
>


Re: unsubscribe

2022-01-30 Thread Bitfox
The signature in your mail has showed the info:

To unsubscribe e-mail: user-unsubscr...@spark.apache.org



On Sun, Jan 30, 2022 at 8:50 PM Lucas Schroeder Rossi 
wrote:

> unsubscribe
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


why the pyspark RDD API is so slow?

2022-01-30 Thread Bitfox
Hello list,

I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a pure
scala program. The result shows the pyspark RDD is too slow.

For the operations and dataset please see:
https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/

The result table is below.
Can you give suggestions on how to optimize the RDD operation?

Thanks a lot.


*program* *time*
scala program 49s
pyspark dataframe 56s
scala RDD 1m31s
pyspark RDD 7m15s


Re: [ANNOUNCE] Apache Spark 3.2.1 released

2022-01-28 Thread Bitfox
Is there a guide for upgrading from 3.2.0 to 3.2.1?

thanks

On Sat, Jan 29, 2022 at 9:14 AM huaxin gao  wrote:

> We are happy to announce the availability of Spark 3.2.1!
>
> Spark 3.2.1 is a maintenance release containing stability fixes. This
> release is based on the branch-3.2 maintenance branch of Spark. We strongly
> recommend all 3.2 users to upgrade to this stable release.
>
> To download Spark 3.2.1, head over to the download page:
> https://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-2-1.html
>
> We would like to acknowledge all community members for contributing to this
> release. This release would not have been possible without you.
>
> Huaxin Gao
>


Re: [ANNOUNCE] Apache Kafka 3.1.0

2022-01-24 Thread Bitfox
Must spark3, kafka3, scala3, python3 work together if my project used these
stacks?

Thanks

On Tue, Jan 25, 2022 at 1:04 AM David Jacot  wrote:

> The Apache Kafka community is pleased to announce the release for
> Apache Kafka 3.1.0.
>
> It is a major release that includes many new features, including:
>
> * Apache Kafka supports Java 17
> * The FetchRequest supports Topic IDs (KIP-516)
> * Extend SASL/OAUTHBEARER with support for OIDC (KIP-768)
> * Add broker count metrics (KIP-748)
> * Differentiate consistently metric latency measured in millis and
> nanos (KIP-773)
> * The eager rebalance protocol is deprecated (KAFKA-13439)
> * Add TaskId field to StreamsException (KIP-783)
> * Custom partitioners in foreign-key joins (KIP-775)
> * Fetch/findSessions queries with open endpoints for
> SessionStore/WindowStore (KIP-766)
> * Range queries with open endpoints (KIP-763)
> * Add total blocked time metric to Streams (KIP-761)
> * Add additional configuration to control MirrorMaker2 internal topics
> naming convention (KIP-690)
>
> You may read a more detailed list of features in the 3.1.0 blog post:
> https://blogs.apache.org/kafka/
>
> All of the changes in this release can be found in the release notes:
> https://www.apache.org/dist/kafka/3.1.0/RELEASE_NOTES.html
>
> You can download the source and binary release (Scala 2.12 and 2.13) from:
> https://kafka.apache.org/downloads#3.1.0
>
>
> ---
>
>
> Apache Kafka is a distributed streaming platform with four core APIs:
>
> ** The Producer API allows an application to publish a stream of records to
> one or more Kafka topics.
>
> ** The Consumer API allows an application to subscribe to one or more
> topics and process the stream of records produced to them.
>
> ** The Streams API allows an application to act as a stream processor,
> consuming an input stream from one or more topics and producing an
> output stream to one or more output topics, effectively transforming the
> input streams to output streams.
>
> ** The Connector API allows building and running reusable producers or
> consumers that connect Kafka topics to existing applications or data
> systems. For example, a connector to a relational database might
> capture every change to a table.
>
>
> With these APIs, Kafka can be used for two broad classes of application:
>
> ** Building real-time streaming data pipelines that reliably get data
> between systems or applications.
>
> ** Building real-time streaming applications that transform or react
> to the streams of data.
>
>
> Apache Kafka is in use at large and small companies worldwide, including
> Capital One, Goldman Sachs, ING, LinkedIn, Netflix, Pinterest, Rabobank,
> Target, The New York Times, Uber, Yelp, and Zalando, among others.
>
> A big thank you for the following 114 contributors to this release!
>
> A. Sophie Blee-Goldman, Alexander Iskuskov, Alexander Stohr, Almog
> Gavra, Andras Katona, Andrew Patterson, Andy Chambers, Andy Lapidas,
> Anna Sophie Blee-Goldman, Antony Stubbs, Arjun Satish, Bill Bejeck,
> Boyang Chen, Bruno Cadonna, CHUN-HAO TANG, Cheng Tan, Chia-Ping Tsai,
> Chris Egerton, Christo Lolov, Colin P. McCabe, Cong Ding, Daniel
> Urban, David Arthur, David Jacot, David Mao, Dmitriy Fishman, Edoardo
> Comar, Ewen Cheslack-Postava, Greg Harris, Guozhang Wang, Igor Soarez,
> Ismael Juma, Israel Ekpo, Ivan Ponomarev, Jakub Scholz, James Galasyn,
> Jason Gustafson, Jeff Kim, Jim Galasyn, JoeCqupt, Joel Hamill, John
> Gray, John Roesler, Jongho Jeon, Jorge Esteban Quilcate Otoya, Jose
> Sancio, Josep Prat, José Armando García Sancio, Jun Rao, Justine
> Olshan, Kalpesh Patel, Kamal Chandraprakash, Kevin Zhang, Kirk True,
> Konstantine Karantasis, Kowshik Prakasam, Leah Thomas, Lee Dongjin,
> Lucas Bradstreet, Luke Chen, Manikumar Reddy, Matthew Wong, Matthias
> J. Sax, Michael Carter, Mickael Maison, Nigel Liang, Niket, Niket
> Goel, Oliver Hutchison, Omnia G H Ibrahim, Patrick Stuedi, Phil
> Hardwick, Prateek Agarwal, Rajini Sivaram, Randall Hauch, René Kerner,
> Richard Yu, Rohan, Ron Dagostino, Ryan Dielhenn, Sanjana Kaundinya,
> Satish Duggana, Sergio Peña, Sherzod Mamadaliev, Stanislav Vodetskyi,
> Ted Yu, Tom Bentley, Tomas Forsman, Tomer Wizman, Uwe Eisele, Victoria
> Xia, Viktor Somogyi-Vass, Vincent Jiang, Walker Carlson, Weisheng
> Yang, Xavier Léauté, Yanwen(Jason) Lin, Yi Ding, Zara Lim, andy0x01,
> dengziming, feyman2016, ik, ik.lim, jem, jiangyuan, kpatelatwork,
> leah, loboya~, lujiefsi, sebbASF, singingMan, vamossagar12,
> wenbingshen
>
> We welcome your help and feedback. For more information on how to
> report problems, and to get involved, visit the project website at
> https://kafka.apache.org/
>
> Thank you!
>
>
> Regards,
>
> David
>


may I need a join here?

2022-01-23 Thread Bitfox
>>> df.show(3)

++-+

|word|count|

++-+

|  on|1|

| dec|1|

|2020|1|

++-+

only showing top 3 rows


>>> df2.show(3)

++-+

|stopword|count|

++-+

|able|1|

|   about|1|

|   above|1|

++-+

only showing top 3 rows


>>> df3=df.filter(~col("word").isin(df2.stopword ))

Traceback (most recent call last):

  File "", line 1, in 

  File "/opt/spark/python/pyspark/sql/dataframe.py", line 1733, in filter

jdf = self._jdf.filter(condition._jc)

  File "/opt/spark/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py",
line 1310, in __call__

  File "/opt/spark/python/pyspark/sql/utils.py", line 117, in deco

raise converted from None

pyspark.sql.utils.AnalysisException: Resolved attribute(s) stopword#4
missing from word#0,count#1L in operator !Filter NOT word#0 IN
(stopword#4).;

!Filter NOT word#0 IN (stopword#4)

+- LogicalRDD [word#0, count#1L], false





The filter method doesn't work here.

Maybe I need a join for two DF?

What's the syntax for this?



Thank you and regards,

Bitfox


Question about ports in spark

2022-01-23 Thread Bitfox
Hello

When spark started in my home server, I saw there were two ports open then.
8080 for master, 8081 for worker.
If I keep these two ports open without any network filter, does it have
security issues?

Thanks


Re: [RELEASE CANDIDATE] mod_perl-2.0.12 RC2

2022-01-08 Thread Bitfox
Is there any update on libapr?

Thanks

On Sun, Jan 9, 2022 at 2:31 AM Steve Hay  wrote:

> On Sat, 18 Dec 2021 at 11:21, Steve Hay  wrote:
> >
> > Please download, test, and report back on this mod_perl 2.0.12 release
> > candidate.
> >
>
> Still waiting to see the necessary votes from other committers before
> I can release this.
>
> FWIW it's all good here (Windows 10) with httpd 2.4.51 / perl 5.34.0.
>


Re: Regarding contribution to Apache-Beam

2022-01-04 Thread Bitfox
Hello

Maybe begin from this content?
https://beam.apache.org/contribute/

Thanks

On Wed, Jan 5, 2022 at 1:43 PM Devangi Das 
wrote:

> Hello!
> I want to contribute to Apache Beam .I have a fair knowledge of java and
> python but I'm new to Go language.kindly guide me how to start contributing
> to the project.
> Thank you;
> Devangi
>


Re: How to make batch filter

2022-01-02 Thread Bitfox
I always use dataframe API, though I am pretty familiar with general SQL.
I use the method you provide to create a big filter as described here:

https://bitfoxtop.wordpress.com/2022/01/02/filter-out-stopwords-in-spark/

Thanks


On Sun, Jan 2, 2022 at 9:06 PM Mich Talebzadeh 
wrote:

> Well the short answer is there is no such thing as which one is more
> performant. Your mileage varies.
>
> SQL is a domain-specific language used in programming and designed for
> managing data held in a relational database management system, or for
> stream processing in a relational data stream management system.
>
>
> A DataFrame is a *Dataset* organised into named columns. It is
> conceptually equivalent to a table in a relational database or a data frame
> in R/Python, but with richer optimizations under the hood. DataFrames can
> be constructed from a wide array of sources such as: structured data
> files, tables in Apache Hive, Google BigQuery, other external databases, or
> existing RDDs.
>
>
> You use sql-API to interact from the underlying data read through by
> constructing a dataframe on it
>
>
> The way I use it is to use either
>
>
> from pyspark.sql.functions import col
>
> DF =  spark.table("alayer.joint_accounts_view")
>
> DF.select(col('transactiondate'),col('transactiontype')).orderBy(col("transactiondate")).show(5)
>
> OR
>
>
> DF.createOrReplaceTempView("tmp") ## create a temporary view
> spark.sql("select transactiondate, transactiontype from tmp order by
> transactiondate").show(5)
>
> You use as you choose. Under the hood, these APIs are using a common
> layer. So the performance for me as a practitioner (i.e. which one is more
> performant) does not come into it.
>
>
> HTH
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 2 Jan 2022 at 12:11, Bitfox  wrote:
>
>> May I ask for daraframe API and sql API, which is better on performance?
>> Thanks
>>
>> On Sun, Jan 2, 2022 at 8:06 PM Gourav Sengupta 
>> wrote:
>>
>>> Hi Mich,
>>>
>>> your notes are really great, it really brought back the old days again
>>> :) thanks.
>>>
>>> Just to note a few points that I found useful related to this question:
>>> 1. cores and threads - page 5
>>> 2. executor cores and number settings - page 6..
>>>
>>>
>>> I think that the following example may be of use, note that I have one
>>> driver and that has 8 cores as I am running PYSPARK 3.1.2 in local mode,
>>> but this will give a way to find out a bit more possibly:
>>>
>>> 
>>> >>> from pyspark.sql.types import *
>>> >>> #create the filter dataframe, there are easier ways to do the below
>>> >>> spark.createDataFrame(list(map(lambda filter:
>>> pyspark.sql.Row(filter), [0, 1, 2, 4, 7, 9])),
>>> StructType([StructField("filter_value",
>>> IntegerType())])).createOrReplaceTempView("filters")
>>> >>> #create the main table
>>> >>> spark.range(100).createOrReplaceTempView("test_base")
>>> >>> spark.sql("SELECT id, FLOOR(RAND() * 10) rand FROM
>>> test_base").createOrReplaceTempView("test")
>>> >>> #see the partitions in the filters and the main table
>>> >>> spark.sql("SELECT * FROM filters").rdd.getNumPartitions()
>>> 8
>>> >>> spark.sql("SELECT * FROM test").rdd.getNumPartitions()
>>> 8
>>> >>> #see the number of partitions in the filtered join output, I am
>>> assuming implicit casting here
>>> >>> spark.sql("SELECT * FROM test WHERE rand IN (SELECT filter_value
>>> FROM filters)").rdd.getNumPartitions()
>>> 200
>>> >>> spark.sql("SET spark.sql.shuffle.partitions=10")
>>> DataFrame[key: string, value: string]
>>> >>> spark.sql("SELECT * FROM test WHERE rand IN (SELECT filter_value
>>> FROM filters)").rdd.getNumPartitions()
>>> 10
>>> ===

Re: How to make batch filter

2022-01-02 Thread Bitfox
May I ask for daraframe API and sql API, which is better on performance?
Thanks

On Sun, Jan 2, 2022 at 8:06 PM Gourav Sengupta 
wrote:

> Hi Mich,
>
> your notes are really great, it really brought back the old days again :)
> thanks.
>
> Just to note a few points that I found useful related to this question:
> 1. cores and threads - page 5
> 2. executor cores and number settings - page 6..
>
>
> I think that the following example may be of use, note that I have one
> driver and that has 8 cores as I am running PYSPARK 3.1.2 in local mode,
> but this will give a way to find out a bit more possibly:
>
> 
> >>> from pyspark.sql.types import *
> >>> #create the filter dataframe, there are easier ways to do the below
> >>> spark.createDataFrame(list(map(lambda filter: pyspark.sql.Row(filter),
> [0, 1, 2, 4, 7, 9])), StructType([StructField("filter_value",
> IntegerType())])).createOrReplaceTempView("filters")
> >>> #create the main table
> >>> spark.range(100).createOrReplaceTempView("test_base")
> >>> spark.sql("SELECT id, FLOOR(RAND() * 10) rand FROM
> test_base").createOrReplaceTempView("test")
> >>> #see the partitions in the filters and the main table
> >>> spark.sql("SELECT * FROM filters").rdd.getNumPartitions()
> 8
> >>> spark.sql("SELECT * FROM test").rdd.getNumPartitions()
> 8
> >>> #see the number of partitions in the filtered join output, I am
> assuming implicit casting here
> >>> spark.sql("SELECT * FROM test WHERE rand IN (SELECT filter_value FROM
> filters)").rdd.getNumPartitions()
> 200
> >>> spark.sql("SET spark.sql.shuffle.partitions=10")
> DataFrame[key: string, value: string]
> >>> spark.sql("SELECT * FROM test WHERE rand IN (SELECT filter_value FROM
> filters)").rdd.getNumPartitions()
> 10
> ====
>
> Please do refer to the following page for adaptive sql execution in SPARK
> 3, it will be of massive help particularly in case you are handling skewed
> joins, https://spark.apache.org/docs/latest/sql-performance-tuning.html
>
>
> Thanks and Regards,
> Gourav Sengupta
>
> On Sun, Jan 2, 2022 at 11:24 AM Bitfox  wrote:
>
>> Thanks Mich. That looks good.
>>
>> On Sun, Jan 2, 2022 at 7:10 PM Mich Talebzadeh 
>> wrote:
>>
>>> LOL.
>>>
>>> You asking these questions takes me back to summer 2016 when I started
>>> writing notes on spark. Obviously earlier versions but the notion of RDD,
>>> Local, standalone, YARN etc. are still valid. Those days there were no k8s
>>> and the public cloud was not widely adopted.  I browsed it and it was
>>> refreshing for me. Anyway you may find some points addressing your
>>> questions that you tend to ask.
>>>
>>> HTH
>>>
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sun, 2 Jan 2022 at 00:20, Bitfox  wrote:
>>>
>>>> One more question, for this big filter, given my server has 4 Cores,
>>>> will spark (standalone mode) split the RDD to 4 partitions automatically?
>>>>
>>>> Thanks
>>>>
>>>> On Sun, Jan 2, 2022 at 6:30 AM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Create a list of values that you don't want anf filter oon those
>>>>>
>>>>> >>> DF = spark.range(10)
>>>>> >>> DF
>>>>> DataFrame[id: bigint]
>>>>> >>>
>>>>> >>> array = [1, 2, 3, 8]  # don't want these
>>>>> >>> DF.filter(DF.id.isin(array) == False).show()
>>>>> +---+
>>>>> | id|
>>>>> +---+
>>>>> |  0|
>>>>> |  4|
>>>>> |  5|
>>>>> |  6|
>>>>> |  7|
>>>>

Re: How to make batch filter

2022-01-02 Thread Bitfox
Thanks Mich. That looks good.

On Sun, Jan 2, 2022 at 7:10 PM Mich Talebzadeh 
wrote:

> LOL.
>
> You asking these questions takes me back to summer 2016 when I started
> writing notes on spark. Obviously earlier versions but the notion of RDD,
> Local, standalone, YARN etc. are still valid. Those days there were no k8s
> and the public cloud was not widely adopted.  I browsed it and it was
> refreshing for me. Anyway you may find some points addressing your
> questions that you tend to ask.
>
> HTH
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 2 Jan 2022 at 00:20, Bitfox  wrote:
>
>> One more question, for this big filter, given my server has 4 Cores, will
>> spark (standalone mode) split the RDD to 4 partitions automatically?
>>
>> Thanks
>>
>> On Sun, Jan 2, 2022 at 6:30 AM Mich Talebzadeh 
>> wrote:
>>
>>> Create a list of values that you don't want anf filter oon those
>>>
>>> >>> DF = spark.range(10)
>>> >>> DF
>>> DataFrame[id: bigint]
>>> >>>
>>> >>> array = [1, 2, 3, 8]  # don't want these
>>> >>> DF.filter(DF.id.isin(array) == False).show()
>>> +---+
>>> | id|
>>> +---+
>>> |  0|
>>> |  4|
>>> |  5|
>>> |  6|
>>> |  7|
>>> |  9|
>>> +---+
>>>
>>>  or use binary NOT operator:
>>>
>>>
>>> >>> DF.filter(*~*DF.id.isin(array)).show()
>>>
>>> +---+
>>>
>>> | id|
>>>
>>> +---+
>>>
>>> |  0|
>>>
>>> |  4|
>>>
>>> |  5|
>>>
>>> |  6|
>>>
>>> |  7|
>>>
>>> |  9|
>>>
>>> +---+
>>>
>>>
>>> HTH
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sat, 1 Jan 2022 at 20:59, Bitfox  wrote:
>>>
>>>> Using the dataframe API I need to implement a batch filter:
>>>>
>>>> DF. select(..).where(col(..) != ‘a’ and col(..) != ‘b’ and …)
>>>>
>>>> There are a lot of keywords should be filtered for the same column in
>>>> where statement.
>>>>
>>>> How can I make it more smater? UDF or others?
>>>>
>>>> Thanks & Happy new Year!
>>>> Bitfox
>>>>
>>>


Re: How to make batch filter

2022-01-01 Thread Bitfox
One more question, for this big filter, given my server has 4 Cores, will
spark (standalone mode) split the RDD to 4 partitions automatically?

Thanks

On Sun, Jan 2, 2022 at 6:30 AM Mich Talebzadeh 
wrote:

> Create a list of values that you don't want anf filter oon those
>
> >>> DF = spark.range(10)
> >>> DF
> DataFrame[id: bigint]
> >>>
> >>> array = [1, 2, 3, 8]  # don't want these
> >>> DF.filter(DF.id.isin(array) == False).show()
> +---+
> | id|
> +---+
> |  0|
> |  4|
> |  5|
> |  6|
> |  7|
> |  9|
> +---+
>
>  or use binary NOT operator:
>
>
> >>> DF.filter(*~*DF.id.isin(array)).show()
>
> +---+
>
> | id|
>
> +---+
>
> |  0|
>
> |  4|
>
> |  5|
>
> |  6|
>
> |  7|
>
> |  9|
>
> +---+
>
>
> HTH
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 1 Jan 2022 at 20:59, Bitfox  wrote:
>
>> Using the dataframe API I need to implement a batch filter:
>>
>> DF. select(..).where(col(..) != ‘a’ and col(..) != ‘b’ and …)
>>
>> There are a lot of keywords should be filtered for the same column in
>> where statement.
>>
>> How can I make it more smater? UDF or others?
>>
>> Thanks & Happy new Year!
>> Bitfox
>>
>


Re: How to make batch filter

2022-01-01 Thread Bitfox
That’s great thanks.

On Sun, Jan 2, 2022 at 6:30 AM Mich Talebzadeh 
wrote:

> Create a list of values that you don't want anf filter oon those
>
> >>> DF = spark.range(10)
> >>> DF
> DataFrame[id: bigint]
> >>>
> >>> array = [1, 2, 3, 8]  # don't want these
> >>> DF.filter(DF.id.isin(array) == False).show()
> +---+
> | id|
> +---+
> |  0|
> |  4|
> |  5|
> |  6|
> |  7|
> |  9|
> +---+
>
>  or use binary NOT operator:
>
>
> >>> DF.filter(*~*DF.id.isin(array)).show()
>
> +---+
>
> | id|
>
> +---+
>
> |  0|
>
> |  4|
>
> |  5|
>
> |  6|
>
> |  7|
>
> |  9|
>
> +---+
>
>
> HTH
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 1 Jan 2022 at 20:59, Bitfox  wrote:
>
>> Using the dataframe API I need to implement a batch filter:
>>
>> DF. select(..).where(col(..) != ‘a’ and col(..) != ‘b’ and …)
>>
>> There are a lot of keywords should be filtered for the same column in
>> where statement.
>>
>> How can I make it more smater? UDF or others?
>>
>> Thanks & Happy new Year!
>> Bitfox
>>
>


How to make batch filter

2022-01-01 Thread Bitfox
Using the dataframe API I need to implement a batch filter:

DF. select(..).where(col(..) != ‘a’ and col(..) != ‘b’ and …)

There are a lot of keywords should be filtered for the same column in where
statement.

How can I make it more smater? UDF or others?

Thanks & Happy new Year!
Bitfox


Re: [ANNOUNCE] Apache Pulsar 2.7.4 released

2021-12-27 Thread bitfox

What's new features on the streaming development then? thanks


On 2021-12-27 22:52, guo jiwei wrote:
The Apache Pulsar team is proud to announce Apache Pulsar version 
2.7.4.


Pulsar is a highly scalable, low latency messaging platform running on
commodity hardware. It provides simple pub-sub semantics over topics,
guaranteed at-least-once delivery of messages, automatic cursor 
management for

subscribers, and cross-datacenter replication.

For Pulsar release details and downloads, visit:

https://pulsar.apache.org/download

Release Notes are at:
http://pulsar.apache.org/release-notes

We would like to thank the contributors that made the release possible.


Regards
Jiwei Guo (Tboy)


my first data science project with spark

2021-12-26 Thread bitfox

Hello list,

Thanks to Spark project and the community I have made my first data 
statistics project with Spark.

The url:  https://github.com/bitfoxtop/EmailRankings
Surely this is not that big-data... I can even write a python script to 
finish the job more quickly.

But since the job was done in Spark I want to share it here.

Thanks for your reviews.

regards
Bitfox

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: measure running time

2021-12-24 Thread bitfox
Thanks a lot Hollis. It is does due to the pypi version. Now I updated 
it.


$ pip3 -V
pip 9.0.1 from /usr/lib/python3/dist-packages (python 3.6)

$ pip3 install sparkmeasure
Collecting sparkmeasure
  Using cached 
https://files.pythonhosted.org/packages/9f/bf/c9810ff2d88513ffc185e65a3ab9df6121ad5b4c78aa8d134a06177f9021/sparkmeasure-0.14.0-py2.py3-none-any.whl

Installing collected packages: sparkmeasure
Successfully installed sparkmeasure-0.14.0

$ pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
...

from sparkmeasure import StageMetrics
stagemetrics = StageMetrics(spark)
stagemetrics.runandmeasure(locals(), 'spark.sql("select count(*) from 
range(1000) cross join range(1000) cross join range(100)").show()')

+-+
| count(1)|
+-+
|1|
+-+
...


Hope it helps to others who have met the same issue.
Happy holidays. :0

Bitfox


On 2021-12-25 09:48, Hollis wrote:

 Replied mail 

 From
 Mich Talebzadeh

 Date
 12/25/2021 00:25

 To
 Sean Owen

 Cc
 user、Luca Canali

 Subject
 Re: measure running time

Hi Sean,

I have already discussed an issue in my case with Spark 3.1.1 and
sparkmeasure  with the author Luca Canali on this matter. It has been
reproduced. I think we ought to wait for a patch.

HTH,

Mich

   view my Linkedin profile [1]

Disclaimer: Use it at your own risk. Any and all responsibility for
any loss, damage or destruction of data or any other property which
may arise from relying on this email's technical content is explicitly
disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.

On Fri, 24 Dec 2021 at 14:51, Sean Owen  wrote:


You probably did not install it on your cluster, nor included the
python package with your app

On Fri, Dec 24, 2021, 4:35 AM  wrote:


but I already installed it:

Requirement already satisfied: sparkmeasure in
/usr/local/lib/python2.7/dist-packages

so how? thank you.

On 2021-12-24 18:15, Hollis wrote:

Hi bitfox,

you need pip install sparkmeasure firstly. then can lanch in

pysaprk.



from sparkmeasure import StageMetrics
stagemetrics = StageMetrics(spark)
stagemetrics.runandmeasure(locals(), 'spark.sql("select

count(*)

from range(1000) cross join range(1000) cross join
range(100)").show()')
+-+

| count(1)|
+-+
|1|
+-+

Regards,
Hollis

At 2021-12-24 09:18:19, bit...@bitfox.top wrote:

Hello list,

I run with Spark 3.2.0

After I started pyspark with:
$ pyspark --packages

ch.cern.sparkmeasure:spark-measure_2.12:0.17


I can't load from the module sparkmeasure:


from sparkmeasure import StageMetrics

Traceback (most recent call last):
File "", line 1, in 
ModuleNotFoundError: No module named 'sparkmeasure'

Do you know why? @Luca thanks.


On 2021-12-24 04:20, bit...@bitfox.top wrote:

Thanks Gourav and Luca. I will try with the tools you provide

in

the

Github.

On 2021-12-23 23:40, Luca Canali wrote:

Hi,

I agree with Gourav that just measuring execution time is a

simplistic

approach that may lead you to miss important details, in

particular

when running distributed computations.

WebUI, REST API, and metrics instrumentation in Spark can be

quite

useful for further drill down. See
https://spark.apache.org/docs/latest/monitoring.html

You can also have a look at this tool that takes care of

automating

collecting and aggregating some executor task metrics:
https://github.com/LucaCanali/sparkMeasure

Best,

Luca

From: Gourav Sengupta 
Sent: Thursday, December 23, 2021 14:23
To: bit...@bitfox.top
Cc: user 
Subject: Re: measure running time

Hi,

I do not think that such time comparisons make any sense at

all in

distributed computation. Just saying that an operation in RDD

and

Dataframe can be compared based on their start and stop time

may

not

provide any valid information.

You will have to look into the details of timing and the

steps.

For

example, please look at the SPARK UI to see how timings are

calculated

in distributed computing mode, there are several well written

papers

on this.

Thanks and Regards,

Gourav Sengupta

On Thu, Dec 23, 2021 at 10:57 AM  wrote:


hello community,

In pyspark how can I measure the running time to the

command?

I just want to compare the running time of the RDD API and

dataframe


API, in my this blog:










https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/


I tried spark.time() it doesn't work.
Thank you.











-

To unsubscribe e-mail: user-unsubscr...@spark.apache.org










-

To uns

df.show() to text file

2021-12-24 Thread bitfox

Hello list,

spark newbie here :0
How can I write the df.show() result to a text file in the system?
I run with pyspark, not the python client programming.

Thanks.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: measure running time

2021-12-24 Thread bitfox

As you see below:

$ pip install sparkmeasure
Collecting sparkmeasure
  Using cached 
https://files.pythonhosted.org/packages/9f/bf/c9810ff2d88513ffc185e65a3ab9df6121ad5b4c78aa8d134a06177f9021/sparkmeasure-0.14.0-py2.py3-none-any.whl

Installing collected packages: sparkmeasure
Successfully installed sparkmeasure-0.14.0


$ pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
..

from sparkmeasure import StageMetrics

Traceback (most recent call last):
  File "", line 1, in 
ModuleNotFoundError: No module named 'sparkmeasure'


That doesn't work still.
I run spark 3.2.0 on an ubuntu system.

Regards.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: measure running time

2021-12-24 Thread bitfox

but I already installed it:

Requirement already satisfied: sparkmeasure in 
/usr/local/lib/python2.7/dist-packages


so how? thank you.

On 2021-12-24 18:15, Hollis wrote:

Hi bitfox,

you need pip install sparkmeasure firstly. then can lanch in pysaprk.


from sparkmeasure import StageMetrics
stagemetrics = StageMetrics(spark)
stagemetrics.runandmeasure(locals(), 'spark.sql("select count(*)

from range(1000) cross join range(1000) cross join
range(100)").show()')
+-+

| count(1)|
+-+
|1|
+-+

Regards,
Hollis

At 2021-12-24 09:18:19, bit...@bitfox.top wrote:

Hello list,

I run with Spark 3.2.0

After I started pyspark with:
$ pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17

I can't load from the module sparkmeasure:


from sparkmeasure import StageMetrics

Traceback (most recent call last):
  File "", line 1, in 
ModuleNotFoundError: No module named 'sparkmeasure'

Do you know why? @Luca thanks.


On 2021-12-24 04:20, bit...@bitfox.top wrote:

Thanks Gourav and Luca. I will try with the tools you provide in

the

Github.

On 2021-12-23 23:40, Luca Canali wrote:

Hi,

I agree with Gourav that just measuring execution time is a

simplistic

approach that may lead you to miss important details, in

particular

when running distributed computations.

WebUI, REST API, and metrics instrumentation in Spark can be quite
useful for further drill down. See
https://spark.apache.org/docs/latest/monitoring.html

You can also have a look at this tool that takes care of

automating

collecting and aggregating some executor task metrics:
https://github.com/LucaCanali/sparkMeasure

Best,

Luca

From: Gourav Sengupta 
Sent: Thursday, December 23, 2021 14:23
To: bit...@bitfox.top
Cc: user 
Subject: Re: measure running time

Hi,

I do not think that such time comparisons make any sense at all in
distributed computation. Just saying that an operation in RDD and
Dataframe can be compared based on their start and stop time may

not

provide any valid information.

You will have to look into the details of timing and the steps.

For

example, please look at the SPARK UI to see how timings are

calculated

in distributed computing mode, there are several well written

papers

on this.

Thanks and Regards,

Gourav Sengupta

On Thu, Dec 23, 2021 at 10:57 AM  wrote:


hello community,

In pyspark how can I measure the running time to the command?
I just want to compare the running time of the RDD API and

dataframe


API, in my this blog:




https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/


I tried spark.time() it doesn't work.
Thank you.





-

To unsubscribe e-mail: user-unsubscr...@spark.apache.org




-

To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Dataframe's storage size

2021-12-23 Thread bitfox

Hello

Is it possible to know a dataframe's total storage size in bytes? such 
as:



df.size()

Traceback (most recent call last):
  File "", line 1, in 
  File "/opt/spark/python/pyspark/sql/dataframe.py", line 1660, in 
__getattr__
"'%s' object has no attribute '%s'" % (self.__class__.__name__, 
name))

AttributeError: 'DataFrame' object has no attribute 'size'

Sure it won't work. but if there is such a method that would be great.

Thanks.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: measure running time

2021-12-23 Thread bitfox

Hello list,

I run with Spark 3.2.0

After I started pyspark with:
$ pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17

I can't load from the module sparkmeasure:


from sparkmeasure import StageMetrics

Traceback (most recent call last):
  File "", line 1, in 
ModuleNotFoundError: No module named 'sparkmeasure'

Do you know why? @Luca thanks.


On 2021-12-24 04:20, bit...@bitfox.top wrote:
Thanks Gourav and Luca. I will try with the tools you provide in the 
Github.


On 2021-12-23 23:40, Luca Canali wrote:

Hi,

I agree with Gourav that just measuring execution time is a simplistic
approach that may lead you to miss important details, in particular
when running distributed computations.

WebUI, REST API, and metrics instrumentation in Spark can be quite
useful for further drill down. See
https://spark.apache.org/docs/latest/monitoring.html

You can also have a look at this tool that takes care of automating
collecting and aggregating some executor task metrics:
https://github.com/LucaCanali/sparkMeasure

Best,

Luca

From: Gourav Sengupta 
Sent: Thursday, December 23, 2021 14:23
To: bit...@bitfox.top
Cc: user 
Subject: Re: measure running time

Hi,

I do not think that such time comparisons make any sense at all in
distributed computation. Just saying that an operation in RDD and
Dataframe can be compared based on their start and stop time may not
provide any valid information.

You will have to look into the details of timing and the steps. For
example, please look at the SPARK UI to see how timings are calculated
in distributed computing mode, there are several well written papers
on this.

Thanks and Regards,

Gourav Sengupta

On Thu, Dec 23, 2021 at 10:57 AM  wrote:


hello community,

In pyspark how can I measure the running time to the command?
I just want to compare the running time of the RDD API and dataframe

API, in my this blog:


https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/


I tried spark.time() it doesn't work.
Thank you.



-

To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: measure running time

2021-12-23 Thread bitfox
Thanks Gourav and Luca. I will try with the tools you provide in the 
Github.


On 2021-12-23 23:40, Luca Canali wrote:

Hi,

I agree with Gourav that just measuring execution time is a simplistic
approach that may lead you to miss important details, in particular
when running distributed computations.

WebUI, REST API, and metrics instrumentation in Spark can be quite
useful for further drill down. See
https://spark.apache.org/docs/latest/monitoring.html

You can also have a look at this tool that takes care of automating
collecting and aggregating some executor task metrics:
https://github.com/LucaCanali/sparkMeasure

Best,

Luca

From: Gourav Sengupta 
Sent: Thursday, December 23, 2021 14:23
To: bit...@bitfox.top
Cc: user 
Subject: Re: measure running time

Hi,

I do not think that such time comparisons make any sense at all in
distributed computation. Just saying that an operation in RDD and
Dataframe can be compared based on their start and stop time may not
provide any valid information.

You will have to look into the details of timing and the steps. For
example, please look at the SPARK UI to see how timings are calculated
in distributed computing mode, there are several well written papers
on this.

Thanks and Regards,

Gourav Sengupta

On Thu, Dec 23, 2021 at 10:57 AM  wrote:


hello community,

In pyspark how can I measure the running time to the command?
I just want to compare the running time of the RDD API and dataframe

API, in my this blog:


https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/


I tried spark.time() it doesn't work.
Thank you.



-

To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



measure running time

2021-12-23 Thread bitfox

hello community,

In pyspark how can I measure the running time to the command?
I just want to compare the running time of the RDD API and dataframe 
API, in my this blog:

https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/

I tried spark.time() it doesn't work.
Thank you.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Unable to use WriteStream to write to delta file.

2021-12-17 Thread bitfox
May I ask why you don’t  use spark.read and spark.write instead of 
readStream and writeStream? Thanks.


On 2021-12-17 15:09, Abhinav Gundapaneni wrote:

Hello Spark community,

I’m using Apache spark(version 3.2) to read a CSV file to a
dataframe using ReadStream, process the dataframe and write the
dataframe to Delta file using WriteStream. I’m getting a failure
during the WriteStream process. I’m trying to run the script locally
in my windows 11 machine. Below is the stack trace of the error I’m
facing. Please let me know if there’s anything that I’m missing.

java.lang.NoSuchMethodError:
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkFieldNames(Lscala/collection/Seq;)V


at
org.apache.spark.sql.delta.schema.SchemaUtils$.checkFieldNames(SchemaUtils.scala:958)


at
org.apache.spark.sql.delta.OptimisticTransactionImpl.verifyNewMetadata(OptimisticTransaction.scala:216)


at
org.apache.spark.sql.delta.OptimisticTransactionImpl.verifyNewMetadata$(OptimisticTransaction.scala:214)


at
org.apache.spark.sql.delta.OptimisticTransaction.verifyNewMetadata(OptimisticTransaction.scala:80)


at
org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadata(OptimisticTransaction.scala:208)


at
org.apache.spark.sql.delta.OptimisticTransactionImpl.updateMetadata$(OptimisticTransaction.scala:195)


at
org.apache.spark.sql.delta.OptimisticTransaction.updateMetadata(OptimisticTransaction.scala:80)


at
org.apache.spark.sql.delta.schema.ImplicitMetadataOperation.updateMetadata(ImplicitMetadataOperation.scala:101)


at
org.apache.spark.sql.delta.schema.ImplicitMetadataOperation.updateMetadata$(ImplicitMetadataOperation.scala:62)


at
org.apache.spark.sql.delta.sources.DeltaSink.updateMetadata(DeltaSink.scala:37)


at
org.apache.spark.sql.delta.schema.ImplicitMetadataOperation.updateMetadata(ImplicitMetadataOperation.scala:59)


at
org.apache.spark.sql.delta.schema.ImplicitMetadataOperation.updateMetadata$(ImplicitMetadataOperation.scala:50)


at
org.apache.spark.sql.delta.sources.DeltaSink.updateMetadata(DeltaSink.scala:37)


at
org.apache.spark.sql.delta.sources.DeltaSink.$anonfun$addBatch$1(DeltaSink.scala:80)


at
org.apache.spark.sql.delta.sources.DeltaSink.$anonfun$addBatch$1$adapted(DeltaSink.scala:54)


at
org.apache.spark.sql.delta.DeltaLog.withNewTransaction(DeltaLog.scala:188)


at
org.apache.spark.sql.delta.sources.DeltaSink.addBatch(DeltaSink.scala:54)


at
org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$17(MicroBatchExecution.scala:600)


at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)


at
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)


at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)


at
org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)

at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)


at
org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$16(MicroBatchExecution.scala:598)


at
org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:375)


at
org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:373)


at
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:69)


at
org.apache.spark.sql.execution.streaming.MicroBatchExecution.runBatch(MicroBatchExecution.scala:598)


at
org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:228)


at
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)


at
org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:375)


at
org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:373)


at
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:69)


at
org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:193)


at
org.apache.spark.sql.execution.streaming.OneTimeExecutor.execute(TriggerExecutor.scala:39)


at
org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:187)


at
org.apache.spark.sql.execution.streaming.StreamExecution.$anonfun$runStream$1(StreamExecution.scala:303)


at
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)


at
org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)


issue on define a dataframe

2021-12-14 Thread bitfox

Hello,

Spark newbie here :)

Why I can't create the dataframe with just one column?

for instance, this works:


df=spark.createDataFrame([("apple",2),("orange",3)],["name","count"])



But this can't work:


df=spark.createDataFrame([("apple"),("orange")],["name"])

Traceback (most recent call last):
  File "", line 1, in 
  File "/opt/spark/python/pyspark/sql/session.py", line 675, in 
createDataFrame
return self._create_dataframe(data, schema, samplingRatio, 
verifySchema)
  File "/opt/spark/python/pyspark/sql/session.py", line 700, in 
_create_dataframe

rdd, schema = self._createFromLocal(map(prepare, data), schema)
  File "/opt/spark/python/pyspark/sql/session.py", line 512, in 
_createFromLocal

struct = self._inferSchemaFromList(data, names=schema)
  File "/opt/spark/python/pyspark/sql/session.py", line 439, in 
_inferSchemaFromList
schema = reduce(_merge_type, (_infer_schema(row, names) for row in 
data))
  File "/opt/spark/python/pyspark/sql/session.py", line 439, in 

schema = reduce(_merge_type, (_infer_schema(row, names) for row in 
data))
  File "/opt/spark/python/pyspark/sql/types.py", line 1067, in 
_infer_schema

raise TypeError("Can not infer schema for type: %s" % type(row))
TypeError: Can not infer schema for type: 


how can I fix it?

Thanks

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: About some Spark technical assistance

2021-12-12 Thread bitfox

github url please.

On 2021-12-13 01:06, sam smith wrote:

Hello guys,

I am replicating a paper's algorithm (graph coloring algorithm) in
Spark under Java, and thought about asking you guys for some
assistance to validate / review my 600 lines of code. Any volunteers
to share the code with ?
Thanks


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: creating database issue

2021-12-07 Thread bitfox

And, I can't start spark-sql shell, the error as below.
Does this mean I need to install Hive on local machine?

Caused by: java.sql.SQLException: Failed to start database 
'metastore_db' with class loader 
jdk.internal.loader.ClassLoaders$AppClassLoader@5ffd2b27, see the next 
exception for details.
	at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown 
Source)
	at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown 
Source)

at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source)
at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(Unknown 
Source)
at org.apache.derby.impl.jdbc.EmbedConnection.(Unknown Source)
at org.apache.derby.jdbc.InternalDriver$1.run(Unknown Source)
at org.apache.derby.jdbc.InternalDriver$1.run(Unknown Source)
at java.base/java.security.AccessController.doPrivileged(Native Method)
	at org.apache.derby.jdbc.InternalDriver.getNewEmbedConnection(Unknown 
Source)

at org.apache.derby.jdbc.InternalDriver.connect(Unknown Source)
at org.apache.derby.jdbc.InternalDriver.connect(Unknown Source)
at org.apache.derby.jdbc.AutoloadedDriver.connect(Unknown Source)
at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:677)
at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:189)
at com.jolbox.bonecp.BoneCP.obtainRawInternalConnection(BoneCP.java:361)
at com.jolbox.bonecp.BoneCP.(BoneCP.java:416)
... 89 more
Caused by: ERROR XJ040: Failed to start database 'metastore_db' with 
class loader jdk.internal.loader.ClassLoaders$AppClassLoader@5ffd2b27, 
see the next exception for details.
	at org.apache.derby.iapi.error.StandardException.newException(Unknown 
Source)
	at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown 
Source)

... 105 more


Thanks.

On 2021/12/8 9:28, bitfox wrote:

Hello

This is just a standalone deployment for testing purpose.
The version:
Spark 3.2.0 (git revision 5d45a415f3) built for Hadoop 3.3.1
Build flags: -B -Pmesos -Pyarn -Pkubernetes -Psparkr -Pscala-2.12 
-Phadoop-3.2 -Phive -Phive-thriftserver


I just started one master and one worker for the test.

Thanks


On 2021/12/8 9:15, Qian Sun wrote:
   It seems to be a hms question. Would u like to provide the 
information about spark version, hive version and spark application 
configuration?


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: creating database issue

2021-12-07 Thread bitfox

Hello

This is just a standalone deployment for testing purpose.
The version:
Spark 3.2.0 (git revision 5d45a415f3) built for Hadoop 3.3.1
Build flags: -B -Pmesos -Pyarn -Pkubernetes -Psparkr -Pscala-2.12 
-Phadoop-3.2 -Phive -Phive-thriftserver


I just started one master and one worker for the test.

Thanks


On 2021/12/8 9:15, Qian Sun wrote:

   It seems to be a hms question. Would u like to provide the information about 
spark version, hive version and spark application configuration?


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



creating database issue

2021-12-07 Thread bitfox

sorry I am newbie to spark.

When I created a database in pyspark shell following the book content of 
learning spark 2.0, it gets:


>>> spark.sql("CREATE DATABASE learn_spark_db")
21/12/08 09:01:34 WARN HiveConf: HiveConf of name 
hive.stats.jdbc.timeout does not exist
21/12/08 09:01:34 WARN HiveConf: HiveConf of name 
hive.stats.retries.wait does not exist
21/12/08 09:01:39 WARN ObjectStore: Version information not found in 
metastore. hive.metastore.schema.verification is not enabled so 
recording the schema version 2.3.0
21/12/08 09:01:39 WARN ObjectStore: setMetaStoreSchemaVersion called but 
recording version is disabled: version = 2.3.0, comment = Set by 
MetaStore pyh@185.213.174.249
21/12/08 09:01:40 WARN ObjectStore: Failed to get database default, 
returning NoSuchObjectException
21/12/08 09:01:40 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
21/12/08 09:01:40 WARN ObjectStore: Failed to get database 
learn_spark_db, returning NoSuchObjectException


Can you point to me where is wrong?

Thanks.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [OUTREACH} December '21 Edition of 'Happenings in the Neighborhood' is out now

2021-12-06 Thread bitfox

Is there a blog for comparison between Apache Pulsar and Apache Spark?

Thanks


On 2021-12-07 09:46, Aaron Williams wrote:

Hello Apache Pulsar Neighbors,

For this issue [1], For this issue, we have three new committers, a
new milestone, and lots of talks. Plus our normal features of a Stack
Overflow question and some community stats.

If you have anything that you think your neighbors would find
interesting,
we have created #blogs-articles and #event-decks channels on the
Apache
Pulsar slack Workspace to capture them.

Thank you,
Aaron Williams
Resident of the Apache Pulsar Neighborhood


Links:
--
[1] 
https://medium.com/apache-pulsar-neighborhood/happenings-in-the-ap-neighborhood-dec-21-abef29b04f0f