How to recursively aggregate Treelike(hierarchical) data using Spark?

2018-09-25 Thread newroyker
The problem statement and an approach to solve it recursively is described
here:
https://stackoverflow.com/questions/52508872/how-to-recursively-aggregate-treelikehierarchical-data-using-spark

Looking for more elegant/performant solutions, if they exist. TIA !



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



RE: Python kubernetes spark 2.4 branch

2018-09-25 Thread Garlapati, Suryanarayana (Nokia - IN/Bangalore)
Hi Ilan/ Yinan,
Yes my test case is also similar to the one described in 
https://issues.apache.org/jira/browse/SPARK-24736

My spark-submit is as follows:
./spark-submit --deploy-mode cluster --master 
k8s://https://10.75.145.23:8443 --conf 
spark.app.name=spark-py --properties-file /tmp/program_files/spark_py.conf 
--py-files http://10.75.145.25:80/Spark/getNN.py 
http://10.75.145.25:80/Spark/test.py

Following is the error observed:

+ exec /sbin/tini -s – /opt/spark/bin/spark-submit --conf 
spark.driver.bindAddress=192.168.1.22 --deploy-mode client --properties-file 
/opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner 
http://10.75.145.25:80/Spark/test.py
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/opt/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/opt/spark/jars/phoenix-4.13.1-HBase-1.3-client.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Traceback (most recent call last):
File "/tmp/spark-4c428c98-e123-4c29-a9f5-ef85f207e229/test.py", line 13, in 

from getNN import *
ImportError: No module named getNN
2018-09-25 16:19:57 INFO ShutdownHookManager:54 - Shutdown hook called
2018-09-25 16:19:57 INFO ShutdownHookManager:54 - Deleting directory 
/tmp/spark-4c428c98-e123-4c29-a9f5-ef85f207e229

Observing the same kind of behaviour as mentioned in 
https://issues.apache.org/jira/browse/SPARK-24736 (file getting downloaded and 
available in pod)

This is also the same with the local files as well:

./spark-submit --deploy-mode cluster --master 
k8s://https://10.75.145.23:8443 --conf 
spark.app.name=spark-py --properties-file /tmp/program_files/spark_py.conf 
--py-files ./getNN.py http://10.75.145.25:80/Spark/test.py

test.py has dependencies from getNN.py.


But the same is working in spark 2.2 k8s branch.


Regards
Surya

From: Ilan Filonenko 
Sent: Wednesday, September 26, 2018 2:06 AM
To: liyinan...@gmail.com
Cc: Garlapati, Suryanarayana (Nokia - IN/Bangalore) 
; Spark dev list ; 
user@spark.apache.org
Subject: Re: Python kubernetes spark 2.4 branch

Is this in reference to: https://issues.apache.org/jira/browse/SPARK-24736 ?

On Tue, Sep 25, 2018 at 12:38 PM Yinan Li 
mailto:liyinan...@gmail.com>> wrote:
Can you give more details on how you ran your app, did you build your own 
image, and which image are you using?

On Tue, Sep 25, 2018 at 10:23 AM Garlapati, Suryanarayana (Nokia - 
IN/Bangalore) 
mailto:suryanarayana.garlap...@nokia.com>> 
wrote:
Hi,
I am trying to run spark python testcases on k8s based on tag spark-2.4-rc1. 
When the dependent files are passed through the --py-files option, they are not 
getting resolved by the main python script. Please let me know, is this a known 
issue?

Regards
Surya



[Spark SQL] why spark sql hash() are returns the same hash value though the keys/expr are not same

2018-09-25 Thread Gokula Krishnan D
Hello All,

I am calculating the hash value  of few columns and determining whether its
an Insert/Delete/Update Record but found a scenario which is little weird
since some of the records returns same hash value though the key's are
totally different.

For the instance,

scala> spark.sql("select hash('40514X'),hash('41751')").show()

+---+---+

|hash(40514)|hash(41751)|

+---+---+

|  976573657|  976573657|

+---+---+

scala> spark.sql("select hash('14589'),hash('40004')").show()

+---+---+

|hash(14589)|hash(40004)|

+---+---+

|  777096871|  777096871|

+---+---+
I do understand that hash() returns an integer, are these reached the max
value?.

Thanks & Regards,
Gokula Krishnan* (Gokul)*


Re: [Spark SQL]: Java Spark Classes With Attributes of Type Set In Datasets

2018-09-25 Thread Dillon Dukek
Actually, it appears walking through it in a debug terminal that the
deserializer can properly transform the data on read to an ArrayType, but
the serializer doesn't know what to do when we try to go back out from the
internal spark representation.

tags, if (isnull(lambdavariable(MapObjects_loopValue0,
MapObjects_loopIsNull0, ObjectType(class ), true).getTags)) null
else named_struct()


On Tue, Sep 25, 2018 at 2:27 PM ddukek  wrote:

> I'm trying to use a data model that has a instance variable that is a Set.
> If
> I leave the type as the Abstract Set class I get an error thrown because
> Set
> is an interface so it cannot be instantiated. If I then try and make the
> variable a concrete implementation of Set I get an analysis exception
>
> "org.apache.spark.sql.AnalysisException: cannot resolve 'named_struct()'
> due
> to data type mismatch: input to function named_struct requires at least one
> argument".
>
> If I then change the type to be a list the program works just fine. I'm
> using dataset operations and using the Encoders.bean method to cast the
> rows
> to the proper type.
>
> Is there a way to get around this without forcing me to use a List in my
> model?
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


can Spark 2.4 work on JDK 11?

2018-09-25 Thread kant kodali
Hi All,

can Spark 2.4 work on JDK 11? I feel like there are lot of features that
are added in JDK 9, 10, 11 that can make deployment process a whole lot
better and of course some more syntax sugar similar to Scala.

Thanks!


[Spark SQL]: Java Spark Classes With Attributes of Type Set In Datasets

2018-09-25 Thread ddukek
I'm trying to use a data model that has a instance variable that is a Set. If
I leave the type as the Abstract Set class I get an error thrown because Set
is an interface so it cannot be instantiated. If I then try and make the
variable a concrete implementation of Set I get an analysis exception 

"org.apache.spark.sql.AnalysisException: cannot resolve 'named_struct()' due
to data type mismatch: input to function named_struct requires at least one
argument".

If I then change the type to be a list the program works just fine. I'm
using dataset operations and using the Encoders.bean method to cast the rows
to the proper type.

Is there a way to get around this without forcing me to use a List in my
model?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Python kubernetes spark 2.4 branch

2018-09-25 Thread Yinan Li
Can you give more details on how you ran your app, did you build your own
image, and which image are you using?

On Tue, Sep 25, 2018 at 10:23 AM Garlapati, Suryanarayana (Nokia -
IN/Bangalore)  wrote:

> Hi,
>
> I am trying to run spark python testcases on k8s based on tag
> spark-2.4-rc1. When the dependent files are passed through the --py-files
> option, they are not getting resolved by the main python script. Please let
> me know, is this a known issue?
>
>
>
> Regards
>
> Surya
>
>
>


Python kubernetes spark 2.4 branch

2018-09-25 Thread Garlapati, Suryanarayana (Nokia - IN/Bangalore)
Hi,
I am trying to run spark python testcases on k8s based on tag spark-2.4-rc1. 
When the dependent files are passed through the --py-files option, they are not 
getting resolved by the main python script. Please let me know, is this a known 
issue?

Regards
Surya



can I model any arbitrary data structure as an RDD?

2018-09-25 Thread kant kodali
Hi All,

I am wondering if I can model any arbitrary data structure as an RDD? For
example, can I model, Red-black trees, Suffix Trees, Radix Trees, Splay
Trees, Fibonacci heaps, Tries, Linked Lists etc as RDD's? If so, how?

To implement a custom RDD I have to implement compute and getPartitions
functions so does this mean that as long as I can store the above data
structures into some storage and implement the compute and getPatitions
functions am I good? I wonder if every data structure is parallelizable in
the first place?

Thanks!