Re: Why Spark generates Java code and not Scala?

2019-11-11 Thread Marcin Tustin
Well TIL.

For those also newly informed:
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-whole-stage-codegen.html
https://mail-archives.apache.org/mod_mbox/spark-dev/201911.mbox/browser


On Sun, Nov 10, 2019 at 7:57 AM Holden Karau  wrote:

> *This Message originated outside your organization.*
> --
> If you look inside of the generation we generate java code and compile it
> with Janino. For interested folks the conversation moved over to the dev@
> list
>
> On Sat, Nov 9, 2019 at 10:37 AM Marcin Tustin
>  wrote:
>
>> What do you mean by this? Spark is written in a combination of Scala and
>> Java, and then compiled to Java Byte Code, as is typical for both Scala and
>> Java. If there's additional byte code generation happening, it's java byte
>> code, because the platform runs on the JVM.
>>
>> On Sat, Nov 9, 2019 at 12:47 PM Bartosz Konieczny <
>> bartkoniec...@gmail.com> wrote:
>>
>>> *This Message originated outside your organization.*
>>> --
>>> Hi there,
>>>
>>
>>> Few days ago I got an intriguing but hard to answer question:
>>> "Why Spark generates Java code and not Scala code?"
>>> (https://github.com/bartosz25/spark-scala-playground/issues/18
>>> <https://github.com/bartosz25/spark-scala-playground/issues/18>
>>> )
>>>
>>> Since I'm not sure about the exact answer, I'd like to ask you to
>>> confirm or not my thinking. I was looking for the reasons in the JIRA and
>>> the research paper "Spark SQL: Relational Data Processing in Spark" (
>>> http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf
>>> <http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf>)
>>> but found nothing explaining why Java over Scala. The single task I found
>>> was about why Scala and not Java but concerning data types (
>>> https://issues.apache.org/jira/browse/SPARK-5193
>>> <https://issues.apache.org/jira/browse/SPARK-5193>)
>>> That's why I'm writing here.
>>>
>>> My guesses about choosing Java code are:
>>> - Java runtime compiler libs are more mature and prod-ready than the
>>> Scala's - or at least, they were at the implementation time
>>> - Scala compiler tends to be slower than the Java's
>>> https://stackoverflow.com/questions/3490383/java-compile-speed-vs-scala-compile-speed
>>> <https://stackoverflow.com/questions/3490383/java-compile-speed-vs-scala-compile-speed>
>>> - Scala compiler seems to be more complex, so debugging & maintaining it
>>> would be harder
>>> - it was easier to represent a pure Java OO design than mixed FP/OO in
>>> Scala
>>> ?
>>>
>>> Thank you for your help.
>>>
>>> --
>>> Bartosz Konieczny
>>> data engineer
>>> https://www.waitingforcode.com
>>> <https://www.waitingforcode.com>
>>> https://github.com/bartosz25/
>>> <https://github.com/bartosz25/>
>>> https://twitter.com/waitingforcode
>>> <https://twitter.com/waitingforcode>
>>>
>>> --
> Twitter: https://twitter.com/holdenkarau
> <https://twitter.com/holdenkarau>
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> <https://www.youtube.com/user/holdenkarau>
>


Re: Why Spark generates Java code and not Scala?

2019-11-09 Thread Marcin Tustin
What do you mean by this? Spark is written in a combination of Scala and
Java, and then compiled to Java Byte Code, as is typical for both Scala and
Java. If there's additional byte code generation happening, it's java byte
code, because the platform runs on the JVM.

On Sat, Nov 9, 2019 at 12:47 PM Bartosz Konieczny 
wrote:

> *This Message originated outside your organization.*
> --
> Hi there,
>
> Few days ago I got an intriguing but hard to answer question:
> "Why Spark generates Java code and not Scala code?"
> (https://github.com/bartosz25/spark-scala-playground/issues/18
> 
> )
>
> Since I'm not sure about the exact answer, I'd like to ask you to confirm
> or not my thinking. I was looking for the reasons in the JIRA and the
> research paper "Spark SQL: Relational Data Processing in Spark" (
> http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf
> )
> but found nothing explaining why Java over Scala. The single task I found
> was about why Scala and not Java but concerning data types (
> https://issues.apache.org/jira/browse/SPARK-5193
> )
> That's why I'm writing here.
>
> My guesses about choosing Java code are:
> - Java runtime compiler libs are more mature and prod-ready than the
> Scala's - or at least, they were at the implementation time
> - Scala compiler tends to be slower than the Java's
> https://stackoverflow.com/questions/3490383/java-compile-speed-vs-scala-compile-speed
> 
> - Scala compiler seems to be more complex, so debugging & maintaining it
> would be harder
> - it was easier to represent a pure Java OO design than mixed FP/OO in
> Scala
> ?
>
> Thank you for your help.
>
> --
> Bartosz Konieczny
> data engineer
> https://www.waitingforcode.com
> 
> https://github.com/bartosz25/
> 
> https://twitter.com/waitingforcode
> 
>
>


Re: Collecting large dataset

2019-09-05 Thread Marcin Tustin
Stop using collect for this purpose. Either continue your further
processing in spark (maybe you need to use streaming), or sink the data to
something that can accept the data (gcs/s3/azure
storage/redshift/elasticsearch/whatever), and have further processing read
from that sink.

On Thu, Sep 5, 2019 at 2:23 PM Rishikesh Gawade 
wrote:

> *This Message originated outside your organization.*
> --
> Hi.
> I have been trying to collect a large dataset(about 2 gb in size, 30
> columns, more than a million rows) onto the driver side. I am aware that
> collecting such a huge dataset isn't suggested, however, the application
> within which the spark driver is running requires that data.
> While collecting the dataframe, the spark job throws an error,
> TaskResultLost( resultset lost from blockmanager).
> I searched for solutions around this and set the following properties:
> spark.blockManager.port, maxResultSize to 0(unlimited), 
> spark.driver.blockManager.port
> and the application within which spark driver is running has 28 gb of max
> heap size.
> And yet the error arises again.
> There are 22 executors running in my cluster.
> Is there any config/necessary step that i am missing before collecting
> such large data?
> Or is there any other effective approach that would guarantee collecting
> such large data without failure?
>
> Thanks,
> Rishikesh
>


Re: How to combine all rows into a single row in DataFrame

2019-08-19 Thread Marcin Tustin
It sounds like you want to aggregate your rows in some way. I actually just
wrote a blog post about that topic:
https://medium.com/@albamus/spark-aggregating-your-data-the-fast-way-e37b53314fad

On Mon, Aug 19, 2019 at 4:24 PM Rishikesh Gawade 
wrote:

> *This Message originated outside your organization.*
> --
> Hi All,
> I have been trying to serialize a dataframe in protobuf format. So far, I
> have been able to serialize every row of the dataframe by using map
> function and the logic for serialization within the same(within the lambda
> function). The resultant dataframe consists of rows in serialized format(1
> row = 1 serialized message).
> I wish to form a single protobuf serialized message for this dataframe and
> in order to do that i need to combine all the serialized rows using some
> custom logic very similar to the one used in map operation.
> I am assuming that this would be possible by using the reduce operation on
> the dataframe, however, i am unaware of how to go about it.
> Any suggestions/approach would be much appreciated.
>
> Thanks,
> Rishikesh
>