Re: disable spark disk cache

2019-03-03 Thread Hien Luu
Hi Andrey, Below is the description of MEMORY_ONLY from https://spark.apache.org/docs/latest/rdd-programming-guide.html "Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're

Re: to_avro and from_avro not working with struct type in spark 2.4

2019-02-28 Thread Hien Luu
|-- name: string (nullable = true) > ||-- age: integer (nullable = false) > > > On Wed, Feb 27, 2019 at 6:51 PM Hien Luu wrote: > >> Thanks for looking into this. Does this mean string fields should alway >> be nullable? >> >> You are right that the re

Re: to_avro and from_avro not working with struct type in spark 2.4

2019-02-27 Thread Hien Luu
ll"]},{"name":"age","type":"int"}]} > > scala> dfKV.select(from_avro('value, avroTypeStruct)).show > +-+ > |from_avro(value, struct)| > +-----+ > | [Mary Jane, 25]|

to_avro and from_avro not working with struct type in spark 2.4

2019-02-26 Thread Hien Luu
Hi, I ran into a pretty weird issue with to_avro and from_avro where it was not able to parse the data in a struct correctly. Please see the simple and self contained example below. I am using Spark 2.4. I am not sure if I missed something. This is how I start the spark-shell on my Mac:

Re: How to register custom structured streaming source

2018-07-22 Thread Hien Luu
Hi Farshid, Take a look at this example on github - https://github.com/hienluu/structured-streaming-sources. Cheers, Hien On Thu, Jul 12, 2018 at 12:52 AM Farshid Zavareh wrote: > Hello. > > I need to create a custom streaming source by extending *FileStreamSource*. > The idea is to override

Re: Writing custom Structured Streaming receiver

2018-03-04 Thread Hien Luu
Finally got a toy version of Structured Streaming DataSource V2 version with Apache Spark 2.3 working. Tested locally and on Databricks community edition. Source code is here - https://github.com/hienluu/wikiedit-streaming -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: Writing custom Structured Streaming receiver

2018-03-03 Thread Hien Luu
I finally got around to implement a custom structured streaming receiver (source) to read Wikipedia edit events from the IRC server. It works fines locally as well as in spark-shell on my laptop. However, it failed with the following exception when running in Databricks community edition. It

RE: How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0?

2018-01-06 Thread Hien Luu
Hi Kant, I am not sure whether you had come up with a solution yet, but the following works for me (in Scala) val emp_info = """ [ {"name": "foo", "address": {"state": "CA", "country": "USA"}, "docs":[{"subject": "english", "year": 2016}]}, {"name": "bar", "address": {"state": "OH",

Re: Writing custom Structured Streaming receiver

2017-11-28 Thread Hien Luu
Cool. Thanks nezhazheng. I will give it a shot. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Writing custom Structured Streaming receiver

2017-11-20 Thread Hien Luu
Hi TD, I looked at DataStreamReader class and looks like we can specify an FQCN as a source (provided that it implements trait Source). The DataSource.lookupDataSource function will try to load this FQCN during the creation of a DataSource object instance inside the DataStreamReader.load(). Will

Re: Spark build failure with com.oracle:ojdbc6:jar:11.2.0.1.0

2016-05-02 Thread Hien Luu
r local repo? > > Best, RS > On May 2, 2016 8:51 PM, "Hien Luu" <hien...@gmail.com> wrote: > >> Hi all, >> >> I am running into a build problem with com.oracle:ojdbc6:jar:11.2.0.1.0. >> It kept getting "Operation timed out" while building Spark P

Spark build failure with com.oracle:ojdbc6:jar:11.2.0.1.0

2016-05-02 Thread Hien Luu
Hi all, I am running into a build problem with com.oracle:ojdbc6:jar:11.2.0.1.0. It kept getting "Operation timed out" while building Spark Project Docker Integration Tests module (see the error below). Has anyone run this problem before? If so, how did you resolve around this problem? [INFO]

Re: Spark Streaming updateStateByKey Implementation

2015-11-09 Thread Hien Luu
n > from you. See StateDStream.scala, there is everything you need to know. > > On Fri, Nov 6, 2015 at 6:25 PM Hien Luu <h...@linkedin.com> wrote: > >> Hi, >> >> I am interested in learning about the implementation of >> updateStateByKey. Does anyone kn

Spark Streaming updateStateByKey Implementation

2015-11-06 Thread Hien Luu
Hi, I am interested in learning about the implementation of updateStateByKey. Does anyone know of a jira or design doc I read? I did a quick search and couldn't find much info. on the implementation. Thanks in advance, Hien

Re: Spark job workflow engine recommendations

2015-10-07 Thread Hien Luu
>> >> On Fri, Aug 7, 2015 at 11:25 AM, Ted Yu <yuzhih...@gmail.com> wrote: >> >>> In my opinion, choosing some particular project among its peers should >>> leave enough room for future growth (which may come faster than you >>> initially think). >>

Re: Spark job workflow engine recommendations

2015-08-11 Thread Hien Luu
initially think). Cheers On Fri, Aug 7, 2015 at 11:23 AM, Hien Luu h...@linkedin.com wrote: Scalability is a known issue due the the current architecture. However this will be applicable if you run more 20K jobs per day. On Fri, Aug 7, 2015 at 10:30 AM, Ted Yu yuzhih...@gmail.com wrote

Re: Newbie question: what makes Spark run faster than MapReduce

2015-08-07 Thread Hien Luu
This blog outlines a few things that make Spark faster than MapReduce - https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html On Fri, Aug 7, 2015 at 9:13 AM, Muler mulugeta.abe...@gmail.com wrote: Consider the classic word count application over a 4 node cluster with a sizable

Re: Spark job workflow engine recommendations

2015-08-07 Thread Hien Luu
Looks like Oozie can satisfy most of your requirements. On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone vikramk...@gmail.com wrote: Hi, I'm looking for open source workflow tools/engines that allow us to schedule spark jobs on a datastax cassandra cluster. Since there are tonnes of