Re: Nullpointerexception error when in repartition

2018-04-12 Thread Junfeng Chen
Hi, I know it, but my purpose it to transforming json string in DataSet to Dataset, while spark.readStream can only support read json file in specified path. https://stackoverflow.com/questions/48617474/how-to-convert-json-dataset-to-dataframe-in-spark-structured-streaming gives an essential

Re: Spark LOCAL mode and external jar (extraClassPath)

2018-04-12 Thread Haoyuan Li
This link should be helpful: https://alluxio.org/docs/1.7/en/Running-Spark-on-Alluxio.html Best regards, Haoyuan (HY) alluxio.com | alluxio.org | powered by Alluxio On Thu, Apr 12, 2018 at 6:32 PM, jb44

Re: Does partition by and order by works only in stateful case?

2018-04-12 Thread Tathagata Das
The traditional SQL windows with `over` is not supported in streaming. Only time-based windows, that is, `window("timestamp", "10 minutes")` is supported in streaming. On Thu, Apr 12, 2018 at 7:34 PM, kant kodali wrote: > Hi All, > > Does partition by and order by works only

Does partition by and order by works only in stateful case?

2018-04-12 Thread kant kodali
Hi All, Does partition by and order by works only in stateful case? For example: select row_number() over (partition by id order by timestamp) from table gives me *SEVERE: Exception occured while submitting the query: java.lang.RuntimeException: org.apache.spark.sql.AnalysisException:

Spark LOCAL mode and external jar (extraClassPath)

2018-04-12 Thread jb44
I'm running spark in LOCAL mode and trying to get it to talk to alluxio. I'm getting the error: java.lang.ClassNotFoundException: Class alluxio.hadoop.FileSystem not found The cause of this error is apparently that Spark cannot find the alluxio client jar in its classpath. I have looked at the

Re: Nullpointerexception error when in repartition

2018-04-12 Thread Tathagata Das
Have you read through the documentation of Structured Streaming? https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html One of the basic mistakes you are making is defining the dataset as with `spark.read()`. You define a streaming Dataset as `spark.readStream()` On

Re: Live Stream Code Reviews :)

2018-04-12 Thread Gourav Sengupta
Hi, This is definitely one of the best messages ever in this group. The videos are absolutely fantastic in case anyone is trying to learn about contributing to SPARK, I had been through one of them. Just trying to repeat the steps in the video (without of course doing anything really stupid)

Re: Live Stream Code Reviews :)

2018-04-12 Thread Holden Karau
Ah yes really good point 11am pacific :) On Thu, Apr 12, 2018 at 1:01 PM, Marco Mistroni wrote: > PST I believelike last time > Works out 9pm bst & 10 pm cet if I m correct > > On Thu, Apr 12, 2018, 8:47 PM Matteo Olivi wrote: > >> Hi, >> 11 am

Re: Live Stream Code Reviews :)

2018-04-12 Thread Marco Mistroni
PST I believelike last time Works out 9pm bst & 10 pm cet if I m correct On Thu, Apr 12, 2018, 8:47 PM Matteo Olivi wrote: > Hi, > 11 am in which timezone? > > Il gio 12 apr 2018, 21:23 Holden Karau ha scritto: > >> Hi Y'all, >> >> If your

Re: Live Stream Code Reviews :)

2018-04-12 Thread Matteo Olivi
Hi, 11 am in which timezone? Il gio 12 apr 2018, 21:23 Holden Karau ha scritto: > Hi Y'all, > > If your interested in learning more about how the development process in > Apache Spark works I've been doing a weekly live streamed code review most > Fridays at 11am. This

Live Stream Code Reviews :)

2018-04-12 Thread Holden Karau
Hi Y'all, If your interested in learning more about how the development process in Apache Spark works I've been doing a weekly live streamed code review most Fridays at 11am. This weeks will be on twitch/youtube ( https://www.twitch.tv/holdenkarau / https://www.youtube.com/watch?v=vGVSa9KnD80 ).

Re: Spark Kubernetes Volumes

2018-04-12 Thread Anirudh Ramanathan
There's a JIRA SPARK-23529 that deals with mounting hostpath volumes. I propose we extend that PR/JIRA to encompass all the different volume types and allow mounting them into the driver/executors. On Thu, Apr 12, 2018 at 10:55 AM Yinan Li

Re: Spark Kubernetes Volumes

2018-04-12 Thread Yinan Li
Hi Marius, Spark on Kubernetes does not yet support mounting user-specified volumes natively. But mounting volume is supported in https://github.com/GoogleCloudPlatform/spark-on-k8s-operator. Please see

Re: Spark is only using one worker machine when more are available

2018-04-12 Thread Gourav Sengupta
Hi, Just for sake of clarity can you please given the full statement for reading the data from the largest table? I mean not the programmatic one but the one which has the full statement in it. Regards, Gourav Sengupta On Thu, Apr 12, 2018 at 7:19 AM, Jhon Anderson Cardenas Diaz <

Re: Broadcasting huge array or persisting on HDFS to read on executors - both not working

2018-04-12 Thread surender kumar
Question was not what kind of sampling but random sampling per user. There's no value associated with items to create stratas. If you read Matteo's answer, that's the way to go about it. -Surender On Thursday, 12 April, 2018, 5:49:43 PM IST, Gourav Sengupta

Spark Kubernetes Volumes

2018-04-12 Thread Marius
Hey, i have a question regarding the Spark on Kubernetes feature. I would like to mount a pre-populated Kubernetes volume into the execution pods of Spark. One of my tools that i invoke using the Sparks pipe command requires these files to be available on a POSIX compatible FS and they are

Re: Spark is only using one worker machine when more are available

2018-04-12 Thread Jhon Anderson Cardenas Diaz
Hi. On spark standalone i think you can not specify the number of workers machines to use but you can achieve that in this way: https://stackoverflow.com/questions/39399205/spark-standalone-number-executors-cores-control . For example, if you want that your jobs run on the 10 machines using all

Re: Broadcasting huge array or persisting on HDFS to read on executors - both not working

2018-04-12 Thread Gourav Sengupta
Hi, There is an option for Stratified Sampling available in SPARK: https://spark.apache.org/docs/latest/mllib-statistics.html#stratified-sampling . Also there is a method called randomSplit which may be called on dataframes in case we want to split them into training and test data. Please let

unsubscribe

2018-04-12 Thread varma dantuluri
-- Regards, Varma Dantuluri

Re: Broadcasting huge array or persisting on HDFS to read on executors - both not working

2018-04-12 Thread surender kumar
Thanks Matteo, this should work! -Surender On Thursday, 12 April, 2018, 1:13:38 PM IST, Matteo Cossu wrote: I don't think it's trivial. Anyway, the naive solution would be a cross join between user x items. But this can be very very expensive. I've encountered

Re: Driver aborts on Mesos when unable to connect to one of external shuffle services

2018-04-12 Thread Szuromi Tamás
Hi Igor, Have you started the external shuffle service manually? Cheers 2018-04-12 10:48 GMT+02:00 igor.berman : > Hi, > any input regarding is it expected: > Driver starts and unable to connect to external shuffle service on one of > the nodes(no matter what is the

Re: Nullpointerexception error when in repartition

2018-04-12 Thread Junfeng Chen
Hi, Tathagata I have tried structured streaming, but in line > Dataset rowDataset = spark.read().json(jsondataset); Always throw > Queries with streaming sources must be executed with writeStream.start() But what i need to do in this step is only transforming json string data to Dataset .

Driver aborts on Mesos when unable to connect to one of external shuffle services

2018-04-12 Thread igor.berman
Hi, any input regarding is it expected: Driver starts and unable to connect to external shuffle service on one of the nodes(no matter what is the reason) This makes framework to go to Inactive mode in Mesos UI However it seems that driver doesn't exits and continues to execute tasks(or tries to).

[Structured Streaming] File source, Parquet format: use of the mergeSchema option.

2018-04-12 Thread Gerard Maas
Hi, I'm looking into the Parquet format support for the File source in Structured Streaming. The docs mention the use of the option 'mergeSchema' to merge the schemas of the part files found.[1] What would be the practical use of that in a streaming context? In its batch counterpart,

Re: Broadcasting huge array or persisting on HDFS to read on executors - both not working

2018-04-12 Thread Matteo Cossu
I don't think it's trivial. Anyway, the naive solution would be a cross join between user x items. But this can be very very expensive. I've encountered once a similar problem, here how I solved it: - create a new RDD with (itemID, index) where the index is a unique integer between 0 and

Re: Nullpointerexception error when in repartition

2018-04-12 Thread Tathagata Das
It's not very surprising that doing this sort of RDD to DF conversion inside DStream.foreachRDD has weird corner cases like this. In fact, you are going to have additional problems with partial parquet files (when there are failures) in this approach. I strongly suggest that you use Structured

Re: Does structured streaming support Spark Kafka Direct?

2018-04-12 Thread Tathagata Das
The parallelism is same for Structured Streaming. In fact, the Kafka Structured Streaming source is based on the same principle as DStream's Kafka Direct, hence it has very similar behavior. On Tue, Apr 10, 2018 at 11:03 PM, SRK wrote: > hi, > > We have code based on

Fwd: pyspark:APID iS coming as null

2018-04-12 Thread nirav nishith
insert_push_body_df = spark.sql('''select CASE WHEN t2.offer_id='' then NULL else t2.offer_id end as offer_id,\ CASE WHEN t2.content_set_id='' then NULL else t2.content_set_id end as content_set_id,\ CASE WHEN t2.post_id='' then NULL else t2.post_id end as post_id,t2.nuid,t2.apid,\

Re: How to use disk instead of just InMemoryRelation when use JDBC datasource in SPARKSQL?

2018-04-12 Thread Takeshi Yamamuro
You want to use `Dataset.persist(StorageLevel.MEMORY_AND_DISK)`? On Thu, Apr 12, 2018 at 1:12 PM, Louis Hust wrote: > We want to extract data from mysql, and calculate in sparksql. > The sql explain like below. > > > REGIONKEY#177,N_COMMENT#178] PushedFilters: [],

Problem running Kubernetes example v2.2.0-kubernetes-0.5.0

2018-04-12 Thread Rico Bergmann
Hi! I was trying to get the SparkPi example running using the spark-on-k8s distro from kubespark. But I get the following error: + /sbin/tini -s -- driver [FATAL tini (11)] exec driver failed: No such file or directory Did anyone get the example running on a Kubernetes cluster? Best, Rico.