Re: Spark Job crash due to File Not found when shuffle intermittently

2017-07-24 Thread Martin Peng
Is there anyone at share me some lights about this issue? Thanks Martin 2017-07-21 18:58 GMT-07:00 Martin Peng : > Hi, > > I have several Spark jobs including both batch job and Stream jobs to > process the system log and analyze them. We are using Kafka as the pipeline > to

Conflict resolution for data in spark streaming

2017-07-24 Thread Biplob Biswas
Hi, I have a situation where updates are coming from 2 different data sources, this data at times are arriving in the same batch defined in streaming context duration parameter of 500 ms (recommended in spark according to the documentation). Now that is not the problem, the problem is that when

Re: how to convert the binary from kafak to srring pleaae

2017-07-24 Thread ??????????
Hi cheers, Would you like write samw code please? I check the select method, i do not know how to cast it and how to set the value.deserialize. Thanks ---Original--- From: "Szuromi Tam??s" Date: 2017/7/24 16:32:52 To: "??"<1427357...@qq.com>; Cc:

NullPointer when collecting a dataset grouped a column

2017-07-24 Thread Aseem Bansal
I was doing this dataset.groupBy("column").collectAsList() It worked for a small dataset but for a bigger dataset I got a NullPointer exception in which went down to spark's code. Is this known behaviour? Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org

Re: using Kudu with Spark

2017-07-24 Thread Mich Talebzadeh
thanks Pierce.That compilation looks very cool. Now as always the question is what is the best fit for the job at hand and I don't think there is a single answer. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: How to configure spark with java

2017-07-24 Thread Patrik Medvedev
What exactly do you need? Basically you need to add spark libs at your pom. пн, 24 июл. 2017 г. в 6:22, amit kumar singh : > Hello everyone > > I want to use spark with java API > > Please let me know how can I configure it > > > Thanks > A > >

Re: Is there a way to run Spark SQL through REST?

2017-07-24 Thread Sumedh Wale
Yes, using the new Spark structured streaming you can keep submitting streaming jobs against the same SparkContext in different requests (or you can create a new SparkContext if required in a request). The SparkJob implementation will get handle of the SparkContext which

Re: Spark 2.0 and Oracle 12.1 error

2017-07-24 Thread Cassa L
Hi Another related question to this. Has anyone tried transactions using Oracle JDBC and spark. How do you do it given that code will be distributed on workers. Do I combine certain queries to make sure they don't get distributed? Regards, Leena On Fri, Jul 21, 2017 at 1:50 PM, Cassa L

Re: how to convert the binary from kafak to srring pleaae

2017-07-24 Thread Michael Armbrust
There are end to end examples of using Kafka in in this blog: https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html On Sun, Jul 23, 2017 at 7:44 PM, 萝卜丝炒饭 <1427357...@qq.com> wrote: > Hi all > > I want to change the binary from

Re: using Kudu with Spark

2017-07-24 Thread Mich Talebzadeh
now they are bringing up Ampool with spark for real time analytics Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: using Kudu with Spark

2017-07-24 Thread Pierce Lamb
Hi Mich, I tried to compile a list of datastores that connect to Spark and provide a bit of context. The list may help you in your research: https://stackoverflow.com/a/39753976/3723346 I'm going to add Kudu, Druid and Ampool from this thread. I'd like to point out SnappyData

Parquet error while saving in HDFS

2017-07-24 Thread unk1102
Hi I am getting the following error not sure why seems like race condition but I dont use any threads just one thread which owns spark context is writing to hdfs with one parquet partition. I am using Scala 2.10 and Spark 1.5.1. Please guide. Thanks in advance. java.io.IOException: The file

real world spark code

2017-07-24 Thread Adaryl Wakefield
Anybody know of publicly available GitHub repos of real world Spark applications written in scala? Adaryl "Bob" Wakefield, MBA Principal Mass Street Analytics, LLC 913.938.6685 www.massstreet.net

Re: how to set the assignee in JIRA please?

2017-07-24 Thread ??????????
Another issue about contribution. After a pull request is created, what should creator do next please? Who will close it please? ---Original--- From: "Hyukjin Kwon" Date: 2017/7/25 09:15:49 To: "Marcelo Vanzin"; Cc:

how to set the assignee in JIRA please?

2017-07-24 Thread ??????????
Hi all, If I want to do some work about an issue registed in JIRA, how to set the assignee to me please? thanks

Re: how to set the assignee in JIRA please?

2017-07-24 Thread Marcelo Vanzin
We don't generally set assignees. Submit a PR on github and the PR will be linked on JIRA; if your PR is submitted, then the bug is assigned to you. On Mon, Jul 24, 2017 at 5:57 PM, 萝卜丝炒饭 <1427357...@qq.com> wrote: > Hi all, > If I want to do some work about an issue registed in JIRA, how to set

Re: how to set the assignee in JIRA please?

2017-07-24 Thread ??????????
Hi vanzin, kwon, thanks for your help. ---Original--- From: "Hyukjin Kwon" Date: 2017/7/25 09:04:44 To: "Marcelo Vanzin"; Cc: "user";"??"<1427357...@qq.com>; Subject: Re: how to set the assignee in JIRA please?

Re: how to convert the binary from kafak to srring pleaae

2017-07-24 Thread ??????????
Hi Armbrust, It works well. Thanks. ---Original--- From: "Michael Armbrust" Date: 2017/7/25 04:58:44 To: "??"<1427357...@qq.com>; Cc: "user"; Subject: Re: how to convert the binary from kafak to srring pleaae There are end to end

Re: how to set the assignee in JIRA please?

2017-07-24 Thread Hyukjin Kwon
However, I see some JIRAs are assigned to someone time to time. Were those mistakes or would you mind if I ask when someone is assigned? When I started to contribute to Spark few years ago, I was confused by this and I am pretty sure some guys are still confused. I do usually say something like

Re: how to set the assignee in JIRA please?

2017-07-24 Thread Hyukjin Kwon
I see. In any event, it sounds not required to work on an issue - http://spark.apache.org/contributing.html . "... There is no need to be the Assignee of the JIRA to work on it, though you are welcome to comment that you have begun work.." and I was just wondering out of my curiosity. It should

Re: how to set the assignee in JIRA please?

2017-07-24 Thread Hyukjin Kwon
I think that's described in the link I used - http://spark.apache.org/ contributing.html. On 25 Jul 2017 10:22 am, "萝卜丝炒饭" <1427357...@qq.com> wrote: Another issue about contribution. After a pull request is created, what should creator do next please? Who will close it please? ---Original---

Re: how to set the assignee in JIRA please?

2017-07-24 Thread Marcelo Vanzin
On Mon, Jul 24, 2017 at 6:04 PM, Hyukjin Kwon wrote: > However, I see some JIRAs are assigned to someone time to time. Were those > mistakes or would you mind if I ask when someone is assigned? I'm not sure if there are any guidelines of when to assign; since there has been

How to list only erros for a stage

2017-07-24 Thread jeff saremi
On the Spark status UI you can click Stages on the menu and see Active (and completed stages). For the active stage, you can see Succeeded/Total and a count of failed ones in paranthesis. I'm looking for a way to go straight to the failed tasks and list the errors. Currently I must go into

Re: Spark Job crash due to File Not found when shuffle intermittently

2017-07-24 Thread 周康
You can also check whether space left in the executor node enough to store shuffle file or not. 2017-07-25 13:01 GMT+08:00 周康 : > First,spark will handle task fail so if job ended normally , this error > can be ignore. > Second, when using BypassMergeSortShuffleWriter,

unsubscribe

2017-07-24 Thread Jnana Sagar
Please unsubscribe me. -- regards Jnana Sagar

Re: Spark Job crash due to File Not found when shuffle intermittently

2017-07-24 Thread 周康
First,spark will handle task fail so if job ended normally , this error can be ignore. Second, when using BypassMergeSortShuffleWriter, it will first write data file then write an index file. You can check "Failed to delete temporary index file at" or "fail to rename file" in related executor

Re: Spark Job crash due to File Not found when shuffle intermittently

2017-07-24 Thread 周康
* If the file exists but is a directory rather than a regular file, does * not exist but cannot be created, or cannot be opened for any other * reason then a FileNotFoundException is thrown. After searching into FileOutputStream i saw this annotation.So you can check executor node first(may be no

Re: How to list only erros for a stage

2017-07-24 Thread 周康
May be you can click Header Status cloumn of Task section,then failed task will appear first. 2017-07-25 10:02 GMT+08:00 jeff saremi : > On the Spark status UI you can click Stages on the menu and see Active > (and completed stages). For the active stage, you can see

Re: using Kudu with Spark

2017-07-24 Thread Mich Talebzadeh
sounds like Druid can do the same? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com *Disclaimer:* Use it

Re: custom joins on dataframe

2017-07-24 Thread Jörn Franke
It might be faster if you add the column with the hash result before the join to the dataframe and then do simply a normal join on that column > On 22. Jul 2017, at 17:39, Stephen Fletcher > wrote: > > Normally a family of joins (left, right outter, inner) are

Re: using Kudu with Spark

2017-07-24 Thread Jörn Franke
I guess you have to find out yourself with experiments. Cloudera has some benchmarks, but it always depends what you test, your data volume and what is meant by "fast". It is also more than a file format with servers that communicate with each other etc. - more complexity. Of course there are

Union large number of DataFrames

2017-07-24 Thread julio . cesare
Hi there ! Let's imagine I have a large number of very small dataframe with the same schema ( a list of DataFrames : allDFs) and I want to create one large dataset with this. I have been trying this : -> allDFs.reduce ( (a,b) => a.union(b) ) And after this one : -> allDFs.reduce ( (a,b) =>

using Kudu with Spark

2017-07-24 Thread Mich Talebzadeh
hi, Has anyone had experience of using Kudu for faster analytics with Spark? How efficient is it compared to usinh HBase and other traditional storage for fast changing data please? Any insight will be appreciated. Thanks Dr Mich Talebzadeh LinkedIn *

Re: using Kudu with Spark

2017-07-24 Thread Mich Talebzadeh
Yes this storage layer is something I have been investigating in my own lab for mixed load such as Lambda Architecture. It offers the convenience of columnar RDBMS (much like Sybase IQ). Kudu tables look like those in SQL relational databases, each with a primary key made up of one or more

Re: how to convert the binary from kafak to srring pleaae

2017-07-24 Thread Szuromi Tamás
Hi, You can cast it to string in a select or you can set the value.deserializer parameter for kafka. cheers, 2017-07-24 4:44 GMT+02:00 萝卜丝炒饭 <1427357...@qq.com>: > Hi all > > I want to change the binary from kafka to string. Would you like help me > please? > > val df =

RE: Is there a difference between these aggregations

2017-07-24 Thread yohann jardin
Seen directly in the code: /** * Aggregate function: returns the average of the values in a group. * Alias for avg. * * @group agg_funcs * @since 1.4.0 */ def mean(e: Column): Column = avg(e) That's the same when the argument is the column name. So no difference between

Re: Is there a difference between these aggregations

2017-07-24 Thread Aseem Bansal
Any difference between using agg or select to do the aggregations? On Mon, Jul 24, 2017 at 5:08 PM, yohann jardin wrote: > Seen directly in the code: > > > /** >* Aggregate function: returns the average of the values in a group. >* Alias for avg. >* >

Complext JSON Handling in Spark 2.1

2017-07-24 Thread Patrick
Hi, On reading a complex JSON, Spark infers schema as following: root |-- header: struct (nullable = true) ||-- deviceId: string (nullable = true) ||-- sessionId: string (nullable = true) |-- payload: struct (nullable = true) ||-- deviceObjects: array (nullable = true) ||

Re: Complext JSON Handling in Spark 2.1

2017-07-24 Thread Patrick
To avoid confusion, the query i am referring above is over some numeric element inside *a: struct (nullable = true).* On Mon, Jul 24, 2017 at 4:04 PM, Patrick wrote: > Hi, > > On reading a complex JSON, Spark infers schema as following: > > root > |-- header: struct

Is there a difference between these aggregations

2017-07-24 Thread Aseem Bansal
If I want to aggregate mean and subtract from my column I can do either of the following in Spark 2.1.0 Java API. Is there any difference between these? Couldn't find anything from reading the docs. dataset.select(mean("mycol")) dataset.agg(mean("mycol")) dataset.select(avg("mycol"))