Re: how to set the assignee in JIRA please?

2017-07-25 Thread ??????????
I agree not to close the old PRs without right reason. As your suggestion, reviewing it is the way to colse it. Thanks. ---Original--- From: "Hyukjin Kwon" Date: 2017/7/26 12:15:45 To: "??"<1427357...@qq.com>; Cc: "user @spark"; Subject:

Re: how to set the assignee in JIRA please?

2017-07-25 Thread Hyukjin Kwon
That's waiting for a review as seen. There have been few discussions about this. I am personally against closing only because it is old. I have made periodically PRs to close other inactive PRs (e.g., not responsive to review comments or Jenkins failures). So, I guess most of such PRs are

Re: how to set the assignee in JIRA please?

2017-07-25 Thread ??????????
Hi all, I find some PR were created one year ago, the last comment is several monthes before. No one to close or reject it. Such as 6880, just put it like this? ---Original--- From: "Hyukjin Kwon" Date: 2017/7/25 09:25:28 To: "??"<1427357...@qq.com>; Cc: "user

Need some help around a Spark Error

2017-07-25 Thread Debabrata Ghosh
Hi, While executing a SparkSQL query, I am hitting the following error. Wonder, if you can please help me with a possible cause and resolution. Here is the stacktrace for the same: 07/25/2017 02:41:58 PM - DataPrep.py 323 - __main__ - ERROR - An error occurred while calling

[SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

2017-07-25 Thread Priyank Shrivastava
I am trying to write key-values to redis using a DataStreamWriter object using pyspark structured streaming APIs. I am using Spark 2.2 Since the Foreach Sink is not supported for python; here , I am

Re: Spark Job crash due to File Not found when shuffle intermittently

2017-07-25 Thread Martin Peng
cool~ Thanks Kang! I will check and let you know. Sorry for delay as there is an urgent customer issue today. Best Martin 2017-07-24 22:15 GMT-07:00 周康 : > * If the file exists but is a directory rather than a regular file, does > * not exist but cannot be created, or

Re: some Ideas on expressing Spark SQL using JSON

2017-07-25 Thread Sathish Kumaran Vairavelu
Just a thought. SQL itself is a DSL. Why DSL on top of another DSL? On Tue, Jul 25, 2017 at 4:47 AM kant kodali wrote: > Hi All, > > I am thinking to express Spark SQL using JSON in the following the way. > > For Example: > > *Query using Spark DSL* > >

Re: What are some disadvantages of issuing a raw sql query to spark?

2017-07-25 Thread Burak Yavuz
I think Kant meant time windowing functions. You can use `window(TIMESTAMP, '24 hours', '24 hours')` On Tue, Jul 25, 2017 at 9:26 AM, Keith Chapman wrote: > Here is an example of a window lead function, > > select *, lead(someColumn1) over ( partition by someColumn2

Re: real world spark code

2017-07-25 Thread Matei Zaharia
You can also find a lot of GitHub repos for external packages here: http://spark.apache.org/third-party-projects.html Matei > On Jul 25, 2017, at 5:30 PM, Frank Austin Nothaft > wrote: > > There’s a number of real-world open source Spark applications in the sciences: >

Re: What are some disadvantages of issuing a raw sql query to spark?

2017-07-25 Thread Keith Chapman
Here is an example of a window lead function, select *, lead(someColumn1) over ( partition by someColumn2 order by someColumn13 asc nulls first) as someName from someTable Regards, Keith. http://keith-chapman.com On Tue, Jul 25, 2017 at 9:15 AM, kant kodali wrote: > How

Re: What are some disadvantages of issuing a raw sql query to spark?

2017-07-25 Thread kant kodali
How do I Specify windowInterval and slideInteval using raw sql string? On Tue, Jul 25, 2017 at 8:52 AM, Keith Chapman wrote: > You could issue a raw sql query to spark, there is no particular advantage > or disadvantage of doing so. Spark would build a logical plan from

Re: What are some disadvantages of issuing a raw sql query to spark?

2017-07-25 Thread Keith Chapman
You could issue a raw sql query to spark, there is no particular advantage or disadvantage of doing so. Spark would build a logical plan from the raw sql (or DSL) and optimize on that. Ideally you would end up with the same physical plan, irrespective of it been written in raw sql / DSL. Regards,

Re: real world spark code

2017-07-25 Thread Frank Austin Nothaft
There’s a number of real-world open source Spark applications in the sciences: genomics: github.com/bigdatagenomics/adam <— core is scala, has py/r wrappers https://github.com/broadinstitute/gatk <— core is java

Re: real world spark code

2017-07-25 Thread Jörn Franke
Continuous integration (Travis, jenkins) and reporting on unit tests, integration tests etc for each source code version. > On 25. Jul 2017, at 16:58, Adaryl Wakefield > wrote: > > ci+reporting? I’ve never heard of that term before. What is that? > > Adaryl

Re: How to list only erros for a stage

2017-07-25 Thread jeff saremi
Thank you. That helps From: 周康 Sent: Monday, July 24, 2017 8:04:51 PM To: jeff saremi Cc: user@spark.apache.org Subject: Re: How to list only erros for a stage May be you can click Header Status cloumn of Task section,then failed task

RE: real world spark code

2017-07-25 Thread Adaryl Wakefield
ci+reporting? I’ve never heard of that term before. What is that? Adaryl "Bob" Wakefield, MBA Principal Mass Street Analytics, LLC 913.938.6685 www.massstreet.net www.linkedin.com/in/bobwakefieldmba Twitter:

Re: [Spark-Core] sc.textFile() explicit minPartitions did not work

2017-07-25 Thread Gokula Krishnan D
Thanks Xiayun Sun, Robin East for your inputs. It make sense to me. Thanks & Regards, Gokula Krishnan* (Gokul)* On Tue, Jul 25, 2017 at 9:55 AM, Xiayun Sun wrote: > I'm guessing by "part files" you mean files like part-r-0. These are > actually different from hadoop

Re: Nested JSON Handling in Spark 2.1

2017-07-25 Thread Patrick
Hi, I would appreciate some suggestions on how to achieve top level struct treatment to nested JSON when stored in Parquet format. Or any other solutions for best performance using Spark 2.1. Thanks in advance On Mon, Jul 24, 2017 at 4:11 PM, Patrick wrote: > To avoid

Re: [Spark-Core] sc.textFile() explicit minPartitions did not work

2017-07-25 Thread Xiayun Sun
I'm guessing by "part files" you mean files like part-r-0. These are actually different from hadoop "block size", which is the value actually used in partitions. Looks like your hdfs block size is the default 128mb: 258.2GB in 500 part files -> around 528mb per part file -> each part file

Re: [Spark-Core] sc.textFile() explicit minPartitions did not work

2017-07-25 Thread Robin East
sc.textFile will use the Hadoop TextInputFormat (I believe), this will use the Hadoop block size to read records from HDFS. Most likely the block size is 128MB. Not sure you can do anything about the number of tasks generated to read from HDFS.

Re: Spark Data Frame Writer - Range Partiotioning

2017-07-25 Thread Jain, Nishit
But wouldn’t partitioning column partition the data only in Spark RDD? Would it also partition columns at disk when data is written (diving data in folders)? From: ayan guha > Date: Friday, July 21, 2017 at 3:25 PM To: "Jain, Nishit"

Re: real world spark code

2017-07-25 Thread Jörn Franke
Look for the ones that have unit and integration tests as well as a ci+reporting on code quality. All the others are just toy examples. Well should be :) > On 25. Jul 2017, at 01:08, Adaryl Wakefield > wrote: > > Anybody know of publicly available GitHub repos

Re: real world spark code

2017-07-25 Thread Xiayun Sun
usually I look in github repos of those big name companies that I know are actively doing machine learning. For example, here are two spark-related repos from soundcloud: - https://github.com/soundcloud/spark-pagerank - https://github.com/soundcloud/cosine-lsh-join-spark On 25 July 2017 at

Re: [Spark-Core] sc.textFile() explicit minPartitions did not work

2017-07-25 Thread Gokula Krishnan D
Excuse for the too many mails on this post. found a similar issue https://stackoverflow.com/questions/24671755/how-to-partition-a-rdd Thanks & Regards, Gokula Krishnan* (Gokul)* On Tue, Jul 25, 2017 at 8:21 AM, Gokula Krishnan D wrote: > In addition to that, > > tried to

Re: [Spark-Core] sc.textFile() explicit minPartitions did not work

2017-07-25 Thread Gokula Krishnan D
In addition to that, tried to read the same file with 3000 partitions but it used 3070 partitions. And took more time than previous please refer the attachment. Thanks & Regards, Gokula Krishnan* (Gokul)* On Tue, Jul 25, 2017 at 8:15 AM, Gokula Krishnan D wrote: > Hello

[Spark-Core] sc.textFile() explicit minPartitions did not work

2017-07-25 Thread Gokula Krishnan D
Hello All, I have a HDFS file with approx. *1.5 Billion records* with 500 Part files (258.2GB Size) and when I tried to execute the following I could see that it used 2290 tasks but it supposed to be 500 as like HDFS File, isn't it? val inputFile = val inputRdd = sc.textFile(inputFile)

some Ideas on expressing Spark SQL using JSON

2017-07-25 Thread kant kodali
Hi All, I am thinking to express Spark SQL using JSON in the following the way. For Example: *Query using Spark DSL* DS.filter(col("name").equalTo("john")) .groupBy(functions.window(df1.col("TIMESTAMP"), "24 hours", "24 hours"), df1.col("hourlyPay"))

Re: ClassNotFoundException for Workers

2017-07-25 Thread 周康
Ensure com.amazonaws.services.s3.AmazonS3ClientBuilder in your classpath which include your application jar and attached executor jars. 2017-07-20 6:12 GMT+08:00 Noppanit Charassinvichai : > I have this spark job which is using S3 client in mapPartition. And I get > this

What are some disadvantages of issuing a raw sql query to spark?

2017-07-25 Thread kant kodali
HI All, I just want to run some spark structured streaming Job similar to this DS.filter(col("name").equalTo("john")) .groupBy(functions.window(df1.col("TIMESTAMP"), "24 hours", "24 hours"), df1.col("hourlyPay")) .agg(sum("hourlyPay").as("total")); I am wondering if I can