Re: care to share latest pom forspark scala applications eclipse?

2017-02-24 Thread Marco Mistroni
Hi i am using sbt to generate ecliipse project file these are my dependencies they 'll probably translate to some thing like this in mvn dependencies these are same for all packages listed below org.apache,spark 2.1.0 spark-core_2.11 spark-streaming_2.11spark-mllib_2.11 spark-sql_2.11

Re: Is there any limit on number of tasks per stage attempt?

2017-02-24 Thread Jacek Laskowski
Hi, Think it's the size of the type to count the partitions which I think is Int. I don't think there's another reason. Jacek On 23 Feb 2017 5:01 a.m., "Parag Chaudhari" wrote: > Hi, > > Is there any limit on number of tasks per stage attempt? > > > *Thanks,* > >

Re: Is there a list of missing optimizations for typed functions?

2017-02-24 Thread Jacek Laskowski
Hi Justin, I have never seen such a list. I think the area is in heavy development esp. optimizations for typed operations. There's a JIRA to somehow find out more on the behavior of Scala code (non-Column-based one from your list) but I've seen no activity in this area. That's why for now

Re: RDD blocks on Spark Driver

2017-02-24 Thread Jacek Laskowski
Hi, Guess you're use local mode which has only one executor called driver. Is my guessing correct? Jacek On 23 Feb 2017 2:03 a.m., wrote: > Hello, > > Had a question. When I look at the executors tab in Spark UI, I notice > that some RDD blocks are assigned to the driver

Re: Get S3 Parquet File

2017-02-24 Thread Benjamin Kim
Gourav, I’ll start experimenting with Spark 2.1 to see if this works. Cheers, Ben > On Feb 24, 2017, at 5:46 AM, Gourav Sengupta > wrote: > > Hi Benjamin, > > First of all fetching data from S3 while writing a code in on premise system > is a very bad idea. You

Re: Duplicate Rank for within same partitions

2017-02-24 Thread Yong Zhang
What you described is not clear here. Do you want to rank your data based on (date, hour, language, item_type, time_zone), and sort by score; or you want to rank your data based on (date, hour) and sort by language, item_type, time_zone and score? If you mean the first one, then your Spark

Re: Apache Spark MLIB

2017-02-24 Thread Jon Gregg
Here's a high level overview of Spark's ML Pipelines around when it came out: https://www.youtube.com/watch?v=OednhGRp938. But reading your description, you might be able to build a basic version of this without ML. Spark has broadcast variables

Re: Get S3 Parquet File

2017-02-24 Thread Gourav Sengupta
Hi Benjamin, First of all fetching data from S3 while writing a code in on premise system is a very bad idea. You might want to first copy the data in to local HDFS before running your code. Ofcourse this depends on the volume of data and internet speed that you have. The platform which makes

Duplicate Rank within same Partitions

2017-02-24 Thread Dana Ram Meghwal
Hey Guys, I am new to spark. I am trying to write a spark script which involves finding rank of records over same data partitions-- (I will be clear in short while ) I have a table which have following column name and example data looks like this (record are around 20 million for each pair of

care to share latest pom forspark scala applications eclipse?

2017-02-24 Thread nancy henry
Hi Guys, Please one of you who is successfully able to bbuild maven packages in eclipse scala IDE please share your pom.xml