Re: care to share latest pom forspark scala applications eclipse?

2017-02-24 Thread Marco Mistroni
Hi
 i am using sbt to generate ecliipse project file
these are my dependencies
they 'll probably translate to some thing like this in mvn dependencies


these are same for all packages listed below
org.apache,spark
2.1.0


spark-core_2.11
spark-streaming_2.11spark-mllib_2.11
spark-sql_2.11
spark-streaming-flume-sink_2.11
spark-streaming-kafka-0-10_2.11

hth




On Fri, Feb 24, 2017 at 8:16 AM, nancy henry 
wrote:

> Hi Guys,
>
> Please one of you who is successfully able to bbuild maven packages in
> eclipse scala IDE please share your pom.xml
>
>
>
>


Re: Is there any limit on number of tasks per stage attempt?

2017-02-24 Thread Jacek Laskowski
Hi,

Think it's the size of the type to count the partitions which I think is
Int. I don't think there's another reason.

Jacek

On 23 Feb 2017 5:01 a.m., "Parag Chaudhari"  wrote:

> Hi,
>
> Is there any limit on number of tasks per stage attempt?
>
>
> *Thanks,*
>
> *​Parag​*
>


Re: Is there a list of missing optimizations for typed functions?

2017-02-24 Thread Jacek Laskowski
Hi Justin,

I have never seen such a list. I think the area is in heavy development
esp. optimizations for typed operations.

There's a JIRA to somehow find out more on the behavior of Scala code
(non-Column-based one from your list) but I've seen no activity in this
area. That's why for now Column-based untyped queries could be faster due
to more optimizations applied. Same about UDFs.

Jacek

On 23 Feb 2017 7:52 a.m., "Justin Pihony"  wrote:

> I was curious if there was introspection of certain typed functions and ran
> the following two queries:
>
> ds.where($"col" > 1).explain
> ds.filter(_.col > 1).explain
>
> And found that the typed function does NOT result in a PushedFilter. I
> imagine this is due to a limited view of the function, so I have two
> questions really:
>
> 1.) Is there a list of the methods that lose some of the optimizations that
> you get from non-functional methods? Is it any method that accepts a
> generic
> function?
> 2.) Is there any work to attempt reflection and gain some of these
> optimizations back? I couldn't find anything in JIRA.
>
> Thanks,
> Justin Pihony
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Is-there-a-list-of-missing-optimizations-for-typed-
> functions-tp28418.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: RDD blocks on Spark Driver

2017-02-24 Thread Jacek Laskowski
Hi,

Guess you're use local mode which has only one executor called driver. Is
my guessing correct?

Jacek

On 23 Feb 2017 2:03 a.m.,  wrote:

> Hello,
>
> Had a question. When I look at the executors tab in Spark UI, I notice
> that some RDD blocks are assigned to the driver as well. Can someone please
> tell me why?
>
> Thanks for the help.
>


Re: Get S3 Parquet File

2017-02-24 Thread Benjamin Kim
Gourav,

I’ll start experimenting with Spark 2.1 to see if this works.

Cheers,
Ben


> On Feb 24, 2017, at 5:46 AM, Gourav Sengupta  
> wrote:
> 
> Hi Benjamin,
> 
> First of all fetching data from S3 while writing a code in on premise system 
> is a very bad idea. You might want to first copy the data in to local HDFS 
> before running your code. Ofcourse this depends on the volume of data and 
> internet speed that you have.
> 
> The platform which makes your data at least 10 times faster is SPARK 2.1. And 
> trust me you do not want to be writing code which needs you to update it once 
> again in 6 months because newer versions of SPARK now find it deprecated.
> 
> 
> Regards,
> Gourav Sengupta
> 
> 
> 
> On Fri, Feb 24, 2017 at 7:18 AM, Benjamin Kim  > wrote:
> Hi Gourav,
> 
> My answers are below.
> 
> Cheers,
> Ben
> 
> 
>> On Feb 23, 2017, at 10:57 PM, Gourav Sengupta > > wrote:
>> 
>> Can I ask where are you running your CDH? Is it on premise or have you 
>> created a cluster for yourself in AWS? Our cluster in on premise in our data 
>> center.
>> 
>> Also I have really never seen use s3a before, that was used way long before 
>> when writing s3 files took a long time, but I think that you are reading it. 
>> 
>> Anyideas why you are not migrating to Spark 2.1, besides speed, there are 
>> lots of apis which are new and the existing ones are being deprecated. 
>> Therefore there is a very high chance that you are already working on code 
>> which is being deprecated by the SPARK community right now. We use CDH and 
>> upgrade with whatever Spark version they include, which is 1.6.0. We are 
>> waiting for the move to Spark 2.0/2.1.
>> 
>> And besides that would you not want to work on a platform which is at least 
>> 10 times faster What would that be?
>> 
>> Regards,
>> Gourav Sengupta
>> 
>> On Thu, Feb 23, 2017 at 6:23 PM, Benjamin Kim > > wrote:
>> We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB Parquet 
>> file from AWS S3. We can read the schema and show some data when the file is 
>> loaded into a DataFrame, but when we try to do some operations, such as 
>> count, we get this error below.
>> 
>> com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS 
>> credentials from any provider in the chain
>> at 
>> com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
>> at 
>> com.cloudera.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3779)
>> at 
>> com.cloudera.com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1107)
>> at 
>> com.cloudera.com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1070)
>> at 
>> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:239)
>> at 
>> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2711)
>> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97)
>> at 
>> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2748)
>> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2730)
>> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385)
>> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>> at 
>> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
>> at 
>> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:162)
>> at 
>> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:145)
>> at 
>> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:180)
>> at 
>> org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:126)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>> at 
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>> at 
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>> at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>> at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>> at 
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:229)
>> at 
>> 

Re: Duplicate Rank for within same partitions

2017-02-24 Thread Yong Zhang
What you described is not clear here.


Do you want to rank your data based on (date, hour, language, item_type, 
time_zone), and sort by score;

or you want to rank your data based on (date, hour) and sort by language, 
item_type, time_zone and score?


If you mean the first one, then your Spark code looks like right, but the 
example you gave didn't include "time_zone", which maybe the reason the rank 
starting from 1 again.


In Spark windows specification, partition by is for the columns you want to 
grouping at, order by is to decide the ordering order within the partition. 
Both can be applied for multi columns.


Yong


From: Dana Ram Meghwal 
Sent: Friday, February 24, 2017 2:08 AM
To: user@spark.apache.org
Subject: Fwd: Duplicate Rank for within same partitions


-- Forwarded message --
From: Dana Ram Meghwal >
Date: Thu, Feb 23, 2017 at 10:40 PM
Subject: Duplicate Rank for within same partitions
To: user-h...@spark.apache.org


Hey Guys,

I am new to spark. I am trying to write a spark script which involves finding  
rank of records over same data partitions-- (I will be clear in short while )


I have a table which have following column name and example data looks like 
this (record are around 20 million for each pair of  date ,hour, language and 
item_type)

Id,  language,   date,  hour,  item_type,   score
1hindi2017022000song10
2hindi 2017022000song  12
3hindi  20170220   00 song  15
.
.
.
till 20 million


4   english20170220   00 song9
5   english2017022000 song18
6  english 2017022000  song12
.
.
.
till 20 million


Now I want to rank them over language, date, hour, item_type

so finally it will look like this

Id,  language,   date,hour,  item_type,   score   rank
1hindi20170220 00song10  1
2hindi 20170220 00song  12  2
3hindi  2017022000  song  15   3

4   english20170220 00  song9   1
6  english 20170220  00 song12  2
5   english20170220   00 song18  3



to solve this I use rank function in spark

code look like following

1. converting rdd to dataframe

rdd_with_final_score_df  = spark.read.json(rdd_with_final_score).repartition(1);

2. setting window specifications

w = 
Window.partitionBy("dt","hour","language","item_type","time_zone").orderBy(rdd_with_final_score_df.score.cast("float").desc())

3. calculating ranks by repartition to 1  partition

rdd_with_final_score_df_rank_df = 
rdd_with_final_score_df.repartition(1).withColumn('rank', row_number().over(w))

Now number of row in " rdd_with_final_score" is so high  so this RDD is 
distributed across machines in cluster.


I am getting result but for each partition I am getting duplicate rank within 
partition

for e.g.

Id,  language,   date,hour,  item_type,   score   rank
1hindi20170220 00song10  1
2hindi 20170220 00song12  2
3hindi  2017022000  song  15   1


here record 1 and record 3 have same rank but it is expected that they should 
have different rank or rank should be unique for different score values.

 is case that each partition of RDD  rank is getting calculated separately ? 
and then merging because of that that multiple row getting same rank.


It will be very very help for me if you guys can help me understand what is 
going on here and how can we solve this.. I thought repartition would work but 
it did not..


I try to use rowBetween or rangeBetween but  it was giving error --

pyspark.sql.utils.AnalysisException: u'Window Frame ROWS BETWEEN 1 PRECEDING 
AND 1 FOLLOWING must match the required frame ROWS BETWEEN UNBOUNDED PRECEDING 
AN

D CURRENT ROW;'








--
Dana Ram Meghwal
Software Engineer
dana...@saavn.com




--
Dana Ram Meghwal
Software Engineer
dana...@saavn.com



Re: Apache Spark MLIB

2017-02-24 Thread Jon Gregg
Here's a high level overview of Spark's ML Pipelines around when it came
out: https://www.youtube.com/watch?v=OednhGRp938.

But reading your description, you might be able to build a basic version of
this without ML.  Spark has broadcast variables

that
would allow you to put flagged IP ranges into an array and make that
available on every node.  Then you can filters to detect users who've
logged in from a flagged IP range.

Jon Gregg

On Thu, Feb 23, 2017 at 9:19 PM, Mina Aslani  wrote:

> Hi,
>
> I am going to start working on anomaly detection using Spark MLIB. Please
> note that I have not used Spark so far.
>
> I would like to read some data and if a user logged in from different ip
> address which is not common consider it as an anomaly, similar to what
> apple/google does.
>
> My preferred language of programming is JAVA.
>
> I am wondering if you can let me know about books/workshops which guide me
> on the ML algorithm to use and how to implement. I would like to know about
> the Spark supervised/unsupervised options and the suggested algorithm.
>
> I really appreciate if you share you thoughts/experience/insight with me.
>
> Best regards,
> Mina
>


Re: Get S3 Parquet File

2017-02-24 Thread Gourav Sengupta
Hi Benjamin,

First of all fetching data from S3 while writing a code in on premise
system is a very bad idea. You might want to first copy the data in to
local HDFS before running your code. Ofcourse this depends on the volume of
data and internet speed that you have.

The platform which makes your data at least 10 times faster is SPARK 2.1.
And trust me you do not want to be writing code which needs you to update
it once again in 6 months because newer versions of SPARK now find it
deprecated.


Regards,
Gourav Sengupta



On Fri, Feb 24, 2017 at 7:18 AM, Benjamin Kim  wrote:

> Hi Gourav,
>
> My answers are below.
>
> Cheers,
> Ben
>
>
> On Feb 23, 2017, at 10:57 PM, Gourav Sengupta 
> wrote:
>
> Can I ask where are you running your CDH? Is it on premise or have you
> created a cluster for yourself in AWS? Our cluster in on premise in our
> data center.
>
> Also I have really never seen use s3a before, that was used way long
> before when writing s3 files took a long time, but I think that you are
> reading it.
>
> Anyideas why you are not migrating to Spark 2.1, besides speed, there are
> lots of apis which are new and the existing ones are being deprecated.
> Therefore there is a very high chance that you are already working on code
> which is being deprecated by the SPARK community right now. We use CDH
> and upgrade with whatever Spark version they include, which is 1.6.0. We
> are waiting for the move to Spark 2.0/2.1.
>
> And besides that would you not want to work on a platform which is at
> least 10 times faster What would that be?
>
> Regards,
> Gourav Sengupta
>
> On Thu, Feb 23, 2017 at 6:23 PM, Benjamin Kim  wrote:
>
>> We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB
>> Parquet file from AWS S3. We can read the schema and show some data when
>> the file is loaded into a DataFrame, but when we try to do some operations,
>> such as count, we get this error below.
>>
>> com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS
>> credentials from any provider in the chain
>> at com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain.
>> getCredentials(AWSCredentialsProviderChain.java:117)
>> at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.invoke
>> (AmazonS3Client.java:3779)
>> at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.headBu
>> cket(AmazonS3Client.java:1107)
>> at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.doesBu
>> cketExist(AmazonS3Client.java:1070)
>> at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSys
>> tem.java:239)
>> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.
>> java:2711)
>> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97)
>> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem
>> .java:2748)
>> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:
>> 2730)
>> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385)
>> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>> at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReade
>> r.java:385)
>> at parquet.hadoop.ParquetRecordReader.initializeInternalReader(
>> ParquetRecordReader.java:162)
>> at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordR
>> eader.java:145)
>> at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(
>> SqlNewHadoopRDD.scala:180)
>> at org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD
>> .scala:126)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:
>> 306)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR
>> DD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:
>> 306)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR
>> DD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:
>> 306)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
>> Task.scala:73)
>> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
>> Task.scala:41)
>> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.
>> scala:229)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>> Executor.java:1142)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>> lExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> Can anyone help?
>>
>> Cheers,
>> Ben
>>
>>
>> -
>> To unsubscribe e-mail: 

Duplicate Rank within same Partitions

2017-02-24 Thread Dana Ram Meghwal
Hey Guys,

I am new to spark. I am trying to write a spark script which involves
finding  rank of records over same data partitions-- (I will be clear in
short while )


I have a table which have following column name and example data looks like
this (record are around 20 million for each pair of  date ,hour, language
and item_type)

Id,  language,   date,  hour,  item_type,   score
1hindi2017022000song10
2hindi 2017022000song  12
3hindi  20170220   00 song  15
.
.
.
till 20 million


4   english20170220   00 song9
5   english2017022000 song18
6  english 2017022000  song12
.
.
.
till 20 million


Now I want to rank them over language, date, hour, item_type

so finally it will look like this

Id,  language,   date,hour,  item_type,   score   rank
1hindi20170220 00song10  1
2hindi 20170220 00song  12
 2
3hindi  2017022000  song  15
3

4   english20170220 00  song9   1
6  english 20170220  00 song12  2
5   english20170220   00 song18
 3



to solve this I use rank function in spark

code look like following

1. converting rdd to dataframe

rdd_with_final_score_df  = spark.read.json(rdd_with_
final_score).repartition(1);

2. setting window specifications

w = Window.partitionBy("dt","hour","language","item_type","time_
zone").orderBy(rdd_with_final_score_df.score.cast("float").desc())

3. calculating ranks by repartition to 1  partition

rdd_with_final_score_df_rank_df = rdd_with_final_score_df.
repartition(1).withColumn('rank', row_number().over(w))

Now number of row in " rdd_with_final_score" is so high  so this RDD is
distributed across machines in cluster.


I am getting result but for each partition I am getting duplicate rank
within partition

for e.g.
Id,  language,   date,hour,  item_type,   score   rank
1hindi20170220 00song10  1
2hindi 20170220 00song12  2
3hindi  2017022000  song  15   1


here record 1 and record 3 have same rank but it is expected that they
should have different rank or rank should be unique for different score
values.

 is case that each partition of RDD  rank is getting calculated separately
? and then merging because of that that multiple row getting same rank.


It will be very very help for me if you guys can help me understand what is
going on here and how can we solve this.. I thought repartition would work
but it did not..


I try to use rowBetween or rangeBetween but  it was giving error --

pyspark.sql.utils.AnalysisException: u'Window Frame ROWS BETWEEN 1
PRECEDING AND 1 FOLLOWING must match the required frame ROWS BETWEEN
UNBOUNDED PRECEDING AN

D CURRENT ROW;'
Thanks a ton

--
Dana Ram Meghwal
Software Engineer
dana...@saavn.com


care to share latest pom forspark scala applications eclipse?

2017-02-24 Thread nancy henry
Hi Guys,

Please one of you who is successfully able to bbuild maven packages in
eclipse scala IDE please share your pom.xml