Re: Get S3 Parquet File

2017-02-23 Thread Femi Anthony
Have you tried reading using s3n which is a slightly older protocol ? I'm
not sure how compatible s3a is with older versions of Spark.


Femi

On Fri, Feb 24, 2017 at 2:18 AM, Benjamin Kim  wrote:

> Hi Gourav,
>
> My answers are below.
>
> Cheers,
> Ben
>
>
> On Feb 23, 2017, at 10:57 PM, Gourav Sengupta 
> wrote:
>
> Can I ask where are you running your CDH? Is it on premise or have you
> created a cluster for yourself in AWS? Our cluster in on premise in our
> data center.
>
> Also I have really never seen use s3a before, that was used way long
> before when writing s3 files took a long time, but I think that you are
> reading it.
>
> Anyideas why you are not migrating to Spark 2.1, besides speed, there are
> lots of apis which are new and the existing ones are being deprecated.
> Therefore there is a very high chance that you are already working on code
> which is being deprecated by the SPARK community right now. We use CDH
> and upgrade with whatever Spark version they include, which is 1.6.0. We
> are waiting for the move to Spark 2.0/2.1.
>
> And besides that would you not want to work on a platform which is at
> least 10 times faster What would that be?
>
> Regards,
> Gourav Sengupta
>
> On Thu, Feb 23, 2017 at 6:23 PM, Benjamin Kim  wrote:
>
>> We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB
>> Parquet file from AWS S3. We can read the schema and show some data when
>> the file is loaded into a DataFrame, but when we try to do some operations,
>> such as count, we get this error below.
>>
>> com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS
>> credentials from any provider in the chain
>> at com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain.
>> getCredentials(AWSCredentialsProviderChain.java:117)
>> at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.invoke
>> (AmazonS3Client.java:3779)
>> at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.headBu
>> cket(AmazonS3Client.java:1107)
>> at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.doesBu
>> cketExist(AmazonS3Client.java:1070)
>> at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSys
>> tem.java:239)
>> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.
>> java:2711)
>> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97)
>> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem
>> .java:2748)
>> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:
>> 2730)
>> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385)
>> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>> at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReade
>> r.java:385)
>> at parquet.hadoop.ParquetRecordReader.initializeInternalReader(
>> ParquetRecordReader.java:162)
>> at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordR
>> eader.java:145)
>> at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(
>> SqlNewHadoopRDD.scala:180)
>> at org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD
>> .scala:126)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:
>> 306)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR
>> DD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:
>> 306)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR
>> DD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:
>> 306)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
>> Task.scala:73)
>> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
>> Task.scala:41)
>> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.
>> scala:229)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>> Executor.java:1142)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>> lExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> Can anyone help?
>>
>> Cheers,
>> Ben
>>
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>


-- 
http://www.femibyte.com/twiki5/bin/view/Tech/
http://www.nextmatrix.com
"Great spirits have always encountered violent opposition from mediocre
minds." - Albert Einstein.


Re: Get S3 Parquet File

2017-02-23 Thread Benjamin Kim
Hi Gourav,

My answers are below.

Cheers,
Ben


> On Feb 23, 2017, at 10:57 PM, Gourav Sengupta  
> wrote:
> 
> Can I ask where are you running your CDH? Is it on premise or have you 
> created a cluster for yourself in AWS? Our cluster in on premise in our data 
> center.
> 
> Also I have really never seen use s3a before, that was used way long before 
> when writing s3 files took a long time, but I think that you are reading it. 
> 
> Anyideas why you are not migrating to Spark 2.1, besides speed, there are 
> lots of apis which are new and the existing ones are being deprecated. 
> Therefore there is a very high chance that you are already working on code 
> which is being deprecated by the SPARK community right now. We use CDH and 
> upgrade with whatever Spark version they include, which is 1.6.0. We are 
> waiting for the move to Spark 2.0/2.1.
> 
> And besides that would you not want to work on a platform which is at least 
> 10 times faster What would that be?
> 
> Regards,
> Gourav Sengupta
> 
> On Thu, Feb 23, 2017 at 6:23 PM, Benjamin Kim  > wrote:
> We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB Parquet 
> file from AWS S3. We can read the schema and show some data when the file is 
> loaded into a DataFrame, but when we try to do some operations, such as 
> count, we get this error below.
> 
> com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS 
> credentials from any provider in the chain
> at 
> com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
> at 
> com.cloudera.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3779)
> at 
> com.cloudera.com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1107)
> at 
> com.cloudera.com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1070)
> at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:239)
> at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2711)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97)
> at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2748)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2730)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
> at 
> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
> at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:162)
> at 
> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:145)
> at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:180)
> at 
> org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:126)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 
> Can anyone help?
> 
> Cheers,
> Ben
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> 
> 
> 



Fwd: Duplicate Rank for within same partitions

2017-02-23 Thread Dana Ram Meghwal
-- Forwarded message --
From: Dana Ram Meghwal 
Date: Thu, Feb 23, 2017 at 10:40 PM
Subject: Duplicate Rank for within same partitions
To: user-h...@spark.apache.org


Hey Guys,

I am new to spark. I am trying to write a spark script which involves
finding  rank of records over same data partitions-- (I will be clear in
short while )


I have a table which have following column name and example data looks like
this (record are around 20 million for each pair of  date ,hour, language
and item_type)

Id,  language,   date,  hour,  item_type,   score
1hindi2017022000song10
2hindi 2017022000song  12
3hindi  20170220   00 song  15
.
.
.
till 20 million


4   english20170220   00 song9
5   english2017022000 song18
6  english 2017022000  song12
.
.
.
till 20 million


Now I want to rank them over language, date, hour, item_type

so finally it will look like this

Id,  language,   date,hour,  item_type,   score   rank
1hindi20170220 00song10  1
2hindi 20170220 00song  12
 2
3hindi  2017022000  song  15
3

4   english20170220 00  song9   1
6  english 20170220  00 song12  2
5   english20170220   00 song18
 3



to solve this I use rank function in spark

code look like following

1. converting rdd to dataframe

rdd_with_final_score_df  = spark.read.json(rdd_with_
final_score).repartition(1);

2. setting window specifications

w = Window.partitionBy("dt","hour","language","item_type","time_
zone").orderBy(rdd_with_final_score_df.score.cast("float").desc())

3. calculating ranks by repartition to 1  partition

rdd_with_final_score_df_rank_df = rdd_with_final_score_df.
repartition(1).withColumn('rank', row_number().over(w))

Now number of row in " rdd_with_final_score" is so high  so this RDD is
distributed across machines in cluster.


I am getting result but for each partition I am getting duplicate rank
within partition

for e.g.
Id,  language,   date,hour,  item_type,   score   rank
1hindi20170220 00song10  1
2hindi 20170220 00song12  2
3hindi  2017022000  song  15   1


here record 1 and record 3 have same rank but it is expected that they
should have different rank or rank should be unique for different score
values.

 is case that each partition of RDD  rank is getting calculated separately
? and then merging because of that that multiple row getting same rank.


It will be very very help for me if you guys can help me understand what is
going on here and how can we solve this.. I thought repartition would work
but it did not..


I try to use rowBetween or rangeBetween but  it was giving error --

pyspark.sql.utils.AnalysisException: u'Window Frame ROWS BETWEEN 1
PRECEDING AND 1 FOLLOWING must match the required frame ROWS BETWEEN
UNBOUNDED PRECEDING AN

D CURRENT ROW;'








-- 
Dana Ram Meghwal
Software Engineer
dana...@saavn.com




-- 
Dana Ram Meghwal
Software Engineer
dana...@saavn.com


Re: Get S3 Parquet File

2017-02-23 Thread Gourav Sengupta
Can I ask where are you running your CDH? Is it on premise or have you
created a cluster for yourself in AWS?

Also I have really never seen use s3a before, that was used way long before
when writing s3 files took a long time, but I think that you are reading
it.

Anyideas why you are not migrating to Spark 2.1, besides speed, there are
lots of apis which are new and the existing ones are being deprecated.
Therefore there is a very high chance that you are already working on code
which is being deprecated by the SPARK community right now.

And besides that would you not want to work on a platform which is at least
10 times faster

Regards,
Gourav Sengupta

On Thu, Feb 23, 2017 at 6:23 PM, Benjamin Kim  wrote:

> We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB
> Parquet file from AWS S3. We can read the schema and show some data when
> the file is loaded into a DataFrame, but when we try to do some operations,
> such as count, we get this error below.
>
> com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS
> credentials from any provider in the chain
> at com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain.
> getCredentials(AWSCredentialsProviderChain.java:117)
> at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.
> invoke(AmazonS3Client.java:3779)
> at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.
> headBucket(AmazonS3Client.java:1107)
> at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.
> doesBucketExist(AmazonS3Client.java:1070)
> at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(
> S3AFileSystem.java:239)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(
> FileSystem.java:2711)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(
> FileSystem.java:2748)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2730)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
> at parquet.hadoop.ParquetFileReader.readFooter(
> ParquetFileReader.java:385)
> at parquet.hadoop.ParquetRecordReader.initializeInternalReader(
> ParquetRecordReader.java:162)
> at parquet.hadoop.ParquetRecordReader.initialize(
> ParquetRecordReader.java:145)
> at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.
> (SqlNewHadoopRDD.scala:180)
> at org.apache.spark.rdd.SqlNewHadoopRDD.compute(
> SqlNewHadoopRDD.scala:126)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(
> ShuffleMapTask.scala:73)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(
> ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(
> Executor.scala:229)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> Can anyone help?
>
> Cheers,
> Ben
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Apache Spark MLIB

2017-02-23 Thread Mina Aslani
Hi,

I am going to start working on anomaly detection using Spark MLIB. Please
note that I have not used Spark so far.

I would like to read some data and if a user logged in from different ip
address which is not common consider it as an anomaly, similar to what
apple/google does.

My preferred language of programming is JAVA.

I am wondering if you can let me know about books/workshops which guide me
on the ML algorithm to use and how to implement. I would like to know about
the Spark supervised/unsupervised options and the suggested algorithm.

I really appreciate if you share you thoughts/experience/insight with me.

Best regards,
Mina


Spark: Continuously reading data from Cassandra

2017-02-23 Thread Tech Id
Hi,

Can anyone help with
http://stackoverflow.com/questions/42428080/spark-continuously-reading-data-from-cassandra
?

Thanks
TI


Spark executor on Docker runs as root

2017-02-23 Thread Ji Yan
Dear spark users,

When running Spark on Docker, the spark executors by default always run as
root. Is there a way to change this to other users?

Thanks
Ji

-- 
 

The information in this email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this email 
by anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be 
taken in reliance on it, is prohibited and may be unlawful.


Re: Get S3 Parquet File

2017-02-23 Thread Benjamin Kim
Aakash,

Here is a code snippet for the keys.

val accessKey = “---"
val secretKey = “---"

val hadoopConf = sc.hadoopConfiguration
hadoopConf.set("fs.s3a.access.key", accessKey)
hadoopConf.set("fs.s3a.secret.key", secretKey)
hadoopConf.set("spark.hadoop.fs.s3a.access.key",accessKey)
hadoopConf.set("spark.hadoop.fs.s3a.secret.key",secretKey)

val df = 
sqlContext.read.parquet("s3a://aps.optus/uc2/BI_URL_DATA_HLY_20170201_09.PARQUET.gz")
df.show
df.count

When we do the count, then the error happens.

Thanks,
Ben


> On Feb 23, 2017, at 10:31 AM, Aakash Basu  wrote:
> 
> Hey,
> 
> Please recheck your access key and secret key being used to fetch the parquet 
> file. It seems to be a credential error. Either mismatch/load. If load, then 
> first use it directly in code and see if the issue resolves, then it can be 
> hidden and read from Input Params.
> 
> Thanks,
> Aakash.
> 
> 
> On 23-Feb-2017 11:54 PM, "Benjamin Kim"  > wrote:
> We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB Parquet 
> file from AWS S3. We can read the schema and show some data when the file is 
> loaded into a DataFrame, but when we try to do some operations, such as 
> count, we get this error below.
> 
> com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS 
> credentials from any provider in the chain
> at 
> com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
> at 
> com.cloudera.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3779)
> at 
> com.cloudera.com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1107)
> at 
> com.cloudera.com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1070)
> at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:239)
> at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2711)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97)
> at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2748)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2730)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
> at 
> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
> at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:162)
> at 
> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:145)
> at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:180)
> at 
> org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:126)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 
> Can anyone help?
> 
> Cheers,
> Ben
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> 
> 
> 



Re: Get S3 Parquet File

2017-02-23 Thread Aakash Basu
Hey,

Please recheck your access key and secret key being used to fetch the
parquet file. It seems to be a credential error. Either mismatch/load. If
load, then first use it directly in code and see if the issue resolves,
then it can be hidden and read from Input Params.

Thanks,
Aakash.


On 23-Feb-2017 11:54 PM, "Benjamin Kim"  wrote:

We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB Parquet
file from AWS S3. We can read the schema and show some data when the file
is loaded into a DataFrame, but when we try to do some operations, such as
count, we get this error below.

com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS
credentials from any provider in the chain
at com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain.
getCredentials(AWSCredentialsProviderChain.java:117)
at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.
invoke(AmazonS3Client.java:3779)
at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.
headBucket(AmazonS3Client.java:1107)
at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.
doesBucketExist(AmazonS3Client.java:1070)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(
S3AFileSystem.java:239)
at org.apache.hadoop.fs.FileSystem.createFileSystem(
FileSystem.java:2711)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(
FileSystem.java:2748)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2730)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at parquet.hadoop.ParquetFileReader.readFooter(
ParquetFileReader.java:385)
at parquet.hadoop.ParquetRecordReader.initializeInternalReader(
ParquetRecordReader.java:162)
at parquet.hadoop.ParquetRecordReader.initialize(
ParquetRecordReader.java:145)
at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.
(SqlNewHadoopRDD.scala:180)
at org.apache.spark.rdd.SqlNewHadoopRDD.compute(
SqlNewHadoopRDD.scala:126)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(
MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(
MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(
ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(
ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(
Executor.scala:229)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Can anyone help?

Cheers,
Ben


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Get S3 Parquet File

2017-02-23 Thread Benjamin Kim
We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB Parquet 
file from AWS S3. We can read the schema and show some data when the file is 
loaded into a DataFrame, but when we try to do some operations, such as count, 
we get this error below.

com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS 
credentials from any provider in the chain
at 
com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
at 
com.cloudera.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3779)
at 
com.cloudera.com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1107)
at 
com.cloudera.com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1070)
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:239)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2711)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2748)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2730)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at 
parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
at 
parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:162)
at 
parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:145)
at 
org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:180)
at 
org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:126)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:229)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Can anyone help?

Cheers,
Ben


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark SQL : Join operation failure

2017-02-23 Thread neil90
It might be a memory issue. Try adding .persist(MEMORY_AND_DISK_ONLY) so that
if the RDD can't fit into memory it will persist parts of the RDD into disk.


cm_go.registerTempTable("x") 
ko.registerTempTable("y") 
joined_df = sqlCtx.sql("select * from x FULL OUTER JOIN y ON field1=field2") 
joined_df.persist(StorageLevel.MEMORY_AND_DISK_ONLY)
joined_df.write.save("/user/data/output") 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-operation-failure-tp28414p28422.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: New Amazon AMIs for EC2 script

2017-02-23 Thread neil90
You should look into AWS EMR instead, with adding pip install steps to the
launch process. They have a pretty nice Jupyter notebook script that setups
up jupyter and lets you choose what packages you want to install -
https://aws.amazon.com/blogs/big-data/running-jupyter-notebook-and-jupyterhub-on-amazon-emr/



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/New-Amazon-AMIs-for-EC2-script-tp28419p28421.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-23 Thread nguyen duc Tuan
I do a self-join. I tried to cache the transformed dataset before joining,
but it didn't help too.

2017-02-23 13:25 GMT+07:00 Nick Pentreath :

> And to be clear, are you doing a self-join for approx similarity? Or
> joining to another dataset?
>
>
>
> On Thu, 23 Feb 2017 at 02:01, nguyen duc Tuan 
> wrote:
>
>> Hi Seth,
>> Here's the parameters that I used in my experiments.
>> - Number of executors: 16
>> - Executor's memories: vary from 1G -> 2G -> 3G
>> - Number of cores per executor: 1-> 2
>> - Driver's memory:  1G -> 2G -> 3G
>> - The similar threshold: 0.6
>> MinHash:
>> - number of hash tables: 2
>> SignedRandomProjection:
>> - Number of hash tables: 2
>>
>> 2017-02-23 0:13 GMT+07:00 Seth Hendrickson 
>> :
>>
>> I'm looking into this a bit further, thanks for bringing it up! Right now
>> the LSH implementation only uses OR-amplification. The practical
>> consequence of this is that it will select too many candidates when doing
>> approximate near neighbor search and approximate similarity join. When we
>> add AND-amplification I think it will become significantly more usable. In
>> the meantime, I will also investigate scalability issues.
>>
>> Can you please provide every parameter you used? It will be very helfpul
>> :) For instance, the similarity threshold, the number of hash tables, the
>> bucket width, etc...
>>
>> Thanks!
>>
>> On Mon, Feb 13, 2017 at 3:21 PM, Nick Pentreath > > wrote:
>>
>> The original Uber authors provided this performance test result:
>> https://docs.google.com/document/d/19BXg-67U83NVB3M0I84HVBVg3baAVaESD_
>> mrg_-vLro
>>
>> This was for MinHash only though, so it's not clear about what the
>> scalability is for the other metric types.
>>
>> The SignRandomProjectionLSH is not yet in Spark master (see
>> https://issues.apache.org/jira/browse/SPARK-18082). It could be there
>> are some implementation details that would make a difference here.
>>
>> By the way, what is the join threshold you use in approx join?
>>
>> Could you perhaps create a JIRA ticket with the details in order to track
>> this?
>>
>>
>> On Sun, 12 Feb 2017 at 22:52 nguyen duc Tuan 
>> wrote:
>>
>> After all, I switched back to LSH implementation that I used before (
>> https://github.com/karlhigley/spark-neighbors ). I can run on my dataset
>> now. If someone has any suggestion, please tell me.
>> Thanks.
>>
>> 2017-02-12 9:25 GMT+07:00 nguyen duc Tuan :
>>
>> Hi Timur,
>> 1) Our data is transformed to dataset of Vector already.
>> 2) If I use RandomSignProjectLSH, the job dies after I call
>> approximateSimilarJoin. I tried to use Minhash instead, the job is still
>> slow. I don't thinks the problem is related to the GC. The time for GC is
>> small compare with the time for computation. Here is some screenshots of my
>> job.
>> Thanks
>>
>> 2017-02-12 8:01 GMT+07:00 Timur Shenkao :
>>
>> Hello,
>>
>> 1) Are you sure that your data is "clean"?  No unexpected missing values?
>> No strings in unusual encoding? No additional or missing columns ?
>> 2) How long does your job run? What about garbage collector parameters?
>> Have you checked what happens with jconsole / jvisualvm ?
>>
>> Sincerely yours, Timur
>>
>> On Sat, Feb 11, 2017 at 12:52 AM, nguyen duc Tuan 
>> wrote:
>>
>> Hi Nick,
>> Because we use *RandomSignProjectionLSH*, there is only one parameter
>> for LSH is the number of hashes. I try with small number of hashes (2) but
>> the error is still happens. And it happens when I call similarity join.
>> After transformation, the size of  dataset is about 4G.
>>
>> 2017-02-11 3:07 GMT+07:00 Nick Pentreath :
>>
>> What other params are you using for the lsh transformer?
>>
>> Are the issues occurring during transform or during the similarity join?
>>
>>
>> On Fri, 10 Feb 2017 at 05:46, nguyen duc Tuan 
>> wrote:
>>
>> hi Das,
>> In general, I will apply them to larger datasets, so I want to use LSH,
>> which is more scaleable than the approaches as you suggested. Have you
>> tried LSH in Spark 2.1.0 before ? If yes, how do you set the
>> parameters/configuration to make it work ?
>> Thanks.
>>
>> 2017-02-10 19:21 GMT+07:00 Debasish Das :
>>
>> If it is 7m rows and 700k features (or say 1m features) brute force row
>> similarity will run fine as well...check out spark-4823...you can compare
>> quality with approximate variant...
>> On Feb 9, 2017 2:55 AM, "nguyen duc Tuan"  wrote:
>>
>> Hi everyone,
>> Since spark 2.1.0 introduces LSH (http://spark.apache.org/docs/
>> latest/ml-features.html#locality-sensitive-hashing), we want to use LSH
>> to find approximately nearest neighbors. Basically, We have dataset with
>> about 7M rows. we want to use cosine distance to meassure the similarity
>> between items, so we use 

Re: New Amazon AMIs for EC2 script

2017-02-23 Thread Nicholas Chammas
spark-ec2 has moved to GitHub and is no longer part of the Spark project. A
related issue from the current issue tracker that you may want to
follow/comment on is this one: https://github.com/amplab/spark-ec2/issues/74

As I said there, I think requiring custom AMIs is one of the major
maintenance headaches of spark-ec2. I solved this problem in my own
project, Flintrock , by working with
the default Amazon Linux AMIs and letting people more freely bring their
own AMI.

Nick


On Thu, Feb 23, 2017 at 7:23 AM in4maniac  wrote:

> Hyy all,
>
> I have been using the EC2 script to launch R pyspark clusters for a while
> now. As we use alot of packages such as numpy and scipy with openblas,
> scikit-learn, bokeh, vowpal wabbit, pystan and etc... All this time, we
> have
> been building AMIs on top of the standard spark-AMIs at
> https://github.com/amplab/spark-ec2/tree/branch-1.6/ami-list/us-east-1
>
> Mainly, I have done the following:
> - updated yum
> - Changed the standard python to python 2.7
> - changed pip to 2.7 and installed alot of libararies on top of the
> existing
> AMIs and created my own AMIs to avoid having to boostrap.
>
> But the ec-2 standard AMIs are from *Early February , 2014* and now have
> become extremely fragile. For example, when I update a certain library,
> ipython would break, or pip would break and so forth.
>
> Can someone please direct me to a more upto date AMI that I can use with
> more confidence. And I am also interested to know what things need to be in
> the AMI, if I wanted to build an AMI from scratch (Last resort :( )
>
> And isn't it time to have a ticket in the spark project to build a new
> suite
> of AMIs for the EC2 script?
> https://issues.apache.org/jira/browse/SPARK-922
>
> Many thanks
> in4maniac
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/New-Amazon-AMIs-for-EC2-script-tp28419.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Structured Streaming: How to handle bad input

2017-02-23 Thread Sam Elamin
Hi Jayesh

So you have 2 problems here

1) Data was loaded in the wrong format
2) Once you handled the wrong data the spark job will continually retry the
failed batch

For 2 its very easy to go into the checkpoint directory and delete that
offset manually and make it seem like it never happened.

However for point 1 the issue is a little bit more trickier, if you receive
bad data then perhaps your first point of call should be a cleaning process
to ensure your data is atleast parsable, then move it to another directory
which spark streaming is looking at

It is unreasonable to have spark both do the streaming and handle bad data
for you yet remain extremely simple and easy to use

That said I personally would have a conversation with the provider of the
data


In this scenario I just ensure that these providers ensure the format of
the data is correct, whether its CSV JSON AVRO PARQUET or whatever, I
should hope whatever service/company is providing this data is providing it
"correctly" to a set definition, otherwise you will have to do a pre
cleaning step


Perhaps someone else can suggest a better/cleaner approach

Regards
Sam







On Thu, Feb 23, 2017 at 2:09 PM, JayeshLalwani <
jayesh.lalw...@capitalone.com> wrote:

> What is a good way to make a Structured Streaming application deal with bad
> input? Right now, the problem is that bad input kills the Structured
> Streaming application. This is highly undesirable, because a Structured
> Streaming application has to be always on
>
> For example, here is a very simple structured streaming program
>
>
>
>
> Now, I drop in a CSV file with the following data into my bucket
>
>
>
> Obviously the data is in the wrong format
>
> The executor and driver come crashing down
> 17/02/23 08:53:40 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID
> 0)
> java.lang.NumberFormatException: For input string: "Iron man"
> at
> java.lang.NumberFormatException.forInputString(
> NumberFormatException.java:65)
> at java.lang.Integer.parseInt(Integer.java:580)
> at java.lang.Integer.parseInt(Integer.java:615)
> at scala.collection.immutable.StringLike$class.toInt(
> StringLike.scala:272)
> at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
> at
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(
> CSVInferSchema.scala:250)
> at
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$
> csvParser$3.apply(CSVRelation.scala:125)
> at
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$
> csvParser$3.apply(CSVRelation.scala:94)
> at
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$
> buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:167)
> at
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$
> buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:166)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(
> FileScanRDD.scala:102)
> at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.
> nextIterator(FileScanRDD.scala:166)
> at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(
> FileScanRDD.scala:102)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$
> GeneratedIterator.processNext(Unknown
> Source)
> at
> org.apache.spark.sql.execution.BufferedRowIterator.
> hasNext(BufferedRowIterator.java:43)
> at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$
> anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$
> 2.apply(SparkPlan.scala:231)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$
> 2.apply(SparkPlan.scala:225)
> at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$
> 1$$anonfun$apply$25.apply(RDD.scala:826)
> at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$
> 1$$anonfun$apply$25.apply(RDD.scala:826)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.
> scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(
> Executor.scala:282)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 

Structured Streaming: How to handle bad input

2017-02-23 Thread JayeshLalwani
What is a good way to make a Structured Streaming application deal with bad
input? Right now, the problem is that bad input kills the Structured
Streaming application. This is highly undesirable, because a Structured
Streaming application has to be always on

For example, here is a very simple structured streaming program




Now, I drop in a CSV file with the following data into my bucket



Obviously the data is in the wrong format

The executor and driver come crashing down
17/02/23 08:53:40 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NumberFormatException: For input string: "Iron man"
at
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at 
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
at
org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:250)
at
org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:125)
at
org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:94)
at
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:167)
at
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:166)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:166)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/02/23 08:53:40 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
localhost, executor driver): java.lang.NumberFormatException: For input
string: "Iron man"
at
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at 
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
at
org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:250)
at
org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:125)
at
org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:94)
at
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:167)
at
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:166)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at

[Spark Streaming] Batch versus streaming

2017-02-23 Thread Charles O. Bajomo
Hello, 

I am reading data from a JMS queue and I need to prevent any data loss so I 
have custom java receiver that only acks messages once they have been stored. 
Sometimes my program crashes because I can't control the flow rate from the 
queue and it overwhelms the job and I end up losing data. So I tried converting 
it into a batch job were the driver reads off the JMS queue indefinably, 
generates an RDD to process the data and then acks the messages once it's done. 
I was surprise it worked. Is there any reason not to just use this in place of 
streaming? I am using Spark 1.5.0 by the way 

Thanks 


Re: unsubscribe

2017-02-23 Thread Ganesh

Thank you for cat facts.

"A group of cats is called a clowder"

MEEOW


To unsubscribe please enter your credit card details followed by your pin.

CAT-FACTS



On 24/02/17 00:04, Donam Kim wrote:

catunsub

2017-02-23 20:28 GMT+11:00 Ganesh Krishnan >:


Thank you for subscribing to "cat facts"

Did you know that a cat's whiskers is used to determine if it can
wiggle through a hole?


To unsubscribe reply with keyword "catunsub"

Thank you

On Feb 23, 2017 8:25 PM, "Donam Kim" > wrote:

unsubscribe






Re: quick question: best to use cluster mode or client mode for production?

2017-02-23 Thread Sam Elamin
I personally use spark submit as it's agnostic to which platform your spark
clusters are working on e.g. Emr dataproc databricks etc


On Thu, 23 Feb 2017 at 08:53, nancy henry  wrote:

> Hi Team,
>
> I have set of hc.sql("hivequery") kind of scripts which i am running right
> now in spark-shell
>
> How should i schedule it in production
> making it spark-shell -i script.scala
> or keeping it in jar file through eclipse and use spark-submit deploy mode
> cluster?
>
> which is advisable?
>


New Amazon AMIs for EC2 script

2017-02-23 Thread in4maniac
Hyy all, 

I have been using the EC2 script to launch R pyspark clusters for a while
now. As we use alot of packages such as numpy and scipy with openblas,
scikit-learn, bokeh, vowpal wabbit, pystan and etc... All this time, we have
been building AMIs on top of the standard spark-AMIs at
https://github.com/amplab/spark-ec2/tree/branch-1.6/ami-list/us-east-1 

Mainly, I have done the following:
- updated yum
- Changed the standard python to python 2.7
- changed pip to 2.7 and installed alot of libararies on top of the existing
AMIs and created my own AMIs to avoid having to boostrap. 

But the ec-2 standard AMIs are from *Early February , 2014* and now have
become extremely fragile. For example, when I update a certain library,
ipython would break, or pip would break and so forth. 

Can someone please direct me to a more upto date AMI that I can use with
more confidence. And I am also interested to know what things need to be in
the AMI, if I wanted to build an AMI from scratch (Last resort :( )

And isn't it time to have a ticket in the spark project to build a new suite
of AMIs for the EC2 script? https://issues.apache.org/jira/browse/SPARK-922 

Many thanks
in4maniac 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/New-Amazon-AMIs-for-EC2-script-tp28419.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark join over sorted columns of dataset.

2017-02-23 Thread Rohit Verma
Hi

While joining two columns of different dataset, how to optimize join if both 
the columns are pre sorted within the dataset.
So that when spark do sort merge join the sorting phase can skipped.

Regards
Rohit
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Scala functions for dataframes

2017-02-23 Thread Advait Mohan Raut
Hi Team,

I am using Scala Spark Dataframes for data operations over CSV files.

There is a common transformation code being used by multiple process flows.
Hence I wish to create a Scala functions for that [with def fn_name()].
All process flows will use the functionality implemented inside these Scala 
functions.


Typical transformations on the data are like the following:

  1.  Modify multiple columns
  2.  Changing a column conditioned on one or more columns
  3.  Date time format manipulations
  4.  Applying regex over one or more columns.

For such transformations:


  1.  What is the best way to perform these operations ?
  2.  Can we do such operations without sql queries on dataframes ?
  3.  If there is no choice other than running sql queries then what is the 
best way to write generic scala functions for that ?
  4.  Also if we have a consideration like all input dataframes have different 
schema but have the constant column names which we need to process. What should 
be the preferred choice in this case ?

Please let me know if you need more clarification on this.




Regards
Advait





The information transmitted herewith is sensitive information intended only for 
use to the individual or entity to which it is addressed. If the reader of this 
message is not the intended recipient, you are hereby notified that any review, 
retransmission, dissemination, distribution, copying or other use of, or taking 
of any action in reliance upon, this information is strictly prohibited. If you 
have received this communication in error, please contact the sender and delete 
the material from your computer.

WARNING: E-mail communications cannot be guaranteed to be timely, secure, 
error-free or virus-free. The recipient of this communication should check this 
e-mail and each attachment for the presence of viruses. The sender does not 
accept any liability for any errors or omissions in the content of this 
electronic communication which arises as a result of e-mail transmission.

Support for decimal separator (comma or period) in spark 2.1

2017-02-23 Thread Arkadiusz Bicz
Hi Team,

I would like to know if it is possible to specify decimal localization for
DataFrameReader for  csv?

I have cvs files from localization where decimal separator is comma like
0,32 instead of US way like 0.32

Is it a way to specify in current version of spark to provide localization:

spark.read.option("sep",";").option("header", "true").option("inferSchema",
"true").format("csv").load("nonuslocalized.csv")

If not should I create ticket in jira for this ? I can work on solution if
not available.

Best Regards,

Arkadiusz Bicz


Re: unsubscribe

2017-02-23 Thread Ganesh Krishnan
Thank you for subscribing to "cat facts"

Did you know that a cat's whiskers is used to determine if it can wiggle
through a hole?


To unsubscribe reply with keyword "catunsub"

Thank you

On Feb 23, 2017 8:25 PM, "Donam Kim"  wrote:

> unsubscribe
>


unsubscribe

2017-02-23 Thread Donam Kim
unsubscribe


quick question: best to use cluster mode or client mode for production?

2017-02-23 Thread nancy henry
Hi Team,

I have set of hc.sql("hivequery") kind of scripts which i am running right
now in spark-shell

How should i schedule it in production
making it spark-shell -i script.scala
or keeping it in jar file through eclipse and use spark-submit deploy mode
cluster?

which is advisable?