Unusual bug,please help me,i can do nothing!!!

2022-03-30 Thread spark User
Hello, I am a spark user. I use the "spark-shell.cmd" startup command in 
windows cmd, the first startup is normal, when I use the "ctrl+c" command to 
force the end of the spark window, it can't start normally again. .The error 
message is as follows "Failed to initialize Spark 
session.org.apache.spark.SparkException: Invalid Spark URL: 
spark://HeartbeatReceiver@x.168.137.41:49963".
When I try to add "x.168.137.41" in 'etc/hosts' it works fine, then use 
"ctrl+c" again.
The result is that it cannot start normally. Please help me

error bug,please help me!!!

2022-03-20 Thread spark User
Hello, I am a spark user. I use the "spark-shell.cmd" startup command in 
windows cmd, the first startup is normal, when I use the "ctrl+c" command to 
force the end of the spark window, it can't start normally again. .The error 
message is as follows "Failed to initialize Spark 
session.org.apache.spark.SparkException: Invalid Spark URL: 
spark://HeartbeatReceiver@x.168.137.41:49963".
When I try to add "x.168.137.41" in 'etc/hosts' it works fine, then use 
"ctrl+c" again.
The result is that it cannot start normally. Please help me

Re: Driver hung and happend out of memory while writing to console progress bar

2017-02-13 Thread Spark User
How much memory have you allocated to the driver? Driver stores some state
for tracking the task, stage and job history that you can see in the spark
console, it does take up a significant portion of the heap, anywhere from
200MB - 1G, depending no your map reduce steps.

Either way that is a good place to start by checking how much memory you
have allocated to the driver.  If it is sufficient , like in the order of
2- 3G + at least, then you will have to take heap dumps of the driver
process periodically and find out what objects grow over time.

On Fri, Feb 10, 2017 at 9:34 AM, Ryan Blue 
wrote:

> This isn't related to the progress bar, it just happened while in that
> section of code. Something else is taking memory in the driver, usually a
> broadcast table or something else that requires a lot of memory and happens
> on the driver.
>
> You should check your driver memory settings and the query plan (if this
> was SparkSQL) for this stage to investigate further.
>
> rb
>
> On Thu, Feb 9, 2017 at 8:41 PM, John Fang 
> wrote:
>
>> the spark version is 2.1.0
>>
>> --
>> 发件人:方孝健(玄弟) 
>> 发送时间:2017年2月10日(星期五) 12:35
>> 收件人:spark-dev ; spark-user 
>> 主 题:Driver hung and happend out of memory while writing to console
>> progress bar
>>
>> [Stage 172:==> (10328 + 93) / 
>> 16144][Stage 172:==> (10329 + 
>> 93) / 16144][Stage 172:==> 
>> (10330 + 93) / 16144][Stage 172:==>  
>>(10331 + 93) / 16144][Stage 172:==>   
>>   (10333 + 92) / 16144][Stage 172:==>
>>  (10333 + 93) / 16144][Stage 172:==> 
>> (10333 + 94) / 16144][Stage 172:==>  
>>(10334 + 94) / 16144][Stage 
>> 172:==> (10338 + 93) / 
>> 16144][Stage 172:==> (10339 + 
>> 92) / 16144][Stage 172:==> 
>> (10340 + 93) / 16144][Stage 172:==>  
>>(10341 + 92) / 16144][Stage 172:==>   
>>   (10341 + 93) / 16144][Stage 172:==>
>>  (10342 + 93) / 16144][Stage 172:==> 
>> (10343 + 93) / 16144][Stage 172:==>  
>>(10344 + 92) / 16144][Stage 
>> 172:==> (10345 + 92) / 
>> 16144][Stage 172:==> (10345 + 
>> 93) / 16144][Stage 172:==> 
>> (10346 + 93) / 16144][Stage 172:==>  
>>(10348 + 92) / 16144][Stage 172:==>   
>>   (10348 + 93) / 16144][Stage 172:==>
>>  (10349 + 92) / 16144][Stage 172:==> 
>> (10349 + 93) / 16144][Stage 172:==>  
>>(10350 + 92) / 16144][Stage 
>> 172:==> (10352 + 92) / 
>> 16144][Stage 172:==> (10353 + 
>> 92) / 16144][Stage 172:==> 
>> (10354 + 92) / 16144][Stage 172:==>  
>>(10355 + 92) / 16144][Stage 172:==>   
>>   (10356 + 92) / 16144][Stage 172:==>
>>  (10356 + 93) / 16144][Stage 172:==> 
>> (10357 + 92) / 16144][Stage 172:==>  
>>(10357 + 93) / 16144][Stage 
>> 172:==> (10358 + 92) / 
>> 16144][Stage 172:==> (10358 + 
>> 93) / 16144][Stage 172:==> 
>> (10359 + 92) / 16144][Stage 172:==>  
>>(10359 + 93) / 16144][Stage 172:==>   
>>   (10359 + 94) / 16144][Stage 172:==>
>

Re: Question about best Spark tuning

2017-02-13 Thread Spark User
My take on the 2-3 tasks per CPU core is that you want to ensure you are
utilizing the cores to the max, which means it will help you with scaling
and performance. The question would be why not 1 task per core? The reason
is that you can probably get a good handle on the average execution time
per task but the execution time p90 + can be spiky. In which case you don't
want the long poll task (s) to slow down your entire batch (which is in
general what you would tune your application for). So by having 2-3 tasks
per CPU core, you can further break down the work to smaller chunks hence
completing tasks quicker and let the spark scheduler (which is low cost and
efficient based on my observation, it is never the bottleneck) do the work
of distributing the work among the tasks.
I have experimented with 1 task per core, 2-3 tasks per core and all the
way up to 20+ tasks per core. The performance difference was similar
between 3 tasks per core and 20+ tasks per core. But it does make a
difference in performance when you compare 1  task per core v/s 2-3 tasks
per core.

Hope this explanation makes sense.
Best,
Bharath


On Thu, Feb 9, 2017 at 2:11 PM, Ji Yan  wrote:

> Dear spark users,
>
> From this site https://spark.apache.org/docs/latest/tuning.html where it
> offers recommendation on setting the level of parallelism
>
> Clusters will not be fully utilized unless you set the level of
>> parallelism for each operation high enough. Spark automatically sets the
>> number of “map” tasks to run on each file according to its size (though you
>> can control it through optional parameters to SparkContext.textFile,
>> etc), and for distributed “reduce” operations, such as groupByKey and
>> reduceByKey, it uses the largest parent RDD’s number of partitions. You
>> can pass the level of parallelism as a second argument (see the
>> spark.PairRDDFunctions
>> 
>>  documentation), or set the config property spark.default.parallelism to
>> change the default. *In general, we recommend 2-3 tasks per CPU core in
>> your cluster*.
>
>
> Do people have a general theory/intuition about why it is a good idea to
> have 2-3 tasks running per CPU core?
>
> Thanks
> Ji
>
> The information in this email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful.
>


Re: Is it better to Use Java or Python on Scala for Spark for using big data sets

2017-02-13 Thread Spark User
Spark has more support for scala, by that I mean more APIs are available
for scala compared to python or Java. Also scala code will be more concise
and easy to read. Java is very verbose.

On Thu, Feb 9, 2017 at 10:21 PM, Irving Duran 
wrote:

> I would say Java, since it will be somewhat similar to Scala.  Now, this
> assumes that you have some app already written in Scala. If you don't, then
> pick the language that you feel most comfortable with.
>
> Thank you,
>
> Irving Duran
>
> On Feb 9, 2017, at 11:59 PM, nancy henry  wrote:
>
> Hi All,
>
> Is it better to Use Java or Python on Scala for Spark coding..
>
> Mainly My work is with getting file data which is in csv format  and I
> have to do some rule checking and rule aggrgeation
>
> and put the final filtered data back to oracle so that real time apps can
> use it..
>


Re: Performance bug in UDAF?

2017-02-09 Thread Spark User
Pinging again on this topic.

Is there an easy way to select TopN in a RelationalGroupedDataset?
Basically in the below example dataSet.groupBy("Column1").agg(udaf("Column2",
"Column3") returns a RelationalGroupedDataset. One way to address the data
skew would be to reduce the data per key (Column1 being the key here). And
if we are interested in TopN values per column (like Column2, Column3) how
can we get TopN from RelationalGroupedDataset?

Is the only way to get TopN is by implementing it in the udaf?

Would appreciate any pointers or examples if someone has solved similar
problem.

Thanks,
Bharath


On Mon, Oct 31, 2016 at 11:40 AM, Spark User 
wrote:

> Trying again. Hoping to find some help in figuring out the performance
> bottleneck we are observing.
>
> Thanks,
> Bharath
>
> On Sun, Oct 30, 2016 at 11:58 AM, Spark User 
> wrote:
>
>> Hi All,
>>
>> I have a UDAF that seems to perform poorly when its input is skewed. I
>> have been debugging the UDAF implementation but I don't see any code that
>> is causing the performance to degrade. More details on the data and the
>> experiments I have run.
>>
>> DataSet: Assume 3 columns, column1 being the key.
>> Column1   Column2  Column3
>> a   1 x
>> a   2 x
>> a   3 x
>> a   4 x
>> a   5 x
>> a   6 z
>> 5 million row for a
>> 
>> a   100   y
>> b   9 y
>> b   9 y
>> b   10   y
>> 3 million rows for b
>> ...
>> more rows
>> total rows is 100 million
>>
>>
>> a has 5 million rows.Column2 for a has 1 million unique values.
>> b has 3 million rows. Column2 for b has 80 unique values.
>>
>> Column 3 has just 100s of unique values not in the order of millions, for
>> both a and b.
>>
>> Say totally there are 100 million rows as the input to a UDAF
>> aggregation. And the skew in data is for the keys a and b. All other rows
>> can be ignored and do not cause any performance issue/ hot partitions.
>>
>> The code does a dataSet.groupBy("Column1").agg(udaf("Column2",
>> "Column3").
>>
>> I commented out the UDAF implementation for update and merge methods, so
>> essentially the UDAF was doing nothing.
>>
>> With this code (empty updated and merge for UDAF) the performance for a
>> mircro-batch is 16 minutes per micro-batch, micro-batch containing 100
>> million rows, with 5million rows for a and 1 million unique values for
>> Column2 for a.
>>
>> But when I pass empty values for Column2 with nothing else change,
>> effectively reducing the 1 million unique values for Column2 to just 1
>> unique value, empty value. The batch processing time goes down to 4 minutes.
>>
>> So I am trying to understand why is there such a big performance
>> difference? What in UDAF causes the processing time to increase in orders
>> of magnitude when there is a skew in the data as observed above?
>>
>> Any insight from spark developers, contributors, or anyone else who has a
>> deeper understanding of UDAF would be helpful.
>>
>> Thanks,
>> Bharath
>>
>>
>>
>


Potential memory leak in yarn ApplicationMaster

2016-11-21 Thread Spark User
Hi All,

It seems like the heap usage for
org.apache.spark.deploy.yarn.ApplicationMaster keeps growing continuously.
The driver crashes with OOM eventually.

More details:
I have a spark streaming app that runs on spark-2.0. The
spark.driver.memory is 10G and spark.yarn.driver.memoryOverhead is 2048.
Looking at driver heap dumps taken every 30 mins, the heap usage for
org.apache.spark.deploy.yarn.ApplicationMaster grows by 100MB every 30 mins.

Also, I suspect it may be caused because I had set below to true (which is
by default true I think)
  --conf spark.dynamicAllocation.enabled=true \
  --conf spark.shuffle.service.enabled=true \

I am trying out by setting them to false now to check if the heap usage for
ApplicationMaster stops increasing.

By investigating the heap dump and looking at the code for
ApplicationMaster it seems like the heap usage is growing because of
releasedExecutorLossReasons HashMap in
https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L124

Has anyone else seen this issue before?

Thanks,
Bharath


Re: Performance bug in UDAF?

2016-10-31 Thread Spark User
Trying again. Hoping to find some help in figuring out the performance
bottleneck we are observing.

Thanks,
Bharath

On Sun, Oct 30, 2016 at 11:58 AM, Spark User 
wrote:

> Hi All,
>
> I have a UDAF that seems to perform poorly when its input is skewed. I
> have been debugging the UDAF implementation but I don't see any code that
> is causing the performance to degrade. More details on the data and the
> experiments I have run.
>
> DataSet: Assume 3 columns, column1 being the key.
> Column1   Column2  Column3
> a   1 x
> a   2 x
> a   3 x
> a   4 x
> a   5 x
> a   6 z
> 5 million row for a
> 
> a   100   y
> b   9 y
> b   9 y
> b   10   y
> 3 million rows for b
> ...
> more rows
> total rows is 100 million
>
>
> a has 5 million rows.Column2 for a has 1 million unique values.
> b has 3 million rows. Column2 for b has 80 unique values.
>
> Column 3 has just 100s of unique values not in the order of millions, for
> both a and b.
>
> Say totally there are 100 million rows as the input to a UDAF aggregation.
> And the skew in data is for the keys a and b. All other rows can be ignored
> and do not cause any performance issue/ hot partitions.
>
> The code does a dataSet.groupBy("Column1").agg(udaf("Column2",
> "Column3").
>
> I commented out the UDAF implementation for update and merge methods, so
> essentially the UDAF was doing nothing.
>
> With this code (empty updated and merge for UDAF) the performance for a
> mircro-batch is 16 minutes per micro-batch, micro-batch containing 100
> million rows, with 5million rows for a and 1 million unique values for
> Column2 for a.
>
> But when I pass empty values for Column2 with nothing else change,
> effectively reducing the 1 million unique values for Column2 to just 1
> unique value, empty value. The batch processing time goes down to 4 minutes.
>
> So I am trying to understand why is there such a big performance
> difference? What in UDAF causes the processing time to increase in orders
> of magnitude when there is a skew in the data as observed above?
>
> Any insight from spark developers, contributors, or anyone else who has a
> deeper understanding of UDAF would be helpful.
>
> Thanks,
> Bharath
>
>
>


Performance bug in UDAF?

2016-10-30 Thread Spark User
Hi All,

I have a UDAF that seems to perform poorly when its input is skewed. I have
been debugging the UDAF implementation but I don't see any code that is
causing the performance to degrade. More details on the data and the
experiments I have run.

DataSet: Assume 3 columns, column1 being the key.
Column1   Column2  Column3
a   1 x
a   2 x
a   3 x
a   4 x
a   5 x
a   6 z
5 million row for a

a   100   y
b   9 y
b   9 y
b   10   y
3 million rows for b
...
more rows
total rows is 100 million


a has 5 million rows.Column2 for a has 1 million unique values.
b has 3 million rows. Column2 for b has 80 unique values.

Column 3 has just 100s of unique values not in the order of millions, for
both a and b.

Say totally there are 100 million rows as the input to a UDAF aggregation.
And the skew in data is for the keys a and b. All other rows can be ignored
and do not cause any performance issue/ hot partitions.

The code does a dataSet.groupBy("Column1").agg(udaf("Column2", "Column3").

I commented out the UDAF implementation for update and merge methods, so
essentially the UDAF was doing nothing.

With this code (empty updated and merge for UDAF) the performance for a
mircro-batch is 16 minutes per micro-batch, micro-batch containing 100
million rows, with 5million rows for a and 1 million unique values for
Column2 for a.

But when I pass empty values for Column2 with nothing else change,
effectively reducing the 1 million unique values for Column2 to just 1
unique value, empty value. The batch processing time goes down to 4 minutes.

So I am trying to understand why is there such a big performance
difference? What in UDAF causes the processing time to increase in orders
of magnitude when there is a skew in the data as observed above?

Any insight from spark developers, contributors, or anyone else who has a
deeper understanding of UDAF would be helpful.

Thanks,
Bharath


RDD to Dataset results in fixed number of partitions

2016-10-21 Thread Spark User
Hi All,

I'm trying to create a Dataset from RDD and do groupBy on the Dataset. The
groupBy stage runs with 200 partitions. Although the RDD had 5000
partitions. I also seem to have no way to change that 200 partitions on the
Dataset to some other large number. This seems to be affecting the
parallelism as there are 700 executors and only 200 partitions.

The code looks somewhat like:

val sqsDstream = sparkStreamingContext.union((1 to 3).map(_ =>
  sparkStreamingContext.receiverStream(new SQSReceiver())
).transform(_.repartition(5000))

sqsDstream.foreachRDD(rdd => {
  val dataSet = sparkSession.createDataset(rdd)
  val aggregatedDataset: Dataset[Row] =
  dataSet.groupBy("primaryKey").agg(udaf("key1"))
  aggregatedDataset.foreachPartition(partition => {
 //write to output stream
   })
})


Any pointers would be appreciated.
Thanks,
Bharath


Question about single/multi-pass execution in Spark-2.0 dataset/dataframe

2016-09-27 Thread Spark User
case class Record(keyAttr: String, attr1: String, attr2: String, attr3:
String)

val ds = sparkSession.createDataset(rdd).as[Record]

val attr1Counts = ds.groupBy('keyAttr', 'attr1').count()

val attr2Counts = ds.groupBy('keyAttr', 'attr2').count()

val attr3Counts = ds.groupBy('keyAttr', 'attr3').count()

//similar counts for 20 attributes

//code to merge attr1Counts and attr2Counts and attr3Counts
//translate it to desired output format and save the result.

Some more details:
1) The application is a spark streaming application with batch interval in
the order of 5 - 10 mins
2) Data set is large in the order of millions of records per batch
3) I'm using spark 2.0

The above implementation doesn't seem to be efficient at all, if data set
goes through the Rows for every count aggregation for computing
attr1Counts, attr2Counts and attr3Counts. I'm concerned about the
performance.

Questions:
1) Does the catalyst optimization handle such queries and does a single
pass on the dataset under the hood?
2) Is there a better way to do such aggregations , may be using UDAFs? Or
it is better to do RDD.reduceByKey for this use case?
RDD.reduceByKey performs well for the data and batch interval of 5 - 10
mins. Not sure if data set implementation as explained above will be
equivalent or better.

Thanks,
Bharath


Data Frame support CSV or excel format ?

2015-08-27 Thread spark user
Hi all ,
Can we create data frame from excels sheet or csv file , in below example It 
seems they support only json ?


DataFrame df = 
sqlContext.read().json("examples/src/main/resources/people.json");



Re: Spark 1.3.1 + Hive: write output to CSV with header on S3

2015-07-17 Thread spark user
Hi Roberto 
I have question regarding HiveContext . 
when you create HiveContext where you define Hive connection properties ?  
Suppose Hive is not in local machine i need to connect , how HiveConext will 
know the data base info like url ,username and password ?
String  username = "";
String  password = "";
String url = "jdbc:hive2://quickstart.cloudera:1/default";  


 On Friday, July 17, 2015 2:29 AM, Roberto Coluccio 
 wrote:
   

 Hello community,
I'm currently using Spark 1.3.1 with Hive support for outputting processed data 
on an external Hive table backed on S3. I'm using a manual specification of the 
delimiter, but I'd want to know if is there any "clean" way to write in CSV 
format:
val sparkConf = new SparkConf()val sc = new SparkContext(sparkConf)val 
hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)import 
hiveContext.implicits._   hiveContext.sql( "CREATE EXTERNAL TABLE IF NOT EXISTS 
table_name(field1 STRING, field2 STRING) ROW FORMAT DELIMITED FIELDS TERMINATED 
BY ',' LOCATION '" + path_on_s3 + "'")hiveContext.sql()
I also need the header of the table to be printed on each written file. I tried 
with:
hiveContext.sql("set hive.cli.print.header=true")
But it didn't work.
Any hint?
Thank you.
Best regards,Roberto


  

Re: Java 8 vs Scala

2015-07-15 Thread spark user
I struggle lots in Scala , almost 10 days n0 improvement , but when i switch to 
Java 8 , things are so smooth , and I used Data Frame with Redshift and Hive 
all are looking good .if you are very good In Scala the go with Scala otherwise 
Java is best fit  .
This is just my openion because I am Java guy. 


 On Wednesday, July 15, 2015 12:33 PM, vaquar khan  
wrote:
   

 My choice is java 8On 15 Jul 2015 18:03, "Alan Burlison" 
 wrote:

On 15/07/2015 08:31, Ignacio Blasco wrote:


The main advantage of using scala vs java 8 is being able to use a console


https://bugs.openjdk.java.net/browse/JDK-8043364

-- 
Alan Burlison
--

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




  

Data Frame for nested json

2015-07-14 Thread spark user


is DataFrame  support nested json to dump directely to data base 
For simple json it working fine 
{"id":2,"name":"Gerald","email":"gbarn...@zimbio.com","city":"Štoky","country":"Czech
 Republic","ip":"92.158.154.75”}, 
But for nested json it failed to load 
root |-- rows: array (nullable = true) |    |-- element: struct (containsNull = 
true) |    |    |-- cell: array (nullable = true) |    |    |    |-- element: 
string (containsNull = true)
2015-07-14 14:50:05[Thread-0] INFO  SparkContext:59 - Invoking stop() from 
shutdown hookException in thread "main" java.lang.IllegalArgumentException: 
Don't know how to save 
StructField(rows,ArrayType(StructType(StructField(cell,ArrayType(StringType,true),true)),true),true)
 to JDBC at org.apache.spark

Java 8 vs Scala

2015-07-14 Thread spark user
Hi All 
To Start new project in Spark , which technology is good .Java8 OR  Scala .
I am Java developer , Can i start with Java 8  or I Need to learn Scala .
which one is better technology  for quick start any POC project 
Thanks 
- su 

Re: spark - redshift !!!

2015-07-08 Thread spark user
Hi 'I am looking how to load data in redshift .Thanks  


 On Wednesday, July 8, 2015 12:47 AM, shahab  
wrote:
   

 Hi,
I did some experiment with loading data from s3 into spark. I loaded data from 
s3 using sc.textFile(). Have a look at the following code snippet:
val csv = sc.textFile("s3n://mybucket/myfile.csv")  val rdd = csv.map(line 
=> line.split(",").map(elem => elem.trim))  // my data format is in CSV format, 
comma separated.map (r =>  MyIbject(r(3), r(4).toLong, r(5).toLong, r(6)))  
//just map it to the target object format
hope this helps,best,/Shahab

On Wed, Jul 8, 2015 at 12:57 AM, spark user  
wrote:

Hi Can you help me how to load data from s3 bucket to  redshift , if you gave 
sample code can you pls send me 
Thanks su



  

spark - redshift !!!

2015-07-07 Thread spark user
Hi Can you help me how to load data from s3 bucket to  redshift , if you gave 
sample code can you pls send me 
Thanks su

Re: s3 bucket access/read file

2015-06-29 Thread spark user
Pls check your ACL properties. 


 On Monday, June 29, 2015 11:29 AM, didi  wrote:
   

 Hi

*Cant read text file from s3 to create RDD
*

after setting the configuration
val hadoopConf=sparkContext.hadoopConfiguration;
hadoopConf.set("fs.s3.impl",
"org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId",yourAccessKey)
hadoopConf.set("fs.s3.awsSecretAccessKey",yourSecretKey)

1. running the following
val hfile = sc.textFile("s3n://mybucket/temp/")

I get the error

Exception in thread "main" org.apache.hadoop.fs.s3.S3Exception:
org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/temp' -
ResponseCode=400, ResponseMessage=Bad Request

2. running the following
val hfile = sc.textFile("s3n://mybucket/*.txt")

I get the error

Exception in thread "main" org.apache.hadoop.fs.s3.S3Exception:
org.jets3t.service.S3ServiceException: S3 GET failed for '/' XML Error
Message: InvalidRequest*The
authorization mechanism you have provided is not supported. Please use
AWS4-HMAC-SHA256*.C2174C316DEC91CB3oPZfZoPZUbvzXJdVaUGl9N0oI1buMx+A/wJiisx7uZ0bpnTkwsaT6i0fhYhjY97JDWBX1x/2Y8=

I read it has to do something with the v4 signature??? isn't it supported by
the sdk??

3. running the following

val hfile = sc.textFile("s3n://mybucket")

get the error
Exception in thread "main" org.apache.hadoop.fs.s3.S3Exception:
org.jets3t.service.S3ServiceException: S3 HEAD request failed for
'/user%2Fdidi' - ResponseCode=400, ResponseMessage=Bad Request

what does the user has to do here??? i am using key & secret !!!

How can i simply create RDD from text file on S3

Thanks

Didi




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/s3-bucket-access-read-file-tp23536.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



   

Re: Scala/Python or Java

2015-06-25 Thread spark user
Spark is based on Scala and it written in Scala .To debug and fix issue i guess 
learning Scala is good  for long term ? any advise ? 


 On Thursday, June 25, 2015 1:26 PM, ayan guha  wrote:
   

 I am a python fan so I use python. But what I noticed some features are 
typically 1-2 release behind for python. So I strongly agree with Ted that 
start with language you are most familiar with and plan to move to scala 
eventually On 26 Jun 2015 06:07, "Ted Yu"  wrote:

The answer depends on the user's experience with these languages as well as the 
most commonly used language in the production environment.
Learning Scala requires some time. If you're very comfortable with Java / 
Python, you can go with that while at the same time familiarizing yourself with 
Scala.
Cheers
On Thu, Jun 25, 2015 at 12:04 PM, spark user  
wrote:

Hi All ,
I am new for spark , i just want to know which technology is good/best for 
spark learning ?
1) Scala 2) Java 3) Python 
I know spark support all 3 languages , but which one is best ?
Thanks su  






  

Scala/Python or Java

2015-06-25 Thread spark user
Hi All ,
I am new for spark , i just want to know which technology is good/best for 
spark learning ?
1) Scala 2) Java 3) Python 
I know spark support all 3 languages , but which one is best ?
Thanks su