unsubscribe

2017-05-21 Thread Abir Chakraborty
unsubscribe




unsubscribe

2017-05-21 Thread Bibudh Lahiri
unsubscribe


Re: SparkSQL not able to read a empty table location

2017-05-21 Thread Sea
please try spark.sql.hive.verifyPartitionPath true


-- Original --
From:  "Steve Loughran";;
Date:  Sat, May 20, 2017 09:19 PM
To:  "Bajpai, Amit X. -ND"; 
Cc:  "user@spark.apache.org"; 
Subject:  Re: SparkSQL not able to read a empty table location



 
   On 20 May 2017, at 01:44, Bajpai, Amit X. -ND  
wrote:
 
Hi,
   
  I have a hive external table with the S3 location having no files (but the S3 
location directory does exists). When I am trying to use Spark SQL to count the 
number of records in the table it is throwing error saying  ??File 
s3n://data/xyz does not exist. null/0??.
   
  select * from tablex limit 10
   
  Can someone let me know how we can fix this issue.
   
  Thanks
 
 
  
 
 
 
 There isn't really a "directory" in S3, just a set of objects whose paths 
begin with a string. Try creating an empty file with an _ prefix in the 
directory; it should be ignored by Spark SQL but will cause the "directory" to 
come into being

Sampling data on RDD vs sampling data on Dataframes

2017-05-21 Thread Marco Didonna
Hello,

me and my team have developed a fairly large big data application using
only the dataframe api (Spark 1.6.3). Since our application uses machine
learning to do prediction we need to sample the train dataset in order not
to have skewed data.

To achieve such objective we use stratified sampling: now, you all probably
know that the DataFrameStatFunctions provided a useful sampleBy method that
supposedly carries out stratified sampling based on the fraction map passed
as input. There are a few question that have risen:

- the samplyBy methods seems to return variabile results with the same
input data therefore looks more like and *approximate* stratified sampling.
Inspection of the spark source code seems to confirm such hypothesis. There
is no mention on the documentation of such approximation nor a confidence
interval that guarantees how good the approximation is supposed to be.

- on the RDD world there is a sampleByKeyExact method which clearly states
that it will produce a sampled datasets with tight guarantees ... is there
anything like that in the DataFrame world?

Has anybody in the community worked around such shortcomings of the
dataframe api? I'm very much aware that I can get an rdd from a dataframe,
perform sampleByKeyExact and then convert the RDD back to a dataframe. I'd
really like to avoid such conversion, if possibile.

Thank you for any help you people can give :)

Best,

Marco


Re: Spark Streaming: Custom Receiver OOM consistently

2017-05-21 Thread Alonso Isidoro Roman
could you share the code?

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman


2017-05-20 7:54 GMT+02:00 Manish Malhotra :

> Hello,
>
> have implemented Java based custom receiver, which consumes from
> messaging system say JMS.
> once received message, I call store(object) ... Im storing spark Row
> object.
>
> it run for around 8 hrs, and then goes OOM, and OOM is happening in
> receiver nodes.
> I also tried to run multiple receivers, to distribute the load but faces
> the same issue.
>
> something fundamentally we are doing wrong, which tells custom receiver/spark
> to release the memory.
> but Im not able to crack that, atleast till now.
>
> any help is appreciated !!
>
> Regards,
> Manish
>
>


Re: SparkSQL not able to read a empty table location

2017-05-21 Thread Bajpai, Amit X. -ND
set spark.sql.hive.verifyPartitionPath=true didn’t help. Still getting the same 
error.

I tried to copy a file with a _ prefix and I am not getting the error and the 
file is also ignored by SparkSQL. But when scheduling the job in prod and if 
during one execution there is no data to be processed the query will again 
fail. How to deal with this scenario.


From: Sea <261810...@qq.com>
Date: Sunday, May 21, 2017 at 8:04 AM
To: Steve Loughran , "Bajpai, Amit X. -ND" 

Cc: "user@spark.apache.org" 
Subject: Re: SparkSQL not able to read a empty table location


please try spark.sql.hive.verifyPartitionPath true

-- Original --
From:  "Steve Loughran";;
Date:  Sat, May 20, 2017 09:19 PM
To:  "Bajpai, Amit X. -ND";
Cc:  "user@spark.apache.org";
Subject:  Re: SparkSQL not able to read a empty table location


On 20 May 2017, at 01:44, Bajpai, Amit X. -ND 
mailto:n...@disney.com>> wrote:

Hi,

I have a hive external table with the S3 location having no files (but the S3 
location directory does exists). When I am trying to use Spark SQL to count the 
number of records in the table it is throwing error saying “File s3n://data/xyz 
does not exist. null/0”.

select * from tablex limit 10

Can someone let me know how we can fix this issue.

Thanks


There isn't really a "directory" in S3, just a set of objects whose paths begin 
with a string. Try creating an empty file with an _ prefix in the directory; it 
should be ignored by Spark SQL but will cause the "directory" to come into being


Are tachyon and akka removed from 2.1.1 please

2017-05-21 Thread ??????????
HI all,
Iread some paper about source code, the paper base on version 1.2.  they refer 
the tachyon and akka.  When i read the 2.1code. I can not find the code abiut 
akka and tachyon.


Are tachyon and akka removed from 2.1.1  please

unsubscribe

2017-05-21 Thread 刘杰
unsubscribe

unsubscribe

2017-05-21 Thread 信息安全部
unsubscribe