Re: Spark3 on k8S reading encrypted data from HDFS with KMS in HA

2020-08-19 Thread Michel Sumbul
Hi Prashant,

I have the problem only on K8S, it's working fine when spark is executed on
top of yarn.
I'm asking myself if the delegation gets saved, any idea how to check that?
Could it be because kms is in HA and spark request 2 delegation token?

For the testing,  just running spark3 on top of any k8s cluster reading
data to any hadoop3 with kms should be fine. I'm using a HDP3 cluster, but
there is probably a more easy way to test.

Michel

Le mer. 19 août 2020 à 09:50, Prashant Sharma  a
écrit :

> -dev
> Hi,
>
> I have used Spark with HDFS encrypted with Hadoop KMS, and it worked well.
> Somehow, I could not recall, if I had the kubernetes in the mix. Somehow,
> seeing the error, it is not clear what caused the failure. Can I reproduce
> this somehow?
>
> Thanks,
>
> On Sat, Aug 15, 2020 at 7:18 PM Michel Sumbul 
> wrote:
>
>> Hi guys,
>>
>> Does anyone have an idea on this issue? even some tips to troubleshoot it?
>> I got the impression that after the creation of the delegation for the
>> KMS, the token is not sent to the executor or maybe not saved?
>>
>> I'm sure I'm not the only one using Spark with HDFS encrypted with KMS :-)
>>
>> Thanks,
>> Michel
>>
>> Le jeu. 13 août 2020 à 14:32, Michel Sumbul  a
>> écrit :
>>
>>> Hi guys,
>>>
>>> Does anyone try Spark3 on k8s reading data from HDFS encrypted with KMS
>>> in HA mode (with kerberos)?
>>>
>>> I have a wordcount job running with Spark3 reading data on HDFS (hadoop
>>> 3.1) everything secure with kerberos. Everything works fine if the data
>>> folder is not encrypted (spark on k8s). If the data is on an encrypted
>>> folder, Spark3 on yarn is working fine but it doesn't work when Spark3 is
>>> running on K8S.
>>> I submit the job with spark-submit command and I provide the keytab and
>>> the principal to use.
>>> I got the kerberos error saying that there is no TGT to authenticate to
>>> the KMS (ranger kms, full stack trace of the error at the end of the mail)
>>> servers but in the log I can see that Spark get 2 delegation token, one for
>>> each KMS servers:
>>>
>>> -- --
>>>
>>> 20/08/13 10:50:50 INFO HadoopDelegationTokenManager: Attempting to login
>>> to KDC using principal: mytestu...@paf.com
>>>
>>> 20/08/13 10:50:50 INFO HadoopDelegationTokenManager: Successfully logged
>>> into KDC.
>>>
>>> 20/08/13 10:50:52 WARN DomainSocketFactory: The short-circuit local
>>> reads feature cannot be used because libhadoop cannot be loaded.
>>>
>>> 20/08/13 10:50:52 INFO HadoopFSDelegationTokenProvider: getting token
>>> for: DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-237056190_16, ugi=
>>> mytestu...@paf.com (auth:KERBEROS)]] with renewer testuser
>>>
>>> 20/08/13 10:50:52 INFO DFSClient: Created token for testuser:
>>> HDFS_DELEGATION_TOKEN owner= mytestu...@paf.com, renewer=testuser,
>>> realUser=, issueDate=1597315852353, maxDate=1597920652353,
>>> sequenceNumber=55185062, masterKeyId=1964 on ha-hdfs:cluster2
>>>
>>> 20/08/13 10:50:52 INFO KMSClientProvider: New token created: (Kind:
>>> kms-dt, Service: kms://ht...@server2.paf.com:9393/kms, Ident: (kms-dt
>>> owner=testuser, renewer=testuser, realUser=, issueDate=1597315852642,
>>> maxDate=1597920652642, sequenceNumber=3929883, masterKeyId=623))
>>>
>>> 20/08/13 10:50:52 INFO HadoopFSDelegationTokenProvider: getting token
>>> for: DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-237056190_16, ugi=
>>> testu...@paf.com (auth:KERBEROS)]] with renewer testu...@paf.com
>>>
>>> 20/08/13 10:50:52 INFO DFSClient: Created token for testuser:
>>> HDFS_DELEGATION_TOKEN owner=testu...@paf.com, renewer=testuser,
>>> realUser=, issueDate=1597315852744, maxDate=1597920652744,
>>> sequenceNumber=55185063, masterKeyId=1964 on ha-hdfs:cluster2
>>>
>>> 20/08/13 10:50:52 INFO KMSClientProvider: New token created: (Kind:
>>> kms-dt, Service: kms://ht...@server.paf.com:9393/kms, Ident: (kms-dt
>>> owner=testuser, renewer=testuser, realUser=, issueDate=1597315852839,
>>> maxDate=1597920652839, sequenceNumber=3929884, masterKeyId=624))
>>>
>>> 20/08/13 10:50:52 INFO HadoopFSDelegationTokenProvider: Renewal interval
>>> is 86400104 for token HDFS_DELEGATION_TOKEN
>>>
>>> 20/08/13 10:50:52 INFO HadoopFSDelegationTokenProvider: Renewal interval
>>> is 86400108 for token kms-dt
>>>
>>

Re: Spark3 on k8S reading encrypted data from HDFS with KMS in HA

2020-08-19 Thread Prashant Sharma
-dev
Hi,

I have used Spark with HDFS encrypted with Hadoop KMS, and it worked well.
Somehow, I could not recall, if I had the kubernetes in the mix. Somehow,
seeing the error, it is not clear what caused the failure. Can I reproduce
this somehow?

Thanks,

On Sat, Aug 15, 2020 at 7:18 PM Michel Sumbul 
wrote:

> Hi guys,
>
> Does anyone have an idea on this issue? even some tips to troubleshoot it?
> I got the impression that after the creation of the delegation for the
> KMS, the token is not sent to the executor or maybe not saved?
>
> I'm sure I'm not the only one using Spark with HDFS encrypted with KMS :-)
>
> Thanks,
> Michel
>
> Le jeu. 13 août 2020 à 14:32, Michel Sumbul  a
> écrit :
>
>> Hi guys,
>>
>> Does anyone try Spark3 on k8s reading data from HDFS encrypted with KMS
>> in HA mode (with kerberos)?
>>
>> I have a wordcount job running with Spark3 reading data on HDFS (hadoop
>> 3.1) everything secure with kerberos. Everything works fine if the data
>> folder is not encrypted (spark on k8s). If the data is on an encrypted
>> folder, Spark3 on yarn is working fine but it doesn't work when Spark3 is
>> running on K8S.
>> I submit the job with spark-submit command and I provide the keytab and
>> the principal to use.
>> I got the kerberos error saying that there is no TGT to authenticate to
>> the KMS (ranger kms, full stack trace of the error at the end of the mail)
>> servers but in the log I can see that Spark get 2 delegation token, one for
>> each KMS servers:
>>
>> -- --
>>
>> 20/08/13 10:50:50 INFO HadoopDelegationTokenManager: Attempting to login
>> to KDC using principal: mytestu...@paf.com
>>
>> 20/08/13 10:50:50 INFO HadoopDelegationTokenManager: Successfully logged
>> into KDC.
>>
>> 20/08/13 10:50:52 WARN DomainSocketFactory: The short-circuit local reads
>> feature cannot be used because libhadoop cannot be loaded.
>>
>> 20/08/13 10:50:52 INFO HadoopFSDelegationTokenProvider: getting token
>> for: DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-237056190_16, ugi=
>> mytestu...@paf.com (auth:KERBEROS)]] with renewer testuser
>>
>> 20/08/13 10:50:52 INFO DFSClient: Created token for testuser:
>> HDFS_DELEGATION_TOKEN owner= mytestu...@paf.com, renewer=testuser,
>> realUser=, issueDate=1597315852353, maxDate=1597920652353,
>> sequenceNumber=55185062, masterKeyId=1964 on ha-hdfs:cluster2
>>
>> 20/08/13 10:50:52 INFO KMSClientProvider: New token created: (Kind:
>> kms-dt, Service: kms://ht...@server2.paf.com:9393/kms, Ident: (kms-dt
>> owner=testuser, renewer=testuser, realUser=, issueDate=1597315852642,
>> maxDate=1597920652642, sequenceNumber=3929883, masterKeyId=623))
>>
>> 20/08/13 10:50:52 INFO HadoopFSDelegationTokenProvider: getting token
>> for: DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-237056190_16, ugi=
>> testu...@paf.com (auth:KERBEROS)]] with renewer testu...@paf.com
>>
>> 20/08/13 10:50:52 INFO DFSClient: Created token for testuser:
>> HDFS_DELEGATION_TOKEN owner=testu...@paf.com, renewer=testuser,
>> realUser=, issueDate=1597315852744, maxDate=1597920652744,
>> sequenceNumber=55185063, masterKeyId=1964 on ha-hdfs:cluster2
>>
>> 20/08/13 10:50:52 INFO KMSClientProvider: New token created: (Kind:
>> kms-dt, Service: kms://ht...@server.paf.com:9393/kms, Ident: (kms-dt
>> owner=testuser, renewer=testuser, realUser=, issueDate=1597315852839,
>> maxDate=1597920652839, sequenceNumber=3929884, masterKeyId=624))
>>
>> 20/08/13 10:50:52 INFO HadoopFSDelegationTokenProvider: Renewal interval
>> is 86400104 for token HDFS_DELEGATION_TOKEN
>>
>> 20/08/13 10:50:52 INFO HadoopFSDelegationTokenProvider: Renewal interval
>> is 86400108 for token kms-dt
>>
>> 20/08/13 10:50:54 INFO HiveConf: Found configuration file null
>>
>> 20/08/13 10:50:54 INFO HadoopDelegationTokenManager: Scheduling renewal
>> in 18.0 h.
>>
>> 20/08/13 10:50:54 INFO HadoopDelegationTokenManager: Updating delegation
>> tokens.
>>
>> 20/08/13 10:50:54 INFO SparkHadoopUtil: Updating delegation tokens for
>> current user.
>>
>> 20/08/13 10:50:55 INFO SparkHadoopUtil: Updating delegation tokens for
>> current user.
>> --- --
>>
>> In the core-site.xml, I have the following property for the 2 kms server
>>
>> --
>>
>> hadoop.security.key.provider.path
>>
>> kms://ht...@server.paf.com;server2.paf.com:9393/kms
>>
>> -
>>
>>
>>

Re: Spark3 on k8S reading encrypted data from HDFS with KMS in HA

2020-08-15 Thread Michel Sumbul
Hi guys,

Does anyone have an idea on this issue? even some tips to troubleshoot it?
I got the impression that after the creation of the delegation for the KMS,
the token is not sent to the executor or maybe not saved?

I'm sure I'm not the only one using Spark with HDFS encrypted with KMS :-)

Thanks,
Michel

Le jeu. 13 août 2020 à 14:32, Michel Sumbul  a
écrit :

> Hi guys,
>
> Does anyone try Spark3 on k8s reading data from HDFS encrypted with KMS in
> HA mode (with kerberos)?
>
> I have a wordcount job running with Spark3 reading data on HDFS (hadoop
> 3.1) everything secure with kerberos. Everything works fine if the data
> folder is not encrypted (spark on k8s). If the data is on an encrypted
> folder, Spark3 on yarn is working fine but it doesn't work when Spark3 is
> running on K8S.
> I submit the job with spark-submit command and I provide the keytab and
> the principal to use.
> I got the kerberos error saying that there is no TGT to authenticate to
> the KMS (ranger kms, full stack trace of the error at the end of the mail)
> servers but in the log I can see that Spark get 2 delegation token, one for
> each KMS servers:
>
> -- --
>
> 20/08/13 10:50:50 INFO HadoopDelegationTokenManager: Attempting to login
> to KDC using principal: mytestu...@paf.com
>
> 20/08/13 10:50:50 INFO HadoopDelegationTokenManager: Successfully logged
> into KDC.
>
> 20/08/13 10:50:52 WARN DomainSocketFactory: The short-circuit local reads
> feature cannot be used because libhadoop cannot be loaded.
>
> 20/08/13 10:50:52 INFO HadoopFSDelegationTokenProvider: getting token for:
> DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-237056190_16, ugi=
> mytestu...@paf.com (auth:KERBEROS)]] with renewer testuser
>
> 20/08/13 10:50:52 INFO DFSClient: Created token for testuser:
> HDFS_DELEGATION_TOKEN owner= mytestu...@paf.com, renewer=testuser,
> realUser=, issueDate=1597315852353, maxDate=1597920652353,
> sequenceNumber=55185062, masterKeyId=1964 on ha-hdfs:cluster2
>
> 20/08/13 10:50:52 INFO KMSClientProvider: New token created: (Kind:
> kms-dt, Service: kms://ht...@server2.paf.com:9393/kms, Ident: (kms-dt
> owner=testuser, renewer=testuser, realUser=, issueDate=1597315852642,
> maxDate=1597920652642, sequenceNumber=3929883, masterKeyId=623))
>
> 20/08/13 10:50:52 INFO HadoopFSDelegationTokenProvider: getting token for:
> DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-237056190_16, ugi=
> testu...@paf.com (auth:KERBEROS)]] with renewer testu...@paf.com
>
> 20/08/13 10:50:52 INFO DFSClient: Created token for testuser:
> HDFS_DELEGATION_TOKEN owner=testu...@paf.com, renewer=testuser,
> realUser=, issueDate=1597315852744, maxDate=1597920652744,
> sequenceNumber=55185063, masterKeyId=1964 on ha-hdfs:cluster2
>
> 20/08/13 10:50:52 INFO KMSClientProvider: New token created: (Kind:
> kms-dt, Service: kms://ht...@server.paf.com:9393/kms, Ident: (kms-dt
> owner=testuser, renewer=testuser, realUser=, issueDate=1597315852839,
> maxDate=1597920652839, sequenceNumber=3929884, masterKeyId=624))
>
> 20/08/13 10:50:52 INFO HadoopFSDelegationTokenProvider: Renewal interval
> is 86400104 for token HDFS_DELEGATION_TOKEN
>
> 20/08/13 10:50:52 INFO HadoopFSDelegationTokenProvider: Renewal interval
> is 86400108 for token kms-dt
>
> 20/08/13 10:50:54 INFO HiveConf: Found configuration file null
>
> 20/08/13 10:50:54 INFO HadoopDelegationTokenManager: Scheduling renewal in
> 18.0 h.
>
> 20/08/13 10:50:54 INFO HadoopDelegationTokenManager: Updating delegation
> tokens.
>
> 20/08/13 10:50:54 INFO SparkHadoopUtil: Updating delegation tokens for
> current user.
>
> 20/08/13 10:50:55 INFO SparkHadoopUtil: Updating delegation tokens for
> current user.
> --- --
>
> In the core-site.xml, I have the following property for the 2 kms server
>
> --
>
> hadoop.security.key.provider.path
>
> kms://ht...@server.paf.com;server2.paf.com:9393/kms
>
> -
>
>
> Does anyone have an idea how to make it work? Or at least anyone has been
> able to make it work?
> Does anyone know where the delegation tokens are saved during the
> execution of jobs on k8s and how it is shared between the executors?
>
>
> Thanks,
> Michel
>
> PS: The full stack trace of the error:
>
> 
>
> Caused by: org.apache.spark.SparkException: Job aborted due to stage
> failure: Task 22 in stage 0.0 failed 4 times, most recent failure: Lost
> task 22.3 in stage 0.0 (TID 23, 10.5.5.5, executor 1): java.io.IOException:
> org.apache.hadoop.security.authentication.client.AuthenticationException:
> Error while authenticating with end

Spark3 on k8S reading encrypted data from HDFS with KMS in HA

2020-08-13 Thread Michel Sumbul
Hi guys,

Does anyone try Spark3 on k8s reading data from HDFS encrypted with KMS in
HA mode (with kerberos)?

I have a wordcount job running with Spark3 reading data on HDFS (hadoop
3.1) everything secure with kerberos. Everything works fine if the data
folder is not encrypted (spark on k8s). If the data is on an encrypted
folder, Spark3 on yarn is working fine but it doesn't work when Spark3 is
running on K8S.
I submit the job with spark-submit command and I provide the keytab and the
principal to use.
I got the kerberos error saying that there is no TGT to authenticate to the
KMS (ranger kms, full stack trace of the error at the end of the mail)
servers but in the log I can see that Spark get 2 delegation token, one for
each KMS servers:

-- --

20/08/13 10:50:50 INFO HadoopDelegationTokenManager: Attempting to login to
KDC using principal: mytestu...@paf.com

20/08/13 10:50:50 INFO HadoopDelegationTokenManager: Successfully logged
into KDC.

20/08/13 10:50:52 WARN DomainSocketFactory: The short-circuit local reads
feature cannot be used because libhadoop cannot be loaded.

20/08/13 10:50:52 INFO HadoopFSDelegationTokenProvider: getting token for:
DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-237056190_16, ugi=
mytestu...@paf.com (auth:KERBEROS)]] with renewer testuser

20/08/13 10:50:52 INFO DFSClient: Created token for testuser:
HDFS_DELEGATION_TOKEN owner= mytestu...@paf.com, renewer=testuser,
realUser=, issueDate=1597315852353, maxDate=1597920652353,
sequenceNumber=55185062, masterKeyId=1964 on ha-hdfs:cluster2

20/08/13 10:50:52 INFO KMSClientProvider: New token created: (Kind: kms-dt,
Service: kms://ht...@server2.paf.com:9393/kms, Ident: (kms-dt
owner=testuser, renewer=testuser, realUser=, issueDate=1597315852642,
maxDate=1597920652642, sequenceNumber=3929883, masterKeyId=623))

20/08/13 10:50:52 INFO HadoopFSDelegationTokenProvider: getting token for:
DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-237056190_16, ugi=
testu...@paf.com (auth:KERBEROS)]] with renewer testu...@paf.com

20/08/13 10:50:52 INFO DFSClient: Created token for testuser:
HDFS_DELEGATION_TOKEN owner=testu...@paf.com, renewer=testuser, realUser=,
issueDate=1597315852744, maxDate=1597920652744, sequenceNumber=55185063,
masterKeyId=1964 on ha-hdfs:cluster2

20/08/13 10:50:52 INFO KMSClientProvider: New token created: (Kind: kms-dt,
Service: kms://ht...@server.paf.com:9393/kms, Ident: (kms-dt
owner=testuser, renewer=testuser, realUser=, issueDate=1597315852839,
maxDate=1597920652839, sequenceNumber=3929884, masterKeyId=624))

20/08/13 10:50:52 INFO HadoopFSDelegationTokenProvider: Renewal interval is
86400104 for token HDFS_DELEGATION_TOKEN

20/08/13 10:50:52 INFO HadoopFSDelegationTokenProvider: Renewal interval is
86400108 for token kms-dt

20/08/13 10:50:54 INFO HiveConf: Found configuration file null

20/08/13 10:50:54 INFO HadoopDelegationTokenManager: Scheduling renewal in
18.0 h.

20/08/13 10:50:54 INFO HadoopDelegationTokenManager: Updating delegation
tokens.

20/08/13 10:50:54 INFO SparkHadoopUtil: Updating delegation tokens for
current user.

20/08/13 10:50:55 INFO SparkHadoopUtil: Updating delegation tokens for
current user.
--- --

In the core-site.xml, I have the following property for the 2 kms server

--

hadoop.security.key.provider.path

kms://ht...@server.paf.com;server2.paf.com:9393/kms

-


Does anyone have an idea how to make it work? Or at least anyone has been
able to make it work?
Does anyone know where the delegation tokens are saved during the execution
of jobs on k8s and how it is shared between the executors?


Thanks,
Michel

PS: The full stack trace of the error:



Caused by: org.apache.spark.SparkException: Job aborted due to stage
failure: Task 22 in stage 0.0 failed 4 times, most recent failure: Lost
task 22.3 in stage 0.0 (TID 23, 10.5.5.5, executor 1): java.io.IOException:
org.apache.hadoop.security.authentication.client.AuthenticationException:
Error while authenticating with endpoint:
https://server.paf.com:9393/kms/v1/keyversion/dir_tmp_key%400/_eek?eek_op=decrypt

at
org.apache.hadoop.crypto.key.kms.KMSClientProvider.createConnection(KMSClientProvider.java:525)

at
org.apache.hadoop.crypto.key.kms.KMSClientProvider.decryptEncryptedKey(KMSClientProvider.java:826)

at
org.apache.hadoop.crypto.key.kms.LoadBalancingKMSClientProvider$5.call(LoadBalancingKMSClientProvider.java:351)

at
org.apache.hadoop.crypto.key.kms.LoadBalancingKMSClientProvider$5.call(LoadBalancingKMSClientProvider.java:347)

at
org.apache.hadoop.crypto.key.kms.LoadBalancingKMSClientProvider.doOp(LoadBalancingKMSClientProvider.java:172)

at
org.apache.hadoop.crypto.key.kms.LoadBalancingKMSClientProvider.decryptEncryptedKey(LoadBalancingKMSClientProvider.java:347

Data from HDFS

2018-04-22 Thread Zois Theodoros

Hello,

I am reading data from HDFS in a Spark application and as far as I read 
each HDFS block is 1 partition for Spark by default. Is there any way to 
select only 1 block from HDFS to read in my Spark application?


Thank you,
Thodoris


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark loads data from HDFS or S3

2017-12-13 Thread Jörn Franke
S3 can be realized cheaper than HDFS on Amazon.

As you correctly describe it does not support data locality. The data is 
distributed to the workers.

Depending on your use case it can make sense to have HDFS as a temporary 
“cache” for S3 data.

> On 13. Dec 2017, at 09:39, Philip Lee <philjj...@gmail.com> wrote:
> 
> Hi​
> 
> I have a few of questions about a structure of HDFS and S3 when Spark-like 
> loads data from two storage.
> 
> Generally, when Spark loads data from HDFS, HDFS supports data locality and 
> already own distributed file on datanodes, right? Spark could just process 
> data on workers.
> 
> What about S3? many people in this field use S3 for storage or loading data 
> remotely. When Spark loads data from S3 (sc.textFile('s3://...'), how all 
> data will be spread on Workers? Master node's responsible for this task? It 
> reads all data from S3, then spread the data to Worker? So it migt be a 
> trade-off compared to HDFS? or I got a wrong point of this
> ​.
> ​
> What kind of points in S3 is better than that of HDFS?​
> ​Thanks in Advanced​


Re: Spark loads data from HDFS or S3

2017-12-13 Thread Sebastian Nagel
> When Spark loads data from S3 (sc.textFile('s3://...'), how all data will be 
> spread on Workers?

The data is read by workers. Only make sure that the data is splittable, by 
using a splittable
format or by passing a list of files
 sc.textFile('s3://.../*.txt')
to achieve full parallelism. Otherwise (e.g., if reading a single gzipped file) 
only one worker
will read the data.

> So it migt be a trade-off compared to HDFS?

Accessing data on S3 fromHadoop is usually slower than HDFS, cf.
  
https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#Other_issues

> What kind of points in S3 is better than that of HDFS?

It's independent from your Hadoop cluster: easier to share, you don't have to
care for the data when maintaining your cluster, ...

Sebastian

On 12/13/2017 09:39 AM, Philip Lee wrote:
> Hi
> ​
> 
> 
> I have a few of questions about a structure of HDFS and S3 when Spark-like 
> loads data from two storage.
> 
> 
> Generally, when Spark loads data from HDFS, HDFS supports data locality and 
> already own distributed
> file on datanodes, right? Spark could just process data on workers.
> 
> 
> What about S3? many people in this field use S3 for storage or loading data 
> remotely. When Spark
> loads data from S3 (sc.textFile('s3://...'), how all data will be spread on 
> Workers? Master node's
> responsible for this task? It reads all data from S3, then spread the data to 
> Worker? So it migt be
> a trade-off compared to HDFS? or I got a wrong point of this
> 
> ​.
> 
> ​
> 
> What kind of points in S3 is better than that of HDFS?
> ​
> 
> ​Thanks in Advanced​
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark loads data from HDFS or S3

2017-12-13 Thread Philip Lee
Hi
​


I have a few of questions about a structure of HDFS and S3 when Spark-like
loads data from two storage.


Generally, when Spark loads data from HDFS, HDFS supports data locality and
already own distributed file on datanodes, right? Spark could just process
data on workers.


What about S3? many people in this field use S3 for storage or loading data
remotely. When Spark loads data from S3 (sc.textFile('s3://...'), how all
data will be spread on Workers? Master node's responsible for this task? It
reads all data from S3, then spread the data to Worker? So it migt be a
trade-off compared to HDFS? or I got a wrong point of this
​.

​

What kind of points in S3 is better than that of HDFS?
​

​Thanks in Advanced​


Can Spark read input data from HDFS centralized cache?

2016-01-25 Thread Jia Zou
I configured HDFS to cache file in HDFS's cache, like following:

hdfs cacheadmin -addPool hibench

hdfs cacheadmin -addDirective -path /HiBench/Kmeans/Input -pool hibench


But I didn't see much performance impacts, no matter how I configure
dfs.datanode.max.locked.memory


Is it possible that Spark doesn't know the data is in HDFS cache, and still
read data from disk, instead of from HDFS cache?


Thanks!

Jia


Re: Can Spark read input data from HDFS centralized cache?

2016-01-25 Thread Ted Yu
Have you read this thread ?

http://search-hadoop.com/m/uOzYttXZcg1M6oKf2/HDFS+cache=RE+hadoop+hdfs+cache+question+do+client+processes+share+cache+

Cheers

On Mon, Jan 25, 2016 at 1:23 PM, Jia Zou  wrote:

> I configured HDFS to cache file in HDFS's cache, like following:
>
> hdfs cacheadmin -addPool hibench
>
> hdfs cacheadmin -addDirective -path /HiBench/Kmeans/Input -pool hibench
>
>
> But I didn't see much performance impacts, no matter how I configure
> dfs.datanode.max.locked.memory
>
>
> Is it possible that Spark doesn't know the data is in HDFS cache, and
> still read data from disk, instead of from HDFS cache?
>
>
> Thanks!
>
> Jia
>


Re: Can Spark read input data from HDFS centralized cache?

2016-01-25 Thread Ted Yu
Please see also:
http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html

According to Chris Nauroth, an hdfs committer, it's extremely difficult to
use the feature correctly.

The feature also brings operational complexity. Since off-heap memory is
used, you can accidentally use too much RAM on the host, resulting in OOM
in the JVM which is hard to debug.

Cheers

On Mon, Jan 25, 2016 at 1:39 PM, Ted Yu  wrote:

> Have you read this thread ?
>
>
> http://search-hadoop.com/m/uOzYttXZcg1M6oKf2/HDFS+cache=RE+hadoop+hdfs+cache+question+do+client+processes+share+cache+
>
> Cheers
>
> On Mon, Jan 25, 2016 at 1:23 PM, Jia Zou  wrote:
>
>> I configured HDFS to cache file in HDFS's cache, like following:
>>
>> hdfs cacheadmin -addPool hibench
>>
>> hdfs cacheadmin -addDirective -path /HiBench/Kmeans/Input -pool hibench
>>
>>
>> But I didn't see much performance impacts, no matter how I configure
>> dfs.datanode.max.locked.memory
>>
>>
>> Is it possible that Spark doesn't know the data is in HDFS cache, and
>> still read data from disk, instead of from HDFS cache?
>>
>>
>> Thanks!
>>
>> Jia
>>
>
>


Re: How to load partial data from HDFS using Spark SQL

2016-01-02 Thread swetha kasireddy
OK. What should the table be? Suppose I have a bunch of parquet files, do I
just specify the directory as the table?

On Fri, Jan 1, 2016 at 11:32 PM, UMESH CHAUDHARY <umesh9...@gmail.com>
wrote:

> Ok, so whats wrong in using :
>
> var df=HiveContext.sql("Select * from table where id = ")
> //filtered data frame
> df.count
>
> On Sat, Jan 2, 2016 at 11:56 AM, SRK <swethakasire...@gmail.com> wrote:
>
>> Hi,
>>
>> How to load partial data from hdfs using Spark SQL? Suppose I want to load
>> data based on a filter like
>>
>> "Select * from table where id = " using Spark SQL with DataFrames,
>> how can that be done? The
>>
>> idea here is that I do not want to load the whole data into memory when I
>> use the SQL and I just want to
>>
>> load the data based on the filter.
>>
>>
>> Thanks!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-load-partial-data-from-HDFS-using-Spark-SQL-tp25855.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


How to load partial data from HDFS using Spark SQL

2016-01-01 Thread SRK
Hi,

How to load partial data from hdfs using Spark SQL? Suppose I want to load
data based on a filter like

"Select * from table where id = " using Spark SQL with DataFrames,
how can that be done? The 

idea here is that I do not want to load the whole data into memory when I
use the SQL and I just want to

load the data based on the filter. 


Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-load-partial-data-from-HDFS-using-Spark-SQL-tp25855.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to load partial data from HDFS using Spark SQL

2016-01-01 Thread UMESH CHAUDHARY
Ok, so whats wrong in using :

var df=HiveContext.sql("Select * from table where id = ")
//filtered data frame
df.count

On Sat, Jan 2, 2016 at 11:56 AM, SRK <swethakasire...@gmail.com> wrote:

> Hi,
>
> How to load partial data from hdfs using Spark SQL? Suppose I want to load
> data based on a filter like
>
> "Select * from table where id = " using Spark SQL with DataFrames,
> how can that be done? The
>
> idea here is that I do not want to load the whole data into memory when I
> use the SQL and I just want to
>
> load the data based on the filter.
>
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-load-partial-data-from-HDFS-using-Spark-SQL-tp25855.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


ClassCastException while reading data from HDFS through Spark

2015-10-07 Thread Vinoth Sankar
 I'm just reading data from HDFS through Spark. It throws
*java.lang.ClassCastException:
org.apache.hadoop.io.LongWritable cannot be cast to
org.apache.hadoop.io.BytesWritable* at line no 6. I never used LongWritable
in my code, no idea how the data was in that format.

Note : I'm not using MapReduce Concepts and also I'm not creating Jobs
explicitly. So i can't use job.setMapOutputKeyClass and
job.setMapOutputValueClass.

JavaPairRDD<IntWritable, BytesWritable> hdfsContent =
sparkContext.sequenceFile(hdfsPath, IntWritable.class, BytesWritable.class);
JavaRDD lines = hdfsContent.map(new Function<Tuple2<IntWritable,
BytesWritable>, FileData>()
{
public FileData call(Tuple2<IntWritable, BytesWritable> tuple2) throws
InvalidProtocolBufferException
{
byte[] bytes = tuple2._2().getBytes();
return FileData.parseFrom(bytes);
}
});


Re: ClassCastException while reading data from HDFS through Spark

2015-10-07 Thread UMESH CHAUDHARY
As per the Exception, it looks like there is a mismatch in actual sequence
file's value type and the one which is provided by you in your code.
Change BytesWritable
to *LongWritable * and feel the execution.

-Umesh

On Wed, Oct 7, 2015 at 2:41 PM, Vinoth Sankar <vinoth9...@gmail.com> wrote:

> I'm just reading data from HDFS through Spark. It throws 
> *java.lang.ClassCastException:
> org.apache.hadoop.io.LongWritable cannot be cast to
> org.apache.hadoop.io.BytesWritable* at line no 6. I never used
> LongWritable in my code, no idea how the data was in that format.
>
> Note : I'm not using MapReduce Concepts and also I'm not creating Jobs
> explicitly. So i can't use job.setMapOutputKeyClass and
> job.setMapOutputValueClass.
>
> JavaPairRDD<IntWritable, BytesWritable> hdfsContent =
> sparkContext.sequenceFile(hdfsPath, IntWritable.class, BytesWritable.class);
> JavaRDD lines = hdfsContent.map(new Function<Tuple2<IntWritable,
> BytesWritable>, FileData>()
> {
> public FileData call(Tuple2<IntWritable, BytesWritable> tuple2) throws
> InvalidProtocolBufferException
> {
> byte[] bytes = tuple2._2().getBytes();
> return FileData.parseFrom(bytes);
> }
> });
>


Re: SparkSQL: Reading data from hdfs and storing into multiple paths

2015-10-02 Thread Michael Armbrust
Once you convert your data to a dataframe (look at spark-csv), try
df.write.partitionBy("", "mm").save("...").

On Thu, Oct 1, 2015 at 4:11 PM, haridass saisriram <
haridass.saisri...@gmail.com> wrote:

> Hi,
>
>   I am trying to find a simple example to read a data file on HDFS. The
> file has the following format
> a , b  , c ,,mm
> a1,b1,c1,2015,09
> a2,b2,c2,2014,08
>
>
> I would like to read this file and store it in HDFS partitioned by year
> and month. Something like this
> /path/to/hdfs//mm
>
> I want to specify the "/path/to/hdfs/" and /mm should be populated
> automatically based on those columns. Could some one point me in the right
> direction
>
> Thank you,
> Sri Ram
>
>


SparkSQL: Reading data from hdfs and storing into multiple paths

2015-10-01 Thread haridass saisriram
Hi,

  I am trying to find a simple example to read a data file on HDFS. The
file has the following format
a , b  , c ,,mm
a1,b1,c1,2015,09
a2,b2,c2,2014,08


I would like to read this file and store it in HDFS partitioned by year and
month. Something like this
/path/to/hdfs//mm

I want to specify the "/path/to/hdfs/" and /mm should be populated
automatically based on those columns. Could some one point me in the right
direction

Thank you,
Sri Ram