Re: build spark source code

2017-11-22 Thread Jörn Franke
You can check if Apache Bigtop provided you something like this for Spark on 
Windows (well probably not based on sbt but mvn).

> On 23. Nov 2017, at 03:34, Michael Artz  wrote:
> 
> It would be nice if I could download the source code of spark from github, 
> then build it with sbt on my windows machine, and use IntelliJ to make little 
> modifications to the code base. I have installed spark before on windows 
> quite a few times, but I just use the packaged artifact.  Has anyone built 
> the source code on a windows machine before? 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Custom Data Source for getting data from Rest based services

2017-11-22 Thread sathich
Hi Sourav,
This is quite an useful addition  to the spark family, this is a usecase
that comes more often than talked about.
* to get a 3rd party mapping data(geo coordinates) , 
* access database data through rest.
* download data from from bulk data api service   


It will be really useful to be able to interact with application layer
through restapi send over data to the rest api(case of post request which
you already mentioned) 

I have few follow up thoughts
1) What's your thought when a resapi returns more complex nested json data ,
will this seamlessly  map to a dataframe as  dataframes are more flatter in
nature. 
2) how can this dataframe be kept in distributed cache in spark workers to
be available , to encourage re-use of slow-changing data (does broadcast
work on a dataframe?) . This is related to your b) 
3) Last case in my mind is how can this be extended for streaming , and
control the frequency  of the resapi call and perform a join of two
dataframes, one is slow-moving(may be a lookup table in db getting accessed
over rest) and fast moving event stream.


Thanks
Sathi

 








--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: does "Deep Learning Pipelines" scale out linearly?

2017-11-22 Thread Nick Pentreath
For that package specifically it’s best to see if they have a mailing list
and if not perhaps ask on github issues.

Having said that perhaps the folks involved in that package will reply here
too.

On Wed, 22 Nov 2017 at 20:03, Andy Davidson 
wrote:

> I am starting a new deep learning project currently we do all of our work
> on a single machine using a combination of Keras and Tensor flow.
> https://databricks.github.io/spark-deep-learning/site/index.html looks
> very promising. Any idea how performance is likely to improve as I add
> machines to my my cluster?
>
> Kind regards
>
> Andy
>
>
> P.s. Is user@spark.apache.org the best place to ask questions about this
> package?
>
>
>


SparkSQL not support CharType

2017-11-22 Thread 163
Hi,
 when I use Dataframe with table schema, It goes wrong:

val test_schema = StructType(Array(
  StructField("id", IntegerType, false),
  StructField("flag", CharType(1), false),
  StructField("time", DateType, false)));

val df = spark.read.format("com.databricks.spark.csv")
  .schema(test_schema)
  .option("header", "false")
  .option("inferSchema", "false")
  .option("delimiter", ",")
  .load("file:///Users/name/b")

The log is below:
Exception in thread "main" scala.MatchError: CharType(1) (of class 
org.apache.spark.sql.types.CharType)
at 
org.apache.spark.sql.catalyst.encoders.RowEncoder$.org$apache$spark$sql$catalyst$encoders$RowEncoder$$serializerFor(RowEncoder.scala:73)
at 
org.apache.spark.sql.catalyst.encoders.RowEncoder$$anonfun$2.apply(RowEncoder.scala:158)
at 
org.apache.spark.sql.catalyst.encoders.RowEncoder$$anonfun$2.apply(RowEncoder.scala:157)

Why? Is this a bug?

But I found spark will translate char type to string when using create 
table command:

 create table test(flag char(1));
desc test:flag string;




Regards
Wendy He

build spark source code

2017-11-22 Thread Michael Artz
It would be nice if I could download the source code of spark from github,
then build it with sbt on my windows machine, and use IntelliJ to make
little modifications to the code base. I have installed spark before on
windows quite a few times, but I just use the packaged artifact.  Has
anyone built the source code on a windows machine before?


Re: Hive From Spark: Jdbc VS sparkContext

2017-11-22 Thread Nicolas Paris
Hey

Finally I improved a lot the spark-hive sql performances.

I had some problem with some topology_script.py that made huge log error
trace and reduced spark performances in python mode. I just corrected
the python2 scripts to be python3 ready.
I had some problem with broadcast variable while joining tables. I just
deactivated this fucntionality.

As a result our users are now able to use spark-hive with very limited
resources (2 executors with 4core) and get decent performances for
analytics.

Compared to JDBC presto, this has several advantages:
- integrated solution
- single security layer (hive/kerberos)
- direct partitionned lazy datasets versus complicated jdbc dataset management
- more robust for analytics with less memory (apparently)

However presto still makes sence for sub second analytics, and oltp like
queries and data discovery.

Le 05 nov. 2017 à 13:57, Nicolas Paris écrivait :
> Hi
> 
> After some testing, I have been quite disapointed with hiveContext way of
> accessing hive tables.
> 
> The main problem is resource allocation: I have tons of users and they
> get a limited subset of workers. Then this does not allow to query huge
> datasetsn because to few memory allocated (or maybe I am missing
> something).
> 
> If using Hive jdbc, Hive resources are shared by all my users and then
> queries are able to finish.
> 
> Then I have been testing other jdbc based approach and for now, "presto"
> looks like the most appropriate solution to access hive tables.
> 
> In order to load huge datasets into spark, the proposed approach is to
> use presto distributed CTAS to build an ORC dataset, and access to that
> dataset from spark dataframe loader ability (instead of direct jdbc
> access tha would break the driver memory).
> 
> 
> 
> Le 15 oct. 2017 à 19:24, Gourav Sengupta écrivait :
> > Hi Nicolas,
> > 
> > without the hive thrift server, if you try to run a select * on a table 
> > which
> > has around 10,000 partitions, SPARK will give you some surprises. PRESTO 
> > works
> > fine in these scenarios, and I am sure SPARK community will soon learn from
> > their algorithms.
> > 
> > 
> > Regards,
> > Gourav
> > 
> > On Sun, Oct 15, 2017 at 3:43 PM, Nicolas Paris  wrote:
> > 
> > > I do not think that SPARK will automatically determine the partitions.
> > Actually
> > > it does not automatically determine the partitions. In case a table 
> > has a
> > few
> > > million records, it all goes through the driver.
> > 
> > Hi Gourav
> > 
> > Actualy spark jdbc driver is able to deal direclty with partitions.
> > Sparks creates a jdbc connection for each partition.
> > 
> > All details explained in this post :
> > http://www.gatorsmile.io/numpartitionsinjdbc/
> > 
> > Also an example with greenplum database:
> > http://engineering.pivotal.io/post/getting-started-with-greenplum-spark/
> > 
> > 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark Streaming Kerberos Issue

2017-11-22 Thread Georg Heiler
Did you check that the security extensions are installed (JCE)?

KhajaAsmath Mohammed  schrieb am Mi., 22. Nov.
2017 um 19:36 Uhr:

> [image: Inline image 1]
>
> This is what we are on.
>
> On Wed, Nov 22, 2017 at 12:33 PM, KhajaAsmath Mohammed <
> mdkhajaasm...@gmail.com> wrote:
>
>> We use oracle JDK. we are on unix.
>>
>> On Wed, Nov 22, 2017 at 12:31 PM, Georg Heiler > > wrote:
>>
>>> Do you use oracle or open jdk? We recently had an issue with open jdk:
>>> formerly, java Security extensions were installed by default - no longer so
>>> on centos 7.3
>>>
>>> Are these installed?
>>>
>>> KhajaAsmath Mohammed  schrieb am Mi. 22. Nov.
>>> 2017 um 19:29:
>>>
 I passed keytab, renewal is enabled by running the script every eight
 hours. User gets renewed by the script every eight hours.

 On Wed, Nov 22, 2017 at 12:27 PM, Georg Heiler <
 georg.kf.hei...@gmail.com> wrote:

> Did you pass a keytab? Is renewal enabled in your kdc?
> KhajaAsmath Mohammed  schrieb am Mi. 22.
> Nov. 2017 um 19:25:
>
>> Hi,
>>
>> I have written spark stream job and job is running successfully for
>> more than 36 hours. After around 36 hours job gets failed with kerberos
>> issue. Any solution on how to resolve it.
>>
>> org.apache.spark.SparkException: Task failed while wri\
>>
>> ting rows.
>>
>> at
>> org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.writeToFile(hiveWriterContainers.scala:328)
>>
>> at
>> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:210)
>>
>> at
>> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:210)
>>
>> at
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>>
>> at org.apache.spark.scheduler.Task.run(Task.scala:99)
>>
>> at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>
>> at java.lang.Thread.run(Thread.java:745)
>>
>> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException:
>> java.io.IOException: org.apache.hadoop.security.authentication.client.\
>>
>> AuthenticationException:
>> org.apache.hadoop.security.token.SecretManager$InvalidToken: token 
>> (kms-dt
>> owner=va_dflt, renewer=yarn, re\
>>
>> alUser=, issueDate=1511262017635, maxDate=1511866817635,
>> sequenceNumber=1854601, masterKeyId=3392) is expired
>>
>> at org.apache.hadoop.hive.ql.io
>> .HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:248)
>>
>> at
>> org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.newOutputWriter$1(hiveWriterContainers.scala:346)
>>
>> at
>> org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.writeToFile(hiveWriterContainers.scala:304)
>>
>> ... 8 more
>>
>> Caused by: java.io.IOException:
>> org.apache.hadoop.security.authentication.client.AuthenticationException:
>> org.apache.hadoop.securit\
>>
>> y.token.SecretManager$InvalidToken: token (kms-dt owner=va_dflt,
>> renewer=yarn, realUser=, issueDate=1511262017635, maxDate=15118668\
>>
>> 17635, sequenceNumber=1854601, masterKeyId=3392) is expired
>>
>> at
>> org.apache.hadoop.crypto.key.kms.LoadBalancingKMSClientProvider.decryptEncryptedKey(LoadBalancingKMSClientProvider.java:216)
>>
>> at
>> org.apache.hadoop.crypto.key.KeyProviderCryptoExtension.decryptEncryptedKey(KeyProviderCryptoExtension.java:388)
>>
>> at
>> org.apache.hadoop.hdfs.DFSClient.decryptEncryptedDataEncryptionKey(DFSClient.java:1440)
>>
>> at
>> org.apache.hadoop.hdfs.DFSClient.createWrappedOutputStream(DFSClient.java:1542)
>>
>> at
>> org.apache.hadoop.hdfs.DFSClient.createWrappedOutputStream(DFSClient.java:1527)
>>
>> at
>> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:428)
>>
>> at
>> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:421)
>>
>> at
>> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>>
>> at
>> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:421)
>>
>> at
>>>

Re: Spark Streaming Kerberos Issue

2017-11-22 Thread KhajaAsmath Mohammed
[image: Inline image 1]

This is what we are on.

On Wed, Nov 22, 2017 at 12:33 PM, KhajaAsmath Mohammed <
mdkhajaasm...@gmail.com> wrote:

> We use oracle JDK. we are on unix.
>
> On Wed, Nov 22, 2017 at 12:31 PM, Georg Heiler 
> wrote:
>
>> Do you use oracle or open jdk? We recently had an issue with open jdk:
>> formerly, java Security extensions were installed by default - no longer so
>> on centos 7.3
>>
>> Are these installed?
>>
>> KhajaAsmath Mohammed  schrieb am Mi. 22. Nov.
>> 2017 um 19:29:
>>
>>> I passed keytab, renewal is enabled by running the script every eight
>>> hours. User gets renewed by the script every eight hours.
>>>
>>> On Wed, Nov 22, 2017 at 12:27 PM, Georg Heiler <
>>> georg.kf.hei...@gmail.com> wrote:
>>>
 Did you pass a keytab? Is renewal enabled in your kdc?
 KhajaAsmath Mohammed  schrieb am Mi. 22. Nov.
 2017 um 19:25:

> Hi,
>
> I have written spark stream job and job is running successfully for
> more than 36 hours. After around 36 hours job gets failed with kerberos
> issue. Any solution on how to resolve it.
>
> org.apache.spark.SparkException: Task failed while wri\
>
> ting rows.
>
> at org.apache.spark.sql.hive.Spar
> kHiveDynamicPartitionWriterContainer.writeToFile(hiveWriterC
> ontainers.scala:328)
>
> at org.apache.spark.sql.hive.exec
> ution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply
> (InsertIntoHiveTable.scala:210)
>
> at org.apache.spark.sql.hive.exec
> ution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply
> (InsertIntoHiveTable.scala:210)
>
> at org.apache.spark.scheduler.Res
> ultTask.runTask(ResultTask.scala:87)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
>
> at org.apache.spark.executor.Exec
> utor$TaskRunner.run(Executor.scala:322)
>
> at java.util.concurrent.ThreadPoo
> lExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at java.util.concurrent.ThreadPoo
> lExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thread.java:745)
>
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException:
> java.io.IOException: org.apache.hadoop.security.aut
> hentication.client.\
>
> AuthenticationException: 
> org.apache.hadoop.security.token.SecretManager$InvalidToken:
> token (kms-dt owner=va_dflt, renewer=yarn, re\
>
> alUser=, issueDate=1511262017635, maxDate=1511866817635,
> sequenceNumber=1854601, masterKeyId=3392) is expired
>
> at org.apache.hadoop.hive.ql.io.H
> iveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:248)
>
> at org.apache.spark.sql.hive.Spar
> kHiveDynamicPartitionWriterContainer.newOutputWriter$1(hiveW
> riterContainers.scala:346)
>
> at org.apache.spark.sql.hive.Spar
> kHiveDynamicPartitionWriterContainer.writeToFile(hiveWriterC
> ontainers.scala:304)
>
> ... 8 more
>
> Caused by: java.io.IOException: org.apache.hadoop.security.aut
> hentication.client.AuthenticationException: org.apache.hadoop.securit\
>
> y.token.SecretManager$InvalidToken: token (kms-dt owner=va_dflt,
> renewer=yarn, realUser=, issueDate=1511262017635, maxDate=15118668\
>
> 17635, sequenceNumber=1854601, masterKeyId=3392) is expired
>
> at org.apache.hadoop.crypto.key.k
> ms.LoadBalancingKMSClientProvider.decryptEncryptedKey(LoadBa
> lancingKMSClientProvider.java:216)
>
> at org.apache.hadoop.crypto.key.K
> eyProviderCryptoExtension.decryptEncryptedKey(KeyProviderCry
> ptoExtension.java:388)
>
> at org.apache.hadoop.hdfs.DFSClie
> nt.decryptEncryptedDataEncryptionKey(DFSClient.java:1440)
>
> at org.apache.hadoop.hdfs.DFSClie
> nt.createWrappedOutputStream(DFSClient.java:1542)
>
> at org.apache.hadoop.hdfs.DFSClie
> nt.createWrappedOutputStream(DFSClient.java:1527)
>
> at org.apache.hadoop.hdfs.Distrib
> utedFileSystem$7.doCall(DistributedFileSystem.java:428)
>
> at org.apache.hadoop.hdfs.Distrib
> utedFileSystem$7.doCall(DistributedFileSystem.java:421)
>
> at org.apache.hadoop.fs.FileSyste
> mLinkResolver.resolve(FileSystemLinkResolver.java:81)
>
> at org.apache.hadoop.hdfs.Distrib
> utedFileSystem.create(DistributedFileSystem.java:421)
>
> at org.apache.hadoop.hdfs.Distrib
> utedFileSystem.create(DistributedFileSystem.java:362)
>
> at org.apache.hadoop.fs.FileSyste
> m.create(FileSystem.java:925)
>
> 

Re: Spark Streaming Kerberos Issue

2017-11-22 Thread KhajaAsmath Mohammed
We use oracle JDK. we are on unix.

On Wed, Nov 22, 2017 at 12:31 PM, Georg Heiler 
wrote:

> Do you use oracle or open jdk? We recently had an issue with open jdk:
> formerly, java Security extensions were installed by default - no longer so
> on centos 7.3
>
> Are these installed?
>
> KhajaAsmath Mohammed  schrieb am Mi. 22. Nov.
> 2017 um 19:29:
>
>> I passed keytab, renewal is enabled by running the script every eight
>> hours. User gets renewed by the script every eight hours.
>>
>> On Wed, Nov 22, 2017 at 12:27 PM, Georg Heiler > > wrote:
>>
>>> Did you pass a keytab? Is renewal enabled in your kdc?
>>> KhajaAsmath Mohammed  schrieb am Mi. 22. Nov.
>>> 2017 um 19:25:
>>>
 Hi,

 I have written spark stream job and job is running successfully for
 more than 36 hours. After around 36 hours job gets failed with kerberos
 issue. Any solution on how to resolve it.

 org.apache.spark.SparkException: Task failed while wri\

 ting rows.

 at org.apache.spark.sql.hive.
 SparkHiveDynamicPartitionWriterContainer.writeToFile(
 hiveWriterContainers.scala:328)

 at org.apache.spark.sql.hive.
 execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.
 apply(InsertIntoHiveTable.scala:210)

 at org.apache.spark.sql.hive.
 execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.
 apply(InsertIntoHiveTable.scala:210)

 at org.apache.spark.scheduler.
 ResultTask.runTask(ResultTask.scala:87)

 at org.apache.spark.scheduler.Task.run(Task.scala:99)

 at org.apache.spark.executor.Executor$TaskRunner.run(
 Executor.scala:322)

 at java.util.concurrent.ThreadPoolExecutor.runWorker(
 ThreadPoolExecutor.java:1145)

 at java.util.concurrent.ThreadPoolExecutor$Worker.run(
 ThreadPoolExecutor.java:615)

 at java.lang.Thread.run(Thread.java:745)

 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException:
 java.io.IOException: org.apache.hadoop.security.authentication.client.\

 AuthenticationException: org.apache.hadoop.security.
 token.SecretManager$InvalidToken: token (kms-dt owner=va_dflt,
 renewer=yarn, re\

 alUser=, issueDate=1511262017635, maxDate=1511866817635,
 sequenceNumber=1854601, masterKeyId=3392) is expired

 at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.
 getHiveRecordWriter(HiveFileFormatUtils.java:248)

 at org.apache.spark.sql.hive.
 SparkHiveDynamicPartitionWriterContainer.newOutputWriter$1(
 hiveWriterContainers.scala:346)

 at org.apache.spark.sql.hive.
 SparkHiveDynamicPartitionWriterContainer.writeToFile(
 hiveWriterContainers.scala:304)

 ... 8 more

 Caused by: java.io.IOException: org.apache.hadoop.security.
 authentication.client.AuthenticationException:
 org.apache.hadoop.securit\

 y.token.SecretManager$InvalidToken: token (kms-dt owner=va_dflt,
 renewer=yarn, realUser=, issueDate=1511262017635, maxDate=15118668\

 17635, sequenceNumber=1854601, masterKeyId=3392) is expired

 at org.apache.hadoop.crypto.key.kms.
 LoadBalancingKMSClientProvider.decryptEncryptedKey(
 LoadBalancingKMSClientProvider.java:216)

 at org.apache.hadoop.crypto.key.
 KeyProviderCryptoExtension.decryptEncryptedKey(
 KeyProviderCryptoExtension.java:388)

 at org.apache.hadoop.hdfs.DFSClient.
 decryptEncryptedDataEncryptionKey(DFSClient.java:1440)

 at org.apache.hadoop.hdfs.DFSClient.
 createWrappedOutputStream(DFSClient.java:1542)

 at org.apache.hadoop.hdfs.DFSClient.
 createWrappedOutputStream(DFSClient.java:1527)

 at org.apache.hadoop.hdfs.DistributedFileSystem$7.
 doCall(DistributedFileSystem.java:428)

 at org.apache.hadoop.hdfs.DistributedFileSystem$7.
 doCall(DistributedFileSystem.java:421)

 at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(
 FileSystemLinkResolver.java:81)

 at org.apache.hadoop.hdfs.DistributedFileSystem.create(
 DistributedFileSystem.java:421)

 at org.apache.hadoop.hdfs.DistributedFileSystem.create(
 DistributedFileSystem.java:362)

 at org.apache.hadoop.fs.FileSystem.create(FileSystem.
 java:925)

 at org.apache.hadoop.fs.FileSystem.create(FileSystem.
 java:906)

 at parquet.hadoop.ParquetFileWriter.(
 ParquetFileWriter.java:220)

 at parquet.hadoop.ParquetOutputFormat.getRecordWriter(
 ParquetOutputFormat.java:311)

Re: Spark Streaming Kerberos Issue

2017-11-22 Thread Georg Heiler
Do you use oracle or open jdk? We recently had an issue with open jdk:
formerly, java Security extensions were installed by default - no longer so
on centos 7.3

Are these installed?
KhajaAsmath Mohammed  schrieb am Mi. 22. Nov. 2017
um 19:29:

> I passed keytab, renewal is enabled by running the script every eight
> hours. User gets renewed by the script every eight hours.
>
> On Wed, Nov 22, 2017 at 12:27 PM, Georg Heiler 
> wrote:
>
>> Did you pass a keytab? Is renewal enabled in your kdc?
>> KhajaAsmath Mohammed  schrieb am Mi. 22. Nov.
>> 2017 um 19:25:
>>
>>> Hi,
>>>
>>> I have written spark stream job and job is running successfully for more
>>> than 36 hours. After around 36 hours job gets failed with kerberos issue.
>>> Any solution on how to resolve it.
>>>
>>> org.apache.spark.SparkException: Task failed while wri\
>>>
>>> ting rows.
>>>
>>> at
>>> org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.writeToFile(hiveWriterContainers.scala:328)
>>>
>>> at
>>> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:210)
>>>
>>> at
>>> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:210)
>>>
>>> at
>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>>>
>>> at org.apache.spark.scheduler.Task.run(Task.scala:99)
>>>
>>> at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>>>
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException:
>>> java.io.IOException: org.apache.hadoop.security.authentication.client.\
>>>
>>> AuthenticationException:
>>> org.apache.hadoop.security.token.SecretManager$InvalidToken: token (kms-dt
>>> owner=va_dflt, renewer=yarn, re\
>>>
>>> alUser=, issueDate=1511262017635, maxDate=1511866817635,
>>> sequenceNumber=1854601, masterKeyId=3392) is expired
>>>
>>> at org.apache.hadoop.hive.ql.io
>>> .HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:248)
>>>
>>> at
>>> org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.newOutputWriter$1(hiveWriterContainers.scala:346)
>>>
>>> at
>>> org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.writeToFile(hiveWriterContainers.scala:304)
>>>
>>> ... 8 more
>>>
>>> Caused by: java.io.IOException:
>>> org.apache.hadoop.security.authentication.client.AuthenticationException:
>>> org.apache.hadoop.securit\
>>>
>>> y.token.SecretManager$InvalidToken: token (kms-dt owner=va_dflt,
>>> renewer=yarn, realUser=, issueDate=1511262017635, maxDate=15118668\
>>>
>>> 17635, sequenceNumber=1854601, masterKeyId=3392) is expired
>>>
>>> at
>>> org.apache.hadoop.crypto.key.kms.LoadBalancingKMSClientProvider.decryptEncryptedKey(LoadBalancingKMSClientProvider.java:216)
>>>
>>> at
>>> org.apache.hadoop.crypto.key.KeyProviderCryptoExtension.decryptEncryptedKey(KeyProviderCryptoExtension.java:388)
>>>
>>> at
>>> org.apache.hadoop.hdfs.DFSClient.decryptEncryptedDataEncryptionKey(DFSClient.java:1440)
>>>
>>> at
>>> org.apache.hadoop.hdfs.DFSClient.createWrappedOutputStream(DFSClient.java:1542)
>>>
>>> at
>>> org.apache.hadoop.hdfs.DFSClient.createWrappedOutputStream(DFSClient.java:1527)
>>>
>>> at
>>> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:428)
>>>
>>> at
>>> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:421)
>>>
>>> at
>>> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>>>
>>> at
>>> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:421)
>>>
>>> at
>>> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:362)
>>>
>>> at
>>> org.apache.hadoop.fs.FileSystem.create(FileSystem.java:925)
>>>
>>> at
>>> org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
>>>
>>> at
>>> parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:220)
>>>
>>> at
>>> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:311)
>>>
>>> at
>>> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:287)
>>>
>>> at org.apache.hadoop.hive.ql.io
>>> .parquet.write.ParquetRecordWriterWrapper.(ParquetRecordWriterWrapper.java:65)
>>>
>>> at org.apache.hadoop

Re: Spark Streaming Kerberos Issue

2017-11-22 Thread KhajaAsmath Mohammed
I passed keytab, renewal is enabled by running the script every eight
hours. User gets renewed by the script every eight hours.

On Wed, Nov 22, 2017 at 12:27 PM, Georg Heiler 
wrote:

> Did you pass a keytab? Is renewal enabled in your kdc?
> KhajaAsmath Mohammed  schrieb am Mi. 22. Nov.
> 2017 um 19:25:
>
>> Hi,
>>
>> I have written spark stream job and job is running successfully for more
>> than 36 hours. After around 36 hours job gets failed with kerberos issue.
>> Any solution on how to resolve it.
>>
>> org.apache.spark.SparkException: Task failed while wri\
>>
>> ting rows.
>>
>> at org.apache.spark.sql.hive.
>> SparkHiveDynamicPartitionWriterContainer.writeToFile(
>> hiveWriterContainers.scala:328)
>>
>> at org.apache.spark.sql.hive.
>> execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.
>> apply(InsertIntoHiveTable.scala:210)
>>
>> at org.apache.spark.sql.hive.
>> execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.
>> apply(InsertIntoHiveTable.scala:210)
>>
>> at org.apache.spark.scheduler.
>> ResultTask.runTask(ResultTask.scala:87)
>>
>> at org.apache.spark.scheduler.Task.run(Task.scala:99)
>>
>> at org.apache.spark.executor.Executor$TaskRunner.run(
>> Executor.scala:322)
>>
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(
>> ThreadPoolExecutor.java:1145)
>>
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
>> ThreadPoolExecutor.java:615)
>>
>> at java.lang.Thread.run(Thread.java:745)
>>
>> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException:
>> java.io.IOException: org.apache.hadoop.security.authentication.client.\
>>
>> AuthenticationException: 
>> org.apache.hadoop.security.token.SecretManager$InvalidToken:
>> token (kms-dt owner=va_dflt, renewer=yarn, re\
>>
>> alUser=, issueDate=1511262017635, maxDate=1511866817635,
>> sequenceNumber=1854601, masterKeyId=3392) is expired
>>
>> at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.
>> getHiveRecordWriter(HiveFileFormatUtils.java:248)
>>
>> at org.apache.spark.sql.hive.
>> SparkHiveDynamicPartitionWriterContainer.newOutputWriter$1(
>> hiveWriterContainers.scala:346)
>>
>> at org.apache.spark.sql.hive.
>> SparkHiveDynamicPartitionWriterContainer.writeToFile(
>> hiveWriterContainers.scala:304)
>>
>> ... 8 more
>>
>> Caused by: java.io.IOException: org.apache.hadoop.security.
>> authentication.client.AuthenticationException: org.apache.hadoop.securit\
>>
>> y.token.SecretManager$InvalidToken: token (kms-dt owner=va_dflt,
>> renewer=yarn, realUser=, issueDate=1511262017635, maxDate=15118668\
>>
>> 17635, sequenceNumber=1854601, masterKeyId=3392) is expired
>>
>> at org.apache.hadoop.crypto.key.kms.
>> LoadBalancingKMSClientProvider.decryptEncryptedKey(
>> LoadBalancingKMSClientProvider.java:216)
>>
>> at org.apache.hadoop.crypto.key.
>> KeyProviderCryptoExtension.decryptEncryptedKey(
>> KeyProviderCryptoExtension.java:388)
>>
>> at org.apache.hadoop.hdfs.DFSClient.
>> decryptEncryptedDataEncryptionKey(DFSClient.java:1440)
>>
>> at org.apache.hadoop.hdfs.DFSClient.
>> createWrappedOutputStream(DFSClient.java:1542)
>>
>> at org.apache.hadoop.hdfs.DFSClient.
>> createWrappedOutputStream(DFSClient.java:1527)
>>
>> at org.apache.hadoop.hdfs.DistributedFileSystem$7.
>> doCall(DistributedFileSystem.java:428)
>>
>> at org.apache.hadoop.hdfs.DistributedFileSystem$7.
>> doCall(DistributedFileSystem.java:421)
>>
>> at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(
>> FileSystemLinkResolver.java:81)
>>
>> at org.apache.hadoop.hdfs.DistributedFileSystem.create(
>> DistributedFileSystem.java:421)
>>
>> at org.apache.hadoop.hdfs.DistributedFileSystem.create(
>> DistributedFileSystem.java:362)
>>
>> at org.apache.hadoop.fs.FileSystem.create(FileSystem.
>> java:925)
>>
>> at org.apache.hadoop.fs.FileSystem.create(FileSystem.
>> java:906)
>>
>> at parquet.hadoop.ParquetFileWriter.(
>> ParquetFileWriter.java:220)
>>
>> at parquet.hadoop.ParquetOutputFormat.getRecordWriter(
>> ParquetOutputFormat.java:311)
>>
>> at parquet.hadoop.ParquetOutputFormat.getRecordWriter(
>> ParquetOutputFormat.java:287)
>>
>> at org.apache.hadoop.hive.ql.io.parquet.write.
>> ParquetRecordWriterWrapper.(ParquetRecordWriterWrapper.java:65)
>>
>> at org.apache.hadoop.hive.ql.io.parquet.
>> MapredParquetOutputFormat.getParquerRecordWriterWrapper(
>> MapredParquetOutputFormat.java:125)
>>
>> at org.apache.hadoop.hive.ql.io.parquet.
>> MapredParquetOutputFormat.getHiveRecordWriter(MapredParquetOutputFormat.
>> java:114)
>>
>> at org.apache.hadoop.hive.ql

Re: Spark Streaming Kerberos Issue

2017-11-22 Thread Georg Heiler
Did you pass a keytab? Is renewal enabled in your kdc?
KhajaAsmath Mohammed  schrieb am Mi. 22. Nov. 2017
um 19:25:

> Hi,
>
> I have written spark stream job and job is running successfully for more
> than 36 hours. After around 36 hours job gets failed with kerberos issue.
> Any solution on how to resolve it.
>
> org.apache.spark.SparkException: Task failed while wri\
>
> ting rows.
>
> at
> org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.writeToFile(hiveWriterContainers.scala:328)
>
> at
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:210)
>
> at
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:210)
>
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
>
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thread.java:745)
>
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException:
> java.io.IOException: org.apache.hadoop.security.authentication.client.\
>
> AuthenticationException:
> org.apache.hadoop.security.token.SecretManager$InvalidToken: token (kms-dt
> owner=va_dflt, renewer=yarn, re\
>
> alUser=, issueDate=1511262017635, maxDate=1511866817635,
> sequenceNumber=1854601, masterKeyId=3392) is expired
>
> at
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:248)
>
> at
> org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.newOutputWriter$1(hiveWriterContainers.scala:346)
>
> at
> org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.writeToFile(hiveWriterContainers.scala:304)
>
> ... 8 more
>
> Caused by: java.io.IOException:
> org.apache.hadoop.security.authentication.client.AuthenticationException:
> org.apache.hadoop.securit\
>
> y.token.SecretManager$InvalidToken: token (kms-dt owner=va_dflt,
> renewer=yarn, realUser=, issueDate=1511262017635, maxDate=15118668\
>
> 17635, sequenceNumber=1854601, masterKeyId=3392) is expired
>
> at
> org.apache.hadoop.crypto.key.kms.LoadBalancingKMSClientProvider.decryptEncryptedKey(LoadBalancingKMSClientProvider.java:216)
>
> at
> org.apache.hadoop.crypto.key.KeyProviderCryptoExtension.decryptEncryptedKey(KeyProviderCryptoExtension.java:388)
>
> at
> org.apache.hadoop.hdfs.DFSClient.decryptEncryptedDataEncryptionKey(DFSClient.java:1440)
>
> at
> org.apache.hadoop.hdfs.DFSClient.createWrappedOutputStream(DFSClient.java:1542)
>
> at
> org.apache.hadoop.hdfs.DFSClient.createWrappedOutputStream(DFSClient.java:1527)
>
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:428)
>
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:421)
>
> at
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:421)
>
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:362)
>
> at
> org.apache.hadoop.fs.FileSystem.create(FileSystem.java:925)
>
> at
> org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
>
> at
> parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:220)
>
> at
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:311)
>
> at
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:287)
>
> at
> org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.(ParquetRecordWriterWrapper.java:65)
>
> at
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getParquerRecordWriterWrapper(MapredParquetOutputFormat.java:125)
>
> at
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getHiveRecordWriter(MapredParquetOutputFormat.java:114)
>
> at
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:260)
>
> at
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:245)
>
> ... 10 more
>
> Caused by:
> org.apache.hadoop.security.authentication.client.AuthenticationException:
> org.apache.hadoop.security.token.SecretManager\
>
> $I

Spark Streaming Kerberos Issue

2017-11-22 Thread KhajaAsmath Mohammed
Hi,

I have written spark stream job and job is running successfully for more
than 36 hours. After around 36 hours job gets failed with kerberos issue.
Any solution on how to resolve it.

org.apache.spark.SparkException: Task failed while wri\

ting rows.

at
org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.writeToFile(hiveWriterContainers.scala:328)

at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:210)

at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:210)

at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

at org.apache.spark.scheduler.Task.run(Task.scala:99)

at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

Caused by: org.apache.hadoop.hive.ql.metadata.HiveException:
java.io.IOException: org.apache.hadoop.security.authentication.client.\

AuthenticationException:
org.apache.hadoop.security.token.SecretManager$InvalidToken: token (kms-dt
owner=va_dflt, renewer=yarn, re\

alUser=, issueDate=1511262017635, maxDate=1511866817635,
sequenceNumber=1854601, masterKeyId=3392) is expired

at
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:248)

at
org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.newOutputWriter$1(hiveWriterContainers.scala:346)

at
org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.writeToFile(hiveWriterContainers.scala:304)

... 8 more

Caused by: java.io.IOException:
org.apache.hadoop.security.authentication.client.AuthenticationException:
org.apache.hadoop.securit\

y.token.SecretManager$InvalidToken: token (kms-dt owner=va_dflt,
renewer=yarn, realUser=, issueDate=1511262017635, maxDate=15118668\

17635, sequenceNumber=1854601, masterKeyId=3392) is expired

at
org.apache.hadoop.crypto.key.kms.LoadBalancingKMSClientProvider.decryptEncryptedKey(LoadBalancingKMSClientProvider.java:216)

at
org.apache.hadoop.crypto.key.KeyProviderCryptoExtension.decryptEncryptedKey(KeyProviderCryptoExtension.java:388)

at
org.apache.hadoop.hdfs.DFSClient.decryptEncryptedDataEncryptionKey(DFSClient.java:1440)

at
org.apache.hadoop.hdfs.DFSClient.createWrappedOutputStream(DFSClient.java:1542)

at
org.apache.hadoop.hdfs.DFSClient.createWrappedOutputStream(DFSClient.java:1527)

at
org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:428)

at
org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:421)

at
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)

at
org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:421)

at
org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:362)

at
org.apache.hadoop.fs.FileSystem.create(FileSystem.java:925)

at
org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)

at
parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:220)

at
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:311)

at
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:287)

at
org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.(ParquetRecordWriterWrapper.java:65)

at
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getParquerRecordWriterWrapper(MapredParquetOutputFormat.java:125)

at
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getHiveRecordWriter(MapredParquetOutputFormat.java:114)

at
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:260)

at
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:245)

... 10 more

Caused by:
org.apache.hadoop.security.authentication.client.AuthenticationException:
org.apache.hadoop.security.token.SecretManager\

$InvalidToken: token (kms-dt owner=va_dflt, renewer=yarn, realUser=,
issueDate=1511262017635, maxDate=1511866817635, sequenceNumber\

=1854601, masterKeyId=3392) is expired

at
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at
sun.reflect.NativeConstructorAccessorImpl.newInstance(N

newbie: how to partition data on file system. What are best practices?

2017-11-22 Thread Andy Davidson
I am working on a deep learning project. Currently we do everything on a
single machine. I am trying to figure out how we might be able to move to a
clustered spark environment.

Clearly its possible a machine or job on the cluster might fail so I assume
that the data needs to be replicated to some degree.

Eventually I expect to I will need to process multi petabyte files and will
need to come up with some sort of sharding. Communication costs could be a
problem. Does spark have any knowledge of how the data distributed,
replicated across the machine in my cluster?

Let say my data source is S3. I should I copy the data to my ec2 cluster or
try to read directly from S3?

If our pilot is successful we expect to need to process multi petabyte file.

What are best practices?

Kind regards

Andy

P.s. We expect to use AWS or some other cloud solution.




Re: Writing files to s3 with out temporary directory

2017-11-22 Thread Haoyuan Li
This blog / tutorial
 maybe
helpful to run Spark in the Cloud with Alluxio.

Best regards,

Haoyuan

On Mon, Nov 20, 2017 at 2:12 PM, lucas.g...@gmail.com 
wrote:

> That sounds like allot of work and if I understand you correctly it sounds
> like a piece of middleware that already exists (I could be wrong).  Alluxio?
>
> Good luck and let us know how it goes!
>
> Gary
>
> On 20 November 2017 at 14:10, Jim Carroll  wrote:
>
>> Thanks. In the meantime I might just write a custom file system that maps
>> writes to parquet file parts to their final locations and then skips the
>> move.
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


does "Deep Learning Pipelines" scale out linearly?

2017-11-22 Thread Andy Davidson
I am starting a new deep learning project currently we do all of our work on
a single machine using a combination of Keras and Tensor flow.
https://databricks.github.io/spark-deep-learning/site/index.html looks very
promising. Any idea how performance is likely to improve as I add machines
to my my cluster?

Kind regards

Andy


P.s. Is user@spark.apache.org the best place to ask questions about this
package?






Spark Stremaing Hive Dynamic Partitions Issue

2017-11-22 Thread KhajaAsmath Mohammed
Hi,

I am able to wirte data into hive tables from spark stremaing. Job ran
successfully for 37 hours and I started getting errors in task failure as
below. Hive table has data too untill tasks are failed.

Job aborted due to stage failure: Task 0 in stage 691.0 failed 4 times,
most recent failure: Lost task 0.3 in stage 691.0 (TID 10884,
brksvl171.brk.navistar.com, executor 2): org.apache.spark.SparkException:
Task failed while writing rows.+details

Job aborted due to stage failure: Task 0 in stage 691.0 failed 4 times,
most recent failure: Lost task 0.3 in stage 691.0 (TID 10884,
brksvl171.brk.navistar.com, executor 2): org.apache.spark.SparkException:
Task failed while writing rows.

 at
org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.writeToFile(hiveWriterContainers.scala:328)

 at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:210)

 at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:210)

 at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

 at org.apache.spark.scheduler.Task.run(Task.scala:99)

 at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)

 at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

 at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

 at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.NullPointerException

 at
parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:152)

 at
parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:111)

 at
parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:112)

 at
org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.close(ParquetRecordWriterWrapper.java:102)

 at
org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.close(ParquetRecordWriterWrapper.java:119)

 at
org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.writeToFile(hiveWriterContainers.scala:320)

 ... 8 more



Driver stacktrace:


any solution for this please?


Thanks,

Asmath


Re: Spark Writing to parquet directory : java.io.IOException: Disk quota exceeded

2017-11-22 Thread Vadim Semenov
The error message seems self-explanatory, try to figure out what's the disk
quota you have for your user.

On Wed, Nov 22, 2017 at 8:23 AM, Chetan Khatri 
wrote:

> Anybody reply on this ?
>
> On Tue, Nov 21, 2017 at 3:36 PM, Chetan Khatri <
> chetan.opensou...@gmail.com> wrote:
>
>>
>> Hello Spark Users,
>>
>> I am getting below error, when i am trying to write dataset to parquet
>> location. I have enough disk space available. Last time i was facing same
>> kind of error which were resolved by increasing number of cores at hyper
>> parameters. Currently result set data size is almost 400Gig with below
>> hyper parameters
>>
>> Driver memory: 4g
>> Executor Memory: 16g
>> Executor cores=12
>> num executors= 8
>>
>> Still it's failing, any Idea ? that if i increase executor memory and
>> number of executors.  it could get resolved ?
>>
>>
>> 17/11/21 04:29:37 ERROR storage.DiskBlockObjectWriter: Uncaught exception
>> while reverting partial writes to file /mapr/chetan/local/david.com/t
>> mp/hadoop/nm-local-dir/usercache/david-khurana/appcache/
>> application_1509639363072_10572/blockmgr-008604e6-37cb-
>> 421f-8cc5-e94db75684e7/12/temp_shuffle_ae885911-a1ef-
>> 404f-9a6a-ded544bb5b3c
>> java.io.IOException: Disk quota exceeded
>> at java.io.FileOutputStream.close0(Native Method)
>> at java.io.FileOutputStream.access$000(FileOutputStream.java:53)
>> at java.io.FileOutputStream$1.close(FileOutputStream.java:356)
>> at java.io.FileDescriptor.closeAll(FileDescriptor.java:212)
>> at java.io.FileOutputStream.close(FileOutputStream.java:354)
>> at org.apache.spark.storage.TimeTrackingOutputStream.close(Time
>> TrackingOutputStream.java:72)
>> at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
>> at net.jpountz.lz4.LZ4BlockOutputStream.close(LZ4BlockOutputStr
>> eam.java:178)
>> at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
>> at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
>> at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$
>> anon$2.close(UnsafeRowSerializer.scala:96)
>> at org.apache.spark.storage.DiskBlockObjectWriter$$anonfun$
>> close$2.apply$mcV$sp(DiskBlockObjectWriter.scala:108)
>> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:
>> 1316)
>> at org.apache.spark.storage.DiskBlockObjectWriter.close(DiskBlo
>> ckObjectWriter.scala:107)
>> at org.apache.spark.storage.DiskBlockObjectWriter.revertPartial
>> WritesAndClose(DiskBlockObjectWriter.scala:159)
>> at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.s
>> top(BypassMergeSortShuffleWriter.java:234)
>> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
>> Task.scala:85)
>> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
>> Task.scala:47)
>> at org.apache.spark.scheduler.Task.run(Task.scala:86)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.
>> scala:274)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>> Executor.java:1142)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>> lExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>> 17/11/21 04:29:37 WARN netty.OneWayOutboxMessage: Failed to send one-way
>> RPC.
>> java.io.IOException: Failed to connect to /192.168.123.43:58889
>> at org.apache.spark.network.client.TransportClientFactory.creat
>> eClient(TransportClientFactory.java:228)
>> at org.apache.spark.network.client.TransportClientFactory.creat
>> eClient(TransportClientFactory.java:179)
>> at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpc
>> Env.scala:197)
>> at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:
>> 191)
>> at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:
>> 187)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>> Executor.java:1142)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>> lExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>> Caused by: java.net.ConnectException: Connection refused: /
>> 192.168.123.43:58889
>> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl
>> .java:717)
>> at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect
>> (NioSocketChannel.java:224)
>> at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fi
>> nishConnect(AbstractNioChannel.java:289)
>> at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEven
>> tLoop.java:528)
>> at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimiz
>> ed(NioEventLoop.java:468)
>> at io.netty.channel.nio.NioEventLoop.processSelecte

Re: Spark Writing to parquet directory : java.io.IOException: Disk quota exceeded

2017-11-22 Thread Chetan Khatri
Anybody reply on this ?

On Tue, Nov 21, 2017 at 3:36 PM, Chetan Khatri 
wrote:

>
> Hello Spark Users,
>
> I am getting below error, when i am trying to write dataset to parquet
> location. I have enough disk space available. Last time i was facing same
> kind of error which were resolved by increasing number of cores at hyper
> parameters. Currently result set data size is almost 400Gig with below
> hyper parameters
>
> Driver memory: 4g
> Executor Memory: 16g
> Executor cores=12
> num executors= 8
>
> Still it's failing, any Idea ? that if i increase executor memory and
> number of executors.  it could get resolved ?
>
>
> 17/11/21 04:29:37 ERROR storage.DiskBlockObjectWriter: Uncaught exception
> while reverting partial writes to file /mapr/chetan/local/david.com/
> tmp/hadoop/nm-local-dir/usercache/david-khurana/appcache/application_
> 1509639363072_10572/blockmgr-008604e6-37cb-421f-8cc5-
> e94db75684e7/12/temp_shuffle_ae885911-a1ef-404f-9a6a-ded544bb5b3c
> java.io.IOException: Disk quota exceeded
> at java.io.FileOutputStream.close0(Native Method)
> at java.io.FileOutputStream.access$000(FileOutputStream.java:53)
> at java.io.FileOutputStream$1.close(FileOutputStream.java:356)
> at java.io.FileDescriptor.closeAll(FileDescriptor.java:212)
> at java.io.FileOutputStream.close(FileOutputStream.java:354)
> at org.apache.spark.storage.TimeTrackingOutputStream.close(
> TimeTrackingOutputStream.java:72)
> at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
> at net.jpountz.lz4.LZ4BlockOutputStream.close(
> LZ4BlockOutputStream.java:178)
> at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
> at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
> at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$
> anon$2.close(UnsafeRowSerializer.scala:96)
> at org.apache.spark.storage.DiskBlockObjectWriter$$
> anonfun$close$2.apply$mcV$sp(DiskBlockObjectWriter.scala:108)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.
> scala:1316)
> at org.apache.spark.storage.DiskBlockObjectWriter.close(
> DiskBlockObjectWriter.scala:107)
> at org.apache.spark.storage.DiskBlockObjectWriter.
> revertPartialWritesAndClose(DiskBlockObjectWriter.scala:159)
> at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.
> stop(BypassMergeSortShuffleWriter.java:234)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(
> ShuffleMapTask.scala:85)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(
> ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at org.apache.spark.executor.Executor$TaskRunner.run(
> Executor.scala:274)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 17/11/21 04:29:37 WARN netty.OneWayOutboxMessage: Failed to send one-way
> RPC.
> java.io.IOException: Failed to connect to /192.168.123.43:58889
> at org.apache.spark.network.client.TransportClientFactory.
> createClient(TransportClientFactory.java:228)
> at org.apache.spark.network.client.TransportClientFactory.
> createClient(TransportClientFactory.java:179)
> at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(
> NettyRpcEnv.scala:197)
> at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.
> scala:191)
> at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.
> scala:187)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.net.ConnectException: Connection refused: /
> 192.168.123.43:58889
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.finishConnect(
> SocketChannelImpl.java:717)
> at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(
> NioSocketChannel.java:224)
> at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.
> finishConnect(AbstractNioChannel.java:289)
> at io.netty.channel.nio.NioEventLoop.processSelectedKey(
> NioEventLoop.java:528)
> at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(
> NioEventLoop.java:468)
> at io.netty.channel.nio.NioEventLoop.processSelectedKeys(
> NioEventLoop.java:382)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> at io.netty.util.concurrent.SingleThreadEventExecutor$2.
> run(SingleThreadEventExecutor.java:111)
>   ... 1 more
>