Unsubscribe

2017-10-14 Thread serkan taş


Android için Outlook uygulamasini edinin


From: Darshan Pandya 
Sent: Saturday, October 14, 2017 11:55:06 AM
To: user
Subject: Spark Structured Streaming not connecting to Kafka using kerberos

Hello,

I'm using Spark 2.1.0 on CDH 5.8 with kafka 0.10.0.1 + kerberos

I am unable to connect to the kafka broker with the following message


17/10/14 14:29:10 WARN clients.NetworkClient: Bootstrap broker 
10.197.19.25:9092
 disconnected

and is unable to consume any messages.

And am using it as follows


jaas.conf

KafkaClient {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
keyTab="./gandalf.keytab"
storeKey=true
useTicketCache=false
serviceName="kafka"
principal="gand...@domain.com";
};

$SPARK_HOME/bin/spark-submit \
--master yarn \
--files jaas.conf,gandalf.keytab \
--driver-java-options "-Djava.security.auth.login.config=./jaas.conf 
-Dhdp.version=2.4.2.0-258" \
--conf 
"spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf" 
\
--class com.example.ClassName uber-jar-with-deps-and-hive-site.jar

Thanks in advance.


--
Sincerely,
Darshan



Runnig multiple spark jobs on yarn

2017-08-02 Thread serkan taş
Hi,

Where should i start to be able to run multiple spark jobs concurrent on 3 node 
spark cluster?

Android için Outlook uygulamasını edinin



Re: Spark parquet file read problem !

2017-07-31 Thread serkan taş
Thank you very much.

Schema merge fixed the structure problem but the fields with same name but 
different type still is an issue i should work on.

Android için Outlook<https://aka.ms/ghei36> uygulamasını edinin



Kimden: 萝卜丝炒饭
Gönderildi: 31 Temmuz Pazartesi 11:16
Konu: Re: Spark parquet file read problem !
Kime: serkan taş, pandees waran
Bilgi: user@spark.apache.org


please add the schemaMerge to the option.

---Original---
From: "serkan taş"<serkan_...@hotmail.com>
Date: 2017/7/31 13:54:14
To: "pandees waran"<pande...@gmail.com>;
Cc: "user@spark.apache.org"<user@spark.apache.org>;
Subject: Re: Spark parquet file read problem !

I checked and realised that the schema of the files different with some missing 
fields and some fields with same  name but different type.

How may i overcome the issue?

Android için Outlook<https://aka.ms/ghei36> uygulamasını edinin

From: pandees waran <pande...@gmail.com>
Sent: Sunday, July 30, 2017 7:12:55 PM
To: serkan taş
Cc: user@spark.apache.org
Subject: Re: Spark parquet file read problem !

I have encountered the similar error when the schema / datatypes are 
conflicting in those 2 parquet files. Are you sure that the 2 individual files 
are in the same structure with similar datatypes. If not you have to fix this 
by enforcing the default values for the missing values to make the structure 
and data types identical.

Sent from my iPhone

On Jul 30, 2017, at 8:11 AM, serkan taş 
<serkan_...@hotmail.com<mailto:serkan_...@hotmail.com>> wrote:

Hi,

I have a problem while reading parquet files located in hdfs.


If i read the files individually nothing wrong and i can get the file content.

parquetFile = 
spark.read.parquet(“hdfs://xxx/20170719/part-0-3a9c226f-4fef-44b8-996b-115a2408c746.snappy.parquet")

and

parquetFile = 
spark.read.parquet(“hdfs://xxx/20170719/part-0-17262890-a56c-42e2-b299-2b10354da184.snappy.parquet”)


But when i try to read the folder,

This is how i read the folder :

parquetFile = spark.read.parquet(“hdfs://xxx/20170719/“)


than i get the below exception :

Note : Only these two files are in the folder. Please find the parquet files 
attached.


Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 
5, localhost, executor driver): org.apache.parquet.io.ParquetDecodingException: 
Can not read value at 1 in block 0 in file 
hdfs://xxx/20170719/part-0-3a9c226f-4fef-44b8-996b-115a2408c746.snappy.parquet
   at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
   at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
   at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:166)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
   at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
   at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
   at 
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117)
   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
   at 
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
   at 
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
   at 
org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
   at 
org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
Caused by: java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to 
org.apache.spark.sql.catalyst.expressions.MutableLong
   at 
org.apache.spark.sql.catalyst.expressions.SpecificInternalRow.setLong(SpecificInternalRow.scala:295)
   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.setLong(ParquetRowConverter.scala:164)
   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addLong(ParquetRowConverter.scala:86)
   at 
org.a

Re: Spark parquet file read problem !

2017-07-30 Thread serkan taş
I checked and realised that the schema of the files different with some missing 
fields and some fields with same  name but different type.

How may i overcome the issue?

Android için Outlook<https://aka.ms/ghei36> uygulamasini edinin


From: pandees waran <pande...@gmail.com>
Sent: Sunday, July 30, 2017 7:12:55 PM
To: serkan taş
Cc: user@spark.apache.org
Subject: Re: Spark parquet file read problem !

I have encountered the similar error when the schema / datatypes are 
conflicting in those 2 parquet files. Are you sure that the 2 individual files 
are in the same structure with similar datatypes. If not you have to fix this 
by enforcing the default values for the missing values to make the structure 
and data types identical.

Sent from my iPhone

On Jul 30, 2017, at 8:11 AM, serkan taş 
<serkan_...@hotmail.com<mailto:serkan_...@hotmail.com>> wrote:

Hi,

I have a problem while reading parquet files located in hdfs.


If i read the files individually nothing wrong and i can get the file content.

parquetFile = 
spark.read.parquet(“hdfs://xxx/20170719/part-0-3a9c226f-4fef-44b8-996b-115a2408c746.snappy.parquet")

and

parquetFile = 
spark.read.parquet(“hdfs://xxx/20170719/part-0-17262890-a56c-42e2-b299-2b10354da184.snappy.parquet”)


But when i try to read the folder,

This is how i read the folder :

parquetFile = spark.read.parquet(“hdfs://xxx/20170719/“)


than i get the below exception :

Note : Only these two files are in the folder. Please find the parquet files 
attached.


Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 
5, localhost, executor driver): org.apache.parquet.io.ParquetDecodingException: 
Can not read value at 1 in block 0 in file 
hdfs://xxx/20170719/part-0-3a9c226f-4fef-44b8-996b-115a2408c746.snappy.parquet
   at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
   at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
   at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:166)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
   at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
   at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
   at 
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117)
   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
   at 
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
   at 
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
   at 
org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
   at 
org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
Caused by: java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to 
org.apache.spark.sql.catalyst.expressions.MutableLong
   at 
org.apache.spark.sql.catalyst.expressions.SpecificInternalRow.setLong(SpecificInternalRow.scala:295)
   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.setLong(ParquetRowConverter.scala:164)
   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addLong(ParquetRowConverter.scala:86)
   at 
org.apache.parquet.column.impl.ColumnReaderImpl$2$4.writeValue(ColumnReaderImpl.java:274)
   at 
org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:371)
   at 
org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
   at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:218)
   ... 16 more

Driver stacktrace:
   at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.

Re: Spark parquet file read problem !

2017-07-30 Thread serkan taş
* for what ?


Yehuda Finkelshtein 
<yeh...@veracity-group.com<mailto:yeh...@veracity-group.com>> şunları yazdı (30 
Tem 2017 20:45):

Try to add "*" at the end of the folder

and

parquetFile = spark.read.parquet(“hdfs://xxx/20170719/*”)



On Jul 30, 2017 19:13, "pandees waran" 
<pande...@gmail.com<mailto:pande...@gmail.com>> wrote:
I have encountered the similar error when the schema / datatypes are 
conflicting in those 2 parquet files. Are you sure that the 2 individual files 
are in the same structure with similar datatypes. If not you have to fix this 
by enforcing the default values for the missing values to make the structure 
and data types identical.

Sent from my iPhone

On Jul 30, 2017, at 8:11 AM, serkan taş 
<serkan_...@hotmail.com<mailto:serkan_...@hotmail.com>> wrote:

Hi,

I have a problem while reading parquet files located in hdfs.


If i read the files individually nothing wrong and i can get the file content.





But when i try to read the folder,

This is how i read the folder :

parquetFile = spark.read.parquet(“hdfs://xxx/20170719/“)


than i get the below exception :

Note : Only these two files are in the folder. Please find the parquet files 
attached.


Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 
5, localhost, executor driver): 
org.apache.parquet.io<http://org.apache.parquet.io/>.ParquetDecodingException: 
Can not read value at 1 in block 0 in file 
hdfs://xxx/20170719/part-0-3a9c226f-4fef-44b8-996b-115a2408c746.snappy.parquet
   at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
   at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
   at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:166)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
   at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
   at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
   at 
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117)
   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
   at 
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
   at 
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
   at 
org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
   at 
org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
Caused by: java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to 
org.apache.spark.sql.catalyst.expressions.MutableLong
   at 
org.apache.spark.sql.catalyst.expressions.SpecificInternalRow.setLong(SpecificInternalRow.scala:295)
   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.setLong(ParquetRowConverter.scala:164)
   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addLong(ParquetRowConverter.scala:86)
   at 
org.apache.parquet.column.impl.ColumnReaderImpl$2$4.writeValue(ColumnReaderImpl.java:274)
   at 
org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:371)
   at 
org.apache.parquet.io<http://org.apache.parquet.io/>.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
   at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:218)
   ... 16 more

Driver stacktrace:
   at 
org.apache.spark.scheduler.DAGScheduler.org<http://org.apache.spark.scheduler.dagscheduler.org/>$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
   at 
scala.collection.mutable.ResizableArray$class.foreach(Resiz

Spark parquet file read problem !

2017-07-30 Thread serkan taş
Hi,

I have a problem while reading parquet files located in hdfs.


If i read the files individually nothing wrong and i can get the file content.

parquetFile = 
spark.read.parquet(“hdfs://xxx/20170719/part-0-3a9c226f-4fef-44b8-996b-115a2408c746.snappy.parquet")

and

parquetFile = 
spark.read.parquet(“hdfs://xxx/20170719/part-0-17262890-a56c-42e2-b299-2b10354da184.snappy.parquet”)


But when i try to read the folder,

This is how i read the folder :

parquetFile = spark.read.parquet(“hdfs://xxx/20170719/“)


than i get the below exception :

Note : Only these two files are in the folder. Please find the parquet files 
attached.


Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 
5, localhost, executor driver): org.apache.parquet.io.ParquetDecodingException: 
Can not read value at 1 in block 0 in file 
hdfs://xxx/20170719/part-0-3a9c226f-4fef-44b8-996b-115a2408c746.snappy.parquet
   at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
   at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
   at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:166)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
   at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
   at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
   at 
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117)
   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
   at 
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
   at 
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
   at 
org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
   at 
org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
Caused by: java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to 
org.apache.spark.sql.catalyst.expressions.MutableLong
   at 
org.apache.spark.sql.catalyst.expressions.SpecificInternalRow.setLong(SpecificInternalRow.scala:295)
   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.setLong(ParquetRowConverter.scala:164)
   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addLong(ParquetRowConverter.scala:86)
   at 
org.apache.parquet.column.impl.ColumnReaderImpl$2$4.writeValue(ColumnReaderImpl.java:274)
   at 
org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:371)
   at 
org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
   at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:218)
   ... 16 more

Driver stacktrace:
   at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
   at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
   at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
   at scala.Option.foreach(Option.scala:257)
   at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
   at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
   at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)