Re: Read Local File

Dirceu Semighini Filho Wed, 14 Jun 2017 05:02:36 -0700

Hello Satish,
Thanks for your answer
*I guess you have already made sure that the paths for your file are
exactly the same on each of your nodes*
Yes
* I'd also check the perms on your path*
The files are all with the same permissions
*Believe the sample code you pasted is only for testing - and you are
already aware that a distributed count on a local file has no benefits.*
Have done this to force the file read.
spark.sqlContext.read.text("file:///pathToFile").count


* Typing the path explicitly resolved*
I'll try this, I'm not sure if I've tested the varargs path option.
*Alternately - if the file size is small, you could do spark-submit with a
--files option which will ship the file to every executor and is available
for all executors. *
Unfortunately I can't do this, because it's a continuous reading process

Thanks.



2017-06-14 4:35 GMT-03:00 satish lalam <satish.la...@gmail.com>:

> I guess you have already made sure that the paths for your file are
> exactly the same on each of your nodes. I'd also check the perms on your
> path.
> Believe the sample code you pasted is only for testing - and you are
> already aware that a distributed count on a local file has no benefits.
> Once I ran into a similar issue while copy pasting file paths probably due
> to encoding issues on some text editors. I'd copied a hidden char at the
> end of the path from source file which made my file lookup fail, but the
> code looked perfectly alright. Typing the path explicitly resolved it. But
> this is a corner case.
>
> Alternately - if the file size is small, you could do spark-submit with a
> --files option which will ship the file to every executor and is available
> for all executors.
>
>
>
>
> On Tue, Jun 13, 2017 at 11:02 AM, Dirceu Semighini Filho <
> dirceu.semigh...@gmail.com> wrote:
>
>> Hi all,
>> I'm trying to read a File from local filesystem, I'd 4 workstations 1
>> Master and 3 slaves, running with Ambari and Yarn with Spark version*
>> 2.1.1.2.6.1.0-129*
>>
>> The code that I'm trying to run is quite simple
>>
>> spark.sqlContext.read.text("file:///pathToFile").count
>>
>> I've copied the file in all 4 workstations and every time that I try to
>> run this I got the following exception:
>> 17/06/13 17:57:37 WARN TaskSetManager: Lost task 0.0 in stage 6.0 (TID
>> 12, ip, executor 1): java.io.FileNotFoundException: File file:pathToFile
>> does not exist
>> It is possible the underlying files have been updated. You can explicitly
>> invalidate the cache in Spark by running 'REFRESH TABLE tableName' command
>> in SQL or by recreating the Dataset/DataFrame involved.
>> at org.apache.spark.sql.execution.datasources.FileScanRDD$$
>> anon$1.nextIterator(FileScanRDD.scala:175)
>> at org.apache.spark.sql.execution.datasources.FileScanRDD$$
>> anon$1.hasNext(FileScanRDD.scala:109)
>> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$Gen
>> eratedIterator.agg_doAggregateWithoutKey$(Unknown Source)
>> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$Gen
>> eratedIterator.processNext(Unknown Source)
>> at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(
>> BufferedRowIterator.java:43)
>> at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfu
>> n$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>> at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.w
>> rite(BypassMergeSortShuffleWriter.java:126)
>> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
>> Task.scala:96)
>> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
>> Task.scala:53)
>> at org.apache.spark.scheduler.Task.run(Task.scala:99)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>> Executor.java:1142)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>> lExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> 17/06/13 17:57:37 ERROR TaskSetManager: Task 0 in stage 6.0 failed 4
>> times; aborting job
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task
>> 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage
>> 6.0 (TID 15, ip, executor 1): java.io.FileNotFoundException: File
>> file:file:pathToFile does not exist
>> It is possible the underlying files have been updated. You can explicitly
>> invalidate the cache in Spark by running 'REFRESH TABLE tableName' command
>> in SQL or by recreating the Dataset/DataFrame involved.
>> at org.apache.spark.sql.execution.datasources.FileScanRDD$$
>> anon$1.nextIterator(FileScanRDD.scala:175)
>> at org.apache.spark.sql.execution.datasources.FileScanRDD$$
>> anon$1.hasNext(FileScanRDD.scala:109)
>> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$Gen
>> eratedIterator.agg_doAggregateWithoutKey$(Unknown Source)
>> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$Gen
>> eratedIterator.processNext(Unknown Source)
>> at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(
>> BufferedRowIterator.java:43)
>> at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfu
>> n$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>> at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.w
>> rite(BypassMergeSortShuffleWriter.java:126)
>> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
>> Task.scala:96)
>> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
>> Task.scala:53)
>> at org.apache.spark.scheduler.Task.run(Task.scala:99)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>> Executor.java:1142)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>> lExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> Driver stacktrace:
>>   at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$sch
>> eduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
>>   at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$
>> 1.apply(DAGScheduler.scala:1423)
>>   at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$
>> 1.apply(DAGScheduler.scala:1422)
>>   at scala.collection.mutable.ResizableArray$class.foreach(Resiza
>> bleArray.scala:59)
>>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>>   at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGSchedu
>> ler.scala:1422)
>>   at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskS
>> etFailed$1.apply(DAGScheduler.scala:802)
>>   at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskS
>> etFailed$1.apply(DAGScheduler.scala:802)
>>   at scala.Option.foreach(Option.scala:257)
>>   at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(
>> DAGScheduler.scala:802)
>>   at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOn
>> Receive(DAGScheduler.scala:1650)
>>   at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onRe
>> ceive(DAGScheduler.scala:1605)
>>   at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onRe
>> ceive(DAGScheduler.scala:1594)
>>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.
>> scala:628)
>>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1925)
>>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1938)
>>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951)
>>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1965)
>>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
>>   at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperati
>> onScope.scala:151)
>>   at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperati
>> onScope.scala:112)
>>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>>   at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
>>   at org.apache.spark.sql.execution.SparkPlan.executeCollect(
>> SparkPlan.scala:275)
>>   at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$D
>> ataset$$execute$1$1.apply(Dataset.scala:2386)
>>   at org.apache.spark.sql.execution.SQLExecution$.withNewExecutio
>> nId(SQLExecution.scala:57)
>>   at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2788)
>>   at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$e
>> xecute$1(Dataset.scala:2385)
>>   at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$c
>> ollect(Dataset.scala:2392)
>>   at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.
>> scala:2420)
>>   at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.
>> scala:2419)
>>   at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2801)
>>   at org.apache.spark.sql.Dataset.count(Dataset.scala:2419)
>>   ... 48 elided
>> Caused by: java.io.FileNotFoundException: File file:file:pathToFile does
>> not exist
>> It is possible the underlying files have been updated. You can explicitly
>> invalidate the cache in Spark by running 'REFRESH TABLE tableName' command
>> in SQL or by recreating the Dataset/DataFrame involved.
>>   at org.apache.spark.sql.execution.datasources.FileScanRDD$$
>> anon$1.nextIterator(FileScanRDD.scala:175)
>>   at org.apache.spark.sql.execution.datasources.FileScanRDD$$
>> anon$1.hasNext(FileScanRDD.scala:109)
>>   at org.apache.spark.sql.catalyst.expressions.GeneratedClass$Gen
>> eratedIterator.agg_doAggregateWithoutKey$(Unknown Source)
>>   at org.apache.spark.sql.catalyst.expressions.GeneratedClass$Gen
>> eratedIterator.processNext(Unknown Source)
>>   at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(
>> BufferedRowIterator.java:43)
>>   at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfu
>> n$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>>   at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.w
>> rite(BypassMergeSortShuffleWriter.java:126)
>>   at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
>> Task.scala:96)
>>   at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
>> Task.scala:53)
>>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.
>> scala:322)
>>   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>> Executor.java:1142)
>>   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>> lExecutor.java:617)
>>   at java.lang.Thread.run(Thread.java:745)
>>
>> Can somebody help me on fixing this?
>>
>> Kind Regards,
>> Dirceu
>>
>
>

Re: Read Local File

Reply via email to