[
https://issues.apache.org/jira/browse/CRUNCH-509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Micah Whitacre updated CRUNCH-509:
----------------------------------
Attachment: CRUNCH-509.patch
Still working on solution for this. The change to add name support is pretty
simple. The downstream effect however is that all calls to materialize the
output (which is what we do in the IT for Spark) fail because it cannot find
the files.
{noformat}
4500 [Thread-29] INFO org.apache.spark.scheduler.DAGScheduler - Job 0
finished: saveAsNewAPIHadoopFile at SparkRuntime.java:332, took 0.874098 s
15/04/08 20:57:48 INFO DAGScheduler: Job 0 finished: saveAsNewAPIHadoopFile at
SparkRuntime.java:332, took 0.874098 s
4573 [main] INFO org.apache.crunch.io.avro.AvroFileReaderFactory - Could not
read avro file at path: file:/tmp/crunch-109470525/p1/part-r-00000
java.io.IOException: Not a data file.
at
org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:105)
at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:97)
at
org.apache.crunch.io.avro.AvroFileReaderFactory.read(AvroFileReaderFactory.java:74)
at
org.apache.crunch.io.CompositePathIterable$2.<init>(CompositePathIterable.java:87)
at
org.apache.crunch.io.CompositePathIterable.iterator(CompositePathIterable.java:85)
at com.google.common.collect.Iterables$3.next(Iterables.java:512)
at com.google.common.collect.Iterables$3.next(Iterables.java:505)
at com.google.common.collect.Iterators$5.hasNext(Iterators.java:597)
at
org.apache.crunch.materialize.pobject.FirstElementPObject.process(FirstElementPObject.java:45)
at
org.apache.crunch.materialize.pobject.PObjectImpl.getValue(PObjectImpl.java:71)
at org.apache.crunch.SparkPageRankIT.run(SparkPageRankIT.java:156)
at
org.apache.crunch.SparkPageRankIT.testAvroReflects(SparkPageRankIT.java:97)
{noformat}
One of the behavior changes I noticed is that when ran without a name, the job
produces files that are named, part-r-00000.avro. When we add the name we are
now getting files without the file extension. I believe this might be related
to it not being able to detect the files as containing data but I haven't found
in the code where that extension might be getting dropped.
> Crunch with Spark doesn't name all outputs
> ------------------------------------------
>
> Key: CRUNCH-509
> URL: https://issues.apache.org/jira/browse/CRUNCH-509
> Project: Crunch
> Issue Type: Bug
> Components: Core
> Affects Versions: 0.11.0
> Reporter: Micah Whitacre
> Assignee: Josh Wills
> Fix For: 0.12.0
>
> Attachments: CRUNCH-509.patch
>
>
> Crunch currently does not "name" all outputs when running with a
> SparkPipeline. This becomes a problem as some Targets (based on CRUNCH-82)
> have coded in checked to ensure that the name must be populated.
> Specifically the implementation I'm running into issues with is the Kite
> DatasetTarget[2].
> Need to read up a bit on context to see if it is a Crunch/Kite issue or where
> it is easiest/correct to fix. [~jwills] or [~tomwhite] feedback would be
> welcome.
> [1] -
> https://github.com/apache/crunch/blob/3ab0b078c47f23b3ba893fdfb05fd723f663d02b/crunch-spark/src/main/java/org/apache/crunch/impl/spark/SparkRuntime.java#L337
> [2] -
> https://github.com/kite-sdk/kite/blob/e080f0237e7383a16fff8547ad43387ccf55c473/kite-data/kite-data-crunch/src/main/java/org/kitesdk/data/crunch/DatasetTarget.java#L178
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)