[jira] [Updated] (CRUNCH-509) Crunch with Spark doesn't name all outputs

Micah Whitacre (JIRA) Wed, 08 Apr 2015 19:05:12 -0700

     [ 
https://issues.apache.org/jira/browse/CRUNCH-509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Micah Whitacre updated CRUNCH-509:
----------------------------------
    Attachment: CRUNCH-509.patch

Still working on solution for this.  The change to add name support is pretty 
simple.  The downstream effect however is that all calls to materialize the 
output (which is what we do in the IT for Spark) fail because it cannot find 
the files.

{noformat}
4500 [Thread-29] INFO  org.apache.spark.scheduler.DAGScheduler  - Job 0 
finished: saveAsNewAPIHadoopFile at SparkRuntime.java:332, took 0.874098 s
15/04/08 20:57:48 INFO DAGScheduler: Job 0 finished: saveAsNewAPIHadoopFile at 
SparkRuntime.java:332, took 0.874098 s
4573 [main] INFO  org.apache.crunch.io.avro.AvroFileReaderFactory  - Could not 
read avro file at path: file:/tmp/crunch-109470525/p1/part-r-00000
java.io.IOException: Not a data file.
        at 
org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:105)
        at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:97)
        at 
org.apache.crunch.io.avro.AvroFileReaderFactory.read(AvroFileReaderFactory.java:74)
        at 
org.apache.crunch.io.CompositePathIterable$2.<init>(CompositePathIterable.java:87)
        at 
org.apache.crunch.io.CompositePathIterable.iterator(CompositePathIterable.java:85)
        at com.google.common.collect.Iterables$3.next(Iterables.java:512)
        at com.google.common.collect.Iterables$3.next(Iterables.java:505)
        at com.google.common.collect.Iterators$5.hasNext(Iterators.java:597)
        at 
org.apache.crunch.materialize.pobject.FirstElementPObject.process(FirstElementPObject.java:45)
        at 
org.apache.crunch.materialize.pobject.PObjectImpl.getValue(PObjectImpl.java:71)
        at org.apache.crunch.SparkPageRankIT.run(SparkPageRankIT.java:156)
        at 
org.apache.crunch.SparkPageRankIT.testAvroReflects(SparkPageRankIT.java:97)
{noformat}

One of the behavior changes I noticed is that when ran without a name, the job 
produces files that are named, part-r-00000.avro.  When we add the name we are 
now getting files without the file extension.  I believe this might be related 
to it not being able to detect the files as containing data but I haven't found 
in the code where that extension might be getting dropped.

> Crunch with Spark doesn't name all outputs
> ------------------------------------------
>
>                 Key: CRUNCH-509
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-509
>             Project: Crunch
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.11.0
>            Reporter: Micah Whitacre
>            Assignee: Josh Wills
>             Fix For: 0.12.0
>
>         Attachments: CRUNCH-509.patch
>
>
> Crunch currently does not "name" all outputs when running with a 
> SparkPipeline.  This becomes a problem as some Targets (based on CRUNCH-82) 
> have coded in checked to ensure that the name must be populated.  
> Specifically the implementation I'm running into issues with is the Kite 
> DatasetTarget[2].
> Need to read up a bit on context to see if it is a Crunch/Kite issue or where 
> it is easiest/correct to fix.  [~jwills] or [~tomwhite] feedback would be 
> welcome.
> [1] - 
> https://github.com/apache/crunch/blob/3ab0b078c47f23b3ba893fdfb05fd723f663d02b/crunch-spark/src/main/java/org/apache/crunch/impl/spark/SparkRuntime.java#L337
> [2] - 
> https://github.com/kite-sdk/kite/blob/e080f0237e7383a16fff8547ad43387ccf55c473/kite-data/kite-data-crunch/src/main/java/org/kitesdk/data/crunch/DatasetTarget.java#L178



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (CRUNCH-509) Crunch with Spark doesn't name all outputs

Reply via email to