Hi, I'm moving my infrastructure from 1.5.2 to 1.6.0 and experiencing
serious issue. I successfully updated spark thrift server from 1.5.2 to
1.6.0. But I have standalone application, which worked fine with 1.5.2 but
failing on 1.6.0 with:

*NestedThrowables:*
*java.lang.ClassNotFoundException:
org.datanucleus.api.jdo.JDOPersistenceManagerFactory*
* at
javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1175)*
* at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)*
* at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)*

Inside this application I work with hive table, which have data in json
format.

When I add

<dependency>
    <groupId>org.datanucleus</groupId>
    <artifactId>datanucleus-core</artifactId>
    <version>4.0.0-release</version>
</dependency>

<dependency>
    <groupId>org.datanucleus</groupId>
    <artifactId>datanucleus-api-jdo</artifactId>
    <version>4.0.0-release</version>
</dependency>

<dependency>
    <groupId>org.datanucleus</groupId>
    <artifactId>datanucleus-rdbms</artifactId>
    <version>3.2.9</version>
</dependency>

I'm getting:

*Caused by: org.datanucleus.exceptions.NucleusUserException: Persistence
process has been specified to use a ClassLoaderResolver of name
"datanucleus" yet this has not been found by the DataNucleus plugin
mechanism. Please check your CLASSPATH and plugin specification.*
* at
org.datanucleus.AbstractNucleusContext.<init>(AbstractNucleusContext.java:102)*
* at
org.datanucleus.PersistenceNucleusContextImpl.<init>(PersistenceNucleusContextImpl.java:162)*

I have CDH 5.5. I build spark with

*./make-distribution.sh -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.5.0
-Phive -DskipTests*

Than I publish fat jar locally:

*mvn org.apache.maven.plugins:maven-install-plugin:2.3.1:install-file
-Dfile=./spark-assembly.jar -DgroupId=org.spark-project
-DartifactId=my-spark-assembly -Dversion=1.6.0-SNAPSHOT -Dpackaging=jar*

Than I include dependency on this fat jar:

<dependency>
    <groupId>org.spark-project</groupId>
    <artifactId>my-spark-assembly</artifactId>
    <version>1.6.0-SNAPSHOT</version>
</dependency>

Than I build my application with assembly plugin:

<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-shade-plugin</artifactId>
    <configuration>
        <artifactSet>
            <includes>
                <include>*:*</include>
            </includes>
        </artifactSet>
        <filters>
            <filter>
                <artifact>*:*</artifact>
                <excludes>
                    <exclude>META-INF/*.SF</exclude>
                    <exclude>META-INF/*.DSA</exclude>
                    <exclude>META-INF/*.RSA</exclude>
                </excludes>
            </filter>
        </filters>
    </configuration>
    <executions>
        <execution>
            <phase>package</phase>
            <goals>
                <goal>shade</goal>
            </goals>
            <configuration>
                <transformers>
                    <transformer

implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
                    <transformer

implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">

<resource>META-INF/services/org.apache.hadoop.fs.FileSystem</resource>
                    </transformer>
                    <transformer

implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
                        <resource>reference.conf</resource>
                    </transformer>
                    <transformer

implementation="org.apache.maven.plugins.shade.resource.DontIncludeResourceTransformer">
                        <resource>log4j.properties</resource>
                    </transformer>
                    <transformer

implementation="org.apache.maven.plugins.shade.resource.ApacheLicenseResourceTransformer"/>
                    <transformer

implementation="org.apache.maven.plugins.shade.resource.ApacheNoticeResourceTransformer"/>
                </transformers>
            </configuration>
        </execution>
    </executions>
</plugin>

Configuration of assembly plugin is copy-past from spark assembly pom.

This workflow worked for 1.5.2 and broke for 1.6.0. If I have not good
approach of creating this standalone application, please recommend
other approach, but spark-submit does not work for me - it hard for me
to connect it to Oozie.

Any suggestion would be appreciated - I'm stuck.

My spark config:

lazy val sparkConf = new SparkConf()
  .setMaster("yarn-client")
  .setAppName(appName)
  .set("spark.yarn.queue", "jenkins")
  .set("spark.executor.memory", "10g")
  .set("spark.yarn.executor.memoryOverhead", "2000")
  .set("spark.executor.cores", "3")
  .set("spark.driver.memory", "4g")
  .set("spark.shuffle.io.numConnectionsPerPeer", "5")
  .set("spark.sql.autoBroadcastJoinThreshold", "200483647")
  .set("spark.network.timeout", "1000s")
  .set("spark.executor.extraJavaOptions", "-XX:MaxPermSize=2g")
  .set("spark.driver.maxResultSize", "2g")
  .set("spark.rpc.lookupTimeout", "1000s")
  .set("spark.sql.hive.convertMetastoreParquet", "false")
  .set("spark.kryoserializer.buffer.max", "200m")
  .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  .set("spark.yarn.driver.memoryOverhead", "1000")
  .set("spark.dynamicAllocation.enabled", "true")
  .set("spark.shuffle.service.enabled", "true")
  .set("spark.dynamicAllocation.minExecutors", "1")
  .set("spark.dynamicAllocation.maxExecutors", "20")
  .set("spark.dynamicAllocation.executorIdleTimeout", "60s")
  .set("spark.sql.tungsten.enabled", "false")
  .set("spark.dynamicAllocation.cachedExecutorIdleTimeout", "100s")
.setJars(List(this.getClass.getProtectionDomain().getCodeSource().getLocation().toURI().getPath()))

-- 



*Sincerely yoursEgor Pakhomov*

Reply via email to