Yeah, fair point about Python.

spark-streaming-kafka should not contain third-party dependencies.
However there's nothing stopping the build from producing an assembly
jar from these modules. I think there is an assembly target already
though?

On Tue, May 12, 2015 at 3:37 PM, Lee McFadden <splee...@gmail.com> wrote:
> Sorry to flog this dead horse, but this is something every python user is
> going to run into as we *cannot* build the dependencies onto our app. There
> is no way to do that with a python script.
>
> As I see it, this is not a third party integration. The package missing its
> dependencies is built by the spark team I believe?
> org.apache.spark:spark-streaming-kafka_2.10:1.3.1 is the problem package, if
> I remove the cassandra package I still run into the same error.
>
> If there really is no way to add these dependencies to the kafka package? I
> tried to add these dependencies in a number of ways but the maze of pom.xml
> files makes that difficult for those not familiar with java.
>
> Thanks again. I really don't want other python users to run into the same
> brick wall I did as I nearly gave up on Spark over what turns out to be a
> relatively simple thing.
>
>
> On Tue, May 12, 2015, 1:11 AM Sean Owen <so...@cloudera.com> wrote:
>>
>> The question is really whether all the third-party integrations should
>> be built into Spark's main assembly. I think reasonable people could
>> disagree, but I think the current state (not built in) is reasonable.
>> It means you have to bring the integration with you.
>>
>> That is, no, third-party queue integrations aren't built in out of the
>> box.
>>
>> the way you got it to work is one way, but not the preferred way:
>> build this into your app and your packaging tool would have resolved
>> the dependencies.
>>
>> I agree with resolving this as basically working-as-intended.
>>
>> On Tue, May 12, 2015 at 3:19 AM, Lee McFadden <splee...@gmail.com> wrote:
>> > I opened a ticket on this (without posting here first - bad etiquette,
>> > apologies) which was closed as 'fixed'.
>> >
>> > https://issues.apache.org/jira/browse/SPARK-7538
>> >
>> > I don't believe that because I have my script running means this is
>> > fixed, I
>> > think it is still an issue.
>> >
>> > I downloaded the spark source, ran `mvn -DskipTests clean package `,
>> > then
>> > simply launched my python script (which shouldn't be introducing
>> > additional
>> > *java* dependencies itself?).
>> >
>> > Doesn't this mean these dependencies are missing from the spark build,
>> > since
>> > I didn't modify any files within the distribution and my application
>> > itself
>> > can't be introducing java dependency clashes?
>> >
>> >
>> > On Mon, May 11, 2015, 4:34 PM Lee McFadden <splee...@gmail.com> wrote:
>> >>
>> >> Ted, many thanks.  I'm not used to Java dependencies so this was a real
>> >> head-scratcher for me.
>> >>
>> >> Downloading the two metrics packages from the maven repository
>> >> (metrics-core, metrics-annotation) and supplying it on the spark-submit
>> >> command line worked.
>> >>
>> >> My final spark-submit for a python project using Kafka as an input
>> >> source:
>> >>
>> >> /home/ubuntu/spark/spark-1.3.1/bin/spark-submit \
>> >>     --packages
>> >>
>> >> TargetHolding/pyspark-cassandra:0.1.4,org.apache.spark:spark-streaming-kafka_2.10:1.3.1
>> >> \
>> >>     --jars
>> >>
>> >> /home/ubuntu/jars/metrics-core-2.2.0.jar,/home/ubuntu/jars/metrics-annotation-2.2.0.jar
>> >> \
>> >>     --conf
>> >>
>> >> spark.cassandra.connection.host=10.10.103.172,10.10.102.160,10.10.101.79 \
>> >>     --master spark://127.0.0.1:7077 \
>> >>     affected_hosts.py
>> >>
>> >> Now we're seeing data from the stream.  Thanks again!
>> >>
>> >> On Mon, May 11, 2015 at 2:43 PM Sean Owen <so...@cloudera.com> wrote:
>> >>>
>> >>> Ah yes, the Kafka + streaming code isn't in the assembly, is it? you'd
>> >>> have to provide it and all its dependencies with your app. You could
>> >>> also build this into your own app jar. Tools like Maven will add in
>> >>> the transitive dependencies.
>> >>>
>> >>> On Mon, May 11, 2015 at 10:04 PM, Lee McFadden <splee...@gmail.com>
>> >>> wrote:
>> >>> > Thanks Ted,
>> >>> >
>> >>> > The issue is that I'm using packages (see spark-submit definition)
>> >>> > and
>> >>> > I do
>> >>> > not know how to add com.yammer.metrics:metrics-core to my classpath
>> >>> > so
>> >>> > Spark
>> >>> > can see it.
>> >>> >
>> >>> > Should metrics-core not be part of the
>> >>> > org.apache.spark:spark-streaming-kafka_2.10:1.3.1 package so it can
>> >>> > work
>> >>> > correctly?
>> >>> >
>> >>> > If not, any clues as to how I can add metrics-core to my project
>> >>> > (bearing in
>> >>> > mind that I'm using Python, not a JVM language) would be much
>> >>> > appreciated.
>> >>> >
>> >>> > Thanks, and apologies for my newbness with Java/Scala.
>> >>> >
>> >>> > On Mon, May 11, 2015 at 1:42 PM Ted Yu <yuzhih...@gmail.com> wrote:
>> >>> >>
>> >>> >> com.yammer.metrics.core.Gauge is in metrics-core jar
>> >>> >> e.g., in master branch:
>> >>> >> [INFO] |  \- org.apache.kafka:kafka_2.10:jar:0.8.1.1:compile
>> >>> >> [INFO] |     +- com.yammer.metrics:metrics-core:jar:2.2.0:compile
>> >>> >>
>> >>> >> Please make sure metrics-core jar is on the classpath.
>> >>> >>
>> >>> >> On Mon, May 11, 2015 at 1:32 PM, Lee McFadden <splee...@gmail.com>
>> >>> >> wrote:
>> >>> >>>
>> >>> >>> Hi,
>> >>> >>>
>> >>> >>> We've been having some issues getting spark streaming running
>> >>> >>> correctly
>> >>> >>> using a Kafka stream, and we've been going around in circles
>> >>> >>> trying
>> >>> >>> to
>> >>> >>> resolve this dependency.
>> >>> >>>
>> >>> >>> Details of our environment and the error below, if anyone can help
>> >>> >>> resolve this it would be much appreciated.
>> >>> >>>
>> >>> >>> Submit command line:
>> >>> >>>
>> >>> >>> /home/ubuntu/spark/spark-1.3.1/bin/spark-submit \
>> >>> >>>     --packages
>> >>> >>>
>> >>> >>>
>> >>> >>> TargetHolding/pyspark-cassandra:0.1.4,org.apache.spark:spark-streaming-kafka_2.10:1.3.1
>> >>> >>> \
>> >>> >>>     --conf
>> >>> >>>
>> >>> >>>
>> >>> >>> spark.cassandra.connection.host=10.10.103.172,10.10.102.160,10.10.101.79
>> >>> >>>  \
>> >>> >>>     --master spark://127.0.0.1:7077 \
>> >>> >>>     affected_hosts.py
>> >>> >>>
>> >>> >>> When we run the streaming job everything starts just fine, then we
>> >>> >>> see
>> >>> >>> the following in the logs:
>> >>> >>>
>> >>> >>> 15/05/11 19:50:46 WARN TaskSetManager: Lost task 0.0 in stage 2.0
>> >>> >>> (TID
>> >>> >>> 70, ip-10-10-102-53.us-west-2.compute.internal):
>> >>> >>> java.lang.NoClassDefFoundError: com/yammer/metrics/core/Gauge
>> >>> >>>         at
>> >>> >>>
>> >>> >>>
>> >>> >>> kafka.consumer.ZookeeperConsumerConnector.createFetcher(ZookeeperConsumerConnector.scala:151)
>> >>> >>>         at
>> >>> >>>
>> >>> >>>
>> >>> >>> kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:115)
>> >>> >>>         at
>> >>> >>>
>> >>> >>>
>> >>> >>> kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:128)
>> >>> >>>         at
>> >>> >>> kafka.consumer.Consumer$.create(ConsumerConnector.scala:89)
>> >>> >>>         at
>> >>> >>>
>> >>> >>>
>> >>> >>> org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:100)
>> >>> >>>         at
>> >>> >>>
>> >>> >>>
>> >>> >>> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
>> >>> >>>         at
>> >>> >>>
>> >>> >>>
>> >>> >>> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
>> >>> >>>         at
>> >>> >>>
>> >>> >>>
>> >>> >>> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:298)
>> >>> >>>         at
>> >>> >>>
>> >>> >>>
>> >>> >>> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:290)
>> >>> >>>         at
>> >>> >>>
>> >>> >>>
>> >>> >>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
>> >>> >>>         at
>> >>> >>>
>> >>> >>>
>> >>> >>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
>> >>> >>>         at
>> >>> >>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>> >>> >>>         at org.apache.spark.scheduler.Task.run(Task.scala:64)
>> >>> >>>         at
>> >>> >>>
>> >>> >>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>> >>> >>>         at
>> >>> >>>
>> >>> >>>
>> >>> >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> >>> >>>         at
>> >>> >>>
>> >>> >>>
>> >>> >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> >>> >>>         at java.lang.Thread.run(Thread.java:745)
>> >>> >>> Caused by: java.lang.ClassNotFoundException:
>> >>> >>> com.yammer.metrics.core.Gauge
>> >>> >>>         at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
>> >>> >>>         at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>> >>> >>>         at java.security.AccessController.doPrivileged(Native
>> >>> >>> Method)
>> >>> >>>         at
>> >>> >>> java.net.URLClassLoader.findClass(URLClassLoader.java:360)
>> >>> >>>         at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> >>> >>>         at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> >>> >>>         ... 17 more
>> >>> >>>
>> >>> >>>
>> >>> >>
>> >>> >

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to