sbt "org.apache.spark#spark-streaming-kafka_2.11;2.0.0: not found"

2016-12-12 Thread Luke Adolph
Hi all,
My project uses spark-streaming-kafka module.When I migrate spark from
1.6.0 to 2.0.0 and rebuild project, I run into below error:

[warn] module not found: org.apache.spark#spark-streaming-kafka_2.11;2.0.0
[warn]  local: tried
[warn]   
/home/linker/.ivy2/local/org.apache.spark/spark-streaming-kafka_2.11/2.0.0/ivys/ivy.xml
[warn]  public: tried
[warn]   
https://repo1.maven.org/maven2/org/apache/spark/spark-streaming-kafka_2.11/2.0.0/spark-streaming-kafka_2.11-2.0.0.pom
[warn]  Akka Repository: tried
[warn]   
http://repo.akka.io/releases/org/apache/spark/spark-streaming-kafka_2.11/2.0.0/spark-streaming-kafka_2.11-2.0.0.pom
[warn]  sonatype-public: tried
[warn]   
https://oss.sonatype.org/content/repositories/public/org/apache/spark/spark-streaming-kafka_2.11/2.0.0/spark-streaming-kafka_2.11-2.0.0.pom
[info] Resolving jline#jline;2.12.1 ...
[warn] ::
[warn] ::  UNRESOLVED DEPENDENCIES ::
[warn] ::
[warn] :: org.apache.spark#spark-streaming-kafka_2.11;2.0.0: not found
[warn] ::
[warn]
[warn] Note: Unresolved dependencies path:
[warn] org.apache.spark:spark-streaming-kafka_2.11:2.0.0
(/home/linker/workspace/linkerwp/linkerStreaming/build.sbt#L12-23)
sbt.ResolveException: unresolved dependency:
org.apache.spark#spark-streaming-kafka_2.11;2.0.0: not found
at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:313)
at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:191)
at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:168)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:156)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:156)
at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:133)
at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:57)
at sbt.IvySbt$$anon$4.call(Ivy.scala:65)
at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93)
at 
xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:78)
at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:97)
at xsbt.boot.Using$.withResource(Using.scala:10)
at xsbt.boot.Using$.apply(Using.scala:9)
at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58)
at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48)
at xsbt.boot.Locks$.apply0(Locks.scala:31)
at xsbt.boot.Locks$.apply(Locks.scala:28)
at sbt.IvySbt.withDefaultLogger(Ivy.scala:65)
at sbt.IvySbt.withIvy(Ivy.scala:128)
at sbt.IvySbt.withIvy(Ivy.scala:125)
at sbt.IvySbt$Module.withModule(Ivy.scala:156)
at sbt.IvyActions$.updateEither(IvyActions.scala:168)
at 
sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1442)
at 
sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1438)
at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$90.apply(Defaults.scala:1473)
at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$90.apply(Defaults.scala:1471)
at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:37)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1476)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1470)
at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:60)
at sbt.Classpaths$.cachedUpdate(Defaults.scala:1493)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1420)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1372)
at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
at sbt.std.Transform$$anon$4.work(System.scala:63)
at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
at sbt.Execute.work(Execute.scala:237)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
at 
sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
at sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
[error] (*:update) sbt.ResolveException: unresolved dependency:
org.apache.spark#spark-streaming-kafka_2.11;2.0.0: not found
[error] Total time: 8 s, completed Dec 13, 2016 3:16:37 PM

​

My dependencies in built.sbt is 

Re: Dependencies with runing Spark Streaming on Mesos cluster using Python

2016-07-13 Thread Luke Adolph
Update:
I revuild my mesos-exector-image ,I download
*spark-streaming-kafka_2.10-1.6.0.jar* on *`/linker/jars`*

I change my submit command:

dcos spark run \ --submit-args='--jars
> local:/linker/jars/spark-streaming-kafka_2.10-1.6.0.jar  spark2cassandra.py
> 10.140.0.14:2181 wlu_spark2cassandra' --docker-image
> adolphlwq/mesos-for-spark-exector-image:1.6.0.beta2


Where I get new stderr output on mesos:


​
I only problem is submit the dependency
spark-streaming-kafka_2.10-1.6.0.jar to worker.

Thanks.


2016-07-13 18:57 GMT+08:00 Luke Adolph <kenan3...@gmail.com>:

> Hi all:
> My spark runs on mesos.I write a spark streaming app using python, code
> on GitHub <https://github.com/adolphlwq/linkerProcessorSample>.
>
> The app has dependency "
> *org.apache.spark:spark-streaming-kafka_2.10:1.6.1*".
>
> Spark on mesos has two important concepts: Spark Framework and Spark
> exector.
>
> I set my exector run in docker image.The docker image Dockerfile
> <https://github.com/adolphlwq/linkerProcessorSample/blob/master/docker/Dockerfile>
>  is
> below:
>
> # refer '
>> http://spark.apache.org/docs/latest/running-on-mesos.html#spark-properties'
>> on 'spark.mesos.executor.docker.image' section
>
> FROM ubuntu:14.04
>> WORKDIR /linker
>> RUN ln -f -s /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
>> #download mesos
>> RUN echo "deb http://repos.mesosphere.io/ubuntu/ trusty main" >
>> /etc/apt/sources.list.d/mesosphere.list && \
>> apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF && \
>> apt-get update && \
>> apt-get -y install mesos=0.28.1-2.0.20.ubuntu1404 openjdk-7-jre
>> python-pip git vim curl
>> RUN git clone https://github.com/adolphlwq/linkerProcessorSample.git && \
>> pip install -r linkerProcessorSample/docker/requirements.txt
>> RUN curl -fL
>> http://archive.apache.org/dist/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.6.tgz
>> | tar xzf - -C /usr/local && \
>> apt-get clean
>> ENV MESOS_NATIVE_JAVA_LIBRARY=/usr/lib/libmesos.so \
>> JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 \
>> SPARK_HOME=/usr/local/spark-1.6.0-bin-hadoop2.6
>> ENV PATH=$JAVA_HOME/bin:$PATH
>> WORKDIR $SPARK_HOME
>
>
> When I use below command to submit my app program:
>
> dcos spark run --submit-args='--packages
>> org.apache.spark:spark-streaming-kafka_2.10:1.6.1 \
>>spark2cassandra.py zk topic' \
>> -docker-image=adolphlwq/mesos-for-spark-exector-image:1.6.0.beta
>
>
> The exector docker container run successfully, but it has no package for
> *org.apache.spark:spark-streaming-kafka_2.10:1.6.1*.
>
> The *stderr* om mesos is:
>
> I0713 09:34:52.715551 18124 logging.cpp:188] INFO level logging started!
>> I0713 09:34:52.717797 18124 fetcher.cpp:424] Fetcher Info:
>> {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/6097419e-c2d0-4e5f-9a91-e5815de640c4-S4","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/home\/ubuntu\/spark2cassandra.py"}},{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/root\/.ivy2\/jars\/org.apache.spark_spark-streaming-kafka_2.10-1.6.0.jar"}},{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/root\/.ivy2\/jars\/com.101tec_zkclient-0.3.jar"}},{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/root\/.ivy2\/jars\/org.apache.kafka_kafka_2.10-0.8.2.1.jar"}},{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/root\/.ivy2\/jars\/org.slf4j_slf4j-api-1.7.10.jar"}},{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/root\/.ivy2\/jars\/org.spark-project.spark_unused-1.0.0.jar"}},{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/root\/.ivy2\/jars\/net.jpountz.lz4_lz4-1.3.0.jar"}},{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/root\/.ivy2\/jars\/log4j_log4j-1.2.17.jar"}},{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/root\/.ivy2\/jars\/com.yammer.metrics_metrics-core-2.2.0.jar"}},{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/root\/.ivy2\/jars\/org.apache.kafka_kafka-clients-0.8.2.1.jar"}},{"action":&qu

Dependencies with runing Spark Streaming on Mesos cluster using Python

2016-07-13 Thread Luke Adolph
Hi all:
My spark runs on mesos.I write a spark streaming app using python, code on
GitHub .

The app has dependency "*org.apache.spark:spark-streaming-kafka_2.10:1.6.1*
".

Spark on mesos has two important concepts: Spark Framework and Spark
exector.

I set my exector run in docker image.The docker image Dockerfile

is
below:

# refer '
> http://spark.apache.org/docs/latest/running-on-mesos.html#spark-properties'
> on 'spark.mesos.executor.docker.image' section

FROM ubuntu:14.04
> WORKDIR /linker
> RUN ln -f -s /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
> #download mesos
> RUN echo "deb http://repos.mesosphere.io/ubuntu/ trusty main" >
> /etc/apt/sources.list.d/mesosphere.list && \
> apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF && \
> apt-get update && \
> apt-get -y install mesos=0.28.1-2.0.20.ubuntu1404 openjdk-7-jre
> python-pip git vim curl
> RUN git clone https://github.com/adolphlwq/linkerProcessorSample.git && \
> pip install -r linkerProcessorSample/docker/requirements.txt
> RUN curl -fL
> http://archive.apache.org/dist/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.6.tgz
> | tar xzf - -C /usr/local && \
> apt-get clean
> ENV MESOS_NATIVE_JAVA_LIBRARY=/usr/lib/libmesos.so \
> JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 \
> SPARK_HOME=/usr/local/spark-1.6.0-bin-hadoop2.6
> ENV PATH=$JAVA_HOME/bin:$PATH
> WORKDIR $SPARK_HOME


When I use below command to submit my app program:

dcos spark run --submit-args='--packages
> org.apache.spark:spark-streaming-kafka_2.10:1.6.1 \
>spark2cassandra.py zk topic' \
> -docker-image=adolphlwq/mesos-for-spark-exector-image:1.6.0.beta


The exector docker container run successfully, but it has no package for
*org.apache.spark:spark-streaming-kafka_2.10:1.6.1*.

The *stderr* om mesos is:

I0713 09:34:52.715551 18124 logging.cpp:188] INFO level logging started!
> I0713 09:34:52.717797 18124 fetcher.cpp:424] Fetcher Info:
> {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/6097419e-c2d0-4e5f-9a91-e5815de640c4-S4","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/home\/ubuntu\/spark2cassandra.py"}},{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/root\/.ivy2\/jars\/org.apache.spark_spark-streaming-kafka_2.10-1.6.0.jar"}},{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/root\/.ivy2\/jars\/com.101tec_zkclient-0.3.jar"}},{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/root\/.ivy2\/jars\/org.apache.kafka_kafka_2.10-0.8.2.1.jar"}},{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/root\/.ivy2\/jars\/org.slf4j_slf4j-api-1.7.10.jar"}},{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/root\/.ivy2\/jars\/org.spark-project.spark_unused-1.0.0.jar"}},{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/root\/.ivy2\/jars\/net.jpountz.lz4_lz4-1.3.0.jar"}},{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/root\/.ivy2\/jars\/log4j_log4j-1.2.17.jar"}},{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/root\/.ivy2\/jars\/com.yammer.metrics_metrics-core-2.2.0.jar"}},{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/root\/.ivy2\/jars\/org.apache.kafka_kafka-clients-0.8.2.1.jar"}},{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/root\/.ivy2\/jars\/org.xerial.snappy_snappy-java-1.1.2.jar"}}],"sandbox_directory":"\/tmp\/mesos\/slaves\/6097419e-c2d0-4e5f-9a91-e5815de640c4-S4\/frameworks\/7399b6f7-5dcd-4a9b-9846-e7948d5ffd11-0024\/executors\/driver-20160713093451-0015\/runs\/84419372-9482-4c58-8f87-4ba528b6885c"}
> I0713 09:34:52.719846 18124 fetcher.cpp:379] Fetching URI
> '/home/ubuntu/spark2cassandra.py'
> I0713 09:34:52.719866 18124 fetcher.cpp:250] Fetching directly into the
> sandbox directory
> I0713 09:34:52.719925 18124 fetcher.cpp:187] Fetching URI
> '/home/ubuntu/spark2cassandra.py'
> I0713 09:34:52.719945 18124 fetcher.cpp:167] Copying resource with
> command:cp '/home/ubuntu/spark2cassandra.py'
> '/tmp/mesos/slaves/6097419e-c2d0-4e5f-9a91-e5815de640c4-S4/frameworks/7399b6f7-5dcd-4a9b-9846-e7948d5ffd11-0024/executors/driver-20160713093451-0015/runs/84419372-9482-4c58-8f87-4ba528b6885c/spark2cassandra.py'
> W0713 09:34:52.722587 18124 fetcher.cpp:272] Copying instead of extracting
> resource from URI with 'extract' flag, because it does not seem to be an
> archive: /home/ubuntu/spark2cassandra.py
> I0713 09:34:52.724138 18124 fetcher.cpp:456] Fetched
> '/home/ubuntu/spark2cassandra.py' to
> '/tmp/mesos/slaves/6097419e-c2d0-4e5f-9a91-e5815de640c4-S4/frameworks/7399b6f7-5dcd-4a9b-9846-e7948d5ffd11-0024/executors/driver-20160713093451-0015/runs/84419372-9482-4c58-8f87-4ba528b6885c/spark2cassandra.py'
> I0713 09:34:52.724148 18124 fetcher.cpp:379] Fetching URI
> '/root/.ivy2/jars/org.apache.spark_spark-streaming-kafka_2.10-1.6.0.jar'
> I0713 09:34:52.724153 18124 fetcher.cpp:250] Fetching directly 

Save RDD to HDFS using Spark Python API

2016-04-26 Thread Luke Adolph
Hi, all:
Below is my code:

from pyspark import *import re
def getDateByLine(input_str):
str_pattern = '^\d{4}-\d{2}-\d{2}'
pattern = re.compile(str_pattern)
match = pattern.match(input_str)
if match:
return match.group()
else:
return None

file_url = "hdfs://192.168.10.130:9000/dev/test/test.log"
input_file = sc.textFile(file_url)
line = input_file.filter(getDateByLine).map(lambda x: (x[:10], 1))
counts = line.reduceByKey(lambda a,b: a+b)print counts.collect()
counts.saveAsHadoopFile("hdfs://192.168.10.130:9000/dev/output/test",\
"org.apache.hadoop.mapred.SequenceFileOutputFormat")

​

What I confused is the method *saveAsHadoopFile*,I have read the pyspark
API, But I still don’t understand the second arg mean

Below is the output when I run above code:
```

[(u'2016-02-29', 99), (u'2016-03-02', 30)]

---Py4JJavaError
Traceback (most recent call
last) in () 18 counts =
line.reduceByKey(lambda a,b: a+b) 19 print counts.collect()---> 20
counts.saveAsHadoopFile("hdfs://192.168.10.130:9000/dev/output/test",

"org.apache.hadoop.mapred.SequenceFileOutputFormat")

/mydata/softwares/spark-1.6.1/python/pyspark/rdd.pyc in
saveAsHadoopFile(self, path, outputFormatClass, keyClass, valueClass,
keyConverter, valueConverter, conf, compressionCodecClass)   1419
keyClass, valueClass,
1420  keyConverter,
valueConverter,-> 1421
 jconf, compressionCodecClass)   1422
   1423 def saveAsSequenceFile(self, path, compressionCodecClass=None):

/mydata/softwares/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
in __call__(self, *args)811 answer =
self.gateway_client.send_command(command)812 return_value
= get_return_value(--> 813 answer, self.gateway_client,
self.target_id, self.name)814
815 for temp_arg in temp_args:

/mydata/softwares/spark-1.6.1/python/pyspark/sql/utils.pyc in deco(*a,
**kw) 43 def deco(*a, **kw): 44 try:---> 45
 return f(*a, **kw) 46 except
py4j.protocol.Py4JJavaError as e: 47 s =
e.java_exception.toString()
/mydata/softwares/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/protocol.py
in get_return_value(answer, gateway_client, target_id, name)306
 raise Py4JJavaError(307 "An error
occurred while calling {0}{1}{2}.\n".--> 308
format(target_id, ".", name), value)309 else:
310 raise Py4JError(
Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.saveAsHadoopFile.
: org.apache.hadoop.mapred.FileAlreadyExistsException: Output
directory hdfs://192.168.10.130:9000/dev/output/test already exists
at 
org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1179)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1156)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1156)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1156)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1060)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1026)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1026)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1026)
at 
org.apache.spark.api.python.PythonRDD$.saveAsHadoopFile(PythonRDD.scala:753)
at 
org.apache.spark.api.python.PythonRDD.saveAsHadoopFile(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at