date:20151030

Re: Exception while reading from kafka stream

2015-10-30 Thread Saisai Shao

What Spark version are you using, also a small code snippet of how you use
Spark Streaming would be greatly helpful.

On Fri, Oct 30, 2015 at 3:57 PM, Ramkumar V  wrote:

> I can able to read and print few lines. Afterthat i'm getting this
> exception. Any idea for this ?
>
> *Thanks*,
> 
>
>
> On Thu, Oct 29, 2015 at 6:14 PM, Ramkumar V 
> wrote:
>
>> Hi,
>>
>> I'm trying to read from kafka stream and printing it textfile. I'm using
>> java over spark. I dont know why i'm getting the following exception.
>> Also exception message is very abstract.  can anyone please help me ?
>>
>> Log Trace :
>>
>> 15/10/29 12:15:09 ERROR scheduler.JobScheduler: Error in job generator
>> java.lang.NullPointerException
>> at
>> org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:172)
>> at
>> org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:172)
>> at
>> scala.collection.TraversableOnce$$anonfun$maxBy$1.apply(TraversableOnce.scala:225)
>> at
>> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
>> at
>> scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:68)
>> at
>> scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:47)
>> at
>> scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:225)
>> at
>> scala.collection.AbstractTraversable.maxBy(Traversable.scala:105)
>> at
>> org.apache.spark.streaming.DStreamGraph.getMaxInputStreamRememberDuration(DStreamGraph.scala:172)
>> at
>> org.apache.spark.streaming.scheduler.JobGenerator.clearMetadata(JobGenerator.scala:267)
>> at org.apache.spark.streaming.scheduler.JobGenerator.org
>> $apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:178)
>> at
>> org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:83)
>> at
>> org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:82)
>> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>> 15/10/29 12:15:09 ERROR yarn.ApplicationMaster: User class threw
>> exception: java.lang.NullPointerException
>> java.lang.NullPointerException
>> at
>> org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:172)
>> at
>> org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:172)
>> at
>> scala.collection.TraversableOnce$$anonfun$maxBy$1.apply(TraversableOnce.scala:225)
>> at
>> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
>> at
>> scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:68)
>> at
>> scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:47)
>> at
>> scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:225)
>> at
>> scala.collection.AbstractTraversable.maxBy(Traversable.scala:105)
>> at
>> org.apache.spark.streaming.DStreamGraph.getMaxInputStreamRememberDuration(DStreamGraph.scala:172)
>> at
>> org.apache.spark.streaming.scheduler.JobGenerator.clearMetadata(JobGenerator.scala:267)
>> at org.apache.spark.streaming.scheduler.JobGenerator.org
>> $apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:178)
>> at
>> org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:83)
>> at
>> org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:82)
>> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>>
>>
>>
>> *Thanks*,
>> 
>>
>>
>

Spark 1.5.1 Build Failure

2015-10-30 Thread Raghuveer Chanda

Hi,

I am trying to build spark 1.5.1 for hadoop 2.5 but I get the following
error.


*build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.5.0-cdh5.3.2 -DskipTests
clean package*


[INFO] Spark Project Parent POM ... SUCCESS [
 9.812 s]
[INFO] Spark Project Launcher . SUCCESS [
27.701 s]
[INFO] Spark Project Networking ... SUCCESS [
16.721 s]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [
 8.617 s]
[INFO] Spark Project Unsafe ... SUCCESS [
27.124 s]
[INFO] Spark Project Core . FAILURE [09:08
min]

Failed to execute goal
net.alchim31.maven:scala-maven-plugin:3.2.2:testCompile
(scala-test-compile-first) on project spark-core_2.10: Execution
scala-test-compile-first of goal
net.alchim31.maven:scala-maven-plugin:3.2.2:testCompile failed.
CompileFailed -> [Help 1]



-- 
Regards,
Raghuveer Chanda

Re: Spark 1.5.1 Build Failure

2015-10-30 Thread Raghuveer Chanda

Thanks for the reply.

I am using the mvn and scala from the source code build/mvn only and I get
the same error without hadoop also after clean package.


*Java Version:*

*rchanda@ubuntu:~/Downloads/spark-1.5.1$ java -version*
*java version "1.7.0_85"*
*OpenJDK Runtime Environment (IcedTea 2.6.1) (7u85-2.6.1-5ubuntu0.14.04.1)*
*OpenJDK 64-Bit Server VM (build 24.85-b03, mixed mode)*

*Complete Error:*

*rchanda@ubuntu:~/Downloads/spark-1.5.1$ build/mvn -DskiptTests clean
package*
*Using `mvn` from path:
/home/rchanda/Downloads/spark-1.5.1/build/apache-maven-3.3.3/bin/mvn*
*[INFO] Scanning for projects...*
*[INFO]
*
*[INFO] Reactor Build Order:*
*[INFO] *
*[INFO] Spark Project Parent POM*
*[INFO] Spark Project Launcher*
*[INFO] Spark Project Networking*
*[INFO] Spark Project Shuffle Streaming Service*
*[INFO] Spark Project Unsafe*
*[INFO] Spark Project Core*
*[INFO] Spark Project Bagel*
*[INFO] Spark Project GraphX*
*[INFO] Spark Project Streaming*
*[INFO] Spark Project Catalyst*
*[INFO] Spark Project SQL*
*[INFO] Spark Project ML Library*
*[INFO] Spark Project Tools*
*[INFO] Spark Project Hive*
*[INFO] Spark Project REPL*
*[INFO] Spark Project Assembly*
*[INFO] Spark Project External Twitter*
*[INFO] Spark Project External Flume Sink*
*[INFO] Spark Project External Flume*
*[INFO] Spark Project External Flume Assembly*
*[INFO] Spark Project External MQTT*
*[INFO] Spark Project External MQTT Assembly*
*[INFO] Spark Project External ZeroMQ*
*[INFO] Spark Project External Kafka*
*[INFO] Spark Project Examples*
*[INFO] Spark Project External Kafka Assembly*
*[INFO]
*
*[INFO]
*
*[INFO] Building Spark Project Parent POM 1.5.1*
*[INFO]
*
*[INFO] *
*[INFO] --- maven-clean-plugin:2.6.1:clean (default-clean) @
spark-parent_2.10 ---*
*[INFO] *
*[INFO] --- maven-enforcer-plugin:1.4:enforce (enforce-versions) @
spark-parent_2.10 ---*
*[INFO] *
*[INFO] --- scala-maven-plugin:3.2.2:add-source (eclipse-add-source) @
spark-parent_2.10 ---*
*[INFO] Add Source directory:
/home/rchanda/Downloads/spark-1.5.1/src/main/scala*
*[INFO] Add Test Source directory:
/home/rchanda/Downloads/spark-1.5.1/src/test/scala*
*[INFO] *
*[INFO] --- maven-remote-resources-plugin:1.5:process (default) @
spark-parent_2.10 ---*
*[INFO] *
*[INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @
spark-parent_2.10 ---*
*[INFO] No sources to compile*
*[INFO] *
*[INFO] --- maven-antrun-plugin:1.8:run (create-tmp-dir) @
spark-parent_2.10 ---*
*[INFO] Executing tasks*

*main:*
*[mkdir] Created dir: /home/rchanda/Downloads/spark-1.5.1/target/tmp*
*[INFO] Executed tasks*
*[INFO] *
*[INFO] --- scala-maven-plugin:3.2.2:testCompile (scala-test-compile-first)
@ spark-parent_2.10 ---*
*[INFO] No sources to compile*
*[INFO] *
*[INFO] --- maven-dependency-plugin:2.10:build-classpath (default) @
spark-parent_2.10 ---*
*[INFO] *
*[INFO] --- scalatest-maven-plugin:1.0:test (test) @ spark-parent_2.10 ---*
*Discovery starting.*
*Discovery completed in 178 milliseconds.*
*Run starting. Expected test count is: 0*
*DiscoverySuite:*
*Run completed in 403 milliseconds.*
*Total number of tests run: 0*
*Suites: completed 1, aborted 0*
*Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0*
*No tests were executed.*
*[INFO] *
*[INFO] --- maven-jar-plugin:2.6:test-jar (prepare-test-jar) @
spark-parent_2.10 ---*
*[INFO] Building jar:
/home/rchanda/Downloads/spark-1.5.1/target/spark-parent_2.10-1.5.1-tests.jar*
*[INFO] *
*[INFO] --- maven-site-plugin:3.3:attach-descriptor (attach-descriptor) @
spark-parent_2.10 ---*
*[INFO] *
*[INFO] --- maven-shade-plugin:2.4.1:shade (default) @ spark-parent_2.10
---*
*[INFO] Including org.spark-project.spark:unused:jar:1.0.0 in the shaded
jar.*
*[INFO] Replacing original artifact with shaded artifact.*
*[INFO] *
*[INFO] --- maven-source-plugin:2.4:jar-no-fork (create-source-jar) @
spark-parent_2.10 ---*
*[INFO] *
*[INFO] --- maven-source-plugin:2.4:test-jar-no-fork (create-source-jar) @
spark-parent_2.10 ---*
*[INFO]
*
*[INFO]
*
*[INFO] Building Spark Project Launcher 1.5.1*
*[INFO]
*
*[INFO] *
*[INFO] --- maven-clean-plugin:2.6.1:clean (default-clean) @
spark-launcher_2.10 ---*
*[INFO] *
*[INFO] --- maven-enforcer-plugin:1.4:enforce (enforce-versions) @
spark-launcher_2.10 ---*
*[INFO] *
*[INFO] --- scala-maven-plugin:3.2.2:add-source (eclipse-add-source) @
spark-launcher_2.10 ---*
*[INFO] Add Source directory:
/home/rchanda/Downloads/spark-1.5.1/launcher/src/main/scala*
*[INFO] Add Test Source directory:
/home/rchanda/Downloads/spark-1.5.1/launcher/src/test/scala*
*[INFO] *
*[INFO] --- maven-remote-resources-plugin:1.5:process (default) @

Re: Spark 1.5.1 Build Failure

2015-10-30 Thread Raghuveer Chanda

There seems to be a error at the zinc server, how can I shut down the zinc
server completely
*build/zinc-0.3.5.3/bin/zinc -shutdown *will shutdown but it again restarts
with the mvn/build command ?

*Error in Debug mode :*

*[ERROR] Failed to execute goal
net.alchim31.maven:scala-maven-plugin:3.2.2:testCompile
(scala-test-compile-first) on project spark-core_2.10: Execution
scala-test-compile-first of goal
net.alchim31.maven:scala-maven-plugin:3.2.2:testCompile failed.
CompileFailed -> [Help 1]*
*org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
goal net.alchim31.maven:scala-maven-plugin:3.2.2:testCompile
(scala-test-compile-first) on project spark-core_2.10: Execution
scala-test-compile-first of goal
net.alchim31.maven:scala-maven-plugin:3.2.2:testCompile failed.*
* at
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:224)*
* at
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)*
* at
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)*
* at
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)*
* at
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)*
* at
org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)*
* at
org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)*
* at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)*
* at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)*
* at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)*
* at org.apache.maven.cli.MavenCli.execute(MavenCli.java:862)*
* at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:286)*
* at org.apache.maven.cli.MavenCli.main(MavenCli.java:197)*
* at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)*
* at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)*
* at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)*
* at java.lang.reflect.Method.invoke(Method.java:606)*
* at
org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)*
* at
org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)*
* at
org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)*
* at
org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)*
*Caused by: org.apache.maven.plugin.PluginExecutionException: Execution
scala-test-compile-first of goal
net.alchim31.maven:scala-maven-plugin:3.2.2:testCompile failed.*
* at
org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:145)*
* at
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)*
* ... 20 more*
*Caused by: Compile failed via zinc server*
* at
sbt_inc.SbtIncrementalCompiler.zincCompile(SbtIncrementalCompiler.java:136)*
* at sbt_inc.SbtIncrementalCompiler.compile(SbtIncrementalCompiler.java:86)*
* at
scala_maven.ScalaCompilerSupport.incrementalCompile(ScalaCompilerSupport.java:303)*
* at
scala_maven.ScalaCompilerSupport.compile(ScalaCompilerSupport.java:119)*
* at
scala_maven.ScalaCompilerSupport.doExecute(ScalaCompilerSupport.java:99)*
* at scala_maven.ScalaMojoSupport.execute(ScalaMojoSupport.java:482)*
* at scala_maven.ScalaTestCompileMojo.execute(ScalaTestCompileMojo.java:48)*
* at
org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)*
* ... 21 more*
*[ERROR] *
*[ERROR] *
*[ERROR] For more information about the errors and possible solutions,
please read the following articles:*
*[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException
*
*[ERROR] *
*[ERROR] After correcting the problems, you can resume the build with the
command*
*[ERROR]   mvn  -rf :spark-core_2.10*

*Regards*
*Raghuveer*

On Fri, Oct 30, 2015 at 1:18 PM, Raghuveer Chanda <
raghuveer.cha...@gmail.com> wrote:

> Thanks for the reply.
>
> I am using the mvn and scala from the source code build/mvn only and I get
> the same error without hadoop also after clean package.
>
>
> *Java Version:*
>
> *rchanda@ubuntu:~/Downloads/spark-1.5.1$ java -version*
> *java version "1.7.0_85"*
> *OpenJDK Runtime Environment (IcedTea 2.6.1) (7u85-2.6.1-5ubuntu0.14.04.1)*
> *OpenJDK 64-Bit Server VM (build 24.85-b03, mixed mode)*
>
> *Complete Error:*
>
> *rchanda@ubuntu:~/Downloads/spark-1.5.1$ build/mvn -DskiptTests clean
> package*
> *Using `mvn` from path:
> /home/rchanda/Downloads/spark-1.5.1/build/apache-maven-3.3.3/bin/mvn*
> *[INFO] Scanning for projects...*
> *[INFO]
> *
> *[INFO] Reactor Build Order:*
> *[INFO] *
> *[INFO] Spark Project Parent POM*
> *[INFO] Spark

Re: Spark 1.5.1 Build Failure

2015-10-30 Thread Jia Zhan

Hi,

Have tried tried building it successfully without hadoop?

$build/mnv -DskiptTests clean package

Can you check it build/mvn was started successfully, or it's using your own
mvn? Let us know your jdk version as well.

On Thu, Oct 29, 2015 at 11:34 PM, Raghuveer Chanda <
raghuveer.cha...@gmail.com> wrote:

> Hi,
>
> I am trying to build spark 1.5.1 for hadoop 2.5 but I get the following
> error.
>
>
> *build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.5.0-cdh5.3.2 -DskipTests
> clean package*
>
>
> [INFO] Spark Project Parent POM ... SUCCESS [
>  9.812 s]
> [INFO] Spark Project Launcher . SUCCESS [
> 27.701 s]
> [INFO] Spark Project Networking ... SUCCESS [
> 16.721 s]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
>  8.617 s]
> [INFO] Spark Project Unsafe ... SUCCESS [
> 27.124 s]
> [INFO] Spark Project Core . FAILURE [09:08
> min]
>
> Failed to execute goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:testCompile
> (scala-test-compile-first) on project spark-core_2.10: Execution
> scala-test-compile-first of goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:testCompile failed.
> CompileFailed -> [Help 1]
>
>
>
> --
> Regards,
> Raghuveer Chanda
>
>


-- 
Jia Zhan

Re: Exception while reading from kafka stream

2015-10-30 Thread Ramkumar V

spark version - spark 1.4.1

my code snippet:

String brokers = "ip:port,ip:port";
String topics = "x,y,z";
HashSet TopicsSet = new
HashSet(Arrays.asList(topics.split(",")));
HashMap kafkaParams = new HashMap();
kafkaParams.put("metadata.broker.list", brokers);

JavaPairInputDStream messages =
KafkaUtils.createDirectStream(
   jssc,
   String.class,
   String.class,
   StringDecoder.class,
   StringDecoder.class,
   kafkaParams,
TopicsSet
   );

messages.foreachRDD(new Function () {
public Void call(JavaPairRDD tuple) {
JavaRDDrdd = tuple.values();
rdd.saveAsTextFile("hdfs://myuser:8020/user/hdfs/output");
return null;
}
   });


*Thanks*,



On Fri, Oct 30, 2015 at 1:57 PM, Saisai Shao  wrote:

> What Spark version are you using, also a small code snippet of how you use
> Spark Streaming would be greatly helpful.
>
> On Fri, Oct 30, 2015 at 3:57 PM, Ramkumar V 
> wrote:
>
>> I can able to read and print few lines. Afterthat i'm getting this
>> exception. Any idea for this ?
>>
>> *Thanks*,
>> 
>>
>>
>> On Thu, Oct 29, 2015 at 6:14 PM, Ramkumar V 
>> wrote:
>>
>>> Hi,
>>>
>>> I'm trying to read from kafka stream and printing it textfile. I'm using
>>> java over spark. I dont know why i'm getting the following exception.
>>> Also exception message is very abstract.  can anyone please help me ?
>>>
>>> Log Trace :
>>>
>>> 15/10/29 12:15:09 ERROR scheduler.JobScheduler: Error in job generator
>>> java.lang.NullPointerException
>>> at
>>> org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:172)
>>> at
>>> org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:172)
>>> at
>>> scala.collection.TraversableOnce$$anonfun$maxBy$1.apply(TraversableOnce.scala:225)
>>> at
>>> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
>>> at
>>> scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:68)
>>> at
>>> scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:47)
>>> at
>>> scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:225)
>>> at
>>> scala.collection.AbstractTraversable.maxBy(Traversable.scala:105)
>>> at
>>> org.apache.spark.streaming.DStreamGraph.getMaxInputStreamRememberDuration(DStreamGraph.scala:172)
>>> at
>>> org.apache.spark.streaming.scheduler.JobGenerator.clearMetadata(JobGenerator.scala:267)
>>> at org.apache.spark.streaming.scheduler.JobGenerator.org
>>> $apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:178)
>>> at
>>> org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:83)
>>> at
>>> org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:82)
>>> at
>>> org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>>> 15/10/29 12:15:09 ERROR yarn.ApplicationMaster: User class threw
>>> exception: java.lang.NullPointerException
>>> java.lang.NullPointerException
>>> at
>>> org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:172)
>>> at
>>> org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:172)
>>> at
>>> scala.collection.TraversableOnce$$anonfun$maxBy$1.apply(TraversableOnce.scala:225)
>>> at
>>> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
>>> at
>>> scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:68)
>>> at
>>> scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:47)
>>> at
>>> scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:225)
>>> at
>>> scala.collection.AbstractTraversable.maxBy(Traversable.scala:105)
>>> at
>>> org.apache.spark.streaming.DStreamGraph.getMaxInputStreamRememberDuration(DStreamGraph.scala:172)
>>> at
>>> org.apache.spark.streaming.scheduler.JobGenerator.clearMetadata(JobGenerator.scala:267)
>>> at org.apache.spark.streaming.scheduler.JobGenerator.org
>>> $apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:178)
>>> at
>>> org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:83)
>>> at
>>> org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:82)
>>> at
>>>

Re: Exception while reading from kafka stream

2015-10-30 Thread Ramkumar V

I can able to read and print few lines. Afterthat i'm getting this
exception. Any idea for this ?

*Thanks*,



On Thu, Oct 29, 2015 at 6:14 PM, Ramkumar V  wrote:

> Hi,
>
> I'm trying to read from kafka stream and printing it textfile. I'm using
> java over spark. I dont know why i'm getting the following exception.
> Also exception message is very abstract.  can anyone please help me ?
>
> Log Trace :
>
> 15/10/29 12:15:09 ERROR scheduler.JobScheduler: Error in job generator
> java.lang.NullPointerException
> at
> org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:172)
> at
> org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:172)
> at
> scala.collection.TraversableOnce$$anonfun$maxBy$1.apply(TraversableOnce.scala:225)
> at
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
> at
> scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:68)
> at
> scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:47)
> at
> scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:225)
> at
> scala.collection.AbstractTraversable.maxBy(Traversable.scala:105)
> at
> org.apache.spark.streaming.DStreamGraph.getMaxInputStreamRememberDuration(DStreamGraph.scala:172)
> at
> org.apache.spark.streaming.scheduler.JobGenerator.clearMetadata(JobGenerator.scala:267)
> at org.apache.spark.streaming.scheduler.JobGenerator.org
> $apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:178)
> at
> org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:83)
> at
> org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:82)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> 15/10/29 12:15:09 ERROR yarn.ApplicationMaster: User class threw
> exception: java.lang.NullPointerException
> java.lang.NullPointerException
> at
> org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:172)
> at
> org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:172)
> at
> scala.collection.TraversableOnce$$anonfun$maxBy$1.apply(TraversableOnce.scala:225)
> at
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
> at
> scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:68)
> at
> scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:47)
> at
> scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:225)
> at
> scala.collection.AbstractTraversable.maxBy(Traversable.scala:105)
> at
> org.apache.spark.streaming.DStreamGraph.getMaxInputStreamRememberDuration(DStreamGraph.scala:172)
> at
> org.apache.spark.streaming.scheduler.JobGenerator.clearMetadata(JobGenerator.scala:267)
> at org.apache.spark.streaming.scheduler.JobGenerator.org
> $apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:178)
> at
> org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:83)
> at
> org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:82)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>
>
>
> *Thanks*,
> 
>
>

Re: Exception while reading from kafka stream

2015-10-30 Thread Saisai Shao

Do you have any special settings, from your code, I don't think it will
incur NPE at that place.

On Fri, Oct 30, 2015 at 4:32 PM, Ramkumar V  wrote:

> spark version - spark 1.4.1
>
> my code snippet:
>
> String brokers = "ip:port,ip:port";
> String topics = "x,y,z";
> HashSet TopicsSet = new
> HashSet(Arrays.asList(topics.split(",")));
> HashMap kafkaParams = new HashMap();
> kafkaParams.put("metadata.broker.list", brokers);
>
> JavaPairInputDStream messages =
> KafkaUtils.createDirectStream(
>jssc,
>String.class,
>String.class,
>StringDecoder.class,
>StringDecoder.class,
>kafkaParams,
> TopicsSet
>);
>
> messages.foreachRDD(new Function () {
> public Void call(JavaPairRDD tuple) {
> JavaRDDrdd = tuple.values();
> rdd.saveAsTextFile("hdfs://myuser:8020/user/hdfs/output");
> return null;
> }
>});
>
>
> *Thanks*,
> 
>
>
> On Fri, Oct 30, 2015 at 1:57 PM, Saisai Shao 
> wrote:
>
>> What Spark version are you using, also a small code snippet of how you
>> use Spark Streaming would be greatly helpful.
>>
>> On Fri, Oct 30, 2015 at 3:57 PM, Ramkumar V 
>> wrote:
>>
>>> I can able to read and print few lines. Afterthat i'm getting this
>>> exception. Any idea for this ?
>>>
>>> *Thanks*,
>>> 
>>>
>>>
>>> On Thu, Oct 29, 2015 at 6:14 PM, Ramkumar V 
>>> wrote:
>>>
 Hi,

 I'm trying to read from kafka stream and printing it textfile. I'm
 using java over spark. I dont know why i'm getting the following exception.
 Also exception message is very abstract.  can anyone please help me ?

 Log Trace :

 15/10/29 12:15:09 ERROR scheduler.JobScheduler: Error in job generator
 java.lang.NullPointerException
 at
 org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:172)
 at
 org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:172)
 at
 scala.collection.TraversableOnce$$anonfun$maxBy$1.apply(TraversableOnce.scala:225)
 at
 scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
 at
 scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:68)
 at
 scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:47)
 at
 scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:225)
 at
 scala.collection.AbstractTraversable.maxBy(Traversable.scala:105)
 at
 org.apache.spark.streaming.DStreamGraph.getMaxInputStreamRememberDuration(DStreamGraph.scala:172)
 at
 org.apache.spark.streaming.scheduler.JobGenerator.clearMetadata(JobGenerator.scala:267)
 at org.apache.spark.streaming.scheduler.JobGenerator.org
 $apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:178)
 at
 org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:83)
 at
 org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:82)
 at
 org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
 15/10/29 12:15:09 ERROR yarn.ApplicationMaster: User class threw
 exception: java.lang.NullPointerException
 java.lang.NullPointerException
 at
 org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:172)
 at
 org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:172)
 at
 scala.collection.TraversableOnce$$anonfun$maxBy$1.apply(TraversableOnce.scala:225)
 at
 scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
 at
 scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:68)
 at
 scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:47)
 at
 scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:225)
 at
 scala.collection.AbstractTraversable.maxBy(Traversable.scala:105)
 at
 org.apache.spark.streaming.DStreamGraph.getMaxInputStreamRememberDuration(DStreamGraph.scala:172)
 at
 org.apache.spark.streaming.scheduler.JobGenerator.clearMetadata(JobGenerator.scala:267)
 at org.apache.spark.streaming.scheduler.JobGenerator.org

Re: Save data to different S3

2015-10-30 Thread William Li

I see. Thanks!

From: Steve Loughran >
Date: Friday, October 30, 2015 at 12:03 PM
To: William Li >
Cc: "Zhang, Jingyu" 
>, user 
>
Subject: Re: Save data to different S3

On 30 Oct 2015, at 18:05, William Li 
> wrote:

Thanks for your response. My secret has a back splash (/) so it didn't work...

that's a recurrent problem with the hadoop/java s3 clients. Keep trying to 
regenerate a secret until you get one that works

Extending Spark ML LogisticRegression Object

2015-10-30 Thread njoshi

Hi,

I am extending Spark ML package locally to include one of the specialized
model I need to try. In particular, I am trying to extend the
LogisticRegression model with one which takes a custom object Weights as
weights, and I am getting the following compilation error


could not find implicit value for parameter space:
breeze.math.MutableInnerProductModule[breeze.linalg.DenseMatrix[Double],Double]
[error]   new BreezeLBFGS[BDM[Double]]($(maxIter), 10, $(tol))

While LogisticRegressionModel object uses Breeze DenseVectors for the
weights, I also tried using Breeze DenseMatrix and it generated the same
error as before.

I imagine I need to specify the MutableInnerProductModel someway, but not
sure how and where. Could someone hint/help?

Thanks in advance,

Nikhil



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Extending-Spark-ML-LogisticRegression-Object-tp25241.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: Spark tunning increase number of active tasks

2015-10-30 Thread YI, XIAOCHUAN


HI
Our team has a 40 node hortonworks Hadoop cluster 2.2.4.2-2  (36 data node) 
with apache spark 1.2 and 1.4 installed.
Each node has 64G RAM and 8 cores.

We are only able to use <= 72 executors with executor-cores=2
So we are only get 144 active tasks running pyspark programs with pyspark.
[Stage 1:===>(596 + 144) / 2042]
IF we use larger number for --num-executors, the pyspark program exit with 
errors:
ERROR YarnScheduler: Lost executor 113 on hag017.example.com: remote Rpc client 
disassociated

I tried spark 1.4 and conf.set("dynamicAllocation.enabled", "true"). However it 
does not help us to increase the number of active tasks.
I expect larger number of active tasks with the cluster we have.
Could anyone advise on this? Thank you very much!

Shaun

RE: key not found: sportingpulse.com in Spark SQL 1.5.0

2015-10-30 Thread Silvio Fiorito

It's something due to the columnar compression. I've seen similar intermittent 
issues when caching DataFrames. "sportingpulse.com" is a value in one of the 
columns of the DF.

From: Ted Yu
Sent: ‎10/‎30/‎2015 6:33 PM
To: Zhang, Jingyu
Cc: user
Subject: Re: key not found: sportingpulse.com in Spark SQL 1.5.0

I searched for sportingpulse in *.scala and *.java files under 1.5 branch.
There was no hit.

mvn dependency doesn't show sportingpulse either.

Is it possible this is specific to EMR ?

Cheers

On Fri, Oct 30, 2015 at 2:57 PM, Zhang, Jingyu 
> wrote:

There is not a problem in Spark SQL 1.5.1 but the error of "key not found: 
sportingpulse.com" shown up when I use 1.5.0.

I have to use the version of 1.5.0 because that the one AWS EMR support.  Can 
anyone tell me why Spark uses "sportingpulse.com" 
and how to fix it?

Thanks.

Caused by: java.util.NoSuchElementException: key not found: 
sportingpulse.com

at scala.collection.MapLike$class.default(MapLike.scala:228)

at scala.collection.AbstractMap.default(Map.scala:58)

at scala.collection.mutable.HashMap.apply(HashMap.scala:64)

at 
org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258)

at 
org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110)

at 
org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87)

at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:152)

at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:152)

at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)

at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)

at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)

at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)

at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:152)

at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120)

at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)

at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)

at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)

at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:63)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)

at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)

at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)

at org.apache.spark.scheduler.Task.run(Task.scala:88)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)

at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

This message and its attachments may contain legally privileged or confidential 
information. It is intended solely for the named addressee. If you are not the 
addressee indicated in this message or responsible for delivery of the message 
to the addressee, you may not copy or deliver this message or its attachments 
to anyone. Rather, you should permanently delete this message and its 
attachments and kindly notify the sender by reply e-mail. Any content of this 
message and its attachments which does not relate to the official business of 
the sending company must be taken not to have been

Re: foreachPartition

2015-10-30 Thread Mark Hamstra

The closure is sent to and executed an Executor, so you need to be looking
at the stdout of the Executors, not on the Driver.

On Fri, Oct 30, 2015 at 4:42 PM, Alex Nastetsky <
alex.nastet...@vervemobile.com> wrote:

> I'm just trying to do some operation inside foreachPartition, but I can't
> even get a simple println to work. Nothing gets printed.
>
> scala> val a = sc.parallelize(List(1,2,3))
> a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize
> at :21
>
> scala> a.foreachPartition(p => println("foo"))
> 2015-10-30 23:38:54,643 INFO  [main] spark.SparkContext
> (Logging.scala:logInfo(59)) - Starting job: foreachPartition at :24
> 2015-10-30 23:38:54,644 INFO  [dag-scheduler-event-loop]
> scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Got job 9
> (foreachPartition at :24) with 3 output partitions
> (allowLocal=false)
> 2015-10-30 23:38:54,644 INFO  [dag-scheduler-event-loop]
> scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Final stage:
> ResultStage 9(foreachPartition at :24)
> 2015-10-30 23:38:54,645 INFO  [dag-scheduler-event-loop]
> scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Parents of final
> stage: List()
> 2015-10-30 23:38:54,645 INFO  [dag-scheduler-event-loop]
> scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Missing parents: List()
> 2015-10-30 23:38:54,646 INFO  [dag-scheduler-event-loop]
> scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Submitting ResultStage
> 9 (ParallelCollectionRDD[2] at parallelize at :21), which has no
> missing parents
> 2015-10-30 23:38:54,648 INFO  [dag-scheduler-event-loop]
> storage.MemoryStore (Logging.scala:logInfo(59)) - ensureFreeSpace(1224)
> called with curMem=14486, maxMem=280496701
> 2015-10-30 23:38:54,649 INFO  [dag-scheduler-event-loop]
> storage.MemoryStore (Logging.scala:logInfo(59)) - Block broadcast_9 stored
> as values in memory (estimated size 1224.0 B, free 267.5 MB)
> 2015-10-30 23:38:54,680 INFO  [dag-scheduler-event-loop]
> storage.MemoryStore (Logging.scala:logInfo(59)) - ensureFreeSpace(871)
> called with curMem=15710, maxMem=280496701
> 2015-10-30 23:38:54,681 INFO  [dag-scheduler-event-loop]
> storage.MemoryStore (Logging.scala:logInfo(59)) - Block broadcast_9_piece0
> stored as bytes in memory (estimated size 871.0 B, free 267.5 MB)
> 2015-10-30 23:38:54,685 INFO
>  [sparkDriver-akka.actor.default-dispatcher-15] storage.BlockManagerInfo
> (Logging.scala:logInfo(59)) - Added broadcast_9_piece0 in memory on
> 10.170.11.94:35814 (size: 871.0 B, free: 267.5 MB)
> 2015-10-30 23:38:54,688 INFO
>  [sparkDriver-akka.actor.default-dispatcher-15] storage.BlockManagerInfo
> (Logging.scala:logInfo(59)) - Removed broadcast_4_piece0 on
> 10.170.11.94:35814 in memory (size: 2.0 KB, free: 267.5 MB)
> 2015-10-30 23:38:54,691 INFO
>  [sparkDriver-akka.actor.default-dispatcher-15] storage.BlockManagerInfo
> (Logging.scala:logInfo(59)) - Removed broadcast_4_piece0 on
> ip-10-51-144-180.ec2.internal:35111 in memory (size: 2.0 KB, free: 535.0 MB)
> 2015-10-30 23:38:54,691 INFO
>  [sparkDriver-akka.actor.default-dispatcher-15] storage.BlockManagerInfo
> (Logging.scala:logInfo(59)) - Removed broadcast_4_piece0 on
> ip-10-51-144-180.ec2.internal:49833 in memory (size: 2.0 KB, free: 535.0 MB)
> 2015-10-30 23:38:54,694 INFO
>  [sparkDriver-akka.actor.default-dispatcher-15] storage.BlockManagerInfo
> (Logging.scala:logInfo(59)) - Removed broadcast_4_piece0 on
> ip-10-51-144-180.ec2.internal:34776 in memory (size: 2.0 KB, free: 535.0 MB)
> 2015-10-30 23:38:54,702 INFO  [dag-scheduler-event-loop]
> spark.SparkContext (Logging.scala:logInfo(59)) - Created broadcast 9 from
> broadcast at DAGScheduler.scala:874
> 2015-10-30 23:38:54,703 INFO  [dag-scheduler-event-loop]
> scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Submitting 3 missing
> tasks from ResultStage 9 (ParallelCollectionRDD[2] at parallelize at
> :21)
> 2015-10-30 23:38:54,703 INFO  [dag-scheduler-event-loop]
> cluster.YarnScheduler (Logging.scala:logInfo(59)) - Adding task set 9.0
> with 3 tasks
> 2015-10-30 23:38:54,708 INFO
>  [sparkDriver-akka.actor.default-dispatcher-15] scheduler.TaskSetManager
> (Logging.scala:logInfo(59)) - Starting task 0.0 in stage 9.0 (TID 27,
> ip-10-51-144-180.ec2.internal, PROCESS_LOCAL, 1313 bytes)
> 2015-10-30 23:38:54,711 INFO
>  [sparkDriver-akka.actor.default-dispatcher-15] scheduler.TaskSetManager
> (Logging.scala:logInfo(59)) - Starting task 1.0 in stage 9.0 (TID 28,
> ip-10-51-144-180.ec2.internal, PROCESS_LOCAL, 1313 bytes)
> 2015-10-30 23:38:54,713 INFO
>  [sparkDriver-akka.actor.default-dispatcher-2] storage.BlockManagerInfo
> (Logging.scala:logInfo(59)) - Removed broadcast_5_piece0 on
> 10.170.11.94:35814 in memory (size: 802.0 B, free: 267.5 MB)
> 2015-10-30 23:38:54,714 INFO
>  [sparkDriver-akka.actor.default-dispatcher-15] scheduler.TaskSetManager
> (Logging.scala:logInfo(59)) - Starting task 2.0 in stage 9.0 (TID 29,
> ip-10-51-144-180.ec2.internal, PROCESS_LOCAL, 1313 bytes)
> 2015-10-30 23:38:54,716

foreachPartition

2015-10-30 Thread Alex Nastetsky

I'm just trying to do some operation inside foreachPartition, but I can't
even get a simple println to work. Nothing gets printed.

scala> val a = sc.parallelize(List(1,2,3))
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize
at :21

scala> a.foreachPartition(p => println("foo"))
2015-10-30 23:38:54,643 INFO  [main] spark.SparkContext
(Logging.scala:logInfo(59)) - Starting job: foreachPartition at :24
2015-10-30 23:38:54,644 INFO  [dag-scheduler-event-loop]
scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Got job 9
(foreachPartition at :24) with 3 output partitions
(allowLocal=false)
2015-10-30 23:38:54,644 INFO  [dag-scheduler-event-loop]
scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Final stage:
ResultStage 9(foreachPartition at :24)
2015-10-30 23:38:54,645 INFO  [dag-scheduler-event-loop]
scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Parents of final
stage: List()
2015-10-30 23:38:54,645 INFO  [dag-scheduler-event-loop]
scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Missing parents: List()
2015-10-30 23:38:54,646 INFO  [dag-scheduler-event-loop]
scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Submitting ResultStage
9 (ParallelCollectionRDD[2] at parallelize at :21), which has no
missing parents
2015-10-30 23:38:54,648 INFO  [dag-scheduler-event-loop]
storage.MemoryStore (Logging.scala:logInfo(59)) - ensureFreeSpace(1224)
called with curMem=14486, maxMem=280496701
2015-10-30 23:38:54,649 INFO  [dag-scheduler-event-loop]
storage.MemoryStore (Logging.scala:logInfo(59)) - Block broadcast_9 stored
as values in memory (estimated size 1224.0 B, free 267.5 MB)
2015-10-30 23:38:54,680 INFO  [dag-scheduler-event-loop]
storage.MemoryStore (Logging.scala:logInfo(59)) - ensureFreeSpace(871)
called with curMem=15710, maxMem=280496701
2015-10-30 23:38:54,681 INFO  [dag-scheduler-event-loop]
storage.MemoryStore (Logging.scala:logInfo(59)) - Block broadcast_9_piece0
stored as bytes in memory (estimated size 871.0 B, free 267.5 MB)
2015-10-30 23:38:54,685 INFO
 [sparkDriver-akka.actor.default-dispatcher-15] storage.BlockManagerInfo
(Logging.scala:logInfo(59)) - Added broadcast_9_piece0 in memory on
10.170.11.94:35814 (size: 871.0 B, free: 267.5 MB)
2015-10-30 23:38:54,688 INFO
 [sparkDriver-akka.actor.default-dispatcher-15] storage.BlockManagerInfo
(Logging.scala:logInfo(59)) - Removed broadcast_4_piece0 on
10.170.11.94:35814 in memory (size: 2.0 KB, free: 267.5 MB)
2015-10-30 23:38:54,691 INFO
 [sparkDriver-akka.actor.default-dispatcher-15] storage.BlockManagerInfo
(Logging.scala:logInfo(59)) - Removed broadcast_4_piece0 on
ip-10-51-144-180.ec2.internal:35111 in memory (size: 2.0 KB, free: 535.0 MB)
2015-10-30 23:38:54,691 INFO
 [sparkDriver-akka.actor.default-dispatcher-15] storage.BlockManagerInfo
(Logging.scala:logInfo(59)) - Removed broadcast_4_piece0 on
ip-10-51-144-180.ec2.internal:49833 in memory (size: 2.0 KB, free: 535.0 MB)
2015-10-30 23:38:54,694 INFO
 [sparkDriver-akka.actor.default-dispatcher-15] storage.BlockManagerInfo
(Logging.scala:logInfo(59)) - Removed broadcast_4_piece0 on
ip-10-51-144-180.ec2.internal:34776 in memory (size: 2.0 KB, free: 535.0 MB)
2015-10-30 23:38:54,702 INFO  [dag-scheduler-event-loop] spark.SparkContext
(Logging.scala:logInfo(59)) - Created broadcast 9 from broadcast at
DAGScheduler.scala:874
2015-10-30 23:38:54,703 INFO  [dag-scheduler-event-loop]
scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Submitting 3 missing
tasks from ResultStage 9 (ParallelCollectionRDD[2] at parallelize at
:21)
2015-10-30 23:38:54,703 INFO  [dag-scheduler-event-loop]
cluster.YarnScheduler (Logging.scala:logInfo(59)) - Adding task set 9.0
with 3 tasks
2015-10-30 23:38:54,708 INFO
 [sparkDriver-akka.actor.default-dispatcher-15] scheduler.TaskSetManager
(Logging.scala:logInfo(59)) - Starting task 0.0 in stage 9.0 (TID 27,
ip-10-51-144-180.ec2.internal, PROCESS_LOCAL, 1313 bytes)
2015-10-30 23:38:54,711 INFO
 [sparkDriver-akka.actor.default-dispatcher-15] scheduler.TaskSetManager
(Logging.scala:logInfo(59)) - Starting task 1.0 in stage 9.0 (TID 28,
ip-10-51-144-180.ec2.internal, PROCESS_LOCAL, 1313 bytes)
2015-10-30 23:38:54,713 INFO  [sparkDriver-akka.actor.default-dispatcher-2]
storage.BlockManagerInfo (Logging.scala:logInfo(59)) - Removed
broadcast_5_piece0 on 10.170.11.94:35814 in memory (size: 802.0 B, free:
267.5 MB)
2015-10-30 23:38:54,714 INFO
 [sparkDriver-akka.actor.default-dispatcher-15] scheduler.TaskSetManager
(Logging.scala:logInfo(59)) - Starting task 2.0 in stage 9.0 (TID 29,
ip-10-51-144-180.ec2.internal, PROCESS_LOCAL, 1313 bytes)
2015-10-30 23:38:54,716 INFO  [sparkDriver-akka.actor.default-dispatcher-2]
storage.BlockManagerInfo (Logging.scala:logInfo(59)) - Removed
broadcast_5_piece0 on ip-10-51-144-180.ec2.internal:34776 in memory (size:
802.0 B, free: 535.0 MB)
2015-10-30 23:38:54,719 INFO  [sparkDriver-akka.actor.default-dispatcher-2]
storage.BlockManagerInfo (Logging.scala:logInfo(59)) - Removed
broadcast_5_piece0 on

key not found: sportingpulse.com in Spark SQL 1.5.0

2015-10-30 Thread Zhang, Jingyu

There is not a problem in Spark SQL 1.5.1 but the error of "key not found:
sportingpulse.com" shown up when I use 1.5.0.

I have to use the version of 1.5.0 because that the one AWS EMR support.
Can anyone tell me why Spark uses "sportingpulse.com" and how to fix it?

Thanks.

Caused by: java.util.NoSuchElementException: key not found:
sportingpulse.com

at scala.collection.MapLike$class.default(MapLike.scala:228)

at scala.collection.AbstractMap.default(Map.scala:58)

at scala.collection.mutable.HashMap.apply(HashMap.scala:64)

at
org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(
compressionSchemes.scala:258)

at
org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(
CompressibleColumnBuilder.scala:110)

at org.apache.spark.sql.columnar.NativeColumnBuilder.build(
ColumnBuilder.scala:87)

at
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(
InMemoryColumnarTableScan.scala:152)

at
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(
InMemoryColumnarTableScan.scala:152)

at scala.collection.TraversableLike$$anonfun$map$1.apply(
TraversableLike.scala:244)

at scala.collection.TraversableLike$$anonfun$map$1.apply(
TraversableLike.scala:244)

at scala.collection.IndexedSeqOptimized$class.foreach(
IndexedSeqOptimized.scala:33)

at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)

at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)

at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)

at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(
InMemoryColumnarTableScan.scala:152)

at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(
InMemoryColumnarTableScan.scala:120)

at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)

at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)

at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)

at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(
MapPartitionsWithPreparationRDD.scala:63)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)

at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73
)

at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41
)

at org.apache.spark.scheduler.Task.run(Task.scala:88)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)

at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:617)

-- 
This message and its attachments may contain legally privileged or 
confidential information. It is intended solely for the named addressee. If 
you are not the addressee indicated in this message or responsible for 
delivery of the message to the addressee, you may not copy or deliver this 
message or its attachments to anyone. Rather, you should permanently delete 
this message and its attachments and kindly notify the sender by reply 
e-mail. Any content of this message and its attachments which does not 
relate to the official business of the sending company must be taken not to 
have been sent or endorsed by that company or any of its related entities. 
No warranty is made that the e-mail or attachments are free from computer 
virus or other defect.

Re: Stack overflow error caused by long lineage RDD created after many recursions

2015-10-30 Thread Tathagata Das

You have to run some action after rdd.checkpointi() for the checkpointing
to actually occur. Have you done that?


On Fri, Oct 30, 2015 at 3:10 PM, Panos Str  wrote:

> Hi all!
>
> Here's a part of a Scala recursion that produces a stack overflow after
> many
> recursions. I've tried many things but I've not managed to solve it.
>
> val eRDD: RDD[(Int,Int)] = ...
>
> val oldRDD: RDD[Int,Int]= ...
>
> val result = *Algorithm*(eRDD,oldRDD)
>
>
> *Algorithm*(eRDD: RDD[(Int,Int)] , oldRDD: RDD[(Int,Int)]) :
> RDD[(Int,Int)]{
>
> val newRDD = *Transformation*(eRDD,oldRDD)//only transformations
>
> if(*Compare*(oldRDD,newRDD)) //Compare has the "take" action!!
>
>   return *Algorithm*(eRDD,newRDD)
>
> else
>
>  return newRDD
> }
>
> The above code is recursive and performs many iterations (until the compare
> returns false)
>
> After some iterations I get a stack overflow error. Probably the lineage
> chain has become too long. Is there any way to solve this problem?
> (persist/unpersist, checkpoint, sc.saveAsObjectFile).
>
> Note1: Only compare function performs Actions on RDDs
>
> Note2: I tried some combinations of persist/unpersist but none of them
> worked!
>
> I tried checkpointing from spark.streaming. I put a checkpoint at every
> recursion but still received an overflow error
>
> I also tried using sc.saveAsObjectFile per iteration and then reading from
> file (sc.objectFile) during the next iteration. Unfortunately I noticed
> that
> the folders are created per iteration are increasing while I was expecting
> from them to have equal size per iteration.
>
> please help!!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Stack-overflow-error-caused-by-long-lineage-RDD-created-after-many-recursions-tp25240.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: key not found: sportingpulse.com in Spark SQL 1.5.0

2015-10-30 Thread Zhang, Jingyu

Thanks Silvio and Ted,

Can you please let me know how to fix this intermittent issues? Should I
wait EMR upgrading to support the Spark 1.5.1 or change my code from
DataFrame to normal Spark map-reduce?

Regards,

Jingyu

On 31 October 2015 at 09:40, Silvio Fiorito 
wrote:

> It's something due to the columnar compression. I've seen similar
> intermittent issues when caching DataFrames. "sportingpulse.com" is a
> value in one of the columns of the DF.
> --
> From: Ted Yu 
> Sent: ‎10/‎30/‎2015 6:33 PM
> To: Zhang, Jingyu 
> Cc: user 
> Subject: Re: key not found: sportingpulse.com in Spark SQL 1.5.0
>
> I searched for sportingpulse in *.scala and *.java files under 1.5 branch.
> There was no hit.
>
> mvn dependency doesn't show sportingpulse either.
>
> Is it possible this is specific to EMR ?
>
> Cheers
>
> On Fri, Oct 30, 2015 at 2:57 PM, Zhang, Jingyu 
> wrote:
>
>> There is not a problem in Spark SQL 1.5.1 but the error of "key not
>> found: sportingpulse.com" shown up when I use 1.5.0.
>>
>> I have to use the version of 1.5.0 because that the one AWS EMR support.
>> Can anyone tell me why Spark uses "sportingpulse.com" and how to fix it?
>>
>> Thanks.
>>
>> Caused by: java.util.NoSuchElementException: key not found:
>> sportingpulse.com
>>
>> at scala.collection.MapLike$class.default(MapLike.scala:228)
>>
>> at scala.collection.AbstractMap.default(Map.scala:58)
>>
>> at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
>>
>> at
>> org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(
>> compressionSchemes.scala:258)
>>
>> at
>> org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(
>> CompressibleColumnBuilder.scala:110)
>>
>> at org.apache.spark.sql.columnar.NativeColumnBuilder.build(
>> ColumnBuilder.scala:87)
>>
>> at
>> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(
>> InMemoryColumnarTableScan.scala:152)
>>
>> at
>> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(
>> InMemoryColumnarTableScan.scala:152)
>>
>> at scala.collection.TraversableLike$$anonfun$map$1.apply(
>> TraversableLike.scala:244)
>>
>> at scala.collection.TraversableLike$$anonfun$map$1.apply(
>> TraversableLike.scala:244)
>>
>> at scala.collection.IndexedSeqOptimized$class.foreach(
>> IndexedSeqOptimized.scala:33)
>>
>> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>>
>> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>>
>> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>>
>> at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(
>> InMemoryColumnarTableScan.scala:152)
>>
>> at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(
>> InMemoryColumnarTableScan.scala:120)
>>
>> at org.apache.spark.storage.MemoryStore.unrollSafely(
>> MemoryStore.scala:278)
>>
>> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171
>> )
>>
>> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
>>
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
>>
>> at org.apache.spark.rdd.MapPartitionsRDD.compute(
>> MapPartitionsRDD.scala:38)
>>
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>>
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>
>> at org.apache.spark.rdd.MapPartitionsRDD.compute(
>> MapPartitionsRDD.scala:38)
>>
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>>
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>
>> at org.apache.spark.rdd.MapPartitionsRDD.compute(
>> MapPartitionsRDD.scala:38)
>>
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>>
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>
>> at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(
>> MapPartitionsWithPreparationRDD.scala:63)
>>
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>>
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>
>> at org.apache.spark.rdd.MapPartitionsRDD.compute(
>> MapPartitionsRDD.scala:38)
>>
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>>
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>
>> at org.apache.spark.scheduler.ShuffleMapTask.runTask(
>> ShuffleMapTask.scala:73)
>>
>> at org.apache.spark.scheduler.ShuffleMapTask.runTask(
>> ShuffleMapTask.scala:41)
>>
>> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>>
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>>
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(
>> ThreadPoolExecutor.java:1142)
>>
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
>> ThreadPoolExecutor.java:617)
>>
>> This message and its attachments may contain legally privileged or
>>

Re: foreachPartition

2015-10-30 Thread Alex Nastetsky

Ahh, makes sense. Knew it was going to be something simple. Thanks.

On Fri, Oct 30, 2015 at 7:45 PM, Mark Hamstra 
wrote:

> The closure is sent to and executed an Executor, so you need to be looking
> at the stdout of the Executors, not on the Driver.
>
> On Fri, Oct 30, 2015 at 4:42 PM, Alex Nastetsky <
> alex.nastet...@vervemobile.com> wrote:
>
>> I'm just trying to do some operation inside foreachPartition, but I can't
>> even get a simple println to work. Nothing gets printed.
>>
>> scala> val a = sc.parallelize(List(1,2,3))
>> a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at
>> parallelize at :21
>>
>> scala> a.foreachPartition(p => println("foo"))
>> 2015-10-30 23:38:54,643 INFO  [main] spark.SparkContext
>> (Logging.scala:logInfo(59)) - Starting job: foreachPartition at :24
>> 2015-10-30 23:38:54,644 INFO  [dag-scheduler-event-loop]
>> scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Got job 9
>> (foreachPartition at :24) with 3 output partitions
>> (allowLocal=false)
>> 2015-10-30 23:38:54,644 INFO  [dag-scheduler-event-loop]
>> scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Final stage:
>> ResultStage 9(foreachPartition at :24)
>> 2015-10-30 23:38:54,645 INFO  [dag-scheduler-event-loop]
>> scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Parents of final
>> stage: List()
>> 2015-10-30 23:38:54,645 INFO  [dag-scheduler-event-loop]
>> scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Missing parents: List()
>> 2015-10-30 23:38:54,646 INFO  [dag-scheduler-event-loop]
>> scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Submitting ResultStage
>> 9 (ParallelCollectionRDD[2] at parallelize at :21), which has no
>> missing parents
>> 2015-10-30 23:38:54,648 INFO  [dag-scheduler-event-loop]
>> storage.MemoryStore (Logging.scala:logInfo(59)) - ensureFreeSpace(1224)
>> called with curMem=14486, maxMem=280496701
>> 2015-10-30 23:38:54,649 INFO  [dag-scheduler-event-loop]
>> storage.MemoryStore (Logging.scala:logInfo(59)) - Block broadcast_9 stored
>> as values in memory (estimated size 1224.0 B, free 267.5 MB)
>> 2015-10-30 23:38:54,680 INFO  [dag-scheduler-event-loop]
>> storage.MemoryStore (Logging.scala:logInfo(59)) - ensureFreeSpace(871)
>> called with curMem=15710, maxMem=280496701
>> 2015-10-30 23:38:54,681 INFO  [dag-scheduler-event-loop]
>> storage.MemoryStore (Logging.scala:logInfo(59)) - Block broadcast_9_piece0
>> stored as bytes in memory (estimated size 871.0 B, free 267.5 MB)
>> 2015-10-30 23:38:54,685 INFO
>>  [sparkDriver-akka.actor.default-dispatcher-15] storage.BlockManagerInfo
>> (Logging.scala:logInfo(59)) - Added broadcast_9_piece0 in memory on
>> 10.170.11.94:35814 (size: 871.0 B, free: 267.5 MB)
>> 2015-10-30 23:38:54,688 INFO
>>  [sparkDriver-akka.actor.default-dispatcher-15] storage.BlockManagerInfo
>> (Logging.scala:logInfo(59)) - Removed broadcast_4_piece0 on
>> 10.170.11.94:35814 in memory (size: 2.0 KB, free: 267.5 MB)
>> 2015-10-30 23:38:54,691 INFO
>>  [sparkDriver-akka.actor.default-dispatcher-15] storage.BlockManagerInfo
>> (Logging.scala:logInfo(59)) - Removed broadcast_4_piece0 on
>> ip-10-51-144-180.ec2.internal:35111 in memory (size: 2.0 KB, free: 535.0 MB)
>> 2015-10-30 23:38:54,691 INFO
>>  [sparkDriver-akka.actor.default-dispatcher-15] storage.BlockManagerInfo
>> (Logging.scala:logInfo(59)) - Removed broadcast_4_piece0 on
>> ip-10-51-144-180.ec2.internal:49833 in memory (size: 2.0 KB, free: 535.0 MB)
>> 2015-10-30 23:38:54,694 INFO
>>  [sparkDriver-akka.actor.default-dispatcher-15] storage.BlockManagerInfo
>> (Logging.scala:logInfo(59)) - Removed broadcast_4_piece0 on
>> ip-10-51-144-180.ec2.internal:34776 in memory (size: 2.0 KB, free: 535.0 MB)
>> 2015-10-30 23:38:54,702 INFO  [dag-scheduler-event-loop]
>> spark.SparkContext (Logging.scala:logInfo(59)) - Created broadcast 9 from
>> broadcast at DAGScheduler.scala:874
>> 2015-10-30 23:38:54,703 INFO  [dag-scheduler-event-loop]
>> scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Submitting 3 missing
>> tasks from ResultStage 9 (ParallelCollectionRDD[2] at parallelize at
>> :21)
>> 2015-10-30 23:38:54,703 INFO  [dag-scheduler-event-loop]
>> cluster.YarnScheduler (Logging.scala:logInfo(59)) - Adding task set 9.0
>> with 3 tasks
>> 2015-10-30 23:38:54,708 INFO
>>  [sparkDriver-akka.actor.default-dispatcher-15] scheduler.TaskSetManager
>> (Logging.scala:logInfo(59)) - Starting task 0.0 in stage 9.0 (TID 27,
>> ip-10-51-144-180.ec2.internal, PROCESS_LOCAL, 1313 bytes)
>> 2015-10-30 23:38:54,711 INFO
>>  [sparkDriver-akka.actor.default-dispatcher-15] scheduler.TaskSetManager
>> (Logging.scala:logInfo(59)) - Starting task 1.0 in stage 9.0 (TID 28,
>> ip-10-51-144-180.ec2.internal, PROCESS_LOCAL, 1313 bytes)
>> 2015-10-30 23:38:54,713 INFO
>>  [sparkDriver-akka.actor.default-dispatcher-2] storage.BlockManagerInfo
>> (Logging.scala:logInfo(59)) - Removed broadcast_5_piece0 on
>> 10.170.11.94:35814 in memory (size: 802.0 B, free: 267.5 MB)
>> 2015-10-30

Using model saved by MLlib with out creating spark context

2015-10-30 Thread vijuks

I want to load a model saved by a spark machine learning job, in a web
application. 

model.save(jsc.sc(), "myModelPath");

LogisticRegressionModel model = 
LogisticRegressionModel.load(
jsc.sc(), "myModelPath");

When I do that, I need to pass a spark context for loading the model.  The
model is small and can be saved to local file system, so is there any way to
use it with out the spark context?  Looks like creating spark context is an
expensive step that starts http server that listens on multiple ports.  

15/10/30 10:53:39 INFO HttpServer: Starting HTTP Server
15/10/30 10:53:39 INFO Utils: Successfully started service 'HTTP file
server' on port 63341.
15/10/30 10:53:39 INFO SparkEnv: Registering OutputCommitCoordinator
15/10/30 10:53:40 INFO Utils: Successfully started service 'SparkUI' on port
4040.
15/10/30 10:53:40 INFO SparkUI: Started SparkUI at http://171.142.49.18:4040








--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-model-saved-by-MLlib-with-out-creating-spark-context-tp25239.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Exception while reading from kafka stream

2015-10-30 Thread Saisai Shao

I just did a local test with your code, seems everything is fine, the only
difference is that I use the master branch, but I don't think it changes a
lot in this part. Do you met any other exceptions or errors beside this
one? Probably this is due to other exceptions that makes this system
unstable.

On Fri, Oct 30, 2015 at 5:13 PM, Ramkumar V  wrote:

> No, i dont have any special settings. if i keep only reading line in my
> code, it's throwing NPE.
>
> *Thanks*,
> 
>
>
> On Fri, Oct 30, 2015 at 2:14 PM, Saisai Shao 
> wrote:
>
>> Do you have any special settings, from your code, I don't think it will
>> incur NPE at that place.
>>
>> On Fri, Oct 30, 2015 at 4:32 PM, Ramkumar V 
>> wrote:
>>
>>> spark version - spark 1.4.1
>>>
>>> my code snippet:
>>>
>>> String brokers = "ip:port,ip:port";
>>> String topics = "x,y,z";
>>> HashSet TopicsSet = new
>>> HashSet(Arrays.asList(topics.split(",")));
>>> HashMap kafkaParams = new HashMap();
>>> kafkaParams.put("metadata.broker.list", brokers);
>>>
>>> JavaPairInputDStream messages =
>>> KafkaUtils.createDirectStream(
>>>jssc,
>>>String.class,
>>>String.class,
>>>StringDecoder.class,
>>>StringDecoder.class,
>>>kafkaParams,
>>> TopicsSet
>>>);
>>>
>>> messages.foreachRDD(new Function () {
>>> public Void call(JavaPairRDD tuple) {
>>> JavaRDDrdd = tuple.values();
>>>
>>> rdd.saveAsTextFile("hdfs://myuser:8020/user/hdfs/output");
>>> return null;
>>> }
>>>});
>>>
>>>
>>> *Thanks*,
>>> 
>>>
>>>
>>> On Fri, Oct 30, 2015 at 1:57 PM, Saisai Shao 
>>> wrote:
>>>
 What Spark version are you using, also a small code snippet of how you
 use Spark Streaming would be greatly helpful.

 On Fri, Oct 30, 2015 at 3:57 PM, Ramkumar V 
 wrote:

> I can able to read and print few lines. Afterthat i'm getting this
> exception. Any idea for this ?
>
> *Thanks*,
> 
>
>
> On Thu, Oct 29, 2015 at 6:14 PM, Ramkumar V 
> wrote:
>
>> Hi,
>>
>> I'm trying to read from kafka stream and printing it textfile. I'm
>> using java over spark. I dont know why i'm getting the following 
>> exception.
>> Also exception message is very abstract.  can anyone please help me ?
>>
>> Log Trace :
>>
>> 15/10/29 12:15:09 ERROR scheduler.JobScheduler: Error in job generator
>> java.lang.NullPointerException
>> at
>> org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:172)
>> at
>> org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:172)
>> at
>> scala.collection.TraversableOnce$$anonfun$maxBy$1.apply(TraversableOnce.scala:225)
>> at
>> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
>> at
>> scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:68)
>> at
>> scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:47)
>> at
>> scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:225)
>> at
>> scala.collection.AbstractTraversable.maxBy(Traversable.scala:105)
>> at
>> org.apache.spark.streaming.DStreamGraph.getMaxInputStreamRememberDuration(DStreamGraph.scala:172)
>> at
>> org.apache.spark.streaming.scheduler.JobGenerator.clearMetadata(JobGenerator.scala:267)
>> at org.apache.spark.streaming.scheduler.JobGenerator.org
>> $apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:178)
>> at
>> org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:83)
>> at
>> org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:82)
>> at
>> org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>> 15/10/29 12:15:09 ERROR yarn.ApplicationMaster: User class threw
>> exception: java.lang.NullPointerException
>> java.lang.NullPointerException
>> at
>> org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:172)
>> at
>> org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:172)
>> at
>> scala.collection.TraversableOnce$$anonfun$maxBy$1.apply(TraversableOnce.scala:225)

Re: Exception while reading from kafka stream

2015-10-30 Thread Saisai Shao

I don't think Spark Streaming supports multiple streaming context in one
jvm, you cannot use in such way. Instead you could run multiple streaming
applications, since you're using Yarn.

2015年10月30日星期五，Ramkumar V  写道：

> I found NPE is mainly because of im using the same JavaStreamingContext
> for some other kafka stream. if i change the name , its running
> successfully. how to run multiple JavaStreamingContext in a program ?  I'm
> getting following exception if i run multiple JavaStreamingContext in
> single file.
>
> 15/10/30 11:04:29 INFO yarn.ApplicationMaster: Final app status: FAILED,
> exitCode: 15, (reason: User class threw exception:
> java.lang.IllegalStateException: Only one StreamingContext may be started
> in this JVM. Currently running StreamingContext was started
> atorg.apache.spark.streaming.api.java.JavaStreamingContext.start(JavaStreamingContext.scala:622)
>
>
> *Thanks*,
> 
>
>
> On Fri, Oct 30, 2015 at 3:25 PM, Saisai Shao  > wrote:
>
>> From the code, I think this field "rememberDuration" shouldn't be null,
>> it will be verified at the start, unless some place changes it's value in
>> the runtime that makes it null, but I cannot image how this happened. Maybe
>> you could add some logs around the place where exception happens if you
>> could reproduce it.
>>
>> On Fri, Oct 30, 2015 at 5:31 PM, Ramkumar V > > wrote:
>>
>>> No. this is the only exception that im getting multiple times in my log.
>>> Also i was reading some other topics earlier but im not faced this NPE.
>>>
>>> *Thanks*,
>>> 
>>>
>>>
>>> On Fri, Oct 30, 2015 at 2:50 PM, Saisai Shao >> > wrote:
>>>
 I just did a local test with your code, seems everything is fine, the
 only difference is that I use the master branch, but I don't think it
 changes a lot in this part. Do you met any other exceptions or errors
 beside this one? Probably this is due to other exceptions that makes this
 system unstable.

 On Fri, Oct 30, 2015 at 5:13 PM, Ramkumar V > wrote:

> No, i dont have any special settings. if i keep only reading line in
> my code, it's throwing NPE.
>
> *Thanks*,
> 
>
>
> On Fri, Oct 30, 2015 at 2:14 PM, Saisai Shao  > wrote:
>
>> Do you have any special settings, from your code, I don't think it
>> will incur NPE at that place.
>>
>> On Fri, Oct 30, 2015 at 4:32 PM, Ramkumar V > > wrote:
>>
>>> spark version - spark 1.4.1
>>>
>>> my code snippet:
>>>
>>> String brokers = "ip:port,ip:port";
>>> String topics = "x,y,z";
>>> HashSet TopicsSet = new
>>> HashSet(Arrays.asList(topics.split(",")));
>>> HashMap kafkaParams = new HashMap();
>>> kafkaParams.put("metadata.broker.list", brokers);
>>>
>>> JavaPairInputDStream messages =
>>> KafkaUtils.createDirectStream(
>>>jssc,
>>>String.class,
>>>String.class,
>>>StringDecoder.class,
>>>StringDecoder.class,
>>>kafkaParams,
>>> TopicsSet
>>>);
>>>
>>> messages.foreachRDD(new Function
>>> () {
>>> public Void call(JavaPairRDD tuple) {
>>> JavaRDDrdd = tuple.values();
>>>
>>> rdd.saveAsTextFile("hdfs://myuser:8020/user/hdfs/output");
>>> return null;
>>> }
>>>});
>>>
>>>
>>> *Thanks*,
>>> 
>>>
>>>
>>> On Fri, Oct 30, 2015 at 1:57 PM, Saisai Shao >> > wrote:
>>>
 What Spark version are you using, also a small code snippet of how
 you use Spark Streaming would be greatly helpful.

 On Fri, Oct 30, 2015 at 3:57 PM, Ramkumar V <
 ramkumar.c...@gmail.com
 > wrote:

> I can able to read and print few lines. Afterthat i'm getting this
> exception. Any idea for this ?
>
> *Thanks*,
> 
>
>
> On Thu, Oct 29, 2015 at 6:14 PM, Ramkumar V <
>

Re: Pivot Data in Spark and Scala

2015-10-30 Thread Adrian Tanase

Its actually a bit tougher as you’ll first need all the years. Also not sure 
how you would reprsent your “columns” given they are dynamic based on the input 
data.

Depending on your downstream processing, I’d probably try to emulate it with a 
hash map with years as keys instead of the columns.

There is probably a nicer solution using the data frames API but I’m not 
familiar with it.

If you actually need vectors I think this article I saw recently on the data 
bricks blog will highlight some options (look for gather encoder)
https://databricks.com/blog/2015/10/20/audience-modeling-with-spark-ml-pipelines.html

-adrian

From: Deng Ching-Mallete
Date: Friday, October 30, 2015 at 4:35 AM
To: Ascot Moss
Cc: User
Subject: Re: Pivot Data in Spark and Scala

Hi,

You could transform it into a pair RDD then use the combineByKey function.

HTH,
Deng

On Thu, Oct 29, 2015 at 7:29 PM, Ascot Moss 
> wrote:
Hi,

I have data as follows:

A, 2015, 4
A, 2014, 12
A, 2013, 1
B, 2015, 24
B, 2013 4


I need to convert the data to a new format:
A ,4,12,1
B,   24,,4

Any idea how to make it in Spark Scala?

Thanks

Re: Exception while reading from kafka stream

2015-10-30 Thread Ramkumar V

No. this is the only exception that im getting multiple times in my log.
Also i was reading some other topics earlier but im not faced this NPE.

*Thanks*,



On Fri, Oct 30, 2015 at 2:50 PM, Saisai Shao  wrote:

> I just did a local test with your code, seems everything is fine, the only
> difference is that I use the master branch, but I don't think it changes a
> lot in this part. Do you met any other exceptions or errors beside this
> one? Probably this is due to other exceptions that makes this system
> unstable.
>
> On Fri, Oct 30, 2015 at 5:13 PM, Ramkumar V 
> wrote:
>
>> No, i dont have any special settings. if i keep only reading line in my
>> code, it's throwing NPE.
>>
>> *Thanks*,
>> 
>>
>>
>> On Fri, Oct 30, 2015 at 2:14 PM, Saisai Shao 
>> wrote:
>>
>>> Do you have any special settings, from your code, I don't think it will
>>> incur NPE at that place.
>>>
>>> On Fri, Oct 30, 2015 at 4:32 PM, Ramkumar V 
>>> wrote:
>>>
 spark version - spark 1.4.1

 my code snippet:

 String brokers = "ip:port,ip:port";
 String topics = "x,y,z";
 HashSet TopicsSet = new
 HashSet(Arrays.asList(topics.split(",")));
 HashMap kafkaParams = new HashMap();
 kafkaParams.put("metadata.broker.list", brokers);

 JavaPairInputDStream messages =
 KafkaUtils.createDirectStream(
jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
 TopicsSet
);

 messages.foreachRDD(new Function () {
 public Void call(JavaPairRDD tuple) {
 JavaRDDrdd = tuple.values();

 rdd.saveAsTextFile("hdfs://myuser:8020/user/hdfs/output");
 return null;
 }
});


 *Thanks*,
 


 On Fri, Oct 30, 2015 at 1:57 PM, Saisai Shao 
 wrote:

> What Spark version are you using, also a small code snippet of how you
> use Spark Streaming would be greatly helpful.
>
> On Fri, Oct 30, 2015 at 3:57 PM, Ramkumar V 
> wrote:
>
>> I can able to read and print few lines. Afterthat i'm getting this
>> exception. Any idea for this ?
>>
>> *Thanks*,
>> 
>>
>>
>> On Thu, Oct 29, 2015 at 6:14 PM, Ramkumar V 
>> wrote:
>>
>>> Hi,
>>>
>>> I'm trying to read from kafka stream and printing it textfile. I'm
>>> using java over spark. I dont know why i'm getting the following 
>>> exception.
>>> Also exception message is very abstract.  can anyone please help me ?
>>>
>>> Log Trace :
>>>
>>> 15/10/29 12:15:09 ERROR scheduler.JobScheduler: Error in job
>>> generator
>>> java.lang.NullPointerException
>>> at
>>> org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:172)
>>> at
>>> org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:172)
>>> at
>>> scala.collection.TraversableOnce$$anonfun$maxBy$1.apply(TraversableOnce.scala:225)
>>> at
>>> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
>>> at
>>> scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:68)
>>> at
>>> scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:47)
>>> at
>>> scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:225)
>>> at
>>> scala.collection.AbstractTraversable.maxBy(Traversable.scala:105)
>>> at
>>> org.apache.spark.streaming.DStreamGraph.getMaxInputStreamRememberDuration(DStreamGraph.scala:172)
>>> at
>>> org.apache.spark.streaming.scheduler.JobGenerator.clearMetadata(JobGenerator.scala:267)
>>> at org.apache.spark.streaming.scheduler.JobGenerator.org
>>> $apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:178)
>>> at
>>> org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:83)
>>> at
>>> org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:82)
>>> at
>>> org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>>> 15/10/29 12:15:09 ERROR yarn.ApplicationMaster: User class threw
>>> exception: java.lang.NullPointerException
>>>

Re: Issue of Hive parquet partitioned table schema mismatch

2015-10-30 Thread Jörn Franke

What Storage Format?



> On 30 Oct 2015, at 12:05, Rex Xiong  wrote:
> 
> Hi folks,
> 
> I have a Hive external table with partitions.
> Every day, an App will generate a new partition day=-MM-dd stored by 
> parquet and run add-partition Hive command.
> In some cases, we will add additional column to new partitions and update 
> Hive table schema, then a query across new and old partitions will fail with 
> exception:
> 
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.sql.AnalysisException: cannot resolve 'newcolumn' given 
> input columns 
> 
> We have tried schema merging feature, but it's too slow, there're hundreds of 
> partitions.
> Is it possible to bypass this schema check and return a default value (such 
> as null) for missing columns?
> 
> Thank you

Re: Exception while reading from kafka stream

2015-10-30 Thread Saisai Shao

>From the code, I think this field "rememberDuration" shouldn't be null, it
will be verified at the start, unless some place changes it's value in the
runtime that makes it null, but I cannot image how this happened. Maybe you
could add some logs around the place where exception happens if you could
reproduce it.

On Fri, Oct 30, 2015 at 5:31 PM, Ramkumar V  wrote:

> No. this is the only exception that im getting multiple times in my log.
> Also i was reading some other topics earlier but im not faced this NPE.
>
> *Thanks*,
> 
>
>
> On Fri, Oct 30, 2015 at 2:50 PM, Saisai Shao 
> wrote:
>
>> I just did a local test with your code, seems everything is fine, the
>> only difference is that I use the master branch, but I don't think it
>> changes a lot in this part. Do you met any other exceptions or errors
>> beside this one? Probably this is due to other exceptions that makes this
>> system unstable.
>>
>> On Fri, Oct 30, 2015 at 5:13 PM, Ramkumar V 
>> wrote:
>>
>>> No, i dont have any special settings. if i keep only reading line in my
>>> code, it's throwing NPE.
>>>
>>> *Thanks*,
>>> 
>>>
>>>
>>> On Fri, Oct 30, 2015 at 2:14 PM, Saisai Shao 
>>> wrote:
>>>
 Do you have any special settings, from your code, I don't think it will
 incur NPE at that place.

 On Fri, Oct 30, 2015 at 4:32 PM, Ramkumar V 
 wrote:

> spark version - spark 1.4.1
>
> my code snippet:
>
> String brokers = "ip:port,ip:port";
> String topics = "x,y,z";
> HashSet TopicsSet = new
> HashSet(Arrays.asList(topics.split(",")));
> HashMap kafkaParams = new HashMap();
> kafkaParams.put("metadata.broker.list", brokers);
>
> JavaPairInputDStream messages =
> KafkaUtils.createDirectStream(
>jssc,
>String.class,
>String.class,
>StringDecoder.class,
>StringDecoder.class,
>kafkaParams,
> TopicsSet
>);
>
> messages.foreachRDD(new Function ()
> {
> public Void call(JavaPairRDD tuple) {
> JavaRDDrdd = tuple.values();
>
> rdd.saveAsTextFile("hdfs://myuser:8020/user/hdfs/output");
> return null;
> }
>});
>
>
> *Thanks*,
> 
>
>
> On Fri, Oct 30, 2015 at 1:57 PM, Saisai Shao 
> wrote:
>
>> What Spark version are you using, also a small code snippet of how
>> you use Spark Streaming would be greatly helpful.
>>
>> On Fri, Oct 30, 2015 at 3:57 PM, Ramkumar V 
>> wrote:
>>
>>> I can able to read and print few lines. Afterthat i'm getting this
>>> exception. Any idea for this ?
>>>
>>> *Thanks*,
>>> 
>>>
>>>
>>> On Thu, Oct 29, 2015 at 6:14 PM, Ramkumar V >> > wrote:
>>>
 Hi,

 I'm trying to read from kafka stream and printing it textfile. I'm
 using java over spark. I dont know why i'm getting the following 
 exception.
 Also exception message is very abstract.  can anyone please help me ?

 Log Trace :

 15/10/29 12:15:09 ERROR scheduler.JobScheduler: Error in job
 generator
 java.lang.NullPointerException
 at
 org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:172)
 at
 org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:172)
 at
 scala.collection.TraversableOnce$$anonfun$maxBy$1.apply(TraversableOnce.scala:225)
 at
 scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
 at
 scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:68)
 at
 scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:47)
 at
 scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:225)
 at
 scala.collection.AbstractTraversable.maxBy(Traversable.scala:105)
 at
 org.apache.spark.streaming.DStreamGraph.getMaxInputStreamRememberDuration(DStreamGraph.scala:172)
 at
 org.apache.spark.streaming.scheduler.JobGenerator.clearMetadata(JobGenerator.scala:267)
 at org.apache.spark.streaming.scheduler.JobGenerator.org

Best practises

2015-10-30 Thread Deepak Sharma

Hi
I am looking for any blog / doc on the developer's best practices if using
Spark .I have already looked at the tuning guide on spark.apache.org.
Please do let me know if any one is aware of any such resource.

Thanks
Deepak

Re: Exception while reading from kafka stream

2015-10-30 Thread Ramkumar V

I found NPE is mainly because of im using the same JavaStreamingContext for
some other kafka stream. if i change the name , its running successfully.
how to run multiple JavaStreamingContext in a program ?  I'm getting
following exception if i run multiple JavaStreamingContext in single file.

15/10/30 11:04:29 INFO yarn.ApplicationMaster: Final app status: FAILED,
exitCode: 15, (reason: User class threw exception:
java.lang.IllegalStateException: Only one StreamingContext may be started
in this JVM. Currently running StreamingContext was started
atorg.apache.spark.streaming.api.java.JavaStreamingContext.start(JavaStreamingContext.scala:622)


*Thanks*,



On Fri, Oct 30, 2015 at 3:25 PM, Saisai Shao  wrote:

> From the code, I think this field "rememberDuration" shouldn't be null,
> it will be verified at the start, unless some place changes it's value in
> the runtime that makes it null, but I cannot image how this happened. Maybe
> you could add some logs around the place where exception happens if you
> could reproduce it.
>
> On Fri, Oct 30, 2015 at 5:31 PM, Ramkumar V 
> wrote:
>
>> No. this is the only exception that im getting multiple times in my log.
>> Also i was reading some other topics earlier but im not faced this NPE.
>>
>> *Thanks*,
>> 
>>
>>
>> On Fri, Oct 30, 2015 at 2:50 PM, Saisai Shao 
>> wrote:
>>
>>> I just did a local test with your code, seems everything is fine, the
>>> only difference is that I use the master branch, but I don't think it
>>> changes a lot in this part. Do you met any other exceptions or errors
>>> beside this one? Probably this is due to other exceptions that makes this
>>> system unstable.
>>>
>>> On Fri, Oct 30, 2015 at 5:13 PM, Ramkumar V 
>>> wrote:
>>>
 No, i dont have any special settings. if i keep only reading line in my
 code, it's throwing NPE.

 *Thanks*,
 


 On Fri, Oct 30, 2015 at 2:14 PM, Saisai Shao 
 wrote:

> Do you have any special settings, from your code, I don't think it
> will incur NPE at that place.
>
> On Fri, Oct 30, 2015 at 4:32 PM, Ramkumar V 
> wrote:
>
>> spark version - spark 1.4.1
>>
>> my code snippet:
>>
>> String brokers = "ip:port,ip:port";
>> String topics = "x,y,z";
>> HashSet TopicsSet = new
>> HashSet(Arrays.asList(topics.split(",")));
>> HashMap kafkaParams = new HashMap();
>> kafkaParams.put("metadata.broker.list", brokers);
>>
>> JavaPairInputDStream messages =
>> KafkaUtils.createDirectStream(
>>jssc,
>>String.class,
>>String.class,
>>StringDecoder.class,
>>StringDecoder.class,
>>kafkaParams,
>> TopicsSet
>>);
>>
>> messages.foreachRDD(new Function
>> () {
>> public Void call(JavaPairRDD tuple) {
>> JavaRDDrdd = tuple.values();
>>
>> rdd.saveAsTextFile("hdfs://myuser:8020/user/hdfs/output");
>> return null;
>> }
>>});
>>
>>
>> *Thanks*,
>> 
>>
>>
>> On Fri, Oct 30, 2015 at 1:57 PM, Saisai Shao 
>> wrote:
>>
>>> What Spark version are you using, also a small code snippet of how
>>> you use Spark Streaming would be greatly helpful.
>>>
>>> On Fri, Oct 30, 2015 at 3:57 PM, Ramkumar V >> > wrote:
>>>
 I can able to read and print few lines. Afterthat i'm getting this
 exception. Any idea for this ?

 *Thanks*,
 


 On Thu, Oct 29, 2015 at 6:14 PM, Ramkumar V <
 ramkumar.c...@gmail.com> wrote:

> Hi,
>
> I'm trying to read from kafka stream and printing it textfile. I'm
> using java over spark. I dont know why i'm getting the following 
> exception.
> Also exception message is very abstract.  can anyone please help me ?
>
> Log Trace :
>
> 15/10/29 12:15:09 ERROR scheduler.JobScheduler: Error in job
> generator
> java.lang.NullPointerException
> at
> org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:172)
> at
> org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:172)
> at
>

Re: How do I parallize Spark Jobs at Executor Level.

2015-10-30 Thread Deng Ching-Mallete

Hi,

You seem to be creating a new RDD for each element in your files RDD. What
I would suggest is to load and process only one sequence file in your Spark
job, then just execute multiple spark jobs to process each sequence file.

With regard to your question of where to view the logs inside the closures,
you should be able to see them in the executor log (via the Spark UI, in
the Executors page).

HTH,
Deng

On Fri, Oct 30, 2015 at 1:20 PM, Vinoth Sankar  wrote:

> Hi Adrian,
>
> Yes. I need to load all files and process it in parallel. Following code
> doesn't seem working(Here I used map, even tried foreach) ,I just
> downloading the files from HDFS to local system and printing the logs count
> in each file. Its not throwing any Exceptions,but not working. Files are
> not getting downloaded. I even didn't get that LOGGER print. Same code
> works if I iterate the files, but its not Parallelized. How do I get my
> code Parallelized and Working.
>
> JavaRDD files = sparkContext.parallelize(fileList);
>
> files.map(new Function()
> {
> public static final long serialVersionUID = 1L;
>
> @Override
> public Void call(String hdfsPath) throws Exception
> {
> JavaPairRDD hdfsContent =
> sc.sequenceFile(hdfsPath, IntWritable.class, BytesWritable.class);
> JavaRDD logs = hdfsContent.map(new Function BytesWritable>, Message>()
> {
> public Message call(Tuple2 tuple2) throws
> Exception
> {
> BytesWritable value = tuple2._2();
> BytesWritable tmp = new BytesWritable();
> tmp.setCapacity(value.getLength());
> tmp.set(value);
> return (Message) getProtos(1, tmp.getBytes());
> }
> });
>
> String path = "/home/sas/logsfilterapp/temp/" + hdfsPath.substring(26);
>
> Thread.sleep(2000);
> logs.saveAsObjectFile(path);
>
> LOGGER.log(Level.INFO, "HDFS Path : {0} Count : {1}", new Object[] {
> hdfsPath, logs.count() });
> return null;
> }
> });
>
>
>
> Note : In another scenario also i didn't get the logs which are present
> inside map,filter closures. But logs outside these closures are getting
> printed as usual. If i can't get the logger prints inside these closures
> how do i debug them ?
>
> Thanks
> Vinoth Sankar
>
> On Wed, Oct 28, 2015 at 8:29 PM Adrian Tanase  wrote:
>
>> The first line is distributing your fileList variable in the cluster as a
>> RDD, partitioned using the default partitioner settings (e.g. Number of
>> cores in your cluster).
>>
>> Each of your workers would one or more slices of data (depending on how
>> many cores each executor has) and the abstraction is called partition.
>>
>> What is your use case? If you want to load the files and continue
>> processing in parallel, then a simple .map should work.
>> If you want to execute arbitrary code based on the list of files that
>> each executor received, then you need to use .foreach that will get
>> executed for each of the entries, on the worker.
>>
>> -adrian
>>
>> From: Vinoth Sankar
>> Date: Wednesday, October 28, 2015 at 2:49 PM
>> To: "user@spark.apache.org"
>> Subject: How do I parallize Spark Jobs at Executor Level.
>>
>> Hi,
>>
>> I'm reading and filtering large no of files using Spark. It's getting
>> parallized at Spark Driver level only. How do i make it parallelize to
>> Executor(Worker) Level. Refer the following sample. Is there any way to
>> paralleling iterate the localIterator ?
>>
>> Note : I use Java 1.7 version
>>
>> JavaRDD files = javaSparkContext.parallelize(fileList)
>> Iterator localIterator = files.toLocalIterator();
>>
>> Regards
>> Vinoth Sankar
>>
>

Re: How do I parallize Spark Jobs at Executor Level.

2015-10-30 Thread Deng Ching-Mallete

Yes, it's also possible. Just pass in the sequence files you want to
process as a comma-separated list in sc.sequenceFile().

-Deng

On Fri, Oct 30, 2015 at 5:46 PM, Vinoth Sankar  wrote:

> Hi Deng.
>
> Thanks for the response.
>
> Is it possible to load sequence files parallely and process each of it in
> parallel...?
>
>
> Regards
> Vinoth Sankar
>
> On Fri, Oct 30, 2015 at 2:56 PM Deng Ching-Mallete 
> wrote:
>
>> Hi,
>>
>> You seem to be creating a new RDD for each element in your files RDD.
>> What I would suggest is to load and process only one sequence file in your
>> Spark job, then just execute multiple spark jobs to process each sequence
>> file.
>>
>> With regard to your question of where to view the logs inside the
>> closures, you should be able to see them in the executor log (via the Spark
>> UI, in the Executors page).
>>
>> HTH,
>> Deng
>>
>>
>> On Fri, Oct 30, 2015 at 1:20 PM, Vinoth Sankar 
>> wrote:
>>
>>> Hi Adrian,
>>>
>>> Yes. I need to load all files and process it in parallel. Following code
>>> doesn't seem working(Here I used map, even tried foreach) ,I just
>>> downloading the files from HDFS to local system and printing the logs count
>>> in each file. Its not throwing any Exceptions,but not working. Files are
>>> not getting downloaded. I even didn't get that LOGGER print. Same code
>>> works if I iterate the files, but its not Parallelized. How do I get my
>>> code Parallelized and Working.
>>>
>>> JavaRDD files = sparkContext.parallelize(fileList);
>>>
>>> files.map(new Function()
>>> {
>>> public static final long serialVersionUID = 1L;
>>>
>>> @Override
>>> public Void call(String hdfsPath) throws Exception
>>> {
>>> JavaPairRDD hdfsContent =
>>> sc.sequenceFile(hdfsPath, IntWritable.class, BytesWritable.class);
>>> JavaRDD logs = hdfsContent.map(new Function>> BytesWritable>, Message>()
>>> {
>>> public Message call(Tuple2 tuple2) throws
>>> Exception
>>> {
>>> BytesWritable value = tuple2._2();
>>> BytesWritable tmp = new BytesWritable();
>>> tmp.setCapacity(value.getLength());
>>> tmp.set(value);
>>> return (Message) getProtos(1, tmp.getBytes());
>>> }
>>> });
>>>
>>> String path = "/home/sas/logsfilterapp/temp/" + hdfsPath.substring(26);
>>>
>>> Thread.sleep(2000);
>>> logs.saveAsObjectFile(path);
>>>
>>> LOGGER.log(Level.INFO, "HDFS Path : {0} Count : {1}", new Object[] {
>>> hdfsPath, logs.count() });
>>> return null;
>>> }
>>> });
>>>
>>>
>>>
>>> Note : In another scenario also i didn't get the logs which are present
>>> inside map,filter closures. But logs outside these closures are getting
>>> printed as usual. If i can't get the logger prints inside these closures
>>> how do i debug them ?
>>>
>>> Thanks
>>> Vinoth Sankar
>>>
>>> On Wed, Oct 28, 2015 at 8:29 PM Adrian Tanase  wrote:
>>>
 The first line is distributing your fileList variable in the cluster as
 a RDD, partitioned using the default partitioner settings (e.g. Number of
 cores in your cluster).

 Each of your workers would one or more slices of data (depending on how
 many cores each executor has) and the abstraction is called partition.

 What is your use case? If you want to load the files and continue
 processing in parallel, then a simple .map should work.
 If you want to execute arbitrary code based on the list of files that
 each executor received, then you need to use .foreach that will get
 executed for each of the entries, on the worker.

 -adrian

 From: Vinoth Sankar
 Date: Wednesday, October 28, 2015 at 2:49 PM
 To: "user@spark.apache.org"
 Subject: How do I parallize Spark Jobs at Executor Level.

 Hi,

 I'm reading and filtering large no of files using Spark. It's getting
 parallized at Spark Driver level only. How do i make it parallelize to
 Executor(Worker) Level. Refer the following sample. Is there any way to
 paralleling iterate the localIterator ?

 Note : I use Java 1.7 version

 JavaRDD files = javaSparkContext.parallelize(fileList)
 Iterator localIterator = files.toLocalIterator();

 Regards
 Vinoth Sankar

>>>

Re: key not found: sportingpulse.com in Spark SQL 1.5.0

2015-10-30 Thread Silvio Fiorito

I don’t believe I have it on 1.5.1. Are you able to test the data locally to 
confirm, or is it too large?

From: "Zhang, Jingyu" 
>
Date: Friday, October 30, 2015 at 7:31 PM
To: Silvio Fiorito 
>
Cc: Ted Yu >, user 
>
Subject: Re: key not found: sportingpulse.com in Spark SQL 1.5.0

Thanks Silvio and Ted,

Can you please let me know how to fix this intermittent issues? Should I wait 
EMR upgrading to support the Spark 1.5.1 or change my code from DataFrame to 
normal Spark map-reduce?

Regards,

Jingyu

On 31 October 2015 at 09:40, Silvio Fiorito 
> wrote:
It's something due to the columnar compression. I've seen similar intermittent 
issues when caching DataFrames. "sportingpulse.com" 
is a value in one of the columns of the DF.

From: Ted Yu
Sent: ‎10/‎30/‎2015 6:33 PM
To: Zhang, Jingyu
Cc: user
Subject: Re: key not found: sportingpulse.com in 
Spark SQL 1.5.0

I searched for sportingpulse in *.scala and *.java files under 1.5 branch.
There was no hit.

mvn dependency doesn't show sportingpulse either.

Is it possible this is specific to EMR ?

Cheers

On Fri, Oct 30, 2015 at 2:57 PM, Zhang, Jingyu 
> wrote:

There is not a problem in Spark SQL 1.5.1 but the error of "key not found: 
sportingpulse.com" shown up when I use 1.5.0.

I have to use the version of 1.5.0 because that the one AWS EMR support.  Can 
anyone tell me why Spark uses "sportingpulse.com" 
and how to fix it?

Thanks.

Caused by: java.util.NoSuchElementException: key not found: 
sportingpulse.com

at scala.collection.MapLike$class.default(MapLike.scala:228)

at scala.collection.AbstractMap.default(Map.scala:58)

at scala.collection.mutable.HashMap.apply(HashMap.scala:64)

at 
org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258)

at 
org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110)

at 
org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87)

at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:152)

at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:152)

at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)

at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)

at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)

at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)

at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:152)

at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120)

at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)

at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)

at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)

at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:63)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)

at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)

at

CompositeInputFormat in Spark

2015-10-30 Thread Alex Nastetsky

Does Spark have an implementation similar to CompositeInputFormat in
MapReduce?

CompositeInputFormat joins multiple datasets prior to the mapper, that are
partitioned the same way with the same number of partitions, using the
"part" number in the file name in each dataset to figure out which file to
join with its counterparts in the other datasets.

Here is a similar question from earlier this year:

http://mail-archives.us.apache.org/mod_mbox/spark-user/201505.mbox/%3CCADrn=epwl6ghs9hfyo3csuxhshtycsrlbujcmpxrtz4zype...@mail.gmail.com%3E

>From what I can tell, there's no way to tell Spark about how a dataset had
been previously partitioned, other than repartitioning it in order to
achieve a map-side join with a similarly partitioned dataset.

Performance issues in SSSP using GraphX

2015-10-30 Thread Khaled Ammar

Hi all,

I have an interesting behavior from GraphX while running SSSP. I use the
stand-alone mode with 16+1 machines, each has 30GB memory and 4 cores. The
dataset is 63GB. However, the input for some stages is huge, about 16 TB !

The computation takes very long time. I stopped it.

For your information, I use the same SSSP code mentioned in the GraphX
documentation:
http://spark.apache.org/docs/latest/graphx-programming-guide.html#pregel-api

I use StorageLevel.MEMORY_ONLY since I have plenty of memory.

I appreciate your comment/help about this issue.

-- 
Thanks,
-Khaled

[image: Inline image 1]

Re: Exception while reading from kafka stream

2015-10-30 Thread Cody Koeninger

Just put them all in one stream and switch processing based on the topic

On Fri, Oct 30, 2015 at 6:29 AM, Ramkumar V  wrote:

> i want to join all those logs in some manner. That's what i'm trying to do.
>
> *Thanks*,
> 
>
>
> On Fri, Oct 30, 2015 at 4:57 PM, Saisai Shao 
> wrote:
>
>> I don't think Spark Streaming supports multiple streaming context in one
>> jvm, you cannot use in such way. Instead you could run multiple streaming
>> applications, since you're using Yarn.
>>
>> 2015年10月30日星期五，Ramkumar V  写道：
>>
>>> I found NPE is mainly because of im using the same JavaStreamingContext
>>> for some other kafka stream. if i change the name , its running
>>> successfully. how to run multiple JavaStreamingContext in a program ?  I'm
>>> getting following exception if i run multiple JavaStreamingContext in
>>> single file.
>>>
>>> 15/10/30 11:04:29 INFO yarn.ApplicationMaster: Final app status: FAILED,
>>> exitCode: 15, (reason: User class threw exception:
>>> java.lang.IllegalStateException: Only one StreamingContext may be started
>>> in this JVM. Currently running StreamingContext was started
>>> atorg.apache.spark.streaming.api.java.JavaStreamingContext.start(JavaStreamingContext.scala:622)
>>>
>>>
>>> *Thanks*,
>>> 
>>>
>>>
>>> On Fri, Oct 30, 2015 at 3:25 PM, Saisai Shao 
>>> wrote:
>>>
 From the code, I think this field "rememberDuration" shouldn't be
 null, it will be verified at the start, unless some place changes it's
 value in the runtime that makes it null, but I cannot image how this
 happened. Maybe you could add some logs around the place where exception
 happens if you could reproduce it.

 On Fri, Oct 30, 2015 at 5:31 PM, Ramkumar V 
 wrote:

> No. this is the only exception that im getting multiple times in my
> log. Also i was reading some other topics earlier but im not faced this 
> NPE.
>
> *Thanks*,
> 
>
>
> On Fri, Oct 30, 2015 at 2:50 PM, Saisai Shao 
> wrote:
>
>> I just did a local test with your code, seems everything is fine, the
>> only difference is that I use the master branch, but I don't think it
>> changes a lot in this part. Do you met any other exceptions or errors
>> beside this one? Probably this is due to other exceptions that makes this
>> system unstable.
>>
>> On Fri, Oct 30, 2015 at 5:13 PM, Ramkumar V 
>> wrote:
>>
>>> No, i dont have any special settings. if i keep only reading line in
>>> my code, it's throwing NPE.
>>>
>>> *Thanks*,
>>> 
>>>
>>>
>>> On Fri, Oct 30, 2015 at 2:14 PM, Saisai Shao >> > wrote:
>>>
 Do you have any special settings, from your code, I don't think it
 will incur NPE at that place.

 On Fri, Oct 30, 2015 at 4:32 PM, Ramkumar V <
 ramkumar.c...@gmail.com> wrote:

> spark version - spark 1.4.1
>
> my code snippet:
>
> String brokers = "ip:port,ip:port";
> String topics = "x,y,z";
> HashSet TopicsSet = new
> HashSet(Arrays.asList(topics.split(",")));
> HashMap kafkaParams = new HashMap String>();
> kafkaParams.put("metadata.broker.list", brokers);
>
> JavaPairInputDStream messages =
> KafkaUtils.createDirectStream(
>jssc,
>String.class,
>String.class,
>StringDecoder.class,
>StringDecoder.class,
>kafkaParams,
> TopicsSet
>);
>
> messages.foreachRDD(new Function,Void> () {
> public Void call(JavaPairRDD tuple) {
> JavaRDDrdd = tuple.values();
>
> rdd.saveAsTextFile("hdfs://myuser:8020/user/hdfs/output");
> return null;
> }
>});
>
>
> *Thanks*,
> 
>
>
> On Fri, Oct 30, 2015 at 1:57 PM, Saisai Shao <
> sai.sai.s...@gmail.com> wrote:
>
>> What Spark version are you using, also a small code snippet of
>> how you use Spark Streaming would be greatly helpful.
>>
>> On Fri, Oct 30, 2015 at 3:57 PM, Ramkumar V <
>> ramkumar.c...@gmail.com> wrote:
>>
>>> I can able to read and print few lines. Afterthat i'm getting
>>> this exception. Any idea for this ?

sparkR 1.5.1 batch yarn-client mode failing on daemon.R not found

2015-10-30 Thread Tom Stewart

I have the following script in a file named test.R:

library(SparkR)
sc <- sparkR.init(master="yarn-client")
sqlContext <- sparkRSQL.init(sc)
df <- createDataFrame(sqlContext, faithful)
showDF(df)
sparkR.stop()
q(save="no")

If I submit this with "sparkR test.R" or "R  CMD BATCH test.R" or "Rscript 
test.R" it fails with this error:
15/10/29 08:08:49 INFO r.BufferedStreamThread: Fatal error: cannot open file 
'/mnt/hdfs9/yarn/nm-local-dir/usercache/hadoop/appcache/application_1446058618330_0171/container_e805_1446058618330_0171_01_05/sparkr/SparkR/worker/daemon.R':
 No such file or directory
15/10/29 08:08:59 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
(TID 1)
java.net.SocketTimeoutException: Accept timed out


However, if I launch just an interactive sparkR shell and cut/paste those 
commands - it runs fine.
It also runs fine on the same Hadoop cluster with Spark 1.4.1.
And, it runs fine from batch mode if I just use sparkR.init() and not 
sparkR.init(master="yarn-client")

Spark Streaming (1.5.0) flaky when recovering from checkpoint

2015-10-30 Thread David P. Kleinschmidt

I have a Spark Streaming job that runs great the first time around (Elastic
MapReduce 4.1.0), but when recovering from a checkpoint in S3, the job runs
but Spark itself seems to be jacked-up in lots of little ways:

   - Executors, which are normally stable for days, are terminated within a
   couple hours. I can see the termination notices in the logs, but no related
   exceptions. The nodes are active in YARN, but Spark doesn't pick them up
   again.
   - Hadoop web proxy can't find Spark web UI ("no route to host")
   - When I get to the web UI, the Streaming tab is missing
   - The web UI appears to stop updating after a few thousand jobs

I'm kind of at wits end here. I've been banging my head against this for a
couple weeks now, and any help would be greatly appreciated. Below is the
configuration that I'm sending to EMR.

- dpk

[
  {
"Classification": "emrfs-site",
"Properties": {"fs.s3.consistent": "true"}
  },
  {
"Classification": "spark-defaults",
"Properties": {
  "spark.default.parallelism": "8",
  "spark.dynamicAllocation.enabled": "true",
  "spark.dynamicAllocation.minExecutors": "1",
  "spark.executor.cores": "4",
  "spark.executor.memory": "4148M",
  "spark.streaming.receiver.writeAheadLog.enable": "true",
  "spark.yarn.executor.memoryOverhead": "460"
}
  },
  {
"Classification": "spark-env",
"Configurations": [{
"Classification": "export",
"Properties": {"SPARK_YARN_MODE": "true"}
}]
  },
  {
"Classification": "spark-log4j",
"Properties": {"log4j.rootCategory": "WARN,console"}
  }
]

Re: issue with spark.driver.maxResultSize parameter in spark 1.3

2015-10-30 Thread karthik kadiyam

Hi Shahid,

I played around with spark driver memory too. In the conf file it was set
to " --driver-memory 20G " first. When i changed the spark driver
maxResultSize from default to 2g ,i changed the driver memory to 30G and
tired too. It gave we same error says "bigger than
spark.driver.maxResultSize (1024.0 MB) " .

One other thing i observed is , in one of the tasks the data its trying to
process is more than 100 MB and that exceutor and task keeps losing
connection and doing retry. I tried increase the Tasks by repartition from
120 to 240 to 480 also. Still i can see in one of my tasks it still is
trying to process more than 100 mb. Other task hardly process 1 mb to 10 mb
, some around 20 mbs, some have 0 mbs .

Any idea how can i try to even the data distribution acrosss multiple node.

On Fri, Oct 30, 2015 at 12:09 AM, shahid ashraf  wrote:

> Hi
> I guess you need to increase spark driver memory as well. But that should
> be set in conf files
> Let me know if that resolves
> On Oct 30, 2015 7:33 AM, "karthik kadiyam" 
> wrote:
>
>> Hi,
>>
>> In spark streaming job i had the following setting
>>
>> this.jsc.getConf().set("spark.driver.maxResultSize", “0”);
>> and i got the error in the job as below
>>
>> User class threw exception: Job aborted due to stage failure: Total size
>> of serialized results of 120 tasks (1082.2 MB) is bigger than
>> spark.driver.maxResultSize (1024.0 MB)
>>
>> Basically i realized that as default value is 1 GB. I changed
>> the configuration as below.
>>
>> this.jsc.getConf().set("spark.driver.maxResultSize", “2g”);
>>
>> and when i ran the job it gave the error
>>
>> User class threw exception: Job aborted due to stage failure: Total size
>> of serialized results of 120 tasks (1082.2 MB) is bigger than
>> spark.driver.maxResultSize (1024.0 MB)
>>
>> So, basically the change i made is not been considered in the job. so my
>> question is
>>
>> - "spark.driver.maxResultSize", “2g” is this the right way to change or
>> any other way to do it.
>> - Is this a bug in spark 1.3 or something or any one had this issue
>> before?
>>
>>

RE: Pivot Data in Spark and Scala

2015-10-30 Thread Andrianasolo Fanilo

Hey,

The question is tricky, here is a possible answer by defining years as keys for 
a hashmap per client and merging those :


import scalaz._
import Scalaz._

val sc = new SparkContext("local[*]", "sandbox")

// Create RDD of your objects
val rdd = sc.parallelize(Seq(
  ("A", 2015, 4),
  ("A", 2014, 12),
  ("A", 2013, 1),
  ("B", 2015, 24),
  ("B", 2013, 4)
))

// Search for all the years in the RDD
val minYear = rdd.map(_._2).reduce(Math.min)// look for minimum year
val maxYear = rdd.map(_._2).reduce(Math.max)// look for maximum year
val sequenceOfYears = maxYear to minYear by -1 // create sequence of years from 
max to min

// Define functions to build, for each client, a Map of year -> value for year, 
and how those maps will be merged
def createCombiner(obj: (Int, Int)): Map[Int, String] = Map(obj._1 -> 
obj._2.toString)
def mergeValue(accum: Map[Int, String], obj: (Int, Int)) = accum + (obj._1 -> 
obj._2.toString)
def mergeCombiners(accum1: Map[Int, String], accum2: Map[Int, String]) = accum1 
|+| accum2 // I’m lazy so I use Scalaz to merge two maps of year -> value, I 
assume we don’t have two lines with same client and year…

// For each client, check for each year from maxYear to minYear if it exists in 
the computed map. If not input blank.
val result = rdd
  .map { case obj => (obj._1, (obj._2, obj._3)) }
  .combineByKey(createCombiner, mergeValue, mergeCombiners)
  .map{ case (name, mapOfYearsToValues) => (Seq(name) ++ 
sequenceOfYears.map(year => mapOfYearsToValues.getOrElse(year, " 
"))).mkString(",")} // here we assume that sequence of all years isn’t too big 
to not fit in memory. If you had to compute for each day, it may break and you 
would definitely need to use a specialized timeseries library…

result.foreach(println)

sc.stop()

Best regards,
Fanilo

De : Adrian Tanase [mailto:atan...@adobe.com]
Envoyé : vendredi 30 octobre 2015 11:50
À : Deng Ching-Mallete; Ascot Moss
Cc : User
Objet : Re: Pivot Data in Spark and Scala

Its actually a bit tougher as you’ll first need all the years. Also not sure 
how you would reprsent your “columns” given they are dynamic based on the input 
data.

Depending on your downstream processing, I’d probably try to emulate it with a 
hash map with years as keys instead of the columns.

There is probably a nicer solution using the data frames API but I’m not 
familiar with it.

If you actually need vectors I think this article I saw recently on the data 
bricks blog will highlight some options (look for gather encoder)
https://databricks.com/blog/2015/10/20/audience-modeling-with-spark-ml-pipelines.html

-adrian

From: Deng Ching-Mallete
Date: Friday, October 30, 2015 at 4:35 AM
To: Ascot Moss
Cc: User
Subject: Re: Pivot Data in Spark and Scala

Hi,

You could transform it into a pair RDD then use the combineByKey function.

HTH,
Deng

On Thu, Oct 29, 2015 at 7:29 PM, Ascot Moss 
> wrote:
Hi,

I have data as follows:

A, 2015, 4
A, 2014, 12
A, 2013, 1
B, 2015, 24
B, 2013 4


I need to convert the data to a new format:
A ,4,12,1
B,   24,,4

Any idea how to make it in Spark Scala?

Thanks





Ce message et les pièces jointes sont confidentiels et réservés à l'usage 
exclusif de ses destinataires. Il peut également être protégé par le secret 
professionnel. Si vous recevez ce message par erreur, merci d'en avertir 
immédiatement l'expéditeur et de le détruire. L'intégrité du message ne pouvant 
être assurée sur Internet, la responsabilité de Worldline ne pourra être 
recherchée quant au contenu de ce message. Bien que les meilleurs efforts 
soient faits pour maintenir cette transmission exempte de tout virus, 
l'expéditeur ne donne aucune garantie à cet égard et sa responsabilité ne 
saurait être recherchée pour tout dommage résultant d'un virus transmis.

This e-mail and the documents attached are confidential and intended solely for 
the addressee; it may also be privileged. If you receive this e-mail in error, 
please notify the sender immediately and destroy it. As its integrity cannot be 
secured on the Internet, the Worldline liability cannot be triggered for the 
message content. Although the sender endeavours to maintain a computer 
virus-free network, the sender does not warrant that this transmission is 
virus-free and will not be liable for any damages resulting from any virus 
transmitted.

heap memory

2015-10-30 Thread Younes Naguib

Hi all,

I'm running a spark shell: bin/spark-shell --executor-memory 32G 
--driver-memory 8G
I keep getting :
15/10/30 13:41:59 WARN MemoryManager: Total allocation exceeds 
95.00% (2,147,483,647 bytes) of heap memory

Any help ?

Thanks,
Younes Naguib
Triton Digital | 1440 Ste-Catherine W., Suite 1200 | Montreal, QC  H3G 1R8
Tel.: +1 514 448 4037 x2688 | Tel.: +1 866 448 4037 x2688 | 
younes.nag...@tritondigital.com

Re: Maintaining overall cumulative data in Spark Streaming

2015-10-30 Thread Silvio Fiorito

In the update function you can return None for a key and it will remove it. If 
you’re restarting your app you can delete your checkpoint directory to start 
from scratch, rather than continuing from the previous state.

From: Sandeep Giri >
Date: Friday, October 30, 2015 at 9:29 AM
To: skaarthik oss >
Cc: dev >, user 
>
Subject: Re: Maintaining overall cumulative data in Spark Streaming

How to we reset the aggregated statistics to null?

Regards,
Sandeep Giri,
+1 347 781 4573 (US)
+91-953-899-8962 (IN)

www.KnowBigData.com.
Phone: +1-253-397-1945 (Office)

[linkedin icon][other site 
icon] [facebook icon] 
 [twitter icon] 

On Fri, Oct 30, 2015 at 9:49 AM, Sandeep Giri 
> wrote:

Yes, update state by key worked.

Though there are some more complications.

On Oct 30, 2015 8:27 AM, "skaarthik oss" 
> wrote:
Did you consider UpdateStateByKey operation?

From: Sandeep Giri 
[mailto:sand...@knowbigdata.com]
Sent: Thursday, October 29, 2015 3:09 PM
To: user >; dev 
>
Subject: Maintaining overall cumulative data in Spark Streaming

Dear All,

If a continuous stream of text is coming in and you have to keep publishing the 
overall word count so far since 0:00 today, what would you do?

Publishing the results for a window is easy but if we have to keep aggregating 
the results, how to go about it?

I have tried to keep an StreamRDD with aggregated count and keep doing a 
fullouterjoin but didn't work. Seems like the StreamRDD gets reset.

Kindly help.

Regards,
Sandeep Giri

Re: Saving RDDs in Tachyon

2015-10-30 Thread Akhil Das

I guess you can do a .saveAsObjectFiles and read it back as sc.objectFile

Thanks
Best Regards

On Fri, Oct 23, 2015 at 7:57 AM, mark  wrote:

> I have Avro records stored in Parquet files in HDFS. I want to read these
> out as an RDD and save that RDD in Tachyon for any spark job that wants the
> data.
>
> How do I save the RDD in Tachyon? What format do I use? Which RDD
> 'saveAs...' method do I want?
>
> Thanks
>

Spark 1.5.1 Dynamic Resource Allocation

2015-10-30 Thread Tom Stewart

I am running the following command on a Hadoop cluster to launch Spark shell 
with DRA:
spark-shell  --conf spark.dynamicAllocation.enabled=true --conf 
spark.shuffle.service.enabled=true --conf 
spark.dynamicAllocation.minExecutors=4 --conf 
spark.dynamicAllocation.maxExecutors=12 --conf 
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=120 --conf 
spark.dynamicAllocation.schedulerBacklogTimeout=300 --conf 
spark.dynamicAllocation.executorIdleTimeout=60 --executor-memory 512m --master 
yarn-client --queue default

This is the code I'm running within the Spark Shell - just demo stuff from teh 
web site.

import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors

// Load and parse the data
val data = sc.textFile("hdfs://ns/public/sample/kmeans_data.txt")

val parsedData = data.map(s => Vectors.dense(s.split(' 
').map(_.toDouble))).cache()

// Cluster the data into two classes using KMeans
val numClusters = 2
val numIterations = 20
val clusters = KMeans.train(parsedData, numClusters, numIterations)

This works fine on Spark 1.4.1 but is failing on Spark 1.5.1. Did something 
change that I need to do differently for DRA on 1.5.1?

This is the error I am getting:
15/10/29 21:44:19 WARN YarnScheduler: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient resources
15/10/29 21:44:34 WARN YarnScheduler: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient resources
15/10/29 21:44:49 WARN YarnScheduler: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient resources

That happens to be the same error you get if you haven't followed the steps to 
enable DRA, however I have done those and as I said if I just flip to Spark 
1.4.1 on the same cluster it works with my YARN config.

Re: Whether Spark will use disk when the memory is not enough on MEMORY_ONLY Storage Level

2015-10-30 Thread Akhil Das

You can set it to MEMORY_AND_DISK, in this case data will fall back to disk
when it exceeds the memory.

Thanks
Best Regards

On Fri, Oct 23, 2015 at 9:52 AM, JoneZhang  wrote:

> 1.Whether Spark will use disk when the memory is not enough on MEMORY_ONLY
> Storage Level?
> 2.If not, How can i set Storage Level when i use Hive on Spark?
> 3.Do Spark have any intention of  dynamically determined Hive on MapReduce
> or Hive on Spark, base on SQL features.
>
> Thanks in advance
> Best regards
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Whether-Spark-will-use-disk-when-the-memory-is-not-enough-on-MEMORY-ONLY-Storage-Level-tp25171.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Maintaining overall cumulative data in Spark Streaming

2015-10-30 Thread Sandeep Giri

How to we reset the aggregated statistics to null?

Regards,
Sandeep Giri,
+1 347 781 4573 (US)
+91-953-899-8962 (IN)

www.KnowBigData.com. 
Phone: +1-253-397-1945 (Office)

[image: linkedin icon]  [image:
other site icon]   [image: facebook icon]
 [image: twitter icon]
 


On Fri, Oct 30, 2015 at 9:49 AM, Sandeep Giri 
wrote:

> Yes, update state by key worked.
>
> Though there are some more complications.
> On Oct 30, 2015 8:27 AM, "skaarthik oss"  wrote:
>
>> Did you consider UpdateStateByKey operation?
>>
>>
>>
>> *From:* Sandeep Giri [mailto:sand...@knowbigdata.com]
>> *Sent:* Thursday, October 29, 2015 3:09 PM
>> *To:* user ; dev 
>> *Subject:* Maintaining overall cumulative data in Spark Streaming
>>
>>
>>
>> Dear All,
>>
>>
>>
>> If a continuous stream of text is coming in and you have to keep
>> publishing the overall word count so far since 0:00 today, what would you
>> do?
>>
>>
>>
>> Publishing the results for a window is easy but if we have to keep
>> aggregating the results, how to go about it?
>>
>>
>>
>> I have tried to keep an StreamRDD with aggregated count and keep doing a
>> fullouterjoin but didn't work. Seems like the StreamRDD gets reset.
>>
>>
>>
>> Kindly help.
>>
>>
>>
>> Regards,
>>
>> Sandeep Giri
>>
>>
>>
>

Re: [Spark] java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

2015-10-30 Thread Yifan LI

Thanks Deng,

Yes, I agree that there is a partition larger than 2GB which caused this
exception.

But actually in my case it seems to be not-helpful to fix this problem by
directly increasing partitioning in sortBy operation.

I think the partitioning in sortBy is not balanced, e.g. in my dataset,
there exist amounts of elements having same value, then they will be kept
in a common partition whose size would be quite large.

So, I called rePartition method for balanced partitioning after sorting and
it works well.

On 30 October 2015 at 03:13, Deng Ching-Mallete  wrote:

> Hi Yifan,
>
> This is a known issue, please refer to
> https://issues.apache.org/jira/browse/SPARK-6235 for more details. In
> your case, it looks like you are caching to disk a partition > 2G. A
> workaround would be to increase the number of your RDD partitions in order
> to make them smaller in size.
>
> HTH,
> Deng
>
> On Thu, Oct 29, 2015 at 8:40 PM, Yifan LI  wrote:
>
>> I have a guess that before scanning that RDD, I sorted it and set
>> partitioning, so the result is not balanced:
>>
>> sortBy[S](f: Function
>> 
>> [T, S], ascending: Boolean, *numPartitions*: Int)
>>
>> I will try to repartition it to see if it helps.
>>
>> Best,
>> Yifan LI
>>
>>
>>
>>
>>
>> On 29 Oct 2015, at 12:52, Yifan LI  wrote:
>>
>> Hey,
>>
>> I was just trying to scan a large RDD sortedRdd, ~1billion elements,
>> using toLocalIterator api, but an exception returned as it was almost
>> finished:
>>
>> java.lang.RuntimeException: java.lang.IllegalArgumentException: Size
>> exceeds Integer.MAX_VALUE
>> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:821)
>> at
>> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125)
>> at
>> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:113)
>> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285)
>> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:127)
>> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:134)
>> at
>> org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:511)
>> at
>> org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:302)
>> at
>> org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:57)
>> at
>> org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:57)
>> at
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>> at
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>> at
>> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>> at
>> org.apache.spark.network.netty.NettyBlockRpcServer.receive(NettyBlockRpcServer.scala:57)
>> at
>> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:114)
>> at
>> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:87)
>> at
>> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:101)
>> at
>> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>> at
>> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>> at
>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>> at
>> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>> at
>> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:254)
>> at
>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>> at
>> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>> at
>> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>> at
>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>> at
>> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>> at
>> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
>> at
>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>> at
>> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>> at
>>

Re: Whether Spark will use disk when the memory is not enough on MEMORY_ONLY Storage Level

2015-10-30 Thread Ted Yu

Jone:
For #3, consider ask on vendor's mailing list.

On Fri, Oct 30, 2015 at 7:11 AM, Akhil Das 
wrote:

> You can set it to MEMORY_AND_DISK, in this case data will fall back to
> disk when it exceeds the memory.
>
> Thanks
> Best Regards
>
> On Fri, Oct 23, 2015 at 9:52 AM, JoneZhang  wrote:
>
>> 1.Whether Spark will use disk when the memory is not enough on MEMORY_ONLY
>> Storage Level?
>> 2.If not, How can i set Storage Level when i use Hive on Spark?
>> 3.Do Spark have any intention of  dynamically determined Hive on MapReduce
>> or Hive on Spark, base on SQL features.
>>
>> Thanks in advance
>> Best regards
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Whether-Spark-will-use-disk-when-the-memory-is-not-enough-on-MEMORY-ONLY-Storage-Level-tp25171.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Caching causes later actions to get stuck

2015-10-30 Thread Sampo Niskanen

Hi,

I'm facing a problem where Spark is able to perform an action on a cached
RDD correctly the first time it is run, but running it immediately
afterwards (or an action depending on that RDD) causes it to get stuck.

I'm using a MongoDB connector for fetching all documents from a collection
to an RDD and caching that (though according to the error message it
doesn't fully fit).  The first action on it always succeeds, but latter
actions fail.  I just upgraded from Spark 0.9.x to 1.5.1, and didn't have
that problem with the older version.


The output I get:


scala> analyticsRDD.cache
res10: analyticsRDD.type = MapPartitionsRDD[84] at map at Mongo.scala:69

scala> analyticsRDD.count
[Stage 2:=> (472 + 8) /
524]15/10/30 14:20:00 WARN MemoryStore: Not enough space to cache
rdd_84_469 in memory! (computed 13.0 MB so far)
15/10/30 14:20:00 WARN MemoryStore: Not enough space to cache rdd_84_470 in
memory! (computed 12.1 MB so far)
15/10/30 14:20:00 WARN MemoryStore: Not enough space to cache rdd_84_476 in
memory! (computed 5.6 MB so far)
...
15/10/30 14:20:06 WARN MemoryStore: Not enough space to cache rdd_84_522 in
memory! (computed 5.3 MB so far)
[Stage 2:==>(522 + 2) /
524]15/10/30 14:20:06 WARN MemoryStore: Not enough space to cache
rdd_84_521 in memory! (computed 13.9 MB so far)
res11: Long = 7754957


scala> analyticsRDD.count
[Stage 3:=> (474 + 0) /
524]


*** Restart Spark ***

scala> analyticsRDD.count
res10: Long = 7755043


scala> analyticsRDD.count
res11: Long = 7755050



The cached RDD always gets stuck at the same point.  I tried enabling full
debug logging, but couldn't make out anything useful.


I'm also facing another issue with loading a lot of data from MongoDB,
which might be related, but the error is different:
https://groups.google.com/forum/#!topic/mongodb-user/Knj406szd74


Any ideas?


*Sampo Niskanen*

*Lead developer / Wellmo*
sampo.niska...@wellmo.com
+358 40 820 5291

SparkR job with >200 tasks hangs when calling from web server

2015-10-30 Thread rporcio

Hi,

I have a web server which can execute R codes using SparkR.
The R session is created with the Rscript init.R command where the /init.R/
file contains a sparkR initialization section:

/library(SparkR, lib.loc = paste("/opt/Spark/spark-1.5.1-bin-hadoop2.6",
"R", "lib", sep = "/"))
sc <<- sparkR.init(master = "local[4]", appName = "TestR", sparkHome =
"/opt/Spark/spark-1.5.1-bin-hadoop2.6", sparkPackages =
"com.databricks:spark-csv_2.10:1.2.0")
sqlContext <<- sparkRSQL.init(sc)/

I have the below example R code that I want to execute (flights.csv comes
from SparkR examples):

/df <- read.df(sqlContext, "/opt/Spark/flights.csv", source =
"com.databricks.spark.csv", header="true")
registerTempTable(df, "flights")
depDF <- sql(sqlContext, "SELECT dep FROM flights")
deps <- collect(depDF)/

If I run this code, it is successfully executed . When I check the Spark UI,
I see that the belonging job has 2 tasks only.

But if I change the first row to 
/df <- repartition(read.df(sqlContext, "/opt/Spark/flights.csv", source =
"com.databricks.spark.csv", header="true"), 200)/ 
and execute the R code again, the belonging job has 202 tasks from which it
sucessfully finishes some (like 132/202) but then it hangs forever.

If I check the /stderr/ of the executor I can see that the executor can't
communicate with the driver:

/15/10/30 15:34:24 WARN AkkaRpcEndpointRef: Error sending message [message =
Heartbeat(0,[Lscala.Tuple2;@36834e15,BlockManagerId(0, 192.168.178.198,
7092))] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [30
seconds]. This timeout is controlled by spark.rpc.askTimeout/

I tried to change memory (e.g. 4g to driver), akka and timeout settings but
with no luck.

Executing the same code (with the repartition part) from R, it successfully
finishes, so I assume the problem is related somehow to the webserver, but I
can't figure it out.

I'm using Centos.

Can someone give me some advice what should I try?

Thanks





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-job-with-200-tasks-hangs-when-calling-from-web-server-tp25237.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Save data to different S3

2015-10-30 Thread Steve Loughran


On 30 Oct 2015, at 18:05, William Li 
> wrote:

Thanks for your response. My secret has a back splash (/) so it didn’t work…

that's a recurrent problem with the hadoop/java s3 clients. Keep trying to 
regenerate a secret until you get one that works

how to merge two dataframes

2015-10-30 Thread Yana Kadiyska

Hi folks,

I have a need to "append" two dataframes -- I was hoping to use UnionAll
but it seems that this operation treats the underlying dataframes as
sequence of columns, rather than a map.

In particular, my problem is that the columns in the two DFs are not in the
same order --notice that my customer_id somehow comes out a string:

This is Spark 1.4.1

case class Test(epoch: Long,browser:String,customer_id:Int,uri:String)
val test = Test(1234l,"firefox",999,"http://foobar;)

case class Test1( customer_id :Int,uri:String,browser:String,
 epoch :Long)
val test1 = Test1(888,"http://foobar","ie",12343)
val df=sc.parallelize(Seq(test)).toDF
val df1=sc.parallelize(Seq(test1)).toDF
df.unionAll(df1)

//res2: org.apache.spark.sql.DataFrame = [epoch: bigint, browser:
string, customer_id: string, uri: string]



Is unionAll the wrong operation? Any special incantations? Or advice on how
to otherwise get this to succeeed?

Re: Pivot Data in Spark and Scala

2015-10-30 Thread Ali Tajeldin EDU

You can take a look at the smvPivot function in the SMV library ( 
https://github.com/TresAmigosSD/SMV ).  Should look for method "smvPivot" in 
SmvDFHelper (
http://tresamigossd.github.io/SMV/scaladocs/index.html#org.tresamigos.smv.SmvDFHelper).
  You can also perform the pivot on a group-by-group basis.  See smvPivot and 
smvPivotSum in SmvGroupedDataFunc 
(http://tresamigossd.github.io/SMV/scaladocs/index.html#org.tresamigos.smv.SmvGroupedDataFunc).

Docs from smvPivotSum are copied below.  Note that you don't have to specify 
the baseOutput columns, but if you don't, it will force an additional action on 
the input data frame to build the cross products of all possible values in your 
input pivot columns. 

Perform a normal SmvPivot operation followed by a sum on all the output pivot 
columns.
For example:
df.smvGroupBy("id").smvPivotSum(Seq("month", "product"))("count")("5_14_A", 
"5_14_B", "6_14_A", "6_14_B")
and the following input:
Input
| id  | month | product | count |
| --- | - | --- | - |
| 1   | 5/14  |   A |   100 |
| 1   | 6/14  |   B |   200 |
| 1   | 5/14  |   B |   300 |
will produce the following output:
| id  | count_5_14_A | count_5_14_B | count_6_14_A | count_6_14_B |
| --- |  |  |  |  |
| 1   | 100  | 300  | NULL | 200  |
pivotCols
The sequence of column names whose values will be used as the output pivot 
column names.
valueCols
The columns whose value will be copied to the pivoted output columns.
baseOutput
The expected base output column names (without the value column prefix). The 
user is required to supply the list of expected pivot column output names to 
avoid and extra action on the input DataFrame just to extract the possible 
pivot columns. if an empty sequence is provided, then the base output columns 
will be extracted from values in the pivot columns (will cause an action on the 
entire DataFrame!)

--
Ali
PS: shoot me an email if you run into any issues using SMV.


On Oct 30, 2015, at 6:33 AM, Andrianasolo Fanilo 
 wrote:

> Hey,
>  
> The question is tricky, here is a possible answer by defining years as keys 
> for a hashmap per client and merging those :
>  
> import scalaz._
> import Scalaz._
>  
> val sc = new SparkContext("local[*]", "sandbox")
> 
> // Create RDD of your objects
> val rdd = sc.parallelize(Seq(
>   ("A", 2015, 4),
>   ("A", 2014, 12),
>   ("A", 2013, 1),
>   ("B", 2015, 24),
>   ("B", 2013, 4)
> ))
> 
> // Search for all the years in the RDD
> val minYear = rdd.map(_._2).reduce(Math.min)// look for minimum year
> val maxYear = rdd.map(_._2).reduce(Math.max)// look for maximum year
> val sequenceOfYears = maxYear to minYear by -1 // create sequence of years 
> from max to min
> 
> // Define functions to build, for each client, a Map of year -> value for 
> year, and how those maps will be merged
> def createCombiner(obj: (Int, Int)): Map[Int, String] = Map(obj._1 -> 
> obj._2.toString)
> def mergeValue(accum: Map[Int, String], obj: (Int, Int)) = accum + (obj._1 -> 
> obj._2.toString)
> def mergeCombiners(accum1: Map[Int, String], accum2: Map[Int, String]) = 
> accum1 |+| accum2 // I’m lazy so I use Scalaz to merge two maps of year -> 
> value, I assume we don’t have two lines with same client and year…
> 
> // For each client, check for each year from maxYear to minYear if it exists 
> in the computed map. If not input blank.
> val result = rdd
>   .map { case obj => (obj._1, (obj._2, obj._3)) }
>   .combineByKey(createCombiner, mergeValue, mergeCombiners)
>   .map{ case (name, mapOfYearsToValues) => (Seq(name) ++ 
> sequenceOfYears.map(year => mapOfYearsToValues.getOrElse(year, " 
> "))).mkString(",")} // here we assume that sequence of all years isn’t too 
> big to not fit in memory. If you had to compute for each day, it may break 
> and you would definitely need to use a specialized timeseries library…
> 
> result.foreach(println)
> 
> sc.stop()
>  
> Best regards,
> Fanilo
>  
> De : Adrian Tanase [mailto:atan...@adobe.com] 
> Envoyé : vendredi 30 octobre 2015 11:50
> À : Deng Ching-Mallete; Ascot Moss
> Cc : User
> Objet : Re: Pivot Data in Spark and Scala
>  
> Its actually a bit tougher as you’ll first need all the years. Also not sure 
> how you would reprsent your “columns” given they are dynamic based on the 
> input data.
>  
> Depending on your downstream processing, I’d probably try to emulate it with 
> a hash map with years as keys instead of the columns.
>  
> There is probably a nicer solution using the data frames API but I’m not 
> familiar with it.
>  
> If you actually need vectors I think this article I saw recently on the data 
> bricks blog will highlight some options (look for gather encoder)
> https://databricks.com/blog/2015/10/20/audience-modeling-with-spark-ml-pipelines.html
>  
> -adrian
>  
> From: Deng Ching-Mallete
> Date: Friday, October 30, 2015 at 4:35 AM
> To:

Re: Pivot Data in Spark and Scala

2015-10-30 Thread Ruslan Dautkhanov

https://issues.apache.org/jira/browse/SPARK-8992

Should be in 1.6?



-- 
Ruslan Dautkhanov

On Thu, Oct 29, 2015 at 5:29 AM, Ascot Moss  wrote:

> Hi,
>
> I have data as follows:
>
> A, 2015, 4
> A, 2014, 12
> A, 2013, 1
> B, 2015, 24
> B, 2013 4
>
>
> I need to convert the data to a new format:
> A ,4,12,1
> B,   24,,4
>
> Any idea how to make it in Spark Scala?
>
> Thanks
>
>

Re: how to merge two dataframes

2015-10-30 Thread Ted Yu

How about the following ?

scala> df.registerTempTable("df")
scala> df1.registerTempTable("df1")
scala> sql("select customer_id, uri, browser, epoch from df union select
customer_id, uri, browser, epoch from df1").show()
+---+-+---+-+
|customer_id|  uri|browser|epoch|
+---+-+---+-+
|999|http://foobar|firefox| 1234|
|888|http://foobar| ie|12343|
+---+-+---+-+

Cheers

On Fri, Oct 30, 2015 at 12:11 PM, Yana Kadiyska 
wrote:

> Hi folks,
>
> I have a need to "append" two dataframes -- I was hoping to use UnionAll
> but it seems that this operation treats the underlying dataframes as
> sequence of columns, rather than a map.
>
> In particular, my problem is that the columns in the two DFs are not in
> the same order --notice that my customer_id somehow comes out a string:
>
> This is Spark 1.4.1
>
> case class Test(epoch: Long,browser:String,customer_id:Int,uri:String)
> val test = Test(1234l,"firefox",999,"http://foobar;)
>
> case class Test1( customer_id :Int,uri:String,browser:String,   epoch 
> :Long)
> val test1 = Test1(888,"http://foobar","ie",12343)
> val df=sc.parallelize(Seq(test)).toDF
> val df1=sc.parallelize(Seq(test1)).toDF
> df.unionAll(df1)
>
> //res2: org.apache.spark.sql.DataFrame = [epoch: bigint, browser: string, 
> customer_id: string, uri: string]
>
> 
>
> Is unionAll the wrong operation? Any special incantations? Or advice on
> how to otherwise get this to succeeed?
>

RE: Error building Spark on Windows with sbt

2015-10-30 Thread Judy Nash

I have not had any success building using sbt/sbt on windows.
However, I have been able to binary by using maven command directly.

From: Richard Eggert [mailto:richard.egg...@gmail.com]
Sent: Sunday, October 25, 2015 12:51 PM
To: Ted Yu 
Cc: User 
Subject: Re: Error building Spark on Windows with sbt

Yes, I know, but it would be nice to be able to test things myself before I 
push commits.

On Sun, Oct 25, 2015 at 3:50 PM, Ted Yu 
> wrote:
If you have a pull request, Jenkins can test your change for you.

FYI

On Oct 25, 2015, at 12:43 PM, Richard Eggert 
> wrote:
Also, if I run the Maven build on Windows or Linux without setting 
-DskipTests=true, it hangs indefinitely when it gets to 
org.apache.spark.JavaAPISuite.

It's hard to test patches when the build doesn't work. :-/

On Sun, Oct 25, 2015 at 3:41 PM, Richard Eggert 
> wrote:
By "it works", I mean, "It gets past that particular error". It still fails 
several minutes later with a different error:

java.lang.IllegalStateException: impossible to get artifacts when data has not 
been loaded. IvyNode = org.scala-lang#scala-library;2.10.3

On Sun, Oct 25, 2015 at 3:38 PM, Richard Eggert 
> wrote:

When I try to start up sbt for the Spark build,  or if I try to import it in 
IntelliJ IDEA as an sbt project, it fails with a "No such file or directory" 
error when it attempts to "git clone" sbt-pom-reader into 
.sbt/0.13/staging/some-sha1-hash.

If I manually create the expected directory before running sbt or importing 
into IntelliJ, then it works. Why is it necessary to do this,  and what can be 
done to make it not necessary?

Rich

--
Rich

--
Rich

--
Rich

Stack overflow error caused by long lineage RDD created after many recursions

2015-10-30 Thread Panos Str

Hi all!

Here's a part of a Scala recursion that produces a stack overflow after many
recursions. I've tried many things but I've not managed to solve it.

val eRDD: RDD[(Int,Int)] = ... 

val oldRDD: RDD[Int,Int]= ...

val result = *Algorithm*(eRDD,oldRDD)


*Algorithm*(eRDD: RDD[(Int,Int)] , oldRDD: RDD[(Int,Int)]) : RDD[(Int,Int)]{

val newRDD = *Transformation*(eRDD,oldRDD)//only transformations

if(*Compare*(oldRDD,newRDD)) //Compare has the "take" action!!

  return *Algorithm*(eRDD,newRDD)

else

 return newRDD
}

The above code is recursive and performs many iterations (until the compare
returns false)

After some iterations I get a stack overflow error. Probably the lineage
chain has become too long. Is there any way to solve this problem?
(persist/unpersist, checkpoint, sc.saveAsObjectFile).

Note1: Only compare function performs Actions on RDDs

Note2: I tried some combinations of persist/unpersist but none of them
worked!

I tried checkpointing from spark.streaming. I put a checkpoint at every
recursion but still received an overflow error

I also tried using sc.saveAsObjectFile per iteration and then reading from
file (sc.objectFile) during the next iteration. Unfortunately I noticed that
the folders are created per iteration are increasing while I was expecting
from them to have equal size per iteration. 

please help!!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Stack-overflow-error-caused-by-long-lineage-RDD-created-after-many-recursions-tp25240.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: key not found: sportingpulse.com in Spark SQL 1.5.0

2015-10-30 Thread Ted Yu

I searched for sportingpulse in *.scala and *.java files under 1.5 branch.
There was no hit.

mvn dependency doesn't show sportingpulse either.

Is it possible this is specific to EMR ?

Cheers

On Fri, Oct 30, 2015 at 2:57 PM, Zhang, Jingyu 
wrote:

> There is not a problem in Spark SQL 1.5.1 but the error of "key not found:
> sportingpulse.com" shown up when I use 1.5.0.
>
> I have to use the version of 1.5.0 because that the one AWS EMR support.
> Can anyone tell me why Spark uses "sportingpulse.com" and how to fix it?
>
> Thanks.
>
> Caused by: java.util.NoSuchElementException: key not found:
> sportingpulse.com
>
> at scala.collection.MapLike$class.default(MapLike.scala:228)
>
> at scala.collection.AbstractMap.default(Map.scala:58)
>
> at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
>
> at
> org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(
> compressionSchemes.scala:258)
>
> at
> org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(
> CompressibleColumnBuilder.scala:110)
>
> at org.apache.spark.sql.columnar.NativeColumnBuilder.build(
> ColumnBuilder.scala:87)
>
> at
> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(
> InMemoryColumnarTableScan.scala:152)
>
> at
> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(
> InMemoryColumnarTableScan.scala:152)
>
> at scala.collection.TraversableLike$$anonfun$map$1.apply(
> TraversableLike.scala:244)
>
> at scala.collection.TraversableLike$$anonfun$map$1.apply(
> TraversableLike.scala:244)
>
> at scala.collection.IndexedSeqOptimized$class.foreach(
> IndexedSeqOptimized.scala:33)
>
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>
> at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(
> InMemoryColumnarTableScan.scala:152)
>
> at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(
> InMemoryColumnarTableScan.scala:120)
>
> at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278
> )
>
> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
>
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
>
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38
> )
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38
> )
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38
> )
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(
> MapPartitionsWithPreparationRDD.scala:63)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38
> )
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(
> ShuffleMapTask.scala:73)
>
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(
> ShuffleMapTask.scala:41)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
>
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
>
> This message and its attachments may contain legally privileged or
> confidential information. It is intended solely for the named addressee. If
> you are not the addressee indicated in this message or responsible for
> delivery of the message to the addressee, you may not copy or deliver this
> message or its attachments to anyone. Rather, you should permanently delete
> this message and its attachments and kindly notify the sender by reply
> e-mail. Any content of this message and its attachments which does not
> relate to the official business of the sending company must be taken not to
> have been sent or endorsed by that company or any of its related entities.
> No warranty is made that the e-mail or attachments are free from computer
> virus or other defect.

RE: Spark tunning increase number of active tasks

2015-10-30 Thread YI, XIAOCHUAN

Hi
Our team has a 40 node hortonworks Hadoop cluster 2.2.4.2-2  (36 data node) 
with apache spark 1.2 and 1.4 installed.
Each node has 64G RAM and 8 cores.

We are only able to use <= 72 executors with executor-cores=2
So we are only get 144 active tasks running pyspark programs with pyspark.
[Stage 1:===>(596 + 144) / 2042]
IF we use larger number for --num-executors, the pyspark program exit with 
errors:
ERROR YarnScheduler: Lost executor 113 on hag017.example.com: remote Rpc client 
disassociated

I tried spark 1.4 and conf.set("dynamicAllocation.enabled", "true"). However it 
does not help us to increase the number of active tasks.
I expect larger number of active tasks with the cluster we have.
Could anyone advise on this? Thank you very much!

Shaun

Re: Exception while reading from kafka stream

2015-10-30 Thread Ramkumar V

No, i dont have any special settings. if i keep only reading line in my
code, it's throwing NPE.

*Thanks*,



On Fri, Oct 30, 2015 at 2:14 PM, Saisai Shao  wrote:

> Do you have any special settings, from your code, I don't think it will
> incur NPE at that place.
>
> On Fri, Oct 30, 2015 at 4:32 PM, Ramkumar V 
> wrote:
>
>> spark version - spark 1.4.1
>>
>> my code snippet:
>>
>> String brokers = "ip:port,ip:port";
>> String topics = "x,y,z";
>> HashSet TopicsSet = new
>> HashSet(Arrays.asList(topics.split(",")));
>> HashMap kafkaParams = new HashMap();
>> kafkaParams.put("metadata.broker.list", brokers);
>>
>> JavaPairInputDStream messages =
>> KafkaUtils.createDirectStream(
>>jssc,
>>String.class,
>>String.class,
>>StringDecoder.class,
>>StringDecoder.class,
>>kafkaParams,
>> TopicsSet
>>);
>>
>> messages.foreachRDD(new Function () {
>> public Void call(JavaPairRDD tuple) {
>> JavaRDDrdd = tuple.values();
>> rdd.saveAsTextFile("hdfs://myuser:8020/user/hdfs/output");
>> return null;
>> }
>>});
>>
>>
>> *Thanks*,
>> 
>>
>>
>> On Fri, Oct 30, 2015 at 1:57 PM, Saisai Shao 
>> wrote:
>>
>>> What Spark version are you using, also a small code snippet of how you
>>> use Spark Streaming would be greatly helpful.
>>>
>>> On Fri, Oct 30, 2015 at 3:57 PM, Ramkumar V 
>>> wrote:
>>>
 I can able to read and print few lines. Afterthat i'm getting this
 exception. Any idea for this ?

 *Thanks*,
 


 On Thu, Oct 29, 2015 at 6:14 PM, Ramkumar V 
 wrote:

> Hi,
>
> I'm trying to read from kafka stream and printing it textfile. I'm
> using java over spark. I dont know why i'm getting the following 
> exception.
> Also exception message is very abstract.  can anyone please help me ?
>
> Log Trace :
>
> 15/10/29 12:15:09 ERROR scheduler.JobScheduler: Error in job generator
> java.lang.NullPointerException
> at
> org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:172)
> at
> org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:172)
> at
> scala.collection.TraversableOnce$$anonfun$maxBy$1.apply(TraversableOnce.scala:225)
> at
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
> at
> scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:68)
> at
> scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:47)
> at
> scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:225)
> at
> scala.collection.AbstractTraversable.maxBy(Traversable.scala:105)
> at
> org.apache.spark.streaming.DStreamGraph.getMaxInputStreamRememberDuration(DStreamGraph.scala:172)
> at
> org.apache.spark.streaming.scheduler.JobGenerator.clearMetadata(JobGenerator.scala:267)
> at org.apache.spark.streaming.scheduler.JobGenerator.org
> $apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:178)
> at
> org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:83)
> at
> org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:82)
> at
> org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> 15/10/29 12:15:09 ERROR yarn.ApplicationMaster: User class threw
> exception: java.lang.NullPointerException
> java.lang.NullPointerException
> at
> org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:172)
> at
> org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:172)
> at
> scala.collection.TraversableOnce$$anonfun$maxBy$1.apply(TraversableOnce.scala:225)
> at
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
> at
> scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:68)
> at
> scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:47)
> at
> scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:225)
> at
> scala.collection.AbstractTraversable.maxBy(Traversable.scala:105)
>

Re: How do I parallize Spark Jobs at Executor Level.

2015-10-30 Thread Vinoth Sankar

Hi Deng.

Thanks for the response.

Is it possible to load sequence files parallely and process each of it in
parallel...?


Regards
Vinoth Sankar

On Fri, Oct 30, 2015 at 2:56 PM Deng Ching-Mallete 
wrote:

> Hi,
>
> You seem to be creating a new RDD for each element in your files RDD. What
> I would suggest is to load and process only one sequence file in your Spark
> job, then just execute multiple spark jobs to process each sequence file.
>
> With regard to your question of where to view the logs inside the
> closures, you should be able to see them in the executor log (via the Spark
> UI, in the Executors page).
>
> HTH,
> Deng
>
>
> On Fri, Oct 30, 2015 at 1:20 PM, Vinoth Sankar 
> wrote:
>
>> Hi Adrian,
>>
>> Yes. I need to load all files and process it in parallel. Following code
>> doesn't seem working(Here I used map, even tried foreach) ,I just
>> downloading the files from HDFS to local system and printing the logs count
>> in each file. Its not throwing any Exceptions,but not working. Files are
>> not getting downloaded. I even didn't get that LOGGER print. Same code
>> works if I iterate the files, but its not Parallelized. How do I get my
>> code Parallelized and Working.
>>
>> JavaRDD files = sparkContext.parallelize(fileList);
>>
>> files.map(new Function()
>> {
>> public static final long serialVersionUID = 1L;
>>
>> @Override
>> public Void call(String hdfsPath) throws Exception
>> {
>> JavaPairRDD hdfsContent =
>> sc.sequenceFile(hdfsPath, IntWritable.class, BytesWritable.class);
>> JavaRDD logs = hdfsContent.map(new Function> BytesWritable>, Message>()
>> {
>> public Message call(Tuple2 tuple2) throws
>> Exception
>> {
>> BytesWritable value = tuple2._2();
>> BytesWritable tmp = new BytesWritable();
>> tmp.setCapacity(value.getLength());
>> tmp.set(value);
>> return (Message) getProtos(1, tmp.getBytes());
>> }
>> });
>>
>> String path = "/home/sas/logsfilterapp/temp/" + hdfsPath.substring(26);
>>
>> Thread.sleep(2000);
>> logs.saveAsObjectFile(path);
>>
>> LOGGER.log(Level.INFO, "HDFS Path : {0} Count : {1}", new Object[] {
>> hdfsPath, logs.count() });
>> return null;
>> }
>> });
>>
>>
>>
>> Note : In another scenario also i didn't get the logs which are present
>> inside map,filter closures. But logs outside these closures are getting
>> printed as usual. If i can't get the logger prints inside these closures
>> how do i debug them ?
>>
>> Thanks
>> Vinoth Sankar
>>
>> On Wed, Oct 28, 2015 at 8:29 PM Adrian Tanase  wrote:
>>
>>> The first line is distributing your fileList variable in the cluster as
>>> a RDD, partitioned using the default partitioner settings (e.g. Number of
>>> cores in your cluster).
>>>
>>> Each of your workers would one or more slices of data (depending on how
>>> many cores each executor has) and the abstraction is called partition.
>>>
>>> What is your use case? If you want to load the files and continue
>>> processing in parallel, then a simple .map should work.
>>> If you want to execute arbitrary code based on the list of files that
>>> each executor received, then you need to use .foreach that will get
>>> executed for each of the entries, on the worker.
>>>
>>> -adrian
>>>
>>> From: Vinoth Sankar
>>> Date: Wednesday, October 28, 2015 at 2:49 PM
>>> To: "user@spark.apache.org"
>>> Subject: How do I parallize Spark Jobs at Executor Level.
>>>
>>> Hi,
>>>
>>> I'm reading and filtering large no of files using Spark. It's getting
>>> parallized at Spark Driver level only. How do i make it parallelize to
>>> Executor(Worker) Level. Refer the following sample. Is there any way to
>>> paralleling iterate the localIterator ?
>>>
>>> Note : I use Java 1.7 version
>>>
>>> JavaRDD files = javaSparkContext.parallelize(fileList)
>>> Iterator localIterator = files.toLocalIterator();
>>>
>>> Regards
>>> Vinoth Sankar
>>>
>>

Issue of Hive parquet partitioned table schema mismatch

2015-10-30 Thread Rex Xiong

Hi folks,

I have a Hive external table with partitions.
Every day, an App will generate a new partition day=-MM-dd stored by
parquet and run add-partition Hive command.
In some cases, we will add additional column to new partitions and update
Hive table schema, then a query across new and old partitions will fail
with exception:

org.apache.hive.service.cli.HiveSQLException:
org.apache.spark.sql.AnalysisException: cannot resolve 'newcolumn' given
input columns 

We have tried schema merging feature, but it's too slow, there're hundreds
of partitions.
Is it possible to bypass this schema check and return a default value (such
as null) for missing columns?

Thank you

??????Best practises

2015-10-30 Thread huangzheng

I have the same question.anyone help us.




--  --
??: "Deepak Sharma"; 
: 2015??10??30??(??) 7:23
??: "user"; 
: Best practises




Hi
 I am looking for any blog / doc on the developer's best practices if using 
Spark .I have already looked at the tuning guide on spark.apache.org.
 Please do let me know if any one is aware of any such resource.
 
Thanks
 Deepak

Re: Exception while reading from kafka stream

2015-10-30 Thread Ramkumar V

In general , i need to consume five different type of logs from kafka in
spark. I have different set of topics for each log. How to start five
different stream in spark ?

*Thanks*,



On Fri, Oct 30, 2015 at 4:40 PM, Ramkumar V  wrote:

> I found NPE is mainly because of im using the same JavaStreamingContext
> for some other kafka stream. if i change the name , its running
> successfully. how to run multiple JavaStreamingContext in a program ?  I'm
> getting following exception if i run multiple JavaStreamingContext in
> single file.
>
> 15/10/30 11:04:29 INFO yarn.ApplicationMaster: Final app status: FAILED,
> exitCode: 15, (reason: User class threw exception:
> java.lang.IllegalStateException: Only one StreamingContext may be started
> in this JVM. Currently running StreamingContext was started
> atorg.apache.spark.streaming.api.java.JavaStreamingContext.start(JavaStreamingContext.scala:622)
>
>
> *Thanks*,
> 
>
>
> On Fri, Oct 30, 2015 at 3:25 PM, Saisai Shao 
> wrote:
>
>> From the code, I think this field "rememberDuration" shouldn't be null,
>> it will be verified at the start, unless some place changes it's value in
>> the runtime that makes it null, but I cannot image how this happened. Maybe
>> you could add some logs around the place where exception happens if you
>> could reproduce it.
>>
>> On Fri, Oct 30, 2015 at 5:31 PM, Ramkumar V 
>> wrote:
>>
>>> No. this is the only exception that im getting multiple times in my log.
>>> Also i was reading some other topics earlier but im not faced this NPE.
>>>
>>> *Thanks*,
>>> 
>>>
>>>
>>> On Fri, Oct 30, 2015 at 2:50 PM, Saisai Shao 
>>> wrote:
>>>
 I just did a local test with your code, seems everything is fine, the
 only difference is that I use the master branch, but I don't think it
 changes a lot in this part. Do you met any other exceptions or errors
 beside this one? Probably this is due to other exceptions that makes this
 system unstable.

 On Fri, Oct 30, 2015 at 5:13 PM, Ramkumar V 
 wrote:

> No, i dont have any special settings. if i keep only reading line in
> my code, it's throwing NPE.
>
> *Thanks*,
> 
>
>
> On Fri, Oct 30, 2015 at 2:14 PM, Saisai Shao 
> wrote:
>
>> Do you have any special settings, from your code, I don't think it
>> will incur NPE at that place.
>>
>> On Fri, Oct 30, 2015 at 4:32 PM, Ramkumar V 
>> wrote:
>>
>>> spark version - spark 1.4.1
>>>
>>> my code snippet:
>>>
>>> String brokers = "ip:port,ip:port";
>>> String topics = "x,y,z";
>>> HashSet TopicsSet = new
>>> HashSet(Arrays.asList(topics.split(",")));
>>> HashMap kafkaParams = new HashMap();
>>> kafkaParams.put("metadata.broker.list", brokers);
>>>
>>> JavaPairInputDStream messages =
>>> KafkaUtils.createDirectStream(
>>>jssc,
>>>String.class,
>>>String.class,
>>>StringDecoder.class,
>>>StringDecoder.class,
>>>kafkaParams,
>>> TopicsSet
>>>);
>>>
>>> messages.foreachRDD(new Function
>>> () {
>>> public Void call(JavaPairRDD tuple) {
>>> JavaRDDrdd = tuple.values();
>>>
>>> rdd.saveAsTextFile("hdfs://myuser:8020/user/hdfs/output");
>>> return null;
>>> }
>>>});
>>>
>>>
>>> *Thanks*,
>>> 
>>>
>>>
>>> On Fri, Oct 30, 2015 at 1:57 PM, Saisai Shao >> > wrote:
>>>
 What Spark version are you using, also a small code snippet of how
 you use Spark Streaming would be greatly helpful.

 On Fri, Oct 30, 2015 at 3:57 PM, Ramkumar V <
 ramkumar.c...@gmail.com> wrote:

> I can able to read and print few lines. Afterthat i'm getting this
> exception. Any idea for this ?
>
> *Thanks*,
> 
>
>
> On Thu, Oct 29, 2015 at 6:14 PM, Ramkumar V <
> ramkumar.c...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm trying to read from kafka stream and printing it textfile.
>> I'm using java over spark. I dont know why i'm getting the following
>> exception. Also exception message is very abstract.  can anyone 
>> please help
>> me ?
>>
>> Log Trace :
>>
>> 15/10/29 12:15:09

Re: Issue of Hive parquet partitioned table schema mismatch

2015-10-30 Thread Michael Armbrust

>
> We have tried schema merging feature, but it's too slow, there're hundreds
> of partitions.
>
Which version of Spark?

Re: Exception while reading from kafka stream

2015-10-30 Thread Ramkumar V

i want to join all those logs in some manner. That's what i'm trying to do.

*Thanks*,



On Fri, Oct 30, 2015 at 4:57 PM, Saisai Shao  wrote:

> I don't think Spark Streaming supports multiple streaming context in one
> jvm, you cannot use in such way. Instead you could run multiple streaming
> applications, since you're using Yarn.
>
> 2015年10月30日星期五，Ramkumar V  写道：
>
>> I found NPE is mainly because of im using the same JavaStreamingContext
>> for some other kafka stream. if i change the name , its running
>> successfully. how to run multiple JavaStreamingContext in a program ?  I'm
>> getting following exception if i run multiple JavaStreamingContext in
>> single file.
>>
>> 15/10/30 11:04:29 INFO yarn.ApplicationMaster: Final app status: FAILED,
>> exitCode: 15, (reason: User class threw exception:
>> java.lang.IllegalStateException: Only one StreamingContext may be started
>> in this JVM. Currently running StreamingContext was started
>> atorg.apache.spark.streaming.api.java.JavaStreamingContext.start(JavaStreamingContext.scala:622)
>>
>>
>> *Thanks*,
>> 
>>
>>
>> On Fri, Oct 30, 2015 at 3:25 PM, Saisai Shao 
>> wrote:
>>
>>> From the code, I think this field "rememberDuration" shouldn't be null,
>>> it will be verified at the start, unless some place changes it's value in
>>> the runtime that makes it null, but I cannot image how this happened. Maybe
>>> you could add some logs around the place where exception happens if you
>>> could reproduce it.
>>>
>>> On Fri, Oct 30, 2015 at 5:31 PM, Ramkumar V 
>>> wrote:
>>>
 No. this is the only exception that im getting multiple times in my
 log. Also i was reading some other topics earlier but im not faced this 
 NPE.

 *Thanks*,
 


 On Fri, Oct 30, 2015 at 2:50 PM, Saisai Shao 
 wrote:

> I just did a local test with your code, seems everything is fine, the
> only difference is that I use the master branch, but I don't think it
> changes a lot in this part. Do you met any other exceptions or errors
> beside this one? Probably this is due to other exceptions that makes this
> system unstable.
>
> On Fri, Oct 30, 2015 at 5:13 PM, Ramkumar V 
> wrote:
>
>> No, i dont have any special settings. if i keep only reading line in
>> my code, it's throwing NPE.
>>
>> *Thanks*,
>> 
>>
>>
>> On Fri, Oct 30, 2015 at 2:14 PM, Saisai Shao 
>> wrote:
>>
>>> Do you have any special settings, from your code, I don't think it
>>> will incur NPE at that place.
>>>
>>> On Fri, Oct 30, 2015 at 4:32 PM, Ramkumar V >> > wrote:
>>>
 spark version - spark 1.4.1

 my code snippet:

 String brokers = "ip:port,ip:port";
 String topics = "x,y,z";
 HashSet TopicsSet = new
 HashSet(Arrays.asList(topics.split(",")));
 HashMap kafkaParams = new HashMap();
 kafkaParams.put("metadata.broker.list", brokers);

 JavaPairInputDStream messages =
 KafkaUtils.createDirectStream(
jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
 TopicsSet
);

 messages.foreachRDD(new Function
 () {
 public Void call(JavaPairRDD tuple) {
 JavaRDDrdd = tuple.values();

 rdd.saveAsTextFile("hdfs://myuser:8020/user/hdfs/output");
 return null;
 }
});


 *Thanks*,
 


 On Fri, Oct 30, 2015 at 1:57 PM, Saisai Shao <
 sai.sai.s...@gmail.com> wrote:

> What Spark version are you using, also a small code snippet of how
> you use Spark Streaming would be greatly helpful.
>
> On Fri, Oct 30, 2015 at 3:57 PM, Ramkumar V <
> ramkumar.c...@gmail.com> wrote:
>
>> I can able to read and print few lines. Afterthat i'm getting
>> this exception. Any idea for this ?
>>
>> *Thanks*,
>> 
>>
>>
>> On Thu, Oct 29, 2015 at 6:14 PM, Ramkumar V <
>> ramkumar.c...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I'm trying to read from kafka stream and

RE: Pulling data from a secured SQL database

2015-10-30 Thread Young, Matthew T

> Can the driver pull data and then distribute execution?



Yes, as long as your dataset will fit in the driver's memory. Execute arbitrary 
code to read the data on the driver as you normally would if you were writing a 
single-node application. Once you have the data in a collection on the driver's 
memory you can call 
sc.parallelize(data)
 to distribute the data out to the workers for parallel processing as an RDD. 
You can then convert to a dataframe if that is more appropriate for your 
workflow.





-Original Message-
From: Thomas Ginter [mailto:thomas.gin...@utah.edu]
Sent: Friday, October 30, 2015 10:49 AM
To: user@spark.apache.org
Subject: Pulling data from a secured SQL database



I am working in an environment where data is stored in MS SQL Server.  It has 
been secured so that only a specific set of machines can access the database 
through an integrated security Microsoft JDBC connection.  We also have a 
couple of beefy linux machines we can use to host a Spark cluster but those 
machines do not have access to the databases directly.  How can I pull the data 
from the SQL database on the smaller development machine and then have it 
distribute to the Spark cluster for processing?  Can the driver pull data and 
then distribute execution?



Thanks,



Thomas Ginter

801-448-7676

thomas.gin...@utah.edu











-

To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org For 
additional commands, e-mail: 
user-h...@spark.apache.org

Pulling data from a secured SQL database

2015-10-30 Thread Thomas Ginter

I am working in an environment where data is stored in MS SQL Server.  It has 
been secured so that only a specific set of machines can access the database 
through an integrated security Microsoft JDBC connection.  We also have a 
couple of beefy linux machines we can use to host a Spark cluster but those 
machines do not have access to the databases directly.  How can I pull the data 
from the SQL database on the smaller development machine and then have it 
distribute to the Spark cluster for processing?  Can the driver pull data and 
then distribute execution?

Thanks,

Thomas Ginter
801-448-7676
thomas.gin...@utah.edu





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Save data to different S3

2015-10-30 Thread William Li

Thanks for your response. My secret has a back splash (/) so it didn't work...

From: "Zhang, Jingyu" 
>
Date: Thursday, October 29, 2015 at 5:16 PM
To: William Li >
Cc: user >
Subject: Re: Save data to different S3

Try s3://aws_key:aws_secret@bucketName/folderName with your access key and 
secret to save the data.

On 30 October 2015 at 10:55, William Li 
> wrote:
Hi - I have a simple app running fine with Spark, it reads data from S3 and 
performs calculation.

When reading data from S3, I use hadoopConfiguration.set for both 
fs.s3n.awsAccessKeyId, and the fs.s3n.awsSecretAccessKey to it has permissions 
to load the data from customer sources.

However, after I complete the analysis, how do I save the results (it's a 
org.apache.spark.rdd.RDD[String]) into my own s3 bucket which requires 
different access key and secret? It seems one option is that I could save the 
results as local file to the spark cluster, then create a new SQLContext with 
the different access, then load the data from the local file.

Is there any other options without requiring save and re-load files?

Thanks,

William.

This message and its attachments may contain legally privileged or confidential 
information. It is intended solely for the named addressee. If you are not the 
addressee indicated in this message or responsible for delivery of the message 
to the addressee, you may not copy or deliver this message or its attachments 
to anyone. Rather, you should permanently delete this message and its 
attachments and kindly notify the sender by reply e-mail. Any content of this 
message and its attachments which does not relate to the official business of 
the sending company must be taken not to have been sent or endorsed by that 
company or any of its related entities. No warranty is made that the e-mail or 
attachments are free from computer virus or other defect.

Re: how to merge two dataframes

2015-10-30 Thread Yana Kadiyska

Not a bad idea I suspect but doesn't help me. I dumbed down the repro to
ask for help. In reality one of my dataframes is a cassandra DF.
So cassDF.registerTempTable("df1") registers the temp table in a different
SQL Context (new CassandraSQLContext(sc)).


scala> sql("select customer_id, uri, browser, epoch from df union all
select customer_id, uri, browser, epoch from df1").show()
org.apache.spark.sql.AnalysisException: no such table df1; line 1 pos 103
at
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:225)
at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:233)
at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:229)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
at
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242)


On Fri, Oct 30, 2015 at 3:34 PM, Ted Yu  wrote:

> How about the following ?
>
> scala> df.registerTempTable("df")
> scala> df1.registerTempTable("df1")
> scala> sql("select customer_id, uri, browser, epoch from df union select
> customer_id, uri, browser, epoch from df1").show()
> +---+-+---+-+
> |customer_id|  uri|browser|epoch|
> +---+-+---+-+
> |999|http://foobar|firefox| 1234|
> |888|http://foobar| ie|12343|
> +---+-+---+-+
>
> Cheers
>
> On Fri, Oct 30, 2015 at 12:11 PM, Yana Kadiyska 
> wrote:
>
>> Hi folks,
>>
>> I have a need to "append" two dataframes -- I was hoping to use UnionAll
>> but it seems that this operation treats the underlying dataframes as
>> sequence of columns, rather than a map.
>>
>> In particular, my problem is that the columns in the two DFs are not in
>> the same order --notice that my customer_id somehow comes out a string:
>>
>> This is Spark 1.4.1
>>
>> case class Test(epoch: Long,browser:String,customer_id:Int,uri:String)
>> val test = Test(1234l,"firefox",999,"http://foobar;)
>>
>> case class Test1( customer_id :Int,uri:String,browser:String,   
>> epoch :Long)
>> val test1 = Test1(888,"http://foobar","ie",12343)
>> val df=sc.parallelize(Seq(test)).toDF
>> val df1=sc.parallelize(Seq(test1)).toDF
>> df.unionAll(df1)
>>
>> //res2: org.apache.spark.sql.DataFrame = [epoch: bigint, browser: string, 
>> customer_id: string, uri: string]
>>
>> 
>>
>> Is unionAll the wrong operation? Any special incantations? Or advice on
>> how to otherwise get this to succeeed?
>>
>
>

Re: how to merge two dataframes

2015-10-30 Thread Ted Yu

I see - you were trying to union a non-Cassandra DF with Cassandra DF :-(

On Fri, Oct 30, 2015 at 12:57 PM, Yana Kadiyska 
wrote:

> Not a bad idea I suspect but doesn't help me. I dumbed down the repro to
> ask for help. In reality one of my dataframes is a cassandra DF.
> So cassDF.registerTempTable("df1") registers the temp table in a different
> SQL Context (new CassandraSQLContext(sc)).
>
>
> scala> sql("select customer_id, uri, browser, epoch from df union all
> select customer_id, uri, browser, epoch from df1").show()
> org.apache.spark.sql.AnalysisException: no such table df1; line 1 pos 103
> at
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:225)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:233)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:229)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
> at
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242)
>
>
> On Fri, Oct 30, 2015 at 3:34 PM, Ted Yu  wrote:
>
>> How about the following ?
>>
>> scala> df.registerTempTable("df")
>> scala> df1.registerTempTable("df1")
>> scala> sql("select customer_id, uri, browser, epoch from df union select
>> customer_id, uri, browser, epoch from df1").show()
>> +---+-+---+-+
>> |customer_id|  uri|browser|epoch|
>> +---+-+---+-+
>> |999|http://foobar|firefox| 1234|
>> |888|http://foobar| ie|12343|
>> +---+-+---+-+
>>
>> Cheers
>>
>> On Fri, Oct 30, 2015 at 12:11 PM, Yana Kadiyska 
>> wrote:
>>
>>> Hi folks,
>>>
>>> I have a need to "append" two dataframes -- I was hoping to use UnionAll
>>> but it seems that this operation treats the underlying dataframes as
>>> sequence of columns, rather than a map.
>>>
>>> In particular, my problem is that the columns in the two DFs are not in
>>> the same order --notice that my customer_id somehow comes out a string:
>>>
>>> This is Spark 1.4.1
>>>
>>> case class Test(epoch: Long,browser:String,customer_id:Int,uri:String)
>>> val test = Test(1234l,"firefox",999,"http://foobar;)
>>>
>>> case class Test1( customer_id :Int,uri:String,browser:String,   
>>> epoch :Long)
>>> val test1 = Test1(888,"http://foobar","ie",12343)
>>> val df=sc.parallelize(Seq(test)).toDF
>>> val df1=sc.parallelize(Seq(test1)).toDF
>>> df.unionAll(df1)
>>>
>>> //res2: org.apache.spark.sql.DataFrame = [epoch: bigint, browser: string, 
>>> customer_id: string, uri: string]
>>>
>>> 
>>>
>>> Is unionAll the wrong operation? Any special incantations? Or advice on
>>> how to otherwise get this to succeeed?
>>>
>>
>>
>

Re: how to merge two dataframes

2015-10-30 Thread Silvio Fiorito

Are you able to upgrade to Spark 1.5.1 and Cassandra connector to latest 
version? It no longer requires a separate CassandraSQLContext.

From: Yana Kadiyska >
Reply-To: "yana.kadiy...@gmail.com" 
>
Date: Friday, October 30, 2015 at 3:57 PM
To: Ted Yu >
Cc: "user@spark.apache.org" 
>
Subject: Re: how to merge two dataframes

Not a bad idea I suspect but doesn't help me. I dumbed down the repro to ask 
for help. In reality one of my dataframes is a cassandra DF. So 
cassDF.registerTempTable("df1") registers the temp table in a different SQL 
Context (new CassandraSQLContext(sc)).


scala> sql("select customer_id, uri, browser, epoch from df union all select 
customer_id, uri, browser, epoch from df1").show()
org.apache.spark.sql.AnalysisException: no such table df1; line 1 pos 103
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:225)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:233)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:229)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242)


On Fri, Oct 30, 2015 at 3:34 PM, Ted Yu 
> wrote:
How about the following ?

scala> df.registerTempTable("df")
scala> df1.registerTempTable("df1")
scala> sql("select customer_id, uri, browser, epoch from df union select 
customer_id, uri, browser, epoch from df1").show()
+---+-+---+-+
|customer_id|  uri|browser|epoch|
+---+-+---+-+
|999|http://foobar|firefox| 1234|
|888|http://foobar| ie|12343|
+---+-+---+-+

Cheers

On Fri, Oct 30, 2015 at 12:11 PM, Yana Kadiyska 
> wrote:
Hi folks,

I have a need to "append" two dataframes -- I was hoping to use UnionAll but it 
seems that this operation treats the underlying dataframes as sequence of 
columns, rather than a map.

In particular, my problem is that the columns in the two DFs are not in the 
same order --notice that my customer_id somehow comes out a string:

This is Spark 1.4.1

case class Test(epoch: Long,browser:String,customer_id:Int,uri:String)
val test = Test(1234l,"firefox",999,"http://foobar;)

case class Test1( customer_id :Int,uri:String,browser:String,   epoch 
:Long)
val test1 = Test1(888,"http://foobar","ie",12343)
val df=sc.parallelize(Seq(test)).toDF
val df1=sc.parallelize(Seq(test1)).toDF
df.unionAll(df1)

//res2: org.apache.spark.sql.DataFrame = [epoch: bigint, browser: string, 
customer_id: string, uri: string]




Is unionAll the wrong operation? Any special incantations? Or advice on how to 
otherwise get this to succeeed?

Re: Spark -- Writing to Partitioned Persistent Table

2015-10-30 Thread Bryan Jeffrey

Deenar,

This worked perfectly - I moved to SQL Server and things are working well.

Regards,

Bryan Jeffrey

On Thu, Oct 29, 2015 at 8:14 AM, Deenar Toraskar 
wrote:

> Hi Bryan
>
> For your use case you don't need to have multiple metastores. The default
> metastore uses embedded Derby
> .
> This cannot be shared amongst multiple processes. Just switch to a
> metastore that supports multiple connections viz. Networked Derby or mysql.
> see https://cwiki.apache.org/confluence/display/Hive/HiveDerbyServerMode
>
> Deenar
>
>
> *Think Reactive Ltd*
> deenar.toras...@thinkreactive.co.uk
> 07714140812
>
>
> On 29 October 2015 at 00:56, Bryan  wrote:
>
>> Yana,
>>
>> My basic use-case is that I want to process streaming data, and publish
>> it to a persistent spark table. After that I want to make the published
>> data (results) available via JDBC and spark SQL to drive a web API. That
>> would seem to require two drivers starting separate HiveContexts (one for
>> sparksql/jdbc, one for streaming)
>>
>> Is there a way to share a hive context between the driver for the thrift
>> spark SQL instance and the streaming spark driver? A better method to do
>> this?
>>
>> An alternate option might be to create the table in two separate
>> metastores and simply use the same storage location for the data. That
>> seems very hacky though, and likely to result in maintenance issues.
>>
>> Regards,
>>
>> Bryan Jeffrey
>> --
>> From: Yana Kadiyska 
>> Sent: ‎10/‎28/‎2015 8:32 PM
>> To: Bryan Jeffrey 
>> Cc: Susan Zhang ; user 
>> Subject: Re: Spark -- Writing to Partitioned Persistent Table
>>
>> For this issue in particular ( ERROR XSDB6: Another instance of Derby
>> may have already booted the database /spark/spark-1.4.1/metastore_db) --
>> I think it depends on where you start your application and HiveThriftserver
>> from. I've run into a similar issue running a driver app first, which would
>> create a directory called metastore_db. If I then try to start SparkShell
>> from the same directory, I will see this exception. So it is like
>> SPARK-9776. It's not so much that the two are in the same process (as the
>> bug resolution states) I think you can't run 2 drivers which start a
>> HiveConext from the same directory.
>>
>>
>> On Wed, Oct 28, 2015 at 4:10 PM, Bryan Jeffrey 
>> wrote:
>>
>>> All,
>>>
>>> One issue I'm seeing is that I start the thrift server (for jdbc access)
>>> via the following: /spark/spark-1.4.1/sbin/start-thriftserver.sh --master
>>> spark://master:7077 --hiveconf "spark.cores.max=2"
>>>
>>> After about 40 seconds the Thrift server is started and available on
>>> default port 1.
>>>
>>> I then submit my application - and the application throws the following
>>> error:
>>>
>>> Caused by: java.sql.SQLException: Failed to start database
>>> 'metastore_db' with class loader
>>> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@6a552721,
>>> see the next exception for details.
>>> at
>>> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown
>>> Source)
>>> at
>>> org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
>>> Source)
>>> ... 86 more
>>> Caused by: java.sql.SQLException: Another instance of Derby may have
>>> already booted the database /spark/spark-1.4.1/metastore_db.
>>> at
>>> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown
>>> Source)
>>> at
>>> org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
>>> Source)
>>> at
>>> org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown
>>> Source)
>>> at
>>> org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source)
>>> ... 83 more
>>> Caused by: ERROR XSDB6: Another instance of Derby may have already
>>> booted the database /spark/spark-1.4.1/metastore_db.
>>>
>>> This also happens if I do the opposite (submit the application first,
>>> and then start the thrift server).
>>>
>>> It looks similar to the following issue -- but not quite the same:
>>> https://issues.apache.org/jira/browse/SPARK-9776
>>>
>>> It seems like this set of steps works fine if the metadata database is
>>> not yet created - but once it's created this happens every time.  Is this a
>>> known issue? Is there a workaround?
>>>
>>> Regards,
>>>
>>> Bryan Jeffrey
>>>
>>> On Wed, Oct 28, 2015 at 3:13 PM, Bryan Jeffrey 
>>> wrote:
>>>
 Susan,

 I did give that a shot -- I'm seeing a number of oddities:

 (1) 'Partition By' appears only accepts alphanumeric lower case
 fields.  It will

70 matches

Mail list logo