key class requirement for PairedRDD ?

2014-10-16 Thread Jaonary Rabarisoa
Dear all,

Is it possible to use any kind of object as key in a PairedRDD. When I use
a case class  key, the groupByKey operation don't behave as I expected. I
want to use a case class to avoid using a large tuple as it is easier to
manipulate.


Cheers,

Jaonary


how to submit multiple jar files when using spark-submit script in shell?

2014-10-16 Thread eric wong
Hi,

i using the comma separated style for submit multiple jar files in the
follow shell but it does not work:

bin/spark-submit --class org.apache.spark.examples.mllib.JavaKMeans
--master yarn-cluster --execur-memory 2g *--jars
lib/spark-examples-1.0.2-hadoop2.2.0.jar,lib/spark-mllib_2.10-1.0.0.jar
*hdfs://master:8000/srcdata/kmeans
8 4


Thanks!


-- 
WangHaihua


Re: rdd caching and use thereof

2014-10-16 Thread Nathan Kronenfeld
Oh, I forgot - I've set the following parameters at the moment (besides the
standard location, memory, and core setup):

spark.logConf  true
spark.shuffle.consolidateFiles true
spark.ui.port  4042
spark.io.compression.codec org.apache.spark.io.SnappyCompressionCodec
spark.shuffle.file.buffer.kb   500
spark.speculation  true



On Fri, Oct 17, 2014 at 2:46 AM, Nathan Kronenfeld <
nkronenf...@oculusinfo.com> wrote:

> I'm trying to understand two things about how spark is working.
>
> (1) When I try to cache an rdd that fits well within memory (about 60g
> with about 600g of memory), I get seemingly random levels of caching, from
> around 60% to 100%, given the same tuning parameters.  What governs how
> much of an RDD gets cached when there is enough memory?
>
> (2) Even when cached, when I run some tasks over the data, I get various
> locality states.  Sometimes it works perfectly, with everything
> PROCESS_LOCAL, and sometimes I get 10-20% of the data on locality ANY (and
> the task takes minutes instead of seconds); often this will vary if I run
> the task twice in a row in the same shell.  Is there anything I can do to
> affect this?  I tried caching with replication, but that caused everything
> to run out of memory nearly instantly (with the same 60g data set in 4-600g
> of memory)
>
> Thanks for the help,
>
> -Nathan
>
>
> --
> Nathan Kronenfeld
> Senior Visualization Developer
> Oculus Info Inc
> 2 Berkeley Street, Suite 600,
> Toronto, Ontario M5A 4J5
> Phone:  +1-416-203-3003 x 238
> Email:  nkronenf...@oculusinfo.com
>



-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  nkronenf...@oculusinfo.com


rdd caching and use thereof

2014-10-16 Thread Nathan Kronenfeld
I'm trying to understand two things about how spark is working.

(1) When I try to cache an rdd that fits well within memory (about 60g with
about 600g of memory), I get seemingly random levels of caching, from
around 60% to 100%, given the same tuning parameters.  What governs how
much of an RDD gets cached when there is enough memory?

(2) Even when cached, when I run some tasks over the data, I get various
locality states.  Sometimes it works perfectly, with everything
PROCESS_LOCAL, and sometimes I get 10-20% of the data on locality ANY (and
the task takes minutes instead of seconds); often this will vary if I run
the task twice in a row in the same shell.  Is there anything I can do to
affect this?  I tried caching with replication, but that caused everything
to run out of memory nearly instantly (with the same 60g data set in 4-600g
of memory)

Thanks for the help,

-Nathan


-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  nkronenf...@oculusinfo.com


Re: Can's create Kafka stream in spark shell

2014-10-16 Thread Akhil Das
This is how you deal with deduplicate errors:

libraryDependencies ++= Seq(
  ("org.apache.spark" % "spark-streaming_2.10" % "1.1.0" % "provided").
*exclude("org.eclipse.jetty.orbit", "javax.transaction").*
*exclude("org.eclipse.jetty.orbit", "javax.mail").*
*exclude("org.eclipse.jetty.orbit", "javax.activation").*
*exclude("com.esotericsoftware.minlog", "minlog").*
*exclude("commons-beanutils", "commons-beanutils-core").*
*exclude("commons-logging", "commons-logging").*
*exclude("commons-collections", "commons-collections").*
*exclude("org.eclipse.jetty.orbit", "javax.servlet")*
)


Thanks
Best Regards

On Fri, Oct 17, 2014 at 2:53 AM, Gary Zhao  wrote:

> Same error. I saw someone reported the same issue, e.g.
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-kafka-error-td9106.html
>
> Should I use "sbt assembly"? It failed for deduplicate though.
>
> error] (*:assembly) deduplicate: different file contents found in the
> following:
> [error]
> /Users/gzhao/.ivy2/cache/org.eclipse.jetty.orbit/javax.transaction/orbits/javax.transaction-1.1.1.v201105210645.jar:META-INF/ECLIPSEF.RSA
> [error]
> /Users/gzhao/.ivy2/cache/org.eclipse.jetty.orbit/javax.servlet/orbits/javax.servlet-3.0.0.v201112011016.jar:META-INF/ECLIPSEF.RSA
> [error]
> /Users/gzhao/.ivy2/cache/org.eclipse.jetty.orbit/javax.mail.glassfish/orbits/javax.mail.glassfish-1.4.1.v201005082020.jar:META-INF/ECLIPSEF.RSA
> [error]
> /Users/gzhao/.ivy2/cache/org.eclipse.jetty.orbit/javax.activation/orbits/javax.activation-1.1.0.v201105071233.jar:META-INF/ECLIPSEF.RSA
> [error] Total time: 4 s, completed Oct 16, 2014 1:58:41 PM
>
>
> On Thu, Oct 16, 2014 at 12:11 PM, Akhil Das 
> wrote:
>
>> Can you try:
>>
>> sbt:
>>
>> name := "Simple Project"
>>
>>
>> version := "1.1"
>>
>>
>> scalaVersion := "2.10.4"
>>
>>
>> libraryDependencies ++= Seq(
>>
>> "org.apache.spark" %% "spark-core" % "1.1.0",
>>
>> "org.apache.spark" %% "spark-streaming" % "1.1.0",
>>
>> "org.apache.spark" %% "spark-streaming-kafka" % "1.1.0"
>>
>>   )
>>
>> Thanks
>> Best Regards
>>
>> On Fri, Oct 17, 2014 at 12:36 AM, Gary Zhao  wrote:
>>
>>> Thanks Akhil. I tried spark-submit and saw the same issue. I double
>>> checked the versions and they look ok. Are you seeing any obvious issues?
>>>
>>> sbt:
>>>
>>> name := "Simple Project"
>>>
>>>
>>> version := "1.1"
>>>
>>>
>>> scalaVersion := "2.10.4"
>>>
>>>
>>> libraryDependencies ++= Seq(
>>>
>>> "org.apache.spark" %% "spark-core" % "1.1.0",
>>>
>>> "org.apache.spark" %% "spark-streaming" % "1.1.0",
>>>
>>> "org.apache.spark" %% "spark-streaming-kafka" % "1.1.0",
>>>
>>> "org.apache.kafka" %% "kafka" % "0.8.0"
>>>
>>>   )
>>>
>>>
>>> spark-1.1.0-bin-hadoop1/bin/spark-submit --class "main.scala.SimpleApp"
>>> --master "local[2]" simple-project_2.10-1.1.jar --jars
>>> spark-streaming-kafka_2.10-1.1.0.jar,kafka_2.10-0.8.0.jar
>>>
>>> Exception in thread "main" java.lang.NoClassDefFoundError:
>>> org/apache/spark/streaming/kafka/KafkaUtils$
>>> at main.scala.SimpleApp$delayedInit$body.apply(SimpleApp.scala:15)
>>> at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
>>> at
>>> scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
>>> at scala.App$$anonfun$main$1.apply(App.scala:71)
>>> at scala.App$$anonfun$main$1.apply(App.scala:71)
>>> at scala.collection.immutable.List.foreach(List.scala:318)
>>> at
>>> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
>>> at scala.App$class.main(App.scala:71)
>>> at main.scala.SimpleApp$.main(SimpleApp.scala:11)
>>> at main.scala.SimpleApp.main(SimpleApp.scala)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>> at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
>>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
>>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>> Caused by: java.lang.ClassNotFoundException:
>>> org.apache.spark.streaming.kafka.KafkaUtils$
>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>> ... 17 more
>>>
>>>
>>> On Tue, Oct 14, 2014 at 12:05 AM, Akhil Das 
>>> wrote:
>>>
 Just make sure you have the same version of spark-streaming-kafka
 
 jar and spark in your classpath.

 Thanks
 Best Regards

 On 

RE: error when maven build spark 1.1.0 with message "You have 1 Scalastyle violation"

2014-10-16 Thread Henry Hung
Found the root cause:


[SPARK-3372] [MLlib] MLlib doesn't pass maven build / checkstyle due ...

...to multi-byte character contained in Gradient.scala



Author: Kousuke Saruta 



Closes #2248 from sarutak/SPARK-3372 
and squashes the following commits:



73a28b8
 [Kousuke Saruta] Replaced UTF-8 hyphen with ascii hyphen



(cherry picked from commit 
1bed0a3)

Signed-off-by: Xiangrui Meng 


From: MA33 YTHung1
Sent: Friday, October 17, 2014 1:10 PM
To: user@spark.apache.org
Subject: RE: error when maven build spark 1.1.0 with message "You have 1 
Scalastyle violation"

Hi All,

Another piece of information, somehow the Gradient.scala throws an exception 
"input length = 2":

error 
file=D:\tools\spark-1.1.0\mllib\src\main\scala\org\apache\spark\mllib\optimization\Gradient.scala
 message=Input length = 2

Best regards,
Henry Hung

From: MA33 YTHung1
Sent: Friday, October 17, 2014 1:05 PM
To: user@spark.apache.org
Subject: error when maven build spark 1.1.0 with message "You have 1 Scalastyle 
violation"

Hi All,

I'm using windows 8.1 to build spark 1.1.0 using this command:

C:\apache-maven-3.0.5\bin\mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 
-DskipTests clean package -e

Below is the error message:

[ERROR] Failed to execute goal 
org.scalastyle:scalastyle-maven-plugin:0.4.0:check (default) on project 
spark-mllib_2.10: Failed during scalastyle execution: You have 1 Scalastyle 
violation(s). -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal 
org.scalastyle:scalastyle-maven-plugin:0.4.0:check (default) on project 
spark-mllib_2.10: Failed during scalastyle execution
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:217)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
at 
org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
at 
org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320)
at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156)
at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537)
at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196)
at org.apache.maven.cli.MavenCli.main(MavenCli.java:141)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352)
Caused by: org.apache.maven.plugin.MojoExecutionException: Failed during 
scalastyle execution
at 
org.scalastyle.maven.plugin.ScalastyleViolationCheckMojo.performCheck(ScalastyleViolationCheckMojo.java:238)
at 
org.scalastyle.maven.plugin.ScalastyleViolationCheckMojo.execute(ScalastyleViolationCheckMojo.java:199)
at 
org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:101)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:209)
... 19 more
Caused by: org.apache.maven.plugin.MojoFailureException: You have 1 Scalastyle 
violation(s).
at 
org.scalastyle.maven.plugin.ScalastyleViolationCheckMojo.performCheck(ScalastyleViolationCheckMojo.java:230)
... 22 more
[ERROR]
[ERROR]
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn  -rf :spark-mllib_2.10

Best regards,
Henry


The privileged confidential information contained in this email is intended for 
use only by the addressees as indicated by the original sender of this email. 
If you are not the addres

Re: Join with large data set

2014-10-16 Thread Sonal Goyal
Hi Ankur,

If your rdds have common keys, you can look at partitioning both your
datasets using a custom partitioner based on keys so that you can avoid
shuffling and optimize join performance.

HTH

Best Regards,
Sonal
Nube Technologies 





On Fri, Oct 17, 2014 at 4:27 AM, Ankur Srivastava <
ankur.srivast...@gmail.com> wrote:

> Hi,
>
> I have a rdd which is my application data and is huge. I want to join this
> with reference data which is also huge to fit in-memory and thus I do not
> want to use Broadcast variable.
>
> What other options do I have to perform such joins?
>
> I am using Cassandra as my data store, so should I just query cassandra to
> get the reference data needed?
>
> Also when I join two rdds, will it result in rdd scan or would spark do a
> hash partition on the two rdds to get the data with same keys on same node?
>
> Thanks
> Ankur
>


Re: How to close resources shared in executor?

2014-10-16 Thread Fengyun RAO
Of course, I could create a connection in

val result = rdd.map(line => {
  val conf = HBaseConfiguration.create
  val connection = HConnectionManager.createConnection(conf)
  val table = connection.getTable("user")
  ...
  table.close()
  connection.close()
}

but that would be too slow, which is also the reason I share conf and
connection in Utilobject.

Maybe I did need a shutdown hook as the Javadoc says.

Thank you!

2014-10-17 12:18 GMT+08:00 Ted Yu :

> Looking at Apache 0.98 code, you can follow the example in the class
> javadoc (line 144 of HConnectionManager.java):
>
>  * HTableInterface table = connection.getTable("table1");
>  * try {
>  *   // Use the table as needed, for a single operation and a single thread
>  * } finally {
>  *   table.close();
>  *   connection.close();
>  * }
>
> Cheers
>
> On Thu, Oct 16, 2014 at 9:03 PM, Fengyun RAO  wrote:
>
>> Thanks, Ted,
>>
>> We use CDH 5.1 and the HBase version is 0.98.1-cdh5.1.0, in which the
>> javadoc of HConnectionManager.java still recommends shutdown hook.
>>
>> I look into val table = Util.Connection.getTable("user"), and find it
>> didn't invoke
>>
>> public HTable(Configuration conf, final byte[] tableName, final 
>> ExecutorService pool)
>>
>> but
>>
>> public HTable(TableName tableName, final HConnection connection,
>>   final ExecutorService pool) throws IOException {
>> if (connection == null || connection.isClosed()) {
>>   throw new IllegalArgumentException("Connection is null or closed.");
>> }
>> this.tableName = tableName;
>> this.cleanupPoolOnClose = this.cleanupConnectionOnClose = false;
>> this.connection = connection;
>> this.configuration = connection.getConfiguration();
>> this.pool = pool;
>>
>> this.finishSetup();
>>   }
>>
>> in which cleanupConnectionOnClose is false
>>
>> 2014-10-16 22:51 GMT+08:00 Ted Yu :
>>
>>> Which hbase release are you using ?
>>>
>>> Let me refer to 0.94 code hbase.
>>>
>>> Take a look at the following method
>>> in src/main/java/org/apache/hadoop/hbase/client/HTable.java :
>>>
>>>   public void close() throws IOException {
>>> ...
>>> if (cleanupConnectionOnClose) {
>>>   if (this.connection != null) {
>>> this.connection.close();
>>>
>>> When Connection.getTable() is called, the following is invoked:
>>>   public HTable(Configuration conf, final byte[] tableName, final
>>> ExecutorService pool)
>>> which sets cleanupConnectionOnClose to true.
>>>
>>> w.r.t. javadoc, the paragraph on shutdown hook is
>>> in HConnectionManager.java of 0.94
>>> You don't need to use shutdown hook for 0.94+
>>>
>>> Cheers
>>>
>>> On Wed, Oct 15, 2014 at 11:41 PM, Fengyun RAO 
>>> wrote:
>>>
 I may have misunderstood your point.

 val result = rdd.map(line => {
   val table = Util.Connection.getTable("user")
   ...
   table.close()
 }

 Did you mean this is enough, and there’s no need to call
 Util.Connection.close(),
 or HConnectionManager.deleteAllConnections()?

 Where is the documentation that statesHconnectionManager would release
 underlying connection automatically?
 If that’s true, maybe the Javadoc which recommends a shutdown hook
 needs update
 ​

 2014-10-16 14:20 GMT+08:00 Fengyun RAO :

> Thanks, Ted.
> Util.Connection.close() should be called only once, so it can NOT be
> in a map function
>
> val result = rdd.map(line => {
>   val table = Util.Connection.getTable("user")
>   ...
>   Util.Connection.close()
> }
>
> As you mentioned:
>
> Calling table.close() is the recommended approach.
> HConnectionManager does reference counting. When all references to the
> underlying connection are gone, connection would be released.
>
> Yes, we should call table.close(), but it won’t remove HConnection in
> HConnectionManager which is a HConnection pool.
> As I look into the HconnectionManager Javadoc, it seems I have to
> implement a shutdown hook
>
>  * Cleanup used to be done inside in a shutdown hook.  On startup we'd
>  * register a shutdown hook that called {@link #deleteAllConnections()}
>  * on its way out but the order in which shutdown hooks run is not 
> defined so
>  * were problematic for clients of HConnection that wanted to register 
> their
>  * own shutdown hooks so we removed ours though this shifts the onus for
>  * cleanup to the client.
>
> ​
>
> 2014-10-15 22:31 GMT+08:00 Ted Yu :
>
>> Pardon me - there was typo in previous email.
>>
>> Calling table.close() is the recommended approach.
>> HConnectionManager does reference counting. When all references to
>> the underlying connection are gone, connection would be released.
>>
>> Cheers
>>
>> On Wed, Oct 15, 2014 at 7:13 AM, Ted Yu  wrote:
>>
>>> Have you tried the following ?
>>>
>>> val result = rdd.map(line => { val tab

RE: Play framework

2014-10-16 Thread Mohammed Guller
That makes sense. Thanks, Raju.

Just as an FYI for other Spark users trying to integrate Play with Spark: Play 
2.3.x is not compatible with Spark 1.1.0, but Play 2.2.x works with Spark 1.1.0

Mohammed

From: Ramaraju Indukuri [mailto:iramar...@gmail.com]
Sent: Thursday, October 16, 2014 8:06 PM
To: Mohammed Guller
Cc: US Office Admin; Surendranauth Hiraman; Daniel Siegmann; 
user@spark.apache.org
Subject: Re: Play framework

In our case, Play libraries are not required to run spark jobs. Hence they are 
available only on master and play runs as a regular scala application. I can't 
think of a case where you need play to run on slaves.

Raju


On Thu, Oct 16, 2014 at 10:21 PM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:
What about all the play dependencies since the jar created by the ‘Play 
package’ won’t include the play jar or any of the 100+ jars on which play 
itself depends?

Mohammed

From: US Office Admin [mailto:ad...@vectorum.com]
Sent: Thursday, October 16, 2014 7:05 PM
To: Mohammed Guller; Surendranauth Hiraman
Cc: Daniel Siegmann; user@spark.apache.org
Subject: Re: Play framework


The remaining dependencies (Spark libraries) are available for the context from 
the sparkhome. I have installed spark such that all the slaves to have same 
sparkhome. Code looks like this.


val conf = new SparkConf()
.setSparkHome(/home/dev/spark)
.setMaster("spark://99.99.99.999:7077")
.setAppName(xxx")
.setJars(Seq("/home/dev/play/target/scala-2.10/xxx_2.10-1.0.jar"))



val sc = new SparkContext(sparkConf)

If you have more dependancies, you can keep adding them to the setJars.



Raju

​


From: Mohammed Guller mailto:moham...@glassbeam.com>>
Sent: Thursday, October 16, 2014 4:00 PM
To: US Office Admin; Surendranauth Hiraman
Cc: Daniel Siegmann; user@spark.apache.org
Subject: RE: Play framework

Thanks, Suren and Raju.

Raju – if I remember correctly, Play package command just creates a jar for 
your app. That jar file will not include other dependencies. So it is not 
really a full jar as you mentioned below. So how you are passing all the other 
dependency jars to spark? Can you share that piece of code? Also is there any 
specific reason why you are not using play dist instead?

Mohammed

From: US Office Admin [mailto:ad...@vectorum.com]
Sent: Thursday, October 16, 2014 11:41 AM
To: Surendranauth Hiraman; Mohammed Guller
Cc: Daniel Siegmann; user@spark.apache.org
Subject: Re: Play framework


​We integrated Spark into Play and use SparkSQL extensively on an ec2 spark 
cluster on Hadoop hdfs 1.2.1 and tachyon 0.4.



Step 1: Create a play scala application as usual

Step 2. In Build.sbt put all your spark dependencies. What works for us is Play 
2.2.3 Scala 2.10.4 Spark 1.1. We have Akka 2.2.3. This is straight forward

step3: As Daniel mentioned, create spark context within Play. And rest of the 
application is as usual.

Step4: Create a full jar using Play Package and use that package to be included 
in library of jars passed to spark context.

Step 5: Play run as usual.



It works very well, and the convenience is, we have all scala application 
throughout.



Regards

Raju




From: Surendranauth Hiraman 
mailto:suren.hira...@velos.io>>
Sent: Thursday, October 16, 2014 12:42 PM
To: Mohammed Guller
Cc: Daniel Siegmann; user@spark.apache.org
Subject: Re: Play framework

Mohammed,

Jumping in for Daniel, we actually address the configuration issue by pulling 
values from environment variables or command line options. Maybe that may 
handle at least some of your needs.

For the akka issue, here is the akka version we include in build.sbt: 
"com.typesafe.akka" %% "akka-actor" % "2.2.1"

-Suren


On Thu, Oct 16, 2014 at 12:23 PM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:
Daniel,
Thanks for sharing this. It is very helpful.

The reason I want to use Spark submit is that it provides more flexibility. For 
example, with spark-submit, I don’t need to hard code the master info in the 
code. I can easily change the config without having to change and recompile 
code.

Do you mind sharing the sbt build file for your play app? I tried to build an 
uber jar using sbt-assembly. It gets built, but when I run it, it throws all 
sorts of exception. I have seen some blog posts that Spark and Play use 
different version of the Akka library. So I included Akka in my build.scala 
file, but still cannot get rid of Akka related exceptions. I suspect that the 
settings in the build.scala file for my play project is incorrect.

Mohammed

From: Daniel Siegmann 
[mailto:daniel.siegm...@velos.io

RE: error when maven build spark 1.1.0 with message "You have 1 Scalastyle violation"

2014-10-16 Thread Henry Hung
Hi All,

Another piece of information, somehow the Gradient.scala throws an exception 
"input length = 2":

error 
file=D:\tools\spark-1.1.0\mllib\src\main\scala\org\apache\spark\mllib\optimization\Gradient.scala
 message=Input length = 2

Best regards,
Henry Hung

From: MA33 YTHung1
Sent: Friday, October 17, 2014 1:05 PM
To: user@spark.apache.org
Subject: error when maven build spark 1.1.0 with message "You have 1 Scalastyle 
violation"

Hi All,

I'm using windows 8.1 to build spark 1.1.0 using this command:

C:\apache-maven-3.0.5\bin\mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 
-DskipTests clean package -e

Below is the error message:

[ERROR] Failed to execute goal 
org.scalastyle:scalastyle-maven-plugin:0.4.0:check (default) on project 
spark-mllib_2.10: Failed during scalastyle execution: You have 1 Scalastyle 
violation(s). -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal 
org.scalastyle:scalastyle-maven-plugin:0.4.0:check (default) on project 
spark-mllib_2.10: Failed during scalastyle execution
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:217)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
at 
org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
at 
org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320)
at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156)
at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537)
at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196)
at org.apache.maven.cli.MavenCli.main(MavenCli.java:141)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352)
Caused by: org.apache.maven.plugin.MojoExecutionException: Failed during 
scalastyle execution
at 
org.scalastyle.maven.plugin.ScalastyleViolationCheckMojo.performCheck(ScalastyleViolationCheckMojo.java:238)
at 
org.scalastyle.maven.plugin.ScalastyleViolationCheckMojo.execute(ScalastyleViolationCheckMojo.java:199)
at 
org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:101)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:209)
... 19 more
Caused by: org.apache.maven.plugin.MojoFailureException: You have 1 Scalastyle 
violation(s).
at 
org.scalastyle.maven.plugin.ScalastyleViolationCheckMojo.performCheck(ScalastyleViolationCheckMojo.java:230)
... 22 more
[ERROR]
[ERROR]
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn  -rf :spark-mllib_2.10

Best regards,
Henry


The privileged confidential information contained in this email is intended for 
use only by the addressees as indicated by the original sender of this email. 
If you are not the addressee indicated in this email or are not responsible for 
delivery of the email to such a person, please kindly reply to the sender 
indicating this fact and delete all copies of it from your computer and network 
server immediately. Your cooperation is highly appreciated. It is advised that 
any unauthorized use of confidential information of Winbond is strictly 
prohibited; and any information in this email irrelevant to the official 
business of Winbond shall be deemed as neither given nor endorsed by Winbond.


The privileged confidential information contained in this email is intended for 
use only by the addressees as indicated by the original sender of this email. 
If you are not the addressee indicated in this email or are not responsible for 
delivery of the email t

error when maven build spark 1.1.0 with message "You have 1 Scalastyle violation"

2014-10-16 Thread Henry Hung
Hi All,

I'm using windows 8.1 to build spark 1.1.0 using this command:

C:\apache-maven-3.0.5\bin\mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 
-DskipTests clean package -e

Below is the error message:

[ERROR] Failed to execute goal 
org.scalastyle:scalastyle-maven-plugin:0.4.0:check (default) on project 
spark-mllib_2.10: Failed during scalastyle execution: You have 1 Scalastyle 
violation(s). -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal 
org.scalastyle:scalastyle-maven-plugin:0.4.0:check (default) on project 
spark-mllib_2.10: Failed during scalastyle execution
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:217)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
at 
org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
at 
org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320)
at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156)
at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537)
at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196)
at org.apache.maven.cli.MavenCli.main(MavenCli.java:141)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352)
Caused by: org.apache.maven.plugin.MojoExecutionException: Failed during 
scalastyle execution
at 
org.scalastyle.maven.plugin.ScalastyleViolationCheckMojo.performCheck(ScalastyleViolationCheckMojo.java:238)
at 
org.scalastyle.maven.plugin.ScalastyleViolationCheckMojo.execute(ScalastyleViolationCheckMojo.java:199)
at 
org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:101)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:209)
... 19 more
Caused by: org.apache.maven.plugin.MojoFailureException: You have 1 Scalastyle 
violation(s).
at 
org.scalastyle.maven.plugin.ScalastyleViolationCheckMojo.performCheck(ScalastyleViolationCheckMojo.java:230)
... 22 more
[ERROR]
[ERROR]
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn  -rf :spark-mllib_2.10

Best regards,
Henry


The privileged confidential information contained in this email is intended for 
use only by the addressees as indicated by the original sender of this email. 
If you are not the addressee indicated in this email or are not responsible for 
delivery of the email to such a person, please kindly reply to the sender 
indicating this fact and delete all copies of it from your computer and network 
server immediately. Your cooperation is highly appreciated. It is advised that 
any unauthorized use of confidential information of Winbond is strictly 
prohibited; and any information in this email irrelevant to the official 
business of Winbond shall be deemed as neither given nor endorsed by Winbond.


Re: How to close resources shared in executor?

2014-10-16 Thread Ted Yu
Looking at Apache 0.98 code, you can follow the example in the class
javadoc (line 144 of HConnectionManager.java):

 * HTableInterface table = connection.getTable("table1");
 * try {
 *   // Use the table as needed, for a single operation and a single thread
 * } finally {
 *   table.close();
 *   connection.close();
 * }

Cheers

On Thu, Oct 16, 2014 at 9:03 PM, Fengyun RAO  wrote:

> Thanks, Ted,
>
> We use CDH 5.1 and the HBase version is 0.98.1-cdh5.1.0, in which the
> javadoc of HConnectionManager.java still recommends shutdown hook.
>
> I look into val table = Util.Connection.getTable("user"), and find it
> didn't invoke
>
> public HTable(Configuration conf, final byte[] tableName, final 
> ExecutorService pool)
>
> but
>
> public HTable(TableName tableName, final HConnection connection,
>   final ExecutorService pool) throws IOException {
> if (connection == null || connection.isClosed()) {
>   throw new IllegalArgumentException("Connection is null or closed.");
> }
> this.tableName = tableName;
> this.cleanupPoolOnClose = this.cleanupConnectionOnClose = false;
> this.connection = connection;
> this.configuration = connection.getConfiguration();
> this.pool = pool;
>
> this.finishSetup();
>   }
>
> in which cleanupConnectionOnClose is false
>
> 2014-10-16 22:51 GMT+08:00 Ted Yu :
>
>> Which hbase release are you using ?
>>
>> Let me refer to 0.94 code hbase.
>>
>> Take a look at the following method
>> in src/main/java/org/apache/hadoop/hbase/client/HTable.java :
>>
>>   public void close() throws IOException {
>> ...
>> if (cleanupConnectionOnClose) {
>>   if (this.connection != null) {
>> this.connection.close();
>>
>> When Connection.getTable() is called, the following is invoked:
>>   public HTable(Configuration conf, final byte[] tableName, final
>> ExecutorService pool)
>> which sets cleanupConnectionOnClose to true.
>>
>> w.r.t. javadoc, the paragraph on shutdown hook is
>> in HConnectionManager.java of 0.94
>> You don't need to use shutdown hook for 0.94+
>>
>> Cheers
>>
>> On Wed, Oct 15, 2014 at 11:41 PM, Fengyun RAO 
>> wrote:
>>
>>> I may have misunderstood your point.
>>>
>>> val result = rdd.map(line => {
>>>   val table = Util.Connection.getTable("user")
>>>   ...
>>>   table.close()
>>> }
>>>
>>> Did you mean this is enough, and there’s no need to call
>>> Util.Connection.close(),
>>> or HConnectionManager.deleteAllConnections()?
>>>
>>> Where is the documentation that statesHconnectionManager would release
>>> underlying connection automatically?
>>> If that’s true, maybe the Javadoc which recommends a shutdown hook needs
>>> update
>>> ​
>>>
>>> 2014-10-16 14:20 GMT+08:00 Fengyun RAO :
>>>
 Thanks, Ted.
 Util.Connection.close() should be called only once, so it can NOT be
 in a map function

 val result = rdd.map(line => {
   val table = Util.Connection.getTable("user")
   ...
   Util.Connection.close()
 }

 As you mentioned:

 Calling table.close() is the recommended approach.
 HConnectionManager does reference counting. When all references to the
 underlying connection are gone, connection would be released.

 Yes, we should call table.close(), but it won’t remove HConnection in
 HConnectionManager which is a HConnection pool.
 As I look into the HconnectionManager Javadoc, it seems I have to
 implement a shutdown hook

  * Cleanup used to be done inside in a shutdown hook.  On startup we'd
  * register a shutdown hook that called {@link #deleteAllConnections()}
  * on its way out but the order in which shutdown hooks run is not defined 
 so
  * were problematic for clients of HConnection that wanted to register 
 their
  * own shutdown hooks so we removed ours though this shifts the onus for
  * cleanup to the client.

 ​

 2014-10-15 22:31 GMT+08:00 Ted Yu :

> Pardon me - there was typo in previous email.
>
> Calling table.close() is the recommended approach.
> HConnectionManager does reference counting. When all references to the
> underlying connection are gone, connection would be released.
>
> Cheers
>
> On Wed, Oct 15, 2014 at 7:13 AM, Ted Yu  wrote:
>
>> Have you tried the following ?
>>
>> val result = rdd.map(line => { val table = 
>> Util.Connection.getTable("user")
>> ...
>> Util.Connection.close() }
>>
>> On Wed, Oct 15, 2014 at 6:09 AM, Fengyun RAO 
>> wrote:
>>
>>> In order to share an HBase connection pool, we create an object
>>>
>>> Object Util {
>>> val HBaseConf = HBaseConfiguration.create
>>> val Connection= HConnectionManager.createConnection(HBaseConf)
>>> }
>>>
>>> which would be shared among tasks on the same executor. e.g.
>>>
>>> val result = rdd.map(line => {
>>>   val table = Util.Connection.getTable("user")
>>>   ...
>>> }

Re: How to close resources shared in executor?

2014-10-16 Thread Fengyun RAO
Thanks, Ted,

We use CDH 5.1 and the HBase version is 0.98.1-cdh5.1.0, in which the
javadoc of HConnectionManager.java still recommends shutdown hook.

I look into val table = Util.Connection.getTable("user"), and find it
didn't invoke

public HTable(Configuration conf, final byte[] tableName, final
ExecutorService pool)

but

public HTable(TableName tableName, final HConnection connection,
  final ExecutorService pool) throws IOException {
if (connection == null || connection.isClosed()) {
  throw new IllegalArgumentException("Connection is null or closed.");
}
this.tableName = tableName;
this.cleanupPoolOnClose = this.cleanupConnectionOnClose = false;
this.connection = connection;
this.configuration = connection.getConfiguration();
this.pool = pool;

this.finishSetup();
  }

in which cleanupConnectionOnClose is false

2014-10-16 22:51 GMT+08:00 Ted Yu :

> Which hbase release are you using ?
>
> Let me refer to 0.94 code hbase.
>
> Take a look at the following method
> in src/main/java/org/apache/hadoop/hbase/client/HTable.java :
>
>   public void close() throws IOException {
> ...
> if (cleanupConnectionOnClose) {
>   if (this.connection != null) {
> this.connection.close();
>
> When Connection.getTable() is called, the following is invoked:
>   public HTable(Configuration conf, final byte[] tableName, final
> ExecutorService pool)
> which sets cleanupConnectionOnClose to true.
>
> w.r.t. javadoc, the paragraph on shutdown hook is
> in HConnectionManager.java of 0.94
> You don't need to use shutdown hook for 0.94+
>
> Cheers
>
> On Wed, Oct 15, 2014 at 11:41 PM, Fengyun RAO 
> wrote:
>
>> I may have misunderstood your point.
>>
>> val result = rdd.map(line => {
>>   val table = Util.Connection.getTable("user")
>>   ...
>>   table.close()
>> }
>>
>> Did you mean this is enough, and there’s no need to call
>> Util.Connection.close(),
>> or HConnectionManager.deleteAllConnections()?
>>
>> Where is the documentation that statesHconnectionManager would release
>> underlying connection automatically?
>> If that’s true, maybe the Javadoc which recommends a shutdown hook needs
>> update
>> ​
>>
>> 2014-10-16 14:20 GMT+08:00 Fengyun RAO :
>>
>>> Thanks, Ted.
>>> Util.Connection.close() should be called only once, so it can NOT be in
>>> a map function
>>>
>>> val result = rdd.map(line => {
>>>   val table = Util.Connection.getTable("user")
>>>   ...
>>>   Util.Connection.close()
>>> }
>>>
>>> As you mentioned:
>>>
>>> Calling table.close() is the recommended approach.
>>> HConnectionManager does reference counting. When all references to the
>>> underlying connection are gone, connection would be released.
>>>
>>> Yes, we should call table.close(), but it won’t remove HConnection in
>>> HConnectionManager which is a HConnection pool.
>>> As I look into the HconnectionManager Javadoc, it seems I have to
>>> implement a shutdown hook
>>>
>>>  * Cleanup used to be done inside in a shutdown hook.  On startup we'd
>>>  * register a shutdown hook that called {@link #deleteAllConnections()}
>>>  * on its way out but the order in which shutdown hooks run is not defined 
>>> so
>>>  * were problematic for clients of HConnection that wanted to register their
>>>  * own shutdown hooks so we removed ours though this shifts the onus for
>>>  * cleanup to the client.
>>>
>>> ​
>>>
>>> 2014-10-15 22:31 GMT+08:00 Ted Yu :
>>>
 Pardon me - there was typo in previous email.

 Calling table.close() is the recommended approach.
 HConnectionManager does reference counting. When all references to the
 underlying connection are gone, connection would be released.

 Cheers

 On Wed, Oct 15, 2014 at 7:13 AM, Ted Yu  wrote:

> Have you tried the following ?
>
> val result = rdd.map(line => { val table = 
> Util.Connection.getTable("user")
> ...
> Util.Connection.close() }
>
> On Wed, Oct 15, 2014 at 6:09 AM, Fengyun RAO 
> wrote:
>
>> In order to share an HBase connection pool, we create an object
>>
>> Object Util {
>> val HBaseConf = HBaseConfiguration.create
>> val Connection= HConnectionManager.createConnection(HBaseConf)
>> }
>>
>> which would be shared among tasks on the same executor. e.g.
>>
>> val result = rdd.map(line => {
>>   val table = Util.Connection.getTable("user")
>>   ...
>> }
>>
>> However, we don’t how to close the Util.Connection.
>> If we write Util.Connection.close() in the main function,
>> it’ll only run on the driver, not the executor.
>>
>> So, How to make sure every Connection closed before exist?
>> ​
>>
>
>

>>>
>>
>


Re: spark1.0 principal component analysis

2014-10-16 Thread Xiangrui Meng
computePrincipalComponents returns a local matrix X, whose columns are
the principal components (ordered), while those column vectors are in
the same feature space as the input feature vectors. -Xiangrui

On Thu, Oct 16, 2014 at 2:39 AM, al123  wrote:
> Hi,
>
> I don't think anybody answered this question...
>
>
> fintis wrote
>> How do I match the principal components to the actual features since there
>> is some sorting?
>
> Would anybody be able to shed a little light on it since I too am struggling
> with this?
>
> Many thanks!!
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-principal-component-analysis-tp9249p16556.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Play framework

2014-10-16 Thread Ramaraju Indukuri
In our case, Play libraries are not required to run spark jobs. Hence they
are available only on master and play runs as a regular scala application.
I can't think of a case where you need play to run on slaves.

Raju


On Thu, Oct 16, 2014 at 10:21 PM, Mohammed Guller 
wrote:

>  What about all the play dependencies since the jar created by the ‘Play
> package’ won’t include the play jar or any of the 100+ jars on which play
> itself depends?
>
>
>
> Mohammed
>
>
>
> *From:* US Office Admin [mailto:ad...@vectorum.com]
> *Sent:* Thursday, October 16, 2014 7:05 PM
> *To:* Mohammed Guller; Surendranauth Hiraman
> *Cc:* Daniel Siegmann; user@spark.apache.org
> *Subject:* Re: Play framework
>
>
>
> The remaining dependencies (Spark libraries) are available for the context
> from the sparkhome. I have installed spark such that all the slaves to have
> same sparkhome. Code looks like this.
>
>
>
> val conf = new SparkConf()
>
> .setSparkHome(/home/dev/spark)
>
> .setMaster("spark://99.99.99.999:7077")
>
> .setAppName(xxx")
>
> .setJars(Seq("/home/dev/play/target/scala-2.10/xxx_2.10-1.0.jar"))
>
>
>
>  val sc = new SparkContext(sparkConf)
>
>  If you have more dependancies, you can keep adding them to the setJars.
>
>
>
> Raju
>
> ​
>  --
>
> *From:* Mohammed Guller 
> *Sent:* Thursday, October 16, 2014 4:00 PM
> *To:* US Office Admin; Surendranauth Hiraman
> *Cc:* Daniel Siegmann; user@spark.apache.org
> *Subject:* RE: Play framework
>
>
>
> Thanks, Suren and Raju.
>
>
>
> Raju – if I remember correctly, Play package command just creates a jar
> for your app. That jar file will not include other dependencies. So it is
> not really a full jar as you mentioned below. So how you are passing all
> the other dependency jars to spark? Can you share that piece of code? Also
> is there any specific reason why you are not using play dist instead?
>
>
>
> Mohammed
>
>
>
> *From:* US Office Admin [mailto:ad...@vectorum.com ]
> *Sent:* Thursday, October 16, 2014 11:41 AM
> *To:* Surendranauth Hiraman; Mohammed Guller
> *Cc:* Daniel Siegmann; user@spark.apache.org
> *Subject:* Re: Play framework
>
>
>
> ​We integrated Spark into Play and use SparkSQL extensively on an ec2
> spark cluster on Hadoop hdfs 1.2.1 and tachyon 0.4.
>
>
>
> Step 1: Create a play scala application as usual
>
> Step 2. In Build.sbt put all your spark dependencies. What works for us is
> Play 2.2.3 Scala 2.10.4 Spark 1.1. We have Akka 2.2.3. This is straight
> forward
>
> step3: As Daniel mentioned, create spark context within Play. And rest of
> the application is as usual.
>
> Step4: Create a full jar using Play Package and use that package to be
> included in library of jars passed to spark context.
>
> Step 5: Play run as usual.
>
>
>
> It works very well, and the convenience is, we have all scala application
> throughout.
>
>
>
> Regards
>
> Raju
>
>
>
>
> 
>--
>
> *From:* Surendranauth Hiraman 
> *Sent:* Thursday, October 16, 2014 12:42 PM
> *To:* Mohammed Guller
> *Cc:* Daniel Siegmann; user@spark.apache.org
> *Subject:* Re: Play framework
>
>
>
> Mohammed,
>
>
>
> Jumping in for Daniel, we actually address the configuration issue by
> pulling values from environment variables or command line options. Maybe
> that may handle at least some of your needs.
>
>
>
> For the akka issue, here is the akka version we include in build.sbt:
> "com.typesafe.akka" %% "akka-actor" % "2.2.1"
>
>
>
> -Suren
>
>
>
>
>
> On Thu, Oct 16, 2014 at 12:23 PM, Mohammed Guller 
> wrote:
>
> Daniel,
>
> Thanks for sharing this. It is very helpful.
>
>
>
> The reason I want to use Spark submit is that it provides more
> flexibility. For example, with spark-submit, I don’t need to hard code the
> master info in the code. I can easily change the config without having to
> change and recompile code.
>
>
>
> Do you mind sharing the sbt build file for your play app? I tried to build
> an uber jar using sbt-assembly. It gets built, but when I run it, it throws
> all sorts of exception. I have seen some blog posts that Spark and Play use
> different version of the Akka library. So I included Akka in my build.scala
> file, but still cannot get rid of Akka related exceptions. I suspect that
> the settings in the build.scala file for my play project is incorrect.
>
>
>
> Mohammed
>
>
>
> *From:* Daniel Siegmann [mailto:daniel.siegm...@velos.io]
> *Sent:* Thursday, October 16, 2014 7:15 AM
> *To:* Mohammed Guller
> *Cc:* user@spark.apache.org
> *Subject:* Re: Play framework
>
>
>
> We execute Spark jobs from a Play application but we don't use
> spark-submit. I don't know if you really want to use spark-submit, but if
> not you can just create a SparkContext programmatically in your app.
>
> In development I typically run Spark loc

Re: object in an rdd: serializable?

2014-10-16 Thread Boromir Widas
make it a case class should work.

On Thu, Oct 16, 2014 at 8:30 PM, ll  wrote:

> i got an exception complaining about serializable.  the sample code is
> below...
>
> class HelloWorld(val count: Int) {
>   ...
>   ...
> }
>
> object Test extends App {
>   ...
>   val data = sc.parallelize(List(new HelloWorld(1), new HelloWorld(2)))
>   ...
> }
>
> what is the best way to serialize HelloWorld so that it can be contained in
> an RDD?
>
> thanks!
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/object-in-an-rdd-serializable-tp16638.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


RE: Play framework

2014-10-16 Thread Mohammed Guller
What about all the play dependencies since the jar created by the ‘Play 
package’ won’t include the play jar or any of the 100+ jars on which play 
itself depends?

Mohammed

From: US Office Admin [mailto:ad...@vectorum.com]
Sent: Thursday, October 16, 2014 7:05 PM
To: Mohammed Guller; Surendranauth Hiraman
Cc: Daniel Siegmann; user@spark.apache.org
Subject: Re: Play framework


The remaining dependencies (Spark libraries) are available for the context from 
the sparkhome. I have installed spark such that all the slaves to have same 
sparkhome. Code looks like this.


val conf = new SparkConf()
.setSparkHome(/home/dev/spark)
.setMaster("spark://99.99.99.999:7077")
.setAppName(xxx")
.setJars(Seq("/home/dev/play/target/scala-2.10/xxx_2.10-1.0.jar"))



val sc = new SparkContext(sparkConf)


If you have more dependancies, you can keep adding them to the setJars.



Raju

​


From: Mohammed Guller mailto:moham...@glassbeam.com>>
Sent: Thursday, October 16, 2014 4:00 PM
To: US Office Admin; Surendranauth Hiraman
Cc: Daniel Siegmann; user@spark.apache.org
Subject: RE: Play framework

Thanks, Suren and Raju.

Raju – if I remember correctly, Play package command just creates a jar for 
your app. That jar file will not include other dependencies. So it is not 
really a full jar as you mentioned below. So how you are passing all the other 
dependency jars to spark? Can you share that piece of code? Also is there any 
specific reason why you are not using play dist instead?

Mohammed

From: US Office Admin [mailto:ad...@vectorum.com]
Sent: Thursday, October 16, 2014 11:41 AM
To: Surendranauth Hiraman; Mohammed Guller
Cc: Daniel Siegmann; user@spark.apache.org
Subject: Re: Play framework


​We integrated Spark into Play and use SparkSQL extensively on an ec2 spark 
cluster on Hadoop hdfs 1.2.1 and tachyon 0.4.



Step 1: Create a play scala application as usual

Step 2. In Build.sbt put all your spark dependencies. What works for us is Play 
2.2.3 Scala 2.10.4 Spark 1.1. We have Akka 2.2.3. This is straight forward

step3: As Daniel mentioned, create spark context within Play. And rest of the 
application is as usual.

Step4: Create a full jar using Play Package and use that package to be included 
in library of jars passed to spark context.

Step 5: Play run as usual.



It works very well, and the convenience is, we have all scala application 
throughout.



Regards

Raju




From: Surendranauth Hiraman 
mailto:suren.hira...@velos.io>>
Sent: Thursday, October 16, 2014 12:42 PM
To: Mohammed Guller
Cc: Daniel Siegmann; user@spark.apache.org
Subject: Re: Play framework

Mohammed,

Jumping in for Daniel, we actually address the configuration issue by pulling 
values from environment variables or command line options. Maybe that may 
handle at least some of your needs.

For the akka issue, here is the akka version we include in build.sbt: 
"com.typesafe.akka" %% "akka-actor" % "2.2.1"

-Suren


On Thu, Oct 16, 2014 at 12:23 PM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:
Daniel,
Thanks for sharing this. It is very helpful.

The reason I want to use Spark submit is that it provides more flexibility. For 
example, with spark-submit, I don’t need to hard code the master info in the 
code. I can easily change the config without having to change and recompile 
code.

Do you mind sharing the sbt build file for your play app? I tried to build an 
uber jar using sbt-assembly. It gets built, but when I run it, it throws all 
sorts of exception. I have seen some blog posts that Spark and Play use 
different version of the Akka library. So I included Akka in my build.scala 
file, but still cannot get rid of Akka related exceptions. I suspect that the 
settings in the build.scala file for my play project is incorrect.

Mohammed

From: Daniel Siegmann 
[mailto:daniel.siegm...@velos.io]
Sent: Thursday, October 16, 2014 7:15 AM
To: Mohammed Guller
Cc: user@spark.apache.org
Subject: Re: Play framework

We execute Spark jobs from a Play application but we don't use spark-submit. I 
don't know if you really want to use spark-submit, but if not you can just 
create a SparkContext programmatically in your app.
In development I typically run Spark locally. Creating the Spark context is 
pretty trivial:

val conf = new SparkConf().setMaster("local[*]").setAppName(s"My Awesome App")
// call conf.set for any other configuration you want
val sc = new SparkContext(sparkConf)

It is important to keep in mind you cannot have multiple local contexts (you 
can create them but you'll get odd errors), so if you are running things in 
p

Re: Play framework

2014-10-16 Thread US Office Admin
The remaining dependencies (Spark libraries) are available for the context from 
the sparkhome. I have installed spark such that all the slaves to have same 
sparkhome. Code looks like this.


val conf = new SparkConf()
.setSparkHome(/home/dev/spark)
.setMaster("spark://99.99.99.999:7077")
.setAppName(xxx")
.setJars(Seq("/home/dev/play/target/scala-2.10/xxx_2.10-1.0.jar"))


val sc = new SparkContext(sparkConf)


If you have more dependancies, you can keep adding them to the setJars.


Raju

​


From: Mohammed Guller 
Sent: Thursday, October 16, 2014 4:00 PM
To: US Office Admin; Surendranauth Hiraman
Cc: Daniel Siegmann; user@spark.apache.org
Subject: RE: Play framework

Thanks, Suren and Raju.

Raju – if I remember correctly, Play package command just creates a jar for 
your app. That jar file will not include other dependencies. So it is not 
really a full jar as you mentioned below. So how you are passing all the other 
dependency jars to spark? Can you share that piece of code? Also is there any 
specific reason why you are not using play dist instead?

Mohammed

From: US Office Admin [mailto:ad...@vectorum.com]
Sent: Thursday, October 16, 2014 11:41 AM
To: Surendranauth Hiraman; Mohammed Guller
Cc: Daniel Siegmann; user@spark.apache.org
Subject: Re: Play framework


​We integrated Spark into Play and use SparkSQL extensively on an ec2 spark 
cluster on Hadoop hdfs 1.2.1 and tachyon 0.4.



Step 1: Create a play scala application as usual

Step 2. In Build.sbt put all your spark dependencies. What works for us is Play 
2.2.3 Scala 2.10.4 Spark 1.1. We have Akka 2.2.3. This is straight forward

step3: As Daniel mentioned, create spark context within Play. And rest of the 
application is as usual.

Step4: Create a full jar using Play Package and use that package to be included 
in library of jars passed to spark context.

Step 5: Play run as usual.



It works very well, and the convenience is, we have all scala application 
throughout.



Regards

Raju




From: Surendranauth Hiraman 
mailto:suren.hira...@velos.io>>
Sent: Thursday, October 16, 2014 12:42 PM
To: Mohammed Guller
Cc: Daniel Siegmann; user@spark.apache.org
Subject: Re: Play framework

Mohammed,

Jumping in for Daniel, we actually address the configuration issue by pulling 
values from environment variables or command line options. Maybe that may 
handle at least some of your needs.

For the akka issue, here is the akka version we include in build.sbt: 
"com.typesafe.akka" %% "akka-actor" % "2.2.1"

-Suren


On Thu, Oct 16, 2014 at 12:23 PM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:
Daniel,
Thanks for sharing this. It is very helpful.

The reason I want to use Spark submit is that it provides more flexibility. For 
example, with spark-submit, I don’t need to hard code the master info in the 
code. I can easily change the config without having to change and recompile 
code.

Do you mind sharing the sbt build file for your play app? I tried to build an 
uber jar using sbt-assembly. It gets built, but when I run it, it throws all 
sorts of exception. I have seen some blog posts that Spark and Play use 
different version of the Akka library. So I included Akka in my build.scala 
file, but still cannot get rid of Akka related exceptions. I suspect that the 
settings in the build.scala file for my play project is incorrect.

Mohammed

From: Daniel Siegmann 
[mailto:daniel.siegm...@velos.io]
Sent: Thursday, October 16, 2014 7:15 AM
To: Mohammed Guller
Cc: user@spark.apache.org
Subject: Re: Play framework

We execute Spark jobs from a Play application but we don't use spark-submit. I 
don't know if you really want to use spark-submit, but if not you can just 
create a SparkContext programmatically in your app.
In development I typically run Spark locally. Creating the Spark context is 
pretty trivial:

val conf = new SparkConf().setMaster("local[*]").setAppName(s"My Awesome App")
// call conf.set for any other configuration you want
val sc = new SparkContext(sparkConf)

It is important to keep in mind you cannot have multiple local contexts (you 
can create them but you'll get odd errors), so if you are running things in 
parallel within your app (even unit tests) you'd need to share a context in 
this case. If you are running sequentially you can create a new local context 
each time, but you must make sure to call SparkContext.stop() when you're done.
Running against a cluster is a bit more complicated because you need to add all 
your dependency jars. I'm not sure how to get this to work with play run. I 
stick to building the app with play dist and then running against the packaged 
ap

how to build spark 1.1.0 to include org.apache.commons.math3 ?

2014-10-16 Thread Henry Hung
HI All,

I try to build spark 1.1.0 using sbt with command:
sbt/sbt -Dhadoop.version=2.2.0 -Pyarn assembly

but the resulting spark-assembly-1.1.0-hadoop2.2.0.jar  still missing the 
apache commons math3 classes.

How to add the math3 into package?

Best regards,
Henry


The privileged confidential information contained in this email is intended for 
use only by the addressees as indicated by the original sender of this email. 
If you are not the addressee indicated in this email or are not responsible for 
delivery of the email to such a person, please kindly reply to the sender 
indicating this fact and delete all copies of it from your computer and network 
server immediately. Your cooperation is highly appreciated. It is advised that 
any unauthorized use of confidential information of Winbond is strictly 
prohibited; and any information in this email irrelevant to the official 
business of Winbond shall be deemed as neither given nor endorsed by Winbond.


RE: Spark Hive Snappy Error

2014-10-16 Thread Shao, Saisai
Hi Arthur,

I think this is a known issue in Spark, you can check 
(https://issues.apache.org/jira/browse/SPARK-3958). I’m curious about it, can 
you always reproduce this issue, Is this issue related to some specific data 
sets, would you mind giving me some information about you workload, Spark 
configuration, JDK version and OS version?

Thanks
Jerry

From: arthur.hk.c...@gmail.com [mailto:arthur.hk.c...@gmail.com]
Sent: Friday, October 17, 2014 7:13 AM
To: user
Cc: arthur.hk.c...@gmail.com
Subject: Spark Hive Snappy Error

Hi,

When trying Spark with Hive table, I got the “java.lang.UnsatisfiedLinkError: 
org.xerial.snappy.SnappyNative.maxCompressedLength(I)I” error,


val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql(“select count(1) from q8_national_market_share
sqlContext.sql("select count(1) from 
q8_national_market_share").collect().foreach(println)
java.lang.UnsatisfiedLinkError: 
org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
 at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
 at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316)
 at 
org.xerial.snappy.SnappyOutputStream.(SnappyOutputStream.java:79)
 at 
org.apache.spark.io.SnappyCompressionCodec.compressedOutputStream(CompressionCodec.scala:125)
 at 
org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:207)
 at 
org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:83)
 at 
org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:68)
 at 
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:36)
 at 
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
 at 
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
 at org.apache.spark.SparkContext.broadcast(SparkContext.scala:809)
 at 
org.apache.spark.sql.hive.HadoopTableReader.(TableReader.scala:68)
 at 
org.apache.spark.sql.hive.execution.HiveTableScan.(HiveTableScan.scala:68)
 at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
 at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
 at 
org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:364)
 at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:184)
 at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
 at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
 at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
 at 
org.apache.spark.sql.execution.SparkStrategies$HashAggregation$.apply(SparkStrategies.scala:146)
 at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
 at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
 at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402)
 at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400)
 at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406)
 at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406)
 at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:438)
 at $iwC$$iwC$$iwC$$iwC.(:15)
 at $iwC$$iwC$$iwC.(:20)
 at $iwC$$iwC.(:22)
 at $iwC.(:24)
 at (:26)
 at .(:30)
 at .()
 at .(:7)
 at .()
 at $print()
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
 at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
 at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
  

Re: [SparkSQL] Convert JavaSchemaRDD to SchemaRDD

2014-10-16 Thread Earthson
I'm trying to give API interface to Java users. And I need to accept their
JavaSchemaRDDs, and convert it to SchemaRDD for Scala users.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Convert-JavaSchemaRDD-to-SchemaRDD-tp16482p16641.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Play framework

2014-10-16 Thread Mohammed Guller
Manu,

I had looked at that example before starting this thread. I was specifically 
looking for some suggestions on how to run a Play app with the Spark-submit 
script on a real cluster.

Mohammed

From: Manu Suryavansh [mailto:suryavanshi.m...@gmail.com]
Sent: Thursday, October 16, 2014 3:32 PM
To: Mohammed Guller
Cc: US Office Admin; Surendranauth Hiraman; Daniel Siegmann; 
user@spark.apache.org
Subject: Re: Play framework

Hi,

Below is the link for a simple Play + SparkSQL example -

http://blog.knoldus.com/2014/07/14/play-with-spark-building-apache-spark-with-play-framework-part-3/

https://github.com/knoldus/Play-Spark-Scala

Manu



On Thu, Oct 16, 2014 at 1:00 PM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:
Thanks, Suren and Raju.

Raju – if I remember correctly, Play package command just creates a jar for 
your app. That jar file will not include other dependencies. So it is not 
really a full jar as you mentioned below. So how you are passing all the other 
dependency jars to spark? Can you share that piece of code? Also is there any 
specific reason why you are not using play dist instead?

Mohammed

From: US Office Admin [mailto:ad...@vectorum.com]
Sent: Thursday, October 16, 2014 11:41 AM
To: Surendranauth Hiraman; Mohammed Guller

Cc: Daniel Siegmann; user@spark.apache.org
Subject: Re: Play framework


​We integrated Spark into Play and use SparkSQL extensively on an ec2 spark 
cluster on Hadoop hdfs 1.2.1 and tachyon 0.4.



Step 1: Create a play scala application as usual

Step 2. In Build.sbt put all your spark dependencies. What works for us is Play 
2.2.3 Scala 2.10.4 Spark 1.1. We have Akka 2.2.3. This is straight forward

step3: As Daniel mentioned, create spark context within Play. And rest of the 
application is as usual.

Step4: Create a full jar using Play Package and use that package to be included 
in library of jars passed to spark context.

Step 5: Play run as usual.



It works very well, and the convenience is, we have all scala application 
throughout.



Regards

Raju




From: Surendranauth Hiraman 
mailto:suren.hira...@velos.io>>
Sent: Thursday, October 16, 2014 12:42 PM
To: Mohammed Guller
Cc: Daniel Siegmann; user@spark.apache.org
Subject: Re: Play framework

Mohammed,

Jumping in for Daniel, we actually address the configuration issue by pulling 
values from environment variables or command line options. Maybe that may 
handle at least some of your needs.

For the akka issue, here is the akka version we include in build.sbt: 
"com.typesafe.akka" %% "akka-actor" % "2.2.1"

-Suren


On Thu, Oct 16, 2014 at 12:23 PM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:
Daniel,
Thanks for sharing this. It is very helpful.

The reason I want to use Spark submit is that it provides more flexibility. For 
example, with spark-submit, I don’t need to hard code the master info in the 
code. I can easily change the config without having to change and recompile 
code.

Do you mind sharing the sbt build file for your play app? I tried to build an 
uber jar using sbt-assembly. It gets built, but when I run it, it throws all 
sorts of exception. I have seen some blog posts that Spark and Play use 
different version of the Akka library. So I included Akka in my build.scala 
file, but still cannot get rid of Akka related exceptions. I suspect that the 
settings in the build.scala file for my play project is incorrect.

Mohammed

From: Daniel Siegmann 
[mailto:daniel.siegm...@velos.io]
Sent: Thursday, October 16, 2014 7:15 AM
To: Mohammed Guller
Cc: user@spark.apache.org
Subject: Re: Play framework

We execute Spark jobs from a Play application but we don't use spark-submit. I 
don't know if you really want to use spark-submit, but if not you can just 
create a SparkContext programmatically in your app.
In development I typically run Spark locally. Creating the Spark context is 
pretty trivial:

val conf = new SparkConf().setMaster("local[*]").setAppName(s"My Awesome App")
// call conf.set for any other configuration you want
val sc = new SparkContext(sparkConf)

It is important to keep in mind you cannot have multiple local contexts (you 
can create them but you'll get odd errors), so if you are running things in 
parallel within your app (even unit tests) you'd need to share a context in 
this case. If you are running sequentially you can create a new local context 
each time, but you must make sure to call SparkContext.stop() when you're done.
Running against a cluster is a bit more complicated because you need to add all 
your dependency jars. I'm not sure how to get this to work with play run. I 
stick to build

Re: Exception Logging

2014-10-16 Thread Yana Kadiyska
you can out a try catch block in the map function and log the exception.
The only "tricky" part is that the exception log will be located in the
executor machine. Even if you don't do any trapping you should see the
exception stacktrace in the executors' stderr log which is visible through
the UI (if your app crashes the executor you can still see it as the last
executor that ran on a given worker). But things like println and logging
work inside map, you just have to remember everything happens on the remote
machine

On Thu, Oct 16, 2014 at 8:11 PM, Ge, Yao (Y.)  wrote:

>  I need help to better trap Exception in the map functions. What is the
> best way to catch the exception and provide some helpful diagnostic
> information such as source of the input such as file name (and ideally line
> number if I am processing a text file)?
>
>
>
> -Yao
>


object in an rdd: serializable?

2014-10-16 Thread ll
i got an exception complaining about serializable.  the sample code is
below...

class HelloWorld(val count: Int) {
  ...
  ...
}

object Test extends App {
  ...
  val data = sc.parallelize(List(new HelloWorld(1), new HelloWorld(2))) 
  ... 
}

what is the best way to serialize HelloWorld so that it can be contained in
an RDD?

thanks!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/object-in-an-rdd-serializable-tp16638.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Exception Logging

2014-10-16 Thread Ge, Yao (Y.)
I need help to better trap Exception in the map functions. What is the best way 
to catch the exception and provide some helpful diagnostic information such as 
source of the input such as file name (and ideally line number if I am 
processing a text file)?

-Yao


Re: scala: java.net.BindException?

2014-10-16 Thread Duy Huynh
thanks marcelo.  i only instantiated sparkcontext once, at the beginning,
in this code.  the exception was thrown right at the beginning.

i also tried to run other programs, which worked fine previously, but now
also got the same error.

it looks like it put "global block" on creating sparkcontext that prevents
any program to create a sparkcontext.



On Oct 16, 2014 6:26 PM, "Marcelo Vanzin"  wrote:

> This error is not fatal, since Spark will retry on a different port..
> but this might be a problem, for different reasons, if somehow your
> code is trying to instantiate multiple SparkContexts.
>
> I assume "nn.SimpleNeuralNetwork" is part of your application, and
> since it seems to be instantiating a new SparkContext and also is
> being called from an iteration, that looks sort of fishy.
>
> On Thu, Oct 16, 2014 at 2:51 PM, ll  wrote:
> > hello... does anyone know how to resolve this issue?  i'm running this
> > locally on my computer.  keep getting this BindException.  much
> appreciated.
> >
> > 14/10/16 17:48:13 WARN component.AbstractLifeCycle: FAILED
> > SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address
> already
> > in use
> > java.net.BindException: Address already in use
> > at sun.nio.ch.Net.bind0(Native Method)
> > at sun.nio.ch.Net.bind(Net.java:444)
> > at sun.nio.ch.Net.bind(Net.java:436)
> > at
> > sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
> > at
> sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
> > at
> >
> org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187)
> > at
> >
> org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316)
> > at
> >
> org.eclipse.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265)
> > at
> >
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
> > at org.eclipse.jetty.server.Server.doStart(Server.java:293)
> > at
> >
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
> > at
> >
> org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:192)
> > at
> org.apache.spark.ui.JettyUtils$$anonfun$3.apply(JettyUtils.scala:202)
> > at
> org.apache.spark.ui.JettyUtils$$anonfun$3.apply(JettyUtils.scala:202)
> > at
> >
> org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1446)
> > at
> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> > at
> org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1442)
> > at
> org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:202)
> > at org.apache.spark.ui.WebUI.bind(WebUI.scala:102)
> > at org.apache.spark.SparkContext.(SparkContext.scala:224)
> > at
> >
> nn.SimpleNeuralNetwork$delayedInit$body.apply(SimpleNeuralNetwork.scala:15)
> > at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
> > at
> scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
> > at scala.App$$anonfun$main$1.apply(App.scala:71)
> > at scala.App$$anonfun$main$1.apply(App.scala:71)
> > at scala.collection.immutable.List.foreach(List.scala:318)
> > at
> >
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
> > at scala.App$class.main(App.scala:71)
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/scala-java-net-BindException-tp16624.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
>
>
> --
> Marcelo
>


Re: local class incompatible: stream classdesc serialVersionUID

2014-10-16 Thread Pat Ferrel
Yes, I removed my Spark dir and scp’ed the master’s build to all cluster 
machines suspecting that problem.

My app (Apache Mahout) had Spark 1.0.1 in the POM but changing it to 1.0.2 (the 
Spark version installed) gave another error. I guess I’ll have to install Spark 
1.0.1 or get Mahout to update their dependencies.


On Oct 16, 2014, at 4:03 PM, Paweł Szulc  wrote:

This looks like typical issue with serialization of same class between 
different versions of an application.

I've ran into similar (yet not the same) issues before. Are you 100% sure that 
you have the same version of Apache Spark on each node of the cluster? And I am 
not only asking about current project version (1.0.0, 1.1.0 etc.) but also 
about package type (hadoop 1.x, hadoop 2.x).

On Fri, Oct 17, 2014 at 12:35 AM, Pat Ferrel  wrote:
I’ve read several discussions of the error here and so have wiped all cluster 
machines and copied the master’s spark build to the rest of the cluster. I’ve 
built my job on the master using the correct Spark version as a dependency and 
even build that version of Spark. I still get the incompatible serialVersionUID 
error.

If I run the job locally with master = local[8] it completes fine.

I thought I had incompatible builds but in the end I’m not quite sure what this 
error is telling me

14/10/16 15:21:03 WARN scheduler.TaskSetManager: Loss was due to 
java.io.InvalidClassException
java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class 
incompatible: stream classdesc serialVersionUID = 385418487991259089, local 
class serialVersionUID = -6766554341038829528
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:560)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1599)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1494)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1599)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1494)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1748)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1327)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:349)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
at 
org.apache.spark.scheduler.ShuffleMapTask$.deserializeInfo(ShuffleMapTask.scala:63)
at 
org.apache.spark.scheduler.ShuffleMapTask.readExternal(ShuffleMapTask.scala:135)
at 
java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1814)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1773)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1327)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:349)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





Spark Hive Snappy Error

2014-10-16 Thread arthur.hk.c...@gmail.com
Hi,

When trying Spark with Hive table, I got the “java.lang.UnsatisfiedLinkError: 
org.xerial.snappy.SnappyNative.maxCompressedLength(I)I” error,


val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql(“select count(1) from q8_national_market_share
sqlContext.sql("select count(1) from 
q8_national_market_share").collect().foreach(println)
java.lang.UnsatisfiedLinkError: 
org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316)
at 
org.xerial.snappy.SnappyOutputStream.(SnappyOutputStream.java:79)
at 
org.apache.spark.io.SnappyCompressionCodec.compressedOutputStream(CompressionCodec.scala:125)
at 
org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:207)
at 
org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:83)
at 
org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:68)
at 
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:36)
at 
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
at 
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:809)
at 
org.apache.spark.sql.hive.HadoopTableReader.(TableReader.scala:68)
at 
org.apache.spark.sql.hive.execution.HiveTableScan.(HiveTableScan.scala:68)
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
at 
org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:364)
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:184)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at 
org.apache.spark.sql.execution.SparkStrategies$HashAggregation$.apply(SparkStrategies.scala:146)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406)
at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:438)
at $iwC$$iwC$$iwC$$iwC.(:15)
at $iwC$$iwC$$iwC.(:20)
at $iwC$$iwC.(:22)
at $iwC.(:24)
at (:26)
at .(:30)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954)
at

Join with large data set

2014-10-16 Thread Ankur Srivastava
Hi,

I have a rdd which is my application data and is huge. I want to join this
with reference data which is also huge to fit in-memory and thus I do not
want to use Broadcast variable.

What other options do I have to perform such joins?

I am using Cassandra as my data store, so should I just query cassandra to
get the reference data needed?

Also when I join two rdds, will it result in rdd scan or would spark do a
hash partition on the two rdds to get the data with same keys on same node?

Thanks
Ankur


local class incompatible: stream classdesc serialVersionUID

2014-10-16 Thread Pat Ferrel
I’ve read several discussions of the error here and so have wiped all cluster 
machines and copied the master’s spark build to the rest of the cluster. I’ve 
built my job on the master using the correct Spark version as a dependency and 
even build that version of Spark. I still get the incompatible serialVersionUID 
error.

If I run the job locally with master = local[8] it completes fine.

I thought I had incompatible builds but in the end I’m not quite sure what this 
error is telling me

14/10/16 15:21:03 WARN scheduler.TaskSetManager: Loss was due to 
java.io.InvalidClassException
java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class 
incompatible: stream classdesc serialVersionUID = 385418487991259089, local 
class serialVersionUID = -6766554341038829528
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:560)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1599)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1494)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1599)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1494)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1748)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1327)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:349)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
at 
org.apache.spark.scheduler.ShuffleMapTask$.deserializeInfo(ShuffleMapTask.scala:63)
at 
org.apache.spark.scheduler.ShuffleMapTask.readExternal(ShuffleMapTask.scala:135)
at 
java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1814)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1773)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1327)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:349)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Play framework

2014-10-16 Thread Manu Suryavansh
Hi,

Below is the link for a simple Play + SparkSQL example -

http://blog.knoldus.com/2014/07/14/play-with-spark-building-apache-spark-with-play-framework-part-3/

https://github.com/knoldus/Play-Spark-Scala

Manu



On Thu, Oct 16, 2014 at 1:00 PM, Mohammed Guller 
wrote:

>  Thanks, Suren and Raju.
>
>
>
> Raju – if I remember correctly, Play package command just creates a jar
> for your app. That jar file will not include other dependencies. So it is
> not really a full jar as you mentioned below. So how you are passing all
> the other dependency jars to spark? Can you share that piece of code? Also
> is there any specific reason why you are not using play dist instead?
>
>
>
> Mohammed
>
>
>
> *From:* US Office Admin [mailto:ad...@vectorum.com]
> *Sent:* Thursday, October 16, 2014 11:41 AM
> *To:* Surendranauth Hiraman; Mohammed Guller
>
> *Cc:* Daniel Siegmann; user@spark.apache.org
> *Subject:* Re: Play framework
>
>
>
> ​We integrated Spark into Play and use SparkSQL extensively on an ec2
> spark cluster on Hadoop hdfs 1.2.1 and tachyon 0.4.
>
>
>
> Step 1: Create a play scala application as usual
>
> Step 2. In Build.sbt put all your spark dependencies. What works for us is
> Play 2.2.3 Scala 2.10.4 Spark 1.1. We have Akka 2.2.3. This is straight
> forward
>
> step3: As Daniel mentioned, create spark context within Play. And rest of
> the application is as usual.
>
> Step4: Create a full jar using Play Package and use that package to be
> included in library of jars passed to spark context.
>
> Step 5: Play run as usual.
>
>
>
> It works very well, and the convenience is, we have all scala application
> throughout.
>
>
>
> Regards
>
> Raju
>
>
>
>
> 
>--
>
> *From:* Surendranauth Hiraman 
> *Sent:* Thursday, October 16, 2014 12:42 PM
> *To:* Mohammed Guller
> *Cc:* Daniel Siegmann; user@spark.apache.org
> *Subject:* Re: Play framework
>
>
>
> Mohammed,
>
>
>
> Jumping in for Daniel, we actually address the configuration issue by
> pulling values from environment variables or command line options. Maybe
> that may handle at least some of your needs.
>
>
>
> For the akka issue, here is the akka version we include in build.sbt:
> "com.typesafe.akka" %% "akka-actor" % "2.2.1"
>
>
>
> -Suren
>
>
>
>
>
> On Thu, Oct 16, 2014 at 12:23 PM, Mohammed Guller 
> wrote:
>
> Daniel,
>
> Thanks for sharing this. It is very helpful.
>
>
>
> The reason I want to use Spark submit is that it provides more
> flexibility. For example, with spark-submit, I don’t need to hard code the
> master info in the code. I can easily change the config without having to
> change and recompile code.
>
>
>
> Do you mind sharing the sbt build file for your play app? I tried to build
> an uber jar using sbt-assembly. It gets built, but when I run it, it throws
> all sorts of exception. I have seen some blog posts that Spark and Play use
> different version of the Akka library. So I included Akka in my build.scala
> file, but still cannot get rid of Akka related exceptions. I suspect that
> the settings in the build.scala file for my play project is incorrect.
>
>
>
> Mohammed
>
>
>
> *From:* Daniel Siegmann [mailto:daniel.siegm...@velos.io]
> *Sent:* Thursday, October 16, 2014 7:15 AM
> *To:* Mohammed Guller
> *Cc:* user@spark.apache.org
> *Subject:* Re: Play framework
>
>
>
> We execute Spark jobs from a Play application but we don't use
> spark-submit. I don't know if you really want to use spark-submit, but if
> not you can just create a SparkContext programmatically in your app.
>
> In development I typically run Spark locally. Creating the Spark context
> is pretty trivial:
>
> val conf = new SparkConf().setMaster("local[*]").setAppName(s"My Awesome
> App")
>
> // call conf.set for any other configuration you want
>
> val sc = new SparkContext(sparkConf)
>
> It is important to keep in mind you cannot have multiple local contexts
> (you can create them but you'll get odd errors), so if you are running
> things in parallel within your app (even unit tests) you'd need to share a
> context in this case. If you are running sequentially you can create a new
> local context each time, but you must make sure to call SparkContext.stop()
> when you're done.
>
> Running against a cluster is a bit more complicated because you need to
> add all your dependency jars. I'm not sure how to get this to work with play
> run. I stick to building the app with play dist and then running against
> the packaged application, because it very conveniently provides all the
> dependencies in a lib folder. Here is some code to load all the paths you
> need from the dist:
>
>
> def libs : Seq[String] = {
> val libDir = play.api.Play.application.getFile("lib")
>
> logger.info(s"SparkContext will be initialized with libraries
> from d

Re: scala: java.net.BindException?

2014-10-16 Thread Marcelo Vanzin
This error is not fatal, since Spark will retry on a different port..
but this might be a problem, for different reasons, if somehow your
code is trying to instantiate multiple SparkContexts.

I assume "nn.SimpleNeuralNetwork" is part of your application, and
since it seems to be instantiating a new SparkContext and also is
being called from an iteration, that looks sort of fishy.

On Thu, Oct 16, 2014 at 2:51 PM, ll  wrote:
> hello... does anyone know how to resolve this issue?  i'm running this
> locally on my computer.  keep getting this BindException.  much appreciated.
>
> 14/10/16 17:48:13 WARN component.AbstractLifeCycle: FAILED
> SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address already
> in use
> java.net.BindException: Address already in use
> at sun.nio.ch.Net.bind0(Native Method)
> at sun.nio.ch.Net.bind(Net.java:444)
> at sun.nio.ch.Net.bind(Net.java:436)
> at
> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
> at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
> at
> org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187)
> at
> org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316)
> at
> org.eclipse.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265)
> at
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
> at org.eclipse.jetty.server.Server.doStart(Server.java:293)
> at
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
> at
> org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:192)
> at 
> org.apache.spark.ui.JettyUtils$$anonfun$3.apply(JettyUtils.scala:202)
> at 
> org.apache.spark.ui.JettyUtils$$anonfun$3.apply(JettyUtils.scala:202)
> at
> org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1446)
> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1442)
> at 
> org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:202)
> at org.apache.spark.ui.WebUI.bind(WebUI.scala:102)
> at org.apache.spark.SparkContext.(SparkContext.scala:224)
> at
> nn.SimpleNeuralNetwork$delayedInit$body.apply(SimpleNeuralNetwork.scala:15)
> at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
> at 
> scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
> at scala.App$$anonfun$main$1.apply(App.scala:71)
> at scala.App$$anonfun$main$1.apply(App.scala:71)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
> at scala.App$class.main(App.scala:71)
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/scala-java-net-BindException-tp16624.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Print dependency graph as DOT file

2014-10-16 Thread Soumitra Kumar
Hello,

Is there a way to print the dependency graph of complete program or RDD/DStream 
as a DOT file? It would be very helpful to have such a thing.

Thanks,
-Soumitra.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: EC2 cluster set up and access to HBase in a different cluster

2014-10-16 Thread freedafeng
Maybe I should create a private AMI to use for my question No.1? Assuming I
use the default instance type as the base image.. Anyone tried this?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/EC2-cluster-set-up-and-access-to-HBase-in-a-different-cluster-tp16622p16628.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark streaming on data at rest.

2014-10-16 Thread ameyc
Apologies if this is something very obvious but I've perused the spark
streaming guide and this isn't very evident to me still. So I have files
with data of the format: timestamp,column1,column2,column3.. etc. and I'd
like to use the spark streaming's window operations on them.

However from what I notice, the streams are expected to be "live". Is there
a way to do window operations on timestamps from my dataset without somehow
"replaying" the messages?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-on-data-at-rest-tp16627.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Strange duplicates in data when scaling up

2014-10-16 Thread Jacob Maloney
I have a flatmap function that shouldn't possibly emit duplicates and yet it 
does. The output of my function is a HashSet so the function itself cannot 
output duplicates and yet I see many copies of keys emmited from it (in one 
case up to 62). The curious thing is I can't get this to happen until I ramp up 
the size of the input lines to about 100,000. For example:
(3587005221,[[(80530632,0.20824391739360665)], 
[(80530632,0.20824391739360665)]])

Will expand to
(3587005221,(80530632,0.37312230565577803))
(3587005221,(80530632,0.37312230565577803))
(3587005221,(80530632,0.37312230565577803))
.
.
.
(3587005221,(80530632,0.37312230565577803))
62 total times

If I run this line only as input I only get the one line of output as expected. 
It seems to be a scaling up issue.

My code is as follows:
JavaPairRDD>>> preAggData = 
indidKeyedJoinedData.groupByKey();

JavaPairRDD> aggregatedData = 
preAggData.flatMapToPair(new AggregateLikeSims());

Where:
static class AggregateLikeSims implements 
PairFlatMapFunction>>>, 
Long,Tuple2>{
HashSet>> 
output = new HashSet>>();
Map> intermediateMap = new 
HashMap>();
Iterator> intIterator;
Tuple2 currentTuple;
Double MAX_RECO_VALUE = 1.0;
Iterator>> 
itIterator;
Accumulator accum;

@Override
public Iterable>> call(Tuple2>>> inTuple){
itIterator = 
inTuple._2.iterator();

while(itIterator.hasNext()){
intIterator = 
itIterator.next().iterator();

while(intIterator.hasNext()){

currentTuple = intIterator.next();

if (intermediateMap.containsKey(currentTuple._1)){

intermediateMap.get(currentTuple._1).add(currentTuple._2);

} else {

List listOfDoubles = new ArrayList();

listOfDoubles.add(currentTuple._2);

intermediateMap.put(currentTuple._1, listOfDoubles);

}
}
}


Iterator>> it = 
intermediateMap.entrySet().iterator();
while (it.hasNext()) {
Map.Entry> pairs = 
it.next();
if (pairs.getValue().size() > 1) {
output.add(new Tuple2>(inTuple._1,new 
Tuple2(pairs.getKey(),aggregate(pairs.getValue();
} else {
output.add(new Tuple2>(inTuple._1,new 
Tuple2(pairs.getKey(),pairs.getValue().get(0;
}
it.remove();
}

return output;
}

private double aggregate(List simsList) 
{
if (simsList == null) {
return 0;
}
if (simsList.size() == 1) {
return 
simsList.get(0);
}

double max = 0;
double sum = 0;

// Find max and sum up all 
elements of array.
for (double sim : simsList) {
if (sim > max) {

max = sim;
}

Re: ALS implicit error pyspark

2014-10-16 Thread Davies Liu
I can run the following code against Spark 1.1

sc = SparkContext()
r1 = (1, 1, 1.0)
r2 = (1, 2, 2.0)
r3 = (2, 1, 2.0)
ratings = sc.parallelize([r1, r2, r3])
model = ALS.trainImplicit(ratings, 1)

Davies

On Thu, Oct 16, 2014 at 2:45 PM, Davies Liu  wrote:
> Could you post the code that have problem with pyspark? thanks!
>
> Davies
>
> On Thu, Oct 16, 2014 at 12:27 PM, Gen  wrote:
>> I tried the same data with scala. It works pretty well.
>> It seems that it is the problem of pyspark.
>> In the console, it shows the following logs:
>>
>> Traceback (most recent call last):
>>   File "", line 1, in 
>> *  File "/root/spark/python/pyspark/mllib/recommendation.py", line 76, in
>> trainImplicit
>> 14/10/16 19:22:44 WARN scheduler.TaskSetManager: Lost task 4.3 in stage
>> 975.0 (TID 1653, ip-172-31-35-240.ec2.internal): TaskKilled (killed
>> intentionally)
>> ratingBytes._jrdd, rank, iterations, lambda_, blocks, alpha)*
>>   File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>> line 538, in __call__
>>   File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line
>> 300, in get_return_value
>> py4j.protocol.Py4JJavaError14/10/16 19:22:44 WARN scheduler.TaskSetManager:
>> Lost task 8.2 in stage 975.0 (TID 1650, ip-172-31-35-241.ec2.internal):
>> TaskKilled (killed intentionally)
>> : An error occurred while calling o32.trainImplicitALSModel.
>> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 6
>> in stage 975.0 failed 4 times, most recent failure: Lost task 6.3 in stage
>> 975.0 (TID 1651, ip-172-31-35-237.ec2.internal):
>> com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException:
>> scala.collection.mutable.HashSet
>> Serialization trace:
>> shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)
>>
>> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
>>
>> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
>> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
>> com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
>> com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
>> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
>>
>> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)
>>
>> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
>> org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
>>
>> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>
>> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137)
>>
>> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
>>
>> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
>>
>> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>
>> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>
>> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>
>> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
>> org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>> org.apache.spark.scheduler.Task.run(Task.scala:54)
>>
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> java.lang.Thread.run(Thread.java:745)
>> Driver stacktrace:
>> at
>> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(D

Re: How to delete file/folder in amazon s3 using pyspark?

2014-10-16 Thread freedafeng
boto?

This is not a spark question, but a python question.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-delete-file-folder-in-amazon-s3-using-pyspark-tp16616p16623.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



scala: java.net.BindException?

2014-10-16 Thread ll
hello... does anyone know how to resolve this issue?  i'm running this
locally on my computer.  keep getting this BindException.  much appreciated.

14/10/16 17:48:13 WARN component.AbstractLifeCycle: FAILED
SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address already
in use
java.net.BindException: Address already in use
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:444)
at sun.nio.ch.Net.bind(Net.java:436)
at
sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at
org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187)
at
org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316)
at
org.eclipse.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265)
at
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at org.eclipse.jetty.server.Server.doStart(Server.java:293)
at
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at
org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:192)
at org.apache.spark.ui.JettyUtils$$anonfun$3.apply(JettyUtils.scala:202)
at org.apache.spark.ui.JettyUtils$$anonfun$3.apply(JettyUtils.scala:202)
at
org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1446)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1442)
at 
org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:202)
at org.apache.spark.ui.WebUI.bind(WebUI.scala:102)
at org.apache.spark.SparkContext.(SparkContext.scala:224)
at
nn.SimpleNeuralNetwork$delayedInit$body.apply(SimpleNeuralNetwork.scala:15)
at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
at 
scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.collection.immutable.List.foreach(List.scala:318)
at
scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
at scala.App$class.main(App.scala:71)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/scala-java-net-BindException-tp16624.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



EC2 cluster set up and access to HBase in a different cluster

2014-10-16 Thread freedafeng
The plan is to create an EC2 cluster and run the (py) spark on it. Input data
is from s3, output data goes to an hbase in a persistent cluster (also EC2).
My questions are:

1. I need to install some software packages on all the workers (sudo apt-get
install ...). Is there a better way to do this than going to every node to
manually install them?

2. I assume the spark can access the hbase which is in a different cluster.
Am I correct? if yes, how?

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/EC2-cluster-set-up-and-access-to-HBase-in-a-different-cluster-tp16622.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: ALS implicit error pyspark

2014-10-16 Thread Davies Liu
Could you post the code that have problem with pyspark? thanks!

Davies

On Thu, Oct 16, 2014 at 12:27 PM, Gen  wrote:
> I tried the same data with scala. It works pretty well.
> It seems that it is the problem of pyspark.
> In the console, it shows the following logs:
>
> Traceback (most recent call last):
>   File "", line 1, in 
> *  File "/root/spark/python/pyspark/mllib/recommendation.py", line 76, in
> trainImplicit
> 14/10/16 19:22:44 WARN scheduler.TaskSetManager: Lost task 4.3 in stage
> 975.0 (TID 1653, ip-172-31-35-240.ec2.internal): TaskKilled (killed
> intentionally)
> ratingBytes._jrdd, rank, iterations, lambda_, blocks, alpha)*
>   File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
> line 538, in __call__
>   File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line
> 300, in get_return_value
> py4j.protocol.Py4JJavaError14/10/16 19:22:44 WARN scheduler.TaskSetManager:
> Lost task 8.2 in stage 975.0 (TID 1650, ip-172-31-35-241.ec2.internal):
> TaskKilled (killed intentionally)
> : An error occurred while calling o32.trainImplicitALSModel.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 6
> in stage 975.0 failed 4 times, most recent failure: Lost task 6.3 in stage
> 975.0 (TID 1651, ip-172-31-35-237.ec2.internal):
> com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException:
> scala.collection.mutable.HashSet
> Serialization trace:
> shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)
>
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
>
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
> com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
> com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
>
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)
>
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
> org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
>
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137)
>
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
>
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
>
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> org.apache.spark.scheduler.Task.run(Task.scala:54)
>
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
> at
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.

Re: ALS implicit error pyspark

2014-10-16 Thread Davies Liu
On Thu, Oct 16, 2014 at 9:53 AM, Gen  wrote:
> Hi,
>
> I am trying to use ALS.trainImplicit method in the
> pyspark.mllib.recommendation. However it didn't work. So I tried use the
> example in the python API documentation such as:
>
> /r1 = (1, 1, 1.0)
> r2 = (1, 2, 2.0)
> r3 = (2, 1, 2.0)
> ratings = sc.parallelize([r1, r2, r3])
> model = ALS.trainImplicit(ratings, 1) /
>
> It didn't work neither. After searching in google, I found that there are
> only two overloads for ALS.trainImplicit in the scala script. So I tried
> /model = ALS.trainImplicit(ratings, 1, 1)/, it worked. But if I set the
> iterations other than 1,  /model = ALS.trainImplicit(ratings, 1, 2)/ or
> /model = ALS.trainImplicit(ratings, 4, 2)/ for example, it generated error.
> The information is as follows:

The Python API has default values for all other arguments, so you should
call with only rank=1 (no default iterations in Scala).

I'm curious that how can you meet this problem?

> count at ALS.scala:314
>
> Job aborted due to stage failure: Task 6 in stage 189.0 failed 4 times, most
> recent failure: Lost task 6.3 in stage 189.0 (TID 626,
> ip-172-31-35-239.ec2.internal): com.esotericsoftware.kryo.KryoException:
> java.lang.ArrayStoreException: scala.collection.mutable.HashSet
> Serialization trace:
> shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)
>
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
>
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
> com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
> com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
>
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)
>
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
> org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
>
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137)
>
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
>
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
>
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> org.apache.spark.scheduler.Task.run(Task.scala:54)
>
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>
> It is really strange, because count at ALS.scala:314 is already out the loop
> of iterations. Any idea?
> Thanks a lot for advance.
>
> FYI: I used spark 1.1.0 and ALS.train() works pretty well for all the cases.
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/ALS-implicit-error-pyspark-tp16595.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...

Re: Can's create Kafka stream in spark shell

2014-10-16 Thread Gary Zhao
Same error. I saw someone reported the same issue, e.g.
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-kafka-error-td9106.html

Should I use "sbt assembly"? It failed for deduplicate though.

error] (*:assembly) deduplicate: different file contents found in the
following:
[error]
/Users/gzhao/.ivy2/cache/org.eclipse.jetty.orbit/javax.transaction/orbits/javax.transaction-1.1.1.v201105210645.jar:META-INF/ECLIPSEF.RSA
[error]
/Users/gzhao/.ivy2/cache/org.eclipse.jetty.orbit/javax.servlet/orbits/javax.servlet-3.0.0.v201112011016.jar:META-INF/ECLIPSEF.RSA
[error]
/Users/gzhao/.ivy2/cache/org.eclipse.jetty.orbit/javax.mail.glassfish/orbits/javax.mail.glassfish-1.4.1.v201005082020.jar:META-INF/ECLIPSEF.RSA
[error]
/Users/gzhao/.ivy2/cache/org.eclipse.jetty.orbit/javax.activation/orbits/javax.activation-1.1.0.v201105071233.jar:META-INF/ECLIPSEF.RSA
[error] Total time: 4 s, completed Oct 16, 2014 1:58:41 PM


On Thu, Oct 16, 2014 at 12:11 PM, Akhil Das 
wrote:

> Can you try:
>
> sbt:
>
> name := "Simple Project"
>
>
> version := "1.1"
>
>
> scalaVersion := "2.10.4"
>
>
> libraryDependencies ++= Seq(
>
> "org.apache.spark" %% "spark-core" % "1.1.0",
>
> "org.apache.spark" %% "spark-streaming" % "1.1.0",
>
> "org.apache.spark" %% "spark-streaming-kafka" % "1.1.0"
>
>   )
>
> Thanks
> Best Regards
>
> On Fri, Oct 17, 2014 at 12:36 AM, Gary Zhao  wrote:
>
>> Thanks Akhil. I tried spark-submit and saw the same issue. I double
>> checked the versions and they look ok. Are you seeing any obvious issues?
>>
>> sbt:
>>
>> name := "Simple Project"
>>
>>
>> version := "1.1"
>>
>>
>> scalaVersion := "2.10.4"
>>
>>
>> libraryDependencies ++= Seq(
>>
>> "org.apache.spark" %% "spark-core" % "1.1.0",
>>
>> "org.apache.spark" %% "spark-streaming" % "1.1.0",
>>
>> "org.apache.spark" %% "spark-streaming-kafka" % "1.1.0",
>>
>> "org.apache.kafka" %% "kafka" % "0.8.0"
>>
>>   )
>>
>>
>> spark-1.1.0-bin-hadoop1/bin/spark-submit --class "main.scala.SimpleApp"
>> --master "local[2]" simple-project_2.10-1.1.jar --jars
>> spark-streaming-kafka_2.10-1.1.0.jar,kafka_2.10-0.8.0.jar
>>
>> Exception in thread "main" java.lang.NoClassDefFoundError:
>> org/apache/spark/streaming/kafka/KafkaUtils$
>> at main.scala.SimpleApp$delayedInit$body.apply(SimpleApp.scala:15)
>> at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
>> at
>> scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
>> at scala.App$$anonfun$main$1.apply(App.scala:71)
>> at scala.App$$anonfun$main$1.apply(App.scala:71)
>> at scala.collection.immutable.List.foreach(List.scala:318)
>> at
>> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
>> at scala.App$class.main(App.scala:71)
>> at main.scala.SimpleApp$.main(SimpleApp.scala:11)
>> at main.scala.SimpleApp.main(SimpleApp.scala)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>> Caused by: java.lang.ClassNotFoundException:
>> org.apache.spark.streaming.kafka.KafkaUtils$
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> ... 17 more
>>
>>
>> On Tue, Oct 14, 2014 at 12:05 AM, Akhil Das 
>> wrote:
>>
>>> Just make sure you have the same version of spark-streaming-kafka
>>> 
>>> jar and spark in your classpath.
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Tue, Oct 14, 2014 at 9:02 AM, Gary Zhao  wrote:
>>>
 Hello

 I'm trying to connect kafka in spark shell, but failed as below. Could
 you take a look what I missed.

 scala>  val kafkaStream = KafkaUtils.createStream(ssc,
 "test-vip.snc1:2181", "test_spark", Map("user-test"->1))
 error: bad symbolic reference. A signature in KafkaUtils.class refers
 to term serializer
 in value kafka which is not available.
 It may be completely missing from the current classpath, or the version
 on
 the classpath might be incompatible with the version used when
 compiling KafkaUtils.class.

 Thanks
 Gary

>>>
>>>
>>
>


Spark Bug? job fails to run when given options on spark-submit (but starts and fails without)

2014-10-16 Thread Michael Campbell
TL;DR - a spark SQL job fails with an OOM (Out of heap space) error.  If
given "--executor-memory" values, it won't even start.  Even (!) if the
values given ARE THE SAME AS THE DEFAULT.



Without --executor-memory:

14/10/16 17:14:58 INFO TaskSetManager: Serialized task 1.0:64 as 14710
bytes in 1 ms
14/10/16 17:14:58 WARN TaskSetManager: Lost TID 26 (task 1.0:25)
14/10/16 17:14:58 WARN TaskSetManager: Loss was due to
java.lang.OutOfMemoryError
java.lang.OutOfMemoryError: Java heap space
at
parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:609)
at
parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:360)
...


USING --executor-memory (WITH ANY VALUE), even "1G" which is the default:

Parsed arguments:
  master  spark://:7077
  deployMode  null
  executorMemory  1G
...

System properties:
spark.executor.memory -> 1G
spark.eventLog.enabled -> true
...

14/10/16 17:14:23 INFO TaskSchedulerImpl: Adding task set 1.0 with 678 tasks
14/10/16 17:14:38 WARN TaskSchedulerImpl: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered and
have sufficient memory



Spark 1.0.0.  Is this a bug?


Re: Spark assembly for YARN/CDH5

2014-10-16 Thread Marcelo Vanzin
Hi Philip,

The assemblies are part of the CDH distribution. You can get them here:
http://www.cloudera.com/content/cloudera/en/downloads/cdh/cdh-5-2-0.html

As of Spark 1.1 (and, thus, CDH 5.2), assemblies are not published to
maven repositories anymore (you can see commit [1] for details).

[1] https://github.com/apache/spark/commit/f493f7982b50


On Thu, Oct 16, 2014 at 11:12 AM, Philip Ogren  wrote:
> Does anyone know if there Spark assemblies are created and available for
> download that have been built for CDH5 and YARN?
>
> Thanks,
> Philip
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: TF-IDF in Spark 1.1.0

2014-10-16 Thread Burke Webster
Thanks for the response.  Appreciate the help!

Burke

On Tue, Oct 14, 2014 at 3:00 PM, Xiangrui Meng  wrote:

> You cannot recover the document from the TF-IDF vector, because
> HashingTF is not reversible. You can assign each document a unique ID,
> and join back the result after training. HasingTF can transform
> individual record:
>
> val docs: RDD[(String, Seq[String])] = ...
>
> val tf = new HashingTF()
> val tfWithId: RDD[(String, Vector)] = docs.mapValues(tf.transform)
>
> ...
>
> Best,
> Xiangrui
>
> On Tue, Oct 14, 2014 at 9:15 AM, Burke Webster 
> wrote:
> > I'm following the Mllib example for TF-IDF and ran into a problem due to
> my
> > lack of knowledge of Scala and spark.  Any help would be greatly
> > appreciated.
> >
> > Following the Mllib example I could do something like this:
> >
> > import org.apache.spark.rdd.RDD
> > import org.apache.spark.SparkContext
> > import org.apache.spark.mllib.feature.HashingTF
> > import org.apache.spark.mllib.linalg.Vector
> > import org.apache.spark.mllib.feature.IDF
> >
> > val sc: SparkContext = ...
> > val documents: RDD[Seq[String]] = sc.textFile("...").map(_.split("
> ").toSeq)
> >
> > val hashingTF = new HashingTF()
> > val tf: RDD[Vector] = hasingTF.transform(documents)
> > tf.cache()
> >
> > val idf = new IDF().fit(tf)
> > val tfidf: RDD[Vector] = idf.transform(tf)
> >
> > As a result I would have an RDD containing the TF-IDF vectors for the
> input
> > documents.  My question is how do I map the vector back to the original
> > input document?
> >
> > My end goal is to compute document similarity using cosine similarity.
> From
> > what I can tell, I can compute TF-IDF, apply the L2 norm, and then
> compute
> > the dot-product.  Has anybody done this?
> >
> > Currently, my example looks more like this:
> >
> > import org.apache.spark.SparkContext._
> > import org.apache.spark.SparkConf
> > import org.apache.spark.mllib.feature.HashingTF
> > import org.apache.spark.mllib.feature.IDF
> > import org.apache.spark.mllib.linalg.Vector
> > import org.apache.spark.rdd.RDD
> > import org.apache.spark.SparkContext
> >
> > val sc: SparkContext = ...
> >
> > // input is sequence file of the form (docid: Text, content: Text)
> > val data: RDD[(String, String)] = sc.sequenceFile[String,
> String](“corpus”)
> >
> > val docs: RDD[(String, Seq[String])] = data.mapValues(v => v.split("
> > ").toSeq)
> >
> > val hashingTF = new HashingTF()
> > val tf: RDD[(String, Vector)] = hashingTF.??
> >
> > I'm trying to maintain some linking from the document identifier to it's
> > eventual vertex representation.  I'm I going about this incorrectly?
> >
> > Thanks
>


Re: Exception while reading SendingConnection to ConnectionManagerId

2014-10-16 Thread Jimmy Li
Does anyone know anything re: this error? Thank you!

On Wed, Oct 15, 2014 at 3:38 PM, Jimmy Li  wrote:

> Hi there, I'm running spark on ec2, and am running into an error there
> that I don't get locally. Here's the error:
>
> 11335 [handle-read-write-executor-3] ERROR
> org.apache.spark.network.SendingConnection  - Exception while reading
> SendingConnection to ConnectionManagerId([IP HERE])
> java.nio.channels.ClosedChannelException
>
> Does anyone know what might be causing this? Spark is running on my ec2
> instances.
>
> Thanks,
> Jimmy
>


RE: Play framework

2014-10-16 Thread Mohammed Guller
Thanks, Suren and Raju.

Raju – if I remember correctly, Play package command just creates a jar for 
your app. That jar file will not include other dependencies. So it is not 
really a full jar as you mentioned below. So how you are passing all the other 
dependency jars to spark? Can you share that piece of code? Also is there any 
specific reason why you are not using play dist instead?

Mohammed

From: US Office Admin [mailto:ad...@vectorum.com]
Sent: Thursday, October 16, 2014 11:41 AM
To: Surendranauth Hiraman; Mohammed Guller
Cc: Daniel Siegmann; user@spark.apache.org
Subject: Re: Play framework


​We integrated Spark into Play and use SparkSQL extensively on an ec2 spark 
cluster on Hadoop hdfs 1.2.1 and tachyon 0.4.



Step 1: Create a play scala application as usual

Step 2. In Build.sbt put all your spark dependencies. What works for us is Play 
2.2.3 Scala 2.10.4 Spark 1.1. We have Akka 2.2.3. This is straight forward

step3: As Daniel mentioned, create spark context within Play. And rest of the 
application is as usual.

Step4: Create a full jar using Play Package and use that package to be included 
in library of jars passed to spark context.

Step 5: Play run as usual.



It works very well, and the convenience is, we have all scala application 
throughout.



Regards

Raju




From: Surendranauth Hiraman 
mailto:suren.hira...@velos.io>>
Sent: Thursday, October 16, 2014 12:42 PM
To: Mohammed Guller
Cc: Daniel Siegmann; user@spark.apache.org
Subject: Re: Play framework

Mohammed,

Jumping in for Daniel, we actually address the configuration issue by pulling 
values from environment variables or command line options. Maybe that may 
handle at least some of your needs.

For the akka issue, here is the akka version we include in build.sbt: 
"com.typesafe.akka" %% "akka-actor" % "2.2.1"

-Suren


On Thu, Oct 16, 2014 at 12:23 PM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:
Daniel,
Thanks for sharing this. It is very helpful.

The reason I want to use Spark submit is that it provides more flexibility. For 
example, with spark-submit, I don’t need to hard code the master info in the 
code. I can easily change the config without having to change and recompile 
code.

Do you mind sharing the sbt build file for your play app? I tried to build an 
uber jar using sbt-assembly. It gets built, but when I run it, it throws all 
sorts of exception. I have seen some blog posts that Spark and Play use 
different version of the Akka library. So I included Akka in my build.scala 
file, but still cannot get rid of Akka related exceptions. I suspect that the 
settings in the build.scala file for my play project is incorrect.

Mohammed

From: Daniel Siegmann 
[mailto:daniel.siegm...@velos.io]
Sent: Thursday, October 16, 2014 7:15 AM
To: Mohammed Guller
Cc: user@spark.apache.org
Subject: Re: Play framework

We execute Spark jobs from a Play application but we don't use spark-submit. I 
don't know if you really want to use spark-submit, but if not you can just 
create a SparkContext programmatically in your app.
In development I typically run Spark locally. Creating the Spark context is 
pretty trivial:

val conf = new SparkConf().setMaster("local[*]").setAppName(s"My Awesome App")
// call conf.set for any other configuration you want
val sc = new SparkContext(sparkConf)

It is important to keep in mind you cannot have multiple local contexts (you 
can create them but you'll get odd errors), so if you are running things in 
parallel within your app (even unit tests) you'd need to share a context in 
this case. If you are running sequentially you can create a new local context 
each time, but you must make sure to call SparkContext.stop() when you're done.
Running against a cluster is a bit more complicated because you need to add all 
your dependency jars. I'm not sure how to get this to work with play run. I 
stick to building the app with play dist and then running against the packaged 
application, because it very conveniently provides all the dependencies in a 
lib folder. Here is some code to load all the paths you need from the dist:

def libs : Seq[String] = {
val libDir = play.api.Play.application.getFile("lib")

logger.info(s"SparkContext will be initialized with 
libraries from directory $libDir")

return if ( libDir.exists ) {

libDir.listFiles().map(_.getCanonicalFile().getAbsolutePath()).filter(_.endsWith(".jar"))
} else {
throw new IllegalStateException(s"lib dir is missing: $libDir")
}
}
Creating the context is similar to above, but with this extra line:

conf.setJars(libs)
I hope this helps. I shoul

Re: reverse an rdd

2014-10-16 Thread Paweł Szulc
Nevermind, I've just run the code in the REPL. Indeed if we do not sort,
then the order is totally random. Which actually makes sens if you think
about it


On Thu, Oct 16, 2014 at 9:58 PM, Paweł Szulc  wrote:

> Just to have this clear, can you answer with quick yes or no:
>
> Does it mean that when I create RDD from a file and I simply iterate
> through it like this:
>
>  sc.textFile("some_text_file.txt").foreach(line => println(line))
>
> then the actual lines might come in different order then they are in the
> file?
>
> On Thu, Oct 16, 2014 at 9:13 PM, Sean Owen  wrote:
>
>> Since you're concerned with the particular ordering, you will need to
>> sort your RDD to ensure the ordering you have in mind. Simply reverse
>> the Ordering with Ordering.reverse() and sort by that instead, and
>> then use toLocalIterator() I suppose.
>>
>> Depending on what you're really trying to achieve, there may be a better
>> way.
>>
>> On Thu, Oct 16, 2014 at 2:49 PM, ll  wrote:
>> > hello... what is the best way to iterate through an rdd backward (last
>> > element first, first element last)?  thanks!
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/reverse-an-rdd-tp16602.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


How to name a DStream

2014-10-16 Thread Soumitra Kumar
Hello,

I am debugging my code to find out what else to cache.

Following is a line in log:

14/10/16 12:00:01 INFO TransformedDStream: Persisting RDD 6 for time 
141348600 ms to StorageLevel(true, true, false, false, 1) at time 
141348600 ms

Is there a way to name a DStream? RDD has a name method, but DStream does not.

Please let me know if there a way to map the DStream to a location in my source.

Thanks,
-Soumitra.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: reverse an rdd

2014-10-16 Thread Paweł Szulc
Just to have this clear, can you answer with quick yes or no:

Does it mean that when I create RDD from a file and I simply iterate
through it like this:

 sc.textFile("some_text_file.txt").foreach(line => println(line))

then the actual lines might come in different order then they are in the
file?

On Thu, Oct 16, 2014 at 9:13 PM, Sean Owen  wrote:

> Since you're concerned with the particular ordering, you will need to
> sort your RDD to ensure the ordering you have in mind. Simply reverse
> the Ordering with Ordering.reverse() and sort by that instead, and
> then use toLocalIterator() I suppose.
>
> Depending on what you're really trying to achieve, there may be a better
> way.
>
> On Thu, Oct 16, 2014 at 2:49 PM, ll  wrote:
> > hello... what is the best way to iterate through an rdd backward (last
> > element first, first element last)?  thanks!
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/reverse-an-rdd-tp16602.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


hi all

2014-10-16 Thread Paweł Szulc
Hi,

I just wanted to say hi all to the Spark community. I'm developing some
stuff right now using Spark (we've started very recently). As the API
documentation of Spark is really really good, I like to get deeper
knowledge of the internal stuff  -you know, the goodies. Watching movies
from Spark Summits helps, nevertheless I hope to learn a lot from reading
this mailing list.

Regrads,
Pawel Szulc


Re: ALS implicit error pyspark

2014-10-16 Thread Davies Liu
It seems a bug, Could you create a JIRA for it? thanks!

Davies

On Thu, Oct 16, 2014 at 12:27 PM, Gen  wrote:
> I tried the same data with scala. It works pretty well.
> It seems that it is the problem of pyspark.
> In the console, it shows the following logs:
>
> Traceback (most recent call last):
>   File "", line 1, in 
> *  File "/root/spark/python/pyspark/mllib/recommendation.py", line 76, in
> trainImplicit
> 14/10/16 19:22:44 WARN scheduler.TaskSetManager: Lost task 4.3 in stage
> 975.0 (TID 1653, ip-172-31-35-240.ec2.internal): TaskKilled (killed
> intentionally)
> ratingBytes._jrdd, rank, iterations, lambda_, blocks, alpha)*
>   File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
> line 538, in __call__
>   File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line
> 300, in get_return_value
> py4j.protocol.Py4JJavaError14/10/16 19:22:44 WARN scheduler.TaskSetManager:
> Lost task 8.2 in stage 975.0 (TID 1650, ip-172-31-35-241.ec2.internal):
> TaskKilled (killed intentionally)
> : An error occurred while calling o32.trainImplicitALSModel.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 6
> in stage 975.0 failed 4 times, most recent failure: Lost task 6.3 in stage
> 975.0 (TID 1651, ip-172-31-35-237.ec2.internal):
> com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException:
> scala.collection.mutable.HashSet
> Serialization trace:
> shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)
>
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
>
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
> com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
> com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
>
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)
>
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
> org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
>
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137)
>
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
>
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
>
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> org.apache.spark.scheduler.Task.run(Task.scala:54)
>
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
> at
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGSched

Re: ALS implicit error pyspark

2014-10-16 Thread Gen
I tried the same data with scala. It works pretty well.
It seems that it is the problem of pyspark.
In the console, it shows the following logs:

Traceback (most recent call last):
  File "", line 1, in 
*  File "/root/spark/python/pyspark/mllib/recommendation.py", line 76, in
trainImplicit
14/10/16 19:22:44 WARN scheduler.TaskSetManager: Lost task 4.3 in stage
975.0 (TID 1653, ip-172-31-35-240.ec2.internal): TaskKilled (killed
intentionally)
ratingBytes._jrdd, rank, iterations, lambda_, blocks, alpha)*
  File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 538, in __call__
  File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line
300, in get_return_value
py4j.protocol.Py4JJavaError14/10/16 19:22:44 WARN scheduler.TaskSetManager:
Lost task 8.2 in stage 975.0 (TID 1650, ip-172-31-35-241.ec2.internal):
TaskKilled (killed intentionally)
: An error occurred while calling o32.trainImplicitALSModel.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 6
in stage 975.0 failed 4 times, most recent failure: Lost task 6.3 in stage
975.0 (TID 1651, ip-172-31-35-237.ec2.internal):
com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException:
scala.collection.mutable.HashSet
Serialization trace:
shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)
   
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
   
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
   
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)
   
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
   
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137)
   
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
   
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
   
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
   
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   
org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   
org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
   
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
   
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at
org.

Re: reverse an rdd

2014-10-16 Thread Sean Owen
Since you're concerned with the particular ordering, you will need to
sort your RDD to ensure the ordering you have in mind. Simply reverse
the Ordering with Ordering.reverse() and sort by that instead, and
then use toLocalIterator() I suppose.

Depending on what you're really trying to achieve, there may be a better way.

On Thu, Oct 16, 2014 at 2:49 PM, ll  wrote:
> hello... what is the best way to iterate through an rdd backward (last
> element first, first element last)?  thanks!
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/reverse-an-rdd-tp16602.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Can's create Kafka stream in spark shell

2014-10-16 Thread Akhil Das
Can you try:

sbt:

name := "Simple Project"


version := "1.1"


scalaVersion := "2.10.4"


libraryDependencies ++= Seq(

"org.apache.spark" %% "spark-core" % "1.1.0",

"org.apache.spark" %% "spark-streaming" % "1.1.0",

"org.apache.spark" %% "spark-streaming-kafka" % "1.1.0"

  )

Thanks
Best Regards

On Fri, Oct 17, 2014 at 12:36 AM, Gary Zhao  wrote:

> Thanks Akhil. I tried spark-submit and saw the same issue. I double
> checked the versions and they look ok. Are you seeing any obvious issues?
>
> sbt:
>
> name := "Simple Project"
>
>
> version := "1.1"
>
>
> scalaVersion := "2.10.4"
>
>
> libraryDependencies ++= Seq(
>
> "org.apache.spark" %% "spark-core" % "1.1.0",
>
> "org.apache.spark" %% "spark-streaming" % "1.1.0",
>
> "org.apache.spark" %% "spark-streaming-kafka" % "1.1.0",
>
> "org.apache.kafka" %% "kafka" % "0.8.0"
>
>   )
>
>
> spark-1.1.0-bin-hadoop1/bin/spark-submit --class "main.scala.SimpleApp"
> --master "local[2]" simple-project_2.10-1.1.jar --jars
> spark-streaming-kafka_2.10-1.1.0.jar,kafka_2.10-0.8.0.jar
>
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/spark/streaming/kafka/KafkaUtils$
> at main.scala.SimpleApp$delayedInit$body.apply(SimpleApp.scala:15)
> at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
> at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
> at scala.App$$anonfun$main$1.apply(App.scala:71)
> at scala.App$$anonfun$main$1.apply(App.scala:71)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
> at scala.App$class.main(App.scala:71)
> at main.scala.SimpleApp$.main(SimpleApp.scala:11)
> at main.scala.SimpleApp.main(SimpleApp.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException:
> org.apache.spark.streaming.kafka.KafkaUtils$
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 17 more
>
>
> On Tue, Oct 14, 2014 at 12:05 AM, Akhil Das 
> wrote:
>
>> Just make sure you have the same version of spark-streaming-kafka
>> 
>> jar and spark in your classpath.
>>
>> Thanks
>> Best Regards
>>
>> On Tue, Oct 14, 2014 at 9:02 AM, Gary Zhao  wrote:
>>
>>> Hello
>>>
>>> I'm trying to connect kafka in spark shell, but failed as below. Could
>>> you take a look what I missed.
>>>
>>> scala>  val kafkaStream = KafkaUtils.createStream(ssc,
>>> "test-vip.snc1:2181", "test_spark", Map("user-test"->1))
>>> error: bad symbolic reference. A signature in KafkaUtils.class refers to
>>> term serializer
>>> in value kafka which is not available.
>>> It may be completely missing from the current classpath, or the version
>>> on
>>> the classpath might be incompatible with the version used when compiling
>>> KafkaUtils.class.
>>>
>>> Thanks
>>> Gary
>>>
>>
>>
>


Re: Spark assembly for YARN/CDH5

2014-10-16 Thread Sean Owen
https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/spark/spark-assembly_2.10/
 ?

I'm not sure why the 5.2 + 1.1 final artifacts don't show up there yet though.

On Thu, Oct 16, 2014 at 2:12 PM, Philip Ogren  wrote:
> Does anyone know if there Spark assemblies are created and available for
> download that have been built for CDH5 and YARN?
>
> Thanks,
> Philip
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Can's create Kafka stream in spark shell

2014-10-16 Thread Gary Zhao
Thanks Akhil. I tried spark-submit and saw the same issue. I double checked
the versions and they look ok. Are you seeing any obvious issues?

sbt:

name := "Simple Project"


version := "1.1"


scalaVersion := "2.10.4"


libraryDependencies ++= Seq(

"org.apache.spark" %% "spark-core" % "1.1.0",

"org.apache.spark" %% "spark-streaming" % "1.1.0",

"org.apache.spark" %% "spark-streaming-kafka" % "1.1.0",

"org.apache.kafka" %% "kafka" % "0.8.0"

  )


spark-1.1.0-bin-hadoop1/bin/spark-submit --class "main.scala.SimpleApp"
--master "local[2]" simple-project_2.10-1.1.jar --jars
spark-streaming-kafka_2.10-1.1.0.jar,kafka_2.10-0.8.0.jar

Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/spark/streaming/kafka/KafkaUtils$
at main.scala.SimpleApp$delayedInit$body.apply(SimpleApp.scala:15)
at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.collection.immutable.List.foreach(List.scala:318)
at
scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
at scala.App$class.main(App.scala:71)
at main.scala.SimpleApp$.main(SimpleApp.scala:11)
at main.scala.SimpleApp.main(SimpleApp.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException:
org.apache.spark.streaming.kafka.KafkaUtils$
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 17 more


On Tue, Oct 14, 2014 at 12:05 AM, Akhil Das 
wrote:

> Just make sure you have the same version of spark-streaming-kafka
> 
> jar and spark in your classpath.
>
> Thanks
> Best Regards
>
> On Tue, Oct 14, 2014 at 9:02 AM, Gary Zhao  wrote:
>
>> Hello
>>
>> I'm trying to connect kafka in spark shell, but failed as below. Could
>> you take a look what I missed.
>>
>> scala>  val kafkaStream = KafkaUtils.createStream(ssc,
>> "test-vip.snc1:2181", "test_spark", Map("user-test"->1))
>> error: bad symbolic reference. A signature in KafkaUtils.class refers to
>> term serializer
>> in value kafka which is not available.
>> It may be completely missing from the current classpath, or the version on
>> the classpath might be incompatible with the version used when compiling
>> KafkaUtils.class.
>>
>> Thanks
>> Gary
>>
>
>


reverse an rdd

2014-10-16 Thread ll
hello... what is the best way to iterate through an rdd backward (last
element first, first element last)?  thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/reverse-an-rdd-tp16602.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Larger heap leads to perf degradation due to GC

2014-10-16 Thread Andrew Ash
I wonder if it's worth logging a warning on executor startup if the heap
size is >32GB?  Given that this has come up a few times it might be good to
embed this community knowledge into the product instead of email threads.

On Thu, Oct 16, 2014 at 7:47 AM, Akshat Aranya  wrote:

> I just want to pitch in and say that I ran into the same problem with
> running with 64GB executors.  For example, some of the tasks take 5 minutes
> to execute, out of which 4 minutes are spent in GC.  I'll try out smaller
> executors.
>
> On Mon, Oct 6, 2014 at 6:35 PM, Otis Gospodnetic <
> otis.gospodne...@gmail.com> wrote:
>
>> Hi,
>>
>> The other option to consider is using G1 GC, which should behave better
>> with large heaps.  But pointers are not compressed in heaps > 32 GB in
>> size, so you may be better off staying under 32 GB.
>>
>> Otis
>> --
>> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
>> Solr & Elasticsearch Support * http://sematext.com/
>>
>>
>> On Mon, Oct 6, 2014 at 8:08 PM, Mingyu Kim  wrote:
>>
>>> Ok, cool. This seems to be general issues in JVM with very large heaps.
>>> I agree that the best workaround would be to keep the heap size below 32GB.
>>> Thanks guys!
>>>
>>> Mingyu
>>>
>>> From: Arun Ahuja 
>>> Date: Monday, October 6, 2014 at 7:50 AM
>>> To: Andrew Ash 
>>> Cc: Mingyu Kim , "user@spark.apache.org" <
>>> user@spark.apache.org>, Dennis Lawler 
>>> Subject: Re: Larger heap leads to perf degradation due to GC
>>>
>>> We have used the strategy that you suggested, Andrew - using many
>>> workers per machine and keeping the heaps small (< 20gb).
>>>
>>> Using a large heap resulted in workers hanging or not responding
>>> (leading to timeouts).  The same dataset/job for us will fail (most often
>>> due to akka disassociated or fetch failures errors) with 10 cores / 100
>>> executors, 60 gb per executor while succceed with 1 core / 1000 executors /
>>> 6gb per executor.
>>>
>>> When the job does succceed with more cores per executor and larger heap
>>> it is usually much slower than the smaller executors (the same 8-10 min job
>>> taking 15-20 min to complete)
>>>
>>> The unfortunate downside of this has been, we have had some large
>>> broadcast variables which may not fit into memory (and unnecessarily
>>> duplicated) when using the smaller executors.
>>>
>>> Most of this is anecdotal but for the most part we have had more success
>>> and consistency with more executors with smaller memory requirements.
>>>
>>> On Sun, Oct 5, 2014 at 7:20 PM, Andrew Ash  wrote:
>>>
 Hi Mingyu,

 Maybe we should be limiting our heaps to 32GB max and running multiple
 workers per machine to avoid large GC issues.

 For a 128GB memory, 32 core machine, this could look like:

 SPARK_WORKER_INSTANCES=4
 SPARK_WORKER_MEMORY=32
 SPARK_WORKER_CORES=8

 Are people running with large (32GB+) executor heaps in production?
 I'd be curious to hear if so.

 Cheers!
 Andrew

 On Thu, Oct 2, 2014 at 1:30 PM, Mingyu Kim  wrote:

> This issue definitely needs more investigation, but I just wanted to
> quickly check if anyone has run into this problem or has general guidance
> around it. We’ve seen a performance degradation with a large heap on a
> simple map task (I.e. No shuffle). We’ve seen the slowness starting around
> from 50GB heap. (I.e. spark.executor.memoty=50g) And, when we checked the
> CPU usage, there were just a lot of GCs going on.
>
> Has anyone seen a similar problem?
>
> Thanks,
> Mingyu
>


>>>
>>
>


Re: Play framework

2014-10-16 Thread US Office Admin
​We integrated Spark into Play and use SparkSQL extensively on an ec2 spark 
cluster on Hadoop hdfs 1.2.1 and tachyon 0.4.


Step 1: Create a play scala application as usual

Step 2. In Build.sbt put all your spark dependencies. What works for us is Play 
2.2.3 Scala 2.10.4 Spark 1.1. We have Akka 2.2.3. This is straight forward

step3: As Daniel mentioned, create spark context within Play. And rest of the 
application is as usual.

Step4: Create a full jar using Play Package and use that package to be included 
in library of jars passed to spark context.

Step 5: Play run as usual.


It works very well, and the convenience is, we have all scala application 
throughout.


Regards

Raju





From: Surendranauth Hiraman 
Sent: Thursday, October 16, 2014 12:42 PM
To: Mohammed Guller
Cc: Daniel Siegmann; user@spark.apache.org
Subject: Re: Play framework

Mohammed,

Jumping in for Daniel, we actually address the configuration issue by pulling 
values from environment variables or command line options. Maybe that may 
handle at least some of your needs.

For the akka issue, here is the akka version we include in build.sbt: 
"com.typesafe.akka" %% "akka-actor" % "2.2.1"

-Suren


On Thu, Oct 16, 2014 at 12:23 PM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:
Daniel,
Thanks for sharing this. It is very helpful.

The reason I want to use Spark submit is that it provides more flexibility. For 
example, with spark-submit, I don’t need to hard code the master info in the 
code. I can easily change the config without having to change and recompile 
code.

Do you mind sharing the sbt build file for your play app? I tried to build an 
uber jar using sbt-assembly. It gets built, but when I run it, it throws all 
sorts of exception. I have seen some blog posts that Spark and Play use 
different version of the Akka library. So I included Akka in my build.scala 
file, but still cannot get rid of Akka related exceptions. I suspect that the 
settings in the build.scala file for my play project is incorrect.

Mohammed

From: Daniel Siegmann 
[mailto:daniel.siegm...@velos.io]
Sent: Thursday, October 16, 2014 7:15 AM
To: Mohammed Guller
Cc: user@spark.apache.org
Subject: Re: Play framework

We execute Spark jobs from a Play application but we don't use spark-submit. I 
don't know if you really want to use spark-submit, but if not you can just 
create a SparkContext programmatically in your app.
In development I typically run Spark locally. Creating the Spark context is 
pretty trivial:

val conf = new SparkConf().setMaster("local[*]").setAppName(s"My Awesome App")
// call conf.set for any other configuration you want
val sc = new SparkContext(sparkConf)

It is important to keep in mind you cannot have multiple local contexts (you 
can create them but you'll get odd errors), so if you are running things in 
parallel within your app (even unit tests) you'd need to share a context in 
this case. If you are running sequentially you can create a new local context 
each time, but you must make sure to call SparkContext.stop() when you're done.
Running against a cluster is a bit more complicated because you need to add all 
your dependency jars. I'm not sure how to get this to work with play run. I 
stick to building the app with play dist and then running against the packaged 
application, because it very conveniently provides all the dependencies in a 
lib folder. Here is some code to load all the paths you need from the dist:

def libs : Seq[String] = {
val libDir = play.api.Play.application.getFile("lib")

logger.info(s"SparkContext will be initialized with 
libraries from directory $libDir")

return if ( libDir.exists ) {

libDir.listFiles().map(_.getCanonicalFile().getAbsolutePath()).filter(_.endsWith(".jar"))
} else {
throw new IllegalStateException(s"lib dir is missing: $libDir")
}
}
Creating the context is similar to above, but with this extra line:

conf.setJars(libs)
I hope this helps. I should note that I don't use play run very much, at least 
not for when I'm actually executing Spark jobs. So I'm not sure if this 
integrates properly with that. I have unit tests which execute on Spark and 
have executed the dist package both locally and on a cluster. To make working 
with the dist locally easier, I wrote myself a little shell script to unzip and 
run the dist.


On Wed, Oct 15, 2014 at 10:51 PM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:
Hi –

Has anybody figured out how to integrate a Play application with Spark and run 
it on a Spark cluster using spark-submit script? I have seen some blogs about 
creating a simple Play app and running it

Re: TaskNotSerializableException when running through Spark shell

2014-10-16 Thread Jimmy McErlain
I actually only ran into this issue recently after we upgraded to Spark
1.1.  Within the REPL for Spark 1.0 everything works fine but within the
REPL for 1.1, it is not.  FYI I am also only doing simple regex matching
functions within an RDD... Now when I am running the same code as App
everything is working fine... it leads me to believe that it is a bug
within the REPL for 1.1

Can anyone else confirm this?

ᐧ




*JIMMY MCERLAIN*

DATA SCIENTIST (NERD)

*. . . . . . . . . . . . . . . . . .*


*IF WE CAN’T DOUBLE YOUR SALES,*



*ONE OF US IS IN THE WRONG BUSINESS.*

*E*: ji...@sellpoints.com

*M*: *510.303.7751*

On Thu, Oct 16, 2014 at 7:56 AM, Akshat Aranya  wrote:

> Hi,
>
> Can anyone explain how things get captured in a closure when runing
> through the REPL.  For example:
>
> def foo(..) = { .. }
>
> rdd.map(foo)
>
> sometimes complains about classes not being serializable that are
> completely unrelated to foo.  This happens even when I write it such:
>
> object Foo {
>   def foo(..) = { .. }
> }
>
> rdd.map(Foo.foo)
>
> It also doesn't happen all the time.
>


Re: How to make operation like cogrop() , groupbykey() on pair RDD = [ [ ], [ ] , [ ] ]

2014-10-16 Thread Gen
Hi,

You just need add list() in the sorted function. 
For example, 
map((lambda (x,y): (x, (list(y[0]), list(y[1],
sorted(list(rdd1.cogroup(rdd2).collect(


I think you just forget the list...

PS: your post has NOT been accepted by the mailing list yet.

Best 
Gen


pm wrote
> Hi ,
> 
> Thanks for reply ,
> 
> 
> now after doing cogroup mentioned in below,
> 
> merge_rdd = map((lambda (x,y): (x, (list(y[0]), list(y[1],
> sorted((rdd1.cogroup(rdd2).collect(
> 
> map((lambda (x,y): (x, (list(y[0]), list(y[1],
> sorted((merge_rdd.cogroup(rdd3).collect(
> 
> 
> i m getting output like  
> 
> 
> [((u'abc', u'0010'),
>   ([(
> 
> ,
>  
> 
> )],
>[[(u'address, u'2017 CAN'),
>  (u'address_city', u'VESTAVIA '),
> ]])),
>  ((u'abc', u'0020'),
>   ([(
> 
> ,
>  
> 
> )],
>[[(u'address', u'2017 CAN'),
>  (u'address_city', u'VESTAV'),
> ]]))]
> 
> How to show value for object pyspark.resultiterable.ResultIterable at
> 0x4b1b4d0.
> 
> I want to show data for pyspark.resultiterable.ResultIterable at
> 0x4b1bd50.
> 
> 
> Could please tell me the way to show data for those object . I m using
> python
> 
> 
> 
> Thanks,





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-operation-like-cogrop-groupbykey-on-pair-RDD-tp16487p16598.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Running an action inside a loop across multiple RDDs + java.io.NotSerializableException

2014-10-16 Thread _soumya_
Sorry - I'll furnish some details below. However, union is not an option for
the business logic I have. The function will generate a specific file based
on a variable passed in as the setter for the function. This variable
changes with each RDD. I annotated the log line where the first run
succeeds. 

Logs output to give you some context:
...
14/10/16 15:32:25 INFO SparkContext: Starting job: count at
GenerateJSONContent.java:145
14/10/16 15:32:25 INFO DAGScheduler: Got job 0 (count at
GenerateJSONContent.java:145) with 1 output partitions (allowLocal=false)
14/10/16 15:32:25 INFO DAGScheduler: Final stage: Stage 0(count at
GenerateJSONContent.java:145)
14/10/16 15:32:25 INFO DAGScheduler: Parents of final stage: List()
14/10/16 15:32:25 INFO DAGScheduler: Missing parents: List()
14/10/16 15:32:25 INFO DAGScheduler: Submitting Stage 0
(ParallelCollectionRDD[2] at parallelize at GenerateJSONContent.java:82),
which has no missing parents
14/10/16 15:32:25 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0
(ParallelCollectionRDD[2] at parallelize at GenerateJSONContent.java:82)
14/10/16 15:32:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
14/10/16 15:32:25 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on
executor localhost: localhost (PROCESS_LOCAL)
14/10/16 15:32:25 INFO TaskSetManager: Serialized task 0.0:0 as 2048 bytes
in 3 ms
14/10/16 15:32:25 INFO Executor: Running task ID 0
14/10/16 15:32:25 INFO Executor: Fetching
http://172.16.1.204:42232/jars/rickshaw-spark-0.0.1-SNAPSHOT.jar with
timestamp 1413473544838
14/10/16 15:32:25 INFO Utils: Fetching
http://172.16.1.204:42232/jars/rickshaw-spark-0.0.1-SNAPSHOT.jar to
/tmp/fetchFileTemp7432615579770034188.tmp
14/10/16 15:32:26 INFO Executor: Adding
file:/tmp/spark-a471145f-ab44-447b-b48c-bb499024d756/rickshaw-spark-0.0.1-SNAPSHOT.jar
to class loader
14/10/16 15:32:26 INFO Executor: Serialized size of result for 0 is 597
14/10/16 15:32:26 INFO Executor: Sending result for 0 directly to driver
14/10/16 15:32:26 INFO Executor: Finished task ID 0
14/10/16 15:32:26 INFO DAGScheduler: Completed ResultTask(0, 0)
14/10/16 15:32:26 INFO DAGScheduler: Stage 0 (count at
GenerateJSONContent.java:145) finished in 0.791 s
14/10/16 15:32:26 INFO SparkContext: Job finished: count at
GenerateJSONContent.java:145, took 0.889171151 s
14/10/16 15:32:26 INFO GenerateJSONContent: Running for tag
:PublicationProductId
14/10/16 15:32:26 INFO TaskSetManager: Finished TID 0 in 778 ms on localhost
(progress: 1/1)
14/10/16 15:32:26 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
have all completed, from pool
14/10/16 15:32:26 INFO SparkContext: Starting job: foreach at
GenerateJSONContent.java:156
14/10/16 15:32:26 INFO DAGScheduler: Got job 1 (foreach at
GenerateJSONContent.java:156) with 1 output partitions (allowLocal=false)
14/10/16 15:32:26 INFO DAGScheduler: Final stage: Stage 1(foreach at
GenerateJSONContent.java:156)
14/10/16 15:32:26 INFO DAGScheduler: Parents of final stage: List()
14/10/16 15:32:26 INFO DAGScheduler: Missing parents: List()
14/10/16 15:32:26 INFO DAGScheduler: Submitting Stage 1
(ParallelCollectionRDD[2] at parallelize at GenerateJSONContent.java:82),
which has no missing parents
14/10/16 15:32:26 INFO DAGScheduler: Submitting 1 missing tasks from Stage 1
(ParallelCollectionRDD[2] at parallelize at GenerateJSONContent.java:82)
14/10/16 15:32:26 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
14/10/16 15:32:26 INFO TaskSetManager: Starting task 1.0:0 as TID 1 on
executor localhost: localhost (PROCESS_LOCAL)
14/10/16 15:32:26 INFO TaskSetManager: Serialized task 1.0:0 as 2792 bytes
in 1 ms
14/10/16 15:32:26 INFO Executor: Running task ID 1
14/10/16 15:32:35 INFO Executor: Serialized size of result for 1 is 559
14/10/16 15:32:35 INFO Executor: Sending result for 1 directly to driver
14/10/16 15:32:35 INFO Executor: Finished task ID 1
14/10/16 15:32:35 INFO DAGScheduler: Completed ResultTask(1, 0)
14/10/16 15:32:35 INFO DAGScheduler: Stage 1 (foreach at
GenerateJSONContent.java:156) finished in 9.098 s
14/10/16 15:32:35 INFO SparkContext: Job finished: foreach at
GenerateJSONContent.java:156, took 9.112736939 s
14/10/16 15:32:35 INFO TaskSetManager: Finished TID 1 in 9095 ms on
localhost (progress: 1/1)
14/10/16 15:32:35 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks
have all completed, from pool
14/10/16 15:32:36 INFO SparkContext: Starting job: count at
GenerateJSONContent.java:145
14/10/16 15:32:36 INFO DAGScheduler: Got job 2 (count at
GenerateJSONContent.java:145) with 1 output partitions (allowLocal=false)
14/10/16 15:32:36 INFO DAGScheduler: Final stage: Stage 2(count at
GenerateJSONContent.java:145)
14/10/16 15:32:36 INFO DAGScheduler: Parents of final stage: List()
14/10/16 15:32:36 INFO DAGScheduler: Missing parents: List()
14/10/16 15:32:36 INFO DAGScheduler: Submitting Stage 2
(ParallelCollectionRDD[1] at parallelize at GenerateJSONContent.java:82),
which has no missing parents
14/10/16 15:32:36 IN

Spark assembly for YARN/CDH5

2014-10-16 Thread Philip Ogren
Does anyone know if there Spark assemblies are created and available for 
download that have been built for CDH5 and YARN?


Thanks,
Philip

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



ALS implicit error pyspark

2014-10-16 Thread Gen
Hi,

I am trying to use ALS.trainImplicit method in the
pyspark.mllib.recommendation. However it didn't work. So I tried use the
example in the python API documentation such as:

/r1 = (1, 1, 1.0) 
r2 = (1, 2, 2.0) 
r3 = (2, 1, 2.0) 
ratings = sc.parallelize([r1, r2, r3]) 
model = ALS.trainImplicit(ratings, 1) /

It didn't work neither. After searching in google, I found that there are
only two overloads for ALS.trainImplicit in the scala script. So I tried
/model = ALS.trainImplicit(ratings, 1, 1)/, it worked. But if I set the
iterations other than 1,  /model = ALS.trainImplicit(ratings, 1, 2)/ or
/model = ALS.trainImplicit(ratings, 4, 2)/ for example, it generated error.
The information is as follows:

count at ALS.scala:314

Job aborted due to stage failure: Task 6 in stage 189.0 failed 4 times, most
recent failure: Lost task 6.3 in stage 189.0 (TID 626,
ip-172-31-35-239.ec2.internal): com.esotericsoftware.kryo.KryoException:
java.lang.ArrayStoreException: scala.collection.mutable.HashSet
Serialization trace:
shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)
   
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
   
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
   
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)
   
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
   
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137)
   
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
   
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
   
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
   
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   
org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   
org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
   
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
   
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:

It is really strange, because count at ALS.scala:314 is already out the loop
of iterations. Any idea?
Thanks a lot for advance.

FYI: I used spark 1.1.0 and ALS.train() works pretty well for all the cases.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ALS-implicit-error-pyspark-tp16595.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark output to s3 extremely slow

2014-10-16 Thread Anny Chen
Hi Rafal,

Thanks for the explanation and solution! I need to write maybe 100 GB to
s3. I will try your way and see whether it works for me.

Thanks again!

On Wed, Oct 15, 2014 at 1:44 AM, Rafal Kwasny  wrote:

> Hi,
> How large is the dataset you're saving into S3?
>
> Actually saving to S3 is done in two steps:
> 1) writing temporary files
> 2) commiting them to proper directory
> Step 2) could be slow because S3 do not have a quick atomic "move"
> operation, you have to copy (server side but still takes time) and then
> delete the original.
>
> I've overcome this but using a jobconf with NullOutputCommitter
>   jobConf.setOutputCommitter(classOf[NullOutputCommitter])
>
> Where NullOutputCommiter is a Class that doesn't do anything:
>
>   class NullOutputCommitter extends OutputCommitter {
> def abortTask(taskContext: TaskAttemptContext) =  { }
> override  def cleanupJob(jobContext: JobContext ) = { }
> def commitTask(taskContext: TaskAttemptContext ) = { }
> def needsTaskCommit(taskContext: TaskAttemptContext ) = {  false  }
> def setupJob(jobContext: JobContext) { }
> def setupTask(taskContext: TaskAttemptContext) { }
>   }
>
> This works but maybe someone has a better solution.
>
> /Raf
>
> anny9699 wrote:
> > Hi,
> >
> > I found writing output back to s3 using rdd.saveAsTextFile() is extremely
> > slow, much slower than reading from s3. Is there a way to make it faster?
> > The rdd has 150 partitions so parallelism is enough I assume.
> >
> > Thanks a lot!
> > Anny
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-output-to-s3-extremely-slow-tp16447.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
>


Re: Play framework

2014-10-16 Thread Surendranauth Hiraman
Mohammed,

Jumping in for Daniel, we actually address the configuration issue by
pulling values from environment variables or command line options. Maybe
that may handle at least some of your needs.

For the akka issue, here is the akka version we include in build.sbt:
"com.typesafe.akka" %% "akka-actor" % "2.2.1"

-Suren


On Thu, Oct 16, 2014 at 12:23 PM, Mohammed Guller 
wrote:

>  Daniel,
>
> Thanks for sharing this. It is very helpful.
>
>
>
> The reason I want to use Spark submit is that it provides more
> flexibility. For example, with spark-submit, I don’t need to hard code the
> master info in the code. I can easily change the config without having to
> change and recompile code.
>
>
>
> Do you mind sharing the sbt build file for your play app? I tried to build
> an uber jar using sbt-assembly. It gets built, but when I run it, it throws
> all sorts of exception. I have seen some blog posts that Spark and Play use
> different version of the Akka library. So I included Akka in my build.scala
> file, but still cannot get rid of Akka related exceptions. I suspect that
> the settings in the build.scala file for my play project is incorrect.
>
>
>
> Mohammed
>
>
>
> *From:* Daniel Siegmann [mailto:daniel.siegm...@velos.io]
> *Sent:* Thursday, October 16, 2014 7:15 AM
> *To:* Mohammed Guller
> *Cc:* user@spark.apache.org
> *Subject:* Re: Play framework
>
>
>
> We execute Spark jobs from a Play application but we don't use
> spark-submit. I don't know if you really want to use spark-submit, but if
> not you can just create a SparkContext programmatically in your app.
>
> In development I typically run Spark locally. Creating the Spark context
> is pretty trivial:
>
> val conf = new SparkConf().setMaster("local[*]").setAppName(s"My Awesome
> App")
>
> // call conf.set for any other configuration you want
>
> val sc = new SparkContext(sparkConf)
>
> It is important to keep in mind you cannot have multiple local contexts
> (you can create them but you'll get odd errors), so if you are running
> things in parallel within your app (even unit tests) you'd need to share a
> context in this case. If you are running sequentially you can create a new
> local context each time, but you must make sure to call SparkContext.stop()
> when you're done.
>
> Running against a cluster is a bit more complicated because you need to
> add all your dependency jars. I'm not sure how to get this to work with play
> run. I stick to building the app with play dist and then running against
> the packaged application, because it very conveniently provides all the
> dependencies in a lib folder. Here is some code to load all the paths you
> need from the dist:
>
>
> def libs : Seq[String] = {
> val libDir = play.api.Play.application.getFile("lib")
>
> logger.info(s"SparkContext will be initialized with libraries
> from directory $libDir")
>
> return if ( libDir.exists ) {
>
> libDir.listFiles().map(_.getCanonicalFile().getAbsolutePath()).filter(_.endsWith(".jar"))
> } else {
> throw new IllegalStateException(s"lib dir is missing: $libDir")
> }
> }
>
> Creating the context is similar to above, but with this extra line:
>
>
> conf.setJars(libs)
>
> I hope this helps. I should note that I don't use play run very much, at
> least not for when I'm actually executing Spark jobs. So I'm not sure if
> this integrates properly with that. I have unit tests which execute on
> Spark and have executed the dist package both locally and on a cluster. To
> make working with the dist locally easier, I wrote myself a little shell
> script to unzip and run the dist.
>
>
>
>
>
> On Wed, Oct 15, 2014 at 10:51 PM, Mohammed Guller 
> wrote:
>
> Hi –
>
>
>
> Has anybody figured out how to integrate a Play application with Spark and
> run it on a Spark cluster using spark-submit script? I have seen some blogs
> about creating a simple Play app and running it locally on a dev machine
> with sbt run command. However, those steps don’t work for Spark-submit.
>
>
>
> If you have figured out how to build and run a Play app with Spark-submit,
> I would appreciate if you could share the steps and the sbt settings for
> your Play app.
>
>
>
> Thanks,
>
> Mohammed
>
>
>
>
>
>
> --
>
> Daniel Siegmann, Software Developer
> Velos
>
> Accelerating Machine Learning
>
>
> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
> E: daniel.siegm...@velos.io W: www.velos.io
>



-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v elos.io
W: www.velos.io


Re: Folding an RDD in order

2014-10-16 Thread Cheng Lian
RDD.aggregate doesn’t require the RDD elements to be pairs, so you don’t 
need to use user_id to be the key or the RDD. For example, you can use 
an empty Map as the zero value of the aggregation. The key of the Map is 
the user_id you extracted from each tuple, and the value is the 
aggregated value.


|keyByTimestamp.aggregate(Map.empty[String,Float].withDefaultValue(0.0))({ (agg, 
rec) =>
  val  (time, (user, amount)) = rec
  agg.updated(user, agg(user) + amount)
}, { (lhs, rhs) =>
  lhs.keys.foldLeft(rhs) { (combined, user) =>
combined.updated(user, lhs(user) + rhs(user))
  }
})
|

Of course, you may use mutable Map for optimized performance. One thing 
to notice, foldByKey is a transformation, while aggregate is an action. 
The final result of the code above is a single Map object rather than an 
RDD. If this map can be very large (say you have billions of users), 
then aggregate may OOM.


On 10/17/14 12:01 AM, Michael Misiewicz wrote:

Thanks for the suggestion! That does look really helpful, I see what 
you mean about it being more general than fold. I think I will replace 
my fold with aggregate - it should give me more control over the process.


I think the problem will still exist though - which is that I can't 
get the correct partitioning I need. When I change my key to user_id, 
I lose the timestamp partitioning. My problem is that I'm trying to 
retain a parent RDD's partitioning in an RDD that no longer has the 
same keys as its parent.


On Thu, Oct 16, 2014 at 11:46 AM, Cheng Lian > wrote:


Hi Michael,

I'm not sure I fully understood your question, but I think
RDD.aggregate can be helpful in your case. You can see it as a
more general version of fold.

Cheng



On 10/16/14 11:15 PM, Michael Misiewicz wrote:

Hi,

I'm working on a problem where I'd like to sum items in an RDD
/in order (/approximately/)/. I am currently trying to implement
this using a fold, but I'm having some issues because the sorting
key of my data is not the same as the folding key for my data. I
have data that looks like this:

user_id, transaction_timestamp, transaction_amount

And I'm interested in doing a foldByKey on user_id to sum
transaction amounts - taking care to note approximately when a
user surpasses a total transaction threshold. I'm using
RangePartitioner to make sure that data is ordered sequentially
between partitions, and I'd also make sure that data is sorted
within partitions, though I'm not sure how to do this exactly (I
was going to look at the code for sortByKey to figure this out -
I believe sorting in place in a mapPartitions should work). What
do you think about the approach? Here's some sample code that
demonstrates what I'm thinking:

def myFold(V1:Float, V2:Float) : Float = {
val partialSum = V1 + V2
if (partialSum >= 500) {
// make a note of it, do things
}
return partialSum
}

val rawData = sc.textFile("hdfs://path/to/data").map{ x => //
load data
l = x.split()
(l(0).toLong, l(1).toLong, l(2).toFloat) // user_id:long,
transaction_timestamp:long, transaction_amount:float
}
val keyByTimestamp = rawData.map(x=> (x._2, (x._1, x._3))) //
rearrange to make timestamp the key (for sorting), convert to PairRDD
val sortedByTimestamp = keyByTimestamp.sortByKey()
val partitionedByTimestamp = sortedByTimestamp.partitionBy(
new org.apache.spark.RangePartitioner(partitions=500,
rdd=sortedByTimestamp)).persist()
// By this point, the RDD should be sorted and partitioned
according to the timestamp. However, I need to now make user_id
the key,
// because the output must be per user. At this point, since I
change the keys of the PairRDD, I understand that I lose the
partitioning
// the consequence of this is that I can no longer be sure in my
fold function that the ordering is retained.

val keyByUser = partitionedByTimestamp.map(x => (x._2._1, x._2._2))
val finalResult = keyByUser.foldByKey(zeroValue=0)(myFold)
finalResult.saveAsTextFile("hdfs://...")

The problem as you'd expect takes place in the folding function,
after I've re-arranged my RDD to no longer be keyed by timestamp
(when I produce keyByUser, I lose the correct partitioning). As
I've read in the documentation, partitioning is not preserved
when keys are changed (which makes sense).

Reading this thread:
https://groups.google.com/forum/#!topic/spark-users/Fx7DNtWiSx4

it appears that one possible solution might be to subclass RDD (à
la MappedValuesRDD) to define my own RDDthat retains the
partitions of its parent. This seems simple enough, but I've
never done anything like that before, but I'm not sure where to
start. I'm also willing to write my own custom partitioner class,
b

Re: Play framework

2014-10-16 Thread Debasish Das
It will really help if the spark users can point to github examples that
integrated spark and playspecifically SparkSQL and Play...


On Thu, Oct 16, 2014 at 9:23 AM, Mohammed Guller 
wrote:

>  Daniel,
>
> Thanks for sharing this. It is very helpful.
>
>
>
> The reason I want to use Spark submit is that it provides more
> flexibility. For example, with spark-submit, I don’t need to hard code the
> master info in the code. I can easily change the config without having to
> change and recompile code.
>
>
>
> Do you mind sharing the sbt build file for your play app? I tried to build
> an uber jar using sbt-assembly. It gets built, but when I run it, it throws
> all sorts of exception. I have seen some blog posts that Spark and Play use
> different version of the Akka library. So I included Akka in my build.scala
> file, but still cannot get rid of Akka related exceptions. I suspect that
> the settings in the build.scala file for my play project is incorrect.
>
>
>
> Mohammed
>
>
>
> *From:* Daniel Siegmann [mailto:daniel.siegm...@velos.io]
> *Sent:* Thursday, October 16, 2014 7:15 AM
> *To:* Mohammed Guller
> *Cc:* user@spark.apache.org
> *Subject:* Re: Play framework
>
>
>
> We execute Spark jobs from a Play application but we don't use
> spark-submit. I don't know if you really want to use spark-submit, but if
> not you can just create a SparkContext programmatically in your app.
>
> In development I typically run Spark locally. Creating the Spark context
> is pretty trivial:
>
> val conf = new SparkConf().setMaster("local[*]").setAppName(s"My Awesome
> App")
>
> // call conf.set for any other configuration you want
>
> val sc = new SparkContext(sparkConf)
>
> It is important to keep in mind you cannot have multiple local contexts
> (you can create them but you'll get odd errors), so if you are running
> things in parallel within your app (even unit tests) you'd need to share a
> context in this case. If you are running sequentially you can create a new
> local context each time, but you must make sure to call SparkContext.stop()
> when you're done.
>
> Running against a cluster is a bit more complicated because you need to
> add all your dependency jars. I'm not sure how to get this to work with play
> run. I stick to building the app with play dist and then running against
> the packaged application, because it very conveniently provides all the
> dependencies in a lib folder. Here is some code to load all the paths you
> need from the dist:
>
>
> def libs : Seq[String] = {
> val libDir = play.api.Play.application.getFile("lib")
>
> logger.info(s"SparkContext will be initialized with libraries
> from directory $libDir")
>
> return if ( libDir.exists ) {
>
> libDir.listFiles().map(_.getCanonicalFile().getAbsolutePath()).filter(_.endsWith(".jar"))
> } else {
> throw new IllegalStateException(s"lib dir is missing: $libDir")
> }
> }
>
> Creating the context is similar to above, but with this extra line:
>
>
> conf.setJars(libs)
>
> I hope this helps. I should note that I don't use play run very much, at
> least not for when I'm actually executing Spark jobs. So I'm not sure if
> this integrates properly with that. I have unit tests which execute on
> Spark and have executed the dist package both locally and on a cluster. To
> make working with the dist locally easier, I wrote myself a little shell
> script to unzip and run the dist.
>
>
>
>
>
> On Wed, Oct 15, 2014 at 10:51 PM, Mohammed Guller 
> wrote:
>
> Hi –
>
>
>
> Has anybody figured out how to integrate a Play application with Spark and
> run it on a Spark cluster using spark-submit script? I have seen some blogs
> about creating a simple Play app and running it locally on a dev machine
> with sbt run command. However, those steps don’t work for Spark-submit.
>
>
>
> If you have figured out how to build and run a Play app with Spark-submit,
> I would appreciate if you could share the steps and the sbt settings for
> your Play app.
>
>
>
> Thanks,
>
> Mohammed
>
>
>
>
>
>
> --
>
> Daniel Siegmann, Software Developer
> Velos
>
> Accelerating Machine Learning
>
>
> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
> E: daniel.siegm...@velos.io W: www.velos.io
>


RE: Play framework

2014-10-16 Thread Mohammed Guller
Daniel,
Thanks for sharing this. It is very helpful.

The reason I want to use Spark submit is that it provides more flexibility. For 
example, with spark-submit, I don’t need to hard code the master info in the 
code. I can easily change the config without having to change and recompile 
code.

Do you mind sharing the sbt build file for your play app? I tried to build an 
uber jar using sbt-assembly. It gets built, but when I run it, it throws all 
sorts of exception. I have seen some blog posts that Spark and Play use 
different version of the Akka library. So I included Akka in my build.scala 
file, but still cannot get rid of Akka related exceptions. I suspect that the 
settings in the build.scala file for my play project is incorrect.

Mohammed

From: Daniel Siegmann [mailto:daniel.siegm...@velos.io]
Sent: Thursday, October 16, 2014 7:15 AM
To: Mohammed Guller
Cc: user@spark.apache.org
Subject: Re: Play framework

We execute Spark jobs from a Play application but we don't use spark-submit. I 
don't know if you really want to use spark-submit, but if not you can just 
create a SparkContext programmatically in your app.
In development I typically run Spark locally. Creating the Spark context is 
pretty trivial:

val conf = new SparkConf().setMaster("local[*]").setAppName(s"My Awesome App")
// call conf.set for any other configuration you want
val sc = new SparkContext(sparkConf)

It is important to keep in mind you cannot have multiple local contexts (you 
can create them but you'll get odd errors), so if you are running things in 
parallel within your app (even unit tests) you'd need to share a context in 
this case. If you are running sequentially you can create a new local context 
each time, but you must make sure to call SparkContext.stop() when you're done.
Running against a cluster is a bit more complicated because you need to add all 
your dependency jars. I'm not sure how to get this to work with play run. I 
stick to building the app with play dist and then running against the packaged 
application, because it very conveniently provides all the dependencies in a 
lib folder. Here is some code to load all the paths you need from the dist:

def libs : Seq[String] = {
val libDir = play.api.Play.application.getFile("lib")

logger.info(s"SparkContext will be initialized with 
libraries from directory $libDir")

return if ( libDir.exists ) {

libDir.listFiles().map(_.getCanonicalFile().getAbsolutePath()).filter(_.endsWith(".jar"))
} else {
throw new IllegalStateException(s"lib dir is missing: $libDir")
}
}
Creating the context is similar to above, but with this extra line:

conf.setJars(libs)
I hope this helps. I should note that I don't use play run very much, at least 
not for when I'm actually executing Spark jobs. So I'm not sure if this 
integrates properly with that. I have unit tests which execute on Spark and 
have executed the dist package both locally and on a cluster. To make working 
with the dist locally easier, I wrote myself a little shell script to unzip and 
run the dist.


On Wed, Oct 15, 2014 at 10:51 PM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:
Hi –

Has anybody figured out how to integrate a Play application with Spark and run 
it on a Spark cluster using spark-submit script? I have seen some blogs about 
creating a simple Play app and running it locally on a dev machine with sbt run 
command. However, those steps don’t work for Spark-submit.

If you have figured out how to build and run a Play app with Spark-submit, I 
would appreciate if you could share the steps and the sbt settings for your 
Play app.

Thanks,
Mohammed




--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: 
www.velos.io


Re: PySpark Error on Windows with sc.wholeTextFiles

2014-10-16 Thread Davies Liu
It's a bug, could you file a JIRA for this? Thanks!

Davies

On Thu, Oct 16, 2014 at 8:28 AM, Griffiths, Michael (NYC-RPM)
 wrote:
> Hi,
>
>
>
> I’m running into an error on Windows (x64, 8.1) running Spark 1.1.0
> (pre-builet for Hadoop 2.4: spark-1.1.0-bin-hadoop2.4.tgz) with Java SE
> Version 8 Update 20 (build 1.8.0_20-b26); just getting started with Spark.
>
>
>
> When running sc.wholeTextFiles() on a directory, I can run the command but
> not do anything with the resulting RDD – specifically, I get an error in
> py4j.protocol.Py4JJavaError; the error is unspecified, though the location
> is included. I’ve attached the traceback below.
>
>
>
> In this situation, I’m trying to load all files from a folder on the local
> filesystem, located at D:\testdata. The folder contains one file, which can
> be loaded successfully with sc.textFile(“d:/testdata/filename”) – no
> problems at all – so I do not believe the file is throwing the error.
>
>
>
> Is there any advice on what I should look at further to isolate or fix the
> error? Am I doing something obviously wrong?
>
>
>
> Thanks,
>
> Michael
>
>
>
>
>
> Welcome to
>
>     __
>
>  / __/__  ___ _/ /__
>
> _\ \/ _ \/ _ `/ __/  '_/
>
>/__ / .__/\_,_/_/ /_/\_\   version 1.1.0
>
>   /_/
>
>
>
> Using Python version 2.7.7 (default, Jun 11 2014 10:40:02)
>
> SparkContext available as sc.
>
 file =
 sc.textFile("d:/testdata/cbcc5b470ec06f212990c68c8f76e887b884")
>
 file.count()
>
> 732
>
 file.first()
>
> u''
>
 data = sc.wholeTextFiles('d:/testdata')
>
 data.first()
>
> Traceback (most recent call last):
>
>   File "", line 1, in 
>
>   File "D:\spark\python\pyspark\rdd.py", line 1167, in first
>
> return self.take(1)[0]
>
>   File "D:\spark\python\pyspark\rdd.py", line 1126, in take
>
> totalParts = self._jrdd.partitions().size()
>
>   File "D:\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py", line
> 538, in __call__
>
>   File "D:\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py", line
> 300, in get_return_value
>
> py4j.protocol.Py4JJavaError: An error occurred while calling o21.partitions.
>
> : java.lang.NullPointerException
>
> at java.lang.ProcessBuilder.start(Unknown Source)
>
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
>
> at org.apache.hadoop.util.Shell.run(Shell.java:418)
>
> at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
>
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
>
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
>
> at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:559)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:534)
>
> at
> org.apache.hadoop.fs.LocatedFileStatus.(LocatedFileStatus.java:42)
>
>at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1697)
>
> at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1679)
>
> at
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:302)
>
> at
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:263)
>
> at
> org.apache.spark.input.WholeTextFileInputFormat.setMaxSplitSize(WholeTextFileInputFormat.scala:54)
>
> at
> org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:219)
>
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>
> at scala.Option.getOrElse(Option.scala:120)
>
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>
> at
> org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:50)
>
> at
> org.apache.spark.api.java.JavaPairRDD.partitions(JavaPairRDD.scala:44)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>
> at java.lang.reflect.Method.invoke(Unknown Source)
>
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>
> at
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>
> at py4j.Gateway.invoke(Gateway.java:259)
>
> at
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>
> at py4j.GatewayConnection.run(GatewayConnection.java:207)
>
> at java.lang.Thread.run(Unknown Source)
>
>
>
 data.count()
>
> Traceback (most recent call last):
>
>   File "", line 1, in 
>

Re: Folding an RDD in order

2014-10-16 Thread Michael Misiewicz
I note that one of the listing variants of aggregateByKey accepts a
partitioner as an argument:

def aggregateByKey[U](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V)
⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]

Would it be possible to extract my sorted parent's partitioner and pass
that into aggregateByKey on the re-keyed data being aggregated?

On Thu, Oct 16, 2014 at 12:01 PM, Michael Misiewicz 
wrote:

> Thanks for the suggestion! That does look really helpful, I see what you
> mean about it being more general than fold. I think I will replace my fold
> with aggregate - it should give me more control over the process.
>
> I think the problem will still exist though - which is that I can't get
> the correct partitioning I need. When I change my key to user_id, I lose
> the timestamp partitioning. My problem is that I'm trying to retain a
> parent RDD's partitioning in an RDD that no longer has the same keys as its
> parent.
>
> On Thu, Oct 16, 2014 at 11:46 AM, Cheng Lian 
> wrote:
>
>>  Hi Michael,
>>
>> I'm not sure I fully understood your question, but I think RDD.aggregate
>> can be helpful in your case. You can see it as a more general version of
>> fold.
>>
>> Cheng
>>
>>
>>
>> On 10/16/14 11:15 PM, Michael Misiewicz wrote:
>>
>> Hi,
>>
>>  I'm working on a problem where I'd like to sum items in an RDD *in
>> order (*approximately*)*. I am currently trying to implement this using
>> a fold, but I'm having some issues because the sorting key of my data is
>> not the same as the folding key for my data. I have data that looks like
>> this:
>>
>>  user_id, transaction_timestamp, transaction_amount
>>
>>  And I'm interested in doing a foldByKey on user_id to sum transaction
>> amounts - taking care to note approximately when a user surpasses a total
>> transaction threshold. I'm using RangePartitioner to make sure that data
>> is ordered sequentially between partitions, and I'd also make sure that
>> data is sorted within partitions, though I'm not sure how to do this
>> exactly (I was going to look at the code for sortByKey to figure this
>> out - I believe sorting in place in a mapPartitions should work). What
>> do you think about the approach? Here's some sample code that demonstrates
>> what I'm thinking:
>>
>>  def myFold(V1:Float, V2:Float) : Float = {
>>  val partialSum = V1 + V2
>>  if (partialSum >= 500) {
>>  // make a note of it, do things
>>  }
>>  return partialSum
>> }
>>
>>  val rawData = sc.textFile("hdfs://path/to/data").map{ x => // load data
>>  l = x.split()
>>  (l(0).toLong, l(1).toLong, l(2).toFloat) // user_id:long,
>> transaction_timestamp:long, transaction_amount:float
>>  }
>>  val keyByTimestamp = rawData.map(x=> (x._2, (x._1, x._3))) // rearrange
>> to make timestamp the key (for sorting), convert to PairRDD
>> val sortedByTimestamp = keyByTimestamp.sortByKey()
>> val partitionedByTimestamp = sortedByTimestamp.partitionBy(
>>  new org.apache.spark.RangePartitioner(partitions=500,
>> rdd=sortedByTimestamp)).persist()
>> // By this point, the RDD should be sorted and partitioned according to
>> the timestamp. However, I need to now make user_id the key,
>> // because the output must be per user. At this point, since I change the
>> keys of the PairRDD, I understand that I lose the partitioning
>> // the consequence of this is that I can no longer be sure in my fold
>> function that the ordering is retained.
>>
>>  val keyByUser = partitionedByTimestamp.map(x => (x._2._1, x._2._2))
>> val finalResult = keyByUser.foldByKey(zeroValue=0)(myFold)
>> finalResult.saveAsTextFile("hdfs://...")
>>
>>  The problem as you'd expect takes place in the folding function, after
>> I've re-arranged my RDD to no longer be keyed by timestamp (when I produce
>> keyByUser, I lose the correct partitioning). As I've read in the
>> documentation, partitioning is not preserved when keys are changed (which
>> makes sense).
>>
>>  Reading this thread:
>> https://groups.google.com/forum/#!topic/spark-users/Fx7DNtWiSx4 it
>> appears that one possible solution might be to subclass RDD (à la
>> MappedValuesRDD) to define my own RDD that retains the partitions of its
>> parent. This seems simple enough, but I've never done anything like that
>> before, but I'm not sure where to start. I'm also willing to write my own
>> custom partitioner class, but it appears that the getPartition method
>> only accepts a "key" argument - and since the value I need to partition on
>> in the final step (the timestamp) would be in the Value, my
>> partitioner class doesn't have the data it needs to make the right
>> decision. I cannot have timestamp in my key.
>>
>>  Alternatively, has anyone else encountered a problem like this (i.e. an
>> approximately ordered sum) and did they find a good solution? Does my
>> approach of subclassing RDD make sense? Would there be some way to
>> finagle a custom partitioner into making this work? Perhaps this might be a
>> job for some other tool, like spark str

Re: Folding an RDD in order

2014-10-16 Thread Michael Misiewicz
Thanks for the suggestion! That does look really helpful, I see what you
mean about it being more general than fold. I think I will replace my fold
with aggregate - it should give me more control over the process.

I think the problem will still exist though - which is that I can't get the
correct partitioning I need. When I change my key to user_id, I lose the
timestamp partitioning. My problem is that I'm trying to retain a parent
RDD's partitioning in an RDD that no longer has the same keys as its
parent.

On Thu, Oct 16, 2014 at 11:46 AM, Cheng Lian  wrote:

>  Hi Michael,
>
> I'm not sure I fully understood your question, but I think RDD.aggregate
> can be helpful in your case. You can see it as a more general version of
> fold.
>
> Cheng
>
>
>
> On 10/16/14 11:15 PM, Michael Misiewicz wrote:
>
> Hi,
>
>  I'm working on a problem where I'd like to sum items in an RDD *in order
> (*approximately*)*. I am currently trying to implement this using a fold,
> but I'm having some issues because the sorting key of my data is not the
> same as the folding key for my data. I have data that looks like this:
>
>  user_id, transaction_timestamp, transaction_amount
>
>  And I'm interested in doing a foldByKey on user_id to sum transaction
> amounts - taking care to note approximately when a user surpasses a total
> transaction threshold. I'm using RangePartitioner to make sure that data
> is ordered sequentially between partitions, and I'd also make sure that
> data is sorted within partitions, though I'm not sure how to do this
> exactly (I was going to look at the code for sortByKey to figure this out
> - I believe sorting in place in a mapPartitions should work). What do you
> think about the approach? Here's some sample code that demonstrates what
> I'm thinking:
>
>  def myFold(V1:Float, V2:Float) : Float = {
>  val partialSum = V1 + V2
>  if (partialSum >= 500) {
>  // make a note of it, do things
>  }
>  return partialSum
> }
>
>  val rawData = sc.textFile("hdfs://path/to/data").map{ x => // load data
>  l = x.split()
>  (l(0).toLong, l(1).toLong, l(2).toFloat) // user_id:long,
> transaction_timestamp:long, transaction_amount:float
>  }
>  val keyByTimestamp = rawData.map(x=> (x._2, (x._1, x._3))) // rearrange
> to make timestamp the key (for sorting), convert to PairRDD
> val sortedByTimestamp = keyByTimestamp.sortByKey()
> val partitionedByTimestamp = sortedByTimestamp.partitionBy(
>  new org.apache.spark.RangePartitioner(partitions=500,
> rdd=sortedByTimestamp)).persist()
> // By this point, the RDD should be sorted and partitioned according to
> the timestamp. However, I need to now make user_id the key,
> // because the output must be per user. At this point, since I change the
> keys of the PairRDD, I understand that I lose the partitioning
> // the consequence of this is that I can no longer be sure in my fold
> function that the ordering is retained.
>
>  val keyByUser = partitionedByTimestamp.map(x => (x._2._1, x._2._2))
> val finalResult = keyByUser.foldByKey(zeroValue=0)(myFold)
> finalResult.saveAsTextFile("hdfs://...")
>
>  The problem as you'd expect takes place in the folding function, after
> I've re-arranged my RDD to no longer be keyed by timestamp (when I produce
> keyByUser, I lose the correct partitioning). As I've read in the
> documentation, partitioning is not preserved when keys are changed (which
> makes sense).
>
>  Reading this thread:
> https://groups.google.com/forum/#!topic/spark-users/Fx7DNtWiSx4 it
> appears that one possible solution might be to subclass RDD (à la
> MappedValuesRDD) to define my own RDD that retains the partitions of its
> parent. This seems simple enough, but I've never done anything like that
> before, but I'm not sure where to start. I'm also willing to write my own
> custom partitioner class, but it appears that the getPartition method
> only accepts a "key" argument - and since the value I need to partition on
> in the final step (the timestamp) would be in the Value, my
> partitioner class doesn't have the data it needs to make the right
> decision. I cannot have timestamp in my key.
>
>  Alternatively, has anyone else encountered a problem like this (i.e. an
> approximately ordered sum) and did they find a good solution? Does my
> approach of subclassing RDD make sense? Would there be some way to
> finagle a custom partitioner into making this work? Perhaps this might be a
> job for some other tool, like spark streaming?
>
>  Thanks,
> Michael
>
>
>


Re: How to add HBase dependencies and conf with spark-submit?

2014-10-16 Thread Soumitra Kumar
Great, it worked.

I don't have an answer what is special about SPARK_CLASSPATH vs --jars, just 
found the working setting through trial an error.

- Original Message -
From: "Fengyun RAO" 
To: "Soumitra Kumar" 
Cc: user@spark.apache.org, u...@hbase.apache.org
Sent: Thursday, October 16, 2014 12:50:01 AM
Subject: Re: How to add HBase dependencies and conf with spark-submit?





Thanks, Soumitra Kumar, 

I didn’t know why you put hbase-protocol.jar in SPARK_CLASSPATH, while add 
hbase-protocol.jar , hbase-common.jar , hbase-client.jar , htrace-core.jar in 
--jar, but it did work. 

Actually, I put all these four jars in SPARK_CLASSPATH along with HBase conf 
directory. 
 


2014-10-15 22:39 GMT+08:00 Soumitra Kumar < kumar.soumi...@gmail.com > : 


I am writing to HBase, following are my options: 

export SPARK_CLASSPATH=/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar 

spark-submit \ 
--jars 
/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar,/opt/cloudera/parcels/CDH/lib/hbase/hbase-common.jar,/opt/cloudera/parcels/CDH/lib/hbase/hbase-client.jar,/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core.jar
 \ 



- Original Message - 
From: "Fengyun RAO" < raofeng...@gmail.com > 
To: user@spark.apache.org , u...@hbase.apache.org 
Sent: Wednesday, October 15, 2014 6:29:21 AM 
Subject: Re: How to add HBase dependencies and conf with spark-submit? 


+user@hbase 



2014-10-15 20:48 GMT+08:00 Fengyun RAO < raofeng...@gmail.com > : 



We use Spark 1.1, and HBase 0.98.1-cdh5.1.0, and need to read and write an 
HBase table in Spark program. 



I notice there are: 

spark.driver.extraClassPath spark.executor.extraClassPath properties to manage 
extra ClassPath, over even an deprecated SPARK_CLASSPATH. 


The problem is what classpath or jars should we append? 
I can simplely add the whole `hbase classpath`, which is huge, 
but this leads to dependencies conflict, e.g. HBase uses guava-12 while Spark 
uses guava-14. 




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Running an action inside a loop across multiple RDDs + java.io.NotSerializableException

2014-10-16 Thread Cheng Lian
You can first union them into a single RDD and then call |foreach|. In 
Scala:


|rddList.reduce(_.union(_)).foreach(myFunc)
|

For the serialization issue, I don’t have any clue unless more code can 
be shared.


On 10/16/14 11:39 PM, /soumya/ wrote:


Hi, my programming model requires me to generate multiple RDDs for various
datasets across a single run and then run an action on it - E.g.

MyFunc myFunc = ... //It implements VoidFunction
//set some extra variables - all serializable
...
for (JavaRDD rdd: rddList) {
...
sc.foreach(myFunc);

}

The problem I'm seeing is that after the first run of the loop - which
succeeds on foreach, the second one fails with
java.io.NotSerializableException for a specific object I'm setting. In my
particular case, the object contains a reference to
org.apache.hadoop.conf.Configuration. Question is:

1. Why does this succeed the first time, and fail the second?
2. Any alternatives to this programming model?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-an-action-inside-a-loop-across-multiple-RDDs-java-io-NotSerializableException-tp16580.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


​


Re: Spark SQL DDL, DML commands

2014-10-16 Thread Cheng Lian
I guess you're referring to the simple SQL dialect recognized by the 
SqlParser component.


Spark SQL supports most DDL and DML of Hive. But the simple SQL dialect 
is still very limited. Usually it's used together with some Spark 
application written in Java/Scala/Python. Within a Spark application, 
you can always register case class RDDs as temporary table, which partly 
replaces the functionality of DDL/DML in pure SQL scripts.


On the other hand, we do plan to support SQL 92 in the future.

On 10/16/14 10:50 PM, neeraj wrote:

Hi,

Does Spark SQL have DDL, DML commands to be executed directly. If yes,
please share the link.

If No, please help me understand why is it not there?

Regards,
Neeraj



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-DDL-DML-commands-tp16572.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Folding an RDD in order

2014-10-16 Thread Cheng Lian

Hi Michael,

I'm not sure I fully understood your question, but I think RDD.aggregate 
can be helpful in your case. You can see it as a more general version of 
fold.


Cheng


On 10/16/14 11:15 PM, Michael Misiewicz wrote:

Hi,

I'm working on a problem where I'd like to sum items in an RDD /in 
order (/approximately/)/. I am currently trying to implement this 
using a fold, but I'm having some issues because the sorting key of my 
data is not the same as the folding key for my data. I have data that 
looks like this:


user_id, transaction_timestamp, transaction_amount

And I'm interested in doing a foldByKey on user_id to sum transaction 
amounts - taking care to note approximately when a user surpasses a 
total transaction threshold. I'm using RangePartitioner to make sure 
that data is ordered sequentially between partitions, and I'd also 
make sure that data is sorted within partitions, though I'm not sure 
how to do this exactly (I was going to look at the code for sortByKey 
to figure this out - I believe sorting in place in a mapPartitions 
should work). What do you think about the approach? Here's some sample 
code that demonstrates what I'm thinking:


def myFold(V1:Float, V2:Float) : Float = {
val partialSum = V1 + V2
if (partialSum >= 500) {
// make a note of it, do things
}
return partialSum
}

val rawData = sc.textFile("hdfs://path/to/data").map{ x => // load data
l = x.split()
(l(0).toLong, l(1).toLong, l(2).toFloat) // user_id:long, 
transaction_timestamp:long, transaction_amount:float

}
val keyByTimestamp = rawData.map(x=> (x._2, (x._1, x._3))) // 
rearrange to make timestamp the key (for sorting), convert to PairRDD

val sortedByTimestamp = keyByTimestamp.sortByKey()
val partitionedByTimestamp = sortedByTimestamp.partitionBy(
new org.apache.spark.RangePartitioner(partitions=500, 
rdd=sortedByTimestamp)).persist()
// By this point, the RDD should be sorted and partitioned according 
to the timestamp. However, I need to now make user_id the key,
// because the output must be per user. At this point, since I change 
the keys of the PairRDD, I understand that I lose the partitioning
// the consequence of this is that I can no longer be sure in my fold 
function that the ordering is retained.


val keyByUser = partitionedByTimestamp.map(x => (x._2._1, x._2._2))
val finalResult = keyByUser.foldByKey(zeroValue=0)(myFold)
finalResult.saveAsTextFile("hdfs://...")

The problem as you'd expect takes place in the folding function, after 
I've re-arranged my RDD to no longer be keyed by timestamp (when I 
produce keyByUser, I lose the correct partitioning). As I've read in 
the documentation, partitioning is not preserved when keys are changed 
(which makes sense).


Reading this thread: 
https://groups.google.com/forum/#!topic/spark-users/Fx7DNtWiSx4 
 it 
appears that one possible solution might be to subclass RDD (à la 
MappedValuesRDD) to define my own RDDthat retains the partitions of 
its parent. This seems simple enough, but I've never done anything 
like that before, but I'm not sure where to start. I'm also willing to 
write my own custom partitioner class, but it appears that the 
getPartitionmethod only accepts a "key" argument - and since the value 
I need to partition on in the final step (the timestamp) would be in 
the Value, my partitioner class doesn't have the data it needs to make 
the right decision. I cannot have timestamp in my key.


Alternatively, has anyone else encountered a problem like this (i.e. 
an approximately ordered sum) and did they find a good solution? Does 
my approach of subclassing RDDmake sense? Would there be some way to 
finagle a custom partitioner into making this work? Perhaps this might 
be a job for some other tool, like spark streaming?


Thanks,
Michael




Re: Running an action inside a loop across multiple RDDs + java.io.NotSerializableException

2014-10-16 Thread _soumya_
Excuse me - the line inside the loop should read: rdd.foreach(myFunc) - not
sc.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-an-action-inside-a-loop-across-multiple-RDDs-java-io-NotSerializableException-tp16580p16581.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Standalone Apps and ClassNotFound

2014-10-16 Thread Ashic Mahtab
I'm relatively new to Spark and have got a couple of questions:

*
 I've got an IntelliJ SBT project that's using Spark Streaming with a 
custom RabbitMQ receiver in the same project. When I run it against 
local[2], all's well. When I put in spark://masterip:7077, I get a 
ClassNotFoundException for RmqReceiver (the name of the custom 
receiver). Note, this is being executed inside IntelliJ, and no jars are
 built in the target folder. I guess using spark-submit would work, but 
was wondering if there's a way to simply run the app in IntelliJ and 
have it work.
* I see there's an sc.addJars(..) method that would (I 
imagine) submit additional jars. Is there a way for it to submit the 
"current project's" classes as well. Or would building and submitting 
the package take care of this?

Any pointers are appreciated.

Regards,
Ashic.

Running an action inside a loop across multiple RDDs + java.io.NotSerializableException

2014-10-16 Thread _soumya_
Hi, my programming model requires me to generate multiple RDDs for various
datasets across a single run and then run an action on it - E.g. 

MyFunc myFunc = ... //It implements VoidFunction
//set some extra variables - all serializable
...
for (JavaRDD rdd: rddList) {
...
sc.foreach(myFunc);

}

The problem I'm seeing is that after the first run of the loop - which
succeeds on foreach, the second one fails with
java.io.NotSerializableException for a specific object I'm setting. In my
particular case, the object contains a reference to
org.apache.hadoop.conf.Configuration. Question is:

1. Why does this succeed the first time, and fail the second?
2. Any alternatives to this programming model?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-an-action-inside-a-loop-across-multiple-RDDs-java-io-NotSerializableException-tp16580.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL DDL, DML commands

2014-10-16 Thread Yi Tian
what is your meaning of "executed directly”?



Best Regards,

Yi Tian
tianyi.asiai...@gmail.com




On Oct 16, 2014, at 22:50, neeraj  wrote:

> Hi,
> 
> Does Spark SQL have DDL, DML commands to be executed directly. If yes,
> please share the link.
> 
> If No, please help me understand why is it not there?
> 
> Regards,
> Neeraj
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-DDL-DML-commands-tp16572.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



PySpark Error on Windows with sc.wholeTextFiles

2014-10-16 Thread Griffiths, Michael (NYC-RPM)
Hi,

I'm running into an error on Windows (x64, 8.1) running Spark 1.1.0 (pre-builet 
for Hadoop 2.4: 
spark-1.1.0-bin-hadoop2.4.tgz)
 with Java SE Version 8 Update 20 (build 1.8.0_20-b26); just getting started 
with Spark.

When running sc.wholeTextFiles() on a directory, I can run the command but not 
do anything with the resulting RDD - specifically, I get an error in 
py4j.protocol.Py4JJavaError; the error is unspecified, though the location is 
included. I've attached the traceback below.

In this situation, I'm trying to load all files from a folder on the local 
filesystem, located at D:\testdata. The folder contains one file, which can be 
loaded successfully with sc.textFile("d:/testdata/filename") - no problems at 
all - so I do not believe the file is throwing the error.

Is there any advice on what I should look at further to isolate or fix the 
error? Am I doing something obviously wrong?

Thanks,
Michael


Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.1.0
  /_/

Using Python version 2.7.7 (default, Jun 11 2014 10:40:02)
SparkContext available as sc.
>>> file = sc.textFile("d:/testdata/cbcc5b470ec06f212990c68c8f76e887b884")
>>> file.count()
732
>>> file.first()
u''
>>> data = sc.wholeTextFiles('d:/testdata')
>>> data.first()
Traceback (most recent call last):
  File "", line 1, in 
  File "D:\spark\python\pyspark\rdd.py", line 1167, in first
return self.take(1)[0]
  File "D:\spark\python\pyspark\rdd.py", line 1126, in take
totalParts = self._jrdd.partitions().size()
  File "D:\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py", line 
538, in __call__
  File "D:\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py", line 300, 
in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o21.partitions.
: java.lang.NullPointerException
at java.lang.ProcessBuilder.start(Unknown Source)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
at 
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:559)
at 
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:534)
at 
org.apache.hadoop.fs.LocatedFileStatus.(LocatedFileStatus.java:42)
   at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1697)
at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1679)
at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:302)
at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:263)
at 
org.apache.spark.input.WholeTextFileInputFormat.setMaxSplitSize(WholeTextFileInputFormat.scala:54)
at 
org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at 
org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:50)
at 
org.apache.spark.api.java.JavaPairRDD.partitions(JavaPairRDD.scala:44)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Unknown Source)

>>> data.count()
Traceback (most recent call last):
  File "", line 1, in 
  File "D:\spark\python\pyspark\rdd.py", line 847, in count
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File "D:\spark\python\pyspark\rdd.py", line 838, in sum
return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
  File "D:\spark\python\pyspark\rdd.py", line 759, in reduce
vals = self.mapPartitions(func).collect()
  File "D:\spark\python\p

Folding an RDD in order

2014-10-16 Thread Michael Misiewicz
Hi,

I'm working on a problem where I'd like to sum items in an RDD *in order (*
approximately*)*. I am currently trying to implement this using a fold, but
I'm having some issues because the sorting key of my data is not the same
as the folding key for my data. I have data that looks like this:

user_id, transaction_timestamp, transaction_amount

And I'm interested in doing a foldByKey on user_id to sum transaction
amounts - taking care to note approximately when a user surpasses a total
transaction threshold. I'm using RangePartitioner to make sure that data is
ordered sequentially between partitions, and I'd also make sure that data
is sorted within partitions, though I'm not sure how to do this exactly (I
was going to look at the code for sortByKey to figure this out - I believe
sorting in place in a mapPartitions should work). What do you think about
the approach? Here's some sample code that demonstrates what I'm thinking:

def myFold(V1:Float, V2:Float) : Float = {
val partialSum = V1 + V2
if (partialSum >= 500) {
// make a note of it, do things
}
return partialSum
}

val rawData = sc.textFile("hdfs://path/to/data").map{ x => // load data
l = x.split()
(l(0).toLong, l(1).toLong, l(2).toFloat) // user_id:long,
transaction_timestamp:long, transaction_amount:float
}
val keyByTimestamp = rawData.map(x=> (x._2, (x._1, x._3))) // rearrange to
make timestamp the key (for sorting), convert to PairRDD
val sortedByTimestamp = keyByTimestamp.sortByKey()
val partitionedByTimestamp = sortedByTimestamp.partitionBy(
new org.apache.spark.RangePartitioner(partitions=500,
rdd=sortedByTimestamp)).persist()
// By this point, the RDD should be sorted and partitioned according to the
timestamp. However, I need to now make user_id the key,
// because the output must be per user. At this point, since I change the
keys of the PairRDD, I understand that I lose the partitioning
// the consequence of this is that I can no longer be sure in my fold
function that the ordering is retained.

val keyByUser = partitionedByTimestamp.map(x => (x._2._1, x._2._2))
val finalResult = keyByUser.foldByKey(zeroValue=0)(myFold)
finalResult.saveAsTextFile("hdfs://...")

The problem as you'd expect takes place in the folding function, after I've
re-arranged my RDD to no longer be keyed by timestamp (when I produce
keyByUser, I lose the correct partitioning). As I've read in the
documentation, partitioning is not preserved when keys are changed (which
makes sense).

Reading this thread:
https://groups.google.com/forum/#!topic/spark-users/Fx7DNtWiSx4 it appears
that one possible solution might be to subclass RDD (à la MappedValuesRDD)
to define my own RDD that retains the partitions of its parent. This seems
simple enough, but I've never done anything like that before, but I'm not
sure where to start. I'm also willing to write my own custom partitioner
class, but it appears that the getPartition method only accepts a "key"
argument - and since the value I need to partition on in the final step
(the timestamp) would be in the Value, my partitioner class doesn't have
the data it needs to make the right decision. I cannot have timestamp in my
key.

Alternatively, has anyone else encountered a problem like this (i.e. an
approximately ordered sum) and did they find a good solution? Does my
approach of subclassing RDD make sense? Would there be some way to finagle
a custom partitioner into making this work? Perhaps this might be a job
for some other tool, like spark streaming?

Thanks,
Michael


Re: Help required on exercise Data Exploratin using Spark SQL

2014-10-16 Thread Cheng Lian

Hi Neeraj,

The Spark Summit 2014 tutorial uses Spark 1.0. I guess you're using 
Spark 1.1? Parquet support got polished quite a bit since then, and 
changed the string representation of the query plan, but this output 
should be OK :)


Cheng

On 10/16/14 10:45 PM, neeraj wrote:

Hi,

I'm exploring an exercise Data Exploratin using Spark SQL from Spark Summit
2014. While running command "val wikiData =
sqlContext.parquetFile("data/wiki_parquet")".. I'm getting the following
output which doesn't match with the expected output.

*Output i'm getting*:
  val wikiData1 =
sqlContext.parquetFile("/data/wiki_parquet/part-r-1.parquet")
14/10/16 19:26:49 INFO parquet.ParquetTypesConverter: Falling back to schema
conversion from Parquet types; result: ArrayBuffer(id#5, title#6,
modified#7L, text#8, username#9)
wikiData1: org.apache.spark.sql.SchemaRDD =
SchemaRDD[1] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
ParquetTableScan [id#5,title#6,modified#7L,text#8,username#9],
(ParquetRelation /data/wiki_parquet/part-r-1.parquet, Some(Configuration:
core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml,
yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml),
org.apache.spark.sql.SQLContext@27a5dac0, []), []

*Expected Output*:
wikiData: org.apache.spark.sql.SchemaRDD =
SchemaRDD[0] at RDD at SchemaRDD.scala:98
== Query Plan ==
ParquetTableScan [id#0,title#1,modified#2L,text#3,username#4],
(ParquetRelation data/wiki_parquet), []

Please help with the possible issue.

I'm using pre-built package of Spark with Hadoop 2.4

Please let me know in case of more information is required.

Regards,
Neeraj



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Help-required-on-exercise-Data-Exploratin-using-Spark-SQL-tp16569.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: YARN deployment of Spark and Thrift JDBC server

2014-10-16 Thread Cheng Lian


On 10/16/14 10:48 PM, neeraj wrote:

1. I'm trying to use Spark SQL as data source.. is it possible?
Unfortunately Spark SQL ODBC/JDBC support are based on the Thrift 
server, so at least you need HDFS and a working Hive Metastore instance 
(used to persist catalogs) to make things work.

2. Please share the link of ODBC/ JDBC drivers at databricks.. i'm not able
to find the same.
Sorry, forgot to mention that Denny's article mentioned the ODBC driver 
link: http://www.datastax.com/download#dl-datastax-drivers


For JDBC access, you can just use Hive 0.12.0 JDBC driver, the Thrift 
server is compatible with it.


P.S. The ODBC driver is not from Databricks, but provided by 3rd party 
companies like DataStax and Simba.






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/YARN-deployment-of-Spark-and-Thrift-JDBC-server-tp16374p16571.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



TaskNotSerializableException when running through Spark shell

2014-10-16 Thread Akshat Aranya
Hi,

Can anyone explain how things get captured in a closure when runing through
the REPL.  For example:

def foo(..) = { .. }

rdd.map(foo)

sometimes complains about classes not being serializable that are
completely unrelated to foo.  This happens even when I write it such:

object Foo {
  def foo(..) = { .. }
}

rdd.map(Foo.foo)

It also doesn't happen all the time.


Re: YARN deployment of Spark and Thrift JDBC server

2014-10-16 Thread neeraj
1. I'm trying to use Spark SQL as data source.. is it possible?
2. Please share the link of ODBC/ JDBC drivers at databricks.. i'm not able
to find the same.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/YARN-deployment-of-Spark-and-Thrift-JDBC-server-tp16374p16571.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to close resources shared in executor?

2014-10-16 Thread Ted Yu
Which hbase release are you using ?

Let me refer to 0.94 code hbase.

Take a look at the following method
in src/main/java/org/apache/hadoop/hbase/client/HTable.java :

  public void close() throws IOException {
...
if (cleanupConnectionOnClose) {
  if (this.connection != null) {
this.connection.close();

When Connection.getTable() is called, the following is invoked:
  public HTable(Configuration conf, final byte[] tableName, final
ExecutorService pool)
which sets cleanupConnectionOnClose to true.

w.r.t. javadoc, the paragraph on shutdown hook is
in HConnectionManager.java of 0.94
You don't need to use shutdown hook for 0.94+

Cheers

On Wed, Oct 15, 2014 at 11:41 PM, Fengyun RAO  wrote:

> I may have misunderstood your point.
>
> val result = rdd.map(line => {
>   val table = Util.Connection.getTable("user")
>   ...
>   table.close()
> }
>
> Did you mean this is enough, and there’s no need to call
> Util.Connection.close(),
> or HConnectionManager.deleteAllConnections()?
>
> Where is the documentation that statesHconnectionManager would release
> underlying connection automatically?
> If that’s true, maybe the Javadoc which recommends a shutdown hook needs
> update
> ​
>
> 2014-10-16 14:20 GMT+08:00 Fengyun RAO :
>
>> Thanks, Ted.
>> Util.Connection.close() should be called only once, so it can NOT be in
>> a map function
>>
>> val result = rdd.map(line => {
>>   val table = Util.Connection.getTable("user")
>>   ...
>>   Util.Connection.close()
>> }
>>
>> As you mentioned:
>>
>> Calling table.close() is the recommended approach.
>> HConnectionManager does reference counting. When all references to the
>> underlying connection are gone, connection would be released.
>>
>> Yes, we should call table.close(), but it won’t remove HConnection in
>> HConnectionManager which is a HConnection pool.
>> As I look into the HconnectionManager Javadoc, it seems I have to
>> implement a shutdown hook
>>
>>  * Cleanup used to be done inside in a shutdown hook.  On startup we'd
>>  * register a shutdown hook that called {@link #deleteAllConnections()}
>>  * on its way out but the order in which shutdown hooks run is not defined so
>>  * were problematic for clients of HConnection that wanted to register their
>>  * own shutdown hooks so we removed ours though this shifts the onus for
>>  * cleanup to the client.
>>
>> ​
>>
>> 2014-10-15 22:31 GMT+08:00 Ted Yu :
>>
>>> Pardon me - there was typo in previous email.
>>>
>>> Calling table.close() is the recommended approach.
>>> HConnectionManager does reference counting. When all references to the
>>> underlying connection are gone, connection would be released.
>>>
>>> Cheers
>>>
>>> On Wed, Oct 15, 2014 at 7:13 AM, Ted Yu  wrote:
>>>
 Have you tried the following ?

 val result = rdd.map(line => { val table = Util.Connection.getTable("user")
 ...
 Util.Connection.close() }

 On Wed, Oct 15, 2014 at 6:09 AM, Fengyun RAO 
 wrote:

> In order to share an HBase connection pool, we create an object
>
> Object Util {
> val HBaseConf = HBaseConfiguration.create
> val Connection= HConnectionManager.createConnection(HBaseConf)
> }
>
> which would be shared among tasks on the same executor. e.g.
>
> val result = rdd.map(line => {
>   val table = Util.Connection.getTable("user")
>   ...
> }
>
> However, we don’t how to close the Util.Connection.
> If we write Util.Connection.close() in the main function,
> it’ll only run on the driver, not the executor.
>
> So, How to make sure every Connection closed before exist?
> ​
>


>>>
>>
>


Spark SQL DDL, DML commands

2014-10-16 Thread neeraj
Hi,

Does Spark SQL have DDL, DML commands to be executed directly. If yes,
please share the link.

If No, please help me understand why is it not there?

Regards,
Neeraj



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-DDL-DML-commands-tp16572.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Larger heap leads to perf degradation due to GC

2014-10-16 Thread Akshat Aranya
I just want to pitch in and say that I ran into the same problem with
running with 64GB executors.  For example, some of the tasks take 5 minutes
to execute, out of which 4 minutes are spent in GC.  I'll try out smaller
executors.

On Mon, Oct 6, 2014 at 6:35 PM, Otis Gospodnetic  wrote:

> Hi,
>
> The other option to consider is using G1 GC, which should behave better
> with large heaps.  But pointers are not compressed in heaps > 32 GB in
> size, so you may be better off staying under 32 GB.
>
> Otis
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Mon, Oct 6, 2014 at 8:08 PM, Mingyu Kim  wrote:
>
>> Ok, cool. This seems to be general issues in JVM with very large heaps. I
>> agree that the best workaround would be to keep the heap size below 32GB.
>> Thanks guys!
>>
>> Mingyu
>>
>> From: Arun Ahuja 
>> Date: Monday, October 6, 2014 at 7:50 AM
>> To: Andrew Ash 
>> Cc: Mingyu Kim , "user@spark.apache.org" <
>> user@spark.apache.org>, Dennis Lawler 
>> Subject: Re: Larger heap leads to perf degradation due to GC
>>
>> We have used the strategy that you suggested, Andrew - using many workers
>> per machine and keeping the heaps small (< 20gb).
>>
>> Using a large heap resulted in workers hanging or not responding (leading
>> to timeouts).  The same dataset/job for us will fail (most often due to
>> akka disassociated or fetch failures errors) with 10 cores / 100 executors,
>> 60 gb per executor while succceed with 1 core / 1000 executors / 6gb per
>> executor.
>>
>> When the job does succceed with more cores per executor and larger heap
>> it is usually much slower than the smaller executors (the same 8-10 min job
>> taking 15-20 min to complete)
>>
>> The unfortunate downside of this has been, we have had some large
>> broadcast variables which may not fit into memory (and unnecessarily
>> duplicated) when using the smaller executors.
>>
>> Most of this is anecdotal but for the most part we have had more success
>> and consistency with more executors with smaller memory requirements.
>>
>> On Sun, Oct 5, 2014 at 7:20 PM, Andrew Ash  wrote:
>>
>>> Hi Mingyu,
>>>
>>> Maybe we should be limiting our heaps to 32GB max and running multiple
>>> workers per machine to avoid large GC issues.
>>>
>>> For a 128GB memory, 32 core machine, this could look like:
>>>
>>> SPARK_WORKER_INSTANCES=4
>>> SPARK_WORKER_MEMORY=32
>>> SPARK_WORKER_CORES=8
>>>
>>> Are people running with large (32GB+) executor heaps in production?  I'd
>>> be curious to hear if so.
>>>
>>> Cheers!
>>> Andrew
>>>
>>> On Thu, Oct 2, 2014 at 1:30 PM, Mingyu Kim  wrote:
>>>
 This issue definitely needs more investigation, but I just wanted to
 quickly check if anyone has run into this problem or has general guidance
 around it. We’ve seen a performance degradation with a large heap on a
 simple map task (I.e. No shuffle). We’ve seen the slowness starting around
 from 50GB heap. (I.e. spark.executor.memoty=50g) And, when we checked the
 CPU usage, there were just a lot of GCs going on.

 Has anyone seen a similar problem?

 Thanks,
 Mingyu

>>>
>>>
>>
>


Help required on exercise Data Exploratin using Spark SQL

2014-10-16 Thread neeraj
Hi,

I'm exploring an exercise Data Exploratin using Spark SQL from Spark Summit
2014. While running command "val wikiData =
sqlContext.parquetFile("data/wiki_parquet")".. I'm getting the following
output which doesn't match with the expected output.

*Output i'm getting*:
 val wikiData1 =
sqlContext.parquetFile("/data/wiki_parquet/part-r-1.parquet")
14/10/16 19:26:49 INFO parquet.ParquetTypesConverter: Falling back to schema
conversion from Parquet types; result: ArrayBuffer(id#5, title#6,
modified#7L, text#8, username#9)
wikiData1: org.apache.spark.sql.SchemaRDD =
SchemaRDD[1] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
ParquetTableScan [id#5,title#6,modified#7L,text#8,username#9],
(ParquetRelation /data/wiki_parquet/part-r-1.parquet, Some(Configuration:
core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml,
yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml),
org.apache.spark.sql.SQLContext@27a5dac0, []), []

*Expected Output*:
wikiData: org.apache.spark.sql.SchemaRDD = 
SchemaRDD[0] at RDD at SchemaRDD.scala:98
== Query Plan ==
ParquetTableScan [id#0,title#1,modified#2L,text#3,username#4],
(ParquetRelation data/wiki_parquet), []

Please help with the possible issue.

I'm using pre-built package of Spark with Hadoop 2.4

Please let me know in case of more information is required.

Regards,
Neeraj



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Help-required-on-exercise-Data-Exploratin-using-Spark-SQL-tp16569.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Unit testing: Mocking out Spark classes

2014-10-16 Thread Daniel Siegmann
Mocking these things is difficult; executing your unit tests in a local
Spark context is preferred, as recommended in the programming guide
.
I know this may not technically be a unit test, but it is hopefully close
enough.

You can load your test data using SparkContext.parallelize and retrieve the
data (for verification) using RDD.collect.

On Thu, Oct 16, 2014 at 9:07 AM, Saket Kumar  wrote:

> Hello all,
>
> I am trying to unit test my classes involved my Spark job. I am trying to
> mock out the Spark classes (like SparkContext and Broadcast) so that I can
> unit test my classes in isolation. However I have realised that these are
> classes instead of traits. My first question is why?
>
> It is quite hard to mock out classes using ScalaTest+ScalaMock as the
> classes which need to be mocked out need to be annotated with
> org.scalamock.annotation.mock as per
> http://www.scalatest.org/user_guide/testing_with_mock_objects#generatedMocks.
> I cannot do that in my case as I am trying to mock out the spark classes.
>
> Am I missing something? Is there a better way to do this?
>
> val sparkContext = mock[SparkInteraction]
> val trainingDatasetLoader = mock[DatasetLoader]
> val broadcastTrainingDatasetLoader = mock[Broadcast[DatasetLoader]]
> def transformerFunction(source: Iterator[(HubClassificationData,
> String)]): Iterator[String] = {
>   source.map(_._2)
> }
> val classificationResultsRDD = mock[RDD[String]]
> val classificationResults = Array("","","")
> val inputRDD = mock[RDD[(HubClassificationData, String)]]
>
> inSequence{
>   inAnyOrder{
> (sparkContext.broadcast[DatasetLoader]
> _).expects(trainingDatasetLoader).returns(broadcastTrainingDatasetLoader)
>   }
> }
>
> val sparkInvoker = new SparkJobInvoker(sparkContext,
> trainingDatasetLoader)
>
> when(inputRDD.mapPartitions(transformerFunction)).thenReturn(classificationResultsRDD)
> sparkInvoker.invoke(inputRDD)
>
> Thanks,
> Saket
>



-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io


Re: Play framework

2014-10-16 Thread Daniel Siegmann
We execute Spark jobs from a Play application but we don't use
spark-submit. I don't know if you really want to use spark-submit, but if
not you can just create a SparkContext programmatically in your app.

In development I typically run Spark locally. Creating the Spark context is
pretty trivial:

val conf = new SparkConf().setMaster("local[*]").setAppName(s"My Awesome
App")
// call conf.set for any other configuration you want
val sc = new SparkContext(sparkConf)

It is important to keep in mind you cannot have multiple local contexts
(you can create them but you'll get odd errors), so if you are running
things in parallel within your app (even unit tests) you'd need to share a
context in this case. If you are running sequentially you can create a new
local context each time, but you must make sure to call SparkContext.stop()
when you're done.

Running against a cluster is a bit more complicated because you need to add
all your dependency jars. I'm not sure how to get this to work with play run.
I stick to building the app with play dist and then running against the
packaged application, because it very conveniently provides all the
dependencies in a lib folder. Here is some code to load all the paths you
need from the dist:

def libs : Seq[String] = {
val libDir = play.api.Play.application.getFile("lib")

logger.info(s"SparkContext will be initialized with libraries from
directory $libDir")

return if ( libDir.exists ) {

libDir.listFiles().map(_.getCanonicalFile().getAbsolutePath()).filter(_.endsWith(".jar"))
} else {
throw new IllegalStateException(s"lib dir is missing: $libDir")
}
}

Creating the context is similar to above, but with this extra line:

conf.setJars(libs)

I hope this helps. I should note that I don't use play run very much, at
least not for when I'm actually executing Spark jobs. So I'm not sure if
this integrates properly with that. I have unit tests which execute on
Spark and have executed the dist package both locally and on a cluster. To
make working with the dist locally easier, I wrote myself a little shell
script to unzip and run the dist.


On Wed, Oct 15, 2014 at 10:51 PM, Mohammed Guller 
wrote:

>  Hi –
>
>
>
> Has anybody figured out how to integrate a Play application with Spark and
> run it on a Spark cluster using spark-submit script? I have seen some blogs
> about creating a simple Play app and running it locally on a dev machine
> with sbt run command. However, those steps don’t work for Spark-submit.
>
>
>
> If you have figured out how to build and run a Play app with Spark-submit,
> I would appreciate if you could share the steps and the sbt settings for
> your Play app.
>
>
>
> Thanks,
>
> Mohammed
>
>
>



-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io


Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-16 Thread Yin Huai
Hello Terry,

I guess you hit this bug .
The list of needed column ids was messed up. Can you try the master branch
or apply the code change

to
your 1.1 and see if the problem is resolved?

Thanks,

Yin

On Wed, Oct 15, 2014 at 12:08 PM, Terry Siu 
wrote:

>  Hi Yin,
>
>  pqt_rdt_snappy has 76 columns. These two parquet tables were created via
> Hive 0.12 from existing Avro data using CREATE TABLE following by an INSERT
> OVERWRITE. These are partitioned tables - pqt_rdt_snappy has one partition
> while pqt_segcust_snappy has two partitions. For pqt_segcust_snappy, I
> noticed that when I populated it with a single INSERT OVERWRITE over all
> the partitions and then executed the Spark code, it would report an illegal
> index value of 29.  However, if I manually did INSERT OVERWRITE for every
> single partition, I would get an illegal index value of 21. I don’t know if
> this will help in debugging, but here’s the DESCRIBE output for
> pqt_segcust_snappy:
>
>   OK
>
> col_namedata_type   comment
>
> customer_id string  from deserializer
>
> age_range   string  from deserializer
>
> gender  string  from deserializer
>
> last_tx_datebigint  from deserializer
>
> last_tx_date_ts string  from deserializer
>
> last_tx_date_dt string  from deserializer
>
> first_tx_date   bigint  from deserializer
>
> first_tx_date_tsstring  from deserializer
>
> first_tx_date_dtstring  from deserializer
>
> second_tx_date  bigint  from deserializer
>
> second_tx_date_ts   string  from deserializer
>
> second_tx_date_dt   string  from deserializer
>
> third_tx_date   bigint  from deserializer
>
> third_tx_date_tsstring  from deserializer
>
> third_tx_date_dtstring  from deserializer
>
> frequency   double  from deserializer
>
> tx_size double  from deserializer
>
> recency double  from deserializer
>
> rfm double  from deserializer
>
> tx_countbigint  from deserializer
>
> sales   double  from deserializer
>
> coll_def_id string  None
>
> seg_def_id  string  None
>
>
>
> # Partition Information
>
> # col_name  data_type   comment
>
>
>
> coll_def_id string  None
>
> seg_def_id  string  None
>
> Time taken: 0.788 seconds, Fetched: 29 row(s)
>
>
>  As you can see, I have 21 data columns, followed by the 2 partition
> columns, coll_def_id and seg_def_id. Output shows 29 rows, but that looks
> like it’s just counting the rows in the console output. Let me know if you
> need more information.
>
>
>  Thanks
>
> -Terry
>
>
>   From: Yin Huai 
> Date: Tuesday, October 14, 2014 at 6:29 PM
> To: Terry Siu 
> Cc: Michael Armbrust , "user@spark.apache.org" <
> user@spark.apache.org>
>
> Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet
>
>   Hello Terry,
>
>  How many columns does pqt_rdt_snappy have?
>
>  Thanks,
>
>  Yin
>
> On Tue, Oct 14, 2014 at 11:52 AM, Terry Siu 
> wrote:
>
>>  Hi Michael,
>>
>>  That worked for me. At least I’m now further than I was. Thanks for the
>> tip!
>>
>>  -Terry
>>
>>   From: Michael Armbrust 
>> Date: Monday, October 13, 2014 at 5:05 PM
>> To: Terry Siu 
>> Cc: "user@spark.apache.org" 
>> Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet
>>
>>   There are some known bug with the parquet serde and spark 1.1.
>>
>>  You can try setting spark.sql.hive.convertMetastoreParquet=true to
>> cause spark sql to use built in parquet support when the serde looks like
>> parquet.
>>
>> On Mon, Oct 13, 2014 at 2:57 PM, Terry Siu 
>> wrote:
>>
>>>  I am currently using Spark 1.1.0 that has been compiled against Hadoop
>>> 2.3. Our cluster is CDH5.1.2 which is runs Hive 0.12. I have two external
>>> Hive tables that point to Parquet (compressed with Snappy), which were
>>> converted over from Avro if that matters.
>>>
>>>  I am trying to perform a join with these two Hive tables, but am
>>> encountering an exception. In a nutshell, I launch a spark shell, create my
>>> HiveContext (pointing to the correct metastore on our cluster), and then
>>> proceed to do the following:
>>>
>>>  scala> val hc = new HiveContext(sc)
>>>
>>>  scala> val txn = hc.sql(“select * from pqt_rdt_snappy where transdate
>>> >= 132537600 and translate <= 134006399”)

Re: [SparkSQL] Convert JavaSchemaRDD to SchemaRDD

2014-10-16 Thread Cheng Lian
Why do you need to convert a JavaSchemaRDD to SchemaRDD? Are you trying 
to use some API that doesn't exist in JavaSchemaRDD?


On 10/15/14 5:50 PM, Earthson wrote:

I don't know why the JavaSchemaRDD.baseSchemaRDD is private[sql]. And I found
that DataTypeConversions is protected[sql].

Finally I find this solution:



 jrdd.registerTempTable("transform_tmp")
 jrdd.sqlContext.sql("select * from transform_tmp")





Could Any One tell me that: Is it a good idea for me to *use catalyst as
DSL's execution engine?*

I am trying to build a DSL, And I want to confirm this.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Convert-JavaSchemaRDD-to-SchemaRDD-tp16482.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: YARN deployment of Spark and Thrift JDBC server

2014-10-16 Thread Cheng Lian


On 10/16/14 12:44 PM, neeraj wrote:

I would like to reiterate that I don't have Hive installed on the Hadoop
cluster.
I have some queries on following comment from Cheng Lian-2:
"The Thrift server is used to interact with existing Hive data, and thus
needs Hive Metastore to access Hive catalog. In your case, you need to build
Spark with sbt/sbt -Phive,hadoop-2.4 clean package. But since you’ve already
started Thrift server successfully, this step should already have been done
properly."

1. Even though, I don't have Hive installed, How can I connect my
application (Microsoft Excel etc.) to Spark SQL. Do I must have Hive
installed.
Are you trying to use Excel as a data source of Spark SQL, or using 
Spark SQL as a data source of Excel? You can use Spark SQL in your own 
Spark applications without involving Hive, but the Thrift server is 
designed to interact to existing Hive data. Actually it's just a 
HiveServer2 port for Spark SQL.

2. Where can I download/get Spark SQL JDBC/ODBC drivers as I could not find
it on databricks site.
3. Could somebody point me to steps to connect Excel with Spark SQL and get
some data SQL. Is this possible at all.
I think this article from Denny Lee can be helpful, although it's about 
Tableau rather than Excel: 
https://www.concur.com/blog/en-us/connect-tableau-to-sparksql

4. Which all applications can be used to connect Spark SQL.

In theory, all applications that support ODBC/JDBC can connect to Spark SQL.


Regards,
Neeraj








--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/YARN-deployment-of-Spark-and-Thrift-JDBC-server-tp16374p16537.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL: set hive.metastore.warehouse.dir in CLI doesn't work

2014-10-16 Thread Cheng Lian
The warehouse location need to be specified before the |HiveContext| 
initialization, you can set it via:


|./bin/spark-sql --hiveconf 
hive.metastore.warehouse.dir=/home/spark/hive/warehouse
|

On 10/15/14 8:55 PM, Hao Ren wrote:


Hi,

The following query in sparkSQL 1.1.0 CLI doesn't work.

*SET hive.metastore.warehouse.dir=/home/spark/hive/warehouse
;

create table test as
select v1.*, v2.card_type, v2.card_upgrade_time_black,
v2.card_upgrade_time_gold
from customer v1 left join customer_loyalty v2
on v1.account_id = v2.account_id
limit 5
;*

StackTrack =>

org.apache.hadoop.hive.ql.metadata.HiveException:
MetaException(*message:file:/user/hive/warehouse/test* is not a directory or
unable to create one)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:602)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:559)
at
org.apache.spark.sql.hive.HiveMetastoreCatalog.createTable(HiveMetastoreCatalog.scala:99)
at
org.apache.spark.sql.hive.HiveMetastoreCatalog$CreateTables$anonfun$apply$1.applyOrElse(HiveMetastoreCatalog.scala:116)
at
org.apache.spark.sql.hive.HiveMetastoreCatalog$CreateTables$anonfun$apply$1.applyOrElse(HiveMetastoreCatalog.scala:111)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156)
at
org.apache.spark.sql.hive.HiveMetastoreCatalog$CreateTables$.apply(HiveMetastoreCatalog.scala:111)
at
org.apache.spark.sql.hive.HiveContext$QueryExecution.optimizedPlan$lzycompute(HiveContext.scala:358)
at
org.apache.spark.sql.hive.HiveContext$QueryExecution.optimizedPlan(HiveContext.scala:357)
at
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402)
at
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400)
at
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406)
at
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406)
at
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360)
at
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360)
at 
org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:103)
at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:98)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:58)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: MetaException(message:file:/user/hive/warehouse/test is not a
directory or unable to create one)
at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_core(HiveMetaStore.java:1060)
at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_with_environment_context(HiveMetaStore.java:1107)
at sun.reflect.GeneratedMethodAccessor133.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:103)
at com.sun.proxy.$Proxy15.create_table_with_environment_context(Unknown
Source)
at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:482)
at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:471)
at sun.reflect.GeneratedMethodAccessor132.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
at com.sun.proxy.$Proxy16.createTable(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive

Unit testing: Mocking out Spark classes

2014-10-16 Thread Saket Kumar
Hello all,

I am trying to unit test my classes involved my Spark job. I am trying to
mock out the Spark classes (like SparkContext and Broadcast) so that I can
unit test my classes in isolation. However I have realised that these are
classes instead of traits. My first question is why?

It is quite hard to mock out classes using ScalaTest+ScalaMock as the
classes which need to be mocked out need to be annotated with
org.scalamock.annotation.mock as per
http://www.scalatest.org/user_guide/testing_with_mock_objects#generatedMocks.
I cannot do that in my case as I am trying to mock out the spark classes.

Am I missing something? Is there a better way to do this?

val sparkContext = mock[SparkInteraction]
val trainingDatasetLoader = mock[DatasetLoader]
val broadcastTrainingDatasetLoader = mock[Broadcast[DatasetLoader]]
def transformerFunction(source: Iterator[(HubClassificationData,
String)]): Iterator[String] = {
  source.map(_._2)
}
val classificationResultsRDD = mock[RDD[String]]
val classificationResults = Array("","","")
val inputRDD = mock[RDD[(HubClassificationData, String)]]

inSequence{
  inAnyOrder{
(sparkContext.broadcast[DatasetLoader]
_).expects(trainingDatasetLoader).returns(broadcastTrainingDatasetLoader)
  }
}

val sparkInvoker = new SparkJobInvoker(sparkContext,
trainingDatasetLoader)

when(inputRDD.mapPartitions(transformerFunction)).thenReturn(classificationResultsRDD)
sparkInvoker.invoke(inputRDD)

Thanks,
Saket


Re: How to write data into Hive partitioned Parquet table?

2014-10-16 Thread Michael Armbrust
Support for dynamic partitioning is available in master and will be part of
Spark 1.2

On Thu, Oct 16, 2014 at 1:08 AM, Banias H  wrote:

> I got tipped by an expert that the error of "Unsupported language
> features in query" that I had was due to the fact that SparkSQL does not
> support dynamic partitions, and I can do saveAsParquetFile() for each
> partition.
>
> My inefficient implementation is to:
>
> //1. run the query without  DISTRIBUTE BY field1 SORT BY field2.
> JavaSchemaRDD rawRdd = hiveCtx.sql("INSERT INTO TABLE target_table
> PARTITION (partition_field) select field1, field2, partition_field FROM
> source_table");
> rawRdd.registerAsTempTable("temp");
>
> //2. Get a list of unique partition_field values
> JavaSchemaRDD partFieldsRdd = hiveCtx.sql("SELECT DISTINCT partition_field
> FROM temp");
>
> //3. Iterate each partition_field value. Run a query to get JavaSchemaRDD.
> Then save the result as ParquetFile
> for (Row row : partFieldsRdd.toArray()) {
> String partitionVal = row.toString(0);
> hiveCtx.sql("SELECT * FROM temp WHERE partition_field="+partitionVal).
> saveAsParquetFile("partition_field="+partitionVal);
> }
>
> It ran and produced the desired output. However Hive runs orders of
> magnitude faster than the code above. Anyone who can shed some lights on a
> more efficient implementation is much appreciated.  Many thanks.
>
> Regards,
> BH
>
> On Tue, Oct 14, 2014 at 8:44 PM, Banias H  wrote:
>
>> Hi,
>>
>> I am still new to Spark. Sorry if similar questions are asked here
>> before. I am trying to read a Hive table; then run a query and save the
>> result into a Hive partitioned Parquet table.
>>
>> For example, I was able to run the following in Hive:
>> INSERT INTO TABLE target_table PARTITION (partition_field) select field1,
>> field2, partition_field FROM source_table DISTRIBUTE BY field1 SORT BY
>> field2
>>
>> But when I tried running it in spark-sql, it gave me the following error:
>>
>> java.lang.RuntimeException:
>> Unsupported language features in query: INSERT INTO TABLE ...
>>
>> I also tried the following Java code and I saw the same error:
>>
>> SparkConf sparkConf = new SparkConf().setAppName("Example");
>> JavaSparkContext ctx = new JavaSparkContext(sparkConf);
>> JavaHiveContext hiveCtx = new JavaHiveContext(ctx);
>> JavaSchemaRDD rdd = hiveCtx.sql("INSERT INTO TABLE target_table PARTITION
>> (partition_field) select field1, field2, partition_field FROM source_table
>> DISTRIBUTE BY field1 SORT BY field2");
>> ...
>> rdd.count(); //Just for running the query
>>
>> If I take out "INSERT INTO TABLE target_table PARTITION
>> (partition_field)" from the sql statement and run that in hiveCtx.sql(), I
>> got a RDD but I only seem to do
>> rdd.saveAsParquetFile(target_table_location). But that is not partitioned
>> correctly.
>>
>> Any help is much appreciated. Thanks.
>>
>> Regards,
>> BH
>>
>
>


  1   2   >