Re: Efficient Key Structure in pairRDD
Spark Dev / Users, help in this regard would be appreciated, we are kind of stuck at this point. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-Key-Structure-in-pairRDD-tp18461p18557.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
How to change the default limiter for textFile function
I am a newbie to spark, and I program in Python. I use textFile function to make an RDD from a file. I notice that the default limiter is newline. However I want to change this default limiter to something else. After searching the web, I came to know about textinputformat.record.delimiter property, but doesn't seem to have any effect, when I use it with SparkConf. So my question is, how do I change the default limiter in python?
Re: Fwd: Executor Lost Failure
Yes... found the output on web UI of the slave. Thanks :) On Tue, Nov 11, 2014 at 2:48 AM, Ankur Dave ankurd...@gmail.com wrote: At 2014-11-10 22:53:49 +0530, Ritesh Kumar Singh riteshoneinamill...@gmail.com wrote: Tasks are now getting submitted, but many tasks don't happen. Like, after opening the spark-shell, I load a text file from disk and try printing its contentsas: sc.textFile(/path/to/file).foreach(println) It does not give me any output. That's because foreach launches tasks on the slaves. When each task tries to print its lines, they go to the stdout file on the slave rather than to your console at the driver. You should see the file's contents in each of the slaves' stdout files in the web UI. This only happens when running on a cluster. In local mode, all the tasks are running locally and can output to the driver, so foreach(println) is more useful. Ankur
Re: disable log4j for spark-shell
go to your spark home and then into the conf/ directory and then edit the log4j.properties file i.e. : gedit $SPARK_HOME/conf/log4j.properties and set root logger to: log4j.rootCategory=WARN, console U don't need to build spark for the changes to take place. Whenever you open spark-shel, it by default looks into the conf directories and loads all the properties. Thanks On Tue, Nov 11, 2014 at 6:34 AM, lordjoe lordjoe2...@gmail.com wrote: public static void main(String[] args) throws Exception { System.out.println(Set Log to Warn); Logger rootLogger = Logger.getRootLogger(); rootLogger.setLevel(Level.WARN); ... works for me -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/disable-log4j-for-spark-shell-tp11278p18535.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Fwd: disable log4j for spark-shell
-- Forwarded message -- From: Ritesh Kumar Singh riteshoneinamill...@gmail.com Date: Tue, Nov 11, 2014 at 2:18 PM Subject: Re: disable log4j for spark-shell To: lordjoe lordjoe2...@gmail.com Cc: u...@spark.incubator.apache.org go to your spark home and then into the conf/ directory and then edit the log4j.properties file i.e. : gedit $SPARK_HOME/conf/log4j.properties and set root logger to: log4j.rootCategory=WARN, console U don't need to build spark for the changes to take place. Whenever you open spark-shel, it by default looks into the conf directories and loads all the properties. Thanks On Tue, Nov 11, 2014 at 6:34 AM, lordjoe lordjoe2...@gmail.com wrote: public static void main(String[] args) throws Exception { System.out.println(Set Log to Warn); Logger rootLogger = Logger.getRootLogger(); rootLogger.setLevel(Level.WARN); ... works for me -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/disable-log4j-for-spark-shell-tp11278p18535.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Is there any benchmark suite for spark?
I need to do some testing on spark and looking for some open source tools.Is there any benchmark suite for spark that covers common use cases like HiBench of Intel(https://github.com/intel-hadoop/HiBench)? -- Best Regards, Hu Liu
Cassandra spark connector exception: NoSuchMethodError: com.google.common.collect.Sets.newConcurrentHashSet()Ljava/util/Set;
Hi, I have a spark application which uses Cassandra connectorspark-cassandra-connector-assembly-1.2.0-SNAPSHOT.jar to load data from Cassandra into spark. Everything works fine in the local mode, when I run in my IDE. But when I submit the application to be executed in standalone Spark server, I get the following exception, (which is apparently related to Guava versions ???!). Does any one know how to solve this? I create a jar file of my spark application using assembly.bat, and the followings is the dependencies I used: I put the connectorspark-cassandra-connector-assembly-1.2.0-SNAPSHOT.ja in the lib/ folder of my eclipse project thats why it is not included in the dependencies libraryDependencies ++= Seq( org.apache.spark%% spark-catalyst% 1.1.0 % provided, org.apache.cassandra % cassandra-all % 2.0.9 intransitive(), org.apache.cassandra % cassandra-thrift % 2.0.9 intransitive(), net.jpountz.lz4 % lz4 % 1.2.0, org.apache.thrift % libthrift % 0.9.1 exclude(org.slf4j, slf4j- api) exclude(javax.servlet, servlet-api), com.datastax.cassandra % cassandra-driver-core % 2.0.4 intransitive(), org.apache.spark %% spark-core % 1.1.0 % provided exclude(org.apache.hadoop, hadoop-core), org.apache.spark %% spark-streaming % 1.1.0 % provided, org.apache.hadoop % hadoop-client % 1.0.4 % provided, com.github.nscala-time %% nscala-time % 1.0.0, org.scalatest %% scalatest % 1.9.1 % test, org.apache.spark %% spark-sql % 1.1.0 % provided, org.apache.spark %% spark-hive % 1.1.0 % provided, org.json4s %% json4s-jackson % 3.2.5, junit % junit % 4.8.1 % test, org.slf4j % slf4j-api % 1.7.7, org.slf4j % slf4j-simple % 1.7.7, org.clapper %% grizzled-slf4j % 1.0.2, log4j % log4j % 1.2.17, com.google.guava % guava % 16.0 ) best, /Shahab And this is the exception I get: Exception in thread main com.google.common.util.concurrent.ExecutionError: java.lang.NoSuchMethodError: com.google.common.collect.Sets.newConcurrentHashSet()Ljava/util/Set; at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2261) at com.google.common.cache.LocalCache.get(LocalCache.java:4000) at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004) at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) at org.apache.spark.sql.cassandra.CassandraCatalog.lookupRelation(CassandraCatalog.scala:39) at org.apache.spark.sql.cassandra.CassandraSQLContext$$anon$2.org $apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(CassandraSQLContext.scala:60) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:123) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:123) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:123) at org.apache.spark.sql.cassandra.CassandraSQLContext$$anon$2.lookupRelation(CassandraSQLContext.scala:65)
JdbcRDD and ClassTag issue
Hi All, I am trying to access SQL Server through JdbcRDD. But getting error on ClassTag place holder. Here is the code which I wrote public void readFromDB() { String sql = Select * from Table_1 where values = ? and values = ?; class GetJDBCResult extends AbstractFunction1ResultSet, Integer { public Integer apply(ResultSet rs) { Integer result = null; try { result = rs.getInt(1); } catch (SQLException e) { // TODO Auto-generated catch block e.printStackTrace(); } return result; } } JdbcRDDInteger jdbcRdd = new JdbcRDDInteger(sc, jdbcInitialization(), sql, 0, 120, 2, new GetJDBCResult(), scala.reflect.ClassTag$.MODULE$.apply(Object.class)); } Can anybody here recommend any solution to this ? Thanks Nitin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/JdbcRDD-and-ClassTag-issue-tp18570.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Where can I find logs from workers PySpark
Hi, I am trying to do some logging in my PySpark jobs, particularly in map that is performed on workers. Unfortunately I am not able tofind these logs. Based on the documentation it seems that the logs should be on masters in the SPARK_KOME, directory work http://spark.apache.org/docs/latest/spark-standalone.html#monitoring-and-logging . But unfortunately I am can't see this directory. So I would like to ask, how do you do logging on workers? Is it somehow possible to send this logs also to master, for example after the job fininshes? Than you in advance for any suggestions and advices. Best regards, Jan - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
java.lang.NoSuchMethodError: twitter4j.TwitterStream.addListener
Hi I am getting the following error while executing a scala_twitter program for spark 14/11/11 16:39:23 ERROR receiver.ReceiverSupervisorImpl: Stopped executor with error: java.lang.NoSuchMethodError: twitter4j.TwitterStream.addListener(Ltwitter4j/StatusListener;)V 14/11/11 16:39:23 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.NoSuchMethodError: twitter4j.TwitterStream.addListener(Ltwitter4j/StatusListener;)V at org.apache.spark.streaming.twitter.TwitterReceiver.onStart(TwitterInputDStream.scala:72) at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121) at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106) at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:264) at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:257) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) 14/11/11 16:39:23 ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-0,5,main] java.lang.NoSuchMethodError: twitter4j.TwitterStream.addListener(Ltwitter4j/StatusListener;)V at org.apache.spark.streaming.twitter.TwitterReceiver.onStart(TwitterInputDStream.scala:72) at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121) at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106) at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:264) at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:257) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) 1 I think it might be a dependency issue so sharing pom.xml too dependencies dependency groupIdorg.twitter4j/groupId artifactIdtwitter4j-core/artifactId version3.0.3/version /dependency dependency groupIdorg.apache.httpcomponents/groupId artifactIdhttpclient/artifactId version4.0-beta1/version /dependency dependency groupIdorg.apache.httpcomponents/groupId artifactIdhttpclient/artifactId version4.3.5/version /dependency dependency groupIdoauth.signpost/groupId artifactIdsignpost-commonshttp4/artifactId version1.2/version /dependency dependency groupIdorg.scalatest/groupId artifactIdscalatest_2.10/artifactId version3.0.0-SNAP2/version /dependency dependency groupIdcommons-io/groupId artifactIdcommons-io/artifactId version2.4/version /dependency dependency groupIdjunit/groupId artifactIdjunit/artifactId version4.4/version /dependency dependency groupIdorg.twitter4j/groupId artifactIdtwitter4j-stream/artifactId version3.0.3/version /dependency dependency groupIdorg.apache.spark/groupId artifactIdspark-streaming_2.10/artifactId
Re: java.lang.NoSuchMethodError: twitter4j.TwitterStream.addListener
You are not having the twitter4j jars in the classpath. While running it in the cluster mode you need to ship those dependency jars. You can do like: sparkConf.setJars(/home/akhld/jars/twitter4j-core-3.0.3.jar, /home/akhld/jars/twitter4j-stream-3.0.3.jar) You can make sure they are shipped by checking the Application WebUI (4040) environment tab. Thanks Best Regards On Tue, Nov 11, 2014 at 5:48 PM, Jishnu Menath Prathap (WT01 - BAS) jishnu.prat...@wipro.com wrote: *Hi I am getting the following error while executing a scala_twitter program for spark* 14/11/11 16:39:23 ERROR receiver.ReceiverSupervisorImpl: Stopped executor with error: java.lang.NoSuchMethodError: twitter4j.TwitterStream.addListener(Ltwitter4j/StatusListener;)V 14/11/11 16:39:23 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.NoSuchMethodError: twitter4j.TwitterStream.addListener(Ltwitter4j/StatusListener;)V at org.apache.spark.streaming.twitter.TwitterReceiver.onStart(TwitterInputDStream.scala:72) at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121) at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106) at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:264) at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:257) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) 14/11/11 16:39:23 ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-0,5,main] java.lang.NoSuchMethodError: twitter4j.TwitterStream.addListener(Ltwitter4j/StatusListener;)V at org.apache.spark.streaming.twitter.TwitterReceiver.onStart(TwitterInputDStream.scala:72) at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121) at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106) at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:264) at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:257) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) 1 *I think it might be a dependency issue so sharing pom.xml too* dependencies dependency groupIdorg.twitter4j/groupId artifactIdtwitter4j-core/artifactId version3.0.3/version /dependency dependency groupIdorg.apache.httpcomponents/groupId artifactIdhttpclient/artifactId version4.0-beta1/version /dependency dependency groupIdorg.apache.httpcomponents/groupId artifactIdhttpclient/artifactId version4.3.5/version /dependency dependency groupIdoauth.signpost/groupId artifactIdsignpost-commonshttp4/artifactId version1.2/version /dependency dependency groupIdorg.scalatest/groupId artifactIdscalatest_2.10/artifactId version3.0.0-SNAP2/version /dependency dependency groupIdcommons-io/groupId artifactIdcommons-io/artifactId version2.4/version /dependency
save as file
Hi, I am spark 1.1.0. I need a help regarding saving rdd in a JSON file? How to do that? And how to mentions hdfs path in the program. -Naveen
RE: java.lang.NoSuchMethodError: twitter4j.TwitterStream.addListener
Hi Thank you Akhil for reply. I am not using cluster mode I am doing in local mode val sparkConf = new SparkConf().setAppName(TwitterPopularTags).setMaster(local).set(spark.eventLog.enabled,true) Also is there anywhere documented which Twitter4j version to be used for different versions of spark. Thanks Regards Jishnu Menath Prathap From: Akhil [via Apache Spark User List] [mailto:ml-node+s1001560n18576...@n3.nabble.com] Sent: Tuesday, November 11, 2014 6:30 PM To: Jishnu Menath Prathap (WT01 - BAS) Subject: Re: java.lang.NoSuchMethodError: twitter4j.TwitterStream.addListener You are not having the twitter4j jars in the classpath. While running it in the cluster mode you need to ship those dependency jars. You can do like: sparkConf.setJars(/home/akhld/jars/twitter4j-core-3.0.3.jar,/home/akhld/jars/twitter4j-stream-3.0.3.jar) You can make sure they are shipped by checking the Application WebUI (4040) environment tab. Thanks Best Regards On Tue, Nov 11, 2014 at 5:48 PM, Jishnu Menath Prathap (WT01 - BAS) [hidden email]/user/SendEmail.jtp?type=nodenode=18576i=0 wrote: Hi I am getting the following error while executing a scala_twitter program for spark 14/11/11 16:39:23 ERROR receiver.ReceiverSupervisorImpl: Stopped executor with error: java.lang.NoSuchMethodError: twitter4j.TwitterStream.addListener(Ltwitter4j/StatusListener;)V 14/11/11 16:39:23 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.NoSuchMethodError: twitter4j.TwitterStream.addListener(Ltwitter4j/StatusListener;)V at org.apache.spark.streaming.twitter.TwitterReceiver.onStart(TwitterInputDStream.scala:72) at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121) at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106) at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:264) at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:257) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) 14/11/11 16:39:23 ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-0,5,main] java.lang.NoSuchMethodError: twitter4j.TwitterStream.addListener(Ltwitter4j/StatusListener;)V at org.apache.spark.streaming.twitter.TwitterReceiver.onStart(TwitterInputDStream.scala:72) at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121) at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106) at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:264) at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:257) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) 1 I think it might be a dependency issue so sharing pom.xml too dependencies dependency groupIdorg.twitter4j/groupId artifactIdtwitter4j-core/artifactId version3.0.3/version /dependency dependency groupIdorg.apache.httpcomponents/groupId artifactIdhttpclient/artifactId version4.0-beta1/version /dependency dependency groupIdorg.apache.httpcomponents/groupId artifactIdhttpclient/artifactId version4.3.5/version /dependency dependency groupIdoauth.signpost/groupId
Spark-submit and Windows / Linux mixed network
Hi, I'm trying to submit a spark application fro network share to the spark master. Network shares are configured so that the master and all nodes have access to the target ja at (say): \\shares\publish\Spark\app1\someJar.jar And this is mounted on each linux box (i.e. master and workers) at: /mnt/spark/app1/someJar.jar I'm using the following to submit the app from a windows machine: spark-submit.cmd --class Main --master spark://mastername:7077 local:/mnt/spark/app1/someJar.jar However, I get an error saying: Warning: Local jar \mnt\spark\app1\someJar.jar does not exist, skipping. Followed by a ClassNotFoundException: Main. Notice the \'s instead of /s. Does anybody have experience getting something similar to work? Regards, Ashic.
Re: java.lang.NoSuchMethodError: twitter4j.TwitterStream.addListener
You can pick the dependency version from here http://mvnrepository.com/artifact/org.apache.spark/spark-streaming-twitter_2.10 Thanks Best Regards On Tue, Nov 11, 2014 at 6:36 PM, Jishnu Menath Prathap (WT01 - BAS) jishnu.prat...@wipro.com wrote: Hi Thank you Akhil for reply. I am not using cluster mode I am doing in local mode val sparkConf = new SparkConf().setAppName(TwitterPopularTags). *setMaster(local)*.set(spark.eventLog.enabled,true) Also is there anywhere documented which Twitter4j version to be used for different versions of spark. Thanks Regards Jishnu Menath Prathap *From:* Akhil [via Apache Spark User List] [mailto: ml-node+s1001560n18576...@n3.nabble.com] *Sent:* Tuesday, November 11, 2014 6:30 PM *To:* Jishnu Menath Prathap (WT01 - BAS) *Subject:* Re: java.lang.NoSuchMethodError: twitter4j.TwitterStream.addListener You are not having the twitter4j jars in the classpath. While running it in the cluster mode you need to ship those dependency jars. You can do like: sparkConf.setJars(/home/akhld/jars/twitter4j-core-3.0.3.jar,/home/akhld/jars/twitter4j-stream-3.0.3.jar) You can make sure they are shipped by checking the Application WebUI (4040) environment tab. Thanks Best Regards On Tue, Nov 11, 2014 at 5:48 PM, Jishnu Menath Prathap (WT01 - BAS) [hidden email] http://user/SendEmail.jtp?type=nodenode=18576i=0 wrote: *Hi I am getting the following error while executing a scala_twitter program for spark* 14/11/11 16:39:23 ERROR receiver.ReceiverSupervisorImpl: Stopped executor with error: java.lang.NoSuchMethodError: twitter4j.TwitterStream.addListener(Ltwitter4j/StatusListener;)V 14/11/11 16:39:23 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.NoSuchMethodError: twitter4j.TwitterStream.addListener(Ltwitter4j/StatusListener;)V at org.apache.spark.streaming.twitter.TwitterReceiver.onStart(TwitterInputDStream.scala:72) at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121) at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106) at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:264) at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:257) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) 14/11/11 16:39:23 ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-0,5,main] java.lang.NoSuchMethodError: twitter4j.TwitterStream.addListener(Ltwitter4j/StatusListener;)V at org.apache.spark.streaming.twitter.TwitterReceiver.onStart(TwitterInputDStream.scala:72) at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121) at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106) at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:264) at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:257) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) 1 *I think it might be a dependency issue so sharing pom.xml too* dependencies dependency groupIdorg.twitter4j/groupId artifactIdtwitter4j-core/artifactId version3.0.3/version /dependency dependency groupIdorg.apache.httpcomponents/groupId
Best practice for multi-user web controller in front of Spark
We are relatively new to spark and so far have been manually submitting single jobs at a time for ML training, during our development process, using spark-submit. Each job accepts a small user-submitted data set and compares it to every data set in our hdfs corpus, which only changes incrementally on a daily basis. (that detail is relevant to question 3 below) Now we are ready to start building out the front-end, which will allow a team of data scientists to submit their problems to the system via a web front-end (web tier will be java). Users could of course be submitting jobs more or less simultaneously. We want to make sure we understand how to best structure this. Questions: 1 - Does a new SparkContext get created in the web tier for each new request for processing? 2 - If so, how much time should we expect it to take for setting up the context? Our goal is to return a response to the users in under 10 seconds, but if it takes many seconds to create a new context or otherwise set up the job, then we need to adjust our expectations for what is possible. From using spark-shell one might conclude that it might take more than 10 seconds to create a context, however it's not clear how much of that is context-creation vs other things. 3 - (This last question perhaps deserves a post in and of itself:) if every job is always comparing some little data structure to the same HDFS corpus of data, what is the best pattern to use to cache the RDD's from HDFS so they don't have to always be re-constituted from disk? I.e. how can RDD's be shared from the context of one job to the context of subsequent jobs? Or does something like memcache have to be used? Thanks! David -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-for-multi-user-web-controller-in-front-of-Spark-tp18581.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
subscribe
- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
How to kill a Spark job running in cluster mode ?
I'm using Spark 1.0.0 and I'd like to kill a job running in cluster mode, which means the driver is not running on local node. So how can I kill such a job? Is there a command like hadoop job -kill job-id which kills a running MapReduce job ? Thanks
Re: save as file
One approach would be to use SaveAsNewAPIHadoop file and specify jsonOutputFormat. Another simple one would be like: val rdd = sc.parallelize(1 to 100) val json = rdd.map(x = { val m: Map[String, Int] = Map(id - x) new JSONObject(m) }) json.saveAsTextFile(output) Thanks Best Regards On Tue, Nov 11, 2014 at 6:28 PM, Naveen Kumar Pokala npok...@spcapitaliq.com wrote: Hi, I am spark 1.1.0. I need a help regarding saving rdd in a JSON file? How to do that? And how to mentions hdfs path in the program. -Naveen
Re: save as file
We have RDD.saveAsTextFile and RDD.saveAsObjectFile for saving the output to any location specified. The params to be provided are: path of storage location no. of partitions For giving an hdfs path we use the following format: /user/user-name/directory-to-sore/ On Tue, Nov 11, 2014 at 6:28 PM, Naveen Kumar Pokala npok...@spcapitaliq.com wrote: Hi, I am spark 1.1.0. I need a help regarding saving rdd in a JSON file? How to do that? And how to mentions hdfs path in the program. -Naveen
Re: Custom persist or cache of RDD?
But that requires an (unnecessary) load from disk. I have run into this same issue, where we want to save intermediate results but continue processing. The cache / persist feature of Spark doesn't seem designed for this case. Unfortunately I'm not aware of a better solution with the current version of Spark. On Mon, Nov 10, 2014 at 5:15 PM, Sean Owen so...@cloudera.com wrote: Well you can always create C by loading B from disk, and likewise for E / D. No need for any custom procedure. On Mon, Nov 10, 2014 at 7:33 PM, Benyi Wang bewang.t...@gmail.com wrote: When I have a multi-step process flow like this: A - B - C - D - E - F I need to store B and D's results into parquet files B.saveAsParquetFile D.saveAsParquetFile If I don't cache/persist any step, spark might recompute from A,B,C,D and E if something is wrong in F. Of course, I'd better cache all steps if I have enough memory to avoid this re-computation, or persist result to disk. But persisting B and D seems duplicate with saving B and D as parquet files. I'm wondering if spark can restore B and D from the parquet files using a customized persist and restore procedure? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io
Re: How to kill a Spark job running in cluster mode ?
There is a property : spark.ui.killEnabled which needs to be set true for killing applications directly from the webUI. Check the link: Kill Enable spark job http://spark.apache.org/docs/latest/configuration.html#spark-ui Thanks On Tue, Nov 11, 2014 at 7:42 PM, Sonal Goyal sonalgoy...@gmail.com wrote: The web interface has a kill link. You can try using that. Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Tue, Nov 11, 2014 at 7:28 PM, Tao Xiao xiaotao.cs@gmail.com wrote: I'm using Spark 1.0.0 and I'd like to kill a job running in cluster mode, which means the driver is not running on local node. So how can I kill such a job? Is there a command like hadoop job -kill job-id which kills a running MapReduce job ? Thanks
Re: Spark + Tableau
I finally solved issue with Spark Tableau connection. Thanks Denny Lee for blog post: https://www.concur.com/blog/en-us/connect-tableau-to-sparksql Solution was to use Authentication type Username. And then use username for metastore. Best regards Bojan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Tableau-tp17720p18591.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Broadcast failure with variable size of ~ 500mb with key already cancelled ?
Hi, Just wondering if anyone has any advice about this issue, as I am experiencing the same thing. I'm working with multiple broadcast variables in PySpark, most of which are small, but one of around 4.5GB, using 10 workers at 31GB memory each and driver with same spec. It's not running out of memory as far as I can see, but definitely only happens when I add the large broadcast. Would be most grateful for advice. I tried playing around with the last 3 conf settings below, but no luck: SparkConf().set(spark.master.memory, 26) .set(spark.executor.memory, 26) .set(spark.worker.memory, 26) .set(spark.driver.memory, 26). .set(spark.storage.memoryFraction,1) .set(spark.core.connection.ack.wait.timeout,6000) .set(spark.akka.frameSize,50) Thanks, Tom On 24 October 2014 12:31, htailor hemant.tai...@live.co.uk wrote: Hi All, I am relatively new to spark and currently having troubles with broadcasting large variables ~500mb in size. Th e broadcast fails with an error shown below and the memory usage on the hosts also blow up. Our hardware consists of 8 hosts (1 x 64gb (driver) and 7 x 32gb (workers)) and we are using Spark 1.1 (Python) via Cloudera CDH 5.2. We have managed to replicate the error using a test script shown below. I would be interested to know if anyone has seen this before with broadcasting or know of a fix. === ERROR == 14/10/24 08:20:04 INFO BlockManager: Found block rdd_11_31 locally 14/10/24 08:20:08 INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@fbc6caf 14/10/24 08:20:08 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(pigeon3.ldn.ebs.io,55316) 14/10/24 08:20:08 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(pigeon3.ldn.ebs.io,55316) 14/10/24 08:20:08 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(pigeon3.ldn.ebs.io,55316) 14/10/24 08:20:08 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@fbc6caf java.nio.channels.CancelledKeyException at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386) at org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139) 14/10/24 08:20:13 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(pigeon7.ldn.ebs.io,52524) 14/10/24 08:20:13 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(pigeon7.ldn.ebs.io,52524) 14/10/24 08:20:13 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(pigeon7.ldn.ebs.io,52524) 14/10/24 08:20:15 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(toppigeon.ldn.ebs.io,34370) 14/10/24 08:20:15 INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@3ecfdb7e 14/10/24 08:20:15 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(toppigeon.ldn.ebs.io,34370) 14/10/24 08:20:15 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(toppigeon.ldn.ebs.io,34370) 14/10/24 08:20:15 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@3ecfdb7e java.nio.channels.CancelledKeyException at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386) at org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139) 14/10/24 08:20:19 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(pigeon8.ldn.ebs.io,48628) 14/10/24 08:20:19 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(pigeon8.ldn.ebs.io,48628) 14/10/24 08:20:19 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(pigeon6.ldn.ebs.io,44996) 14/10/24 08:20:19 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(pigeon6.ldn.ebs.io,44996) 14/10/24 08:20:27 ERROR SendingConnection: Exception while reading SendingConnection to ConnectionManagerId(pigeon8.ldn.ebs.io,48628) java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295) at org.apache.spark.network.SendingConnection.read(Connection.scala:390) at org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:199) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/10/24 08:20:27 ERROR SendingConnection: Exception while reading SendingConnection to ConnectionManagerId(pigeon6.ldn.ebs.io,44996) java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295)
Re: Cassandra spark connector exception: NoSuchMethodError: com.google.common.collect.Sets.newConcurrentHashSet()Ljava/util/Set;
Hi, It looks like you are building from master (spark-cassandra-connector-assembly-1.2.0). - Append this to your com.google.guava declaration: % provided - Be sure your version of the connector dependency is the same as the assembly build. For instance, if you are using 1.1.0-beta1, build your assembly with that vs master. - You can upgrade your version of cassandra if that is plausible for your deploy environment, to 2.1.0. Side note: we are releasing 1.1.0-beta2 today or tomorrow which allows usage of Cassandra 2.1.1 and fixes any guava issues - Make your version of cassandra server + dependencies match your cassandra driver version. You currently have 2.0.9 with 2.0.4 - Helena @helenaedelson On Nov 11, 2014, at 6:13 AM, shahab shahab.mok...@gmail.com wrote: Hi, I have a spark application which uses Cassandra connectorspark-cassandra-connector-assembly-1.2.0-SNAPSHOT.jar to load data from Cassandra into spark. Everything works fine in the local mode, when I run in my IDE. But when I submit the application to be executed in standalone Spark server, I get the following exception, (which is apparently related to Guava versions ???!). Does any one know how to solve this? I create a jar file of my spark application using assembly.bat, and the followings is the dependencies I used: I put the connectorspark-cassandra-connector-assembly-1.2.0-SNAPSHOT.ja in the lib/ folder of my eclipse project thats why it is not included in the dependencies libraryDependencies ++= Seq( org.apache.spark%% spark-catalyst% 1.1.0 % provided, org.apache.cassandra % cassandra-all % 2.0.9 intransitive(), org.apache.cassandra % cassandra-thrift % 2.0.9 intransitive(), net.jpountz.lz4 % lz4 % 1.2.0, org.apache.thrift % libthrift % 0.9.1 exclude(org.slf4j, slf4j-api) exclude(javax.servlet, servlet-api), com.datastax.cassandra % cassandra-driver-core % 2.0.4 intransitive(), org.apache.spark %% spark-core % 1.1.0 % provided exclude(org.apache.hadoop, hadoop-core), org.apache.spark %% spark-streaming % 1.1.0 % provided, org.apache.hadoop % hadoop-client % 1.0.4 % provided, com.github.nscala-time %% nscala-time % 1.0.0, org.scalatest %% scalatest % 1.9.1 % test, org.apache.spark %% spark-sql % 1.1.0 % provided, org.apache.spark %% spark-hive % 1.1.0 % provided, org.json4s %% json4s-jackson % 3.2.5, junit % junit % 4.8.1 % test, org.slf4j % slf4j-api % 1.7.7, org.slf4j % slf4j-simple % 1.7.7, org.clapper %% grizzled-slf4j % 1.0.2, log4j % log4j % 1.2.17, com.google.guava % guava % 16.0 ) best, /Shahab And this is the exception I get: Exception in thread main com.google.common.util.concurrent.ExecutionError: java.lang.NoSuchMethodError: com.google.common.collect.Sets.newConcurrentHashSet()Ljava/util/Set; at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2261) at com.google.common.cache.LocalCache.get(LocalCache.java:4000) at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004) at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) at org.apache.spark.sql.cassandra.CassandraCatalog.lookupRelation(CassandraCatalog.scala:39) at org.apache.spark.sql.cassandra.CassandraSQLContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(CassandraSQLContext.scala:60) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:123) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:123) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:123) at org.apache.spark.sql.cassandra.CassandraSQLContext$$anon$2.lookupRelation(CassandraSQLContext.scala:65)
Re: Spark-submit and Windows / Linux mixed network
Never tried this form but just guessing, What's the output when you submit this jar: \\shares\publish\Spark\app1\ someJar.jar using spark-submit.cmd
Re: ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId
Yes please can you share. I am getting this error after expanding my application to include a large broadcast variable. Would be good to know if it can be fixed with configuration. On 23 October 2014 18:04, Michael Campbell michael.campb...@gmail.com wrote: Can you list what your fix was so others can benefit? On Wed, Oct 22, 2014 at 8:15 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, I have managed to resolve it because a wrong setting. Please ignore this . Regards Arthur On 23 Oct, 2014, at 5:14 am, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: 14/10/23 05:09:04 WARN ConnectionManager: All connections not cleaned up
scala.MatchError
Hi, This is my Instrument java constructor. public Instrument(Issue issue, Issuer issuer, Issuing issuing) { super(); this.issue = issue; this.issuer = issuer; this.issuing = issuing; } I am trying to create javaschemaRDD JavaSchemaRDD schemaInstruments = sqlCtx.applySchema(distData, Instrument.class); Remarks: Instrument, Issue, Issuer, Issuing all are java classes distData is holding List Instrument I am getting the following error. Exception in thread Driver java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162) Caused by: scala.MatchError: class sample.spark.test.Issue (of class java.lang.Class) at org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:189) at org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:188) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.sql.api.java.JavaSQLContext.getSchema(JavaSQLContext.scala:188) at org.apache.spark.sql.api.java.JavaSQLContext.applySchema(JavaSQLContext.scala:90) at sample.spark.test.SparkJob.main(SparkJob.java:33) ... 5 more Please help me. Regards, Naveen.
Combining data from two tables in two databases postgresql, JdbcRDD.
I want to be able to perform a query on two tables in different databases. I want to know whether it can be done. I've heard about union of two RDD's but here I want to connect to something like different partitions of a table. Any help is appreciated import java.io.Serializable; //import org.junit.*; //import static org.junit.Assert.*; import scala.*; import scala.runtime.AbstractFunction0; import scala.runtime.AbstractFunction1; import scala.runtime.*; import scala.collection.mutable.LinkedHashMap; //import static scala.collection.Map.Projection; import org.apache.spark.api.java.*; import org.apache.spark.api.java.function.*; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.rdd.*; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.Function; import org.apache.spark.sql.api.java.JavaSQLContext; import org.apache.spark.sql.api.java.JavaSchemaRDD; import org.apache.spark.sql.api.java.Row; import org.apache.spark.sql.api.java.DataType; import org.apache.spark.sql.api.java.StructType; import org.apache.spark.sql.api.java.StructField; import org.apache.spark.sql.api.java.Row; import org.apache.spark.SparkConf; import org.apache.spark.SparkContext; import java.sql.*; import java.util.*; import com.mysql.jdbc.Driver; import com.mysql.jdbc.*; import java.io.*; public class Spark_Mysql { static class Z extends AbstractFunction0java.sql.Connection implements Serializable { java.sql.Connection con; public java.sql.Connection apply() { try { con=DriverManager.getConnection(jdbc:mysql://localhost:3306/azkaban?user=azkabanpassword=password); } catch(Exception e) { e.printStackTrace(); } return con; } } static public class Z1 extends AbstractFunction1ResultSet,Integer implements Serializable { int ret; public Integer apply(ResultSet i) { try{ ret=i.getInt(1); } catch(Exception e) {e.printStackTrace();} return ret; } } public static void main(String[] args) throws Exception { String arr[]=new String[1]; arr[0]=/home/hduser/Documents/Credentials/Newest_Credentials_AX/spark-1.1.0-bin-hadoop1/lib/mysql-connector-java-5.1.33-bin.jar; JavaSparkContext ctx = new JavaSparkContext(new SparkConf().setAppName(JavaSparkSQL).setJars(arr)); SparkContext sctx = new SparkContext(new SparkConf().setAppName(JavaSparkSQL).setJars(arr)); JavaSQLContext sqlCtx = new JavaSQLContext(ctx); try { Class.forName(com.mysql.jdbc.Driver); } catch(Exception ex) { ex.printStackTrace(); System.exit(1); } JdbcRDD rdd=new JdbcRDD(sctx,new Z(),SELECT * FROM spark WHERE ? = id AND id = ?,0L, 1000L, 10,new Z1(),scala.reflect.ClassTag$.MODULE$.AnyRef()); rdd.saveAsTextFile(hdfs://127.0.0.1:9000/user/hduser/mysqlrdd); rdd.saveAsTextFile(/home/hduser/mysqlrdd); } } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Combining-data-from-two-tables-in-two-databases-postgresql-JdbcRDD-tp18597.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Best practice for multi-user web controller in front of Spark
I believe the Spark Job Server by Ooyala can help you share data across multiple jobs, take a look at http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server. It seems to fit closely to what you need. Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Tue, Nov 11, 2014 at 7:20 PM, bethesda swearinge...@mac.com wrote: We are relatively new to spark and so far have been manually submitting single jobs at a time for ML training, during our development process, using spark-submit. Each job accepts a small user-submitted data set and compares it to every data set in our hdfs corpus, which only changes incrementally on a daily basis. (that detail is relevant to question 3 below) Now we are ready to start building out the front-end, which will allow a team of data scientists to submit their problems to the system via a web front-end (web tier will be java). Users could of course be submitting jobs more or less simultaneously. We want to make sure we understand how to best structure this. Questions: 1 - Does a new SparkContext get created in the web tier for each new request for processing? 2 - If so, how much time should we expect it to take for setting up the context? Our goal is to return a response to the users in under 10 seconds, but if it takes many seconds to create a new context or otherwise set up the job, then we need to adjust our expectations for what is possible. From using spark-shell one might conclude that it might take more than 10 seconds to create a context, however it's not clear how much of that is context-creation vs other things. 3 - (This last question perhaps deserves a post in and of itself:) if every job is always comparing some little data structure to the same HDFS corpus of data, what is the best pattern to use to cache the RDD's from HDFS so they don't have to always be re-constituted from disk? I.e. how can RDD's be shared from the context of one job to the context of subsequent jobs? Or does something like memcache have to be used? Thanks! David -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-for-multi-user-web-controller-in-front-of-Spark-tp18581.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark and Play
Hi, Sorry if this has been asked before; I didn't find a satisfactory answer when searching. How can I integrate a Play application with Spark? I'm getting into issues of akka-actor versions. Play 2.2.x uses akka-actor 2.0, whereas Play 2.3.x uses akka-actor 2.3.4, neither of which work fine with Spark 1.1.0. Is there something I should do with libraryDependencies in my build.sbt to make it work? Thanks, Akshat
Re: thrift jdbc server probably running queries as hive query
Hi Cheng, I made sure the only hive server running on the machine is hivethriftserver2. /usr/lib/jvm/default-java/bin/java -cp /usr/lib/hadoop/lib/hadoop-lzo.jar::/mnt/sadhan/spark-3/sbin/../conf:/mnt/sadhan/spark-3/spark-assembly-1.2.0-SNAPSHOT-hadoop2.3.0-cdh5.0.2.jar:/etc/hadoop/conf -Xms512m -Xmx512m org.apache.spark.deploy.SparkSubmit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --master yarn --jars reporting.jar spark-internal The query I am running is a simple count(*): select count(*) from Xyz where date_prefix=20141031 and pretty sure it's submitting a map reduce job based on the spark logs: TakesRest=false Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=number In order to limit the maximum number of reducers: set hive.exec.reducers.max=number In order to set a constant number of reducers: set mapreduce.job.reduces=number 14/11/11 16:23:17 INFO ql.Context: New scratch dir is hdfs://fdsfdsfsdfsdf:9000/tmp/hive-ubuntu/hive_2014-11-11_16-23-17_333_5669798325805509526-2 Starting Job = job_1414084656759_0142, Tracking URL = http://xxx:8100/proxy/application_1414084656759_0142/ http://t.signauxdix.com/e1t/c/5/f18dQhb0S7lC8dDMPbW2n0x6l2B9nMJW7t5XYg2zGvG-W8rBGxP1p8d-TW64zBkx56dS1Dd58vwq02?t=http%3A%2F%2Fec2-54-83-34-89.compute-1.amazonaws.com%3A8100%2Fproxy%2Fapplication_1414084656759_0142%2Fsi=6222577584832512pi=626685a9-b628-43cc-91a1-93636171ce77 Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1414084656759_0142 On Mon, Nov 10, 2014 at 9:59 PM, Cheng Lian lian.cs@gmail.com wrote: Hey Sadhan, I really don't think this is Spark log... Unlike Shark, Spark SQL doesn't even provide a Hive mode to let you execute queries against Hive. Would you please check whether there is an existing HiveServer2 running there? Spark SQL HiveThriftServer2 is just a Spark port of HiveServer2, and they share the same default listening port. I guess the Thrift server didn't start successfully because the HiveServer2 occupied the port, and your Beeline session was probably linked against HiveServer2. Cheng On 11/11/14 8:29 AM, Sadhan Sood wrote: I was testing out the spark thrift jdbc server by running a simple query in the beeline client. The spark itself is running on a yarn cluster. However, when I run a query in beeline - I see no running jobs in the spark UI(completely empty) and the yarn UI seem to indicate that the submitted query is being run as a map reduce job. This is probably also being indicated from the spark logs but I am not completely sure: 2014-11-11 00:19:00,492 INFO ql.Context (Context.java:getMRScratchDir(267)) - New scratch dir is hdfs://:9000/tmp/hive-ubuntu/hive_2014-11-11_00-19-00_367_3847629323646885865-1 2014-11-11 00:19:00,877 INFO ql.Context (Context.java:getMRScratchDir(267)) - New scratch dir is hdfs://:9000/tmp/hive-ubuntu/hive_2014-11-11_00-19-00_367_3847629323646885865-2 2014-11-11 00:19:04,152 INFO ql.Context (Context.java:getMRScratchDir(267)) - New scratch dir is hdfs://:9000/tmp/hive-ubuntu/hive_2014-11-11_00-19-00_367_3847629323646885865-2 2014-11-11 00:19:04,425 INFO Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(1009)) - mapred.submit.replication is deprecated. Instead, use mapreduce.client.submit.file.replication 2014-11-11 00:19:04,516 INFO client.RMProxy (RMProxy.java:createRMProxy(92)) - Connecting to ResourceManager at :8032 2014-11-11 00:19:04,607 INFO client.RMProxy (RMProxy.java:createRMProxy(92)) - Connecting to ResourceManager at :8032 2014-11-11 00:19:04,639 WARN mapreduce.JobSubmitter (JobSubmitter.java:copyAndConfigureFiles(150)) - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this 2014-11-11 00:00:08,806 INFO input.FileInputFormat (FileInputFormat.java:listStatus(287)) - Total input paths to process : 14912 2014-11-11 00:00:08,864 INFO lzo.GPLNativeCodeLoader (GPLNativeCodeLoader.java:clinit(34)) - Loaded native gpl library 2014-11-11 00:00:08,866 INFO lzo.LzoCodec (LzoCodec.java:clinit(76)) - Successfully loaded initialized native-lzo library [hadoop-lzo rev 8e266e052e423af592871e2dfe09d54c03f6a0e8] 2014-11-11 00:00:09,873 INFO input.CombineFileInputFormat (CombineFileInputFormat.java:createSplits(413)) - DEBUG: Terminated node allocation with : CompletedNodes: 1, size left: 194541317 2014-11-11 00:00:10,017 INFO mapreduce.JobSubmitter (JobSubmitter.java:submitJobInternal(396)) - number of splits:615 2014-11-11 00:00:10,095 INFO mapreduce.JobSubmitter (JobSubmitter.java:printTokens(479)) - Submitting tokens for job: job_1414084656759_0115 2014-11-11 00:00:10,241 INFO impl.YarnClientImpl (YarnClientImpl.java:submitApplication(167))
What should be the number of partitions after a union and a subtractByKey
Assume the following where both updatePairRDD and deletePairRDD are both HashPartitioned. Before the union, each one of these has 512 partitions. The new created updateDeletePairRDD has 1024 partitions. Is this the general/expected behavior for a union (the number of partitions to double)? JavaPairRDDString,String updateDeletePairRDD = updatePairRDD.union(deletePairRDD); Then a similar question for subtractByKey. In the example below, baselinePairRDD is HashPartitioned (with 512 partitions). We know from above that updateDeletePairRDD has 1024 partitions. The newly created workSubtractBaselinePairRDD has 512 partitions. This makes sense because we are only 'subtracting' records from the baselinePairRDD and one wouldn't think the number of partitions would increase. Is this the general/expected behavior for a subractByKey? JavaPairRDDString,String workSubtractBaselinePairRDD = baselinePairRDD.subtractByKey(updateDeletePairRDD);
Re: Best practice for multi-user web controller in front of Spark
For sharing RDDs across multiple jobs - you could also have a look at Tachyon. It provides an HDFS compatible in-memory storage layer that keeps data in memory across multiple jobs/frameworks - http://tachyon-project.org/ . - On Tue, Nov 11, 2014 at 8:11 AM, Sonal Goyal sonalgoy...@gmail.com wrote: I believe the Spark Job Server by Ooyala can help you share data across multiple jobs, take a look at http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server. It seems to fit closely to what you need. Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Tue, Nov 11, 2014 at 7:20 PM, bethesda swearinge...@mac.com wrote: We are relatively new to spark and so far have been manually submitting single jobs at a time for ML training, during our development process, using spark-submit. Each job accepts a small user-submitted data set and compares it to every data set in our hdfs corpus, which only changes incrementally on a daily basis. (that detail is relevant to question 3 below) Now we are ready to start building out the front-end, which will allow a team of data scientists to submit their problems to the system via a web front-end (web tier will be java). Users could of course be submitting jobs more or less simultaneously. We want to make sure we understand how to best structure this. Questions: 1 - Does a new SparkContext get created in the web tier for each new request for processing? 2 - If so, how much time should we expect it to take for setting up the context? Our goal is to return a response to the users in under 10 seconds, but if it takes many seconds to create a new context or otherwise set up the job, then we need to adjust our expectations for what is possible. From using spark-shell one might conclude that it might take more than 10 seconds to create a context, however it's not clear how much of that is context-creation vs other things. 3 - (This last question perhaps deserves a post in and of itself:) if every job is always comparing some little data structure to the same HDFS corpus of data, what is the best pattern to use to cache the RDD's from HDFS so they don't have to always be re-constituted from disk? I.e. how can RDD's be shared from the context of one job to the context of subsequent jobs? Or does something like memcache have to be used? Thanks! David -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-for-multi-user-web-controller-in-front-of-Spark-tp18581.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
data locality, task distribution
Can anyone point me to a good primer on how spark decides where to send what task, how it distributes them, and how it determines data locality? I'm trying a pretty simple task - it's doing a foreach over cached data, accumulating some (relatively complex) values. So I see several inconsistencies I don't understand: (1) If I run it a couple times, as separate applications (i.e., reloading, recaching, etc), I will get different %'s cached each time. I've got about 5x as much memory as I need overall, so it isn't running out. But one time, 100% of the data will be cached; the next, 83%, the next, 92%, etc. (2) Also, the data is very unevenly distributed. I've got 400 partitions, and 4 workers (with, I believe, 3x replication), and on my last run, my distribution was 165/139/25/71. Is there any way to get spark to distribute the tasks more evenly? (3) If I run the problem several times in the same execution (to take advantage of caching etc.), I get very inconsistent results. My latest try, I get: - 1st run: 3.1 min - 2nd run: 2 seconds - 3rd run: 8 minutes - 4th run: 2 seconds - 5th run: 2 seconds - 6th run: 6.9 minutes - 7th run: 2 seconds - 8th run: 2 seconds - 9th run: 3.9 minuts - 10th run: 8 seconds I understand the difference for the first run; it was caching that time. Later times, when it manages to work in 2 seconds, it's because all the tasks were PROCESS_LOCAL; when it takes longer, the last 10-20% of the tasks end up with locality level ANY. Why would that change when running the exact same task twice in a row on cached data? Any help or pointers that I could get would be much appreciated. Thanks, -Nathan -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com
Re: Status of MLLib exporting models to PMML
Vincenzo sent a PR and included k-means as an example. Sean is helping review it. PMML standard is quite large. So we may start with simple model export, like linear methods, then move forward to tree-based. -Xiangrui On Mon, Nov 10, 2014 at 11:27 AM, Aris arisofala...@gmail.com wrote: Hello Spark and MLLib folks, So a common problem in the real world of using machine learning is that some data analysis use tools like R, but the more data engineers out there will use more advanced systems like Spark MLLib or even Python Scikit Learn. In the real world, I want to have a system where multiple different modeling environments can learn from data / build models, represent the models in a common language, and then have a layer which just takes the model and run model.predict() all day long -- scores the models in other words. It looks like the project openscoring.io and jpmml-evaluator are some amazing systems for this, but they fundamentally use PMML as the model representation here. I have read some JIRA tickets that Xiangrui Meng is interested in getting PMML implemented to export MLLib models, is that happening? Further, would something like Manish Amde's boosted ensemble tree methods be representable in PMML? Thank you!! Aris - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Broadcast failure with variable size of ~ 500mb with key already cancelled ?
There is a open PR [1] to support broadcast larger than 2G, could you try it? [1] https://github.com/apache/spark/pull/2659 On Tue, Nov 11, 2014 at 6:39 AM, Tom Seddon mr.tom.sed...@gmail.com wrote: Hi, Just wondering if anyone has any advice about this issue, as I am experiencing the same thing. I'm working with multiple broadcast variables in PySpark, most of which are small, but one of around 4.5GB, using 10 workers at 31GB memory each and driver with same spec. It's not running out of memory as far as I can see, but definitely only happens when I add the large broadcast. Would be most grateful for advice. I tried playing around with the last 3 conf settings below, but no luck: SparkConf().set(spark.master.memory, 26) .set(spark.executor.memory, 26) .set(spark.worker.memory, 26) .set(spark.driver.memory, 26). .set(spark.storage.memoryFraction,1) .set(spark.core.connection.ack.wait.timeout,6000) .set(spark.akka.frameSize,50) There is some invalid configs here, spark.master.memory and spark.worker.memory are not valid. spark.storage.memoryFraction is too large, then you will have not memory for general use (such as shuffle). The Python jobs run in separated processes, so you should leave some memory for them in slaves. For example, if it has 8 CPUs, each Python process will need at least 8G (for 4G broadcast plus object overhead in Python), then you can use only 3 process in the same time, use spark.cores.max=2 OR spark.task.cpus=4. Also you can only set spark.executor.memory to 10G, leave 16G for Python. Also, you could specify spark.python.worker.memory = 8G to have better shuffle performance in Python. (it's not necessary) So, for large broadcast, maybe you should use Scala, which uses multiple threads, the broadcast will be shared by multiple tasks in same executor. Thanks, Tom On 24 October 2014 12:31, htailor hemant.tai...@live.co.uk wrote: Hi All, I am relatively new to spark and currently having troubles with broadcasting large variables ~500mb in size. Th e broadcast fails with an error shown below and the memory usage on the hosts also blow up. Our hardware consists of 8 hosts (1 x 64gb (driver) and 7 x 32gb (workers)) and we are using Spark 1.1 (Python) via Cloudera CDH 5.2. We have managed to replicate the error using a test script shown below. I would be interested to know if anyone has seen this before with broadcasting or know of a fix. === ERROR == 14/10/24 08:20:04 INFO BlockManager: Found block rdd_11_31 locally 14/10/24 08:20:08 INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@fbc6caf 14/10/24 08:20:08 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(pigeon3.ldn.ebs.io,55316) 14/10/24 08:20:08 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(pigeon3.ldn.ebs.io,55316) 14/10/24 08:20:08 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(pigeon3.ldn.ebs.io,55316) 14/10/24 08:20:08 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@fbc6caf java.nio.channels.CancelledKeyException at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386) at org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139) 14/10/24 08:20:13 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(pigeon7.ldn.ebs.io,52524) 14/10/24 08:20:13 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(pigeon7.ldn.ebs.io,52524) 14/10/24 08:20:13 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(pigeon7.ldn.ebs.io,52524) 14/10/24 08:20:15 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(toppigeon.ldn.ebs.io,34370) 14/10/24 08:20:15 INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@3ecfdb7e 14/10/24 08:20:15 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(toppigeon.ldn.ebs.io,34370) 14/10/24 08:20:15 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(toppigeon.ldn.ebs.io,34370) 14/10/24 08:20:15 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@3ecfdb7e java.nio.channels.CancelledKeyException at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386) at org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139) 14/10/24 08:20:19 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(pigeon8.ldn.ebs.io,48628) 14/10/24 08:20:19 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(pigeon8.ldn.ebs.io,48628) 14/10/24 08:20:19 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(pigeon6.ldn.ebs.io,44996) 14/10/24 08:20:19 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(pigeon6.ldn.ebs.io,44996) 14/10/24
Re: MLLib Decision Tress algorithm hangs, others fine
Could you provide more information? For example, spark version, dataset size (number of instances/number of features), cluster size, error messages from both the drive and the executor. -Xiangrui On Mon, Nov 10, 2014 at 11:28 AM, tsj tsj...@gmail.com wrote: Hello all, I have some text data that I am running different algorithms on. I had no problems with LibSVM and Naive Bayes on the same data, but when I run Decision Tree, the execution hangs in the middle of DecisionTree.trainClassifier(). The only difference from the example given on the site is that I am using 6 categories instead of 2, and the input is text that is transformed to labeled points using TF-IDF. It halts shortly after this log output: spark.SparkContext: Job finished: collect at DecisionTree.scala:1347, took 1.019579676 s Any ideas as to what could be causing this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-Decision-Tress-algorithm-hangs-others-fine-tp18515.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: scala.MatchError
I think you need a Java bean class instead of a normal class. See example here: http://spark.apache.org/docs/1.1.0/sql-programming-guide.html (switch to the java tab). -Xiangrui On Tue, Nov 11, 2014 at 7:18 AM, Naveen Kumar Pokala npok...@spcapitaliq.com wrote: Hi, This is my Instrument java constructor. public Instrument(Issue issue, Issuer issuer, Issuing issuing) { super(); this.issue = issue; this.issuer = issuer; this.issuing = issuing; } I am trying to create javaschemaRDD JavaSchemaRDD schemaInstruments = sqlCtx.applySchema(distData, Instrument.class); Remarks: Instrument, Issue, Issuer, Issuing all are java classes distData is holding List Instrument I am getting the following error. Exception in thread Driver java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162) Caused by: scala.MatchError: class sample.spark.test.Issue (of class java.lang.Class) at org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:189) at org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:188) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.sql.api.java.JavaSQLContext.getSchema(JavaSQLContext.scala:188) at org.apache.spark.sql.api.java.JavaSQLContext.applySchema(JavaSQLContext.scala:90) at sample.spark.test.SparkJob.main(SparkJob.java:33) ... 5 more Please help me. Regards, Naveen. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: scala.MatchError
Xiangrui is correct that is must be a java bean, also nested classes are not yet supported in java. On Tue, Nov 11, 2014 at 10:11 AM, Xiangrui Meng men...@gmail.com wrote: I think you need a Java bean class instead of a normal class. See example here: http://spark.apache.org/docs/1.1.0/sql-programming-guide.html (switch to the java tab). -Xiangrui On Tue, Nov 11, 2014 at 7:18 AM, Naveen Kumar Pokala npok...@spcapitaliq.com wrote: Hi, This is my Instrument java constructor. public Instrument(Issue issue, Issuer issuer, Issuing issuing) { super(); this.issue = issue; this.issuer = issuer; this.issuing = issuing; } I am trying to create javaschemaRDD JavaSchemaRDD schemaInstruments = sqlCtx.applySchema(distData, Instrument.class); Remarks: Instrument, Issue, Issuer, Issuing all are java classes distData is holding List Instrument I am getting the following error. Exception in thread Driver java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162) Caused by: scala.MatchError: class sample.spark.test.Issue (of class java.lang.Class) at org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:189) at org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:188) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.sql.api.java.JavaSQLContext.getSchema(JavaSQLContext.scala:188) at org.apache.spark.sql.api.java.JavaSQLContext.applySchema(JavaSQLContext.scala:90) at sample.spark.test.SparkJob.main(SparkJob.java:33) ... 5 more Please help me. Regards, Naveen. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Still struggling with building documentation
Nichols and Patrick, Thanks for your help, but, no, it still does not work. The latest master produces the following scaladoc errors: [error] /home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/UploadBlock.java:55: not found: type Type [error] protected Type type() { return Type.UPLOAD_BLOCK; } [error] ^ [error] /home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/StreamHandle.java:39: not found: type Type [error] protected Type type() { return Type.STREAM_HANDLE; } [error] ^ [error] /home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/OpenBlocks.java:40: not found: type Type [error] protected Type type() { return Type.OPEN_BLOCKS; } [error] ^ [error] /home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/RegisterExecutor.java:44: not found: type Type [error] protected Type type() { return Type.REGISTER_EXECUTOR; } [error] ^ ... [error] four errors found [error] (spark/javaunidoc:doc) javadoc returned nonzero exit code [error] (spark/scalaunidoc:doc) Scaladoc generation failed [error] Total time: 140 s, completed Nov 11, 2014 10:20:53 AM Moving back into docs dir. Making directory api/scala cp -r ../target/scala-2.10/unidoc/. api/scala Making directory api/java cp -r ../target/javaunidoc/. api/java Moving to python/docs directory and building sphinx. Makefile:14: *** The 'sphinx-build' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the 'sphinx-build' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/. Stop. Moving back into home dir. Making directory api/python cp -r python/docs/_build/html/. docs/api/python /usr/lib/ruby/1.9.1/fileutils.rb:1515:in `stat': No such file or directory - python/docs/_build/html/. (Errno::ENOENT) from /usr/lib/ruby/1.9.1/fileutils.rb:1515:in `block in fu_each_src_dest' from /usr/lib/ruby/1.9.1/fileutils.rb:1529:in `fu_each_src_dest0' from /usr/lib/ruby/1.9.1/fileutils.rb:1513:in `fu_each_src_dest' from /usr/lib/ruby/1.9.1/fileutils.rb:436:in `cp_r' from /home/alex/git/spark/docs/_plugins/copy_api_dirs.rb:79:in `top (required)' from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require' from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require' from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:76:in `block in setup' from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:75:in `each' from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:75:in `setup' from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:30:in `initialize' from /usr/bin/jekyll:224:in `new' from /usr/bin/jekyll:224:in `main' What next? Alex On Fri, Nov 7, 2014 at 12:54 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I believe the web docs need to be built separately according to the instructions here https://github.com/apache/spark/blob/master/docs/README.md. Did you give those a shot? It's annoying to have a separate thing with new dependencies in order to build the web docs, but that's how it is at the moment. Nick On Fri, Nov 7, 2014 at 3:39 PM, Alessandro Baretta alexbare...@gmail.com wrote: I finally came to realize that there is a special maven target to build the scaladocs, although arguably a very unintuitive on: mvn verify. So now I have scaladocs for each package, but not for the whole spark project. Specifically, build/docs/api/scala/index.html is missing. Indeed the whole build/docs/api directory referenced in api.html is missing. How do I build it? Alex Baretta
Re: How to execute a function from class in distributed jar on each worker node?
I'm not sure that this will work but it makes sense to me. Basically you write the functionality in a static block in a class and broadcast that class. Not sure what your use case is but I need to load a native library and want to avoid running the init in mapPartitions if it's not necessary (just to make the code look cleaner) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-execute-a-function-from-class-in-distributed-jar-on-each-worker-node-tp3870p18611.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
S3 table to spark sql
How can i create a date field in spark sql? I have a S3 table and i load it into a RDD. val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class trx_u3m(id: String, local: String, fechau3m: String, rubro: Int, sku: String, unidades: Double, monto: Double) val tabla = sc.textFile(s3n://exalitica.com/trx_u3m/trx_u3m.txt).map(_.split(,)).map (p = trx_u3m(p(0).trim.toString, p(1).trim.toString, p(2).trim.toString, p(3).trim.toInt, p(4).trim.toString, p(5).trim.toDouble, p(6).trim.toDouble)) tabla.registerTempTable(trx_u3m) Now my problema i show can i transform string variable into date variables (fechau3m)? Franco Barrientos Data Scientist Málaga #115, Of. 1003, Las Condes. Santiago, Chile. (+562)-29699649 (+569)-76347893 mailto:franco.barrien...@exalitica.com franco.barrien...@exalitica.com http://www.exalitica.com/ www.exalitica.com http://exalitica.com/web/img/frim.png
pyspark get column family and qualifier names from hbase table
Hello there, I am wondering how to get the column family names and column qualifier names when using pyspark to read an hbase table with multiple column families. I have a hbase table as follows, hbase(main):007:0 scan 'data1' ROW COLUMN+CELL row1 column=f1:, timestamp=1411078148186, value=value1 row1 column=f2:, timestamp=1415732470877, value=value7 row2 column=f2:, timestamp=1411078160265, value=value2 when I ran the examples/hbase_inputformat.py code: conf2 = {hbase.zookeeper.quorum: localhost, hbase.mapreduce.inputtable: 'data1'} hbase_rdd = sc.newAPIHadoopRDD( org.apache.hadoop.hbase.mapreduce.TableInputFormat, org.apache.hadoop.hbase.io.ImmutableBytesWritable, org.apache.hadoop.hbase.client.Result, keyConverter=org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter, valueConverter=org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter, conf=conf2) output = hbase_rdd.collect() for (k, v) in output: print (k, v) I only see (u'row1', u'value1') (u'row2', u'value2') What I really want is (row_id, column family:column qualifier, value) tuples. Any comments? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-tp18613.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Status of MLLib exporting models to PMML
JPMML evaluator just changed their license to AGPL or commercial license, and I think AGPL is not compatible with apache project. Any advice? https://github.com/jpmml/jpmml-evaluator Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Tue, Nov 11, 2014 at 10:07 AM, Xiangrui Meng men...@gmail.com wrote: Vincenzo sent a PR and included k-means as an example. Sean is helping review it. PMML standard is quite large. So we may start with simple model export, like linear methods, then move forward to tree-based. -Xiangrui On Mon, Nov 10, 2014 at 11:27 AM, Aris arisofala...@gmail.com wrote: Hello Spark and MLLib folks, So a common problem in the real world of using machine learning is that some data analysis use tools like R, but the more data engineers out there will use more advanced systems like Spark MLLib or even Python Scikit Learn. In the real world, I want to have a system where multiple different modeling environments can learn from data / build models, represent the models in a common language, and then have a layer which just takes the model and run model.predict() all day long -- scores the models in other words. It looks like the project openscoring.io and jpmml-evaluator are some amazing systems for this, but they fundamentally use PMML as the model representation here. I have read some JIRA tickets that Xiangrui Meng is interested in getting PMML implemented to export MLLib models, is that happening? Further, would something like Manish Amde's boosted ensemble tree methods be representable in PMML? Thank you!! Aris - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
filtering out non English tweets using TwitterUtils
Hi, Is there a way to extract only the English language tweets when using TwitterUtils.createStream()? The filters argument specifies the strings that need to be contained in the tweets, but I am not sure how this can be used to specify the language. thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/filtering-out-non-English-tweets-using-TwitterUtils-tp18614.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Status of MLLib exporting models to PMML
Yes, jpmml-evaluator is AGPL, but things like jpmml-model are not; they're 3-clause BSD: https://github.com/jpmml/jpmml-model So some of the scoring components are off-limits for an AL2 project but the core model components are OK. On Tue, Nov 11, 2014 at 7:40 PM, DB Tsai dbt...@dbtsai.com wrote: JPMML evaluator just changed their license to AGPL or commercial license, and I think AGPL is not compatible with apache project. Any advice? https://github.com/jpmml/jpmml-evaluator Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Tue, Nov 11, 2014 at 10:07 AM, Xiangrui Meng men...@gmail.com wrote: Vincenzo sent a PR and included k-means as an example. Sean is helping review it. PMML standard is quite large. So we may start with simple model export, like linear methods, then move forward to tree-based. -Xiangrui On Mon, Nov 10, 2014 at 11:27 AM, Aris arisofala...@gmail.com wrote: Hello Spark and MLLib folks, So a common problem in the real world of using machine learning is that some data analysis use tools like R, but the more data engineers out there will use more advanced systems like Spark MLLib or even Python Scikit Learn. In the real world, I want to have a system where multiple different modeling environments can learn from data / build models, represent the models in a common language, and then have a layer which just takes the model and run model.predict() all day long -- scores the models in other words. It looks like the project openscoring.io and jpmml-evaluator are some amazing systems for this, but they fundamentally use PMML as the model representation here. I have read some JIRA tickets that Xiangrui Meng is interested in getting PMML implemented to export MLLib models, is that happening? Further, would something like Manish Amde's boosted ensemble tree methods be representable in PMML? Thank you!! Aris - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: filtering out non English tweets using TwitterUtils
You could get all the tweets in the stream, and then apply filter transformation on the DStream of tweets to filter away non-english tweets. The tweets in the DStream is of type twitter4j.Status which has a field describing the language. You can use that in the filter. Though in practice, a lot of non-english tweets are also marked as english by Twitter. To really filter out ALL non-english tweets, you will have to probably do some machine learning stuff to identify English tweets. On Tue, Nov 11, 2014 at 11:41 AM, SK skrishna...@gmail.com wrote: Hi, Is there a way to extract only the English language tweets when using TwitterUtils.createStream()? The filters argument specifies the strings that need to be contained in the tweets, but I am not sure how this can be used to specify the language. thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/filtering-out-non-English-tweets-using-TwitterUtils-tp18614.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Failed jobs showing as SUCCEEDED on web UI
I¹m running a Python script using spark-submit on YARN in an EMR cluster, and if I have a job that fails due to ExecutorLostFailure or if I kill the job, it still shows up on the web UI with a FinalStatus of SUCCEEDED. Is this due to PySpark, or is there potentially some other issue with the job failure status not propagating to the logs? smime.p7s Description: S/MIME cryptographic signature
Re: pyspark get column family and qualifier names from hbase table
checked the source, found the following, class HBaseResultToStringConverter extends Converter[Any, String] { override def convert(obj: Any): String = { val result = obj.asInstanceOf[Result] Bytes.toStringBinary(result.value()) } } I feel using 'result.value()' here is a big limitation. Converting from the 'list()' from the 'Result' is more general and easy to use. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-tp18613p18619.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
failed to create a table with python (single node)
I'm executing this example from the documentation (in single node mode) # sc is an existing SparkContext. from pyspark.sql import HiveContext sqlContext = HiveContext(sc) sqlContext.sql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING)) # Queries can be expressed in HiveQL. results = sqlContext.sql(FROM src SELECT key, value).collect() 1. would it be possible to get the results from collect in a more human-readable format? For example, I would like to have a result similar to what I would get using hive CLI. 2. The first query does not seem to create the table. I tried show tables; from hive after doing it, and the table src did not show up.
Re: filtering out non English tweets using TwitterUtils
Thanks for the response. I tried the following : tweets.filter(_.getLang()=en) I get a compilation error: value getLang is not a member of twitter4j.Status But getLang() is one of the methods of twitter4j.Status since version 3.0.6 according to the doc at: http://twitter4j.org/javadoc/twitter4j/Status.html#getLang-- What version of twitter4j does Spark Streaming use? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/filtering-out-non-English-tweets-using-TwitterUtils-tp18614p18621.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: filtering out non English tweets using TwitterUtils
Small typo in my code in the previous post. That should be: tweets.filter(_.getLang()==en) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/filtering-out-non-English-tweets-using-TwitterUtils-tp18614p18622.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
groupBy for DStream
Hi. 1) I dont see a groupBy() method for a DStream object. Not sure why that is not supported. Currently I am using filter () to separate out the different groups. I would like to know if there is a way to convert a DStream object to a regular RDD so that I can apply the RDD methods like groupBy. 2) The count() method for a DStream object returns a DStream[Long] instead of a simple Long (like RDD does). How can I extract the simple Long count value? I tried dstream(0) but got a compilation error that it does not take parameters. I also tried dstream[0], but that also resulted in a compilation error. I am not able to use the head() or take(0) method for DStream either. thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-for-DStream-tp18623.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: closure serialization behavior driving me crazy
I tried turning on the extended debug info. The Scala output is a little opaque (lots of - field (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC), but it seems like, as expected, somehow the full array of OLSMultipleLinearRegression objects is getting pulled in. I'm not sure I understand your comment about Array.ofDim being large. When serializing the array alone, it only takes up about 80K, which is close to 1867*5*sizeof(double). The 400MB comes when referencing the array from a function, which pulls in all the extra data. Copying the global variable into a local one seems to work. Much appreciated, Matei. On Mon, Nov 10, 2014 at 9:26 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hey Sandy, Try using the -Dsun.io.serialization.extendedDebugInfo=true flag on the JVM to print the contents of the objects. In addition, something else that helps is to do the following: { val _arr = arr models.map(... _arr ...) } Basically, copy the global variable into a local one. Then the field access from outside (from the interpreter-generated object that contains the line initializing arr) is no longer required, and the closure no longer has a reference to that. I'm really confused as to why Array.ofDim would be so large by the way, but are you sure you haven't flipped around the dimensions (e.g. it should be 5 x 1800)? A 5-double array will consume more than 5*8 bytes (probably something like 60 at least), and an array of those will still have a pointer to each one, so I'd expect that many of them to be more than 80 MB (which is very close to 1867*5*8). Matei On Nov 10, 2014, at 1:01 AM, Sandy Ryza sandy.r...@cloudera.com wrote: I'm experiencing some strange behavior with closure serialization that is totally mind-boggling to me. It appears that two arrays of equal size take up vastly different amount of space inside closures if they're generated in different ways. The basic flow of my app is to run a bunch of tiny regressions using Commons Math's OLSMultipleLinearRegression and then reference a 2D array of the results from a transformation. I was running into OOME's and NotSerializableExceptions and tried to get closer to the root issue by calling the closure serializer directly. scala val arr = models.map(_.estimateRegressionParameters()).toArray The result array is 1867 x 5. It serialized is 80k bytes, which seems about right: scala SparkEnv.get.closureSerializer.newInstance().serialize(arr) res17: java.nio.ByteBuffer = java.nio.HeapByteBuffer[pos=0 lim=80027 cap=80027] If I reference it from a simple function: scala def func(x: Long) = arr.length scala SparkEnv.get.closureSerializer.newInstance().serialize(func) I get a NotSerializableException. If I take pains to create the array using a loop: scala val arr = Array.ofDim[Double](1867, 5) scala for (s - 0 until models.length) { | factorWeights(s) = models(s).estimateRegressionParameters() | } Serialization works, but the serialized closure for the function is a whopping 400MB. If I pass in an array of the same length that was created in a different way, the size of the serialized closure is only about 90K, which seems about right. Naively, it seems like somehow the history of how the array was created is having an effect on what happens to it inside a closure. Is this expected behavior? Can anybody explain what's going on? any insight very appreciated, Sandy
Re: Still struggling with building documentation
The doc build appears to be broken in master. We'll get it patched up before the release: https://issues.apache.org/jira/browse/SPARK-4326 On Tue, Nov 11, 2014 at 10:50 AM, Alessandro Baretta alexbare...@gmail.com wrote: Nichols and Patrick, Thanks for your help, but, no, it still does not work. The latest master produces the following scaladoc errors: [error] /home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/UploadBlock.java:55: not found: type Type [error] protected Type type() { return Type.UPLOAD_BLOCK; } [error] ^ [error] /home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/StreamHandle.java:39: not found: type Type [error] protected Type type() { return Type.STREAM_HANDLE; } [error] ^ [error] /home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/OpenBlocks.java:40: not found: type Type [error] protected Type type() { return Type.OPEN_BLOCKS; } [error] ^ [error] /home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/RegisterExecutor.java:44: not found: type Type [error] protected Type type() { return Type.REGISTER_EXECUTOR; } [error] ^ ... [error] four errors found [error] (spark/javaunidoc:doc) javadoc returned nonzero exit code [error] (spark/scalaunidoc:doc) Scaladoc generation failed [error] Total time: 140 s, completed Nov 11, 2014 10:20:53 AM Moving back into docs dir. Making directory api/scala cp -r ../target/scala-2.10/unidoc/. api/scala Making directory api/java cp -r ../target/javaunidoc/. api/java Moving to python/docs directory and building sphinx. Makefile:14: *** The 'sphinx-build' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the 'sphinx-build' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/. Stop. Moving back into home dir. Making directory api/python cp -r python/docs/_build/html/. docs/api/python /usr/lib/ruby/1.9.1/fileutils.rb:1515:in `stat': No such file or directory - python/docs/_build/html/. (Errno::ENOENT) from /usr/lib/ruby/1.9.1/fileutils.rb:1515:in `block in fu_each_src_dest' from /usr/lib/ruby/1.9.1/fileutils.rb:1529:in `fu_each_src_dest0' from /usr/lib/ruby/1.9.1/fileutils.rb:1513:in `fu_each_src_dest' from /usr/lib/ruby/1.9.1/fileutils.rb:436:in `cp_r' from /home/alex/git/spark/docs/_plugins/copy_api_dirs.rb:79:in `top (required)' from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require' from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require' from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:76:in `block in setup' from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:75:in `each' from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:75:in `setup' from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:30:in `initialize' from /usr/bin/jekyll:224:in `new' from /usr/bin/jekyll:224:in `main' What next? Alex On Fri, Nov 7, 2014 at 12:54 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I believe the web docs need to be built separately according to the instructions here. Did you give those a shot? It's annoying to have a separate thing with new dependencies in order to build the web docs, but that's how it is at the moment. Nick On Fri, Nov 7, 2014 at 3:39 PM, Alessandro Baretta alexbare...@gmail.com wrote: I finally came to realize that there is a special maven target to build the scaladocs, although arguably a very unintuitive on: mvn verify. So now I have scaladocs for each package, but not for the whole spark project. Specifically, build/docs/api/scala/index.html is missing. Indeed the whole build/docs/api directory referenced in api.html is missing. How do I build it? Alex Baretta - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark and Play
Hi There, Because Akka versions are not binary compatible with one another, it might not be possible to integrate Play with Spark 1.1.0. - Patrick On Tue, Nov 11, 2014 at 8:21 AM, Akshat Aranya aara...@gmail.com wrote: Hi, Sorry if this has been asked before; I didn't find a satisfactory answer when searching. How can I integrate a Play application with Spark? I'm getting into issues of akka-actor versions. Play 2.2.x uses akka-actor 2.0, whereas Play 2.3.x uses akka-actor 2.3.4, neither of which work fine with Spark 1.1.0. Is there something I should do with libraryDependencies in my build.sbt to make it work? Thanks, Akshat - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Status of MLLib exporting models to PMML
I also worry about that the author of JPMML changed the license of jpmml-evaluator due to his interest of his commercial business, and he might change the license of jpmml-model in the future. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Tue, Nov 11, 2014 at 11:43 AM, Sean Owen so...@cloudera.com wrote: Yes, jpmml-evaluator is AGPL, but things like jpmml-model are not; they're 3-clause BSD: https://github.com/jpmml/jpmml-model So some of the scoring components are off-limits for an AL2 project but the core model components are OK. On Tue, Nov 11, 2014 at 7:40 PM, DB Tsai dbt...@dbtsai.com wrote: JPMML evaluator just changed their license to AGPL or commercial license, and I think AGPL is not compatible with apache project. Any advice? https://github.com/jpmml/jpmml-evaluator Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Tue, Nov 11, 2014 at 10:07 AM, Xiangrui Meng men...@gmail.com wrote: Vincenzo sent a PR and included k-means as an example. Sean is helping review it. PMML standard is quite large. So we may start with simple model export, like linear methods, then move forward to tree-based. -Xiangrui On Mon, Nov 10, 2014 at 11:27 AM, Aris arisofala...@gmail.com wrote: Hello Spark and MLLib folks, So a common problem in the real world of using machine learning is that some data analysis use tools like R, but the more data engineers out there will use more advanced systems like Spark MLLib or even Python Scikit Learn. In the real world, I want to have a system where multiple different modeling environments can learn from data / build models, represent the models in a common language, and then have a layer which just takes the model and run model.predict() all day long -- scores the models in other words. It looks like the project openscoring.io and jpmml-evaluator are some amazing systems for this, but they fundamentally use PMML as the model representation here. I have read some JIRA tickets that Xiangrui Meng is interested in getting PMML implemented to export MLLib models, is that happening? Further, would something like Manish Amde's boosted ensemble tree methods be representable in PMML? Thank you!! Aris - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: which is the recommended workflow engine for Apache Spark jobs?
We've had success with Azkaban from LinkedIn over Oozie and Luigi. http://azkaban.github.io/ Azkaban has support for many different job types, a fleshed out web UI with decent log reporting, a decent failure / retry model, a REST api, and I think support for multiple executor slaves is coming in the future. We found Oozie's configuration and execution cumbersome, and Luigi immature. On Tue, Nov 11, 2014 at 2:50 AM, Adamantios Corais adamantios.cor...@gmail.com wrote: Hi again, As Jimmy said, any thoughts about Luigi and/or any other tools? So far it seems that Oozie is the best and only choice here. Is that right? On Mon, Nov 10, 2014 at 8:43 PM, Jimmy McErlain jimmy.mcerl...@gmail.com wrote: I have used Oozie for all our workflows with Spark apps but you will have to use a java event as the workflow element. I am interested in anyones experience with Luigi and/or any other tools. On Mon, Nov 10, 2014 at 10:34 AM, Adamantios Corais adamantios.cor...@gmail.com wrote: I have some previous experience with Apache Oozie while I was developing in Apache Pig. Now, I am working explicitly with Apache Spark and I am looking for a tool with similar functionality. Is Oozie recommended? What about Luigi? What do you use \ recommend? -- Nothing under the sun is greater than education. By educating one person and sending him/her into the society of his/her generation, we make a contribution extending a hundred generations to come. -Jigoro Kano, Founder of Judo-
concat two Dstreams
Hi, Is it possible to concatenate or append two Dstreams together? I have an incoming stream that I wish to combine with data that's generated by a utility. I then need to process the combined Dstream. Thanks, Josh
Re: concat two Dstreams
I think it's just called union On Tue, Nov 11, 2014 at 2:41 PM, Josh J joshjd...@gmail.com wrote: Hi, Is it possible to concatenate or append two Dstreams together? I have an incoming stream that I wish to combine with data that's generated by a utility. I then need to process the combined Dstream. Thanks, Josh
Re: S3 table to spark sql
simple scala val date = new java.text.SimpleDateFormat(mmdd).parse(fechau3m) should work. Replace mmdd with the format fechau3m is in. If you want to do it at case class level: val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) //HiveContext always a good idea import sqlContext.createSchemaRDD case class trx_u3m(id: String, local: String, fechau3m: java.util.Date, rubro: Int, sku: String, unidades: Double, monto: Double) val tabla = sc.textFile(s3n://exalitica.com/trx_u3m/trx_u3m.txt).map(_.split(,)).map(p = trx_u3m(p(0).trim.toString, p(1).trim.toString, new java.text.SimpleDateFormat(mmdd).parse(p(2).trim.toString), p(3).trim.toInt, p(4).trim.toString, p(5).trim.toDouble, p(6).trim.toDouble)) tabla.registerTempTable(trx_u3m) On Tue, Nov 11, 2014 at 11:11 AM, Franco Barrientos franco.barrien...@exalitica.com wrote: How can i create a date field in spark sql? I have a S3 table and i load it into a RDD. val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class trx_u3m(id: String, local: String, fechau3m: String, rubro: Int, sku: String, unidades: Double, monto: Double) val tabla = sc.textFile(s3n://exalitica.com/trx_u3m/trx_u3m.txt).map(_.split(,)).map(p = trx_u3m(p(0).trim.toString, p(1).trim.toString, p(2).trim.toString, p(3).trim.toInt, p(4).trim.toString, p(5).trim.toDouble, p(6).trim.toDouble)) tabla.registerTempTable(trx_u3m) Now my problema i show can i transform string variable into date variables (fechau3m)? *Franco Barrientos* Data Scientist Málaga #115, Of. 1003, Las Condes. Santiago, Chile. (+562)-29699649 (+569)-76347893 franco.barrien...@exalitica.com www.exalitica.com [image: http://exalitica.com/web/img/frim.png]
RE: Spark and Play
Actually, it is possible to integrate Spark 1.1.0 with Play 2.2.x Here is a sample build.sbt file: name := xyz version := 0.1 scalaVersion := 2.10.4 libraryDependencies ++= Seq( jdbc, anorm, cache, org.apache.spark %% spark-core % 1.1.0, com.typesafe.akka %% akka-actor % 2.2.3, com.typesafe.akka %% akka-slf4j % 2.2.3, org.apache.spark %% spark-sql % 1.1.0 ) play.Project.playScalaSettings Mohammed -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Tuesday, November 11, 2014 2:06 PM To: Akshat Aranya Cc: user@spark.apache.org Subject: Re: Spark and Play Hi There, Because Akka versions are not binary compatible with one another, it might not be possible to integrate Play with Spark 1.1.0. - Patrick On Tue, Nov 11, 2014 at 8:21 AM, Akshat Aranya aara...@gmail.com wrote: Hi, Sorry if this has been asked before; I didn't find a satisfactory answer when searching. How can I integrate a Play application with Spark? I'm getting into issues of akka-actor versions. Play 2.2.x uses akka-actor 2.0, whereas Play 2.3.x uses akka-actor 2.3.4, neither of which work fine with Spark 1.1.0. Is there something I should do with libraryDependencies in my build.sbt to make it work? Thanks, Akshat - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Best practice for multi-user web controller in front of Spark
David, Here is what I would suggest: 1 - Does a new SparkContext get created in the web tier for each new request for processing? Create a single SparkContext that gets shared across multiple web requests. Depending on the framework that you are using for the web-tier, it should not be difficult to create a global singleton object that holds the SparkContext. 2 - If so, how much time should we expect it to take for setting up the context? Our goal is to return a response to the users in under 10 seconds, but if it takes many seconds to create a new context or otherwise set up the job, then we need to adjust our expectations for what is possible. From using spark-shell one might conclude that it might take more than 10 seconds to create a context, however it's not clear how much of that is context-creation vs other things. 3 - (This last question perhaps deserves a post in and of itself:) if every job is always comparing some little data structure to the same HDFS corpus of data, what is the best pattern to use to cache the RDD's from HDFS so they don't have to always be re-constituted from disk? I.e. how can RDD's be shared from the context of one job to the context of subsequent jobs? Or does something like memcache have to be used? Create a cached RDD in a global singleton object, which gets accessed by multiple web requests. You could put the cached RDD in the same object that holds the SparkContext, if you would like. I need to know more details about the specifics of your application to be more specific, but hopefully you get the idea. Mohammed From: Evan R. Sparks [mailto:evan.spa...@gmail.com] Sent: Tuesday, November 11, 2014 8:54 AM To: Sonal Goyal Cc: bethesda; u...@spark.incubator.apache.org Subject: Re: Best practice for multi-user web controller in front of Spark For sharing RDDs across multiple jobs - you could also have a look at Tachyon. It provides an HDFS compatible in-memory storage layer that keeps data in memory across multiple jobs/frameworks - http://tachyon-project.org/. - On Tue, Nov 11, 2014 at 8:11 AM, Sonal Goyal sonalgoy...@gmail.commailto:sonalgoy...@gmail.com wrote: I believe the Spark Job Server by Ooyala can help you share data across multiple jobs, take a look at http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server. It seems to fit closely to what you need. Best Regards, Sonal Founder, Nube Technologieshttp://www.nubetech.co On Tue, Nov 11, 2014 at 7:20 PM, bethesda swearinge...@mac.commailto:swearinge...@mac.com wrote: We are relatively new to spark and so far have been manually submitting single jobs at a time for ML training, during our development process, using spark-submit. Each job accepts a small user-submitted data set and compares it to every data set in our hdfs corpus, which only changes incrementally on a daily basis. (that detail is relevant to question 3 below) Now we are ready to start building out the front-end, which will allow a team of data scientists to submit their problems to the system via a web front-end (web tier will be java). Users could of course be submitting jobs more or less simultaneously. We want to make sure we understand how to best structure this. Questions: 1 - Does a new SparkContext get created in the web tier for each new request for processing? 2 - If so, how much time should we expect it to take for setting up the context? Our goal is to return a response to the users in under 10 seconds, but if it takes many seconds to create a new context or otherwise set up the job, then we need to adjust our expectations for what is possible. From using spark-shell one might conclude that it might take more than 10 seconds to create a context, however it's not clear how much of that is context-creation vs other things. 3 - (This last question perhaps deserves a post in and of itself:) if every job is always comparing some little data structure to the same HDFS corpus of data, what is the best pattern to use to cache the RDD's from HDFS so they don't have to always be re-constituted from disk? I.e. how can RDD's be shared from the context of one job to the context of subsequent jobs? Or does something like memcache have to be used? Thanks! David -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-for-multi-user-web-controller-in-front-of-Spark-tp18581.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.orgmailto:user-h...@spark.apache.org
Re: pyspark get column family and qualifier names from hbase table
just wrote a custom convert in scala to replace HBaseResultToStringConverter. Just couple of lines of code. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-tp18613p18639.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Native / C/C++ code integration
More thoughts. I took a deeper look at BlockManager, RDD, and friends. Suppose one wanted to get native code access to un-deserialized blocks. This task looks very hard. An RDD behaves much like a Scala iterator of deserialized values, and interop with BlockManager is all on deserialized data. One would probably need to rewrite much of RDD, CacheManager, etc in native code; an RDD subclass (e.g. PythonRDD) probably wouldn't work. So exposing raw blocks to native code looks intractable. I wonder how fast Java/Kyro can SerDe of byte arrays. E.g. suppose we have an RDDT where T is immutable and most of the memory for a single T is a byte array. What is the overhead of SerDe-ing T? (Does Java/Kyro copy the underlying memory?) If the overhead is small, then native access to raw blocks wouldn't really yield any advantage. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Native-C-C-code-integration-tp18347p18640.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Converting Apache log string into map using delimiter
I have an RDD of logs that look like this: /no_cache/bi_event?Log=0pg_inst=517638988975678942pg=fow_mwever=c.2.1.8site=xyz.compid=156431807121222351rid=156431666543211500srch_id=156431666581865115row=6seq=1tot=1tsp=1cmp=thmb_12co_txt_url=Viewinget=clickthmb_type=pct=uc=579855lnx=SPGOOGBRANDCAMPref_url=http%3A%2F%2Fwww.abcd.com The pairs are separated by , and the keys/values of each pair are separated by =. Hive has a str_to_map function https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions that will convert this String to a map that will make the following work: mappedString[site] will return xyz.com What's the most efficient way to do this in Scala + Spark? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Converting-Apache-log-string-into-map-using-delimiter-tp18641.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
in function prototypes?
I am creating a workflow; I have an existing call to updateStateByKey that works fine, but when I created a second use where the key is a Tuple2, it's now failing with the dreaded overloaded method value updateStateByKey with alternatives ... cannot be applied to ... Comparing the two uses I'm not seeing anything that seems broken, though I do note that all the messages below describe what the code provides as Time as org.apache.spark.streaming.Time. a) Could the Time v org.apache.spark.streaming.Time difference be causing this? (I'm using Time the same in the first use, which appears to work properly.) b) Any suggestions of what else could be causing the error? --code val ssc = new StreamingContext(conf, Seconds(timeSliceArg)) ssc.checkpoint(.) var lines = ssc.textFileStream(dirArg) var linesArray = lines.map( line = (line.split(\t))) var DnsSvr = linesArray.map( lineArray = ( (lineArray(4), lineArray(5)), (1 , Time((lineArray(0).toDouble*1000).toLong) )) ) val updateDnsCount = (values: Seq[(Int, Time)], state: Option[(Int, Time)]) = { val currentCount = if (values.isEmpty) 0 else values.map( x = x._1).sum val newMinTime = if (values.isEmpty) Time(Long.MaxValue) else values.map( x = x._2).min val (previousCount, minTime) = state.getOrElse((0, Time(System.currentTimeMillis))) (currentCount + previousCount, Seq(minTime, newMinTime).min) } var DnsSvrCum = DnsSvr.updateStateByKey[(Int, Time)](updateDnsCount) // === error here --compilation output-- [error] /Users/spr/Documents/.../cyberSCinet.scala:513: overloaded method value updateStateByKey with alternatives: [error] (updateFunc: Iterator[((String, String), Seq[(Int, org.apache.spark.streaming.Time)], Option[(Int, org.apache.spark.streaming.Time)])] = Iterator[((String, String), (Int, org.apache.spark.streaming.Time))],partitioner: org.apache.spark.Partitioner,rememberPartitioner: Boolean)(implicit evidence$5: scala.reflect.ClassTag[(Int, org.apache.spark.streaming.Time)])org.apache.spark.streaming.dstream.DStream[((String, String), (Int, org.apache.spark.streaming.Time))] and [error] (updateFunc: (Seq[(Int, org.apache.spark.streaming.Time)], Option[(Int, org.apache.spark.streaming.Time)]) = Option[(Int, org.apache.spark.streaming.Time)],partitioner: org.apache.spark.Partitioner)(implicit evidence$4: scala.reflect.ClassTag[(Int, org.apache.spark.streaming.Time)])org.apache.spark.streaming.dstream.DStream[((String, String), (Int, org.apache.spark.streaming.Time))] and [error] (updateFunc: (Seq[(Int, org.apache.spark.streaming.Time)], Option[(Int, org.apache.spark.streaming.Time)]) = Option[(Int, org.apache.spark.streaming.Time)],numPartitions: Int)(implicit evidence$3: scala.reflect.ClassTag[(Int, org.apache.spark.streaming.Time)])org.apache.spark.streaming.dstream.DStream[((String, String), (Int, org.apache.spark.streaming.Time))] and [error] (updateFunc: (Seq[(Int, org.apache.spark.streaming.Time)], Option[(Int, org.apache.spark.streaming.Time)]) = Option[(Int, org.apache.spark.streaming.Time)])(implicit evidence$2: scala.reflect.ClassTag[(Int, org.apache.spark.streaming.Time)])org.apache.spark.streaming.dstream.DStream[((String, String), (Int, org.apache.spark.streaming.Time))] [error] cannot be applied to ((Seq[(Int, org.apache.spark.streaming.Time)], Option[(Int, org.apache.spark.streaming.Time)]) = (Int, org.apache.spark.streaming.Time)) [error] var DnsSvrCum = DnsSvr.updateStateByKey[(Int, Time)](updateDnsCount) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/in-function-prototypes-tp18642.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Converting Apache log string into map using delimiter
OK I got it working with: z.map(row = (row.map(element = element.split(=)(0)) zip row.map(element = element.split(=)(1))).toMap) But I'm guessing there is a more efficient way than to create two separate lists and then zip them together and then convert the result into a map. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Converting-Apache-log-string-into-map-using-delimiter-tp18641p18643.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
overloaded method value updateStateByKey ... cannot be applied to ... when Key is a Tuple2
I am creating a workflow; I have an existing call to updateStateByKey that works fine, but when I created a second use where the key is a Tuple2, it's now failing with the dreaded overloaded method value updateStateByKey with alternatives ... cannot be applied to ... Comparing the two uses I'm not seeing anything that seems broken, though I do note that all the messages below describe what the code provides as Time as org.apache.spark.streaming.Time. a) Could the Time v org.apache.spark.streaming.Time difference be causing this? (I'm using Time the same in the first use, which appears to work properly.) b) Any suggestions of what else could be causing the error? --code val ssc = new StreamingContext(conf, Seconds(timeSliceArg)) ssc.checkpoint(.) var lines = ssc.textFileStream(dirArg) var linesArray = lines.map( line = (line.split(\t))) var DnsSvr = linesArray.map( lineArray = ( (lineArray(4), lineArray(5)), (1 , Time((lineArray(0).toDouble*1000).toLong) )) ) val updateDnsCount = (values: Seq[(Int, Time)], state: Option[(Int, Time)]) = { val currentCount = if (values.isEmpty) 0 else values.map( x = x._1).sum val newMinTime = if (values.isEmpty) Time(Long.MaxValue) else values.map( x = x._2).min val (previousCount, minTime) = state.getOrElse((0, Time(System.currentTimeMillis))) (currentCount + previousCount, Seq(minTime, newMinTime).min) } var DnsSvrCum = DnsSvr.updateStateByKey[(Int, Time)](updateDnsCount) // === error here --compilation output-- [error] /Users/spr/Documents/.../cyberSCinet.scala:513: overloaded method value updateStateByKey with alternatives: [error] (updateFunc: Iterator[((String, String), Seq[(Int, org.apache.spark.streaming.Time)], Option[(Int, org.apache.spark.streaming.Time)])] = Iterator[((String, String), (Int, org.apache.spark.streaming.Time))],partitioner: org.apache.spark.Partitioner,rememberPartitioner: Boolean)(implicit evidence$5: scala.reflect.ClassTag[(Int, org.apache.spark.streaming.Time)])org.apache.spark.streaming.dstream.DStream[((String, String), (Int, org.apache.spark.streaming.Time))] and [error] (updateFunc: (Seq[(Int, org.apache.spark.streaming.Time)], Option[(Int, org.apache.spark.streaming.Time)]) = Option[(Int, org.apache.spark.streaming.Time)],partitioner: org.apache.spark.Partitioner)(implicit evidence$4: scala.reflect.ClassTag[(Int, org.apache.spark.streaming.Time)])org.apache.spark.streaming.dstream.DStream[((String, String), (Int, org.apache.spark.streaming.Time))] and [error] (updateFunc: (Seq[(Int, org.apache.spark.streaming.Time)], Option[(Int, org.apache.spark.streaming.Time)]) = Option[(Int, org.apache.spark.streaming.Time)],numPartitions: Int)(implicit evidence$3: scala.reflect.ClassTag[(Int, org.apache.spark.streaming.Time)])org.apache.spark.streaming.dstream.DStream[((String, String), (Int, org.apache.spark.streaming.Time))] and [error] (updateFunc: (Seq[(Int, org.apache.spark.streaming.Time)], Option[(Int, org.apache.spark.streaming.Time)]) = Option[(Int, org.apache.spark.streaming.Time)])(implicit evidence$2: scala.reflect.ClassTag[(Int, org.apache.spark.streaming.Time)])org.apache.spark.streaming.dstream.DStream[((String, String), (Int, org.apache.spark.streaming.Time))] [error] cannot be applied to ((Seq[(Int, org.apache.spark.streaming.Time)], Option[(Int, org.apache.spark.streaming.Time)]) = (Int, org.apache.spark.streaming.Time)) [error] var DnsSvrCum = DnsSvr.updateStateByKey[(Int, Time)](updateDnsCount) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/overloaded-method-value-updateStateByKey-cannot-be-applied-to-when-Key-is-a-Tuple2-tp18644.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
SVMWithSGD default threshold
I'm hoping to get a linear classifier on a dataset. I'm using SVMWithSGD to train the data. After running with the default options: val model = SVMWithSGD.train(training, numIterations), I don't think SVM has done the classification correctly. My observations: 1. the intercept is always 0.0 2. the predicted labels are ALL 1's, no 0's. My questions are: 1. what should the numIterations be? I tried to set it to 10*trainingSetSize, is that sufficient? 2. since MLlib only accepts data with labels 0 or 1, shouldn't the default threshold for SVMWithSGD be 0.5 instead of 0.0? 3. It seems counter-intuitive to me to have the default intercept be 0.0, meaning the line has to go through the origin. 4. Does Spark MLlib provide an API to do grid search like scikit-learn does? Any help would be greatly appreciated! - Thanks! -Caron -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SVMWithSGD-default-threshold-tp18645.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: how to create a Graph in GraphX?
You should be able to construct the edges in a single map() call without using collect(): val edges: RDD[Edge[String]] = sc.textFile(...).map { line = val row = line.split(,) Edge(row(0), row(1), row(2) } val graph: Graph[Int, String] = Graph.fromEdges(edges, defaultValue = 1) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-create-a-Graph-in-GraphX-tp18635p18646.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Does spark can't work with HBase?
hello,all I have tested reading Hbase table with spark1.1 using SparkContext.newAPIHadoopRDD.I found the performance is much slower than reading from HIVE.I also try read data using HFileScanner on one region HFile,but the performance is not good.So,How do I improve performance spark reading from Hbase? or, Can this illustrate spark do not work with HBase? thanks for your help. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-spark-can-t-work-with-HBase-tp18647.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Imbalanced shuffle read
Im running a job that uses groupByKey(), so it generates a lot of shuffle data. Then it processes this and writes files to HDFS in a forEachPartition block. Looking at the forEachPartition stage details in the web console, all but one executor is idle (SUCCESS in 50-60ms), and one is RUNNING with a huge shuffle read and takes a long time to finish. Can someone explain why the read is all on one node and how to parallelize this better? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Imbalanced-shuffle-read-tp18648.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
ISpark class not found
I've been experimenting with the ISpark extension to IScala (https://github.com/tribbloid/ISpark) Objects created in the REPL are not being loaded correctly on worker nodes, leading to a ClassNotFound exception. This does work correctly in spark-shell. I was curious if anyone has used ISpark and has encountered this issue. Thanks! Simple example: In [1]: case class Circle(rad:Float) In [2]: val rdd = sc.parallelize(1 to 1).map(i=Circle(i.toFloat)).take(10) 14/11/11 13:03:35 ERROR TaskResultGetter: Exception while getting task result com.esotericsoftware.kryo.KryoException: Unable to find class: [L$line5.$read$$iwC$$iwC$Circle; Full trace in my gist: https://gist.github.com/benjaminlaird/3e543a9a89fb499a3a14 The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.
Re: pyspark get column family and qualifier names from hbase table
Hey freedafeng, I'm exactly where you are. I want the output to show the rowkey and all column qualifiers that correspond to it. How did you write HBaseResultToStringConverter to do what you wanted it to do? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-tp18613p18650.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Help with processing multiple RDDs
Hi, How is 78g distributed in driver, daemon, executor ? Can you please paste the logs regarding that I don't have enough memory to hold the data in memory Are you collecting any data in driver ? Lastly, did you try doing a re-partition to create smaller and evenly distributed partitions? Regards, Kapil -Original Message- From: akhandeshi [mailto:ami.khande...@gmail.com] Sent: 12 November 2014 03:44 To: u...@spark.incubator.apache.org Subject: Help with processing multiple RDDs I have been struggling to process a set of RDDs. Conceptually, it is is not a large data set. It seems, no matter how much I provide to JVM or partition, I can't seem to process this data. I am caching the RDD. I have tried persit(disk and memory), perist(memory) and persist(off_heap) with no success. Currently I am giving 78g to my driver, daemon and executor memory. Currently, it seems to have trouble with one of the largest partition, rdd_22_29 which is 25.9 GB. The metrics page shows Summary Metrics for 29 Completed Tasks. However, I don't see few partitions on the list below. However, i do seem to have warnings in the log file, indicating that I don't have enough memory to hold the data in memory. I don't understand, what I am doing wrong or how I can troubleshoot. Any pointers will be appreciated... 14/11/11 21:28:45 WARN CacheManager: Not enough space to cache partition rdd_22_20 in memory! Free memory is 17190150496 bytes. 14/11/11 21:29:27 WARN CacheManager: Not enough space to cache partition rdd_22_13 in memory! Free memory is 17190150496 bytes. Block Name Storage Level Size in Memory Size on DiskExecutors rdd_22_0Memory Deserialized 1x Replicated 2.1 MB 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_10 Memory Deserialized 1x Replicated 7.0 GB 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_11 Memory Deserialized 1x Replicated 1290.2 MB 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_12 Memory Deserialized 1x Replicated 1167.7 KB 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_14 Memory Deserialized 1x Replicated 3.8 GB 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_15 Memory Deserialized 1x Replicated 4.0 MB 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_16 Memory Deserialized 1x Replicated 2.4 GB 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_17 Memory Deserialized 1x Replicated 37.6 MB 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_18 Memory Deserialized 1x Replicated 120.9 MB0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_19 Memory Deserialized 1x Replicated 755.9 KB0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_2Memory Deserialized 1x Replicated 289.5 KB0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_21 Memory Deserialized 1x Replicated 11.9 KB 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_22 Memory Deserialized 1x Replicated 24.0 B 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_23 Memory Deserialized 1x Replicated 24.0 B 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_24 Memory Deserialized 1x Replicated 3.0 MB 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_25 Memory Deserialized 1x Replicated 24.0 B 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_26 Memory Deserialized 1x Replicated 4.0 GB 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_27 Memory Deserialized 1x Replicated 24.0 B 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_28 Memory Deserialized 1x Replicated 1846.1 KB 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_29 Memory Deserialized 1x Replicated 25.9 GB 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_3Memory Deserialized 1x Replicated 267.1 KB0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_4Memory Deserialized 1x Replicated 24.0 B 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_5Memory Deserialized 1x Replicated 24.0 B 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_6Memory Deserialized 1x Replicated 24.0 B 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_7Memory Deserialized 1x Replicated 14.8 KB 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_8Memory Deserialized 1x Replicated 24.0 B 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_9Memory Deserialized 1x Replicated 24.0 B 0.0 B mddworker.c.fi-mdd-poc.internal:54974 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-processing-multiple-RDDs-tp18628.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
Re: Strange behavior of spark-shell while accessing hdfs
Hi, On Tue, Nov 11, 2014 at 2:04 PM, hmxxyy hmx...@gmail.com wrote: If I run bin/spark-shell without connecting a master, it can access a hdfs file on a remote cluster with kerberos authentication. [...] However, if I start the master and slave on the same host and using bin/spark-shell --master spark://*.*.*.*:7077 run the same commands [... ] org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: *.*.*.*.com/98.138.236.95; destination host is: *.*.*.*:8020; When you give no master, it is local[*], so Spark will (implicitly?) authenticate to HDFS from your local machine using local environment variables, key files etc., I guess. When you give a spark://* master, Spark will run on a different machine, where you have not yet authenticated to HDFS, I think. I don't know how to solve this, though, maybe some Kerberos token must be passed on to the Spark cluster? Tobias
Re: Status of MLLib exporting models to PMML
Yes although I think this difference is on purpose as part of that commercial strategy. If future versions change license it would still be possible to not upgrade. Or fork / recreate the bean classes. Not worried so much but it is a good point. On Nov 11, 2014 10:06 PM, DB Tsai dbt...@dbtsai.com wrote: I also worry about that the author of JPMML changed the license of jpmml-evaluator due to his interest of his commercial business, and he might change the license of jpmml-model in the future. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Tue, Nov 11, 2014 at 11:43 AM, Sean Owen so...@cloudera.com wrote: Yes, jpmml-evaluator is AGPL, but things like jpmml-model are not; they're 3-clause BSD: https://github.com/jpmml/jpmml-model So some of the scoring components are off-limits for an AL2 project but the core model components are OK. On Tue, Nov 11, 2014 at 7:40 PM, DB Tsai dbt...@dbtsai.com wrote: JPMML evaluator just changed their license to AGPL or commercial license, and I think AGPL is not compatible with apache project. Any advice? https://github.com/jpmml/jpmml-evaluator Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Tue, Nov 11, 2014 at 10:07 AM, Xiangrui Meng men...@gmail.com wrote: Vincenzo sent a PR and included k-means as an example. Sean is helping review it. PMML standard is quite large. So we may start with simple model export, like linear methods, then move forward to tree-based. -Xiangrui On Mon, Nov 10, 2014 at 11:27 AM, Aris arisofala...@gmail.com wrote: Hello Spark and MLLib folks, So a common problem in the real world of using machine learning is that some data analysis use tools like R, but the more data engineers out there will use more advanced systems like Spark MLLib or even Python Scikit Learn. In the real world, I want to have a system where multiple different modeling environments can learn from data / build models, represent the models in a common language, and then have a layer which just takes the model and run model.predict() all day long -- scores the models in other words. It looks like the project openscoring.io and jpmml-evaluator are some amazing systems for this, but they fundamentally use PMML as the model representation here. I have read some JIRA tickets that Xiangrui Meng is interested in getting PMML implemented to export MLLib models, is that happening? Further, would something like Manish Amde's boosted ensemble tree methods be representable in PMML? Thank you!! Aris - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Strange behavior of spark-shell while accessing hdfs
You need to set the spark configuration property: spark.yarn.access.namenodes to your namenode. e.g. spark.yarn.access.namenodes=hdfs://mynamenode:8020 Similarly, I'm curious if you're also running high availability HDFS with an HA nameservice. I currently have HA HDFS and kerberos and I've noticed that I must set the above property to the currently active namenode's hostname and port. Simply using the HA nameservice to get delegation tokens does NOT seem to work with Spark 1.1.0 (even though I can confirm the token is acquired). I believe this may be a bug. Unfortunately simply adding both the active and standby name nodes does not work as this actually causes an error. This means that when my active name node fails over, my spark configuration becomes invalid. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Strange-behavior-of-spark-shell-while-accessing-hdfs-tp18549p18656.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Strange behavior of spark-shell while accessing hdfs
Only YARN mode is supported with kerberos. You can't use a spark:// master with kerberos. Tobias Pfeiffer wrote When you give a spark://* master, Spark will run on a different machine, where you have not yet authenticated to HDFS, I think. I don't know how to solve this, though, maybe some Kerberos token must be passed on to the Spark cluster? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Strange-behavior-of-spark-shell-while-accessing-hdfs-tp18549p18658.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Help with processing multiple RDDs
i think you can try to set lower spark.storage.memoryFraction,for example 0.4 conf.set(spark.storage.memoryFraction,0.4) //default 0.6 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-processing-multiple-RDDs-tp18628p18659.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
MLLIB usage: BLAS dependency warning
Hi, I am having trouble using the BLAS libs with the MLLib functions. I am using org.apache.spark.mllib.clustering.KMeans (on a single machine) and running the Spark-shell with the kmeans example code (from https://spark.apache.org/docs/latest/mllib-clustering.html) which runs successfully but I get the following warning in the log: WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS I compiled spark 1.1.0 with mvn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Pnetlib-lgpl -DskipTests clean package If anyone could please clarify the steps to get the dependencies correctly installed and visible to spark (from https://spark.apache.org/docs/latest/mllib-guide.html), that would be greatly appreciated. Using yum, I installed blas.x86_64, lapack.x86_64, gcc-gfortran.x86_64, libgfortran.x86_64 and then downloaded Breeze and built that successfully with Maven. I verified that I do have /usr/lib/libblas.so.3 and /usr/lib/liblapack.so.3 present on the machine and ldconf -p shows these listed. I also tried adding /usr/lib/ to spark.executor.extraLibraryPath and I verified it is present in the Spark webUI environment tab. I downloaded and compiled jblas with mvn clean install, which creates jblas-1.2.4-SNAPSHOT.jar, and then also tried adding that to spark.executor.extraClassPath but I still get the same WARN message. Maybe there are a few simple steps that I am missing? Thanks a lot. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLLIB-usage-BLAS-dependency-warning-tp18660.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Help with processing multiple RDDs
I am running as Local in client mode. I have allocated as high as 85g to the driver, executor, and daemon. When I look at java processes. I see two. I see 20974 SparkSubmitDriverBootstrapper 21650 Jps 21075 SparkSubmit I have tried repartition before, but my understanding is that comes with an overhead. In my previous attempt, I didn't achieve much success. I am not clear, how to best get even partitions, any thoughts?? I am caching the RDD, and performing count on the keys. I am running it again, with repartitioning on the dataset. Let us see if that helps! I will send you the logs as soon as this completes! Thank you, I sincerely appreciate your help! Regards, Ami -Original Message- From: Kapil Malik [mailto:kma...@adobe.com] Sent: Tuesday, November 11, 2014 9:05 PM To: akhandeshi; u...@spark.incubator.apache.org Subject: RE: Help with processing multiple RDDs Hi, How is 78g distributed in driver, daemon, executor ? Can you please paste the logs regarding that I don't have enough memory to hold the data in memory Are you collecting any data in driver ? Lastly, did you try doing a re-partition to create smaller and evenly distributed partitions? Regards, Kapil -Original Message- From: akhandeshi [mailto:ami.khande...@gmail.com] Sent: 12 November 2014 03:44 To: u...@spark.incubator.apache.org Subject: Help with processing multiple RDDs I have been struggling to process a set of RDDs. Conceptually, it is is not a large data set. It seems, no matter how much I provide to JVM or partition, I can't seem to process this data. I am caching the RDD. I have tried persit(disk and memory), perist(memory) and persist(off_heap) with no success. Currently I am giving 78g to my driver, daemon and executor memory. Currently, it seems to have trouble with one of the largest partition, rdd_22_29 which is 25.9 GB. The metrics page shows Summary Metrics for 29 Completed Tasks. However, I don't see few partitions on the list below. However, i do seem to have warnings in the log file, indicating that I don't have enough memory to hold the data in memory. I don't understand, what I am doing wrong or how I can troubleshoot. Any pointers will be appreciated... 14/11/11 21:28:45 WARN CacheManager: Not enough space to cache partition rdd_22_20 in memory! Free memory is 17190150496 bytes. 14/11/11 21:29:27 WARN CacheManager: Not enough space to cache partition rdd_22_13 in memory! Free memory is 17190150496 bytes. Block Name Storage Level Size in Memory Size on DiskExecutors rdd_22_0Memory Deserialized 1x Replicated 2.1 MB 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_10 Memory Deserialized 1x Replicated 7.0 GB 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_11 Memory Deserialized 1x Replicated 1290.2 MB 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_12 Memory Deserialized 1x Replicated 1167.7 KB 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_14 Memory Deserialized 1x Replicated 3.8 GB 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_15 Memory Deserialized 1x Replicated 4.0 MB 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_16 Memory Deserialized 1x Replicated 2.4 GB 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_17 Memory Deserialized 1x Replicated 37.6 MB 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_18 Memory Deserialized 1x Replicated 120.9 MB0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_19 Memory Deserialized 1x Replicated 755.9 KB0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_2Memory Deserialized 1x Replicated 289.5 KB0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_21 Memory Deserialized 1x Replicated 11.9 KB 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_22 Memory Deserialized 1x Replicated 24.0 B 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_23 Memory Deserialized 1x Replicated 24.0 B 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_24 Memory Deserialized 1x Replicated 3.0 MB 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_25 Memory Deserialized 1x Replicated 24.0 B 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_26 Memory Deserialized 1x Replicated 4.0 GB 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_27 Memory Deserialized 1x Replicated 24.0 B 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_28 Memory Deserialized 1x Replicated 1846.1 KB 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_29 Memory Deserialized 1x Replicated 25.9 GB 0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_3Memory Deserialized 1x Replicated 267.1 KB0.0 B mddworker.c.fi-mdd-poc.internal:54974 rdd_22_4Memory Deserialized 1x Replicated 24.0 B 0.0 B
Pyspark Error when broadcast numpy array
In spark-1.0.2, I have come across an error when I try to broadcast a quite large numpy array(with 35M dimension). The error information except the java.lang.NegativeArraySizeException error and details is listed below. Moreover, when broadcast a relatively smaller numpy array(30M dimension), everything works fine. And 30M dimension numpy array takes 230M memory which, in my opinion, not very large. As far as I have surveyed, it seems related with py4j. However, I have no idea how to fix this. I would be appreciated if I can get some hint. py4j.protocol.Py4JError: An error occurred while calling o23.broadcast. Trace: java.lang.NegativeArraySizeException at py4j.Base64.decode(Base64.java:292) at py4j.Protocol.getBytes(Protocol.java:167) at py4j.Protocol.getObject(Protocol.java:276) at py4j.commands.AbstractCommand.getArguments(AbstractCommand.java:81) at py4j.commands.CallCommand.execute(CallCommand.java:77) at py4j.GatewayConnection.run(GatewayConnection.java:207) - And the test code is a follows: conf = SparkConf().setAppName('brodyliu_LR').setMaster('spark://10.231.131.87:5051') conf.set('spark.executor.memory', '4000m') conf.set('spark.akka.timeout', '10') conf.set('spark.ui.port','8081') conf.set('spark.cores.max','150') #conf.set('spark.rdd.compress', 'True') conf.set('spark.default.parallelism', '300') #configure the spark environment sc = SparkContext(conf=conf, batchSize=1) vec = np.random.rand(3500) a = sc.broadcast(vec) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Best practice for multi-user web controller in front of Spark
Hi, also there is Spindle https://github.com/adobe-research/spindle which was introduced on this list some time ago. I haven't looked into it deeply, but you might gain some valuable insights from their architecture, they are also using Spark to fulfill requests coming from the web. Tobias
External table partitioned by date using Spark SQL
I have large json files stored in S3 grouped under a sub-key for each year like this: I've defined an external table that's partitioned by year to keep the year limited queries efficient. The table definition looks like this: But alas, a simple query like: yields no results. If I remove the PARTITIONED BY clause and point directly at s3://spark-data/logs/1980, everything works fine. Am I running into a bug or are range partitions not supported on external tables? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/External-table-partitioned-by-date-using-Spark-SQL-tp18663.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: filtering out non English tweets using TwitterUtils
Hi, On Wed, Nov 12, 2014 at 5:42 AM, SK skrishna...@gmail.com wrote: But getLang() is one of the methods of twitter4j.Status since version 3.0.6 according to the doc at: http://twitter4j.org/javadoc/twitter4j/Status.html#getLang-- What version of twitter4j does Spark Streaming use? 3.0.3 https://github.com/apache/spark/blob/master/external/twitter/pom.xml#L53 Tobias
Re: concat two Dstreams
Concatenate? no. It doesn't make sense in this context to think about one potentially infinite stream coming after another one. Do you just want the union of batches from two streams? yes, just union(). You can union() with non-streaming RDDs too. On Tue, Nov 11, 2014 at 10:41 PM, Josh J joshjd...@gmail.com wrote: Hi, Is it possible to concatenate or append two Dstreams together? I have an incoming stream that I wish to combine with data that's generated by a utility. I then need to process the combined Dstream. Thanks, Josh
Re: Converting Apache log string into map using delimiter
I think it would be faster/more compact as: z.map(_.map { element = val tokens = element.split(=) (tokens(0), tokens(1)) }.toMap) (That's probably 95% right but I didn't compile or test it.) On Wed, Nov 12, 2014 at 12:18 AM, YaoPau jonrgr...@gmail.com wrote: OK I got it working with: z.map(row = (row.map(element = element.split(=)(0)) zip row.map(element = element.split(=)(1))).toMap) But I'm guessing there is a more efficient way than to create two separate lists and then zip them together and then convert the result into a map. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Converting-Apache-log-string-into-map-using-delimiter-tp18641p18643.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Still struggling with building documentation
(I don't think that's the same issue. This looks like some local problem with tool installation?) On Tue, Nov 11, 2014 at 9:56 PM, Patrick Wendell pwend...@gmail.com wrote: The doc build appears to be broken in master. We'll get it patched up before the release: https://issues.apache.org/jira/browse/SPARK-4326 On Tue, Nov 11, 2014 at 10:50 AM, Alessandro Baretta alexbare...@gmail.com wrote: Nichols and Patrick, Thanks for your help, but, no, it still does not work. The latest master produces the following scaladoc errors: [error] /home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/UploadBlock.java:55: not found: type Type [error] protected Type type() { return Type.UPLOAD_BLOCK; } [error] ^ [error] /home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/StreamHandle.java:39: not found: type Type [error] protected Type type() { return Type.STREAM_HANDLE; } [error] ^ [error] /home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/OpenBlocks.java:40: not found: type Type [error] protected Type type() { return Type.OPEN_BLOCKS; } [error] ^ [error] /home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/RegisterExecutor.java:44: not found: type Type [error] protected Type type() { return Type.REGISTER_EXECUTOR; } [error] ^ ... [error] four errors found [error] (spark/javaunidoc:doc) javadoc returned nonzero exit code [error] (spark/scalaunidoc:doc) Scaladoc generation failed [error] Total time: 140 s, completed Nov 11, 2014 10:20:53 AM Moving back into docs dir. Making directory api/scala cp -r ../target/scala-2.10/unidoc/. api/scala Making directory api/java cp -r ../target/javaunidoc/. api/java Moving to python/docs directory and building sphinx. Makefile:14: *** The 'sphinx-build' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the 'sphinx-build' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/. Stop. Moving back into home dir. Making directory api/python cp -r python/docs/_build/html/. docs/api/python /usr/lib/ruby/1.9.1/fileutils.rb:1515:in `stat': No such file or directory - python/docs/_build/html/. (Errno::ENOENT) from /usr/lib/ruby/1.9.1/fileutils.rb:1515:in `block in fu_each_src_dest' from /usr/lib/ruby/1.9.1/fileutils.rb:1529:in `fu_each_src_dest0' from /usr/lib/ruby/1.9.1/fileutils.rb:1513:in `fu_each_src_dest' from /usr/lib/ruby/1.9.1/fileutils.rb:436:in `cp_r' from /home/alex/git/spark/docs/_plugins/copy_api_dirs.rb:79:in `top (required)' from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require' from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require' from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:76:in `block in setup' from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:75:in `each' from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:75:in `setup' from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:30:in `initialize' from /usr/bin/jekyll:224:in `new' from /usr/bin/jekyll:224:in `main' What next? Alex On Fri, Nov 7, 2014 at 12:54 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I believe the web docs need to be built separately according to the instructions here. Did you give those a shot? It's annoying to have a separate thing with new dependencies in order to build the web docs, but that's how it is at the moment. Nick On Fri, Nov 7, 2014 at 3:39 PM, Alessandro Baretta alexbare...@gmail.com wrote: I finally came to realize that there is a special maven target to build the scaladocs, although arguably a very unintuitive on: mvn verify. So now I have scaladocs for each package, but not for the whole spark project. Specifically, build/docs/api/scala/index.html is missing. Indeed the whole build/docs/api directory referenced in api.html is missing. How do I build it? Alex Baretta - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SVMWithSGD default threshold
I think you need to use setIntercept(true) to get it to allow a non-zero intercept. I also kind of agree that's not obvious or the intuitive default. Is your data set highly imbalanced, with lots of positive examples? that could explain why predictions are heavily skewed. Iterations should definitely not be of the same order of magnitude as your input, which could have millions of elements. 100 should be plenty as a default. Threshold is not related to the 0/1 labels in SVMs. It is a threshold on the SVM margin. Margin is 0 at the decision boundary, not 0.5. There's no grid search at this stage but it's easy to code up in a short method. On Wed, Nov 12, 2014 at 12:41 AM, Caron caron.big...@gmail.com wrote: I'm hoping to get a linear classifier on a dataset. I'm using SVMWithSGD to train the data. After running with the default options: val model = SVMWithSGD.train(training, numIterations), I don't think SVM has done the classification correctly. My observations: 1. the intercept is always 0.0 2. the predicted labels are ALL 1's, no 0's. My questions are: 1. what should the numIterations be? I tried to set it to 10*trainingSetSize, is that sufficient? 2. since MLlib only accepts data with labels 0 or 1, shouldn't the default threshold for SVMWithSGD be 0.5 instead of 0.0? 3. It seems counter-intuitive to me to have the default intercept be 0.0, meaning the line has to go through the origin. 4. Does Spark MLlib provide an API to do grid search like scikit-learn does? Any help would be greatly appreciated! - Thanks! -Caron -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SVMWithSGD-default-threshold-tp18645.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
How to solve this core dump error
I got a core dump when I used spark 1.1.0 . The enviorment is shown here. software enviorment: OS: cent os 6.3 jvm : Java HotSpot(TM) 64-Bit Server VM(build 21.0-b17,mixed mode) hardware enviorment: memory: 64G I run three spark process with jvm -Xmx args like this: -Xmx 28G -Xmx 2G -Xmx 14G The process which has core dump is with -Xmx 14G parameter. I found the core dump happend when the IO was very high . The gdb core dump bt data is like this : (gdb) bt #0 0x003fc04328a5 in raise () from /lib64/libc.so.6 #1 0x003fc0434085 in abort () from /lib64/libc.so.6 #2 0x003fc046fa37 in __libc_message () from /lib64/libc.so.6 #3 0x003fc0475366 in malloc_printerr () from /lib64/libc.so.6 #4 0x2ad53773dad0 in Java_java_io_UnixFileSystem_getLength () from /apps/svr/jdk-7/jre/lib/amd64/libjava.so #5 0x2ad53c5c9b5c in ?? () #6 0x00049d9908e0 in ?? () #7 0x0004413d7030 in ?? () #8 0x0004498f77e0 in ?? () #9 0x2ad53c3dff18 in ?? () #10 0x0004498f77e0 in ?? () #11 0x2ad53c7c1668 in ?? () #12 0x0004413d7690 in ?? () #13 0x0004413d7678 in ?? () #14 0x in ?? () Aditionally ,I set the jvm parameter of spark executor like this : -XX:SurvivorRatio=8 -XX:MaxPermSize=1g -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/apps/logs/spark/ -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:SoftRefLRUPolicyMSPerMB=0 -XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0 -XX:+CMSParallelRemarkEnabled -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSConcurrentMTEnabled -XX:+DisableExplicitGC -Djava.net.preferIPv4Stack=true Sencerely get your help , Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-solve-this-core-dump-error-tp18672.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Pyspark Error when broadcast numpy array
This PR fix the problem: https://github.com/apache/spark/pull/2659 cc @josh Davies On Tue, Nov 11, 2014 at 7:47 PM, bliuab bli...@cse.ust.hk wrote: In spark-1.0.2, I have come across an error when I try to broadcast a quite large numpy array(with 35M dimension). The error information except the java.lang.NegativeArraySizeException error and details is listed below. Moreover, when broadcast a relatively smaller numpy array(30M dimension), everything works fine. And 30M dimension numpy array takes 230M memory which, in my opinion, not very large. As far as I have surveyed, it seems related with py4j. However, I have no idea how to fix this. I would be appreciated if I can get some hint. py4j.protocol.Py4JError: An error occurred while calling o23.broadcast. Trace: java.lang.NegativeArraySizeException at py4j.Base64.decode(Base64.java:292) at py4j.Protocol.getBytes(Protocol.java:167) at py4j.Protocol.getObject(Protocol.java:276) at py4j.commands.AbstractCommand.getArguments(AbstractCommand.java:81) at py4j.commands.CallCommand.execute(CallCommand.java:77) at py4j.GatewayConnection.run(GatewayConnection.java:207) - And the test code is a follows: conf = SparkConf().setAppName('brodyliu_LR').setMaster('spark://10.231.131.87:5051') conf.set('spark.executor.memory', '4000m') conf.set('spark.akka.timeout', '10') conf.set('spark.ui.port','8081') conf.set('spark.cores.max','150') #conf.set('spark.rdd.compress', 'True') conf.set('spark.default.parallelism', '300') #configure the spark environment sc = SparkContext(conf=conf, batchSize=1) vec = np.random.rand(3500) a = sc.broadcast(vec) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Pyspark Error when broadcast numpy array
Dear Liu: Thank you very much for your help. I will update that patch. By the way, as I have succeed to broadcast an array of size(30M) the log said that such array takes around 230MB memory. As a result, I think the numpy array that leads to error is much smaller than 2G. On Wed, Nov 12, 2014 at 12:29 PM, Davies Liu-2 [via Apache Spark User List] ml-node+s1001560n18673...@n3.nabble.com wrote: This PR fix the problem: https://github.com/apache/spark/pull/2659 cc @josh Davies On Tue, Nov 11, 2014 at 7:47 PM, bliuab [hidden email] http://user/SendEmail.jtp?type=nodenode=18673i=0 wrote: In spark-1.0.2, I have come across an error when I try to broadcast a quite large numpy array(with 35M dimension). The error information except the java.lang.NegativeArraySizeException error and details is listed below. Moreover, when broadcast a relatively smaller numpy array(30M dimension), everything works fine. And 30M dimension numpy array takes 230M memory which, in my opinion, not very large. As far as I have surveyed, it seems related with py4j. However, I have no idea how to fix this. I would be appreciated if I can get some hint. py4j.protocol.Py4JError: An error occurred while calling o23.broadcast. Trace: java.lang.NegativeArraySizeException at py4j.Base64.decode(Base64.java:292) at py4j.Protocol.getBytes(Protocol.java:167) at py4j.Protocol.getObject(Protocol.java:276) at py4j.commands.AbstractCommand.getArguments(AbstractCommand.java:81) at py4j.commands.CallCommand.execute(CallCommand.java:77) at py4j.GatewayConnection.run(GatewayConnection.java:207) - And the test code is a follows: conf = SparkConf().setAppName('brodyliu_LR').setMaster('spark:// 10.231.131.87:5051') conf.set('spark.executor.memory', '4000m') conf.set('spark.akka.timeout', '10') conf.set('spark.ui.port','8081') conf.set('spark.cores.max','150') #conf.set('spark.rdd.compress', 'True') conf.set('spark.default.parallelism', '300') #configure the spark environment sc = SparkContext(conf=conf, batchSize=1) vec = np.random.rand(3500) a = sc.broadcast(vec) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=18673i=1 For additional commands, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=18673i=2 - To unsubscribe, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=18673i=3 For additional commands, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=18673i=4 -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662p18673.html To unsubscribe from Pyspark Error when broadcast numpy array, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=18662code=YmxpdWFiQGNzZS51c3QuaGt8MTg2NjJ8NTUwMDMxMjYz . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- My Homepage: www.cse.ust.hk/~bliuab MPhil student in Hong Kong University of Science and Technology. Clear Water Bay, Kowloon, Hong Kong. Profile at LinkedIn http://www.linkedin.com/pub/liu-bo/55/52b/10b. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662p18674.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Read a HDFS file from Spark source code
Hi I am trying to access a file in HDFS from spark source code. Basically, I am tweaking the spark source code. I need to access a file in HDFS from the source code of the spark. I am really not understanding how to go about doing this. Can someone please help me out in this regard. Thank you!! Karthik
Re: Strange behavior of spark-shell while accessing hdfs
Thanks guys for the info. I have to use yarn to access a kerberos cluster. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Strange-behavior-of-spark-shell-while-accessing-hdfs-tp18549p18677.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: pyspark get column family and qualifier names from hbase table
Feel free to add that converter as an option in the Spark examples via a PR :) — Sent from Mailbox On Wed, Nov 12, 2014 at 3:27 AM, alaa contact.a...@gmail.com wrote: Hey freedafeng, I'm exactly where you are. I want the output to show the rowkey and all column qualifiers that correspond to it. How did you write HBaseResultToStringConverter to do what you wanted it to do? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-tp18613p18650.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: filtering out non English tweets using TwitterUtils
Fwiw if you do decide to handle language detection on your machine this library works great on tweets https://github.com/carrotsearch/langid-java On Tue, Nov 11, 2014, 7:52 PM Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Wed, Nov 12, 2014 at 5:42 AM, SK skrishna...@gmail.com wrote: But getLang() is one of the methods of twitter4j.Status since version 3.0.6 according to the doc at: http://twitter4j.org/javadoc/twitter4j/Status.html#getLang-- What version of twitter4j does Spark Streaming use? 3.0.3 https://github.com/apache/spark/blob/master/external/twitter/pom.xml#L53 Tobias
Re: Read a HDFS file from Spark source code
Instead of a file path, use a HDFS URI. For example: (In Python) data = sc.textFile(hdfs://localhost/user/someuser/data) On Wed, Nov 12, 2014 at 10:12 AM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi I am trying to access a file in HDFS from spark source code. Basically, I am tweaking the spark source code. I need to access a file in HDFS from the source code of the spark. I am really not understanding how to go about doing this. Can someone please help me out in this regard. Thank you!! Karthik